Direkt zum Inhalt Direkt zur Suche Direkt zur Navigation

Humboldt-Universität zu Berlin - Statistik

Roman Timofeev


Kontakt

E-mail: timofeev @ wiwi . hu-berlin . de
Telefon: +49 30 2093 5721
Fax: +49 30 2093-5649
Sprechzeit/ort: Spa 1, 400

Mo 14:00 - 16:00

In den Semesterferien
nach Vereinbarung
Postadresse: Humboldt-Universität zu Berlin
Wirtschaftswissenschaftliche Fakultät
Institut für Statistik und Ökonometrie
Spandauer Strasse 1
10178 Berlin

Lehre



MVA I WS 07-08
  SPA 1, R. 203  
  Fr 8:00 - 12:00  
  Subscribe for the course in Moodle System  
     

Study and Research

Classification and Regression Trees

Research is supported by  in frame of two-year doctoral scholarship. After end of scholarship students can also do a research-related internship with DekaBank. The students present the results of their research at a workshop held annually at DekaBank.

Method of Classification and Regression Trees (CART) is classification method which uses historical data to construct so-called decision trees. Depending on available information about the dataset, classification tree or regression tree can be constructed. The grown tree can be then used for classification of new observations.

Classification and Regression Trees is a classification method which uses historical data to construct so-called decision trees. Decision trees are then used to classify new data. In order to use CART we need to know number of classes a priori. CART methodology was developed in 80s by Breiman, Freidman, Olshen, Stone in their paper ”Classification and Regression Trees” (1984). For building decision trees, CART uses so-called learning sample - a set of historical data with pre-assigned classes for all observations. For example, learning sample for credit scoring system would be fundamental information about previous borrows (variables) matched with actual payoff results (classes).

Decision trees are represented by a set of questions which splits the learning sample into smaller and smaller parts. CART asks only yes/no questions. A possible question could be: ”Is age greater than 50?” or ”Is sex male?”. CART algorithm will search for all possible variables and all possible values in order to find the best split - the question that splits the data into two parts with maximum homogeneity. The process is then repeated for each of the resulting data fragments. Here is an example of simple classification tree, used by San Diego Medical Center for classification of their patients to different levels of risk:

In practice there can be much more complicated decision trees which can include dozens of levels and hundreds of variables. As it can be seen from figure 1.1, CART can easily handle both numerical and categorical variables. Among other advantages of CART method is its robustness to outliers. Usually the splitting algorithm will isolate outliers in individual node or nodes. An important practical property of CART is that the structure of its classification or regression trees is invariant with respect to monotone transformations of independent variables. One can replace any variable with its logarithm or square root value, the structure of the tree will not change.

CART methodology consists of tree parts:

  1. Construction of maximum tree
  2. Choice of right tree size
  3. Classification of new data using constructed tree
Download my master thesis on CART

Working Papers

  • A. Andriashin, M. Benko, W. Härdle, R. Timofeev, U. Ziegenhagen (2005): Color Harmonization in Car Manufacturing Processes, special issue "Business, Industry and Government (BIG) Statistics" of Applied Stochastic Models for Business and Industry, volume 22, issue 5-6, pages 519-532, J. Wiley
  • A. Andriyashin, R. Timofeev, W.Härdle (2007): Recursive Portfolio Selection with Decision Trees, Discussion paper SFB 649
  • Y. Golubev, W.Härdle, R. Timofeev (2007): Testing Monotonicity of Pricing Kernels, submitted to Journal of Applied Econometrics

Talks

  • Classification and Regression Trees in XploRe
  • CART Implimentation Issues

Projects

  • Yxilon - A Modular Programming Language

Modern statistical computing requires smooth integration of new algorithms and quantitative analysis results into all sorts of platforms, like webbrowsers, standard and proprietary applications. With Yxilon we want to implement such a vertically integrable, modular environment, providing the user with a rich set of statistical methods and a variety of different interfaces to use these methods. Yxilon will be the successor of XploRe, a complete statistical engine developed at Humboldt-Universitaet zu Berlin. While working on several projects with international partners the function set of XploRe had been more and more extended at the cost of performance and stability, aspects we want to change in the upcoming Yxilon. The main goals of Yxilon are: - platform independence - COM and Client/server interfaces - database functionality and connectivity - multi-lingual user interfaces - full support of all XploRe functions and packages - Support for XML