| E-mail: |
timofeev @
wiwi . hu-berlin . de
|
 |
| Telefon: |
+49 30 2093 5721 |
| Fax: |
+49 30 2093-5649 |
| Sprechzeit/ort: |
Spa 1, 400 |
|
Mo 14:00 - 16:00 |
|
In den Semesterferien
nach Vereinbarung |
| Postadresse: |
Humboldt-Universität zu Berlin
Wirtschaftswissenschaftliche Fakultät
Institut für Statistik und Ökonometrie
Spandauer Strasse 1
10178 Berlin |
|
|
Lehre
MVA I WS 07-08
Study and Research
Classification and Regression Trees
|
Research is supported by
in frame of two-year doctoral scholarship. After end of
scholarship students can also do a research-related internship with
DekaBank. The students present the results of their research at a
workshop held annually at DekaBank. |
|
Method of Classification and Regression
Trees (CART) is classification method which uses historical data to
construct so-called decision trees. Depending on available information
about the dataset, classification tree or regression tree can be
constructed. The grown tree can be then used for classification of new
observations.
|
|
Classification and Regression Trees is
a classification method which uses historical data to construct
so-called decision trees. Decision trees are then used to
classify new data. In order to use CART we need to know number of
classes a priori. CART methodology was developed in 80s by
Breiman, Freidman, Olshen, Stone in their paper ”Classification and
Regression Trees” (1984). For building decision trees, CART uses
so-called learning sample - a set of historical data with pre-assigned
classes for all observations. For example, learning sample for credit
scoring system would be fundamental information about previous borrows
(variables) matched with actual payoff results (classes).
Decision trees are represented by a set
of questions which splits the learning sample into smaller and smaller
parts. CART asks only yes/no questions. A possible question could be:
”Is age greater than 50?” or ”Is sex male?”. CART algorithm will search
for all possible variables and all possible values in order to find the
best split - the question that splits the data into two parts with
maximum homogeneity. The process is then repeated for each of the
resulting data fragments. Here is an example of simple classification
tree, used by San Diego Medical Center for classification of their
patients to different levels of risk:
|
|
|
In practice there can be much more
complicated decision trees which can include dozens of levels and
hundreds of variables. As it can be seen from figure 1.1, CART can
easily handle both numerical and categorical variables. Among other
advantages of CART method is its robustness to outliers. Usually the
splitting algorithm will isolate outliers in individual node or nodes.
An important practical property of CART is that the structure of its
classification or regression trees is invariant with respect to
monotone transformations of independent variables. One can replace any
variable with its logarithm or square root value, the structure of the
tree will not change.
CART methodology consists of tree parts:
- Construction of maximum tree
- Choice of right tree size
- Classification of new data using constructed tree
|
|
Download my master thesis on CART |
Working Papers
- A. Andriashin, M. Benko, W. Härdle, R. Timofeev, U. Ziegenhagen
(2005): Color Harmonization in Car Manufacturing Processes, special
issue "Business, Industry and Government (BIG) Statistics" of Applied
Stochastic Models for Business and Industry, volume 22, issue 5-6,
pages 519-532, J. Wiley
- A. Andriyashin, R. Timofeev, W.Härdle (2007): Recursive Portfolio
Selection with Decision Trees, Discussion paper SFB 649
- Y. Golubev, W.Härdle, R. Timofeev (2007): Testing Monotonicity of
Pricing Kernels, submitted to Journal of Applied Econometrics
|
Talks
- Classification and Regression Trees in XploRe

- CART Implimentation Issues

|
Projects
- Yxilon - A Modular Programming Language
Modern statistical computing requires
smooth integration of new algorithms and quantitative analysis results
into all sorts of platforms, like webbrowsers, standard and proprietary
applications. With Yxilon we want to implement such a vertically
integrable, modular environment, providing the user with a rich set of
statistical methods and a variety of different interfaces to use these
methods. Yxilon will be the successor of XploRe, a complete statistical
engine developed at Humboldt-Universitaet zu Berlin. While working on
several projects with international partners the function set of XploRe
had been more and more extended at the cost of performance and
stability, aspects we want to change in the upcoming Yxilon. The main
goals of Yxilon are: - platform independence - COM and Client/server
interfaces - database functionality and connectivity - multi-lingual
user interfaces - full support of all XploRe functions and packages -
Support for XML
|
|
|