Day 19 - Regression trees

The predictive abstraction technique described in the last lecture was for one categorical predictor only. Can it be extended to more than one predictor? Today we explore one way, called recursive partitioning.

The idea behind recursive partitioning is to repeatedly select the most relevant predictor variable and use it to stratify the data. Within each stratum, we select the most relevant predictor on that data and sub-stratify until the strata are so small that we run out of data. The result is a stratification tree. By design, the responses within each stratum are as similar as possible. The usefulness of this procedure for data mining is that it finds the main subdivisions of the data as well as the most relevant predictors. It is not particularly good at providing a model for the data or making detailed predictions.

When predictors are categorical, there are different ways to implement this scheme. One way is to stratify across all values of the predictor. Another way, which works better, is to abstract the predictor into two values and stratify only across those two.

The relevance of a predictor is measured either by the sum of squares criterion or the sum of log-variance criterion, depending on your assumptions about the response variance. After abstraction into two bins, the predictor which has smallest sum of squares or smallest sum of log-variance is the most relevant.

Virginia death rates

Here is an example. The dataset is death rates in Virginia in 1940, with three predictor variables: age, gender, and residence:

  Age Gender Residence Rate
50-54 Female     Urban  8.4
60-64 Female     Rural 20.3
50-54   Male     Rural 11.7
55-59   Male     Rural 18.1
70-74 Female     Urban 50.0
...

Using the function break.factor, we can abstract Age into two bins and measure the sum of squares:

> break.factor(x$Age, x$Rate, 2)
sum of squares = 2178.565 
[1] "50-54.55-59.60-64" "65-69.70-74"

The binning which minimizes the sum of squares of the response is (50-64, 65-74). Note that by using the sum of squares we are assuming the response is normal and homoscedastic (it has a fixed variance). The variables Gender and Residence already have two levels and their predictive sum of squares is 6577 and 7148. Rate does depend on all three variables, but Age is the most relevant. Therefore the tree should start by stratifying on Age.

Within the first stratum, ages 50-64, the most relevant variable is still Age. So we stratify Age some more. Within the 60-64 substratum, we find that Gender is now the most relevant predictor. And so on.

The full tree can be generated with the tree command:

> tr <- tree(Rate ~ ., x, minsize=1)
> tr
node), split, n, yval
      * denotes terminal node

 1) root 20 30.92  
   2) Age: 50-54,55-59,60-64 12 17.95  
     4) Age: 50-54,55-59  8 13.99  
       8) Gender: Female  4 10.60 *
       9) Gender: Male  4 17.38 *
     5) Age: 60-64  4 25.88  
      10) Gender: Female  2 19.80 *
      11) Gender: Male  2 31.95 *
   3) Age: 65-69,70-74  8 50.38  
     6) Age: 65-69  4 40.40  
      12) Gender: Female  2 33.00 *
      13) Gender: Male  2 47.80  
        26) Residence: Rural  1 41.00 *
        27) Residence: Urban  1 54.60 *
     7) Age: 70-74  4 60.35  
      14) Gender: Female  2 52.15 *
      15) Gender: Male  2 68.55 *

The tree command always uses the sum of squares (same variance) criterion. As argued in the analysis above, the first stratification is between ages 50-64 and 65-74. After stratifying on Age, the most relevant variable in all strata is Gender. Finally, Residence appears but only in the (Age = 65-69, Gender = Male) stratum. In other strata, Residence makes so little impact on Rate that the tree function simply stops partitioning. The minsize=1 option was necessary for this dataset because of the way it was represented (only one data row per predictor combination).

Trees can also be plotted in the following way in S:

plot(tr,type="u");text(tr,pretty=0)

This type of plot is not recommended since it is confusing and hard to read. In particular, the text above each split refers to the condition for the left child. The condition for the right child is not shown and must be inferred as the negation of the left child. The numbers at the leaves are the mean response for that stratum.

Car prices

Here is a dataset of car information from the April 1990 issue of Consumer Reports:

                     Price Country   Type
Volkswagen Fox 4      7225    Brzl  Small
Porsche 944          41990    Grmn Sporty
Chrysler Imperial V6 25495     USA Medium
Buick Electra V6     20225     USA  Large
Ford Probe           11470     USA Sporty
...

We want to predict Price from the other two categorical variables. Here are the results of break.factor:

> break.factor(x$Country, x$Price, 2)
sum of squares = 5876419790 
[1] "Kore.Brzl.Mexc.J/US.USA.Japn.Frnc"
[2] "Swdn.Engl.Grmn"

> break.factor(x$Type, x$Price, 2)
sum of squares = 5551739740 
[1] "Small.Van.Compact.Sporty" "Large.Medium"

We find that Type has the lower sum of squares, so we should stratify on that variable first, using the bins Small.Van.Compact.Sporty and Large.Medium. Here is the full tree:

> tr <- tree(Price ~ ., x)
> tr
node), split, n, yval
      * denotes terminal node

 1) root 117 15740  
   2) Type: Compact,Small,Sporty,Van  80 13040  
     4) Country: Brzl,Frnc,Japn,J/US,Kore,Mexc,USA  69 11560  
       8) Type: Small  21  7629 *
       9) Type: Compact,Sporty,Van  48 13270  
        18) Country: J/US,Mexc,USA  29 12240 *
        19) Country: Frnc,Japn  19 14850 *
     5) Country: Grmn,Swdn  11 22320 *
   3) Type: Large,Medium  37 21600  
     6) Country: Frnc,Kore,USA  25 18700  
      12) Type: Large   7 21500 *
      13) Type: Medium  18 17610 *
     7) Country: Engl,Grmn,Japn,Swdn  12 27650 *

To get a list of the cases that fall into a particular stratum, use the function cases.tree. For example, suppose we want the cars under node 4:

> cases.tree(tr,"4")
 [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14
[15]  15  16  17  18  19  20  21  23  24  25  26  27  28  29
[29]  30  31  32  33  34  35  36  37  38  39  40  41  42  44
[43]  45  46  48  52  53  54  55  56  57  58  59  61  62  63
[57]  64  65  68  69 108 109 110 111 112 113 114 115 116
> y <- x[cases.tree(tr,"4"),]

y is now a reduced table containing only those cars. If we run break.factor on y, we find that Type is the most relevant predictor, which confirms the split made at node 4 into Type = Small versus Compact,Sporty,Van.

Mixed numerical and categorical predictors

Interestingly, the same technique for abstracting categorical predictors also works for numerical predictors. Given that you are going to divide the predictor range into two bins, you just need to find the breakpoint that minimizes the sum of squares in the two batches created. This sum of squares is directly comparable to that obtained from a categorical predictor, so you can use it to select which predictor to use in the tree. The ability to use categorical and numerical predictors equally is one of the attractive qualities of recursive partitioning.

Consider this dataset from Motor Trend magazine describing characteristics of 32 cars:

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
...

We want to predict mpg (fuel efficiency) from the other predictors, which include engine displacement, rear axle ratio, number of forward gears, etc. A first-cut visualization of the data can be obtained by plotting mpg against each predictor individually:

This matrix of plots shows that every variable seems to have something to do with fuel efficiency. But many of these relationships could disappear if we stratify the data according to the dominant trends. The most relevant variables appear to be disp, hp, and wt (weight). mpg has the least amount of vertical spread around the regression line for these predictors, which is suggestive of a small sum of squares (but not conclusive). All are contenders for the root of the tree. It turns out that wt has the lowest sum of squares in predicting mpg, when you break at wt = 2.26. Here is the tree:

> tr <- tree(mpg ~ ., mtcars)
> tr
node), split, n, yval
      * denotes terminal node

 1) root 32 20.09  
   2) wt < 2.26  6 30.07 *
   3) wt > 2.26 26 17.79  
     6) cyl < 7 12 20.93  
      12) cyl < 5  5 22.58 *
      13) cyl > 5  7 19.74 *
     7) cyl > 7 14 15.1  
      14) hp < 192.5  7 16.79 *
      15) hp > 192.5  7 13.41 *
> plot(tr,type="uniform"); text(tr,pretty=0)

With numerical predictors, we can get a better idea of what the tree is doing by making a partition plot. First make a tree that uses only two predictor variables. In the tree above, wt and cyl are the most important, so we use those. Then run partition.tree or cplot to make a partition plot.

tr <- tree(mpg ~ wt+cyl, mtcars)
partition.tree(tr)
cplot(tr)

The partition plot shows the stratification cells defined over the two predictors. The mean response in each cell appears as a number. A cplot also shows the data from which the tree was built. The colors indicate the response value of each point. The response value is abstracted into four balanced bins, with low to high ordering black, red, green, and blue. We can immediately see that the first cut is on wt, and it separates several highly efficient cars (blue points) from the rest. The next two splits are on the number of cylinders.

Code

To use trees in R, you must first install the tree package. This only needs to be done once. Please ask us if you need help with this.
- If you are using R on your own PC: Under the Package menu, select "Install package from CRAN" and select tree.
- If you are using R in a unix cluster, it is more complicated. First execute this command in R: install.packages("tree","library"). It will download the package into the directory where you started R. Now you must tell R to look for libraries in your directory, using this command: .lib.loc <- c(.lib.loc,"library"). The .lib.loc command must be executed every time you want to use trees.
Splus has the tree function built-in, however it tends to produce big trees, so you need to fiddle with the mindev parameter, as in tree(mpg ~ ., mtcars, mindev=0.1).
For the additional routines which are used in this class, download and source one of these files: tree.r tree.s
For it to work, you must also download the new version of clus1: clus1.r clus1.s
Once you have done the above, every time you want to use trees in R, start by typing source("tree.r").
You can have R automatically run commands for you when it starts up by putting those commands into a file called .Rprofile.

Functions used in this lecture:

break.factor(factor,response,nbins,same=T/F) merges categories of a factor until nbins bins are reached. The merging is done to preserve prediction of response. The optional argument same indicates whether the response variance is assumed the same for all categories. The default is T (yes).
tree
cases.tree
plot.tree
partition.tree
cplot.tree

Tom Minka

Last modified: Thu Oct 25 16:38:43 EDT 2001