For each categorical predictor, we can form a two-way contingency table relating it to the class variable. If the chi-square statistic of this table is small, then the predictor gives little information about the class: they are nearly independent. If the chi-square is large, then the predictor gives a lot of information about class. So we want to find the tables with large chi-square. Furthermore, as in regression trees, each split will be binary. So we need to merge predictor values (rows of the table) until only two remain. We already have a routine to do this: merge.table.
For a numerical predictor x, we exploit the fact that exactly two bins are required and simply search over all thresholds t such that the categorical variable x < t, x > t has large dependence on the class. This is also done with a chi-square test.
The resulting algorithm: for each predictor, abstract into two bins. Choose the abstracted predictor with largest dependence on the class. Stratify and recurse.
Classification trees are useful in data mining because they find the relevant variables and they are concise. They don't give precise predictive probabilities, however. We will discuss techniques for that later.
cap.shape cap.surface cap.color bruises odor gill.attachment gill.spacing gill.size gill.color stalk.shape 1 convex fibrous red bruises none free close broad purple tapering stalk.root stalk.surface.above.ring stalk.surface.below.ring stalk.color.above.ring stalk.color.below.ring 1 bulbous smooth smooth gray pink veil.type veil.color ring.number ring.type spore.print.color population habitat class 1 partial white one pendant brown several woods edible.Here is the resulting classification tree:
> tr <- tree(class~.,x) > tr node), split, n, yval, (yprob) * denotes terminal node 1) root 3803 edible. ( 0.623192217 0.376807783 ) 2) odor: almond,anise,none 2420 edible. ( 0.979338843 0.020661157 ) 4) spore.print.color: black,brown,purple,white 2379 edible. ( 0.996216898 0.003783102 ) 8) population: abundant,numerous,scattered,several,solitary 2370 edible. ( 1.000000000 0.000000000 ) * 9) population: clustered 9 poisonous. ( 0.000000000 1.000000000 ) * 5) spore.print.color: green 41 poisonous. ( 0.000000000 1.000000000 ) * 3) odor: creosote,foul,musty,pungent 1383 poisonous. ( 0.000000000 1.000000000 ) * > plot(tr,type="uniform"); text(tr,pretty=0,label="yprob")
> iris[1:5,] Sepal.Length Sepal.Width Petal.Length Petal.Width Species 6.4 2.8 5.6 2.2 virginica 5.8 2.7 3.9 1.2 versicolor 5.7 2.5 5.0 2.0 virginica 5.5 2.4 3.8 1.1 versicolor 5.6 2.9 3.6 1.3 versicolor ... > tr <- tree(Species ~., iris) > plot(tr,type="uniform"); text(tr,pretty=0,label="yprob")
To get a picture of what is going on, we can focus on Petal.Length and Petal.Width:
tr <- tree(Species ~ Petal.Length+Petal.Width, iris) cplot(tr)
x <- sort(runif(100)) y <- x+rnorm(100)/10 y <- factor(y > 0.5) levels(y) <- c("No","Yes") tr <- tree(y~x) cplot(tr)
acct.len area intl vmail vmail.msgs day.mins day.calls day.charge eve.mins eve.calls eve.charge night.mins 1 128 415 no yes 25 265.1 110 45.07 197.4 99 16.78 244.7 night.calls night.charge intl.mins intl.calls intl.charge svc.calls churn 1 91 11.01 10 3 2.7 1 NoSome of these predictors are categorical and some numeric. Here is the resulting tree:
tr <- tree(churn~.,x) plot(tr,type="uniform"); text(tr,label="yprob",pretty=0)
This classification scheme is quite sophisticated. Let's analyze it piece by piece.
On the right side of the tree we see that people with a large number of daytime and evening minutes tend to churn, but only if they have a small number of voicemail messages. Here is the result of predicting churn only on day and evening minutes, restricted to those with a small amount of voicemail:
i <- (x$vmail.msgs < 6.5) tr2 <- tree(churn~eve.mins+day.mins,x[i,]) cplot(tr2)
In the middle of the tree we see that people who make many service calls yet have a small number of day minutes are likely to churn. These may be people who are just setting up their accounts. Here is a tree using only these two variables:
tr2 <- tree(churn~svc.calls+day.mins,x) cplot(tr2)
On the left side of the tree we see an interaction between the number of international minutes vs. international calls. We can focus in on this interaction by building a tree for the data that fall under node 9:
y <- x[cases.tree(tr,"9"),] tr2 <- tree(churn~intl.calls+intl.mins,y) cplot(tr2)
These plots are not just to check that the tree is correct. The idea is to use the tree as a springboard for a more detailed analysis that the tree cannot provide.