Day 23 - Examples of performance assessment

Null case

Last time I gave the example of a response which had nothing to do with the predictors. Here is R code for the example:
# make the dataset
n <- 100
y <- factor(sample(c(0,1),n,replace=T))
x <- data.frame(f1=runif(n),f2=runif(n),y=y)

tr <- tree(y~.,x,minsize=1)
cplot(tr)
misclass.tree(tr)

The tree has zero misclassifications on the dataset. But that is a very misleading result. Let's use the holdout method to get a fair estimate. The command rsplit will randomly split a data frame into a list of two smaller frames. We will name those list elements "train" and "test" (instead of 1 and 2).
x <- rsplit(x,0.5)
names(x) <- c("train","test")
tr <- tree(y~.,x$train,minsize=1)
cplot(tr)
misclass.tree(tr)

This tree, trained using half of the data points, also has zero misclassifications on its training set. To test it, use misclass.tree with a second argument:
misclass.tree(tr,x$test)
The result is 24 errors out of 50, for an error rate of 0.48. This is much closer to the truth, which we know is 0.5. A full confidence interval on the error can be obtained in the usual way:
nx <- misclass.tree(tr,x$test)
n <- nrow(x$test)
p <- nx/n
se <- sqrt(p*(1-p)/n)
z <- 1.96
cat("Error rate is in (",p-z*se,",",p+z*se,") with 95% probability\n")

Simple tree pruning

x <- read.table("ex20.dat")
tr <- tree(y~.,x)
cplot(tr)

This tree has many splits which seem dependent on the vagaries of this dataset. We would like to get a simpler tree which only contains the reliable splits. The function cv.tree runs cross-validation to find the best tree size.
plot(cv.tree(tr),type="o")

cv.tree uses 10 blocks by default, so for each of these six sizes 10 trees were trained, for a total of 60 trees to make this plot. The performance measure used here is deviance, not misclassification. Deviance is the log-probability of the data, multiplied by -2. It is similar to misclassification except it includes the confidence that the tree makes in its classifications. A tree which is confident when it is right and not confident when it is wrong will have low deviance.

Since 4 leaves appears best, we could regrow the tree and stop when there are 4 leaves. Equivalently, we can run the tree growing procedure backwards, by merging leaves instead of splitting them. This is called pruning and is done by prune.tree:

tr2 <- prune.tree(tr,best=4)
cplot(tr2)

Both cross-validation and pruning can be done in one step via the routine best.size.tree:

tr2 <- best.size.tree(tr,10)
The second argument is the number of blocks to use for cross-validation, which defaults to 10.

Churn dataset

Here is an example of cross-validation and holdout on the churn dataset:
x <- rsplit(churn,0.5)
names(x) <- c("train","test")

tr <- tree(churn~.,x$train,mindev=0)
plot(tr,type="uniform"); text(tr,pretty=0)
misclass.tree(tr)
# misclass of all prunings
p <- prune.tree(tr)
plot(p,type="o")

tr2 <- best.size.tree(tr)

p <- prune.tree(tr,newdata=x$test)
plot(p,type="o")
cat("best size on test is",p$size[which.min(p$dev)],"leaves\n")

best.size.tree picks 14 leaves as the best size. From running prune.tree on the test data we see that 18 is really best, though 14 is still pretty good. So cross-validation has helped us choose a good tree size. Note that we cannot use the result of prune.tree on the test data to choose a tree size, only to evaluate a size that was already picked by best.size.tree. If you use the test data to choose tree size, you are violating the holdout method.

Code

Functions introduced in this lecture:
Tom Minka
Last modified: Tue Apr 25 10:13:45 GMT 2006