Day 22 - Performance assessment

Once we have grown a classification tree from a dataset, how do we know that the tree represents properties of the entire population and not just the sample? In general, how much should we trust the results of data mining? These are the questions addressed today.

One way to quantify whether a tree represents the right things is by its predictive ability, e.g. error rate, on future data. If it can predict the class of new data points accurately, then it must capture something important about the population. How can we estimate the population error rate? A naive approach is to measure the error rate on the training set and assume that these errors are independent Bernoulli trials, resulting in a confidence interval on the population error rate. This approach is unreliable for a variety of reasons, mainly:

The training data could have had errors.
The training data could be biased, i.e. not a simple random sample from the full population. Note that this applies equally to large and small training sets; it is a separate issue from the training set being too small.
The training set errors are not independent Bernoulli trials, because the classifier was designed specifically to minimize the number of training set errors.

The solution to the first problem is to check and clean the data. The solution to the second problem is either to correct for the bias using advanced statistical procedures or to re-interpret "the population" in such a way that the training set is a simple random sample from that population. The solution to the third problem is what we will discuss.

To illustrate the problem, consider a situation where the predictors were independent of the response. Any tree that you construct from the training data must have a 50% error rate on future data. If the training points all have distinct predictor values, then a big enough tree could put each one into a separate leaf and get zero error on the training set, regardless of how many points there are. So training set error rate is not a reliable estimate of population error rate.

This problem is often called "overfitting", because the big tree over-fits the data. But technically the problem exists no matter how big the tree is and no matter how complex or simple the classifier is. It is just especially bad for big trees.

How well did I learn?

There are two main approaches to estimating the population error rate. The first approach is model-based and falls under the name of computational learning theory. If the classifier has been learned according to a fixed procedure, then it is theoretically possible to construct a sampling distribution for the training error rate given the population error rate and other quantities like the training set size, the classification tree size, the number of predictors available, etc. From the sampling distribution one can produce a confidence interval on the population error rate. In practice, the sampling distribution is extremely difficult to construct, and this is still an area of research. For example, many decisions are made along the way to growing a tree: what variable to split on, how to split it, etc. Each of these choices allows the tree to over-fit and therefore influences the sampling distribution.

The second approach is empirical. It is called the holdout method because it requires you to hold out part of the dataset from training. You split the original dataset into a training set and testing set. The classifier is designed only using the training set. Then it is evaluated on the test set, which has been carefully locked away. The errors on the test set are independent Bernoulli trials so you can get a confidence interval on the population error rate in the usual way.

The holdout method does not require the classifier to have been learned according to a fixed procedure. This is important for data mining, which typically does not follow a predefined routine and instead involves visualization and serendipity. As long as you did not use the test set for any of the exploratory analyses and visualizations, it doesn't matter what path you took to get the classifier.

The holdout method answers the question "how well did I learn?". It evaluates a given tree. It doesn't tell us how should we have altered the process of learning the tree in order to get better performance. That question is considered next.

How should I learn?

One of the main decisions in growing a tree is when to stop. Intuitively, we should stop when the bins have too little data in them. But how much is too little? To answer that, we need some quantitative expectation of how we would perform with a big tree instead of a small tree. Computational learning theory is one way to do that. The other way is empirical. Cross-validation is an empirical method for estimating the expected performance of different learning strategies on a given dataset.

Cross-validation works as follows. Let alpha be some parameter which controls learning, such as the desired number of tree leaves.

For each particular value alpha-hat,
estimate the expected performance of using alpha = alpha-hat
Use the alpha-hat with best expected performance.

To estimate the expected performance of alpha = alpha-hat, we can apply the holdout method within the training set itself:

Break the training set into k blocks of equal size.
For each block,
- train a classifier on everything except that block,
- test on the block. The result is an unbiased performance estimate of that classifier.
Because there are k blocks, we will train k classifiers and get k performance estimates. Average the performance estimates together to get the expected performance of alpha = alpha-hat.

In our case, "performance" means "error rate over the population".

Examples of this procedure will be given next time.

References

For the current state of computational learning theory, see the COLT page.

Tom Minka

Last modified: Sun Oct 14 20:44:44 Eastern Daylight Time 2001