Day 22 - Performance assessment
Once we have grown a classification tree from a dataset, how do we know
that the tree represents properties of the entire population and not just
the sample? In general, how much should we trust the results of data
mining? These are the questions addressed today.
One way to quantify whether a tree represents the right things is by its
predictive ability, e.g. error rate, on future data. If it can predict the
class of new data points accurately, then it must capture something
important about the population. How can we estimate the population error
rate? A naive approach is to measure the error rate on the training set
and assume that these errors are independent Bernoulli trials, resulting in
a confidence interval on the population error rate.
This approach is
unreliable for a variety of reasons, mainly:
- The training data could have had errors.
- The training data could be biased, i.e. not a simple random sample
from the full population. Note that this applies equally to large and
small training sets; it is a separate issue from the
training set being too small.
- The training set errors are not independent Bernoulli trials, because
the classifier was designed specifically to minimize the number of training
set errors.
The solution to the first problem is to check and clean the data.
The solution to the second problem is either to correct for the bias using
advanced statistical procedures or to re-interpret "the population" in such
a way that the training set is a simple random sample from that population.
The solution to the third problem is what we will discuss.
To illustrate the problem, consider a situation where the predictors were
independent of the response. Any tree that you construct from the training
data must have a 50% error rate on future data. If the training points all
have distinct predictor values, then a big enough tree could put each one
into a separate leaf and get zero error on the training set, regardless of
how many points there are. So training set error rate is not a reliable
estimate of population error rate.
This problem is often called "overfitting", because the big tree over-fits
the data. But technically the problem exists no matter how big the tree
is and no matter how complex or simple the classifier is.
It is just especially bad for big trees.
How well did I learn?
There are two main approaches to estimating the population error rate. The
first approach is model-based and falls under the name of computational
learning theory. If the classifier has been learned according to a
fixed procedure, then it is theoretically possible to construct a sampling
distribution for the training error rate given the population error rate
and other quantities like the training set size, the classification tree
size, the number of predictors available, etc. From the sampling
distribution one can produce a confidence interval on the population error
rate. In practice, the sampling distribution is extremely difficult to
construct, and this is still an area of research. For example, many
decisions are made along the way to growing a tree: what variable to split
on, how to split it, etc. Each of these choices allows the tree to
over-fit and therefore influences the sampling distribution.
The second approach is empirical. It is called the holdout method
because it requires you to hold out part of the dataset from training. You
split the original dataset into a training set and testing set. The
classifier is designed only using the training set. Then it is evaluated
on the test set, which has been carefully locked away. The errors on the
test set are independent Bernoulli trials so you can get a confidence
interval on the population error rate in the usual way.
The holdout method does not require the classifier to have been learned
according to a fixed procedure. This is important for data mining, which
typically does not follow a predefined routine and instead involves
visualization and serendipity. As long as you did not use the test set for
any of the exploratory analyses and visualizations, it doesn't matter what
path you took to get the classifier.
The holdout method answers the question "how well did I learn?". It
evaluates a given tree. It doesn't tell us how should we have altered the
process of learning the tree in order to get better performance.
That question is considered next.
How should I learn?
One of the main decisions in growing a tree is when to stop.
Intuitively, we should stop when the bins have too little data in them.
But how much is too little? To answer that, we need some quantitative
expectation of how we would perform with a big tree instead of a small
tree.
Computational learning theory is one way to do that.
The other way is empirical.
Cross-validation is an empirical method for estimating the
expected performance of different learning strategies on a given dataset.
Cross-validation works as follows. Let alpha be some parameter which
controls learning, such as the desired number of tree leaves.
- For each
particular value alpha-hat,
estimate the expected performance of using
alpha = alpha-hat
- Use the alpha-hat with best expected performance.
To estimate the expected performance of alpha = alpha-hat, we can apply the
holdout method within the training set itself:
- Break the training set into k blocks of equal size.
- For each block,
- train a classifier on everything
except that block,
- test on the block.
The result is an unbiased performance estimate of that classifier.
-
Because there are k blocks, we
will train k classifiers and get k performance estimates.
Average the performance estimates
together to get the expected performance of alpha = alpha-hat.
In our case, "performance" means "error rate over the
population".
Examples of this procedure will be given next time.
References
For the current state of computational learning theory, see the COLT page.
Tom Minka
Last modified: Sun Oct 14 20:44:44 Eastern Daylight Time 2001