Day 24 - Classification uncertainty

So far our discussion of performance assessment has focused on error rate: the fraction of times that the true class is different from the class predicted by the tree classifier. But trees give more information than a single class prediction: they give a probability distribution over possible classes, which reflects the classifier's confidence. These confidence values are important in real situations where different outcomes lead to different costs (not just "right" or "wrong"). Hence we will discuss methods to assess the quality of a classifier's confidence values and to improve the quality of those values.

Minimizing classification cost

If a classification decision is simply "right" or "wrong", then we just need to choose the class with highest probability. This is what we do when measuring error rate. But if classification decisions produce varying costs, we must weigh the options more carefully.

Let Cost(Ck | Cj) be the cost of predicting Ck when the truth is Cj. Varying costs means that Cost(Ck | Cj) != Cost(Cj | Ck). For observation x, the classifier gives class probabilities p(Cj | x). Then the expected cost of predicting Ck is:

Cost(Ck | x) = sum_j Cost(Ck | Cj) p(Cj | x)

The best prediction minimizes expected cost.

For example, suppose this is the cost matrix:

predict C1 predict C2
truth C1 0 10
truth C2 1 0
We run an observation x1 through the tree and wind up with class probabilities (0.9, 0.1). The most likely class is C1. But which prediction minimizes cost? The expected cost of predicting C1 is 0.9*0 + 0.1*1 = 0.1, while the expected cost of predicting C2 is 0.9*10 + 0.1*0 = 9. Hence we should predict C1. This makes sense since the cost matrix is biased against C2.

	predict C1	predict C2
truth C1	0	10
truth C2	1	0

Suppose another observation x2 winds up with class probabilities (0.4,0.6). The most likely class is C2, but it is not the most cost-effective decision. The expected cost of predicting C1 is 0.4*0 + 0.6*1 = 0.6, while the expected cost of predicting C2 is 0.4*10 + 0.6*0 = 4. The probability of C2 must be very high before C2 is a cost-effective prediction.

Measuring performance

There are two main ways to measure the quality of a classifier's confidence values. If the problem has a cost structure, then you can use cost instead of error rate. For example, in the holdout method you would measure the cost on the test set instead of the number of errors. The cost on the test set is:

sum_i Cost(prediction for x_i | true class of x_i)

(In class I gave a different formula which is the expected cost on the test set.) This formula can also be used in cross-validation. A drawback of this formula is that it doesn't evaluate whether the classifier's cost estimates are actually correct---only whether they succeed in selecting the minimum-cost prediction.

If the cost structure is unavailable or uncertain at the time the classifier is designed, there is another approach called deviance. Here the confidence values are evaluated directly. Given a test set, the formula for deviance is:

-2 sum_i log p(true class of x_i | x_i)

In other words, for each test point you get the class probabilities from the tree and take the logarithm of the probability of the true class, then add up these log-probabilities across the test set. This measure rewards trees which give high confidence when they are right and low confidence when they are wrong. The best situation is when the classifier always gives probability 1 to the true class, in which case the deviance is zero. Note that it is possible for a classifier with lower error rate to have higher deviance, if its confidence values are not realistic.

As mentioned in the day23 lecture notes, cv.tree uses the deviance measure by default, and can be instructed to use classification error instead (as was done in class).

Computer intrusion detection

One application where costs are important (and trees have been successful) is computer intrusion detection. Earlier in day3 we discussed an approach based on user profiling. That approach is somewhat invasive and catches only certain kinds of attacks, though it is very effective at finding outliers. Classification trees offer a more general approach which is good when you know what you are looking for. The idea is to continuously snoop the network and look for patterns that are suggestive of an attack. This problem is of particular interest to security centers like CERT (at CMU) as well as government agencies like DARPA who deal with cyberwarfare and cyberterrorism. Computer intrusion detection was the subject of a recent competition between machine learning algorithms (KDD Cup'99).

Costs are important here because most network connections are normal, under almost any stratification of the data. If you simply predicted the most likely situation, you would never predict an attack. But you should predict attacks because the cost of missing an attack is much higher than falsely detecting an attack. To win the competition, your classifier had to make decisions which minimized overall cost.

The winner of the contest used trees with a special enhancement called "bagging" which improved the quality of the confidence values and thereby improved the selection of the minimal-cost decision. This method is described next.

Improving confidence values

To improve confidence values, we have to think more carefully about our uncertainty in the class probabilities, both locally to each leaf and globally across the whole tree. Locally, we know from day3 that the number of occurrences of class 1 in the leaf divided by the number of data points in the leaf is not a very good estimate of the probability of class 1, especially when the counts are small. This is a problem since the tree growing procedure encourages class counts to be small, and it especially likes to have zero counts in a leaf. But just because a count is zero doesn't mean the probability is zero. If the tree predicts zero probability for a class that is actually true for a given case, then the deviance is infinity. So we want to use a better estimate. The corrected estimate given on day3 is a good choice. Because the leaves cover different amounts of space, instead of adding 1 it may be better to add an area-scaled constant, as done on day7.

Globally, we need to realize that the particular stratification chosen by the tree growing procedure is not necessarily the best one. There are usually many tree structures which are almost as good, and each alternative tree leads to different class probabilities, sometimes slightly different and sometimes very different. So our confidence values should incorporate our uncertainty about the tree structure. For example, suppose two trees fit the training data equally well. On a test case, tree 1 gives class probabilities (0.7,0.3) and tree 2 gives class probabilities (0.6,0.4). Since we have no reason to favor either tree, we should treat them as competing opinions and use the average probabilities (0.65,0.35). In general, we should average the predictions of all competing trees. If the trees are not equally good on the training data, then we should still average but using a weighted average.

A simple and general way to form a weighted average of the possible classifiers for a given training set is to use bagging, which is short for bootstrap averaging. In this approach, you draw many random subsets of the training data and train a classifier on each one, in the normal way. Typically this will give you many different classifiers which are all reasonable classifiers for the training set. That is the bootstrapping part. To get predictive probabilities on a test case, you average the probabilities given by each of the trees. Interestingly, a weighted average is not needed, because good classifiers will automatically show up more often in the collection and thereby get more sway. Classifiers which are best on the entire training set will tend to be best on subsets of the training set, as well.

Bootstrapping is also useful for sensitivity analysis, i.e. seeing which parts of the tree are resistant to changes in the data. You simply look at how the tree structures vary over the collection.

References

A general discussion of performance assessment and characterizing uncertainty can be found in the paper "Statistical themes and lessons for data mining".

Tom Minka

Last modified: Mon Aug 22 16:40:54 GMT 2005