Let Cost(Ck | Cj) be the cost of predicting Ck when the truth is Cj. Varying costs means that Cost(Ck | Cj) != Cost(Cj | Ck). For observation x, the classifier gives class probabilities p(Cj | x). Then the expected cost of predicting Ck is:
Cost(Ck | x) = sum_j Cost(Ck | Cj) p(Cj | x)The best prediction minimizes expected cost.
For example, suppose this is the cost matrix:
predict C1 | predict C2 | |
---|---|---|
truth C1 | 0 | 10 |
truth C2 | 1 | 0 |
Suppose another observation x2 winds up with class probabilities (0.4,0.6). The most likely class is C2, but it is not the most cost-effective decision. The expected cost of predicting C1 is 0.4*0 + 0.6*1 = 0.6, while the expected cost of predicting C2 is 0.4*10 + 0.6*0 = 4. The probability of C2 must be very high before C2 is a cost-effective prediction.
sum_i Cost(prediction for x_i | true class of x_i)(In class I gave a different formula which is the expected cost on the test set.) This formula can also be used in cross-validation. A drawback of this formula is that it doesn't evaluate whether the classifier's cost estimates are actually correct---only whether they succeed in selecting the minimum-cost prediction.
If the cost structure is unavailable or uncertain at the time the classifier is designed, there is another approach called deviance. Here the confidence values are evaluated directly. Given a test set, the formula for deviance is:
-2 sum_i log p(true class of x_i | x_i)In other words, for each test point you get the class probabilities from the tree and take the logarithm of the probability of the true class, then add up these log-probabilities across the test set. This measure rewards trees which give high confidence when they are right and low confidence when they are wrong. The best situation is when the classifier always gives probability 1 to the true class, in which case the deviance is zero. Note that it is possible for a classifier with lower error rate to have higher deviance, if its confidence values are not realistic.
As mentioned in the day23 lecture notes, cv.tree uses the deviance measure by default, and can be instructed to use classification error instead (as was done in class).
Costs are important here because most network connections are normal, under almost any stratification of the data. If you simply predicted the most likely situation, you would never predict an attack. But you should predict attacks because the cost of missing an attack is much higher than falsely detecting an attack. To win the competition, your classifier had to make decisions which minimized overall cost.
The winner of the contest used trees with a special enhancement called "bagging" which improved the quality of the confidence values and thereby improved the selection of the minimal-cost decision. This method is described next.
Globally, we need to realize that the particular stratification chosen by the tree growing procedure is not necessarily the best one. There are usually many tree structures which are almost as good, and each alternative tree leads to different class probabilities, sometimes slightly different and sometimes very different. So our confidence values should incorporate our uncertainty about the tree structure. For example, suppose two trees fit the training data equally well. On a test case, tree 1 gives class probabilities (0.7,0.3) and tree 2 gives class probabilities (0.6,0.4). Since we have no reason to favor either tree, we should treat them as competing opinions and use the average probabilities (0.65,0.35). In general, we should average the predictions of all competing trees. If the trees are not equally good on the training data, then we should still average but using a weighted average.
A simple and general way to form a weighted average of the possible classifiers for a given training set is to use bagging, which is short for bootstrap averaging. In this approach, you draw many random subsets of the training data and train a classifier on each one, in the normal way. Typically this will give you many different classifiers which are all reasonable classifiers for the training set. That is the bootstrapping part. To get predictive probabilities on a test case, you average the probabilities given by each of the trees. Interestingly, a weighted average is not needed, because good classifiers will automatically show up more often in the collection and thereby get more sway. Classifiers which are best on the entire training set will tend to be best on subsets of the training set, as well.
Bootstrapping is also useful for sensitivity analysis, i.e. seeing which parts of the tree are resistant to changes in the data. You simply look at how the tree structures vary over the collection.