- Collect a dataset of email labeled as "spam" and "not spam".
- Define variables which can be used to discriminate the two classes.
This is sometimes called
**feature extraction**. - Divide the data into a training set and test set.
- Train a classifier and evaluate it on the test set.

- histograms
- trees
- logistic regression

George Forman, a researcher at Hewlett-Packard, has made a spam dataset. He took 4601 of his own email messages, labeled them, and extracted various features. The features include the frequency of various words (e.g. "money"), special characters (e.g. dollar signs), and the use of capital letters in the message. For each message, the following variables are recorded:

> colnames(x) [1] "word.freq.make" [2] "word.freq.address" [3] "word.freq.all" [4] "word.freq.3d" ... [48] "word.freq.conference" [49] "char.freq.semi" [50] "char.freq.paren" [51] "char.freq.bracket" [52] "char.freq.bang" [53] "char.freq.dollar" [54] "char.freq.pound" [55] "capital.run.length.average" [56] "capital.run.length.longest" [57] "capital.run.length.total" [58] "spam"The last variable is the class variable ("Yes" or "No"). These variables only describe the content of the message. In a real system, we would probably also want to compare the sender name to an address book.

We split into train and test:

x <- rsplit(x,0.5); names(x) <- c("train","test")For a tree classifier, these are the commands we would use:

tr <- tree(spam~.,x$train) plot(tr,type="u");text(tr,pretty=0)The resulting tree depends strongly on the particular train/test split. Some variables which appear consistently are the frequency of dollar signs, exclamation points, and the word "hp", which is a good indicator of non-spam in this dataset. The tree is rather complicated, but only uses a few variables. Trees tend to use only a few variables, which is great for analysis but not so good for classification. For good classifications we should want to use as much information about the message as possible.

Logistic regression is quite different from trees in that it uses all of the variables. We can make a logistic regression classifier as follows:

fit <- glm(spam~.,x$train,family=binomial)The error rate for both models can be computed using

> misclass(tr,x$test) [1] 224 > misclass(fit,x$test) [1] 168For one particular train/test split, the tree has 224/2300 errors while logistic regression has 168/2300. This difference is statistically significant:

> chisq.test(array(c(224,2300,168,2300),c(2,2))) X-squared = 7.0897, df = 1, p-value = 0.007753Cross-validation on tree size does not change the error rate.

Misclassification rate is not an entirely appropriate measure for this
problem, because the costs are not equal.
I would say that `C(y|n)` (falsely classifying as spam)
is greater than `C(n|y)`.
We can get a better idea of
performance by making a **confusion matrix**:

> confusion(tr,x$test) predicted truth No Yes No 1330 82 Yes 142 746 > confusion(fit,x$test) predicted truth No Yes No 1345 67 Yes 101 787Logistic regression makes less errors of both type, so it is a better classifier for this data no matter what the cost structure.

Can a tree ever be better than logistic regression? Absolutely. Remember that logistic regression always uses a linear boundary between the classes. But trees are not so constrained: a big enough tree, with small enough cells, can represent any class boundary. The problem is having enough data to figure out what that boundary is. If the tree doesn't have enough data, it will use a boundary with lots of corners. Therefore logistic regression wins in the above example because the true boundary is close to linear and the training set isn't big enough for the tree to figure this out.

Linear classifiers have been used since the early days of computing, since
they are so easy to implement in hardware. In theory, you just need a few
potentiometers to represent the coefficients. At the time, they were
called **neural networks**, by analogy to how neurons in the brain
combine messages from their neighbors (perhaps linearly) and then decide to
"fire" or "not fire" (perhaps by thresholding at zero). Since then, a
variety of spinoff methods have been developed, which were also called
neural networks. Eventually it got to a point where every new
classification algorithm was called a neural network. Consequently the
term carries little information today.

So what's the difference between logistic regression and the histogram
method? For one thing, the histogram method has a very specific idea of
what the predictors are. It regards each document as a batch of
independent observations; the predictors are relative frequencies within
the batch. The predictors must refer to disjoint possibilities. For
example, `word.freq.make` and
`capital.run.length.average` are not disjoint, because "make"
could be capitalized. To use the histogram method on the spam dataset, we
have to restrict ourselves to the word variables.

Secondly, the histogram method chooses its coefficients differently than logistic regression. Logistic regression chooses coefficients which separate the training data as well as possible. The histogram method tries to model the two classes, based on an independence assumption. This constraint means that the histogram method can only achieve a subset of the possible linear classifiers. But as we saw above, constraints aren't necessarily bad.

To test this theory, a logistic regression classifier (`fit`) and
a histogram classifier (`fit2`) were trained on the spam dataset
using word variables only.
Here are the results:

> misclass(fit,x$test) [1] 216 > misclass(fit2,x$test) [1] 309This is the same train/test split used above, so we see that restricting to word variables has worsened the logistic regression to 216/2300 errors. The histogram classifier, even though it is also a linear classifier, does significantly worse.

Here's what happens if we do the same thing, but with the training set reduced to 231 examples:

> misclass(fit,x$train) [1] 2 > misclass(fit2,x$train) [1] 34 > misclass(fit,x$test) [1] 755 > misclass(fit2,x$test) [1] 564Logistic regression does much better on the training set, as expected, but worse on the test set. How can this be? Logistic regression could easily have found the same linear classifier as the histogram method. But it chose another classifier which seemed better on the training set. The additional constraint imposed by the histogram method has helped it find a better classifier when the amount of data is small.

According to Bayesian statistics, most classifiers, including logistic
regression and trees, are using the wrong criteria to learn from a training
set. For example, let's look at the criterion used by logistic regression.
This is the maximum-likelihood formula given on day31. Look at what happens when the classes are
perfectly separable:

In such cases, it can be
shown that maximum-likelihood will try to place the boundary as far away
from the data as possible. Specifically, it will maximize the
**margin**: the distance to the closest boundary point. A consequence
is that the logistic regression coefficients are determined completely by
the boundary points (three points, in this case).

Some people say this is a good thing, because it makes the estimates robust
to noise in most of the dataset. On the other hand, it is bad, because it
throws away information in the dataset and makes the estimate very
sensitive to the position of the boundary points. Consider this example:

The maximum-margin solution is a vertical
line, determined only by the two points at (0,0) and (1,0). (The line is
not perfectly vertical in the figure because the optimization was stopped
early. `glm` has problems with separated classes.) By moving
those two points, we can swing the estimated boundary all the way from
vertical to nearly horizontal:

A better option in these situations is to consider all of the possible
classifiers, not just the one which has some special geometrical property.
For the above example, the classifiers range from vertical (90 degrees) to
horizontal (0 degrees). To hedge our bets among them, we should pick the
average: 45 degrees. Note that this is similar to what we would get if we
used bagging on logistic regression: by taking random subsets of the
training data, the boundary would swing from 0 to 90 degrees, and when we
voted them we would be classifying according to the 45 degree line.
According to **Bayesian statistics**, you should always average when
making predictions. So I am recommending a Bayesian approach to linear
classification. For my PhD thesis, I developed a technique for efficiently
computing the average linear classifier. It outperforms
logistic regression on many real-world datasets, which adds support for the
theory.

Another problem with maximum-likelihood logistic regression is that it will always choose a classifier which separates the data, if this is possible. Along with maximum-margin, this is what causes logistic regression to overfit. A proper criterion would allow logistic regression to make errors on the training set, even if it didn't have to. We can achieve this by enlarging the averaging process to include classifiers that make errors.

So what about constrained classifiers? By the above arguments, constrained classifiers succeed because their constraints bring them close to the average classifier, when the training set is small. Hence some constraints really are better than others.

`misclass``confusion`

My PhD thesis on Bayesian linear classifiers and general algorithms for Bayesian inference. See Chapter 5.

Tom Minka Last modified: Tue Apr 25 09:48:24 GMT 2006