Day 6 - Classification and abstraction details

Today we covered the details of classification and data-driven abstraction.

Classification

We covered the solution to problem 4 (classification) on homework 1, using the maximum-likelihood method. You can see the test document for yourself.

Besides maximum-likelihood, an alternative way to do classification via histograms is to view the problem as a homogeneity test between two samples. We can set up a two-way table with two rows, where the first row has counts for one class and the second row has counts for the sample we are trying to classify. The chi-square test gives us a measure of the disagreement between the training data for the class and the new sample. The sample is then labeled according to the class with the lowest chi-square value.

You not need to add one to the counts to use the chi-square test. However, if there are many zeros this will cause the answer will be slightly different from the maximum-likelihood method.

The file homogeneity.r is provided for computing chi-square values. Load it into R or S using source("homogeneity.r"). It is basically the same as the chisq.test function except it allows zero counts. For some reason, it also seems to run faster. chisq.test works on any two-way table, whereas homogeneity is only for tables with two rows. The arguments to the function are the two rows.

Note that the p-value associated with the chi-square statistic is not particularly relevant because the assumptions for this p-value to be valid do not hold. Generally the p-value will be very small indicating the obvious fact that the words in a document are not really independent draws from a word population.

When the training set is large, maximum-likelihood and the chi-square test give the same answers. Furthermore, maximum-likelihood can be used with any probability model of the populations, not just histograms. Hence the chi-square test is not often used for classification. However, for "query by example" it is very useful, because typically the user will label only one or two documents. In that case, we can't realistically expect to estimate the population probability of each word, needed for the maximum likelihood method. See the paper below.

Abstraction via balance principle

We worked an example of algorithms A and B on the following tree:

Both algorithms A and B can be carried out efficiently by precomputing the frequency and merging cost of each node in the tree (each possible subgroup). The output of the algorithms can then be written as an ordered list of the subgroups which are merged. For both algorithms, the ordering is F,C,D,E,B,A.

For abstraction into 5 bins, it turns out that algorithm B does not give the most balanced histogram. This is because algorithm B always produces a nested sequence of abstractions. Nesting is important for interpretability but means we have to tolerate some imbalance. If you really care about balance, it is possible to ignore nesting and find the best abstraction for each bin size by doing some kind of optimization of the quadratic imbalance function.

Optional reading

"Non-parametric Similarity Measures for Unsupervised Texture Segmentation and Image Retrieval", Jan Puzicha, Thomas Hofmann and Joachim Buhmann (1997)
This paper uses texture histograms instead of color, and considers alternative ways of comparing histograms. The chi-square method turns out to be best. Maximum-likelihood was not attempted.


Tom Minka
Last modified: Thu Sep 13 19:29:42 EDT 2001