Day 13 - Sorting contingency tables

The categories defining a contingency table often have no natural order, and the "default" ordering that we are given is usually not the best one. Fortunately, there is a systematic way to sort the categories in order to make the structure of the table more apparent.

For example, here is a table from Moore & McCabe (exercise 9-24) relating ethnic group (Hawaiian, Hawaiian-White, Hawaiian-Chinese, and White) to blood type:

     Group
Blood    H   HW   HC White
   O  1903 4469 2206 53759
   A  2490 4671 2368 50008
   B   178  606  568 16252
   AB   99  236  243  5001
This is the mosaic plot, with residual shading:

All differences in this table are significant, because of the large sample size. What are the overall trends? Suppose we order the rows and columns so that blue boxes are on the diagonal and red boxes are off the diagonal:

Now the trend among groups is more apparent. As you move from H to HW to HC to White, the proportion of type A decreases, the proportion of types B and AB increase, and the proportion of type O is relatively unchanged. These trends tells us something about the categories. The ordering suggests that HW is more similar to H than HC is, and that types O and AB are "in between" types A and B. This is the payoff of table sorting.

Correspondence analysis

There is a way to formalize the idea of "putting blue on the diagonal". It is called correspondence analysis. The idea is to convert the categories into numbers, called scores, such that the order and spacing of the numbers reflects the similarity between categories. In the above table, we would say that two blood categories are similar if they occur in the same groups, and that two groups are similar if they have the same distribution of blood type. We want similar categories to have similar scores, e.g. (A = -1, O = 0, AB = 1.5, B = 2.2). Then we sort the rows and columns by sorting their scores. (In fact these are the scores obtained for the above table.)

It turns out that we can achieve this by looking at the relationship between the row scores and column scores. Specifically, (row,column) pairs which occur frequently should have similar scores for the row and the column. In other words, if we convert the categorical observations into numerical observations, then most pairs should have similar numbers in them, like (3.5,3.6), not (3.5,100). This principle encourages unusually frequent pairs (blue boxes) to be on the diagonal of the sorted table. Let the table counts be nij, the row variable i, the column variable j, the row scores ai, the column scores bj. Then we can represent the principle by a sum of squares cost function:

This cost function wants scores which occur together to be similar to each other. Because scaling and shifting the scores has no bearing on the criterion, we force the row scores to have zero mean and unit variance. (This forces the optimal column scores to also have zero mean, but not necessarily unit variance.) There are other ways to write this cost function, for example as maximizing the correlation between the row and column scores.

Minimizing the sum of squares is easy. Because it is a quadratic bowl, we can start anywhere and move downhill until we reach the bottom. To move downhill, we solve for the best row scores given the column scores and vice versa. Here is the algorithm:

  1. Guess random values for the column scores.
  2. For each row, set the row score to the average column score for that row.
  3. Standardize the row scores to have zero mean and unit variance (by shifting and scaling).
  4. For each column, set the column score to the average row score for that column.
  5. If the column scores have changed significantly, go back to step 2. Otherwise stop.
Essentially what step 2 is doing is looking at all observed pairs where the row variable had value i, replacing the column value in those pairs with the column score, and taking the average. This minimizes the sum of squares if we regard the column scores as fixed.

Example

Here is a table of census data relating occupation and education:
         education
workclass 10th 11th 12th 1st-4th 5th-6th 7th-8th 9th
  Fed        6    9    5       0       1       2   3
  Local     31   36   19       4       9      28  23
  None       2    1    0       0       0       2   0
  Private  695  923  333     136     266     424 387
  Self      86   74   26      15      23     108  44
  State     13   14   10       1       4      10   6
         education
workclass Assoc-acdm Assoc-voc Bachelors Doctorate HS-grad
  Fed             55        38       212        16     263
  Local           88        86       477        27     503
  None             1         0         0         0      10
  Private        729      1005      3551       181    7780
  Self           106       146       672        85    1145
  State           41        46       270        89     268
         education
workclass Masters Preschool Prof-school Some-college
  Fed          67         0          29          254
  Local       342         4          29          387
  None          0         0           0            5
  Private     894        41         257         5094
  Self        203         0         212          712
  State       169         1          31          325
There are 16 (unsorted) education levels and 6 (unsorted) occupation categories. "Fed", "Local", and "State" refer to government positions. "Private" is the (entire) private sector and "Self" means self-employed. Obviously there is a bias in this choice of categories.

A mosiac plot of this big table is difficult to read. We could use prior knowledge to make a conceptual hierarchy and abstract the education variable. Interestingly, correspondence analysis does most of the work for us. Here are the education levels after sorting:

 [1] "1st-4th"      "5th-6th"      "11th"        
 [4] "10th"         "Preschool"    "9th"         
 [7] "12th"         "HS-grad"      "7th-8th"     
[10] "Assoc-voc"    "Some-college" "Assoc-acdm"  
[13] "Bachelors"    "Prof-school"  "Masters"     
[16] "Doctorate"   

This comes simply from considering what kind of jobs you get at each level. Here are the sorted occupation levels:
[1] "None"    "Private" "Self"    "Fed"     "Local"  
[6] "State"  
Notice how the government jobs are grouped together. This comes simply from considering the education levels needed for each job.

Here is the table after abstracting the education variable and sorting:

The bottom row is "Advanced" degrees. The trend is clear: higher education levels are needed as you move from "None" (unemployed) to "Private" to "Self" to government jobs.

Image clustering

The same method works for sorting images by their color histograms. For each image, we make a histogram across 64 color bins. Each histogram becomes a row in a contingency table. Sorting this table will group similar images (and colors) together.

To test this idea, the seven images from homework 2 were put into a 7 by 64 table. The rows after sorting are:

[1] "ocean"   "tiger1"  "tiger3"  "tiger2"  "flower2"
[6] "flower1" "flower3"
Notice how it has grouped the tiger images and flower images.

Code

To sort a table using correspondence analysis, just say x <- sort.table(x). This function is part of the crosstab package.


Tom Minka
Last modified: Fri Sep 28 13:23:45 Eastern Daylight Time 2001