For example, here is a table from Moore & McCabe (exercise 9-24)
relating ethnic group (Hawaiian, Hawaiian-White, Hawaiian-Chinese, and White)
to blood type:
Group Blood H HW HC White O 1903 4469 2206 53759 A 2490 4671 2368 50008 B 178 606 568 16252 AB 99 236 243 5001This is the mosaic plot, with residual shading:
It turns out that we can achieve this by looking at the relationship
between the row scores and column scores. Specifically, (row,column) pairs
which occur frequently should have similar scores for the row and the
column. In other words, if we convert the categorical observations into
numerical observations, then most pairs should have similar numbers in
them, like (3.5,3.6), not (3.5,100). This principle encourages unusually
frequent pairs (blue boxes) to be on the diagonal of the sorted table.
Let the table counts be nij, the row variable i, the
column variable j, the row scores ai, the column
scores bj.
Then we can represent the principle by a sum of squares cost function:
This cost function wants scores which occur together to be similar to each
other. Because scaling and shifting the scores has no bearing on the
criterion, we force the row scores to have zero mean and unit variance.
(This forces the optimal column scores to also have zero mean, but not
necessarily unit variance.) There are other ways to write this cost
function, for example as maximizing the correlation between the row and
column scores.
Minimizing the sum of squares is easy. Because it is a quadratic bowl, we can start anywhere and move downhill until we reach the bottom. To move downhill, we solve for the best row scores given the column scores and vice versa. Here is the algorithm:
education workclass 10th 11th 12th 1st-4th 5th-6th 7th-8th 9th Fed 6 9 5 0 1 2 3 Local 31 36 19 4 9 28 23 None 2 1 0 0 0 2 0 Private 695 923 333 136 266 424 387 Self 86 74 26 15 23 108 44 State 13 14 10 1 4 10 6 education workclass Assoc-acdm Assoc-voc Bachelors Doctorate HS-grad Fed 55 38 212 16 263 Local 88 86 477 27 503 None 1 0 0 0 10 Private 729 1005 3551 181 7780 Self 106 146 672 85 1145 State 41 46 270 89 268 education workclass Masters Preschool Prof-school Some-college Fed 67 0 29 254 Local 342 4 29 387 None 0 0 0 5 Private 894 41 257 5094 Self 203 0 212 712 State 169 1 31 325There are 16 (unsorted) education levels and 6 (unsorted) occupation categories. "Fed", "Local", and "State" refer to government positions. "Private" is the (entire) private sector and "Self" means self-employed. Obviously there is a bias in this choice of categories.
A mosiac plot of this big table is difficult to read.
We could use prior knowledge to make a conceptual hierarchy and abstract
the education variable. Interestingly, correspondence analysis does most
of the work for us. Here are the education levels after sorting:
[1] "1st-4th" "5th-6th" "11th" [4] "10th" "Preschool" "9th" [7] "12th" "HS-grad" "7th-8th" [10] "Assoc-voc" "Some-college" "Assoc-acdm" [13] "Bachelors" "Prof-school" "Masters" [16] "Doctorate"
[1] "None" "Private" "Self" "Fed" "Local" [6] "State"Notice how the government jobs are grouped together. This comes simply from considering the education levels needed for each job.
Here is the table after abstracting the education variable and sorting:
The bottom row is "Advanced" degrees.
The trend is clear: higher education levels are needed as you move from
"None" (unemployed) to "Private" to "Self" to government jobs.
To test this idea, the seven images from homework 2 were put into a 7 by 64
table. The rows after sorting are:
[1] "ocean" "tiger1" "tiger3" "tiger2" "flower2" [6] "flower1" "flower3"Notice how it has grouped the tiger images and flower images.