Here is an example two-way contingency table. It reports the number of
passengers on the Titanic who survived and did not survive, classified by
age group:

Survived Age No Yes Child 52 57 Adult 1438 654

In this course, we go deeper than that and try to describe the nature of the relationship between the variables. In order to do that, we will apply tools like visualization, sorting, and clustering.

Note that the mosaic treats the row and column variables differently. The column variable is treated as a conditioning variable, a "treatment", and the row variable is treated as a "response". Each column is a histogram, with bins stacked on top of each other instead of side by side.

Here is the mosaic plot of the Titanic table and its transpose:

On the left plot, the height of the upper left box is the probability that
someone who did not survive would be a child: `p(age = child | survived
= no)`. The width is `p(survived = no)`.
On the right plot, the height of the upper left box is the
probability that a child would not be a survivor: `p(survived = no |
age = child)`. The width is `p(age = child)`.
The area of these two boxes is exactly the same, but they tell us different
things.
The plot on the left tells us that most of the people who survived (and did
not survive) were
adults, though the survivors have a greater proportion of children.
The plot on the right tells us that children had a greater probability of
survival than adults. It also tells us that the number of children who
survived is approximately equal to the number which did not, a fact which
is difficult to see in the left plot.

These show that women also had a greater probability of survival, as well as people in upper passenger classes.

Upper classes were
located higher on the ship, which may have aided their survival.
Women and children were more prevalent in the upper classes, so does
this explain away the sex and age effect? To answer this question,
can **stratify** the data across class and then plot the relationship
between age and survival:

This shows that children had a higher survival rate within each class.

It seems to indicate a gender bias. However, there is a lurking variable: the department applied to. Here is what happens if we stratify on department:

It appears that most departments have no gender bias, and those departments that are biased favor women. How can this be? First, note that depts A and B have very few female applicants (the columns are narrow). It is also relatively easy to get into those departments---the proportion rejected is lower than other departments, especially F. So one explanation is that more males get in because they are applying to the hungrier, perhaps fastest-growing, departments.

Here is a mosaic plot, with residual shading, of hair color versus eye
color in a group of statistics students (the `HairEyeColor`
dataset in R):

The residuals can be interpreted in the following way: a cell is shaded
**blue** if we are confident that it is **taller** than the other
cells in the same **row**. A cell is shaded **red** if we are confident
that it is **shorter** than the other cells in the same **row**.
If a cell is visibly short, but does not get shaded red, then there
is not enough data to conclude that the cell would continue to be short
if we took another sample.
A blue cell is usually accompanied by a red cell in the same row, but not
always---see e.g. the bottom row of the plot (green eyes).
Note that shading does **not** say anything about
the relative height of boxes in the same
**column**.

For a table with lots of data, shading is redundant because all differences are significant and can be seen from the box heights. However, heights can be difficult to compare when boxes aren't lined up, as in the "hazel eyes" row. Also, shading helps draw your eye to where the major associations are.

One problem with the mosaic plot in this context is that when a proportion is very small, the corresponding box is nearly invisible and you don't see that it is colored red. Unusually large cells are emphasized in a mosaic plot, while unusually small cells are hidden.

Contingency tables are created from `factor` variables, using the
`table` command. Recall that a `factor` is a vector
of categorical observations.
Here is an example:

Pet <- factor(c("Cat","Dog","Cat","Dog","Cat","Dog")) Food <- factor(c("Dry","Dry","Dry","Wet","Wet","Dry")) x <- table(Pet,Food)Now

Food Pet Dry Wet Cat 2 1 Dog 2 1The transpose of this table is obtained via the

Pet Food Cat Dog Dry 2 2 Wet 1 1

To save a table to disk, use `write.crosstab(x,"filename")`.
To read it back, use `read.crosstab("filename")`.
The table is stored in an easy-to-read format.

You can perform a classical chi-square test with `chisq.test(x)`.
The expected counts under independence can be obtained with
`indep.fit(x)`.
The marginal counts can be obtained with
`margin.table(x,k)` or equivalently `apply(x,k,sum)`.
k is a number where
1 means the row variable and 2 means the column variable.

To make a mosaicplot of a table x, say `mosaicplot(x)`. To make a
mosaicplot with residual shading, say `mosaicplot(x,shade=T)`.

Tom Minka Last modified: Mon Dec 17 14:01:24 EST 2001