Day 14 - Measuring deviations from independence

So far, we've talked about spotting deviations from independence via Pearson residuals as well as sorting a table to show the direction of the deviations. But we haven't talked about actually measuring the size of the deviations. Measuring size is important for determining whether a deviation is practically interesting. Knowing that two variables are dependent may not be not so important if the difference from independence is only one percent.

Statistical significance, which is what Pearson residuals measure, only tells us that a deviation exists. It does not imply that the deviation is interesting. Even the weakest correlation can be statistically significant if the sample size is big enough. Consider a table where all counts have been multipled by 100. The effect sizes are the same, but the deviations will be more statistically significant. Furthermore, within a given table, the rows and columns with large marginal counts will automatically have larger residuals for the same size effect.

A deviation from independence is when p(i|j) does not match p(i). So it comes down to comparing proportions. There are good and bad ways to do this.

In the introductory statistics text by Moore & McCabe, it is recommended to compare proportions by differencing (see p602 in the Third Edition). If one proportion was 39% and the other 22%, we would say that the effect is 17 percentage points. This is the type of comparison that is often done in the news. However, it has a serious flaw: the same percentage change can mean very different things depending on the original percentage. Suppose one proportion was 18% and the other 1%. Again the difference is 17 percentage points. Would you consider this change comparable to the first? In one case, the proportion changes by a factor of 1.8; in the other case, by a factor of 18!

Lift

A better way to compare proportions is by division. In the case of contingency tables, the ratio p(i|j)/p(i) is called lift. It is symmetric: p(i|j)/p(i) = p(i,j)/p(i)/p(j) = p(j|i)/p(j). The lift of i given j and the lift of j given i are equivalent. The difference p(i|j)-p(i), by contrast, is not symmetric. Like the Pearson residuals, lift can be written as a function of actual and expected counts: lift = actual/expected.

A lift of 1.33 means "33% more" or "one third more". A lift of 0.33 means "one third as many". For most purposes, the latter effect is more interesting. Remember that lift is unconnected to statistical significance: a lift of 1.01 could be statistically significant if the sample size was large enough, and a lift of 10 could be statistically insignificant if the sample size is small.

Confidence interval

To deal with sampling variation, it is useful to have a confidence interval on the lift. To do this, we need to specify what we mean by "true" lift. One way is to define it as the ratio of "true" counts nij/eij. The uncertainty in the true eij is small enough that we can equate it with the observed eij. A confidence interval on the true nij can be obtained by assuming the observed count is Poisson with mean nij.

Another approach is to define the true lift as the ratio of the true probabilities p(i,j) and p(i)*p(j). Again, we can assume p(i)*p(j) is essentially known. The usual standard error of a probability p(i,j) is sqrt(p(i,j)*(1-p(i,j))/n). Substituting the observed probability, and assuming p(i,j) is sufficiently small, gives SE = sqrt(nij)/n. This is the same as the Poisson result. (The standard error sqrt(eij) used in the Pearson residuals has a similar origin.)

In conclusion, an approximate 68% confidence interval on the lift is (actual - sqrt(actual))/expected, (actual + sqrt(actual))/expected). If we are interested in finding the cells with largest lift, we use the lower bound. If we are interested in finding the cells with smallest lift, we use the upper bound. This way we focus our efforts on deviations from independence which are reliably interesting. Statistical significance and practical relevance are handled at the same time. Note that even though the confidence interval is correct 68% of the time, the lower bound by itself is correct 84% of the time (as well as the upper bound by itself).

Code

To sort the cells of a table, e.g. a table of lifts, type sort.cells(x). The cells are printed in a list from smallest to largest. This function is part of the crosstab package.

References

The lift measure has been used in several data mining papers, however the technique of using confidence intervals is due to William DuMouchel. His papers are described in the next lecture.


Tom Minka
Last modified: Wed Oct 03 17:56:56 Eastern Daylight Time 2001