Another example of market basket analysis is analyzing international calling patterns. This analysis was done at AT&T---see the references below. For each month, a customer's "basket" is the countries that they made calls to. An association between countries means that people who call country A are likely to also call country B in the same month. Most of the associations they found were typical: countries with the same language and countries which are close to each other. One can learn about immigration patterns this way. An extreme outlier that they noticed was an association between St. Kitts and Vanuatu, two small islands on opposite sides of the globe. The number of times that these two countries are called is far above that predicted by independence, especially since these countries are rarely called. It turns out that both countries, among others, are involved in phone scams where people who call pay services are unwittingly forwarded across the globe, resulting in huge phone bills.
Market basket analysis has also been successfully used by the Food and Drug Administration (FDA) for spotting adverse drug reactions. Health care providers are asked to report adverse reactions by patients to any drug. Note that if a patient is taking several drugs at the time, all will be reported with the same event. Furthermore, the table suffers from reporting bias, so that the marginals do not reflect the true probability of taking a drug or of having an adverse reaction in the overall population. The lift measure is natural in this situation, because it focuses on departures from the observed marginals of the table, not absolute probabilities.
In a study done by William DuMouchel, there were about 1400 drugs and 952 adverse events in the table, with a total of 4.9 million observations. The lift measure was compared to using Pearson residuals. Of the top 1000 associations found by each method, only 65 were picked by both measures. This shows they are quite different. Pearson residuals favor cells with large count, even if there is also a large expected count under independence. These cells do not have large lift or practical significance.
One of the associations found in this study was between Sudden Infant Death Syndrome (SIDS) and the polio vaccine. Does this mean the polio vaccine is dangerous? Not necessarily; the polio vaccine is mainly given to infants, who are the only possible victims of SIDS. Receiving the polio vaccine increases your likelihood of being an infant, which significantly increases your chance of SIDS. This could explain the association. To go deeper, we would need to stratify the database on age and see if there is an association even among infants. This example teaches an important lesson: association mining is the beginning of an analysis, not the end. You need to be skeptical of every association found and use your statistical training to explain what is really going on.
Market basket analysis can also be used in the news filtering and document classification problems we've been discussing earlier. Instead of just dealing with individual words, it is useful to identify word phrases like "Los Angeles" which should be treated as a unit. If each pair of adjacent words is considered a "basket", then market basket analysis tells us which words occur together more often than would be expected by chance. Incorporating word phrases in this way can improve retrieval performance. This idea has also been applied to automatic transcription and indexing of medical reports, so that the computer can identify salient technical phrases used by doctors.
Unfortunately, the support/confidence measure has several drawbacks. First, it is asymmetric, which is unnatural in market basket analysis. Second, it doesn't compare p(i|j) to the baseline frequency p(i). A confidence of 96% is not so interesting if the baseline frequency is 97%. Third, it misses important but rare cases, because of the support restriction. For example, the (St. Kitts, Vanuatu) association above could easily be missed. The result of these drawbacks is that the data miner has to peruse a large number of associations in order to find the ones that are interesting. The emphasis on saving computer time has created a bottleneck in researcher time. Using a more informative measure like lift gives the computer more work but allows a more thorough analysis of the few good associations that are found.
Another measure that is sometimes used by statistically minded practitioners is a chi-square test of association between i and j. This works by reducing the original table to a 2 by 2 table with cells (i,j), (not i, j), (i, not j), and (not i, not j). The chi-square statistic on this table is the measure of association. This method is essentially the same as using the Pearson residual, and has the same drawbacks.
An overview of research in association mining:
"Mining Large Itemsets for Association Rules", by Aggarwal and Yu.