Day 15 - Association mining

Association mining is an application of the contingency table techniques we have been talking about so far. It is probably the most common type of analysis in data mining and is included in virtually all data mining packages. These packages mainly differ in how they compute the size of effects. As discussed last time, there are different ways of doing this, some good and some not so good. In this class we recommend using the lift measure. In association mining, the emphasis is almost always on positive associations (large lifts). The two main applications of association mining are market basket analysis and finding prediction rules.

Market basket analysis

In market basket analysis, we have a collection of sets ("baskets") and want to find elements which often occur together in these sets. For example, grocery stores want to know what items a person is likely to buy within the same trip. They can put these items closer together in the store, or in some cases farther apart so that you see more of the store. They can put certain items on sale, with the expectation that people will buy certain other items which are not on sale. If the store has a way of identifying you, such as with an advantage card, a market basket could extend over multiple purchases. These associations are used to trigger coupon suggestions. In e-commerce, the store always knows who you are, so this data is easy to gather. They can dynamically modify their web site to suggest other items you are likely to buy.

Another example of market basket analysis is analyzing international calling patterns. This analysis was done at AT&T---see the references below. For each month, a customer's "basket" is the countries that they made calls to. An association between countries means that people who call country A are likely to also call country B in the same month. Most of the associations they found were typical: countries with the same language and countries which are close to each other. One can learn about immigration patterns this way. An extreme outlier that they noticed was an association between St. Kitts and Vanuatu, two small islands on opposite sides of the globe. The number of times that these two countries are called is far above that predicted by independence, especially since these countries are rarely called. It turns out that both countries, among others, are involved in phone scams where people who call pay services are unwittingly forwarded across the globe, resulting in huge phone bills.

Market basket analysis has also been successfully used by the Food and Drug Administration (FDA) for spotting adverse drug reactions. Health care providers are asked to report adverse reactions by patients to any drug. Note that if a patient is taking several drugs at the time, all will be reported with the same event. Furthermore, the table suffers from reporting bias, so that the marginals do not reflect the true probability of taking a drug or of having an adverse reaction in the overall population. The lift measure is natural in this situation, because it focuses on departures from the observed marginals of the table, not absolute probabilities.

In a study done by William DuMouchel, there were about 1400 drugs and 952 adverse events in the table, with a total of 4.9 million observations. The lift measure was compared to using Pearson residuals. Of the top 1000 associations found by each method, only 65 were picked by both measures. This shows they are quite different. Pearson residuals favor cells with large count, even if there is also a large expected count under independence. These cells do not have large lift or practical significance.

One of the associations found in this study was between Sudden Infant Death Syndrome (SIDS) and the polio vaccine. Does this mean the polio vaccine is dangerous? Not necessarily; the polio vaccine is mainly given to infants, who are the only possible victims of SIDS. Receiving the polio vaccine increases your likelihood of being an infant, which significantly increases your chance of SIDS. This could explain the association. To go deeper, we would need to stratify the database on age and see if there is an association even among infants. This example teaches an important lesson: association mining is the beginning of an analysis, not the end. You need to be skeptical of every association found and use your statistical training to explain what is really going on.

Market basket analysis can also be used in the news filtering and document classification problems we've been discussing earlier. Instead of just dealing with individual words, it is useful to identify word phrases like "Los Angeles" which should be treated as a unit. If each pair of adjacent words is considered a "basket", then market basket analysis tells us which words occur together more often than would be expected by chance. Incorporating word phrases in this way can improve retrieval performance. This idea has also been applied to automatic transcription and indexing of medical reports, so that the computer can identify salient technical phrases used by doctors.

Interest measures

A major choice in association mining is how the interestingness of an association should be measured. The lift measure has only recently emerged on the scene. Historically, the dominant approach has been support/confidence. In this approach, we seek cells with highest prediction confidence p(i|j) subject to having support above a threshold: p(i,j) > t. This measure of association arose mainly for computational reasons, since it allows us to restrict our attention to the largest counts, which are easy to identify. (This is not possible with lift, since lift can be high even if the actual count is small, as long as the expected count is even smaller.)

Unfortunately, the support/confidence measure has several drawbacks. First, it is asymmetric, which is unnatural in market basket analysis. Second, it doesn't compare p(i|j) to the baseline frequency p(i). A confidence of 96% is not so interesting if the baseline frequency is 97%. Third, it misses important but rare cases, because of the support restriction. For example, the (St. Kitts, Vanuatu) association above could easily be missed. The result of these drawbacks is that the data miner has to peruse a large number of associations in order to find the ones that are interesting. The emphasis on saving computer time has created a bottleneck in researcher time. Using a more informative measure like lift gives the computer more work but allows a more thorough analysis of the few good associations that are found.

Another measure that is sometimes used by statistically minded practitioners is a chi-square test of association between i and j. This works by reducing the original table to a 2 by 2 table with cells (i,j), (not i, j), (i, not j), and (not i, not j). The chi-square statistic on this table is the measure of association. This method is essentially the same as using the Pearson residual, and has the same drawbacks.

Optional reading

Several of the applications described above are the work of William DuMouchel:

International calling patterns: "Empirical Bayes screening for multi-item associations"
FDA adverse event database: "Bayesian data mining in large frequency tables, with an application to the FDA Spontaneous Reporting System"
Word phrases in medical transcriptions: "Two applications of statistical modelling to natural language processing"

An overview of research in association mining:
"Mining Large Itemsets for Association Rules", by Aggarwal and Yu.

Tom Minka

Last modified: Fri Sep 02 13:43:04 GMT 2005