Day 16 - Predictive abstraction

Today we talked about association mining for prediction rules and introduced the idea of predictive abstraction, which amounts to merging rows and columns of a contingency table.

Mining prediction rules

In market basket analysis, we were looking for items which often occurred in the same basket. A more general problem is to find unusually frequent combinations between a set of variables, e.g. combinations of passenger class, age, and survival on the Titanic. A natural place to start is examining all pairwise associations:

For each pair of variables in the dataset, construct a two-way table by marginalizing all other variables.
Find the cells with largest lower bound on lift.

I am providing a function, mine.associations, which does this. Here is the result of mine.associations(Titanic):

       Lift Class    Sex   Age Survived
10 1.196292  Crew   Male    NA       NA
9  1.207227   3rd Female    NA       NA
8  1.353335   2nd     NA Child       NA
7  1.404403    NA     NA Child      Yes
6  1.572567   2nd Female    NA       NA
5  1.645135    NA Female Child       NA
4  1.797873   1st     NA    NA      Yes
3  1.915820   1st Female    NA       NA
2  2.005303   3rd     NA Child       NA
1  2.143584    NA Female    NA      Yes

Each row of this table describes a cell with high lift. Only two variables describe the cell; the other variables are listed as "NA". The top three associations in this list are:

Females had a survival rate twice as high as the general population. Equivalently, survivors were twice as likely to be female compared to the general population.
The proportion of children in 3rd class was twice as high as the average.
The proportion of females in 1st class was 1.9 times the average.

Suppose we are specifically interested in associations regarding survival. Then we only consider two-way tables where "Survived" is one of the dimensions. The command is mine.associations(Titanic,target="Survived"):

        Lift Class    Sex   Age Survived
7  1.0566695   3rd     NA    NA       No
6  1.0800244  Crew     NA    NA       No
5  1.1324780    NA   Male    NA       No
4  1.1637132   2nd     NA    NA      Yes
3  1.4044028    NA     NA Child      Yes
2  1.7978733   1st     NA    NA      Yes
1  2.1435842    NA Female    NA      Yes

Now we see the associations identified on day12:

Females had twice the average survival rate.
First-class passengers had 1.8 times the average survival rate.
Children had 1.4 times the average survival rate.

Note that lifts whose lower bound is below 1 are not interesting.

Mining down an abstraction hierarchy

If an abstraction hierarchy is available, we can use it to aid the process of mining associations. Associations between abstract categories are easier to interpret and have larger support. So the idea is to start at the most abstract level and then work our way down, looking for associations between child categories that are stronger than those of the parents. Child associations which are weaker are not interesting. For example, if we may find that beer and diapers are associated in market baskets, as well as the special case of Heineken and Pampers. If the more specific rule has smaller lift, then it is not interesting since it is already explained by the general rule. If the more specific rule has larger lift, then it is interesting since it suggests a reason for the beer and diapers association.

Here is an example using the census data from day13. If we abstract the education variable into "HS" (high school), "College", and "Advanced", then these are the associations found:

        Lift workclass education
5  1.1981963       Fed   College
4  1.2337028       Fed  Advanced
3  1.5252680      Self  Advanced
2  2.1093621     Local  Advanced
1  2.4471763     State  Advanced

There is an assocation between government jobs and advanced degrees, as noted earlier. To go deeper, we redo with the detailed set of education levels:

       Lift workclass   education
5  1.599709      Self   Doctorate
4  2.204586     State     Masters
3  2.835249     Local     Masters
2  2.972812      Self Prof-school
1  4.732185     State   Doctorate

We see that the association between State jobs and Advanced education is primarily due to Doctorate degrees, which have a lift of 4.7 compared to 2.4 for Advanced education in general. A useful property of lift is that this ratio, 4.73/2.44 = 1.9, can be interpreted as the lift of a Doctorate education relative to all Advanced educations:

In other words, an individual with a Doctorate degree is 1.9 times more likely to work for State government compared to the average holder of an Advanced degree. Equivalently, a person known to have an Advanced education is 1.9 times more likely to have a Doctorate degree if they also work for State government. Likewise, we see that the association between Local government and Advanced education is primarily due to Masters degrees, but only by a factor of 2.83/2.1 = 1.3.

Predictive abstraction

If we are interested in mining associations at an abstract level, which abstraction principle should we use? It turns out that none of the principles we have talked about so far is appropriate, and we need a new one: the prediction principle. If we are interested in the associations between i and j, this principle says that we should preserve our ability to predict i from j and to predict j from i. Mathematically, this means that we should abstract i to preserve the conditional distribution p(j|i) and we should abstract j to preserve the conditional distribution p(i|j). For histogram merging, we would have only looked at p(i) or p(j). In fact, the prediction principle pays no attention to preserving p(i) or p(j).

This principle leads us to the following abstraction algorithm:

Compute the similarity between each pair of rows and each pair of columns. If the dimensions are naturally ordered, only consider adjacent pairs, otherwise consider all pairs.
Merge the most similar pair.
Repeat.

Because we are interested in preserving the distribution in each row and column, the appropriate similarity measure is similarity between distributions. The method we previously used for this purpose is the homogeneity measure, a.k.a. chi-square statistic.

Examples of this merging algorithm are given in the next lecture.

Code

The code for mine.associations and table merging are given in the next lecture.

Tom Minka

Last modified: Mon Aug 22 16:38:54 GMT 2005