Day 16 - Predictive abstraction
Today we talked about association mining for prediction rules and
introduced the idea of predictive abstraction, which amounts to merging
rows and columns of a contingency table.
Mining prediction rules
In market basket analysis, we were looking for items which often occurred
in the same basket. A more general problem is to find unusually frequent
combinations between a set of variables, e.g. combinations of
passenger class, age, and survival on the Titanic.
A natural place to start is
examining all pairwise associations:
-
For each pair of variables in the dataset, construct a two-way table by
marginalizing all other variables.
-
Find the cells with largest lower bound on lift.
I am providing a function, mine.associations, which does this.
Here is the result of mine.associations(Titanic):
Lift Class Sex Age Survived
10 1.196292 Crew Male NA NA
9 1.207227 3rd Female NA NA
8 1.353335 2nd NA Child NA
7 1.404403 NA NA Child Yes
6 1.572567 2nd Female NA NA
5 1.645135 NA Female Child NA
4 1.797873 1st NA NA Yes
3 1.915820 1st Female NA NA
2 2.005303 3rd NA Child NA
1 2.143584 NA Female NA Yes
Each row of this table describes a cell with high lift.
Only two variables describe the cell; the other variables are listed as "NA".
The top three associations in this list are:
- Females had a survival rate twice as high as the general population.
Equivalently, survivors were twice as likely to be female compared to the
general population.
- The proportion of children in 3rd class was twice as high as the average.
- The proportion of females in 1st class was 1.9 times the average.
Suppose we are specifically interested in associations regarding
survival. Then we only consider two-way tables where "Survived" is one of
the dimensions.
The command is mine.associations(Titanic,target="Survived"):
Lift Class Sex Age Survived
7 1.0566695 3rd NA NA No
6 1.0800244 Crew NA NA No
5 1.1324780 NA Male NA No
4 1.1637132 2nd NA NA Yes
3 1.4044028 NA NA Child Yes
2 1.7978733 1st NA NA Yes
1 2.1435842 NA Female NA Yes
Now we see the associations identified on day12:
- Females had twice the average survival rate.
- First-class passengers had 1.8 times the average survival rate.
- Children had 1.4 times the average survival rate.
Note that lifts whose lower bound is below 1 are not interesting.
Mining down an abstraction hierarchy
If an abstraction hierarchy is available, we can use it to aid the process
of mining associations. Associations between abstract categories are
easier to interpret and have larger support.
So the idea is to start at the most abstract level and then work our way
down, looking for associations between child categories that are stronger
than those of the parents. Child associations which are weaker are not
interesting. For example, if we may find that beer and diapers are associated
in market baskets, as well as the special case of Heineken and Pampers.
If the more specific rule has smaller lift, then it is not interesting
since it is already explained by the general rule. If the more specific
rule has larger lift, then it is interesting since it suggests
a reason for the beer and diapers association.
Here is an example using the census data from day13.
If we abstract the education variable into "HS" (high school), "College",
and "Advanced", then these are the associations found:
Lift workclass education
5 1.1981963 Fed College
4 1.2337028 Fed Advanced
3 1.5252680 Self Advanced
2 2.1093621 Local Advanced
1 2.4471763 State Advanced
There is an assocation between government jobs and advanced degrees, as
noted earlier. To go deeper, we redo with the detailed set of education
levels:
Lift workclass education
5 1.599709 Self Doctorate
4 2.204586 State Masters
3 2.835249 Local Masters
2 2.972812 Self Prof-school
1 4.732185 State Doctorate
We see that the association between State jobs and Advanced education is
primarily due to Doctorate degrees, which have a lift of 4.7 compared to
2.4 for Advanced education in general. A useful property of lift is that
this ratio, 4.73/2.44 = 1.9, can be interpreted as the lift of a
Doctorate education relative to all Advanced educations:
In other words, an individual with a Doctorate degree is 1.9 times
more likely to work for State government compared to
the average holder of an Advanced degree.
Equivalently, a person known to have an Advanced education
is 1.9 times more
likely to have a Doctorate degree if they also work for State government.
Likewise, we see that the association between Local government and
Advanced education is primarily due to Masters degrees, but only by a
factor of 2.83/2.1 = 1.3.
Predictive abstraction
If we are interested in mining associations at an abstract level, which
abstraction principle should we use? It turns out that none of the
principles we have talked about so far is appropriate, and we need a new
one: the prediction principle. If we are interested in the
associations between i and j, this principle says
that we should preserve our ability to predict i from
j and to predict j from i.
Mathematically, this means that we should abstract i to preserve
the conditional distribution p(j|i) and we should abstract
j to preserve the conditional distribution p(i|j).
For histogram merging, we would have only looked at p(i) or
p(j). In fact, the prediction principle pays no attention to
preserving p(i) or p(j).
This principle leads us to the following abstraction algorithm:
- Compute the similarity between each pair of rows and each pair of columns.
If the dimensions are naturally ordered, only consider adjacent pairs,
otherwise consider all pairs.
- Merge the most similar pair.
- Repeat.
Because we are interested in preserving the distribution in each row and
column, the appropriate similarity measure is similarity between
distributions. The method we previously used for this purpose is the
homogeneity measure, a.k.a. chi-square statistic.
Examples of this merging algorithm are given in the next lecture.
Code
The code for mine.associations and table merging are given in
the next lecture.
Tom Minka
Last modified: Mon Aug 22 16:38:54 GMT 2005