Day 17 - Applying predictive abstraction

Predictive abstraction is useful for simplifying tables, making abstraction hierarchies, customer profiling, and time-series analysis.

Simple examples

Consider the following table:
   j
i   1 2 3 4
  1 0 0 5 5
  2 0 0 5 5
  3 5 5 0 0
  4 5 5 0 0
There is a definite relationship between i and j. Is it possible to abstract the values of i and j, without losing the relationship? If we look at the marginal totals, they give a uniform distribution over i and over j. No particular abstraction is preferred by them. But clearly we lose something if we merge i=2 and i=3.

The prediction principle says we should merge rows with similar distributions and columns with similar distributions. In this case, we should merge (i=1,i=2), (i=3,i=4), (j=1,j=2), and (j=3,j=4).

The function provided for predictive merging is called merge.table. If the above table is x, we can reduce it to a 2 by 2 table via merge.table(x, c(2,2)). The first argument is the table and the second argument is the desired number of abstraction bins. Because the table has two dimensions, we have to give a vector of two numbers, specifying the desired number of rows and columns. The c function makes a vector. Here is the result:

> merge.table(x,c(2,2))
merging i = 2 and 1 
merging i = 4 and 3 
merging j = 2 and 1 
merging j = 4 and 3 
total cost = 0 
     j
i     2.1 4.3
  2.1   0  20
  4.3  20   0
The merging cost in this case was zero, because all merges were between identical rows and columns.

Suppose the rows and columns were originally in a different order:

   j
i   1 2 3 4
  1 0 5 0 5
  2 5 0 5 0
  3 0 5 0 5
  4 5 0 5 0
If the dimensions are viewed as unordered categories, then we should expect the same result. In fact, that is what we get when we run merge.table.

How about a less trivial table. Here is Table 2.14 in Moore & McCabe:

             age
education     25-34 35-54   55+
  noHS         5325  9152 16035
  HS          14061 24070 18320
  SomeCollege 11659 19926  9662
  College     10342 19878  8005
It seems that the rows SomeCollege and College could be merged. As for columns, it is harder to tell. A mosaic plot can help:

The first plot confirms that SomeCollege and College have similar age distributions. The second plot tells us that ages 25-34 and 35-54 have similar education distributions. Note that a mosaic only tells us how to merge values of one variable: the conditioning variable. It doesn't say anything about how to merge the other variable (the response variable). So we need to make second plot with a transposed table. merge.table(x,c(3,2)) gives the following result:
> merge.table(x,c(3,2))
merging age = 35-54 and 25-34 
merging education = College and SomeCollege 
total cost = 138.5215 
                     age
education             35-54.25-34   55+
  noHS                      14477 16035
  HS                        38131 18320
  College.SomeCollege       61805 17667

The merging trace is the same as previous routines except that it is now on its side so that it is easier to read. As cells are merged, we go upwards, from a table of size 4x2 to 3x2 to 2x2. According to the plot, there is essentially zero merging cost until we move to 2x2, where it suddenly increases. So 3x2 is a good choice.

Making abstraction hierarchies

Abstraction hierarchies were introduced early on in this course, as a way of simplifying categorical data. Usually they come from expert knowledge. However, predictive merging opens up the possibility of automatically recovering abstraction hierarchies. Here are two examples.

Let's come back to the census data of day16. It is a table relating six workclasses to sixteen education types. Previously, the education types were manually abstracted into "high school", "college", and "advanced". Can we recover this division automatically? Using merge.table, we get the following abstract categories:

> y <- merge.table(x, c(3,3))
> dimnames(y)
$workclass
[1] "Private.None"    
[2] "Self"
[3] "State.Fed.Local"

$education
[1] "Bachelors.Assoc-acdm"
[2] "Prof-school.Doctorate.Masters"
[3] "Some-college.Assoc-voc.HS-grad.7th-8th.Preschool.12th.5th-6th.11th.9th.10th.1st-4th"
Note that the three types of government job (state/fed/local) have been merged together. The education categories are very similar to what we got by hand. The main difference is that Some-college and Assoc-voc are grouped with high school. This is because these education levels are similar to high school in terms of the kind of jobs you get, at least in this coarse division of jobs. Here is the final mosaicplot:
This abstraction seems to show the differences better than our previous one. Furthermore, from the merging process we get an entire hierarchy of education levels. For example, the "advanced" category divides into "Masters" vs. "Prof-school.Doctorate".

Here is another census example, considering age versus occupation. Age is initially categorized by year and occupation into fourteen levels:

$age
 [1] "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27"
[12] "28" "29" "30" "31" "32" "33" "34" "35" "36" "37" "38"
[23] "39" "40" "41" "42" "43" "44" "45" "46" "47" "48" "49"
[34] "50" "51" "52" "53" "54" "55" "56" "57" "58" "59" "60"
[45] "61" "62" "63" "64" "65" "66" "67" "68" "69" "70" "71"
[56] "72" "73" "74" "75" "76" "77" "78" "79" "80" "81" "82"
[67] "83" "84" "85" "86" "87" "88" "90"

$occupation
 [1] "Adm-clerical"      "Armed-Forces"     
 [3] "Craft-repair"      "Exec-managerial"  
 [5] "Farming-fishing"   "Handlers-cleaners"
 [7] "Machine-op-inspct" "Other-service"    
 [9] "Priv-house-serv"   "Prof-specialty"   
[11] "Protective-serv"   "Sales"            
[13] "Tech-support"      "Transport-moving" 
We can use predictive abstraction to group the ages as well as the occupations. In this case, we have an ordered variable (ages). The restrict the merging to preserve order, we provide an additional ordered argument to merge.table. It is a vector of True/False values indicating which dimensions are ordered. Here the row dimension is ordered but the column is not:
> y <- merge.table(x, c(3,4), ordered=c(T,F))
> dimnames(y)
$age
[1] "17-21" "22-35" "36-90"

$occupation
[1] "Other-service.Handlers-cleaners.Armed-Forces"
[2] "Prof-specialty.Exec-managerial"
[3] "Sales.Adm-clerical.Priv-house-serv.Farming-fishing"
[4] "Transport-moving.Protective-serv.Craft-repair.Tech-support.Machine-op-inspct"

The mosaic is a bit cluttered, but we can match up the rows with the occupation groups above. The general trend is that younger people are in service/sales/farming jobs, older people are in managerial jobs, and middle age people are in the miscellaneous skilled jobs.

Customer profiling

The goal of customer profiling is to guess what a customer will do in the future. Previously, we used the beyond principle to determine appropriate bins for each customer histogram. With predictive abstraction, we can find one binning which is appropriate for all customers. At the same time, we can cluster customers into groups.

Customer clustering is important because we may have very little data for any single customer. Clustering allows us to `borrow data' from similar customers. By assigning a new customer to a cluster, the company can automatically anticipate that customer's needs and make recommendations or special offers (e.g. grocery store coupons). Compared to other methods of recommendation, such as soliticing questionnaires, this method only requires observing what the customer has bought. Furthermore, when we match a customer to a cluster using the chi-square distance, their entire distribution of purchases is taken into account, not just a few triggers like beer and diapers.

Consider the museum application from day10. We can construct a contingency table which reports, for each customer and each time duration (in multiples of 5 seconds), the number of times the customer spent that much time at an exhibit. In other words, each column is a histogram of durations for a given customer. Our goal is to abstract the times and cluster the customers.

First we merge times (rows) into 4 bins:

> y <- merge.table(x,4,1,ord=T)
> rownames(y)
[1] "0"      "5-15"   "20-30"  "35-140"
The merging trace does not suggest any particular number of bins as interesting. Now we merge customers (columns) into 4 groups:
z <- merge.table(y,4,2)
colnames(z) <- c("nonsel","busy","select","greedy")
mosaicplot(t(z),shade=T)

The merging trace strongly suggests four groups. I've named the four groups according to their distribution of times on the mosaic. A "nonselective" person skips exhibits less frequently, usually spends a short amount of time, but sometimes a long time at an exhibit. The remaining groups fit the canonical "busy", "selective", and "greedy" archetypes. They skip exhibits at roughly the same rate. Busy people never spend a long amount of time, selective people rarely spend a medium amount of time, and greedy people spend unusually long amounts of time.

Code

The functions mine.associations and merge.table are provided as an extension to the crosstab package called crosstab2. source("crosstab2.r") automatically loads crosstab, so make sure you also have crosstab in your directory.

The original crosstab package has also been modified. Please download the latest version here.

crosstab.r crosstab.s
crosstab2.r crosstab2.s

The general usage of merge.table is merge.table(table, num.bins, which.dims, ordered). By default, which.dims is all dimensions of the table. If you only want to merge rows, use 1 for which.dims. If you only want to merge columns, use 2. If only one dimension is being merged, you only specify one number for num.bins and one value for ordered. Otherwise num.bins and ordered must be vectors. By default, ordered is false for all dimensions.

Background reading

Clustering purchase histories to make recommendations:

There is a research paper search engine called ResearchIndex which can make recommendations based on a user profile. The profile is made from explicit ratings and/or passive observation of the user.


Tom Minka
Last modified: Mon Aug 22 16:39:11 GMT 2005