j i 1 2 3 4 1 0 0 5 5 2 0 0 5 5 3 5 5 0 0 4 5 5 0 0There is a definite relationship between i and j. Is it possible to abstract the values of i and j, without losing the relationship? If we look at the marginal totals, they give a uniform distribution over i and over j. No particular abstraction is preferred by them. But clearly we lose something if we merge i=2 and i=3.
The prediction principle says we should merge rows with similar distributions and columns with similar distributions. In this case, we should merge (i=1,i=2), (i=3,i=4), (j=1,j=2), and (j=3,j=4).
The function provided for predictive merging is called merge.table. If the above table is x, we can reduce it to a 2 by 2 table via merge.table(x, c(2,2)). The first argument is the table and the second argument is the desired number of abstraction bins. Because the table has two dimensions, we have to give a vector of two numbers, specifying the desired number of rows and columns. The c function makes a vector. Here is the result:
> merge.table(x,c(2,2)) merging i = 2 and 1 merging i = 4 and 3 merging j = 2 and 1 merging j = 4 and 3 total cost = 0 j i 2.1 4.3 2.1 0 20 4.3 20 0The merging cost in this case was zero, because all merges were between identical rows and columns.
Suppose the rows and columns were originally in a different order:
j i 1 2 3 4 1 0 5 0 5 2 5 0 5 0 3 0 5 0 5 4 5 0 5 0If the dimensions are viewed as unordered categories, then we should expect the same result. In fact, that is what we get when we run merge.table.
How about a less trivial table. Here is Table 2.14 in Moore & McCabe:
age education 25-34 35-54 55+ noHS 5325 9152 16035 HS 14061 24070 18320 SomeCollege 11659 19926 9662 College 10342 19878 8005It seems that the rows SomeCollege and College could be merged. As for columns, it is harder to tell. A mosaic plot can help:
> merge.table(x,c(3,2)) merging age = 35-54 and 25-34 merging education = College and SomeCollege total cost = 138.5215 age education 35-54.25-34 55+ noHS 14477 16035 HS 38131 18320 College.SomeCollege 61805 17667
Let's come back to the census data of day16. It is a table relating six workclasses to sixteen education types. Previously, the education types were manually abstracted into "high school", "college", and "advanced". Can we recover this division automatically? Using merge.table, we get the following abstract categories:
> y <- merge.table(x, c(3,3)) > dimnames(y) $workclass [1] "Private.None" [2] "Self" [3] "State.Fed.Local" $education [1] "Bachelors.Assoc-acdm" [2] "Prof-school.Doctorate.Masters" [3] "Some-college.Assoc-voc.HS-grad.7th-8th.Preschool.12th.5th-6th.11th.9th.10th.1st-4th"Note that the three types of government job (state/fed/local) have been merged together. The education categories are very similar to what we got by hand. The main difference is that Some-college and Assoc-voc are grouped with high school. This is because these education levels are similar to high school in terms of the kind of jobs you get, at least in this coarse division of jobs. Here is the final mosaicplot:
Here is another census example, considering age versus occupation. Age is initially categorized by year and occupation into fourteen levels:
$age [1] "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" [12] "28" "29" "30" "31" "32" "33" "34" "35" "36" "37" "38" [23] "39" "40" "41" "42" "43" "44" "45" "46" "47" "48" "49" [34] "50" "51" "52" "53" "54" "55" "56" "57" "58" "59" "60" [45] "61" "62" "63" "64" "65" "66" "67" "68" "69" "70" "71" [56] "72" "73" "74" "75" "76" "77" "78" "79" "80" "81" "82" [67] "83" "84" "85" "86" "87" "88" "90" $occupation [1] "Adm-clerical" "Armed-Forces" [3] "Craft-repair" "Exec-managerial" [5] "Farming-fishing" "Handlers-cleaners" [7] "Machine-op-inspct" "Other-service" [9] "Priv-house-serv" "Prof-specialty" [11] "Protective-serv" "Sales" [13] "Tech-support" "Transport-moving"We can use predictive abstraction to group the ages as well as the occupations. In this case, we have an ordered variable (ages). The restrict the merging to preserve order, we provide an additional ordered argument to merge.table. It is a vector of True/False values indicating which dimensions are ordered. Here the row dimension is ordered but the column is not:
> y <- merge.table(x, c(3,4), ordered=c(T,F)) > dimnames(y) $age [1] "17-21" "22-35" "36-90" $occupation [1] "Other-service.Handlers-cleaners.Armed-Forces" [2] "Prof-specialty.Exec-managerial" [3] "Sales.Adm-clerical.Priv-house-serv.Farming-fishing" [4] "Transport-moving.Protective-serv.Craft-repair.Tech-support.Machine-op-inspct"
Customer clustering is important because we may have very little data for any single customer. Clustering allows us to `borrow data' from similar customers. By assigning a new customer to a cluster, the company can automatically anticipate that customer's needs and make recommendations or special offers (e.g. grocery store coupons). Compared to other methods of recommendation, such as soliticing questionnaires, this method only requires observing what the customer has bought. Furthermore, when we match a customer to a cluster using the chi-square distance, their entire distribution of purchases is taken into account, not just a few triggers like beer and diapers.
Consider the museum application from day10. We can construct a contingency table which reports, for each customer and each time duration (in multiples of 5 seconds), the number of times the customer spent that much time at an exhibit. In other words, each column is a histogram of durations for a given customer. Our goal is to abstract the times and cluster the customers.
First we merge times (rows) into 4 bins:
> y <- merge.table(x,4,1,ord=T) > rownames(y) [1] "0" "5-15" "20-30" "35-140"The merging trace does not suggest any particular number of bins as interesting. Now we merge customers (columns) into 4 groups:
z <- merge.table(y,4,2) colnames(z) <- c("nonsel","busy","select","greedy") mosaicplot(t(z),shade=T)
The original crosstab package has also been modified. Please download the latest version here.
crosstab.r
crosstab.s
crosstab2.r
crosstab2.s
The general usage of merge.table is merge.table(table, num.bins, which.dims, ordered). By default, which.dims is all dimensions of the table. If you only want to merge rows, use 1 for which.dims. If you only want to merge columns, use 2. If only one dimension is being merged, you only specify one number for num.bins and one value for ordered. Otherwise num.bins and ordered must be vectors. By default, ordered is false for all dimensions.
There is a research paper search engine called ResearchIndex which can make recommendations based on a user profile. The profile is made from explicit ratings and/or passive observation of the user.