Day 5 - Choosing abstraction levels

Once you have an abstraction hierarchy, you have to choose which level you want. Even if you know you want five bins, say, you still have options about which categories to merge. In fact, it is possible to use the data itself to tell you how to merge. This is part of a topic called "data driven abstraction."

For categorical variables, a useful principle to invoke in data driven abstraction is the balance principle. This says that the volume of the different abstract categories should be roughly the same. Here "volume" can mean the frequency of the category in the dataset, or it can mean something problem-specific like sales volume in dollars.

Two algorithms were given for automatically finding balanced abstractions. The idea is that you merge categories until the desired number of bins is reached. Algorithm A merges the categories whose total volume is minimal. This algorithm is simple and sometimes used in the literature, but it does not handle highly imbalanced hierarchies well. How should we proceed? Should we dream up and test other merging rules to see if any gives the results we want? No.

A better approach is to define a numeric measure of "imbalance" of a set of bins and try to minimize the imbalance. Algorithm B merges the categories which, when merged, most reduce the imbalance. Two convenient and effective ways to measure imbalance are given. One is the quadratic imbalance, which is the sum of squares of the bin volumes:

Q = sum_x n(x)^2
where n(x) is the volume of bin x (a bin is a subset of the original categories). Another measure is the entropy imbalance:
H = sum_x n(x) log n(x)
(It doesn't matter whether you normalize n(x) or not, because multiplying all volumes by a constant cannot change which categories are merged.) Both of these measures have the property that the imbalance is minimum when all bins have the same volume: n(x) = constant.

For each imbalance measure, we can define a merging cost, which is the amount that the measure is increased by a merge. For the quadratic imbalance, the cost of merging bins x and y is

merging cost = (n(x) + n(y))^2 - n(x)^2 - n(y)^2 = 2*n(x)*n(y)
Unlike algorithm A, where we merged the bins whose volume sum is smallest, now we merge the bins whose volume product is the smallest.

Background reading

Automatic abstraction using (essentially) algorithm A was first discussed in the paper "Dynamic Generation and Refinement of Concept Hierarchies for Knowledge Discovery in Databases", by Han and Fu, in 1994.


Tom Minka
Last modified: Thu Dec 06 22:53:28 Eastern Standard Time 2001