For categorical variables, a useful principle to invoke in data driven abstraction is the balance principle. This says that the volume of the different abstract categories should be roughly the same. Here "volume" can mean the frequency of the category in the dataset, or it can mean something problem-specific like sales volume in dollars.
Two algorithms were given for automatically finding balanced abstractions. The idea is that you merge categories until the desired number of bins is reached. Algorithm A merges the categories whose total volume is minimal. This algorithm is simple and sometimes used in the literature, but it does not handle highly imbalanced hierarchies well. How should we proceed? Should we dream up and test other merging rules to see if any gives the results we want? No.
A better approach is to define a numeric measure of "imbalance" of a set of bins and try to minimize the imbalance. Algorithm B merges the categories which, when merged, most reduce the imbalance. Two convenient and effective ways to measure imbalance are given. One is the quadratic imbalance, which is the sum of squares of the bin volumes:
Q = sum_x n(x)^2where n(x) is the volume of bin x (a bin is a subset of the original categories). Another measure is the entropy imbalance:
H = sum_x n(x) log n(x)(It doesn't matter whether you normalize n(x) or not, because multiplying all volumes by a constant cannot change which categories are merged.) Both of these measures have the property that the imbalance is minimum when all bins have the same volume: n(x) = constant.
For each imbalance measure, we can define a merging cost, which is the amount that the measure is increased by a merge. For the quadratic imbalance, the cost of merging bins x and y is
merging cost = (n(x) + n(y))^2 - n(x)^2 - n(y)^2 = 2*n(x)*n(y)Unlike algorithm A, where we merged the bins whose volume sum is smallest, now we merge the bins whose volume product is the smallest.