Day 7 - Density-preserving abstraction

View the class slides [PDF]

Today we discussed two alternatives to the balance principle for abstraction. For numerical data, the balance principle implies binning the data according to quantiles. This is rarely a natural abstraction of the data, since it doesn't respect the hills and valleys of the distribution. The same issue arises with any variable whose values are not just names but have a notion of distance or a neighborhood structure.

There are many ways to get more natural abstractions of numerical data. If fact, so many ways that it is easy to get confused about which one to use. We will start by dividing the methods into two main groups, each of which follows a different principle.

The first principle is that distinctions within the data should be preserved. For short, this is the within principle. The idea is to look for compact and separated subgroups in the data. This is called clustering. For example, we could break the bins at valleys of the distribution. If there are no valleys, or if one of the bins is too big, subdivide according to the balance principle. Clearly there are many choices here. We will talk about the different choices later.

The second principle is that distinctions beyond this data should be preserved. That is, we should retain enough information to distinguish the data we actually received from the data we could have received. For short, this is the beyond principle. The idea is to break the bins at places where the density of data changes. To make this kind of judgement, we need to invoke statistical tests.

Histograms with confidence intervals

Suppose a bin contains some fraction q of the data. What we do know about the true probability of landing in that bin? From basic statistics, we have the approximate 68% confidence interval p = (q - SE, q + SE) where SE = sqrt(q(1-q)/n) and n is the size of the entire dataset. This formula is good only when nq (the number of individuals in the bin) is bigger than 5. A more exact formula, which works for any nq > 0, is:
SE = sqrt(q(1-q)/n)*exp(-1/6/nq)

The remaining problem is that if q=0, then the estimate is p=0 with empty confidence interval, which is clearly false. The correct thing to do is add a small number to all of the bin counts before doing any other calculations. Furthermore, wide bins should be increased more than small bins. A formula which does this is

(bin count) = (raw count) + (width)*a
Typically I use a = 8/sum(widths), which corresponds to 8 extra counts over the range of the data. Note that the bin counts will generally be fractions after this, which poses no problems mathematically.

See the slides for examples of histograms with confidence intervals.

Recall the definition of density. If a bin has probability p and width w, then the density of the bin is f = p/w. Unlike probability, density can be greater than 1 (on small intervals). The confidence interval for f follows immediately from the one for p: just divide the endpoints by w.

Bin merging

According to the beyond principle, we only need to keep bins with a significant difference in density. This can be done by merging, just as in the algorithms for balancing. At each step, we consider every pair of neighboring bins and merge the pair with the least significant difference in density. This gives a nested sequence of binnings. The user can pick the appropriate level of detail.

Let the two bins under consideration have widths w1 and w2, estimated probabilities p1 and p2, estimated density f1 and f2. The density if they are merged is f = (p1+p2)/(w1+w2). The significance of the difference can be measured by the following chi-square statistic: w1*(f1 - f)^2/f + w2*(f2 - f)^2/f. We are not interested in getting a p-value from this. We only need to pick the pair with smallest difference.

See the slides for examples of merged histograms. These histograms are simpler for visualization and for computation. They are also more accurate, because they pool data together, and they represent what we know about the data distribution more honestly than a histogram with a large number of bins.

Density-preserving abstraction essentially solves the question we asked in day 3, which was how should we reduce the number of bins for classification and anomaly detection using histograms.

Code

You need to download different files depending on whether you have R or S: bhist.r bhist.s
You must use source to load them, as usual.

The function for making histograms with confidence intervals is bhist(x,b), where x is a vector of numbers and b is the number of bins or a vector of bin breaks.

The function for making merged histograms is bhist.merge(x,b,n) where b is the original number of bins (or bin breaks) and n is the desired number of bins. It returns a set of bin breaks which can be passed to bhist. It also calls bhist automatically. The merged histogram is plotted underneath a plot of the chi-square values for each merge. The chi-square values are meant to help you decide on the best number of bins.

Reading

Read Moore and McCabe, chapters 8 and 9, to review inference of proportions.


Tom Minka
Last modified: Mon Aug 22 16:41:13 GMT 2005