Day 40 - Course overview

The objectives of this course were to teach you:

How to tame complexity, via abstraction
Principles of good visualization
How to form predictive models

A common pattern throughout this course has been to simplify a dataset by making abstractions and then gradually look at the details beyond the abstractions. Here is a master list of all the abstraction methods we've looked at:

Principle	Algorithms	Type being abstracted
Balance	A,B	categorical

A concept hierarchy is given and we want to find an abstraction level with a given number of categories. For example, we want to abstract the products in a supermarket into four types. A good abstraction will balance the category sizes, while respecting the hierarchy. This was our first example of a merging algorithm. Algorithm A was based on a direct merging rule. Algorithm B was indirect: we formulated a measure of imbalance and then derived a merging rule from it, which performed better than Algorithm A. See day5.

Principle	Algorithms	Type being abstracted
Beyond	Histogram bin merging	numerical

Convert numbers into categories (bins), so that numbers in the same bin have similar probability. This is important when trying to make predictions about future data, such as in fraud detection and customer profiling. The merging cost is given by a chi-square statistic which measures the difference in the probability density of two bins. See day7.

Principle	Algorithms	Type being abstracted
Within	Ward, k-means, PCA	numerical

Convert numbers into categories (bins), so that the bins are well-separated and approximately balanced. This is useful for identifying the subgroups of a population, such as market segments. Analysis can continue on the subgroups. Ward's method and k-means optimize the sum of squares criterion introduced on day9 and extended to multivariate data on day37. Ward's method uses merging while k-means uses iterative reassignment.

PCA is a "within" method which projects several numerical variables into a smaller number, so that a scatterplot can be made. See day35 and day36. PCA is useful for sorting a star plot (day38).

Principle	Algorithms	Type being abstracted
Prediction	merging	categorical

Merge categories which predict the same responses. There are two possibilities. If the response is categorical then we compare the distribution of the response for different predictor values using a homogeniety test. See day16 and day17. If the response is numerical then we have to assume something about the response distribution. If we assume it is normal with constant variance, then we merge predictor values based on the sum of squares. If we assume the response is normal with non-constant variance then we merge based on log-variance. See day18.

Principle	Algorithms	Type being abstracted
Prediction	m, v, mv	numerical

Project several numerical predictors of a categorical response (a class variable) into a smaller number of numerical predictors (h1 and h2). The projection axes are hard to interpret in themselves, because they are a mixture of the original variables. But you can use them to make a scatterplot and visualize the shape and boundary of the classes. There are three algorithms depending on whether you want to separate the means, variances, or both means and variances of the classes. See day35 and day36.

Tom Minka

Last modified: Mon Aug 22 16:41:57 GMT 2005