Day 40 - Course overview
The objectives of this course were to teach you:
- How to tame complexity, via abstraction
- Principles of good visualization
- How to form predictive models
A common pattern throughout this course has been to simplify a dataset by
making abstractions and then gradually look at the details beyond the
abstractions.
Here is a master list of all the abstraction methods we've looked at:
Principle |
Algorithms |
Type being abstracted |
Balance |
A,B |
categorical |
A concept hierarchy is given and we want to find an
abstraction level with a given number of categories.
For example, we want to abstract the products in a supermarket into four
types.
A good abstraction will balance the category sizes, while respecting the
hierarchy.
This was our first example of a merging algorithm.
Algorithm A was based
on a direct merging rule. Algorithm B was indirect: we formulated a
measure of imbalance and then derived a merging rule from it, which
performed better than Algorithm A.
See day5.
Principle |
Algorithms |
Type being abstracted |
Beyond |
Histogram bin merging |
numerical |
Convert numbers into categories (bins), so that numbers
in the same bin have similar probability. This is important when trying to
make predictions about future data, such as in fraud detection and customer
profiling. The merging cost is given by a chi-square statistic which
measures the difference in the probability density of two bins.
See day7.
Principle |
Algorithms |
Type being abstracted |
Within |
Ward, k-means, PCA |
numerical |
Convert numbers into categories (bins), so that the bins are well-separated
and approximately balanced. This is useful for identifying the subgroups of
a population, such as market segments. Analysis can continue on the
subgroups.
Ward's method and k-means optimize the sum of squares criterion
introduced on day9 and extended to multivariate
data on day37. Ward's method uses merging while
k-means uses iterative reassignment.
PCA is a "within" method which projects several numerical variables into a
smaller number, so that a scatterplot can be made. See day35 and day36. PCA is useful for
sorting a star plot (day38).
Principle |
Algorithms |
Type being abstracted |
Prediction |
merging |
categorical |
Merge categories which predict the same responses.
There are two possibilities. If the response is categorical then we
compare the distribution of the response for different predictor values
using a homogeniety test.
See day16 and day17.
If the response is numerical then we have to
assume something about the response distribution. If we assume it is
normal with constant variance, then we merge predictor values based on the
sum of squares. If we assume the response is normal with non-constant
variance then we merge based on log-variance.
See day18.
Principle |
Algorithms |
Type being abstracted |
Prediction |
m, v, mv |
numerical |
Project several numerical predictors of a categorical response (a
class variable) into a smaller number of numerical
predictors (h1 and h2).
The projection axes are hard to interpret in themselves, because they
are a mixture of the original variables. But you can use them to make a
scatterplot and visualize the shape and boundary of the classes.
There are three algorithms depending on whether you want to separate the
means, variances, or both means and variances of the classes.
See day35 and day36.
Tom Minka
Last modified: Mon Aug 22 16:41:57 GMT 2005