Day 37 - Multivariate clustering

Last time we saw that PCA was effective in revealing the major subgroups of a multivariate dataset. However, it is limited by what can be seen in a two-dimensional projection. Sometimes the group structure is more complex than that. A more general way to break a dataset into subgroups is to use clustering. Geometrically, clustering tries to find compact and well-separated point clouds within the overall cloud. The methods for multivariate clustering are similar to those discussed on day9, but with some new twists.

Ward's method

Previously we used a sum of squares criterion for clustering, with the option to get a hierarchy (Ward's method) or a partitioning (k-means). How do we assess the quality of a multivariate clustering? Using the definition of distance from day35, we can construct a generalized sum of squares criterion:
 SS = sum_{clusters j}
sum_{points i in j} ||x_i - m_j||^2 
Ward's method works by making each point a cluster, then merging clusters in order to minimize the sum of squares. The merging cost that results from the sum of squares criterion is na*nb/(na+nb)*||a - b||^2, a simple generalization of the merging cost in the one-dimensional case. Iterative reassignment is also possible using this sum of squares criterion, giving a multivariate k-means algorithm. The commands to use in R are hclust and kmeans. Here is an example two-dimensional dataset:

hc <- hclust(dist(x)^2,method="ward")

The merging trace shows that 4 is an interesting number. Given the desired number of clusters, the function cutree will cut the tree and return a vector telling you which cluster each case falls into. We will save this in the data frame as an additional variable, allowing us to make a cplot:
x$cluster <- factor(cutree(hc,k=4))

The sum of squares criterion desires separation and balance. The merging cost prefers to merge small clusters rather than large clusters, given the same amount of separation. In the multivariate case, sum of squares also wants the clusters to be spherical; it does not want highly elongated clusters. This is because distance is measured equally in all directions. Counterintuitive clusterings can result from this property. Consider this two-dimensional dataset:

The data falls into six vertical strips. This can easily happen when the clusters are compact in one dimension (horizontal) and highly variable in another (vertical). Here is the result of Ward's method, 6 clusters:

The clusters are chosen to be small and round, in violation of the structure of the data.

Single-link clustering

This property of the sum of squares criterion has caused people to search for alternatives. An extreme alternative is known as single-link (or nearest-neighbor) clustering. It is a merging algorithm: you start with all points in separate clusters, and repeatedly merge the two closest clusters. The difference is in how the distance between clusters is measured. Instead of using the distance between means, as Ward's method does, it uses the distance between the two closest points from each cluster, i.e. the size of the gap between the clusters. This algorithm has no preference for spherical clusters or any other shape. It does not scale the distance by cluster size, so it doesn't prefer balance either. It is extreme among clustering algorithms in that it only desires separation.

You can switch to single-link clustering by telling hclust to use method="single". Here is the result on the strips data:

hc <- hclust(dist(x)^2,method="single")
x$cluster <- factor(cutree(hc,k=6))

It divides the data into the six strips.

However, the blind pursuit of separation can also lead to counterintuitive clusterings. Here is the result on the first example above, showing the k=5 solution:

Four outlying data-points have been assigned their own cluster, and the rest of the data in the middle, which is very dense, has been lumped into the remaining cluster. Single-link can do this because it doesn't care about balance.

Here are some more examples:


Let's compare clustering to PCA on the car dataset. Here is the PCA projection, with points colored by cluster:
sx <- scale(x)
w <- pca(sx,2)
hc <- hclust(dist(sx)^2,method="ward")
sx$cluster <- factor(cutree(hc,k=5))

The clusters follow the PCA projection pretty closely, even though the clusters were computed in the full 10-dimensional space, not the 2-dimensional projection. The only difference from our analysis on day36 is that the cars with high MPG have been divided into two groups, corresponding to compact and midsize cars.

Another way to view the result of clustering is to treat the clusters as classes and make a discriminative projection:

x$cluster <- factor(cutree(hc,k=5))
w2 <- projection(x,2)

The projection is a little odd in that the seemingly most important variables like Horsepower and MPG are not used. projection found a different set of variables which can also separate the clusters.


Here is the PCA projection of the states data, colored by cluster:
sx <- scale(x)
w <- pca(sx,2)
hc <- hclust(dist(sx)^2,method="ward")
sx$cluster <- factor(cutree(hc,k=4))

In this case, the clustering does not completely agree with PCA. The green class is split up and one red point is far from its cluster. Here is the discriminative projection:
x$cluster <- factor(cutree(hc,k=4))
w2 <- projection(x,2,type="m")

Only the three variables Life.Exp, Illiteracy, and Homocide are needed to distinguish the clusters. This projection shows that the green cluster is most unusual compared to the other three. We will examine this more closely in the next lecture.


To use these functions, you need the latest version of clus1.r clus1.s

Functions introduced in this lecture:

Tom Minka
Last modified: Mon Aug 22 16:41:25 GMT 2005