Day 36 - Examples of projection

Last time we discussed two ways to project multivariate data for visualization: Principal Components Analysis (PCA) and discriminative projection (types m, v, and mv). Today we will use these methods to analyze some datasets, as well as introduce some new functions.

Handwritten digit recognition

We looked at several projections of this dataset last time, and found that the class boundary was quadratic. This explains why linear logistic regression didn't work very well. Another way to diagnose problems with logistic regression is to look directly at the classification boundary it is using. Recall that logistic regression uses a model of the form
p(y=1|x) = sigma(a + b1*x1 + b2*x2 + ...)
The argument to the logistic function is a projection of the data onto one dimension. If the projection exceeds zero, then the classifier says class 1, otherwise class 2. By choosing a second projection dimension, e.g. using one of the discriminative criteria (m/v/mv), we can see how the data is distributed around that boundary. The function cplot.project.glm will do this, if you give it a logistic regression fit. Here is the result on the digit problem:
fit <- glm(digit8~.,x8,family=binomial)
cplot.project.glm(fit)

Logistic regression has done its best to fit a linear boundary, but clearly a curved boundary would be more appropriate.

Diabetes

Let's look again at the diabetes dataset from day31:
> x[1:5,]
  npreg glu bp skin  bmi   ped age type
1     5  86 68   28 30.2 0.364  24   No
2     7 195 70   33 25.1 0.163  55  Yes
3     5  77 82   41 35.8 0.156  35   No
4     0 165 76   43 47.9 0.259  26   No
5     0 107 60   25 26.4 0.133  23   No
> w <- projection(x,2)
Projection type mv 
> cplot(project(x,w))

fit <- glm(type~.,x,family=binomial)
cplot.project.glm(fit)

Both projections show that the classes are ellipsoidal and have significant overlap. Hence we cannot expect a low test error rate, with any classifier. It seems that a logistic regression classifier is sufficient. An extra bit of information revealed in these plots is that the red class (diabetes) has a larger degree of variation, especially as you move farther away from the black.

Vehicle classification

In the Vehicle dataset, four vehicles were observed at several different camera angles, and the silhouette of each vehicle was extracted. Describing each silhouette are 18 variables measuring properties like circularity (Circ) and elongatedness (Elong). We want to discriminate vans versus other vehicles, based on the silhouette alone. The variable Class indicates a van. Here is the first row:
  Comp Circ D.Circ Rad.Ra Pr.Axis.Ra Max.L.Ra Scat.Ra Elong
1   95   48     83    178         72       10     162    42
  Pr.Axis.Rect Max.L.Rect Sc.Var.Maxis Sc.Var.maxis Ra.Gyr
1           20        159          176          379    184
  Skew.Maxis Skew.maxis Kurt.maxis Kurt.Maxis Holl.Ra Class
1         70          6         16        187     197   Yes

w <- projection(x,2,type="m")
cplot(project(x,w))
w <- projection(x,2,type="mv")
cplot(project(x,w))

The Fisher projection suggests that a linear boundary is appropriate. The mv projection shows that the classes have quite different spread, but only along an irrelevant dimension. So it does not contradict the appropriateness of linearity. A quantitative comparison of logistic regression and nearest neighbor classification on this dataset is made in homework 11.

Speech recognition

Fifteen speakers utter 11 different vowels 6 times each. The sound signal was recorded and transformed into 9 variables measuring harmonic properties. We want to classify the vowel "hid" versus all others.
w <- projection(Vowel,2,type="mv")
cplot(project(Vowel,w))
w <- projection(Vowel,2,type="m")
cplot(project(Vowel,w))

These two projections seem to contradict each other, until you realize that the first projection is just a sideways look at the second. This was shown in class by making a three dimensional mv projection and rotating it in ggobi. The classes are fairly separable if you use a curved boundary, such as a quadratic.

Cars

Now let's consider some datasets where there is not a designated response variable, and we just want to understand the structure of the data. In this situation, PCA is the appropriate projection. Consider the Cars93 dataset from day20, excluding non-numeric variables. Here is the first row:
              Price MPG.highway EngineSize Horsepower Passengers
Acura Integra  15.9          31        1.8        140          5
              Length Wheelbase Width Turn.circle Weight
Acura Integra    177       102    68          37   2705
We will treat Price as just another attribute of the car, not as a response variable. Before running PCA, it is a good idea to standardize the variables so that they have the same variance, otherwise differences in units will distort the result. The function scale will standardize all variables in a data frame.
sx <- scale(x)
w <- pca(sx,2)
plot(project(sx,w))

This plot shows a funnel effect: a narrow band of cars of the left blends into a wide band of cars on the right, with some highly unusual cars on the bottom. To understand what variables are causing this, it helps to examine the projection coefficients:
> w
                    h1           h2
Price        0.2561774 -0.495802418
MPG.highway -0.3017322  0.049590394
EngineSize   0.3475447 -0.090019577
Horsepower   0.2896481 -0.503803781
Passengers   0.2088062  0.630746684
Length       0.3337987  0.109095127
Wheelbase    0.3404057  0.247025091
Width        0.3520862  0.085103623
Turn.circle  0.3239262  0.108473122
Weight       0.3726352  0.005041253
The horizontal axis is h1, which has large positive contributions from all variables except MPG.highway, which has a large negative contribution. So h1 represents the tradeoff between MPG and the other car variables, and is the most significant way that the cars vary. A secondary effect is captured by h2, which measures the tradeoff between number of passengers and horsepower/price. (Expensive sports cars tend to have a small number of passengers.) We can see these variable contributions visually by plotting the rows of w as vectors. This is done by the function plot.axes:
plot.axes(w)

Essentially what you are seeing is the projection of unit vectors pointing along each axis. Vectors which line up correspond to variables which are positively correlated. Vectors which points in opposite directions are negatively correlated, and vectors at right angles are independent. The funnel shape for low MPG cars means that they are relatively similar cars, while high MPG cars can be quite different. The variables Length, Width, EngineSize, etc. are obviously correlated, and associated with low MPG. At the top right we have big cars, such as minivans, that hold many passengers. At the bottom right we have expensive sports cars. At the bottom left and top left there are vacant regions with no cars; high MPG cars tend not to have high horsepower, nor carry many passengers.

Because of the strong correlations among variables, we can get a decent picture of the spectrum of cars in one two-dimensional plot. The R-squared of the projection is 0.82. It resembles a clustering of the cars, but it is better than clustering because it reflects a continuum of variation.

US demographics

The following dataset, used in homework 10, has demographic information about the 50 states:
           Income Illiteracy Life.Exp Homocide HS.Grad Frost
Alabama      3624        2.1    69.05     15.1    41.3    20
Alaska       6315        1.5    69.31     11.3    66.7   152
Arizona      4530        1.8    70.55      7.8    58.1    15
Arkansas     3378        1.9    70.66     10.1    39.9    65
California   5114        1.1    71.71     10.3    62.6    20
...
Let's analyze it using PCA, remembering to standardize the variables:
sx <- scale(x)
w <- pca(sx,2)
plot(project(sx,w))
plot.axes(w)
identify(project(sx,w))

Like the cars, this dataset shows a three-way division. On the left we have states with high illiteracy rate, high homocide rate, and low average household income. On the top right are states with high frost (negatively correlated with homocide) and high life expectancy. At the bottom right are states with high graduation rate and high income. Of course, this is only an approximate representation of the full dataset, since we know that Alaska is pretty frosty. Nevertheless, it is useful in showing us the general trends, in a convenient two-dimensional plot. The overall R-squared is 0.76.

Code

Functions introduced in this lecture:
Tom Minka
Last modified: Thu Nov 29 21:11:15 Eastern Standard Time 2001