Multivariate analysis is useful whether or not you have a designated response variable. When there is a designated response, as in classification and regression, multivariate analysis is useful for diagnosing non-additive interactions. For example, we saw last week that the geometrical shape of a dataset was crucial in choosing between logistic regression and nearest-neighbor classification. When there is no designated response, multivariate analysis is useful for finding trends, anomalies from those trends, and clusters.
It helps to think about datasets in terms of multivariate geometry. Consider this data frame describing iris flowers:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 6.0 2.2 4.0 1.0 versicolor 6.1 2.8 4.7 1.2 versicolor 5.8 2.7 5.1 1.9 virginica 6.8 2.8 4.8 1.4 versicolor 7.7 3.8 6.7 2.2 virginicaInstead of viewing this as a table of predictors and responses, we can view it as a set of points in high-dimensional space. Each row is a point with four numeric dimensions and one categorical dimension. For now, let's focus on the four numeric dimensions. Each of the three species forms a cloud of points in four-dimensional space, defined by the four measurements above.
What is four-dimensional space? It is a mathematical abstraction. You take the Cartesian coordinate system that we are familiar with in 1d, 2d, and 3d and extend it a the natural way. A point x is indexed by four numbers: (x1,x2,x3,x4). The length or norm of a point x is denoted
||x|| = sqrt(x1^2 + x2^2 + x3^2 + x4^2)which is a straightforward generalization of the definition of length in 1d, 2d, and 3d. The distance between two points is the length of their difference:
d(x,z) = ||x-z||The main difficulty of four-dimensional space is visualizing it.
There are several ways to look into four dimensional space. The technique we will use today is the same one used by photographers to capture a three-dimensional world on two-dimensional film. We will visualize high-dimensional data through projections.
pairs(iris)
The draftsman's display is mainly used for determining the correlation structure of the variables. In this case, we see that all of the variables are correlated, especially Petal.Length and Petal.Width. We can also see that the dataset is clustered. A lot of information is contained in this display, and we will examine it further in a later lecture.
h1 = w1*x1 + w2*x2 + w3*x3 h2 = u1*x1 + u2*x2 + u3*x3where the w's and u's are real numbers. (Technically, we also want w and u to be orthogonal and unit length.) In photography terms, this is an orthographic projection, corresponding to a very large focal length. There are no perspective distortions in such a projection. The draftsman's display, when we just drop dimensions, corresponds to w = (1,0,0), u = (0,1,0), or the like. The formula for a projection from any number of dimensions down to one dimension is
h1 = w1*x1 + w2*x2 + w3*x3 + w4*x4 + ...In linear algebra, this is an inner product between w and x, denoted w'x ("w transpose x"). To project down to more than one dimension, just combine multiple inner products, giving the point (h1,h2,...).
The idea of a grand tour is to smoothly pan around an object,
viewing it from all directions, instead of only from a few canonical
directions. We can do this by smoothly varying the projection vectors
w and u. You can make a grand tour via the ggobi package, an optional extension to R.
Here are some snapshots of a grand tour of the iris data:
That's right, you are looking at four dimensional space! (Through a
two-dimensional window.)
The points have been colored according to the Species variable.
Some projections keep the classes separate, and some don't.
In some projections the three classes have the same shape, while in other
projections they do not.
The second image above shows that the classes are "flat", implying there
is a strong amount of linear redundancy among the four measurements.
Mathematically, we can describe the "depth" information lost by projection as the distance between an original point x and its projection Px. We can think of x and Px both being in high dimensional space, except that Px lies on a plane (the "film"). The formula for "depth" is
d(x,Px) = ||x - Px||The quality of a given projection vector w is the variance of the depth over the dataset. PCA finds the w which minimizes this quantity. Equivalently, PCA finds the w which maximizes the spread of the Px. This process works when projecting down to one dimension, or any number of dimensions. The result is a projection that hopefully gives an informative look at the data, without the need for a grand tour. To get an idea of what we are missing in this projection, we can define an R-squared statistic, analogous to the one for regression:
variance( ||Px|| ) R-squared = ------------------ variance( ||x|| )R-squared is the percent of the variation which is captured by the projection. The larger the R-squared, the more flat the object is.
To run PCA, use the function pca. For a one-dimensional projection, it returns a projection vector. For a two-dimensional projection, it returns a matrix of projection vectors. Here is the result on the iris data (first four columns only):
> w <- pca(iris[1:4]) R-squared = 0.965303 > w h1 Sepal.Length -0.7511082 Sepal.Width -0.3800862 Petal.Length -0.5130089 Petal.Width -0.1679075 > w <- pca(iris[1:4],2) R-squared = 0.998372 > w h1 h2 Sepal.Length -0.7511082 -0.2841749 Sepal.Width -0.3800862 -0.5467445 Petal.Length -0.5130089 0.7086646 Petal.Width -0.1679075 0.3436708A 2D projection always captures more variation than a 1D projection, so its R-squared is higher. In this case, a 2D projection captures virtually all of the variation in the dataset. To apply a projection matrix to a data frame, use the command project:
> project(iris,w) h1 h2 Species -5.912747 -2.302033 setosa -5.572482 -1.971826 setosa -5.446977 -2.095206 setosa -5.436459 -1.870382 setosa -5.875645 -2.328290 setosa ...Each row of this frame is a two-dimensional point, plus the Species variable which was left unchanged. Here is what the projection looks like:
cplot(project(iris,w))
"Class separation" is tricky to define mathematically. One approach is to define it as the distance between the class means. The mean of a class is the center of its point cloud; the coordinates are the mean along each dimension. Suppose we have two classes and the means after projection are Pm1 and Pm2. Then we would want to maximize ||Pm1 - Pm2||. This doesn't quite work, since the spread of the classes may differ in different projections. Assuming the classes have the same spread, we divide the distance between means by the projected standard deviation: ||Pm1 - Pm2||/sqrt(PVP'). This has the effect of standardizing the variables so that the choice of units doesn't matter (similar to what we did for nearest neighbor on day33). The projection which maximizes this criterion is called the Fisher projection. You can compute the Fisher projection of a dataset via the projection function, with type="m". It assumes that the last variable in the data frame is the response, which is true for iris since Species is last. Otherwise it is the same as using pca:
> w <- projection(iris,2,type="m") > w h1 h2 Sepal.Length -0.0741 0.0204 Sepal.Width -0.4135 0.0128 Petal.Length 0.3334 -0.3998 Petal.Width 0.8440 0.9163 > cplot(project(iris,w))
Here is a simple example where one class is contained in another.
The dataset is two dimensional, and we want to reduce it to one dimension.
The projection line which maximizes the difference in spread is shown in
blue.
Here is another dataset, where one class has been shifted. Now we
compromise between maximizing the difference in class mean and class
spread.
> w <- pca(x8[,1:16],2) R-squared = 0.83 > cplot(project(x8,w))
> w <- projection(x8,2,type="m") > cplot(project(x8,w))
> w <- projection(x8,2,type="v") > cplot(project(x8,w))
> w <- projection(x8,2,type="mv") > cplot(project(x8,w))
Functions introduced in this lecture: