x1 x2 y 3.2283839 3.7587273 Yes -1.2221922 -2.5341930 No 2.6448352 2.0397598 Yes -2.3814279 -0.9399438 No -0.2223163 -1.1950309 No ...The function knn.model will make a knn classifier (k=1 by default):
nn <- knn.model(y~.,frame)This function actually doesn't do anything but put the training set into an object. All of the work is done at prediction time by the function predict:
> predict(nn,data.frame(x1=0,x2=-1)) [1] NoTo classify a future (x1,x2) observation, it finds the training example (z1,z2) with smallest distance. The distance measure is the familiar Euclidean distance:
d(x,z) = sqrt((x1 - z1)^2 + (x2 - z2)^2)To see the model visually, use cplot(nn):
The 1-nn boundary is always a polygon. To understand this, note that each example has a small region around it where it is the nearest example. This region is a polygon, and the classification decision, whatever it is, is constant over the region. Some regions are labeled "Yes" and others "No". The class boundary is the boundary between the "Yes" regions and "No" regions, so it must be a polygon, or collection of polygons. This is more general than a tree classifier, which uses rectangles.
When we use k > 1, the boundary is smoother:
nn <- knn.model(y~.,frame,k=3) cplot(nn)
x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 x6 y6 x7 y7 x8 y8 digit 47 100 27 81 57 37 26 0 0 23 56 53 100 90 40 98 8 48 96 62 65 88 27 21 0 21 33 79 67 100 100 0 85 8 0 57 31 68 72 90 100 100 76 75 50 51 28 25 16 0 1 0 100 7 92 5 68 19 45 86 34 100 45 74 23 67 0 4 ...Here is what they look like in 2D:
To keep the example simple, we will only consider the two-class problem of classifying a digit as "8" or "not 8". We break the dataset into train and test (we're going to use a small 5% training set this time):
x <- rsplit(x8,0.05); names(x) <- c("train","test")Let's start with logistic regression:
> fit <- glm(digit8~.,x$train,family=binomial) > misclass(fit)/nrow(x$train) [1] 0.02181818 > misclass(fit,x$test)/nrow(x$test) [1] 0.03208198Can we do better with a tree?
> tr <- tree(digit8~.,x$train) > misclass(tr)/nrow(x$train) [1] 0.009090909 > misclass(tr,x$test)/nrow(x$test) [1] 0.02777246The tree can fit the training set well, but doesn't do much better on the test set. Now let's try nearest-neighbor. Intuitively, this problem seems well-suited to nearest-neighbor because
> nn <- knn.model(digit8~.,x$train) > misclass(nn)/nrow(x$train) [1] 0 > misclass(nn,x$test)/nrow(x$test) [1] 0.002298410On the training set, 1-nn is of course perfect, because it has memorized the training set. But this doesn't mean 1-nn has overfit; quite the contrary. Nearest neighbor makes one-tenth as many errors on the test set as the other two classifiers. The distance measure that 1-nn used was
d(a,b) = sqrt((a.x1 - b.x1)^2 + (a.y1 - b.y1)^2 + (a.x2 - b.x2)^2 + ...)This is Euclidean distance in the 16-dimensional space defined by (x1,y1,x2,...,y8).
One approach is to use geometry to skip irrelevant computations. For example, suppose we are looking for the nearest neighbor to x. We measure the distance to example z and find it is 2. This tells us that the nearest neighbor, whatever it is, cannot be farther than 2. So there is no need to compute the distance to any examples which are outside of a sphere of radius 2 centered at x. With certain data structures, it is possible to find out which points are outside of this sphere, without computing the distance to each one of them. After the next comparison, the sphere gets smaller, and so on. This technique can substantially reduce the number of comparisons, and it is widely used.
Another approach is to prune away some of the examples. As discussed above, nearest-neighbor is largely unaffected by removing examples. Only the boundary examples are important. Examples can be pruned as follows: for each point, see if its nearest neighbor has the same class. If so, prune one of them. Repeat. For datasets with simple boundaries, this can substantially reduce the number of examples; sometimes to two or three. Note that you can combine this with the geometric method for additional speedup. With these tricks, nearest-neighbor can be quite practical.
nn <- knn.model(type~glu+ped,x) cplot(nn)
> nn2 <- best.k.knn(nn,1:20) best k is 13
cplot(nn2)
> misclass(nn,Pima.te) [1] 93 > misclass(nn2,Pima.te) [1] 81 > misclass(fit,Pima.te) [1] 68Using k=13 helped, but logistic regression still wins. Why? From looking at the test data, the boundary is not quite linear. But by assuming linearity, logistic regression was able to get an estimate of the boundary which was closer to the truth than knn could, for the same amount of data. Once again, the more constrained classifier finds an advantage.
d(x,z) = sqrt((x1 - z1)^2/sd1^2 + (x2 - z2)^2/sd2^2)This is done automatically by knn.model.
Second, nearest-neighbor is very robust to noise in the dataset; even the removal of random points has little effect on its performance. Compare this to logistic regression and classification trees, which can change substantially from the alteration of a single training example. On the other hand, nearest-neighbor is especially sensitive to irrelevant predictors, and doesn't perform well when some predictors are more important than others. This is the opposite of logistic regression and classification trees, which deliberately exclude irrelevant predictors. This problem can only be fixed by designing your own distance measure which downweights certain predictors.
Nearest-neighbor is especially useful for domains where a distance measure between examples is straightfoward to define but a model relating the predictors to the response is not.
Functions introduced in this lecture: