Visualization is more difficult when there are multiple predictors. Our technique will be to use pairwise scatterplots and residual scatterplots. There are also inference methods which allow us to quickly determine which are the relevant and irrelevant predictors.
building.type sq.ft kwh.elec mlbs.steam mcf.gas kwh.sq.ft lbs.sq.ft cf.sq.ft Apartments, Houses Residence 114320 598552 NA 1093 5.24 NA 0.95 407-409 Craig Street Administrative 10118 247560 NA 95 24.47 NA 0.93 6555 Penn Ave Administrative 99369 538120 NA 3668 5.42 NA 3.69 Alumni Hall Administrative 8223 107446 783 37 13.07 9.52 0.44 Baker/Porter Academic 227617 2842770 21778 NA 12.49 9.56 NA ...We can learn a lot about this dataset just by looking at pairs. Let's start with electricity consumption (kwh.elec) versus building size (sq.ft):
predict.plot(kwh.elec~sq.ft, x)
predict.plot(log(kwh.elec)~log(sq.ft), x)
> lm(log(kwh.elec)~log(sq.ft), x) Coefficients: (Intercept) log(sq.ft) 2.808 0.993Taking the slope as 1, this means that kwh.elec = sq.ft*exp(2.808), or that the ratio kwh.elec/sq.ft is approximately 16.6 across campus.
predict.plot(kwh.sq.ft ~ ., x)
predict.plot(kwh.sq.ft~lbs.sq.ft,x)
> fit <- lm(kwh.sq.ft~lbs.sq.ft,x) > fit Call: lm(formula = kwh.sq.ft ~ lbs.sq.ft, data = x) Coefficients: (Intercept) lbs.sq.ft 12.074 1.084The slope is 1, which is interesting since it implies that energy efficiency moves together with steam efficiency. However, there is a fair bit of noise off of the line; how can we be sure that this effect is real? One approach is to test against the null hypothesis that the slope is zero. The result of this test can be found in the output of summary(fit):
> summary(fit) Call: lm(formula = kwh.sq.ft ~ lbs.sq.ft, data = x) Residuals: Min 1Q Median 3Q Max -14.807 -9.328 -2.106 4.565 55.095 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 12.0740 6.2727 1.925 0.0667 . lbs.sq.ft 1.0845 0.7494 1.447 0.1614 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 14.02 on 23 degrees of freedom Multiple R-Squared: 0.08345, Adjusted R-squared: 0.0436 F-statistic: 2.094 on 1 and 23 DF, p-value: 0.1614It says that the estimated slope has a probability 0.16 of occurring under the null hypothesis, which isn't unlikely enough for us to say with confidence that the slope is nonzero.
predict.plot(kwh.sq.ft~lbs.sq.ft,x,identify=T)Left click to identify, right click to exit. Clicking on the outlier reveals that it is Cyert Hall, home of Computing Services at CMU. It uses quite a lot of electricity relative to its size. Because we can explain this outlier, we are justified in removing it and doing a new analysis:
x2 <- x[setdiff(rownames(x),"Cyert Hall"),] predict.plot(kwh.sq.ft~lbs.sq.ft,x2) fit <- lm(kwh.sq.ft~lbs.sq.ft,x2) summary(fit)The summary now gives a p-value of 1%, so the relationship between lbs.sq.ft and kwh.sq.ft is statistically significant.
Girth Height Volume 1 8.3 70 10.3 2 8.6 65 10.3 3 8.8 63 10.2 4 10.5 72 16.4 5 10.7 81 18.8 ...For logging, it is useful to know how much lumber you will get from a tree of known girth and height. We start with a predict.plot:
data(trees) predict.plot(Volume~.,trees)
> fit <- lm(Volume~.,trees) > fit <- lm(Volume~Girth+Height,trees) > fit Coefficients: (Intercept) Girth Height -57.9877 4.7082 0.3393To see how well we've done, we plot each predictor against the residuals of the fit. This is done by giving the linear model to predict.plot:
predict.plot(fit)
predict.plot(lm(Volume^(1/3) ~ Height + Girth,trees))
From geometry, you might argue that the correct relationship is Volume = Height*Girth^2. In the log domain, this means log(Volume) = log(Height) + 2*log(Girth). Let's try this:
> lm(log(Volume) ~ log(Height) + log(Girth),trees) Coefficients: (Intercept) log(Height) log(Girth) -6.632 1.117 1.983Yes, the geometry seems to be correct. Interestingly, the R-squared of this fit is 97.8%, the same as above.
x1 x2 x3 x4 y 1 7 26 6 60 78.5 2 1 29 15 52 74.3 3 11 56 8 20 104.3 4 11 31 8 47 87.6 5 7 52 6 33 95.9 ...The response y is the heat evolved in setting cement, and the predictors are the proportions of active ingredients in the cement. The four of them should add to 100 (but usually less due to rounding). Here is the predict.plot:
predict.plot(y~.,cement)
> fit <- lm(y ~ x1 + x2 + x3 + x4, cement) > summary(fit) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 62.4054 70.0710 0.891 0.3991 x1 1.5511 0.7448 2.083 0.0708 . x2 0.5102 0.7238 0.705 0.5009 x3 0.1019 0.7547 0.135 0.8959 x4 -0.1441 0.7091 -0.203 0.8441 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 2.446 on 8 degrees of freedom Multiple R-Squared: 0.9824, Adjusted R-squared: 0.9736 F-statistic: 111.5 on 4 and 8 DF, p-value: 4.756e-007According to this summary, none of the predictors have a statistically significant effect on the response. How can this be? The p-value for each predictor assumes that the other predictors are in the model. The predictors are linearly related by the fact that they sum to 100. Therefore each predictor seems irrelevant when you know the other three. This illustrates a danger in trying to interpret p-value summaries.
A more robust way to determine relevant variables is to fit models with different variables included and see which one is best. In doing this, we have to be careful about the meaning of "best". If we just look at the performance on the training data, via R-squared or the sum of squares, then we will favor the model which includes all of the predictors. Just as in building a tree, more complex models always win on the training data. To truly decide among models we have to infer what will happen on future data. Cross-validation could be used for this, but it is expensive. For linear regression, learning theory is a practical alternative.
There are various refinements of the theory of linear regression (some of which I am developing), but basically everyone agrees that more coefficients allow you to overfit more. So one has to strike a balance between the fit on the training set (sum of squares) and the number of coefficients. One criterion which embodies this tradeoff is the AIC: (p is the number of coefficients)
AIC = (sum of squares) + 2*pTo find the model minimizing AIC, use the function step. It works by stepwise selection: predictors are tentatively dropped or added, the change being accepted if the AIC goes down. Here is the result on the cement data:
> fit <- lm(y ~ x1 + x2 + x3 + x4, cement) > fit <- step(fit) Start: AIC= 26.94 y ~ x1 + x2 + x3 + x4 Df Sum of Sq RSS AIC - x3 1 0.109 47.973 24.974 - x4 1 0.247 48.111 25.011 - x2 1 2.972 50.836 25.728In the first iteration, it tries to drop each predictor in turn. Notice how the RSS (sum of squares) always increases when a predictor is dropped. But dropping x3 increases the RSS only slightly, and the net effect is that AIC drops by two, since there is one coefficient less. This change is therefore accepted and we arrive at the model y ~ x1 + x2 + x4. Dropping another predictor now increases AIC so the procedure stops.47.864 26.944 - x1 1 25.951 73.815 30.576 Step: AIC= 24.97 y ~ x1 + x2 + x4 Df Sum of Sq RSS AIC 47.97 24.97 - x4 1 9.93 57.90 25.42 - x2 1 26.79 74.76 28.74 - x1 1 820.91 868.88 60.63