Day 30 - Multiple regression with interactions
So far we have been assuming that the predictors are additive in producing
the response. But as we saw last week, this is a strong assumption. Today
we will learn how to diagnose and visualize interactions between numerical
predictors. This allows us to produce detailed analyses of realistic
datasets.
Simple example
Let's see how well the methods we have so far deal with a non-additive
dataset. We'll construct a dataset where the response y has a
bilinear term x1*x2:
n <- 100
x1 <- runif(n)
x2 <- runif(n)
y <- x1 + x2 + x1*x2
predict.plot(y~x1+x2)
From the pairwise plots, the relationships seem linear enough.
One anomaly is that the response variance is non-constant, but it doesn't
look like a transformation would help. Let's look at residuals:
fit <- lm(y~x1+x2)
predict.plot(fit)
The residuals clearly show the changing variance but otherwise there is no
non-linearity in the pairwise plots. We can try plotting partial residuals
instead. Partial residuals are the difference between the actual response
and the expected response based on all predictors except one.
This lets you see what each predictor is doing.
predict.plot(fit,partial=T)
Each predictor is modeling a partial residual which is very linear. So we
can't see any problems.
This proves that the methods we have so far cannot diagnose interaction
(non-additivity) between the predictors. We need new methods.
Finding interaction
To find interactions, start by adding interaction terms to the regression,
so that the model is
y = a + b1*x1 + b2*x2 + b12*x1*x2
Typically one uses bilinear terms since bilinearity is a common type of
interaction and other types of interaction often have a
bilinear component. In R, you add bilinear terms to a linear model via the
":" notation:
> fit2 <- lm(y ~ x1 + x2 + x1:x2)
> fit2
Coefficients:
(Intercept) x1 x2 x1:x2
0 1 1 1
> summary(fit2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.441e-16 1.536e-16 -2.890e+00 0.00476 **
x1 1.000e+00 2.782e-16 3.595e+15 < 2e-16 ***
x2 1.000e+00 2.670e-16 3.746e+15 < 2e-16 ***
x1:x2 1.000e+00 4.724e-16 2.117e+15 < 2e-16 ***
The coefficient on the bilinear term is 1, which matches how we constructed
the data. The summary says that the bilinear term is statistically
significant, so we have detected an interaction.
To visualize the interaction and see if it is bilinear or not, we use
a profile plot, just like last week. The difference is that here the
predictors are numerical so we need to abstract one of them into bins.
predict.plot will do this for you if you use the
conditional notation "|":
predict.plot(y ~ x1 | x2)
This is the same as predict.plot(y~x1), except
the x2 predictor has been divided into four ranges which are
used to color the points. It is what we call a "raw" profile plot.
We can see the interaction from the fact that the lines are not shifts of
each other but have different slopes. To see it more clearly, we can use
partial residuals from the additive fit:
predict.plot(y ~ x1 | x2, partial=fit)
By removing the x2 effect, we bring the lines closer together.
This plot confirms that the interaction is bilinear: the profiles are
intersecting lines.
Note that we passed in the original fit, without the interaction
term. Otherwise the interaction we are trying to see would have already
been subtracted out.
In this example, the predictors interact even though they are uncorrelated
with each other. Correlation and interaction are separate things.
Correlation is a property of two random variables while interaction is a
property of three (two predictors and a response).
General method
Putting all of this together gives a general method for doing regression
analysis on any dataset:
- Make a predict.plot of the response versus all predictors.
- Transform variables to make the relationships linear.
- Fit a linear model with no interaction terms.
Use the residuals (or partial residuals) from this model to suggest
additional transformations.
- Drop irrelevant predictors. You can spot them by using partial
residuals or summary. An automatic method is step.
- Test for interactions by adding bilinear terms to the model.
If one shows up as significant, then you have found an interaction.
An automatic method for this is step.up.
- Visualize and describe the interactions via profile plots.
This procedure is illustrated on the following two datasets.
Puromycin treatment
An experiment is conducted to determine the effect of Puromycin on
accelerating a reaction in a cell. The substrate concentration and the
presence of Puromycin is varied, and the reaction rate is observed. The
resulting dataset is:
conc rate state
1 0.02 76 treated
2 0.02 47 treated
3 0.06 97 treated
4 0.06 107 treated
5 0.11 123 treated
...
We want to know how rate depends on the other two variables.
Start with a predict.plot:
predict.plot(rate~.,Puromycin)
A transformation is called for here. The bulge rule tells us we could
transform rate upwards or conc downwards. From the
distribution of data it looks like transformating conc makes
more sense. Here is the result:
x <- Puromycin
x$conc <- log(x$conc)
predict.plot(rate~.,x)
Now the relationships are linear enough to try fitting a model:
fit <- lm(rate~.,x)
predict.plot(fit)
predict.plot(fit,partial=T)
The model fits the predictors well (plots not shown). Now try adding
interaction terms:
> fit2 <- step.up(fit)
> summary(fit2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 209.194 4.453 46.974 < 2e-16 ***
conc 37.110 1.968 18.858 9.24e-14 ***
stateuntreated -44.606 6.811 -6.549 2.85e-06 ***
conc:stateuntreated -10.128 2.937 -3.448 0.00269 **
step.up has added an interaction term, and it is significant.
Now plot it:
predict.plot(rate~conc|state,x)
predict.plot(rate~conc|state,x,partial=fit)
There is an interaction because the slopes are not the same (this is
especially evident in the partial residuals).
Our conclusion is that Puromycin treatment increases the reaction rate,
especially at high concentrations of substrate.
A likely scenario is that treatment multiplies the rate, which means that
conc and state are additive in predicting
log(rate). We can verify this with a single plot:
predict.plot(log(rate)~conc|state,x)
The curves are parallel, so the hypothesis seems to be correct.
Boston housing
Consider the Boston housing dataset from day11
and homework 7.
We want to predict medv, the median house value.
Here is the predict.plot:
The variables in most need of transformation are crim and
lstat. A log transformation works well for both of them.
Now we fit a linear model and plot the partial residuals:
fit <- lm(medv~.,x)
predict.plot(fit,partial=T)
Running step gets rid of crim and
indus, which from the plot are clearly irrelevant.
> fit2 <- step(fit)
Step: AIC= 1492.58
medv ~ chas + nox + rm + age + dis + rad + tax + ptratio + black +
lstat
Df Sum of Sq RSS AIC
<none> 9254.2 1492.6
- age 1 73.1 9327.3 1494.6
- chas 1 167.7 9421.9 1499.7
- tax 1 190.3 9444.5 1500.9
- black 1 255.5 9509.6 1504.4
- rad 1 291.6 9545.8 1506.3
- nox 1 390.8 9645.0 1511.5
- rm 1 784.2 10038.3 1531.7
- dis 1 798.6 10052.7 1532.5
- ptratio 1 1209.3 10463.5 1552.7
- lstat 1 4915.8 14169.9 1706.2
The other variables, according to summary, have nonzero effect.
But in many cases this effect is small. Take a look at the sum of squares
listing from step. The four variables
(rm,dis,ptratio,lstat) are much more important than the rest.
Let's look for interactions only among these variables:
> fit2 <- lm(medv~lstat+rm+dis+ptratio,x)
> fit3 <- step.up(fit2)
> summary(fit3)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -227.5485 49.2363 -4.622 4.87e-06 ***
lstat 50.8553 20.3184 2.503 0.012639 *
rm 38.1245 7.0987 5.371 1.21e-07 ***
dis -16.8163 2.9174 -5.764 1.45e-08 ***
ptratio 14.9592 2.5847 5.788 1.27e-08 ***
lstat:rm -6.8143 3.1209 -2.183 0.029475 *
lstat:dis 4.8736 1.3940 3.496 0.000514 ***
lstat:ptratio -3.3209 1.0345 -3.210 0.001412 **
rm:dis 2.0295 0.4435 4.576 5.99e-06 ***
rm:ptratio -1.9911 0.3757 -5.299 1.76e-07 ***
lstat:rm:dis -0.5216 0.2242 -2.327 0.020364 *
lstat:rm:ptratio 0.3368 0.1588 2.121 0.034423 *
Many statistically significant interactions are found.
Some of these we already know.
predict.plot(medv~rm|lstat,x,pch=".",partial=fit)
The interaction is that rm is only important when
lstat is low. Surprisingly, rm is essentially
irrelvant, perhaps even negative, when lstat is high.
These are partial residuals so we have taken into account the other
variables (but not their interactions).
The interaction between rm and lstat
is also visible in the original response (without partial=fit).
Now try dis and lstat:
predict.plot(medv~dis|lstat,x,pch=".",partial=fit2)
High lstat makes large dis good (increases the price),
while low lstat makes large dis bad (decreases the
price).
This is counterintuitive. If we consider the partial residual with
respect to all variables, not just the top four, then the interaction
disappears:
predict.plot(medv~dis|lstat,x,pch=".",partial=fit)
One of the less important variables must be causing this effect.
I leave it as an exercise to find out what that variable is.
Code
Functions introduced in this lecture:
- step.up(fit)
- predict.plot(fit,partial=T)
- predict.plot(y~x1|x2)
- predict.plot(y~x1|x2,partial=fit)
Tom Minka
Last modified: Mon Aug 22 16:37:29 GMT 2005