Day 30 - Multiple regression with interactions

So far we have been assuming that the predictors are additive in producing the response. But as we saw last week, this is a strong assumption. Today we will learn how to diagnose and visualize interactions between numerical predictors. This allows us to produce detailed analyses of realistic datasets.

Simple example

Let's see how well the methods we have so far deal with a non-additive dataset. We'll construct a dataset where the response y has a bilinear term x1*x2:
n <- 100
x1 <- runif(n)
x2 <- runif(n)
y <- x1 + x2 + x1*x2
predict.plot(y~x1+x2)

From the pairwise plots, the relationships seem linear enough. One anomaly is that the response variance is non-constant, but it doesn't look like a transformation would help. Let's look at residuals:
fit <- lm(y~x1+x2)
predict.plot(fit)

The residuals clearly show the changing variance but otherwise there is no non-linearity in the pairwise plots. We can try plotting partial residuals instead. Partial residuals are the difference between the actual response and the expected response based on all predictors except one. This lets you see what each predictor is doing.
predict.plot(fit,partial=T)

Each predictor is modeling a partial residual which is very linear. So we can't see any problems. This proves that the methods we have so far cannot diagnose interaction (non-additivity) between the predictors. We need new methods.

Finding interaction

To find interactions, start by adding interaction terms to the regression, so that the model is
y = a + b1*x1 + b2*x2 + b12*x1*x2
Typically one uses bilinear terms since bilinearity is a common type of interaction and other types of interaction often have a bilinear component. In R, you add bilinear terms to a linear model via the ":" notation:
> fit2 <- lm(y ~ x1 + x2 + x1:x2)
> fit2
Coefficients:
(Intercept)            x1           x2           x1:x2  
          0            1            1            1  
> summary(fit2)
Coefficients:
              Estimate Std. Error    t value Pr(>|t|)    
(Intercept) -4.441e-16  1.536e-16 -2.890e+00  0.00476 ** 
x1           1.000e+00  2.782e-16  3.595e+15  < 2e-16 ***
x2           1.000e+00  2.670e-16  3.746e+15  < 2e-16 ***
x1:x2        1.000e+00  4.724e-16  2.117e+15  < 2e-16 ***
The coefficient on the bilinear term is 1, which matches how we constructed the data. The summary says that the bilinear term is statistically significant, so we have detected an interaction.

To visualize the interaction and see if it is bilinear or not, we use a profile plot, just like last week. The difference is that here the predictors are numerical so we need to abstract one of them into bins. predict.plot will do this for you if you use the conditional notation "|":

predict.plot(y ~ x1 | x2)

This is the same as predict.plot(y~x1), except the x2 predictor has been divided into four ranges which are used to color the points. It is what we call a "raw" profile plot. We can see the interaction from the fact that the lines are not shifts of each other but have different slopes. To see it more clearly, we can use partial residuals from the additive fit:
predict.plot(y ~ x1 | x2, partial=fit)

By removing the x2 effect, we bring the lines closer together. This plot confirms that the interaction is bilinear: the profiles are intersecting lines. Note that we passed in the original fit, without the interaction term. Otherwise the interaction we are trying to see would have already been subtracted out.

In this example, the predictors interact even though they are uncorrelated with each other. Correlation and interaction are separate things. Correlation is a property of two random variables while interaction is a property of three (two predictors and a response).

General method

Putting all of this together gives a general method for doing regression analysis on any dataset:
  1. Make a predict.plot of the response versus all predictors.
  2. Transform variables to make the relationships linear.
  3. Fit a linear model with no interaction terms. Use the residuals (or partial residuals) from this model to suggest additional transformations.
  4. Drop irrelevant predictors. You can spot them by using partial residuals or summary. An automatic method is step.
  5. Test for interactions by adding bilinear terms to the model. If one shows up as significant, then you have found an interaction. An automatic method for this is step.up.
  6. Visualize and describe the interactions via profile plots.
This procedure is illustrated on the following two datasets.

Puromycin treatment

An experiment is conducted to determine the effect of Puromycin on accelerating a reaction in a cell. The substrate concentration and the presence of Puromycin is varied, and the reaction rate is observed. The resulting dataset is:
  conc rate   state
1 0.02   76 treated
2 0.02   47 treated
3 0.06   97 treated
4 0.06  107 treated
5 0.11  123 treated
...
We want to know how rate depends on the other two variables. Start with a predict.plot:
predict.plot(rate~.,Puromycin)

A transformation is called for here. The bulge rule tells us we could transform rate upwards or conc downwards. From the distribution of data it looks like transformating conc makes more sense. Here is the result:
x <- Puromycin
x$conc <- log(x$conc)
predict.plot(rate~.,x)

Now the relationships are linear enough to try fitting a model:
fit <- lm(rate~.,x)
predict.plot(fit)
predict.plot(fit,partial=T)
The model fits the predictors well (plots not shown). Now try adding interaction terms:
> fit2 <- step.up(fit)
> summary(fit2)

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)          209.194      4.453  46.974  < 2e-16 ***
conc                  37.110      1.968  18.858 9.24e-14 ***
stateuntreated       -44.606      6.811  -6.549 2.85e-06 ***
conc:stateuntreated  -10.128      2.937  -3.448  0.00269 ** 
step.up has added an interaction term, and it is significant. Now plot it:
predict.plot(rate~conc|state,x)
predict.plot(rate~conc|state,x,partial=fit)

There is an interaction because the slopes are not the same (this is especially evident in the partial residuals). Our conclusion is that Puromycin treatment increases the reaction rate, especially at high concentrations of substrate.

A likely scenario is that treatment multiplies the rate, which means that conc and state are additive in predicting log(rate). We can verify this with a single plot:

predict.plot(log(rate)~conc|state,x)

The curves are parallel, so the hypothesis seems to be correct.

Boston housing

Consider the Boston housing dataset from day11 and homework 7. We want to predict medv, the median house value. Here is the predict.plot:

The variables in most need of transformation are crim and lstat. A log transformation works well for both of them. Now we fit a linear model and plot the partial residuals:
fit <- lm(medv~.,x)
predict.plot(fit,partial=T)

Running step gets rid of crim and indus, which from the plot are clearly irrelevant.
> fit2 <- step(fit)
Step:  AIC= 1492.58 
 medv ~ chas + nox + rm + age + dis + rad + tax + ptratio + black +  
    lstat 

          Df Sum of Sq     RSS     AIC
<none>                  9254.2  1492.6
- age      1      73.1  9327.3  1494.6
- chas     1     167.7  9421.9  1499.7
- tax      1     190.3  9444.5  1500.9
- black    1     255.5  9509.6  1504.4
- rad      1     291.6  9545.8  1506.3
- nox      1     390.8  9645.0  1511.5
- rm       1     784.2 10038.3  1531.7
- dis      1     798.6 10052.7  1532.5
- ptratio  1    1209.3 10463.5  1552.7
- lstat    1    4915.8 14169.9  1706.2
The other variables, according to summary, have nonzero effect. But in many cases this effect is small. Take a look at the sum of squares listing from step. The four variables (rm,dis,ptratio,lstat) are much more important than the rest. Let's look for interactions only among these variables:
> fit2 <- lm(medv~lstat+rm+dis+ptratio,x)
> fit3 <- step.up(fit2)
> summary(fit3)
Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -227.5485    49.2363  -4.622 4.87e-06 ***
lstat              50.8553    20.3184   2.503 0.012639 *  
rm                 38.1245     7.0987   5.371 1.21e-07 ***
dis               -16.8163     2.9174  -5.764 1.45e-08 ***
ptratio            14.9592     2.5847   5.788 1.27e-08 ***
lstat:rm           -6.8143     3.1209  -2.183 0.029475 *  
lstat:dis           4.8736     1.3940   3.496 0.000514 ***
lstat:ptratio      -3.3209     1.0345  -3.210 0.001412 ** 
rm:dis              2.0295     0.4435   4.576 5.99e-06 ***
rm:ptratio         -1.9911     0.3757  -5.299 1.76e-07 ***
lstat:rm:dis       -0.5216     0.2242  -2.327 0.020364 *  
lstat:rm:ptratio    0.3368     0.1588   2.121 0.034423 *  
Many statistically significant interactions are found. Some of these we already know.
predict.plot(medv~rm|lstat,x,pch=".",partial=fit)

The interaction is that rm is only important when lstat is low. Surprisingly, rm is essentially irrelvant, perhaps even negative, when lstat is high. These are partial residuals so we have taken into account the other variables (but not their interactions). The interaction between rm and lstat is also visible in the original response (without partial=fit). Now try dis and lstat:
predict.plot(medv~dis|lstat,x,pch=".",partial=fit2)

High lstat makes large dis good (increases the price), while low lstat makes large dis bad (decreases the price). This is counterintuitive. If we consider the partial residual with respect to all variables, not just the top four, then the interaction disappears:
predict.plot(medv~dis|lstat,x,pch=".",partial=fit)

One of the less important variables must be causing this effect. I leave it as an exercise to find out what that variable is.

Code

Functions introduced in this lecture:
Tom Minka
Last modified: Mon Aug 22 16:37:29 GMT 2005