Chapter 3: Answers to selected exercises

Problem 3.1

An alternative to OLS is least absolute deviation (LAD) regression, in which the regression line is selected to minimize the sum of the absolute vertical differences (rather than squared differences) between the line and the data. Explain how this might reduce sensitivity to outliers.

As each value equally (1/n) contributes to the estimation of the OLS regression line (refer to the normal equations for solving the OLS regression), outlying values might distord the slope of the linear trend. If we consider instead using the absolute vertical deviation as a weighting scheme, also called L₁ regression, the corresponding contribution of each outlier is lowered.

The following example should help comparing the two approaches. We create artificial outliers by adding random exponential deviates to y values. The parameter of the exponential distribution is fixed to 1. As a consequence, we get “positive” outlying values.

> n <- 100
> x <- rnorm(n)
> y <- 2*x+rexp(n,1)
> plot(x,y)
> points(x[x<1 & y>5],y[x<1 & y>5],pch=19)
> ols <- lm(y~x)
> abline(ols)
> summary(ols)
Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max
-0.9057 -0.7418 -0.3131  0.4525  6.3636

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.9092     0.1106   8.217  8.7e-13 ***
x             1.9970     0.1008  19.812  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.091 on 98 degrees of freedom
Multiple R-Squared: 0.8002,     Adjusted R-squared: 0.7982
F-statistic: 392.5 on 1 and 98 DF,  p-value: < 2.2e-16
> library(quantreg)
> lad <- rq(y~x)
> abline(lad\$coefficients[1],lad\$coefficients[2],col=2)
> summary(lad)
Call: rq(formula = y ~ x)

tau: [1] 0.5

Coefficients:
            coefficients lower bd upper bd
(Intercept) 0.63507      0.45152  0.76212
x           2.08776      1.77945  2.20147

OLS regression line is plotted together with the dispersion diagram in [Fig. 1]. Two influential points have been highlighted in this figure using plain symbols. Various diagnostic plots are reported in [Fig. 2] and were produced using plot(ols). We can check that some of the observations are considered as being influential (e.g. obs. 12, 13 and 32), based on their residual values or Cook's distance. Two of them were already considered as outliers. OLS slope estimate (`beta`₁) is found to be 1.997 while LAD slope estimate is slightly larger (2.088). If we compare the intercepts, which in our case are directly impacted by the random exponential deviates on y axis, we can see that the OLS one (0.909) is larger than what LAD gives (`beta`₀=0.635). This reflects the adjustment of the OLS predicted values toward the vertical bias induced by the outliers.

LAD technique inherits from the work of Koenker and coll., for example [1] and [2]. There is also a recent article of Koenker and Hallock entitled Quantile Regression (a local copy is provided here: QRJEP.pdf) and published in the Journal of Economic Perspectives (15(4): 143-156, 2001). R provides tools for quantile regression in the quantreg package. By default, the median is used (tau=0.5) and this corresponds to what is called Least Deviation Regression since using the L₁ norm is equivalent to working with the .5 quantile.

Figure 1: Effect of different regression techniques on estimated slope.

The code below illustrates two other commonly used robust regression techniques, namely Huber regression and Least Trimmed Squares regression. Huber robust regression is a compromise between OLS and LAD regression. Instead of choosing a constant weight (OLS) or a weight inversely proportional to the absolute deviation of the residuals (LAD), one can opt for a mixed approach and take into account an estimate of the standard deviation when weighting the residuals. On the other hand, LTS regression is a resistant regression technique which we will not explain here.

J. Faraway covers these topics in greater details in his on-line handbook on regression (Chapter 13). G. V. Farnsworth also wrote a tutorial on doing Econometrics in R, which contains information relevant to quantile and robust regression techniques (Chapter 4).

> library(MASS)
> hm <- rlm(y~x)
> abline(hm\$coefficients[1],hm\$coefficients[2],col=3)
> summary(hm)
Call: rlm(formula = y ~ x)
Residuals:
    Min      1Q  Median      3Q     Max
-0.7872 -0.6093 -0.1838  0.5701  6.5004

Coefficients:
            Value   Std. Error t value
(Intercept)  0.7757  0.0756    10.2635
x            2.0105  0.0689    29.2002

Residual standard error: 0.8644 on 98 degrees of freedom
> #library(lqs)  # library lqs is now included in the MASS package
> lts <- ltsreg(y~x)
> abline(lts\$coefficients[1],lts\$coefficients[2],col=4)
> legend.text <-
c(paste("OLS",round(ols\$coefficients[2],3)),paste("LAD",round(lad\$coefficients[2],3)),
paste("Huber",round(hm\$coefficients[2],3)),paste("LTS",round(lts\$coefficients[2],3)))
> legend("topleft",legend.text,lty=1,col=1:4)
> title(main="Regression lines and Slopes estimates")

As can be seen from [Fig. 1], Huber method leads to intermediate results: the regression line lies between those estimated using OLS and LAD regression, in particular for low to intermediate values of x which correspond to the greatest influence of our exponential random deviates added to the y values. On the contrary, LTS failed to uncover the main association between X and Y since the regression line is far below the principal axis of the bivariate cloud. In fact, this is partly due to the fact that there are relatively little outliers (compared to the sample size) and the LTS is quite outperforming in this case.

Figure 2: Diagnostic plots for the OLS regression.

Problem 3.3

Using (3.12) and a statistical analysis program, demonstrate with your own data that the slope coefficient in a univariate linear model with continuous predictor and outcome is a rescaled transformation of the sample correlation between predictor and outcome.

Recall that the slope coefficient estimate `beta` in a simple linear model is related to the sample Pearson correlation coefficient r in the following way:

where S_x and S_y stand for the standard deviations associated with X and Y.

It is easily seen from the above formula that the regression coefficient is related to the correlation coefficient up to a constant, which is precisely the ratio between Y and X standard deviations.

The following example illustrates that knowing the correlation between two continuous variables together with their standard deviations allows to estimate the slope coefficient of a linear regression without explicitly computing the regression model (shown at the end of the code snippet).

> n <- 100
> true.sdx <- 2.5
> true.slope <- 3
> x <- rnorm(n,23,true.sdx)
> y <- true.slope*x+rnorm(n)
> sdx <- sd(x)
> sdy <- sd(y)
> cor.xy <- cor(x,y)
> cor.xy*((n-1)/n*sdy)/((n-1)/n*sdx)
[1] 2.962331
> cor.xy*sdy/sdx
[1] 2.962331
> coef(lm(y~x))[2]
       x
2.962331
> plot(x,y)

Figure 3: Plot of outcome y versus predictor x (the slope is estimated to be 2.96).

Problem 3.4

The correlation coefficient is a measure of linear association. Suppose x takes on values evenly over the range from -10 to 10, and that E[y|x]=x². In this case, the correlation of x and y is zero, even though there is clearly a systematic relationship. What does this suggest about the need to test model assumptions? Using a statistical package, generate a random sample of 100 values of x uniformly distributed on [-10,10], compute E[y|x] for each value of x, add randomly generated standard normal errors to get the 100 values of y, and check the sample correlation of x and y.

If the relation between y and x is quadratic, the Pearson correlation is obviously not a good way to reflect the trend between the two variables since it is a measure of linear association. In particular, if E[y|x]=x², the curve describing the variation of y as a function of x should look like a parabola [Fig. 4]. Now, when y takes approximately the same value, the corresponding x values can be both positive and negative (if x is centred on 0, which is the case in the above hypothesis as x values uniformly lie on the [-10,10] interval).

The following piece of code demonstrates the intuitive idea developed above.

> n <- 100
> x <- runif(n,-10,10)
> expected.y <- x^2
> y <- x^2+rnorm(n)
> cor(x,y)
[1] -0.06221294
> plot(x,y)

Figure 4: Plot of y versus x, where x_i are uniform deviates and E[y|x]=x².

As can be seen, the correlation is close to zero. We can run a small simulation experiment in which we replicate the above code, say 100 times. Here is a short function doing the main computation:

sim <- function(n=100) {
    x <- runif(n,-10,10)
    y <- x^2+rnorm(n)
    return(cor(x,y))
}

Now, we can check the distribution of correlation estimate for 100 random samples as follows:

> set.seed(12345)
> res <- replicate(100,sim())
> summary(res)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
-0.27970 -0.08975 -0.02419 -0.01665  0.07211  0.31570
> hist(res)

Estimated correlation range from -0.28 to 0.32 and the average correlation is -0.02. We can thus be confident in the hypothesis of a zero expected value. The distribution of the statistics is illustrated in [Fig. 5].

Figure 5: Variation of Pearson correlation for 100 random samples.

This reminds of Anscombe's [3] data whereby a slight rearrangement of the (x,y) values leaves the correlation unchanged but gives a drastically different picture of the relationship [Fig 6].

> data(anscombe)
> var.names <- colnames(anscombe)
> par(mfrow=c(2,2))
> for (i in 1:4)
+ plot(anscombe[,i],anscombe[,i+4],xlab=var.names[i],ylab=var.names[i+4])

Figure 6: Anscombe's dataset: All correlations are 0.816.

Case 1 corresponds to a quite reasonable hypothesis of a linear relationship between the two variables. Case 2 corresponds to the one discussed earlier. Case 3 illustrates the influence of a single outlier on slope estimate of an otherwise perfect linear association. Case 4 clearly advocates the absence of influence of x on y since different values of y are observed although they refer to the same x value.

Problem 3.5

Verify the estimates for the excess risk, relative risk, and odds ratio for the HIV example presented in Table 3.6.

First, we set up the data:

> HIV <- matrix(c(3,2,4,22),nrow=2)
> dimnames(HIV) <- list(c("Cases","Noncases"),c("Exposed","Unexposed"))
> HIV
         Exposed Unexposed
Cases          3         4
Noncases       2        22
> apply(HIV,1,sum)
   Cases Noncases
       7       24
> apply(HIV,2,sum)
  Exposed Unexposed
        5        26

Next, we can computed the corresponding association measures:

> er <- (HIV[1,1]/sum(HIV[,1]))-(HIV[1,2]/sum(HIV[,2]))
> rr <- (HIV[1,1]/sum(HIV[,1]))/(HIV[1,2]/sum(HIV[,2]))
> or <- (HIV[1,1]*HIV[2,2])/(HIV[2,1]*HIV[1,2])

These estimates are reported in [Tab. 1]. Excess risk represents the difference between risk in exposed population versus risk in unexposed population; here, this difference is estimated to be 0.45. Relative risk is computed as the excess risk in exposed population taking the unexposed population as the baseline: we can conclude that exposed female partner are on average 3.9 more likely to be infected by HIV that unexposed women. Finally, odds ratio is 8.25. Its interpretation is less appealing than the previous ones, as noted by the authors (pp. 46-47), but we can note that in this case RR and OR are not very close one from the other. This is because the outcome is not very rare (7/31), which could be partly due to the small sample size in this prospective study, thus giving an unclear idea of overall prevalence in the population.

Table 1: Association measures for the HIV data.
ER	RR	OR
0.45	3.90	8.25

Problem 3.6

The data presented below are from a case-control study of esophageal cancer. The rows (labeled according to Case status) represent 200 cancer cases and 775 cancer-free controls selected from the same population as the cases. The columns represent a binary indicator of reported consumption of more than ten grams of tobacco per day.

	Tobacco
Case status	0-9 g/day	10+ g/day	Total
0	255	520	775
1	9	191	200
Total	264	711	975

Compute the odds ratio comparing the risk of cancer in individuals who report consuming more than ten grams of tobacco per day with the corresponding risk in the group reporting less or no consumption. Next, compute the odds ratio comparing the proportion of individuals reporting higher levels of consumption among cases with that among the controls. Comment.

We happen to set up the data as previously:

> ESOP <- matrix(c(255,9,520,191),nrow=2)
> dimnames(ESOP) <- list(c("Control","Case"),c("0-9 g/day","10+ g/day"))
> ESOP
        0-9 g/day 10+ g/day
Control       255       520
Case            9       191

The odds ratio is given by

> (ESOP[1,1]*ESOP[2,2])/(ESOP[1,2]*ESOP[2,1])
[1] 10.40705

So we would conclude that there exists a strong association between the risk factor — tobacco consumption — and the outcome, which in this case is the risk of developing a cancer.

> library(vcd)
> or <- oddsratio(ESOP,log=FALSE)
> or
[1] 10.40705
> confint(or)
          lwr      upr
[1,] 5.333583 20.30656
> plot(or)
> fourfold(or)

The command plot(or) only represents OR value and its confidence interval as a graphical output. However, the last plot (fourfold(or)) is reproduced in [Fig. 7] since it allows to graphically visualize the table of counts. This kind of graph was proposed by M. Friendly, see [4] and [5]. We can check how the OR is departing from 1 by looking at the extreme arc which do not cross together (for adjacent categories).

Figure 7: Association plot for the study on esophageal cancer.

However, if we consider the relative proportion according to the two subgroups, Case and Control, we see that the odds of exposure (the probability of developing a cancer when tobacco consumption exceeds 10 g/day compared to the case where tobacco consumption is less than 10 g/day) is larger (x10 approximately) for the case than for control group [Fig. 8].

> library(epicalc)
> cci(191,520,9,255,design="case-control")
          Exposure
Outcome    Non-exposed Exposed Total
  Negative         255     520   775
  Positive           9     191   200
  Total            264     711   975

OR =  10.39
95% CI = 5.24 23.46
Chi-squared = 64.95 ,  1 d.f. , P value = 0
Fisher's exact test (2-sided) P value = 0

Figure 8: Comparison of odds of exposure for case and control.

Problem 3.7

Suppose we could estimate the value of the survival function S(t) for every possible survival time from t=0 onward. Clearly `S(t)->0` as t becomes large. It can be shown that the mean survival time is equal to the area under this “complete” survival curve. Why are we unable to estimate mean survival from the Kaplan-Meier result when the largest follow-up time is censored? To gain insight, contrast the survival curves for the 6-MP and placebo groups in Fig. 3.2.

It can be shown that `E(t)=int\ S(t)\ dt`, e.g. [6]. Integration is done on a support ranging from time 0 up to the last recorded uncensored time. Clearly, if the largest follow-up time is right censored, we would not be able to estimate the integrand and get a reliable estimate of the mean survival time.

If we take the leukemia data, a quick look at the survival function indicates that we cannot estimate mean survival time for the 6-MP group, since last measurements are all right censored. However, we can estimate that for the Placebo group.

> LEUK <- read.table("leuk.txt",header=TRUE)
> library(survival)
> LEUK.surv <- Surv(LEUK$time,LEUK$cens)
> LEUK.km <- survfit(LEUK.surv~group,data=LEUK)
> plot(LEUK.km[1])
> lines(LEUK.km[2],col="gray60")
> legend("topright",c("6-MP","Placebo"),lty=1,col=c("black","gray60"),bty="n")
> title(xlab="Weeks since randomization",ylab="Proportion in remission")

Figure 9: Survival function for 6-MP and placebo.

References

Koenker, R.W. and d'Orey (1987). Computing regression quantiles. Applied Statistics, 36, 383-393.
Koenker, R.W. and d'Orey (1994). Computing regression quantiles. Applied Statistics, 43, 410-444.
Anscombe, F.J. (1973). Graphs in statistical analysis. American Statistician, 27, 17-21.
Friendly, M. (1994). A fourfold display for 2 by 2 by k tables. Technical Report 217, York University, Psychology Department, 4fold.ps.gz.
Friendly, M. (2000). Visualizing Categorical Data. SAS Institute, Cary, NC.
Rothman, K.J. and Greenland, S. (1998). Modern Epidemiology. Lippincott Williams & Wilkins.