Problem 3.1

As each value equally (1/n) contributes to the estimation of the OLS regression line (refer to the normal equations for solving the OLS regression), outlying values might distord the slope of the linear trend. If we consider instead using the absolute vertical deviation as a weighting scheme, also called L1 regression, the corresponding contribution of each outlier is lowered.

The following example should help comparing the two approaches. We create artificial outliers by adding random exponential deviates to y values. The parameter of the exponential distribution is fixed to 1. As a consequence, we get “positive” outlying values.

> n <- 100
> x <- rnorm(n)
> y <- 2*x+rexp(n,1)
> plot(x,y)
> points(x[x<1 & y>5],y[x<1 & y>5],pch=19)
> ols <- lm(y~x)
> abline(ols)
> summary(ols)
Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max
-0.9057 -0.7418 -0.3131  0.4525  6.3636

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.9092     0.1106   8.217  8.7e-13 ***
x             1.9970     0.1008  19.812  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.091 on 98 degrees of freedom
Multiple R-Squared: 0.8002,     Adjusted R-squared: 0.7982
F-statistic: 392.5 on 1 and 98 DF,  p-value: < 2.2e-16
> library(quantreg)
> lad <- rq(y~x)
> abline(lad\$coefficients[1],lad\$coefficients[2],col=2)
> summary(lad)
Call: rq(formula = y ~ x)

tau: [1] 0.5

Coefficients:
            coefficients lower bd upper bd
(Intercept) 0.63507      0.45152  0.76212
x           2.08776      1.77945  2.20147

OLS regression line is plotted together with the dispersion diagram in [Fig. 1]. Two influential points have been highlighted in this figure using plain symbols. Various diagnostic plots are reported in [Fig. 2] and were produced using plot(ols). We can check that some of the observations are considered as being influential (e.g. obs. 12, 13 and 32), based on their residual values or Cook's distance. Two of them were already considered as outliers. OLS slope estimate (`beta`1) is found to be 1.997 while LAD slope estimate is slightly larger (2.088). If we compare the intercepts, which in our case are directly impacted by the random exponential deviates on y axis, we can see that the OLS one (0.909) is larger than what LAD gives (`beta`0=0.635). This reflects the adjustment of the OLS predicted values toward the vertical bias induced by the outliers.

LAD technique inherits from the work of Koenker and coll., for example [1] and [2]. There is also a recent article of Koenker and Hallock entitled Quantile Regression (a local copy is provided here: QRJEP.pdf) and published in the Journal of Economic Perspectives (15(4): 143-156, 2001). R provides tools for quantile regression in the quantreg package. By default, the median is used (tau=0.5) and this corresponds to what is called Least Deviation Regression since using the L1 norm is equivalent to working with the .5 quantile.

plot31.png
Figure 1: Effect of different regression techniques on estimated slope.

The code below illustrates two other commonly used robust regression techniques, namely Huber regression and Least Trimmed Squares regression. Huber robust regression is a compromise between OLS and LAD regression. Instead of choosing a constant weight (OLS) or a weight inversely proportional to the absolute deviation of the residuals (LAD), one can opt for a mixed approach and take into account an estimate of the standard deviation when weighting the residuals. On the other hand, LTS regression is a resistant regression technique which we will not explain here.

J. Faraway covers these topics in greater details in his on-line handbook on regression (Chapter 13). G. V. Farnsworth also wrote a tutorial on doing Econometrics in R, which contains information relevant to quantile and robust regression techniques (Chapter 4).

> library(MASS)
> hm <- rlm(y~x)
> abline(hm\$coefficients[1],hm\$coefficients[2],col=3)
> summary(hm)
Call: rlm(formula = y ~ x)
Residuals:
    Min      1Q  Median      3Q     Max
-0.7872 -0.6093 -0.1838  0.5701  6.5004

Coefficients:
            Value   Std. Error t value
(Intercept)  0.7757  0.0756    10.2635
x            2.0105  0.0689    29.2002

Residual standard error: 0.8644 on 98 degrees of freedom
> #library(lqs)  # library lqs is now included in the MASS package
> lts <- ltsreg(y~x)
> abline(lts\$coefficients[1],lts\$coefficients[2],col=4)
> legend.text <-
c(paste("OLS",round(ols\$coefficients[2],3)),paste("LAD",round(lad\$coefficients[2],3)),
paste("Huber",round(hm\$coefficients[2],3)),paste("LTS",round(lts\$coefficients[2],3)))
> legend("topleft",legend.text,lty=1,col=1:4)
> title(main="Regression lines and Slopes estimates")

As can be seen from [Fig. 1], Huber method leads to intermediate results: the regression line lies between those estimated using OLS and LAD regression, in particular for low to intermediate values of x which correspond to the greatest influence of our exponential random deviates added to the y values. On the contrary, LTS failed to uncover the main association between X and Y since the regression line is far below the principal axis of the bivariate cloud. In fact, this is partly due to the fact that there are relatively little outliers (compared to the sample size) and the LTS is quite outperforming in this case.

plot31bis.png
Figure 2: Diagnostic plots for the OLS regression.

Problem 3.3

Recall that the slope coefficient estimate `beta` in a simple linear model is related to the sample Pearson correlation coefficient r in the following way:

slope.png

where Sx and Sy stand for the standard deviations associated with X and Y.

It is easily seen from the above formula that the regression coefficient is related to the correlation coefficient up to a constant, which is precisely the ratio between Y and X standard deviations.

The following example illustrates that knowing the correlation between two continuous variables together with their standard deviations allows to estimate the slope coefficient of a linear regression without explicitly computing the regression model (shown at the end of the code snippet).

> n <- 100
> true.sdx <- 2.5
> true.slope <- 3
> x <- rnorm(n,23,true.sdx)
> y <- true.slope*x+rnorm(n)
> sdx <- sd(x)
> sdy <- sd(y)
> cor.xy <- cor(x,y)
> cor.xy*((n-1)/n*sdy)/((n-1)/n*sdx)
[1] 2.962331
> cor.xy*sdy/sdx
[1] 2.962331
> coef(lm(y~x))[2]
       x
2.962331
> plot(x,y)
plot33.png
Figure 3: Plot of outcome y versus predictor x (the slope is estimated to be 2.96).

Problem 3.4

If the relation between y and x is quadratic, the Pearson correlation is obviously not a good way to reflect the trend between the two variables since it is a measure of linear association. In particular, if E[y|x]=x2, the curve describing the variation of y as a function of x should look like a parabola [Fig. 4]. Now, when y takes approximately the same value, the corresponding x values can be both positive and negative (if x is centred on 0, which is the case in the above hypothesis as x values uniformly lie on the [-10,10] interval).

The following piece of code demonstrates the intuitive idea developed above.

> n <- 100
> x <- runif(n,-10,10)
> expected.y <- x^2
> y <- x^2+rnorm(n)
> cor(x,y)
[1] -0.06221294
> plot(x,y)
plot34.png
Figure 4: Plot of y versus x, where xi are uniform deviates and E[y|x]=x2.

As can be seen, the correlation is close to zero. We can run a small simulation experiment in which we replicate the above code, say 100 times. Here is a short function doing the main computation:

sim <- function(n=100) {
    x <- runif(n,-10,10)
    y <- x^2+rnorm(n)
    return(cor(x,y))
}

Now, we can check the distribution of correlation estimate for 100 random samples as follows:

> set.seed(12345)
> res <- replicate(100,sim())
> summary(res)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
-0.27970 -0.08975 -0.02419 -0.01665  0.07211  0.31570
> hist(res)

Estimated correlation range from -0.28 to 0.32 and the average correlation is -0.02. We can thus be confident in the hypothesis of a zero expected value. The distribution of the statistics is illustrated in [Fig. 5].

plot34bis.png
Figure 5: Variation of Pearson correlation for 100 random samples.

This reminds of Anscombe's [3] data whereby a slight rearrangement of the (x,y) values leaves the correlation unchanged but gives a drastically different picture of the relationship [Fig 6].

> data(anscombe)
> var.names <- colnames(anscombe)
> par(mfrow=c(2,2))
> for (i in 1:4)
+ plot(anscombe[,i],anscombe[,i+4],xlab=var.names[i],ylab=var.names[i+4])
plot34ter.png
Figure 6: Anscombe's dataset: All correlations are 0.816.

Case 1 corresponds to a quite reasonable hypothesis of a linear relationship between the two variables. Case 2 corresponds to the one discussed earlier. Case 3 illustrates the influence of a single outlier on slope estimate of an otherwise perfect linear association. Case 4 clearly advocates the absence of influence of x on y since different values of y are observed although they refer to the same x value.

Problem 3.5

First, we set up the data:

> HIV <- matrix(c(3,2,4,22),nrow=2)
> dimnames(HIV) <- list(c("Cases","Noncases"),c("Exposed","Unexposed"))
> HIV
         Exposed Unexposed
Cases          3         4
Noncases       2        22
> apply(HIV,1,sum)
   Cases Noncases
       7       24
> apply(HIV,2,sum)
  Exposed Unexposed
        5        26

Next, we can computed the corresponding association measures:

> er <- (HIV[1,1]/sum(HIV[,1]))-(HIV[1,2]/sum(HIV[,2]))
> rr <- (HIV[1,1]/sum(HIV[,1]))/(HIV[1,2]/sum(HIV[,2]))
> or <- (HIV[1,1]*HIV[2,2])/(HIV[2,1]*HIV[1,2])

These estimates are reported in [Tab. 1]. Excess risk represents the difference between risk in exposed population versus risk in unexposed population; here, this difference is estimated to be 0.45. Relative risk is computed as the excess risk in exposed population taking the unexposed population as the baseline: we can conclude that exposed female partner are on average 3.9 more likely to be infected by HIV that unexposed women. Finally, odds ratio is 8.25. Its interpretation is less appealing than the previous ones, as noted by the authors (pp. 46-47), but we can note that in this case RR and OR are not very close one from the other. This is because the outcome is not very rare (7/31), which could be partly due to the small sample size in this prospective study, thus giving an unclear idea of overall prevalence in the population.

Table 1: Association measures for the HIV data.
ER RR OR
0.45 3.90 8.25

Problem 3.6

We happen to set up the data as previously:

> ESOP <- matrix(c(255,9,520,191),nrow=2)
> dimnames(ESOP) <- list(c("Control","Case"),c("0-9 g/day","10+ g/day"))
> ESOP
        0-9 g/day 10+ g/day
Control       255       520
Case            9       191

The odds ratio is given by

> (ESOP[1,1]*ESOP[2,2])/(ESOP[1,2]*ESOP[2,1])
[1] 10.40705

So we would conclude that there exists a strong association between the risk factor — tobacco consumption — and the outcome, which in this case is the risk of developing a cancer.

> library(vcd)
> or <- oddsratio(ESOP,log=FALSE)
> or
[1] 10.40705
> confint(or)
          lwr      upr
[1,] 5.333583 20.30656
> plot(or)
> fourfold(or)

The command plot(or) only represents OR value and its confidence interval as a graphical output. However, the last plot (fourfold(or)) is reproduced in [Fig. 7] since it allows to graphically visualize the table of counts. This kind of graph was proposed by M. Friendly, see [4] and [5]. We can check how the OR is departing from 1 by looking at the extreme arc which do not cross together (for adjacent categories).

plot36.png
Figure 7: Association plot for the study on esophageal cancer.

However, if we consider the relative proportion according to the two subgroups, Case and Control, we see that the odds of exposure (the probability of developing a cancer when tobacco consumption exceeds 10 g/day compared to the case where tobacco consumption is less than 10 g/day) is larger (x10 approximately) for the case than for control group [Fig. 8].

> library(epicalc)
> cci(191,520,9,255,design="case-control")
          Exposure
Outcome    Non-exposed Exposed Total
  Negative         255     520   775
  Positive           9     191   200
  Total            264     711   975

OR =  10.39
95% CI = 5.24 23.46
Chi-squared = 64.95 ,  1 d.f. , P value = 0
Fisher's exact test (2-sided) P value = 0
plot36bis.png
Figure 8: Comparison of odds of exposure for case and control.

Problem 3.7

It can be shown that `E(t)=int\ S(t)\ dt`, e.g. [6]. Integration is done on a support ranging from time 0 up to the last recorded uncensored time. Clearly, if the largest follow-up time is right censored, we would not be able to estimate the integrand and get a reliable estimate of the mean survival time.

If we take the leukemia data, a quick look at the survival function indicates that we cannot estimate mean survival time for the 6-MP group, since last measurements are all right censored. However, we can estimate that for the Placebo group.

> LEUK <- read.table("leuk.txt",header=TRUE)
> library(survival)
> LEUK.surv <- Surv(LEUK$time,LEUK$cens)
> LEUK.km <- survfit(LEUK.surv~group,data=LEUK)
> plot(LEUK.km[1])
> lines(LEUK.km[2],col="gray60")
> legend("topright",c("6-MP","Placebo"),lty=1,col=c("black","gray60"),bty="n")
> title(xlab="Weeks since randomization",ylab="Proportion in remission")
plot37.png
Figure 9: Survival function for 6-MP and placebo.

References

  1. Koenker, R.W. and d'Orey (1987). Computing regression quantiles. Applied Statistics, 36, 383-393.

  2. Koenker, R.W. and d'Orey (1994). Computing regression quantiles. Applied Statistics, 43, 410-444.

  3. Anscombe, F.J. (1973). Graphs in statistical analysis. American Statistician, 27, 17-21.

  4. Friendly, M. (1994). A fourfold display for 2 by 2 by k tables. Technical Report 217, York University, Psychology Department, 4fold.ps.gz.

  5. Friendly, M. (2000). Visualizing Categorical Data. SAS Institute, Cary, NC.

  6. Rothman, K.J. and Greenland, S. (1998). Modern Epidemiology. Lippincott Williams & Wilkins.