Some example dataset for Regression

We go through some example datasets to illustrate the utility of the regression model.

Set the right working directory (you need to change the directory to the local directory in your machine containing the data)

dataDir<-"../finalDataSets"

The following is housing price data from Ames, Iowa. The full dataset is available at https://ww2.amstat.org/publications/jse/v19n3/decock/AmesHousing.txt A description of the full dataset is at https://ww2.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt We will be working with a slightly smaller version of the full dataset. We read in the dataset:

dd = read.csv(file.path(dataDir, "Ames_Short.csv"), header = T,stringsAsFactors=TRUE)

To explore the data, we will perform a pairs plot.

pairs(dd)

plot of chunk unnamed-chunk-3

The next dataset is the bike rental company. The dataset can also be found here (https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset) We again read in the data and convert some of the variables to factors.

bike <- read.csv(file.path(dataDir, "DailyBikeSharingDataset.csv"),stringsAsFactors=TRUE)
bike$yr<-factor(bike$yr)
bike$mnth<-factor(bike$mnth)

We do a pairs plot of only the continuous variables

pairs(bike[,11:16])

plot of chunk unnamed-chunk-5

The final dataset is the bodyfat dataset which gives estimates of the percentage of body fat determined by underwater weighing and various body circumference measurements for 252 men. Further explanation of the dataset can be found at http://lib.stat.cmu.edu/datasets/bodyfat

body = read.csv(file.path(dataDir, "bodyfat_short.csv"), header = T,stringsAsFactors=TRUE)
pairs(body)

plot of chunk unnamed-chunk-6

We can see that there are outliers in the data, which we can try to identify by their values on specific variables.

ou1 = which(body$HEIGHT < 30)
ou2 = which(body$WEIGHT > 300)
ou3 = which(body$HIP > 120)
ou = c(ou1, ou2, ou3)

We replot the data after removing these outliers and observed that we can see the data much better.

pairs(body[-ou,])

plot of chunk unnamed-chunk-8

Interpreting the variables

Here we plot the pairs plot of two highly correlated variables.

pairs(body[,c("BODYFAT","THIGH","ABDOMEN")])

plot of chunk unnamed-chunk-9

To understand what the coefficient in regression means, we can consider a coplot, which divides the data. It takes a formula like lm, but also includes a condition |, meaning that the plot of BODYFAT ~ THIGH will be plotted, for different ranges of ABDOMEN

coplot(BODYFAT~THIGH|ABDOMEN,data=body)

plot of chunk unnamed-chunk-10

The reverse looks quite different, showing that the plot is not symmetric.

coplot(BODYFAT~ABDOMEN | THIGH,data=body)

plot of chunk unnamed-chunk-11

The Regression model.

Here, we plot the response (body fat pct) against all the predictors.

par(mfrow = c(3, 3))
for(i in 2:8)
{
    plot(body[-ou,i], body[-ou,1], xlab = names(body)[i], ylab = "BODYFAT")
    }
par(mfrow = c(1, 1))

plot of chunk unnamed-chunk-12

Now we fit our linear model with all of the variables:

ft = lm(BODYFAT ~ AGE + WEIGHT + HEIGHT + CHEST + ABDOMEN + HIP + THIGH, data = body)
summary(ft)
## 
## Call:
## lm(formula = BODYFAT ~ AGE + WEIGHT + HEIGHT + CHEST + ABDOMEN + 
##     HIP + THIGH, data = body)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.0729  -3.2387  -0.0782   3.0623  10.3611 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.748e+01  1.449e+01  -2.585  0.01031 *  
## AGE          1.202e-02  2.934e-02   0.410  0.68246    
## WEIGHT      -1.392e-01  4.509e-02  -3.087  0.00225 ** 
## HEIGHT      -1.028e-01  9.787e-02  -1.051  0.29438    
## CHEST       -8.312e-04  9.989e-02  -0.008  0.99337    
## ABDOMEN      9.685e-01  8.531e-02  11.352  < 2e-16 ***
## HIP         -1.834e-01  1.448e-01  -1.267  0.20648    
## THIGH        2.857e-01  1.362e-01   2.098  0.03693 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.438 on 244 degrees of freedom
## Multiple R-squared:  0.7266, Adjusted R-squared:  0.7187 
## F-statistic: 92.62 on 7 and 244 DF,  p-value: < 2.2e-16

If we want to use all of the variables in a dataset, we can use a simpler notation, that keeps us from having to write out all of the variable names:

ft = lm(BODYFAT ~ . , data = body)

Even if we want to use all but a few variables, it can be easier to use this notation, and just subset the dataset, e.g. data= body[, -4]

Here we show the example of doing the linear regression, only after scaling the explanatory variables. Now the coefficients are comparable, as the amount of change for one standard deviation change of the variable. Notice that we have to remove the response variable, before scaling.

tempBody<-body
tempBody[,-1]<-scale(tempBody[,-1])
ftScale = lm(BODYFAT ~ . , data = tempBody)
cat("Coefficients with variables scaled:\n")
## Coefficients with variables scaled:
coef(ftScale)
## (Intercept)         AGE      WEIGHT      HEIGHT       CHEST     ABDOMEN 
## 19.15079365  0.15143812 -4.09098792 -0.37671913 -0.00700714 10.44300051 
##         HIP       THIGH 
## -1.31360120  1.50003073
cat("Coefficients on original scale:\n")
## Coefficients on original scale:
coef(ft)
##   (Intercept)           AGE        WEIGHT        HEIGHT         CHEST 
## -3.747573e+01  1.201695e-02 -1.392006e-01 -1.028485e-01 -8.311678e-04 
##       ABDOMEN           HIP         THIGH 
##  9.684620e-01 -1.833599e-01  2.857227e-01
sdVar<-apply(body[,-1],2,sd)
cat("Sd per variable:\n")
## Sd per variable:
sdVar
##       AGE    WEIGHT    HEIGHT     CHEST   ABDOMEN       HIP     THIGH 
## 12.602040 29.389160  3.662856  8.430476 10.783077  7.164058  5.249952
cat("Ratio of scaled lm coefficient to original lm coefficient\n")
## Ratio of scaled lm coefficient to original lm coefficient
coef(ftScale)[-1]/coef(ft)[-1]
##       AGE    WEIGHT    HEIGHT     CHEST   ABDOMEN       HIP     THIGH 
## 12.602040 29.389160  3.662856  8.430476 10.783077  7.164058  5.249952

Correlated variables

We find the correlation between chest circumference and abdomen circumference

cor(body[,c("HIP","THIGH","ABDOMEN","CHEST")])
##               HIP     THIGH   ABDOMEN     CHEST
## HIP     1.0000000 0.8964098 0.8740662 0.8294199
## THIGH   0.8964098 1.0000000 0.7666239 0.7298586
## ABDOMEN 0.8740662 0.7666239 1.0000000 0.9158277
## CHEST   0.8294199 0.7298586 0.9158277 1.0000000

We change our regression, and now drop the variables Abdomen and Hip. We see that this affects the coefficients of our model for the variables that remain.

ft1 = lm(BODYFAT ~ AGE + WEIGHT + HEIGHT + CHEST + HIP, data = body) 
# Including abdomen
round(coef(ft),4)
## (Intercept)         AGE      WEIGHT      HEIGHT       CHEST     ABDOMEN 
##    -37.4757      0.0120     -0.1392     -0.1028     -0.0008      0.9685 
##         HIP       THIGH 
##     -0.1834      0.2857
# Excluding abdomen
round(coef(ft1),4)
## (Intercept)         AGE      WEIGHT      HEIGHT       CHEST         HIP 
##    -53.9871      0.1290     -0.0526     -0.3146      0.5148      0.4697

To find the fitted values in R we can use the function fitted as applied to the lm output that we saved earlier:

head(fitted(ft))
##        1        2        3        4        5        6 
## 16.32670 10.22019 18.42600 11.89502 25.97564 16.28529

Here we plot fitted values against response

plot(fitted(ft), body$BODYFAT, xlab = "Fitted Values", ylab = "Bodyfat Percentage")

plot of chunk unnamed-chunk-19

The correlation of these two variables is \(R^2\), the Coefficient of Determination, or Multiple \(R^2\), which is reported in the summary of the lm object:

cor(body$BODYFAT, fitted(ft))^2
## [1] 0.7265596
summary(ft)
## 
## Call:
## lm(formula = BODYFAT ~ ., data = body)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.0729  -3.2387  -0.0782   3.0623  10.3611 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.748e+01  1.449e+01  -2.585  0.01031 *  
## AGE          1.202e-02  2.934e-02   0.410  0.68246    
## WEIGHT      -1.392e-01  4.509e-02  -3.087  0.00225 ** 
## HEIGHT      -1.028e-01  9.787e-02  -1.051  0.29438    
## CHEST       -8.312e-04  9.989e-02  -0.008  0.99337    
## ABDOMEN      9.685e-01  8.531e-02  11.352  < 2e-16 ***
## HIP         -1.834e-01  1.448e-01  -1.267  0.20648    
## THIGH        2.857e-01  1.362e-01   2.098  0.03693 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.438 on 244 degrees of freedom
## Multiple R-squared:  0.7266, Adjusted R-squared:  0.7187 
## F-statistic: 92.62 on 7 and 244 DF,  p-value: < 2.2e-16

We can also pull out the residuals with the function residuals

head(residuals(ft))
##         1         2         3         4         5         6 
## -4.026695 -4.120189  6.874004 -1.495017  2.724355  4.614712

We can then make the common plot, residuals plotted against the fitted values

plot(fitted(ft), residuals(ft), xlab = "Fitted Values", ylab = "Residuals")

plot of chunk unnamed-chunk-22

We can also plot the residuals against each of the explanatory variables

par(mfrow = c(3, 3))
for(i in 2:8)
{
    plot(body[,i], ft$residuals, xlab = names(body)[i], ylab = "Residuals")
    }
par(mfrow = c(1, 1))

plot of chunk unnamed-chunk-23

Note here the presence of the outliers.

We can look at the RSS after removing the variable Abdomen. Notice how I can pull out the information shown in summary to access it (here I did summary(ft.1)$r.squared to get the \(R^2\); similarly summary(ft.1)$coef would pull out the matrix of coefficients with the p-values).

ft.1 = lm(BODYFAT ~ AGE + WEIGHT + HEIGHT + CHEST + HIP + THIGH, data = body)
rss.ft1 = summary(ft.1)$r.squared
rss.ft1
## [1] 0.5821305

Compare this to the RSS of the full regression.

summary(ft$r.squared)
## Length  Class   Mode 
##      0   NULL   NULL

Categorical Variables

We will now bring in the scorecard data on colleges from 2016.

scorecard <- read.csv(file.path(dataDir, "college.csv"),stringsAsFactors=TRUE)

Here we fit our lm, treating the CONTROL variable (the private/public/etc) as numeric.

req.bad = lm(RET_FT4~ TUITIONFEE_OUT + CONTROL, data = scorecard)
summary(req.bad)
## 
## Call:
## lm(formula = RET_FT4 ~ TUITIONFEE_OUT + CONTROL, data = scorecard)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.69041 -0.04915  0.00516  0.05554  0.33165 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     6.661e-01  9.265e-03   71.90   <2e-16 ***
## TUITIONFEE_OUT  9.405e-06  3.022e-07   31.12   <2e-16 ***
## CONTROL        -8.898e-02  5.741e-03  -15.50   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08741 on 1238 degrees of freedom
## Multiple R-squared:  0.4391, Adjusted R-squared:  0.4382 
## F-statistic: 484.5 on 2 and 1238 DF,  p-value: < 2.2e-16

We can check that R does think CONTROL is a numeric variable with is.numeric

is.numeric(scorecard$CONTROL)
## [1] TRUE

Instead we need to treat this variable as a factor.

req = lm(RET_FT4~ TUITIONFEE_OUT + as.factor(CONTROL), data = scorecard)
summary(req)
## 
## Call:
## lm(formula = RET_FT4 ~ TUITIONFEE_OUT + as.factor(CONTROL), data = scorecard)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.68856 -0.04910  0.00505  0.05568  0.33150 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          5.765e-01  7.257e-03  79.434  < 2e-16 ***
## TUITIONFEE_OUT       9.494e-06  3.054e-07  31.090  < 2e-16 ***
## as.factor(CONTROL)2 -9.204e-02  5.948e-03 -15.474  < 2e-16 ***
## as.factor(CONTROL)3 -1.218e-01  3.116e-02  -3.909 9.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08732 on 1237 degrees of freedom
## Multiple R-squared:  0.4408, Adjusted R-squared:  0.4394 
## F-statistic:   325 on 3 and 1237 DF,  p-value: < 2.2e-16

It is generally better, however, to actually fix it in your data.frame as a factor. That way you won't accidentally mess up, and all the various R commands you might do (plot, summary etc) will treat it correctly.

scorecard$CONTROL<-factor(scorecard$CONTROL,levels=c(1,2,3),labels=c("public","private","private for-profit"))

Now the regression will just know it should be a factor.

req = lm(RET_FT4~ TUITIONFEE_OUT + CONTROL, data = scorecard)
summary(req)
## 
## Call:
## lm(formula = RET_FT4 ~ TUITIONFEE_OUT + CONTROL, data = scorecard)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.68856 -0.04910  0.00505  0.05568  0.33150 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                5.765e-01  7.257e-03  79.434  < 2e-16 ***
## TUITIONFEE_OUT             9.494e-06  3.054e-07  31.090  < 2e-16 ***
## CONTROLprivate            -9.204e-02  5.948e-03 -15.474  < 2e-16 ***
## CONTROLprivate for-profit -1.218e-01  3.116e-02  -3.909 9.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08732 on 1237 degrees of freedom
## Multiple R-squared:  0.4408, Adjusted R-squared:  0.4394 
## F-statistic:   325 on 3 and 1237 DF,  p-value: < 2.2e-16

Now we add an interaction between the continuous variable (TUITIONFEE_OUT) and the categorical, meaning we will get a different slope for each group. Interactions are given by putting a : between the two variables.

req.1 = lm(RET_FT4~ TUITIONFEE_OUT + CONTROL + TUITIONFEE_OUT:CONTROL, data = scorecard) 
summary(req.1)
## 
## Call:
## lm(formula = RET_FT4 ~ TUITIONFEE_OUT + CONTROL + TUITIONFEE_OUT:CONTROL, 
##     data = scorecard)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.68822 -0.04982  0.00491  0.05555  0.32900 
## 
## Coefficients:
##                                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)                               5.814e-01  1.405e-02  41.372  < 2e-16
## TUITIONFEE_OUT                            9.240e-06  6.874e-07  13.441  < 2e-16
## CONTROLprivate                           -9.830e-02  1.750e-02  -5.617  2.4e-08
## CONTROLprivate for-profit                -2.863e-01  1.568e-01  -1.826   0.0681
## TUITIONFEE_OUT:CONTROLprivate             2.988e-07  7.676e-07   0.389   0.6971
## TUITIONFEE_OUT:CONTROLprivate for-profit  7.215e-06  6.716e-06   1.074   0.2829
##                                             
## (Intercept)                              ***
## TUITIONFEE_OUT                           ***
## CONTROLprivate                           ***
## CONTROLprivate for-profit                .  
## TUITIONFEE_OUT:CONTROLprivate               
## TUITIONFEE_OUT:CONTROLprivate for-profit    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08734 on 1235 degrees of freedom
## Multiple R-squared:  0.4413, Adjusted R-squared:  0.4391 
## F-statistic: 195.1 on 5 and 1235 DF,  p-value: < 2.2e-16

Alternatively, it is more common to just do TUITIONFEE_OUT * CONTROL which is equivalent to TUITIONFEE_OUT + CONTROL + TUITIONFEE_OUT:CONTROL, in other words, you put in both the individual terms and their interactions. This is the most common way to do it.

summary(lm(RET_FT4~ TUITIONFEE_OUT * CONTROL, data = scorecard) )
## 
## Call:
## lm(formula = RET_FT4 ~ TUITIONFEE_OUT * CONTROL, data = scorecard)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.68822 -0.04982  0.00491  0.05555  0.32900 
## 
## Coefficients:
##                                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)                               5.814e-01  1.405e-02  41.372  < 2e-16
## TUITIONFEE_OUT                            9.240e-06  6.874e-07  13.441  < 2e-16
## CONTROLprivate                           -9.830e-02  1.750e-02  -5.617  2.4e-08
## CONTROLprivate for-profit                -2.863e-01  1.568e-01  -1.826   0.0681
## TUITIONFEE_OUT:CONTROLprivate             2.988e-07  7.676e-07   0.389   0.6971
## TUITIONFEE_OUT:CONTROLprivate for-profit  7.215e-06  6.716e-06   1.074   0.2829
##                                             
## (Intercept)                              ***
## TUITIONFEE_OUT                           ***
## CONTROLprivate                           ***
## CONTROLprivate for-profit                .  
## TUITIONFEE_OUT:CONTROLprivate               
## TUITIONFEE_OUT:CONTROLprivate for-profit    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08734 on 1235 degrees of freedom
## Multiple R-squared:  0.4413, Adjusted R-squared:  0.4391 
## F-statistic: 195.1 on 5 and 1235 DF,  p-value: < 2.2e-16

Now we fit a linear model for the bike data, with two categorical variables.

md1 = lm(casual ~ atemp + workingday + weathersit, data = bike)
summary(md1)
## 
## Call:
## lm(formula = casual ~ atemp + workingday + weathersit, data = bike)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1456.76  -243.97   -22.93   166.81  1907.20 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 350.31      55.11   6.357 3.63e-10 ***
## atemp                      2333.77      97.48  23.942  < 2e-16 ***
## workingdayYes              -794.11      33.95 -23.388  < 2e-16 ***
## weathersitLight Rain/Snow  -523.79      95.23  -5.500 5.26e-08 ***
## weathersitMisty            -150.79      33.75  -4.468 9.14e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 425.2 on 726 degrees of freedom
## Multiple R-squared:  0.6186, Adjusted R-squared:  0.6165 
## F-statistic: 294.3 on 4 and 726 DF,  p-value: < 2.2e-16

Now we add a different slope of atemp for each of the workingday levels by putting in an interaction.

md3 = lm(casual ~ atemp + workingday + weathersit +
    workingday: atemp , data = bike)  
summary(md3)
## 
## Call:
## lm(formula = casual ~ atemp + workingday + weathersit + workingday:atemp, 
##     data = bike)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1709.76  -198.09   -55.12   152.88  1953.07 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -276.22      77.48  -3.565 0.000388 ***
## atemp                      3696.41     155.56  23.762  < 2e-16 ***
## workingdayYes               166.71      94.60   1.762 0.078450 .  
## weathersitLight Rain/Snow  -520.78      88.48  -5.886 6.05e-09 ***
## weathersitMisty            -160.28      31.36  -5.110 4.12e-07 ***
## atemp:workingdayYes       -2052.09     190.48 -10.773  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 395.1 on 725 degrees of freedom
## Multiple R-squared:  0.6712, Adjusted R-squared:  0.6689 
## F-statistic:   296 on 5 and 725 DF,  p-value: < 2.2e-16

Inference

The Global Fit

Here we create a function that will implement the permutation test to test for the global fit.

set.seed(147980)
permutationLM <- function(y,data,  n.repetitions, STAT=function(lmFit){summary(lmFit)$r.squared}){ 
  # calculate the observed statistic
  stat.obs <- STAT(lm(y ~ ., data=data))
  #function to do the permutation and return statistic
  makePermutedStats<-function(){
      sampled <- sample(y) 
      fit<-lm(sampled ~ ., data=data)
      return(STAT(fit))
  }
  # calculate the permuted statistic
  stat.permute <-replicate(n.repetitions,makePermutedStats()) 
  p.value <- sum(stat.permute >= stat.obs) / n.repetitions
  return(list(p.value=p.value,observedStat=stat.obs,permutedStats=stat.permute))
}

We now apply it to our bodyfat data, using \(R^2\) as our statistic, and not surprisingly get very significant results.

permOut<-permutationLM(body$BODYFAT,data=body[,-1],n.repetitions=1000)
hist(permOut$permutedStats,breaks=50)

plot of chunk unnamed-chunk-37

permOut[1:2]
## $p.value
## [1] 0
## 
## $observedStat
## [1] 0.7265596

We can also use the \(F\) statistic and compare the permutation distribution to the \(F\) distribution. The \(F\) distribution is an approximation of the permutation distribution for large sample sizes, so not surprisingly they are similar.

n<-nrow(body)
p<-ncol(body)-1
permOutF<-permutationLM(body$BODYFAT,data=body[,-1],n.repetitions=1000, STAT=function(lmFit){summary(lmFit)$fstatistic["value"]})
hist(permOutF$permutedStats,freq=FALSE,breaks=50)
curve(df(x,df1=p,df2=n-p-1),add=TRUE, main=paste("F(",p,",",n-p-1,") distribution"))

plot of chunk unnamed-chunk-38

permOutF[1:2]
## $p.value
## [1] 0
## 
## $observedStat
##    value 
## 92.61904

Variable importance

Bootstrap Confidence Intervals The following function creates bootstrap confidence intervals for the linear regression, and is quite similar to the one I used for simple regression.

bootstrapLM <- function(y,x, repetitions, confidence.level=0.95){
    # calculate the observed statistics
  stat.obs <- coef(lm(y~., data=x))
  # calculate the bootstrapped statistics
  bootFun<-function(){
      sampled <- sample(1:length(y), size=length(y),replace = TRUE)
      coef(lm(y[sampled]~.,data=x[sampled,])) #small correction here to make it for a matrix x
  }  
  stat.boot<-replicate(repetitions,bootFun())
  # nm <-deparse(substitute(x))
  # row.names(stat.boot)[2]<-nm
  level<-1-confidence.level
  confidence.interval <- apply(stat.boot,1,quantile,probs=c(level/2,1-level/2))
    return(list(confidence.interval = cbind("lower"=confidence.interval[1,],"estimate"=stat.obs,"upper"=confidence.interval[2,]), bootStats=stat.boot))
}

Now we apply it to the Bodyfat dataset.

bodyBoot<-with(body,bootstrapLM(y=BODYFAT,x=body[,-1],repetitions=10000))
bodyBoot$conf
##                    lower      estimate       upper
## (Intercept) -75.68776383 -3.747573e+01 -3.84419402
## AGE          -0.03722018  1.201695e-02  0.06645578
## WEIGHT       -0.24629552 -1.392006e-01 -0.02076377
## HEIGHT       -0.41327145 -1.028485e-01  0.28042319
## CHEST        -0.25876131 -8.311678e-04  0.20995486
## ABDOMEN       0.81115069  9.684620e-01  1.13081481
## HIP          -0.46808557 -1.833599e-01  0.10637834
## THIGH         0.02272414  2.857227e-01  0.56054626

We can do plots of the confidence intervals (excluding the intercept) with plotCI

require(gplots)
## Loading required package: gplots
## Warning: package 'gplots' was built under R version 4.0.2
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
#Not include intercept
with(bodyBoot,plotCI(confidence.interval[-1,"estimate"],ui=confidence.interval[-1,"upper"],li=confidence.interval[-1,"lower"],xaxt="n"))
axis(side=1,at=1:(nrow(bodyBoot$conf)-1),rownames(bodyBoot$conf)[-1])

cat("Ratio of p/n in body fat: ",ncol(body)/nrow(body),"\n")
## Ratio of p/n in body fat:  0.03174603
library(pheatmap)

plot of chunk unnamed-chunk-41

pheatmap(summary(ft,correlation=TRUE)$corr[-1,-1],breaks=seq(-1,1,length=100), main="Correlation of the estimated coefficients")

plot of chunk unnamed-chunk-41

pheatmap(cor(body[,-1]),breaks=seq(-1,1,length=100), main="Correlation of the variables")

plot of chunk unnamed-chunk-41

Prediction Intervals

We can create confidence intervals for the prediction using the predict function. We first have to create a data.frame in the same format as our data.frame that was used in our lm model (i.e. same variable names).

x0 = data.frame(AGE = 30, WEIGHT = 180, HEIGHT = 70, CHEST = 95, ABDOMEN = 90, HIP = 100, THIGH = 60)

Now we can create confidence intervals for the average predicted value, with interval = "confidence"

predict(ft, x0, interval = "confidence")
##        fit      lwr      upr
## 1 16.51927 15.20692 17.83162

For prediction intervals, i.e. the expected range of values for an individual, we set interval = "prediction"

predict(ft, x0, interval = "prediction")
##        fit      lwr      upr
## 1 16.51927 7.678715 25.35983

Regression diagnostics

If we run plot on our output from lm, it will automatically plot four diagnostic plots. Note that if I don't set par(mfrow = c(2, 2)), it will go into “interactive” mode, and expect you to hit return to see the next plot.

par(mfrow = c(2, 2))
plot(ft)

plot of chunk unnamed-chunk-45

par(mfrow = c(1, 1))

We can also choose to plot only certain of these plots with the which argument. There are six plots to choose from, even though only 4 are done by default.

par(mfrow=c(1,2))
plot(ft,which=c(1,3))

plot of chunk unnamed-chunk-46

Here we do the same plots for the bike data (we refit the model just for convenience to remember the model)

md1 = lm(casual ~ atemp + workingday + weathersit, data = bike)
par(mfrow=c(1,2))
plot(md1,which=c(1,3))

plot of chunk unnamed-chunk-47

Since there is keteroskedasticity, we consider transformations of our response (log and sqrt):

mdLog = lm(log(casual) ~ atemp + workingday + weathersit, data = bike)
mdSqrt= lm(sqrt(casual) ~ atemp + workingday + weathersit, data = bike)
par(mfrow=c(2,2))
plot(mdLog,which=1,main="Log")
plot(mdSqrt,which=1,main="Sqrt")
plot(mdLog,which=3,main="Log")
plot(mdSqrt,which=3,main="Sqrt")

plot of chunk unnamed-chunk-48

# Here we plot only the QQ plot
par(mfrow=c(1,1))
plot(ft,which=2)

plot of chunk unnamed-chunk-48

# And the QQ plot for the different bike models
par(mfrow=c(2,2))
plot(md1,which=2,main="Original (counts)")
plot(mdLog,which=2,main="Log")
plot(mdSqrt,which=2,main="Sqrt")

plot of chunk unnamed-chunk-48

Plots 4 and 5 are useful for detecting outliers. 4 is the plot of cook's distance, showing observations that will change the fit a lot if removed. 5 is the plot of the residuals versus the leverage of the observations, which gives similar information.

par(mfrow=c(1,2))
plot(ft,which=c(4:5))

plot of chunk unnamed-chunk-49

Here we look at these outlying points seen in the plots.

whOut<-c(39, 42, 36)
cat("High leverage points:\n")
## High leverage points:
body[whOut,]
##    BODYFAT AGE WEIGHT HEIGHT CHEST ABDOMEN   HIP THIGH
## 39    35.2  46 363.15  72.25 136.2   148.1 147.7  87.3
## 42    32.9  44 205.00  29.50 106.0   104.3 115.5  70.6
## 36    40.1  49 191.75  65.00 118.5   113.1 113.8  61.9
cat("Mean of each variables:\n")
## Mean of each variables:
apply(body, 2, mean)
##   BODYFAT       AGE    WEIGHT    HEIGHT     CHEST   ABDOMEN       HIP     THIGH 
##  19.15079  44.88492 178.92440  70.14881 100.82421  92.55595  99.90476  59.40595

We can do a PCA of the explanatory variables and see that these points are separated. Notice how I can use the text function to plot character values, in this case in the indices of the outlier points, instead of points. To not have a point also drawn for these points, I first made a blank plot by choosing type="n", and then did points for all points except the outlier. Then I used text to plot the indices of the outlier points. The blank plot is important, because it makes sure that the limits of the plot will be big enough for all the points; if I just drew the plot excluding the outliers (e.g. plot(bodyPca$x[-whOut,1:2])), the boundaries wouldn't be big enough to show the outlying points when you add them.

bodyPca<-prcomp(body[,-1])
plot(bodyPca$x[,1:2],type="n")
points(bodyPca$x[-whOut,1:2])
text(bodyPca$x[whOut,1:2],labels=whOut,col="red",pch=19)

plot of chunk unnamed-chunk-51

We can also do a pairs plot that highlights these points in a similar way, only putting the plotting commands in the panel argument. Because it's in a panel function, you don't need the blank plot, and you use only plotting commands that add to an existing plot (like points and text).

pairs(body,panel=function(x,y){
    points(x[-whOut],y[-whOut])
    text(x[whOut],y[whOut],labels=whOut)})

plot of chunk unnamed-chunk-52

I can run the linear model excluding those observations and compare the coefficients. Notice how I use summary(ftNoOut)$coef to get the full table of coefficients, including the p-values.

ftNoOut<-lm(BODYFAT~.,data=body[-whOut,])
cat("Coefficients without outliers:\n")
## Coefficients without outliers:
round(summary(ftNoOut)$coef,3)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -22.902     20.297  -1.128    0.260
## AGE            0.021      0.029   0.717    0.474
## WEIGHT        -0.074      0.059  -1.271    0.205
## HEIGHT        -0.241      0.187  -1.288    0.199
## CHEST         -0.121      0.113  -1.065    0.288
## ABDOMEN        0.945      0.088  10.709    0.000
## HIP           -0.171      0.152  -1.124    0.262
## THIGH          0.223      0.141   1.584    0.114
cat("\nCoefficients in Original Model:\n")
## 
## Coefficients in Original Model:
round(summary(ft)$coef,3)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -37.476     14.495  -2.585    0.010
## AGE            0.012      0.029   0.410    0.682
## WEIGHT        -0.139      0.045  -3.087    0.002
## HEIGHT        -0.103      0.098  -1.051    0.294
## CHEST         -0.001      0.100  -0.008    0.993
## ABDOMEN        0.968      0.085  11.352    0.000
## HIP           -0.183      0.145  -1.267    0.206
## THIGH          0.286      0.136   2.098    0.037

Variable Selection

We first introduce the car seat position dataset, which is provided as part of the package faraway. We do a basic linear model on this data set and look at the summary of the model.

library(faraway)
## Warning: package 'faraway' was built under R version 4.0.2
data(seatpos)
pairs(seatpos)

plot of chunk unnamed-chunk-54

lmSeat = lm(hipcenter ~ ., seatpos)
summary(lmSeat)
## 
## Call:
## lm(formula = hipcenter ~ ., data = seatpos)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -73.827 -22.833  -3.678  25.017  62.337 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 436.43213  166.57162   2.620   0.0138 *
## Age           0.77572    0.57033   1.360   0.1843  
## Weight        0.02631    0.33097   0.080   0.9372  
## HtShoes      -2.69241    9.75304  -0.276   0.7845  
## Ht            0.60134   10.12987   0.059   0.9531  
## Seated        0.53375    3.76189   0.142   0.8882  
## Arm          -1.32807    3.90020  -0.341   0.7359  
## Thigh        -1.14312    2.66002  -0.430   0.6706  
## Leg          -6.43905    4.71386  -1.366   0.1824  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.72 on 29 degrees of freedom
## Multiple R-squared:  0.6866, Adjusted R-squared:  0.6001 
## F-statistic:  7.94 on 8 and 29 DF,  p-value: 1.306e-05

To compare a submodel to a full model using a F-test, I first need to run lm on both of these models:

mod0<-lm(BODYFAT~ABDOMEN+AGE+WEIGHT,data=body)

Then I run the command anova on these two models.

anova(mod0,ft)
## Analysis of Variance Table
## 
## Model 1: BODYFAT ~ ABDOMEN + AGE + WEIGHT
## Model 2: BODYFAT ~ AGE + WEIGHT + HEIGHT + CHEST + ABDOMEN + HIP + THIGH
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    248 4941.3                           
## 2    244 4806.8  4     134.5 1.7069 0.1491

Note that anova does other kinds of analysis if you give it only one linear model that is somewhat distinct from this analysis, but we are not going to get into it.

Here we also do a F-test for a submodel, only now we're only missing a single variable (HEIGHT).

modNoHEIGHT<-lm(BODYFAT~ABDOMEN+AGE+WEIGHT+CHEST+HIP+THIGH,data=body)
anova(modNoHEIGHT,ft)
## Analysis of Variance Table
## 
## Model 1: BODYFAT ~ ABDOMEN + AGE + WEIGHT + CHEST + HIP + THIGH
## Model 2: BODYFAT ~ AGE + WEIGHT + HEIGHT + CHEST + ABDOMEN + HIP + THIGH
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    245 4828.6                           
## 2    244 4806.8  1    21.753 1.1042 0.2944

We can see that we get the same result as if we did a t-test for the coefficient corresponding to HEIGHT.

summary(ft)
## 
## Call:
## lm(formula = BODYFAT ~ ., data = body)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.0729  -3.2387  -0.0782   3.0623  10.3611 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.748e+01  1.449e+01  -2.585  0.01031 *  
## AGE          1.202e-02  2.934e-02   0.410  0.68246    
## WEIGHT      -1.392e-01  4.509e-02  -3.087  0.00225 ** 
## HEIGHT      -1.028e-01  9.787e-02  -1.051  0.29438    
## CHEST       -8.312e-04  9.989e-02  -0.008  0.99337    
## ABDOMEN      9.685e-01  8.531e-02  11.352  < 2e-16 ***
## HIP         -1.834e-01  1.448e-01  -1.267  0.20648    
## THIGH        2.857e-01  1.362e-01   2.098  0.03693 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.438 on 244 degrees of freedom
## Multiple R-squared:  0.7266, Adjusted R-squared:  0.7187 
## F-statistic: 92.62 on 7 and 244 DF,  p-value: < 2.2e-16

We can use the function regsubsets in the package leaps to find the best model of size k, for all values of k. Here we apply it to the bodyfat data:

library(leaps)
bFat = regsubsets(BODYFAT ~ ., body)
summary(bFat)
## Subset selection object
## Call: regsubsets.formula(BODYFAT ~ ., body)
## 7 Variables  (and intercept)
##         Forced in Forced out
## AGE         FALSE      FALSE
## WEIGHT      FALSE      FALSE
## HEIGHT      FALSE      FALSE
## CHEST       FALSE      FALSE
## ABDOMEN     FALSE      FALSE
## HIP         FALSE      FALSE
## THIGH       FALSE      FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: exhaustive
##          AGE WEIGHT HEIGHT CHEST ABDOMEN HIP THIGH
## 1  ( 1 ) " " " "    " "    " "   "*"     " " " "  
## 2  ( 1 ) " " "*"    " "    " "   "*"     " " " "  
## 3  ( 1 ) " " "*"    " "    " "   "*"     " " "*"  
## 4  ( 1 ) " " "*"    " "    " "   "*"     "*" "*"  
## 5  ( 1 ) " " "*"    "*"    " "   "*"     "*" "*"  
## 6  ( 1 ) "*" "*"    "*"    " "   "*"     "*" "*"  
## 7  ( 1 ) "*" "*"    "*"    "*"   "*"     "*" "*"

Here is the same function for the car seat position data.

bSeat = regsubsets(hipcenter ~ ., seatpos)
summary(bSeat)
## Subset selection object
## Call: regsubsets.formula(hipcenter ~ ., seatpos)
## 8 Variables  (and intercept)
##         Forced in Forced out
## Age         FALSE      FALSE
## Weight      FALSE      FALSE
## HtShoes     FALSE      FALSE
## Ht          FALSE      FALSE
## Seated      FALSE      FALSE
## Arm         FALSE      FALSE
## Thigh       FALSE      FALSE
## Leg         FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          Age Weight HtShoes Ht  Seated Arm Thigh Leg
## 1  ( 1 ) " " " "    " "     "*" " "    " " " "   " "
## 2  ( 1 ) " " " "    " "     "*" " "    " " " "   "*"
## 3  ( 1 ) "*" " "    " "     "*" " "    " " " "   "*"
## 4  ( 1 ) "*" " "    "*"     " " " "    " " "*"   "*"
## 5  ( 1 ) "*" " "    "*"     " " " "    "*" "*"   "*"
## 6  ( 1 ) "*" " "    "*"     " " "*"    "*" "*"   "*"
## 7  ( 1 ) "*" "*"    "*"     " " "*"    "*" "*"   "*"
## 8  ( 1 ) "*" "*"    "*"     "*" "*"    "*" "*"   "*"

The object summary(bFat) stores all of this information. In particular, the summary(bFat)$which gives a matrix of logical values with a column for each variable and a row for each value of k. The logical indicates whether the variable is contained in the best model of size k.

Here we illustrate the idea of dividing our data into test and training data. We choose to make our test data a random 10% of the data, and our training dataset the rest.

set.seed(1249)
nTest<-.1*nrow(body)
whTest<-sample(1:nrow(body),size=nTest)
bodyTest<-body[whTest,]
bodyTrain<-body[-whTest,]

Now we can calculate the prediction error on each of the best models found from regsubsets. We will, for each row of summary(bFat)$which fit the model on the training data with only the variables indicated should be used in the best model. And then we will evaluate their performance on the test data. The average error on the test data will be our estimate of predicted error.

predError<-apply(summary(bFat)$which[,-1],1,function(x){
    lmObj<-lm(bodyTrain$BODYFAT~.,data=bodyTrain[,-1][,x,drop=FALSE])
    testPred<-predict(lmObj,newdata=bodyTest[,-1])
    mean((bodyTest$BODYFAT-testPred)^2)
})
cat("Predicted error on random 10% of data:\n")
## Predicted error on random 10% of data:
predError
##        1        2        3        4        5        6        7 
## 25.29712 28.86460 27.17047 28.65131 28.96773 28.92292 29.01328

Here we repeat the whole thing, only with a different random test and training to show the variability in our estimate of predicted error (and thus variability in which is the “best” model).

set.seed(19085)
nTest<-.1*nrow(body)
whTest<-sample(1:nrow(body),size=nTest)
bodyTest<-body[whTest,]
bodyTrain<-body[-whTest,]

predError<-apply(summary(bFat)$which[,-1],1,function(x){
    lmObj<-lm(bodyTrain$BODYFAT~.,data=bodyTrain[,-1][,x,drop=FALSE])
    testPred<-predict(lmObj,newdata=bodyTest[,-1])
    mean((bodyTest$BODYFAT-testPred)^2)
})
cat("Predicted error on random 10% of data:\n")
## Predicted error on random 10% of data:
predError
##        1        2        3        4        5        6        7 
## 22.36633 22.58908 22.21784 21.90046 21.99034 21.94618 22.80151

Now we will implement cross-validation, where we will break our data into random 10 parts, to compare the prediction error of each of these eight models. We will do this by first randomly permuting the indices of the data

set.seed(78912)
permutation<-sample(1:nrow(body))

Then we will assign the first 10% of the permuted data to partition 1, the second to partition 2, etc. The object folds we create here just simply divides the numbers 1-n into ten parts (i.e. no randomness). But now if I take permutation[folds==1] I will get the indices of the first random partition of the data, permutation[folds==2] corresponds to the second one, and so forth.

folds <- cut(1:nrow(body),breaks=10,labels=FALSE)
#Perform 10 fold cross validation

Now we will perform a for-loop, so that we will take turns making the each partition the test data, and the rest are the training. predErrorMat is a matrix we set up that will hold the results of our for-loop. Each row will correspond to the results of one of the partitions, and each column will correspond to a model.

predErrorMat<-matrix(nrow=10,ncol=nrow(summary(bFat)$which))
for(i in 1:10){
    #Segement your data by fold using the which() function 
    testIndexes <- which(folds==i,arr.ind=TRUE)
    testData <- body[permutation,][testIndexes, ]
    trainData <- body[permutation,][-testIndexes, ]
    #Use the test and train data partitions however you desire...
    predError<-apply(summary(bFat)$which[,-1],1,function(x){
        lmObj<-lm(trainData$BODYFAT~.,data=trainData[,-1][,x,drop=FALSE])
        testPred<-predict(lmObj,newdata=testData[,-1])
        mean((testData$BODYFAT-testPred)^2)
    })
    predErrorMat[i,]<-predError
}
predErrorMat
##           [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]
##  [1,] 18.72568 10.95537 11.68551 12.16354 11.83839 11.78985 11.93013
##  [2,] 21.41687 21.08760 21.53709 21.06757 21.10223 21.20400 21.62519
##  [3,] 32.47863 21.97477 22.48690 22.50871 22.97452 22.92450 24.05130
##  [4,] 21.05072 20.22509 19.16631 18.82538 18.90923 18.89133 18.94164
##  [5,] 26.47937 22.92690 23.76934 26.13180 26.17794 26.12684 26.28473
##  [6,] 26.60945 23.35274 22.06232 22.06825 22.15430 23.10201 25.29325
##  [7,] 25.65426 20.48995 19.95947 19.82442 19.53618 19.97744 20.29104
##  [8,] 17.54916 18.79081 18.14251 17.67780 17.74409 17.67456 17.71624
##  [9,] 33.52443 27.26399 25.83256 26.87850 27.80847 28.32894 28.41455
## [10,] 18.64271 14.11973 14.05815 14.53730 14.42609 14.36767 14.57028

To get the cross-validation estimate of expected predicted error, we average across the 10 folds.

colMeans(predErrorMat)
## [1] 24.21313 20.11870 19.87002 20.16833 20.26714 20.43871 20.91184

Leave-one-out CV uses each individual observation as a test set. For regression, it can be calculated from the residuals of the linear model. I give a function here that calculates it (you don't need to understand why it works).

LOOCV<-function(lm){
    vals<-residuals(lm)/(1-lm.influence(lm)$hat)
    sum(vals^2)/length(vals)
}

Now I create a function that will calculate all of these criterion of interest for a submodel. y should be the response and dataset should be the data.frame with all of the p variables. Either the argument x or the argument lmObj should be given to define which submodel to use. If x, then x is assumed to be a logical vector in the same order as the columns of dataset defining with variables to use, and the function will run lm on that submodel; if lmObj, then it should be the result of running lm on the submodel.

calculateCriterion<-function(x=NULL,y,dataset,lmObj=NULL){
    #dataset contains only explanatory variables
    #x is a vector of logicals, length equal to number of explanatory variables in dataset, telling us which variables to keep
    #sigma2 is estimate of model on full dataset
    # either x or lmObj must be given to specify the smaller lm model
    sigma2=summary(lm(y~.,data=dataset))$sigma^2
    if(is.null(lmObj)) lmObj<-lm(y ~ ., data=dataset[,x,drop=FALSE]) #don't include intercept
    sumlmObj<-summary(lmObj)
    n<-nrow(dataset)
    p<-sum(x)
    RSS<-sumlmObj$sigma^2*(n-p-1)
    c(R2=sumlmObj$r.squared,
        R2adj=sumlmObj$adj.r.squared,
        "RSS/n"=RSS/n,
        LOOCV=LOOCV(lmObj),
        Cp=RSS/n+2*sigma2*(p+1)/n,
        CpAlt=RSS/sigma2-n+2*(p+1),
        AIC=AIC(lmObj), # n*log(RSS/n)+2*p +constant,
        BIC=BIC(lmObj) # n*log(RSS/n)+p*log(n) + constant
    )
}

Here we apply my function to the 8 best model of each size found for the car seat position data.

cat("Criterion for the 8 best k-sized models of car seat position:\n")
## Criterion for the 8 best k-sized models of car seat position:
critSeat<-apply(summary(bSeat)$which[,-1],1,calculateCriterion,
    y=seatpos$hipcenter,
    dataset=seatpos[,-9])
critSeat<-t(critSeat)
critSeat
##          R2     R2adj    RSS/n    LOOCV       Cp      CpAlt      AIC      BIC
## 1 0.6382850 0.6282374 1253.047 1387.644 1402.818 -0.5342143 384.9060 389.8188
## 2 0.6594117 0.6399496 1179.860 1408.696 1404.516 -0.4888531 384.6191 391.1694
## 3 0.6814159 0.6533055 1103.634 1415.652 1403.175 -0.5246725 384.0811 392.2691
## 4 0.6848577 0.6466586 1091.711 1456.233 1466.137  1.1568934 385.6684 395.4939
## 5 0.6861644 0.6371276 1087.184 1548.041 1536.496  3.0359952 387.5105 398.9736
## 6 0.6864310 0.6257403 1086.261 1739.475 1610.457  5.0113282 389.4782 402.5789
## 7 0.6865154 0.6133690 1085.968 1911.701 1685.051  7.0035240 391.4680 406.2062
## 8 0.6865535 0.6000855 1085.836 1975.415 1759.804  9.0000000 393.4634 409.8392

Here we apply my function to the 7 best model of each size found for the body fat data.

cat("\nCriterion for the 7 best k-sized models of body fat:\n")
## 
## Criterion for the 7 best k-sized models of body fat:
critBody<-apply(summary(bFat)$which[,-1],1,calculateCriterion,
    y=body$BODYFAT,
    dataset=body[,-1])
critBody<-t(critBody)
critBody<-cbind(critBody,CV10=colMeans(predErrorMat))
critBody
##          R2     R2adj    RSS/n    LOOCV       Cp     CpAlt      AIC      BIC
## 1 0.6616721 0.6603188 23.60104 24.30696 23.91374 53.901272 1517.790 1528.379
## 2 0.7187981 0.7165395 19.61605 20.27420 20.08510  4.925819 1473.185 1487.302
## 3 0.7234261 0.7200805 19.29321 20.07151 19.91861  2.796087 1471.003 1488.650
## 4 0.7249518 0.7204976 19.18678 20.13848 19.96853  3.434662 1471.609 1492.785
## 5 0.7263716 0.7208100 19.08774 20.21249 20.02584  4.167779 1472.305 1497.011
## 6 0.7265595 0.7198630 19.07463 20.34676 20.16908  6.000069 1474.132 1502.367
## 7 0.7265596 0.7187150 19.07463 20.62801 20.32542  8.000000 1476.132 1507.896
##       CV10
## 1 24.21313
## 2 20.11870
## 3 19.87002
## 4 20.16833
## 5 20.26714
## 6 20.43871
## 7 20.91184

The function step will run a stepwise regression based on the AIC. The input object should be the full linear regression model (i.e. with all of the variables). Here we set the argument trace =0 to keep it from printing out the path it takes, but note that this isn't the default. It can be useful to see the printout, which shows you the variables that are added or removed at each step, but for reports it can be good to suppress it. In the end, the output that is returned by the function is just the linear model object from running lm on the final choice of variables.

outBody<-step(ft,trace=0, direction="both")
outBody
## 
## Call:
## lm(formula = BODYFAT ~ WEIGHT + ABDOMEN + THIGH, data = body)
## 
## Coefficients:
## (Intercept)       WEIGHT      ABDOMEN        THIGH  
##    -52.9631      -0.1828       0.9919       0.2190

The default version of the step function only removes variables (analogous to backward elimination). If one wants to add variables as well, you can set the argument direction as we have done here. We compare the results of the linear model found by step to those found by looking at the RSS of all subsets.

# results from best of size k
summary(bFat)$out
##          AGE WEIGHT HEIGHT CHEST ABDOMEN HIP THIGH
## 1  ( 1 ) " " " "    " "    " "   "*"     " " " "  
## 2  ( 1 ) " " "*"    " "    " "   "*"     " " " "  
## 3  ( 1 ) " " "*"    " "    " "   "*"     " " "*"  
## 4  ( 1 ) " " "*"    " "    " "   "*"     "*" "*"  
## 5  ( 1 ) " " "*"    "*"    " "   "*"     "*" "*"  
## 6  ( 1 ) "*" "*"    "*"    " "   "*"     "*" "*"  
## 7  ( 1 ) "*" "*"    "*"    "*"   "*"     "*" "*"
# results from criterion of best of each size k
critBody
##          R2     R2adj    RSS/n    LOOCV       Cp     CpAlt      AIC      BIC
## 1 0.6616721 0.6603188 23.60104 24.30696 23.91374 53.901272 1517.790 1528.379
## 2 0.7187981 0.7165395 19.61605 20.27420 20.08510  4.925819 1473.185 1487.302
## 3 0.7234261 0.7200805 19.29321 20.07151 19.91861  2.796087 1471.003 1488.650
## 4 0.7249518 0.7204976 19.18678 20.13848 19.96853  3.434662 1471.609 1492.785
## 5 0.7263716 0.7208100 19.08774 20.21249 20.02584  4.167779 1472.305 1497.011
## 6 0.7265595 0.7198630 19.07463 20.34676 20.16908  6.000069 1474.132 1502.367
## 7 0.7265596 0.7187150 19.07463 20.62801 20.32542  8.000000 1476.132 1507.896
##       CV10
## 1 24.21313
## 2 20.11870
## 3 19.87002
## 4 20.16833
## 5 20.26714
## 6 20.43871
## 7 20.91184

We repeat the stepwise for the car seat data.

outCarseat<-step(lmSeat,trace=0, direction="both")
outCarseat
## 
## Call:
## lm(formula = hipcenter ~ Age + HtShoes + Leg, data = seatpos)
## 
## Coefficients:
## (Intercept)          Age      HtShoes          Leg  
##    456.2137       0.5998      -2.3023      -6.8297

And we again compare it to the best we found from all subsets.

# results from best of size k
summary(bSeat)$out
##          Age Weight HtShoes Ht  Seated Arm Thigh Leg
## 1  ( 1 ) " " " "    " "     "*" " "    " " " "   " "
## 2  ( 1 ) " " " "    " "     "*" " "    " " " "   "*"
## 3  ( 1 ) "*" " "    " "     "*" " "    " " " "   "*"
## 4  ( 1 ) "*" " "    "*"     " " " "    " " "*"   "*"
## 5  ( 1 ) "*" " "    "*"     " " " "    "*" "*"   "*"
## 6  ( 1 ) "*" " "    "*"     " " "*"    "*" "*"   "*"
## 7  ( 1 ) "*" "*"    "*"     " " "*"    "*" "*"   "*"
## 8  ( 1 ) "*" "*"    "*"     "*" "*"    "*" "*"   "*"
# results from criterion of best of each size k
critSeat
##          R2     R2adj    RSS/n    LOOCV       Cp      CpAlt      AIC      BIC
## 1 0.6382850 0.6282374 1253.047 1387.644 1402.818 -0.5342143 384.9060 389.8188
## 2 0.6594117 0.6399496 1179.860 1408.696 1404.516 -0.4888531 384.6191 391.1694
## 3 0.6814159 0.6533055 1103.634 1415.652 1403.175 -0.5246725 384.0811 392.2691
## 4 0.6848577 0.6466586 1091.711 1456.233 1466.137  1.1568934 385.6684 395.4939
## 5 0.6861644 0.6371276 1087.184 1548.041 1536.496  3.0359952 387.5105 398.9736
## 6 0.6864310 0.6257403 1086.261 1739.475 1610.457  5.0113282 389.4782 402.5789
## 7 0.6865154 0.6133690 1085.968 1911.701 1685.051  7.0035240 391.4680 406.2062
## 8 0.6865535 0.6000855 1085.836 1975.415 1759.804  9.0000000 393.4634 409.8392

We can also use the function I created above on this output by giving it to the lmObj argument.

calculateCriterion(lmObj=outCarseat,y=seatpos$hipcenter,dataset=seatpos[,-9])
##           R2        R2adj        RSS/n        LOOCV           Cp        CpAlt 
##    0.6812662    0.6531427 1201.5776327 1412.6121485 1276.4629022   -3.9088387 
##          AIC          BIC 
##  384.0989931  392.2869239