We go through some example datasets to illustrate the utility of the regression model.
Set the right working directory (you need to change the directory to the local directory in your machine containing the data)
dataDir<-"../finalDataSets"
The following is housing price data from Ames, Iowa. The full dataset is available at https://ww2.amstat.org/publications/jse/v19n3/decock/AmesHousing.txt A description of the full dataset is at https://ww2.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt We will be working with a slightly smaller version of the full dataset. We read in the dataset:
dd = read.csv(file.path(dataDir, "Ames_Short.csv"), header = T,stringsAsFactors=TRUE)
To explore the data, we will perform a pairs plot.
pairs(dd)
The next dataset is the bike rental company. The dataset can also be found here (https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset) We again read in the data and convert some of the variables to factors.
bike <- read.csv(file.path(dataDir, "DailyBikeSharingDataset.csv"),stringsAsFactors=TRUE)
bike$yr<-factor(bike$yr)
bike$mnth<-factor(bike$mnth)
We do a pairs plot of only the continuous variables
pairs(bike[,11:16])
The final dataset is the bodyfat dataset which gives estimates of the percentage of body fat determined by underwater weighing and various body circumference measurements for 252 men. Further explanation of the dataset can be found at http://lib.stat.cmu.edu/datasets/bodyfat
body = read.csv(file.path(dataDir, "bodyfat_short.csv"), header = T,stringsAsFactors=TRUE)
pairs(body)
We can see that there are outliers in the data, which we can try to identify by their values on specific variables.
ou1 = which(body$HEIGHT < 30)
ou2 = which(body$WEIGHT > 300)
ou3 = which(body$HIP > 120)
ou = c(ou1, ou2, ou3)
We replot the data after removing these outliers and observed that we can see the data much better.
pairs(body[-ou,])
Here we plot the pairs plot of two highly correlated variables.
pairs(body[,c("BODYFAT","THIGH","ABDOMEN")])
To understand what the coefficient in regression means, we can consider a coplot, which divides the data. It takes a formula like lm
, but also includes a condition |
, meaning that the plot of BODYFAT ~ THIGH
will be plotted, for different ranges of ABDOMEN
coplot(BODYFAT~THIGH|ABDOMEN,data=body)
The reverse looks quite different, showing that the plot is not symmetric.
coplot(BODYFAT~ABDOMEN | THIGH,data=body)
Here, we plot the response (body fat pct) against all the predictors.
par(mfrow = c(3, 3))
for(i in 2:8)
{
plot(body[-ou,i], body[-ou,1], xlab = names(body)[i], ylab = "BODYFAT")
}
par(mfrow = c(1, 1))
Now we fit our linear model with all of the variables:
ft = lm(BODYFAT ~ AGE + WEIGHT + HEIGHT + CHEST + ABDOMEN + HIP + THIGH, data = body)
summary(ft)
##
## Call:
## lm(formula = BODYFAT ~ AGE + WEIGHT + HEIGHT + CHEST + ABDOMEN +
## HIP + THIGH, data = body)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.0729 -3.2387 -0.0782 3.0623 10.3611
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.748e+01 1.449e+01 -2.585 0.01031 *
## AGE 1.202e-02 2.934e-02 0.410 0.68246
## WEIGHT -1.392e-01 4.509e-02 -3.087 0.00225 **
## HEIGHT -1.028e-01 9.787e-02 -1.051 0.29438
## CHEST -8.312e-04 9.989e-02 -0.008 0.99337
## ABDOMEN 9.685e-01 8.531e-02 11.352 < 2e-16 ***
## HIP -1.834e-01 1.448e-01 -1.267 0.20648
## THIGH 2.857e-01 1.362e-01 2.098 0.03693 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.438 on 244 degrees of freedom
## Multiple R-squared: 0.7266, Adjusted R-squared: 0.7187
## F-statistic: 92.62 on 7 and 244 DF, p-value: < 2.2e-16
If we want to use all of the variables in a dataset, we can use a simpler notation, that keeps us from having to write out all of the variable names:
ft = lm(BODYFAT ~ . , data = body)
Even if we want to use all but a few variables, it can be easier to use this notation, and just subset the dataset, e.g. data= body[, -4]
Here we show the example of doing the linear regression, only after scaling the explanatory variables. Now the coefficients are comparable, as the amount of change for one standard deviation change of the variable. Notice that we have to remove the response variable, before scaling.
tempBody<-body
tempBody[,-1]<-scale(tempBody[,-1])
ftScale = lm(BODYFAT ~ . , data = tempBody)
cat("Coefficients with variables scaled:\n")
## Coefficients with variables scaled:
coef(ftScale)
## (Intercept) AGE WEIGHT HEIGHT CHEST ABDOMEN
## 19.15079365 0.15143812 -4.09098792 -0.37671913 -0.00700714 10.44300051
## HIP THIGH
## -1.31360120 1.50003073
cat("Coefficients on original scale:\n")
## Coefficients on original scale:
coef(ft)
## (Intercept) AGE WEIGHT HEIGHT CHEST
## -3.747573e+01 1.201695e-02 -1.392006e-01 -1.028485e-01 -8.311678e-04
## ABDOMEN HIP THIGH
## 9.684620e-01 -1.833599e-01 2.857227e-01
sdVar<-apply(body[,-1],2,sd)
cat("Sd per variable:\n")
## Sd per variable:
sdVar
## AGE WEIGHT HEIGHT CHEST ABDOMEN HIP THIGH
## 12.602040 29.389160 3.662856 8.430476 10.783077 7.164058 5.249952
cat("Ratio of scaled lm coefficient to original lm coefficient\n")
## Ratio of scaled lm coefficient to original lm coefficient
coef(ftScale)[-1]/coef(ft)[-1]
## AGE WEIGHT HEIGHT CHEST ABDOMEN HIP THIGH
## 12.602040 29.389160 3.662856 8.430476 10.783077 7.164058 5.249952
We find the correlation between chest circumference and abdomen circumference
cor(body[,c("HIP","THIGH","ABDOMEN","CHEST")])
## HIP THIGH ABDOMEN CHEST
## HIP 1.0000000 0.8964098 0.8740662 0.8294199
## THIGH 0.8964098 1.0000000 0.7666239 0.7298586
## ABDOMEN 0.8740662 0.7666239 1.0000000 0.9158277
## CHEST 0.8294199 0.7298586 0.9158277 1.0000000
We change our regression, and now drop the variables Abdomen and Hip. We see that this affects the coefficients of our model for the variables that remain.
ft1 = lm(BODYFAT ~ AGE + WEIGHT + HEIGHT + CHEST + HIP, data = body)
# Including abdomen
round(coef(ft),4)
## (Intercept) AGE WEIGHT HEIGHT CHEST ABDOMEN
## -37.4757 0.0120 -0.1392 -0.1028 -0.0008 0.9685
## HIP THIGH
## -0.1834 0.2857
# Excluding abdomen
round(coef(ft1),4)
## (Intercept) AGE WEIGHT HEIGHT CHEST HIP
## -53.9871 0.1290 -0.0526 -0.3146 0.5148 0.4697
To find the fitted values in R we can use the function fitted
as applied to the lm
output that we saved earlier:
head(fitted(ft))
## 1 2 3 4 5 6
## 16.32670 10.22019 18.42600 11.89502 25.97564 16.28529
Here we plot fitted values against response
plot(fitted(ft), body$BODYFAT, xlab = "Fitted Values", ylab = "Bodyfat Percentage")
The correlation of these two variables is \(R^2\), the Coefficient of Determination, or Multiple \(R^2\), which is reported in the summary of the lm
object:
cor(body$BODYFAT, fitted(ft))^2
## [1] 0.7265596
summary(ft)
##
## Call:
## lm(formula = BODYFAT ~ ., data = body)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.0729 -3.2387 -0.0782 3.0623 10.3611
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.748e+01 1.449e+01 -2.585 0.01031 *
## AGE 1.202e-02 2.934e-02 0.410 0.68246
## WEIGHT -1.392e-01 4.509e-02 -3.087 0.00225 **
## HEIGHT -1.028e-01 9.787e-02 -1.051 0.29438
## CHEST -8.312e-04 9.989e-02 -0.008 0.99337
## ABDOMEN 9.685e-01 8.531e-02 11.352 < 2e-16 ***
## HIP -1.834e-01 1.448e-01 -1.267 0.20648
## THIGH 2.857e-01 1.362e-01 2.098 0.03693 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.438 on 244 degrees of freedom
## Multiple R-squared: 0.7266, Adjusted R-squared: 0.7187
## F-statistic: 92.62 on 7 and 244 DF, p-value: < 2.2e-16
We can also pull out the residuals with the function residuals
head(residuals(ft))
## 1 2 3 4 5 6
## -4.026695 -4.120189 6.874004 -1.495017 2.724355 4.614712
We can then make the common plot, residuals plotted against the fitted values
plot(fitted(ft), residuals(ft), xlab = "Fitted Values", ylab = "Residuals")
We can also plot the residuals against each of the explanatory variables
par(mfrow = c(3, 3))
for(i in 2:8)
{
plot(body[,i], ft$residuals, xlab = names(body)[i], ylab = "Residuals")
}
par(mfrow = c(1, 1))
Note here the presence of the outliers.
We can look at the RSS after removing the variable Abdomen. Notice how I can pull out the information shown in summary to access it (here I did summary(ft.1)$r.squared
to get the \(R^2\); similarly summary(ft.1)$coef
would pull out the matrix of coefficients with the p-values).
ft.1 = lm(BODYFAT ~ AGE + WEIGHT + HEIGHT + CHEST + HIP + THIGH, data = body)
rss.ft1 = summary(ft.1)$r.squared
rss.ft1
## [1] 0.5821305
Compare this to the RSS of the full regression.
summary(ft$r.squared)
## Length Class Mode
## 0 NULL NULL
We will now bring in the scorecard data on colleges from 2016.
scorecard <- read.csv(file.path(dataDir, "college.csv"),stringsAsFactors=TRUE)
Here we fit our lm, treating the CONTROL variable (the private/public/etc) as numeric.
req.bad = lm(RET_FT4~ TUITIONFEE_OUT + CONTROL, data = scorecard)
summary(req.bad)
##
## Call:
## lm(formula = RET_FT4 ~ TUITIONFEE_OUT + CONTROL, data = scorecard)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.69041 -0.04915 0.00516 0.05554 0.33165
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.661e-01 9.265e-03 71.90 <2e-16 ***
## TUITIONFEE_OUT 9.405e-06 3.022e-07 31.12 <2e-16 ***
## CONTROL -8.898e-02 5.741e-03 -15.50 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08741 on 1238 degrees of freedom
## Multiple R-squared: 0.4391, Adjusted R-squared: 0.4382
## F-statistic: 484.5 on 2 and 1238 DF, p-value: < 2.2e-16
We can check that R does think CONTROL is a numeric variable with is.numeric
is.numeric(scorecard$CONTROL)
## [1] TRUE
Instead we need to treat this variable as a factor.
req = lm(RET_FT4~ TUITIONFEE_OUT + as.factor(CONTROL), data = scorecard)
summary(req)
##
## Call:
## lm(formula = RET_FT4 ~ TUITIONFEE_OUT + as.factor(CONTROL), data = scorecard)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.68856 -0.04910 0.00505 0.05568 0.33150
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.765e-01 7.257e-03 79.434 < 2e-16 ***
## TUITIONFEE_OUT 9.494e-06 3.054e-07 31.090 < 2e-16 ***
## as.factor(CONTROL)2 -9.204e-02 5.948e-03 -15.474 < 2e-16 ***
## as.factor(CONTROL)3 -1.218e-01 3.116e-02 -3.909 9.75e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08732 on 1237 degrees of freedom
## Multiple R-squared: 0.4408, Adjusted R-squared: 0.4394
## F-statistic: 325 on 3 and 1237 DF, p-value: < 2.2e-16
It is generally better, however, to actually fix it in your data.frame as a factor. That way you won't accidentally mess up, and all the various R commands you might do (plot
, summary
etc) will treat it correctly.
scorecard$CONTROL<-factor(scorecard$CONTROL,levels=c(1,2,3),labels=c("public","private","private for-profit"))
Now the regression will just know it should be a factor.
req = lm(RET_FT4~ TUITIONFEE_OUT + CONTROL, data = scorecard)
summary(req)
##
## Call:
## lm(formula = RET_FT4 ~ TUITIONFEE_OUT + CONTROL, data = scorecard)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.68856 -0.04910 0.00505 0.05568 0.33150
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.765e-01 7.257e-03 79.434 < 2e-16 ***
## TUITIONFEE_OUT 9.494e-06 3.054e-07 31.090 < 2e-16 ***
## CONTROLprivate -9.204e-02 5.948e-03 -15.474 < 2e-16 ***
## CONTROLprivate for-profit -1.218e-01 3.116e-02 -3.909 9.75e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08732 on 1237 degrees of freedom
## Multiple R-squared: 0.4408, Adjusted R-squared: 0.4394
## F-statistic: 325 on 3 and 1237 DF, p-value: < 2.2e-16
Now we add an interaction between the continuous variable (TUITIONFEE_OUT) and the categorical, meaning we will get a different slope for each group. Interactions are given by putting a :
between the two variables.
req.1 = lm(RET_FT4~ TUITIONFEE_OUT + CONTROL + TUITIONFEE_OUT:CONTROL, data = scorecard)
summary(req.1)
##
## Call:
## lm(formula = RET_FT4 ~ TUITIONFEE_OUT + CONTROL + TUITIONFEE_OUT:CONTROL,
## data = scorecard)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.68822 -0.04982 0.00491 0.05555 0.32900
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.814e-01 1.405e-02 41.372 < 2e-16
## TUITIONFEE_OUT 9.240e-06 6.874e-07 13.441 < 2e-16
## CONTROLprivate -9.830e-02 1.750e-02 -5.617 2.4e-08
## CONTROLprivate for-profit -2.863e-01 1.568e-01 -1.826 0.0681
## TUITIONFEE_OUT:CONTROLprivate 2.988e-07 7.676e-07 0.389 0.6971
## TUITIONFEE_OUT:CONTROLprivate for-profit 7.215e-06 6.716e-06 1.074 0.2829
##
## (Intercept) ***
## TUITIONFEE_OUT ***
## CONTROLprivate ***
## CONTROLprivate for-profit .
## TUITIONFEE_OUT:CONTROLprivate
## TUITIONFEE_OUT:CONTROLprivate for-profit
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08734 on 1235 degrees of freedom
## Multiple R-squared: 0.4413, Adjusted R-squared: 0.4391
## F-statistic: 195.1 on 5 and 1235 DF, p-value: < 2.2e-16
Alternatively, it is more common to just do TUITIONFEE_OUT * CONTROL
which is equivalent to TUITIONFEE_OUT + CONTROL + TUITIONFEE_OUT:CONTROL
, in other words, you put in both the individual terms and their interactions. This is the most common way to do it.
summary(lm(RET_FT4~ TUITIONFEE_OUT * CONTROL, data = scorecard) )
##
## Call:
## lm(formula = RET_FT4 ~ TUITIONFEE_OUT * CONTROL, data = scorecard)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.68822 -0.04982 0.00491 0.05555 0.32900
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.814e-01 1.405e-02 41.372 < 2e-16
## TUITIONFEE_OUT 9.240e-06 6.874e-07 13.441 < 2e-16
## CONTROLprivate -9.830e-02 1.750e-02 -5.617 2.4e-08
## CONTROLprivate for-profit -2.863e-01 1.568e-01 -1.826 0.0681
## TUITIONFEE_OUT:CONTROLprivate 2.988e-07 7.676e-07 0.389 0.6971
## TUITIONFEE_OUT:CONTROLprivate for-profit 7.215e-06 6.716e-06 1.074 0.2829
##
## (Intercept) ***
## TUITIONFEE_OUT ***
## CONTROLprivate ***
## CONTROLprivate for-profit .
## TUITIONFEE_OUT:CONTROLprivate
## TUITIONFEE_OUT:CONTROLprivate for-profit
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08734 on 1235 degrees of freedom
## Multiple R-squared: 0.4413, Adjusted R-squared: 0.4391
## F-statistic: 195.1 on 5 and 1235 DF, p-value: < 2.2e-16
Now we fit a linear model for the bike data, with two categorical variables.
md1 = lm(casual ~ atemp + workingday + weathersit, data = bike)
summary(md1)
##
## Call:
## lm(formula = casual ~ atemp + workingday + weathersit, data = bike)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1456.76 -243.97 -22.93 166.81 1907.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 350.31 55.11 6.357 3.63e-10 ***
## atemp 2333.77 97.48 23.942 < 2e-16 ***
## workingdayYes -794.11 33.95 -23.388 < 2e-16 ***
## weathersitLight Rain/Snow -523.79 95.23 -5.500 5.26e-08 ***
## weathersitMisty -150.79 33.75 -4.468 9.14e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 425.2 on 726 degrees of freedom
## Multiple R-squared: 0.6186, Adjusted R-squared: 0.6165
## F-statistic: 294.3 on 4 and 726 DF, p-value: < 2.2e-16
Now we add a different slope of atemp
for each of the workingday
levels by putting in an interaction.
md3 = lm(casual ~ atemp + workingday + weathersit +
workingday: atemp , data = bike)
summary(md3)
##
## Call:
## lm(formula = casual ~ atemp + workingday + weathersit + workingday:atemp,
## data = bike)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1709.76 -198.09 -55.12 152.88 1953.07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -276.22 77.48 -3.565 0.000388 ***
## atemp 3696.41 155.56 23.762 < 2e-16 ***
## workingdayYes 166.71 94.60 1.762 0.078450 .
## weathersitLight Rain/Snow -520.78 88.48 -5.886 6.05e-09 ***
## weathersitMisty -160.28 31.36 -5.110 4.12e-07 ***
## atemp:workingdayYes -2052.09 190.48 -10.773 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 395.1 on 725 degrees of freedom
## Multiple R-squared: 0.6712, Adjusted R-squared: 0.6689
## F-statistic: 296 on 5 and 725 DF, p-value: < 2.2e-16
Here we create a function that will implement the permutation test to test for the global fit.
set.seed(147980)
permutationLM <- function(y,data, n.repetitions, STAT=function(lmFit){summary(lmFit)$r.squared}){
# calculate the observed statistic
stat.obs <- STAT(lm(y ~ ., data=data))
#function to do the permutation and return statistic
makePermutedStats<-function(){
sampled <- sample(y)
fit<-lm(sampled ~ ., data=data)
return(STAT(fit))
}
# calculate the permuted statistic
stat.permute <-replicate(n.repetitions,makePermutedStats())
p.value <- sum(stat.permute >= stat.obs) / n.repetitions
return(list(p.value=p.value,observedStat=stat.obs,permutedStats=stat.permute))
}
We now apply it to our bodyfat data, using \(R^2\) as our statistic, and not surprisingly get very significant results.
permOut<-permutationLM(body$BODYFAT,data=body[,-1],n.repetitions=1000)
hist(permOut$permutedStats,breaks=50)
permOut[1:2]
## $p.value
## [1] 0
##
## $observedStat
## [1] 0.7265596
We can also use the \(F\) statistic and compare the permutation distribution to the \(F\) distribution. The \(F\) distribution is an approximation of the permutation distribution for large sample sizes, so not surprisingly they are similar.
n<-nrow(body)
p<-ncol(body)-1
permOutF<-permutationLM(body$BODYFAT,data=body[,-1],n.repetitions=1000, STAT=function(lmFit){summary(lmFit)$fstatistic["value"]})
hist(permOutF$permutedStats,freq=FALSE,breaks=50)
curve(df(x,df1=p,df2=n-p-1),add=TRUE, main=paste("F(",p,",",n-p-1,") distribution"))
permOutF[1:2]
## $p.value
## [1] 0
##
## $observedStat
## value
## 92.61904
Bootstrap Confidence Intervals The following function creates bootstrap confidence intervals for the linear regression, and is quite similar to the one I used for simple regression.
bootstrapLM <- function(y,x, repetitions, confidence.level=0.95){
# calculate the observed statistics
stat.obs <- coef(lm(y~., data=x))
# calculate the bootstrapped statistics
bootFun<-function(){
sampled <- sample(1:length(y), size=length(y),replace = TRUE)
coef(lm(y[sampled]~.,data=x[sampled,])) #small correction here to make it for a matrix x
}
stat.boot<-replicate(repetitions,bootFun())
# nm <-deparse(substitute(x))
# row.names(stat.boot)[2]<-nm
level<-1-confidence.level
confidence.interval <- apply(stat.boot,1,quantile,probs=c(level/2,1-level/2))
return(list(confidence.interval = cbind("lower"=confidence.interval[1,],"estimate"=stat.obs,"upper"=confidence.interval[2,]), bootStats=stat.boot))
}
Now we apply it to the Bodyfat dataset.
bodyBoot<-with(body,bootstrapLM(y=BODYFAT,x=body[,-1],repetitions=10000))
bodyBoot$conf
## lower estimate upper
## (Intercept) -75.68776383 -3.747573e+01 -3.84419402
## AGE -0.03722018 1.201695e-02 0.06645578
## WEIGHT -0.24629552 -1.392006e-01 -0.02076377
## HEIGHT -0.41327145 -1.028485e-01 0.28042319
## CHEST -0.25876131 -8.311678e-04 0.20995486
## ABDOMEN 0.81115069 9.684620e-01 1.13081481
## HIP -0.46808557 -1.833599e-01 0.10637834
## THIGH 0.02272414 2.857227e-01 0.56054626
We can do plots of the confidence intervals (excluding the intercept) with plotCI
require(gplots)
## Loading required package: gplots
## Warning: package 'gplots' was built under R version 4.0.2
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
#Not include intercept
with(bodyBoot,plotCI(confidence.interval[-1,"estimate"],ui=confidence.interval[-1,"upper"],li=confidence.interval[-1,"lower"],xaxt="n"))
axis(side=1,at=1:(nrow(bodyBoot$conf)-1),rownames(bodyBoot$conf)[-1])
cat("Ratio of p/n in body fat: ",ncol(body)/nrow(body),"\n")
## Ratio of p/n in body fat: 0.03174603
library(pheatmap)
pheatmap(summary(ft,correlation=TRUE)$corr[-1,-1],breaks=seq(-1,1,length=100), main="Correlation of the estimated coefficients")
pheatmap(cor(body[,-1]),breaks=seq(-1,1,length=100), main="Correlation of the variables")
We can create confidence intervals for the prediction using the predict
function. We first have to create a data.frame in the same format as our data.frame that was used in our lm
model (i.e. same variable names).
x0 = data.frame(AGE = 30, WEIGHT = 180, HEIGHT = 70, CHEST = 95, ABDOMEN = 90, HIP = 100, THIGH = 60)
Now we can create confidence intervals for the average predicted value, with interval = "confidence"
predict(ft, x0, interval = "confidence")
## fit lwr upr
## 1 16.51927 15.20692 17.83162
For prediction intervals, i.e. the expected range of values for an individual, we set interval = "prediction"
predict(ft, x0, interval = "prediction")
## fit lwr upr
## 1 16.51927 7.678715 25.35983
If we run plot
on our output from lm
, it will automatically plot four diagnostic plots. Note that if I don't set par(mfrow = c(2, 2))
, it will go into “interactive” mode, and expect you to hit return to see the next plot.
par(mfrow = c(2, 2))
plot(ft)
par(mfrow = c(1, 1))
We can also choose to plot only certain of these plots with the which
argument. There are six plots to choose from, even though only 4 are done by default.
par(mfrow=c(1,2))
plot(ft,which=c(1,3))
Here we do the same plots for the bike data (we refit the model just for convenience to remember the model)
md1 = lm(casual ~ atemp + workingday + weathersit, data = bike)
par(mfrow=c(1,2))
plot(md1,which=c(1,3))
Since there is keteroskedasticity, we consider transformations of our response (log and sqrt):
mdLog = lm(log(casual) ~ atemp + workingday + weathersit, data = bike)
mdSqrt= lm(sqrt(casual) ~ atemp + workingday + weathersit, data = bike)
par(mfrow=c(2,2))
plot(mdLog,which=1,main="Log")
plot(mdSqrt,which=1,main="Sqrt")
plot(mdLog,which=3,main="Log")
plot(mdSqrt,which=3,main="Sqrt")
# Here we plot only the QQ plot
par(mfrow=c(1,1))
plot(ft,which=2)
# And the QQ plot for the different bike models
par(mfrow=c(2,2))
plot(md1,which=2,main="Original (counts)")
plot(mdLog,which=2,main="Log")
plot(mdSqrt,which=2,main="Sqrt")
Plots 4 and 5 are useful for detecting outliers. 4 is the plot of cook's distance, showing observations that will change the fit a lot if removed. 5 is the plot of the residuals versus the leverage of the observations, which gives similar information.
par(mfrow=c(1,2))
plot(ft,which=c(4:5))
Here we look at these outlying points seen in the plots.
whOut<-c(39, 42, 36)
cat("High leverage points:\n")
## High leverage points:
body[whOut,]
## BODYFAT AGE WEIGHT HEIGHT CHEST ABDOMEN HIP THIGH
## 39 35.2 46 363.15 72.25 136.2 148.1 147.7 87.3
## 42 32.9 44 205.00 29.50 106.0 104.3 115.5 70.6
## 36 40.1 49 191.75 65.00 118.5 113.1 113.8 61.9
cat("Mean of each variables:\n")
## Mean of each variables:
apply(body, 2, mean)
## BODYFAT AGE WEIGHT HEIGHT CHEST ABDOMEN HIP THIGH
## 19.15079 44.88492 178.92440 70.14881 100.82421 92.55595 99.90476 59.40595
We can do a PCA of the explanatory variables and see that these points are separated. Notice how I can use the text
function to plot character values, in this case in the indices of the outlier points, instead of points. To not have a point also drawn for these points, I first made a blank plot by choosing type="n"
, and then did points for all points except the outlier. Then I used text
to plot the indices of the outlier points. The blank plot is important, because it makes sure that the limits of the plot will be big enough for all the points; if I just drew the plot excluding the outliers (e.g. plot(bodyPca$x[-whOut,1:2])
), the boundaries wouldn't be big enough to show the outlying points when you add them.
bodyPca<-prcomp(body[,-1])
plot(bodyPca$x[,1:2],type="n")
points(bodyPca$x[-whOut,1:2])
text(bodyPca$x[whOut,1:2],labels=whOut,col="red",pch=19)
We can also do a pairs plot that highlights these points in a similar way, only putting the plotting commands in the panel
argument. Because it's in a panel function, you don't need the blank plot, and you use only plotting commands that add to an existing plot (like points
and text
).
pairs(body,panel=function(x,y){
points(x[-whOut],y[-whOut])
text(x[whOut],y[whOut],labels=whOut)})
I can run the linear model excluding those observations and compare the coefficients. Notice how I use summary(ftNoOut)$coef
to get the full table of coefficients, including the p-values.
ftNoOut<-lm(BODYFAT~.,data=body[-whOut,])
cat("Coefficients without outliers:\n")
## Coefficients without outliers:
round(summary(ftNoOut)$coef,3)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -22.902 20.297 -1.128 0.260
## AGE 0.021 0.029 0.717 0.474
## WEIGHT -0.074 0.059 -1.271 0.205
## HEIGHT -0.241 0.187 -1.288 0.199
## CHEST -0.121 0.113 -1.065 0.288
## ABDOMEN 0.945 0.088 10.709 0.000
## HIP -0.171 0.152 -1.124 0.262
## THIGH 0.223 0.141 1.584 0.114
cat("\nCoefficients in Original Model:\n")
##
## Coefficients in Original Model:
round(summary(ft)$coef,3)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -37.476 14.495 -2.585 0.010
## AGE 0.012 0.029 0.410 0.682
## WEIGHT -0.139 0.045 -3.087 0.002
## HEIGHT -0.103 0.098 -1.051 0.294
## CHEST -0.001 0.100 -0.008 0.993
## ABDOMEN 0.968 0.085 11.352 0.000
## HIP -0.183 0.145 -1.267 0.206
## THIGH 0.286 0.136 2.098 0.037
We first introduce the car seat position dataset, which is provided as part of the package faraway
. We do a basic linear model on this data set and look at the summary of the model.
library(faraway)
## Warning: package 'faraway' was built under R version 4.0.2
data(seatpos)
pairs(seatpos)
lmSeat = lm(hipcenter ~ ., seatpos)
summary(lmSeat)
##
## Call:
## lm(formula = hipcenter ~ ., data = seatpos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -73.827 -22.833 -3.678 25.017 62.337
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 436.43213 166.57162 2.620 0.0138 *
## Age 0.77572 0.57033 1.360 0.1843
## Weight 0.02631 0.33097 0.080 0.9372
## HtShoes -2.69241 9.75304 -0.276 0.7845
## Ht 0.60134 10.12987 0.059 0.9531
## Seated 0.53375 3.76189 0.142 0.8882
## Arm -1.32807 3.90020 -0.341 0.7359
## Thigh -1.14312 2.66002 -0.430 0.6706
## Leg -6.43905 4.71386 -1.366 0.1824
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.72 on 29 degrees of freedom
## Multiple R-squared: 0.6866, Adjusted R-squared: 0.6001
## F-statistic: 7.94 on 8 and 29 DF, p-value: 1.306e-05
To compare a submodel to a full model using a F-test, I first need to run lm
on both of these models:
mod0<-lm(BODYFAT~ABDOMEN+AGE+WEIGHT,data=body)
Then I run the command anova
on these two models.
anova(mod0,ft)
## Analysis of Variance Table
##
## Model 1: BODYFAT ~ ABDOMEN + AGE + WEIGHT
## Model 2: BODYFAT ~ AGE + WEIGHT + HEIGHT + CHEST + ABDOMEN + HIP + THIGH
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 248 4941.3
## 2 244 4806.8 4 134.5 1.7069 0.1491
Note that anova
does other kinds of analysis if you give it only one linear model that is somewhat distinct from this analysis, but we are not going to get into it.
Here we also do a F-test for a submodel, only now we're only missing a single variable (HEIGHT
).
modNoHEIGHT<-lm(BODYFAT~ABDOMEN+AGE+WEIGHT+CHEST+HIP+THIGH,data=body)
anova(modNoHEIGHT,ft)
## Analysis of Variance Table
##
## Model 1: BODYFAT ~ ABDOMEN + AGE + WEIGHT + CHEST + HIP + THIGH
## Model 2: BODYFAT ~ AGE + WEIGHT + HEIGHT + CHEST + ABDOMEN + HIP + THIGH
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 245 4828.6
## 2 244 4806.8 1 21.753 1.1042 0.2944
We can see that we get the same result as if we did a t-test for the coefficient corresponding to HEIGHT
.
summary(ft)
##
## Call:
## lm(formula = BODYFAT ~ ., data = body)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.0729 -3.2387 -0.0782 3.0623 10.3611
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.748e+01 1.449e+01 -2.585 0.01031 *
## AGE 1.202e-02 2.934e-02 0.410 0.68246
## WEIGHT -1.392e-01 4.509e-02 -3.087 0.00225 **
## HEIGHT -1.028e-01 9.787e-02 -1.051 0.29438
## CHEST -8.312e-04 9.989e-02 -0.008 0.99337
## ABDOMEN 9.685e-01 8.531e-02 11.352 < 2e-16 ***
## HIP -1.834e-01 1.448e-01 -1.267 0.20648
## THIGH 2.857e-01 1.362e-01 2.098 0.03693 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.438 on 244 degrees of freedom
## Multiple R-squared: 0.7266, Adjusted R-squared: 0.7187
## F-statistic: 92.62 on 7 and 244 DF, p-value: < 2.2e-16
We can use the function regsubsets
in the package leaps
to find the best model of size k, for all values of k. Here we apply it to the bodyfat data:
library(leaps)
bFat = regsubsets(BODYFAT ~ ., body)
summary(bFat)
## Subset selection object
## Call: regsubsets.formula(BODYFAT ~ ., body)
## 7 Variables (and intercept)
## Forced in Forced out
## AGE FALSE FALSE
## WEIGHT FALSE FALSE
## HEIGHT FALSE FALSE
## CHEST FALSE FALSE
## ABDOMEN FALSE FALSE
## HIP FALSE FALSE
## THIGH FALSE FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: exhaustive
## AGE WEIGHT HEIGHT CHEST ABDOMEN HIP THIGH
## 1 ( 1 ) " " " " " " " " "*" " " " "
## 2 ( 1 ) " " "*" " " " " "*" " " " "
## 3 ( 1 ) " " "*" " " " " "*" " " "*"
## 4 ( 1 ) " " "*" " " " " "*" "*" "*"
## 5 ( 1 ) " " "*" "*" " " "*" "*" "*"
## 6 ( 1 ) "*" "*" "*" " " "*" "*" "*"
## 7 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
Here is the same function for the car seat position data.
bSeat = regsubsets(hipcenter ~ ., seatpos)
summary(bSeat)
## Subset selection object
## Call: regsubsets.formula(hipcenter ~ ., seatpos)
## 8 Variables (and intercept)
## Forced in Forced out
## Age FALSE FALSE
## Weight FALSE FALSE
## HtShoes FALSE FALSE
## Ht FALSE FALSE
## Seated FALSE FALSE
## Arm FALSE FALSE
## Thigh FALSE FALSE
## Leg FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## Age Weight HtShoes Ht Seated Arm Thigh Leg
## 1 ( 1 ) " " " " " " "*" " " " " " " " "
## 2 ( 1 ) " " " " " " "*" " " " " " " "*"
## 3 ( 1 ) "*" " " " " "*" " " " " " " "*"
## 4 ( 1 ) "*" " " "*" " " " " " " "*" "*"
## 5 ( 1 ) "*" " " "*" " " " " "*" "*" "*"
## 6 ( 1 ) "*" " " "*" " " "*" "*" "*" "*"
## 7 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*"
## 8 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
The object summary(bFat)
stores all of this information. In particular, the summary(bFat)$which
gives a matrix of logical values with a column for each variable and a row for each value of k. The logical indicates whether the variable is contained in the best model of size k.
Here we illustrate the idea of dividing our data into test and training data. We choose to make our test data a random 10% of the data, and our training dataset the rest.
set.seed(1249)
nTest<-.1*nrow(body)
whTest<-sample(1:nrow(body),size=nTest)
bodyTest<-body[whTest,]
bodyTrain<-body[-whTest,]
Now we can calculate the prediction error on each of the best models found from regsubsets
. We will, for each row of summary(bFat)$which
fit the model on the training data with only the variables indicated should be used in the best model. And then we will evaluate their performance on the test data. The average error on the test data will be our estimate of predicted error.
predError<-apply(summary(bFat)$which[,-1],1,function(x){
lmObj<-lm(bodyTrain$BODYFAT~.,data=bodyTrain[,-1][,x,drop=FALSE])
testPred<-predict(lmObj,newdata=bodyTest[,-1])
mean((bodyTest$BODYFAT-testPred)^2)
})
cat("Predicted error on random 10% of data:\n")
## Predicted error on random 10% of data:
predError
## 1 2 3 4 5 6 7
## 25.29712 28.86460 27.17047 28.65131 28.96773 28.92292 29.01328
Here we repeat the whole thing, only with a different random test and training to show the variability in our estimate of predicted error (and thus variability in which is the “best” model).
set.seed(19085)
nTest<-.1*nrow(body)
whTest<-sample(1:nrow(body),size=nTest)
bodyTest<-body[whTest,]
bodyTrain<-body[-whTest,]
predError<-apply(summary(bFat)$which[,-1],1,function(x){
lmObj<-lm(bodyTrain$BODYFAT~.,data=bodyTrain[,-1][,x,drop=FALSE])
testPred<-predict(lmObj,newdata=bodyTest[,-1])
mean((bodyTest$BODYFAT-testPred)^2)
})
cat("Predicted error on random 10% of data:\n")
## Predicted error on random 10% of data:
predError
## 1 2 3 4 5 6 7
## 22.36633 22.58908 22.21784 21.90046 21.99034 21.94618 22.80151
Now we will implement cross-validation, where we will break our data into random 10 parts, to compare the prediction error of each of these eight models. We will do this by first randomly permuting the indices of the data
set.seed(78912)
permutation<-sample(1:nrow(body))
Then we will assign the first 10% of the permuted data to partition 1, the second to partition 2, etc. The object folds
we create here just simply divides the numbers 1-n into ten parts (i.e. no randomness). But now if I take permutation[folds==1]
I will get the indices of the first random partition of the data, permutation[folds==2]
corresponds to the second one, and so forth.
folds <- cut(1:nrow(body),breaks=10,labels=FALSE)
#Perform 10 fold cross validation
Now we will perform a for-loop, so that we will take turns making the each partition the test data, and the rest are the training. predErrorMat
is a matrix we set up that will hold the results of our for-loop. Each row will correspond to the results of one of the partitions, and each column will correspond to a model.
predErrorMat<-matrix(nrow=10,ncol=nrow(summary(bFat)$which))
for(i in 1:10){
#Segement your data by fold using the which() function
testIndexes <- which(folds==i,arr.ind=TRUE)
testData <- body[permutation,][testIndexes, ]
trainData <- body[permutation,][-testIndexes, ]
#Use the test and train data partitions however you desire...
predError<-apply(summary(bFat)$which[,-1],1,function(x){
lmObj<-lm(trainData$BODYFAT~.,data=trainData[,-1][,x,drop=FALSE])
testPred<-predict(lmObj,newdata=testData[,-1])
mean((testData$BODYFAT-testPred)^2)
})
predErrorMat[i,]<-predError
}
predErrorMat
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 18.72568 10.95537 11.68551 12.16354 11.83839 11.78985 11.93013
## [2,] 21.41687 21.08760 21.53709 21.06757 21.10223 21.20400 21.62519
## [3,] 32.47863 21.97477 22.48690 22.50871 22.97452 22.92450 24.05130
## [4,] 21.05072 20.22509 19.16631 18.82538 18.90923 18.89133 18.94164
## [5,] 26.47937 22.92690 23.76934 26.13180 26.17794 26.12684 26.28473
## [6,] 26.60945 23.35274 22.06232 22.06825 22.15430 23.10201 25.29325
## [7,] 25.65426 20.48995 19.95947 19.82442 19.53618 19.97744 20.29104
## [8,] 17.54916 18.79081 18.14251 17.67780 17.74409 17.67456 17.71624
## [9,] 33.52443 27.26399 25.83256 26.87850 27.80847 28.32894 28.41455
## [10,] 18.64271 14.11973 14.05815 14.53730 14.42609 14.36767 14.57028
To get the cross-validation estimate of expected predicted error, we average across the 10 folds.
colMeans(predErrorMat)
## [1] 24.21313 20.11870 19.87002 20.16833 20.26714 20.43871 20.91184
Leave-one-out CV uses each individual observation as a test set. For regression, it can be calculated from the residuals of the linear model. I give a function here that calculates it (you don't need to understand why it works).
LOOCV<-function(lm){
vals<-residuals(lm)/(1-lm.influence(lm)$hat)
sum(vals^2)/length(vals)
}
Now I create a function that will calculate all of these criterion of interest for a submodel. y
should be the response and dataset
should be the data.frame with all of the p variables. Either the argument x
or the argument lmObj
should be given to define which submodel to use. If x
, then x
is assumed to be a logical vector in the same order as the columns of dataset
defining with variables to use, and the function will run lm
on that submodel; if lmObj
, then it should be the result of running lm
on the submodel.
calculateCriterion<-function(x=NULL,y,dataset,lmObj=NULL){
#dataset contains only explanatory variables
#x is a vector of logicals, length equal to number of explanatory variables in dataset, telling us which variables to keep
#sigma2 is estimate of model on full dataset
# either x or lmObj must be given to specify the smaller lm model
sigma2=summary(lm(y~.,data=dataset))$sigma^2
if(is.null(lmObj)) lmObj<-lm(y ~ ., data=dataset[,x,drop=FALSE]) #don't include intercept
sumlmObj<-summary(lmObj)
n<-nrow(dataset)
p<-sum(x)
RSS<-sumlmObj$sigma^2*(n-p-1)
c(R2=sumlmObj$r.squared,
R2adj=sumlmObj$adj.r.squared,
"RSS/n"=RSS/n,
LOOCV=LOOCV(lmObj),
Cp=RSS/n+2*sigma2*(p+1)/n,
CpAlt=RSS/sigma2-n+2*(p+1),
AIC=AIC(lmObj), # n*log(RSS/n)+2*p +constant,
BIC=BIC(lmObj) # n*log(RSS/n)+p*log(n) + constant
)
}
Here we apply my function to the 8 best model of each size found for the car seat position data.
cat("Criterion for the 8 best k-sized models of car seat position:\n")
## Criterion for the 8 best k-sized models of car seat position:
critSeat<-apply(summary(bSeat)$which[,-1],1,calculateCriterion,
y=seatpos$hipcenter,
dataset=seatpos[,-9])
critSeat<-t(critSeat)
critSeat
## R2 R2adj RSS/n LOOCV Cp CpAlt AIC BIC
## 1 0.6382850 0.6282374 1253.047 1387.644 1402.818 -0.5342143 384.9060 389.8188
## 2 0.6594117 0.6399496 1179.860 1408.696 1404.516 -0.4888531 384.6191 391.1694
## 3 0.6814159 0.6533055 1103.634 1415.652 1403.175 -0.5246725 384.0811 392.2691
## 4 0.6848577 0.6466586 1091.711 1456.233 1466.137 1.1568934 385.6684 395.4939
## 5 0.6861644 0.6371276 1087.184 1548.041 1536.496 3.0359952 387.5105 398.9736
## 6 0.6864310 0.6257403 1086.261 1739.475 1610.457 5.0113282 389.4782 402.5789
## 7 0.6865154 0.6133690 1085.968 1911.701 1685.051 7.0035240 391.4680 406.2062
## 8 0.6865535 0.6000855 1085.836 1975.415 1759.804 9.0000000 393.4634 409.8392
Here we apply my function to the 7 best model of each size found for the body fat data.
cat("\nCriterion for the 7 best k-sized models of body fat:\n")
##
## Criterion for the 7 best k-sized models of body fat:
critBody<-apply(summary(bFat)$which[,-1],1,calculateCriterion,
y=body$BODYFAT,
dataset=body[,-1])
critBody<-t(critBody)
critBody<-cbind(critBody,CV10=colMeans(predErrorMat))
critBody
## R2 R2adj RSS/n LOOCV Cp CpAlt AIC BIC
## 1 0.6616721 0.6603188 23.60104 24.30696 23.91374 53.901272 1517.790 1528.379
## 2 0.7187981 0.7165395 19.61605 20.27420 20.08510 4.925819 1473.185 1487.302
## 3 0.7234261 0.7200805 19.29321 20.07151 19.91861 2.796087 1471.003 1488.650
## 4 0.7249518 0.7204976 19.18678 20.13848 19.96853 3.434662 1471.609 1492.785
## 5 0.7263716 0.7208100 19.08774 20.21249 20.02584 4.167779 1472.305 1497.011
## 6 0.7265595 0.7198630 19.07463 20.34676 20.16908 6.000069 1474.132 1502.367
## 7 0.7265596 0.7187150 19.07463 20.62801 20.32542 8.000000 1476.132 1507.896
## CV10
## 1 24.21313
## 2 20.11870
## 3 19.87002
## 4 20.16833
## 5 20.26714
## 6 20.43871
## 7 20.91184
The function step
will run a stepwise regression based on the AIC. The input object should be the full linear regression model (i.e. with all of the variables). Here we set the argument trace =0
to keep it from printing out the path it takes, but note that this isn't the default. It can be useful to see the printout, which shows you the variables that are added or removed at each step, but for reports it can be good to suppress it. In the end, the output that is returned by the function is just the linear model object from running lm
on the final choice of variables.
outBody<-step(ft,trace=0, direction="both")
outBody
##
## Call:
## lm(formula = BODYFAT ~ WEIGHT + ABDOMEN + THIGH, data = body)
##
## Coefficients:
## (Intercept) WEIGHT ABDOMEN THIGH
## -52.9631 -0.1828 0.9919 0.2190
The default version of the step
function only removes variables (analogous to backward elimination). If one wants to add variables as well, you can set the argument direction
as we have done here.
We compare the results of the linear model found by step to those found by looking at the RSS of all subsets.
# results from best of size k
summary(bFat)$out
## AGE WEIGHT HEIGHT CHEST ABDOMEN HIP THIGH
## 1 ( 1 ) " " " " " " " " "*" " " " "
## 2 ( 1 ) " " "*" " " " " "*" " " " "
## 3 ( 1 ) " " "*" " " " " "*" " " "*"
## 4 ( 1 ) " " "*" " " " " "*" "*" "*"
## 5 ( 1 ) " " "*" "*" " " "*" "*" "*"
## 6 ( 1 ) "*" "*" "*" " " "*" "*" "*"
## 7 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
# results from criterion of best of each size k
critBody
## R2 R2adj RSS/n LOOCV Cp CpAlt AIC BIC
## 1 0.6616721 0.6603188 23.60104 24.30696 23.91374 53.901272 1517.790 1528.379
## 2 0.7187981 0.7165395 19.61605 20.27420 20.08510 4.925819 1473.185 1487.302
## 3 0.7234261 0.7200805 19.29321 20.07151 19.91861 2.796087 1471.003 1488.650
## 4 0.7249518 0.7204976 19.18678 20.13848 19.96853 3.434662 1471.609 1492.785
## 5 0.7263716 0.7208100 19.08774 20.21249 20.02584 4.167779 1472.305 1497.011
## 6 0.7265595 0.7198630 19.07463 20.34676 20.16908 6.000069 1474.132 1502.367
## 7 0.7265596 0.7187150 19.07463 20.62801 20.32542 8.000000 1476.132 1507.896
## CV10
## 1 24.21313
## 2 20.11870
## 3 19.87002
## 4 20.16833
## 5 20.26714
## 6 20.43871
## 7 20.91184
We repeat the stepwise for the car seat data.
outCarseat<-step(lmSeat,trace=0, direction="both")
outCarseat
##
## Call:
## lm(formula = hipcenter ~ Age + HtShoes + Leg, data = seatpos)
##
## Coefficients:
## (Intercept) Age HtShoes Leg
## 456.2137 0.5998 -2.3023 -6.8297
And we again compare it to the best we found from all subsets.
# results from best of size k
summary(bSeat)$out
## Age Weight HtShoes Ht Seated Arm Thigh Leg
## 1 ( 1 ) " " " " " " "*" " " " " " " " "
## 2 ( 1 ) " " " " " " "*" " " " " " " "*"
## 3 ( 1 ) "*" " " " " "*" " " " " " " "*"
## 4 ( 1 ) "*" " " "*" " " " " " " "*" "*"
## 5 ( 1 ) "*" " " "*" " " " " "*" "*" "*"
## 6 ( 1 ) "*" " " "*" " " "*" "*" "*" "*"
## 7 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*"
## 8 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
# results from criterion of best of each size k
critSeat
## R2 R2adj RSS/n LOOCV Cp CpAlt AIC BIC
## 1 0.6382850 0.6282374 1253.047 1387.644 1402.818 -0.5342143 384.9060 389.8188
## 2 0.6594117 0.6399496 1179.860 1408.696 1404.516 -0.4888531 384.6191 391.1694
## 3 0.6814159 0.6533055 1103.634 1415.652 1403.175 -0.5246725 384.0811 392.2691
## 4 0.6848577 0.6466586 1091.711 1456.233 1466.137 1.1568934 385.6684 395.4939
## 5 0.6861644 0.6371276 1087.184 1548.041 1536.496 3.0359952 387.5105 398.9736
## 6 0.6864310 0.6257403 1086.261 1739.475 1610.457 5.0113282 389.4782 402.5789
## 7 0.6865154 0.6133690 1085.968 1911.701 1685.051 7.0035240 391.4680 406.2062
## 8 0.6865535 0.6000855 1085.836 1975.415 1759.804 9.0000000 393.4634 409.8392
We can also use the function I created above on this output by giving it to the lmObj
argument.
calculateCriterion(lmObj=outCarseat,y=seatpos$hipcenter,dataset=seatpos[,-9])
## R2 R2adj RSS/n LOOCV Cp CpAlt
## 0.6812662 0.6531427 1201.5776327 1412.6121485 1276.4629022 -3.9088387
## AIC BIC
## 384.0989931 392.2869239