First set the right working directory (you need to change the directory dataDir
to the local directory
in your machine)
dataDir <- "../finalDataSets"
We bring in the bodyfat dataset that we used in the regression chapter.
body = read.csv(file.path(dataDir, "bodyfat_short.csv"), header = T,stringsAsFactors=TRUE)
To fit a decision tree, we use the function rpart
.
rpart
stands for recursive partitioning. It is available in the R package rpart
library(rpart)
rt = rpart(BODYFAT ~ AGE + WEIGHT + HEIGHT + CHEST + ABDOMEN + HIP + THIGH, data = body)
For now we will focus on just visualizing the results, and then go back and explore the output rt
more.
There is a built in plotting command for the output plot
. This plots just the tree. If you want the labels plotted on top (and you do!), you have to then call text
afterward.
par(mar=c(.1,.1,.1,.1))
plot(rt)
text(rt)
Now we will run rpart
on the college data, to demonstrate how rpart
handles categorical explanatory variables. First we read in the data:
scorecard = read.csv(file.path(dataDir, "college.csv"), stringsAsFactors=TRUE)
We will convert the variable CONTROL
into a factor. Recall that CONTROL
is a categorical variable describing what kind of college the observation is.
scorecard$CONTROL<-factor(scorecard$CONTROL, labels=c("public","private","for-profit"))
We will now fit a very simple tree, with just two explanatory variables. This is not because this is the best way to fit a tree, but for illustration purposes we want to make sure that the resulting decision tree uses the categorical CONTROL
variable.
collegeTree = rpart(RET_FT4 ~ TUITIONFEE_OUT + CONTROL, data = scorecard)
par(mar=rep(.1,4))
plot(collegeTree)
text(collegeTree)
Remember that the plot of the tree doesn't actually show the names of the levels of the categorical variable, but instead uses “a”,“b”,“c”,… correponding to the levels of the categorical variable. We can find the levels of CONTROL
and interpret that “b” corresponds to “private”, and “c” corresponds to “for-profit”.
levels(scorecard$CONTROL)
## [1] "public" "private" "for-profit"
The illustrations of the tree are nice, but rather messy and hard to read. Moreover, it can be hard to remember whether you should go left or right when you satisfy the condition. And dealing with the levels of a categorical variable is quite difficult. We can see the “tree” structure in text format by using the print
command on our output.
print(rt)
## n= 252
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 252 17578.99000 19.150790
## 2) ABDOMEN< 91.9 132 4698.25500 13.606060
## 4) ABDOMEN< 85.45 66 1303.62400 10.054550
## 8) ABDOMEN< 75.5 7 113.54860 5.314286 *
## 9) ABDOMEN>=75.5 59 1014.12300 10.616950 *
## 5) ABDOMEN>=85.45 66 1729.68100 17.157580
## 10) HEIGHT>=71.875 19 407.33790 13.189470 *
## 11) HEIGHT< 71.875 47 902.23110 18.761700 *
## 3) ABDOMEN>=91.9 120 4358.48000 25.250000
## 6) ABDOMEN< 103 81 1752.42000 22.788890 *
## 7) ABDOMEN>=103 39 1096.45200 30.361540
## 14) ABDOMEN< 112.3 28 413.60000 28.300000
## 28) HEIGHT>=72.125 8 89.39875 23.937500 *
## 29) HEIGHT< 72.125 20 111.04950 30.045000 *
## 15) ABDOMEN>=112.3 11 260.94910 35.609090 *
This allows us to more precisely see how to traverse the tree, as well as the levels of the categorical variables.
(4698.255 + 4358.48)/17578.99
## [1] 0.5152022
We can look at the choice of where to prune the tree using the function printcp
. This allows us to decide whether the default R value (0.01) was a good idea.
printcp(rt)
##
## Regression tree:
## rpart(formula = BODYFAT ~ AGE + WEIGHT + HEIGHT + CHEST + ABDOMEN +
## HIP + THIGH, data = body)
##
## Variables actually used in tree construction:
## [1] ABDOMEN HEIGHT
##
## Root node error: 17579/252 = 69.758
##
## n= 252
##
## CP nsplit rel error xerror xstd
## 1 0.484798 0 1.00000 1.01709 0.082285
## 2 0.094713 1 0.51520 0.55931 0.050360
## 3 0.085876 2 0.42049 0.48323 0.038202
## 4 0.024000 3 0.33461 0.38474 0.032989
## 5 0.023899 4 0.31061 0.37870 0.032727
## 6 0.012125 5 0.28672 0.36315 0.030400
## 7 0.010009 6 0.27459 0.37437 0.030582
## 8 0.010000 7 0.26458 0.37953 0.030954
Recall that printcp
only prints out the results up to the value of cp
given to the original call to rpart
, in this case the default of cp=0.01
. If you want to go beyond that value, you need to set cp
higher in the original call to rpart
.
Here we demonstrate changing the cp
value to be slightly higher based on the results of printcp
(though the difference between 0.0122
and 0.01
were probably not really very meaningful)
rtd = rpart(BODYFAT ~ AGE + WEIGHT + HEIGHT + CHEST + ABDOMEN + HIP + THIGH, data = body, cp = 0.0122)
plot(rtd)
text(rtd)
We will not demonstrate classification trees using the the email spam dataset from the chapter on logistic regression.
library(DAAG)
## Warning: package 'DAAG' was built under R version 4.0.2
## Loading required package: lattice
data(spam7)
spam = spam7
We again call rpart
only now we need the additional argument method="class"
. This tells rpart
to build a classification tree instead of a regression tree.
sprt = rpart(yesno ~ crl.tot + dollar + bang + money + n000 + make, method = "class", data = spam)
plot(sprt)
text(sprt)
We can similarly use printcp
to evaluate the pruning of the classification tree.
printcp(sprt)
##
## Classification tree:
## rpart(formula = yesno ~ crl.tot + dollar + bang + money + n000 +
## make, data = spam, method = "class")
##
## Variables actually used in tree construction:
## [1] bang crl.tot dollar
##
## Root node error: 1813/4601 = 0.39404
##
## n= 4601
##
## CP nsplit rel error xerror xstd
## 1 0.476558 0 1.00000 1.00000 0.018282
## 2 0.075565 1 0.52344 0.55047 0.015420
## 3 0.011583 3 0.37231 0.38941 0.013484
## 4 0.010480 4 0.36073 0.37728 0.013310
## 5 0.010000 5 0.35025 0.37452 0.013270
The results of printcp
show that the prediction error is continuing to decrease up until 0.01. This indicates that we might want to try smaller values of cp
in our definition of our classification tree, so that we can evaluate where a good stopping point is.
sprt = rpart(yesno ~ crl.tot + dollar + bang + money + n000 + make, method = "class", cp = 0.001, data = spam)
Now we can see the results beyond cp=0.01
, and we can see that a smaller value will perform better.
printcp(sprt)
##
## Classification tree:
## rpart(formula = yesno ~ crl.tot + dollar + bang + money + n000 +
## make, data = spam, method = "class", cp = 0.001)
##
## Variables actually used in tree construction:
## [1] bang crl.tot dollar money n000
##
## Root node error: 1813/4601 = 0.39404
##
## n= 4601
##
## CP nsplit rel error xerror xstd
## 1 0.4765582 0 1.00000 1.00000 0.018282
## 2 0.0755654 1 0.52344 0.55323 0.015447
## 3 0.0115830 3 0.37231 0.38665 0.013445
## 4 0.0104799 4 0.36073 0.38389 0.013406
## 5 0.0063431 5 0.35025 0.37397 0.013262
## 6 0.0055157 10 0.31660 0.35742 0.013014
## 7 0.0044126 11 0.31109 0.35576 0.012989
## 8 0.0038610 12 0.30667 0.34418 0.012810
## 9 0.0027579 16 0.29123 0.33094 0.012599
## 10 0.0022063 17 0.28847 0.33149 0.012608
## 11 0.0019305 18 0.28627 0.33094 0.012599
## 12 0.0016547 20 0.28240 0.33039 0.012590
## 13 0.0010000 25 0.27413 0.33039 0.012590
We can use a better value (0.0028
) and refit our tree.
sprt = rpart(yesno ~ crl.tot + dollar + bang + money + n000 + make, method = "class", cp = 0.0028, data = spam)
plot(sprt)
text(sprt)
we can also make predictions from our decision trees. We'll demonstrate this for our classification tree, but the main workhorse is the function predict
which can be used for any output of rpart
, whether classification or regression trees.
First we need to make a data.frame with our new observation(s) we want to get predictions for. We only predict for a single observations, but you can make a data.frame with as many observations as you want predictions.
x0 = data.frame(crl.tot = 100, dollar = 3, bang = 0.33, money = 1.2, n000 = 0, make = .3)
Now we call predict
, giving it our fitted tree along with the new data
predict(sprt, newdata=x0)
## n y
## 1 0.04916201 0.950838
The following code is copied over from the previous chapter (07), and calculates the confusion matrix for our data.
confusion <- function (y, yhat, thres)
{
n <- length(thres)
conf <- matrix(0,length(thres),ncol=4)
colnames(conf) <- c("C0","W1","W0","C1")
rownames(conf)<-round(thres,3)
for ( i in 1:n)
{
C0 <- sum((!y) & (yhat<=thres[i]))
W1 <- sum((!y) & (yhat>thres[i]))
W0 <- sum((y) & (yhat<=thres[i]))
C1 <- sum((y) & (yhat>thres[i]))
conf[i,] <- c(C0, W1, W0, C1)
}
return(conf)
}
makeConfMatrix<-function(v){
v<-paste0("$",c("C_0","W_1","W_0","C_1"),"=",v,"$")
knitr::kable(
rbind(
c("","$\\hat{y} = 0$","$\\hat{y} = 1$"),
c("$y = 0$",v[1],v[2]),
c("$y = 1$",v[3],v[4])
) ,
align="r",
escape = FALSE
)
}
We can use our predictions and our observed data to calculate the confusion matrix for our classification tree:
y = as.numeric(spam$yesno == "y")
y.hat = predict(sprt, spam)[,2]
v<-seq(0.1, 0.9, by = 0.05) #a sequence of possible cutoff values
tree.conf = confusion(y, y.hat, thres=v)
We can plot the total errors (\(W1+W0\)) as a function of our cutoff threshold.
plot(v,tree.conf[,"W1"]+tree.conf[,"W0"], xlab="threshold", ylab="Total error", type = "l")
makeConfMatrix(confusion(y, y.hat, thres=0.5))
## Warning in kable_pipe(x = structure(c("", "$y = 0$", "$y = 1$", "$\\hat{y} =
## 0$", : The table should have a header (column names)
\(\hat{y} = 0\) | \(\hat{y} = 1\) | |
\(y = 0\) | \(C_0=2624\) | \(W_1=164\) |
\(y = 1\) | \(W_0=364\) | \(C_1=1449\) |
We use the package randomForest
to fit our random forests. The syntax is very similar to that of rpart
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.0.2
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
ft = randomForest(BODYFAT ~ AGE + WEIGHT + HEIGHT + CHEST + ABDOMEN + HIP + THIGH, data = body, importance = TRUE)
We chose importance=TRUE
to have randomForest
calculate importance measures for our variables for later use.
We can view a summary of our results:
ft
##
## Call:
## randomForest(formula = BODYFAT ~ AGE + WEIGHT + HEIGHT + CHEST + ABDOMEN + HIP + THIGH, data = body, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 23.04282
## % Var explained: 66.97
We can see these importance measures for each variable in our model.
importance(ft)
## %IncMSE IncNodePurity
## AGE 10.45760 1096.375
## WEIGHT 14.49798 1904.028
## HEIGHT 13.22469 1180.638
## CHEST 16.88800 3367.604
## ABDOMEN 35.83094 5681.652
## HIP 15.36587 2199.989
## THIGH 11.29928 1452.568
Again we can use the predict
function to predict a new observations from our random forest
x0 = data.frame(AGE = 40, WEIGHT = 170, HEIGHT = 76, CHEST = 120, ABDOMEN = 100, HIP = 101, THIGH = 60)
predict(ft, x0)
## 1
## 24.07606
We can similarly use random forests for classification. We need to make sure that our response (yesno
) is a factor. Unlike building the classification tree in rpart
, we do not have to explicitly mention method= class
; the fact the response is given as a factor variable tells randomForest
to use a classification forest as opposed to a regression forest.
sprf = randomForest(as.factor(yesno) ~ crl.tot + dollar + bang + money + n000 + make, data = spam)
sprf
##
## Call:
## randomForest(formula = as.factor(yesno) ~ crl.tot + dollar + bang + money + n000 + make, data = spam)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 11.76%
## Confusion matrix:
## n y class.error
## n 2645 143 0.05129125
## y 398 1415 0.21952565
Again, prediction is exactly the same using the predict
function:
x0 = data.frame(crl.tot = 100, dollar = 3, bang = 0.33, money = 1.2, n000 = 0, make = 0.3)
predict(sprf, x0)
## 1
## y
## Levels: n y