We will work with a dataset packaged in the library DAAG
:
library(DAAG)
## Warning: package 'DAAG' was built under R version 4.0.2
## Loading required package: lattice
data(frogs)
We will visualize this data by plotting the presence/absence of frogs geographically (northing
is the distance north, easting
is distance east in meters).
We first define a factor from our binary variable for coloring the points
presAbs<-factor(frogs$pres.abs,levels=c(0,1),labels=c("Absent","Present"))
We can chose the pch
based on the values of presAbs
. 16
means to fill in the circle, while 1
leaves it empty:
plot(northing ~ easting, data=frogs, pch=c(1,16)[presAbs],
xlab="Meters east of reference point", ylab="Meters north")
legend("bottomleft", legend=levels(presAbs), pch=c(1,16))
We will also consider another dataset, spam7
that gives classification of emails as spam or not.
data(spam7)
head(spam7)
## crl.tot dollar bang money n000 make yesno
## 1 278 0.000 0.778 0.00 0.00 0.00 y
## 2 1028 0.180 0.372 0.43 0.43 0.21 y
## 3 2259 0.184 0.276 0.06 1.16 0.06 y
## 4 191 0.000 0.137 0.00 0.00 0.00 y
## 5 191 0.000 0.135 0.00 0.00 0.00 y
## 6 54 0.000 0.000 0.00 0.00 0.00 y
For simplicity, we are going to rename the dataset to just spam
spam = spam7
We use the function glm
to fit the logistic model to the frogs data.
We will not use the geographic variables in our model, so we will first create a data.frame that excludes those.
frogsNoGeo<-frogs[,-c(2,3)]
We now call glm
using the same syntax as in lm
. We give the response variable on the left of the ~
, and on the right we give the explanatory variables. Because we removed the geographic variables, we will use all of the remaining variables as explanatory variables, so we can simplify the formula by just using “.” on the right of the ~
, like with lm
.
glmFrogs = glm(pres.abs ~ ., family = binomial, data = frogsNoGeo)
As with lm
, the summary
of the output gives relevant information regarding the significance of coefficients and the overall fit.
summary(glmFrogs)
##
## Call:
## glm(formula = pres.abs ~ ., family = binomial, data = frogsNoGeo)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7215 -0.7590 -0.2237 0.8320 2.6789
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.105e+02 1.388e+02 0.796 0.42587
## altitude -3.086e-02 4.076e-02 -0.757 0.44901
## distance -4.800e-04 2.055e-04 -2.336 0.01949 *
## NoOfPools 2.986e-02 9.276e-03 3.219 0.00129 **
## NoOfSites 4.364e-02 1.061e-01 0.411 0.68077
## avrain -1.140e-02 5.995e-02 -0.190 0.84920
## meanmin 4.899e+00 1.564e+00 3.133 0.00173 **
## meanmax -5.660e+00 5.049e+00 -1.121 0.26224
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 279.99 on 211 degrees of freedom
## Residual deviance: 198.74 on 204 degrees of freedom
## AIC: 214.74
##
## Number of Fisher Scoring iterations: 6
Like in lm
, we can use the function fitted
to get the fitted values of our data. These are the fitted probabilities for each observation.
head(fitted(glmFrogs))
## 2 3 4 5 6 7
## 0.9994421 0.9391188 0.8683363 0.7443973 0.9427198 0.7107780
We can plot the fitted probabilities against the actual response. However, since our actual responses aren't probabilities, but 0/1 values, it makes more sense to plot this as a boxplot, so I can see the distribution of the fitted values for the 0 group and the 1 group.
boxplot(fitted(glmFrogs)~frogs$pres.abs, at=c(0,1))
points(x=jitter(frogs$pres.abs),fitted(glmFrogs))
I also overlay the individual predictions on top of the boxplot. To do this, I used points
. But if I had just done points(x=frogs$pres.abs, y=fitted(glmFrogs))
, then all the x values in the 0
group would be the same and they would all be overplotted. So I used the function jitter
to add a bit of random noise to the 0/1 values; this helps spread out the x-axis values so the points don't get overlaid on each other.
We can see some of the points have high predicted probabilities to be in the opposite class from what we observed. For example, some points that had a response of 0 had fitted probabilities \(>0.7\) to be a 1. Similarly some points with a response of 1 had fitted probabilities \(<0.2\) to be a 1. We can isolate those points, and inspect them to try to understand why.
First we find which points these are:
high0 <- frogs$pres.abs==0 & glmFrogs$fitted >.7
low1 <- frogs$pres.abs==1 & glmFrogs$fitted <.2
Once we've found them, we can plot them as red on our plot.
par(mfrow=c(1,2))
plot(northing ~ easting, data=frogs, pch=c(1,16)[frogs$pres.abs+1], col=c("black","red")[factor(high0)], xlab="Meters east of reference point", ylab="Meters north",main="Points with no frogs, but high prob")
plot(northing ~ easting, data=frogs, pch=c(1,16)[frogs$pres.abs+1], col=c("black","red")[factor(low1)], xlab="Meters east of reference point", ylab="Meters north",main="Points with frogs, but low prob")
We can calculate the overall residual deviance from our object using the deviance
function:
deviance(glmFrogs)
## [1] 198.7384
We demonstrate the difference in deviance we get from removing the variable NoOfPools
. First we fit a model excluding this variable:
m2 = glm(pres.abs ~ altitude + distance + NoOfSites + avrain + meanmin + meanmax, family = binomial, data = frogs)
Now we calculate the deviance for this smaller model and the original (full) model:
deviance(m2)
## [1] 210.8392
deviance(glmFrogs)
## [1] 198.7384
We can similarly get the deviance for a null model of no variables. This is the
m0 <- glm(pres.abs ~ 1, family = binomial, data = frogs)
deviance(m0)
## [1] 279.987
Notice that this null deviance is what is reported in the summary of our glm object:
summary(glmFrogs)
##
## Call:
## glm(formula = pres.abs ~ ., family = binomial, data = frogsNoGeo)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7215 -0.7590 -0.2237 0.8320 2.6789
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.105e+02 1.388e+02 0.796 0.42587
## altitude -3.086e-02 4.076e-02 -0.757 0.44901
## distance -4.800e-04 2.055e-04 -2.336 0.01949 *
## NoOfPools 2.986e-02 9.276e-03 3.219 0.00129 **
## NoOfSites 4.364e-02 1.061e-01 0.411 0.68077
## avrain -1.140e-02 5.995e-02 -0.190 0.84920
## meanmin 4.899e+00 1.564e+00 3.133 0.00173 **
## meanmax -5.660e+00 5.049e+00 -1.121 0.26224
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 279.99 on 211 degrees of freedom
## Residual deviance: 198.74 on 204 degrees of freedom
## AIC: 214.74
##
## Number of Fisher Scoring iterations: 6
This comparison of deviances can also be compared via the function anova
anova(m0,glmFrogs)
## Analysis of Deviance Table
##
## Model 1: pres.abs ~ 1
## Model 2: pres.abs ~ altitude + distance + NoOfPools + NoOfSites + avrain +
## meanmin + meanmax
## Resid. Df Resid. Dev Df Deviance
## 1 211 279.99
## 2 204 198.74 7 81.249
The above did not give a p-value for the comparison, but if we specify the test we want to use, such as test="LRT"
, you will get p-values.
anova(m0,glmFrogs, test="LRT")
## Analysis of Deviance Table
##
## Model 1: pres.abs ~ 1
## Model 2: pres.abs ~ altitude + distance + NoOfPools + NoOfSites + avrain +
## meanmin + meanmax
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 211 279.99
## 2 204 198.74 7 81.249 7.662e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can also get p-values for compare the effect of leaving out a single variable:
anova(m2,glmFrogs, test="LRT")
## Analysis of Deviance Table
##
## Model 1: pres.abs ~ altitude + distance + NoOfSites + avrain + meanmin +
## meanmax
## Model 2: pres.abs ~ altitude + distance + NoOfPools + NoOfSites + avrain +
## meanmin + meanmax
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 205 210.84
## 2 204 198.74 1 12.101 0.000504 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
These p-values are slightly different than the results printed on summary
which uses a different test (not the “LRT”)
cat("Summary results:\n")
## Summary results:
round(summary(glmFrogs)$coeff,4)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 110.4935 138.7622 0.7963 0.4259
## altitude -0.0309 0.0408 -0.7571 0.4490
## distance -0.0005 0.0002 -2.3360 0.0195
## NoOfPools 0.0299 0.0093 3.2192 0.0013
## NoOfSites 0.0436 0.1061 0.4114 0.6808
## avrain -0.0114 0.0599 -0.1901 0.8492
## meanmin 4.8991 1.5637 3.1329 0.0017
## meanmax -5.6603 5.0488 -1.1211 0.2622
We can also use AIC for glm models:
AIC(glmFrogs)
## [1] 214.7384
The AIC is used in the step
function to do stepwise insertion and deletion of variables, like with regression.
step(glmFrogs, direction="both",trace=0)
##
## Call: glm(formula = pres.abs ~ distance + NoOfPools + meanmin + meanmax,
## family = binomial, data = frogsNoGeo)
##
## Coefficients:
## (Intercept) distance NoOfPools meanmin meanmax
## 14.0074032 -0.0005138 0.0285643 5.6230647 -2.3717579
##
## Degrees of Freedom: 211 Total (i.e. Null); 207 Residual
## Null Deviance: 280
## Residual Deviance: 199.6 AIC: 209.6
The following function confusion
calculates the entries of the confusion matrix for a given cutoff, given by thres
.
confusion <- function (y, yhat, thres)
{
n <- length(thres)
conf <- matrix(0,length(thres),ncol=4)
colnames(conf) <- c("C0","W1","W0","C1")
rownames(conf)<-round(thres,3)
for ( i in 1:n)
{
C0 <- sum((!y) & (yhat<=thres[i]))
W1 <- sum((!y) & (yhat>thres[i]))
W0 <- sum((y) & (yhat<=thres[i]))
C1 <- sum((y) & (yhat>thres[i]))
conf[i,] <- c(C0, W1, W0, C1)
}
return(conf)
}
We can use this function to calculate the confusion matrix values for a large range of thresholds. The output is a matrix, where each row corresponds to a threshold value, and each columns one of the entries of the 2x2 matrix.
thr<-seq(0,1,by=.05)
conf<-confusion(frogs$pres.abs, glmFrogs$fitted.values, thr)
conf
## C0 W1 W0 C1
## 0 0 133 0 79
## 0.05 33 100 3 76
## 0.1 47 86 5 74
## 0.15 61 72 5 74
## 0.2 69 64 6 73
## 0.25 80 53 10 69
## 0.3 84 49 10 69
## 0.35 91 42 13 66
## 0.4 100 33 14 65
## 0.45 106 27 18 61
## 0.5 112 21 21 58
## 0.55 118 15 26 53
## 0.6 121 12 35 44
## 0.65 126 7 44 35
## 0.7 129 4 50 29
## 0.75 130 3 59 20
## 0.8 133 0 69 10
## 0.85 133 0 71 8
## 0.9 133 0 73 6
## 0.95 133 0 78 1
## 1 133 0 79 0
Using this output, I can plot the total amount of error, W1+W0
as a function of the threshold:
plot(thr,conf[,"W1"]+conf[,"W0"], xlab="threshold", ylab="W1+W0")
We will now return to the spam dataset for demonstration.
head(spam7)
## crl.tot dollar bang money n000 make yesno
## 1 278 0.000 0.778 0.00 0.00 0.00 y
## 2 1028 0.180 0.372 0.43 0.43 0.21 y
## 3 2259 0.184 0.276 0.06 1.16 0.06 y
## 4 191 0.000 0.137 0.00 0.00 0.00 y
## 5 191 0.000 0.135 0.00 0.00 0.00 y
## 6 54 0.000 0.000 0.00 0.00 0.00 y
First we will visualize some of the variables:
summary(spam)
## crl.tot dollar bang money
## Min. : 1.0 Min. :0.00000 Min. : 0.0000 Min. : 0.00000
## 1st Qu.: 35.0 1st Qu.:0.00000 1st Qu.: 0.0000 1st Qu.: 0.00000
## Median : 95.0 Median :0.00000 Median : 0.0000 Median : 0.00000
## Mean : 283.3 Mean :0.07581 Mean : 0.2691 Mean : 0.09427
## 3rd Qu.: 266.0 3rd Qu.:0.05200 3rd Qu.: 0.3150 3rd Qu.: 0.00000
## Max. :15841.0 Max. :6.00300 Max. :32.4780 Max. :12.50000
## n000 make yesno
## Min. :0.0000 Min. :0.0000 n:2788
## 1st Qu.:0.0000 1st Qu.:0.0000 y:1813
## Median :0.0000 Median :0.0000
## Mean :0.1016 Mean :0.1046
## 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :5.4500 Max. :4.5400
par(mfrow=c(3,2))
for (i in 1:5)
hist(spam[,i],main="",xlab=names(spam)[i],breaks=10000)
par(mfrow=c(1,1))
And do a pairs plot:
pairs(spam, cex=0.5)
The explanatory variables are very skewed, which may be hard for the model. We can consider taking the log-transform of our explanatory variables; however, we will add a small amount before taking the log so we don't run the risk of taking log of 0:
s = 0.001
pairs(~log(crl.tot)+log(dollar+s) +log(bang+s)+log(money+s)+log(n000+s)+log(make+s) + yesno, data=spam, cex=.5)
Based on these plots, we are going to log-transform many of the explanatory variables before putting them into the glm
model
spam.glm <- glm(yesno~ log(crl.tot) + log(dollar+s) + log(bang+s)
+log(money+s) + log(n000+s) + log(make+s),
family=binomial, data=spam)
summary(spam.glm)
##
## Call:
## glm(formula = yesno ~ log(crl.tot) + log(dollar + s) + log(bang +
## s) + log(money + s) + log(n000 + s) + log(make + s), family = binomial,
## data = spam)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1657 -0.4367 -0.2863 0.3609 2.7152
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.11947 0.36342 11.335 < 2e-16 ***
## log(crl.tot) 0.30228 0.03693 8.185 2.71e-16 ***
## log(dollar + s) 0.32586 0.02365 13.777 < 2e-16 ***
## log(bang + s) 0.40984 0.01597 25.661 < 2e-16 ***
## log(money + s) 0.34563 0.02800 12.345 < 2e-16 ***
## log(n000 + s) 0.18947 0.02931 6.463 1.02e-10 ***
## log(make + s) -0.11418 0.02206 -5.177 2.25e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6170.2 on 4600 degrees of freedom
## Residual deviance: 3245.1 on 4594 degrees of freedom
## AIC: 3259.1
##
## Number of Fisher Scoring iterations: 6
We can compare this to if we had run a standard regression with lm
:
spam.lm <- lm(as.numeric(yesno=="y")~ log(crl.tot) + log(dollar+s)
+ log(bang+s) +log(money+s) + log(n000+s) + log(make+s)
,data=spam)
summary(spam.lm)
##
## Call:
## lm(formula = as.numeric(yesno == "y") ~ log(crl.tot) + log(dollar +
## s) + log(bang + s) + log(money + s) + log(n000 + s) + log(make +
## s), data = spam)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.10937 -0.13830 -0.05674 0.15262 1.05619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.078531 0.034188 31.547 < 2e-16 ***
## log(crl.tot) 0.028611 0.003978 7.193 7.38e-13 ***
## log(dollar + s) 0.054878 0.002934 18.703 < 2e-16 ***
## log(bang + s) 0.064522 0.001919 33.619 < 2e-16 ***
## log(money + s) 0.039776 0.002751 14.457 < 2e-16 ***
## log(n000 + s) 0.018530 0.002815 6.582 5.16e-11 ***
## log(make + s) -0.017380 0.002370 -7.335 2.61e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3391 on 4594 degrees of freedom
## Multiple R-squared: 0.5193, Adjusted R-squared: 0.5186
## F-statistic: 827.1 on 6 and 4594 DF, p-value: < 2.2e-16
We can first compare the fitted values from these two models.
par(mfrow=c(1,1))
plot(spam.lm$fitted.values, spam.glm$fitted.values,asp=1)
abline(c(0, 1), col = "red")
We can also instead consider the error we make based on the two models. We use our confusion
function from before to calculate the different type of errors from both models.
v <- seq(0.001,0.999,length=50)
y <- as.numeric(spam$yesno=="y")
glm.conf <- confusion(y,spam.glm$fitted,v)
lm.conf <- confusion(y,spam.lm$fitted,v)
matplot(v,cbind((glm.conf[,"W1"]+glm.conf[,"W0"])/4601, (lm.conf[,"W1"]+lm.conf[,"W0"])/4601), xlab="threshold", ylab="W0+W1",type="b",pch=1)
legend(.8, .4, lty=1:2, col=1:2,c("glm","lm"))
We can also compare what would have happened if we hadn't taken the log of our explanatory variables.
spam.glm.nolog <- glm(yesno~ crl.tot + dollar + bang
+money + n000 + make,
family=binomial, data=spam)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(spam.glm)
##
## Call:
## glm(formula = yesno ~ log(crl.tot) + log(dollar + s) + log(bang +
## s) + log(money + s) + log(n000 + s) + log(make + s), family = binomial,
## data = spam)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1657 -0.4367 -0.2863 0.3609 2.7152
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.11947 0.36342 11.335 < 2e-16 ***
## log(crl.tot) 0.30228 0.03693 8.185 2.71e-16 ***
## log(dollar + s) 0.32586 0.02365 13.777 < 2e-16 ***
## log(bang + s) 0.40984 0.01597 25.661 < 2e-16 ***
## log(money + s) 0.34563 0.02800 12.345 < 2e-16 ***
## log(n000 + s) 0.18947 0.02931 6.463 1.02e-10 ***
## log(make + s) -0.11418 0.02206 -5.177 2.25e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6170.2 on 4600 degrees of freedom
## Residual deviance: 3245.1 on 4594 degrees of freedom
## AIC: 3259.1
##
## Number of Fisher Scoring iterations: 6
summary(spam.glm.nolog)
##
## Call:
## glm(formula = yesno ~ crl.tot + dollar + bang + money + n000 +
## make, family = binomial, data = spam)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.4904 -0.6153 -0.5816 0.4439 1.9323
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.700e+00 5.361e-02 -31.717 < 2e-16 ***
## crl.tot 6.917e-04 9.745e-05 7.098 1.27e-12 ***
## dollar 8.013e+00 6.175e-01 12.976 < 2e-16 ***
## bang 1.572e+00 1.115e-01 14.096 < 2e-16 ***
## money 2.142e+00 2.418e-01 8.859 < 2e-16 ***
## n000 4.149e+00 4.371e-01 9.492 < 2e-16 ***
## make 1.698e-02 1.434e-01 0.118 0.906
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6170.2 on 4600 degrees of freedom
## Residual deviance: 4058.8 on 4594 degrees of freedom
## AIC: 4072.8
##
## Number of Fisher Scoring iterations: 16
We can see that the AIC is much lower in the one with the log. Similarly with the lm
models, we have a noticable difference in the R-squared values.
spam.lglmFrogs = lm(as.numeric(yesno=="y")~ crl.tot + dollar
+ bang + money + n000 + make, data=spam)
summary(spam.lglmFrogs)
##
## Call:
## lm(formula = as.numeric(yesno == "y") ~ crl.tot + dollar + bang +
## money + n000 + make, data = spam)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8650 -0.2758 -0.2519 0.4459 0.7499
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.498e-01 7.488e-03 33.365 <2e-16 ***
## crl.tot 1.241e-04 1.058e-05 11.734 <2e-16 ***
## dollar 3.481e-01 2.733e-02 12.740 <2e-16 ***
## bang 1.113e-01 7.725e-03 14.407 <2e-16 ***
## money 1.765e-01 1.440e-02 12.262 <2e-16 ***
## n000 3.218e-01 1.891e-02 17.014 <2e-16 ***
## make 3.212e-02 2.101e-02 1.529 0.126
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4223 on 4594 degrees of freedom
## Multiple R-squared: 0.2543, Adjusted R-squared: 0.2533
## F-statistic: 261.1 on 6 and 4594 DF, p-value: < 2.2e-16
summary(spam.lm)
##
## Call:
## lm(formula = as.numeric(yesno == "y") ~ log(crl.tot) + log(dollar +
## s) + log(bang + s) + log(money + s) + log(n000 + s) + log(make +
## s), data = spam)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.10937 -0.13830 -0.05674 0.15262 1.05619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.078531 0.034188 31.547 < 2e-16 ***
## log(crl.tot) 0.028611 0.003978 7.193 7.38e-13 ***
## log(dollar + s) 0.054878 0.002934 18.703 < 2e-16 ***
## log(bang + s) 0.064522 0.001919 33.619 < 2e-16 ***
## log(money + s) 0.039776 0.002751 14.457 < 2e-16 ***
## log(n000 + s) 0.018530 0.002815 6.582 5.16e-11 ***
## log(make + s) -0.017380 0.002370 -7.335 2.61e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3391 on 4594 degrees of freedom
## Multiple R-squared: 0.5193, Adjusted R-squared: 0.5186
## F-statistic: 827.1 on 6 and 4594 DF, p-value: < 2.2e-16
We can use the results of our confusion matrix to calculate a precision/recall curve or a Sensitivity/Specificity plot (also known as a ROC curve):
spamGlm.precision<-glm.conf[,"C1"]/(glm.conf[,"C1"]+glm.conf[,"W1"])
spamGlm.recall<-glm.conf[,"C1"]/(glm.conf[,"C1"]+glm.conf[,"W0"])
spamGlm.spec<-glm.conf[,"C0"]/(glm.conf[,"C0"]+glm.conf[,"W1"])
par(mfrow=c(1,2))
plot(x=spamGlm.precision,y=spamGlm.recall,xlab="Precision",ylab="Recall / Sensitivity",type="b",xlim=c(0,1),ylim=c(0,1), main="Precision-Recall")
plot(x=spamGlm.spec,y=spamGlm.recall,ylab="Recall / Sensitivity", xlab="Specificity",type="b",xlim=c(0,1),ylim=c(0,1),main="ROC")
Note there is a NaN for precision for the last value of the threshold so there is no point plotted.
We can do the same plots for our linear regression curves:
spamlm.precision<-lm.conf[,"C1"]/(lm.conf[,"C1"]+glm.conf[,"W1"])
spamlm.recall<-lm.conf[,"C1"]/(lm.conf[,"C1"]+glm.conf[,"W0"])
spamlm.spec<-lm.conf[,"C0"]/(lm.conf[,"C0"]+glm.conf[,"W1"])
par(mfrow=c(1,2))
matplot(x = cbind(spamGlm.precision,
spamlm.precision),
y = cbind(spamGlm.recall,
spamlm.recall),
xlab="Precision", ylab="Recall",type="l")
legend(.8, .4, lty=1:2, col=1:2,c("glm","lm"))
matplot(x = cbind(spamGlm.spec,
spamlm.spec),
y = cbind(spamGlm.recall,
spamlm.recall),
ylab="Recall / Sensitivity", xlab="Specificity",type="l")
legend(.8, .4, lty=1:2, col=1:2,c("glm","lm"))