Logistic Regression

We will work with a dataset packaged in the library DAAG:

library(DAAG)

## Warning: package 'DAAG' was built under R version 4.0.2

## Loading required package: lattice

data(frogs)

We will visualize this data by plotting the presence/absence of frogs geographically (northing is the distance north, easting is distance east in meters). We first define a factor from our binary variable for coloring the points

presAbs<-factor(frogs$pres.abs,levels=c(0,1),labels=c("Absent","Present"))

We can chose the pch based on the values of presAbs. 16 means to fill in the circle, while 1 leaves it empty:

plot(northing ~ easting, data=frogs, pch=c(1,16)[presAbs],
     xlab="Meters east of reference point", ylab="Meters north")
legend("bottomleft", legend=levels(presAbs), pch=c(1,16))

plot of chunk unnamed-chunk-3

We will also consider another dataset, spam7 that gives classification of emails as spam or not.

data(spam7)
head(spam7)

##   crl.tot dollar  bang money n000 make yesno
## 1     278  0.000 0.778  0.00 0.00 0.00     y
## 2    1028  0.180 0.372  0.43 0.43 0.21     y
## 3    2259  0.184 0.276  0.06 1.16 0.06     y
## 4     191  0.000 0.137  0.00 0.00 0.00     y
## 5     191  0.000 0.135  0.00 0.00 0.00     y
## 6      54  0.000 0.000  0.00 0.00 0.00     y

For simplicity, we are going to rename the dataset to just spam

spam = spam7

We use the function glm to fit the logistic model to the frogs data. We will not use the geographic variables in our model, so we will first create a data.frame that excludes those.

frogsNoGeo<-frogs[,-c(2,3)]

We now call glm using the same syntax as in lm. We give the response variable on the left of the ~, and on the right we give the explanatory variables. Because we removed the geographic variables, we will use all of the remaining variables as explanatory variables, so we can simplify the formula by just using “.” on the right of the ~, like with lm.

glmFrogs = glm(pres.abs ~ ., family = binomial, data = frogsNoGeo)

As with lm, the summary of the output gives relevant information regarding the significance of coefficients and the overall fit.

summary(glmFrogs)

## 
## Call:
## glm(formula = pres.abs ~ ., family = binomial, data = frogsNoGeo)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7215  -0.7590  -0.2237   0.8320   2.6789  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  1.105e+02  1.388e+02   0.796  0.42587   
## altitude    -3.086e-02  4.076e-02  -0.757  0.44901   
## distance    -4.800e-04  2.055e-04  -2.336  0.01949 * 
## NoOfPools    2.986e-02  9.276e-03   3.219  0.00129 **
## NoOfSites    4.364e-02  1.061e-01   0.411  0.68077   
## avrain      -1.140e-02  5.995e-02  -0.190  0.84920   
## meanmin      4.899e+00  1.564e+00   3.133  0.00173 **
## meanmax     -5.660e+00  5.049e+00  -1.121  0.26224   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 279.99  on 211  degrees of freedom
## Residual deviance: 198.74  on 204  degrees of freedom
## AIC: 214.74
## 
## Number of Fisher Scoring iterations: 6

Like in lm, we can use the function fitted to get the fitted values of our data. These are the fitted probabilities for each observation.

head(fitted(glmFrogs))

##         2         3         4         5         6         7 
## 0.9994421 0.9391188 0.8683363 0.7443973 0.9427198 0.7107780

We can plot the fitted probabilities against the actual response. However, since our actual responses aren't probabilities, but 0/1 values, it makes more sense to plot this as a boxplot, so I can see the distribution of the fitted values for the 0 group and the 1 group.

boxplot(fitted(glmFrogs)~frogs$pres.abs, at=c(0,1))
points(x=jitter(frogs$pres.abs),fitted(glmFrogs))

plot of chunk unnamed-chunk-10

I also overlay the individual predictions on top of the boxplot. To do this, I used points. But if I had just done points(x=frogs$pres.abs, y=fitted(glmFrogs)), then all the x values in the 0 group would be the same and they would all be overplotted. So I used the function jitter to add a bit of random noise to the 0/1 values; this helps spread out the x-axis values so the points don't get overlaid on each other.

We can see some of the points have high predicted probabilities to be in the opposite class from what we observed. For example, some points that had a response of 0 had fitted probabilities $>0.7$ to be a 1. Similarly some points with a response of 1 had fitted probabilities $<0.2$ to be a 1. We can isolate those points, and inspect them to try to understand why.

First we find which points these are:

high0 <- frogs$pres.abs==0 & glmFrogs$fitted >.7
low1 <- frogs$pres.abs==1 & glmFrogs$fitted <.2

Once we've found them, we can plot them as red on our plot.

par(mfrow=c(1,2))
plot(northing ~ easting, data=frogs, pch=c(1,16)[frogs$pres.abs+1], col=c("black","red")[factor(high0)], xlab="Meters east of reference point", ylab="Meters north",main="Points with no frogs, but high prob") 
plot(northing ~ easting, data=frogs, pch=c(1,16)[frogs$pres.abs+1], col=c("black","red")[factor(low1)], xlab="Meters east of reference point", ylab="Meters north",main="Points with frogs, but low prob")

plot of chunk unnamed-chunk-12

Comparing models

We can calculate the overall residual deviance from our object using the deviance function:

deviance(glmFrogs)

## [1] 198.7384

We demonstrate the difference in deviance we get from removing the variable NoOfPools. First we fit a model excluding this variable:

m2 = glm(pres.abs ~ altitude + distance + NoOfSites + avrain + meanmin + meanmax, family = binomial, data = frogs)

Now we calculate the deviance for this smaller model and the original (full) model:

deviance(m2)

## [1] 210.8392

deviance(glmFrogs)

## [1] 198.7384

We can similarly get the deviance for a null model of no variables. This is the

m0 <- glm(pres.abs ~ 1, family = binomial, data = frogs)
deviance(m0)

## [1] 279.987

Notice that this null deviance is what is reported in the summary of our glm object:

summary(glmFrogs)

## 
## Call:
## glm(formula = pres.abs ~ ., family = binomial, data = frogsNoGeo)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7215  -0.7590  -0.2237   0.8320   2.6789  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  1.105e+02  1.388e+02   0.796  0.42587   
## altitude    -3.086e-02  4.076e-02  -0.757  0.44901   
## distance    -4.800e-04  2.055e-04  -2.336  0.01949 * 
## NoOfPools    2.986e-02  9.276e-03   3.219  0.00129 **
## NoOfSites    4.364e-02  1.061e-01   0.411  0.68077   
## avrain      -1.140e-02  5.995e-02  -0.190  0.84920   
## meanmin      4.899e+00  1.564e+00   3.133  0.00173 **
## meanmax     -5.660e+00  5.049e+00  -1.121  0.26224   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 279.99  on 211  degrees of freedom
## Residual deviance: 198.74  on 204  degrees of freedom
## AIC: 214.74
## 
## Number of Fisher Scoring iterations: 6

This comparison of deviances can also be compared via the function anova

anova(m0,glmFrogs)

## Analysis of Deviance Table
## 
## Model 1: pres.abs ~ 1
## Model 2: pres.abs ~ altitude + distance + NoOfPools + NoOfSites + avrain + 
##     meanmin + meanmax
##   Resid. Df Resid. Dev Df Deviance
## 1       211     279.99            
## 2       204     198.74  7   81.249

The above did not give a p-value for the comparison, but if we specify the test we want to use, such as test="LRT", you will get p-values.

anova(m0,glmFrogs, test="LRT")

## Analysis of Deviance Table
## 
## Model 1: pres.abs ~ 1
## Model 2: pres.abs ~ altitude + distance + NoOfPools + NoOfSites + avrain + 
##     meanmin + meanmax
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       211     279.99                          
## 2       204     198.74  7   81.249 7.662e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can also get p-values for compare the effect of leaving out a single variable:

anova(m2,glmFrogs, test="LRT")

## Analysis of Deviance Table
## 
## Model 1: pres.abs ~ altitude + distance + NoOfSites + avrain + meanmin + 
##     meanmax
## Model 2: pres.abs ~ altitude + distance + NoOfPools + NoOfSites + avrain + 
##     meanmin + meanmax
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)    
## 1       205     210.84                         
## 2       204     198.74  1   12.101 0.000504 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

These p-values are slightly different than the results printed on summary which uses a different test (not the “LRT”)

cat("Summary results:\n")

## Summary results:

round(summary(glmFrogs)$coeff,4)

##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) 110.4935   138.7622  0.7963   0.4259
## altitude     -0.0309     0.0408 -0.7571   0.4490
## distance     -0.0005     0.0002 -2.3360   0.0195
## NoOfPools     0.0299     0.0093  3.2192   0.0013
## NoOfSites     0.0436     0.1061  0.4114   0.6808
## avrain       -0.0114     0.0599 -0.1901   0.8492
## meanmin       4.8991     1.5637  3.1329   0.0017
## meanmax      -5.6603     5.0488 -1.1211   0.2622

We can also use AIC for glm models:

AIC(glmFrogs)

## [1] 214.7384

The AIC is used in the step function to do stepwise insertion and deletion of variables, like with regression.

step(glmFrogs, direction="both",trace=0)

## 
## Call:  glm(formula = pres.abs ~ distance + NoOfPools + meanmin + meanmax, 
##     family = binomial, data = frogsNoGeo)
## 
## Coefficients:
## (Intercept)     distance    NoOfPools      meanmin      meanmax  
##  14.0074032   -0.0005138    0.0285643    5.6230647   -2.3717579  
## 
## Degrees of Freedom: 211 Total (i.e. Null);  207 Residual
## Null Deviance:       280 
## Residual Deviance: 199.6     AIC: 209.6

Confusion matrices and Classification

The following function confusion calculates the entries of the confusion matrix for a given cutoff, given by thres.

confusion <- function (y, yhat, thres)
  {
    n <- length(thres)
    conf <- matrix(0,length(thres),ncol=4)
    colnames(conf) <- c("C0","W1","W0","C1")
    rownames(conf)<-round(thres,3)
    for ( i in 1:n)
      {
        C0 <- sum((!y) & (yhat<=thres[i]))
        W1 <- sum((!y) & (yhat>thres[i]))
        W0 <- sum((y) & (yhat<=thres[i])) 
        C1 <- sum((y) & (yhat>thres[i]))
        conf[i,] <- c(C0, W1, W0, C1)
      }
    return(conf)
}

We can use this function to calculate the confusion matrix values for a large range of thresholds. The output is a matrix, where each row corresponds to a threshold value, and each columns one of the entries of the 2x2 matrix.

thr<-seq(0,1,by=.05)
conf<-confusion(frogs$pres.abs, glmFrogs$fitted.values, thr)
conf

##       C0  W1 W0 C1
## 0      0 133  0 79
## 0.05  33 100  3 76
## 0.1   47  86  5 74
## 0.15  61  72  5 74
## 0.2   69  64  6 73
## 0.25  80  53 10 69
## 0.3   84  49 10 69
## 0.35  91  42 13 66
## 0.4  100  33 14 65
## 0.45 106  27 18 61
## 0.5  112  21 21 58
## 0.55 118  15 26 53
## 0.6  121  12 35 44
## 0.65 126   7 44 35
## 0.7  129   4 50 29
## 0.75 130   3 59 20
## 0.8  133   0 69 10
## 0.85 133   0 71  8
## 0.9  133   0 73  6
## 0.95 133   0 78  1
## 1    133   0 79  0

Using this output, I can plot the total amount of error, W1+W0 as a function of the threshold:

plot(thr,conf[,"W1"]+conf[,"W0"], xlab="threshold", ylab="W1+W0")

plot of chunk unnamed-chunk-26

Example of spam data

We will now return to the spam dataset for demonstration.

head(spam7)

##   crl.tot dollar  bang money n000 make yesno
## 1     278  0.000 0.778  0.00 0.00 0.00     y
## 2    1028  0.180 0.372  0.43 0.43 0.21     y
## 3    2259  0.184 0.276  0.06 1.16 0.06     y
## 4     191  0.000 0.137  0.00 0.00 0.00     y
## 5     191  0.000 0.135  0.00 0.00 0.00     y
## 6      54  0.000 0.000  0.00 0.00 0.00     y

First we will visualize some of the variables:

summary(spam)

##     crl.tot            dollar             bang             money         
##  Min.   :    1.0   Min.   :0.00000   Min.   : 0.0000   Min.   : 0.00000  
##  1st Qu.:   35.0   1st Qu.:0.00000   1st Qu.: 0.0000   1st Qu.: 0.00000  
##  Median :   95.0   Median :0.00000   Median : 0.0000   Median : 0.00000  
##  Mean   :  283.3   Mean   :0.07581   Mean   : 0.2691   Mean   : 0.09427  
##  3rd Qu.:  266.0   3rd Qu.:0.05200   3rd Qu.: 0.3150   3rd Qu.: 0.00000  
##  Max.   :15841.0   Max.   :6.00300   Max.   :32.4780   Max.   :12.50000  
##       n000             make        yesno   
##  Min.   :0.0000   Min.   :0.0000   n:2788  
##  1st Qu.:0.0000   1st Qu.:0.0000   y:1813  
##  Median :0.0000   Median :0.0000           
##  Mean   :0.1016   Mean   :0.1046           
##  3rd Qu.:0.0000   3rd Qu.:0.0000           
##  Max.   :5.4500   Max.   :4.5400

par(mfrow=c(3,2))
for (i in 1:5)
  hist(spam[,i],main="",xlab=names(spam)[i],breaks=10000)
par(mfrow=c(1,1))

plot of chunk unnamed-chunk-28

And do a pairs plot:

pairs(spam, cex=0.5)

plot of chunk unnamed-chunk-29

The explanatory variables are very skewed, which may be hard for the model. We can consider taking the log-transform of our explanatory variables; however, we will add a small amount before taking the log so we don't run the risk of taking log of 0:

s = 0.001  
pairs(~log(crl.tot)+log(dollar+s) +log(bang+s)+log(money+s)+log(n000+s)+log(make+s) + yesno,       data=spam, cex=.5)

plot of chunk unnamed-chunk-30

Based on these plots, we are going to log-transform many of the explanatory variables before putting them into the glm model

spam.glm <- glm(yesno~ log(crl.tot) + log(dollar+s) + log(bang+s)
                +log(money+s) + log(n000+s) + log(make+s),
                family=binomial, data=spam)
summary(spam.glm)

## 
## Call:
## glm(formula = yesno ~ log(crl.tot) + log(dollar + s) + log(bang + 
##     s) + log(money + s) + log(n000 + s) + log(make + s), family = binomial, 
##     data = spam)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1657  -0.4367  -0.2863   0.3609   2.7152  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      4.11947    0.36342  11.335  < 2e-16 ***
## log(crl.tot)     0.30228    0.03693   8.185 2.71e-16 ***
## log(dollar + s)  0.32586    0.02365  13.777  < 2e-16 ***
## log(bang + s)    0.40984    0.01597  25.661  < 2e-16 ***
## log(money + s)   0.34563    0.02800  12.345  < 2e-16 ***
## log(n000 + s)    0.18947    0.02931   6.463 1.02e-10 ***
## log(make + s)   -0.11418    0.02206  -5.177 2.25e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6170.2  on 4600  degrees of freedom
## Residual deviance: 3245.1  on 4594  degrees of freedom
## AIC: 3259.1
## 
## Number of Fisher Scoring iterations: 6

We can compare this to if we had run a standard regression with lm:

spam.lm <-  lm(as.numeric(yesno=="y")~ log(crl.tot) + log(dollar+s)
               + log(bang+s) +log(money+s) + log(n000+s) + log(make+s)
               ,data=spam)
summary(spam.lm)

## 
## Call:
## lm(formula = as.numeric(yesno == "y") ~ log(crl.tot) + log(dollar + 
##     s) + log(bang + s) + log(money + s) + log(n000 + s) + log(make + 
##     s), data = spam)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.10937 -0.13830 -0.05674  0.15262  1.05619 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.078531   0.034188  31.547  < 2e-16 ***
## log(crl.tot)     0.028611   0.003978   7.193 7.38e-13 ***
## log(dollar + s)  0.054878   0.002934  18.703  < 2e-16 ***
## log(bang + s)    0.064522   0.001919  33.619  < 2e-16 ***
## log(money + s)   0.039776   0.002751  14.457  < 2e-16 ***
## log(n000 + s)    0.018530   0.002815   6.582 5.16e-11 ***
## log(make + s)   -0.017380   0.002370  -7.335 2.61e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3391 on 4594 degrees of freedom
## Multiple R-squared:  0.5193, Adjusted R-squared:  0.5186 
## F-statistic: 827.1 on 6 and 4594 DF,  p-value: < 2.2e-16

We can first compare the fitted values from these two models.

par(mfrow=c(1,1))
plot(spam.lm$fitted.values, spam.glm$fitted.values,asp=1)
abline(c(0, 1), col = "red")

plot of chunk unnamed-chunk-33

We can also instead consider the error we make based on the two models. We use our confusion function from before to calculate the different type of errors from both models.

v <- seq(0.001,0.999,length=50)
y <- as.numeric(spam$yesno=="y")
glm.conf <- confusion(y,spam.glm$fitted,v)
lm.conf <- confusion(y,spam.lm$fitted,v)
matplot(v,cbind((glm.conf[,"W1"]+glm.conf[,"W0"])/4601, (lm.conf[,"W1"]+lm.conf[,"W0"])/4601), xlab="threshold", ylab="W0+W1",type="b",pch=1)
legend(.8, .4, lty=1:2, col=1:2,c("glm","lm"))

plot of chunk unnamed-chunk-34

We can also compare what would have happened if we hadn't taken the log of our explanatory variables.

spam.glm.nolog <- glm(yesno~ crl.tot + dollar + bang
                +money + n000 + make,
                family=binomial, data=spam)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(spam.glm)

## 
## Call:
## glm(formula = yesno ~ log(crl.tot) + log(dollar + s) + log(bang + 
##     s) + log(money + s) + log(n000 + s) + log(make + s), family = binomial, 
##     data = spam)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1657  -0.4367  -0.2863   0.3609   2.7152  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      4.11947    0.36342  11.335  < 2e-16 ***
## log(crl.tot)     0.30228    0.03693   8.185 2.71e-16 ***
## log(dollar + s)  0.32586    0.02365  13.777  < 2e-16 ***
## log(bang + s)    0.40984    0.01597  25.661  < 2e-16 ***
## log(money + s)   0.34563    0.02800  12.345  < 2e-16 ***
## log(n000 + s)    0.18947    0.02931   6.463 1.02e-10 ***
## log(make + s)   -0.11418    0.02206  -5.177 2.25e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6170.2  on 4600  degrees of freedom
## Residual deviance: 3245.1  on 4594  degrees of freedom
## AIC: 3259.1
## 
## Number of Fisher Scoring iterations: 6

summary(spam.glm.nolog)

## 
## Call:
## glm(formula = yesno ~ crl.tot + dollar + bang + money + n000 + 
##     make, family = binomial, data = spam)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.4904  -0.6153  -0.5816   0.4439   1.9323  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.700e+00  5.361e-02 -31.717  < 2e-16 ***
## crl.tot      6.917e-04  9.745e-05   7.098 1.27e-12 ***
## dollar       8.013e+00  6.175e-01  12.976  < 2e-16 ***
## bang         1.572e+00  1.115e-01  14.096  < 2e-16 ***
## money        2.142e+00  2.418e-01   8.859  < 2e-16 ***
## n000         4.149e+00  4.371e-01   9.492  < 2e-16 ***
## make         1.698e-02  1.434e-01   0.118    0.906    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6170.2  on 4600  degrees of freedom
## Residual deviance: 4058.8  on 4594  degrees of freedom
## AIC: 4072.8
## 
## Number of Fisher Scoring iterations: 16

We can see that the AIC is much lower in the one with the log. Similarly with the lm models, we have a noticable difference in the R-squared values.

spam.lglmFrogs = lm(as.numeric(yesno=="y")~ crl.tot + dollar
                + bang + money + n000 + make, data=spam)
summary(spam.lglmFrogs)

## 
## Call:
## lm(formula = as.numeric(yesno == "y") ~ crl.tot + dollar + bang + 
##     money + n000 + make, data = spam)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8650 -0.2758 -0.2519  0.4459  0.7499 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.498e-01  7.488e-03  33.365   <2e-16 ***
## crl.tot     1.241e-04  1.058e-05  11.734   <2e-16 ***
## dollar      3.481e-01  2.733e-02  12.740   <2e-16 ***
## bang        1.113e-01  7.725e-03  14.407   <2e-16 ***
## money       1.765e-01  1.440e-02  12.262   <2e-16 ***
## n000        3.218e-01  1.891e-02  17.014   <2e-16 ***
## make        3.212e-02  2.101e-02   1.529    0.126    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4223 on 4594 degrees of freedom
## Multiple R-squared:  0.2543, Adjusted R-squared:  0.2533 
## F-statistic: 261.1 on 6 and 4594 DF,  p-value: < 2.2e-16

summary(spam.lm)

## 
## Call:
## lm(formula = as.numeric(yesno == "y") ~ log(crl.tot) + log(dollar + 
##     s) + log(bang + s) + log(money + s) + log(n000 + s) + log(make + 
##     s), data = spam)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.10937 -0.13830 -0.05674  0.15262  1.05619 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.078531   0.034188  31.547  < 2e-16 ***
## log(crl.tot)     0.028611   0.003978   7.193 7.38e-13 ***
## log(dollar + s)  0.054878   0.002934  18.703  < 2e-16 ***
## log(bang + s)    0.064522   0.001919  33.619  < 2e-16 ***
## log(money + s)   0.039776   0.002751  14.457  < 2e-16 ***
## log(n000 + s)    0.018530   0.002815   6.582 5.16e-11 ***
## log(make + s)   -0.017380   0.002370  -7.335 2.61e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3391 on 4594 degrees of freedom
## Multiple R-squared:  0.5193, Adjusted R-squared:  0.5186 
## F-statistic: 827.1 on 6 and 4594 DF,  p-value: < 2.2e-16

We can use the results of our confusion matrix to calculate a precision/recall curve or a Sensitivity/Specificity plot (also known as a ROC curve):

spamGlm.precision<-glm.conf[,"C1"]/(glm.conf[,"C1"]+glm.conf[,"W1"])
spamGlm.recall<-glm.conf[,"C1"]/(glm.conf[,"C1"]+glm.conf[,"W0"])
spamGlm.spec<-glm.conf[,"C0"]/(glm.conf[,"C0"]+glm.conf[,"W1"])
par(mfrow=c(1,2))
plot(x=spamGlm.precision,y=spamGlm.recall,xlab="Precision",ylab="Recall / Sensitivity",type="b",xlim=c(0,1),ylim=c(0,1), main="Precision-Recall")
plot(x=spamGlm.spec,y=spamGlm.recall,ylab="Recall / Sensitivity", xlab="Specificity",type="b",xlim=c(0,1),ylim=c(0,1),main="ROC")

plot of chunk unnamed-chunk-37

Note there is a NaN for precision for the last value of the threshold so there is no point plotted.

We can do the same plots for our linear regression curves:

spamlm.precision<-lm.conf[,"C1"]/(lm.conf[,"C1"]+glm.conf[,"W1"])
spamlm.recall<-lm.conf[,"C1"]/(lm.conf[,"C1"]+glm.conf[,"W0"])
spamlm.spec<-lm.conf[,"C0"]/(lm.conf[,"C0"]+glm.conf[,"W1"])

par(mfrow=c(1,2))
matplot(x = cbind(spamGlm.precision,
              spamlm.precision),
        y = cbind(spamGlm.recall,
              spamlm.recall),
        xlab="Precision", ylab="Recall",type="l")
legend(.8, .4, lty=1:2, col=1:2,c("glm","lm"))
matplot(x = cbind(spamGlm.spec,
              spamlm.spec),
        y = cbind(spamGlm.recall,
              spamlm.recall),
        ylab="Recall / Sensitivity", xlab="Specificity",type="l")
legend(.8, .4, lty=1:2, col=1:2,c("glm","lm"))

plot of chunk unnamed-chunk-38