Do not forget to change directory for where the data will be (see 01Probability_forClassCode.Rmd)

dataDir<-"../../finalDataSets"
# dataDir<-"."

I will first read in data sets. I have three different data sets. One is a SRS from then entire 2016 year; one is just the month of january; and the last is a Stratified sample, which is a SRS from each month. I will mainly use the SRS in what follows.

flightSFOJan<-read.table(file.path(dataDir,"SFO.txt"),
                         sep="\t", header=TRUE, stringsAsFactors = FALSE)
flightSFOSRS<-read.table(file.path(dataDir,"SFO_SRS.txt"),
                         sep="\t",header=TRUE, stringsAsFactors = FALSE)
flightSFOStratified<-read.table(file.path(dataDir,"SFO_Stratified.txt"),
                                sep="\t",header=TRUE, stringsAsFactors = FALSE)

Now lets subset to only those flights involving United and American Airlines.

flightSubset <- flightSFOSRS[flightSFOSRS$Carrier %in% c("UA", "AA"), ]

If I take the mean of the departure delay variable, I get NAs.

mean(flightSubset$DepDelay)
## [1] NA

To ignore the NA, I set na.rm=TRUE (uncomment the next line). Notice, however, that this may not be a wise thing to do, as we've discussed.

#mean(flightSubset$DepDelay,na.rm=TRUE) #remove the NAs, which may or may not be wise

We demonstrate the use of tapply to perform a calculation for each group. This function is for when you have one vector (or column of your data frame) with all of the observed measurements and another vector/column with a variable that says what group each observation is in. This is how data usually comes, so tapply helps you calculate for each group without having to do annoying subsetting or making a for-loop through the different values of your groups. We saw this before in lecture 1. When I apply it here, though, I just get NAs.

# NA problem:
tapply(X=flightSubset$DepDelay, flightSubset$Carrier, mean)
## AA UA 
## NA NA

But I can give additional arguments to the mean function via tapply by just adding them in. If you look in the help of tapply you'll see that there is an argument .... This is R's way of allowing you to pass along additional arguments, without the function tapply needing to know what those arguments are. In this case we will give the na.rm=TRUE argument for the mean. Notice that tapply just sends them along to mean without looking at them, so tapply doesn't need to know what arguments your function defined in FUN actually needs, which makes it very flexible.

# remove NA:
tapply(flightSubset$DepDelay, flightSubset$Carrier, mean,na.rm=TRUE)
##        AA        UA 
##  7.728294 12.255649

We can also write our own function for tapply. The following code does the same as the above, only this time we wrote out a function that sets na.rm=TRUE in our call to mean.

# equivalent, only write a function to do it:
tapply(flightSubset$DepDelay, flightSubset$Carrier, 
       function(x){mean(x, na.rm = TRUE)})
##        AA        UA 
##  7.728294 12.255649

We could have made this two lines of code. As an experienced coder, I don't tend to do this, but certainly feel free to break it down into steps. It can help to define the function separately, too, if it's long or complicated, so as to make sure it does what you want.

#equivalent, only in pieces
f<-function(x){mean(x, na.rm = TRUE)}
tapply(flightSubset$DepDelay, flightSubset$Carrier, FUN=f)
##        AA        UA 
##  7.728294 12.255649

Now we do the median:

# median delay time
tapply(flightSubset$DepDelay, flightSubset$Carrier, 
       function(x){median(x, na.rm = TRUE)})
## AA UA 
## -2 -1

Now we calculate the percentage of flights that are delayed, and write our own function to do this. Notice how I can create a logical vector x>0 | is.na(x) to determine with observations are either delayed >0 minutes, or are missing (i.e. cancelled). Then taking sum of a logical vector counts the number of TRUE values.

# percent flights delayed
tapply(flightSubset$DepDelay, flightSubset$Carrier, 
       function(x){sum(x > 0 | is.na(x)) / length(x)})
##        AA        UA 
## 0.3201220 0.4383791

Again, another example only calculating the % more than 15 minutes

# percent flights delayed more than 15 minutes
tapply(flightSubset$DepDelay, flightSubset$Carrier, 
       function(x){sum(x > 15 | is.na(x)) / length(x)})
##        AA        UA 
## 0.1554878 0.2046216
# percent flights cancelled

And now the percent of cancelled flights. Notice that cancelled values are 1=Cancelled and 0=not cancelled. Therefore taking the mean of the 0-1 values is the same as calculating the percentage.

tapply(flightSubset$Cancelled, flightSubset$Carrier, mean)
##          AA          UA 
## 0.005081301 0.007032820

To illustrate different ways of sampling data, we had three different data sets (SRS, Stratified, and all of January). Here we plot the density functions of UA and AA for these different samples to indicate how they can be rather different.

par(mfrow=c(1, 2))
xlim<-c(-20, 50)
ylim<-c(0, 0.08)
#draw same for them separately
for (airline in c("UA", "AA")){
    density.Jan <- density(flightSFOJan[flightSFOJan$Carrier==airline, "DepDelay"], 
    na.rm = TRUE)
    density.SRS <- density(flightSFOSRS[flightSFOSRS$Carrier==airline, "DepDelay"], 
    na.rm = TRUE)
    density.Stratified <- density(flightSFOStratified[flightSFOSRS$Carrier==airline, "DepDelay"], na.rm = TRUE)
  plot(density.Jan, xlim=xlim, ylim=ylim, col="black",
       main=paste("Density estimates:",airline), xlab=airline)
  lines(density.SRS, col="red")
  lines(density.Stratified, col="blue")
  legend("topright",c("January","SRS","Stratified"), fill=c("black","red","blue"))
}

plot of chunk unnamed-chunk-14

Permutation Tests Now we consider implementing permutation tests to compare two groups. The following code sets up a generic function that will take as input the data on group1 and group2, as well as a function FUN that defines the statistic we are using for the test. And the last argument is repetitions which is the number of repetitions of the permutation we choose to do. It returns a list with the p-value, the observed statistic (i.e. calculated on the real data), and a vector of the statistic calculated on the permuted data.

# Returns:
#   list with p-value, observed statistic, and vector of permuted statistics

permutation.test <- function(group1,group2, FUN, repetitions){ 
  # calculate the observed statistic
  stat.obs <- FUN(group1,  group2)
  #function to do the permutation and return statistic
  makePermutedStats<-function(){
      sampled <- sample(1:length(c(group1,group2)), size=length(group1),replace=FALSE) #sample indices that will go into making new group 1
      return(FUN(c(group1,group2)[sampled], c(group1,group2)[-sampled]))
  }
  # calculate the permuted statistic
  stat.permute <-replicate(repetitions,makePermutedStats()) 
  p.value <- sum(stat.permute >= stat.obs) / repetitions
  return(list(p.value=p.value,observedStat=stat.obs,permutedStats=stat.permute))
}

Notice how I give names to the values when defining the list, list(p.value=..., observedStat=..., permutedStats=...) means that the resulting list will have three elements with names p.value, observedStat, and permutedStats.

Again, we calculate the percent of flights delayed more than 15 minutes; this time we choose to remove the cancelled flights with na.omit function.

# percent flights delayed more than 15 minutes
tapply(flightSFOSRS$DepDelay, flightSFOSRS$Carrier, 
       function(x){x<-na.omit(x); sum(x > 15) / length(x)})[c("AA","UA")]
##        AA        UA 
## 0.1511747 0.1989882

Now I want to run a permutation test. Since there is randomness in our permutations, I am going to call set.seed which takes as an argument an arbitrary number. This means that I will make the same random draws even if I repeat the code. This is convenient when you want to make sure you will get the exact same result each time (for example in homeworks or lectures).

set.seed(201728)

Now I need to define a function that will calculate my statistic when I give input of two vectors of data.

diffProportion<-function(x1,x2){
    x1<-na.omit(x1)
    x2<-na.omit(x2)
    prop1<-sum(x1>15)/length(x1)
    prop2<-sum(x2>15)/length(x2)
    return(abs(prop1-prop2))
}

Now I will run my function permutation.test on the UA and AA data with 10,000 permutations and save the output.

dataset<-flightSFOSRS
output<-permutation.test(group1=dataset$DepDelay[dataset$Carrier == "UA"], group2=dataset$DepDelay[dataset$Carrier == "AA"],FUN=diffProportion, repetitions=10000)

The output contains not just the p-value, but the values of the statistic under all of the permutations. The output is a list, so I can access the values with a $ and the names.

names(output)
## [1] "p.value"       "observedStat"  "permutedStats"

For example, output$p.value will give me the p-value. I can plot the histogram of these permuted values, along with the observed value. Notice that I can use the function abline to draw horizontal or vertical lines. Since I want a vertical line, I set v= and the vertical line will be at that value.

xlim<-range(c(output$observedStat,output$permutedStats))
hist(output$permutedStats, main = "Histogram of permuted statistics",
    xlab = "Difference in proportion delay > 15 minutes", xlim=xlim)
abline(v = output$observedStat, col = "blue", lwd = 2)  

plot of chunk unnamed-chunk-21

cat("pvalue=",output$p.value)
## pvalue= 8e-04

I can repeat the whole process with the median.

# median of delay times
tapply(flightSFOSRS$DepDelay, flightSFOSRS$Carrier, 
       function(x){median(x,na.rm=TRUE)})[c("AA","UA")]
## AA UA 
## -2 -1

All I need to do is change the function that calculates the statistic to now calculate the absolute difference in the medians.

set.seed(2843261)
# for each dataset, do permutation test and plot histogram
diffMedian<-function(x1,x2){
    prop1<-median(x1,na.rm=TRUE)
    prop2<-median(x2,na.rm=TRUE)
    return(abs(prop1-prop2))
}

And now I can run the exact same code to get the permutation test.

dataset<-flightSFOSRS
output<-permutation.test(group1=dataset$DepDelay[dataset$Carrier == "UA"], group2=dataset$DepDelay[dataset$Carrier == "AA"],FUN=diffMedian, repetitions=10000)
xlim<-range(c(output$observedStat,output$permutedStats))
hist(output$permutedStats, main = "Histogram of permuted statistics",
    xlab = "Difference in median flight delay", xlim=xlim)
abline(v = output$observedStat, col = "blue", lwd = 2)  

plot of chunk unnamed-chunk-24

cat("pvalue=",output$p.value)
## pvalue= 0.1279

T-test This next plot is simply a plot of the normal distribution.

curve(dnorm,from=-3,to=3)

plot of chunk unnamed-chunk-25

While this plot demonstrates the difference in the distribution of the normal versus the absolute value of the normal.

par(mfrow=c(1,2))
curve(dnorm,from=-3,to=3,main="Normal density curve",ylim=c(0,0.8))
library(fdrtool)
f<-function(x){abs(dhalfnorm(x))}
curve(f,from=0,to=6,main="Absolute Normal density curve")

plot of chunk unnamed-chunk-26

Now I draw the t distribution (i.e. the distribution of the t-statistic) for different sample sizes

f<-function(x){dt(x,df=4)}
curve(f,from=-3,to=3,col="red")
f<-function(x){dt(x,df=9)}
curve(f,from=-3,to=3,col="orange",add=TRUE)
f<-function(x){dt(x,df=29)}
curve(f,from=-3,to=3,col="blue",add=TRUE)
curve(dnorm,add=TRUE)
legend("topright",legend=c("5/group","11/group","30/group","N(0,1)"),col=c("red","orange","blue","black"))

plot of chunk unnamed-chunk-27

Let's return to our flight data and consider using the t-test to compare the means of the distribution.

# percent flights delayed more than 15 minutes
tapply(flightSFOSRS$DepDelay, flightSFOSRS$Carrier, 
       function(x){mean(x,na.rm=TRUE)})[c("AA","UA")]
##        AA        UA 
##  7.728294 12.255649

However, our data is very skewed.

hist(flightSFOSRS$DepDelay[flightSFOSRS$Carrier=="UA"],main="Histogram Departure Delay, UA",xlab="In minutes",breaks=50)

plot of chunk unnamed-chunk-29

We can still run the t-test despite how bad the data is – it doesn't mean it's a good idea however. There is a function in R that calculates this t.test .

t.test(flightSFOSRS$DepDelay[flightSFOSRS$Carrier=="UA"],flightSFOSRS$DepDelay[flightSFOSRS$Carrier=="AA"])
## 
##  Welch Two Sample t-test
## 
## data:  flightSFOSRS$DepDelay[flightSFOSRS$Carrier == "UA"] and flightSFOSRS$DepDelay[flightSFOSRS$Carrier == "AA"]
## t = 2.8325, df = 1703.1, p-value = 0.004673
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.392379 7.662332
## sample estimates:
## mean of x mean of y 
## 12.255649  7.728294

We can do a permutation test using the t-statistic as our statistic as well.

set.seed(489712)
# for each dataset, do permutation test and plot histogram
tstatFun<-function(x1,x2){abs(t.test(x1,x2)$statistic)}
dataset<-flightSFOSRS
output<-permutation.test(group1=dataset$DepDelay[dataset$Carrier == "UA"], group2=dataset$DepDelay[dataset$Carrier == "AA"],FUN=tstatFun, repetitions=10000)
cat("pvalue=",output$p.value)
## pvalue= 0.0059

Now we can compare the distribution of the permuted mean differences to the normal predicted by the t-test if we had normal data. For this plot, we will not take the absolute value of the t-test in the permutation test, so we can more easily compare its distribution to that of the parametric model of the distribution of the t-statistic.

set.seed(489712)
# for this plot, won't take the absolute value, so will match the normal distribution plot
tstatFun2<-function(x1,x2){t.test(x1,x2)$statistic}
dataset<-flightSFOSRS[!is.na(flightSFOSRS$DepDelay),]
outputTPerm<-permutation.test(group1=dataset$DepDelay[dataset$Carrier == "UA"], group2=dataset$DepDelay[dataset$Carrier == "AA"],FUN=tstatFun2, repetitions=10000)
xlim<-range(c(outputTPerm$observedStat,outputTPerm$permutedStats))
hist(outputTPerm$permutedStats, main = "Histogram of permuted t-statistic",
    xlab = "T-statistic", xlim=xlim,freq=FALSE,breaks=100)
abline(v = outputTPerm$observedStat, col = "blue", lwd = 2) 
curve(dnorm,add=TRUE)
legend("topleft","N(0,1)",lty=1)

plot of chunk unnamed-chunk-32

Now we consider the same comparison of the parametric t-test with the permutation test with the t-statistic. So I will subsample so that I only get 50 in each of the two airline groups.

set.seed(489712)
dataset<-flightSFOSRS[flightSFOSRS$Carrier %in% c("UA","AA") & !is.na(flightSFOSRS$DepDelay),]
#sample smaller number of each
ind1<-sample(which(dataset$Carrier=="AA"),size=50)
ind2<-sample(which(dataset$Carrier=="UA"),size=50)
dataset<-dataset[c(ind1,ind2),]

Now I calculate the null distribution under the permutation test and under the parametric model (i.e. a t-distribution). To get the degrees of freedom for the t-test, I will just use the output of the t.test function (this isn't an important skill, but it is useful to know that you can save the result of t.test to access later).

outputTPermSmall<-permutation.test(group1=dataset$DepDelay[dataset$Carrier == "UA"], group2=dataset$DepDelay[dataset$Carrier == "AA"],FUN=tstatFun2, repetitions=10000)
toutput<-t.test(dataset$DepDelay[dataset$Carrier == "UA"],dataset$DepDelay[dataset$Carrier == "AA"])
xlim<-range(c(outputTPermSmall$observedStat,outputTPermSmall$permutedStats))
hist(outputTPermSmall$permutedStats, main = "Histogram of permuted t-statistic",
    xlab = "T-statistic", xlim=xlim,freq=FALSE,breaks=100)
abline(v = outputTPermSmall$observedStat, col = "blue", lwd = 2)    
tfun<-function(x){dt(x,df=toutput$parameter)}
curve(tfun,add=TRUE,col="red")
legend("topleft","t-distribution",lty=1)

plot of chunk unnamed-chunk-34

We can compare the pvalues:

cat("pvalue permutation=",outputTPermSmall$p.value,"\n")
## pvalue permutation= 0.5504
cat("pvalue t.test=",toutput$p.value,"\n")
## pvalue t.test= 0.6273047

Of course, the t-test assumptions don't match the flight data. We can consider what happens if we transform the data like we did in earlier lectures. We will repeat all of the above, only now with the log of the departure delay (+ a small amount)

#use the same subset of data found above
addValue<-abs(min(dataset$DepDelay,na.rm=TRUE))+1
dataset$logDepDelay<-log(dataset$DepDelay+addValue)
outputTPermSmall<-permutation.test(group1=dataset$logDepDelay[dataset$Carrier == "UA"], group2=dataset$logDepDelay[dataset$Carrier == "AA"],FUN=tstatFun2, repetitions=10000)
toutput<-t.test(dataset$DepDelay[dataset$Carrier == "UA"],dataset$DepDelay[dataset$Carrier == "AA"])
xlim<-range(c(outputTPermSmall$observedStat,outputTPermSmall$permutedStats))
hist(outputTPermSmall$permutedStats, main = "Histogram of permuted t-statistic, log-transformed",
    xlab = "T-statistic", xlim=xlim,freq=FALSE,breaks=100)
abline(v = outputTPermSmall$observedStat, col = "blue", lwd = 2)    
tfun<-function(x){dt(x,df=toutput$parameter)}
curve(tfun,add=TRUE,col="red")
legend("topleft","t-distribution",lty=1)

plot of chunk unnamed-chunk-36

cat("pvalue permutation=",outputTPermSmall$p.value,"\n")
## pvalue permutation= 0.2448
cat("pvalue t.test=",toutput$p.value,"\n")
## pvalue t.test= 0.6273047

The following is a small calculation to determine the number of pairs of carriers. Not incredibly important code.

carriers<-levels(as.factor(flightSFOSRS$Carrier))
ncarriers<-length(carriers)
npairs<-choose(ncarriers,2)
cat("number of pairs:",npairs,"\n")
## number of pairs: 45

Here is code to find all of the pairs of airline carriers, and calculate a t-test for each pair. The function combn gives all the combinations of 2 (m=2); the result is a matrix with 2 columns, with each combination a row.

pairsOfCarriers<-combn(x=carriers,m=2) #get all combinations of length 2, ie all pairs

I will again work with the transformed data, so lets create this variable again in my data.frame (before it was saved as a temporary value)

addValue<-abs(min(flightSFOSRS$DepDelay,na.rm=TRUE))+1
flightSFOSRS$logDepDelay<-log(flightSFOSRS$DepDelay+addValue)

Now, for each pair, I will do a t-test. I could do this as a for-loop, e.g. for(i in 1:nrow(pairsOfCarriers)){...}. This is perfectly acceptable. Instead I will use a common R function apply that is meant to loop over rows (or columns) of a matrix.

Specifically, apply takes a matrix as the first argument (X). The second argument (MARGIN) determines whether to loop over rows (MARGIN=1) or columns (MARGIN=2). Then FUN defines the function to apply to each row. I can give a function, like mean (which would result in row means, except that for this example my matrix is not numeric values), or I can supply a function. Here I write a function that uses the value of the two carriers (given as input as a character vector 'x' by the apply function) to subset the data by those that match the input Carrier, do a t.test, and return just the t-statistic and its p-value under the t-test. The result is a matrix of dimension 2 x No. combinations. I also put an argument that requires the user to specify the name of the column in the dataset where the groups are; this is because I am going to reuse this code later with a different variable.

#each pair, do t-test
ttestFun<-function(x,variableName){
    tout<-t.test(flightSFOSRS$logDepDelay[flightSFOSRS[,variableName] == x[1]],flightSFOSRS$logDepDelay[flightSFOSRS[,variableName] == x[2]])
    unlist(tout[c("statistic","p.value")]) #unlist makes it a vector rather than list
}
t.testPairs<-apply(X=pairsOfCarriers,MARGIN=2,FUN=ttestFun,variableName="Carrier")

I would note that the function combn allows you to run a function over all combinations, so I could have avoided the apply step (combn just runs apply for you). But it is a good example of using the apply function

# skipping apply step and doing it just within the combn function:
t.testPairsV2<-combn(x=carriers,m=2,FUN=ttestFun,variableName="Carrier")

I am going to pretty up the names of the columns here so I can keep track of what comparison they correspond to. The paste function allows me to glue together two vectors of strings, and sep="-" says to put a minus sign between them.

colnames(t.testPairs)<-paste(pairsOfCarriers[1,],pairsOfCarriers[2,],sep="-")
cat("Number found with p-value < 0.05: ",sum(t.testPairs["p.value",]<0.05))
## Number found with p-value < 0.05:  26

Notice that the function given to apply needs to assume the input is a vector corresponding to the row/column I'm looping over. I couldn't, for example, write a function function(group1,group2); I have to have function(x) where x[1] is the name of group 1 and x[2] is the name of group 2.

We can plot these p-values to get an idea of their value. Notice that I give only 1 vector to plot. By default, if you give a single vector, say z it will just make the x-values 1:length(z), i.e. plot them across the graph in order.

plot(log(t.testPairs["p.value",]),ylab="log(P-value)",main="P-values from all pairwise tests",xaxt="n",xlab="")
abline(h=log(0.05),lty=2)
legend("bottomright",legend="P-value = -0.05",lty=2)
axis(1,at=1:ncol(t.testPairs),labels=colnames(t.testPairs),las=2)

plot of chunk unnamed-chunk-43

To consider whether this is a lot to find significant (i.e. there is strong signal) or whether this is due to just chance multiple testing issues, we redo the above. Only now we scramble the carrier labels, so there is actually no difference, and see how many we incorrectly reject. (I've set the seed here so that we get the same result each time in the lecture notes, but you can comment out the set.seed and rerun it many times to get an idea of how it changes.)

#can repeat this code over and over to see how changes with different scrambles.
set.seed(95289235) 
flightSFOSRS$fakeCarrier<-sample(flightSFOSRS$Carrier)
#each pair, do t-test
t.testPairsScramble<-apply(pairsOfCarriers,2,ttestFun,variableName="fakeCarrier")
colnames(t.testPairsScramble)<-paste(pairsOfCarriers[1,],pairsOfCarriers[2,],sep="-")
cat("Number permuted found with p-value < 0.05: ",sum(t.testPairsScramble["p.value",]<0.05))
## Number permuted found with p-value < 0.05:  6
plot(log(t.testPairsScramble["p.value",]),ylab="log(P-value)",main="P-values from all pairwise tests, scrambled carriers",xaxt="n",xlab="")
abline(h=log(0.05),lty=2)
legend("bottomright",legend="P-value = -0.05",lty=2)
axis(1,at=1:ncol(t.testPairs),labels=colnames(t.testPairs),las=2)

plot of chunk unnamed-chunk-44

We can adjust for multiple testing by adjusting our cutoff for declaring significance of each test (the “level”)

cat("Number found significant after Bonferonni: ",sum(t.testPairs["p.value",]<0.05/npairs))
## Number found significant after Bonferonni:  16
cat("Number of shuffled differences found significant after Bonferonni: ",sum(t.testPairsScramble["p.value",]<0.05/npairs))
## Number of shuffled differences found significant after Bonferonni:  0

We can also directly adjust our p-values and use our standard notion of level (e.g. 0.05).

t.testPairs<-rbind(t.testPairs,"p.value.adj"=t.testPairs["p.value",]*npairs)
t.testPairsScramble<-rbind(t.testPairsScramble,"p.value.adj"=t.testPairsScramble["p.value",]*npairs)
t.testPairs[,1:5]
##                  AA-AS         AA-B6        AA-DL      AA-F9       AA-HA
## statistic.t  1.1514752 -3.7413417707 -2.648054950 -0.3894014 3.101645905
## p.value      0.2501338  0.0002038769  0.008170586  0.6974224 0.003824936
## p.value.adj 11.2560196  0.0091744583  0.367676386 31.3840059 0.172122129

We can also directly adjust our p-values and use our standard notion of level (e.g. 0.05).

t.testPairs<-rbind(t.testPairs[1:2,],"p.value.adj"=pmin(t.testPairs["p.value",]*npairs,1))
t.testPairsScramble<-rbind(t.testPairsScramble[1:2,],"p.value.adj"=pmin(t.testPairsScramble["p.value",]*npairs,1))
t.testPairs[,1:5]
##                 AA-AS         AA-B6        AA-DL      AA-F9       AA-HA
## statistic.t 1.1514752 -3.7413417707 -2.648054950 -0.3894014 3.101645905
## p.value     0.2501338  0.0002038769  0.008170586  0.6974224 0.003824936
## p.value.adj 1.0000000  0.0091744583  0.367676386  1.0000000 0.172122129

Let's identify those tests that were significant at 0.05 level before adjustment, but now are not

whLostSig<-which(t.testPairs["p.value.adj",]>0.05 & t.testPairs["p.value",]<=0.05)
whLostSigScramble<-which(t.testPairsScramble["p.value.adj",]>0.05 & t.testPairsScramble["p.value",]<=0.05)

I will remake the previous plots, only now with adjusted p-values, and “highlight” those p-values that changed by plotting them a different color. Notice how I can use the function points to add points to a scatterplot, just like I previously used lines to add points connected by lines to an existing scatterplot. I also use the argument pch to change from an open circle to a filled in circle (see the help of points to get a full list of the possible plotting symbols.)

par(mfrow=c(1,2))
plot(log(t.testPairs["p.value.adj",]),ylab="log(adjusted P-value)",main="Adjusted P-values from all pairwise tests, scrambled carriers",xaxt="n",xlab="")
abline(h=log(0.05),lty=2)
n<-ncol(t.testPairs)
points((1:n)[whLostSig],log(t.testPairs["p.value.adj",whLostSig]),col="red",pch=19)
legend("bottomright",legend="P-value = -0.05",lty=2)
axis(1,at=1:ncol(t.testPairs),labels=colnames(t.testPairs),las=2)
plot(log(t.testPairsScramble["p.value.adj",]),ylab="log(P-value)",main="Adjusted P-values from all pairwise tests, scrambled carriers",xaxt="n",xlab="")
points((1:n)[whLostSigScramble],log(t.testPairsScramble["p.value.adj",whLostSigScramble]),col="red",pch=19)
abline(h=log(0.05),lty=2)
axis(1,at=1:ncol(t.testPairs),labels=colnames(t.testPairs),las=2)

plot of chunk unnamed-chunk-49

#legend("bottomright",legend="P-value = -0.05",lty=2)

This code demonstrates calculating quantiles using the qXXX function. qXXX goes along with dXXX, rXXX, and pXXX. For a qXXX function the input argument is a probability, and the output is the value for which you have the probability of being less than. Here we demonstrate with the normal distribution.

qnorm(0.2, mean=0,sd=1)
## [1] -0.8416212
qnorm(0.9, mean=0,sd=1)
## [1] 1.281552
qnorm(0.0275, mean=0,sd=1)
## [1] -1.918876

The t.test function also provides confidence intervals. Here we perform for our delay

t.test(log(flightSFOSRS$DepDelay[flightSFOSRS$Carrier=="UA"]+addValue))
## 
##  One Sample t-test
## 
## data:  log(flightSFOSRS$DepDelay[flightSFOSRS$Carrier == "UA"] + addValue)
## t = 289.15, df = 2964, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  3.236722 3.280920
## sample estimates:
## mean of x 
##  3.258821

The t.test function also provides confidence intervals. Here we perform for our delay

logT<-t.test(log(flightSFOSRS$DepDelay[flightSFOSRS$Carrier=="UA"]+addValue))
exp(logT$conf.int)-addValue
## [1] 3.450158 4.600224
## attr(,"conf.level")
## [1] 0.95

The t.test function also provides confidence intervals. Here we perform for the log-difference of our delay

t.test(log(flightSFOSRS$DepDelay[flightSFOSRS$Carrier=="UA"]+addValue),log(flightSFOSRS$DepDelay[flightSFOSRS$Carrier=="AA"]+addValue))
## 
##  Welch Two Sample t-test
## 
## data:  log(flightSFOSRS$DepDelay[flightSFOSRS$Carrier == "UA"] + addValue) and log(flightSFOSRS$DepDelay[flightSFOSRS$Carrier == "AA"] + addValue)
## t = 5.7011, df = 1800.7, p-value = 1.389e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.07952358 0.16293414
## sample estimates:
## mean of x mean of y 
##  3.258821  3.137592

The following function creates bootstrap confidence intervals comparing two groups. The input is like the permutation.test I wrote above: the two vectors of data, the function FUN that defines your statistic, and the number of repetitions. It returns a list with the confidence interval and the vector of bootstrapped statistics.

# Parameters:
#   group1, group2: vector of data from each group
#   FUN: function; this function accepts two vectors (two groups), 
#               and calculate the statistics (should be absolute value.)
#   repetitions: number of permutations to be done
# 
# Returns:
#   list with confidence interval, and vector of bootstrapped statistics

bootstrap.test <- function(group1, group2, FUN, 
                           repetitions, confidence.level=0.95){
    # calculate the observed statistic
  stat.obs <- FUN(group1, group2)
  # calculate the bootstrapped statistic
  bootFun<-function(){
      sampled1 = sample(group1, size=length(group1),replace = TRUE)
      sampled2 = sample(group2, size=length(group2),replace = TRUE)
      FUN(sampled1,sampled2)    
  }  
  stat.boot<-replicate(repetitions,bootFun())
  level<-1-confidence.level
  confidence.interval <- quantile(stat.boot,probs=c(level/2,1-level/2))
    return(list(confidence.interval = c("lower"=confidence.interval[1],"estimate"=stat.obs,"upper"=confidence.interval[2]), bootStats=stat.boot))
}

Now we apply it to our flight data, reusing our function diffProportion from before.

set.seed(201728)
par(mfrow = c(1,2))
# for each dataset, do permutation test and plot histogram
#for (dataset in list(flightSFOJan, flightSFOSRS, flightSFOStratified)){
dataset<-flightSFOSRS
   output<-bootstrap.test(group1=dataset$DepDelay[dataset$Carrier == "UA"], group2=dataset$DepDelay[dataset$Carrier == "AA"],FUN=diffProportion, repetitions=10000)
   xlim<-range(c(output$confidence.interval["estimate"],output$bootStats))
   hist(output$bootStats, main = "Histogram of bootstrapped statistics", xlim=c(xlim),xlab="Difference in proportions")
   abline(v = output$confidence.interval["estimate"], col = "blue", lwd = 2)    

plot of chunk unnamed-chunk-55

If you don't have the package gplots installed, uncomment the install.packages code (but only do it once!):

# install.packages("gplots")

We will use the package gplots to plot the confidence intervals. x refers to the observed statistic, ui the upper confidence interval value, and li the lower confidence interval.

require(gplots)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
plotCI(x=output$confidence.interval["estimate"],ui=output$confidence.interval["upper.97.5%"],li=output$confidence.interval["lower.2.5%"],ylab="Difference in Proportions",pch=19)

plot of chunk unnamed-chunk-57

We can reuse our code from before, only this time, pull the confidence interval rather than p-value

ttestCI<-function(x,variableName){
    tout<-t.test(flightSFOSRS$logDepDelay[flightSFOSRS[,variableName] == x[2]],flightSFOSRS$logDepDelay[flightSFOSRS[,variableName] == x[1]])
    unlist(tout[c("estimate","conf.int")]) #unlist makes it a vector rather than list
}
t.CIPairs<-apply(X=pairsOfCarriers,MARGIN=2,FUN=ttestCI,variableName="Carrier")
colnames(t.CIPairs)<-paste(pairsOfCarriers[2,],pairsOfCarriers[1,],sep="-")
t.CIPairs<-rbind(t.CIPairs,diff=t.CIPairs["estimate.mean of x",]-t.CIPairs["estimate.mean of y",])

Now we can plot all of the confidence intevals using plotCI. Notice that I set xaxt="n". This is to keep plotCI from drawing in a x-axis annotation. Instead, I want the x-axis to give the character string indicating what airlines are being compared. I do this with the axis command, which allows me to put any annotation on the axis (labels) at any location (at). las says to turn the labels perpindicular to the axis so I can read them.

require(gplots)
plotCI(x=t.CIPairs["diff",],li=t.CIPairs["conf.int1",],ui=t.CIPairs["conf.int2",],ylab="Difference in log-Proportions",pch=19,xaxt="n")
axis(1,at=1:ncol(t.CIPairs),labels=colnames(t.CIPairs),las=2)
abline(h=0,lty=2)

plot of chunk unnamed-chunk-59

We can reuse our above code, but just adjust the conf.level argument in the t.test command to get Bonferonni corrected CI.

ttestCIAdj<-function(x,variableName){
    tout<-t.test(flightSFOSRS$logDepDelay[flightSFOSRS[,variableName] == x[2]],flightSFOSRS$logDepDelay[flightSFOSRS[,variableName] == x[1]],conf.level = 1-0.05/npairs)
    unlist(tout[c("estimate","conf.int")]) #unlist makes it a vector rather than list
}
t.CIPairsAdj<-apply(X=pairsOfCarriers,MARGIN=2,FUN=ttestCIAdj,variableName="Carrier")
colnames(t.CIPairsAdj)<-paste(pairsOfCarriers[2,],pairsOfCarriers[1,],sep="-")
t.CIPairsAdj<-rbind(t.CIPairsAdj,diff=t.CIPairsAdj["estimate.mean of x",]-t.CIPairsAdj["estimate.mean of y",])
par(mfrow=c(1,2))
plotCI(x=t.CIPairs["diff",],li=t.CIPairs["conf.int1",],ui=t.CIPairs["conf.int2",],ylab="Difference in log-Proportions",sub="Raw CI",pch=19,xaxt="n")
#axis(1,at=1:ncol(t.CIPairs),labels=colnames(t.CIPairs),las=2)
abline(h=0,lty=2)
plotCI(x=t.CIPairsAdj["diff",],li=t.CIPairsAdj["conf.int1",],ui=t.CIPairsAdj["conf.int2",],ylab="Difference in log-Proportions",sub="Bonferonni Adjusted CI",pch=19,xaxt="n")
#axis(1,at=1:ncol(t.CIPairsAdj),labels=colnames(t.CIPairsAdj),las=2)
abline(h=0,lty=2)

plot of chunk unnamed-chunk-60

Here we plot the TukeyHSD confidence intervals for pairwise comparisons:

tukeyCI<-TukeyHSD( aov(logDepDelay~Carrier, data=flightSFOSRS))
plot(tukeyCI,las=2)

plot of chunk unnamed-chunk-61

For side-by-side comparison, let's make a matrix with all of the CI.

combMat<-cbind("diff"=c(t.CIPairsAdj["diff",],tukeyCI$Carrier[,"diff"]),
"lwr"=c(t.CIPairsAdj["conf.int1",],tukeyCI$Carrier[,"lwr"]),
"upr"=c(t.CIPairsAdj["conf.int2",],tukeyCI$Carrier[,"upr"])
)

Now we want to put them in order so each pair is next to each other. I'm going to use rep to get replicated vectors that when I add together will give me a sequence to reorder the above matrix by (don't worry too much about this code).

reorderSeq<-rep(1:ncol(t.CIPairsAdj), each=2)+rep(c(0,ncol(t.CIPairsAdj)),times=ncol(t.CIPairsAdj))
plotCI(x=combMat[reorderSeq,"diff"],li=combMat[reorderSeq,"lwr"],
ui=combMat[reorderSeq,"upr"],pch=19,xaxt="n",col=c("black","red"))
axis(1,at=seq(1.5,2*ncol(t.CIPairsAdj),by=2),labels=colnames(t.CIPairsAdj),las=2)
abline(h=0,lty=2)
legend("topleft",c("Bonferroni","TukeyHSD"),fill=c("black","red"))

plot of chunk unnamed-chunk-63