Reading and summarizing the data

Do not forget to change directory for where the data will be (see 01Probability_forClassCode.Rmd)

dataDir<-"../finalDataSets"
# dataDir<-"."

I will first read in data sets. I have three different data sets. One is a SRS from then entire 2016 year; one is just the month of january; and the last is a Stratified sample, which is a SRS from each month. I will mainly use the SRS in what follows.

flightSFOJan<-read.table(file.path(dataDir,"SFO.txt"),
                         sep="\t", header=TRUE, stringsAsFactors = FALSE)
flightSFOSRS<-read.table(file.path(dataDir,"SFO_SRS.txt"),
                         sep="\t",header=TRUE, stringsAsFactors = FALSE)
flightSFOStratified<-read.table(file.path(dataDir,"SFO_Stratified.txt"),
                                sep="\t",header=TRUE, stringsAsFactors = FALSE)
flightSRS.UAAA<-subset(flightSFOSRS,Carrier%in% c("UA","AA"))

Now lets subset to only those flights involving United and American Airlines.

flightSubset <- flightSFOSRS[flightSFOSRS$Carrier %in% c("UA", "AA"), ]

If I take the mean of the departure delay variable, I get NAs.

mean(flightSubset$DepDelay)
## [1] NA

To ignore the NA, I set na.rm=TRUE (uncomment the next line). Notice, however, that this may not be a wise thing to do, as we've discussed.

mean(flightSubset$DepDelay,na.rm=TRUE) #remove the NAs, which may or may not be wise
## [1] 11.13185

We demonstrate the use of tapply to perform a calculation for each group. This function is for when you have one vector (or column of your data frame) with all of the observed measurements and another vector/column with a variable that says what group each observation is in. This is how data usually comes, so tapply helps you calculate for each group without having to do annoying subsetting or making a for-loop through the different values of your groups. We saw this before in lecture 1. When I apply it here, though, I just get NAs.

# NA problem:
tapply(X=flightSubset$DepDelay, flightSubset$Carrier, mean)
## AA UA 
## NA NA

But I can give additional arguments to the mean function via tapply by just adding them in. If you look in the help of tapply you'll see that there is an argument .... This is R's way of allowing you to pass along additional arguments, without the function tapply needing to know what those arguments are. In this case we will give the na.rm=TRUE argument for the mean. Notice that tapply just sends them along to mean without looking at them, so tapply doesn't need to know what arguments your function defined in FUN actually needs, which makes it very flexible.

# remove NA:
tapply(flightSubset$DepDelay, flightSubset$Carrier, mean,na.rm=TRUE)
##        AA        UA 
##  7.728294 12.255649

We can also write our own function for tapply. The following code does the same as the above, only this time we wrote out a function that sets na.rm=TRUE in our call to mean.

# equivalent, only write a function to do it:
tapply(flightSubset$DepDelay, flightSubset$Carrier, 
       function(x){mean(x, na.rm = TRUE)})
##        AA        UA 
##  7.728294 12.255649

We could have made this two lines of code. As an experienced coder, I don't tend to do this, but certainly feel free to break it down into steps. It can help to define the function separately, too, if it's long or complicated, so as to make sure it does what you want.

#equivalent, only in pieces
f<-function(x){mean(x, na.rm = TRUE)}
tapply(flightSubset$DepDelay, flightSubset$Carrier, FUN=f)
##        AA        UA 
##  7.728294 12.255649

Now we do the median:

# median delay time
tapply(flightSubset$DepDelay, flightSubset$Carrier, 
       function(x){median(x, na.rm = TRUE)})
## AA UA 
## -2 -1

Now we calculate the percentage of flights that are delayed, and write our own function to do this. Notice how I can create a logical vector x>0 | is.na(x) to determine with observations are either delayed >0 minutes, or are missing (i.e. cancelled). Then taking sum of a logical vector counts the number of TRUE values.

# percent flights delayed
tapply(flightSubset$DepDelay, flightSubset$Carrier, 
       function(x){sum(x > 0 | is.na(x)) / length(x)})
##        AA        UA 
## 0.3201220 0.4383791

Again, another example only calculating the % more than 15 minutes

# percent flights delayed more than 15 minutes
tapply(flightSubset$DepDelay, flightSubset$Carrier, 
       function(x){sum(x > 15 | is.na(x)) / length(x)})
##        AA        UA 
## 0.1554878 0.2046216

And now the percent of cancelled flights. Notice that cancelled values are 1=Cancelled and 0=not cancelled. Therefore taking the mean of the 0-1 values is the same as calculating the percentage.

# percent flights cancelled
tapply(flightSubset$Cancelled, flightSubset$Carrier, mean)
##          AA          UA 
## 0.005081301 0.007032820

To illustrate different ways of sampling data, we had three different data sets (SRS, Stratified, and all of January). Here we plot the density functions of UA and AA for these different samples to indicate how they can be rather different.

par(mfrow=c(1, 2))
xlim<-c(-20, 50)
ylim<-c(0, 0.08)
#draw same for them separately
for (airline in c("UA", "AA")){
    density.Jan <- density(flightSFOJan[flightSFOJan$Carrier==airline, "DepDelay"], 
    na.rm = TRUE)
    density.SRS <- density(flightSFOSRS[flightSFOSRS$Carrier==airline, "DepDelay"], 
    na.rm = TRUE)
    density.Stratified <- density(flightSFOStratified[flightSFOSRS$Carrier==airline, "DepDelay"], na.rm = TRUE)
  plot(density.Jan, xlim=xlim, ylim=ylim, col="black",
       main=paste("Density estimates:",airline), xlab=airline)
  lines(density.SRS, col="red")
  lines(density.Stratified, col="blue")
  legend("topright",c("January","SRS","Stratified"), fill=c("black","red","blue"))
}

plot of chunk unnamed-chunk-14

Permutation Tests

Here we will demonstrate a few permutations of the group labels, using the function sample. You should think that sample takes your input values (argument x) and puts them in a large bag and then draws at random from that bag over and over to create a random sample. The number of times it draws is the argument size and whether it replaces the selected value into the bag for the next draw is dependent on the argument replace. So if I set size to be the length of x (the default) and I choose replace=FALSE, I get a random permutation of my x values.

set.seed(18592)
permuteData<-data.frame(FlightDelay=flightSRS.UAAA$DepDelay,
    Observed=flightSRS.UAAA$Carrier,
    Permutation1=sample(flightSRS.UAAA$Carrier,replace=FALSE),
    Permutation2=sample(flightSRS.UAAA$Carrier,replace=FALSE),
    Permutation3=sample(flightSRS.UAAA$Carrier,replace=FALSE))
head(permuteData)
##   FlightDelay Observed Permutation1 Permutation2 Permutation3
## 1           5       UA           AA           UA           AA
## 2          -6       UA           UA           UA           UA
## 3         -10       AA           UA           UA           UA
## 4          -3       UA           UA           AA           UA
## 5          -3       UA           UA           AA           UA
## 6           0       UA           UA           UA           UA

Now for each permutation, I will calculate the proportion of flights with delay greater than 15 or cancelled (NA). To do this I will define a small function that does this:

propFun<-function(x){sum(x>15 | is.na(x))/length(x)}

Now I will apply that function to each of the permutations. I do this with the function apply which applies the same function to each column of my input data frame. So for every permutation in my dataset (3 + the observed one) I will use that permutation in defining my groups and calculate the proportions per group.

cat("Proportions per Carrier, each permutation:\n")
## Proportions per Carrier, each permutation:
medPerPermute<-apply(permuteData[,-1],MARGIN=2,function(x){
    tapply(permuteData[,1],x,propFun)})
medPerPermute
##     Observed Permutation1 Permutation2 Permutation3
## AA 0.1554878    0.2063008    0.1951220    0.1910569
## UA 0.2046216    0.1878768    0.1915606    0.1929002

Notice my above command set MARGIN=2, meaning I will apply my function to each column (I dropped the first column with permuteData[,-1]).

The output (medPerPermute) is a matrix of the values per airline (each row is an airline). I need to take the difference of these values. I will do this with the diff function, and again use apply to do this for each column.

cat("\n")
cat("Differences in Proportions per Carrier, each permutation:\n")
## Differences in Proportions per Carrier, each permutation:
diffPerPermute<-apply(medPerPermute,2,diff)
diffPerPermute
##     Observed Permutation1 Permutation2 Permutation3 
##  0.049133762 -0.018424055 -0.003561335  0.001843290

Note that I could have done this in one call to apply rather than break it up, but for teaching purposes I wanted to see the intermediate steps.

Writing a Function for Permutation tests

Now we consider actually implementing permutation tests to compare two groups. The following code sets up a generic function permutation.test that will take as input the data on the two groups as arguments group1 and group2, as well as a function FUN that defines the statistic we are using for the test. This makes it a generic function that I can apply for any statistic and set of data. The last argument is n.repetitions which is the number of repetitions of the permutation we choose to do, i.e. how many random permutations.

The function returns a list with the p-value, the observed statistic (i.e. as calculated on the real data), and a vector of the statistic calculated on all the permuted data.

# Returns:
#   list with p-value, observed statistic, and vector of permuted statistics
permutation.test <- function(group1,group2, FUN, n.repetitions){ 
  # calculate the observed statistic
  stat.obs <- FUN(group1,  group2)
  #function to do the permutation and return statistic
  makePermutedStats<-function(){
      sampled <- sample(1:length(c(group1,group2)), size=length(group1),replace=FALSE) #sample indices that will go into making new group 1
      return(FUN(c(group1,group2)[sampled], c(group1,group2)[-sampled]))
  }
  # calculate the permuted statistic
  stat.permute <-replicate(n.repetitions,makePermutedStats()) 
  p.value <- sum(stat.permute >= stat.obs) / n.repetitions
  return(list(p.value=p.value,observedStat=stat.obs,permutedStats=stat.permute))
}

Notice how I give names to the values when defining the list, list(p.value=..., observedStat=..., permutedStats=...). This means that the resulting list that is returned will have three elements with names p.value, observedStat, and permutedStats. This is a good coding convention, as it makes it easier to use the output. If I didn't, to use the output you have to use commands like output[[1]] rather than output$p.value, which is far less readable and easily broken, if for example you change the order of the output.

I need to define a function that will calculate my statistic when I give input of two vectors of data. I use my propFun from before to have a per-group statistic (proportion), and then take the absolute difference (i.e. make 1-step what I previously did in 2 steps):

diffProportion<-function(x1,x2){
    prop1<-propFun(x1)
    prop2<-propFun(x2)
    return(abs(prop1-prop2))
}
diffProportion(subset(flightSFOSRS,Carrier=="AA")$DepDelay,subset(flightSFOSRS,Carrier=="UA")$DepDelay)
## [1] 0.04913376

Now I want to run a permutation test using this function. Since there is randomness in our permutations, I am going to call set.seed which takes as an argument an arbitrary number. This means that I will make the same random draws even if I repeat the code. This is convenient when you want to make sure you will get the exact same result each time (for example in homeworks or lectures).

set.seed(201728)

Now I will run my function permutation.test on the UA and AA data with 10,000 permutations and save the output.

dataset<-flightSFOSRS
output<-permutation.test(
    group1=dataset$DepDelay[dataset$Carrier == "UA"], 
    group2=dataset$DepDelay[dataset$Carrier == "AA"],
    FUN=diffProportion, n.repetitions=10000)

The output contains not just the p-value, but the values of the statistic under all of the permutations. The output is a list, so I can access the values with a $ and the names. If I do names(output) I can see the possible values (commented out here because I didn't want it in class).

# names(output) 

For example, output$p.value will give me the p-value.

I can plot the histogram of these permuted values, along with the observed value. Notice that I can use the function abline to draw horizontal or vertical lines. Since I want a vertical line corresponding to the observed statistic, I set v= and the vertical line will be at that value.

xlim<-range(c(output$observedStat,output$permutedStats))
hist(output$permutedStats, main = "Histogram of permuted statistics",
    xlab = "Difference in proportion delay > 15 minutes", xlim=xlim)
abline(v = output$observedStat, col = "blue", lwd = 2)  

plot of chunk unnamed-chunk-24

To be safe, I set the limits of my plot to be the range of the values from the permutation and my observed using the argument xlim in hist; this is in case my observed statistic is so different my histogram command doesn't go out that far and so my blue line wouldn't be seen.

I can repeat the whole process with the median. All I need to do is change the function I give FUN to now be a function that calculates the absolute difference in the medians. Notice I am picking absolute everywhere, because my function will always count only the large values. This saves me trouble of figuring out how to count the positive and negative in my code.

set.seed(2843261)
# for each dataset, do permutation test and plot histogram
diffMedian<-function(x1,x2){
    prop1<-median(x1,na.rm=TRUE)
    prop2<-median(x2,na.rm=TRUE)
    return(abs(prop1-prop2))
}

And now I can run the exact same code to get the permutation test.

dataset<-flightSFOSRS
output<-permutation.test(
    group1=dataset$DepDelay[dataset$Carrier == "UA"], 
    group2=dataset$DepDelay[dataset$Carrier == "AA"],
    FUN=diffMedian, n.repetitions=10000)
xlim<-range(c(output$observedStat,output$permutedStats))
hist(output$permutedStats, 
    main = "Histogram of permuted statistics",
    xlab = "Difference in median flight delay", xlim=xlim)
abline(v = output$observedStat, col = "blue", lwd = 2)  

plot of chunk unnamed-chunk-26

cat("pvalue (median difference)=",output$p.value)
## pvalue (median difference)= 0.1287

The t-test

Let's return to our flight data and consider using the t-test to compare the means of the distribution.

# percent flights delayed more than 15 minutes

However, our data is very skewed.

hist(flightSFOSRS$DepDelay[flightSFOSRS$Carrier=="UA"],
    main="Histogram Departure Delay, UA",
    xlab="In minutes",breaks=50)

plot of chunk unnamed-chunk-28

We can still run the t-test despite how bad the data is – it doesn't mean it's a good idea however!

There is a function in R that calculates this: t.test .

t.test(flightSFOSRS$DepDelay[flightSFOSRS$Carrier=="UA"],
    flightSFOSRS$DepDelay[flightSFOSRS$Carrier=="AA"])
## 
##  Welch Two Sample t-test
## 
## data:  flightSFOSRS$DepDelay[flightSFOSRS$Carrier == "UA"] and flightSFOSRS$DepDelay[flightSFOSRS$Carrier == "AA"]
## t = 2.8325, df = 1703.1, p-value = 0.004673
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.392379 7.662332
## sample estimates:
## mean of x mean of y 
## 12.255649  7.728294

Compare to the permutation test

We can do a permutation test using the t-statistic as our statistic as well. We need to first define a function that will calculate the t-statistic for each permutation. I am going to actually use t.test to do that. In particular, you can save the output of t.test and it has a lot of information, including the value of the t-statistic. While I'm not going to actually use the p-value or results from the test, this saves me the trouble of writing a function for the t-statistic (and makes sure I do not make an error in my code!). I will, as always, take the absolute value of my t-statistic to calculate the p-value.

set.seed(489712)
# for each dataset, do permutation test and plot histogram
tstatFun<-function(x1,x2){abs(t.test(x1,x2)$statistic)}

I will now run the permutation test and compare its p-values with those of the actual t-test.

dataset<-flightSFOSRS
output<-permutation.test(group1=dataset$DepDelay[dataset$Carrier == "UA"], group2=dataset$DepDelay[dataset$Carrier == "AA"],FUN=tstatFun, n.repetitions=10000)
cat("permutation pvalue=",output$p.value)
## permutation pvalue= 0.0076
tout<-t.test(flightSFOSRS$DepDelay[flightSFOSRS$Carrier=="UA"],flightSFOSRS$DepDelay[flightSFOSRS$Carrier=="AA"])
cat("t-test pvalue=",tout$p.value)
## t-test pvalue= 0.004673176

Now we can compare the distribution of the permuted mean differences to the normal predicted by the t-test if we had normal data. Before I took the absolute value of the t-statistic but that's not going to make for the right values to compare to the t-distribution – I need the positive and negative values. So to make this plot, I am going to rerun the permutation test with a different function that doesn't take the absolute value of the t-statistic, so we can more easily compare its distribution to that of the parametric model of the distribution of the t-statistic. However, this will be just for illustration, and the p-value will be wrong.

set.seed(489712)
# for this plot, won't take the absolute value, so will match the normal distribution plot
tstatFun2<-function(x1,x2){t.test(x1,x2)$statistic}

Doing this, I can now plot the histogram of the t-statistic under the permutation with the t-distribution from the corresponding t-test. Notice that because I have very large numbers of samples, I just draw the normal density.

dataset<-flightSFOSRS[!is.na(flightSFOSRS$DepDelay),]
outputTPerm<-permutation.test(group1=dataset$DepDelay[dataset$Carrier == "UA"], group2=dataset$DepDelay[dataset$Carrier == "AA"],FUN=tstatFun2, n.repetitions=10000)
xlim<-range(c(outputTPerm$observedStat,outputTPerm$permutedStats))
hist(outputTPerm$permutedStats, main = "Histogram of permuted t-statistic",
    xlab = "T-statistic", xlim=xlim,freq=FALSE,breaks=100)
abline(v = outputTPerm$observedStat, col = "blue", lwd = 2) 
curve(dnorm,add=TRUE)
legend("topleft","N(0,1)",lty=1)

plot of chunk unnamed-chunk-33

Now we consider the same comparison of the parametric t-test to the permutation test with the t-statistic, but with a smaller sample size. I will subsample the data so that I only get 50 in each of the two airline groups.

set.seed(489712)
dataset<-flightSFOSRS[flightSFOSRS$Carrier %in% c("UA","AA") & !is.na(flightSFOSRS$DepDelay),]
#sample smaller number of each
ind1<-sample(which(dataset$Carrier=="AA"),size=50)
ind2<-sample(which(dataset$Carrier=="UA"),size=50)
dataset<-dataset[c(ind1,ind2),]

Now I calculate the null distribution under the permutation test and under the parametric model (i.e. a t-distribution). Again I will do it two different ways for the histogram versus the p-value. To get the right degrees of freedom for drawing the density of the t-distribution, I will just use the output of the t.test function that saves that information.

outputTPermSmall<-permutation.test(
    group1=dataset$DepDelay[dataset$Carrier == "UA"], 
    group2=dataset$DepDelay[dataset$Carrier == "AA"],
    FUN=tstatFun2, n.repetitions=10000)
toutput<-t.test(dataset$DepDelay[dataset$Carrier == "UA"],
    dataset$DepDelay[dataset$Carrier == "AA"])
xlim<-range(c(outputTPermSmall$observedStat,
    outputTPermSmall$permutedStats))
hist(outputTPermSmall$permutedStats, 
    main = "Histogram of permuted t-statistic",
    xlab = "T-statistic", xlim=xlim,freq=FALSE,breaks=70)
abline(v = outputTPermSmall$observedStat, col = "blue", lwd = 2)    
# add the t-distribution with the right degrees of freedom
tfun<-function(x){dt(x,df=toutput$parameter)}
curve(tfun,add=TRUE,col="red")
legend("topleft","t-distribution",lty=1)

plot of chunk unnamed-chunk-35

We can compare the pvalues:

set.seed(21590)
outputTPermSmallPvalue<-permutation.test(
    group1=dataset$DepDelay[dataset$Carrier == "UA"], 
    group2=dataset$DepDelay[dataset$Carrier == "AA"],
    FUN=tstatFun, n.repetitions=10000)
cat("pvalue permutation=",
    outputTPermSmallPvalue$p.value,"\n")
## pvalue permutation= 0.4446
cat("pvalue t.test=",toutput$p.value,"\n")
## pvalue t.test= 0.394545

Transformations

Of course, the t-test assumptions don't match the flight data. We can consider what happens if we transform the data like we did in earlier lectures. We will repeat all of the above, only now with the log of the departure delay (+ a small amount)

#use the same subset of data found above
addValue<-abs(min(dataset$DepDelay,na.rm=TRUE))+1 #the small amount I will add to get positive values
dataset$logDepDelay<-log(dataset$DepDelay+addValue)
outputTPermSmall<-permutation.test(
    group1=dataset$logDepDelay[dataset$Carrier == "UA"], 
    group2=dataset$logDepDelay[dataset$Carrier == "AA"],
    FUN=tstatFun2, n.repetitions=10000)
toutput<-t.test(dataset$logDepDelay[dataset$Carrier == "UA"],
    dataset$logDepDelay[dataset$Carrier == "AA"])
xlim<-range(c(outputTPermSmall$observedStat,outputTPermSmall$permutedStats))
hist(outputTPermSmall$permutedStats, 
    main = "Histogram of permuted t-statistic, log-transformed",
    xlab = "T-statistic", xlim=xlim,freq=FALSE,breaks=100)
abline(v = outputTPermSmall$observedStat, col = "blue", lwd = 2)    
tfun<-function(x){dt(x,df=toutput$parameter)}
curve(tfun,add=TRUE,col="red")
legend("topleft","t-distribution",lty=1)

plot of chunk unnamed-chunk-37

# re-do to get proper p-values on absolute value:
set.seed(21590)
outputTPermSmallPvalue<-permutation.test(group1=dataset$DepDelay[dataset$Carrier == "UA"], group2=dataset$DepDelay[dataset$Carrier == "AA"],FUN=tstatFun, n.repetitions=10000)
cat("pvalue permutation=",outputTPermSmallPvalue$p.value,"\n")
## pvalue permutation= 0.4446
cat("pvalue t.test=",toutput$p.value,"\n")
## pvalue t.test= 0.3261271

Pairwise tests (Multiple testing)

We are going to make a vector of the names of all the airline carriers, as well as a variable (ncarriers) that gives the number of carriers

carriers<-levels(as.factor(flightSFOSRS$Carrier))
ncarriers<-length(carriers)

The following is a small calculation to determine the number of pairs of carriers using the choose function. It basically tells you how many ways there are to pick 2 objects from a list of ncarriers.

npairs<-choose(ncarriers,2)
cat("number of pairs:",npairs,"\n")
## number of pairs: 45

To compare all pairs of airlines, first we need to ennumerate all the possible combinations. The function combn gives all the combinations of length 2 (m=2), i.e. all pairs. The result is a matrix with 2 rows, with each column a combination.

pairsOfCarriers<-combn(x=carriers,m=2) #get all combinations of length 2, ie all pairs

I will again work with the transformed data, so lets create this as a variable in my data.frame (before it was saved as a temporary value)

addValue<-abs(min(flightSFOSRS$DepDelay,na.rm=TRUE))+1
flightSFOSRS$logDepDelay<-log(flightSFOSRS$DepDelay+addValue)

Now, for each pair, I want to perform a t-test. So I'm going to define a function that does that, and use apply to with my pairsOfCarriers matrix above to do it to each pair of airlines.

That means that the input that apply will be sending to my function is going to be a vector of airline names (a character vector of length two). So I need to write my function with that as the expected input, in addition to other possible variables I might give my function. My function will use that vector ( x) to subset the full dataset to those that match the input carriers given by x, do a t.test between the two-carriers, and for simplicity return only the t-statistic and its p-value under the t-test. I also put an argument variableName into my function that requires the user to specify the name of the column in the dataset where the groups are (i.e. Carrier); this is because I am going to reuse this code later with a different variable. Notice this isn't a very flexible code, because it only works with the flightSFOSRS dataset for the logDepDelay variable. I could add more arguments to make it more general.

#each pair, do t-test
ttestFun<-function(x,variableName){
    group1<-flightSFOSRS$logDepDelay[ flightSFOSRS[,variableName] == x[1] ]
    group2<-flightSFOSRS$logDepDelay[ flightSFOSRS[,variableName] == x[2] ]
    tout<-t.test(group1,group2)
    #unlist makes it a vector rather than list
    unlist(tout[c("statistic","p.value")]) 
}

Now, I could use this function and apply it to all my pairs with a for-loop, e.g. for(i in 1:ncol(pairsOfCarriers)){...}. This is perfectly acceptable. Instead I will use a common R function apply that we used above that is meant to loop over rows (or columns) of a matrix.

Specifically, apply takes a matrix as the first argument (X). The second argument (MARGIN) determines whether to loop over rows (MARGIN=1) or columns (MARGIN=2). Then FUN defines the function to apply to each row. I can give built-in function, like mean (which would result in row means, except that for this example my matrix is not numeric values), or I can supply a function, like my ttestFun I wrote above.

t.testPairs<-apply(X=pairsOfCarriers,
    MARGIN=2,FUN=ttestFun,
    variableName="Carrier")

The result is a matrix of dimension 2 x No. combinations.

dim(t.testPairs)
## [1]  2 45

Notice that apply gives the full vector corresponding to the column or row to your function in FUN. This means that the function I wrote had to assume the input is a vector. I couldn't, for example, write a function function(group1Name,group2Name); I have to have function(x) where x[1] is the name of group 1 and x[2] is the name of group 2.

I would note that the function combn allows you to run a function over all combinations, so I could have avoided the apply step (combn just runs apply for you). But it is a good example of using the apply function. Here's an example of doing it directly in combn:

# skipping apply step and doing it just within the combn function:
t.testPairsV2<-combn(x=carriers,m=2,FUN=ttestFun,variableName="Carrier")

Cleaning up the output It's pretty annoying to have the output in the rows of t.testPairs. I can't do things like t.testPairs$statistic.t for example. So I'm going to flip the matrix with the t function

t.testPairs<-t(t.testPairs)

And then I'm going to convert it to a data.frame:

t.testPairs<-as.data.frame(t.testPairs)

I am going to pretty up the names of the rows here so I can keep track of what comparison they correspond to (i.e. which Airline). I'd like to have names like “UA-AA”, for example so I can keep track. I know by my code that each result corresponds to the same element of the first row of pairsOfCarriers and the second row of pairsOfCarriers, so what I want to do is glue those values together and make my “UA-AA” string. The paste function allows me to glue together two vectors of strings, element-by-element, and sep="-" says to put a minus sign between them.

rownames(t.testPairs)<-paste(pairsOfCarriers[1,],pairsOfCarriers[2,],sep="-")
head(t.testPairs)
##       statistic.t      p.value
## AA-AS   1.1514752 0.2501337691
## AA-B6  -3.7413418 0.0002038769
## AA-DL  -2.6480549 0.0081705864
## AA-F9  -0.3894014 0.6974223534
## AA-HA   3.1016459 0.0038249362
## AA-OO  -2.0305868 0.0424142975

I find it rather annoying to do all this clean-up, and I'm going to do it several times in this module, so I'm going to write a little function that does it:

cleanUpPairs<-function(pairsResults){
    pairsResults<-t(pairsResults)
    pairsResults<-as.data.frame(pairsResults)
    rownames(pairsResults)<-paste(pairsOfCarriers[1,],
        pairsOfCarriers[2,],sep="-")
    return(pairsResults)
}
# Repeat the above calculation and instead use my `cleanUpPairs` function.
t.testPairs<-apply(X=pairsOfCarriers,
    MARGIN=2,FUN=ttestFun,
    variableName="Carrier")
t.testPairs<-cleanUpPairs(t.testPairs)
head(t.testPairs)
##       statistic.t      p.value
## AA-AS   1.1514752 0.2501337691
## AA-B6  -3.7413418 0.0002038769
## AA-DL  -2.6480549 0.0081705864
## AA-F9  -0.3894014 0.6974223534
## AA-HA   3.1016459 0.0038249362
## AA-OO  -2.0305868 0.0424142975

Now I can do some calculations, like the number and the proportion significant at level 0.05:

numbSig<-sum(t.testPairs$p.value<0.05)
propSig<-numbSig/length(t.testPairs$p.value)
cat("Number found with p-value < 0.05: ",numbSig, "(",round(propSig,2)," proportion of tests)")
## Number found with p-value < 0.05:  26 ( 0.58  proportion of tests)

We can plot these p-values to get an idea of their value. Notice that I give only 1 vector to plot. By default, if you give a single vector, say z it will just make the x-values be 1:length(z), i.e. plot them across the graph in order.

plot(log(t.testPairs$p.value),ylab="log(P-value)",main="P-values from all pairwise tests",xaxt="n",xlab="")
abline(h=log(0.05),lty=2)
legend("bottomright",legend="P-value = 0.05",lty=2)
axis(1,at=1:nrow(t.testPairs),labels=rownames(t.testPairs),las=2)

plot of chunk unnamed-chunk-51

To consider whether this is a lot to find significant (i.e. there is strong signal) or whether this is due to just chance multiple testing issues, we redo the above. Only now we scramble the carrier labels, so there is actually no difference, and see how many we incorrectly reject. (I've set the seed here so that we get the same result each time in the lecture notes, but you can comment out the set.seed and rerun it many times to get an idea of how it changes.)

#can repeat this code over and over to see how changes with different scrambles.
set.seed(95289235) 
flightSFOSRS$fakeCarrier<-sample(flightSFOSRS$Carrier)
#each pair, do t-test
t.testPairsScramble<-apply(pairsOfCarriers,2,ttestFun,variableName="fakeCarrier")
t.testPairsScramble<-cleanUpPairs(t.testPairsScramble)
numbSig<-sum(t.testPairsScramble$p.value<0.05)
propSig<-numbSig/length(t.testPairsScramble$p.value)
cat("Number found with p-value < 0.05: ",numbSig, "(",round(propSig,2)," proportion)")
## Number found with p-value < 0.05:  6 ( 0.13  proportion)
plot(log(t.testPairsScramble$p.value),ylab="log(P-value)",main="P-values from all pairwise tests, scrambled carriers",xaxt="n",xlab="")
abline(h=log(0.05),lty=2)
legend("bottomright",legend="P-value = 0.05",lty=2)
axis(1,at=1:nrow(t.testPairs),labels=rownames(t.testPairs),las=2)

plot of chunk unnamed-chunk-52

Bonferroni Correction

Adjust the Level We can adjust for multiple testing by adjusting the level of each test to 0.05/ # tests

cat("Number found significant after Bonferonni: ",
    sum(t.testPairs$p.value<0.05/npairs))
## Number found significant after Bonferonni:  16
cat("Number of shuffled differences found significant after Bonferonni: ",
    sum(t.testPairsScramble$p.value<0.05/npairs))
## Number of shuffled differences found significant after Bonferonni:  0

Adjust the p-values We can also directly adjust our p-values and use our standard notion of level (e.g. 0.05). We create a new column in our data.frame called p.value.adj by using the function cbind, which just pastes another column to an existing matrix/data.frame

t.testPairs<-cbind(t.testPairs,"p.value.adj"=t.testPairs$p.value*npairs)
t.testPairsScramble<-cbind(t.testPairsScramble,"p.value.adj"=t.testPairsScramble$p.value*npairs)
head(t.testPairs)
##       statistic.t      p.value  p.value.adj
## AA-AS   1.1514752 0.2501337691 11.256019611
## AA-B6  -3.7413418 0.0002038769  0.009174458
## AA-DL  -2.6480549 0.0081705864  0.367676386
## AA-F9  -0.3894014 0.6974223534 31.384005904
## AA-HA   3.1016459 0.0038249362  0.172122129
## AA-OO  -2.0305868 0.0424142975  1.908643388
head(t.testPairsScramble)
##       statistic.t    p.value p.value.adj
## AA-AS  -1.3008280 0.19388985    8.725043
## AA-B6   0.5208849 0.60264423   27.118990
## AA-DL  -2.0270773 0.04281676    1.926754
## AA-F9  -0.1804245 0.85698355   38.564260
## AA-HA  -1.5553127 0.13030058    5.863526
## AA-OO  -1.3227495 0.18607903    8.373556

To make sure that none of our adjusted p-values are greater than 1, we can use the function pmin on our adjusted values, which goes element by element and takes the min of the adjusted p-value and the value 1:

t.testPairs<-cbind(t.testPairs,"p.value.adj.final"=pmin(t.testPairs$p.value.adj,1))
t.testPairsScramble<-cbind(t.testPairsScramble,"p.value.adj.final"=pmin(t.testPairsScramble$p.value.adj,1))
head(t.testPairs)
##       statistic.t      p.value  p.value.adj p.value.adj.final
## AA-AS   1.1514752 0.2501337691 11.256019611       1.000000000
## AA-B6  -3.7413418 0.0002038769  0.009174458       0.009174458
## AA-DL  -2.6480549 0.0081705864  0.367676386       0.367676386
## AA-F9  -0.3894014 0.6974223534 31.384005904       1.000000000
## AA-HA   3.1016459 0.0038249362  0.172122129       0.172122129
## AA-OO  -2.0305868 0.0424142975  1.908643388       1.000000000

Let's identify those tests that were significant at 0.05 level before adjustment, but now are not

whLostSig<-which(t.testPairs$p.value.adj.final>0.05 & t.testPairs$p.value<=0.05)
whLostSigScramble<-which(t.testPairsScramble$p.value.adj.final>0.05 & t.testPairsScramble$p.value<=0.05)

I will remake the previous plots, only now with adjusted p-values, and “highlight” those p-values that changed to non-significant by plotting them a different color. Notice how I can use the function points to add points to a scatterplot, just like I previously used lines to add points connected by lines to an existing scatterplot. In this case I am overplotting, meaning I plotted all of the values in my plot command, and by calling points afterward using a different color, those points are covered up and plotted with a new color. I also use the argument pch to change from an open circle to a filled in circle (see the help of points to get a full list of the possible plotting symbols.)

ylab<-"log(adjusted P-value)"
par(mfrow=c(1,2))
plot(log(t.testPairs$p.value.adj.final),ylab=ylab,main="Real Data, after adjustment",xaxt="n",xlab="")
abline(h=log(0.05),lty=2)
n<-nrow(t.testPairs)
points((1:n)[whLostSig],log(t.testPairs$p.value.adj.final)[whLostSig],col="red",pch=19)
legend("bottomright",legend="P-value = 0.05",lty=2)
axis(1,at=1:n,labels=rownames(t.testPairs),las=2)

plot(log(t.testPairsScramble$p.value.adj.final),ylab=ylab,main="Scrambled Data, after adjustment",xaxt="n",xlab="", ylim=c(log(0.05),0))
points((1:n)[whLostSigScramble],log(t.testPairsScramble$p.value.adj.final[whLostSigScramble]),col="red",pch=19)
abline(h=log(0.05),lty=2)
axis(1,at=1:n,labels=rownames(t.testPairs),las=2)

plot of chunk unnamed-chunk-57

Confidence Intervals

Examples calculating quantiles This code demonstrates calculating quantiles using the qXXX function. qXXX goes along with dXXX, rXXX, and pXXX. For a qXXX function the input argument is a probability, and the output is the value for which you have the probability of being less than. Here we demonstrate with the normal distribution.

qnorm(0.2, mean=0,sd=1)
## [1] -0.8416212
qnorm(0.9, mean=0,sd=1)
## [1] 1.281552
qnorm(0.0275, mean=0,sd=1)
## [1] -1.918876

Parametric Confidence intervals for the mean

To get parametric confidence intervals, we again use the t.test function, which also provides confidence intervals. Here we get confidence intervals for our logged- delay values.

t.test(log(flightSFOSRS$DepDelay[flightSFOSRS$Carrier=="UA"]+addValue))
## 
##  One Sample t-test
## 
## data:  log(flightSFOSRS$DepDelay[flightSFOSRS$Carrier == "UA"] + addValue)
## t = 289.15, df = 2964, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  3.236722 3.280920
## sample estimates:
## mean of x 
##  3.258821

Here we take our confidence interval for the log of the mean, and convert it back to the original scale

logT<-t.test(log(flightSFOSRS$DepDelay[flightSFOSRS$Carrier=="UA"]+addValue))
exp(logT$conf.int)-addValue
## [1] 3.450158 4.600224
## attr(,"conf.level")
## [1] 0.95

The t.test function also provides confidence intervals between groups. Here we perform for the log-difference of our delay between UA and AA

logUA<-log(flightSFOSRS$DepDelay[flightSFOSRS$Carrier=="UA"]+addValue)
logAA<-log(flightSFOSRS$DepDelay[flightSFOSRS$Carrier=="AA"]+addValue)
t.test(logUA,logAA)
## 
##  Welch Two Sample t-test
## 
## data:  logUA and logAA
## t = 5.7011, df = 1800.7, p-value = 1.389e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.07952358 0.16293414
## sample estimates:
## mean of x mean of y 
##  3.258821  3.137592

Bootstrap Confidence Intervals

The following function creates bootstrap confidence intervals comparing two groups. The input is like the permutation.test I wrote above: the two vectors of data, the function FUN that defines your statistic, and the number of repetitions. It returns a list with the confidence interval and the vector of bootstrapped statistics.

# Parameters:
#   group1, group2: vector of data from each group
#   FUN: function; this function accepts two vectors (two groups),
#               and calculate the statistics (should be absolute value.)
#   n.repetitions: number of permutations to be done
#
# Returns:
#   list with confidence interval, and vector of bootstrapped statistics

bootstrap.test <- function(group1, group2, FUN,
                           n.repetitions, confidence.level=0.95){
    # calculate the observed statistic
    stat.obs <- FUN(group1, group2)
    # calculate the bootstrapped statistic, notice replace=TRUE
    bootFun<-function(){
      sampled1 = sample(group1, size=length(group1),replace = TRUE)
      sampled2 = sample(group2, size=length(group2),replace = TRUE)
      FUN(sampled1,sampled2)
    }
    stat.boot<-replicate(n.repetitions,bootFun())
    level<-1-confidence.level
    confidence.interval <- quantile(stat.boot,
      probs=c(level/2,1-level/2))
    # below, `unname` simply takes off the existing names 
    # given by quantile function:
    return(list(
        confidence.interval = c(
            "lower"=unname(confidence.interval[1]),
            "estimate"=stat.obs,
            "upper"=unname(confidence.interval[2])), 
        bootStats=stat.boot
    ))
}

Now we apply our function to our flight data, reusing our function diffProportion from before.

set.seed(201728)
# for each dataset, do permutation test and plot histogram
#for (dataset in list(flightSFOJan, flightSFOSRS, flightSFOStratified)){
dataset<-flightSFOSRS #this makes it easy to recopy code later
output<-bootstrap.test(
    group1=dataset$DepDelay[dataset$Carrier == "UA"], 
    group2=dataset$DepDelay[dataset$Carrier == "AA"],
    FUN=diffProportion, 
    n.repetitions=10000)

We can draw a histogram of the bootstrap distribution of our estimates

par(mfrow = c(1,1))
hist(output$bootStats, 
    main = "Histogram of bootstrapped statistics", 
    xlab="Difference in proportions")
abline(v = output$confidence.interval["estimate"], 
   col = "blue", lwd = 2)
legend("topright",
    legend=expression(paste("Observed ",hat(delta))),
    fill="blue")

plot of chunk unnamed-chunk-64

If you don't have the package gplots installed, uncomment the install.packages code below (but only do it once!):

# install.packages("gplots")

We will use the package gplots to plot the confidence intervals. x refers to the observed statistic, ui the upper confidence interval value, and li the lower confidence interval.

require(gplots)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
plotCI(x=output$confidence.interval["estimate"],ui=output$confidence.interval["upper"],li=output$confidence.interval["lower"],ylab="Difference in Proportions",pch=19,xaxt="n",xlab="")

plot of chunk unnamed-chunk-66

CI of means vs Difference

Here I compare creating the confidence interval of the individual means, versus the confidence interval of the difference of the means.

First I calculate both the individual means and the difference in means using t.test.

uaT<-t.test(logUA)
aaT<-t.test(logAA)
diffT<-t.test(logUA,logAA)

Now I will plot the confidence intervals of the two groups' means, and then that of the difference using plotCI

par(mfrow=c(1,2))
plotCI(x=c(UA=uaT$estimate,aaT$estimate),ui=c(uaT$conf.int[2],aaT$conf.int[2]),li=c(uaT$conf.int[1],aaT$conf.int[1]),ylab="Mean Delay Time",pch=19,xlab="Airline",xaxt="n")
axis(1,at=c(1,2),c("UA","AA"))
plotCI(x=diffT$estimate[1]-diffT$estimate[2],ui=diffT$conf.int[2],li=diffT$conf.int[1],ylab="Difference in Mean Delay Time",xlab="",pch=19,xaxt="n")
axis(1,at=1,c("UA-AA"))

plot of chunk unnamed-chunk-68

Here is an artifical example where the CI of the means of each group overlap, but the CI of the difference between the means doesn't cover zero:

exampleData<-read.csv(file.path(dataDir,"DifferenceMeans.csv"),comment.char="#")
aT<-t.test( exampleData$A)
bT<-t.test(exampleData$B)
diffT<-t.test(exampleData$A,exampleData$B)
par(mfrow=c(1,2))
plotCI(x=c(aT$estimate,bT$estimate),ui=c(aT$conf.int[2],bT$conf.int[2]),li=c(aT$conf.int[1],bT$conf.int[1]),ylab="Mean of Groups",pch=19,xlab="",xaxt="n")
axis(1,at=c(1,2),c("A","B"))
plotCI(x=diffT$estimate[1]-diffT$estimate[2],ui=diffT$conf.int[2],li=diffT$conf.int[1],ylab="Difference in Mean",xlab="",pch=19,xaxt="n")
axis(1,at=1,c("A-B"))
abline(h=0,lty=2)

plot of chunk unnamed-chunk-69

Multiple Testing with CI

Now we are going to create CI for each of the pairs of carriers, like we did with testing before.

We can reuse our code from before, that can be applied to the combinations of carriers, only this time, pull the confidence interval information rather than p-value

ttestCI<-function(x,variableName){
    group1<-flightSFOSRS$logDepDelay[flightSFOSRS[,variableName] == x[2]]
    group2<- flightSFOSRS$logDepDelay[flightSFOSRS[,variableName] == x[1]]
    tout<-t.test(group1,group2)
    out<-unlist(tout[c("estimate","conf.int")]) #unlist makes it a vector rather than list
    #fix up the names
    names(out)<-c("mean.of.x","mean.of.y","lower","upper")
    return(out)
}
t.CIPairs<-apply(X=pairsOfCarriers,MARGIN=2,FUN=ttestCI,variableName="Carrier")
t.CIPairs<-cleanUpPairs(t.CIPairs)
head(t.CIPairs)
##       mean.of.x mean.of.y        lower       upper
## AA-AS  3.086589  3.137592 -0.138045593  0.03603950
## AA-B6  3.289174  3.137592  0.071983930  0.23118020
## AA-DL  3.209319  3.137592  0.018600177  0.12485342
## AA-F9  3.164201  3.137592 -0.108192832  0.16141032
## AA-HA  2.943335  3.137592 -0.321473062 -0.06704092
## AA-OO  3.184732  3.137592  0.001615038  0.09266604

The results of t.test only give the means of each group, so we'll create a new variable that is the difference.

t.CIPairs<-cbind(t.CIPairs,diff=t.CIPairs$mean.of.x-t.CIPairs$mean.of.y)

Now we can plot all of the confidence intevals using plotCI. Notice that I set xaxt="n". This is to keep plotCI from drawing in a x-axis annotation. Instead, I want the x-axis to give the character string indicating what airlines are being compared. I do this with the axis command, which allows me to put any annotation on the axis (labels) at any location (at). las says to turn the labels perpindicular to the axis so I can read them.

require(gplots)
par(mfrow=c(1,1))
plotCI(x=t.CIPairs$diff,li=t.CIPairs$lower,ui=t.CIPairs$upper,ylab="Difference in log-Proportions",pch=19,xaxt="n",xlab="")
axis(1,at=1:nrow(t.CIPairs),labels=rownames(t.CIPairs),las=2)
abline(h=0,lty=2)

plot of chunk unnamed-chunk-72

We can reuse our above code, but just adjust the conf.level argument in the t.test command to get Bonferonni corrected CI.

ttestCIAdj<-function(x,variableName,npairs){
    group1<-flightSFOSRS$logDepDelay[flightSFOSRS[,variableName] == x[2]]
    group2<- flightSFOSRS$logDepDelay[flightSFOSRS[,variableName] == x[1]]
    tout<-t.test(group1,group2,conf.level = 1-0.05/npairs)
    out<-unlist(tout[c("estimate","conf.int")]) #unlist makes it a vector rather than list
    #fix up the names
    names(out)<-c("mean.of.x","mean.of.y","lower","upper")
    return(out)
}
t.CIPairsAdj<-apply(X=pairsOfCarriers,MARGIN=2,
    FUN=ttestCIAdj,variableName="Carrier",npairs=npairs)
t.CIPairsAdj<-cleanUpPairs(t.CIPairsAdj)
t.CIPairsAdj<-cbind(t.CIPairsAdj,diff=t.CIPairsAdj$mean.of.x-t.CIPairsAdj$mean.of.y)
par(mfrow=c(1,2))
plotCI(x=t.CIPairs$diff,li=t.CIPairs$lower,ui=t.CIPairs$upper,ylab="Difference in log-Proportions",
    sub="Raw CI",pch=19,xaxt="n",xlab="")
axis(1,at=1:nrow(t.CIPairs),labels=rownames(t.CIPairs),las=2)
abline(h=0,lty=2)
plotCI(x=t.CIPairsAdj$diff,
    li=t.CIPairsAdj$lower,
    ui=t.CIPairsAdj$upper,
    ylab="Difference in log-Proportions",
    sub="Bonferonni Adjusted CI",pch=19,xaxt="n",xlab="")
axis(1,at=1:nrow(t.CIPairsAdj),labels=rownames(t.CIPairsAdj),las=2)
abline(h=0,lty=2)

plot of chunk unnamed-chunk-73

Here we plot the TukeyHSD confidence intervals for pairwise comparisons:

tukeyCI<-TukeyHSD( aov(logDepDelay~Carrier, data=flightSFOSRS))
plot(tukeyCI,las=2)

plot of chunk unnamed-chunk-74