Do not forget to change directory for where the data will be (see 01Probability_forClassCode.Rmd)
dataDir<-"../../finalDataSets"
# dataDir<-"."
I will first read in data sets. I have three different data sets. One is a SRS from then entire 2016 year; one is just the month of january; and the last is a Stratified sample, which is a SRS from each month. I will mainly use the SRS in what follows.
flightSFOJan<-read.table(file.path(dataDir,"SFO.txt"),
sep="\t", header=TRUE, stringsAsFactors = FALSE)
flightSFOSRS<-read.table(file.path(dataDir,"SFO_SRS.txt"),
sep="\t",header=TRUE, stringsAsFactors = FALSE)
flightSFOStratified<-read.table(file.path(dataDir,"SFO_Stratified.txt"),
sep="\t",header=TRUE, stringsAsFactors = FALSE)
flightSRS.UAAA<-subset(flightSFOSRS,Carrier%in% c("UA","AA"))
Now lets subset to only those flights involving United and American Airlines.
flightSubset <- flightSFOSRS[flightSFOSRS$Carrier %in% c("UA", "AA"), ]
If I take the mean of the departure delay variable, I get NAs.
mean(flightSubset$DepDelay)
## [1] NA
To ignore the NA, I set na.rm=TRUE
(uncomment the next line). Notice, however, that this may not be a wise thing to do, as we've discussed.
#mean(flightSubset$DepDelay,na.rm=TRUE) #remove the NAs, which may or may not be wise
We demonstrate the use of tapply
to perform a calculation for each group. This function is for when you have one vector (or column of your data frame) with all of the observed measurements and another vector/column with a variable that says what group each observation is in. This is how data usually comes, so tapply
helps you calculate for each group without having to do annoying subsetting or making a for-loop through the different values of your groups. We saw this before in lecture 1. When I apply it here, though, I just get NAs.
# NA problem:
tapply(X=flightSubset$DepDelay, flightSubset$Carrier, mean)
## AA UA
## NA NA
But I can give additional arguments to the mean
function via tapply
by just adding them in. If you look in the help of tapply
you'll see that there is an argument ...
. This is R's way of allowing you to pass along additional arguments, without the function tapply
needing to know what those arguments are. In this case we will give the na.rm=TRUE
argument for the mean. Notice that tapply
just sends them along to mean
without looking at them, so tapply
doesn't need to know what arguments your function defined in FUN
actually needs, which makes it very flexible.
# remove NA:
tapply(flightSubset$DepDelay, flightSubset$Carrier, mean,na.rm=TRUE)
## AA UA
## 7.728294 12.255649
We can also write our own function for tapply
. The following code does the same as the above, only this time we wrote out a function that sets na.rm=TRUE
in our call to mean
.
# equivalent, only write a function to do it:
tapply(flightSubset$DepDelay, flightSubset$Carrier,
function(x){mean(x, na.rm = TRUE)})
## AA UA
## 7.728294 12.255649
We could have made this two lines of code. As an experienced coder, I don't tend to do this, but certainly feel free to break it down into steps. It can help to define the function separately, too, if it's long or complicated, so as to make sure it does what you want.
#equivalent, only in pieces
f<-function(x){mean(x, na.rm = TRUE)}
tapply(flightSubset$DepDelay, flightSubset$Carrier, FUN=f)
## AA UA
## 7.728294 12.255649
Now we do the median:
# median delay time
tapply(flightSubset$DepDelay, flightSubset$Carrier,
function(x){median(x, na.rm = TRUE)})
## AA UA
## -2 -1
Now we calculate the percentage of flights that are delayed, and write our own function to do this. Notice how I can create a logical vector x>0 | is.na(x)
to determine with observations are either delayed >0 minutes, or are missing (i.e. cancelled). Then taking sum
of a logical vector counts the number of TRUE values.
# percent flights delayed
tapply(flightSubset$DepDelay, flightSubset$Carrier,
function(x){sum(x > 0 | is.na(x)) / length(x)})
## AA UA
## 0.3201220 0.4383791
Again, another example only calculating the % more than 15 minutes
# percent flights delayed more than 15 minutes
tapply(flightSubset$DepDelay, flightSubset$Carrier,
function(x){sum(x > 15 | is.na(x)) / length(x)})
## AA UA
## 0.1554878 0.2046216
# percent flights cancelled
And now the percent of cancelled flights. Notice that cancelled values are 1=Cancelled and 0=not cancelled. Therefore taking the mean of the 0-1 values is the same as calculating the percentage.
tapply(flightSubset$Cancelled, flightSubset$Carrier, mean)
## AA UA
## 0.005081301 0.007032820
To illustrate different ways of sampling data, we had three different data sets (SRS, Stratified, and all of January). Here we plot the density functions of UA and AA for these different samples to indicate how they can be rather different.
par(mfrow=c(1, 2))
xlim<-c(-20, 50)
ylim<-c(0, 0.08)
#draw same for them separately
for (airline in c("UA", "AA")){
density.Jan <- density(flightSFOJan[flightSFOJan$Carrier==airline, "DepDelay"],
na.rm = TRUE)
density.SRS <- density(flightSFOSRS[flightSFOSRS$Carrier==airline, "DepDelay"],
na.rm = TRUE)
density.Stratified <- density(flightSFOStratified[flightSFOSRS$Carrier==airline, "DepDelay"], na.rm = TRUE)
plot(density.Jan, xlim=xlim, ylim=ylim, col="black",
main=paste("Density estimates:",airline), xlab=airline)
lines(density.SRS, col="red")
lines(density.Stratified, col="blue")
legend("topright",c("January","SRS","Stratified"), fill=c("black","red","blue"))
}
Here we will demonstrate a few permutations of the group labels, using the function sample
. You should think that sample
takes your input values (argument x
) and puts them in a large bag and then draws at random from that bag over and over to create a random sample. The number of times it draws is the argument size
and whether it replaces the selected value into the bag for the next draw is dependent on the argument replace
. So if I set size
to be the length of x
(the default) and I choose replace=FALSE
, I get a random permutation of my x
values.
set.seed(18592)
permuteData<-data.frame(FlightDelay=flightSRS.UAAA$DepDelay,
Observed=flightSRS.UAAA$Carrier,
Permutation1=sample(flightSRS.UAAA$Carrier,replace=FALSE),
Permutation2=sample(flightSRS.UAAA$Carrier,replace=FALSE),
Permutation3=sample(flightSRS.UAAA$Carrier,replace=FALSE))
head(permuteData)
## FlightDelay Observed Permutation1 Permutation2 Permutation3
## 1 5 UA AA UA AA
## 2 -6 UA UA UA UA
## 3 -10 AA UA UA UA
## 4 -3 UA UA AA UA
## 5 -3 UA UA AA UA
## 6 0 UA UA UA UA
Now for each permutation, I will calculate the proportion of flights with delay greater than 15 or cancelled (NA). To do this I will define a small function that does this:
propFun<-function(x){sum(x>15 | is.na(x))/length(x)}
Now I will apply that function to each of the permutations.
I do this with the function apply
which applies the same function to each column of my input data frame. So for every permutation in my dataset (3 + the observed one) I will use that permutation in defining my groups and calculate the proportions per group.
cat("Proportions per Carrier, each permutation:\n")
## Proportions per Carrier, each permutation:
medPerPermute<-apply(permuteData[,-1],MARGIN=2,function(x){
tapply(permuteData[,1],x,propFun)})
medPerPermute
## Observed Permutation1 Permutation2 Permutation3
## AA 0.1554878 0.2063008 0.1951220 0.1910569
## UA 0.2046216 0.1878768 0.1915606 0.1929002
Notice my above command set MARGIN=2
, meaning I will apply my function to each column (I dropped the first column with permuteData[,-1]
).
The output (medPerPermute
) is a matrix of the values per airline (each row is an airline). I need to take the difference of these values. I will do this with the diff
function, and again use apply
to do this for each column.
cat("\n")
cat("Differences in Proportions per Carrier, each permutation:\n")
## Differences in Proportions per Carrier, each permutation:
diffPerPermute<-apply(medPerPermute,2,diff)
diffPerPermute
## Observed Permutation1 Permutation2 Permutation3
## 0.049133762 -0.018424055 -0.003561335 0.001843290
Note that I could have done this in one call to apply
rather than break it up, but for teaching purposes I wanted to see the intermediate steps.
Now we consider actually implementing permutation tests to compare two groups. The following code sets up a generic function permutation.test
that will take as input the data on the two groups as arguments group1
and group2
, as well as a function FUN
that defines the statistic we are using for the test. This makes it a generic function that I can apply for any statistic and set of data. The last argument is n.repetitions
which is the number of repetitions of the permutation we choose to do, i.e. how many random permutations.
The function returns a list with the p-value, the observed statistic (i.e. as calculated on the real data), and a vector of the statistic calculated on all the permuted data.
# Returns:
# list with p-value, observed statistic, and vector of permuted statistics
permutation.test <- function(group1,group2, FUN, n.repetitions){
# calculate the observed statistic
stat.obs <- FUN(group1, group2)
#function to do the permutation and return statistic
makePermutedStats<-function(){
sampled <- sample(1:length(c(group1,group2)), size=length(group1),replace=FALSE) #sample indices that will go into making new group 1
return(FUN(c(group1,group2)[sampled], c(group1,group2)[-sampled]))
}
# calculate the permuted statistic
stat.permute <-replicate(n.repetitions,makePermutedStats())
p.value <- sum(stat.permute >= stat.obs) / n.repetitions
return(list(p.value=p.value,observedStat=stat.obs,permutedStats=stat.permute))
}
Notice how I give names to the values when defining the list, list(p.value=..., observedStat=..., permutedStats=...)
. This means that the resulting list that is returned will have three elements with names p.value
, observedStat
, and permutedStats
. This is a good coding convention, as it makes it easier to use the output. If I didn't, to use the output you have to use commands like output[[1]]
rather than output$p.value
, which is far less readable and easily broken, if for example you change the order of the output.
Now I want to run a permutation test using this function. Since there is randomness in our permutations, I am going to call set.seed
which takes as an argument an arbitrary number. This means that I will make the same random draws even if I repeat the code. This is convenient when you want to make sure you will get the exact same result each time (for example in homeworks or lectures).
set.seed(201728)
Now I need to define a function that will calculate my statistic when I give input of two vectors of data. I use my propFun
from before to have a per-group statistic (proportion), and then take the absolute difference (i.e. make 1-step what I previously did in 2 steps):
diffProportion<-function(x1,x2){
prop1<-propFun(x1)
prop2<-propFun(x2)
return(abs(prop1-prop2))
}
Now I will run my function permutation.test
on the UA and AA data with 10,000 permutations and save the output.
dataset<-flightSFOSRS
output<-permutation.test(
group1=dataset$DepDelay[dataset$Carrier == "UA"],
group2=dataset$DepDelay[dataset$Carrier == "AA"],
FUN=diffProportion, n.repetitions=10000)
The output contains not just the p-value, but the values of the statistic under all of the permutations. The output is a list, so I can access the values with a $
and the names. If I do names(output)
I can see the possible values (commented out here because I didn't want it in class).
# names(output)
For example, output$p.value
will give me the p-value.
I can plot the histogram of these permuted values, along with the observed value. Notice that I can use the function abline
to draw horizontal or vertical lines. Since I want a vertical line corresponding to the observed statistic, I set v=
and the vertical line will be at that value.
xlim<-range(c(output$observedStat,output$permutedStats))
hist(output$permutedStats, main = "Histogram of permuted statistics",
xlab = "Difference in proportion delay > 15 minutes", xlim=xlim)
abline(v = output$observedStat, col = "blue", lwd = 2)
To be safe, I set the limits of my plot to be the range of the values from the permutation and my observed using the argument xlim
in hist
; this is in case my observed statistic is so different my histogram command doesn't go out that far and so my blue line wouldn't be seen.
I can repeat the whole process with the median.
All I need to do is change the function I give FUN
to now be a function that calculates the absolute difference in the medians. Notice I am picking absolute everywhere, because my function will always count only the large values. This saves me trouble of figuring out how to count the positive and negative in my code.
set.seed(2843261)
# for each dataset, do permutation test and plot histogram
diffMedian<-function(x1,x2){
prop1<-median(x1,na.rm=TRUE)
prop2<-median(x2,na.rm=TRUE)
return(abs(prop1-prop2))
}
And now I can run the exact same code to get the permutation test.
dataset<-flightSFOSRS
output<-permutation.test(
group1=dataset$DepDelay[dataset$Carrier == "UA"],
group2=dataset$DepDelay[dataset$Carrier == "AA"],
FUN=diffMedian, n.repetitions=10000)
xlim<-range(c(output$observedStat,output$permutedStats))
hist(output$permutedStats,
main = "Histogram of permuted statistics",
xlab = "Difference in median flight delay", xlim=xlim)
abline(v = output$observedStat, col = "blue", lwd = 2)
cat("pvalue=",output$p.value)
## pvalue= 0.1287
Let's return to our flight data and consider using the t-test to compare the means of the distribution.
# percent flights delayed more than 15 minutes
However, our data is very skewed.
hist(flightSFOSRS$DepDelay[flightSFOSRS$Carrier=="UA"],
main="Histogram Departure Delay, UA",
xlab="In minutes",breaks=50)
We can still run the t-test despite how bad the data is – it doesn't mean it's a good idea however!
There is a function in R that calculates this: t.test
.
t.test(flightSFOSRS$DepDelay[flightSFOSRS$Carrier=="UA"],
flightSFOSRS$DepDelay[flightSFOSRS$Carrier=="AA"])
##
## Welch Two Sample t-test
##
## data: flightSFOSRS$DepDelay[flightSFOSRS$Carrier == "UA"] and flightSFOSRS$DepDelay[flightSFOSRS$Carrier == "AA"]
## t = 2.8325, df = 1703.1, p-value = 0.004673
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.392379 7.662332
## sample estimates:
## mean of x mean of y
## 12.255649 7.728294
We can do a permutation test using the t-statistic as our statistic as well. We need to first define a function that will calculate the t-statistic for each permutation. I am going to actually use t.test
to do that. In particular, you can save the output of t.test
and it has a lot of information, including the value of the t-statistic. While I'm not going to actually use the p-value or results from the test, this saves me the trouble of writing a function for the t-statistic (and makes sure I do not make an error in my code!). I will, as always, take the absolute value of my t-statistic to calculate the p-value.
set.seed(489712)
# for each dataset, do permutation test and plot histogram
tstatFun<-function(x1,x2){abs(t.test(x1,x2)$statistic)}
I will now run the permutation test and compare its p-values with those of the actual t-test.
dataset<-flightSFOSRS
output<-permutation.test(group1=dataset$DepDelay[dataset$Carrier == "UA"], group2=dataset$DepDelay[dataset$Carrier == "AA"],FUN=tstatFun, n.repetitions=10000)
cat("permutation pvalue=",output$p.value)
## permutation pvalue= 0.0076
tout<-t.test(flightSFOSRS$DepDelay[flightSFOSRS$Carrier=="UA"],flightSFOSRS$DepDelay[flightSFOSRS$Carrier=="AA"])
cat("t-test pvalue=",tout$p.value)
## t-test pvalue= 0.004673176
Now we can compare the distribution of the permuted mean differences to the normal predicted by the t-test if we had normal data. Before I took the absolute value of the t-statistic but that's not going to make for the right values to compare to the t-distribution – I need the positive and negative values. So to make this plot, I am going to rerun the permutation test with a different function that doesn't take the absolute value of the t-statistic, so we can more easily compare its distribution to that of the parametric model of the distribution of the t-statistic. However, this will be just for illustration, and the p-value will be wrong.
set.seed(489712)
# for this plot, won't take the absolute value, so will match the normal distribution plot
tstatFun2<-function(x1,x2){t.test(x1,x2)$statistic}
Doing this, I can now plot the histogram of the t-statistic under the permutation with the t-distribution from the corresponding t-test. Notice that because I have very large numbers of samples, I just draw the normal density.
dataset<-flightSFOSRS[!is.na(flightSFOSRS$DepDelay),]
outputTPerm<-permutation.test(group1=dataset$DepDelay[dataset$Carrier == "UA"], group2=dataset$DepDelay[dataset$Carrier == "AA"],FUN=tstatFun2, n.repetitions=10000)
xlim<-range(c(outputTPerm$observedStat,outputTPerm$permutedStats))
hist(outputTPerm$permutedStats, main = "Histogram of permuted t-statistic",
xlab = "T-statistic", xlim=xlim,freq=FALSE,breaks=100)
abline(v = outputTPerm$observedStat, col = "blue", lwd = 2)
curve(dnorm,add=TRUE)
legend("topleft","N(0,1)",lty=1)
Now we consider the same comparison of the parametric t-test to the permutation test with the t-statistic, but with a smaller sample size. I will subsample the data so that I only get 50 in each of the two airline groups.
set.seed(489712)
dataset<-flightSFOSRS[flightSFOSRS$Carrier %in% c("UA","AA") & !is.na(flightSFOSRS$DepDelay),]
#sample smaller number of each
ind1<-sample(which(dataset$Carrier=="AA"),size=50)
ind2<-sample(which(dataset$Carrier=="UA"),size=50)
dataset<-dataset[c(ind1,ind2),]
Now I calculate the null distribution under the permutation test and under the parametric model (i.e. a t-distribution). Again I will do it two different ways for the histogram versus the p-value. To get the right degrees of freedom for drawing the density of the t-distribution, I will just use the output of the t.test function that saves that information.
outputTPermSmall<-permutation.test(
group1=dataset$DepDelay[dataset$Carrier == "UA"],
group2=dataset$DepDelay[dataset$Carrier == "AA"],
FUN=tstatFun2, n.repetitions=10000)
toutput<-t.test(dataset$DepDelay[dataset$Carrier == "UA"],
dataset$DepDelay[dataset$Carrier == "AA"])
xlim<-range(c(outputTPermSmall$observedStat,
outputTPermSmall$permutedStats))
hist(outputTPermSmall$permutedStats,
main = "Histogram of permuted t-statistic",
xlab = "T-statistic", xlim=xlim,freq=FALSE,breaks=70)
abline(v = outputTPermSmall$observedStat, col = "blue", lwd = 2)
# add the t-distribution with the right degrees of freedom
tfun<-function(x){dt(x,df=toutput$parameter)}
curve(tfun,add=TRUE,col="red")
legend("topleft","t-distribution",lty=1)
We can compare the pvalues:
set.seed(21590)
outputTPermSmallPvalue<-permutation.test(
group1=dataset$DepDelay[dataset$Carrier == "UA"],
group2=dataset$DepDelay[dataset$Carrier == "AA"],
FUN=tstatFun, n.repetitions=10000)
cat("pvalue permutation=",
outputTPermSmallPvalue$p.value,"\n")
## pvalue permutation= 0.4446
cat("pvalue t.test=",toutput$p.value,"\n")
## pvalue t.test= 0.394545
Of course, the t-test assumptions don't match the flight data. We can consider what happens if we transform the data like we did in earlier lectures. We will repeat all of the above, only now with the log of the departure delay (+ a small amount)
#use the same subset of data found above
addValue<-abs(min(dataset$DepDelay,na.rm=TRUE))+1 #the small amount I will add to get positive values
dataset$logDepDelay<-log(dataset$DepDelay+addValue)
outputTPermSmall<-permutation.test(
group1=dataset$logDepDelay[dataset$Carrier == "UA"],
group2=dataset$logDepDelay[dataset$Carrier == "AA"],
FUN=tstatFun2, n.repetitions=10000)
toutput<-t.test(dataset$logDepDelay[dataset$Carrier == "UA"],
dataset$logDepDelay[dataset$Carrier == "AA"])
xlim<-range(c(outputTPermSmall$observedStat,outputTPermSmall$permutedStats))
hist(outputTPermSmall$permutedStats,
main = "Histogram of permuted t-statistic, log-transformed",
xlab = "T-statistic", xlim=xlim,freq=FALSE,breaks=100)
abline(v = outputTPermSmall$observedStat, col = "blue", lwd = 2)
tfun<-function(x){dt(x,df=toutput$parameter)}
curve(tfun,add=TRUE,col="red")
legend("topleft","t-distribution",lty=1)
# re-do to get proper p-values on absolute value:
set.seed(21590)
outputTPermSmallPvalue<-permutation.test(group1=dataset$DepDelay[dataset$Carrier == "UA"], group2=dataset$DepDelay[dataset$Carrier == "AA"],FUN=tstatFun, n.repetitions=10000)
cat("pvalue permutation=",outputTPermSmallPvalue$p.value,"\n")
## pvalue permutation= 0.4446
cat("pvalue t.test=",toutput$p.value,"\n")
## pvalue t.test= 0.3261271
To compare all pairs of airlines, first we need to ennumerate all the possible combinations. The function combn
gives all the combinations of length 2 (m=2
), i.e. all pairs. The result is a matrix with 2 rows, with each column a combination.
pairsOfCarriers<-combn(x=carriers,m=2) #get all combinations of length 2, ie all pairs
## Error in combn(x = carriers, m = 2): object 'carriers' not found
I will again work with the transformed data, so lets create this as a variable in my data.frame (before it was saved as a temporary value)
addValue<-abs(min(flightSFOSRS$DepDelay,na.rm=TRUE))+1
flightSFOSRS$logDepDelay<-log(flightSFOSRS$DepDelay+addValue)
Now, for each pair, I want to perform a t-test. So I'm going to define a function that does that, and use apply
to with my pairsOfCarriers
matrix above to do it to each pair of airlines.
That means that the input that apply will be sending to my function is going to be a vector of airline names (a character vector of length two). So I need to write my function with that as the expected input, in addition to other possible variables I might give my function. My function will use that vector ( x
) to subset the full dataset to those that match the input carriers given by x
, do a t.test between the two-carriers, and for simplicity return only the t-statistic and its p-value under the t-test. I also put an argument variableName
into my function that requires the user to specify the name of the column in the dataset where the groups are (i.e. Carrier
); this is because I am going to reuse this code later with a different variable. Notice this isn't a very flexible code, because it only works with the flightSFOSRS
dataset for the logDepDelay
variable. I could add more arguments to make it more general.
#each pair, do t-test
ttestFun<-function(x,variableName){
group1<-flightSFOSRS$logDepDelay[ flightSFOSRS[,variableName] == x[1] ]
group2<-flightSFOSRS$logDepDelay[ flightSFOSRS[,variableName] == x[2] ]
tout<-t.test(group1,group2)
#unlist makes it a vector rather than list
unlist(tout[c("statistic","p.value")])
}
Now, I could use this function and apply it to all my pairs with a for-loop, e.g. for(i in 1:ncol(pairsOfCarriers)){...}
. This is perfectly acceptable. Instead I will use a common R function apply
that we used above that is meant to loop over rows (or columns) of a matrix.
Specifically, apply
takes a matrix as the first argument (X
). The second argument (MARGIN
) determines whether to loop over rows (MARGIN=1
) or columns (MARGIN=2
). Then FUN
defines the function to apply to each row. I can give built-in function, like mean
(which would result in row means, except that for this example my matrix is not numeric values), or I can supply a function, like my ttestFun
I wrote above.
t.testPairs<-apply(X=pairsOfCarriers,
MARGIN=2,FUN=ttestFun,
variableName="Carrier")
## Error in apply(X = pairsOfCarriers, MARGIN = 2, FUN = ttestFun, variableName = "Carrier"): object 'pairsOfCarriers' not found
The result is a matrix of dimension 2 x No. combinations.
dim(t.testPairs)
## Error in eval(expr, envir, enclos): object 't.testPairs' not found
Notice that apply gives the full vector corresponding to the column or row to your function in FUN
. This means that the function I wrote had to assume the input is a vector. I couldn't, for example, write a function function(group1Name,group2Name)
; I have to have function(x)
where x[1]
is the name of group 1 and x[2]
is the name of group 2.
I would note that the function combn
allows you to run a function over all combinations, so I could have avoided the apply
step (combn
just runs apply for you). But it is a good example of using the apply
function. Here's an example of doing it directly in combn
:
# skipping apply step and doing it just within the combn function:
t.testPairsV2<-combn(x=carriers,m=2,FUN=ttestFun,variableName="Carrier")
## Error in combn(x = carriers, m = 2, FUN = ttestFun, variableName = "Carrier"): object 'carriers' not found
Cleaning up the output It's pretty annoying to have the output in the rows of t.testPairs
. I can't do things like t.testPairs$statistic.t
for example. So I'm going to flip the matrix with the t
function
t.testPairs<-t(t.testPairs)
## Error in t(t.testPairs): object 't.testPairs' not found
And then I'm going to convert it to a data.frame:
t.testPairs<-as.data.frame(t.testPairs)
## Error in as.data.frame(t.testPairs): object 't.testPairs' not found
I am going to pretty up the names of the rows here so I can keep track of what comparison they correspond to (i.e. which Airline). I'd like to have names like “UA-AA”, for example so I can keep track. I know by my code that each result corresponds to the same element of the first row of pairsOfCarriers
and the second row of pairsOfCarriers
, so what I want to do is glue those values together and make my “UA-AA” string. The paste
function allows me to glue together two vectors of strings, element-by-element, and sep="-"
says to put a minus sign between them.
rownames(t.testPairs)<-paste(pairsOfCarriers[1,],pairsOfCarriers[2,],sep="-")
## Error in paste(pairsOfCarriers[1, ], pairsOfCarriers[2, ], sep = "-"): object 'pairsOfCarriers' not found
head(t.testPairs)
## Error in head(t.testPairs): object 't.testPairs' not found
I find it rather annoying to do all this clean-up, and I'm going to do it several times in this module, so I'm going to write a little function that does it:
cleanUpPairs<-function(pairsResults){
pairsResults<-t(pairsResults)
pairsResults<-as.data.frame(pairsResults)
rownames(pairsResults)<-paste(pairsOfCarriers[1,],
pairsOfCarriers[2,],sep="-")
return(pairsResults)
}
# Repeat the above calculation and instead use my `cleanUpPairs` function.
t.testPairs<-apply(X=pairsOfCarriers,
MARGIN=2,FUN=ttestFun,
variableName="Carrier")
## Error in apply(X = pairsOfCarriers, MARGIN = 2, FUN = ttestFun, variableName = "Carrier"): object 'pairsOfCarriers' not found
t.testPairs<-cleanUpPairs(t.testPairs)
## Error in t(pairsResults): object 't.testPairs' not found
head(t.testPairs)
## Error in head(t.testPairs): object 't.testPairs' not found
Now I can do some calculations, like the number and the proportion significant at level 0.05:
numbSig<-sum(t.testPairs$p.value<0.05)
## Error in eval(expr, envir, enclos): object 't.testPairs' not found
propSig<-numbSig/length(t.testPairs$p.value)
## Error in eval(expr, envir, enclos): object 'numbSig' not found
cat("Number found with p-value < 0.05: ",numbSig, "(",round(propSig,2)," proportion of tests)")
## Error in cat("Number found with p-value < 0.05: ", numbSig, "(", round(propSig, : object 'numbSig' not found
We can plot these p-values to get an idea of their value. Notice that I give only 1 vector to plot
. By default, if you give a single vector, say z
it will just make the x-values be 1:length(z)
, i.e. plot them across the graph in order.
plot(log(t.testPairs$p.value),ylab="log(P-value)",main="P-values from all pairwise tests",xaxt="n",xlab="")
## Error in plot(log(t.testPairs$p.value), ylab = "log(P-value)", main = "P-values from all pairwise tests", : object 't.testPairs' not found
abline(h=log(0.05),lty=2)
## Error in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...): plot.new has not been called yet
legend("bottomright",legend="P-value = 0.05",lty=2)
## Error in strwidth(legend, units = "user", cex = cex, font = text.font): plot.new has not been called yet
axis(1,at=1:nrow(t.testPairs),labels=rownames(t.testPairs),las=2)
## Error in nrow(t.testPairs): object 't.testPairs' not found
To consider whether this is a lot to find significant (i.e. there is strong signal) or whether this is due to just chance multiple testing issues, we redo the above. Only now we scramble the carrier labels, so there is actually no difference, and see how many we incorrectly reject. (I've set the seed here so that we get the same result each time in the lecture notes, but you can comment out the set.seed
and rerun it many times to get an idea of how it changes.)
#can repeat this code over and over to see how changes with different scrambles.
set.seed(95289235)
flightSFOSRS$fakeCarrier<-sample(flightSFOSRS$Carrier)
#each pair, do t-test
t.testPairsScramble<-apply(pairsOfCarriers,2,ttestFun,variableName="fakeCarrier")
## Error in apply(pairsOfCarriers, 2, ttestFun, variableName = "fakeCarrier"): object 'pairsOfCarriers' not found
t.testPairsScramble<-cleanUpPairs(t.testPairsScramble)
## Error in t(pairsResults): object 't.testPairsScramble' not found
numbSig<-sum(t.testPairsScramble$p.value<0.05)
## Error in eval(expr, envir, enclos): object 't.testPairsScramble' not found
propSig<-numbSig/length(t.testPairsScramble$p.value)
## Error in eval(expr, envir, enclos): object 'numbSig' not found
cat("Number found with p-value < 0.05: ",numbSig, "(",round(propSig,2)," proportion)")
## Error in cat("Number found with p-value < 0.05: ", numbSig, "(", round(propSig, : object 'numbSig' not found
plot(log(t.testPairsScramble$p.value),ylab="log(P-value)",main="P-values from all pairwise tests, scrambled carriers",xaxt="n",xlab="")
## Error in plot(log(t.testPairsScramble$p.value), ylab = "log(P-value)", : object 't.testPairsScramble' not found
abline(h=log(0.05),lty=2)
## Error in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...): plot.new has not been called yet
legend("bottomright",legend="P-value = 0.05",lty=2)
## Error in strwidth(legend, units = "user", cex = cex, font = text.font): plot.new has not been called yet
axis(1,at=1:nrow(t.testPairs),labels=rownames(t.testPairs),las=2)
## Error in nrow(t.testPairs): object 't.testPairs' not found
Adjust the Level We can adjust for multiple testing by adjusting the level of each test to 0.05/ # tests
cat("Number found significant after Bonferonni: ",
sum(t.testPairs$p.value<0.05/npairs))
## Error in cat("Number found significant after Bonferonni: ", sum(t.testPairs$p.value < : object 't.testPairs' not found
cat("Number of shuffled differences found significant after Bonferonni: ",
sum(t.testPairsScramble$p.value<0.05/npairs))
## Error in cat("Number of shuffled differences found significant after Bonferonni: ", : object 't.testPairsScramble' not found
Adjust the p-values We can also directly adjust our p-values and use our standard notion of level (e.g. 0.05). We create a new column in our data.frame called p.value.adj
by using the function cbind
, which just pastes another column to an existing matrix/data.frame
t.testPairs<-cbind(t.testPairs,"p.value.adj"=t.testPairs$p.value*npairs)
## Error in cbind(t.testPairs, p.value.adj = t.testPairs$p.value * npairs): object 't.testPairs' not found
t.testPairsScramble<-cbind(t.testPairsScramble,"p.value.adj"=t.testPairsScramble$p.value*npairs)
## Error in cbind(t.testPairsScramble, p.value.adj = t.testPairsScramble$p.value * : object 't.testPairsScramble' not found
head(t.testPairs)
## Error in head(t.testPairs): object 't.testPairs' not found
head(t.testPairsScramble)
## Error in head(t.testPairsScramble): object 't.testPairsScramble' not found
To make sure that none of our adjusted p-values are greater than 1, we can use the function pmin
on our adjusted values, which goes element by element and takes the min of the adjusted p-value and the value 1:
t.testPairs<-cbind(t.testPairs,"p.value.adj.final"=pmin(t.testPairs$p.value.adj,1))
## Error in cbind(t.testPairs, p.value.adj.final = pmin(t.testPairs$p.value.adj, : object 't.testPairs' not found
t.testPairsScramble<-cbind(t.testPairsScramble,"p.value.adj.final"=pmin(t.testPairsScramble$p.value.adj,1))
## Error in cbind(t.testPairsScramble, p.value.adj.final = pmin(t.testPairsScramble$p.value.adj, : object 't.testPairsScramble' not found
head(t.testPairs)
## Error in head(t.testPairs): object 't.testPairs' not found
Let's identify those tests that were significant at 0.05 level before adjustment, but now are not
whLostSig<-which(t.testPairs$p.value.adj.final>0.05 & t.testPairs$p.value<=0.05)
## Error in which(t.testPairs$p.value.adj.final > 0.05 & t.testPairs$p.value <= : object 't.testPairs' not found
whLostSigScramble<-which(t.testPairsScramble$p.value.adj.final>0.05 & t.testPairsScramble$p.value<=0.05)
## Error in which(t.testPairsScramble$p.value.adj.final > 0.05 & t.testPairsScramble$p.value <= : object 't.testPairsScramble' not found
I will remake the previous plots, only now with adjusted p-values, and “highlight” those p-values that changed to non-significant by plotting them a different color. Notice how I can use the function points
to add points to a scatterplot, just like I previously used lines
to add points connected by lines to an existing scatterplot. In this case I am overplotting, meaning I plotted all of the values in my plot
command, and by calling points
afterward using a different color, those points are covered up and plotted with a new color. I also use the argument pch
to change from an open circle to a filled in circle (see the help of points
to get a full list of the possible plotting symbols.)
ylab<-"log(adjusted P-value)"
par(mfrow=c(1,2))
plot(log(t.testPairs$p.value.adj.final),ylab=ylab,main="Real Data, after adjustment",xaxt="n",xlab="")
## Error in plot(log(t.testPairs$p.value.adj.final), ylab = ylab, main = "Real Data, after adjustment", : object 't.testPairs' not found
abline(h=log(0.05),lty=2)
## Error in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...): plot.new has not been called yet
n<-nrow(t.testPairs)
## Error in nrow(t.testPairs): object 't.testPairs' not found
points((1:n)[whLostSig],log(t.testPairs$p.value.adj.final)[whLostSig],col="red",pch=19)
## Error in points((1:n)[whLostSig], log(t.testPairs$p.value.adj.final)[whLostSig], : object 'n' not found
legend("bottomright",legend="P-value = 0.05",lty=2)
## Error in strwidth(legend, units = "user", cex = cex, font = text.font): plot.new has not been called yet
axis(1,at=1:n,labels=rownames(t.testPairs),las=2)
## Error in axis(1, at = 1:n, labels = rownames(t.testPairs), las = 2): object 'n' not found
plot(log(t.testPairsScramble$p.value.adj.final),ylab=ylab,main="Scrambled Data, after adjustment",xaxt="n",xlab="", ylim=c(log(0.05),0))
## Error in plot(log(t.testPairsScramble$p.value.adj.final), ylab = ylab, : object 't.testPairsScramble' not found
points((1:n)[whLostSigScramble],log(t.testPairsScramble$p.value.adj.final[whLostSigScramble]),col="red",pch=19)
## Error in points((1:n)[whLostSigScramble], log(t.testPairsScramble$p.value.adj.final[whLostSigScramble]), : object 'n' not found
abline(h=log(0.05),lty=2)
## Error in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...): plot.new has not been called yet
axis(1,at=1:n,labels=rownames(t.testPairs),las=2)
## Error in axis(1, at = 1:n, labels = rownames(t.testPairs), las = 2): object 'n' not found
Examples calculating quantiles This code demonstrates calculating quantiles using the qXXX function. qXXX goes along with dXXX, rXXX, and pXXX. For a qXXX function the input argument is a probability, and the output is the value for which you have the probability of being less than. Here we demonstrate with the normal distribution.
qnorm(0.2, mean=0,sd=1)
## [1] -0.8416212
qnorm(0.9, mean=0,sd=1)
## [1] 1.281552
qnorm(0.0275, mean=0,sd=1)
## [1] -1.918876
To get parametric confidence intervals, we again use the t.test
function, which also provides confidence intervals. Here we get confidence intervals for our logged- delay values.
t.test(log(flightSFOSRS$DepDelay[flightSFOSRS$Carrier=="UA"]+addValue))
##
## One Sample t-test
##
## data: log(flightSFOSRS$DepDelay[flightSFOSRS$Carrier == "UA"] + addValue)
## t = 289.15, df = 2964, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 3.236722 3.280920
## sample estimates:
## mean of x
## 3.258821
Here we take our confidence interval for the log of the mean, and convert it back to the original scale
logT<-t.test(log(flightSFOSRS$DepDelay[flightSFOSRS$Carrier=="UA"]+addValue))
exp(logT$conf.int)-addValue
## [1] 3.450158 4.600224
## attr(,"conf.level")
## [1] 0.95
The t.test
function also provides confidence intervals between groups. Here we perform for the log-difference of our delay between UA and AA
logUA<-log(flightSFOSRS$DepDelay[flightSFOSRS$Carrier=="UA"]+addValue)
logAA<-log(flightSFOSRS$DepDelay[flightSFOSRS$Carrier=="AA"]+addValue)
t.test(logUA,logAA)
##
## Welch Two Sample t-test
##
## data: logUA and logAA
## t = 5.7011, df = 1800.7, p-value = 1.389e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.07952358 0.16293414
## sample estimates:
## mean of x mean of y
## 3.258821 3.137592
The following function creates bootstrap confidence intervals comparing two groups.
The input is like the permutation.test
I wrote above: the two vectors of data, the function FUN
that defines your statistic, and the number of repetitions. It returns a list with the confidence interval and the vector of bootstrapped statistics.
# Parameters:
# group1, group2: vector of data from each group
# FUN: function; this function accepts two vectors (two groups),
# and calculate the statistics (should be absolute value.)
# n.repetitions: number of permutations to be done
#
# Returns:
# list with confidence interval, and vector of bootstrapped statistics
bootstrap.test <- function(group1, group2, FUN,
n.repetitions, confidence.level=0.95){
# calculate the observed statistic
stat.obs <- FUN(group1, group2)
# calculate the bootstrapped statistic, notice replace=TRUE
bootFun<-function(){
sampled1 = sample(group1, size=length(group1),replace = TRUE)
sampled2 = sample(group2, size=length(group2),replace = TRUE)
FUN(sampled1,sampled2)
}
stat.boot<-replicate(n.repetitions,bootFun())
level<-1-confidence.level
confidence.interval <- quantile(stat.boot,
probs=c(level/2,1-level/2))
# below, `unname` simply takes off the existing names
# given by quantile function:
return(list(
confidence.interval = c(
"lower"=unname(confidence.interval[1]),
"estimate"=stat.obs,
"upper"=unname(confidence.interval[2])),
bootStats=stat.boot
))
}
Now we apply our function to our flight data, reusing our function diffProportion
from before.
set.seed(201728)
# for each dataset, do permutation test and plot histogram
#for (dataset in list(flightSFOJan, flightSFOSRS, flightSFOStratified)){
dataset<-flightSFOSRS #this makes it easy to recopy code later
output<-bootstrap.test(
group1=dataset$DepDelay[dataset$Carrier == "UA"],
group2=dataset$DepDelay[dataset$Carrier == "AA"],
FUN=diffProportion,
n.repetitions=10000)
We can draw a histogram of the bootstrap distribution of our estimates
par(mfrow = c(1,1))
hist(output$bootStats,
main = "Histogram of bootstrapped statistics",
xlab="Difference in proportions")
abline(v = output$confidence.interval["estimate"],
col = "blue", lwd = 2)
legend("topright",
legend=expression(paste("Observed ",hat(delta))),
fill="blue")
If you don't have the package gplots
installed, uncomment the install.packages
code below (but only do it once!):
# install.packages("gplots")
We will use the package gplots
to plot the confidence intervals. x
refers to the observed statistic, ui
the upper confidence interval value, and li
the lower confidence interval.
require(gplots)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
plotCI(x=output$confidence.interval["estimate"],ui=output$confidence.interval["upper"],li=output$confidence.interval["lower"],ylab="Difference in Proportions",pch=19,xaxt="n",xlab="")
Here I compare creating the confidence interval of the individual means, versus the confidence interval of the difference of the means.
First I calculate both the individual means and the difference in means using t.test.
uaT<-t.test(logUA)
aaT<-t.test(logAA)
diffT<-t.test(logUA,logAA)
Now I will plot the confidence intervals of the two groups' means, and then that of the difference using plotCI
par(mfrow=c(1,2))
plotCI(x=c(UA=uaT$estimate,aaT$estimate),ui=c(uaT$conf.int[2],aaT$conf.int[2]),li=c(uaT$conf.int[1],aaT$conf.int[1]),ylab="Mean Delay Time",pch=19,xlab="Airline",xaxt="n")
axis(1,at=c(1,2),c("UA","AA"))
plotCI(x=diffT$estimate[1]-diffT$estimate[2],ui=diffT$conf.int[2],li=diffT$conf.int[1],ylab="Difference in Mean Delay Time",xlab="",pch=19,xaxt="n")
axis(1,at=1,c("UA-AA"))
Here is an artifical example where the CI of the means of each group overlap, but the CI of the difference between the means doesn't cover zero:
exampleData<-read.csv(file.path(dataDir,"DifferenceMeans.csv"),comment.char="#")
aT<-t.test( exampleData$A)
bT<-t.test(exampleData$B)
diffT<-t.test(exampleData$A,exampleData$B)
par(mfrow=c(1,2))
plotCI(x=c(aT$estimate,bT$estimate),ui=c(aT$conf.int[2],bT$conf.int[2]),li=c(aT$conf.int[1],bT$conf.int[1]),ylab="Mean of Groups",pch=19,xlab="",xaxt="n")
axis(1,at=c(1,2),c("A","B"))
plotCI(x=diffT$estimate[1]-diffT$estimate[2],ui=diffT$conf.int[2],li=diffT$conf.int[1],ylab="Difference in Mean",xlab="",pch=19,xaxt="n")
axis(1,at=1,c("A-B"))
abline(h=0,lty=2)
Now we are going to create CI for each of the pairs of carriers, like we did with testing before.
We can reuse our code from before, that can be applied to the combinations of carriers, only this time, pull the confidence interval information rather than p-value
ttestCI<-function(x,variableName){
group1<-flightSFOSRS$logDepDelay[flightSFOSRS[,variableName] == x[2]]
group2<- flightSFOSRS$logDepDelay[flightSFOSRS[,variableName] == x[1]]
tout<-t.test(group1,group2)
out<-unlist(tout[c("estimate","conf.int")]) #unlist makes it a vector rather than list
#fix up the names
names(out)<-c("mean.of.x","mean.of.y","lower","upper")
return(out)
}
t.CIPairs<-apply(X=pairsOfCarriers,MARGIN=2,FUN=ttestCI,variableName="Carrier")
## Error in apply(X = pairsOfCarriers, MARGIN = 2, FUN = ttestCI, variableName = "Carrier"): object 'pairsOfCarriers' not found
t.CIPairs<-cleanUpPairs(t.CIPairs)
## Error in t(pairsResults): object 't.CIPairs' not found
head(t.CIPairs)
## Error in head(t.CIPairs): object 't.CIPairs' not found
The results of t.test
only give the means of each group, so we'll create a new variable that is the difference.
t.CIPairs<-cbind(t.CIPairs,diff=t.CIPairs$mean.of.x-t.CIPairs$mean.of.y)
## Error in cbind(t.CIPairs, diff = t.CIPairs$mean.of.x - t.CIPairs$mean.of.y): object 't.CIPairs' not found
Now we can plot all of the confidence intevals using plotCI
. Notice that I set xaxt="n"
. This is to keep plotCI
from drawing in a x-axis annotation. Instead, I want the x-axis to give the character string indicating what airlines are being compared. I do this with the axis
command, which allows me to put any annotation on the axis (labels
) at any location (at
). las
says to turn the labels perpindicular to the axis so I can read them.
require(gplots)
par(mfrow=c(1,1))
plotCI(x=t.CIPairs$diff,li=t.CIPairs$lower,ui=t.CIPairs$upper,ylab="Difference in log-Proportions",pch=19,xaxt="n",xlab="")
## Error in plotCI(x = t.CIPairs$diff, li = t.CIPairs$lower, ui = t.CIPairs$upper, : object 't.CIPairs' not found
axis(1,at=1:nrow(t.CIPairs),labels=rownames(t.CIPairs),las=2)
## Error in nrow(t.CIPairs): object 't.CIPairs' not found
abline(h=0,lty=2)
## Error in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...): plot.new has not been called yet
We can reuse our above code, but just adjust the conf.level
argument in the t.test
command to get Bonferonni corrected CI.
ttestCIAdj<-function(x,variableName,npairs){
group1<-flightSFOSRS$logDepDelay[flightSFOSRS[,variableName] == x[2]]
group2<- flightSFOSRS$logDepDelay[flightSFOSRS[,variableName] == x[1]]
tout<-t.test(group1,group2,conf.level = 1-0.05/npairs)
out<-unlist(tout[c("estimate","conf.int")]) #unlist makes it a vector rather than list
#fix up the names
names(out)<-c("mean.of.x","mean.of.y","lower","upper")
return(out)
}
t.CIPairsAdj<-apply(X=pairsOfCarriers,MARGIN=2,
FUN=ttestCIAdj,variableName="Carrier",npairs=npairs)
## Error in apply(X = pairsOfCarriers, MARGIN = 2, FUN = ttestCIAdj, variableName = "Carrier", : object 'pairsOfCarriers' not found
t.CIPairsAdj<-cleanUpPairs(t.CIPairsAdj)
## Error in t(pairsResults): object 't.CIPairsAdj' not found
t.CIPairsAdj<-cbind(t.CIPairsAdj,diff=t.CIPairsAdj$mean.of.x-t.CIPairsAdj$mean.of.y)
## Error in cbind(t.CIPairsAdj, diff = t.CIPairsAdj$mean.of.x - t.CIPairsAdj$mean.of.y): object 't.CIPairsAdj' not found
par(mfrow=c(1,2))
plotCI(x=t.CIPairs$diff,li=t.CIPairs$lower,ui=t.CIPairs$upper,ylab="Difference in log-Proportions",
sub="Raw CI",pch=19,xaxt="n",xlab="")
## Error in plotCI(x = t.CIPairs$diff, li = t.CIPairs$lower, ui = t.CIPairs$upper, : object 't.CIPairs' not found
axis(1,at=1:nrow(t.CIPairs),labels=rownames(t.CIPairs),las=2)
## Error in nrow(t.CIPairs): object 't.CIPairs' not found
abline(h=0,lty=2)
## Error in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...): plot.new has not been called yet
plotCI(x=t.CIPairsAdj$diff,
li=t.CIPairsAdj$lower,
ui=t.CIPairsAdj$upper,
ylab="Difference in log-Proportions",
sub="Bonferonni Adjusted CI",pch=19,xaxt="n",xlab="")
## Error in plotCI(x = t.CIPairsAdj$diff, li = t.CIPairsAdj$lower, ui = t.CIPairsAdj$upper, : object 't.CIPairsAdj' not found
axis(1,at=1:nrow(t.CIPairsAdj),labels=rownames(t.CIPairsAdj),las=2)
## Error in nrow(t.CIPairsAdj): object 't.CIPairsAdj' not found
abline(h=0,lty=2)
## Error in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...): plot.new has not been called yet
Here we plot the TukeyHSD confidence intervals for pairwise comparisons:
tukeyCI<-TukeyHSD( aov(logDepDelay~Carrier, data=flightSFOSRS))
plot(tukeyCI,las=2)