This is the directory I have my data saved in.

dataDir<-"../../finalDataSets"

To change it to your working directory uncomment the following code (“.” just means current directory):

# dataDir<-"."

Now we read in the data. We give it the path of the data; sep=“\t” indicates it is tab-deliminated; header=TRUE means there is a row with header/column names.

flightSF<-read.table(file.path(dataDir,"SFO.txt"),sep="\t",header=TRUE)

This results in a data.frame. We can look at some basic information regarding this data.frame:

dim(flightSF)

## [1] 13207    64

names(flightSF)

##  [1] "Year"                 "Quarter"              "Month"               
##  [4] "DayofMonth"           "DayOfWeek"            "FlightDate"          
##  [7] "UniqueCarrier"        "AirlineID"            "Carrier"             
## [10] "TailNum"              "FlightNum"            "OriginAirportID"     
## [13] "OriginAirportSeqID"   "OriginCityMarketID"   "Origin"              
## [16] "OriginCityName"       "OriginState"          "OriginStateFips"     
## [19] "OriginStateName"      "OriginWac"            "DestAirportID"       
## [22] "DestAirportSeqID"     "DestCityMarketID"     "Dest"                
## [25] "DestCityName"         "DestState"            "DestStateFips"       
## [28] "DestStateName"        "DestWac"              "CRSDepTime"          
## [31] "DepTime"              "DepDelay"             "DepDelayMinutes"     
## [34] "DepDel15"             "DepartureDelayGroups" "DepTimeBlk"          
## [37] "TaxiOut"              "WheelsOff"            "WheelsOn"            
## [40] "TaxiIn"               "CRSArrTime"           "ArrTime"             
## [43] "ArrDelay"             "ArrDelayMinutes"      "ArrDel15"            
## [46] "ArrivalDelayGroups"   "ArrTimeBlk"           "Cancelled"           
## [49] "CancellationCode"     "Diverted"             "CRSElapsedTime"      
## [52] "ActualElapsedTime"    "AirTime"              "Flights"             
## [55] "Distance"             "DistanceGroup"        "CarrierDelay"        
## [58] "WeatherDelay"         "NASDelay"             "SecurityDelay"       
## [61] "LateAircraftDelay"    "FirstDepTime"         "TotalAddGTime"       
## [64] "LongestAddGTime"

We can just print a few rows and columns

flightSF[1:10,c("FlightDate","Carrier","FlightNum","Origin","Dest","DepDelay","DepDelayMinutes","ArrDelay","ArrDelayMinutes","Cancelled")]

##    FlightDate Carrier FlightNum Origin Dest DepDelay DepDelayMinutes
## 1  2016-01-01      AA       208    SFO  MIA       -2               0
## 2  2016-01-02      AA       208    SFO  MIA       -7               0
## 3  2016-01-03      AA       208    SFO  MIA       -4               0
## 4  2016-01-04      AA       208    SFO  MIA       -4               0
## 5  2016-01-05      AA       208    SFO  MIA       -8               0
## 6  2016-01-06      AA       208    SFO  MIA        0               0
## 7  2016-01-07      AA       208    SFO  MIA        2               2
## 8  2016-01-08      AA       208    SFO  MIA       -1               0
## 9  2016-01-09      AA       208    SFO  MIA       -1               0
## 10 2016-01-10      AA       208    SFO  MIA       -6               0
##    ArrDelay ArrDelayMinutes Cancelled
## 1       -27               0         0
## 2       -27               0         0
## 3       -27               0         0
## 4       -14               0         0
## 5       -19               0         0
## 6         6               6         0
## 7        10              10         0
## 8       -21               0         0
## 9        -5               0         0
## 10      -15               0         0

summary creates a 5-number summary of the departure delay, min/max/25,50,75 percentiles of data:

summary(flightSF$DepDelay)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   -25.0    -5.0    -1.0    13.8    12.0   861.0     413

We will subset to those rows that do not have NA for DepDelay . Note use of subset , a useful function for getting a subset of rows, where the first argument is a data.frame and a second argument is a logical argument.

naDepDf<-subset(flightSF,is.na(DepDelay))

head shows the top 5 rows of a data.frame. Here, I have also selected only 5 columns. Note how I can select them by name of the column.

head(naDepDf[,c("FlightDate","Carrier","FlightNum","DepDelay","Cancelled")])

##     FlightDate Carrier FlightNum DepDelay Cancelled
## 44  2016-01-14      AA       209       NA         1
## 75  2016-01-14      AA       218       NA         1
## 112 2016-01-24      AA        12       NA         1
## 138 2016-01-22      AA        16       NA         1
## 139 2016-01-23      AA        16       NA         1
## 140 2016-01-24      AA        16       NA         1

summary applied to a data frame does summary of each column.

summary(naDepDf[,c("FlightDate","Carrier","FlightNum","DepDelay","Cancelled")])

##       FlightDate     Carrier      FlightNum       DepDelay     Cancelled
##  2016-01-22: 92   OO     :176   Min.   :   1   Min.   : NA   Min.   :1  
##  2016-01-06: 49   UA     : 76   1st Qu.: 616   1st Qu.: NA   1st Qu.:1  
##  2016-01-23: 41   WN     : 55   Median :2080   Median : NA   Median :1  
##  2016-01-24: 40   AA     : 35   Mean   :3059   Mean   :NaN   Mean   :1  
##  2016-01-19: 31   VX     : 33   3rd Qu.:5555   3rd Qu.: NA   3rd Qu.:1  
##  2016-01-03: 27   DL     : 17   Max.   :6503   Max.   : NA   Max.   :1  
##  (Other)   :133   (Other): 21                  NA's   :413

hist function creates histogram. Arguments 'main' and 'xlab' are shared across plotting commands. 'main' sets the title of plot, ' 'xlab' sets the label for the xaxis.

hist(flightSF$DepDelay,main="Departure Delay", xlab="Time (in minutes)")

plot of chunk unnamed-chunk-10

Here we create some 'fake' values for the cancelled flights so they show up on plot. Notice how we can use is.na to identify the NAs, and then assign only those the value 1200

flightSF$DepDelayWithCancel<-flightSF$DepDelay
flightSF$DepDelayWithCancel[is.na(flightSF$DepDelay)]<-1200
hist(flightSF$DepDelayWithCancel,xlab="Time (in minutes)", main="Departure delay, with cancellations=1200")

plot of chunk unnamed-chunk-11

Here we show the the effect of changing the argument breaks to histogram.

par(mfrow=c(2,2))
hist(flightSF$DepDelay,main="Departure Delay, default breaks", xlab="Time (in minutes)")
hist(flightSF$DepDelay,main="Departure Delay, breaks=100", xlab="Time (in minutes)",breaks=100)
hist(flightSF$DepDelay,main="Departure Delay, breaks=1000", xlab="Time (in minutes)",breaks=1000)

plot of chunk unnamed-chunk-12

We can truncate the x axis using the command xlim

hist(flightSF$DepDelay,main="Departure Delay, Zoomed-in", xlab="Time (in minutes)", xlim=c(-30,100), breaks=1000)

plot of chunk unnamed-chunk-13

Note that the command par(mfrow=c(2,2)) sets up a 2x2 grid of plots. Similarly, par(mfrow=c(1,2)) sets up a 1x2 grid. To get back to a single plot, we would type the command par(mfrow=c(1,1))

What about a smaller data frame with only flights from SFO to PHL, again truncating the axis

flightSFO_PHL <- subset(flightSF, Dest=="PHL")
dim(flightSFO_PHL)

## [1] 158  65

par(mfrow=c(2,2))
hist(flightSFO_PHL$DepDelay,main="SFO->PHL, default breaks", xlab="Time (in minutes)", xlim=c(-50,100))
hist(flightSFO_PHL$DepDelay,main="SFO->PHL, breaks=30", xlab="Time (in minutes)", xlim=c(-50,100), breaks=30)
hist(flightSFO_PHL$DepDelay,main="SFO->PHL, breaks=100", xlab="Time (in minutes)", xlim=c(-50,100), breaks=100)
hist(flightSFO_PHL$DepDelay,main="SFO->PHL, breaks=1000", xlab="Time (in minutes)", xlim=c(-50,100),breaks=1000)

plot of chunk unnamed-chunk-14

The command boxplot gives basic boxplots. Notice I can continue to use arguments like main and ylab to make the plot more informative.

boxplot(flightSF$DepDelay,main="Departure Delay", ylab="Time (in minutes)")

plot of chunk unnamed-chunk-15

Now I use a special formulation for telling R to divide into separate boxplots for different groups; specifically divide the numeric value 'DepDelay' into different groups based on airline (Carrier) and make a boxplot separately for each group. The y ~ fac notation means that y is the numeric values and fac is the factor defining groups you want to divide into.

boxplot(flightSF$DepDelay~flightSF$Carrier,
    main="Departure Delay, by airline carrier", 
    ylab="Time (in minutes)")

plot of chunk unnamed-chunk-16

Here I use the argument outline=FALSE to indicate not to plot the individual “outliers” beyond the whiskers.

boxplot(flightSF$DepDelay~flightSF$Carrier,
    main="Departure Delay, by airline carrier", 
    ylab="Time (in minutes)",outline=FALSE)

plot of chunk unnamed-chunk-17

set.seed(1)
par(mfrow=c(2,2))
hist(rgamma(1000,shape=2),main="Right Skew")
hist(rnorm(1000),main="Symmetric")
breaks=c(-1E6,seq(-20,20,1),1E6)
hist(rnorm(1000),main="Light tails",xlim=c(-20,20),breaks=breaks,freq=TRUE)

## Warning in plot.histogram(r, freq = freq1, col = col, border = border,
## angle = angle, : the AREAS in the plot are wrong -- rather use 'freq =
## FALSE'

hist(rcauchy(1000),main="Heavy tails",xlim=c(-20,20),breaks=breaks,freq=TRUE)

## Warning in plot.histogram(r, freq = freq1, col = col, border = border,
## angle = angle, : the AREAS in the plot are wrong -- rather use 'freq =
## FALSE'

plot of chunk unnamed-chunk-17

The curve function draws a function. The first argument is the function, and the from and to arguments give the range of values. Here I draw the log function then the next command is the sqrt funciton. For the sqrt I use the argument add to have the curve for the square-root plot on an existing plot. I also use the option col to choose what color each curve should be.

ylim<-c(-3,3)
curve(log,from=0,to=10,ylim=ylim,ylab="Transformed",xlab="Original")
curve(sqrt,from=0,to=10,add=TRUE,col="red")
legend("bottomright",legend=c("log","sqrt"),fill=c("black","red"))

plot of chunk unnamed-chunk-18

The function legend adds a legend to your existing figure. The first argument is the location of the legend ( bottomright ). The argument legend is the vector of character strings that make up the legend. The remaining arguments tell what should be in front of the text and should be the same length as the legend argument (if they are a different length, then the values will be “recycled”, i.e. start back at the beginning until get to the right length). Here we choose to have color-filled squares, using the argument fill to give the colors. Here I make some simulated data from the “gamma” distribution in order to show the effect of the log transformation on skewed data. A note about the simulated data. rXXX always stands for getting simulated data. The first argument is the number of simulated values (1000); the remaining arguments are parameters of the distribution. In the case of the gamma scale and shape are parameters of the distribution.

y<-rgamma(10000,scale=1,shape=0.1)

Note I first the regular y and then the log-transformed y, and I use the par(mfrow=c(1,2)) to draw them side-by-side as a 1x2 grid

par(mfrow=c(1,2))
hist(y,main="Original data",xlab="original scale",breaks=50)
hist(log(y),main="Log of data",xlab="log-scale",breaks=50)

plot of chunk unnamed-chunk-20

Now I will do a transformation of our flight data. Since I can't take the log (because I have negative values as well as 0's) I will add a little bit to each value before I take the log. This may not be the most brillant thing to do, frankly. I choose how much to add by finding the minimum value, taking the absolute value, and adding 1. Notice how I use na.rm=TRUE when I take the minimum. Otherwise the minimum where there are NA's is NA. na.rm=TRUE means to remove the NAs before taking the minimum.

addValue<-abs(min(flightSF$DepDelay,na.rm=TRUE))+1
par(mfrow=c(3,1))
boxplot(flightSF$DepDelay+addValue~flightSF$Carrier,main="Departure Delay, original", ylab="Time")
boxplot(log(flightSF$DepDelay+addValue)~flightSF$Carrier,main="Departure Delay, log transformed", ylab=paste("log(Time+",addValue,")"))
boxplot(sqrt(flightSF$DepDelay+addValue)~flightSF$Carrier,main="Departure Delay, sqrt-transformed", ylab=paste("sqrt(Time+",addValue,")"))

plot of chunk unnamed-chunk-21

Notice in creating the ylab value, I use the paste function. The paste function concatinates two strings. It's particularly helpful when you want to put a variable into your label/title/etc. Here, I paste the string sqrt(Time+ with the value of my variable addValue . ) I am going to make a subset of my data that removes the cancelled flights to make it easier going forward. This means that I don't need to worry about removing the NA values all of the time.

flightSF_nc<-subset(flightSF,flightSF$Cancelled!=1)

The following function is something I wrote does something I don't recommend that you do in a normal course of events. But it might be useful to look at for writing functions. Namely, it plots the probabilities of being within a bin. Recall, that's not what the y-axis of histograms normally is.

histBinProb<-function(x,ylab="Bin Probabilities",breaks = "Sturges",...){
    #plot=FALSE: doesn't plot, just returns the calculated values
    h<-hist(x,plot=FALSE,breaks=breaks) 
    h$counts=h$counts/sum(h$counts)
    plot(h,ylab=ylab,...)
}

There are some useful features to notice from this function, even though you shouldn't need to use it. One is to note that even though it doesn't look like it, hist actually has an output that you can save. Usually you just want the plot, but sometimes, you'd like to actually grab those calculated values (like the number of counts in a bin). The out put is a list, and is described in the help of hist . You also can do plot=FALSE with histogram if you want just the calculations and not the plot. These are all rather esoteric things to know, but FYI. Now I will plot the probabilities for each bin, for different numbers of breaks. Notice to make them all have the same y-axis limits, I can set the argument ylim . Again, this is a standard way for almost any plot to limit the y values that are plotted (or x if you use xlim ).

par(mfrow=c(1,2))
ylim<-c(0,0.6)
histBinProb(flightSF$DepDelay,main="Departure Delay, Probability of Bin", xlab="Time (in minutes)",ylim=ylim)
histBinProb(flightSF$DepDelay,main="Departure Delay, Probability of Bin, more breaks", xlab="Time (in minutes)",breaks=10000,ylim=ylim)

plot of chunk unnamed-chunk-24

flightSF_del <- subset(flightSF_nc, DepDelay > 0)
par(mfrow=c(1,2))
hist(flightSF_del$DepDelay, main="Delayed only", xlab="Time (in minutes)")
hist(log(flightSF_del$DepDelay), main="log-transformed", xlab="Time (in minutes)")

plot of chunk unnamed-chunk-24

Now I want to create a simple random sample from my data (SRS). I use the function sample to sample from my flights in January. x gives the vector of values I want to sample from (my flight delays) and size is the size of the sample I would like. A very important argument to sample is the argument replace . If it equals TRUE then we sample with replacement (a SRS).

flightSRS<-sample(x=flightSF_nc$DepDelay, size= 100, replace=TRUE)

To get a permutation of my data, I would do instead replace=FALSE and I could not put in a size argument. I'm going to show how a SRS estimates the actual probabilities in our January flights. To do this, I will plot the probabilities again using the little function I wrote above (the one I said you shouldn't normally do!). I will set the ylim to be the same for all the plots. I will also give a predefined vector of breaks by using the seq function. The seq function finds a evenly space set of values within an interval. I do this rather than setting the breaks=10 function to make sure I use exactly the same breakpoints for my SRS as for my full population. (R does various magic to interpret what breaks=10 means which we want to avoid.)

ylim<-c(0,1)
breaks<-seq(min(flightSF_nc$DepDelay),max(flightSF_nc$DepDelay),length=10)
histBinProb(flightSRS,main="Departure Delay", xlab="Time (in minutes)",border=NA,breaks=breaks,ylim=ylim,col="red",add=FALSE)
histBinProb(flightSF_nc$DepDelay,main="Departure Delay", xlab="Time (in minutes)",col=NULL,border="black",breaks=breaks,ylim=ylim,lwd=2,add=TRUE)
legend("topright",c("SRS","Truth"),fill=c("red","white"))

plot of chunk unnamed-chunk-26

Notice, I have changed a lot of options in my histogram of the SRS. In particular, I have changed the color to red, and said not to plot the border of the rectangle bins. In my full population ones, I set col=NULL so that I don't fill in the rectangles, and made the width of the lines ( lwd ) twice as big as normal. I also did add=TRUE which draws a histogram on top of the existing plot. Again, this is rather esoteric, but occassionally is useful, like now. I will repeat the same thing, but with more breaks to compare how the SRS estimates probabilities for smaller intervals versus wider intervals.

ylim<-c(0,0.6)
breaks<-seq(min(flightSF_nc$DepDelay),max(flightSF_nc$DepDelay),length=100)
histBinProb(flightSRS,main="Departure Delay", xlab="Time (in minutes)",border=NA,breaks=breaks,ylim=ylim,col="red",add=FALSE)
histBinProb(flightSF_nc$DepDelay,main="Departure Delay", xlab="Time (in minutes)",col=NULL,border="black",breaks=breaks,ylim=ylim,lwd=2,add=TRUE,lwd=3)
legend("topright",c("SRS","Truth"),fill=c("red","white"))

plot of chunk unnamed-chunk-27

Here I read in data that was a SRS from the entire year, and one that was a stratified sample of a SRS per month. I then compare the histograms to compare different types of samples.

flightSFOSRS<-read.table(file.path(dataDir,"SFO_SRS.txt"),
                         sep="\t",header=TRUE, stringsAsFactors = FALSE)
flightSFOStratified<-read.table(file.path(dataDir,"SFO_Stratified.txt"),
                                sep="\t",header=TRUE, stringsAsFactors = FALSE)
par(mfrow=c(2,2))
xlim<-c(-20,400)
hist(flightSF$DepDelay,breaks=100,xlim=xlim,freq=FALSE)
hist(flightSFOSRS$DepDelay,breaks=100,xlim=xlim,freq=FALSE)
hist(flightSFOStratified$DepDelay,breaks=100,xlim=xlim,freq=FALSE)
curve(x*exp(-x/2)/4,xlim=c(0,30),ylab="p(x)",xlab="x")

plot of chunk unnamed-chunk-28

This is not a function to worry about understanding. This is for teaching purposes and shades in the area under the curve. Notice, however, how I can define a function ( p ) to give to curve to plot.

plotUnderCurve<-function(x1, x2,p,...){
    x=seq(x1,x2,len=100) 
    y=p(x)
    polygon(c(x,tail(x,1),x[1]), c(y, 0,0), ...)
}
p<-function(x){x*exp(-x/2)/4} #define as function
curve(p,xlim=c(0,30),ylab="p(x)",xlab="x")
plotUnderCurve(5,10,p,col="red")

plot of chunk unnamed-chunk-29

Here I am going to create a SRS and take the mean of the SRS of size 1,000 over and over again (10,000 times) in order to demonstrate the Central Limit Theorem. I use the function replicate to do the exact same (usually random) something over and over again. I define the random thing I want to do to the argument expr . In this case it is to sample with replacement and take the mean. n=10000 means to do it 10,000 times. The 10,000 means are returned and saved as a vector.

sampleSize<-1000
sampleMean<-replicate(n=10000,expr=mean(sample(flightSF_nc$DepDelay,size=sampleSize,replace=TRUE)))
hist(sampleMean,xlab="Mean Departure Delay",main=paste("Mean of SRS of size",sampleSize))

plot of chunk unnamed-chunk-30

We now will plot the histogram of the 10,000 simulations of means of our SRS from the flight data and overlay the density. Note we need to draw the histogram on the “density” scale by setting freq=FALSE (i.e. NOT frequency/counts). We then define a curve to overlay the histogram using the dnorm function. dXXX functions calculate the density of a value x and take the form dXXX(x) and return $p(x)$. They can take a vector of values x or a single value. They also have the ability to give parameters. For the normal, this is mean and sd (so standard deviation, not variance). We do this by creating a function p and then do curve(p) . We do add=TRUE and col="red" and lwd=3 to get it to overlay a red curve that is 3 times the usual thickness.

hist(sampleMean,xlab="Mean Departure Delay",main=paste("Mean of SRS of size",sampleSize),freq=FALSE)
m<-mean(flightSF_nc$DepDelay)
s<-sqrt(var(flightSF_nc$DepDelay)/sampleSize)
p<-function(x){dnorm(x, mean=m,sd=s)}
curve(p,add=TRUE,col="red",lwd=3)

plot of chunk unnamed-chunk-31

Note in order to have the density of the normal centered and scaled at the right values, we need to calculate the mean and standard deviation predicted by the CLT. This is the mean of the original population, given by mean(flightSF_nc$DepDelay) . And the variance is the variance of the original population, divided by the size of our SRS (n=1000). We demonstrate the pdf of a uniform distribution ( dunif ); this corresponds to a pdf with uniform values between 0,1 (for illustration of properties of the density.)

curve(dunif,xlim=c(-1,2),ylab="p(x)",xlab="x")

plot of chunk unnamed-chunk-32

Now we demonstrate the pdf of a uniform distribution between (¼, ½) and how it effects the density values. Instead of using the function curve we show another way that is commonly used. We create a sequence of ordered values between -1 and 2, of length 100, using the seq command.

xvalues<-seq(-1,2,length=100)

We now evaluate dunif at each of those values; here we set the uniform parameters min and max to indicate that the uniform is between ¼ and ½.

yvalues<-dunif(xvalues,min=1/4,max=1/2)

We now do a scatter plot of the xvalues and the yvalues . But instead of drawing points, we will tell R to connect those points with a line by setting type="l" ( l for lines).

plot(xvalues,yvalues,type="l",xlim=c(-1,2),ylab="p(x)",xlab="x")

plot of chunk unnamed-chunk-35

We now draw the density histogram in minutes versus hours to demonstrate the effect of scaling the departure delay on the value of the density.

ylim<-c(0,0.8)
par(mfrow=c(1,2))
hist(flightSF$DepDelay,main="Departure Delay, minutes", xlab="Time (in minutes)",freq=FALSE,ylim=ylim)
hist(flightSF$DepDelay/60,main="Departure Delay, hours", xlab="Time (in in hours)",freq=FALSE,ylim=ylim)

plot of chunk unnamed-chunk-36

We now draw the density histogram with different numbers of breaks to illustrate the effect on the value of the density.

par(mfrow=c(1,2))
ylim<-c(0,0.1)
hist(flightSF$DepDelay,main="Departure Delay, default breaks", xlab="Time (in minutes)",freq=FALSE,ylim=ylim)
hist(flightSF$DepDelay,main="Departure Delay, 1000 breaks", xlab="Time (in minutes)",breaks=1000,freq=FALSE,ylim=ylim)

plot of chunk unnamed-chunk-37

We repeat the density histogram with different number of breaks, only now we draw the histogram for the simulation of 10,000 means of a SRS, where we know that the distribution approaches a normal, by the CLT. Then we can see the effect of too small or too big of bins.

m<-mean(flightSF_nc$DepDelay)
s<-sqrt(var(flightSF_nc$DepDelay)/sampleSize)
p<-function(x){dnorm(x, mean=m,sd=s)}
par(mfrow=c(2,2))
hist(sampleMean,xlab="Mean Departure Delay",main=expression(paste(bar(X),", 10 breaks")),freq=FALSE)
curve(p,add=TRUE,col="red",lwd=3)
hist(sampleMean,xlab="Mean Departure Delay",main=expression(paste(bar(X),", 30 breaks")),breaks=30,freq=FALSE)
curve(p,add=TRUE,col="red",lwd=3)
hist(sampleMean,xlab="Mean Departure Delay",main=expression(paste(bar(X),", 100 breaks")),freq=FALSE,breaks=100)
curve(p,add=TRUE,col="red",lwd=3)
hist(sampleMean,xlab="Mean Departure Delay",main=expression(paste(bar(X),", 1000 breaks")),freq=FALSE,breaks=1000)
curve(p,add=TRUE,col="red",lwd=3)

plot of chunk unnamed-chunk-38

Here are examples of different pdfs and different parameters. Notice the use of the dXXX formulation.

par(mfrow=c(2,2))
f<-function(x){dgamma(x,shape=5,scale=1)}
curve(f,from=0,to=20,ylab="p(x)",main="Gamma(5,1) distribution")
f<-function(x){dgamma(x,shape=1,scale=5)}
curve(f,from=0,to=20,ylab="p(x)",main="Gamma(1,5) distribution")
f<-function(x){dbeta(x,.5,.5)}
curve(f,from=0,to=1,ylab="p(x)",main="Beta(.5,.5) distribution")
f<-function(x){dbeta(x,2,5)}
curve(f,from=0,to=1,ylab="p(x)",main="Beta(2,5) distribution")

plot of chunk unnamed-chunk-39

These are examples of functions that are not pdfs.

par(mfrow=c(1,2))
f<-function(x){x^3}
curve(f,from=-10,to=10,ylab="p(x)",main=expression(x^3))
f<-function(x){x^2}
curve(f,from=-10,to=10,ylab="p(x)",main=expression(x^2))

plot of chunk unnamed-chunk-40

A histogram of a single, simple random sample of flights.

binWidth<-20
breaks<-seq(min(flightSRS),max(flightSRS)+binWidth,by=binWidth)
xlim<-range(breaks)
histBin<-hist(flightSRS,xlab="Departure Delay (in Minutes)",breaks=breaks,freq=TRUE,xlim=xlim)

plot of chunk unnamed-chunk-41

ylim<-range(histBin$density)

This next code draws the histogram as a step function – not important code to understand.

p<-stepfun(histBin$breaks,c(0,histBin$density,0),right=TRUE)
plot(p,do.points=FALSE,xlab="Departure Delay (in Minutes)",main="pdf estimated by histogram",verticals=FALSE,xlim=xlim,ylim=ylim)
x<-15
mtext(side=1,text="x",at=x)
points(x=x,y=p(x),pch=19)
text(x=x,y=p(x),labels=expression(paste(hat(p),"(x)")),pos=3)

plot of chunk unnamed-chunk-42

The next code is a plot showing the result of creating a density curve using a rectangle kernel. Note we call the function density on our SRS of flight data to get our kernel density estimate of the pdf. Here I set the bandwidth ( bw ), but normally you would use the default. I set the kernel to be rectangular to give the rectangular moving windows. The code starting with lines again overplots the step function of the histogram from above (again, not important code to understand). Notice I could have done this in two steps: d<-density(flightSRS,...) then plot(d) . But instead I do it in one line for convenience. plot knows how to plot the output of density (this is a light version of object-oriented programming in R)

plot(density(flightSRS,bw=binWidth/sqrt(12),kernel="rectangular"))
lines(p,do.points=FALSE,xlab="Departure Delay (in Minutes)",verticals=FALSE,ylim=ylim,xlim=xlim)

plot of chunk unnamed-chunk-43

The following code is a visualization for teaching purposes, but it's not important to understand the code.

plot(stepfun(c(-1,1),c(0,1,0)),do.points=FALSE,xaxt="n",yaxt="n",ylab="",xlab="",main="",bty="l",xlim=c(-2,2))
mtext(c("x-w/2","x","x-w/2"),side=1,at=c(-1,0,1))

plot of chunk unnamed-chunk-44

#mtext(expression(frac(1,w)),side=2,at=1,las=2,line=.5)

Here is a visualization of the gaussian kernel, not important to understand the code.

curve(dnorm,xaxt="n",yaxt="n",ylab="",xlab="",main="",bty="l",xlim=c(-2,2))
mtext(c("x"),side=1,at=c(0))
segments(x0=0,x1=1,y0=dnorm(1),y1=dnorm(1))
text(label=expression(w/2),x=0.5,y=dnorm(1),pos=1)

plot of chunk unnamed-chunk-45

Here I have plotted the density curve for the gaussian and the rectangular. First I plot the density using a Guassian/normal kernel. Again, I do in one line both the plotting and calculation of the density – keep careful track of the () to notice that main is an argument to plot and not to density . To overlay the rectangular density, I call lines instead of plot – lines automatically adds to the existing plot rather than making a new plot.

plot(density(flightSRS,kernel="gaussian"),
    main="Density estimate, both kernels")
lines(density(flightSRS,kernel="rectangular"),col="red")
lines(p,do.points=FALSE,xlab="Departure Delay (in Minutes)",verticals=FALSE,ylim=ylim,xlim=xlim,col="blue")
legend("topright",c("Normal","Rectangle","Histogram"),fill=c("black","red","blue"))

plot of chunk unnamed-chunk-46

Now I use a for-loop to repeat the same plot, only now choosing a different bandwidth for each plot. I set up a 2x2 grid and choose a uniform ylim for all of the plots. Then I run a for-loop over the values 1, 0.5, 2, and 10, and for each iteration of the for-loop, I multiply the bandwidth parameter by that value by setting the adjust parameter in the density function.

par(mfrow=c(2,2))
ylim<-c(0,0.08)
for(adjust in c(1,0.5,2,10)){
  plot(density(flightSRS,kernel="gaussian",adjust=adjust),main="Density estimate, different bandwidths",
  sub=paste("Bandwith multiplied by",adjust),ylim=ylim,xlim=xlim)
lines(density(flightSRS,kernel="rectangular",adjust=adjust),col="red")
lines(p,do.points=FALSE,xlab="Departure Delay (in Minutes)",verticals=FALSE,col="blue")
legend("topright",c("Normal","Rectangle","Histogram"),fill=c("black","red","blue"))
}

plot of chunk unnamed-chunk-47

Notice the new use of the sub argument to the first plotting call. This sets a subtitle underneath the x-axis. Now I plot the density for multiple groups on the same plot. This code gets more complicated. Don't worry if you don't follow this too closely. I will make this into a function and make it available for the class.

perGroupDensity<-tapply(X=flightSF_nc$DepDelay,INDEX=flightSF_nc$Carrier,FUN=density)
ylim<-range(sapply(perGroupDensity,function(x){range(x$y)}))
cols<-rainbow(length(perGroupDensity))
par(mfrow=c(1,1))
plot(density(flightSF_nc$DepDelay),main="All non-cancelled flight, by carrier",sub=paste("Bandwith multiplied by",adjust),lwd=2,lty=2,xlim=c(-20,50),ylim=ylim)
nullOut<-mapply(perGroupDensity,cols,FUN=function(x,col){lines(x,col=col)})

plot of chunk unnamed-chunk-48

Now I am going to make a violin plot. First I need to load the package that has the function, since this is not a standard function in R.

library(vioplot)

## Loading required package: sm

## Package 'sm', version 2.2-5.4: type help(sm) for summary information

If that command gave you an error, you need to install the package vioplot uncommenting the next code:

# install.packages("vioplot")
# library(vioplot)

Now I call vioplot on my flight departure data.

vioplot(flightSF_nc$DepDelay)

plot of chunk unnamed-chunk-51

Now I want to make violin plots for each of my airlines like I did with boxplots. I am going to read in the function I provided for you. source reads in R code in a file, and I have placed online code defining a function vioplot2 that will do this easily.

source("myvioplot.R")

Once I have done this, I can plot the violin plots of the departure delays, divided by group.

vioplot2(flightSF_nc$DepDelay,flightSF_nc$Carrier,col=palette())

plot of chunk unnamed-chunk-53

Notice I set the color argument, col to be palette() . palette() just returns a list of default colors – a quick way to get a list of colors in R. (All the colors available by name can be found by colors() )