This is the directory I have my data saved in.
dataDir<-"../../finalDataSets"
To change it to your working directory uncomment the following code (“.” just means current directory):
# dataDir<-"."
Now we read in the data. We give it the path of the data; sep=“\t” indicates it is tab-deliminated; header=TRUE means there is a row with header/column names.
flightSF<-read.table(file.path(dataDir,"SFO.txt"),sep="\t",header=TRUE)
This results in a data.frame. We can look at some basic information regarding this data.frame:
dim(flightSF)
## [1] 13207 64
names(flightSF)
## [1] "Year" "Quarter" "Month"
## [4] "DayofMonth" "DayOfWeek" "FlightDate"
## [7] "UniqueCarrier" "AirlineID" "Carrier"
## [10] "TailNum" "FlightNum" "OriginAirportID"
## [13] "OriginAirportSeqID" "OriginCityMarketID" "Origin"
## [16] "OriginCityName" "OriginState" "OriginStateFips"
## [19] "OriginStateName" "OriginWac" "DestAirportID"
## [22] "DestAirportSeqID" "DestCityMarketID" "Dest"
## [25] "DestCityName" "DestState" "DestStateFips"
## [28] "DestStateName" "DestWac" "CRSDepTime"
## [31] "DepTime" "DepDelay" "DepDelayMinutes"
## [34] "DepDel15" "DepartureDelayGroups" "DepTimeBlk"
## [37] "TaxiOut" "WheelsOff" "WheelsOn"
## [40] "TaxiIn" "CRSArrTime" "ArrTime"
## [43] "ArrDelay" "ArrDelayMinutes" "ArrDel15"
## [46] "ArrivalDelayGroups" "ArrTimeBlk" "Cancelled"
## [49] "CancellationCode" "Diverted" "CRSElapsedTime"
## [52] "ActualElapsedTime" "AirTime" "Flights"
## [55] "Distance" "DistanceGroup" "CarrierDelay"
## [58] "WeatherDelay" "NASDelay" "SecurityDelay"
## [61] "LateAircraftDelay" "FirstDepTime" "TotalAddGTime"
## [64] "LongestAddGTime"
We can just print a few rows and columns
flightSF[1:10,c("FlightDate","Carrier","FlightNum","Origin","Dest","DepDelay","DepDelayMinutes","ArrDelay","ArrDelayMinutes","Cancelled")]
## FlightDate Carrier FlightNum Origin Dest DepDelay DepDelayMinutes
## 1 2016-01-01 AA 208 SFO MIA -2 0
## 2 2016-01-02 AA 208 SFO MIA -7 0
## 3 2016-01-03 AA 208 SFO MIA -4 0
## 4 2016-01-04 AA 208 SFO MIA -4 0
## 5 2016-01-05 AA 208 SFO MIA -8 0
## 6 2016-01-06 AA 208 SFO MIA 0 0
## 7 2016-01-07 AA 208 SFO MIA 2 2
## 8 2016-01-08 AA 208 SFO MIA -1 0
## 9 2016-01-09 AA 208 SFO MIA -1 0
## 10 2016-01-10 AA 208 SFO MIA -6 0
## ArrDelay ArrDelayMinutes Cancelled
## 1 -27 0 0
## 2 -27 0 0
## 3 -27 0 0
## 4 -14 0 0
## 5 -19 0 0
## 6 6 6 0
## 7 10 10 0
## 8 -21 0 0
## 9 -5 0 0
## 10 -15 0 0
summary
creates a 5-number summary of the departure delay, min/max/25,50,75 percentiles of data:
summary(flightSF$DepDelay)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -25.0 -5.0 -1.0 13.8 12.0 861.0 413
We will subset to those rows that do not have NA for DepDelay
. Note use of subset
, a useful function for getting a subset of rows, where the first argument is a data.frame and a second argument is a logical argument.
naDepDf<-subset(flightSF,is.na(DepDelay))
head
shows the top 5 rows of a data.frame. Here, I have also selected only 5 columns. Note how I can select them by name of the column.
head(naDepDf[,c("FlightDate","Carrier","FlightNum","DepDelay","Cancelled")])
## FlightDate Carrier FlightNum DepDelay Cancelled
## 44 2016-01-14 AA 209 NA 1
## 75 2016-01-14 AA 218 NA 1
## 112 2016-01-24 AA 12 NA 1
## 138 2016-01-22 AA 16 NA 1
## 139 2016-01-23 AA 16 NA 1
## 140 2016-01-24 AA 16 NA 1
summary
applied to a data frame does summary of each column.
summary(naDepDf[,c("FlightDate","Carrier","FlightNum","DepDelay","Cancelled")])
## FlightDate Carrier FlightNum DepDelay Cancelled
## 2016-01-22: 92 OO :176 Min. : 1 Min. : NA Min. :1
## 2016-01-06: 49 UA : 76 1st Qu.: 616 1st Qu.: NA 1st Qu.:1
## 2016-01-23: 41 WN : 55 Median :2080 Median : NA Median :1
## 2016-01-24: 40 AA : 35 Mean :3059 Mean :NaN Mean :1
## 2016-01-19: 31 VX : 33 3rd Qu.:5555 3rd Qu.: NA 3rd Qu.:1
## 2016-01-03: 27 DL : 17 Max. :6503 Max. : NA Max. :1
## (Other) :133 (Other): 21 NA's :413
hist function creates histogram. Arguments 'main' and 'xlab' are shared across plotting commands. 'main' sets the title of plot, ' 'xlab' sets the label for the xaxis.
hist(flightSF$DepDelay,main="Departure Delay", xlab="Time (in minutes)")
Here we create some 'fake' values for the cancelled flights so they show up on plot.
Notice how we can use is.na
to identify the NAs, and then assign only those the value 1200
flightSF$DepDelayWithCancel<-flightSF$DepDelay
flightSF$DepDelayWithCancel[is.na(flightSF$DepDelay)]<-1200
hist(flightSF$DepDelayWithCancel,xlab="Time (in minutes)", main="Departure delay, with cancellations=1200")
Here we show the the effect of changing the argument breaks
to histogram.
par(mfrow=c(2,2))
hist(flightSF$DepDelay,main="Departure Delay, default breaks", xlab="Time (in minutes)")
hist(flightSF$DepDelay,main="Departure Delay, breaks=100", xlab="Time (in minutes)",breaks=100)
hist(flightSF$DepDelay,main="Departure Delay, breaks=1000", xlab="Time (in minutes)",breaks=1000)
We can truncate the x axis using the command xlim
hist(flightSF$DepDelay,main="Departure Delay, Zoomed-in", xlab="Time (in minutes)", xlim=c(-30,100), breaks=1000)
Note that the command par(mfrow=c(2,2))
sets up a 2x2 grid of plots. Similarly, par(mfrow=c(1,2))
sets up a 1x2 grid.
To get back to a single plot, we would type the command par(mfrow=c(1,1))
What about a smaller data frame with only flights from SFO to PHL, again truncating the axis
flightSFO_PHL <- subset(flightSF, Dest=="PHL")
dim(flightSFO_PHL)
## [1] 158 65
par(mfrow=c(2,2))
hist(flightSFO_PHL$DepDelay,main="SFO->PHL, default breaks", xlab="Time (in minutes)", xlim=c(-50,100))
hist(flightSFO_PHL$DepDelay,main="SFO->PHL, breaks=30", xlab="Time (in minutes)", xlim=c(-50,100), breaks=30)
hist(flightSFO_PHL$DepDelay,main="SFO->PHL, breaks=100", xlab="Time (in minutes)", xlim=c(-50,100), breaks=100)
hist(flightSFO_PHL$DepDelay,main="SFO->PHL, breaks=1000", xlab="Time (in minutes)", xlim=c(-50,100),breaks=1000)
The command boxplot
gives basic boxplots. Notice I can continue to use arguments like main
and ylab
to make the plot more informative.
boxplot(flightSF$DepDelay,main="Departure Delay", ylab="Time (in minutes)")
Now I use a special formulation for telling R to divide into separate boxplots for different groups; specifically divide the numeric value 'DepDelay' into different groups based on airline (Carrier
) and make a boxplot separately for each group. The y ~ fac
notation means that y
is the numeric values and fac
is the factor defining groups you want to divide into.
boxplot(flightSF$DepDelay~flightSF$Carrier,
main="Departure Delay, by airline carrier",
ylab="Time (in minutes)")
Here I use the argument outline=FALSE
to indicate not to plot the individual “outliers” beyond the whiskers.
boxplot(flightSF$DepDelay~flightSF$Carrier,
main="Departure Delay, by airline carrier",
ylab="Time (in minutes)",outline=FALSE)
set.seed(1)
par(mfrow=c(2,2))
hist(rgamma(1000,shape=2),main="Right Skew")
hist(rnorm(1000),main="Symmetric")
breaks=c(-1E6,seq(-20,20,1),1E6)
hist(rnorm(1000),main="Light tails",xlim=c(-20,20),breaks=breaks,freq=TRUE)
## Warning in plot.histogram(r, freq = freq1, col = col, border = border,
## angle = angle, : the AREAS in the plot are wrong -- rather use 'freq =
## FALSE'
hist(rcauchy(1000),main="Heavy tails",xlim=c(-20,20),breaks=breaks,freq=TRUE)
## Warning in plot.histogram(r, freq = freq1, col = col, border = border,
## angle = angle, : the AREAS in the plot are wrong -- rather use 'freq =
## FALSE'
The curve
function draws a function. The first argument is the function, and the from
and to
arguments give the range of values. Here I draw the log
function then the next command is the sqrt
funciton. For the sqrt
I use the argument add
to have the curve for the square-root plot on an existing plot. I also use the option col
to choose what color each curve should be.
ylim<-c(-3,3)
curve(log,from=0,to=10,ylim=ylim,ylab="Transformed",xlab="Original")
curve(sqrt,from=0,to=10,add=TRUE,col="red")
legend("bottomright",legend=c("log","sqrt"),fill=c("black","red"))
The function legend
adds a legend to your existing figure. The first argument is the location of the legend ( bottomright
). The argument legend
is the vector of character strings that make up the legend. The remaining arguments tell what should be in front of the text and should be the same length as the legend
argument (if they are a different length, then the values will be “recycled”, i.e. start back at the beginning until get to the right length). Here we choose to have color-filled squares, using the argument fill
to give the colors.
Here I make some simulated data from the “gamma” distribution in order to show the effect of the log transformation on skewed data.
A note about the simulated data. rXXX always stands for getting simulated data. The first argument is the number of simulated values (1000); the remaining arguments are parameters of the distribution. In the case of the gamma scale
and shape
are parameters of the distribution.
y<-rgamma(10000,scale=1,shape=0.1)
Note I first the regular y and then the log-transformed y, and I use the par(mfrow=c(1,2))
to draw them side-by-side as a 1x2 grid
par(mfrow=c(1,2))
hist(y,main="Original data",xlab="original scale",breaks=50)
hist(log(y),main="Log of data",xlab="log-scale",breaks=50)
Now I will do a transformation of our flight data. Since I can't take the log (because I have negative values as well as 0's) I will add a little bit to each value before I take the log. This may not be the most brillant thing to do, frankly.
I choose how much to add by finding the minimum value, taking the absolute value, and adding 1. Notice how I use na.rm=TRUE
when I take the minimum. Otherwise the minimum where there are NA's is NA. na.rm=TRUE
means to remove the NAs before taking the minimum.
addValue<-abs(min(flightSF$DepDelay,na.rm=TRUE))+1
par(mfrow=c(3,1))
boxplot(flightSF$DepDelay+addValue~flightSF$Carrier,main="Departure Delay, original", ylab="Time")
boxplot(log(flightSF$DepDelay+addValue)~flightSF$Carrier,main="Departure Delay, log transformed", ylab=paste("log(Time+",addValue,")"))
boxplot(sqrt(flightSF$DepDelay+addValue)~flightSF$Carrier,main="Departure Delay, sqrt-transformed", ylab=paste("sqrt(Time+",addValue,")"))
Notice in creating the ylab
value, I use the paste
function. The paste function concatinates two strings. It's particularly helpful when you want to put a variable into your label/title/etc. Here, I paste the string sqrt(Time+
with the value of my variable addValue
. )
I am going to make a subset of my data that removes the cancelled flights to make it easier going forward. This means that I don't need to worry about removing the NA values all of the time.
flightSF_nc<-subset(flightSF,flightSF$Cancelled!=1)
The following function is something I wrote does something I don't recommend that you do in a normal course of events. But it might be useful to look at for writing functions. Namely, it plots the probabilities of being within a bin. Recall, that's not what the y-axis of histograms normally is.
histBinProb<-function(x,ylab="Bin Probabilities",breaks = "Sturges",...){
#plot=FALSE: doesn't plot, just returns the calculated values
h<-hist(x,plot=FALSE,breaks=breaks)
h$counts=h$counts/sum(h$counts)
plot(h,ylab=ylab,...)
}
There are some useful features to notice from this function, even though you shouldn't need to use it. One is to note that even though it doesn't look like it, hist
actually has an output that you can save. Usually you just want the plot, but sometimes, you'd like to actually grab those calculated values (like the number of counts in a bin). The out put is a list, and is described in the help of hist
. You also can do plot=FALSE
with histogram if you want just the calculations and not the plot. These are all rather esoteric things to know, but FYI.
Now I will plot the probabilities for each bin, for different numbers of breaks. Notice to make them all have the same y-axis limits, I can set the argument ylim
. Again, this is a standard way for almost any plot to limit the y values that are plotted (or x if you use xlim
).
par(mfrow=c(1,2))
ylim<-c(0,0.6)
histBinProb(flightSF$DepDelay,main="Departure Delay, Probability of Bin", xlab="Time (in minutes)",ylim=ylim)
histBinProb(flightSF$DepDelay,main="Departure Delay, Probability of Bin, more breaks", xlab="Time (in minutes)",breaks=10000,ylim=ylim)
flightSF_del <- subset(flightSF_nc, DepDelay > 0)
par(mfrow=c(1,2))
hist(flightSF_del$DepDelay, main="Delayed only", xlab="Time (in minutes)")
hist(log(flightSF_del$DepDelay), main="log-transformed", xlab="Time (in minutes)")
Now I want to create a simple random sample from my data (SRS). I use the function sample
to sample from my flights in January. x
gives the vector of values I want to sample from (my flight delays) and size
is the size of the sample I would like.
A very important argument to sample
is the argument replace
. If it equals TRUE
then we sample with replacement (a SRS).
flightSRS<-sample(x=flightSF_nc$DepDelay, size= 100, replace=TRUE)
To get a permutation of my data, I would do instead replace=FALSE
and I could not put in a size
argument.
I'm going to show how a SRS estimates the actual probabilities in our January flights. To do this, I will plot the probabilities again using the little function I wrote above (the one I said you shouldn't normally do!).
I will set the ylim
to be the same for all the plots. I will also give a predefined vector of breaks by using the seq
function. The seq
function finds a evenly space set of values within an interval. I do this rather than setting the breaks=10
function to make sure I use exactly the same breakpoints for my SRS as for my full population. (R does various magic to interpret what breaks=10
means which we want to avoid.)
ylim<-c(0,1)
breaks<-seq(min(flightSF_nc$DepDelay),max(flightSF_nc$DepDelay),length=10)
histBinProb(flightSRS,main="Departure Delay", xlab="Time (in minutes)",border=NA,breaks=breaks,ylim=ylim,col="red",add=FALSE)
histBinProb(flightSF_nc$DepDelay,main="Departure Delay", xlab="Time (in minutes)",col=NULL,border="black",breaks=breaks,ylim=ylim,lwd=2,add=TRUE)
legend("topright",c("SRS","Truth"),fill=c("red","white"))
Notice, I have changed a lot of options in my histogram of the SRS. In particular, I have changed the color to red, and said not to plot the border of the rectangle bins. In my full population ones, I set col=NULL
so that I don't fill in the rectangles, and made the width of the lines ( lwd
) twice as big as normal. I also did add=TRUE
which draws a histogram on top of the existing plot. Again, this is rather esoteric, but occassionally is useful, like now.
I will repeat the same thing, but with more breaks to compare how the SRS estimates probabilities for smaller intervals versus wider intervals.
ylim<-c(0,0.6)
breaks<-seq(min(flightSF_nc$DepDelay),max(flightSF_nc$DepDelay),length=100)
histBinProb(flightSRS,main="Departure Delay", xlab="Time (in minutes)",border=NA,breaks=breaks,ylim=ylim,col="red",add=FALSE)
histBinProb(flightSF_nc$DepDelay,main="Departure Delay", xlab="Time (in minutes)",col=NULL,border="black",breaks=breaks,ylim=ylim,lwd=2,add=TRUE,lwd=3)
legend("topright",c("SRS","Truth"),fill=c("red","white"))
Here I read in data that was a SRS from the entire year, and one that was a stratified sample of a SRS per month. I then compare the histograms to compare different types of samples.
flightSFOSRS<-read.table(file.path(dataDir,"SFO_SRS.txt"),
sep="\t",header=TRUE, stringsAsFactors = FALSE)
flightSFOStratified<-read.table(file.path(dataDir,"SFO_Stratified.txt"),
sep="\t",header=TRUE, stringsAsFactors = FALSE)
par(mfrow=c(2,2))
xlim<-c(-20,400)
hist(flightSF$DepDelay,breaks=100,xlim=xlim,freq=FALSE)
hist(flightSFOSRS$DepDelay,breaks=100,xlim=xlim,freq=FALSE)
hist(flightSFOStratified$DepDelay,breaks=100,xlim=xlim,freq=FALSE)
curve(x*exp(-x/2)/4,xlim=c(0,30),ylab="p(x)",xlab="x")
This is not a function to worry about understanding. This is for teaching purposes and shades in the area under the curve. Notice, however, how I can define a function ( p
) to give to curve
to plot.
plotUnderCurve<-function(x1, x2,p,...){
x=seq(x1,x2,len=100)
y=p(x)
polygon(c(x,tail(x,1),x[1]), c(y, 0,0), ...)
}
p<-function(x){x*exp(-x/2)/4} #define as function
curve(p,xlim=c(0,30),ylab="p(x)",xlab="x")
plotUnderCurve(5,10,p,col="red")
Here I am going to create a SRS and take the mean of the SRS of size 1,000 over and over again (10,000 times) in order to demonstrate the Central Limit Theorem.
I use the function replicate
to do the exact same (usually random) something over and over again. I define the random thing I want to do to the argument expr
. In this case it is to sample with replacement and take the mean. n=10000
means to do it 10,000 times. The 10,000 means are returned and saved as a vector.
sampleSize<-1000
sampleMean<-replicate(n=10000,expr=mean(sample(flightSF_nc$DepDelay,size=sampleSize,replace=TRUE)))
hist(sampleMean,xlab="Mean Departure Delay",main=paste("Mean of SRS of size",sampleSize))
We now will plot the histogram of the 10,000 simulations of means of our SRS from the flight data and overlay the density. Note we need to draw the histogram on the “density” scale by setting freq=FALSE
(i.e. NOT frequency/counts).
We then define a curve to overlay the histogram using the dnorm
function. dXXX
functions calculate the density of a value x
and take the form dXXX(x)
and return \(p(x)\). They can take a vector of values x
or a single value. They also have the ability to give parameters. For the normal, this is mean
and sd
(so standard deviation, not variance).
We do this by creating a function p
and then do curve(p)
. We do add=TRUE
and col="red"
and lwd=3
to get it to overlay a red curve that is 3 times the usual thickness.
hist(sampleMean,xlab="Mean Departure Delay",main=paste("Mean of SRS of size",sampleSize),freq=FALSE)
m<-mean(flightSF_nc$DepDelay)
s<-sqrt(var(flightSF_nc$DepDelay)/sampleSize)
p<-function(x){dnorm(x, mean=m,sd=s)}
curve(p,add=TRUE,col="red",lwd=3)
Note in order to have the density of the normal centered and scaled at the right values, we need to calculate the mean and standard deviation predicted by the CLT. This is the mean of the original population, given by mean(flightSF_nc$DepDelay)
. And the variance is the variance of the original population, divided by the size of our SRS (n=1000).
We demonstrate the pdf of a uniform distribution ( dunif
); this corresponds to a pdf with uniform values between 0,1 (for illustration of properties of the density.)
curve(dunif,xlim=c(-1,2),ylab="p(x)",xlab="x")
Now we demonstrate the pdf of a uniform distribution between (¼, ½) and how it effects the density values. Instead of using the function curve
we show another way that is commonly used. We create a sequence of ordered values between -1 and 2, of length 100, using the seq
command.
xvalues<-seq(-1,2,length=100)
We now evaluate dunif
at each of those values; here we set the uniform parameters min
and max
to indicate that the uniform is between ¼ and ½.
yvalues<-dunif(xvalues,min=1/4,max=1/2)
We now do a scatter plot of the xvalues
and the yvalues
. But instead of drawing points, we will tell R to connect those points with a line by setting type="l"
( l
for lines).
plot(xvalues,yvalues,type="l",xlim=c(-1,2),ylab="p(x)",xlab="x")
We now draw the density histogram in minutes versus hours to demonstrate the effect of scaling the departure delay on the value of the density.
ylim<-c(0,0.8)
par(mfrow=c(1,2))
hist(flightSF$DepDelay,main="Departure Delay, minutes", xlab="Time (in minutes)",freq=FALSE,ylim=ylim)
hist(flightSF$DepDelay/60,main="Departure Delay, hours", xlab="Time (in in hours)",freq=FALSE,ylim=ylim)
We now draw the density histogram with different numbers of breaks to illustrate the effect on the value of the density.
par(mfrow=c(1,2))
ylim<-c(0,0.1)
hist(flightSF$DepDelay,main="Departure Delay, default breaks", xlab="Time (in minutes)",freq=FALSE,ylim=ylim)
hist(flightSF$DepDelay,main="Departure Delay, 1000 breaks", xlab="Time (in minutes)",breaks=1000,freq=FALSE,ylim=ylim)
We repeat the density histogram with different number of breaks, only now we draw the histogram for the simulation of 10,000 means of a SRS, where we know that the distribution approaches a normal, by the CLT. Then we can see the effect of too small or too big of bins.
m<-mean(flightSF_nc$DepDelay)
s<-sqrt(var(flightSF_nc$DepDelay)/sampleSize)
p<-function(x){dnorm(x, mean=m,sd=s)}
par(mfrow=c(2,2))
hist(sampleMean,xlab="Mean Departure Delay",main=expression(paste(bar(X),", 10 breaks")),freq=FALSE)
curve(p,add=TRUE,col="red",lwd=3)
hist(sampleMean,xlab="Mean Departure Delay",main=expression(paste(bar(X),", 30 breaks")),breaks=30,freq=FALSE)
curve(p,add=TRUE,col="red",lwd=3)
hist(sampleMean,xlab="Mean Departure Delay",main=expression(paste(bar(X),", 100 breaks")),freq=FALSE,breaks=100)
curve(p,add=TRUE,col="red",lwd=3)
hist(sampleMean,xlab="Mean Departure Delay",main=expression(paste(bar(X),", 1000 breaks")),freq=FALSE,breaks=1000)
curve(p,add=TRUE,col="red",lwd=3)
Here are examples of different pdfs and different parameters. Notice the use of the dXXX formulation.
par(mfrow=c(2,2))
f<-function(x){dgamma(x,shape=5,scale=1)}
curve(f,from=0,to=20,ylab="p(x)",main="Gamma(5,1) distribution")
f<-function(x){dgamma(x,shape=1,scale=5)}
curve(f,from=0,to=20,ylab="p(x)",main="Gamma(1,5) distribution")
f<-function(x){dbeta(x,.5,.5)}
curve(f,from=0,to=1,ylab="p(x)",main="Beta(.5,.5) distribution")
f<-function(x){dbeta(x,2,5)}
curve(f,from=0,to=1,ylab="p(x)",main="Beta(2,5) distribution")
These are examples of functions that are not pdfs.
par(mfrow=c(1,2))
f<-function(x){x^3}
curve(f,from=-10,to=10,ylab="p(x)",main=expression(x^3))
f<-function(x){x^2}
curve(f,from=-10,to=10,ylab="p(x)",main=expression(x^2))
A histogram of a single, simple random sample of flights.
binWidth<-20
breaks<-seq(min(flightSRS),max(flightSRS)+binWidth,by=binWidth)
xlim<-range(breaks)
histBin<-hist(flightSRS,xlab="Departure Delay (in Minutes)",breaks=breaks,freq=TRUE,xlim=xlim)
ylim<-range(histBin$density)
This next code draws the histogram as a step function – not important code to understand.
p<-stepfun(histBin$breaks,c(0,histBin$density,0),right=TRUE)
plot(p,do.points=FALSE,xlab="Departure Delay (in Minutes)",main="pdf estimated by histogram",verticals=FALSE,xlim=xlim,ylim=ylim)
x<-15
mtext(side=1,text="x",at=x)
points(x=x,y=p(x),pch=19)
text(x=x,y=p(x),labels=expression(paste(hat(p),"(x)")),pos=3)
The next code is a plot showing the result of creating a density curve using a rectangle kernel. Note we call the function density
on our SRS of flight data to get our kernel density estimate of the pdf. Here I set the bandwidth ( bw
), but normally you would use the default. I set the kernel to be rectangular
to give the rectangular moving windows.
The code starting with lines
again overplots the step function of the histogram from above (again, not important code to understand).
Notice I could have done this in two steps: d<-density(flightSRS,...)
then plot(d)
. But instead I do it in one line for convenience. plot
knows how to plot the output of density (this is a light version of object-oriented programming in R)
plot(density(flightSRS,bw=binWidth/sqrt(12),kernel="rectangular"))
lines(p,do.points=FALSE,xlab="Departure Delay (in Minutes)",verticals=FALSE,ylim=ylim,xlim=xlim)
The following code is a visualization for teaching purposes, but it's not important to understand the code.
plot(stepfun(c(-1,1),c(0,1,0)),do.points=FALSE,xaxt="n",yaxt="n",ylab="",xlab="",main="",bty="l",xlim=c(-2,2))
mtext(c("x-w/2","x","x-w/2"),side=1,at=c(-1,0,1))
#mtext(expression(frac(1,w)),side=2,at=1,las=2,line=.5)
Here is a visualization of the gaussian kernel, not important to understand the code.
curve(dnorm,xaxt="n",yaxt="n",ylab="",xlab="",main="",bty="l",xlim=c(-2,2))
mtext(c("x"),side=1,at=c(0))
segments(x0=0,x1=1,y0=dnorm(1),y1=dnorm(1))
text(label=expression(w/2),x=0.5,y=dnorm(1),pos=1)
Here I have plotted the density curve for the gaussian and the rectangular. First I plot the density using a Guassian/normal kernel. Again, I do in one line both the plotting and calculation of the density – keep careful track of the ()
to notice that main
is an argument to plot
and not to density
. To overlay the rectangular density, I call lines
instead of plot
– lines
automatically adds to the existing plot rather than making a new plot.
plot(density(flightSRS,kernel="gaussian"),
main="Density estimate, both kernels")
lines(density(flightSRS,kernel="rectangular"),col="red")
lines(p,do.points=FALSE,xlab="Departure Delay (in Minutes)",verticals=FALSE,ylim=ylim,xlim=xlim,col="blue")
legend("topright",c("Normal","Rectangle","Histogram"),fill=c("black","red","blue"))
Now I use a for-loop to repeat the same plot, only now choosing a different bandwidth for each plot. I set up a 2x2 grid and choose a uniform ylim for all of the plots. Then I run a for-loop over the values 1, 0.5, 2, and 10, and for each iteration of the for-loop, I multiply the bandwidth parameter by that value by setting the adjust
parameter in the density
function.
par(mfrow=c(2,2))
ylim<-c(0,0.08)
for(adjust in c(1,0.5,2,10)){
plot(density(flightSRS,kernel="gaussian",adjust=adjust),main="Density estimate, different bandwidths",
sub=paste("Bandwith multiplied by",adjust),ylim=ylim,xlim=xlim)
lines(density(flightSRS,kernel="rectangular",adjust=adjust),col="red")
lines(p,do.points=FALSE,xlab="Departure Delay (in Minutes)",verticals=FALSE,col="blue")
legend("topright",c("Normal","Rectangle","Histogram"),fill=c("black","red","blue"))
}
Notice the new use of the sub
argument to the first plotting call. This sets a subtitle underneath the x-axis.
Now I plot the density for multiple groups on the same plot. This code gets more complicated. Don't worry if you don't follow this too closely. I will make this into a function and make it available for the class.
perGroupDensity<-tapply(X=flightSF_nc$DepDelay,INDEX=flightSF_nc$Carrier,FUN=density)
ylim<-range(sapply(perGroupDensity,function(x){range(x$y)}))
cols<-rainbow(length(perGroupDensity))
par(mfrow=c(1,1))
plot(density(flightSF_nc$DepDelay),main="All non-cancelled flight, by carrier",sub=paste("Bandwith multiplied by",adjust),lwd=2,lty=2,xlim=c(-20,50),ylim=ylim)
nullOut<-mapply(perGroupDensity,cols,FUN=function(x,col){lines(x,col=col)})
Now I am going to make a violin plot. First I need to load the package that has the function, since this is not a standard function in R.
library(vioplot)
## Loading required package: sm
## Package 'sm', version 2.2-5.4: type help(sm) for summary information
If that command gave you an error, you need to install the package vioplot
uncommenting the next code:
# install.packages("vioplot")
# library(vioplot)
Now I call vioplot on my flight departure data.
vioplot(flightSF_nc$DepDelay)
Now I want to make violin plots for each of my airlines like I did with boxplots. I am going to read in the function I provided for you. source
reads in R code in a file, and I have placed online code defining a function vioplot2
that will do this easily.
source("myvioplot.R")
Once I have done this, I can plot the violin plots of the departure delays, divided by group.
vioplot2(flightSF_nc$DepDelay,flightSF_nc$Carrier,col=palette())
Notice I set the color argument, col
to be palette()
. palette()
just returns a list of default colors – a quick way to get a list of colors in R. (All the colors available by name can be found by colors()
)