Do not forget to change directory for where the data will be (see 01Probability_forClassCode.Rmd)

dataDir<-"../finalDataSets"
# dataDir<-"."

Pairs Plots

We will start by reading in the data of college information

scorecard <- read.csv(file.path(dataDir,"college.csv"), stringsAsFactors = FALSE,na.strings = c("NA","PrivacySuppressed"))

We will exclude those that are for-profit institutes:

scorecard<-scorecard[-which(scorecard$CONTROL==3),]

Now we will do a pairs plot. The default pairs plot requires a matrix of numeric values, and some of our values are factors. So we will leave those off (the first 3 columns), as well as a number of other variables. Notice how I can remove a range of values with the - (and how I have to put c() around it).

smallScores<-scorecard[,-c(1:3,4,5,6,9,11,14:17,18:22,24:27,31)]
pairs(smallScores)

plot of chunk unnamed-chunk-4

We can improve the pairs function by providing alternative functions to draw on the lower/upper/or diagonal plots. Here we use a function provided in the help of pairs to draw a histogram of the variable on the diagonal, instead of just printing the name. A function we will see later (gpairs) does this automatically, so don't stress about following this.

panel.hist <- function(x, ...)
{
    #from help of pairs
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(usr[1:2], 0, 1.5) )
    h <- hist(x, plot = FALSE)
    breaks <- h$breaks; nB <- length(breaks)
    y <- h$counts; y <- y/max(y)
    rect(breaks[-nB], 0, breaks[-1], y)
}

Now we call pairs, and we set the argument diag.panel to be the function we just wrote. We also can make the scatterplots have a loess line through them by using panel.smooth as our function (this is a built in function like smooth.scatter). We also color the points as to whether they are public or private using the col argument.

pairs(smallScores, lower.panel = panel.smooth, col=c("red","black")[smallScores$CONTROL],diag.panel = panel.hist)

plot of chunk unnamed-chunk-6

Now we make the upper and lower panels different. The upper panel is now going to print the correlation of the variables, where the size of the correlation will be relative to the size of the correlation (again, this is a function from the help of pairs). A function we will see later (gpairs) does this automatically, so don't stress about following this.

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
    #from help of pairs
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y,use="pairwise.complete.obs"))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}
pairs(smallScores, lower.panel = panel.smooth, upper.panel = panel.cor,col=c("red","black")[smallScores$CONTROL],diag.panel = panel.hist)

plot of chunk unnamed-chunk-7

Using the function corrgram we can visualize the correlations for a large number of variables. This take a bit of time to render. It also outputs an enormous number of warnings, so we are going to suppress those warnings so as to not clutter up this document.

library(gpairs)
suppressWarnings(corrgram(scorecard[,-c(1:3)]))

plot of chunk unnamed-chunk-8

Categorical Plots

We will read in data from the wellbeing survey. We will set stringsAsFactors=TRUE (the default) to automatically convert strings to factors.

wellbeing <- read.csv(file.path(dataDir,"wellbeing.csv"), stringsAsFactors = TRUE, header = TRUE)

Now we will do a number of adjustments of the variables to improve them for future use. First we will make a variable where the Year is a factor, rather than a number. This can be helpful sometimes.

wellbeing$YearFactor<-factor(wellbeing$Gss.year.for.this.respondent)

Now we will categorize the year into specific decades. We can do this using the cut command, where we ask to bin a numeric variable into bins defined by the break argument.

wellbeing$Decade<-cut(wellbeing$Gss.year.for.this.respondent,breaks=c(1970,1980,1990,2000,2010,2015),labels=c("1970s","1980s","1990s","2000s","2010s"))

By default text variables were made factors, and the order of the levels was alphabetical. However, these factor levels have some natural order to them – from greater happiness to lesser happiness. Since all the commands that work with levels use this ordering of the levels we would like the order to follow this pattern. We will write a small function that does this, since we want to do this to several variables.

#change order of levels so that make more sense
changeOrderLevels<-function(x,newLevels){
  if(is.factor(x)){
    oldLevels<-levels(x)  
    if(!all(sort(oldLevels)==sort(newLevels))) stop("newLevels is not just a reorder of the old levels")
  }
  else{if(!all(sort(unique(x))==sort(newLevels))) stop("newLevels is not just a ordering of existing values")}
  return(factor(x,levels=newLevels))
}
wellbeing$General.happiness<-changeOrderLevels(wellbeing$General.happiness,newLevels=c("Very happy","Pretty happy","Not too happy","Don't know","Not applicable","No answer"))
wellbeing$Job.or.housework<-changeOrderLevels(wellbeing$Job.or.housework,newLevels=c("Very satisfied","Mod. satisfied","A little dissat","Very dissatisfied","Don't know","Not applicable","No answer"))
wellbeing$Satisfaction.with.financial.situation<-changeOrderLevels(wellbeing$Satisfaction.with.financial.situation,newLevels=c("Satisfied","More or less","Not at all sat","Don't know","Not applicable","No answer"))
wellbeing$Is.life.exciting.or.dull<-changeOrderLevels(wellbeing$Is.life.exciting.or.dull,newLevels=c("Exciting","Routine","Dull","Don't know","Not applicable","No answer"))
wellbeing$Happiness.of.marriage<-changeOrderLevels(wellbeing$Happiness.of.marriage,newLevels=c("Very happy","Pretty happy","Not too happy","Don't know","Not applicable","No answer"))

Now we will make a dataset that only consists of the recent responses (since 2000).

wellbeingRecent<-droplevels(wellbeing[wellbeing$Decade %in% c("2000s","2010s"),])

Our last step will be to drop any unused levels in our data set using the droplevels command. This will keep a 0 sized level of any factor from being carried around.

wellbeingRecent<-droplevels(wellbeingRecent)

We will now make a table of the different levels of the factor General.happiness and also a barplot.

table(wellbeingRecent$General.happiness)
## 
##     Very happy   Pretty happy  Not too happy     Don't know Not applicable 
##           4270           7979           1991             25           4383 
##      No answer 
##             18
barplot(table(wellbeingRecent$General.happiness))

plot of chunk unnamed-chunk-15

We use the function gpairs to plot a pairs plot that takes into account categorical variables. (You may have to install the package gpairs). In order for this to work, we need to designate the variable CONTROL as a factor so that gpairs knows to treat it as a categorical variable.

library(gpairs)
smallScores$CONTROL<-factor(smallScores$CONTROL,levels=c(1,2),labels=c("public","private"))

Now we run gpairs. To save time, we have changed the upper diagonal plots to be like the boxplots in the lower diagonal. The default is a rug plot, which can be nice for small numbers of observations, but takes a long time to plot otherwise. (You can delete the upper.pars=... argument and see what it looks like.)

gpairs(smallScores,lower.pars=list(scatter="loess"),upper.pars=list(scatter="loess",conditional = "boxplot"),
scatter.pars = list( col = c("red","black")[smallScores$CONTROL]))

plot of chunk unnamed-chunk-17

Two Categorical Variables

We now use the table command to create a contingency table between two of the variables in the wellbeing data.

tabGeneralJob<-with(wellbeingRecent,table(General.happiness,Job.or.housework))
tabGeneralJob
##                  Job.or.housework
## General.happiness Very satisfied Mod. satisfied A little dissat
##    Very happy               2137            843             154
##    Pretty happy             2725           2569             562
##    Not too happy             436            527             247
##    Don't know                 11              1               4
##    Not applicable            204            134              36
##    No answer                   8              2               1
##                  Job.or.housework
## General.happiness Very dissatisfied Don't know Not applicable No answer
##    Very happy                    61         25           1011        39
##    Pretty happy                 213         61           1776        73
##    Not too happy                161         39            549        32
##    Don't know                     0          1              8         0
##    Not applicable                12          1           3990         6
##    No answer                      3          0              4         0

We can use barplot to visualize the relationship between two categorical variables.

barplot(tabGeneralJob,legend=TRUE)

plot of chunk unnamed-chunk-19

To be able to tell the two variables apart, we will change the names of the columns and rows of the contingency table (we could also do this in other, prettier ways…).

colnames(tabGeneralJob)<-paste(colnames(tabGeneralJob),"(Job)")
rownames(tabGeneralJob)<-paste(rownames(tabGeneralJob),"(General)")
barplot(tabGeneralJob,legend=TRUE)

plot of chunk unnamed-chunk-20

We can also switch with variable is plotted by taking the transpose of the contingency table using the t() function (“transpose” just means to flip the matrix so the rows become columns and columns the rows).

barplot(t(tabGeneralJob),legend=TRUE)

plot of chunk unnamed-chunk-21

Now we plot the barplot so that we get separate barplots for each level. We also adjust the colors using the col argument.

barplot(tabGeneralJob,beside=TRUE,legend=TRUE,col=palette()[1:6])

plot of chunk unnamed-chunk-22

We can convert our contingency table into proportions within each category of job using the function prop.table. The argument margin indicates whether we should make the proportions out of the columns or the rows. So margin=2 means that the columns will now sum to 1, i.e. each column corresponding to a response to Job Satisfaction, has been converted into a proportion.

prop.table(tabGeneralJob,margin=2)
##                           Job.or.housework
## General.happiness          Very satisfied (Job) Mod. satisfied (Job)
##   Very happy (General)             0.3870675602         0.2068204122
##   Pretty happy (General)           0.4935700054         0.6302747792
##   Not too happy (General)          0.0789712009         0.1292934249
##   Don't know (General)             0.0019923927         0.0002453386
##   Not applicable (General)         0.0369498279         0.0328753680
##   No answer (General)              0.0014490129         0.0004906771
##                           Job.or.housework
## General.happiness          A little dissat (Job) Very dissatisfied (Job)
##   Very happy (General)              0.1533864542            0.1355555556
##   Pretty happy (General)            0.5597609562            0.4733333333
##   Not too happy (General)           0.2460159363            0.3577777778
##   Don't know (General)              0.0039840637            0.0000000000
##   Not applicable (General)          0.0358565737            0.0266666667
##   No answer (General)               0.0009960159            0.0066666667
##                           Job.or.housework
## General.happiness          Don't know (Job) Not applicable (Job)
##   Very happy (General)         0.1968503937         0.1377759608
##   Pretty happy (General)       0.4803149606         0.2420278005
##   Not too happy (General)      0.3070866142         0.0748160262
##   Don't know (General)         0.0078740157         0.0010902153
##   Not applicable (General)     0.0078740157         0.5437448896
##   No answer (General)          0.0000000000         0.0005451077
##                           Job.or.housework
## General.happiness          No answer (Job)
##   Very happy (General)        0.2600000000
##   Pretty happy (General)      0.4866666667
##   Not too happy (General)     0.2133333333
##   Don't know (General)        0.0000000000
##   Not applicable (General)    0.0400000000
##   No answer (General)         0.0000000000
barplot(prop.table(tabGeneralJob,margin=2),beside=TRUE)

plot of chunk unnamed-chunk-23

We can of course switch these, and instead make each row a proportion, using margin=1. To appropriately plot this with barplot, however, we need to flip the matrix (i.e. take the transpose).

prop.table(tabGeneralJob,margin=1)
##                           Job.or.housework
## General.happiness          Very satisfied (Job) Mod. satisfied (Job)
##   Very happy (General)             0.5004683841         0.1974238876
##   Pretty happy (General)           0.3415214939         0.3219701717
##   Not too happy (General)          0.2189854345         0.2646911100
##   Don't know (General)             0.4400000000         0.0400000000
##   Not applicable (General)         0.0465434634         0.0305726671
##   No answer (General)              0.4444444444         0.1111111111
##                           Job.or.housework
## General.happiness          A little dissat (Job) Very dissatisfied (Job)
##   Very happy (General)              0.0360655738            0.0142857143
##   Pretty happy (General)            0.0704348916            0.0266950746
##   Not too happy (General)           0.1240582622            0.0808638875
##   Don't know (General)              0.1600000000            0.0000000000
##   Not applicable (General)          0.0082135524            0.0027378508
##   No answer (General)               0.0555555556            0.1666666667
##                           Job.or.housework
## General.happiness          Don't know (Job) Not applicable (Job)
##   Very happy (General)         0.0058548009         0.2367681499
##   Pretty happy (General)       0.0076450683         0.2225842837
##   Not too happy (General)      0.0195881467         0.2757408338
##   Don't know (General)         0.0400000000         0.3200000000
##   Not applicable (General)     0.0002281542         0.9103353867
##   No answer (General)          0.0000000000         0.2222222222
##                           Job.or.housework
## General.happiness          No answer (Job)
##   Very happy (General)        0.0091334895
##   Pretty happy (General)      0.0091490162
##   Not too happy (General)     0.0160723255
##   Don't know (General)        0.0000000000
##   Not applicable (General)    0.0013689254
##   No answer (General)         0.0000000000
barplot(t(prop.table(tabGeneralJob,margin=1)),beside=TRUE,legend=TRUE)

plot of chunk unnamed-chunk-24

Alluvial Plots

## #'
## #' We can do a contingency table with more than two categories. However, the output is not easy to read.
## with(wellbeingRecent,table(General.happiness,Job.or.housework,Happiness.of.marriage))

We can also calculate the values in a contingency table using aggregate. We have to create a variable Freq in our data.frame that indicates that each row of our data.frame should count as 1 when we sum, and then we can use a formula notation to indicate to aggregate that we want to sum the variable Freq for each combination of the values of General.Happiness and Job.or.housework.

wellbeingRecent$Freq<-1
wellbeingAggregates<-aggregate(Freq~ General.happiness+Job.or.housework ,data=wellbeingRecent[,-2],FUN=sum)
head(wellbeingAggregates,10)
##    General.happiness Job.or.housework Freq
## 1         Very happy   Very satisfied 2137
## 2       Pretty happy   Very satisfied 2725
## 3      Not too happy   Very satisfied  436
## 4         Don't know   Very satisfied   11
## 5     Not applicable   Very satisfied  204
## 6          No answer   Very satisfied    8
## 7         Very happy   Mod. satisfied  843
## 8       Pretty happy   Mod. satisfied 2569
## 9      Not too happy   Mod. satisfied  527
## 10        Don't know   Mod. satisfied    1

We can extend this format easily to many variables:

wellbeingAggregatesBig<-aggregate(Freq~ General.happiness+Job.or.housework +Satisfaction.with.financial.situation+Happiness.of.marriage+Is.life.exciting.or.dull,data=wellbeingRecent[,-2],FUN=sum)
head(wellbeingAggregatesBig,5)
##   General.happiness Job.or.housework Satisfaction.with.financial.situation
## 1        Very happy   Very satisfied                             Satisfied
## 2      Pretty happy   Very satisfied                             Satisfied
## 3     Not too happy   Very satisfied                             Satisfied
## 4        Very happy   Mod. satisfied                             Satisfied
## 5      Pretty happy   Mod. satisfied                             Satisfied
##   Happiness.of.marriage Is.life.exciting.or.dull Freq
## 1            Very happy                 Exciting  333
## 2            Very happy                 Exciting   54
## 3            Very happy                 Exciting    3
## 4            Very happy                 Exciting   83
## 5            Very happy                 Exciting   38

With the data in this format (i.e. from aggregate) we can make an alluvial plot that tracks the data through the different categories. (you may have to install the package alluvial). I choose for the lines in the alluvial plot to be color coded according to their value of General.happiness, so this is the variable I can track through the plot.

library(alluvial)
alluvial( wellbeingAggregates[,c("General.happiness","Job.or.housework")], freq=wellbeingAggregates$Freq, 
col=palette()[wellbeingAggregates$General.happiness])

plot of chunk unnamed-chunk-28

Next we plot the alluvial for our big cross-tabulations. I want to use all of the columns except for the Freq column, so now I remove that column, which is the last column. This command takes a while for R to render, so it is commented out. You can run it by uncommenting the following code:

## ###Takes a bit of time...
## alluvial( wellbeingAggregatesBig[,-ncol(wellbeingAggregatesBig)], freq=wellbeingAggregatesBig$Freq,
## col=palette()[wellbeingAggregatesBig$General.happiness])

We are going to remove observations that are Not applicable in all of the questions of interest, as well as those that are answering No answer or Don't know.

#remove those that not applicable in all
wh<-with(wellbeingRecent,which(General.happiness=="Not applicable" | Job.or.housework =="Not applicable" | Satisfaction.with.financial.situation=="Not applicable"))# & Happiness.of.marriage=="Not applicable" & Is.life.exciting.or.dull=="Not applicable"))
wellbeingCondenseGroups<-wellbeingRecent[-wh,]
#keep only those not 'No answer' or 'Don't know' in all of the variables of interest
wellbeingCondenseGroups<-subset(wellbeingCondenseGroups,!General.happiness%in%c("No answer","Don't know") & !Job.or.housework %in%c("No answer","Don't know") &  !Satisfaction.with.financial.situation%in%c("No answer","Don't know")  & !Happiness.of.marriage%in%c("No answer","Don't know") & !Is.life.exciting.or.dull%in%c("No answer","Don't know") )

Now we'll get rid of any unused levels

wellbeingCondenseGroups<-droplevels(wellbeingCondenseGroups)

And now we'll create new aggregates for this new dataset

wellbeingCondenseAggregates<-aggregate(Freq~ General.happiness+Job.or.housework +Satisfaction.with.financial.situation+Happiness.of.marriage+Is.life.exciting.or.dull,data=wellbeingCondenseGroups,FUN=sum)

Now we will redraw the alluvial plot (again, takes a while so commented out). We can also hide those subsets of data that are particularly small to make a less messy plot using the hide argument. Basically you give hide a logical for which cross-tabulations to not plot; here we only plot the top 50% of our cross-tabulations so as to get rid of the very small ones.

## alluvial( wellbeingCondenseAggregates[,-ncol(wellbeingCondenseAggregates)], freq=wellbeingCondenseAggregates$Freq,hide = wellbeingCondenseAggregates$Freq < quantile(wellbeingCondenseAggregates$Freq, .50),
## col=palette()[wellbeingCondenseAggregates$General.happiness])

Now we will focus only on those that are married and working/keeping house, to again remove some of the small noisy columns.

wh<-with(wellbeingCondenseGroups,which(Marital.status=="Married" & Labor.force.status %in% c("Working fulltime","Working parttime","Keeping house")))
wellbeingMarried<-wellbeingCondenseGroups[wh,]
wellbeingMarried<-droplevels(wellbeingMarried)
wellbeingMarriedAggregates<-aggregate(Freq~ General.happiness+Job.or.housework +Satisfaction.with.financial.situation+Happiness.of.marriage+Is.life.exciting.or.dull,data=wellbeingMarried,FUN=sum)

Now we will redraw the alluvial plot for the married, working people(again, takes a while so commented out)

alluvial( wellbeingMarriedAggregates[,-ncol(wellbeingMarriedAggregates)], freq=wellbeingMarriedAggregates$Freq,hide = wellbeingMarriedAggregates$Freq < quantile(wellbeingMarriedAggregates$Freq, .50),
col=palette()[wellbeingMarriedAggregates$General.happiness])

plot of chunk unnamed-chunk-35

data(Titanic)
tit<-as.data.frame(Titanic)
alluvial( tit[,1:4], freq=tit$Freq, border=NA,
         col=ifelse( tit$Survived == "No", "red", "gray") )

plot of chunk unnamed-chunk-35

Mosaic Plots

Now we draw a mosaic plot for three of these variables.

mosaicplot(~General.happiness+Job.or.housework,data=wellbeingMarried,las=1,col=palette())

plot of chunk unnamed-chunk-36

Now we draw a mosaic plot for three of these variables.

mosaicplot(~General.happiness+Job.or.housework+Satisfaction.with.financial.situation,data=wellbeingMarried,las=1,col=palette())

plot of chunk unnamed-chunk-37

We are going to return to our flight data. We are going to fix up some of the variables by making them factors and giving their levels relevant names.

flightSFOSRS<-read.table(file.path(dataDir,"SFO_SRS.txt"),
                         sep="\t",header=TRUE, stringsAsFactors = TRUE)
flightSFOSRS$DayOfWeek<-factor(flightSFOSRS$DayOfWeek,levels=c(1:7),
labels=c("M","Tu","W","Th","F","Sa","Su"))
flightSFOSRS$Cancelled<-factor(flightSFOSRS$Cancelled,levels=0:1,labels=c("No","Yes"))
flightSFOSRS$Diverted<-factor(flightSFOSRS$Diverted,levels=0:1,labels=c("No","Yes"))
flightSFOSRS$CancellationCode<-factor(flightSFOSRS$CancellationCode,levels=c("A","B","C","D",""),labels=c("Carrier","Weather","NationalAirSystem","Security","NotCancelled"))

We will also create a variable that indicates the cause of the delay (there is no such variable, but only the amount of time due to different delay causes).

whDelayed<-which(!is.na(flightSFOSRS$DepDelay) & flightSFOSRS$DepDelay>15)
flightSFOSRS$DelayCause<-"NoDelay"
flightSFOSRS$DelayCause[flightSFOSRS$Cancelled==1]<-NA
delayVars<-c("CarrierDelay","WeatherDelay","NASDelay","SecurityDelay" ,"LateAircraftDelay")
delayCause<-apply(flightSFOSRS[whDelayed,delayVars]>0,1,function(x){
    if(all(is.na(x))) return(NA) 
        else{
            if(any(x)){
                if(sum(x)==1) return(gsub("Delay","",delayVars[x]))
                    else return("Multiple")
            }
            else "NoDelay"
        }
    })
flightSFOSRS$DelayCause[whDelayed]<-delayCause
flightSFOSRS$DelayCause<-factor(flightSFOSRS$DelayCause)

Now we make a pairs plot using gpairs. We can see that the pairs corresponding to two categorical variables show up as a mosaic plot.

gpairs(droplevels(flightSFOSRS[whDelayed,c("AirTime","DepDelay","DayOfWeek","DelayCause")]),upper.pars=list(conditional = "boxplot"))

plot of chunk unnamed-chunk-40

Heatmaps

We will read in the data on gene expression from breast tumors.

breast<-read.csv(file.path(dataDir,"highVarBreast.csv"))

We can visualize the correlations of our variables. We will not use the corrgram function, however; that function is only appropriate for a small number of variables. We will instead create a heatmap using a heatmap function. There are many heatmap functions in R. The default heatmap comes with the standard packages in R, but is not very good. Better options are heatmap.2 in gplots; heatmap.plus in the heatmap.plus package. I favor the aheatmap function in the NMF package. It's got a couple of strange quirks, but generally gives the nicest pictures, I find.

The first quirk is that if you make a pdf, it puts a blank page in front of it, unless you use the following commands:

library(NMF)
## Loading required package: methods
## Loading required package: pkgmaker
## Loading required package: registry
## 
## Attaching package: 'pkgmaker'
## The following object is masked from 'package:base':
## 
##     isNamespaceLoaded
## Loading required package: rngtools
## Loading required package: cluster
## NMF - BioConductor layer [OK] | Shared memory capabilities [NO: bigmemory] | Cores 3/4
##   To enable shared memory capabilities, try: install.extras('
## NMF
## ')
nmf.options(grid.patch=TRUE) #fixes so plots to pdf correctly

Now to plotting the correlations. First we are going to calculate all the correlations with the cor function. This function calculates the correlations among the columns of a matrix. Notice, we need to remove the non-continuous variables.

corMat<-cor(breast[,-c(1:7)])

Now we run aheatmap. I have supressed ordering of the samples by turning off both column and row clustering (Rowv=NA and ColvNA). This was for teaching purposes during class, but not what you would want to actually do.

aheatmap(corMat,Rowv = NA, Colv = NA )

plot of chunk unnamed-chunk-44

Now we will draw the same heatmap with the rows and samples clustered. However, I chose not to show the hierarchical clustering tree by setting Rowv=FALSE and Colv=FALSE. Again this was for teaching purposes, but this is actually sometimes useful if you don't want the clutter of the dendrogram.

aheatmap(corMat,Rowv=FALSE,Colv=FALSE)

plot of chunk unnamed-chunk-45

Now we do a heatmap of the actual data, rather than the correlation.

aheatmap(breast[,-c(1:7)],Rowv = FALSE, Colv = FALSE,scale="none")

plot of chunk unnamed-chunk-46

Now we are going to demonstrate some options with aheatmap to show how you can customize it. First, we want to draw colors next to the samples to identify different characteristics of the samples (e.g. normal or cancer samples). So I am going to create vectors of colors, and then given names to that vector that exactly match the levels of the variable. Notice that I am NOT making the vector of colors the same length as the factor (which is what I would do in say plot). aheatmap will handle that internally. I will do this for three variables, TypeSample, EstReceptor, and Progesteron.

typeCol<-c("red","black","yellow")
names(typeCol)<-levels(breast$TypeSample)
estCol<-palette()[c(4,3,5)]
names(estCol)<-levels(breast$EstReceptor)
proCol<-palette()[5:7]
names(estCol)<-levels(breast$Progesteron)

Now I'm going to change the color scheme for the heatmap. This is a color scheme I made up that I use frequently and like.

seqPal5<-colorRampPalette(c("black","navyblue","mediumblue","dodgerblue3","aquamarine4","green4","yellowgreen","yellow"))(16)

The last change I'm going to make is set my own breaks for the colors based on the quantiles of the data, rather than equally spaced bins. First I find the maximum (absolute) value.

m<-max(abs(breast[,-c(1:7)]))

Then I find the 1% and 99% quantiles of the data.

qnt<-quantile(as.numeric(data.matrix((breast[,-1]))),c(0.01,.99))

Now I make a vector of break points of length 52. These will be my bins. The outer values will be my maximum (m), and the rest of the bins will be equally spaced between my 1% and 99% quantiles. I pick 52 because of a weird quirk in aheatmap: if you set your own breaks, you should make it length 52 (and also set your own colors). Otherwise weird things happen.

brks<-c(-m,seq(qnt[1],qnt[2],length=50),m)
head(brks)
## [1] -17.651678  -5.742913  -5.428952  -5.114990  -4.801028  -4.487066

Now I draw the heatmap with all of these features. annRow is the variable(s) that define the colors I want to draw next to the samples/rows (the argument annCol does the same thing for columns). annColors is a list that gives the colors for each of these variables. Note that the name of the list exactly match the name of the variable in the annRow. Note that if you don't give colors, aheatmap will automatically create some, but I don't like their defaults.

fullHeat<-aheatmap(breast[,-c(1:7)],Rowv = FALSE, Colv = FALSE,
    breaks=brks,col=seqPal5,annRow=breast[,5:7],
    annColors=list("TypeSample"=typeCol,"EstReceptor"=estCol,"Progesteron"=proCol))

plot of chunk unnamed-chunk-52

Now we will draw a heatmap for only the cancer samples. Notice I don't have to change much of the setup from above.

whCancer<-which(breast$Type!="Normal")
aheatmap(breast[whCancer,-c(1:7)],Rowv = FALSE, Colv = FALSE,
    breaks=brks,col=seqPal5,annRow=breast[whCancer,6:7],
    annColors=list("TypeSample"=typeCol,"EstReceptor"=estCol,"Progesteron"=proCol))

plot of chunk unnamed-chunk-53

Here we are going to “center” the data, by subtracing out the median of each variable so that they are on a closer scale. I am actually going to center the data in two ways. First by subracting off the mean and then the median, just to show you a few more functions.

There is a simple function for subtracting off the mean, called scale. This function subtracts off the mean of each variable (column) and returns a data.frame that has been centered. There is also an option to divide each variable by its standard deviation to put them on the same scale (scale=TRUE would do that).

breastCenteredMean<-scale(breast[,-c(1:7)],center=TRUE,scale=FALSE)

To subtract off the median requires a bit more work. First we need to calculate the median of each variable, which we will do with apply. This will return a vector of the length of the number of variables.

colMedian<-apply(breast[,-c(1:7)],2,median)

Then we need to subtract them from every column. There is a function sweep that will take a vector and subtract it from every column in the matrix. It works similarly to apply. The first argument is the matrix/data.frame; the second argument, like apply, is MARGIN that defines whether you are subtracting from the rows (1) or columns (2). Then you give the vector that you want to subtract, and finally the operation you want to perform (“-”,“+”,“/”,“*” are common ones though you could write your own).

breastCenteredMed<-sweep(breast[,-c(1:7)],MARGIN=2, colMedian,"-")

If this feels like you could have just done this with a call to apply, you are right! This is just such a common task that they have a helper function. In particular, it guarantees that the output will be the same dimension as the input, which is frequently not the case for apply (which sometimes winds up “flipping” the rows and columns on you.)

Now we are going to create breaks for this centered data that are symmetric around zero, and also chop off the really outlying points.

m<-max(abs(breastCenteredMed[,-1]))
qnt<-max(abs(quantile(as.numeric(data.matrix((breastCenteredMed[,-1]))),c(0.01,.99))))
brksCentered<-c(-m,seq(-qnt,qnt,length=50),m)

We are also going to create a different color scale. Here's another one that I like for centered data.

seqPal2<- colorRampPalette(c("orange","black","blue"))(16)
seqPal2<-(c("yellow","gold2",seqPal2))
seqPal2<-rev(seqPal2)
aheatmap(breastCenteredMed[whCancer,-c(1:7)],Rowv = FALSE, Colv = FALSE,
    breaks=brksCentered,annRow=breast[whCancer,6:7],labRow=NA,col=seqPal2,
    annColors=list("TypeSample"=typeCol,"EstReceptor"=estCol,"Progesteron"=proCol))

plot of chunk unnamed-chunk-58

Now we are going to work with a small subset of genes and samples so that we can more clearly see the dendrograms from the clustering. By setting Rowv=TRUE (default), it gives the dendrogram for the rows.

smallBreast<-read.csv(file.path(dataDir,"smallVarBreast.csv"),header=TRUE)

The following command shows the dendrogram, but only for the rows:

aheatmap(smallBreast[,-c(1:7)],Rowv = TRUE, Colv = FALSE,
    breaks=brks,col=seqPal5,annRow=smallBreast[,5:7],
    annColors=list("TypeSample"=typeCol,"EstReceptor"=estCol,"Progesteron"=proCol))

plot of chunk unnamed-chunk-60

The following command shows the dendrogram for both (both Rowv and Colv are TRUE now):

aheatmap(smallBreast[,-c(1:7)],Rowv = TRUE, Colv = TRUE,
    breaks=brks,col=seqPal5,annRow=smallBreast[,5:7],
    annColors=list("TypeSample"=typeCol,"EstReceptor"=estCol,"Progesteron"=proCol))

plot of chunk unnamed-chunk-61

PCA

Demonstrations on simulated data

We are going to simulate some data to demonstrate the ideas of PCA. There will be a lot of code here that creates the demonstrations of PCA, but it is not useful to worry about understanding such code until I get to the part that actually starts running PCA for real. Skip to that section, and don't worry about code in this section.

set.seed(275382)
trueQuality<-rnorm(150)
score2<-(trueQuality+100)*2+rnorm(150,sd=.5)
score1<-(trueQuality+80)*.5+rnorm(150,sd=.5)
x<-data.frame(score1=score1,score2=score2)
plot(x)

plot of chunk unnamed-chunk-62

This is another demonstration, this time with negative correlation

set.seed(231382)
skipped<- -(trueQuality+80)*2+rnorm(150,sd=.5)
skipped<-(skipped+abs(min(skipped)))
skipped<-skipped/max(skipped)
xNeg<-data.frame(score1=score1,skipped=skipped)
plot(xNeg)

plot of chunk unnamed-chunk-63

We center the data, and replot it.

xcenter<-data.frame(scale(x,center=TRUE,scale=FALSE))
xcenterNeg<-data.frame(scale(xNeg,center=TRUE,scale=FALSE))
par(mfrow=c(1,2))
plot(xcenter,main="centered",xlim=range(xcenter),ylim=range(xcenter))
plot(xcenterNeg,main="centered",xlim=range(xcenterNeg),ylim=range(xcenterNeg))

plot of chunk unnamed-chunk-64

Here we extract the code and plot histograms of the scores. I will explain this function more in the section on real data.

#demonstrate new scores:
pca<-prcomp(na.omit(xcenter),scale=FALSE,center=FALSE)

I will plot a histogram of the scores from PCA versus that of the mean.

xlim<-range(pca$x[,1],rowMeans(xcenter))
bks<-seq(xlim[1],xlim[2],length=20)
par(mfrow=c(2,1))
hist(pca$x[,1],xlim=xlim,breaks=bks,main="PCA scores")
hist(rowMeans(xcenter),xlim=xlim,breaks=bks,main="Mean scores")

plot of chunk unnamed-chunk-66

This is just a simple illustration of the regression line for the simulated data.

par(mfrow=c(1,2))
plot(xcenter,asp=1)
abline(lm(score2~score1,data=xcenter))
plot(xcenterNeg,asp=1)
abline(lm(skipped~score1,data=xcenterNeg))

plot of chunk unnamed-chunk-67

This is the flipped regression line – this code is not important. Just a simple inversion of the regression line of x~y to be able to plot it.

par(mfrow=c(1,2))
plot(xcenter,asp=1)
abline(lm(score2~score1,data=xcenter))
#score1=beta0+beta1 score2 -> score2 = 1/beta1 * score1-beta0/beta1
cx<-lm(score1~score2,data=xcenter)$coef
abline(a=-cx[1]/cx[2],b=1/cx[2],lty=2)
legend("topleft",c("score2~score1","score1~score2"))
plot(xcenterNeg,asp=1)
abline(lm(skipped~score1,data=xcenterNeg))
cxNeg<-lm(score1~skipped,data=xcenterNeg)$coef
abline(a=-cxNeg[1]/cxNeg[2],b=1/cxNeg[2],lty=2)

plot of chunk unnamed-chunk-68

This code is not important and is visualizing the error that regression is minimizing.

par(mfrow=c(1,2))
plot(xcenter,main=c("score2 ~ score1"),asp=1)
abline(lm(score2~score1,data=xcenter))
yhat<-predict(lm(score2~score1,data=xcenter))
segments(x0=xcenter$score1,y0=xcenter$score2,y1=yhat,col="red")
points(xcenter$score1,yhat,col="red",pch=19)
plot(xcenter,main=c("score1 ~ score2"),asp=1)
abline(a=-cx[1]/cx[2],b=1/cx[2],lty=2)
yhat<-predict(lm(score1~score2,data=xcenter))
segments(x0=xcenter$score1,y0=xcenter$score2,x1=yhat,col="red")
points(yhat,xcenter$score2,col="red",pch=19)

plot of chunk unnamed-chunk-69

This code is particularly not important. I am projecting the data onto arbitrary lines to demonstrate the idea of distance of point to line.

projectLine<-function(a,b,x,y){
    xproj<-b/(1+b^2)*(y+x/b-a)
    yproj<-a+b^2/(1+b^2)*(y+x/b-a)
    return(cbind(x=xproj,y=yproj))
}
par(mfrow=c(1,2))
plot(xcenter,main="Orthogonal projection to Arbitrary Line",asp=1)
arbitrarySlope<-1
abline(a=0,b=arbitrarySlope)
newValueArbitrary<-projectLine(a=0,b=arbitrarySlope,x=xcenter[,1],y=xcenter[,2])
segments(x0=xcenter[,1],y0=xcenter[,2],x1=newValueArbitrary[,"x"],y1=newValueArbitrary[,"y"],col="red")
b<-pca$rotation[2,1]/pca$rotation[1,1]
plot(xcenter,main="Orthogonal projection to Arbitrary Line",asp=1)
arbitrarySlope<-4
abline(a=0,b=arbitrarySlope)
newValueArbitrary<-projectLine(a=0,b=arbitrarySlope,x=xcenter[,1],y=xcenter[,2])
segments(x0=xcenter[,1],y0=xcenter[,2],x1=newValueArbitrary[,"x"],y1=newValueArbitrary[,"y"],col="red")

plot of chunk unnamed-chunk-70

This code is not important. Demonstrates the best line (PCA line)

#point(x,y) projects to ax+by 
b<-pca$rotation[2,1]/pca$rotation[1,1]
plot(xcenter,main="Best Line for minimizing Orthogonal projection",asp=1)
abline(a=0,b=b,lty=1,col="black")
newValuePCA<-projectLine(a=0,b=b,x=xcenter[,1],y=xcenter[,2])
segments(x0=xcenter[,1],y0=xcenter[,2],x1=newValuePCA[,"x"],y1=newValuePCA[,"y"],col="red")

plot of chunk unnamed-chunk-71

This code is not important. Demonstrates PCA line versus regression line.

plot(xcenter,main="Closest to point",asp=1)
abline(a=0,b=b,lty=1,col="red")
abline(lm(score2~score1,data=xcenter),lty=2)
legend("topleft",c("PCA line","Regression Line"),lty=c(1,2),col=c("red","black"))

plot of chunk unnamed-chunk-72

This code is not important. Demonstrates how to get a new variable \(Z\) from the PCA line.

plot(xcenter,main="Getting new variable",xlim=range(xcenter),ylim=range(xcenter))
abline(a=0,b=b,lty=1,col="black",asp=1)
whExample<-which.max(xcenter[,1])
points(x=xcenter[whExample,1],y=xcenter[whExample,2],pch=19,cex=1.2)
text(x=xcenter[whExample,1],y=xcenter[whExample,2],expression(paste("(",list(x[i]^1,x[i]^2),")")),col="black",pos=4)
segments(x0=xcenter[whExample,1],y0=xcenter[whExample,2],x1=newValuePCA[whExample,"x"],y1=newValuePCA[whExample,"y"],col="red")
points(x=newValuePCA[whExample,"x"],y=newValuePCA[whExample,"y"],col="red",pch=19,cex=1.2)
points(x=0,y=0,col="red",pch=19,cex=1.2)
segments(x0=0,y0=0,x1=newValuePCA[whExample,"x"],y1=newValuePCA[whExample,"y"],col="red",lwd=4,lend="butt")
text(x=newValuePCA[whExample,"x"]/2,y=b*newValuePCA[whExample,"x"]/2,expression(z[i]),col="red",pos=2,lwd=4,cex=2.5)

plot of chunk unnamed-chunk-73

This code is not important. Again shows project to PCA line.

pca<-prcomp(na.omit(xcenter),scale=FALSE,center=FALSE)
#point(x,y) projects to ax+by 
b<-pca$rotation[2,1]/pca$rotation[1,1]
plot(xcenter,main="Projection to PCA line",xlim=range(xcenter),ylim=range(xcenter),asp=1)
abline(a=0,b=b,lty=1,col="black")
newValuePCA<-projectLine(a=0,b=b,x=xcenter[,1],y=xcenter[,2])
segments(x0=xcenter[,1],y0=xcenter[,2],x1=newValuePCA[,"x"],y1=newValuePCA[,"y"],col="red")

plot of chunk unnamed-chunk-74

This code is not important. Shows the linear combination defined by the mean as a line.

par(mfrow=c(1,2))
plot(xcenter,main="Orthogonal projection to y=x",xlim=range(xcenter),ylim=range(xcenter),asp=1)
abline(a=0,b=1)
newValue<-rowMeans(xcenter)
segments(x0=xcenter[,1],y0=xcenter[,2],x1=newValue,y1=newValue,col="red")
plot(xcenter,main="project to PCA line",xlim=range(xcenter),ylim=range(xcenter),asp=1)
abline(a=0,b=1,lty=2)
abline(a=0,b=b,lty=1,col="black")
newValuePCA<-projectLine(a=0,b=b,x=xcenter[,1],y=xcenter[,2])
segments(x0=xcenter[,1],y0=xcenter[,2],x1=newValuePCA[,"x"],y1=newValuePCA[,"y"],col="red")

plot of chunk unnamed-chunk-75

This code demonstrates why we would want to maximize the variance – how the variance mantains the signal – again on simulated data.

set.seed(275382)
trueDiff<-rep(c(10,12),each=75)+rnorm(150,sd=1)
noise<-rnorm(150,sd=.5)
score1<-rowMeans(cbind(trueDiff,noise))
score2<-trueDiff-noise
xGroups<-data.frame(score1=score1,score2=score2)
plot(scale(xGroups,center=TRUE,scale=TRUE),col=rep(c("black","red"),each=75))
abline(a=0,b=1)

plot of chunk unnamed-chunk-76

This code is useful! But only because we read in data on students taking an AP Statistics class that we will use later.

apscores<-read.csv(file = file.path(dataDir,"AP_Statistics_Predictions_2013-16.csv"), header = TRUE, stringsAsFactors = TRUE)
library(NMF)
#21-29 are various conversions of earlier values (26 is actual AP score)
aheatmap(cor(apscores[,-c(1,3,10,13,21:29)],use="pairwise.complete.obs"))

plot of chunk unnamed-chunk-77

This code is not important – just drawing the PCA, mean, and regression line on two variables of the apscores dataset.

twovar<-c("MT"  ,   "Locus.Aug")
xAP2<-na.omit(apscores[,twovar])
xscaleAP2<-scale(xAP2,center=TRUE,scale=FALSE)
pcaAP2<-prcomp(na.omit(xscaleAP2),scale=FALSE,center=FALSE)

plot(xscaleAP2,main="AP Statistics data",xlab=twovar[1],ylab=twovar[2],asp=1)
abline(a=0,b=pcaAP2$rotation[2,1]/pcaAP2$rotation[1,1],lty=1,lwd=2)
abline(a=0,b=1,lty=2,col="green")
abline(lm(xscaleAP2[,2]~xscaleAP2[,1]),col="blue",lty=2)
legend("topleft",c("PCA line","LM line","Mean line"),fill=c("black","blue","green"))

plot of chunk unnamed-chunk-78

This code is not important. It shows my simulated data in 3d.

set.seed(64289)
library(mvtnorm)
x3<-rmvnorm(150,sigma=cbind(c(1,.4,.5),c(.4,1,.8),c(.5,.8,1)))
names(x3)<-c("score1","score2","score3")
library(scatterplot3d)
scatterplot3d(x3)

plot of chunk unnamed-chunk-79

# not able to get interactive 3d working on computer. Freezes:
# library(rgl)
# plot3d(x3[,1],x3[,2],x3[,3])

This code is not terribly important. It shows another simulated dataset, only now one that lies more on a plane than a line.

set.seed(2358)
library(mvtnorm)
x3Plane<-rmvnorm(150,sigma=cbind(c(1,.4,0),c(.4,1,.8),c(0,.8,1)))
names(x3Plane)<-c("score1","score2","score3")
library(scatterplot3d)
par(mfrow=c(2,2))
scatterplot3d(x3Plane,highlight.3d=TRUE)
scatterplot3d(x3Plane,highlight.3d=TRUE,angle=-40)
scatterplot3d(x3Plane,highlight.3d=TRUE,angle=-120)# not able to get interactive 3d working 
scatterplot3d(x3Plane,highlight.3d=TRUE,angle=120)# not able to get interactive 3d working 

plot of chunk unnamed-chunk-80

# not able to get interactive 3d working on computer. Freezes:
# library(rgl)
# plot3d(x3[,1],x3[,2],x3[,3])

Now we return to our 2d simulation from before. This code is not very important. It demonstrates the line corresponding to the second component.

#point(x,y) projects to ax+by 
b2<-pca$rotation[2,2]/pca$rotation[1,2]
plot(xcenter,main="second component",xlim=range(xcenter),ylim=range(xcenter),asp=1)
abline(a=0,b=b2,lty=1,col="black")
abline(a=0,b=b,col="black",lty=2)
legend("topleft",c("1st PC","2nd PC"),lty=c(2,1))

plot of chunk unnamed-chunk-81

This code is not important. It shows the projection to the 2nd component.

#point(x,y) projects to ax+by 
plot(xcenter,main="project to second component",xlim=range(xcenter),ylim=range(xcenter),asp=1)
abline(a=0,b=b2,lty=1,col="black")
abline(a=0,b=b,col="black",lty=2)
legend("topleft",c("1st PC","2nd PC"),lty=c(2,1))
newValuePCA2<-projectLine(a=0,b=b2,x=xcenter[,1],y=xcenter[,2])
segments(x0=xcenter[,1],y0=xcenter[,2],x1=newValuePCA2[,"x"],y1=newValuePCA2[,"y"],col="red")

plot of chunk unnamed-chunk-82

This code is not important. This code demonstrates the coordinates of the data on 2 PCs.

plot(xcenter,main="project to second component",xlim=range(xcenter),ylim=range(xcenter),asp=1)
abline(a=0,b=b2,lty=1,col="black")
abline(a=0,b=b,col="black",lty=2)
legend("topleft",c("1st PC","2nd PC"),lty=c(2,1))
points(x=newValuePCA2[,"x"],y=newValuePCA2[,"y"],col="red")
points(x=newValuePCA[,"x"],y=newValuePCA[,"y"],col="red")
segments(x0=xcenter[whExample,1],y0=xcenter[whExample,2],x1=newValuePCA2[whExample,"x"],y1=newValuePCA2[whExample,"y"],col="blue")
segments(x0=xcenter[whExample,1],y0=xcenter[whExample,2],x1=newValuePCA[whExample,"x"],y1=newValuePCA[whExample,"y"],col="blue")
points(x=xcenter[whExample,1],y=xcenter[whExample,2],col="blue",pch=19)
points(x=newValuePCA2[whExample,"x"],y=newValuePCA2[whExample,"y"],col="blue",pch=19)
points(x=newValuePCA[whExample,"x"],y=newValuePCA[whExample,"y"],col="blue",pch=19)

plot of chunk unnamed-chunk-83

plot(pca$x,asp=1,main="Plotting of PCA coordinates (rotation)")
points(pca$x[whExample,1],pca$x[whExample,2],col="blue",pch=19,type="p")

plot of chunk unnamed-chunk-83

This code is not important. We return to the 3d case, and show the PCs coordinates of the 3d data.

pca3d<-prcomp(na.omit(x3Plane),scale=FALSE,center=TRUE)
par(mfrow=c(1,2))
scatterplot3d(x3Plane,highlight.3d=TRUE,angle=120)# not able to get interactive 3d 
plot(pca3d$x[,1:2],asp=1)

plot of chunk unnamed-chunk-84

Examples on real data

Now we will look at useful code where we actually do PCA on real data.

Notice that I need to remove NA values to run PCA. This is because PCA cannot handle NA values. I could run na.omit on my data scorecard (this is what I do with my example above in the section I told you not to look at). However, because I want to use variables in scorecard, like CONTROL that defines public or private, I don't want to just remove them. I want to create a index so that I can remove them whenever I want. This information is saved in the output of na.omit (as part of its attributes, a rather esoteric feature that you don't need to know about)

whNACollege<-attributes(na.omit(scorecard[,-c(1:3,12)]))[["na.action"]]

Here we use the prcomp function to actually calculate the principal components of our data. The prcomp function has the arguments scale and center to have the function center and scale the data (by the mean and the standard deviation). For teaching purposes, I do not want to scale it yet.

pcaCollege<-prcomp(scorecard[-whNACollege,-c(1:3,12)],center=TRUE,scale=FALSE)

We then use the output of this to get the new scores, which are stored as the x element of the output of prcomp. As we've learned, there are actually multiple scores created by PCA (equal to the number of variables) so the x element is a matrix, with each column corresponding to a new PC variable. They are also ordered, so that the first column is the first PC variable, etc.

I will plot these scores, and color the points by whether they are public or private. This is where I will make use of that whNACollege variable before. Otherwise, the result of prcomp is scores not including the NAs (because the input matrix didn't have them), while the scorecard that contains the categorical variable for public/private schools does contain them, so they will be different lengths.

plot(pcaCollege$x[,1:2],col=c("red","black")[scorecard$CONTROL[-whNACollege]],asp=1)
legend("topright",c("public","private"),fill=c("red","black"))

plot of chunk unnamed-chunk-87

Notice I set asp=1. This is to make the distance on the x and y axis equivalent (notice how the axes change if you rescale the plot)

Now we run a PCA on the breast data. We already centered the breast data (with the median), so we will pick center=FALSE, but we will choose scale=TRUE. This time we will color the observations based on whether they are normal/cancer samples.

pcaBreast<-prcomp(breastCenteredMed[,-c(1:7)],center=FALSE,scale=TRUE)
plot(pcaBreast$x[,1:2],col=typeCol[breast$TypeSample],asp=1,pch=19)
legend("bottomright",levels(breast$TypeSample),fill=typeCol)

plot of chunk unnamed-chunk-88

We repeat the above, only now we remove the normal samples, and color the points by estrogen receptor status and pick a pointing graphic according to Progesteron status. We also use points to draw just the metastastic samples bigger and with a different plotting symbol.

We have used different plotting symbols (argument pch). These plotting symbols give you the ability to have one color for the outline of the shape and another color to fill in the shape (you can see the help of points to see which values of pch define such plotting symbols). For these symbols, col refers to the outline of the symbol while bg defines the fill for the points. Similarly our legend has to make that distinction.

pcaBreastCancer<-prcomp(breastCenteredMed[whCancer,-c(1:7)],center=FALSE,scale=TRUE)
whMets<-which(breast$Type[whCancer]=="Metastatic")
plot(pcaBreastCancer$x[,1:2],col="black",bg=estCol[breast$EstReceptor[whCancer]],asp=1,pch=c(21,22,24)[breast$Progesteron[whCancer]])
points(pcaBreastCancer$x[whMets,1:2],pch=8,cex=2)
legend("bottomright",levels(breast$EstReceptor),fill=estCol,title="Estrogen Receptor")
legend("bottomleft",levels(breast$Progesteron),pch=c(21,22,24),col="black",pt.bg="grey",title="Progesteron")

plot of chunk unnamed-chunk-89

Now we draw a heatmap of the coefficients that go into the new variables created by the PCA of the college data. These coefficients are given in the output of prcomp and are saved under the element rotation. Again, there are coefficients for all the PCs, so this is a matrix, with the rows being the variables, and the columns being the PCs. So the first column corresponds to the coefficients for each variable that create the final PC variable. So multiplying these values with the data values of an observation in the college data gives the new variable.

We choose to not cluster the columns (PCs) and simply plot them in the order they are given by setting Colv=NA. We show only the top 2 PCs.

aheatmap(pcaCollege$rotation[,1:2],Colv=NA)

plot of chunk unnamed-chunk-90

Here we calculate the correlation between the original variables in our college data and the new variables created by PCA. These are given by a matrix, only this time, the matrix is not symmetric. The rows correspond to the first argument (the PC variables given by the $x element) and the columns to the second argument to cor (the original variables in scorecard). Again we show only the correlation with the top 2 PCs.

corPCACollege<-cor(pcaCollege$x,scale(scorecard[-whNACollege,-c(1:3,12)],center=TRUE,scale=FALSE))
aheatmap(corPCACollege[1:2,],Colv=NA,col= seqPal2)

plot of chunk unnamed-chunk-91

Now we draw a biplot of the result of prcomp on the college data. biplot is not a really great function, unfortunately. There is very little you can customize. Again, we are going to suppress warnings from biplot so as to not clutter this document.

par(mfrow=c(1,2))
plot(pcaCollege$x[,1:2],col=c("red","black")[scorecard$CONTROL[-whNACollege]],asp=1)
legend("topright",c("public","private"),fill=c("red","black"))
suppressWarnings(biplot(pcaCollege,pch=19))

plot of chunk unnamed-chunk-92

We repeat some of the same plots above, only now for APScores.

whNAAP<-as.numeric(attributes(na.omit(apscores[,-c(1,3,10,13,21:29)]))[["na.action"]])
pcaAP<-prcomp(apscores[-whNAAP,-c(1,3,10,13,21:29)],center=TRUE,scale=FALSE)
par(mfrow=c(2,2))
plot(pcaAP$x[,1:2],asp=1)
biplot(pcaAP)
aheatmap(pcaAP$rotation[,1:2])

plot of chunk unnamed-chunk-93

Here we repeat the pca of the college data, only now with scaling as well as centering the variables.

pcaCollegeScale<-prcomp(scorecard[-whNACollege,-c(1:3,12)],center=TRUE,scale=TRUE)
par(mfrow=c(1,2))
plot(pcaCollegeScale$x[,1:2],pch=19,col=c("red","black")[scorecard$CONTROL[-whNACollege]])
legend("topright",c("public","private"),fill=c("red","black"))
biplot(pcaCollegeScale,choices=c(1,2),pch=19)

plot of chunk unnamed-chunk-94

We also redo the heatmap of the coefficients.

par(mfrow=c(1,2))
aheatmap(pcaCollegeScale$rotation[,1:2],Colv=NA,main="PCA Coefficients")
aheatmap(cor(scorecard[-whNACollege,-c(1:3,12)],pcaCollegeScale$x[,1:2]),Colv=NA,main="Correlation with Original Variables")

plot of chunk unnamed-chunk-95

We plot the observations based on their coordinates in the 3rd and 4th dimension, as well as a biplot of those dimensions. The argument choices allows you to pick which PC variables will be plotted in the biplot.

par(mfrow=c(1,2))
plot(pcaCollege$x[,3:4],pch=19,col=c("red","black")[scorecard$CONTROL[-whNACollege]])
legend("topright",c("public","private"),fill=c("red","black"))
suppressWarnings(biplot(pcaCollege,choices=3:4))

plot of chunk unnamed-chunk-96

We repeat the above heatmap, but now we scale each of the coefficients in a PC by the standard deviation of the final variable it creates. prcomp returns these values in the $sdev element of its output. We will use the function sweep to multiply each PC (column) by its standard deviation.

par(mfrow=c(1,2))
aheatmap(pcaCollege$rotation,Colv=NA,main="Coefficients")
aheatmap(sweep(pcaCollege$rotation,2,pcaCollege$sdev,"*"),Colv=NA,main="Coefficients scaled by variance")

plot of chunk unnamed-chunk-97

We do the same thing for the breast dataset. We first look at the coefficients. To make it relevant, we will first replot the heatmap from above of the entire datamatrix. Then we will plot the heatmap of the coefficients of the first 25 PCs.

par(mfrow=c(1,2))
breastHeat<-aheatmap(breastCenteredMed[,-c(1:7)],Rowv = FALSE, Colv = FALSE,
    breaks=brksCentered,annRow=breast[,5:7],labRow=NA,col=seqPal2,
    annColors=list("TypeSample"=typeCol,"EstReceptor"=estCol,"Progesteron"=proCol))
aheatmap(t(pcaBreast$rotation[,1:25]),Colv=breastHeat$colInd,col=seqPal2,Rowv=NA)   

plot of chunk unnamed-chunk-98

We now do a heatmap of the coefficients of the top 25 PCs, scaled by the variance.

par(mfrow=c(1,2))
aheatmap(breastCenteredMed[,-c(1:7)],Rowv = FALSE, Colv = FALSE,
    breaks=brksCentered,annRow=breast[,5:7],labRow=NA,col=seqPal2,
    annColors=list("TypeSample"=typeCol,"EstReceptor"=estCol,"Progesteron"=proCol))
scaledPCs<-sweep(pcaBreast$rotation,2,pcaBreast$sdev,"*")
aheatmap(t(scaledPCs[,1:25]),Colv=breastHeat$colInd,col=seqPal2,Rowv=NA)    

plot of chunk unnamed-chunk-99

The function screeplot plots the variance in each PC dimension as a function of the PC. Here we do it for the simulated 3D data which had a plane.

screeplot(pca3d,type="lines",main="Simple simulated data in 3d")

plot of chunk unnamed-chunk-100

Now we do a screeplot for the college and AP Statistics data.

par(mfrow=c(1,2))
screeplot(pcaCollege,type="lines",main="College")
screeplot(pcaAP,type="lines",main="AP Statistics")

plot of chunk unnamed-chunk-101

We can also do screeplots where the y-axis is the variance in that component as a percentage of the sum of all the variances.

par(mfrow=c(1,2))
plot(pcaCollege$sdev^2/sum(pcaCollege$sdev^2),type="b",main="College",ylab="Percentage of variance")
plot(pcaAP$sdev^2/sum(pcaAP$sdev^2),type="b",main="AP Stats",ylab="Percentage of variance")

plot of chunk unnamed-chunk-102

And we repeat this for the breast cancer datasets.

par(mfrow=c(1,2))
plot(pcaBreast$sdev^2/sum(pcaBreast$sdev^2),type="b",main="Breast PCA",ylab="Percentage of variance",xlim=c(0,100))
plot(pcaBreastCancer$sdev^2/sum(pcaBreastCancer$sdev^2),type="b",main="Breast, Cancer only, PCA",ylab="Percentage of variance",xlim=c(0,100))

plot of chunk unnamed-chunk-103