Do not forget to change directory for where the data will be (see 01Probability_forClassCode.Rmd)
dataDir<-"../../finalDataSets"
# dataDir<-"."
We will start by reading in the data of college information
scorecard <- read.csv(file.path(dataDir,"college.csv"), stringsAsFactors = FALSE,na.strings = c("NA","PrivacySuppressed"))
We will exclude those that are for-profit institutes:
scorecard<-scorecard[-which(scorecard$CONTROL==3),]
Now we will do a pairs plot. The default pairs plot requires a matrix of numeric values, and some of our values are factors. So we will leave those off (the first 3 columns), as well as a number of other variables. Notice how I can remove a range of values with the -
(and how I have to put c()
around it).
smallScores<-scorecard[,-c(1:3,4,5,6,9,11,14:17,18:22,24:27,31)]
pairs(smallScores)
We can improve the pairs
function by providing alternative functions to draw on the lower/upper/or diagonal plots. Here we use a function provided in the help of pairs
to draw a histogram of the variable on the diagonal, instead of just printing the name. A function we will see later (gpairs
) does this automatically, so don't stress about following this.
panel.hist <- function(x, ...)
{
#from help of pairs
usr <- par("usr"); on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5) )
h <- hist(x, plot = FALSE)
breaks <- h$breaks; nB <- length(breaks)
y <- h$counts; y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y)
}
Now we call pairs, and we set the argument diag.panel
to be the function we just wrote. We also can make the scatterplots have a loess line through them by using panel.smooth
as our function (this is a built in function like smooth.scatter
). We also color the points as to whether they are public or private using the col
argument.
pairs(smallScores, lower.panel = panel.smooth, col=c("red","black")[smallScores$CONTROL],diag.panel = panel.hist)
Now we make the upper and lower panels different. The upper panel is now going to print the correlation of the variables, where the size of the correlation will be relative to the size of the correlation (again, this is a function from the help of pairs
). A function we will see later (gpairs
) does this automatically, so don't stress about following this.
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
#from help of pairs
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y,use="pairwise.complete.obs"))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
pairs(smallScores, lower.panel = panel.smooth, upper.panel = panel.cor,col=c("red","black")[smallScores$CONTROL],diag.panel = panel.hist)
Using the function corrgram
we can visualize the correlations for a large number of variables. This take a bit of time to render. It also outputs an enormous number of warnings, so we are going to suppress those warnings so as to not clutter up this document.
library(gpairs)
suppressWarnings(corrgram(scorecard[,-c(1:3)]))
We will read in data from the wellbeing survey. We will set stringsAsFactors=TRUE
(the default) to automatically convert strings to factors.
wellbeing <- read.csv(file.path(dataDir,"wellbeing.csv"), stringsAsFactors = TRUE, header = TRUE)
Now we will do a number of adjustments of the variables to improve them for future use. First we will make a variable where the Year is a factor, rather than a number. This can be helpful sometimes.
wellbeing$YearFactor<-factor(wellbeing$Gss.year.for.this.respondent)
Now we will categorize the year into specific decades. We can do this using the cut
command, where we ask to bin a numeric variable into bins defined by the break
argument.
wellbeing$Decade<-cut(wellbeing$Gss.year.for.this.respondent,breaks=c(1970,1980,1990,2000,2010,2015),labels=c("1970s","1980s","1990s","2000s","2010s"))
By default text variables were made factors, and the order of the levels was alphabetical. However, these factor levels have some natural order to them – from greater happiness to lesser happiness. Since all the commands that work with levels use this ordering of the levels we would like the order to follow this pattern. We will write a small function that does this, since we want to do this to several variables.
#change order of levels so that make more sense
changeOrderLevels<-function(x,newLevels){
if(is.factor(x)){
oldLevels<-levels(x)
if(!all(sort(oldLevels)==sort(newLevels))) stop("newLevels is not just a reorder of the old levels")
}
else{if(!all(sort(unique(x))==sort(newLevels))) stop("newLevels is not just a ordering of existing values")}
return(factor(x,levels=newLevels))
}
wellbeing$General.happiness<-changeOrderLevels(wellbeing$General.happiness,newLevels=c("Very happy","Pretty happy","Not too happy","Don't know","Not applicable","No answer"))
wellbeing$Job.or.housework<-changeOrderLevels(wellbeing$Job.or.housework,newLevels=c("Very satisfied","Mod. satisfied","A little dissat","Very dissatisfied","Don't know","Not applicable","No answer"))
wellbeing$Satisfaction.with.financial.situation<-changeOrderLevels(wellbeing$Satisfaction.with.financial.situation,newLevels=c("Satisfied","More or less","Not at all sat","Don't know","Not applicable","No answer"))
wellbeing$Is.life.exciting.or.dull<-changeOrderLevels(wellbeing$Is.life.exciting.or.dull,newLevels=c("Exciting","Routine","Dull","Don't know","Not applicable","No answer"))
wellbeing$Happiness.of.marriage<-changeOrderLevels(wellbeing$Happiness.of.marriage,newLevels=c("Very happy","Pretty happy","Not too happy","Don't know","Not applicable","No answer"))
Now we will make a dataset that only consists of the recent responses (since 2000).
wellbeingRecent<-droplevels(wellbeing[wellbeing$Decade %in% c("2000s","2010s"),])
Our last step will be to drop any unused levels in our data set using the droplevels
command. This will keep a 0 sized level of any factor from being carried around.
wellbeingRecent<-droplevels(wellbeingRecent)
We will now make a table of the different levels of the factor General.happiness
and also a barplot.
table(wellbeingRecent$General.happiness)
##
## Very happy Pretty happy Not too happy Don't know Not applicable
## 4270 7979 1991 25 4383
## No answer
## 18
barplot(table(wellbeingRecent$General.happiness))
We use the function gpairs
to plot a pairs plot that takes into account categorical variables. (You may have to install the package gpairs
). In order for this to work, we need to designate the variable CONTROL
as a factor so that gpairs
knows to treat it as a categorical variable.
library(gpairs)
smallScores$CONTROL<-factor(smallScores$CONTROL,levels=c(1,2),labels=c("public","private"))
Now we run gpairs. To save time, we have changed the upper diagonal plots to be like the boxplots in the lower diagonal. The default is a rug plot, which can be nice for small numbers of observations, but takes a long time to plot otherwise. (You can delete the upper.pars=...
argument and see what it looks like.)
gpairs(smallScores,lower.pars=list(scatter="loess"),upper.pars=list(scatter="loess",conditional = "boxplot"),
scatter.pars = list( col = c("red","black")[smallScores$CONTROL]))
We now use the table
command to create a contingency table between two of the variables in the wellbeing data.
tabGeneralJob<-with(wellbeingRecent,table(General.happiness,Job.or.housework))
tabGeneralJob
## Job.or.housework
## General.happiness Very satisfied Mod. satisfied A little dissat
## Very happy 2137 843 154
## Pretty happy 2725 2569 562
## Not too happy 436 527 247
## Don't know 11 1 4
## Not applicable 204 134 36
## No answer 8 2 1
## Job.or.housework
## General.happiness Very dissatisfied Don't know Not applicable No answer
## Very happy 61 25 1011 39
## Pretty happy 213 61 1776 73
## Not too happy 161 39 549 32
## Don't know 0 1 8 0
## Not applicable 12 1 3990 6
## No answer 3 0 4 0
We can use barplot
to visualize the relationship between two categorical variables.
barplot(tabGeneralJob,legend=TRUE)
To be able to tell the two variables apart, we will change the names of the columns and rows of the contingency table (we could also do this in other, prettier ways…).
colnames(tabGeneralJob)<-paste(colnames(tabGeneralJob),"(Job)")
rownames(tabGeneralJob)<-paste(rownames(tabGeneralJob),"(General)")
barplot(tabGeneralJob,legend=TRUE)
We can also switch with variable is plotted by taking the transpose of the contingency table using the t()
function (“transpose” just means to flip the matrix so the rows become columns and columns the rows).
barplot(t(tabGeneralJob),legend=TRUE)
Now we plot the barplot so that we get separate barplots for each level. We also adjust the colors using the col
argument.
barplot(tabGeneralJob,beside=TRUE,legend=TRUE,col=palette()[1:6])
We can convert our contingency table into proportions within each category of job using the function prop.table
. The argument margin
indicates whether we should make the proportions out of the columns or the rows. So margin=2
means that the columns will now sum to 1, i.e. each column corresponding to a response to Job Satisfaction, has been converted into a proportion.
prop.table(tabGeneralJob,margin=2)
## Job.or.housework
## General.happiness Very satisfied (Job) Mod. satisfied (Job)
## Very happy (General) 0.3870675602 0.2068204122
## Pretty happy (General) 0.4935700054 0.6302747792
## Not too happy (General) 0.0789712009 0.1292934249
## Don't know (General) 0.0019923927 0.0002453386
## Not applicable (General) 0.0369498279 0.0328753680
## No answer (General) 0.0014490129 0.0004906771
## Job.or.housework
## General.happiness A little dissat (Job) Very dissatisfied (Job)
## Very happy (General) 0.1533864542 0.1355555556
## Pretty happy (General) 0.5597609562 0.4733333333
## Not too happy (General) 0.2460159363 0.3577777778
## Don't know (General) 0.0039840637 0.0000000000
## Not applicable (General) 0.0358565737 0.0266666667
## No answer (General) 0.0009960159 0.0066666667
## Job.or.housework
## General.happiness Don't know (Job) Not applicable (Job)
## Very happy (General) 0.1968503937 0.1377759608
## Pretty happy (General) 0.4803149606 0.2420278005
## Not too happy (General) 0.3070866142 0.0748160262
## Don't know (General) 0.0078740157 0.0010902153
## Not applicable (General) 0.0078740157 0.5437448896
## No answer (General) 0.0000000000 0.0005451077
## Job.or.housework
## General.happiness No answer (Job)
## Very happy (General) 0.2600000000
## Pretty happy (General) 0.4866666667
## Not too happy (General) 0.2133333333
## Don't know (General) 0.0000000000
## Not applicable (General) 0.0400000000
## No answer (General) 0.0000000000
barplot(prop.table(tabGeneralJob,margin=2),beside=TRUE)
We can of course switch these, and instead make each row a proportion, using margin=1
. To appropriately plot this with barplot, however, we need to flip the matrix (i.e. take the transpose).
prop.table(tabGeneralJob,margin=1)
## Job.or.housework
## General.happiness Very satisfied (Job) Mod. satisfied (Job)
## Very happy (General) 0.5004683841 0.1974238876
## Pretty happy (General) 0.3415214939 0.3219701717
## Not too happy (General) 0.2189854345 0.2646911100
## Don't know (General) 0.4400000000 0.0400000000
## Not applicable (General) 0.0465434634 0.0305726671
## No answer (General) 0.4444444444 0.1111111111
## Job.or.housework
## General.happiness A little dissat (Job) Very dissatisfied (Job)
## Very happy (General) 0.0360655738 0.0142857143
## Pretty happy (General) 0.0704348916 0.0266950746
## Not too happy (General) 0.1240582622 0.0808638875
## Don't know (General) 0.1600000000 0.0000000000
## Not applicable (General) 0.0082135524 0.0027378508
## No answer (General) 0.0555555556 0.1666666667
## Job.or.housework
## General.happiness Don't know (Job) Not applicable (Job)
## Very happy (General) 0.0058548009 0.2367681499
## Pretty happy (General) 0.0076450683 0.2225842837
## Not too happy (General) 0.0195881467 0.2757408338
## Don't know (General) 0.0400000000 0.3200000000
## Not applicable (General) 0.0002281542 0.9103353867
## No answer (General) 0.0000000000 0.2222222222
## Job.or.housework
## General.happiness No answer (Job)
## Very happy (General) 0.0091334895
## Pretty happy (General) 0.0091490162
## Not too happy (General) 0.0160723255
## Don't know (General) 0.0000000000
## Not applicable (General) 0.0013689254
## No answer (General) 0.0000000000
barplot(t(prop.table(tabGeneralJob,margin=1)),beside=TRUE,legend=TRUE)
## #'
## #' We can do a contingency table with more than two categories. However, the output is not easy to read.
## with(wellbeingRecent,table(General.happiness,Job.or.housework,Happiness.of.marriage))
We can also calculate the values in a contingency table using aggregate
. We have to create a variable Freq
in our data.frame that indicates that each row of our data.frame should count as 1 when we sum, and then we can use a formula notation to indicate to aggregate
that we want to sum the variable Freq
for each combination of the values of General.Happiness
and Job.or.housework
.
wellbeingRecent$Freq<-1
wellbeingAggregates<-aggregate(Freq~ General.happiness+Job.or.housework ,data=wellbeingRecent[,-2],FUN=sum)
head(wellbeingAggregates,10)
## General.happiness Job.or.housework Freq
## 1 Very happy Very satisfied 2137
## 2 Pretty happy Very satisfied 2725
## 3 Not too happy Very satisfied 436
## 4 Don't know Very satisfied 11
## 5 Not applicable Very satisfied 204
## 6 No answer Very satisfied 8
## 7 Very happy Mod. satisfied 843
## 8 Pretty happy Mod. satisfied 2569
## 9 Not too happy Mod. satisfied 527
## 10 Don't know Mod. satisfied 1
We can extend this format easily to many variables:
wellbeingAggregatesBig<-aggregate(Freq~ General.happiness+Job.or.housework +Satisfaction.with.financial.situation+Happiness.of.marriage+Is.life.exciting.or.dull,data=wellbeingRecent[,-2],FUN=sum)
head(wellbeingAggregatesBig,5)
## General.happiness Job.or.housework Satisfaction.with.financial.situation
## 1 Very happy Very satisfied Satisfied
## 2 Pretty happy Very satisfied Satisfied
## 3 Not too happy Very satisfied Satisfied
## 4 Very happy Mod. satisfied Satisfied
## 5 Pretty happy Mod. satisfied Satisfied
## Happiness.of.marriage Is.life.exciting.or.dull Freq
## 1 Very happy Exciting 333
## 2 Very happy Exciting 54
## 3 Very happy Exciting 3
## 4 Very happy Exciting 83
## 5 Very happy Exciting 38
With the data in this format (i.e. from aggregate
) we can make an alluvial plot that tracks the data through the different categories. (you may have to install the package alluvial
). I choose for the lines in the alluvial plot to be color coded according to their value of General.happiness
, so this is the variable I can track through the plot.
library(alluvial)
alluvial( wellbeingAggregates[,c("General.happiness","Job.or.housework")], freq=wellbeingAggregates$Freq,
col=palette()[wellbeingAggregates$General.happiness])
Next we plot the alluvial for our big cross-tabulations. I want to use all of the columns except for the Freq
column, so now I remove that column, which is the last column. This command takes a while for R to render, so it is commented out. You can run it by uncommenting the following code:
## ###Takes a bit of time...
## alluvial( wellbeingAggregatesBig[,-ncol(wellbeingAggregatesBig)], freq=wellbeingAggregatesBig$Freq,
## col=palette()[wellbeingAggregatesBig$General.happiness])
We are going to remove observations that are Not applicable
in all of the questions of interest, as well as those that are answering No answer
or Don't know
.
#remove those that not applicable in all
wh<-with(wellbeingRecent,which(General.happiness=="Not applicable" | Job.or.housework =="Not applicable" | Satisfaction.with.financial.situation=="Not applicable"))# & Happiness.of.marriage=="Not applicable" & Is.life.exciting.or.dull=="Not applicable"))
wellbeingCondenseGroups<-wellbeingRecent[-wh,]
#keep only those not 'No answer' or 'Don't know' in all of the variables of interest
wellbeingCondenseGroups<-subset(wellbeingCondenseGroups,!General.happiness%in%c("No answer","Don't know") & !Job.or.housework %in%c("No answer","Don't know") & !Satisfaction.with.financial.situation%in%c("No answer","Don't know") & !Happiness.of.marriage%in%c("No answer","Don't know") & !Is.life.exciting.or.dull%in%c("No answer","Don't know") )
Now we'll get rid of any unused levels
wellbeingCondenseGroups<-droplevels(wellbeingCondenseGroups)
And now we'll create new aggregates for this new dataset
wellbeingCondenseAggregates<-aggregate(Freq~ General.happiness+Job.or.housework +Satisfaction.with.financial.situation+Happiness.of.marriage+Is.life.exciting.or.dull,data=wellbeingCondenseGroups,FUN=sum)
Now we will redraw the alluvial plot (again, takes a while so commented out). We can also hide those subsets of data that are particularly small to make a less messy plot using the hide
argument. Basically you give hide
a logical for which cross-tabulations to not plot; here we only plot the top 50% of our cross-tabulations so as to get rid of the very small ones.
## alluvial( wellbeingCondenseAggregates[,-ncol(wellbeingCondenseAggregates)], freq=wellbeingCondenseAggregates$Freq,hide = wellbeingCondenseAggregates$Freq < quantile(wellbeingCondenseAggregates$Freq, .50),
## col=palette()[wellbeingCondenseAggregates$General.happiness])
Now we will focus only on those that are married and working/keeping house, to again remove some of the small noisy columns.
wh<-with(wellbeingCondenseGroups,which(Marital.status=="Married" & Labor.force.status %in% c("Working fulltime","Working parttime","Keeping house")))
wellbeingMarried<-wellbeingCondenseGroups[wh,]
wellbeingMarried<-droplevels(wellbeingMarried)
wellbeingMarriedAggregates<-aggregate(Freq~ General.happiness+Job.or.housework +Satisfaction.with.financial.situation+Happiness.of.marriage+Is.life.exciting.or.dull,data=wellbeingMarried,FUN=sum)
Now we will redraw the alluvial plot for the married, working people(again, takes a while so commented out)
alluvial( wellbeingMarriedAggregates[,-ncol(wellbeingMarriedAggregates)], freq=wellbeingMarriedAggregates$Freq,hide = wellbeingMarriedAggregates$Freq < quantile(wellbeingMarriedAggregates$Freq, .50),
col=palette()[wellbeingMarriedAggregates$General.happiness])
data(Titanic)
tit<-as.data.frame(Titanic)
alluvial( tit[,1:4], freq=tit$Freq, border=NA,
col=ifelse( tit$Survived == "No", "red", "gray") )
Now we draw a mosaic plot for three of these variables.
mosaicplot(~General.happiness+Job.or.housework,data=wellbeingMarried,las=1,col=palette())
Now we draw a mosaic plot for three of these variables.
mosaicplot(~General.happiness+Job.or.housework+Satisfaction.with.financial.situation,data=wellbeingMarried,las=1,col=palette())
We are going to return to our flight data. We are going to fix up some of the variables by making them factors and giving their levels relevant names.
flightSFOSRS<-read.table(file.path(dataDir,"SFO_SRS.txt"),
sep="\t",header=TRUE, stringsAsFactors = TRUE)
flightSFOSRS$DayOfWeek<-factor(flightSFOSRS$DayOfWeek,levels=c(1:7),
labels=c("M","Tu","W","Th","F","Sa","Su"))
flightSFOSRS$Cancelled<-factor(flightSFOSRS$Cancelled,levels=0:1,labels=c("No","Yes"))
flightSFOSRS$Diverted<-factor(flightSFOSRS$Diverted,levels=0:1,labels=c("No","Yes"))
flightSFOSRS$CancellationCode<-factor(flightSFOSRS$CancellationCode,levels=c("A","B","C","D",""),labels=c("Carrier","Weather","NationalAirSystem","Security","NotCancelled"))
We will also create a variable that indicates the cause of the delay (there is no such variable, but only the amount of time due to different delay causes).
whDelayed<-which(!is.na(flightSFOSRS$DepDelay) & flightSFOSRS$DepDelay>15)
flightSFOSRS$DelayCause<-"NoDelay"
flightSFOSRS$DelayCause[flightSFOSRS$Cancelled==1]<-NA
delayVars<-c("CarrierDelay","WeatherDelay","NASDelay","SecurityDelay" ,"LateAircraftDelay")
delayCause<-apply(flightSFOSRS[whDelayed,delayVars]>0,1,function(x){
if(all(is.na(x))) return(NA)
else{
if(any(x)){
if(sum(x)==1) return(gsub("Delay","",delayVars[x]))
else return("Multiple")
}
else "NoDelay"
}
})
flightSFOSRS$DelayCause[whDelayed]<-delayCause
flightSFOSRS$DelayCause<-factor(flightSFOSRS$DelayCause)
Now we make a pairs plot using gpairs
. We can see that the pairs corresponding to two categorical variables show up as a mosaic plot.
gpairs(droplevels(flightSFOSRS[whDelayed,c("AirTime","DepDelay","DayOfWeek","DelayCause")]),upper.pars=list(conditional = "boxplot"))
We will read in the data on gene expression from breast tumors.
breast<-read.csv(file.path(dataDir,"highVarBreast.csv"))
We can visualize the correlations of our variables. We will not use the corrgram
function, however; that function is only appropriate for a small number of variables. We will instead create a heatmap using a heatmap function. There are many heatmap functions in R. The default heatmap
comes with the standard R, but is not very good. Better options are heatmap.2
in gplots
; heatmap.plus
in the heatmap.plus
package; and aheatmap
in the NMF
package. I favor the pheatmap
function in the pheatmap
package. It's got a couple of strange quirks, but generally gives the nicest pictures, I find.
library(pheatmap)
Now to plotting the correlations. First we are going to calculate all the correlations with the cor
function. This function calculates the correlations among the columns of a matrix. Notice, we need to remove the non-continuous variables.
corMat<-cor(breast[,-c(1:7)])
Now we run pheatmap
. I have supressed ordering of the samples by turning off both column and row clustering (cluster_rows = FALSE,
and cluster_cols = FALSE
). This was for teaching purposes during class, to show what it would look like without clustering, but not what you would want to actually do.
pheatmap(corMat,cluster_rows = FALSE, cluster_cols = FALSE )
Now we will draw the same heatmap with the rows and samples clustered. However, I chose not to show the hierarchical clustering tree by setting treeheight_row =0
and treeheight_col = 0
. Again this was for teaching purposes, but this is actually sometimes useful if you don't want the clutter of the dendrogram. You'll see this same code a good deal, because I don't actually get to the idea of clustering until later.
pheatmap(corMat,cluster_rows = TRUE, cluster_cols = TRUE, treeheight_row =0, treeheight_col = 0)
Now we do a heatmap of the actual data, rather than the correlation.
pheatmap(breast[,-c(1:7)],cluster_rows = TRUE, cluster_cols = TRUE, treeheight_row =0, treeheight_col = 0)
Now we are going to demonstrate some options with pheatmap
to show how you can customize it. First, we want to draw colors next to the samples to identify different characteristics of the samples (e.g. normal or cancer samples). So I am going to create vectors of colors, and then given names to that vector that exactly match the levels of the variable. Notice that I am NOT making the vector of colors the same length as the factor (which is what I would do in say plot
). pheatmap
will handle that internally. I will do this for three variables, TypeSample
, EstReceptor
, and Progesteron
.
typeCol<-c("red","black","yellow")
names(typeCol)<-levels(breast$TypeSample)
estCol<-palette()[c(4,3,5)]
names(estCol)<-levels(breast$EstReceptor)
proCol<-palette()[5:7]
names(proCol)<-levels(breast$Progesteron)
The last change I'm going to make is to make my colors “max out” after a certain point, so that all values greater than a certain amount get the same value. This way, I don't “waste” a lot of colors trying to show the value of outliers. I will choose the 1% and 99% quantiles of the data.
qnt<-quantile(as.numeric(data.matrix((breast[,-c(1:7)]))),c(0.01,.99))
Now I make a vector of break points of length 50. These will be my bins.
brks<-seq(qnt[1],qnt[2],length=20)
head(brks)
## [1] -5.744770 -4.949516 -4.154261 -3.359006 -2.563751 -1.768496
Now I'm going to change the color scheme for the heatmap by giving a smooth gradient of colors to pheatmap
, using the colorRampPalette
function which will create a vector of colors that smoothly interpolate from a list of colors you give it. This is a color scheme I made up that I use frequently and like.
seqPal5<-colorRampPalette(c("black","navyblue","mediumblue","dodgerblue3","aquamarine4","green4","yellowgreen","yellow"))(length(brks)-1)
I'm almost set, only I have to decide what is the annotation I want to show. I want it to be the columns saved in the data that are “TypeSample” (normal/cancer), “EstReceptor” (the estrogen receptor status), and “Progesteron” (the progesteron receptor status). One quirk of pheatmap
to do this labelling, you must have row names for both your data matrix and the annotation matrix that match, so we are going to give rownames to our breast dataset (that are just numbers!)
row.names(breast)<-c(1:nrow(breast))
Now I draw the heatmap with all of these features. annotation_row
is the variable(s) that define the colors I want to draw next to the samples/rows (the argument annotation_col
does the same thing for columns). annotation_colors
is a list that gives the colors for each of these variables. Note that the name of the list exactly match the name of the variable in the annotation_row
. Note that if you don't give colors, pheatmap
will automatically create some.
fullHeat<-pheatmap(breast[,-c(1:7)],cluster_rows = TRUE, cluster_cols = TRUE,
treeheight_row =0, treeheight_col = 0,
color=seqPal5,
breaks=brks,
annotation_row=breast[,5:7],
annotation_colors = list("TypeSample"=typeCol,"EstReceptor"=estCol,"Progesteron"=proCol))
Now we will draw a heatmap for only the cancer samples. Notice I don't have to change much of the setup from above.
whCancer<-which(breast$Type!="Normal")
pheatmap(breast[whCancer,-c(1:7)],cluster_rows = TRUE, cluster_cols = TRUE,
treeheight_row =0, treeheight_col = 0,
color=seqPal5,
breaks=brks,
annotation_row=breast[whCancer,5:7],
annotation_colors = list("TypeSample"=typeCol,"EstReceptor"=estCol,"Progesteron"=proCol))
Here we are going to “center” the data, by subtracing out the median of each variable so that they are on a closer scale. I am actually going to center the data in two ways. First by subracting off the mean and then the median, just to show you a few more functions.
There is a simple function for subtracting off the mean, called scale
. This function subtracts off the mean of each variable (column) and returns a data.frame that has been centered. There is also an option to divide each variable by its standard deviation to put them on the same scale (scale=TRUE
would do that).
breastCenteredMean<-scale(breast[,-c(1:7)],center=TRUE,scale=FALSE)
To subtract off the median requires a bit more work. First we need to calculate the median of each variable, which we will do with apply. This will return a vector of the length of the number of variables.
colMedian<-apply(breast[,-c(1:7)],2,median)
Then we need to subtract them from every column. There is a function sweep
that will take a vector and subtract it from every column in the matrix. It works similarly to apply
. The first argument is the matrix/data.frame; the second argument, like apply, is MARGIN
that defines whether you are subtracting from the rows (1) or columns (2). Then you give the vector that you want to subtract, and finally the operation you want to perform (“-”,“+”,“/”,“*” are common ones though you could write your own).
breastCenteredMed<-sweep(breast[,-c(1:7)],MARGIN=2, colMedian,"-")
If this feels like you could have just done this with a call to apply
, you are right! This is just such a common task that they have a helper function. In particular, it guarantees that the output will be the same dimension as the input, which is frequently not the case for apply
(which sometimes winds up “flipping” the rows and columns on you.)
Now we are going to create breaks for this centered data that are symmetric around zero, and also chop off the really outlying points.
qnt<-max(abs(quantile(as.numeric(data.matrix((breastCenteredMed[,-c(1:7)]))),c(0.01,.99))))
brksCentered<-seq(-qnt,qnt,length=50)
We are also going to create a different color scale. Here's another one that I like for centered data.
seqPal2<- colorRampPalette(c("orange","black","blue"))(length(brksCentered)-1)
seqPal2<-(c("yellow","gold2",seqPal2))
seqPal2<-rev(seqPal2)
pheatmap(breastCenteredMed[whCancer,-c(1:7)],cluster_rows = TRUE, cluster_cols = TRUE,
treeheight_row =0, treeheight_col = 0,
color=seqPal2,
breaks=brksCentered,
annotation_row=breast[whCancer,6:7],
annotation_colors = list("TypeSample"=typeCol,"EstReceptor"=estCol,"Progesteron"=proCol))
Now we are going to work with a small subset of genes and samples so that we can more clearly see the dendrograms from the clustering. By not setting treeheight_row
, it gives the dendrogram for the rows (but both rows and columns are clustered).
smallBreast<-read.csv(file.path(dataDir,"smallVarBreast.csv"),header=TRUE)
row.names(smallBreast)<-1:nrow(smallBreast)
The following command shows the dendrogram, but only for the rows:
pheatmap(smallBreast[,-c(1:7)],cluster_rows = TRUE, cluster_cols = FALSE, treeheight_col = 0,
breaks=brks,col=seqPal5)
The following command shows the dendrogram for both:
pheatmap(smallBreast[,-c(1:7)],,cluster_rows = TRUE, cluster_cols = TRUE,
breaks=brks,col=seqPal5,annotation_row=smallBreast[,5:7],
annotation_colors=list("TypeSample"=typeCol,"EstReceptor"=estCol,"Progesteron"=proCol))
We read in data on students taking an AP Statistics class and do a heatmap of their correlations
apscores<-read.csv(file = file.path(dataDir,"AP_Statistics_Predictions_2013-16.csv"), header = TRUE, stringsAsFactors = TRUE)
library(pheatmap)
#21-29 are various conversions of earlier values (26 is actual AP score)
pheatmap(cor(apscores[,-c(1,3,10,13,21:29)],use="pairwise.complete.obs"))
To run PCA, I need only continuous values, so I am going to remove variables 1-3 and 12: INSTNM: The institution’s name, STABBR: The institution’s location, and CONTROL: whether public/private
Notice that I need to also remove NA values to run PCA. This is because PCA cannot handle NA values. I could run na.omit
on my data scorecard
(this is what I do with my example above in the section I told you not to look at). However, because I want to use variables in scorecard
, like CONTROL
that defines public or private, I don't want to just remove them. I want to create a index so that I can remove them whenever I want. This information is saved in the output of na.omit
(as part of its attributes
, a rather esoteric feature that you don't need to know about)
whNACollege<-attributes(na.omit(scorecard[,-c(1:3,12)]))[["na.action"]]
Here we use the prcomp
function to actually calculate the principal components of our data. The prcomp
function has the arguments scale
and center
to have the function center and scale the data (by the mean and the standard deviation). For teaching purposes, I do not want to scale it yet.
pcaCollege<-prcomp(scorecard[-whNACollege,-c(1:3,12)],center=TRUE,scale=FALSE)
We then use the output of this to get the new scores, which are stored as the x
element of the output of prcomp
. As we've learned, there are actually multiple scores created by PCA (equal to the number of variables) so the x
element is a matrix, with each column corresponding to a new PC variable. They are also ordered, so that the first column is the first PC variable, etc.
I will plot these scores, and color the points by whether they are public or private. This is where I will make use of that whNACollege
variable before. Otherwise, the result of prcomp
is scores not including the NAs (because the input matrix didn't have them), while the scorecard
that contains the categorical variable for public/private schools does contain them, so they will be different lengths.
plot(pcaCollege$x[,1:2],col=c("red","black")[scorecard$CONTROL[-whNACollege]],asp=1)
legend("topright",c("public","private"),fill=c("red","black"))
Notice I set asp=1
. This is to make the distance on the x and y axis equivalent (notice how the axes change if you rescale the plot)
Now we run a PCA on the breast data. We already centered the breast data (with the median), so we will pick center=FALSE
, but we will choose scale=TRUE
. This time we will color the observations based on whether they are normal/cancer samples.
pcaBreast<-prcomp(breastCenteredMed[,-c(1:7)],center=FALSE,scale=TRUE)
plot(pcaBreast$x[,1:2],col=typeCol[breast$TypeSample],asp=1,pch=19)
legend("bottomright",levels(breast$TypeSample),fill=typeCol)
We repeat the above, only now we remove the normal samples, and color the points by estrogen receptor status and pick a pointing graphic according to Progesteron status. We also use points
to draw just the metastastic samples bigger and with a different plotting symbol.
We have used different plotting symbols (argument pch
). These plotting symbols give you the ability to have one color for the outline of the shape and another color to fill in the shape (you can see the help of points
to see which values of pch
define such plotting symbols). For these symbols, col
refers to the outline of the symbol while bg
defines the fill for the points. Similarly our legend has to make that distinction.
pcaBreastCancer<-prcomp(breastCenteredMed[whCancer,-c(1:7)],center=FALSE,scale=TRUE)
whMets<-which(breast$Type[whCancer]=="Metastatic")
plot(pcaBreastCancer$x[,1:2],col="black",bg=estCol[breast$EstReceptor[whCancer]],asp=1,pch=c(21,22,24)[breast$Progesteron[whCancer]])
points(pcaBreastCancer$x[whMets,1:2],pch=8,cex=2)
legend("bottomright",levels(breast$EstReceptor),fill=estCol,title="Estrogen Receptor")
legend("bottomleft",levels(breast$Progesteron),pch=c(21,22,24),col="black",pt.bg="grey",title="Progesteron")
Now we draw a heatmap of the coefficients that go into the new variables created by the PCA of the college data. These coefficients are given in the output of prcomp
and are saved under the element rotation
. Again, there are coefficients for all the PCs, so this is a matrix, with the rows being the variables, and the columns being the PCs. So the first column corresponds to the coefficients for each variable that create the final PC variable. So multiplying these values with the data values of an observation in the college data gives the new variable.
We choose to not cluster the columns (PCs) and simply plot them in the order they are given by setting Colv=NA
. We show only the top 2 PCs.
pheatmap(pcaCollege$rotation[,1:2],cluster_cols=FALSE)
Here we calculate the correlation between the original variables in our college data and the new variables created by PCA. These are given by a matrix, only this time, the matrix is not symmetric. The rows correspond to the first argument (the PC variables given by the $x
element) and the columns to the second argument to cor
(the original variables in scorecard
). Again we show only the correlation with the top 2 PCs.
corPCACollege<-cor(pcaCollege$x,scale(scorecard[-whNACollege,-c(1:3,12)],center=TRUE,scale=FALSE))
pheatmap(corPCACollege[1:2,],cluster_cols=FALSE,col= seqPal2)
Now we draw a biplot of the result of prcomp
on the college data. biplot
is not a really great function, unfortunately. There is very little you can customize. Again, we are going to suppress warnings from biplot so as to not clutter this document.
par(mfrow=c(1,2))
plot(pcaCollege$x[,1:2],col=c("red","black")[scorecard$CONTROL[-whNACollege]],asp=1)
legend("topright",c("public","private"),fill=c("red","black"))
suppressWarnings(biplot(pcaCollege,pch=19,main="Biplot"))
We repeat some of the same plots above, only now for APScores.
whNAAP<-as.numeric(attributes(na.omit(apscores[,-c(1,3,4,8,10,13,21:29)]))[["na.action"]])
pcaAP<-prcomp(apscores[-whNAAP,-c(1,3,4,8,10,13,21:29)],center=TRUE,scale=FALSE)
par(mfrow=c(1,2))
plot(pcaAP$x[,1:2],asp=1)
biplot(pcaAP)
pheatmap(pcaAP$rotation[,1:2])
Here we repeat the pca of the college data, only now with scaling as well as centering the variables.
pcaCollegeScale<-prcomp(scorecard[-whNACollege,-c(1:3,12)],center=TRUE,scale=TRUE)
par(mfrow=c(1,2))
plot(pcaCollegeScale$x[,1:2],pch=19,col=c("red","black")[scorecard$CONTROL[-whNACollege]])
legend("topright",c("public","private"),fill=c("red","black"))
biplot(pcaCollegeScale,choices=c(1,2),pch=19)
We also redo the heatmap of the coefficients.
pheatmap(pcaCollegeScale$rotation[,1:2],cluster_cols=FALSE,main="PCA Coefficients (Loadings)")
pheatmap(cor(scorecard[-whNACollege,-c(1:3,12)],pcaCollegeScale$x[,1:2]),cluster_cols=FALSE,main="Correlation with Original Variables")
We plot the observations based on their coordinates in the 3rd and 4th dimension, as well as a biplot of those dimensions. The argument choices
allows you to pick which PC variables will be plotted in the biplot.
par(mfrow=c(1,2))
plot(pcaCollege$x[,3:4],pch=19,col=c("red","black")[scorecard$CONTROL[-whNACollege]])
legend("topright",c("public","private"),fill=c("red","black"))
suppressWarnings(biplot(pcaCollege,choices=3:4))
We repeat the above heatmap, but now we scale each of the coefficients in a PC by the standard deviation of the final variable it creates. prcomp
returns these values in the $sdev
element of its output. We will use the function sweep
to multiply each PC (column) by its standard deviation.
par(mfrow=c(1,2))
pheatmap(pcaCollege$rotation,cluster_cols=FALSE,main="Coefficients")
pheatmap(sweep(pcaCollege$rotation,2,pcaCollege$sdev,"*"),cluster_cols=FALSE,main="Coefficients scaled by variance")
We do the same thing for the breast dataset. We first look at the coefficients. To make it relevant, we will first replot the heatmap from above of the entire datamatrix. Then we will plot the heatmap of the coefficients of the first 25 PCs.
#par(mfrow=c(1,2))
breastHeat<-pheatmap(breastCenteredMed[,-c(1:7)],treeheight_row=0, treeheight_col=0,
breaks=brksCentered,annotation_row=breast[,5:7],labRow=NA,color=seqPal2,
annotation_colors=list("TypeSample"=typeCol,"EstReceptor"=estCol,"Progesteron"=proCol))
pheatmap(t(pcaBreast$rotation[,1:25]),cluster_cols=breastHeat$tree_col,color=seqPal2,treeheight_row=0)
We now do a heatmap of the coefficients of the top 25 PCs, scaled by the variance.
#par(mfrow=c(1,2))
pheatmap(breastCenteredMed[,-c(1:7)],treeheight_row=0, treeheight_col=0,
breaks=brksCentered,annotation_row=breast[,5:7],
show_rownames=FALSE,color=seqPal2,
annotation_colors=list("TypeSample"=typeCol,"EstReceptor"=estCol,"Progesteron"=proCol))
scaledPCs<-sweep(pcaBreast$rotation,2,pcaBreast$sdev,"*")
pheatmap(t(scaledPCs[,1:25]),cluster_col=breastHeat$tree_col,color=seqPal2,treeheight_row=0)
The function screeplot
plots the variance in each PC dimension as a function of the PC. Here we do it for the simulated 3D data which had a plane.
screeplot(pca3d,type="lines",main="Simple simulated data in 3d")
## Error in screeplot(pca3d, type = "lines", main = "Simple simulated data in 3d"): object 'pca3d' not found
Now we do a screeplot for the college and AP Statistics data.
par(mfrow=c(1,2))
screeplot(pcaCollege,type="lines",main="College")
screeplot(pcaAP,type="lines",main="AP Statistics")
We can also do screeplots where the y-axis is the variance in that component as a percentage of the sum of all the variances.
par(mfrow=c(1,2))
plot(pcaCollege$sdev^2/sum(pcaCollege$sdev^2),type="b",main="College",ylab="Percentage of variance")
plot(pcaAP$sdev^2/sum(pcaAP$sdev^2),type="b",main="AP Stats",ylab="Percentage of variance")
And we repeat this for the breast cancer datasets.
par(mfrow=c(1,2))
plot(pcaBreast$sdev^2/sum(pcaBreast$sdev^2),type="b",main="Breast PCA",ylab="Percentage of variance",xlim=c(0,100))
plot(pcaBreastCancer$sdev^2/sum(pcaBreastCancer$sdev^2),type="b",main="Breast, Cancer only, PCA",ylab="Percentage of variance",xlim=c(0,100))