This lab is expected to be self-paced and make sure that everyone is familiar with R and RStudio. This is a long lab, but we expect that for everyone, parts of this will be review and parts will be new. All of these concepts will continue to be reinforced in future labs.
While we give “Exercises”, nothing needs to be turned in – they are just for checking your understanding. Go at your own pace, and only do the exercises if they add to your understanding. Notice the table of contents on the left-hand side. These will allow you to efficiently navigate through the document, and if you click on one, it will open up into more detailed sub-sections.
Data 8 Students / Python Users For Data 8 students working in R and R Studio will be new. You will find that the basic concepts from python will transfer over, and you just need to concentrate on figuring out the difference in syntax. For those students who are more familiar with python, there are special notes about how R commands compare to python commands that look like this:
Python Users: An example of python note!
If you’re coming from STAT 20, you can ignore the comments and code with a green background.
Stat 20 Students / R users For Stat 20 students this will feel very familiar, but some of the programming concepts, like how to write your own function and for-loops may be new. (Data 8 students will be familiar with these concepts, but should focus on the syntax in R). Please slowly go over these sections and if you are not clear ask for help in lab or do some outside google searches to find further tutorials on these concepts.
For additional R support, see https://epurdom.github.io/Stat131A/RSupport/RResources.html
RStudio is an open-source Integrated Development Environment (IDE) which provides many user-friendly features. We will use RStudio throughout the rest of the course. RStudio can be started from the desktop (i.e. your computer) or through logging into a website and accessing a web server. The user interface is generally the same for the two.
We will use the web version for labs, since this makes it easier and quicker to get going during labs – by clicking on the links to the lab we provide you on bcourses, it will necessary additional files for the lab into your workspace.
We give brief instructions below for both methods, but for following this lab (and future labs), you should use the web version. For other assignments, you may find it easier to use RStudio on your own computer. In fact, we strongly recommend for your own convenience that if you have your own computer you should download and install R and do assignments (other than the labs) on your own machine.
A RStudio server can be accessed via a web browser for this class. All the data storage and computation are done remotely on the server (similar to what you may have experienced in Data 8 or Stat 20).
To login in to the server, open the url for this lab here. Once successfully authenticated, you will see the following RStudio interface.
The files you see in the lower left corner may look different, but you should see the folder “Stat131LabsPublic”. This will be where the necessary files for the lab will be located (e.g. data files, code files, etc.). We will explore this more below.
The layout looks similar to that of the web version. The installation of R and RStudio is straightforward, just as installing other applications in your computer. We strongly recommend for your own convenience that if you have your own computer, download and install R and do assignments (other than the labs) on your own machine.
Your working directory is the folder on your computer in which you are currently working. When you ask R to open a certain file, it will look in the working directory for this file, and when you tell R to save a data file or figure, it will save it in the working directory.
Before you start working, please set your working directory to where
all your data and script files are or should be stored. You can browse
and choose working directory by going to Session
->
Set working directory
->
Choose directory ...
as follows.
For this lab, change your working directory to where the uploaded lab 00 files should be:
Stat131Public
-> lab00
After doing this, you should see that the tab in the lower right marked files should show the files that were automatically imported when you clicked on the link to start this session:
These are the files you will be using in this lab, and you can create more during this lab (this is your personal working space…)
The main window is the “Console” on left panel. The symbol > is a command prompt. And this is where R evaluate your commands. For example, type the following commands in the console:
a = 1:5
a
You should see something like this:
R print the outputs (except figures) in the console, right below your
commands. You might notice that after you excute the command, vector
a
appears in the upper right environment
window. In fact, you can always check all the value of objects you
created in the environment
window. Now you may try to
explore how to execute commands in the console yourself.
R allows you to go back to previous commands using up and down arrow
in your keyboard. However, often times you would like to store R
commands in a file. To do so, you can create an R Script by clicking
File
-> New File
->
R Script
with the menu bar in the top. R will create an R
script with the file extention .R
and the
editor window will be invoked. R scripts is a file type
that stores only R commands and comments.
Type the same commands from above into the R script.
To run one line or several line of code in an R script, put your
cursor in that line or select mulitple lines, and then click the
Run
botton.
The RStudio interface contains several component:
This is where you can type simple command after the the prompt “>”. R will excute your current command and return the output in this window.
This window will contain the files you are working on, and is hidden on initial startup. This will show you files that you have opened or created. You will want to have separate files where you will save commands for editing and future use.
If you do not have any files open, this editor may not show up (like
when you first opened RStudio). You can make this window appear by
creating a new document (e.g. with the menu bar in the top
File
-> New File
->
R Script
.) or opening a file ( File
->
Open file ...
). The editor supports multiple file type,
including R script, R markdown, R notebooks, and Sweave files.
For example, open the R script example.R
in the lab 0
folder (under Stat131APublic/2019Fall/Labs/lab00) and try to run the
code (you don’t have to understand the code). You can run the code by
either cutting and pasting into the console, or highlighting the code
and clicking Run
button.
Try adding the following code to the example.R
file and
running it:
D<-data.frame(a=1:3, b=2:4)
You can also run all the code in the file by clicking on the
source
button.
The red font with *
at the end of the file name shows
that the current changes are not saved. While you are working in a file,
you should save it frequently to avoid losing data unexpectedly.
In the environment window you can see the data, values and functions
you created or loaded. You can view the data by clicking on them. In the
example below, if you click the data frame D
you created,
you can see the data on the editor window. The history window shows the
commands which has been excuted.
Upload
botton. To download files from server, check the box before that file
and go to More
-> Export
. This is useful,
for example, if you are working on the web server, and then want to move
your files to your own computer and work on your desktop version of
R.After changing your working directory, you should see the files that were imported
Run
you should see a histogram come up in
this tab.C=rnorm(1000)
hist(C)
R can do many statistical and data analyses. They are organized in packages.
To get a list of all installed packages, go to the packages window or
type library()
in the console window. If the box in front
of the package name is ticked, the package is loaded (activated) and can
be used.
ggplot2
, click install
packages in the packages window and type ggplot to search and install.
Or you can type install.packages("ggplot2")
in the
console.library("geometry")
in the command
window.Python Users: In python you would bring in modules with the command
import
to bring in other code; packages or libraries are the equivalent idea in R, and the commandlibrary
is like that ofimport
. A big difference is that in python you can also useimport
to bring in arbitrary code that you have written that is saved in a script file (e.g. functions you have written and saved in a.py
file). In R thelibrary
command is only for bringing in bundled packages; this means mean code that has been developed and distributed in a particular format and has been installed into your R system on your computer. To read in arbitrary code in a script.R
file you would use the functionsource
.
While R Scripts contains only R code, R Markdown enable us to create dynamic documents with embedded chunks of R code. Meaning you can write text and also have code that you want to describe in the same document. The R Markdown document can be converted to other formats such as HTML and PDF to have a “published” version. So you can develop a published report with the code embedded within it, ensuring that you can always recreate your work. This is opposed to, for example, cutting and pasting figures or results into a text editor like Word.
Python Users: The relationship between R Scripts (
*.R
) and R Markdown (*.Rmd
) is similar to the relationship between Python scripts (*.py
) and IPython Notebooks (*.ipynb
).
To create an R Markdown file, click on File
->
New File
-> R Markdown
. You should now see
a dialog as shown below. Select ‘Document’ in the left panel and fill in
title and author field and hit ‘OK’.
You should now have a document which looks like this:
RStudio has already populated the file with example text and code to
get you started (which you would obviously delete for a real
assignment). The code is contained within “chunks”, which start with
```{r}
and end with ```
(each on their own
line):
Extra options can be added in the ```{r}
code. In the
above example, this code chunk has been given a descriptive name
(“cars”), via ```{r cars}
To add new code chunks into the document, click in the document where you want the code chunk to go and click add chunk icon . Try to add a code chunk after the “cars” chunk that is already there. It should look like this
Add the following code into your chunk,
a=c(pi, 2, 4)
b=1:3
a>b
so that it looks like this
If you click on the run
icon at the top right of the
chunk (green triangle), the entire chunk will be run and the output will
either show the results below
or will show them in the console which look like this:
Python Users: Beneath the code is similar to how IPython notebooks work
You can change between these two options under the “gear” button:
Similar as in R script file, you can instead run one or several lines of code from the R Markdown. To run several lines of code, You may select a piece of code as follows:
Click Run
and get a drop-down menu about what you want
to run (there are simple key short-cuts that will show up there, but
they differ on different operating systems).
To generate an HTML file of the entire document, click on the icon .
It should generate the HTML in the same folder as the R Markdown file.
The above code chunk will look as follows in the HTML file:
Compiling the entire document will run all the commands from the beginning in a fresh environment. You will often discover mistakes when you do this, even if you think you have been testing your code interactively the whole time. Running everything fresh ensures that you have reproducible code, meaning someone else can start with just this document and recreate everything you show in your report.
For example, suppose your code ran fine, you changed something, but didn’t hit run. Or if you move a chunk around and don’t realize that you moved it after another chunk that depends on the results of that chunk – if you run the chunks in the right order, you won’t get an error. But when you compile it from top to bottom, it won’t work.
You should always give yourself time to test your assignment before the due date. Furthermore, it is wise as you finish one section/question of your assignment to compile the entire document and see if you get what you expect, rather than waiting until you are completely finished.
Please refer to an introduction of Markdown Basics on how to add headers, lists and images on the markdown file and generally formatting your text.
You also have an option to create a R notebook. R notebooks create a R markdown file on your computer so they’re not so different, but R Studio treats them a little bit differently.
Create a new R Notebook (under File/New… again). It should create a
file that again is populated with example code. Go to the first chunk
(plot(cars)
) and hit the Run triangle for that chunk. You
should see the plot appear under the code:
When you when save this notebook, it will save both the
.Rmd
file and also a html file with extension
.nb.html
.
You can see this file by going to the “Preview” button at the top (which will force you to save the file)
In the Viewer tab, you should see a html come up with the results. You can see the code and plot below, and this html has the option to hide your code.
This html file is not like the html file from our simple R markdown
from before. Before, with a simple R markdown, when we “knit” our
.Rmd
file, it reran all of our code from scratch
and created a html of the output. This means that all the code has to be
able to run in sequence correctly.
With the .nb.html
that is created, it is just a html
file that mimics what actually have on your screen, namely the output of
the chunk when it was last run by you in your editor is displayed.
Add the following code to your R notebook below the plot code but don’t run it
a=1:3
a<1
If you hit the preview button, you can scroll down and see this new code, but the actual results will not show:
If you do run the code and hit preview, you’ll see the results now show up:
This can be convenient to be able to see (and save) exactly where you are in terms of what you have tested and run. It is also much quicker than compiling the whole document if you just want to quickly see how the html (i.e. text) will look at the end. It also allows a preview, even if part of your code isn’t finished.
Warning But it also means that you could have broken code that you haven’t run and this preview will not let you know that.
So while this preview can be handy, you still need to be able to
compile it completely from scratch to know that everything works. The
preview button has a drop down menu that gives you the option to do this
(i.e. compile them like standard .Rmd
documents)
By clicking “Knit to html” (or pdf) you can make sure the entire thing compiles fresh from the beginning (which is what we will do in testing that your code works).
This tutorial is compiled from a Rmarkdown document, but is a little bit complicated as an introduction to Rmarkdown because of the screenshots and python comments that we have added.
In future labs, your labs will be Rmarkdown documents. You will open them up and interact with the code directly in the Rmarkdown file we have given you as part of the lab. You will turn in your modified Rmarkdown, along with a compiled pdf version as your solution to gradescope.
But for this lab (where you do not need to turn in anything), you will need to cut and paste (or type!) the commands from the html version of this lab into your console.
In what follows, we will move away from Rmarkdown, and just show you the R code. You can type this directly into the console, or (better!) use the R script document we were working with above.
Here we do a simple assignment of a number to a variable called
val
:
val <- 3
print(val)
## [1] 3
Val <- 7 # case-sensitive!
print(Val)
## [1] 7
print(val)
## [1] 3
Val-val #you don't really need the print statement
## [1] 4
Notice the assignment operator <-
, which consists of
the two characters <
(“less than”) and -
(“minus”), and must be strictly side-by-side and ‘point’ to the object
receiving the value of the expression. R expression is case sensitive,
which means A and a are different symbols and would refer to different
variables.
The =
operator can be used as an alternative (like in
python).
val = 3
print(val)
## [1] 3
Val = 7 # case-sensitive!
print(Val)
## [1] 7
You can use either one, and you should be comfortable reading code with either one.
Let’s learn basic arithmetic operators
Python Users: While in Python we import
math
ornumpy
packages to do calculations such as square root and log, in R these are built-in functionalities and can be applied directly.
# add numbers
2 + 3
## [1] 5
# powers
3^4
## [1] 81
# square root
sqrt(4^4)
## [1] 16
# 21 mod 5
21 %% 5
## [1] 1
# take log
log(10)
## [1] 2.302585
# exponential
exp(2)
## [1] 7.389056
# mathematical constant pi
2*pi
## [1] 6.283185
# absolute value
abs(-2)
## [1] 2
# scientific notation
5000000000 * 1000
## [1] 5e+12
# scientific notation
5e9 * 1e3
## [1] 5e+12
Python Users: The following would be the corresponding code in python for these arithmetic operations:
# python code:
import math
import numpy as np
print(2 + 3) # add numbers
print(3**4) # powers
print(pow(3, 4)) # powers
print(math.sqrt(4**4)) # functions
print(21 % 5) # 21 mod 5
print(math.log(10)) # take log
print(math.exp(2)) # exponential
print(np.abs(-2)) # absolute value
print(2*math.pi) # mathematical constant
# scientific notation
print(5000000000 * 1000)
print(5e9 * 1e3)
Exercise 1. You’ve seen the probability density function of Normal distribution \(N(\mu, \sigma^2)\) in Data8 or Stat 20 as follows:
\[p(x)=\frac{1}{\sqrt{2\sigma^2\pi}}\exp{-\frac{(x - \mu)^2}{2\sigma^2}}\]
Now let \(\mu = 1\), \(\sigma^2 = 2\), and calculate the value of the density function at \(X = 0\). Finish the chunk below
mu <- 1
sig.sq <- 2
x <- 0
# insert code here to calculate this value
You can check that you got the right answer by making sure that it is the same as the following (built-in) function in R that calculates this quantity:
dnorm(0, mean=1, sd=sqrt(2))
## [1] 0.2196956
R has similar data types as in python: numeric values, integer values, characters (i.e. strings), and logicals (i.e. booleans or TRUE/FALSE).
Python Users: In R the boolean value is
TRUE
orFALSE
(all caps), while in Python it would beTrue
orFalse
In R we can generally test what something is using functions like
is.X
is.numeric("my name is")
## [1] FALSE
is.character("my name is")
## [1] TRUE
is.numeric(5)
## [1] TRUE
is.integer(5) #This is false! you have to specify values to be integers, other wise they are saved as numeric (i.e. decimal valued)
## [1] FALSE
is.character(TRUE)
## [1] FALSE
is.logical(TRUE)
## [1] TRUE
Logical values generally arise from comparisons. Here we will review operators for comparing values in R.
The basic numerical comparisons are <
,
<=
, >
, >=
. To do put
together multiple comparisons we can require both logical statements to
be true (&
) or at least one (|
)
Python Users: In python, the equivalent of
&
isand
and of|
would beor
, i.e. python uses the words for these operations.
(1 > 0) & (3 <= 5)
## [1] TRUE
(1 < 0) | (3 > 5)
## [1] FALSE
(3 == 9/3) | (2 < 1)
## [1] TRUE
!(2 != 4/3)
## [1] FALSE
Python Users: The above would be the corresponding code in python for these logical operations:
print((1 > 0) and (3 <= 5))
print((1 < 0) or (3 > 5))
print((3 == 9/3) or (2 < 1) )
print(not(2 != 4/3))
Exercise 2 We want to use R and the function
dnorm
that evaluates the normal density to answer the
following questions. Recall that dnorm(x,mean=1,sd=2)
is
the value of the normal density with mean 1 and standard deviation 2
evaluated at x.
result1
.result2
.mu1 <- 1
mu2 <- 0
sig.sq <- 2
x <- 0
# Add code with result saved as `result1`
# uncomment to print annswer
# print(result1)
# Add code with result saved as `result2`
# uncomment to print annswer
# print(result2)
Vectors store a series of values that are of the same data type in R.
There are several ways to create an vector in R.
You can enter in comma-separated values with the function
c
# set up a vector
a <- c(0.125, 4.75, -1.3)
a
## [1] 0.125 4.750 -1.300
Python Users: In python this would be equivalent to:
a = [0.125, 4.75, -1.3]
#Or a numpy array
a = np.array([0.125, 4.75, -1.3])
We can also use c
to combine together existing
vectors
# set up another vector
b <- c(0, 1, -1, pi, exp(1))
b
## [1] 0.000000 1.000000 -1.000000 3.141593 2.718282
newVector <- c(a, b)
newVector
## [1] 0.125000 4.750000 -1.300000 0.000000 1.000000 -1.000000 3.141593
## [8] 2.718282
Python Users: In python this would be equivalent to
a = [0.125, 4.75, -1.3]
b = [0, 1, -1, math.pi, math.e]
print([a, b])
print(np.concatenate((a,b)))
Vectors can also hold logical or character values
bools <- c(TRUE, FALSE, TRUE)
bools
## [1] TRUE FALSE TRUE
mystring <- c("Hello", ",", " ", "world", "!")
mystring
## [1] "Hello" "," " " "world" "!"
If we have a vector of strings, we can put them together into a
single string (concatenate them) with the function cat
or
the function paste
to have finer control of how we combine
them (cat
always puts a space between each value)
cat(mystring)
## Hello , world !
paste(mystring,collapse="")##i.e. put nothing between the values
## [1] "Hello, world!"
seq
The function seq
generates sequences of numbers with a
similar pattern. Usually it has three parameters: from
,
to
, and by
, which stands for the starting
value, ending value and increment of the sequence.
seq1 <- seq(from=4, to=9, by=1)
seq1
## [1] 4 5 6 7 8 9
seq2 <- seq(1.1, 11.1, by = 2)
seq2
## [1] 1.1 3.1 5.1 7.1 9.1 11.1
Python Users: Notice the difference with
np.arange
, where then end value given tonp.arrange
(given bystop
) is one past the value you want:
>>> np.arange(start=4,stop=10,step=1)
array([4, 5, 6, 7, 8, 9])
You can also write function seq(from=a, to=b, by=1)
as
a:b
seq3 <- 1:6
seq3
## [1] 1 2 3 4 5 6
Getting Help More parameters are available for
seq
– how to figure out how to use them? R provides easy
accessible documents to search for help for functions, datasets and
packages. Before asking others for help, it will always be helpful to
read the documents. Documentation for a function usually includes
description, usages, arguments and examples. For example, there are two
ways to access the document of the function seq: ?seq
and
help(seq)
.
Exercise 3 To create a vector with replicates, we
usually use rep
as follows. Look at the documentation of
rep
and figure out what these arguments do
seq4 <- rep(4, times=6)
seq5 <- rep(1:2, times=5)
seq6 <- rep(1:2, each = 5)
In R, mathematical operations on vectors are usually done element-wise, meaning the operation is done on each element of the vector.
Let \(x=(x_1,x_2,\ldots,x_n\) and
\(y=(y_1,y_2,\ldots,y_n)\) Then
x*y
will return the vector \[(x_1 y_1, x_2 y_2,\ldots,x_n y_n)\]
Note: For those that are familiar with vector operations
this means that x*y
for vectors in R is NOT the inner
product, but element wise, and ditto for matrices.
And it’s similar for other operators. Look at how these mathematical operations work on vectors:
vec1 <- 1:5
vec2 <- seq(0.1, 0.5, by = 0.1)
vec1 + vec2
## [1] 1.1 2.2 3.3 4.4 5.5
vec2^vec1
## [1] 0.10000 0.04000 0.02700 0.02560 0.03125
vec1 > vec2
## [1] TRUE TRUE TRUE TRUE TRUE
vec1 < 5 & vec2 > 0.3
## [1] FALSE FALSE TRUE TRUE FALSE
vec1 < 5 | vec2 > 0.3
## [1] TRUE TRUE TRUE TRUE TRUE
Exercise 4
Create the following vectors.
# Insert code here
# Insert code here
# Insert code here
# Insert code here
# Insert code here
Python Users: Python and R have similar subsetting options, but have small differences in indexing for which you should take care! Please read this carefully!
Let’s use a integer vector example to show the differences.
vector1 <- 8:17
# the first element
vector1[1]
## [1] 8
vector1[10]
## [1] 17
Python Users: R starts at 1 while Python starts at 0. In python you would say
vector1[0]
to get the first element andvector1[9]
to get the 10th value of the vector
vector1[-1]
## [1] 9 10 11 12 13 14 15 16 17
Python Users: Python uses negative indexing differently. For example,
somevector[-1]
refers to the last element of the vector. However, negative index in R returns the vector with indicated elements DELETED. In python the R code above would be give the last element, not delete the first one:
>>> vector1[-1]
10
a:b
notationvector1[3:6]
## [1] 10 11 12 13
Python Users: When subsettting an vector using
a:b
, R subsets b-a+1 (i.e. includes all of the elements a through b) elements while Python takes out b-a elements (includes elements a+1 through b). In python you would need to dovector1[2:6]
to get these same values.
You can subset using non-consecutive integer vectors in the same way
# pull out the 1st and 5th elements of the vector
vector1[c(1, 5)]
## [1] 8 12
Python Users: In python you would do
vector1[[0, 4]]
to get these same values.
And using negative indices works the same way for vectors – this example removes the 1st and 5th element
# subsetting, noting (1, 5) here indicating the index instead of the values.\n",
vector1[-c(1, 5)]
## [1] 9 10 11 13 14 15 16 17
You can also subset based on logical values as well, meaning if you
have a logical vector of the same length as another vector, you
can use it to subset to only the TRUE
values of your
vector.
mylogical<-c(rep(TRUE, 3), rep(FALSE, 3), rep(TRUE, 4))
vector1[mylogical]
## [1] 8 9 10 14 15 16 17
Since vector1 > 5
creates a logical vector, I can use
this to pull out entries of my vector greater than 5
# indexing using boolean operators
vector1[vector1 > 5]
## [1] 8 9 10 11 12 13 14 15 16 17
Python Users: In python you would do similar indexing,
vector1[vector1>5]
Alternatively, you can also find the indices of the vector that
satisfy the logical using the which
function. Here I find
which of my vector values are even using %%
whEven<-which(vector1 %%2 ==0)
print(whEven)
## [1] 1 3 5 7 9
I can then index with this vector of indices
vector1[whEven]
## [1] 8 10 12 14 16
Note that I don’t have to have unique indices – I repeat the indices to get replicates of the values of the vector:
vector1[c(3,3,2,1,4,2,2)]
## [1] 10 10 9 8 11 9 9
The same indexing method works when assigning values, namely you can assign values to only a subset of the vector.
vector1 <- 1:10
vector1[1] <- 5
vector1[3:6] <- c(8, 8, 8, 8)
vector1
## [1] 5 2 8 8 8 8 7 8 9 10
You can also give names to each entry of your vector
vector1<-1:3
names(vector1)<-c("A","B","C")
vector1
## A B C
## 1 2 3
I can then subset the vector by a vector of characters giving the names I want
vector1["A"]
## A
## 1
vector1[c("B","C")]
## B C
## 2 3
sample
. We give the function a vector of values we
want to sample from, the size of the sample we want, and whether we want
to sample with or without replacement. We also use the argument
set.seed
so that in fact our random numbers will be the
same every time we rerun this code, rather than varying.# Can put any arbitrary integer in set.seed and the random code that follows will always be the same:
set.seed(27489)
# random sample 10 number from integers 1 to 10.
x<-1:100
# WITH replacement
samples <- sample(x, size=500, replace = TRUE)
# WITHOUT replacement
sample.withoutrep <- sample(x, size=10)
length(samples)
## [1] 500
max
, min
, mean
,
median
, var
, sum
min(samples)
## [1] 1
max(samples)
## [1] 100
mean(samples)
## [1] 51.136
median(samples)
## [1] 53
var(samples)
## [1] 845.3041
sd(samples)
## [1] 29.07411
summary(samples)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 25.75 53.00 51.14 76.00 100.00
Factors in R are vectors especially for storing categorical variable. While factors can seem like they are vectors of character values, in fact they are treated very different by R. By storing the values in a factor we are telling R to treat the values as categorical values, not as arbitrary strings. The implicit assumption is that there are only small set of discrete values the vector take on (small relative to the size of the data). The possible values that a categorical variable takes on are called levels in statistics.
It is very important for summarizing information and plot. Both
numeric and character variables can be made into factors by the function
factor
.
Here’s an example where we take character-valued vectors and change them into a factor.
# Create character vector
vec2 <- c('Fri', 'Thur', 'Mon', 'Tue', 'Wed', 'Thur', 'Mon', 'Mon', 'Tue', 'Wed', 'Wed', 'Tue', 'Fri')
summary(vec2)
## Length Class Mode
## 13 character character
# Create character vector
vec2.fac <- factor(vec2)
summary(vec2.fac)
## Fri Mon Thur Tue Wed
## 2 3 2 3 3
Notice how R treats the variable differently now that it’s a factor –
the function summary
gives entirely different results for
the factor vector as it does the character vector. In fact,
summary
for a factor works like the function
table
, which simply counts the number of times each unique
entry is present.
# table of factor
table(vec2.fac)
## vec2.fac
## Fri Mon Thur Tue Wed
## 2 3 2 3 3
# table of character
table(vec2)
## vec2
## Fri Mon Thur Tue Wed
## 2 3 2 3 3
# table of numeric
table(c(1,4,2,4,2,1))
##
## 1 2 4
## 2 2 2
We can also create factors from numeric vectors. This may not feel intuitive – character values feel like they are naturally categorical. However, many times categorical variables might be encoded as numeric (e.g. 1=No, 2=Maybe, 3=Yes )
# Create numeric variable
vec1 <- c(3, 2, 3, 2, 1, 3, 2, 3, 1, 1, 2)
vec1
## [1] 3 2 3 2 1 3 2 3 1 1 2
summary(vec1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.500 2.000 2.091 3.000 3.000
Now convert it to a factor
# Convert it to a factor
vec1.fac <- factor(vec1)
vec1.fac
## [1] 3 2 3 2 1 3 2 3 1 1 2
## Levels: 1 2 3
summary(vec1.fac)
## 1 2 3
## 3 4 4
Again, notice how R treats the variable differently now that it’s a factor. This is important, because it doesn’t make sense to take the mean of categorical variables, even if they are encoded as numbers.
I can also give more informative labels to my factor:
# Convert it to a factor
vec1.fac <- factor(vec1, labels=c("No","Maybe","Yes"), levels=c(1,2,3))
vec1.fac
## [1] Yes Maybe Yes Maybe No Yes Maybe Yes No No Maybe
## Levels: No Maybe Yes
summary(vec1.fac)
## No Maybe Yes
## 3 4 4
We can check what the possible levels are and how many:
# give the levels of the factor
levels(vec1.fac)
## [1] "No" "Maybe" "Yes"
# give the number of levels of the factor
nlevels(vec1.fac)
## [1] 3
To count the number of occurrence of each level, we use function
table
table(vec1.fac)
## vec1.fac
## No Maybe Yes
## 3 4 4
Note that you can also apply table
to non-factor
vectors, and that works – the function transforms them to factors first,
but without our nice mapping of the values to their labels that we
created above.
table(vec1)
## vec1
## 1 2 3
## 3 4 4
table(vec2)
## vec2
## Fri Mon Thur Tue Wed
## 2 3 2 3 3
You can think data frame as a collection of vectors, each element of
the vectors holding different types of data collected on the same
observations. Data frames can be a mix of types of vectors (e.g. numbers
and characters) and they are the main way to store data in R. Data
frames can be manually created with the function data.frame
as shown in the following examples. But usually, we would read them from
an external file files, as we’ll do later.
Python Users: A Data frame in R is very similar to Table in Python.
We will use examples of demonstrated in Data 8 below (LAB 3 and LAB 4)
When not specified, the function data.frame
will coerce
all character variables to factors, which is usually right for
characters. If you want to keep the strings as character variables, you
need to specify stringsAsFactors = FALSE
.
# Create a data frame from scratch (unusual)
imdb <- data.frame(Votes = c(1498733, 1027398, 692753),
Rating = c(9.2, 9.2, 9.0),
Title = c('The Shawshank Redemption (1994)', 'The Godfather (1972)',
'The Godfather: Part II (1974)'),
Year = c(1994, 1972, 1974),
Decade = c(1990, 1970, 1970), stringsAsFactors = FALSE)
imdb
## Votes Rating Title Year Decade
## 1 1498733 9.2 The Shawshank Redemption (1994) 1994 1990
## 2 1027398 9.2 The Godfather (1972) 1972 1970
## 3 692753 9.0 The Godfather: Part II (1974) 1974 1970
The data.frame
is generally thought of like a matrix,
where the vectors (the different variables or attributes) are stored as
columns and the observations in the rows.
# Get number of rows (observations)
nrow(imdb)
## [1] 3
# Get number of columns (attributes)
ncol(imdb)
## [1] 5
You can index them like a matrix too:
imdb[1:2,]
## Votes Rating Title Year Decade
## 1 1498733 9.2 The Shawshank Redemption (1994) 1994 1990
## 2 1027398 9.2 The Godfather (1972) 1972 1970
imdb[,c(3,5)]
## Title Decade
## 1 The Shawshank Redemption (1994) 1990
## 2 The Godfather (1972) 1970
## 3 The Godfather: Part II (1974) 1970
But in fact they are not matrices, and they have some
indexing options that don’t work for matrices. In particular, you can
call up a column using a $
and the name of the column
imdb$Title
## [1] "The Shawshank Redemption (1994)" "The Godfather (1972)"
## [3] "The Godfather: Part II (1974)"
All the tricks about indexing vectors will work for data frames (and matrices). For example, what do the following snippets do (in words)?
imdb[imdb$Rating > 9.0 & imdb$Decade == 1990, ]
## Votes Rating Title Year Decade
## 1 1498733 9.2 The Shawshank Redemption (1994) 1994 1990
oldest_rating <- max(imdb$Year)
imdb$Title[imdb$Year==oldest_rating]
## [1] "The Shawshank Redemption (1994)"
We will go through a couple of basic plotting commands here. Specific types of plots have different commands, but most plotting commands share basic arguments for certain aspects of the plot, such as axis labels, titles, and so forth. Here are some important ones:
main
: Title of the plotsub
: subtitle for plot (below x-axis)xlab
/ylab
: x/y axis labelxlim
/ylim
: the limits (starting and ending
values) for the x/y axis”,las
the orientation of the axis labelslty
the type of line (e.g. solid
,
dashed
, dotted
)lwd
the width of lines (>1 increases size, <1
decreases)pch
the type of point (see help of points
for the many options)col
color of the plotted images (e.g. “red”, “blue” –
colors()
prints out all possible character names of colors
you can give)cex
the relative size of plotting (>1 increases
size, <1 decreases)A full set of the parameters can be found in the help for
par
(?par
).
Not all of these parameters are applicable for any plot, often because they don’t make sense.
Plotting commands in R are generally of two kinds, the main plotting command, that sets up the axes, labels, etc and draws the plot and then plotting commands that can add on top of the existing plot.
Common main plotting commands are
plot
– for scatterplots (and plotting of other
objects)boxplot
– boxplotshist
– histogramsbarplot
– barplotscurve
– draw a functionCommon commands for adding features to a plot are:
lines
– drawing lines connecting points (input vectors
x and y and draw lines between each (x[i],y[i]) and (x[i+1],y[i+1]), in
the order they are listed in the vector)points
– drawing points (input vectors x and y and draw
point for each (x[i],y[i]) pair)legend
– add legend to plotabline
– draw a line on a plot based on an equation for
a line. Useful for horizontal and vertical plots, but can draw arbitrary
intercept and slope too.title
– add x/y labels, titles, subtitles, etc on top
of existing plotaxes
– add axes (tick marks and their labels) to
plotWe will go through three common examples, and you will see many more in future labs
The file baby.csv
contains data on a random sample of
1,174 mothers and their newborn babies. The column birthwt
contains the birth weight of the baby, in ounces; gest_days
is the number of gestational days, that is, the number of days the baby
was in the womb. There is also data on maternal age, maternal height,
maternal pregnancy weight, and whether or not the mother was a
smoker.
We will read this data in
baby <- read.csv('baby.csv', header = TRUE)
head(baby)
## X Birth.Weight Gestational.Days Maternal.Age Maternal.Height
## 1 1 120 284 27 62
## 2 2 113 282 33 64
## 3 3 128 279 28 64
## 4 4 108 282 23 67
## 5 5 136 286 25 62
## 6 6 138 244 33 62
## Maternal.Pregnancy.Weight Maternal.Smoker
## 1 100 FALSE
## 2 135 FALSE
## 3 115 TRUE
## 4 125 TRUE
## 5 93 FALSE
## 6 178 FALSE
The command hist
is used to create histograms.
hist(baby$Birth.Weight,
col="darkblue", # histogram color,
main = "Birth weights of babies",# plot title,
xlab = "Birth weights (in ounces)", # x axis label,
xlim = c(40, 180) # x axis range
)
I can also add a vertical line showing where the mean and median are
on this histogram using abline
and the argument
v
, for “vertical” (notice that I have to repeat the
hist
command from above. This is because you can’t spread
plotting commands across R chunks in a rmarkdown. If I was typing into a
console, I could just type in the abline
command after my
previous hist
command)
hist(baby$Birth.Weight,
col="darkblue", # histogram color,
main = "Birth weights of babies",# plot title,
xlab = "Birth weights (in ounces)", # x axis label,
xlim = c(40, 180) # x axis range
)
abline(v=c(mean(baby$Birth.Weight), median(baby$Birth.Weight)), lty=c("solid","dashed"), col=c("red","orange"))
legend("topright", legend=c("Mean","Median"), lty=c("solid","dashed"),col=c("red","orange"))
Notice in abline
I could give a vector of values I want
for my vertical lines, and it makes a separate vertical line for each.
Notice also that lty
and col
defines the type
and color of the lines, and that by giving a vector to lty
and col
, the first value of lty
and
col
is for the first vertical line in the v
argument, etc.
The legend’s first argument is the location (e.g. “topright”,
“center”, “bottomright” etc). The second argument is the vector of
characters that gives the text you want in the legend, and then the
remaining commands describe what you want in front of this text. In this
case, we want the text to correspond to different line types
and colors, so we give it our lty
and
col
values for each text element, and it draws lines with
those colors.
** Exercise** Plot a histogram of the birth weights of the babies of the smokers, with meaningful labels, title etc.
We create scatter plots using the plot
command.
plot(x=baby$Maternal.Pregnancy.Weight, y=baby$Birth.Weight,
pch=19, #colored in points
cex=.75, #smaller points
col="grey",
xlab="Maternal Pregnancy Weight (in lbs)",
ylab="Birth Weight (in ounces)")
I could chose to color some points differently. I could do this in
two ways. The first is to give a vector of colors, equal to the length
of x
and y
for each point. Here I give an
example of plotting different colors based on the variable
Maternal.Age
First I will make a variable that divides Maternal.Age
into three categories, <21, 21-35, 35+. I will use the function
cut
ageBinned<-cut(baby$Maternal.Age, c(0,21,35,100))
summary(ageBinned)
## (0,21] (21,35] (35,100]
## 194 852 128
class(ageBinned)
## [1] "factor"
Notice that the resulting variable, ageBinned
, is now a
factor variable, classifying each observation into three categories.
Now I can use that to define colors. Suppose I want the colors for
(0,21] to be red, (21,35] black and (35,100] green. If I make a variable
with the three colors, in the order of the levels of my factor variable
ageBinned
, I can do the following
levels(ageBinned)
## [1] "(0,21]" "(21,35]" "(35,100]"
ageColors<-c("red","black","green")
head(ageBinned)
## [1] (21,35] (21,35] (21,35] (21,35] (21,35] (21,35]
## Levels: (0,21] (21,35] (35,100]
head(ageColors[ageBinned])
## [1] "black" "black" "black" "black" "black" "black"
The vector ageColors[ageBinned]
has taken the vector of
colors and repeated it for every value that matches it’s level (it’s
using ageBinned
to do the repeated indices trick from above
and so get replicated values of the colors in the right order). I can
use this vector to tell the plot
command what colors to
give to each observations.
plot(x=baby$Maternal.Pregnancy.Weight, y=baby$Birth.Weight,
pch=19, #colored in points
cex=.75, #smaller points
col=ageColors[ageBinned],
xlab="Maternal Pregnancy Weight (in lbs)",
ylab="Birth Weight (in ounces)")
legend("bottomright",legend=levels(ageBinned),fill=ageColors, title="Age of Mother")
Another way to change something about a small number of points, is to
use the points
function to add points to the existing plot.
Because it plots over the existing points, it will effectively replot
the points you request. It’s a little less elegant, but can be useful to
highlight a subset of values.
Here I will plot differently points with birth weight under 60oz.
First I need to identify those points, using the function
which
whLowBirth<-which(baby$Birth.Weight<60)
length(whLowBirth) #check how many observations that is
## [1] 2
baby[whLowBirth, ]
## X Birth.Weight Gestational.Days Maternal.Age Maternal.Height
## 860 860 58 245 34 64
## 923 923 55 204 35 65
## Maternal.Pregnancy.Weight Maternal.Smoker
## 860 156 TRUE
## 923 140 FALSE
This gives me the indices of the observations with low birth weight
plot(x=baby$Maternal.Pregnancy.Weight, y=baby$Birth.Weight,
pch=19,
cex=.75,
col="grey",
xlab="Maternal Pregnancy Weight (in lbs)",
ylab="Birth Weight (in ounces)")
points(x=baby$Maternal.Pregnancy.Weight[whLowBirth], y=baby$Birth.Weight[whLowBirth],
pch=19,
cex=2, # Make the points BIG
col="red")
title(sub="Highlighting in red low weight babies")
However, usually we do not create data frame manually by typing the information. We read data from external files.
A common task is to read in comma-deliminated files with
read.csv
:
# Read csv files
twitter_follows <- read.csv("twitter_follows.csv")
twitter_follows
## Screen.name Followers Friends
## 1 LeoDiCaprio 14082200 142
## 2 SteveCarell 4607580 48
## 3 MarkRuffalo 2165110 1178
## 4 amyschumer 3452330 1931
## 5 TherealTaraji 3960390 702
## 6 Racheldoesstuff 31996 3341
## 7 IAMQUEENLATIFAH 6890940 458
We can also read tab-deliminated with read.delim
(these
are often saved as .txt
files, but can sometimes be
.tsv
)
# Read tab-delinated files
twitter_info <- read.delim("twitter_info.txt")
twitter_info
## Name Screen.name Gender Medium
## 1 Leonardo DiCaprio LeoDiCaprio M Film
## 2 Steve Carell SteveCarell M Both
## 3 Mark Ruffalo MarkRuffalo M Film
## 4 Amy Schumer amyschumer F Both
## 5 Taraji P. Henson TherealTaraji F Both
## 6 Aziz Ansari azizansari M TV
## 7 Rachel Bloom Racheldoesstuff F TV
## 8 Queen Latifah IAMQUEENLATIFAH F Both
R can handle any kind of delimination via read.table
(both read.csv
and read.delim
are just special
cases of read.table
), and has many options for dealing with
common formatting problems.
We can also join data frames together that have at least one shared
data attributes. In this case, both datasets have twitter information
about actors, and we can match the actors in different datasets based on
the attribute Screen.name
# join the data frame info and follows into one data frame called twitter
twitter = merge(twitter_info,twitter_follows,by="Screen.name")
twitter
## Screen.name Name Gender Medium Followers Friends
## 1 amyschumer Amy Schumer F Both 3452330 1931
## 2 IAMQUEENLATIFAH Queen Latifah F Both 6890940 458
## 3 LeoDiCaprio Leonardo DiCaprio M Film 14082200 142
## 4 MarkRuffalo Mark Ruffalo M Film 2165110 1178
## 5 Racheldoesstuff Rachel Bloom F TV 31996 3341
## 6 SteveCarell Steve Carell M Both 4607580 48
## 7 TherealTaraji Taraji P. Henson F Both 3960390 702
Notice that one of the observations in twitter_info
was
not in twitter_follows
and was silently dropped. We could
ask that all entries in either dataset be kept by setting
all=TRUE
:
# join the data frame info and follows into one data frame called twitter
twitter = merge(twitter_info,twitter_follows,by="Screen.name", all=TRUE)
twitter
## Screen.name Name Gender Medium Followers Friends
## 1 amyschumer Amy Schumer F Both 3452330 1931
## 2 azizansari Aziz Ansari M TV NA NA
## 3 IAMQUEENLATIFAH Queen Latifah F Both 6890940 458
## 4 LeoDiCaprio Leonardo DiCaprio M Film 14082200 142
## 5 MarkRuffalo Mark Ruffalo M Film 2165110 1178
## 6 Racheldoesstuff Rachel Bloom F TV 31996 3341
## 7 SteveCarell Steve Carell M Both 4607580 48
## 8 TherealTaraji Taraji P. Henson F Both 3960390 702
In this case, R adds in NA values for the missing entries.
There are many options for merge
, including if the two
datasets have different names for the same attribute.
We can also write data frames to a file using
write.table
. We can make them tab or comma deliminated by
setting the option sep
, but there is a special function
write.csv
for the special case of comma-deliminated.
# Write a data frame to csv
write.csv(twitter_follows,
file = "twitter_follows.csv",
row.names = FALSE)
# Write a data frame to tab-deliminated
write.table(twitter_follows, sep="\t",
file = "twitter_follows.txt",
col.names=FALSE,
row.names = FALSE )
R provide another method for saving and loading R data. You can
save(load) several R objects to(from) at the same timea binary file with
the extension .RData
. And load them when you start a new R
session in the future.
x <- 1:3
y <- list(a = 1, b = TRUE, c = "oops")
# save x, y
save(x, y, file = "xy.RData")
# load x, y
load("xy.RData")
R provides methods to save the environment (all data, values,
functions, etc.) in RData for future use. You could excute the code as
follows, or click the save botton in the environment window to save and
click the RData
files in the file window to load them.
# save the environment
save.image("myenv.RData")
# load the environment
load("myenv.RData")
While R comes with many functions, you can actually write your own functions.
You define a function in R using the command function
.
The format is
nothing <- function(){
x = 4
return(x)
}
Python Users: Unlike python, R is not sensitive to white space. Meaning you can tab or not as you like (though indenting is recommended for readability). Instead the brackets
{ }
define the limits of the function
You can now call your function
nothing()
## [1] 4
Python Users: The same function in python would be defined as:
def nothing():
x = 4
return x
output = nothing()
Often you will want your function to take information from the user.
This is done by defining arguments that the user can give in the
function(...)
part of the call.
randMean <- function(data){
x = sample(data, size=50, replace=FALSE)
return(mean(x))
}
set.seed(915820)
randMean(data=samples)
## [1] 54.28
Notice how I can use the variable data
in my function as
if it has been defined – it will be whatever data the user gives.
I can also have multiple arguments. Here I will allow the user to also determine the sample size; only this time I will give a default value in my definition of the function.
randMean <- function(data, sampleSize=50){
x = sample(data, size=sampleSize, replace=FALSE)
return(mean(x))
}
set.seed(915820)
randMean(data=samples) #should be same as above
## [1] 54.28
randMean(data=samples, sampleSize=20)
## [1] 51.35
Notice how because I set the seed I get the same answer (when
sampleSize=50
) as before, even though I have a “random”
sample.
In my function above, I have a problem – what if the data the user gives me is less than length 50? Since I’m sampling without replacement, this will be a problem. So I can add error handling using conditional statements. An “if statement” will evaluate a logical expression and do a series of actions if the expression is true; I can add an “else” which tells what to do if the expression is false.
randMean <- function(data, sampleSize=50){
if(length(data) < sampleSize){
stop("Input to argument data must be of length at least ", sampleSize)
}
else{
x = sample(data, size=sampleSize, replace=FALSE)
return(mean(x))
}
}
set.seed(915820)
randMean(data=samples)
## [1] 54.28
randMean(data=samples[1:10])
## Error in randMean(data = samples[1:10]): Input to argument data must be of length at least 50
Notice the function stop
– it will produce an error and
kill the function. You can also use warning
to give
warnings – notice the difference in the next function and how I changed
the conditional statements:
randMean <- function(data, sampleSize=50){
if(length(data) < sampleSize){
warning("Input to argument data is longer than requested sample size. Will set to default of ", length(data))
sampleSize<-length(data)
}
x = sample(data, size=sampleSize, replace=FALSE)
return(mean(x))
}
set.seed(915820)
randMean(data=samples)
## [1] 54.28
randMean(data=samples[1:10])
## Warning in randMean(data = samples[1:10]): Input to argument data is longer than
## requested sample size. Will set to default of 10
## [1] 55.4
Often we will want to repeat an expresion over and over again, but change the input into it each time. This is called iteration, and for-loops are a standard programming device for doing so. The basic idea is that you give a vector of values to the for loop, and for each element in the vector, the for loop will evaluate the expression on that element.
Here is a for-loop in R
for(animal in c('cat', 'dog', 'rabbit')){
print(animal)
}
## [1] "cat"
## [1] "dog"
## [1] "rabbit"
Python Users: This for-loop in python would be defined as:
for animal in make_array('cat', 'dog', 'rabbit'):
print(animal)
This means that for every element in the vector
c('cat', 'dog', 'rabbit')
, the variable animal
was given the value of the element of the vector, and then the code
beween the { }
is executed.
Here’s a more complex for-loop, that determines the sum and product of the numbers 1-10. It does this by defining variables it will at each iteration add (or multiple) the new integer to.
sum_10 = 0
prod_10 = 1
#loop over the integers 1-10:
for (i in 1:10){
sum_10 = sum_10 + i
prod_10 = prod_10 * i
}
print(sum_10)
## [1] 55
print(prod_10)
## [1] 3628800
Python Users: This for-loop in python would be defined as:
sum_10 = 0
prod_10 = 1
for i in range(1,11):
sum_10 = sum_10 + i
prod_10 = prod_10 * i
print(sum_10)
print(prod_10)
R has special additional functions that are used for iteration that
we will learn about in future labs (sapply
,
apply
, lapply
, tapply
)