Spring 2020
Course details
Form for concurrent enrollment students
Graded Course Components
Policies
Using R and resources
Weektoweek Syllabus
Course Details
Lectures
TTh 23:30pm, Evans 10
Labs
TTh 45, Evans 332 TTh 56, Evans 332
Instructors
Name  Office Hours  

Instructor  Will Fithian  wfithian@berkeley.edu  M 23, Th 1112, Evans 301 
GSI  George Shan  gshan@berkeley.edu  W 45, F 45, Evans 432 
Description
This course is a upperdivision course that is a followup to Data 8 or STAT 20. The course will teach a broad range of statistical methods that are used to solve data problems. Topics will include group comparisons and ANOVA, standard parametric statistical models, multivariate data visualization, multiple linear regression and classification, classification and regression trees and random forests. Students will be introduced to the widely used R statistical language and they will obtain handson experience in implementing a range of statistical methods on numerous real world datasets.
This is a upperdivision class, so the pace of the class is expected to be more challenging than introductory classes.
Course Text
There is no book for this class. Instead, we have provided extensive lecture notes. If you would like some additional optional reading, you can try the following books.
 The Statistical Sleuth: A Course in Methods of Data Analysis by Ramsey and Schafer
 Introductory Statistics with R by Peter Dalgaard
 Probability for Data Science: by Ani Adhikari and Jim Pitman (http://prob140.org/textbook/). This is the online book for teaching probability used by STAT 140. Chapter 15, in particular, covers continuous distributions, which might be a useful complement to our coverage of the topic – though more in depth than what we will do.
Neither of these books covers all of the topics we will cover, nor do they have the same perspective and focus as this class – they do not have extensive use of bootstrapping and resampling methods. But for those students wanting some additional structure or R assistance these books may be helpful and should be at the right level for this class.
Stat 20 and Data 8 are similar courses, but each covers some subjects that the other does not. While we will cover these topics in class, you may find the following useful background if you are seeing them for the first time (more to follow):

The Foundations of Data Science, by Ani Adhikari and John DeNero (www.inferentialthinking.com) – chapters 1113.
This is the online book used by Data 8 and these chapters introduce hypothesis testing using only resampling ideas, ideas which are not necessarily covered in Stat 20. Their examples integrate python code, which may not be familiar, but what the code is doing will probably make sense and not make it too difficult to follow.
Online services used
We will use several different online services in the class that are probably familiar to you:
Resource  Materials 

Course Website  Course material (lectures, labs, etc) will be available at the course website, https://epurdom.github.io/Stat131A/ 
Bcourses  We will also distribute some materials like HWs, Solutions, Practice midterms, etc on Bcourses. Grades, announcements, updates, and the like will also be through Bcourses 
Piazza  You can link to the piazza account via bcourses – clicking on the Piazza link should enroll you automatically in the Piazza class (with your Berkeley email). A GSI will check Piazza once a day. 
Gradescope  HW Assignments will be submitted to gradescope. After the first week, we will enter the roster of emails, based on your official @berkeley.edu account on the student roster. 
Concurrent enrollment students
Concurrent enrollment students wishing to register for the class should fill out a google form at https://forms.gle/39cNK7BT85sTC7BR6 (Note this is now the correct link for Spring 2020) to give me information about their previous coursework so I can assess if they have satisfied the prerequisites.
Graded Course Components:
Overall score
 Homeworks: 25%
 Labs: 20%
 Midterm: 25% (in class)
 Final: 30%
Homework
Homework assignments will be posted to bcourses, and will generally be due one week later. All homework will be submitted in Gradescope and are due by 5pm of that day. There are 810 homework assignments planned, though the exact number may change. Homeworks will be a combination of computational exercises and data analysis using the computer, as well as comprehension questions. R code will also be submitted in Gradescope where the code will be checked (see bcourses for more information about turning in homework).
The final homework score will be the sum of all homework grades, so 15 points in HW1 counts the same as 15 points in HW5. Points are allocated to questions across the semester and attempt to be compatible. This means each homework assignment (as a block) will not count equally, and it is always worth your while to turn in what you have done of your assignment, even if it is incomplete.
Labs
The labs will be a combination of review of material and an introduction to how to perform the analysis shown in class in the R environment. The labs will be selfpaced to be completed during the lab and turned in. Attendance at the labs is mandatory, and you will be expected to turn in the results of finishing the exercises in the lab. The exercises are intended to be turned in at the end of lab, but the GSI may use their discretion to allow people to turn them in later that day or limit what is required to be turned in at the end of lab.
There is no lab on the first week, but there is a selfpaced lab for you to go through on your own. You do not have to turn in this lab, but it will be expected that you have gone through this lab.
Policies
 If you wish for your email to make it into our inbox, the subject of your email must contain the text “STAT 131A”
 None of the instructors explain course material over email and will not respond to emails with such requests. Please come to office hours, discussion section, or GSI’s office hours (or schedule another time to meet if you have irreconcilable conflicts with the office hours). You can also post questions to Piazza.
 We respond to email regarding the class roughly once a day, and almost never in the evening nor weekend. A GSI will monitor Piazza, again roughly once a day.
Academic Honesty Policy
 Homework and projects must be done independently. Obtaining and/or using solutions from previous years or from the internet, if such happen to be available, is considered cheating. With other classmates, you may discuss issues about the homework, but you must not sit down and do the assignment jointly.
 For exams cheating includes, but is not limited to, bringing written or electronic materials into an exam or quiz beyond that allowed, using written or electronic materials during an exam or quiz, copying off another person’s exam or quiz, allowing someone to copy off of your exam or quiz, and having someone take an exam or quiz for you.
In fairness to students who put in an honest effort, cheaters will be harshly treated. Any evidence of cheating will result in a score of zero (0) on the entire assignment or examination. We will always report incidences of cheating to Student Judicial Affairs, which may administer additional punishment.
Disability
If you need accommodations for any physical, psychological, or learning disability, please speak the instructor after class or during office hours. Please note you must make arrangements in a timely manner (through DSP) so that we can make the appropriate accommodations; last minute requests cannot be accommodated.
Using R and resources for R
We will be using R to analyze data in this class. You do not need your own computer to use R and do the assignments, just access to a web browser (e.g. chromebooks will be okay). See R resources for more information.
We expect that students have already been exposed to R or python in either Stat 20 or Data 8, respectively. For those who come from Data 8, the first labs will help you transfer that knowledge over to R. Both languages are excellent platforms for analyzing data, are widely used in data science, and have their individual strengths. R has been developed within the statistics community specifically for data analysis, while python is a generalpurpose programming language but has large data analysis capabalities.
The lectures will be focused on the statistical concepts, while learning how to in R what you see in lecture will be the focus of the labs. Do not be surprised or worried about not following the details of R code during class – that is not the point (furthermore, there will sometimes be code that is really specific to the instructional purposes of the lecture, e.g. to make a specific plot, and is beyond the scope of what you would be expected to understand how to do yourself).
In addition to what you learn in lab, there will be for every lecture, complementary html/.Rmd with the code used in the lecture, that will contain brief explanations.
Furthermore, we expect that you’ve already been introduced to the basics, so that we will not need to walk you through every detail. The first step if you are confused about code is to look at the documentation; the second step is to do a quick google search; and of course, ask us questions if you can’t figure it out after doing that.
See our page on R resources for useful tutorials and online help.
Syllabus
Exam dates
 The midterm is scheduled inclass, Thursday March 12
 The final exam is Monday, May 11 from 11:30 am – 2:30 pm
Weekbyweek guide
The following is a rough guideline for the material we will cover in the semester, but the actual topics will vary as the semester goes along.
Note: links will only work once the material is posted (i.e. as the semester progresses). Visit the sites from previous semesters to get access to the reading material in advance (there is only small variation from year to year). The descriptions posted here are projected, and may be updated/changed depending on the pace of the class
Link to Lab 0
Week  Description  Chapter 

01  Boxplots, discrete distributions, intro to continuous distributions  01: 1.11.3 
02  Continuous distributions, density curves  01: 1.34.4 
03  Density estimation, Intro to Hypothesis testing  01: 4.55 02: 1 
04  Permutation test, ttest  02: 2.13.7 
05  Type I error, multiple testing, confidence intervals for ttest  02: 4.16.2 
06  Bootstrap Confidence intervals. Review simple regression. Polynomial regression, loess curves  02: 7.18.1 03 
07  Finish loess curves  03 
08  Finish local fitting, smooth density plots MIDTERM 
03 
09  Pairs plots, Heatmaps, Start PCA  04 
10  PCA, start Multiple linear regression  04 05 
11  Multiple linear regression, fitting and interpretation, fitted values, residuals, Multiple Rsquared, Residual degrees of freedom and residual standard error  05 
12  Multiple regression with categorical explanatory variables and interactions  05 
13  Inference in multiple regression: Ftests via the anova function  05 
14  Inference in multiple regression: ttests, standard errors, confidence intervals and prediction intervals. Variable selection in linear regression. Regression diagnostics  05 
15  The Classification problem and logistic regression, interpretation in terms of odds, binary predictions via confusion matrices, precision and recall, deviance, variable selection via AIC  06 
16  Reading and recitation week: Regression trees, classification trees and Random Forests 
07 