Fall 2019
Course details
Form for concurrent enrollment students
Graded Course Components
Policies
Using R and resources
Week-to-week Syllabus
Course Details
Lectures
MWF 2-3pm, Hearst Mining 390
Labs
Tu/Th 10-11, Evans 332 Tu/Th 2-3, Evans 330
Instructors
Name | Office Hours | ||
---|---|---|---|
Instructor | Elizabeth Purdom | epurdom@berkeley.edu | F 3-5pm in Evans 433 |
GSI | George Shan | gshan@berkeley.edu | TBA |
Description
This course is a upper-division course that is a follow-up to Data 8 or STAT 20. The course will teach a broad range of statistical methods that are used to solve data problems. Topics will include group comparisons and ANOVA, standard parametric statistical models, multivariate data visualization, multiple linear regression and classification, classification and regression trees and random forests. Students will be introduced to the widely used R statistical language and they will obtain hands-on experience in implementing a range of statistical methods on numerous real world datasets.
This is a upper-division class, so the pace of the class is expected to be more challenging than introductory classes.
Course Text
There is no book for this class. Instead, we have provided extensive lecture notes. If you would like some additional optional reading, you can try the following books.
- The Statistical Sleuth: A Course in Methods of Data Analysis by Ramsey and Schafer
- Introductory Statistics with R by Peter Dalgaard
- Probability for Data Science: by Ani Adhikari and Jim Pitman (http://prob140.org/textbook/). This is the online book for teaching probability used by STAT 140. Chapter 15, in particular, covers continuous distributions, which might be a useful complement to our coverage of the topic – though more in depth than what we will do.
Neither of these books covers all of the topics we will cover, nor do they have the same perspective and focus as this class – they do not have extensive use of bootstrapping and resampling methods. But for those students wanting some additional structure or R assistance these books may be helpful and should be at the right level for this class.
Stat 20 and Data 8 are similar courses, but each covers some subjects that the other does not. While we will cover these topics in class, you may find the following useful background if you are seeing them for the first time (more to follow):
-
The Foundations of Data Science, by Ani Adhikari and John DeNero (www.inferentialthinking.com) – chapters 11-13.
This is the online book used by Data 8 and these chapters introduce hypothesis testing using only resampling ideas, ideas which are not necessarily covered in Stat 20. Their examples integrate python code, which may not be familiar, but what the code is doing will probably make sense and not make it too difficult to follow.
Online services used
We will use several different online services in the class that are probably familiar to you:
Resource | Materials |
---|---|
Course Website | Course material (lectures, labs, etc) will be available at the course website, https://epurdom.github.io/Stat131A/ |
Bcourses | We will also distribute some materials like HWs, Solutions, Practice midterms, etc on Bcourses. Grades, announcements, updates, and the like will also be through Bcourses |
Piazza | You can link to the piazza account via bcourses – clicking on the Piazza link should enroll you automatically in the Piazza class (with your Berkeley email). A GSI will check Piazza once a day. |
Gradescope | HW Assignments will be submitted to gradescope. After the first week, we will enter the roster of emails, based on your official @berkeley.edu account on the student roster. |
Concurrent enrollment students
Concurrent enrollment students wishing to register for the class should fill out a google form at https://forms.gle/JXf6VZutKqwjrCmR7 to give me information about their previous coursework so I can assess if they have satisfied the pre-requisites.
Graded Course Components:
Overall score
- Homeworks: 20%
- Labs: 15%
- Projects (2): 20%
- Midterm: 20% (in class)
- Final: 25%
Homework
Homework assignments will be posted to bcourses, and will generally be due one week later. All homework will be submitted in Gradescope and are due by 5pm of that day. There are 6-8 homework assignments planned, though the exact number may change. Homeworks will be a combination of computational exercises and data analysis using the computer, as well as comprehension questions. R code will also be submitted in Gradescope where the code will be checked (see bcourses for more information about turning in homework).
The final homework score will be the sum of all homework grades, so 15 points in HW1 counts the same as 15 points in HW5. Points are allocated to questions across the semester and attempt to be compatible. This means each homework assignment (as a block) will not count equally, and it is always worth your while to turn in what you have done of your assignment, even if it is incomplete.
Labs
The labs will be a combination of review of material and an introduction to how to perform the analysis shown in class in the R environment. The labs will be self-paced to be completed during the lab and turned in. Attendance at the labs is mandatory, and you will be expected to turn in the results of finishing the exercises in the lab. The exercises are intended to be turned in at the end of lab, but the GSI may use their discretion to allow people to turn them in later that day or limit what is required to be turned in at the end of lab.
There is no lab on the first week, but there is a self-paced lab for you to go through on your own. You do not have to turn in this lab, but it will be expected that you have gone through this lab.
Projects
In addition to the homeworks, there will be roughly 2 more extensive data analysis assignments that require more independent use of the tools.
Policies
- If you wish for your email to make it into our inbox, the subject of your email must contain the text “STAT 131A”
- None of the instructors explain course material over email and will not respond to emails with such requests. Please come to office hours, discussion section, or GSI’s office hours (or schedule another time to meet if you have irreconcilable conflicts with the office hours). You can also post questions to Piazza.
- We respond to email regarding the class roughly once a day, and almost never in the evening nor weekend. A GSI will monitor Piazza, again roughly once a day.
Academic Honesty Policy
- Homework and projects must be done independently. Obtaining and/or using solutions from previous years or from the internet, if such happen to be available, is considered cheating. With other classmates, you may discuss issues about the homework, but you must not sit down and do the assignment jointly.
- For exams cheating includes, but is not limited to, bringing written or electronic materials into an exam or quiz beyond that allowed, using written or electronic materials during an exam or quiz, copying off another person’s exam or quiz, allowing someone to copy off of your exam or quiz, and having someone take an exam or quiz for you.
In fairness to students who put in an honest effort, cheaters will be harshly treated. Any evidence of cheating will result in a score of zero (0) on the entire assignment or examination. We will always report incidences of cheating to Student Judicial Affairs, which may administer additional punishment.
Disability
If you need accommodations for any physical, psychological, or learning disability, please speak the instructor after class or during office hours. Please note you must make arrangements in a timely manner (through DSP) so that we can make the appropriate accommodations; last minute requests cannot be accommodated.
Using R and resources for R
We will be using R to analyze data in this class. You do not need your own computer to use R and do the assignments, just access to a web browser (e.g. chromebooks will be okay). See R resources for more information.
We expect that students have already been exposed to R or python in either Stat 20 or Data 8, respectively. For those who come from Data 8, the first labs will help you transfer that knowledge over to R. Both languages are excellent platforms for analyzing data, are widely used in data science, and have their individual strengths. R has been developed within the statistics community specifically for data analysis, while python is a general-purpose programming language but has large data analysis capabalities.
The lectures will be focused on the statistical concepts, while learning how to in R what you see in lecture will be the focus of the labs. Do not be surprised or worried about not following the details of R code during class – that is not the point (furthermore, there will sometimes be code that is really specific to the instructional purposes of the lecture, e.g. to make a specific plot, and is beyond the scope of what you would be expected to understand how to do yourself).
In addition to what you learn in lab, there will be for every lecture, complementary html/.Rmd with the code used in the lecture, that will contain brief explanations.
Furthermore, we expect that you’ve already been introduced to the basics, so that we will not need to walk you through every detail. The first step if you are confused about code is to look at the documentation; the second step is to do a quick google search; and of course, ask us questions if you can’t figure it out after doing that.
See our page on R resources for useful tutorials and online help.
Syllabus
Key assignments
- Project 1 is tentatively due in-class, Friday, October 11
- The midterm is scheduled in-class, Friday October 18
- Project 2 is tentatively due Thursday, December 12
- The final exam is Thurs, December 19 from 3–6 pm
Week-by-week guide
The following is a rough guideline for the material we will cover in the semester, but the actual topics will vary as the semester goes along.
Note: links will only work once the material is posted (i.e. as the semester progresses). Visit the sites from previous semesters to get access to the reading material in advance (there is only small variation from year to year). The descriptions posted here are projected, and may be updated/changed depending on the pace of the class
Link to Lab 0
Week | Description | Chapter |
:———— | :————- | :————- |
01 | Boxplots, discrete distributions, intro to continuous distributions | 01: 1.1-1.3
02 | Continuous distributions, density curves | 01: 1.3-4.4
03 | Density estimation, Intro to Hypothesis testing | 01: 4.5-5
02: 1
04 | Permutation test, t-test | 02: 2.1-3.7
05 | Type I error, multiple testing, confidence intervals for t-test | 02: 4.1-6.2
06 | Bootstrap Confidence intervals. Review simple regression. Polynomial regression, loess curves | 02: 7.1-8.1
03
07 | Finish loess curves (classes cancel due to power outages)| 03, 08 | Finish local fitting, smooth density plots
MIDTERM | 03, 09 | Pairs plots, Heatmaps, Start PCA | 04,
10 | PCA, start Multiple linear regression |
04,
05
11 | Multiple linear regression, fitting and interpretation, fitted values, residuals, Multiple R-squared, Residual degrees of freedom and residual standard error|
05
12 | Multiple regression with categorical explanatory variables and interactions| 05
13 | Inference in multiple regression: F-tests via the anova function| 05
14 | Inference in multiple regression: t-tests, standard errors, confidence intervals and prediction intervals. Variable selection in linear regression. Regression diagnostics| 05
15 | The Classification problem and logistic regression, interpretation in terms of odds, binary predictions via confusion matrices, precision and recall, deviance, variable selection via AIC| 06
16 | Reading and recitation week: (make up material missed during power outage)
Regression trees, classification trees and Random Forests| 07