Spring 2021
Course details
Form for concurrent enrollment students
Graded Course Components
Policies
Using R and resources
Week-to-week Syllabus
Course Details
Lectures
MWF 1 - 2 pm, Zoom link on bcourses
Labs
TuTh 2 - 3 pm & 5 - 6 pm
Instructors
Name | Office Hours (Zoom links on bcourses) | ||
---|---|---|---|
Instructor | Shobhana Stoyanov | shobhana@berkeley.edu | Tues 9-11 am & Wed 2-4 pm |
GSI | Shide Li | shideli99@berkeley.edu | Mon 4 - 5 pm & Fri 10 - 11 am |
Description
This course is a upper-division course that is a follow-up to Data 8 or STAT 20. The course will teach a broad range of statistical methods that are used to solve data problems. Topics will include group comparisons and ANOVA, standard parametric statistical models, multivariate data visualization, multiple linear regression and classification, classification and regression trees and random forests. Students will be introduced to the widely used R statistical language and they will obtain hands-on experience in implementing a range of statistical methods on numerous real world datasets.
This is a upper-division class, so the pace of the class is expected to be more challenging than introductory classes.
Course Text
We will be following the online book developed for this course available here. If you would like some additional optional reading, you can try the following books.
- The Statistical Sleuth: A Course in Methods of Data Analysis by Ramsey and Schafer
- Introductory Statistics with R by Peter Dalgaard
- Probability for Data Science: by Ani Adhikari and Jim Pitman (http://prob140.org/textbook/). This is the online book for teaching probability used by STAT 140. Chapter 15, in particular, covers continuous distributions, which might be a useful complement to our coverage of the topic – though more in depth than what we will do.
Neither of these books covers all of the topics we will cover, nor do they have the same perspective and focus as this class – they do not have extensive use of bootstrapping and resampling methods. But for those students wanting some additional structure or R assistance these books may be helpful and should be at the right level for this class.
Stat 20 and Data 8 are similar courses, but each covers some subjects that the other does not. While we will cover these topics in class, you may find the following useful background if you are seeing them for the first time (more to follow):
-
The Foundations of Data Science, by Ani Adhikari and John DeNero (www.inferentialthinking.com) – chapters 11-13.
This is the online book used by Data 8 and these chapters introduce hypothesis testing using only resampling ideas, ideas which are not necessarily covered in Stat 20. Their examples integrate python code, which may not be familiar, but what the code is doing will probably make sense and not make it too difficult to follow.
Online services used
We will use several different online services in the class that are probably familiar to you:
Resource | Materials |
---|---|
Textbook | Online book |
Course Website | Course material (lectures, labs, etc) will be available at the course website, https://epurdom.github.io/Stat131A/ |
Bcourses | We will distribute materials like HWs, Solutions, Practice midterms, etc on bcourses. Grades, announcements, updates, and the like will also be through bcourses |
Piazza | You can link to the piazza account via bcourses – clicking on the Piazza link should enroll you automatically in the Piazza class (with your Berkeley email). A GSI will check Piazza once a day. |
Gradescope | HW Assignments will be submitted to gradescope. After the first week, we will enter the roster of emails, based on your official @berkeley.edu account on the student roster. |
Concurrent enrollment students
Concurrent enrollment students wishing to register for the class should fill out a google form at (coming soon) to give me information about their previous coursework so I can assess if they have satisfied the pre-requisites.
Graded Course Components:
Overall score
- Homeworks: 25%
- Labs: 15%
- Midterm: 20%
- Final project: 18%
- Final: 22%
Homework
The homework assignments will be posted to bcourses, and will generally be due every other Wednesday. All homework will be submitted in Gradescope and are due by 11:59 pm of the due date. There are 5-6 homework assignments planned, though the exact number may change. Homeworks will be a combination of computational exercises and data analysis using the computer, as well as conceptual questions. R code will also be submitted in Gradescope where the code will be checked (see bcourses for more information about turning in homework).
The final homework score will be the sum of all homework grades, so 15 points in HW1 counts the same as 15 points in HW5. Points are allocated to questions across the semester and attempt to be compatible. This means each homework assignment (as a block) will not count equally, and it is always worth your while to turn in what you have done of your assignment, even if it is incomplete.
Labs
There will be about 12 labs which will be a combination of review of material and an introduction to how to perform the analysis shown in class in the R environment. The labs will be self-paced to be worked on during section and turned in every Friday by 11:59. Though the sections will be recorded, it is usually helpful to attend them.
There is no lab to be submitted this Friday, but there is a self-paced lab (Lab 00) for you to go through on your own. You do not have to turn in this lab, but we will assume that you have gone through this lab and understand the code.
Policies
- If you wish for your email to make it into our inbox, the subject of your email must contain the text “STAT 131A”
- None of the instructors explain course material over email and will not respond to emails with such requests. Please come to office hours or discussion sections (or schedule another time to meet if you have irreconcilable conflicts with the office hours). You can also post questions to Piazza.
- We respond to email regarding the class roughly once a day, and almost never in the evening or the weekend.
Academic Honesty Policy
-
Homework and projects must be done independently. Obtaining and/or using solutions from previous years or from the internet, if such happen to be available, is considered cheating. With other classmates, you may discuss issues about the homework, but you must not sit down and do the assignment jointly.
-
For exams cheating includes, but is not limited to, using electronic materials in an exam beyond that allowed, copying off another person’s exam or quiz, allowing someone to copy off of your exam or quiz, and having someone take an exam or quiz for you.
In fairness to students who put in an honest effort, cheaters will be harshly treated. Any evidence of cheating will result in a score of zero (0) on the entire assignment or examination, and perhaps a failing grade in the class. We will always report incidences of cheating to the Office of Student Conduct, which may administer additional punishment.
Disability
If you need accommodations for any physical, psychological, or learning disability, please speak the instructor after class or during office hours. Please note you must make arrangements in a timely manner (through DSP) so that we can make the appropriate accommodations; last minute requests cannot be accommodated.
Using R and resources for R
We will be using R to analyze data in this class. See R resources for more information.
We expect that students have already been exposed to R or python in either Stat 20 or Data 8, respectively. For those who come from Data 8, the first labs will help you transfer that knowledge over to R. Both languages are excellent platforms for analyzing data, are widely used in data science, and have their individual strengths. R has been developed within the statistics community specifically for data analysis, while python is a general-purpose programming language but has large data analysis capabalities.
The lectures will be focused on the statistical concepts, while learning how to in R what you see in lecture will be the focus of the labs. Do not be surprised or worried about not following the details of R code during class – that is not the point (furthermore, there will sometimes be code that is really specific to the instructional purposes of the lecture, e.g. to make a specific plot, and is beyond the scope of what you would be expected to understand how to do yourself).
In addition to what you learn in lab, there will be for every lecture, complementary html/.Rmd with the code used in the lecture, that will contain brief explanations.
Furthermore, we expect that you’ve already been introduced to the basics, so that we will not need to walk you through every detail. The first step if you are confused about code is to look at the documentation; the second step is to do a quick google search; and of course, ask us questions if you can’t figure it out after doing that.
You will be going through an introductory lab (Lab 0) to help you refresh your skills. See our page on R resources for useful tutorials and online help.
Syllabus
Exam dates
- The midterm will be on Gradescope & is scheduled for Friday March 19
- The final project will be due on the Monday of finals week, May 10
- We will have an oral final exam in this course over the course of finals week. Topics/questions will be released previous to RRR week, and each student will need to sign up for a 20 minute slot. More details will be supplied after the midterm.
Week-by-week guide
The following is a rough guideline for the material we will cover in the semester, but the actual topics will vary as the semester goes along.
The descriptions posted here are projected, and may be updated/changed depending on the pace of the class
Week | Description | Chapter |
---|---|---|
01 | Boxplots, discrete distributions, intro to continuous distributions | 2.1.1-2.1.3 |
02 | Continuous distributions, density curves | 2.1.3-2.4.4 |
03 | Density estimation, Intro to Hypothesis testing | 2.45-2.5 3.1 |
04 | Permutation test, t-test | 3.2-3.3 |
05 | Type I error, multiple testing, confidence intervals for t-test | 3.4-3.6 |
06 | Bootstrap Confidence intervals. Review simple regression. Polynomial regression, loess curves | 3.7-3.9 4.1.1-4.3 |
07 | Finish loess curves | 4.4 |
08 | Finish local fitting, smooth density plots MIDTERM |
4.5-4.6 |
09 | Pairs plots, Heatmaps, Start PCA | 5.1, 5.3 |
10 | PCA, start Multiple linear regression | 5.4 6.1 |
11 | Multiple linear regression, fitting and interpretation, fitted values, residuals, Multiple R-squared, Residual degrees of freedom and residual standard error | 6.2-6.3 |
12 | Multiple regression with categorical explanatory variables and interactions | 6.4 |
13 | Inference in multiple regression: F-tests via the anova function | 6.5 |
14 | Inference in multiple regression: t-tests, standard errors, confidence intervals and prediction intervals. Variable selection in linear regression. Regression diagnostics | 6.6 |
15 | The Classification problem and logistic regression, interpretation in terms of odds, binary predictions via confusion matrices, precision and recall, deviance, variable selection via AIC | 7 |
16 | Reading and recitation week: Regression trees, classification trees and Random Forests |
8 |