Course details
Form for concurrent enrollment or auditing students
Graded Course Components
Policies
Using R and resources
Week-to-week Syllabus

Course Details

Lectures

MWF 2-3pm, Social Sciences Building 170

The lectures the week of April 3 may be conducted on zoom – stay tuned for information

Labs

MW 10-11am or MW 3-4pm, Evans Hall 334

Instructional Staff

  Name Email Office Hours
Instructor Elizabeth Purdom epurdom@berkeley.edu F 3-5, 433 Evans Hall
GSI Saqib Mumtaz saqib@berkeley.edu TBA

 

Description

This course is a upper-division course that follows Data 8 or STAT 20. The course will teach a broad range of statistical methods that are used to solve data problems. Topics will include group comparisons and ANOVA, standard parametric statistical models, multivariate data visualization, multiple linear regression and classification, classification and regression trees and random forests. Students will be introduced to the widely used R statistical language and they will obtain hands-on experience in implementing a range of statistical methods on numerous real world datasets.

This is a upper-division class, so the pace of the class is expected to be more challenging than introductory classes. Exams and final projects will expect students to pull together material taught in lectures, labs, and homeworks.

Course Text

We will be following the online book developed for this course available here. If you would like some additional optional reading, you can try the following books.

  • The Statistical Sleuth: A Course in Methods of Data Analysis by Ramsey and Schafer
  • Introductory Statistics with R by Peter Dalgaard
  • Theory Meets Data by Ani Adhikari (http://stat88.org/textbook/content/intro.html). This is the online book for STAT 88 that covers introductory probability at the level of Stat 20.

None of these books covers all of the topics we will cover, nor do they have the same perspective and focus as this class – they do not have extensive use of bootstrapping and resampling methods. But for those students wanting some additional structure or R assistance these books may be helpful and should be at the right level for this class.

Stat 20 and Data 8 are similar courses, but each covers some subjects that the other does not. While we will cover these topics in class, you may find the following useful background if you are seeing them for the first time (more to follow):

  • The Foundations of Data Science, by Ani Adhikari and John DeNero (www.inferentialthinking.com) – chapters 11-13.

    This is the online book used by Data 8 and these chapters introduce hypothesis testing using only resampling ideas, ideas which are not necessarily covered in Stat 20. Their examples integrate python code, which may not be familiar, but what the code is doing will probably make sense and not make it too difficult to follow.

Online tools used

We will use several different online tools in the class that are probably familiar to you:

Resource Materials
Textbook Online book: https://epurdom.github.io/Stat131A/book
Course Website Basic course material (syllabus, book) can be viewed at the course website, https://epurdom.github.io/Stat131A/.
Bcourses Materials like Assignments, Labs, Solutions, Practice midterms, etc will be distributed on bcourses.
Ed Discussion You can link to the Ed Discussion account via bcourses. A GSI will check Ed Discussion once a day.
Gradescope All assignments will be submitted to gradescope. By the end of the first week, if not sooner, you should see yourself added to gradescope if you are registered for the class.

Concurrent enrollment or Auditing students

Concurrent enrollment students wishing to register for the class should fill out this google form to give me information about their previous coursework so I can assess if they have satisfied the pre-requisites.

Students who wish to audit the course can also fill out the top portion of this form to get their email added to bcourses as a guest, but only name and email is needed.

Graded Course Components:

Overall score

  • Homeworks: 30%
  • Labs: 10%
  • Midterm: 20%
  • Final project: 20%
  • Final: 20%

Homework

The homework assignments will be posted to bcourses, and will generally be due every other week. All homework will be submitted in Gradescope and are due by 11:59 pm of the due date. There are 5-6 homework assignments planned, though the exact number may change. Homework will be a combination of computational exercises and data analysis using the computer, as well as conceptual questions.

The final homework score will be the sum of all homework grades, so 15 points in HW1 counts the same as 15 points in HW5. Points are allocated to questions across the semester and attempt to be compatible. This means each homework assignment (as a block) will not count equally, and it is always worth your while to turn in what you have done of your assignment, even if it is incomplete.

Labs

Lab sections meet twice a week. Generally one lab will be reviewing conceptual material while the other lab will be going through a lab assignment, but this may vary at times.

There will be about 12 lab assignments which will be an introduction to how to perform the analysis shown in class in the R environment. The labs will be self-paced to be primarily worked on during section, but you will not have to turn them in until a couple of days later.

There are no lab sections the first week of classes and therefore no lab assignment to be submitted this week. BUT there is a self-paced lab (Lab 00) for you to go through on your own. You do not have to turn in this lab, but we will assume that you have gone through this lab and understand the code.

Policies

Late Assignments

Assignments are expected to be turned in promptly unless you have explicit permission from the instructor to turn the assignment in late. Gradescope will be set up to accept late assignments, but this is to deal with people that received permission, and does not mean that there is a blanket allowance for late grades. Unless you receive permission to turn in a late assignment, late assignments will lose 10\% of the total points per day (i.e. per 24 hours) and will not be accepted more than 3 days late, unless the instructor explicitly gives permission otherwise. (You will see late deductions show up as a “question”” on the graded homework). Late projects will not be accepted at all, because we are on a tight timeline to grade them before grades are due.

Students with medical problems or other emergencies who need to submit an assignment late should reach out to the instructor (via a private message on Ed Discussion) promptly to avoid these penalties. Conflict with work in other classes will not be a reason for waiving late penalties, so please manage your time wisely.

Communicating with Instructional Staff

The best way to ask a question to the instructor or GSI is to send a private message through Ed Discussion. This way it does not get lost in our inboxes and we can both see it and keep track of the responses to it. Ed Discussion will be checked by one of us at least once a day during the weekday, and almost never in the evening or the weekend. You can mark these messages with “Administrative” to make sure we see them.

If you absolutely must email, in order for your email to make it into our inbox, the subject of your email must contain the text “STAT 131A”. None of the instructors explain course material over email and will not respond to emails with such requests. Please come to office hours or discussion sections for questions about the material, or post questions to Ed Discussion.

Academic Honesty Policy

  • Homework and projects must be done independently. With other classmates, you may discuss specific issues/questions you have about the homework at a high level, but you must not sit down and do the assignment jointly. Giving advice about code or coding tips is also not cheating, but you can not directly share code with other classmates. Requesting, obtaining, and/or using solutions from previous years or from the internet or other sources, if such happen to be available, is considered cheating.

  • For exams cheating includes, but is not limited to, using electronic materials in an exam beyond that allowed, copying off another person’s exam or quiz, allowing someone to copy off of your exam or quiz, and having someone take an exam or quiz for you.

In fairness to students who put in an honest effort, cheaters will be harshly treated. Any evidence of cheating will result in a score of zero (0) on the entire assignment or examination, and perhaps a failing grade in the class. We will always report incidences of cheating to the Office of Student Conduct, which may administer additional punishment.

Disability

If you need accommodations for any disability, please speak to the instructor after class or during office hours. Please note you must make arrangements in a timely manner (through DSP) so that we can make the appropriate accommodations; last minute requests cannot be accommodated.

Using R and resources for R

We will be using R to analyze data in this class. See R resources for more information. You will be going through an introductory lab (Lab 0) to help you refresh your skills. See our page on R resources for useful tutorials and online help.

We expect that students have already been exposed to R or python in either Stat 20 or Data 8, respectively. For those who come from Data 8, the first labs will help you transfer that knowledge over to R. Both languages are excellent platforms for analyzing data, are widely used in data science, and have their individual strengths. R has been developed within the statistics community specifically for data analysis, while python is a general-purpose programming language which also has large data analysis capabilities.

The lectures will be focused on the statistical concepts. Learning how to do these tasks in R will be the focus of the labs. Do not be surprised or worried about not following the details of R code during lectures – that is not the point (furthermore, there will sometimes be code that is really specific to the instructional purposes of the lecture, e.g. to make a specific plot, and is beyond the scope of what you would be expected to understand how to do yourself).

In addition to what you learn in lab, the class website provides for every book chapter, complementary html/.Rmd with the code used in the lecture that will contain brief explanations, available (here)[https://epurdom.github.io/Stat131A/RSupport/index.html].

Furthermore, we expect that you’ve already been introduced to the basics, so that we will not need to walk you through every detail. The first step if you are confused about code is to look at the documentation of the function in R; the second step is to do a quick google search; and of course, ask us questions if you can’t figure it out after doing that.

Syllabus

Exam dates

  • The midterm is scheduled for Friday March 10
  • The final project will be due on the last day of of reading week, Friday May 5
  • The final exam will be in class: Tuesday May 9, 11:30–2:30 pm (scheduled by the registrar)

Week-by-week guide

The following is a rough guideline for the material we will cover in the semester, but the actual topics will vary as the semester goes along.

The descriptions posted here are projected, and may be updated/changed depending on the pace of the class

Week Description Chapter
01 Boxplots, discrete distributions, intro to continuous distributions 2.1.1-2.1.3
02 Continuous distributions, density curves 2.1.3-2.4.4
03 Density estimation, Intro to Hypothesis testing 2.45-2.5
3.1
04 Permutation test, t-test 3.2-3.3
05 Type I error, multiple testing, confidence intervals for t-test 3.4-3.6
06 Bootstrap Confidence intervals. Review simple regression. Polynomial regression, loess curves 3.7-3.9
4.1.1-4.3
07 Finish loess curves 4.4
08 Finish local fitting, smooth density plots
MIDTERM
4.5-4.6
09 Pairs plots, Heatmaps, Start PCA 5.1, 5.3
10 PCA, start Multiple linear regression 5.4
6.1
11 Multiple linear regression, fitting and interpretation, fitted values, residuals, Multiple R-squared, Residual degrees of freedom and residual standard error 6.2-6.3
12 Multiple regression with categorical explanatory variables and interactions 6.4
13 Inference in multiple regression: F-tests via the anova function 6.5
14 Inference in multiple regression: t-tests, standard errors, confidence intervals and prediction intervals. Variable selection in linear regression. Regression diagnostics 6.6
15 The Classification problem and logistic regression, interpretation in terms of odds, binary predictions via confusion matrices, precision and recall, deviance, variable selection via AIC 7
16 Reading and recitation week:
Regression trees, classification trees and Random Forests
8