Chapter 1 Introduction

This book consists of materials to accompany the course “Statistical Methods for Data Science” (STAT 131A) taught at UC Berkeley. STAT 131A is an upper-division course that is a follow-up course to an introductory statistics, such as DATA 8 or STAT 20 taught at UC Berkeley.

The textbook will teach a broad range of statistical methods that are used to solve data problems. Topics include group comparisons and ANOVA, standard parametric statistical models, multivariate data visualization, multiple linear regression and logistic regression, classification and regression trees and random forests.

These topics are covered at a very intuitive level, with only a semester of calculus expected to be able to follow the material. The goal of the book is to explain these more advanced topics at a level that is widely accessible.

In addition to an introductory statistics course, students in this course are expected to have had some introduction to programming, and the textbook does not explain programming concepts nor does it generally explain the R Code shown in the book. The focus of the book is understanding the concepts and the output. To have more understanding of the R Code, please see the accompanying .Rmd that steps through the code in each chapter (and the accompanying .html that gives a compiled version). These can be found at

The datasets used in this manuscript should be made available to students in the class on bcourses by their instructor.

The contents of this book are licensed for free consumption under the following license: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

1.1 Acknowledgements

This manuscript is based on lecture notes originally developed by Aditya Guntuboyina (Chapters 6-8) and Elizabeth Purdom (Chapters 2-5) in the Spring of 2017, the first time the course was taught at UC Berkeley. Shobhana Stoyanov provided materials that aided in the writing of Chapter 2, section 2.2 and useful feedback.

