Practical Statistics for Data Scientists

March 26 2020
Practical Statistics for Data Scientists
A second edition, *Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python* is scheduled for publication on June 9, 2020.

Who is this book for?

According to the preface:

"This book is aimed at the data scientist with some familiarity with the R programming language, and with some prior (perhaps spotty or ephemeral) exposure to statistics."
"Two goals underlie this book:
1) to lay out, in digestible, navigable, and easily referenced form, key concepts from statistics that are relevant to data science, and
2) to explain which concepts are important and useful from a data science perspective, which are less so, and why."

Book outline

1. Exploratory Data Analysis Elements of Structured Data
Rectangular Data
Estimates of Location
Estimates of Variability
Exploring the Data Distribution
Exploring Binary and Categorical Data
Exploring Two or More Variables
2. Data and Sampling Distributions Random Sampling and Sample Bias
Selection Bias
Sampling Distribution of a Statistic
The Bootstrap
Confidence Intervals
Normal Distribution
Long-Tailed Distributions
Student’s t-Distribution
Binomial Distribution
Poisson and Related Distributions
3. Statistical Experiments and Significance Testing A/B Testing [March 23, 2020]
Hypothesis Tests [March 25, 2020]
Resampling [March 26, 2020]
Statistical Significance and P-values
Multiple Testing
Degrees of Freedom
Chi-Square Test
Multi-Arm Bandit Algorithm
Power and Sample Size
4. Regression and Prediction Simple Linear Regression
Multiple Linear Regression
Prediction Using Regression
Factor Variables in Regression
Interpreting the Regression Equation
Testing the Assumptions: Regression Diagnostics
Polynomial and Spline Regression
5. Classification Naive Bayes
Discriminant Analysis
Logistic Regression
Evaluating Classification Models
Strategies for Imbalanced Data
6. Statistical Machine Learning K-Nearest Neighbours
Tree Models
Bagging and the Random Forest
7. Unsupervised Learning Principal Components Analysis
K-Means Clustering
Hierarchical Clustering
Model-Based Clustering
Scaling and Categorical Variables



Book is divided up into fairly bite-sized concepts, which provide useful chunks to guide learning


The amount of equations included in the text can interrupt flow and be confusing to readers without a significant statistics background