Intelligence Refinery

Home
About
Contact Us

Practical Statistics for Data Scientists

March 26 2020

A second edition, *Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python* is scheduled for publication on June 9, 2020.

Who is this book for?

According to the preface:

"This book is aimed at the data scientist with some familiarity with the R programming language, and with some prior (perhaps spotty or ephemeral) exposure to statistics."

"Two goals underlie this book:
1) to lay out, in digestible, navigable, and easily referenced form, key concepts from statistics that are relevant to data science, and
2) to explain which concepts are important and useful from a data science perspective, which are less so, and why."

Book outline

1. Exploratory Data Analysis

Elements of Structured Data
Rectangular Data
Estimates of Location
Estimates of Variability
Exploring the Data Distribution
Exploring Binary and Categorical Data
Correlation
Exploring Two or More Variables
Summary

2. Data and Sampling Distributions

Random Sampling and Sample Bias
Selection Bias
Sampling Distribution of a Statistic
The Bootstrap
Confidence Intervals
Normal Distribution
Long-Tailed Distributions
Student’s t-Distribution
Binomial Distribution
Poisson and Related Distributions
Summary

3. Statistical Experiments and Significance Testing

A/B Testing [March 23, 2020]
Hypothesis Tests [March 25, 2020]
Resampling [March 26, 2020]
Statistical Significance and P-values
t-Tests
Multiple Testing
Degrees of Freedom
ANOVA
Chi-Square Test
Multi-Arm Bandit Algorithm
Power and Sample Size
Summary

4. Regression and Prediction

Simple Linear Regression
Multiple Linear Regression
Prediction Using Regression
Factor Variables in Regression
Interpreting the Regression Equation
Testing the Assumptions: Regression Diagnostics
Polynomial and Spline Regression
Summary

5. Classification

Naive Bayes
Discriminant Analysis
Logistic Regression
Evaluating Classification Models
Strategies for Imbalanced Data
Summary

6. Statistical Machine Learning

K-Nearest Neighbours
Tree Models
Bagging and the Random Forest
Boosting
Summary

7. Unsupervised Learning

Principal Components Analysis
K-Means Clustering
Hierarchical Clustering
Model-Based Clustering
Scaling and Categorical Variables
Summary

Impressions

Strengths

Book is divided up into fairly bite-sized concepts, which provide useful chunks to guide learning

Weaknesses

The amount of equations included in the text can interrupt flow and be confusing to readers without a significant statistics background

Contact Us

Toronto, Canada
intelligence.refinery@gmail.com

Topics

Careers
Data science toolbox
Dev handbook
Natural language processing
Statistics
Web development

Quick Links

About Us
Nancy's portfolio