Data Mining: Practical Machine Learning Tools and Techniques

Mihai Chelaru-Centea

Book by Ian Witten, Eibe Frank, Mark Hall, and Chris Pal

Resource last updated: Dec. 1, 2016


Overview

This book is a meaty textbook that offers a thorough summary of machine learning tools, as well as in-depth explanations that often go into the underlying statistical rationale behind the techniques. The authors are also the creators of the Weka machine learning software, which is written in Java and has a variety of classifiers and regressors, as well as unsupervised learning algorithms built-in. They also have several MOOCs on FutureLearn for using Weka for machine learning. This book is one of the most comprehensive on the subject.

Format

The book is split into 12 chapters, each of which is split into several subchapters that go into detail on a particular concept or implementation. The first few chapters are dedicated to introducing the topic of data mining and machine learning, and the rest elaborate on implementations of specific families of algorithms. The latter chapters each have a section at the end called WEKA Implementations that describes how that particular algorithm has been implemented in the WEKA API.

Content

It all starts with an overview of what data mining and machine learning are all about, which serves as an introduction for people new to the field. The second chapter defines key terms like examples and attributes, the features of the inputs to the algorithms covered in later chapters.

Chapter 3 describes the outputs in terms of things like decision trees, association rules, linear models, and clusters. The basic algorithms are introduced in the fourth chapter, such as trees, rules, instance-based learning methods like k-nearest neighbors, and unsupervised methods like clustering. Chapter 5 focuses on evaluation strategies and metrics, such as train-test splits and cross-validation, hyperparameter tuning, and evaluating predictions with cost functions.

The rest of the chapters each cover more specific implementations of the basic algorithms introduced in Chapter 4, or go into how data can be transformed to work with different techniques. The range of subject matter pretty much runs the gamut of the field of data mining, and reading this book cover-to-cover will pretty much give anyone a good general understanding of the field as a whole.

Difficulty

The general level of the book I would describe as intermediate, as when I read it as a beginner, I still had some background in statistics and mathematics, as well as some programming, but I still found even some of the earlier chapters difficult to wrap my head around. The examples are good, and based on real-life datasets that make sense conceptually, but the book can feel pretty dry at times, and especially the later chapters sometimes get into details of the statistics involved, with lots of equations. Those sections I would say should be skimmed or skipped by all but the experienced statisticians, and you can come back to them when you are looking for details on a specific implementation.

The Bottom Line


This is a great textbook to have in your arsenal, but I would not recommend reading past the fifth chapter if you are a beginner, and maybe look for something written in simpler terms if you're just starting your foray into data mining. As a resource, it's hard to come across something with a wider range of topics covered, and ironically the WEKA Explorer is a point-and-click software for doing machine learning that I would say is great for any beginner who doesn't have programming experience and just wants to run some algorithms on fairly clean datasets.

I have not personally checked out the WEKA MOOCs, but if you're just starting out and want to get a good general sense of how machine learning works without having to program it all yourself, download the WEKA Explorer and some datasets from the UCI Machine Learning Repository and get cracking!