How to Use This Book

Structure

This book is organized into seven parts, each building on the previous:

Part Focus Chapters
I. The ML Mindset How to think about prediction, experimentation, and the math that powers ML 1–4
II. Feature Engineering and Data Preparation Getting data, building features, and assembling pipelines 5–10
III. Supervised Learning Algorithms, evaluation, tuning, and interpretation 11–19
IV. Unsupervised Learning Clustering, dimensionality reduction, anomaly detection, and recommendations 20–24
V. Specialized Data Types Time series, text, geospatial, and large-scale data 25–28
VI. From Notebook to Production Software engineering, deployment, monitoring, fairness, and business impact 29–34
VII. Synthesis Capstone project and the road ahead 35–36

Parts I–III should be read in order. Parts IV and V can be read in any order after Part III. Part VI should be read in order after at least Parts I–III. Part VII is the capstone.

Chapter Components

Every chapter contains seven files:

  • index.md — The main content. Read this first. Typically 8,000–12,000 words with embedded code, math, and visualizations.
  • exercises.md — Hands-on practice. Ranges from "apply this technique to the StreamFlow data" to "debug this intentionally broken pipeline."
  • quiz.md — Self-assessment. 10–15 multiple-choice and short-answer questions with answers in Appendix B.
  • case-study-01.md and case-study-02.md — Extended scenarios drawn from the anchor examples. Apply the chapter's concepts to realistic problems with real tradeoffs.
  • key-takeaways.md — The essential points. If you read nothing else, read this.
  • further-reading.md — Papers, blog posts, documentation, and other resources for going deeper.

The Progressive Project: StreamFlow Churn Prediction

Across the book, you build a complete ML system for predicting customer churn at StreamFlow, a subscription streaming analytics platform. Each chapter adds a component:

  • Part I: Frame the problem and design the validation experiment
  • Part II: Extract data with SQL, engineer features, build the preprocessing pipeline
  • Part III: Train models, evaluate properly, handle imbalance, tune, and interpret
  • Part VI: Deploy as a REST API, add monitoring, audit for fairness, calculate ROI
  • Part VII: Integrate everything into a portfolio-ready capstone

Look for the Progressive Project callout in each chapter. The milestones are cumulative — each builds on the previous.

Four Anchor Examples

Four recurring scenarios ground every concept in a specific domain:

  1. StreamFlow (SaaS/subscription) — The progressive project. Customer churn prediction from subscription and usage data.
  2. Metro General Hospital (healthcare) — 30-day readmission prediction with fairness constraints across patient demographics.
  3. ShopSmart (e-commerce) — A/B testing recommendation algorithms, conversion optimization, and market basket analysis.
  4. TurbineTech (manufacturing/IoT) — Predictive maintenance from sensor time series with cost-asymmetric errors.

Callout Boxes

Throughout the text, you will encounter these callout types:

War Story — Real-world anecdotes from production data science. These are the lessons that textbooks usually skip.

Common Mistake — The error that every junior data scientist makes at least once. We name it so you can avoid it.

Production Tip — Advice that matters in production but not in a homework assignment.

Math Sidebar — Deeper mathematical treatment for readers who want the formal details. Skippable without losing the main thread.

Try It — Interactive prompts to modify and experiment with the code.

Debug Challenge — Intentionally broken code. Find and fix the bug.

Code and Environment

All code is Python 3.10+. The core stack:

  • Data: pandas, numpy, SQL (PostgreSQL dialect)
  • ML: scikit-learn, XGBoost, LightGBM, CatBoost
  • Visualization: matplotlib, seaborn
  • Interpretation: SHAP
  • Tracking: MLflow
  • Deployment: FastAPI, Docker
  • Specialized: statsmodels, Prophet, geopandas, Dask, Polars, nltk

See Appendix D for complete setup instructions.

All code uses random_state=42 for reproducibility. All model training uses scikit-learn Pipelines with proper train/test separation.

Learning Paths

Not everyone reads front to back. Here are suggested paths by goal:

The Job Seeker (fastest path to portfolio project): Ch 1–2 → Ch 5–6 → Ch 10 → Ch 11 → Ch 13–14 → Ch 16 → Ch 19 → Ch 29 → Ch 31 → Ch 35

The Algorithm Deep-Diver (master supervised + unsupervised learning): Ch 1 → Ch 4 → Ch 11–19 → Ch 20–24

The ML Engineer (focus on production): Ch 2 → Ch 10 → Ch 16 → Ch 29–32 → Ch 35

The Experimenter (A/B testing and causal reasoning): Ch 3 → Ch 4 → Ch 16 → Ch 17 → Ch 33 → Ch 34

For Instructors

See the Instructor Guide for: - Three syllabi (15-week semester, 30-week two-semester, self-paced) - A 12-week bootcamp syllabus - Chapter-by-chapter teaching notes and discussion guides - Midterm and final exams with rubrics - A guide to running in-class Kaggle-style competitions