Chapter 34: Further Reading — Predictive Models: Regression and Classification

DataField.Dev

Chapter 34: Further Reading — Predictive Models: Regression and Classification

Resources selected for business practitioners who want to go deeper on the models covered in this chapter without requiring advanced mathematics. All books and papers cited are real, available publications.

Books

"An Introduction to Statistical Learning" by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani The most accessible rigorous treatment of machine learning for practitioners. Published by Springer and freely available as a PDF at statlearning.com. Chapters 3 (linear regression), 4 (logistic regression), 8 (tree-based methods), and 5 (cross-validation and resampling) map directly to this chapter's content. The R code throughout the book has a Python companion repository (ISLP). The mathematical derivations are present but the authors consistently explain the intuition first — the right balance for business readers who want to understand what they are doing.

"Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow" by Aurélien Géron The most practical book for Python-based machine learning. Part I (chapters 1-8) covers every algorithm in this chapter — linear regression, logistic regression, decision trees, random forests — with full scikit-learn code examples and honest discussion of what each algorithm's assumptions and limitations are. If you want to go beyond this chapter's scope into gradient boosting (XGBoost, LightGBM) or neural networks, Géron's book is the natural next step. Published by O'Reilly.

"The Elements of Statistical Learning" by Hastie, Tibshirani, and Friedman The graduate-level companion to "Introduction to Statistical Learning." More mathematical, but freely available at web.stanford.edu/~hastie/ElemStatLearn. For readers who want to understand why random forests work (bias-variance tradeoff, bagging theory), or the mathematical derivation of logistic regression's likelihood function, this is the authoritative reference. The ESL is not a book to read cover to cover — it is a reference to consult when you want to understand something precisely.

"Machine Learning for Business" by Doug Hudgeon and Richard Nichol Explicitly aimed at business practitioners rather than engineers. Focuses on problem framing, model selection, and communicating results rather than implementation details. Particularly strong on the organizational and decision-making context — when to build a model vs. when a simpler analysis suffices, how to present probabilistic predictions to leadership, and how to avoid the trap of optimizing the wrong metric. Manning Publications.

"Naked Statistics" by Charles Wheelan Not a machine learning book — a statistics book aimed at general readers. But the chapters on regression, correlation vs. causation, and the problems with p-values provide essential conceptual grounding for anyone building predictive models professionally. The chapter on "The Danger of Overfitting" is one of the clearest non-technical explanations of the problem you will find anywhere. W.W. Norton.

Articles and Papers

"A Few Useful Things to Know About Machine Learning" by Pedro Domingos, Communications of the ACM, 2012 A landmark practitioner's guide to the things that do not appear in textbooks: why more features can hurt, why evaluation is harder than it looks, why overfitting is the central problem of ML, and why interpretability often matters more than accuracy. Twelve pages that are worth more than many books. Freely available via search.

"Practical Advice for Analysis of Large, Complex Data Sets" by Patrick Riley, Google AI Blog A practitioner's checklist for applied machine learning: sanity-checking features, understanding model behavior before deploying, debugging surprising results. Written from the perspective of production machine learning at Google but directly applicable to business analysts working with customer data. Available at developers.googleblog.com.

"Machine Learning: The High Interest Credit Card of Technical Debt" by Sculley et al., NeurIPS 2015 Explains why machine learning systems are expensive to maintain over time: data dependencies, hidden feedback loops, feature entanglement, and concept drift. Essential reading before deploying any model to production. The churn model from Case Study 34-01 will require ongoing monitoring for exactly the reasons this paper describes. Available via Google Scholar.

"Evaluating Machine Learning Models" by Alice Zheng (O'Reilly free report) A concise guide to evaluation metrics for both regression and classification, with clear explanations of when each metric is appropriate. Covers the precision-recall tradeoff, AUC interpretation, and the problems with accuracy on imbalanced datasets. Available as a free PDF at oreilly.com.

Online Resources

scikit-learn User Guide (scikit-learn.org/stable/user_guide) The official documentation is unusually good — each algorithm's page includes both API reference and conceptual explanation, with worked examples. The "Supervised Learning" section maps directly to this chapter. The section on "Model Evaluation: Quantifying the Quality of Predictions" is a comprehensive reference for metrics. Readable top to bottom for learning; useful as a reference while working.

scikit-learn "Common Pitfalls and Recommended Practices" (official documentation) A dedicated page in the scikit-learn docs listing the most frequent mistakes practitioners make: data leakage, wrong evaluation procedures, inappropriate feature selection. Available at scikit-learn.org/stable/common_pitfalls.html. Read this before deploying any model.

Kaggle Learn: Machine Learning (kaggle.com/learn) Free, self-paced courses on machine learning with Jupyter notebook exercises that run in the browser without local setup. "Intro to Machine Learning" and "Intermediate Machine Learning" cover the tree-based methods from this chapter with hands-on practice. The "Feature Engineering" course is particularly good for developing intuition about what kinds of engineered features are most useful.

fast.ai Practical Deep Learning for Coders — Tabular Data (fast.ai) Jeremy Howard's course materials include an excellent module on tabular data prediction using random forests and gradient boosting. The philosophy — start with baselines, understand your data before your model, evaluate honestly — aligns with this chapter's approach. Even if you do not continue to deep learning, the tabular chapters are valuable.

Python Libraries for Going Deeper

xgboost and lightgbm Gradient boosting implementations that frequently outperform Random Forest on tabular business data. Both install via pip and integrate with scikit-learn through the familiar .fit() / .predict() interface. If cross-validation shows that Random Forest is not improving over logistic regression, gradient boosting is the logical next step before moving to neural networks. Documentation at xgboost.readthedocs.io and lightgbm.readthedocs.io.

shap SHAP (SHapley Additive exPlanations) provides mathematically grounded feature attributions for any model — including black-box models like Random Forest. Where Random Forest's built-in feature importances give a global average, SHAP gives per-prediction explanations: "for this specific customer, these three features drove the churn probability up by 0.31." When Sandra asks "why is Meridian Supply flagged?", SHAP provides the rigorous answer. Available on PyPI; documentation at shap.readthedocs.io.

imbalanced-learn A scikit-learn-compatible library providing SMOTE (Synthetic Minority Oversampling Technique) and other approaches to imbalanced classification beyond class weighting. When class_weight="balanced" is not sufficient — for example, when the minority class has fewer than 50 examples — SMOTE-based approaches can provide additional improvement. Available on PyPI as imbalanced-learn.

optuna or scikit-learn's GridSearchCV / RandomizedSearchCV Hyperparameter tuning: finding the best values for max_depth, n_estimators, C, and other parameters automatically. GridSearchCV exhaustively searches a parameter grid (slow but thorough); RandomizedSearchCV samples randomly (faster for large grids); optuna uses Bayesian optimization (most efficient for expensive models). Start with RandomizedSearchCV for most business problems. Documentation for scikit-learn options at scikit-learn.org; optuna at optuna.org.

mlflow Experiment tracking: logging which features you used, which hyperparameters you tried, and what metrics resulted, so you can compare runs and reproduce results months later. For any model you plan to update or maintain over time, tracking experiments with mlflow (or a similar tool) is professional practice. Documentation at mlflow.org.

Connecting to Other Chapters in This Book

Chapter 26 (Time Series Forecasting): Demand forecasting built exponential smoothing models for inventory. The predictive models in this chapter are complementary — regression models work best when the relationship between features and outcome is relatively stable over time, whereas time series models explicitly handle temporal structure. For predictions where timing patterns matter (seasonal demand, monthly churn cycles), Chapter 26's techniques apply directly.

Chapter 32 (Supply Chain Analytics): The supply chain data built in Chapter 32 — inventory levels, lead times, demand velocity — is rich territory for predictive classification. Exercise 5.5 in this chapter's exercises explores connecting these systems directly.

Chapter 33 (Interactive Dashboards with Streamlit): A churn model that produces probability scores is most useful when those scores are surfaced in an interactive interface — allowing Sandra to filter by segment, sort by risk score, or drill into individual accounts. Chapter 33's Streamlit skills convert the static prediction outputs from this chapter into a live, browser-accessible tool.

Chapter 36 (Advanced Analytics and Model Deployment): This chapter covers training and evaluating models. Chapter 36 addresses what comes next: packaging models for production use, scheduling periodic retraining, monitoring for drift, and integrating predictions into existing business systems.