Chapter 7 Further Reading: Supervised Learning -- Classification


Classification Fundamentals

1. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2023). An Introduction to Statistical Learning with Applications in Python (2nd ed.). Springer. The most widely recommended textbook for learning statistical learning methods with practical application. The classification chapters cover logistic regression, discriminant analysis, and K-nearest neighbors with clear mathematical intuition and Python code. Chapter 4 (Classification) and Chapter 8 (Tree-Based Methods) are directly relevant. Freely available online at StatLearning.com, making it an ideal first reference for any concept from this chapter that you want to explore more deeply.

2. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer. The more mathematically rigorous companion to ISLR. If you want to understand the mathematics behind logistic regression, information gain, or gradient boosting beyond the intuitive explanations in this chapter, this is the canonical reference. Chapters 4 (Linear Methods for Classification), 9 (Additive Models, Trees, and Related Methods), and 10 (Boosting and Additive Trees) provide the formal treatment. Also freely available online.

3. Raschka, S., Liu, Y. H., & Mirjalili, V. (2022). Machine Learning with PyTorch and Scikit-Learn. Packt Publishing. A hands-on guide to implementing ML algorithms in Python. The scikit-learn sections provide production-quality code patterns for classification workflows, feature engineering, model evaluation, and pipeline construction. Particularly useful if you want to extend the ChurnClassifier code from this chapter with more sophisticated scikit-learn techniques like Pipeline and GridSearchCV.


Decision Trees and Ensemble Methods

4. Breiman, L. (2001). "Random Forests." Machine Learning, 45(1), 5-32. The original paper introducing random forests. Breiman's writing is unusually clear for an academic paper, and the core intuitions he presents -- bagging, random feature selection, out-of-bag error estimation -- remain the best explanations of why random forests work. Worth reading even if you never plan to implement the algorithm from scratch. One of the most cited papers in machine learning.

5. Chen, T., & Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785-794. The paper that introduced XGBoost and helped establish gradient boosting as the dominant algorithm for structured data. Chen and Guestrin explain the system design decisions (approximate greedy algorithm, sparsity-aware split finding, cache-aware access) that make XGBoost both fast and accurate. Understanding these design choices helps explain why XGBoost dominates Kaggle competitions and production ML systems.

6. Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T.-Y. (2017). "LightGBM: A Highly Efficient Gradient Boosting Decision Tree." Advances in Neural Information Processing Systems, 30. The Microsoft Research paper introducing LightGBM. If you work with datasets larger than a few hundred thousand rows, LightGBM's innovations (gradient-based one-side sampling, exclusive feature bundling) offer meaningful speed improvements over XGBoost with comparable accuracy. The paper is worth reading for practitioners who need to scale gradient boosting to large business datasets.


Model Evaluation and Metrics

7. Provost, F., & Fawcett, T. (2013). Data Science for Business: What You Need to Know About Data Mining and Data-Analytic Thinking. O'Reilly Media. The best book for understanding classification metrics in a business context. Chapters 7 and 8 provide an exceptional treatment of evaluation metrics, ROC curves, profit curves, and the cost-sensitive decision framework that directly informs Section 7.10 of this chapter. If you read one supplementary text on connecting model performance to business outcomes, this should be it.

8. Davis, J., & Goadrich, M. (2006). "The Relationship Between Precision-Recall and ROC Curves." Proceedings of the 23rd International Conference on Machine Learning, 233-240. An important paper explaining when precision-recall curves are more informative than ROC curves, particularly for imbalanced datasets. For business applications where the positive class is rare (fraud, default, churn), precision-recall analysis often provides a more honest picture of model performance than AUC-ROC. This paper will be revisited in Chapter 11.

9. Niculescu-Mizil, A., & Caruana, R. (2005). "Predicting Good Probabilities with Supervised Learning." Proceedings of the 22nd International Conference on Machine Learning, 625-632. A study comparing the calibration of different classification algorithms -- a topic introduced in Section 7.3. The authors find that logistic regression and neural networks tend to produce well-calibrated probabilities, while decision trees and random forests do not. Understanding calibration is essential for business applications where probability scores drive tiered actions (as in Athena's intervention strategy).


Feature Engineering and Class Imbalance

10. Zheng, A., & Casari, A. (2018). Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists. O'Reilly Media. A practical guide to feature engineering that covers encoding, scaling, interaction features, and domain-specific techniques. The chapter on text and categorical features is particularly useful for business applications with mixed data types. Extends the feature engineering concepts introduced in Section 7.7 with more advanced techniques.

11. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). "SMOTE: Synthetic Minority Over-sampling Technique." Journal of Artificial Intelligence Research, 16, 321-357. The original paper introducing SMOTE, one of the most widely used techniques for handling class imbalance. The paper explains the algorithm's intuition (interpolating between minority class examples) and evaluates its effectiveness across multiple datasets. If you are considering SMOTE for a business application, this paper provides the context for understanding when it works well and when it introduces artifacts.

12. He, H., & Garcia, E. A. (2009). "Learning from Imbalanced Data." IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263-1284. A comprehensive survey of techniques for handling class imbalance, covering oversampling, undersampling, cost-sensitive learning, and ensemble approaches. Provides a systematic framework for choosing among the strategies discussed in Section 7.7. Essential reading for practitioners working with imbalanced business datasets (which is most business datasets).


Applied Classification in Business

13. Provost, F., & Fawcett, T. (2001). "Robust Classification for Imprecise Environments." Machine Learning, 42(3), 203-231. Addresses the practical reality that in business applications, the costs of different types of errors are often not precisely known. The authors develop cost-sensitive evaluation methods that remain robust across a range of cost assumptions -- a framework directly applicable to the threshold optimization problem discussed in Section 7.10. Highly relevant for any reader who found the business economics of classification compelling.

14. Verbeke, W., Dejaeger, K., Martens, D., Hur, J., & Baesens, B. (2012). "New Insights into Churn Prediction in the Telecommunication Sector: A Profit Maximizing Approach." European Journal of Operational Research, 218(1), 211-229. An empirical study that demonstrates how profit-maximizing classification (as opposed to accuracy-maximizing classification) can significantly improve the business value of churn models. The authors show that the profit-optimal model is often not the one with the highest AUC, reinforcing the chapter's message that model metrics and business metrics are different things.

15. Baesens, B., Van Vlasselaer, V., & Verbeke, W. (2015). Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques: A Guide to Data Science for Fraud Detection. Wiley. A comprehensive treatment of classification for fraud detection -- one of the highest-value business applications. Covers the unique challenges of fraud classification: extreme class imbalance, adversarial dynamics (fraudsters adapt to models), concept drift, and real-time scoring requirements. Useful for readers interested in applying classification concepts beyond churn prediction.


Fairness, Interpretability, and Ethics in Classification

16. Barocas, S., Hardt, M., & Narayanan, A. (2023). Fairness and Machine Learning: Limitations and Opportunities. MIT Press. The definitive academic text on fairness in ML classification. Covers mathematical definitions of fairness (demographic parity, equalized odds, calibration), their inherent conflicts, and practical strategies for building fairer models. Directly relevant to the ZestFinance case study and essential preparation for Chapters 25 and 26. Freely available online at fairmlbook.org.

17. Rudin, C. (2019). "Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead." Nature Machine Intelligence, 1(5), 206-215. A provocative and influential argument that post-hoc explanations of complex models (like SHAP values applied to XGBoost) are fundamentally unreliable and should not be used as a substitute for inherently interpretable models (like logistic regression or small decision trees) in high-stakes domains. Challenges the assumption that you can have both maximum accuracy and adequate interpretability. Essential reading for anyone deploying classification models in regulated industries.

18. Kleinberg, J., Mullainathan, S., & Raghavan, M. (2016). "Inherent Trade-Offs in the Fair Determination of Risk Scores." arXiv preprint arXiv:1609.05807. A landmark paper proving mathematically that certain definitions of fairness are incompatible with each other (except in trivial cases). This impossibility result has profound implications for businesses deploying classification models in contexts where fairness matters -- essentially all contexts. Understanding these trade-offs is necessary for making informed decisions about which fairness criteria to prioritize.


Industry Case Studies and Reports

19. Zest AI. (2023). "The State of AI in Financial Services." Zest AI. An industry report from the company featured in Case Study 2, providing data on ML adoption in lending, fairness measurement practices, and regulatory compliance approaches. Read with appropriate awareness of the source's commercial interest, but useful for understanding how ML credit scoring is marketed and deployed in practice.

20. American Express. (2022). "American Express Default Prediction." Kaggle Competition. The public dataset released by American Express for a Kaggle competition on default prediction. Exploring this dataset provides hands-on experience with a real (anonymized) financial services classification problem. The competition discussion forum contains insights from thousands of data scientists on feature engineering, model selection, and evaluation techniques. Available at kaggle.com/competitions/amex-default-prediction.

21. Consumer Financial Protection Bureau. (2022). "CFPB Acts to Protect the Public from Black-Box Credit Models Using Complex Algorithms." CFPB Press Release and Circular 2022-03. The regulatory guidance requiring lenders to provide specific adverse action reasons even when using complex ML models. Essential reading for anyone deploying classification models in financial services -- and instructive for other regulated industries where similar requirements may emerge.


scikit-learn Documentation and Tutorials

22. Pedregosa, F., et al. (2011). "Scikit-learn: Machine Learning in Python." Journal of Machine Learning Research, 12, 2825-2830. The foundational paper for the scikit-learn library used throughout this chapter. While the documentation at scikit-learn.org is the primary practical reference, the original paper provides context on the library's design philosophy: consistency, simplicity, and composability. The scikit-learn user guide sections on classification, model evaluation, and preprocessing are the most directly relevant to extending the code in this chapter.

23. scikit-learn Documentation: "Classification Metrics" and "Model Evaluation" sections. The official scikit-learn documentation for classification metrics (precision, recall, F1, ROC-AUC, confusion matrix) includes detailed explanations, mathematical formulas, and code examples. Available at scikit-learn.org/stable/modules/model_evaluation.html. This is the most authoritative and up-to-date reference for the metrics discussed in Section 7.9.


Advanced Topics (Preview of Later Chapters)

24. Lundberg, S. M., & Lee, S.-I. (2017). "A Unified Approach to Interpreting Model Predictions." Advances in Neural Information Processing Systems, 30. The paper introducing SHAP (SHapley Additive exPlanations), which provides a unified framework for interpreting individual predictions from any ML model. SHAP values are increasingly the standard for model interpretability in business applications. We will cover SHAP in detail in Chapter 26 (Fairness, Explainability, and Transparency).

25. Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., & Dennison, D. (2015). "Hidden Technical Debt in Machine Learning Systems." Advances in Neural Information Processing Systems, 28. A seminal Google paper on the operational challenges of production ML systems. The authors argue that the ML code in a real-world system is a small fraction of the total code -- surrounded by configuration, data pipelines, feature engineering, monitoring, and serving infrastructure. This paper motivates the MLOps discussion in Chapter 12 and provides context for why Athena's churn model needs an operational framework, not just an algorithm.


Each item in this reading list was selected because it directly supports concepts introduced in Chapter 7 and developed throughout the textbook. Items are organized by topic to facilitate targeted deep dives. Entries marked with specific chapter references connect to more detailed treatment later in the course.