Further Reading: Logistic Regression and Classification — Predicting Categories

This chapter introduced classification, the confusion matrix, and the precision-recall tradeoff — ideas that are central to machine learning and have profound implications for how we use data to make decisions about people. These resources will take you deeper into the mathematics, applications, and ethics of classification.

Tier 1: Verified Sources

Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, An Introduction to Statistical Learning (Springer, 2nd edition, 2021). Chapter 4 covers logistic regression with mathematical rigor while remaining accessible. The discussion of the maximum likelihood estimation procedure (how logistic regression actually finds its coefficients) complements our more intuitive treatment. Free PDF from the authors' website.

Aurélien Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (O'Reilly, 3rd edition, 2022). Chapter 3 provides an excellent practical treatment of classification, including precision-recall curves, ROC curves, and multi-class classification. The code examples use scikit-learn and directly extend our chapter's approach. Particularly good on the precision-recall tradeoff and threshold selection.

Cathy O'Neil, Weapons of Math Destruction (Crown, 2016). Several chapters deal directly with classification models used in high-stakes decisions — criminal justice risk scores, credit scoring, hiring algorithms. O'Neil shows how classification models can encode and amplify biases when applied to decisions about people. Essential reading for anyone building classification models that affect lives.

Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner, "Machine Bias," ProPublica (May 23, 2016). The investigative report that revealed racial disparities in the COMPAS recidivism prediction tool. The article found that the tool had different false positive rates for Black and white defendants — a direct application of the confusion matrix concepts from this chapter. This is one of the most important pieces of journalism about algorithmic decision-making.

David Spiegelhalter, The Art of Statistics (Basic Books, 2019). Spiegelhalter's chapters on screening tests and diagnostic accuracy are outstanding. He explains sensitivity, specificity, positive predictive value, and the base rate fallacy with medical examples that directly connect to our confusion matrix and precision-recall discussion.

scikit-learn documentation: Classification Metrics. The official documentation provides clear explanations of precision, recall, F1-score, ROC curves, and other metrics. The "Model Evaluation" user guide includes practical advice on choosing metrics for different scenarios.

Tier 2: Attributed Resources

StatQuest with Josh Starmer, "Logistic Regression" series (YouTube). Starmer's multi-video series covers logistic regression, the sigmoid function, odds ratios, and maximum likelihood with his trademark clarity. Start with "Logistic Regression Details Pt 1: Coefficients" and work through the series. Search "StatQuest logistic regression."

Google's Machine Learning Crash Course: Classification section. A free, interactive course that covers classification, the confusion matrix, precision, recall, ROC curves, and threshold selection with clear explanations and interactive exercises. Particularly good on the precision-recall tradeoff with a visual, interactive component.

Nate Silver, The Signal and the Noise (Penguin, 2012). Silver's discussion of weather forecasting, earthquake prediction, and political polling includes extensive treatment of calibration — whether predicted probabilities match actual frequencies. Directly relevant to our predict_proba discussion and the question of what a "72% probability" actually means.

Solon Barocas and Moritz Hardt, Fairness and Machine Learning (fairmlbook.org, 2023). A free online textbook on fairness in machine learning. Chapter 2 on classification directly extends our chapter's discussion of confusion matrices and shows how different definitions of fairness (equal false positive rates, equal false negative rates, calibration) are mathematically incompatible. Technical but essential if you're building models that affect people.

Wikipedia: Receiver Operating Characteristic (ROC curve). The ROC curve, which we'll cover in Chapter 29, plots true positive rate vs. false positive rate across all possible thresholds. The Wikipedia article includes clear explanations, historical context (the ROC curve was invented for radar signal detection in WWII), and interactive examples.

Recommended Next Steps

If you want deeper mathematical understanding of logistic regression: Read James et al., Chapter 4. The maximum likelihood derivation shows how the coefficients are actually found — it's different from least squares in linear regression, and understanding why gives you deeper insight into the model.
If the precision-recall tradeoff fascinates you: Work through Géron's Chapter 3. He shows how to create precision-recall curves, which visualize the tradeoff across all possible thresholds, helping you choose the optimal one for your application.
If you're concerned about fairness in classification: Read the ProPublica article on COMPAS, then explore Barocas and Hardt's textbook. The mathematical impossibility of satisfying all fairness criteria simultaneously (the "impossibility theorem" of fair classification) is one of the most important results in the field.
If you want to understand calibration: Spiegelhalter's book and Silver's The Signal and the Noise both treat calibration in accessible ways. In practice, logistic regression tends to produce well-calibrated probabilities, but other classifiers (like random forests) do not, which matters when you need the probabilities — not just the rankings — to be meaningful.
If you're interested in medical applications: Look into the sensitivity and specificity framework used in diagnostic medicine, which maps directly to recall and precision. The concepts of positive predictive value (precision) and negative predictive value are standard in medical literature and depend critically on disease prevalence (the base rate).
If you want more hands-on practice: Try Kaggle's "Titanic: Machine Learning from Disaster" competition. It's a binary classification problem (survived vs. not survived) with the same workflow we used in this chapter: split, train, evaluate, iterate. It's the most popular beginner competition on Kaggle for good reason.
If you're ready to move on: Chapter 28 introduces decision trees, which handle nonlinear relationships naturally and produce highly interpretable models. Chapter 29 will formalize model evaluation with cross-validation and ROC curves, giving you more sophisticated tools for comparing models and choosing thresholds.

A Final Thought

Classification seems simple: put things into categories. But as you've seen in this chapter, the simplicity is deceptive. The choice of threshold is not a technical parameter — it's a policy decision that determines who benefits and who bears the cost of errors. The confusion matrix is not just a diagnostic tool — it's a moral accounting of the model's impact.

A model that catches 95% of fraud but wrongly flags 1% of legitimate transactions sounds good in the abstract. But if that 1% amounts to thousands of people whose accounts are frozen, the impact is real. A model that identifies 80% of at-risk students sounds helpful, but if the 20% it misses are disproportionately from underrepresented groups, the model reinforces the very inequities it was meant to address.

These are not problems that better algorithms can solve. They are problems that require human judgment, institutional accountability, and ongoing attention. The model provides information; the decision about how to use that information is ours.