Further Reading: Chapter 11

Linear Models Revisited


Foundational Books

1. An Introduction to Statistical Learning (ISLR) --- James, Witten, Hastie, Tibshirani (2nd edition, 2021) Chapter 6 ("Linear Model Selection and Regularization") is the definitive accessible treatment of Ridge, Lasso, and Elastic Net. The mathematical exposition is clear, the figures are excellent (especially the geometric interpretation of L1 vs. L2 constraints), and the bias-variance analysis of regularization is presented at exactly the right level. The Python edition (ISLP, 2023) includes lab exercises in scikit-learn. Free PDF at statlearning.com. Read Sections 6.1--6.2 first; they cover everything this chapter does with more mathematical rigor.

2. The Elements of Statistical Learning (ESL) --- Hastie, Tibshirani, Friedman (2nd edition, 2009) Chapter 3 ("Linear Methods for Regression") and Chapter 4 ("Linear Methods for Classification") provide the graduate-level treatment. Section 3.4 on Ridge and Lasso is the original reference that most textbooks draw from. Section 3.4.4 on the "Lasso and LARS" algorithm explains why L1 produces exactly-zero coefficients. Heavier on math than ISLR, but worth the effort if you want to understand the geometry. Free PDF from the authors' website.

3. Regression Shrinkage and Selection via the Lasso --- Robert Tibshirani (1996) Journal of the Royal Statistical Society, Series B, Vol. 58, No. 1, pp. 267--288. The original Lasso paper. Tibshirani introduces the L1 penalty for regression and demonstrates its dual role as regularizer and feature selector. The paper is surprisingly readable for a foundational statistics paper. If you read one original paper on the topics in this chapter, make it this one.


Regularization Deep Dives

4. Regularization and Variable Selection via the Elastic Net --- Zou and Hastie (2005) Journal of the Royal Statistical Society, Series B, Vol. 67, No. 2, pp. 301--320. The paper that introduced Elastic Net. The key contribution is Theorem 1 (the "grouping effect"), which proves that Elastic Net assigns similar coefficients to correlated features, while Lasso assigns weight to only one. The practical implication: when your features come in correlated groups, Elastic Net is more appropriate than pure Lasso.

5. Regularization Paths for Generalized Linear Models via Coordinate Descent --- Friedman, Hastie, Tibshirani (2010) Journal of Statistical Software, Vol. 33, No. 1. The paper behind the glmnet R package and the algorithm that scikit-learn's Lasso, Ridge, and Elastic Net use under the hood. Understanding coordinate descent is not required for using these methods, but it explains why Lasso is fast enough to run on millions of observations and why warm starts make regularization path computation efficient.


Practical Guides

6. scikit-learn User Guide --- "Linear Models" The official scikit-learn documentation for Ridge, Lasso, Elastic Net, LogisticRegression, and LogisticRegressionCV. Pay particular attention to the solver table (which solvers support which penalties), the C vs. alpha convention, and the class_weight parameter documentation. The "Mathematical formulation" sections provide the exact loss functions being minimized, which clarifies what each parameter controls. Available at scikit-learn.org/stable/modules/linear_model.html.

7. scikit-learn User Guide --- "Preprocessing Data" The official documentation for StandardScaler, MinMaxScaler, RobustScaler, and other scaling methods. Includes clear examples of fit-transform workflows and integration with Pipelines. The comparison table of scaler properties (centering, scaling, outlier sensitivity) is a useful reference. Available at scikit-learn.org/stable/modules/preprocessing.html.

8. "How to Use StandardScaler and MinMaxScaler with scikit-learn" --- Various practitioners Multiple high-quality blog posts demonstrate the practical differences between scaling methods with real datasets. The key insight that many highlight: StandardScaler is generally preferred for regularized models because it handles outliers more gracefully and produces coefficients that are directly interpretable as "per-standard-deviation effects."


Interpretability and Deployment

9. Interpretable Machine Learning --- Christoph Molnar (2nd edition, 2022) Chapter 4 ("Interpretable Models") covers logistic regression's interpretability properties in detail: coefficient interpretation, odds ratios, and the conditions under which coefficient-based explanations are valid. Chapter 5 covers model-agnostic methods (SHAP, LIME) that extend interpretability to black-box models. Free online at christophm.github.io/interpretable-ml-book/. Essential reading for anyone who needs to explain model predictions to non-technical stakeholders.

10. "Why Every Data Scientist Should Start with Logistic Regression" --- Various industry blogs A recurring theme in practitioner writing: logistic regression as the underrated workhorse. The best versions of this argument come from machine learning engineers at large tech companies who have seen complex models replaced by simpler ones in production. The common thread: logistic regression is not the model that wins competitions; it is the model that survives production.


Feature Scaling and Preprocessing

11. "Importance of Feature Scaling" --- scikit-learn Examples The scikit-learn documentation includes an example ("Compare the effect of different scalers on data with outliers") that demonstrates StandardScaler, MinMaxScaler, MaxAbsScaler, RobustScaler, and QuantileTransformer side by side on the same dataset. The visualizations make the differences immediately intuitive. Available in the scikit-learn example gallery.


Healthcare and Regulated Domains

12. "Hospital Readmissions Reduction Program (HRRP)" --- CMS.gov The official CMS documentation on the readmission penalty structure, target conditions, and measurement methodology. Understanding the regulatory context makes Case Study 2 (Metro General) far more concrete. The penalty formula, excess readmission ratio calculation, and list of covered conditions are all specified here.

13. "A Comparison of Approaches to Improving the Predictive Accuracy of Hospital Readmission Models" --- Rajkomar et al. (2018) Published in JAMA Internal Medicine, this study compares logistic regression against deep learning for readmission prediction using EHR data from two academic medical centers. The finding: the deep learning model improved AUC by 2--3 points, but the logistic regression was considered sufficient for clinical use. The paper is frequently cited in debates about model complexity in healthcare.


How to Use This List

If you read nothing else, read ISLR Chapter 6 (item 1). It covers Ridge, Lasso, and Elastic Net with clear explanations and excellent figures. Pair it with the scikit-learn linear models documentation (item 6) for the implementation details.

If you want to understand the math behind Lasso's feature selection property, read Tibshirani's original paper (item 3). It is shorter and more accessible than you might expect.

If you are deploying models in a regulated industry, Molnar (item 9) is essential. It provides the vocabulary and frameworks for communicating model interpretability to non-technical stakeholders.

If you are building production pipelines, the scikit-learn preprocessing documentation (item 7) is your day-to-day reference. Bookmark it.


This reading list supports Chapter 11: Linear Models Revisited. Return to the chapter to review concepts before diving in.