Key Takeaways: Chapter 11

Linear Models Revisited


  1. Unregularized linear models fail in high dimensions because they have no mechanism to constrain themselves. With many features (especially when features outnumber or approach the number of observations), OLS will use every degree of freedom to fit training data --- including noise. The result is perfect training R-squared, abysmal test R-squared, and coefficient estimates that are statistically unbiased but practically useless due to enormous variance.

  2. Regularization trades bias for variance, and the trade is almost always worth it. By adding a penalty for large coefficients, regularized models produce estimates that are slightly biased (pulled toward zero) but dramatically more stable. The coefficients generalize. The predictions generalize. A small amount of bias buys a large reduction in variance.

  3. Ridge (L2) shrinks; Lasso (L1) selects. Ridge regression penalizes the sum of squared coefficients, pulling everything toward zero but never reaching it. All features stay in the model. Lasso penalizes the sum of absolute coefficients and can drive coefficients to exactly zero, performing automatic feature selection. Choose Ridge when most features are relevant and you need stability. Choose Lasso when you suspect many features are irrelevant and you want sparsity.

  4. Elastic Net is the pragmatic default when you are unsure. Elastic Net combines L1 and L2 penalties, getting Lasso's sparsity while handling correlated feature groups more gracefully than Lasso alone. When you do not have strong prior knowledge about feature relevance or correlation structure, Elastic Net with l1_ratio=0.5 is a defensible starting point.

  5. Feature scaling is not optional for regularized models. Without scaling, regularization penalizes features based on their measurement scale, not their predictive importance. A feature measured in dollars gets a tiny coefficient (penalized less) while a feature measured as a proportion gets a large coefficient (penalized more). StandardScaler is the default. Always fit on training data, transform both train and test.

  6. The C parameter in LogisticRegression is the inverse of alpha. Large C = weak regularization (model has more freedom). Small C = strong regularization (more shrinkage). This inverse convention is a persistent source of confusion. Use LogisticRegressionCV to select C automatically via cross-validation.

  7. The coefficient path is your most important diagnostic tool. Plotting coefficients as a function of regularization strength shows which features activate first (most important), which features are stable across a range of alpha values (robust signals), and where the sweet spot lies between underfitting and overfitting.

  8. Logistic regression with regularization is a production-grade classifier. It is fast (trains in seconds on millions of rows), interpretable (coefficients have direct meaning), well-calibrated (probabilities are reliable when the model is well-specified), and competitive (often within 2--5 AUC points of gradient boosting). It is your baseline. Every other model must justify its complexity by outperforming it.

  9. Interpretability is not a luxury --- it is a deployment requirement in many domains. In healthcare, finance, insurance, and other regulated industries, stakeholders need to understand why a model makes specific predictions. Logistic regression coefficients provide per-feature, per-prediction explanations without additional tooling. This is why hospitals choose logistic regression over XGBoost and why lenders document coefficient tables for regulators.

  10. Accuracy is the wrong metric for imbalanced classification. With 8.2% churn, predicting "no churn" for everyone achieves 91.8% accuracy. Always evaluate with AUC-ROC, precision, recall, and F1. Use class_weight='balanced' to prevent the model from defaulting to the majority class. And always record the baseline (majority-class accuracy) so you know what "better than guessing" looks like.


If You Remember One Thing

Logistic regression with L1 regularization and StandardScaler, wrapped in a Pipeline, is the model you build first on every tabular classification problem. Not because it is the best model --- it rarely is --- but because it is the fastest model to build, the easiest to interpret, and the hardest to beat by enough to justify something more complex. It is your floor. Everything else is measured against it. If your gradient-boosted ensemble cannot beat regularized logistic regression by a meaningful margin on your specific data, the ensemble is not worth its operational cost.


These takeaways summarize Chapter 11: Linear Models Revisited. Return to the chapter for full context.