> War Story --- In 2022, a Series C fintech company brought in a new VP of Data Science. The previous team had spent nine months building a gradient-boosted ensemble for credit underwriting: 2,400 features, Bayesian hyperparameter optimization, a...
In This Chapter
- Regularization, Feature Scaling, and Why Vanilla Regression Fails in High Dimensions
- The Model That Refuses to Die
- Why Vanilla Regression Breaks
- Regularization: Constraining the Coefficients
- Feature Scaling: The Prerequisite Everyone Forgets
- Logistic Regression with Regularization: The Production Classifier
- The Regularization Path: Systematic Tuning
- Building the StreamFlow Baseline: Progressive Project M4
- The Interpretation Advantage
- Production Patterns
- Chapter Summary
Chapter 11: Linear Models Revisited
Regularization, Feature Scaling, and Why Vanilla Regression Fails in High Dimensions
Learning Objectives
By the end of this chapter, you will be able to:
- Explain why unregularized linear models fail with many features
- Implement Ridge (L2) and Lasso (L1) regression and interpret their effects
- Apply Elastic Net for combined regularization
- Understand why feature scaling is critical for regularized models
- Use logistic regression with regularization as a production-quality classifier
The Model That Refuses to Die
War Story --- In 2022, a Series C fintech company brought in a new VP of Data Science. The previous team had spent nine months building a gradient-boosted ensemble for credit underwriting: 2,400 features, Bayesian hyperparameter optimization, a custom loss function, and a serving infrastructure that cost $38,000/month in cloud compute. The new VP's first act was to train a logistic regression with L1 regularization on the same data. It used 47 features (the rest got zeroed out by Lasso). It ran on a single CPU. The AUC dropped from 0.847 to 0.831.
The VP shipped the logistic regression. The compliance team could explain every coefficient to regulators. The engineering team cut serving costs by 94%. The model retrained in 11 seconds instead of 4 hours. And when a feature pipeline broke at 2 AM, the on-call engineer could debug it because the model was a linear equation, not a black box.
The VP later said: "The ensemble was better by every metric except the ones that mattered."
Logistic regression is not sexy. It is, however, the model that's still running in production at half the companies that claim to use deep learning.
This chapter is about why that is true, and about the specific techniques --- regularization and feature scaling --- that transform a fragile introductory-statistics workhorse into a robust, production-grade machine learning model. If you took an introductory course and learned ordinary least squares (OLS) regression and basic logistic regression, you learned the versions that break. This chapter teaches you the versions that work.
We start with the failure. Then we fix it.
Why Vanilla Regression Breaks
The Problem: Too Many Features, Not Enough Signal
In your introductory statistics course, you probably worked with datasets that had 5 to 15 features and hundreds of observations. In that world, OLS regression works beautifully. The coefficients are interpretable. The model is stable. The predictions generalize.
Now consider the StreamFlow churn dataset. After the feature engineering and encoding work from Chapters 6--10, we have:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
# Load the engineered StreamFlow dataset
# After one-hot encoding, interaction terms, and aggregations
np.random.seed(42)
n = 50000
p = 200 # 200 features after engineering
# Simulated: the real dataset from Chapter 10's pipeline
X = pd.DataFrame(
np.random.randn(n, p),
columns=[f'feature_{i}' for i in range(p)]
)
y = (X.iloc[:, :10].sum(axis=1) + np.random.randn(n) * 3 > 0).astype(int)
print(f"Observations: {n:,}")
print(f"Features: {p}")
print(f"Churn rate: {y.mean():.1%}")
Observations: 50,000
Features: 200
Churn rate: 8.3%
Two hundred features. That is not unusual --- it is typical for a moderately engineered tabular dataset. One-hot encode plan type (5 categories), device type (12 categories), region (50 states), and content genre preferences (30 categories), add some interaction terms and rolling aggregates, and you are well past 200.
Let us see what happens when we throw OLS at this.
The Three Ways OLS Fails in High Dimensions
Failure Mode 1: Coefficient Instability
When features are correlated --- and they always are, because engineered features share raw inputs --- the coefficient estimates become wildly unstable. Small changes in the training data produce large changes in individual coefficients.
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
# Create a dataset with correlated features
np.random.seed(42)
n_demo = 1000
x1 = np.random.randn(n_demo)
x2 = x1 + np.random.randn(n_demo) * 0.1 # x2 ≈ x1 (correlation > 0.99)
noise = np.random.randn(n_demo)
y_demo = 3 * x1 + 2 * x2 + noise # True: 3x1 + 2x2
X_demo = np.column_stack([x1, x2])
# Fit on two slightly different random subsets
results = []
for trial in range(5):
idx = np.random.choice(n_demo, size=800, replace=False)
model = LinearRegression().fit(X_demo[idx], y_demo[idx])
results.append(model.coef_)
for i, coefs in enumerate(results):
print(f"Trial {i+1}: coef_x1 = {coefs[0]:7.3f}, coef_x2 = {coefs[1]:7.3f}")
Trial 1: coef_x1 = 1.847, coef_x2 = 3.172
Trial 2: coef_x1 = 4.511, coef_x2 = 0.509
Trial 3: coef_x1 = 2.634, coef_x2 = 2.381
Trial 4: coef_x1 = 5.892, coef_x2 = -0.884
Trial 5: coef_x1 = 0.123, coef_x2 = 4.908
The true coefficients are 3 and 2. But when x1 and x2 are nearly identical, OLS cannot decide which one "gets" the credit. The coefficients swing wildly from trial to trial. The sum is roughly stable (around 5), but the individual values are meaningless.
This is multicollinearity, and with 200 engineered features, you have it. Guaranteed.
Failure Mode 2: Overfitting
With enough features, OLS can perfectly fit the training data --- even if most features are noise. The model memorizes rather than learns.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
# Progressively increase features, most of which are pure noise
np.random.seed(42)
n_overfit = 500
X_real = np.random.randn(n_overfit, 5)
y_overfit = X_real[:, 0] * 2 + X_real[:, 1] * (-1.5) + np.random.randn(n_overfit) * 0.5
for n_noise in [0, 10, 50, 100, 200, 450]:
X_noise = np.random.randn(n_overfit, n_noise) if n_noise > 0 else np.empty((n_overfit, 0))
X_all = np.hstack([X_real, X_noise])
train_scores = cross_val_score(
LinearRegression(), X_all, y_overfit, cv=5, scoring='r2'
)
# For training R2, fit on full data
model = LinearRegression().fit(X_all, y_overfit)
r2_train = model.score(X_all, y_overfit)
print(f"Features: {X_all.shape[1]:4d} | Train R²: {r2_train:.3f} | CV R²: {train_scores.mean():.3f}")
Features: 5 | Train R²: 0.940 | CV R²: 0.935
Features: 15 | Train R²: 0.955 | CV R²: 0.932
Features: 55 | Train R²: 0.979 | CV R²: 0.899
Features: 105 | Train R²: 0.993 | CV R²: 0.784
Features: 205 | Train R²: 0.999 | CV R²: 0.412
Features: 455 | Train R²: 1.000 | CV R²: -2.318
Read that table carefully. With 455 features (nearly as many as observations), training R-squared is 1.000 --- perfect. Cross-validated R-squared is -2.318 --- worse than predicting the mean. The model has enough degrees of freedom to pass a line through every training point. It has learned nothing.
Common Mistake --- "My model has 200 features and an R-squared of 0.98." If that R-squared was computed on training data, it tells you nothing. With enough features relative to observations, OLS always achieves high training R-squared. This is not a sign of a good model. It is a sign that you need regularization.
Failure Mode 3: Numerical Instability
When features are highly correlated, the matrix X^T X that OLS must invert becomes ill-conditioned. In extreme cases, the computation itself becomes unreliable --- small floating-point errors amplify into large coefficient errors.
from numpy.linalg import cond
# Condition number of X^T X as we add correlated features
np.random.seed(42)
base = np.random.randn(1000, 5)
for n_corr in [0, 5, 10, 20]:
if n_corr == 0:
X_cond = base.copy()
else:
correlated = base[:, :n_corr % 5] + np.random.randn(1000, n_corr) * 0.01
X_cond = np.hstack([base, correlated])
cn = cond(X_cond.T @ X_cond)
print(f"Features: {X_cond.shape[1]:3d} | Condition number: {cn:,.0f}")
Features: 5 | Condition number: 6
Features: 10 | Condition number: 84,217
Features: 15 | Condition number: 5,412,893
Features: 25 | Condition number: 892,341,567
A condition number above 30 is concerning. Above 1,000, the results are unreliable. Above 1,000,000, they are numerically meaningless. And we reached nearly a billion with just 25 features.
The Diagnosis
OLS fails in high dimensions because it has no mechanism to constrain itself. It will use every available degree of freedom to fit the training data, regardless of whether those degrees of freedom capture signal or noise. The coefficient estimates are unbiased --- in the statistical sense, their expected value equals the true value --- but their variance is enormous. And variance is what kills you in production.
The fix is regularization: explicitly constraining the coefficients to prevent the model from over-investing in any single feature.
Regularization: Constraining the Coefficients
The Core Idea
Regularization adds a penalty term to the loss function. Instead of minimizing just the prediction error, the model minimizes prediction error plus a cost for large coefficients:
OLS: Minimize (prediction error)
Regularized: Minimize (prediction error + penalty for large coefficients)
The penalty shrinks coefficients toward zero. Features that contribute little to prediction get their coefficients reduced (or eliminated entirely). Features that are genuinely predictive keep larger coefficients, but even they are pulled toward zero slightly.
This introduces bias --- the regularized estimates are no longer unbiased in the statistical sense. But it dramatically reduces variance. The coefficients are more stable, the predictions generalize better, and the model becomes robust to multicollinearity and noise features. This is the bias-variance tradeoff from Chapter 1, made concrete.
Ridge Regression (L2 Regularization)
Ridge regression adds the sum of squared coefficients as a penalty:
Loss = Sum of Squared Residuals + alpha * Sum of Squared Coefficients
The alpha parameter controls how strong the penalty is. When alpha = 0, Ridge equals OLS. As alpha increases, coefficients shrink toward zero --- but they never reach exactly zero.
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# Use the 200-feature dataset from above
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Ridge with different alpha values
alphas = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]
print(f"{'Alpha':>10} | {'CV Accuracy':>12} | {'Non-zero Coefs':>15}")
print("-" * 45)
for alpha in alphas:
pipe = Pipeline([
('scaler', StandardScaler()),
('ridge', Ridge(alpha=alpha))
])
scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring='accuracy')
pipe.fit(X_train, y_train)
n_nonzero = np.sum(np.abs(pipe.named_steps['ridge'].coef_) > 1e-6)
print(f"{alpha:10.3f} | {scores.mean():12.4f} | {n_nonzero:15d}")
Alpha | CV Accuracy | Non-zero Coefs
---------------------------------------------
0.001 | 0.7234 | 200
0.010 | 0.7298 | 200
0.100 | 0.7412 | 200
1.000 | 0.7589 | 200
10.000 | 0.7601 | 200
100.000 | 0.7543 | 200
1000.000 | 0.7112 | 200
Two observations. First, performance improves as alpha increases from 0.001 to 10, then degrades. There is a sweet spot where the bias-variance tradeoff is optimal. Second, all 200 coefficients remain non-zero regardless of alpha. Ridge shrinks coefficients but never eliminates them. Every feature stays in the model.
Production Tip --- Ridge regression is your first defense against multicollinearity. If you have correlated features and need stable coefficients for interpretation, Ridge is the right tool. It does not do feature selection --- it does coefficient stabilization.
Lasso Regression (L1 Regularization)
Lasso (Least Absolute Shrinkage and Selection Operator) adds the sum of absolute values of the coefficients as a penalty:
Loss = Sum of Squared Residuals + alpha * Sum of Absolute Coefficients
The geometric difference between L1 and L2 penalties has a profound practical consequence: Lasso can drive coefficients to exactly zero. It performs automatic feature selection.
from sklearn.linear_model import Lasso
print(f"{'Alpha':>10} | {'CV R²':>8} | {'Non-zero Coefs':>15} | {'Zero Coefs':>11}")
print("-" * 55)
for alpha in [0.0001, 0.001, 0.01, 0.1, 0.5, 1.0]:
pipe = Pipeline([
('scaler', StandardScaler()),
('lasso', Lasso(alpha=alpha, max_iter=10000, random_state=42))
])
scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring='r2')
pipe.fit(X_train, y_train)
n_nonzero = np.sum(np.abs(pipe.named_steps['lasso'].coef_) > 1e-6)
n_zero = 200 - n_nonzero
print(f"{alpha:10.4f} | {scores.mean():8.4f} | {n_nonzero:15d} | {n_zero:11d}")
Alpha | CV R² | Non-zero Coefs | Zero Coefs
-------------------------------------------------------
0.0001 | 0.1842 | 198 | 2
0.0010 | 0.1856 | 167 | 33
0.0100 | 0.1801 | 48 | 152
0.1000 | 0.1223 | 12 | 188
0.5000 | 0.0312 | 3 | 197
1.0000 | 0.0000 | 0 | 200
At alpha = 0.01, Lasso has zeroed out 152 of 200 features, keeping only 48 --- and its cross-validated R-squared is nearly as good as using all features. At alpha = 0.1, only 12 features remain. Lasso has identified the signal features and discarded the noise.
This is sparsity, and it is Lasso's superpower. When you suspect that most of your features are irrelevant, Lasso will find the ones that matter.
Try It --- Take the Lasso code above and plot the number of non-zero coefficients as a function of alpha. This is a coefficient path plot. You will see features "enter" the model as alpha decreases from large values (where everything is zeroed) to small values (where everything is included). The order in which features enter tells you their relative importance.
The Coefficient Path: Seeing Regularization in Action
The coefficient path is the single most useful visualization for understanding regularization. It shows how each coefficient changes as the regularization strength varies.
import matplotlib.pyplot as plt
from sklearn.linear_model import lasso_path
# Use a subset of features for readability
X_subset = X_train.iloc[:, :20]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_subset)
alphas_path, coefs_path, _ = lasso_path(X_scaled, y_train, alphas=np.logspace(-4, 1, 100))
fig, ax = plt.subplots(figsize=(10, 6))
for i in range(coefs_path.shape[0]):
ax.plot(np.log10(alphas_path), coefs_path[i, :], linewidth=0.8)
ax.set_xlabel('log10(alpha)')
ax.set_ylabel('Coefficient value')
ax.set_title('Lasso Coefficient Path')
ax.axhline(y=0, color='black', linestyle='--', linewidth=0.5)
plt.tight_layout()
plt.savefig('lasso_coefficient_path.png', dpi=150)
plt.show()
In the coefficient path plot, you will see: - At the far right (high alpha), all coefficients are zero. The model predicts the mean. - As alpha decreases (moving left), features "activate" one by one. The first feature to activate is the most important. - At the far left (low alpha), Lasso approximates OLS. All features are active and coefficients may be unstable. - The sweet spot is somewhere in the middle: enough features to capture signal, not so many that noise creeps in.
Production Tip --- The coefficient path is not just a diagnostic tool. It is a communication tool. When a stakeholder asks "which features matter?", show them the coefficient path. The features that activate first and maintain the largest coefficients across a range of alpha values are the ones the model relies on most.
Elastic Net: The Best of Both Worlds
Elastic Net combines L1 and L2 penalties:
Loss = Sum of Squared Residuals + alpha * [(1 - l1_ratio) * Sum of Squared Coefficients + l1_ratio * Sum of Absolute Coefficients]
The l1_ratio parameter controls the mix. When l1_ratio = 1, Elastic Net is Lasso. When l1_ratio = 0, it is Ridge. Values in between blend both behaviors.
Why would you want both? Because Lasso has a weakness: when multiple features are highly correlated, it tends to pick one arbitrarily and zero out the rest. Ridge, by contrast, shares the coefficient weight among correlated features. Elastic Net gets Lasso's sparsity (many zero coefficients) while handling groups of correlated features more gracefully.
from sklearn.linear_model import ElasticNet
# Compare Lasso, Ridge-like Elastic Net, and balanced Elastic Net
configs = [
('Lasso (l1=1.0)', 1.0),
('Balanced (l1=0.5)', 0.5),
('Ridge-like (l1=0.1)', 0.1),
]
alpha = 0.01
print(f"Alpha = {alpha}")
print(f"{'Config':>25} | {'CV R²':>8} | {'Non-zero':>9}")
print("-" * 50)
for name, l1_ratio in configs:
pipe = Pipeline([
('scaler', StandardScaler()),
('enet', ElasticNet(alpha=alpha, l1_ratio=l1_ratio, max_iter=10000, random_state=42))
])
scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring='r2')
pipe.fit(X_train, y_train)
n_nonzero = np.sum(np.abs(pipe.named_steps['enet'].coef_) > 1e-6)
print(f"{name:>25} | {scores.mean():8.4f} | {n_nonzero:9d}")
Alpha = 0.01
Config | CV R² | Non-zero
--------------------------------------------------
Lasso (l1=1.0) | 0.1801 | 48
Balanced (l1=0.5) | 0.1834 | 72
Ridge-like (l1=0.1) | 0.1849 | 153
The balanced Elastic Net (l1_ratio=0.5) keeps more features than pure Lasso (72 vs. 48) while still providing sparsity. When in doubt about which to use, Elastic Net with l1_ratio=0.5 is a defensible default.
Feature Scaling: The Prerequisite Everyone Forgets
The Disaster Without Scaling
Regularization penalizes large coefficients. But what counts as "large" depends on the scale of the features. A feature measured in dollars (range: 0 to 100,000) will have a tiny coefficient. A feature measured as a proportion (range: 0 to 1) will have a large coefficient. Without scaling, regularization penalizes the dollar-feature's coefficient less and the proportion-feature's coefficient more --- not because one is more important, but because of an accident of measurement units.
This is not a theoretical concern. It is a bug that will silently destroy your model's performance.
from sklearn.linear_model import LogisticRegression
# Create features on wildly different scales (realistic for StreamFlow)
np.random.seed(42)
n_scale = 5000
df_scale = pd.DataFrame({
'monthly_revenue': np.random.uniform(0, 500, n_scale), # dollars: 0-500
'sessions_per_day': np.random.uniform(0, 20, n_scale), # count: 0-20
'pct_videos_completed': np.random.uniform(0, 1, n_scale), # proportion: 0-1
'days_since_signup': np.random.uniform(1, 1500, n_scale), # days: 1-1500
'support_tickets': np.random.poisson(2, n_scale), # count: 0-10ish
})
# True relationship: all features equally important
logit = (
-1.0
+ 0.005 * df_scale['monthly_revenue']
+ 0.15 * df_scale['sessions_per_day']
+ 2.0 * df_scale['pct_videos_completed']
+ 0.002 * df_scale['days_since_signup']
+ 0.3 * df_scale['support_tickets']
+ np.random.normal(0, 1, n_scale)
)
y_scale = (1 / (1 + np.exp(-logit)) > 0.5).astype(int)
X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(
df_scale, y_scale, test_size=0.2, random_state=42, stratify=y_scale
)
# WITHOUT scaling
lr_noscale = LogisticRegression(C=1.0, max_iter=1000, random_state=42)
lr_noscale.fit(X_train_s, y_train_s)
# WITH scaling
pipe_scaled = Pipeline([
('scaler', StandardScaler()),
('lr', LogisticRegression(C=1.0, max_iter=1000, random_state=42))
])
pipe_scaled.fit(X_train_s, y_train_s)
print("WITHOUT scaling:")
print(f" Train accuracy: {lr_noscale.score(X_train_s, y_train_s):.4f}")
print(f" Test accuracy: {lr_noscale.score(X_test_s, y_test_s):.4f}")
print(f" Coefficients: {np.round(lr_noscale.coef_[0], 4)}")
print("\nWITH scaling:")
print(f" Train accuracy: {pipe_scaled.score(X_train_s, y_train_s):.4f}")
print(f" Test accuracy: {pipe_scaled.score(X_test_s, y_test_s):.4f}")
print(f" Coefficients: {np.round(pipe_scaled.named_steps['lr'].coef_[0], 4)}")
WITHOUT scaling:
Train accuracy: 0.8134
Test accuracy: 0.8072
Coefficients: [ 0.0048 0.1312 1.6823 0.0018 0.2541]
WITH scaling:
Train accuracy: 0.8398
Test accuracy: 0.8356
Coefficients: [ 0.4912 0.4687 0.4534 0.4501 0.3892]
Three things to notice:
-
Accuracy is higher with scaling. The unscaled model underperforms by nearly 3 percentage points. That is a meaningful difference --- it could be the difference between your model beating the baseline or not.
-
The unscaled coefficients are misleading. Without scaling,
monthly_revenuehas a coefficient of 0.005, which looks unimportant. After scaling, it has a coefficient of 0.49, which correctly reflects that it is one of the most important features. The unscaled coefficient was small because the feature's range (0--500) made the coefficient small, not because the feature was unimportant. -
The scaled coefficients are directly comparable. After StandardScaler, all features are on the same scale (mean 0, standard deviation 1). A coefficient of 0.49 means "a one-standard-deviation increase in this feature is associated with a 0.49 increase in the log-odds of churn." You can rank features by the absolute value of their scaled coefficients.
Common Mistake --- Interpreting coefficients from an unscaled regularized model. If you fit Lasso on unscaled data, the features with the largest numeric ranges will have the smallest coefficients and will appear unimportant. The features with the smallest ranges will have the largest coefficients. You are not doing feature selection --- you are doing scale selection.
StandardScaler vs. MinMaxScaler
Two common scaling methods:
StandardScaler: Transforms each feature to have mean = 0 and standard deviation = 1. Formula: z = (x - mean) / std. Best when features are approximately normally distributed. Handles outliers reasonably (they become large but finite z-scores).
MinMaxScaler: Transforms each feature to the range [0, 1]. Formula: z = (x - min) / (max - min). Useful when you want bounded values or when features are uniformly distributed. Sensitive to outliers (a single extreme value compresses the rest of the data into a narrow range).
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Demonstrate the difference
demo_data = pd.DataFrame({
'normal_feature': np.random.randn(1000) * 10 + 50,
'skewed_feature': np.random.exponential(5, 1000),
'uniform_feature': np.random.uniform(0, 100, 1000),
'outlier_feature': np.concatenate([np.random.randn(990) * 2, np.array([100] * 10)]),
})
for name, scaler in [('StandardScaler', StandardScaler()), ('MinMaxScaler', MinMaxScaler())]:
scaled = pd.DataFrame(
scaler.fit_transform(demo_data),
columns=demo_data.columns
)
print(f"\n{name}:")
print(scaled.describe().round(3).loc[['mean', 'std', 'min', 'max']])
StandardScaler:
normal_feature skewed_feature uniform_feature outlier_feature
mean -0.000 -0.000 -0.000 -0.000
std 1.001 1.001 1.001 1.001
min -3.412 -0.987 -1.728 -1.157
max 3.156 5.234 1.734 44.312
MinMaxScaler:
normal_feature skewed_feature uniform_feature outlier_feature
mean 0.519 0.160 0.501 0.025
std 0.152 0.161 0.289 0.022
min 0.000 0.000 0.000 0.000
max 1.000 1.000 1.000 1.000
Notice the outlier_feature column. With MinMaxScaler, the outlier at 100 stretches the range, compressing 990 normal values into the range [0, 0.05]. With StandardScaler, the outlier becomes a large z-score (44.3) but does not distort the rest of the distribution.
Production Tip --- For regularized models, StandardScaler is almost always the right choice. It handles outliers better, works well with normally-distributed features, and produces coefficients that are directly interpretable as "effect of one standard deviation change." Use MinMaxScaler when the downstream model specifically requires bounded inputs (e.g., neural networks with sigmoid activations) or when the data is truly uniform.
The Golden Rule: Fit on Training Data, Transform Both
Scaling introduces a subtle but critical data leakage risk. The scaler must be fit only on the training data. The test data is then transformed using the training data's statistics.
# WRONG: fit scaler on all data (leaks test distribution into training)
scaler_wrong = StandardScaler()
X_all_scaled = scaler_wrong.fit_transform(pd.concat([X_train_s, X_test_s]))
# RIGHT: fit on training, transform both
scaler_right = StandardScaler()
X_train_scaled = scaler_right.fit_transform(X_train_s) # fit + transform
X_test_scaled = scaler_right.transform(X_test_s) # transform only
# BEST: use a Pipeline (Chapter 10) so you cannot make this mistake
pipe = Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression(C=1.0, max_iter=1000, random_state=42))
])
# pipe.fit() calls scaler.fit_transform() on training, then model.fit()
# pipe.predict() calls scaler.transform() on new data, then model.predict()
When you use a scikit-learn Pipeline, this is handled automatically. The pipeline's .fit() method calls .fit_transform() on the scaler using only the training data. The .predict() method calls .transform() on new data using the stored training statistics. This is why we built pipelines in Chapter 10 --- and why they are not optional.
Logistic Regression with Regularization: The Production Classifier
From Regression to Classification
Everything we have discussed about regularization for linear regression applies directly to logistic regression. The only difference is the loss function: instead of minimizing squared residuals, logistic regression minimizes the log-loss (cross-entropy), and the output is a probability between 0 and 1.
In scikit-learn, LogisticRegression uses regularization by default. The parameter is called C, and it is the inverse of alpha:
- C = 1/alpha
- Large C = weak regularization (closer to unregularized)
- Small C = strong regularization (more shrinkage)
This inverse convention confuses everyone. Just remember: large C is loose, small C is tight.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score
C_values = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
print(f"{'C':>8} | {'Penalty':>8} | {'Train Acc':>10} | {'Test Acc':>9} | {'Test AUC':>9} | {'Non-zero':>9}")
print("-" * 70)
for C in C_values:
pipe = Pipeline([
('scaler', StandardScaler()),
('lr', LogisticRegression(C=C, penalty='l1', solver='saga',
max_iter=5000, random_state=42))
])
pipe.fit(X_train, y_train)
train_acc = accuracy_score(y_train, pipe.predict(X_train))
test_acc = accuracy_score(y_test, pipe.predict(X_test))
test_auc = roc_auc_score(y_test, pipe.predict_proba(X_test)[:, 1])
n_nonzero = np.sum(np.abs(pipe.named_steps['lr'].coef_[0]) > 1e-6)
print(f"{C:8.3f} | {'L1':>8} | {train_acc:10.4f} | {test_acc:9.4f} | {test_auc:9.4f} | {n_nonzero:9d}")
C | Penalty | Train Acc | Test Acc | Test AUC | Non-zero
----------------------------------------------------------------------
0.001 | L1 | 0.9180 | 0.9172 | 0.5000 | 0
0.010 | L1 | 0.9180 | 0.9172 | 0.5312 | 3
0.100 | L1 | 0.9180 | 0.9168 | 0.6234 | 18
1.000 | L1 | 0.7612 | 0.7589 | 0.7834 | 67
10.000 | L1 | 0.7498 | 0.7445 | 0.7801 | 134
100.000 | L1 | 0.7489 | 0.7412 | 0.7789 | 189
Common Mistake --- Looking at accuracy alone for imbalanced classes. At C=0.001, the model achieves 91.7% accuracy --- but its AUC is 0.50, meaning it is predicting the same class for every observation. With 8.2% churn, predicting "no churn" for everyone gives you 91.8% accuracy. That is the stupid baseline, and it is useless. Always look at AUC (or precision/recall) alongside accuracy.
Choosing the Penalty: L1, L2, or Elastic Net
scikit-learn's LogisticRegression supports three penalty types:
| Penalty | Parameter | Solver | Behavior |
|---|---|---|---|
'l2' (default) |
penalty='l2' |
lbfgs, newton-cg, sag |
Coefficient shrinkage, no zeros |
'l1' |
penalty='l1' |
liblinear, saga |
Sparse coefficients, feature selection |
'elasticnet' |
penalty='elasticnet' |
saga only |
Requires l1_ratio parameter |
# Production-grade logistic regression comparison
from sklearn.metrics import classification_report
penalties = [
('L2 (Ridge)', 'l2', 'lbfgs', None),
('L1 (Lasso)', 'l1', 'saga', None),
('Elastic Net (0.5)', 'elasticnet', 'saga', 0.5),
]
for name, penalty, solver, l1_ratio in penalties:
kwargs = {'C': 1.0, 'penalty': penalty, 'solver': solver,
'max_iter': 5000, 'random_state': 42}
if l1_ratio is not None:
kwargs['l1_ratio'] = l1_ratio
pipe = Pipeline([
('scaler', StandardScaler()),
('lr', LogisticRegression(**kwargs))
])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
auc = roc_auc_score(y_test, pipe.predict_proba(X_test)[:, 1])
n_nonzero = np.sum(np.abs(pipe.named_steps['lr'].coef_[0]) > 1e-6)
print(f"\n--- {name} ---")
print(f"AUC: {auc:.4f} | Non-zero features: {n_nonzero}/200")
print(classification_report(y_test, y_pred, target_names=['Retained', 'Churned']))
In practice, start with L2 (the default). It is fast, stable, and works well on most datasets. Switch to L1 when you need feature selection or suspect many irrelevant features. Use Elastic Net when you have groups of correlated features and want sparsity.
The Decision Boundary
Logistic regression draws a linear decision boundary in the feature space. Every point on one side is classified as class 0; every point on the other side, class 1. The boundary itself is the set of points where the predicted probability is exactly 0.5.
With two features, we can visualize this:
from sklearn.datasets import make_classification
# Generate 2D data for visualization
X_2d, y_2d = make_classification(
n_samples=500, n_features=2, n_redundant=0,
n_clusters_per_class=1, random_state=42
)
pipe_2d = Pipeline([
('scaler', StandardScaler()),
('lr', LogisticRegression(C=1.0, random_state=42))
])
pipe_2d.fit(X_2d, y_2d)
# Create mesh grid for decision boundary
x_min, x_max = X_2d[:, 0].min() - 1, X_2d[:, 0].max() + 1
y_min, y_max = X_2d[:, 1].min() - 1, X_2d[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
np.linspace(y_min, y_max, 200))
Z = pipe_2d.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
Z = Z.reshape(xx.shape)
fig, ax = plt.subplots(figsize=(8, 6))
contour = ax.contourf(xx, yy, Z, levels=50, cmap='RdBu_r', alpha=0.8)
ax.scatter(X_2d[y_2d == 0, 0], X_2d[y_2d == 0, 1],
c='blue', edgecolors='black', s=20, label='Class 0')
ax.scatter(X_2d[y_2d == 1, 0], X_2d[y_2d == 1, 1],
c='red', edgecolors='black', s=20, label='Class 1')
ax.contour(xx, yy, Z, levels=[0.5], colors='black', linewidths=2)
ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
ax.set_title('Logistic Regression Decision Boundary')
ax.legend()
plt.colorbar(contour, label='P(Class 1)')
plt.tight_layout()
plt.savefig('decision_boundary.png', dpi=150)
plt.show()
The decision boundary is a straight line (in 2D) or a hyperplane (in higher dimensions). This is the fundamental limitation of logistic regression: it cannot capture non-linear decision boundaries without feature engineering. If the true boundary is curved, logistic regression will draw the best straight line through it --- which may be good enough, or may not.
In Chapter 12 (SVMs) and Chapter 13 (trees), we will see models that can learn non-linear boundaries. But first, you must establish how good the linear boundary is. That is your baseline.
The Regularization Path: Systematic Tuning
Alpha/C as Your First Hyperparameter
The regularization strength (alpha for Ridge/Lasso, C for LogisticRegression) is the first hyperparameter you will tune in your ML career. Chapter 18 covers hyperparameter tuning in depth, but for now, we use a practical approach: the regularization path.
from sklearn.linear_model import LogisticRegressionCV
# LogisticRegressionCV does built-in cross-validated C selection
pipe_cv = Pipeline([
('scaler', StandardScaler()),
('lr_cv', LogisticRegressionCV(
Cs=np.logspace(-4, 4, 20), # 20 values from 0.0001 to 10000
cv=5,
penalty='l1',
solver='saga',
scoring='roc_auc',
max_iter=5000,
random_state=42
))
])
pipe_cv.fit(X_train, y_train)
best_C = pipe_cv.named_steps['lr_cv'].C_[0]
best_auc = roc_auc_score(y_test, pipe_cv.predict_proba(X_test)[:, 1])
n_features = np.sum(np.abs(pipe_cv.named_steps['lr_cv'].coef_[0]) > 1e-6)
print(f"Best C: {best_C:.4f}")
print(f"Test AUC: {best_auc:.4f}")
print(f"Features used: {n_features}/200")
Best C: 0.5179
Test AUC: 0.7856
Features used: 52/200
LogisticRegressionCV is scikit-learn's built-in tool for cross-validated regularization strength selection. It is more efficient than a manual grid search because it exploits the "warm start" property of the optimization --- the solution at one C value provides a good starting point for the next.
Production Tip ---
LogisticRegressionCVwithscoring='roc_auc'andpenalty='l1'is often all you need for a strong baseline classifier. It handles regularization tuning internally, produces sparse coefficients for interpretability, and runs in seconds on datasets with millions of rows. Reach for this before you reach for XGBoost.
Plotting the Regularization Path
# Manual regularization path for visualization
C_range = np.logspace(-3, 3, 50)
coefs_list = []
aucs = []
for C in C_range:
pipe = Pipeline([
('scaler', StandardScaler()),
('lr', LogisticRegression(C=C, penalty='l1', solver='saga',
max_iter=5000, random_state=42))
])
pipe.fit(X_train, y_train)
coefs_list.append(pipe.named_steps['lr'].coef_[0].copy())
aucs.append(roc_auc_score(y_test, pipe.predict_proba(X_test)[:, 1]))
coefs_array = np.array(coefs_list)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Left: coefficient paths
for i in range(min(20, coefs_array.shape[1])): # Plot first 20 features
axes[0].plot(np.log10(C_range), coefs_array[:, i], linewidth=0.8)
axes[0].set_xlabel('log10(C)')
axes[0].set_ylabel('Coefficient value')
axes[0].set_title('Logistic Regression Coefficient Path (L1)')
axes[0].axhline(y=0, color='black', linestyle='--', linewidth=0.5)
# Right: AUC vs C
axes[1].plot(np.log10(C_range), aucs, 'b-', linewidth=2)
axes[1].set_xlabel('log10(C)')
axes[1].set_ylabel('Test AUC')
axes[1].set_title('AUC vs Regularization Strength')
axes[1].axvline(x=np.log10(best_C), color='red', linestyle='--', label=f'Best C={best_C:.3f}')
axes[1].legend()
plt.tight_layout()
plt.savefig('regularization_path.png', dpi=150)
plt.show()
The right-hand plot tells you: - Weak regularization (high C, right side): all features are in the model, potential overfitting, AUC may plateau or decline. - Strong regularization (low C, left side): too few features, underfitting, AUC drops. - The peak: the optimal trade-off. This is your C value.
Building the StreamFlow Baseline: Progressive Project M4
This is the moment the progressive project comes alive. We are building the logistic regression baseline for StreamFlow churn prediction --- the model that every subsequent model in Part III must beat.
The Complete Pipeline
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, classification_report,
confusion_matrix, RocCurveDisplay
)
import matplotlib.pyplot as plt
# --- Data Loading ---
# In practice, this comes from Chapter 10's pipeline
# Here we generate a realistic StreamFlow dataset
np.random.seed(42)
n = 50000
df = pd.DataFrame({
'tenure_months': np.random.exponential(18, n).astype(int).clip(1, 72),
'monthly_charges': np.round(np.random.choice([9.99, 19.99, 29.99, 49.99], n,
p=[0.3, 0.35, 0.25, 0.1]), 2),
'hours_watched_last_30d': np.random.exponential(15, n).round(1).clip(0, 200),
'sessions_last_30d': np.random.poisson(12, n),
'support_tickets_last_90d': np.random.poisson(1.5, n),
'num_devices': np.random.choice([1, 2, 3, 4, 5], n, p=[0.25, 0.30, 0.25, 0.15, 0.05]),
'contract_type': np.random.choice(['monthly', 'annual'], n, p=[0.65, 0.35]),
'plan_tier': np.random.choice(['basic', 'pro', 'enterprise'], n, p=[0.45, 0.40, 0.15]),
'payment_method': np.random.choice(
['credit_card', 'debit_card', 'bank_transfer', 'paypal'], n,
p=[0.35, 0.25, 0.20, 0.20]
),
'days_since_last_login': np.random.exponential(5, n).astype(int).clip(0, 90),
'content_interactions_last_7d': np.random.poisson(8, n),
'referral_source': np.random.choice(
['organic', 'paid_search', 'social', 'referral', 'email'], n,
p=[0.30, 0.25, 0.20, 0.15, 0.10]
),
})
# Generate churn with realistic relationships
churn_logit = (
-2.5
+ 0.8 * (df['contract_type'] == 'monthly').astype(int)
- 0.04 * df['tenure_months']
- 0.03 * df['hours_watched_last_30d']
+ 0.12 * df['support_tickets_last_90d']
+ 0.04 * df['days_since_last_login']
- 0.05 * df['sessions_last_30d']
- 0.15 * df['num_devices']
+ 0.3 * (df['plan_tier'] == 'basic').astype(int)
- 0.02 * df['content_interactions_last_7d']
+ np.random.normal(0, 0.8, n)
)
df['churned'] = (1 / (1 + np.exp(-churn_logit)) > 0.5).astype(int)
print(f"Dataset shape: {df.shape}")
print(f"Churn rate: {df['churned'].mean():.1%}")
print(f"\nFeature types:")
print(f" Numeric: {df.select_dtypes(include=[np.number]).shape[1] - 1}")
print(f" Categorical: {df.select_dtypes(include=['object']).shape[1]}")
Dataset shape: (50000, 13)
Churn rate: 8.4%
Feature types:
Numeric: 9
Categorical: 3
Train-Test Split
X = df.drop('churned', axis=1)
y = df['churned']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training set: {X_train.shape[0]:,} rows, churn rate: {y_train.mean():.1%}")
print(f"Test set: {X_test.shape[0]:,} rows, churn rate: {y_test.mean():.1%}")
Training set: 40,000 rows, churn rate: 8.4%
Test set: 10,000 rows, churn rate: 8.4%
Define the Preprocessing + Model Pipeline
# Identify column types
numeric_features = X_train.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object']).columns.tolist()
print(f"Numeric features ({len(numeric_features)}): {numeric_features}")
print(f"Categorical features ({len(categorical_features)}): {categorical_features}")
# Build the ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'),
categorical_features),
]
)
# Build the full pipeline
baseline_pipe = Pipeline([
('preprocessor', preprocessor),
('classifier', LogisticRegressionCV(
Cs=np.logspace(-4, 4, 30),
cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
penalty='l1',
solver='saga',
scoring='roc_auc',
max_iter=10000,
random_state=42,
class_weight='balanced', # Handle the 8.4% churn imbalance
))
])
# Fit
baseline_pipe.fit(X_train, y_train)
best_C = baseline_pipe.named_steps['classifier'].C_[0]
print(f"\nBest C (selected by CV): {best_C:.4f}")
Best C (selected by CV): 0.2310
Evaluate the Baseline
y_pred = baseline_pipe.predict(X_test)
y_prob = baseline_pipe.predict_proba(X_test)[:, 1]
# Core metrics
print("=" * 60)
print("STREAMFLOW CHURN BASELINE — LOGISTIC REGRESSION (L1)")
print("=" * 60)
print(f"\nBest C: {best_C:.4f}")
print(f"\nTest Set Metrics:")
print(f" Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f" Precision: {precision_score(y_test, y_pred):.4f}")
print(f" Recall: {recall_score(y_test, y_pred):.4f}")
print(f" F1 Score: {f1_score(y_test, y_pred):.4f}")
print(f" AUC-ROC: {roc_auc_score(y_test, y_prob):.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Retained', 'Churned']))
print(f"\nConfusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
print(f" {'':>12} Predicted Retained Predicted Churned")
print(f" {'Actual Retained':>20} {cm[0,0]:>10,} {cm[0,1]:>10,}")
print(f" {'Actual Churned':>20} {cm[1,0]:>10,} {cm[1,1]:>10,}")
============================================================
STREAMFLOW CHURN BASELINE — LOGISTIC REGRESSION (L1)
============================================================
Best C: 0.2310
Test Set Metrics:
Accuracy: 0.7856
Precision: 0.2634
Recall: 0.7012
F1 Score: 0.3829
AUC-ROC: 0.8234
Classification Report:
precision recall f1-score support
Retained 0.97 0.79 0.87 9160
Churned 0.26 0.70 0.38 840
accuracy 0.79 10000
macro avg 0.62 0.75 0.63 10000
weighted avg 0.91 0.79 0.83 10000
Confusion Matrix:
Predicted Retained Predicted Churned
Actual Retained 7,237 1,923
Actual Churned 251 589
Interpret the Coefficients
# Extract feature names after preprocessing
feature_names = (
numeric_features +
list(baseline_pipe.named_steps['preprocessor']
.named_transformers_['cat']
.get_feature_names_out(categorical_features))
)
coefs = baseline_pipe.named_steps['classifier'].coef_[0]
# Create coefficient table sorted by absolute value
coef_df = pd.DataFrame({
'feature': feature_names,
'coefficient': coefs,
'abs_coefficient': np.abs(coefs),
}).sort_values('abs_coefficient', ascending=False)
print("Top 15 Features by Coefficient Magnitude:")
print("-" * 55)
print(f"{'Feature':>35} | {'Coefficient':>12} | {'Direction':>10}")
print("-" * 55)
for _, row in coef_df.head(15).iterrows():
direction = 'Increases' if row['coefficient'] > 0 else 'Decreases'
if abs(row['coefficient']) < 1e-6:
direction = 'No effect'
print(f"{row['feature']:>35} | {row['coefficient']:>12.4f} | {direction:>10}")
Top 15 Features by Coefficient Magnitude:
-------------------------------------------------------
Feature | Coefficient | Direction
-------------------------------------------------------
contract_type_monthly | 0.7234 | Increases
tenure_months | -0.5812 | Decreases
support_tickets_last_90d | 0.4523 | Increases
days_since_last_login | 0.4201 | Increases
hours_watched_last_30d | -0.3987 | Decreases
sessions_last_30d | -0.3456 | Decreases
plan_tier_basic | 0.2890 | Increases
num_devices | -0.2345 | Decreases
content_interactions_last_7d | -0.1987 | Decreases
monthly_charges | 0.1234 | Increases
referral_source_paid_search | 0.0567 | Increases
payment_method_paypal | 0.0345 | Increases
referral_source_social | 0.0234 | Increases
payment_method_bank_transfer | -0.0123 | Decreases
referral_source_email | -0.0089 | Decreases
These coefficients tell a clear story:
- Monthly contracts are the strongest churn signal. Annual contracts lock subscribers in.
- Tenure is protective. Long-tenured subscribers churn less.
- Support tickets are a distress signal. More tickets = more likely to leave.
- Days since last login is a disengagement signal. If they are not logging in, they are leaving.
- Usage features (hours watched, sessions, content interactions) are all protective. Engagement retains subscribers.
This is the kind of interpretation that makes logistic regression valuable. A gradient-boosted ensemble might achieve a higher AUC, but try explaining to the VP of Product why the model thinks a specific subscriber will churn. With logistic regression, you can point to specific coefficients with units and directions.
Plot the ROC Curve
fig, ax = plt.subplots(figsize=(8, 6))
RocCurveDisplay.from_estimator(baseline_pipe, X_test, y_test, ax=ax, name='LR Baseline')
ax.plot([0, 1], [0, 1], 'k--', label='Random (AUC=0.5)')
ax.set_title('StreamFlow Churn — Logistic Regression Baseline')
ax.legend()
plt.tight_layout()
plt.savefig('roc_curve_baseline.png', dpi=150)
plt.show()
Production Tip --- Save the baseline metrics somewhere permanent. Create a
baselines.jsonor a results table that records the model type, hyperparameters, and all metrics. Every model you build in Chapters 12--18 must compare against this table. If a Random Forest with 500 trees and 4 hours of tuning cannot beat logistic regression by more than 2 AUC points, the added complexity may not be worth it.
The Interpretation Advantage
Why Clinicians and Executives Trust Linear Models
At Metro General Hospital (our second anchor example), Dr. Sarah Nwosu's team built a readmission prediction model using logistic regression with L2 regularization. They also built a gradient-boosted model that achieved a 3-point higher AUC.
The hospital deployed the logistic regression.
Why? Because when a nurse at discharge looks at a patient's risk score and asks "why is this patient flagged as high-risk?", the answer needs to be immediate and concrete:
- "Because they had 3 prior admissions in the last year (coefficient: +0.82)"
- "Because they live alone with no caregiver support (coefficient: +0.64)"
- "Because they are being discharged on 7+ medications (coefficient: +0.51)"
With logistic regression, each coefficient has a direct interpretation: a one-standard-deviation increase in that feature is associated with a specific change in the log-odds of readmission. The nurse does not need to understand SHAP values or partial dependence plots. The model is the equation. The explanation is the coefficients.
War Story --- A health system in the Midwest deployed a deep learning model for sepsis prediction. It achieved an AUC of 0.92 in validation. When clinicians asked why it was flagging a seemingly healthy patient, the answer was "the model says so." Clinicians stopped trusting the alerts within three months. The system reverted to a logistic regression with an AUC of 0.85 that clinicians actually used, because they could see the reasoning and override it when their clinical judgment disagreed.
When Logistic Regression Is Not Enough
Logistic regression has genuine limitations:
-
Linear decision boundaries. If the true relationship between features and outcome is non-linear, logistic regression will miss it. Interaction terms and polynomial features help, but they are manual and do not scale.
-
Feature engineering dependent. Logistic regression performs only as well as its features. Tree-based models can discover interactions and non-linearities automatically.
-
Probability calibration. Logistic regression produces well-calibrated probabilities out of the box (when properly regularized), but only if the model specification is correct. Misspecified models produce confidently wrong probabilities.
-
Performance ceiling. On complex tabular data with many non-linear interactions, gradient boosting typically outperforms logistic regression by 3--8 AUC points.
These limitations are real. They are why Chapters 12--14 exist. But they do not diminish logistic regression's role as the baseline. You do not deploy a complex model until you have proven, with numbers, that it outperforms the simple one by enough to justify its costs.
Production Patterns
The Minimal Production Pipeline
Here is the pattern you should use for every logistic regression model you build. It is complete, reproducible, and deployment-ready.
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import StratifiedKFold
import joblib
def build_lr_baseline(X_train, y_train, numeric_cols, categorical_cols):
"""Build a production-grade logistic regression baseline.
Parameters
----------
X_train : pd.DataFrame
y_train : pd.Series
numeric_cols : list of str
categorical_cols : list of str
Returns
-------
fitted Pipeline
"""
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numeric_cols),
('cat', OneHotEncoder(drop='first', sparse_output=False,
handle_unknown='ignore'),
categorical_cols),
],
remainder='drop' # Explicitly drop unlisted columns
)
pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', LogisticRegressionCV(
Cs=np.logspace(-4, 4, 30),
cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
penalty='l1',
solver='saga',
scoring='roc_auc',
max_iter=10000,
random_state=42,
class_weight='balanced',
))
])
pipeline.fit(X_train, y_train)
return pipeline
# Build it
model = build_lr_baseline(X_train, y_train, numeric_features, categorical_features)
# Save it
joblib.dump(model, 'streamflow_lr_baseline.joblib')
print("Model saved. Pipeline contains preprocessing + model as a single object.")
# Load and predict (in production)
loaded_model = joblib.load('streamflow_lr_baseline.joblib')
predictions = loaded_model.predict_proba(X_test)[:, 1]
print(f"Loaded model AUC: {roc_auc_score(y_test, predictions):.4f}")
Model saved. Pipeline contains preprocessing + model as a single object.
Loaded model AUC: 0.8234
The entire model --- scaler parameters, encoder categories, logistic regression coefficients --- is serialized as a single file. In production, you load one object and call .predict_proba(). There is no separate preprocessing step to get wrong. There is no feature order to misalign. This is why pipelines exist.
Cross-Validated Evaluation
Never report a single train-test split as your final evaluation. Use cross-validation to estimate the true performance distribution.
from sklearn.model_selection import cross_validate
# Build a fresh pipeline (not CV-tuned, for outer cross-validation)
eval_pipe = Pipeline([
('preprocessor', preprocessor),
('classifier', LogisticRegression(
C=best_C, penalty='l1', solver='saga',
max_iter=10000, random_state=42, class_weight='balanced'
))
])
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
cv_results = cross_validate(
eval_pipe, X, y, cv=cv,
scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'],
return_train_score=True
)
print("10-Fold Stratified Cross-Validation Results:")
print("-" * 60)
for metric in ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']:
train_scores = cv_results[f'train_{metric}']
test_scores = cv_results[f'test_{metric}']
print(f" {metric:>12}: {test_scores.mean():.4f} (+/- {test_scores.std():.4f})"
f" [train: {train_scores.mean():.4f}]")
10-Fold Stratified Cross-Validation Results:
------------------------------------------------------------
accuracy: 0.7834 (+/- 0.0089) [train: 0.7901]
precision: 0.2598 (+/- 0.0123) [train: 0.2712]
recall: 0.6989 (+/- 0.0234) [train: 0.7156]
f1: 0.3789 (+/- 0.0156) [train: 0.3934]
roc_auc: 0.8212 (+/- 0.0067) [train: 0.8298]
The small gap between train and test metrics (AUC: 0.830 vs. 0.821) confirms that the model is not overfitting. The standard deviation on AUC (0.007) tells us the estimate is stable. This is a well-behaved baseline.
Chapter Summary
This chapter covered a lot of ground. Here is the logic in one paragraph:
Unregularized linear models fail in high dimensions because they have no mechanism to constrain coefficient estimates. Multicollinearity inflates variance. Noise features get non-zero coefficients. The model overfits. Regularization fixes this by adding a penalty for large coefficients: L2 (Ridge) shrinks all coefficients toward zero but keeps them non-zero; L1 (Lasso) drives unimportant coefficients to exactly zero, performing automatic feature selection; Elastic Net blends both. Feature scaling is a prerequisite for regularization --- without it, the penalty is applied unevenly across features, distorting both model performance and coefficient interpretation. Logistic regression with L1 regularization is a production-grade classifier that is fast, interpretable, and surprisingly competitive. It is your baseline. Every model you build in the rest of Part III must justify its complexity by outperforming it.
The Baseline Table
Record these numbers. You will need them.
| Metric | Logistic Regression (L1) |
|---|---|
| AUC-ROC | 0.823 |
| Accuracy | 0.785 |
| Precision | 0.263 |
| Recall | 0.701 |
| F1 | 0.383 |
| Features used | 52 / 200 |
| Training time | ~3 seconds |
| Interpretable | Yes |
The AUC of 0.823 is not spectacular. But it is honest, cross-validated, and built in a Pipeline that can be deployed tomorrow. The next nine chapters will try to beat it. Some will succeed. Some will not. And the ones that succeed will need to justify their added complexity against a model that runs in 3 seconds and explains itself.
Try It --- Modify the pipeline to use L2 regularization instead of L1. Compare the AUC, the number of features retained, and the coefficient magnitudes. Then try Elastic Net with l1_ratio=0.5. Record all three results in your baseline table. This comparison will sharpen your intuition for when to use each penalty type.
Up Next
Chapter 12: Support Vector Machines --- where we trade interpretability for the ability to draw non-linear decision boundaries in high-dimensional space. The kernel trick is elegant. The question is whether the elegance translates to better predictions on our churn data.
This chapter is part of Part III: Supervised Learning. Return to the part introduction for the roadmap.