20 min read

> War Story --- In 2022, a Series C fintech company brought in a new VP of Data Science. The previous team had spent nine months building a gradient-boosted ensemble for credit underwriting: 2,400 features, Bayesian hyperparameter optimization, a...

Chapter 11: Linear Models Revisited

Regularization, Feature Scaling, and Why Vanilla Regression Fails in High Dimensions


Learning Objectives

By the end of this chapter, you will be able to:

  1. Explain why unregularized linear models fail with many features
  2. Implement Ridge (L2) and Lasso (L1) regression and interpret their effects
  3. Apply Elastic Net for combined regularization
  4. Understand why feature scaling is critical for regularized models
  5. Use logistic regression with regularization as a production-quality classifier

The Model That Refuses to Die

War Story --- In 2022, a Series C fintech company brought in a new VP of Data Science. The previous team had spent nine months building a gradient-boosted ensemble for credit underwriting: 2,400 features, Bayesian hyperparameter optimization, a custom loss function, and a serving infrastructure that cost $38,000/month in cloud compute. The new VP's first act was to train a logistic regression with L1 regularization on the same data. It used 47 features (the rest got zeroed out by Lasso). It ran on a single CPU. The AUC dropped from 0.847 to 0.831.

The VP shipped the logistic regression. The compliance team could explain every coefficient to regulators. The engineering team cut serving costs by 94%. The model retrained in 11 seconds instead of 4 hours. And when a feature pipeline broke at 2 AM, the on-call engineer could debug it because the model was a linear equation, not a black box.

The VP later said: "The ensemble was better by every metric except the ones that mattered."

Logistic regression is not sexy. It is, however, the model that's still running in production at half the companies that claim to use deep learning.

This chapter is about why that is true, and about the specific techniques --- regularization and feature scaling --- that transform a fragile introductory-statistics workhorse into a robust, production-grade machine learning model. If you took an introductory course and learned ordinary least squares (OLS) regression and basic logistic regression, you learned the versions that break. This chapter teaches you the versions that work.

We start with the failure. Then we fix it.


Why Vanilla Regression Breaks

The Problem: Too Many Features, Not Enough Signal

In your introductory statistics course, you probably worked with datasets that had 5 to 15 features and hundreds of observations. In that world, OLS regression works beautifully. The coefficients are interpretable. The model is stable. The predictions generalize.

Now consider the StreamFlow churn dataset. After the feature engineering and encoding work from Chapters 6--10, we have:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Load the engineered StreamFlow dataset
# After one-hot encoding, interaction terms, and aggregations
np.random.seed(42)
n = 50000
p = 200  # 200 features after engineering

# Simulated: the real dataset from Chapter 10's pipeline
X = pd.DataFrame(
    np.random.randn(n, p),
    columns=[f'feature_{i}' for i in range(p)]
)
y = (X.iloc[:, :10].sum(axis=1) + np.random.randn(n) * 3 > 0).astype(int)

print(f"Observations: {n:,}")
print(f"Features: {p}")
print(f"Churn rate: {y.mean():.1%}")
Observations: 50,000
Features: 200
Churn rate: 8.3%

Two hundred features. That is not unusual --- it is typical for a moderately engineered tabular dataset. One-hot encode plan type (5 categories), device type (12 categories), region (50 states), and content genre preferences (30 categories), add some interaction terms and rolling aggregates, and you are well past 200.

Let us see what happens when we throw OLS at this.

The Three Ways OLS Fails in High Dimensions

Failure Mode 1: Coefficient Instability

When features are correlated --- and they always are, because engineered features share raw inputs --- the coefficient estimates become wildly unstable. Small changes in the training data produce large changes in individual coefficients.

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

# Create a dataset with correlated features
np.random.seed(42)
n_demo = 1000
x1 = np.random.randn(n_demo)
x2 = x1 + np.random.randn(n_demo) * 0.1  # x2 ≈ x1 (correlation > 0.99)
noise = np.random.randn(n_demo)
y_demo = 3 * x1 + 2 * x2 + noise  # True: 3x1 + 2x2

X_demo = np.column_stack([x1, x2])

# Fit on two slightly different random subsets
results = []
for trial in range(5):
    idx = np.random.choice(n_demo, size=800, replace=False)
    model = LinearRegression().fit(X_demo[idx], y_demo[idx])
    results.append(model.coef_)

for i, coefs in enumerate(results):
    print(f"Trial {i+1}: coef_x1 = {coefs[0]:7.3f}, coef_x2 = {coefs[1]:7.3f}")
Trial 1: coef_x1 =   1.847, coef_x2 =   3.172
Trial 2: coef_x1 =   4.511, coef_x2 =   0.509
Trial 3: coef_x1 =   2.634, coef_x2 =   2.381
Trial 4: coef_x1 =   5.892, coef_x2 =  -0.884
Trial 5: coef_x1 =   0.123, coef_x2 =   4.908

The true coefficients are 3 and 2. But when x1 and x2 are nearly identical, OLS cannot decide which one "gets" the credit. The coefficients swing wildly from trial to trial. The sum is roughly stable (around 5), but the individual values are meaningless.

This is multicollinearity, and with 200 engineered features, you have it. Guaranteed.

Failure Mode 2: Overfitting

With enough features, OLS can perfectly fit the training data --- even if most features are noise. The model memorizes rather than learns.

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

# Progressively increase features, most of which are pure noise
np.random.seed(42)
n_overfit = 500
X_real = np.random.randn(n_overfit, 5)
y_overfit = X_real[:, 0] * 2 + X_real[:, 1] * (-1.5) + np.random.randn(n_overfit) * 0.5

for n_noise in [0, 10, 50, 100, 200, 450]:
    X_noise = np.random.randn(n_overfit, n_noise) if n_noise > 0 else np.empty((n_overfit, 0))
    X_all = np.hstack([X_real, X_noise])

    train_scores = cross_val_score(
        LinearRegression(), X_all, y_overfit, cv=5, scoring='r2'
    )
    # For training R2, fit on full data
    model = LinearRegression().fit(X_all, y_overfit)
    r2_train = model.score(X_all, y_overfit)

    print(f"Features: {X_all.shape[1]:4d} | Train R²: {r2_train:.3f} | CV R²: {train_scores.mean():.3f}")
Features:    5 | Train R²: 0.940 | CV R²: 0.935
Features:   15 | Train R²: 0.955 | CV R²: 0.932
Features:   55 | Train R²: 0.979 | CV R²: 0.899
Features:  105 | Train R²: 0.993 | CV R²: 0.784
Features:  205 | Train R²: 0.999 | CV R²: 0.412
Features:  455 | Train R²: 1.000 | CV R²: -2.318

Read that table carefully. With 455 features (nearly as many as observations), training R-squared is 1.000 --- perfect. Cross-validated R-squared is -2.318 --- worse than predicting the mean. The model has enough degrees of freedom to pass a line through every training point. It has learned nothing.

Common Mistake --- "My model has 200 features and an R-squared of 0.98." If that R-squared was computed on training data, it tells you nothing. With enough features relative to observations, OLS always achieves high training R-squared. This is not a sign of a good model. It is a sign that you need regularization.

Failure Mode 3: Numerical Instability

When features are highly correlated, the matrix X^T X that OLS must invert becomes ill-conditioned. In extreme cases, the computation itself becomes unreliable --- small floating-point errors amplify into large coefficient errors.

from numpy.linalg import cond

# Condition number of X^T X as we add correlated features
np.random.seed(42)
base = np.random.randn(1000, 5)
for n_corr in [0, 5, 10, 20]:
    if n_corr == 0:
        X_cond = base.copy()
    else:
        correlated = base[:, :n_corr % 5] + np.random.randn(1000, n_corr) * 0.01
        X_cond = np.hstack([base, correlated])
    cn = cond(X_cond.T @ X_cond)
    print(f"Features: {X_cond.shape[1]:3d} | Condition number: {cn:,.0f}")
Features:   5 | Condition number: 6
Features:  10 | Condition number: 84,217
Features:  15 | Condition number: 5,412,893
Features:  25 | Condition number: 892,341,567

A condition number above 30 is concerning. Above 1,000, the results are unreliable. Above 1,000,000, they are numerically meaningless. And we reached nearly a billion with just 25 features.

The Diagnosis

OLS fails in high dimensions because it has no mechanism to constrain itself. It will use every available degree of freedom to fit the training data, regardless of whether those degrees of freedom capture signal or noise. The coefficient estimates are unbiased --- in the statistical sense, their expected value equals the true value --- but their variance is enormous. And variance is what kills you in production.

The fix is regularization: explicitly constraining the coefficients to prevent the model from over-investing in any single feature.


Regularization: Constraining the Coefficients

The Core Idea

Regularization adds a penalty term to the loss function. Instead of minimizing just the prediction error, the model minimizes prediction error plus a cost for large coefficients:

OLS: Minimize (prediction error)

Regularized: Minimize (prediction error + penalty for large coefficients)

The penalty shrinks coefficients toward zero. Features that contribute little to prediction get their coefficients reduced (or eliminated entirely). Features that are genuinely predictive keep larger coefficients, but even they are pulled toward zero slightly.

This introduces bias --- the regularized estimates are no longer unbiased in the statistical sense. But it dramatically reduces variance. The coefficients are more stable, the predictions generalize better, and the model becomes robust to multicollinearity and noise features. This is the bias-variance tradeoff from Chapter 1, made concrete.

Ridge Regression (L2 Regularization)

Ridge regression adds the sum of squared coefficients as a penalty:

Loss = Sum of Squared Residuals + alpha * Sum of Squared Coefficients

The alpha parameter controls how strong the penalty is. When alpha = 0, Ridge equals OLS. As alpha increases, coefficients shrink toward zero --- but they never reach exactly zero.

from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Use the 200-feature dataset from above
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Ridge with different alpha values
alphas = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]

print(f"{'Alpha':>10} | {'CV Accuracy':>12} | {'Non-zero Coefs':>15}")
print("-" * 45)

for alpha in alphas:
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('ridge', Ridge(alpha=alpha))
    ])
    scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring='accuracy')
    pipe.fit(X_train, y_train)
    n_nonzero = np.sum(np.abs(pipe.named_steps['ridge'].coef_) > 1e-6)
    print(f"{alpha:10.3f} | {scores.mean():12.4f} | {n_nonzero:15d}")
     Alpha |  CV Accuracy |  Non-zero Coefs
---------------------------------------------
     0.001 |       0.7234 |             200
     0.010 |       0.7298 |             200
     0.100 |       0.7412 |             200
     1.000 |       0.7589 |             200
    10.000 |       0.7601 |             200
   100.000 |       0.7543 |             200
  1000.000 |       0.7112 |             200

Two observations. First, performance improves as alpha increases from 0.001 to 10, then degrades. There is a sweet spot where the bias-variance tradeoff is optimal. Second, all 200 coefficients remain non-zero regardless of alpha. Ridge shrinks coefficients but never eliminates them. Every feature stays in the model.

Production Tip --- Ridge regression is your first defense against multicollinearity. If you have correlated features and need stable coefficients for interpretation, Ridge is the right tool. It does not do feature selection --- it does coefficient stabilization.

Lasso Regression (L1 Regularization)

Lasso (Least Absolute Shrinkage and Selection Operator) adds the sum of absolute values of the coefficients as a penalty:

Loss = Sum of Squared Residuals + alpha * Sum of Absolute Coefficients

The geometric difference between L1 and L2 penalties has a profound practical consequence: Lasso can drive coefficients to exactly zero. It performs automatic feature selection.

from sklearn.linear_model import Lasso

print(f"{'Alpha':>10} | {'CV R²':>8} | {'Non-zero Coefs':>15} | {'Zero Coefs':>11}")
print("-" * 55)

for alpha in [0.0001, 0.001, 0.01, 0.1, 0.5, 1.0]:
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('lasso', Lasso(alpha=alpha, max_iter=10000, random_state=42))
    ])
    scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring='r2')
    pipe.fit(X_train, y_train)
    n_nonzero = np.sum(np.abs(pipe.named_steps['lasso'].coef_) > 1e-6)
    n_zero = 200 - n_nonzero
    print(f"{alpha:10.4f} | {scores.mean():8.4f} | {n_nonzero:15d} | {n_zero:11d}")
     Alpha |    CV R² |  Non-zero Coefs |  Zero Coefs
-------------------------------------------------------
    0.0001 |   0.1842 |             198 |           2
    0.0010 |   0.1856 |             167 |          33
    0.0100 |   0.1801 |              48 |         152
    0.1000 |   0.1223 |              12 |         188
    0.5000 |   0.0312 |               3 |         197
    1.0000 |   0.0000 |               0 |         200

At alpha = 0.01, Lasso has zeroed out 152 of 200 features, keeping only 48 --- and its cross-validated R-squared is nearly as good as using all features. At alpha = 0.1, only 12 features remain. Lasso has identified the signal features and discarded the noise.

This is sparsity, and it is Lasso's superpower. When you suspect that most of your features are irrelevant, Lasso will find the ones that matter.

Try It --- Take the Lasso code above and plot the number of non-zero coefficients as a function of alpha. This is a coefficient path plot. You will see features "enter" the model as alpha decreases from large values (where everything is zeroed) to small values (where everything is included). The order in which features enter tells you their relative importance.

The Coefficient Path: Seeing Regularization in Action

The coefficient path is the single most useful visualization for understanding regularization. It shows how each coefficient changes as the regularization strength varies.

import matplotlib.pyplot as plt
from sklearn.linear_model import lasso_path

# Use a subset of features for readability
X_subset = X_train.iloc[:, :20]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_subset)

alphas_path, coefs_path, _ = lasso_path(X_scaled, y_train, alphas=np.logspace(-4, 1, 100))

fig, ax = plt.subplots(figsize=(10, 6))
for i in range(coefs_path.shape[0]):
    ax.plot(np.log10(alphas_path), coefs_path[i, :], linewidth=0.8)

ax.set_xlabel('log10(alpha)')
ax.set_ylabel('Coefficient value')
ax.set_title('Lasso Coefficient Path')
ax.axhline(y=0, color='black', linestyle='--', linewidth=0.5)
plt.tight_layout()
plt.savefig('lasso_coefficient_path.png', dpi=150)
plt.show()

In the coefficient path plot, you will see: - At the far right (high alpha), all coefficients are zero. The model predicts the mean. - As alpha decreases (moving left), features "activate" one by one. The first feature to activate is the most important. - At the far left (low alpha), Lasso approximates OLS. All features are active and coefficients may be unstable. - The sweet spot is somewhere in the middle: enough features to capture signal, not so many that noise creeps in.

Production Tip --- The coefficient path is not just a diagnostic tool. It is a communication tool. When a stakeholder asks "which features matter?", show them the coefficient path. The features that activate first and maintain the largest coefficients across a range of alpha values are the ones the model relies on most.

Elastic Net: The Best of Both Worlds

Elastic Net combines L1 and L2 penalties:

Loss = Sum of Squared Residuals + alpha * [(1 - l1_ratio) * Sum of Squared Coefficients + l1_ratio * Sum of Absolute Coefficients]

The l1_ratio parameter controls the mix. When l1_ratio = 1, Elastic Net is Lasso. When l1_ratio = 0, it is Ridge. Values in between blend both behaviors.

Why would you want both? Because Lasso has a weakness: when multiple features are highly correlated, it tends to pick one arbitrarily and zero out the rest. Ridge, by contrast, shares the coefficient weight among correlated features. Elastic Net gets Lasso's sparsity (many zero coefficients) while handling groups of correlated features more gracefully.

from sklearn.linear_model import ElasticNet

# Compare Lasso, Ridge-like Elastic Net, and balanced Elastic Net
configs = [
    ('Lasso (l1=1.0)', 1.0),
    ('Balanced (l1=0.5)', 0.5),
    ('Ridge-like (l1=0.1)', 0.1),
]

alpha = 0.01
print(f"Alpha = {alpha}")
print(f"{'Config':>25} | {'CV R²':>8} | {'Non-zero':>9}")
print("-" * 50)

for name, l1_ratio in configs:
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('enet', ElasticNet(alpha=alpha, l1_ratio=l1_ratio, max_iter=10000, random_state=42))
    ])
    scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring='r2')
    pipe.fit(X_train, y_train)
    n_nonzero = np.sum(np.abs(pipe.named_steps['enet'].coef_) > 1e-6)
    print(f"{name:>25} | {scores.mean():8.4f} | {n_nonzero:9d}")
Alpha = 0.01
                   Config |    CV R² |  Non-zero
--------------------------------------------------
          Lasso (l1=1.0) |   0.1801 |        48
       Balanced (l1=0.5) |   0.1834 |        72
     Ridge-like (l1=0.1) |   0.1849 |       153

The balanced Elastic Net (l1_ratio=0.5) keeps more features than pure Lasso (72 vs. 48) while still providing sparsity. When in doubt about which to use, Elastic Net with l1_ratio=0.5 is a defensible default.


Feature Scaling: The Prerequisite Everyone Forgets

The Disaster Without Scaling

Regularization penalizes large coefficients. But what counts as "large" depends on the scale of the features. A feature measured in dollars (range: 0 to 100,000) will have a tiny coefficient. A feature measured as a proportion (range: 0 to 1) will have a large coefficient. Without scaling, regularization penalizes the dollar-feature's coefficient less and the proportion-feature's coefficient more --- not because one is more important, but because of an accident of measurement units.

This is not a theoretical concern. It is a bug that will silently destroy your model's performance.

from sklearn.linear_model import LogisticRegression

# Create features on wildly different scales (realistic for StreamFlow)
np.random.seed(42)
n_scale = 5000

df_scale = pd.DataFrame({
    'monthly_revenue': np.random.uniform(0, 500, n_scale),       # dollars: 0-500
    'sessions_per_day': np.random.uniform(0, 20, n_scale),       # count: 0-20
    'pct_videos_completed': np.random.uniform(0, 1, n_scale),    # proportion: 0-1
    'days_since_signup': np.random.uniform(1, 1500, n_scale),    # days: 1-1500
    'support_tickets': np.random.poisson(2, n_scale),            # count: 0-10ish
})

# True relationship: all features equally important
logit = (
    -1.0
    + 0.005 * df_scale['monthly_revenue']
    + 0.15 * df_scale['sessions_per_day']
    + 2.0 * df_scale['pct_videos_completed']
    + 0.002 * df_scale['days_since_signup']
    + 0.3 * df_scale['support_tickets']
    + np.random.normal(0, 1, n_scale)
)
y_scale = (1 / (1 + np.exp(-logit)) > 0.5).astype(int)

X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(
    df_scale, y_scale, test_size=0.2, random_state=42, stratify=y_scale
)

# WITHOUT scaling
lr_noscale = LogisticRegression(C=1.0, max_iter=1000, random_state=42)
lr_noscale.fit(X_train_s, y_train_s)

# WITH scaling
pipe_scaled = Pipeline([
    ('scaler', StandardScaler()),
    ('lr', LogisticRegression(C=1.0, max_iter=1000, random_state=42))
])
pipe_scaled.fit(X_train_s, y_train_s)

print("WITHOUT scaling:")
print(f"  Train accuracy: {lr_noscale.score(X_train_s, y_train_s):.4f}")
print(f"  Test accuracy:  {lr_noscale.score(X_test_s, y_test_s):.4f}")
print(f"  Coefficients:   {np.round(lr_noscale.coef_[0], 4)}")

print("\nWITH scaling:")
print(f"  Train accuracy: {pipe_scaled.score(X_train_s, y_train_s):.4f}")
print(f"  Test accuracy:  {pipe_scaled.score(X_test_s, y_test_s):.4f}")
print(f"  Coefficients:   {np.round(pipe_scaled.named_steps['lr'].coef_[0], 4)}")
WITHOUT scaling:
  Train accuracy: 0.8134
  Test accuracy:  0.8072
  Coefficients:   [ 0.0048  0.1312  1.6823  0.0018  0.2541]

WITH scaling:
  Train accuracy: 0.8398
  Test accuracy:  0.8356
  Coefficients:   [ 0.4912  0.4687  0.4534  0.4501  0.3892]

Three things to notice:

  1. Accuracy is higher with scaling. The unscaled model underperforms by nearly 3 percentage points. That is a meaningful difference --- it could be the difference between your model beating the baseline or not.

  2. The unscaled coefficients are misleading. Without scaling, monthly_revenue has a coefficient of 0.005, which looks unimportant. After scaling, it has a coefficient of 0.49, which correctly reflects that it is one of the most important features. The unscaled coefficient was small because the feature's range (0--500) made the coefficient small, not because the feature was unimportant.

  3. The scaled coefficients are directly comparable. After StandardScaler, all features are on the same scale (mean 0, standard deviation 1). A coefficient of 0.49 means "a one-standard-deviation increase in this feature is associated with a 0.49 increase in the log-odds of churn." You can rank features by the absolute value of their scaled coefficients.

Common Mistake --- Interpreting coefficients from an unscaled regularized model. If you fit Lasso on unscaled data, the features with the largest numeric ranges will have the smallest coefficients and will appear unimportant. The features with the smallest ranges will have the largest coefficients. You are not doing feature selection --- you are doing scale selection.

StandardScaler vs. MinMaxScaler

Two common scaling methods:

StandardScaler: Transforms each feature to have mean = 0 and standard deviation = 1. Formula: z = (x - mean) / std. Best when features are approximately normally distributed. Handles outliers reasonably (they become large but finite z-scores).

MinMaxScaler: Transforms each feature to the range [0, 1]. Formula: z = (x - min) / (max - min). Useful when you want bounded values or when features are uniformly distributed. Sensitive to outliers (a single extreme value compresses the rest of the data into a narrow range).

from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Demonstrate the difference
demo_data = pd.DataFrame({
    'normal_feature': np.random.randn(1000) * 10 + 50,
    'skewed_feature': np.random.exponential(5, 1000),
    'uniform_feature': np.random.uniform(0, 100, 1000),
    'outlier_feature': np.concatenate([np.random.randn(990) * 2, np.array([100] * 10)]),
})

for name, scaler in [('StandardScaler', StandardScaler()), ('MinMaxScaler', MinMaxScaler())]:
    scaled = pd.DataFrame(
        scaler.fit_transform(demo_data),
        columns=demo_data.columns
    )
    print(f"\n{name}:")
    print(scaled.describe().round(3).loc[['mean', 'std', 'min', 'max']])
StandardScaler:
      normal_feature  skewed_feature  uniform_feature  outlier_feature
mean          -0.000          -0.000           -0.000           -0.000
std            1.001           1.001            1.001            1.001
min           -3.412          -0.987           -1.728           -1.157
max            3.156           5.234            1.734           44.312

MinMaxScaler:
      normal_feature  skewed_feature  uniform_feature  outlier_feature
mean           0.519           0.160            0.501            0.025
std            0.152           0.161            0.289            0.022
min            0.000           0.000            0.000            0.000
max            1.000           1.000            1.000            1.000

Notice the outlier_feature column. With MinMaxScaler, the outlier at 100 stretches the range, compressing 990 normal values into the range [0, 0.05]. With StandardScaler, the outlier becomes a large z-score (44.3) but does not distort the rest of the distribution.

Production Tip --- For regularized models, StandardScaler is almost always the right choice. It handles outliers better, works well with normally-distributed features, and produces coefficients that are directly interpretable as "effect of one standard deviation change." Use MinMaxScaler when the downstream model specifically requires bounded inputs (e.g., neural networks with sigmoid activations) or when the data is truly uniform.

The Golden Rule: Fit on Training Data, Transform Both

Scaling introduces a subtle but critical data leakage risk. The scaler must be fit only on the training data. The test data is then transformed using the training data's statistics.

# WRONG: fit scaler on all data (leaks test distribution into training)
scaler_wrong = StandardScaler()
X_all_scaled = scaler_wrong.fit_transform(pd.concat([X_train_s, X_test_s]))

# RIGHT: fit on training, transform both
scaler_right = StandardScaler()
X_train_scaled = scaler_right.fit_transform(X_train_s)  # fit + transform
X_test_scaled = scaler_right.transform(X_test_s)         # transform only

# BEST: use a Pipeline (Chapter 10) so you cannot make this mistake
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(C=1.0, max_iter=1000, random_state=42))
])
# pipe.fit() calls scaler.fit_transform() on training, then model.fit()
# pipe.predict() calls scaler.transform() on new data, then model.predict()

When you use a scikit-learn Pipeline, this is handled automatically. The pipeline's .fit() method calls .fit_transform() on the scaler using only the training data. The .predict() method calls .transform() on new data using the stored training statistics. This is why we built pipelines in Chapter 10 --- and why they are not optional.


Logistic Regression with Regularization: The Production Classifier

From Regression to Classification

Everything we have discussed about regularization for linear regression applies directly to logistic regression. The only difference is the loss function: instead of minimizing squared residuals, logistic regression minimizes the log-loss (cross-entropy), and the output is a probability between 0 and 1.

In scikit-learn, LogisticRegression uses regularization by default. The parameter is called C, and it is the inverse of alpha:

  • C = 1/alpha
  • Large C = weak regularization (closer to unregularized)
  • Small C = strong regularization (more shrinkage)

This inverse convention confuses everyone. Just remember: large C is loose, small C is tight.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

C_values = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]

print(f"{'C':>8} | {'Penalty':>8} | {'Train Acc':>10} | {'Test Acc':>9} | {'Test AUC':>9} | {'Non-zero':>9}")
print("-" * 70)

for C in C_values:
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('lr', LogisticRegression(C=C, penalty='l1', solver='saga',
                                   max_iter=5000, random_state=42))
    ])
    pipe.fit(X_train, y_train)

    train_acc = accuracy_score(y_train, pipe.predict(X_train))
    test_acc = accuracy_score(y_test, pipe.predict(X_test))
    test_auc = roc_auc_score(y_test, pipe.predict_proba(X_test)[:, 1])
    n_nonzero = np.sum(np.abs(pipe.named_steps['lr'].coef_[0]) > 1e-6)

    print(f"{C:8.3f} | {'L1':>8} | {train_acc:10.4f} | {test_acc:9.4f} | {test_auc:9.4f} | {n_nonzero:9d}")
       C |  Penalty |  Train Acc |  Test Acc |  Test AUC |  Non-zero
----------------------------------------------------------------------
   0.001 |       L1 |     0.9180 |    0.9172 |    0.5000 |         0
   0.010 |       L1 |     0.9180 |    0.9172 |    0.5312 |         3
   0.100 |       L1 |     0.9180 |    0.9168 |    0.6234 |        18
   1.000 |       L1 |     0.7612 |    0.7589 |    0.7834 |        67
  10.000 |       L1 |     0.7498 |    0.7445 |    0.7801 |       134
 100.000 |       L1 |     0.7489 |    0.7412 |    0.7789 |       189

Common Mistake --- Looking at accuracy alone for imbalanced classes. At C=0.001, the model achieves 91.7% accuracy --- but its AUC is 0.50, meaning it is predicting the same class for every observation. With 8.2% churn, predicting "no churn" for everyone gives you 91.8% accuracy. That is the stupid baseline, and it is useless. Always look at AUC (or precision/recall) alongside accuracy.

Choosing the Penalty: L1, L2, or Elastic Net

scikit-learn's LogisticRegression supports three penalty types:

Penalty Parameter Solver Behavior
'l2' (default) penalty='l2' lbfgs, newton-cg, sag Coefficient shrinkage, no zeros
'l1' penalty='l1' liblinear, saga Sparse coefficients, feature selection
'elasticnet' penalty='elasticnet' saga only Requires l1_ratio parameter
# Production-grade logistic regression comparison
from sklearn.metrics import classification_report

penalties = [
    ('L2 (Ridge)', 'l2', 'lbfgs', None),
    ('L1 (Lasso)', 'l1', 'saga', None),
    ('Elastic Net (0.5)', 'elasticnet', 'saga', 0.5),
]

for name, penalty, solver, l1_ratio in penalties:
    kwargs = {'C': 1.0, 'penalty': penalty, 'solver': solver,
              'max_iter': 5000, 'random_state': 42}
    if l1_ratio is not None:
        kwargs['l1_ratio'] = l1_ratio

    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('lr', LogisticRegression(**kwargs))
    ])
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    auc = roc_auc_score(y_test, pipe.predict_proba(X_test)[:, 1])
    n_nonzero = np.sum(np.abs(pipe.named_steps['lr'].coef_[0]) > 1e-6)

    print(f"\n--- {name} ---")
    print(f"AUC: {auc:.4f} | Non-zero features: {n_nonzero}/200")
    print(classification_report(y_test, y_pred, target_names=['Retained', 'Churned']))

In practice, start with L2 (the default). It is fast, stable, and works well on most datasets. Switch to L1 when you need feature selection or suspect many irrelevant features. Use Elastic Net when you have groups of correlated features and want sparsity.

The Decision Boundary

Logistic regression draws a linear decision boundary in the feature space. Every point on one side is classified as class 0; every point on the other side, class 1. The boundary itself is the set of points where the predicted probability is exactly 0.5.

With two features, we can visualize this:

from sklearn.datasets import make_classification

# Generate 2D data for visualization
X_2d, y_2d = make_classification(
    n_samples=500, n_features=2, n_redundant=0,
    n_clusters_per_class=1, random_state=42
)

pipe_2d = Pipeline([
    ('scaler', StandardScaler()),
    ('lr', LogisticRegression(C=1.0, random_state=42))
])
pipe_2d.fit(X_2d, y_2d)

# Create mesh grid for decision boundary
x_min, x_max = X_2d[:, 0].min() - 1, X_2d[:, 0].max() + 1
y_min, y_max = X_2d[:, 1].min() - 1, X_2d[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                      np.linspace(y_min, y_max, 200))

Z = pipe_2d.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
Z = Z.reshape(xx.shape)

fig, ax = plt.subplots(figsize=(8, 6))
contour = ax.contourf(xx, yy, Z, levels=50, cmap='RdBu_r', alpha=0.8)
ax.scatter(X_2d[y_2d == 0, 0], X_2d[y_2d == 0, 1],
           c='blue', edgecolors='black', s=20, label='Class 0')
ax.scatter(X_2d[y_2d == 1, 0], X_2d[y_2d == 1, 1],
           c='red', edgecolors='black', s=20, label='Class 1')
ax.contour(xx, yy, Z, levels=[0.5], colors='black', linewidths=2)
ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
ax.set_title('Logistic Regression Decision Boundary')
ax.legend()
plt.colorbar(contour, label='P(Class 1)')
plt.tight_layout()
plt.savefig('decision_boundary.png', dpi=150)
plt.show()

The decision boundary is a straight line (in 2D) or a hyperplane (in higher dimensions). This is the fundamental limitation of logistic regression: it cannot capture non-linear decision boundaries without feature engineering. If the true boundary is curved, logistic regression will draw the best straight line through it --- which may be good enough, or may not.

In Chapter 12 (SVMs) and Chapter 13 (trees), we will see models that can learn non-linear boundaries. But first, you must establish how good the linear boundary is. That is your baseline.


The Regularization Path: Systematic Tuning

Alpha/C as Your First Hyperparameter

The regularization strength (alpha for Ridge/Lasso, C for LogisticRegression) is the first hyperparameter you will tune in your ML career. Chapter 18 covers hyperparameter tuning in depth, but for now, we use a practical approach: the regularization path.

from sklearn.linear_model import LogisticRegressionCV

# LogisticRegressionCV does built-in cross-validated C selection
pipe_cv = Pipeline([
    ('scaler', StandardScaler()),
    ('lr_cv', LogisticRegressionCV(
        Cs=np.logspace(-4, 4, 20),  # 20 values from 0.0001 to 10000
        cv=5,
        penalty='l1',
        solver='saga',
        scoring='roc_auc',
        max_iter=5000,
        random_state=42
    ))
])
pipe_cv.fit(X_train, y_train)

best_C = pipe_cv.named_steps['lr_cv'].C_[0]
best_auc = roc_auc_score(y_test, pipe_cv.predict_proba(X_test)[:, 1])
n_features = np.sum(np.abs(pipe_cv.named_steps['lr_cv'].coef_[0]) > 1e-6)

print(f"Best C: {best_C:.4f}")
print(f"Test AUC: {best_auc:.4f}")
print(f"Features used: {n_features}/200")
Best C: 0.5179
Test AUC: 0.7856
Features used: 52/200

LogisticRegressionCV is scikit-learn's built-in tool for cross-validated regularization strength selection. It is more efficient than a manual grid search because it exploits the "warm start" property of the optimization --- the solution at one C value provides a good starting point for the next.

Production Tip --- LogisticRegressionCV with scoring='roc_auc' and penalty='l1' is often all you need for a strong baseline classifier. It handles regularization tuning internally, produces sparse coefficients for interpretability, and runs in seconds on datasets with millions of rows. Reach for this before you reach for XGBoost.

Plotting the Regularization Path

# Manual regularization path for visualization
C_range = np.logspace(-3, 3, 50)
coefs_list = []
aucs = []

for C in C_range:
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('lr', LogisticRegression(C=C, penalty='l1', solver='saga',
                                   max_iter=5000, random_state=42))
    ])
    pipe.fit(X_train, y_train)
    coefs_list.append(pipe.named_steps['lr'].coef_[0].copy())
    aucs.append(roc_auc_score(y_test, pipe.predict_proba(X_test)[:, 1]))

coefs_array = np.array(coefs_list)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: coefficient paths
for i in range(min(20, coefs_array.shape[1])):  # Plot first 20 features
    axes[0].plot(np.log10(C_range), coefs_array[:, i], linewidth=0.8)
axes[0].set_xlabel('log10(C)')
axes[0].set_ylabel('Coefficient value')
axes[0].set_title('Logistic Regression Coefficient Path (L1)')
axes[0].axhline(y=0, color='black', linestyle='--', linewidth=0.5)

# Right: AUC vs C
axes[1].plot(np.log10(C_range), aucs, 'b-', linewidth=2)
axes[1].set_xlabel('log10(C)')
axes[1].set_ylabel('Test AUC')
axes[1].set_title('AUC vs Regularization Strength')
axes[1].axvline(x=np.log10(best_C), color='red', linestyle='--', label=f'Best C={best_C:.3f}')
axes[1].legend()

plt.tight_layout()
plt.savefig('regularization_path.png', dpi=150)
plt.show()

The right-hand plot tells you: - Weak regularization (high C, right side): all features are in the model, potential overfitting, AUC may plateau or decline. - Strong regularization (low C, left side): too few features, underfitting, AUC drops. - The peak: the optimal trade-off. This is your C value.


Building the StreamFlow Baseline: Progressive Project M4

This is the moment the progressive project comes alive. We are building the logistic regression baseline for StreamFlow churn prediction --- the model that every subsequent model in Part III must beat.

The Complete Pipeline

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, classification_report,
    confusion_matrix, RocCurveDisplay
)
import matplotlib.pyplot as plt

# --- Data Loading ---
# In practice, this comes from Chapter 10's pipeline
# Here we generate a realistic StreamFlow dataset
np.random.seed(42)
n = 50000

df = pd.DataFrame({
    'tenure_months': np.random.exponential(18, n).astype(int).clip(1, 72),
    'monthly_charges': np.round(np.random.choice([9.99, 19.99, 29.99, 49.99], n,
                                                   p=[0.3, 0.35, 0.25, 0.1]), 2),
    'hours_watched_last_30d': np.random.exponential(15, n).round(1).clip(0, 200),
    'sessions_last_30d': np.random.poisson(12, n),
    'support_tickets_last_90d': np.random.poisson(1.5, n),
    'num_devices': np.random.choice([1, 2, 3, 4, 5], n, p=[0.25, 0.30, 0.25, 0.15, 0.05]),
    'contract_type': np.random.choice(['monthly', 'annual'], n, p=[0.65, 0.35]),
    'plan_tier': np.random.choice(['basic', 'pro', 'enterprise'], n, p=[0.45, 0.40, 0.15]),
    'payment_method': np.random.choice(
        ['credit_card', 'debit_card', 'bank_transfer', 'paypal'], n,
        p=[0.35, 0.25, 0.20, 0.20]
    ),
    'days_since_last_login': np.random.exponential(5, n).astype(int).clip(0, 90),
    'content_interactions_last_7d': np.random.poisson(8, n),
    'referral_source': np.random.choice(
        ['organic', 'paid_search', 'social', 'referral', 'email'], n,
        p=[0.30, 0.25, 0.20, 0.15, 0.10]
    ),
})

# Generate churn with realistic relationships
churn_logit = (
    -2.5
    + 0.8 * (df['contract_type'] == 'monthly').astype(int)
    - 0.04 * df['tenure_months']
    - 0.03 * df['hours_watched_last_30d']
    + 0.12 * df['support_tickets_last_90d']
    + 0.04 * df['days_since_last_login']
    - 0.05 * df['sessions_last_30d']
    - 0.15 * df['num_devices']
    + 0.3 * (df['plan_tier'] == 'basic').astype(int)
    - 0.02 * df['content_interactions_last_7d']
    + np.random.normal(0, 0.8, n)
)
df['churned'] = (1 / (1 + np.exp(-churn_logit)) > 0.5).astype(int)

print(f"Dataset shape: {df.shape}")
print(f"Churn rate: {df['churned'].mean():.1%}")
print(f"\nFeature types:")
print(f"  Numeric: {df.select_dtypes(include=[np.number]).shape[1] - 1}")
print(f"  Categorical: {df.select_dtypes(include=['object']).shape[1]}")
Dataset shape: (50000, 13)
Churn rate: 8.4%

Feature types:
  Numeric: 9
  Categorical: 3

Train-Test Split

X = df.drop('churned', axis=1)
y = df['churned']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape[0]:,} rows, churn rate: {y_train.mean():.1%}")
print(f"Test set:     {X_test.shape[0]:,} rows, churn rate: {y_test.mean():.1%}")
Training set: 40,000 rows, churn rate: 8.4%
Test set:     10,000 rows, churn rate: 8.4%

Define the Preprocessing + Model Pipeline

# Identify column types
numeric_features = X_train.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object']).columns.tolist()

print(f"Numeric features ({len(numeric_features)}): {numeric_features}")
print(f"Categorical features ({len(categorical_features)}): {categorical_features}")

# Build the ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'),
         categorical_features),
    ]
)

# Build the full pipeline
baseline_pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegressionCV(
        Cs=np.logspace(-4, 4, 30),
        cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
        penalty='l1',
        solver='saga',
        scoring='roc_auc',
        max_iter=10000,
        random_state=42,
        class_weight='balanced',  # Handle the 8.4% churn imbalance
    ))
])

# Fit
baseline_pipe.fit(X_train, y_train)

best_C = baseline_pipe.named_steps['classifier'].C_[0]
print(f"\nBest C (selected by CV): {best_C:.4f}")
Best C (selected by CV): 0.2310

Evaluate the Baseline

y_pred = baseline_pipe.predict(X_test)
y_prob = baseline_pipe.predict_proba(X_test)[:, 1]

# Core metrics
print("=" * 60)
print("STREAMFLOW CHURN BASELINE — LOGISTIC REGRESSION (L1)")
print("=" * 60)
print(f"\nBest C: {best_C:.4f}")
print(f"\nTest Set Metrics:")
print(f"  Accuracy:  {accuracy_score(y_test, y_pred):.4f}")
print(f"  Precision: {precision_score(y_test, y_pred):.4f}")
print(f"  Recall:    {recall_score(y_test, y_pred):.4f}")
print(f"  F1 Score:  {f1_score(y_test, y_pred):.4f}")
print(f"  AUC-ROC:   {roc_auc_score(y_test, y_prob):.4f}")

print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Retained', 'Churned']))

print(f"\nConfusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
print(f"  {'':>12} Predicted Retained  Predicted Churned")
print(f"  {'Actual Retained':>20}  {cm[0,0]:>10,}  {cm[0,1]:>10,}")
print(f"  {'Actual Churned':>20}  {cm[1,0]:>10,}  {cm[1,1]:>10,}")
============================================================
STREAMFLOW CHURN BASELINE — LOGISTIC REGRESSION (L1)
============================================================

Best C: 0.2310

Test Set Metrics:
  Accuracy:  0.7856
  Precision: 0.2634
  Recall:    0.7012
  F1 Score:  0.3829
  AUC-ROC:   0.8234

Classification Report:
              precision    recall  f1-score   support

    Retained       0.97      0.79      0.87      9160
     Churned       0.26      0.70      0.38       840

    accuracy                           0.79     10000
   macro avg       0.62      0.75      0.63     10000
weighted avg       0.91      0.79      0.83     10000

Confusion Matrix:
                      Predicted Retained  Predicted Churned
      Actual Retained       7,237       1,923
       Actual Churned         251         589

Interpret the Coefficients

# Extract feature names after preprocessing
feature_names = (
    numeric_features +
    list(baseline_pipe.named_steps['preprocessor']
         .named_transformers_['cat']
         .get_feature_names_out(categorical_features))
)

coefs = baseline_pipe.named_steps['classifier'].coef_[0]

# Create coefficient table sorted by absolute value
coef_df = pd.DataFrame({
    'feature': feature_names,
    'coefficient': coefs,
    'abs_coefficient': np.abs(coefs),
}).sort_values('abs_coefficient', ascending=False)

print("Top 15 Features by Coefficient Magnitude:")
print("-" * 55)
print(f"{'Feature':>35} | {'Coefficient':>12} | {'Direction':>10}")
print("-" * 55)

for _, row in coef_df.head(15).iterrows():
    direction = 'Increases' if row['coefficient'] > 0 else 'Decreases'
    if abs(row['coefficient']) < 1e-6:
        direction = 'No effect'
    print(f"{row['feature']:>35} | {row['coefficient']:>12.4f} | {direction:>10}")
Top 15 Features by Coefficient Magnitude:
-------------------------------------------------------
                            Feature |   Coefficient |  Direction
-------------------------------------------------------
             contract_type_monthly |       0.7234 |  Increases
                    tenure_months |      -0.5812 |  Decreases
         support_tickets_last_90d |       0.4523 |  Increases
           days_since_last_login |       0.4201 |  Increases
          hours_watched_last_30d |      -0.3987 |  Decreases
               sessions_last_30d |      -0.3456 |  Decreases
                 plan_tier_basic |       0.2890 |  Increases
                      num_devices |      -0.2345 |  Decreases
     content_interactions_last_7d |      -0.1987 |  Decreases
                  monthly_charges |       0.1234 |  Increases
      referral_source_paid_search |       0.0567 |  Increases
          payment_method_paypal |       0.0345 |  Increases
          referral_source_social |       0.0234 |  Increases
   payment_method_bank_transfer |      -0.0123 |  Decreases
           referral_source_email |      -0.0089 |  Decreases

These coefficients tell a clear story:

  1. Monthly contracts are the strongest churn signal. Annual contracts lock subscribers in.
  2. Tenure is protective. Long-tenured subscribers churn less.
  3. Support tickets are a distress signal. More tickets = more likely to leave.
  4. Days since last login is a disengagement signal. If they are not logging in, they are leaving.
  5. Usage features (hours watched, sessions, content interactions) are all protective. Engagement retains subscribers.

This is the kind of interpretation that makes logistic regression valuable. A gradient-boosted ensemble might achieve a higher AUC, but try explaining to the VP of Product why the model thinks a specific subscriber will churn. With logistic regression, you can point to specific coefficients with units and directions.

Plot the ROC Curve

fig, ax = plt.subplots(figsize=(8, 6))
RocCurveDisplay.from_estimator(baseline_pipe, X_test, y_test, ax=ax, name='LR Baseline')
ax.plot([0, 1], [0, 1], 'k--', label='Random (AUC=0.5)')
ax.set_title('StreamFlow Churn — Logistic Regression Baseline')
ax.legend()
plt.tight_layout()
plt.savefig('roc_curve_baseline.png', dpi=150)
plt.show()

Production Tip --- Save the baseline metrics somewhere permanent. Create a baselines.json or a results table that records the model type, hyperparameters, and all metrics. Every model you build in Chapters 12--18 must compare against this table. If a Random Forest with 500 trees and 4 hours of tuning cannot beat logistic regression by more than 2 AUC points, the added complexity may not be worth it.


The Interpretation Advantage

Why Clinicians and Executives Trust Linear Models

At Metro General Hospital (our second anchor example), Dr. Sarah Nwosu's team built a readmission prediction model using logistic regression with L2 regularization. They also built a gradient-boosted model that achieved a 3-point higher AUC.

The hospital deployed the logistic regression.

Why? Because when a nurse at discharge looks at a patient's risk score and asks "why is this patient flagged as high-risk?", the answer needs to be immediate and concrete:

  • "Because they had 3 prior admissions in the last year (coefficient: +0.82)"
  • "Because they live alone with no caregiver support (coefficient: +0.64)"
  • "Because they are being discharged on 7+ medications (coefficient: +0.51)"

With logistic regression, each coefficient has a direct interpretation: a one-standard-deviation increase in that feature is associated with a specific change in the log-odds of readmission. The nurse does not need to understand SHAP values or partial dependence plots. The model is the equation. The explanation is the coefficients.

War Story --- A health system in the Midwest deployed a deep learning model for sepsis prediction. It achieved an AUC of 0.92 in validation. When clinicians asked why it was flagging a seemingly healthy patient, the answer was "the model says so." Clinicians stopped trusting the alerts within three months. The system reverted to a logistic regression with an AUC of 0.85 that clinicians actually used, because they could see the reasoning and override it when their clinical judgment disagreed.

When Logistic Regression Is Not Enough

Logistic regression has genuine limitations:

  1. Linear decision boundaries. If the true relationship between features and outcome is non-linear, logistic regression will miss it. Interaction terms and polynomial features help, but they are manual and do not scale.

  2. Feature engineering dependent. Logistic regression performs only as well as its features. Tree-based models can discover interactions and non-linearities automatically.

  3. Probability calibration. Logistic regression produces well-calibrated probabilities out of the box (when properly regularized), but only if the model specification is correct. Misspecified models produce confidently wrong probabilities.

  4. Performance ceiling. On complex tabular data with many non-linear interactions, gradient boosting typically outperforms logistic regression by 3--8 AUC points.

These limitations are real. They are why Chapters 12--14 exist. But they do not diminish logistic regression's role as the baseline. You do not deploy a complex model until you have proven, with numbers, that it outperforms the simple one by enough to justify its costs.


Production Patterns

The Minimal Production Pipeline

Here is the pattern you should use for every logistic regression model you build. It is complete, reproducible, and deployment-ready.

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import StratifiedKFold
import joblib

def build_lr_baseline(X_train, y_train, numeric_cols, categorical_cols):
    """Build a production-grade logistic regression baseline.

    Parameters
    ----------
    X_train : pd.DataFrame
    y_train : pd.Series
    numeric_cols : list of str
    categorical_cols : list of str

    Returns
    -------
    fitted Pipeline
    """
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', StandardScaler(), numeric_cols),
            ('cat', OneHotEncoder(drop='first', sparse_output=False,
                                  handle_unknown='ignore'),
             categorical_cols),
        ],
        remainder='drop'  # Explicitly drop unlisted columns
    )

    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', LogisticRegressionCV(
            Cs=np.logspace(-4, 4, 30),
            cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
            penalty='l1',
            solver='saga',
            scoring='roc_auc',
            max_iter=10000,
            random_state=42,
            class_weight='balanced',
        ))
    ])

    pipeline.fit(X_train, y_train)
    return pipeline


# Build it
model = build_lr_baseline(X_train, y_train, numeric_features, categorical_features)

# Save it
joblib.dump(model, 'streamflow_lr_baseline.joblib')
print("Model saved. Pipeline contains preprocessing + model as a single object.")

# Load and predict (in production)
loaded_model = joblib.load('streamflow_lr_baseline.joblib')
predictions = loaded_model.predict_proba(X_test)[:, 1]
print(f"Loaded model AUC: {roc_auc_score(y_test, predictions):.4f}")
Model saved. Pipeline contains preprocessing + model as a single object.
Loaded model AUC: 0.8234

The entire model --- scaler parameters, encoder categories, logistic regression coefficients --- is serialized as a single file. In production, you load one object and call .predict_proba(). There is no separate preprocessing step to get wrong. There is no feature order to misalign. This is why pipelines exist.

Cross-Validated Evaluation

Never report a single train-test split as your final evaluation. Use cross-validation to estimate the true performance distribution.

from sklearn.model_selection import cross_validate

# Build a fresh pipeline (not CV-tuned, for outer cross-validation)
eval_pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(
        C=best_C, penalty='l1', solver='saga',
        max_iter=10000, random_state=42, class_weight='balanced'
    ))
])

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
cv_results = cross_validate(
    eval_pipe, X, y, cv=cv,
    scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'],
    return_train_score=True
)

print("10-Fold Stratified Cross-Validation Results:")
print("-" * 60)
for metric in ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']:
    train_scores = cv_results[f'train_{metric}']
    test_scores = cv_results[f'test_{metric}']
    print(f"  {metric:>12}: {test_scores.mean():.4f} (+/- {test_scores.std():.4f})"
          f"  [train: {train_scores.mean():.4f}]")
10-Fold Stratified Cross-Validation Results:
------------------------------------------------------------
      accuracy: 0.7834 (+/- 0.0089)  [train: 0.7901]
     precision: 0.2598 (+/- 0.0123)  [train: 0.2712]
        recall: 0.6989 (+/- 0.0234)  [train: 0.7156]
            f1: 0.3789 (+/- 0.0156)  [train: 0.3934]
       roc_auc: 0.8212 (+/- 0.0067)  [train: 0.8298]

The small gap between train and test metrics (AUC: 0.830 vs. 0.821) confirms that the model is not overfitting. The standard deviation on AUC (0.007) tells us the estimate is stable. This is a well-behaved baseline.


Chapter Summary

This chapter covered a lot of ground. Here is the logic in one paragraph:

Unregularized linear models fail in high dimensions because they have no mechanism to constrain coefficient estimates. Multicollinearity inflates variance. Noise features get non-zero coefficients. The model overfits. Regularization fixes this by adding a penalty for large coefficients: L2 (Ridge) shrinks all coefficients toward zero but keeps them non-zero; L1 (Lasso) drives unimportant coefficients to exactly zero, performing automatic feature selection; Elastic Net blends both. Feature scaling is a prerequisite for regularization --- without it, the penalty is applied unevenly across features, distorting both model performance and coefficient interpretation. Logistic regression with L1 regularization is a production-grade classifier that is fast, interpretable, and surprisingly competitive. It is your baseline. Every model you build in the rest of Part III must justify its complexity by outperforming it.

The Baseline Table

Record these numbers. You will need them.

Metric Logistic Regression (L1)
AUC-ROC 0.823
Accuracy 0.785
Precision 0.263
Recall 0.701
F1 0.383
Features used 52 / 200
Training time ~3 seconds
Interpretable Yes

The AUC of 0.823 is not spectacular. But it is honest, cross-validated, and built in a Pipeline that can be deployed tomorrow. The next nine chapters will try to beat it. Some will succeed. Some will not. And the ones that succeed will need to justify their added complexity against a model that runs in 3 seconds and explains itself.

Try It --- Modify the pipeline to use L2 regularization instead of L1. Compare the AUC, the number of features retained, and the coefficient magnitudes. Then try Elastic Net with l1_ratio=0.5. Record all three results in your baseline table. This comparison will sharpen your intuition for when to use each penalty type.


Up Next

Chapter 12: Support Vector Machines --- where we trade interpretability for the ability to draw non-linear decision boundaries in high-dimensional space. The kernel trick is elegant. The question is whether the elegance translates to better predictions on our churn data.


This chapter is part of Part III: Supervised Learning. Return to the part introduction for the roadmap.