Case Study 1: StreamRec Training Optimization — From Unstable to Converged

Context

StreamRec's recommendation team has built a click-prediction MLP following the architecture from Chapter 6: a 4-layer network with hidden dimensions [512, 256, 128, 64], ReLU activations, and a sigmoid output for P(click). The model takes 128 handcrafted user and item features as input and predicts the probability that a user will click on a recommended content item.

The model runs — it computes outputs and gradients without crashing — but it does not train well. The team has been struggling for two weeks with unstable training, and the model's AUC on held-out data (0.71) barely exceeds the logistic regression baseline (0.69). Management is questioning whether the MLP approach is worth the added complexity.

This case study traces the debugging process from initial failure to production-ready training, applying the techniques from this chapter in sequence.

The Initial Training Run

The team's first training configuration:

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from typing import Dict, List, Tuple

# Initial (problematic) configuration
model = nn.Sequential(
    nn.Linear(128, 512),
    nn.ReLU(),
    nn.Linear(512, 256),
    nn.ReLU(),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Linear(128, 64),
    nn.ReLU(),
    nn.Linear(64, 1),
)

optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.BCEWithLogitsLoss()

The training curves told the story:

# Simulated training metrics from the initial run
initial_metrics = {
    "epoch": list(range(1, 51)),
    "train_loss": [
        0.693, 0.691, 0.688, 0.680, 0.665, 0.642, 0.608, 0.571, 0.530,
        0.485, 0.440, 0.395, 0.352, 0.310, 0.272, 0.238, 0.208, 0.182,
        0.160, 0.141, 0.125, 0.111, 0.099, 0.089, 0.080, 0.073, 0.067,
        0.061, 0.057, 0.053, 0.049, 0.046, 0.043, 0.041, 0.039, 0.037,
        0.035, 0.033, 0.032, 0.031, 0.030, 0.029, 0.028, 0.027, 0.027,
        0.026, 0.026, 0.025, 0.025, 0.025,
    ],
    "val_loss": [
        0.694, 0.692, 0.690, 0.685, 0.675, 0.660, 0.640, 0.618, 0.598,
        0.582, 0.570, 0.565, 0.562, 0.563, 0.567, 0.573, 0.580, 0.589,
        0.599, 0.610, 0.621, 0.632, 0.642, 0.651, 0.659, 0.666, 0.672,
        0.678, 0.683, 0.687, 0.691, 0.695, 0.698, 0.701, 0.704, 0.706,
        0.709, 0.711, 0.713, 0.715, 0.717, 0.718, 0.720, 0.721, 0.722,
        0.723, 0.724, 0.725, 0.726, 0.726,
    ],
}

Diagnosis: Training loss decreases steadily to near zero, but validation loss starts increasing after epoch 11. This is textbook overfitting — the model memorizes the 2M training examples without learning generalizable patterns. The train-val gap at epoch 50 is enormous: 0.025 vs. 0.726.

Two secondary problems are also visible:

  1. Slow initial progress. The loss barely moves for the first 3 epochs (0.693 to 0.688), suggesting the learning dynamics are sluggish at initialization.
  2. No early stopping. The team trained for 50 epochs without monitoring the validation loss, missing the minimum at epoch 11.

Step 1: Diagnosing the Root Causes

The team applied the debugging checklist from Section 7.8.5.

Initialization check:

def diagnose_initialization(model: nn.Module) -> None:
    """Check weight variance at each layer against He initialization targets."""
    for name, param in model.named_parameters():
        if "weight" in name and param.dim() == 2:
            n_in = param.shape[1]
            actual_var = param.var().item()
            he_target = 2.0 / n_in
            ratio = actual_var / he_target
            print(
                f"{name:30s} | n_in={n_in:4d} | "
                f"Var={actual_var:.4e} | He target={he_target:.4e} | "
                f"Ratio={ratio:.2f}"
            )

diagnose_initialization(model)
0.weight                       | n_in= 128 | Var=7.79e-03 | He target=1.56e-02 | Ratio=0.50
2.weight                       | n_in= 512 | Var=1.95e-03 | He target=3.91e-03 | Ratio=0.50
4.weight                       | n_in= 256 | Var=3.90e-03 | He target=7.81e-03 | Ratio=0.50
6.weight                       | n_in= 128 | Var=7.80e-03 | He target=1.56e-02 | Ratio=0.50
8.weight                       | n_in=  64 | Var=1.56e-02 | He target=3.13e-02 | Ratio=0.50

The variance is consistently half the He target. PyTorch's default initialization is kaiming_uniform_ with a=math.sqrt(5) (a legacy choice for compatibility with the Lua Torch library), not the standard He initialization. This means activations are shrinking through layers — not catastrophically, but enough to slow early training.

Gradient flow check:

def check_gradient_flow(model: nn.Module, sample_batch: torch.Tensor) -> None:
    """Log gradient norms per layer after a backward pass."""
    model.zero_grad()
    output = model(sample_batch)
    loss = output.sum()
    loss.backward()

    for name, param in model.named_parameters():
        if param.grad is not None:
            norm = param.grad.norm().item()
            print(f"{name:30s} | grad norm: {norm:.4e}")

The gradient norms revealed a 100x spread between the first and last layer — a mild vanishing gradient problem exacerbated by the initialization issue.

Step 2: Applying the Training Toolkit

The team applied fixes incrementally, measuring the impact of each.

Fix 1: He Initialization

def apply_he_init(model: nn.Module) -> None:
    """Apply proper He initialization to all linear layers."""
    for module in model.modules():
        if isinstance(module, nn.Linear):
            nn.init.kaiming_normal_(module.weight, mode="fan_in", nonlinearity="relu")
            if module.bias is not None:
                nn.init.zeros_(module.bias)

apply_he_init(model)

Impact: The first 3 epochs now show meaningful loss decrease (0.693 to 0.672 instead of 0.688). The gradient norms across layers are within a 5x range instead of 100x.

Fix 2: Batch Normalization

model = nn.Sequential(
    nn.Linear(128, 512),
    nn.BatchNorm1d(512),
    nn.ReLU(),
    nn.Linear(512, 256),
    nn.BatchNorm1d(256),
    nn.ReLU(),
    nn.Linear(256, 128),
    nn.BatchNorm1d(128),
    nn.ReLU(),
    nn.Linear(128, 64),
    nn.BatchNorm1d(64),
    nn.ReLU(),
    nn.Linear(64, 1),
)
apply_he_init(model)

Impact: Convergence is 3x faster (reaching epoch-11 performance by epoch 4). The gradient norms are now nearly uniform across layers — batch normalization smooths the loss landscape as predicted by Santurkar et al. (2018).

Fix 3: Dropout and Weight Decay

model = nn.Sequential(
    nn.Linear(128, 512),
    nn.BatchNorm1d(512),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(512, 256),
    nn.BatchNorm1d(256),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(256, 128),
    nn.BatchNorm1d(128),
    nn.ReLU(),
    nn.Dropout(0.2),
    nn.Linear(128, 64),
    nn.BatchNorm1d(64),
    nn.ReLU(),
    nn.Dropout(0.1),
    nn.Linear(64, 1),
)
apply_he_init(model)

optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)

Impact: The train-val gap dropped dramatically. At epoch 30, the training loss is 0.38 and validation loss is 0.41 — a gap of 0.03, down from 0.70 in the original configuration.

Fix 4: One-Cycle Schedule with Early Stopping

total_steps = len(train_loader) * 30  # 30 epochs
scheduler = optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=3e-3,
    total_steps=total_steps,
    pct_start=0.3,
    div_factor=25,
    final_div_factor=1e4,
)

Impact: The model converged to its best validation loss in 18 epochs instead of 30. Early stopping at epoch 22 (patience 4) captured the optimal model.

Final Results

final_metrics = {
    "configuration": [
        "Logistic Regression Baseline",
        "Ch. 6 MLP (no tricks)",
        "Ch. 7 MLP (all techniques)",
    ],
    "val_auc": [0.690, 0.712, 0.789],
    "val_loss": [0.582, 0.562, 0.398],
    "epochs_to_best": ["-", 11, 18],
    "train_val_gap": [0.015, 0.537, 0.028],
    "training_time_minutes": [2, 45, 14],
}
Configuration Val AUC Val Loss Epochs Train-Val Gap Time (min)
Logistic Regression 0.690 0.582 - 0.015 2
Ch. 6 MLP (no tricks) 0.712 0.562 11 0.537 45
Ch. 7 MLP (full toolkit) 0.789 0.398 18 0.028 14

The optimized MLP achieves a 10-point AUC improvement over the baseline (0.789 vs. 0.690) — a substantial gain for click prediction. Equally importantly, the train-val gap of 0.028 indicates the model generalizes well and will be stable in production.

Lessons Learned

  1. The default training configuration is rarely good enough. PyTorch's defaults (kaiming_uniform with a=sqrt(5), no normalization, no regularization) are chosen for backward compatibility, not for training quality. Proper initialization, normalization, and regularization are not optional enhancements — they are the difference between a model that barely beats logistic regression and one that justifies its complexity.

  2. Apply fixes incrementally and measure. The team could have applied all fixes at once, but the incremental approach revealed that batch normalization provided the largest single improvement, followed by dropout. This knowledge is valuable for future projects where compute budgets are tighter.

  3. The one-cycle policy saved both compute time and hyperparameter tuning. The team originally planned a grid search over 5 learning rates and 3 schedules. The one-cycle policy's single max_lr hyperparameter (found via the learning rate finder in 2 minutes) replaced hours of grid search.

  4. Mixed precision was not applied here but would be the next optimization. The team's A100 GPUs support bf16, which would reduce training time from 14 minutes to approximately 7 minutes without measurable accuracy loss. For a model retrained daily, this halves the compute cost.