Chapter 13 Key Takeaways

The Big Picture

Regularization is not optional---it is a fundamental requirement for building models that generalize to unseen data. Modern neural networks are massively overparameterized and can memorize arbitrary data, including random labels. Regularization techniques constrain model complexity, inject noise, augment data, and encourage simpler representations, all with the goal of reducing the gap between training and test performance.


The Generalization Problem

  • Overfitting: Low training error, high test error, large generalization gap. The model memorizes training data.
  • Underfitting: High training error, high test error, small gap. The model is too simple.
  • The bias-variance tradeoff frames regularization as reducing variance (overfitting) at the cost of slightly increased bias.
  • Neural networks can memorize random labels (Zhang et al., 2017), proving that generalization is not an intrinsic property of the architecture.

Regularization Techniques

Technique Mechanism When to Use PyTorch
L2 / Weight Decay Penalizes large weights Always (default) weight_decay in optimizer
L1 Encourages sparsity Feature selection Manual penalty in loss
Dropout Randomly zeros activations FC layers, large models nn.Dropout(p)
Data Augmentation Artificially expands training set Images, text, audio torchvision.transforms
Early Stopping Stop before overfitting Always Monitor val loss
Label Smoothing Soft targets prevent overconfidence Classification label_smoothing param
Mixup Interpolate between examples Image/text classification Custom implementation
CutMix Cut-and-paste between images Image classification Custom implementation
Weight Pruning Remove small weights Deployment, compression torch.nn.utils.prune
Stochastic Depth Randomly drop residual blocks Deep ResNets Custom implementation

L1 and L2 Regularization

  • L2 (weight decay): Adds $\frac{\lambda}{2}\|\mathbf{w}\|_2^2$ to the loss. Shrinks all weights toward zero proportionally. Produces smooth decision boundaries.
  • L1: Adds $\lambda\|\mathbf{w}\|_1$ to the loss. Drives some weights exactly to zero, producing sparse models.
  • With AdamW, weight decay is decoupled from the gradient update, which is mathematically different from L2 regularization with Adam.
  • Bias and normalization parameters should not have weight decay applied.

Dropout

  • During training, randomly sets activations to zero with probability $p$ (typically 0.2--0.5).
  • Inverted dropout scales surviving activations by $1/(1-p)$ so no change is needed at test time.
  • Can be interpreted as training an ensemble of $2^n$ subnetworks, where $n$ is the number of neurons.
  • Should be disabled during evaluation via model.eval().
  • Interacts with batch normalization: the variance shift from dropout can conflict with BN's statistics. Common practice is to place dropout after BN or use only one.

Data Augmentation

  • Standard (images): Random crop, horizontal flip, color jitter, rotation.
  • Advanced: RandAugment, AutoAugment, TrivialAugment---learned or randomized augmentation policies.
  • Mixup: $\tilde{x} = \lambda x_i + (1-\lambda) x_j$, $\tilde{y} = \lambda y_i + (1-\lambda) y_j$, where $\lambda \sim \text{Beta}(\alpha, \alpha)$.
  • CutMix: Replaces a rectangular patch of one image with a patch from another; labels mixed proportionally to area.
  • Data augmentation is often the most impactful regularization technique for image tasks, especially with small datasets.

Early Stopping

  • Monitor validation loss and stop when it stops improving (with patience).
  • Equivalent to an implicit form of L2 regularization: limits the effective distance parameters can move from initialization.
  • Requires splitting data into train/validation/test sets.
  • Should restore the best model weights, not the weights at the stopping epoch.

Label Smoothing

  • Replaces hard one-hot targets with soft targets: true class gets $1 - \epsilon$, others get $\epsilon / (K-1)$.
  • Prevents the model from becoming overconfident, which improves calibration.
  • Typical values: $\epsilon = 0.1$.
  • Built into PyTorch: nn.CrossEntropyLoss(label_smoothing=0.1).

Modern Phenomena

Double Descent

  • Test error follows a U-shape (classical), then decreases again in the overparameterized regime.
  • The peak occurs at the "interpolation threshold" where the model has just enough capacity to memorize the training data.
  • Regularization (weight decay, early stopping) smooths the double descent curve.

Lottery Ticket Hypothesis

  • Dense networks contain sparse subnetworks ("winning tickets") that can achieve comparable performance when trained in isolation from the same initialization.
  • Found via iterative magnitude pruning: train, prune smallest weights, rewind to initial weights, repeat.
  • Practical implications: large models may be compressible by 90%+ with minimal accuracy loss.

Combining Regularization Techniques

Combination Recommendation
Dropout + BatchNorm Use carefully; dropout variance can conflict with BN stats
Weight decay + LR Higher weight decay pairs well with higher LR
Data augmentation + model size Stronger augmentation allows larger models
Early stopping + LR schedule Use together; schedule affects when to stop
Mixup + Label smoothing Complementary; both prevent overconfidence

A typical production recipe: 1. Weight decay (AdamW with wd=0.01--0.05) 2. Dropout (0.1--0.3, primarily in FC layers) 3. Data augmentation (task-appropriate) 4. Early stopping (patience=5--20) 5. Label smoothing (0.1)


Common Pitfalls

  1. Too much regularization: Can cause underfitting. If training loss is high, reduce regularization.
  2. Dropout during evaluation: Forgetting model.eval() keeps dropout active, reducing predictions to noise.
  3. Augmenting validation/test data: Never augment evaluation data; it corrupts the performance estimate.
  4. One regularizer fits all: Different datasets and architectures need different regularization recipes.
  5. Ignoring the generalization gap: A small gap may indicate underfitting, not good generalization.

Looking Ahead

  • Chapter 14 (CNNs): Spatial augmentation (crop, flip, rotation) is critical; batch norm placement in conv blocks; transfer learning as regularization.
  • Chapter 15 (RNNs): Dropout applied to non-recurrent connections; variational dropout; weight tying as implicit regularization.