Chapter 13 Key Takeaways
The Big Picture
Regularization is not optional---it is a fundamental requirement for building models that generalize to unseen data. Modern neural networks are massively overparameterized and can memorize arbitrary data, including random labels. Regularization techniques constrain model complexity, inject noise, augment data, and encourage simpler representations, all with the goal of reducing the gap between training and test performance.
The Generalization Problem
- Overfitting: Low training error, high test error, large generalization gap. The model memorizes training data.
- Underfitting: High training error, high test error, small gap. The model is too simple.
- The bias-variance tradeoff frames regularization as reducing variance (overfitting) at the cost of slightly increased bias.
- Neural networks can memorize random labels (Zhang et al., 2017), proving that generalization is not an intrinsic property of the architecture.
Regularization Techniques
| Technique | Mechanism | When to Use | PyTorch |
|---|---|---|---|
| L2 / Weight Decay | Penalizes large weights | Always (default) | weight_decay in optimizer |
| L1 | Encourages sparsity | Feature selection | Manual penalty in loss |
| Dropout | Randomly zeros activations | FC layers, large models | nn.Dropout(p) |
| Data Augmentation | Artificially expands training set | Images, text, audio | torchvision.transforms |
| Early Stopping | Stop before overfitting | Always | Monitor val loss |
| Label Smoothing | Soft targets prevent overconfidence | Classification | label_smoothing param |
| Mixup | Interpolate between examples | Image/text classification | Custom implementation |
| CutMix | Cut-and-paste between images | Image classification | Custom implementation |
| Weight Pruning | Remove small weights | Deployment, compression | torch.nn.utils.prune |
| Stochastic Depth | Randomly drop residual blocks | Deep ResNets | Custom implementation |
L1 and L2 Regularization
- L2 (weight decay): Adds $\frac{\lambda}{2}\|\mathbf{w}\|_2^2$ to the loss. Shrinks all weights toward zero proportionally. Produces smooth decision boundaries.
- L1: Adds $\lambda\|\mathbf{w}\|_1$ to the loss. Drives some weights exactly to zero, producing sparse models.
- With AdamW, weight decay is decoupled from the gradient update, which is mathematically different from L2 regularization with Adam.
- Bias and normalization parameters should not have weight decay applied.
Dropout
- During training, randomly sets activations to zero with probability $p$ (typically 0.2--0.5).
- Inverted dropout scales surviving activations by $1/(1-p)$ so no change is needed at test time.
- Can be interpreted as training an ensemble of $2^n$ subnetworks, where $n$ is the number of neurons.
- Should be disabled during evaluation via
model.eval(). - Interacts with batch normalization: the variance shift from dropout can conflict with BN's statistics. Common practice is to place dropout after BN or use only one.
Data Augmentation
- Standard (images): Random crop, horizontal flip, color jitter, rotation.
- Advanced: RandAugment, AutoAugment, TrivialAugment---learned or randomized augmentation policies.
- Mixup: $\tilde{x} = \lambda x_i + (1-\lambda) x_j$, $\tilde{y} = \lambda y_i + (1-\lambda) y_j$, where $\lambda \sim \text{Beta}(\alpha, \alpha)$.
- CutMix: Replaces a rectangular patch of one image with a patch from another; labels mixed proportionally to area.
- Data augmentation is often the most impactful regularization technique for image tasks, especially with small datasets.
Early Stopping
- Monitor validation loss and stop when it stops improving (with patience).
- Equivalent to an implicit form of L2 regularization: limits the effective distance parameters can move from initialization.
- Requires splitting data into train/validation/test sets.
- Should restore the best model weights, not the weights at the stopping epoch.
Label Smoothing
- Replaces hard one-hot targets with soft targets: true class gets $1 - \epsilon$, others get $\epsilon / (K-1)$.
- Prevents the model from becoming overconfident, which improves calibration.
- Typical values: $\epsilon = 0.1$.
- Built into PyTorch:
nn.CrossEntropyLoss(label_smoothing=0.1).
Modern Phenomena
Double Descent
- Test error follows a U-shape (classical), then decreases again in the overparameterized regime.
- The peak occurs at the "interpolation threshold" where the model has just enough capacity to memorize the training data.
- Regularization (weight decay, early stopping) smooths the double descent curve.
Lottery Ticket Hypothesis
- Dense networks contain sparse subnetworks ("winning tickets") that can achieve comparable performance when trained in isolation from the same initialization.
- Found via iterative magnitude pruning: train, prune smallest weights, rewind to initial weights, repeat.
- Practical implications: large models may be compressible by 90%+ with minimal accuracy loss.
Combining Regularization Techniques
| Combination | Recommendation |
|---|---|
| Dropout + BatchNorm | Use carefully; dropout variance can conflict with BN stats |
| Weight decay + LR | Higher weight decay pairs well with higher LR |
| Data augmentation + model size | Stronger augmentation allows larger models |
| Early stopping + LR schedule | Use together; schedule affects when to stop |
| Mixup + Label smoothing | Complementary; both prevent overconfidence |
A typical production recipe: 1. Weight decay (AdamW with wd=0.01--0.05) 2. Dropout (0.1--0.3, primarily in FC layers) 3. Data augmentation (task-appropriate) 4. Early stopping (patience=5--20) 5. Label smoothing (0.1)
Common Pitfalls
- Too much regularization: Can cause underfitting. If training loss is high, reduce regularization.
- Dropout during evaluation: Forgetting
model.eval()keeps dropout active, reducing predictions to noise. - Augmenting validation/test data: Never augment evaluation data; it corrupts the performance estimate.
- One regularizer fits all: Different datasets and architectures need different regularization recipes.
- Ignoring the generalization gap: A small gap may indicate underfitting, not good generalization.
Looking Ahead
- Chapter 14 (CNNs): Spatial augmentation (crop, flip, rotation) is critical; batch norm placement in conv blocks; transfer learning as regularization.
- Chapter 15 (RNNs): Dropout applied to non-recurrent connections; variational dropout; weight tying as implicit regularization.