Chapter 13 Key Takeaways

The Big Picture

Regularization is not optional---it is a fundamental requirement for building models that generalize to unseen data. Modern neural networks are massively overparameterized and can memorize arbitrary data, including random labels. Regularization techniques constrain model complexity, inject noise, augment data, and encourage simpler representations, all with the goal of reducing the gap between training and test performance.

The Generalization Problem

Overfitting: Low training error, high test error, large generalization gap. The model memorizes training data.
Underfitting: High training error, high test error, small gap. The model is too simple.
The bias-variance tradeoff frames regularization as reducing variance (overfitting) at the cost of slightly increased bias.
Neural networks can memorize random labels (Zhang et al., 2017), proving that generalization is not an intrinsic property of the architecture.

Regularization Techniques

Technique	Mechanism	When to Use	PyTorch
L2 / Weight Decay	Penalizes large weights	Always (default)	`weight_decay` in optimizer
L1	Encourages sparsity	Feature selection	Manual penalty in loss
Dropout	Randomly zeros activations	FC layers, large models	`nn.Dropout(p)`
Data Augmentation	Artificially expands training set	Images, text, audio	`torchvision.transforms`
Early Stopping	Stop before overfitting	Always	Monitor val loss
Label Smoothing	Soft targets prevent overconfidence	Classification	`label_smoothing` param
Mixup	Interpolate between examples	Image/text classification	Custom implementation
CutMix	Cut-and-paste between images	Image classification	Custom implementation
Weight Pruning	Remove small weights	Deployment, compression	`torch.nn.utils.prune`
Stochastic Depth	Randomly drop residual blocks	Deep ResNets	Custom implementation

L1 and L2 Regularization

L2 (weight decay): Adds $\frac{\lambda}{2}\|\mathbf{w}\|_2^2$ to the loss. Shrinks all weights toward zero proportionally. Produces smooth decision boundaries.
L1: Adds $\lambda\|\mathbf{w}\|_1$ to the loss. Drives some weights exactly to zero, producing sparse models.
With AdamW, weight decay is decoupled from the gradient update, which is mathematically different from L2 regularization with Adam.
Bias and normalization parameters should not have weight decay applied.

Dropout

During training, randomly sets activations to zero with probability $p$ (typically 0.2--0.5).
Inverted dropout scales surviving activations by $1/(1-p)$ so no change is needed at test time.
Can be interpreted as training an ensemble of $2^n$ subnetworks, where $n$ is the number of neurons.
Should be disabled during evaluation via model.eval().
Interacts with batch normalization: the variance shift from dropout can conflict with BN's statistics. Common practice is to place dropout after BN or use only one.

Data Augmentation

Standard (images): Random crop, horizontal flip, color jitter, rotation.
Advanced: RandAugment, AutoAugment, TrivialAugment---learned or randomized augmentation policies.
Mixup: $\tilde{x} = \lambda x_i + (1-\lambda) x_j$, $\tilde{y} = \lambda y_i + (1-\lambda) y_j$, where $\lambda \sim \text{Beta}(\alpha, \alpha)$.
CutMix: Replaces a rectangular patch of one image with a patch from another; labels mixed proportionally to area.
Data augmentation is often the most impactful regularization technique for image tasks, especially with small datasets.

Early Stopping

Monitor validation loss and stop when it stops improving (with patience).
Equivalent to an implicit form of L2 regularization: limits the effective distance parameters can move from initialization.
Requires splitting data into train/validation/test sets.
Should restore the best model weights, not the weights at the stopping epoch.

Label Smoothing

Replaces hard one-hot targets with soft targets: true class gets $1 - \epsilon$, others get $\epsilon / (K-1)$.
Prevents the model from becoming overconfident, which improves calibration.
Typical values: $\epsilon = 0.1$.
Built into PyTorch: nn.CrossEntropyLoss(label_smoothing=0.1).

Modern Phenomena

Double Descent

Test error follows a U-shape (classical), then decreases again in the overparameterized regime.
The peak occurs at the "interpolation threshold" where the model has just enough capacity to memorize the training data.
Regularization (weight decay, early stopping) smooths the double descent curve.

Lottery Ticket Hypothesis

Dense networks contain sparse subnetworks ("winning tickets") that can achieve comparable performance when trained in isolation from the same initialization.
Found via iterative magnitude pruning: train, prune smallest weights, rewind to initial weights, repeat.
Practical implications: large models may be compressible by 90%+ with minimal accuracy loss.

Combining Regularization Techniques

Combination	Recommendation
Dropout + BatchNorm	Use carefully; dropout variance can conflict with BN stats
Weight decay + LR	Higher weight decay pairs well with higher LR
Data augmentation + model size	Stronger augmentation allows larger models
Early stopping + LR schedule	Use together; schedule affects when to stop
Mixup + Label smoothing	Complementary; both prevent overconfidence

A typical production recipe: 1. Weight decay (AdamW with wd=0.01--0.05) 2. Dropout (0.1--0.3, primarily in FC layers) 3. Data augmentation (task-appropriate) 4. Early stopping (patience=5--20) 5. Label smoothing (0.1)

Common Pitfalls

Too much regularization: Can cause underfitting. If training loss is high, reduce regularization.
Dropout during evaluation: Forgetting model.eval() keeps dropout active, reducing predictions to noise.
Augmenting validation/test data: Never augment evaluation data; it corrupts the performance estimate.
One regularizer fits all: Different datasets and architectures need different regularization recipes.
Ignoring the generalization gap: A small gap may indicate underfitting, not good generalization.

Looking Ahead

Chapter 14 (CNNs): Spatial augmentation (crop, flip, rotation) is critical; batch norm placement in conv blocks; transfer learning as regularization.
Chapter 15 (RNNs): Dropout applied to non-recurrent connections; variational dropout; weight tying as implicit regularization.