> "The goal of machine learning is not to fit the training data perfectly---it is to generalize to unseen data." --- Adapted from Vladimir Vapnik
In This Chapter
- 13.1 The Generalization Problem
- 13.2 L1 and L2 Regularization (Weight Decay)
- 13.3 Dropout
- 13.4 Data Augmentation
- 13.5 Early Stopping
- 13.6 Label Smoothing
- 13.7 Mixup and CutMix
- 13.8 Batch Size Effects on Generalization
- 13.9 The Double Descent Phenomenon
- 13.10 The Lottery Ticket Hypothesis
- 13.11 Combining Regularization Techniques
- 13.12 Implicit Regularization
- 13.13 Regularization in Modern Architectures
- 13.14 Monitoring Generalization
- 13.15 Summary
- References
Chapter 13: Regularization and Generalization
"The goal of machine learning is not to fit the training data perfectly---it is to generalize to unseen data." --- Adapted from Vladimir Vapnik
In the previous chapters, we built the foundational machinery of deep learning: neural network architectures (Chapter 7), activation functions (Chapter 8), loss functions (Chapter 9), optimizers (Chapter 10), weight initialization (Chapter 11), and batch normalization (Chapter 12). With these tools, you can train models that achieve near-zero training loss on virtually any dataset. But here is the uncomfortable truth: a model that memorizes training data is useless. What matters is performance on data the model has never seen before.
This chapter addresses the central challenge of machine learning -- and arguably the central challenge of intelligence itself: the tension between fitting the training data well and generalizing to new examples. We will explore a comprehensive toolkit of regularization techniques---from classical methods like L1 and L2 penalties to modern innovations like mixup, cutmix, and label smoothing. Along the way, we will encounter surprising phenomena like double descent and the lottery ticket hypothesis that challenge our intuitions about how neural networks learn.
By the end of this chapter, you will understand not just what regularization techniques exist, but why they work from both theoretical and practical perspectives, when to apply each one, and how to combine them effectively for maximum impact. This knowledge is essential for every AI engineer who wants to build models that work reliably in production.
13.1 The Generalization Problem
13.1.1 What Is Generalization?
Generalization is the ability of a model trained on a finite dataset to make accurate predictions on new, previously unseen data drawn from the same distribution. Formally, we distinguish between:
- Training error (empirical risk): The average loss computed over the training set.
- Test error (generalization error): The expected loss over the true data distribution.
The generalization gap is the difference between test error and training error:
$$\text{Generalization Gap} = \mathcal{L}_{\text{test}} - \mathcal{L}_{\text{train}}$$
A small generalization gap indicates that the model's performance on training data is representative of its performance on unseen data. A large gap signals trouble.
13.1.2 Overfitting and Underfitting
These two failure modes sit at opposite ends of a spectrum:
Underfitting occurs when the model is too simple to capture the underlying patterns in the data. Signs include:
- High training error
- High test error
- Training and test errors are close together
- The model fails to learn meaningful representations
Overfitting occurs when the model captures noise and idiosyncrasies in the training data rather than the true underlying pattern. Signs include:
- Low training error (sometimes near zero)
- High test error
- Large gap between training and test performance
- The model effectively memorizes the training set
In the context of deep learning, overfitting is far more common than underfitting because modern neural networks are massively overparameterized---they have far more parameters than training examples. A ResNet-50 has approximately 25 million parameters, yet it can be trained on datasets with only thousands of images. As we discussed in Chapter 7, this overparameterization gives networks enormous representational capacity, but it also means they can easily memorize arbitrary data, including random labels (Zhang et al., 2017).
13.1.3 PAC Learning: A Theoretical Foundation for Generalization
Before diving into practical techniques, it is worth understanding the theoretical underpinning of generalization. The Probably Approximately Correct (PAC) learning framework, introduced by Leslie Valiant in 1984, provides the formal foundation for understanding when and why learning from finite data can succeed.
The core idea of PAC learning is this: given a hypothesis class $\mathcal{H}$ (the set of functions our model can represent), a training set of $n$ examples drawn i.i.d. from some distribution $\mathcal{D}$, and a desired accuracy $\epsilon$ and confidence $1 - \delta$, PAC theory asks: how many examples $n$ do we need to guarantee that our learned hypothesis is within $\epsilon$ of optimal with probability at least $1 - \delta$?
The answer depends on the complexity of the hypothesis class. For finite hypothesis classes, the bound is:
$$n \geq \frac{1}{\epsilon} \left( \ln |\mathcal{H}| + \ln \frac{1}{\delta} \right)$$
where:
- $n$ is the number of training samples needed.
- $|\mathcal{H}|$ is the size of the hypothesis class.
- $\epsilon$ is the desired accuracy (how close to optimal).
- $\delta$ is the failure probability (how confident we want to be).
Intuition: The more complex your model class (larger $|\mathcal{H}|$), the more data you need to generalize. This is the formal version of the intuition that simple models generalize better with limited data.
For infinite hypothesis classes like neural networks, the VC dimension (Vapnik-Chervonenkis dimension) replaces $\ln |\mathcal{H}|$ as the complexity measure. The VC dimension of a classifier is the largest number of points it can shatter (classify in all possible ways). A linear classifier in $d$ dimensions has VC dimension $d + 1$, while a neural network's VC dimension grows with the number of parameters.
However, PAC/VC theory has a critical limitation for modern deep learning: it predicts that networks with millions of parameters should require millions of training examples to generalize. In practice, deep networks generalize with far fewer examples than theory predicts. This gap between theory and practice -- sometimes called the generalization puzzle -- motivates much of the research discussed later in this chapter, including the double descent phenomenon (Section 13.9) and implicit regularization (Section 13.12).
Practical Tip: While PAC bounds are too loose for practical use, the core insight remains valuable: more model complexity demands more data or more regularization. When your dataset is small relative to your model size, aggressive regularization is not optional -- it is mathematically necessary.
13.1.4 The Bias-Variance Tradeoff
Classical statistical learning theory frames generalization through the bias-variance decomposition. For a given input $x$, the expected prediction error can be decomposed as:
$$\mathbb{E}[(\hat{f}(x) - y)^2] = \text{Bias}^2(\hat{f}(x)) + \text{Var}(\hat{f}(x)) + \sigma^2$$
where:
- Bias measures how far the average prediction is from the true value (underfitting).
- Variance measures how much predictions vary across different training sets (overfitting).
- $\sigma^2$ is the irreducible noise in the data.
Regularization techniques primarily work by reducing variance at the cost of slightly increased bias, achieving better overall generalization. However, as we will see in Section 13.10, the classical bias-variance picture does not fully explain the behavior of modern deep networks.
13.1.5 Why Regularization Matters in Practice
Consider a real-world scenario: you are building a model to classify skin lesions from dermatological images. Your training set contains 10,000 labeled images from three hospitals. Without regularization:
- The model might learn hospital-specific artifacts (watermarks, lighting conditions, camera models) rather than genuine medical features.
- It might memorize rare cases rather than learning generalizable patterns.
- It could perform brilliantly on your test set (from the same hospitals) but fail catastrophically when deployed to a new hospital.
Regularization is not an optional add-on---it is a fundamental requirement for building models that work in the real world. The techniques in this chapter are among the most practically important tools in your engineering arsenal.
13.2 L1 and L2 Regularization (Weight Decay)
13.2.1 The Core Idea
The simplest form of regularization adds a penalty term to the loss function that discourages the model from developing large weights. The intuition is that models with smaller weights tend to be simpler and less prone to overfitting because they produce smoother decision boundaries.
13.2.2 L2 Regularization (Ridge / Weight Decay)
L2 regularization adds the squared magnitude of all weights to the loss function:
$$\mathcal{L}_{\text{regularized}} = \mathcal{L}_{\text{original}} + \frac{\lambda}{2} \sum_{i} w_i^2$$
where $\lambda$ (often called the regularization strength or weight decay coefficient) controls the tradeoff between fitting the data and keeping weights small.
The gradient of the L2 penalty with respect to a weight $w_i$ is simply $\lambda w_i$. During gradient descent, this means each weight is multiplied by a factor of $(1 - \eta \lambda)$ before the gradient update, where $\eta$ is the learning rate. This is why L2 regularization is also called weight decay---weights literally decay toward zero at each step.
Important nuance: In the case of vanilla SGD, L2 regularization and weight decay are mathematically equivalent. However, for adaptive optimizers like Adam (Chapter 10), they differ. Loshchilov and Hutter (2019) showed that decoupled weight decay (AdamW) performs better than L2 regularization applied to Adam. This is because Adam's per-parameter learning rate scaling interacts poorly with the L2 penalty. Always use torch.optim.AdamW rather than adding an L2 penalty to the loss when using Adam.
import torch
import torch.nn as nn
torch.manual_seed(42)
model = nn.Linear(100, 10)
# L2 regularization via weight_decay parameter (preferred)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
# Equivalent manual L2 penalty (for illustration only)
lambda_l2 = 0.01
loss_fn = nn.CrossEntropyLoss()
x = torch.randn(32, 100)
y = torch.randint(0, 10, (32,))
output = model(x)
loss = loss_fn(output, y)
l2_penalty = sum(p.pow(2).sum() for p in model.parameters())
loss_with_l2 = loss + lambda_l2 * l2_penalty
13.2.3 L1 Regularization (Lasso)
L1 regularization adds the absolute magnitude of weights:
$$\mathcal{L}_{\text{regularized}} = \mathcal{L}_{\text{original}} + \lambda \sum_{i} |w_i|$$
The gradient of the L1 penalty is $\lambda \cdot \text{sign}(w_i)$, which pushes all weights toward zero with equal force regardless of their magnitude. This produces sparse weight matrices where many weights are exactly zero.
L1 regularization is less commonly used in deep learning than L2 because:
- Sparse weights do not necessarily improve generalization in deep networks.
- The non-differentiability at zero can cause optimization issues.
- Modern techniques like dropout and data augmentation are generally more effective.
However, L1 can be useful when you want interpretable feature selection or when deploying models with limited memory (sparse weights compress well).
torch.manual_seed(42)
model = nn.Linear(100, 10)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
# Manual L1 regularization
lambda_l1 = 0.001
x = torch.randn(32, 100)
y = torch.randint(0, 10, (32,))
loss_fn = nn.CrossEntropyLoss()
output = model(x)
loss = loss_fn(output, y)
l1_penalty = sum(p.abs().sum() for p in model.parameters())
total_loss = loss + lambda_l1 * l1_penalty
total_loss.backward()
optimizer.step()
13.2.4 Elastic Net: Combining L1 and L2
You can combine both penalties:
$$\mathcal{L}_{\text{regularized}} = \mathcal{L}_{\text{original}} + \lambda_1 \sum_{i} |w_i| + \frac{\lambda_2}{2} \sum_{i} w_i^2$$
This gives you both the sparsity-inducing property of L1 and the smoothness of L2. In practice, this is rarely used in deep learning but can be useful in certain specialized architectures.
13.2.5 Practical Recommendations for Weight Decay
| Scenario | Recommended Weight Decay |
|---|---|
| SGD with momentum | 1e-4 to 5e-4 |
| AdamW | 0.01 to 0.1 |
| Fine-tuning pretrained models | 0.01 to 0.1 (often higher than training from scratch) |
| Very small datasets | Higher values (0.1+) |
| Very large datasets | Lower values or none |
As a rule of thumb, weight decay is almost always beneficial. It is one of the first regularization techniques you should try, and it composes well with every other method in this chapter.
13.3 Dropout
13.3.1 The Mechanism
Dropout, introduced by Srivastava et al. (2014), is one of the most influential regularization techniques in deep learning. The idea is disarmingly simple: during training, randomly set each neuron's output to zero with probability $p$ (the dropout rate). During inference, use all neurons but scale their outputs by $(1 - p)$ to compensate.
In practice, PyTorch uses inverted dropout: during training, the surviving neurons are scaled up by $\frac{1}{1-p}$ so that no scaling is needed at inference time. This is more efficient because inference is the common case.
import torch
import torch.nn as nn
torch.manual_seed(42)
# Simple network with dropout
class RegularizedNet(nn.Module):
"""A feedforward network with dropout regularization.
Args:
input_dim: Number of input features.
hidden_dim: Number of hidden units per layer.
output_dim: Number of output classes.
dropout_rate: Probability of dropping a neuron. Defaults to 0.5.
"""
def __init__(
self,
input_dim: int,
hidden_dim: int,
output_dim: int,
dropout_rate: float = 0.5,
) -> None:
super().__init__()
self.net = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(p=dropout_rate),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(p=dropout_rate),
nn.Linear(hidden_dim, output_dim),
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""Forward pass through the network.
Args:
x: Input tensor of shape (batch_size, input_dim).
Returns:
Output logits of shape (batch_size, output_dim).
"""
return self.net(x)
model = RegularizedNet(784, 256, 10, dropout_rate=0.5)
# IMPORTANT: Switch between train and eval mode
model.train() # Dropout is active
model.eval() # Dropout is disabled
13.3.2 Why Dropout Works
There are several complementary explanations for dropout's effectiveness:
-
Ensemble interpretation: Each training step uses a different random subnetwork. Dropout trains an exponential number of weight-sharing subnetworks simultaneously. At test time, using all neurons approximates the ensemble average.
-
Co-adaptation prevention: Without dropout, neurons can develop complex co-dependencies where specific neurons rely on the outputs of other specific neurons. Dropout forces each neuron to learn features that are useful on their own, leading to more robust representations.
-
Noise injection: Dropout adds multiplicative noise to the hidden representations, which acts as a regularizer by preventing the network from becoming too confident about any single feature.
-
Implicit Bayesian approximation: Gal and Ghahramani (2016) showed that a network with dropout applied before every weight layer is mathematically equivalent to an approximation of a Gaussian process. This connects dropout to Bayesian deep learning and uncertainty estimation (a topic we will explore further in later chapters).
13.3.3 Dropout Variants
Spatial Dropout (Dropout2d): For convolutional networks, standard dropout drops individual pixels, which is often too fine-grained because adjacent pixels are highly correlated. Spatial dropout drops entire feature maps instead:
# For convolutional networks, use Dropout2d
conv_block = nn.Sequential(
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(),
nn.Dropout2d(p=0.2), # Drops entire feature maps
)
DropConnect: Instead of dropping neuron outputs, DropConnect drops individual weights. This provides finer-grained regularization but is more computationally expensive.
DropBlock: Designed for convolutional networks, DropBlock drops contiguous regions of feature maps rather than individual elements, forcing the network to learn from broader context.
13.3.4 Stochastic Depth
Stochastic depth (Huang et al., 2016) extends the dropout concept from individual neurons to entire layers. During training, each residual block in a deep network is randomly skipped (replaced by an identity mapping) with a probability that increases linearly from 0 at the input to some maximum $p_L$ at the final layer. Formally, for a network with $L$ residual blocks, block $l$ has survival probability:
$$p_l = 1 - \frac{l}{L}(1 - p_L)$$
where $p_L$ is the survival probability of the last layer (typically 0.5-0.8). During inference, all blocks are active but their outputs are scaled by $p_l$, analogous to the dropout scaling.
import torch
import torch.nn as nn
torch.manual_seed(42)
class StochasticDepthBlock(nn.Module):
"""A residual block with stochastic depth.
Args:
channels: Number of input and output channels.
survival_prob: Probability that this block is active during training.
"""
def __init__(self, channels: int, survival_prob: float = 0.8) -> None:
super().__init__()
self.survival_prob = survival_prob
self.block = nn.Sequential(
nn.Linear(channels, channels),
nn.ReLU(),
nn.Linear(channels, channels),
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""Forward pass with stochastic skip.
Args:
x: Input tensor.
Returns:
Output tensor, possibly with block skipped.
"""
if not self.training:
return x + self.survival_prob * self.block(x)
if torch.rand(1).item() < self.survival_prob:
return x + self.block(x)
else:
return x # Skip this block entirely
Stochastic depth provides several benefits: it regularizes by creating an implicit ensemble of networks with different depths, it reduces training time (skipped blocks require no forward or backward computation), and it alleviates the vanishing gradient problem in very deep networks. It is a standard component of modern architectures like Vision Transformers (ViT) and EfficientNet, as we will see in Chapter 14.
13.3.5 Practical Recommendations for Dropout
| Layer Type | Recommended Dropout Rate |
|---|---|
| Fully connected (hidden) | 0.3 - 0.5 |
| Convolutional layers | 0.1 - 0.3 (or use Dropout2d) |
| Recurrent layers (between layers) | 0.2 - 0.5 |
| After embeddings | 0.1 - 0.3 |
| Before final layer | 0.0 - 0.3 |
Key practical tips:
- Always call
model.train()before training andmodel.eval()before evaluation. This is the most common dropout-related bug (as we noted in Chapter 12 regarding batch normalization). - Dropout interacts with batch normalization. The conventional wisdom is to not use dropout in the same block as batch norm, though some architectures do combine them successfully.
- Higher dropout rates require longer training because the effective learning rate is reduced.
- For modern architectures like transformers, dropout rates of 0.1 are typical.
13.4 Data Augmentation
13.4.1 The Philosophy
Data augmentation is perhaps the most powerful regularization technique available because it addresses the root cause of overfitting: insufficient data diversity. Rather than constraining the model, data augmentation expands the effective training set by applying label-preserving transformations to existing examples.
The key insight is that for many domains, we know transformations that change the input but not the output. A cat rotated by 10 degrees is still a cat. A sentence with a synonym substituted still has the same sentiment. A time series shifted by a few timesteps still shows the same pattern.
13.4.2 Image Augmentation
Image augmentation is the most mature area, with a rich library of techniques. As discussed in Chapter 5 (data preprocessing), the torchvision.transforms module provides a comprehensive set of augmentations:
import torch
from torchvision import transforms
torch.manual_seed(42)
# Standard augmentation pipeline for image classification
train_transform = transforms.Compose([
transforms.RandomResizedCrop(224, scale=(0.08, 1.0)),
transforms.RandomHorizontalFlip(p=0.5),
transforms.ColorJitter(
brightness=0.4,
contrast=0.4,
saturation=0.4,
hue=0.1,
),
transforms.RandomRotation(degrees=15),
transforms.RandomAffine(degrees=0, translate=(0.1, 0.1)),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225],
),
transforms.RandomErasing(p=0.25),
])
# Validation/test transform: NO augmentation, only resize and normalize
val_transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225],
),
])
Common image augmentations and their effects:
| Augmentation | Effect | Typical Parameters |
|---|---|---|
| Random horizontal flip | Translation invariance | p=0.5 |
| Random rotation | Rotation invariance | 10-30 degrees |
| Color jitter | Lighting invariance | brightness, contrast, saturation=0.2-0.4 |
| Random crop | Scale/position invariance | Various scales |
| Random erasing | Occlusion robustness | p=0.2-0.5 |
| Gaussian blur | Noise robustness | kernel 3-7 |
13.4.3 Automated Augmentation Policies
Manually designing augmentation policies is tedious and domain-specific. Several methods learn augmentation policies automatically:
- AutoAugment (Cubuk et al., 2019): Uses reinforcement learning to search for optimal augmentation policies. Computationally expensive but produces strong results.
- RandAugment (Cubuk et al., 2020): Simplifies AutoAugment to just two hyperparameters: the number of augmentations $N$ and their magnitude $M$. Surprisingly competitive with AutoAugment at a fraction of the cost.
- TrivialAugment (Muller and Hutter, 2021): Even simpler---applies a single random augmentation with random magnitude. Achieves state-of-the-art results with zero hyperparameter tuning.
from torchvision.transforms import autoaugment, transforms
# RandAugment: simple and effective
train_transform_randaug = transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
autoaugment.RandAugment(num_ops=2, magnitude=9),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225],
),
])
13.4.4 Text Data Augmentation
While image augmentation is well-established, text augmentation requires more care because small changes can alter meaning:
- Synonym replacement: Replace words with synonyms from WordNet.
- Random insertion/deletion/swap: Modify sentences at the word level.
- Back-translation: Translate to another language and back.
- Contextual augmentation: Use a language model to generate paraphrases.
13.4.5 Augmentation for Other Domains
- Audio: Time stretching, pitch shifting, adding background noise, SpecAugment (masking frequency/time bands in spectrograms).
- Tabular data: SMOTE for oversampling, adding Gaussian noise to continuous features, feature dropout.
- Time series: Window slicing, magnitude warping, time warping, jittering.
13.5 Early Stopping
13.5.1 The Concept
Early stopping is the practice of monitoring the model's performance on a validation set during training and stopping when performance begins to degrade. It is one of the oldest and most reliable regularization techniques.
The training dynamics typically follow a characteristic pattern:
- Early phase: Both training and validation loss decrease.
- Transition phase: Training loss continues to decrease, but validation loss plateaus.
- Overfitting phase: Training loss keeps decreasing, but validation loss increases.
Early stopping halts training at the transition between phases 2 and 3.
13.5.2 Implementation
import copy
import torch
import torch.nn as nn
class EarlyStopping:
"""Early stopping to terminate training when validation loss stops improving.
Args:
patience: Number of epochs to wait after last improvement.
min_delta: Minimum change to qualify as an improvement.
restore_best_weights: Whether to restore the best model weights.
"""
def __init__(
self,
patience: int = 10,
min_delta: float = 0.0,
restore_best_weights: bool = True,
) -> None:
self.patience = patience
self.min_delta = min_delta
self.restore_best_weights = restore_best_weights
self.best_loss: float = float("inf")
self.counter: int = 0
self.best_weights: dict | None = None
self.should_stop: bool = False
def __call__(self, val_loss: float, model: nn.Module) -> bool:
"""Check if training should stop.
Args:
val_loss: Current validation loss.
model: The model being trained.
Returns:
True if training should stop, False otherwise.
"""
if val_loss < self.best_loss - self.min_delta:
self.best_loss = val_loss
self.counter = 0
if self.restore_best_weights:
self.best_weights = copy.deepcopy(model.state_dict())
else:
self.counter += 1
if self.counter >= self.patience:
self.should_stop = True
if self.restore_best_weights and self.best_weights is not None:
model.load_state_dict(self.best_weights)
return True
return False
13.5.3 Using Early Stopping in a Training Loop
Here is a complete example showing how the EarlyStopping class integrates into a real training loop:
import torch
import torch.nn as nn
torch.manual_seed(42)
model = nn.Linear(100, 10)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()
early_stopping = EarlyStopping(patience=10, min_delta=1e-4)
# Simulated training loop
for epoch in range(500):
# Training phase
model.train()
x_train = torch.randn(256, 100)
y_train = torch.randint(0, 10, (256,))
train_loss = loss_fn(model(x_train), y_train)
train_loss.backward()
optimizer.step()
optimizer.zero_grad()
# Validation phase
model.eval()
with torch.no_grad():
x_val = torch.randn(64, 100)
y_val = torch.randint(0, 10, (64,))
val_loss = loss_fn(model(x_val), y_val).item()
# Check early stopping
if early_stopping(val_loss, model):
print(f"Early stopping at epoch {epoch}")
print(f"Best validation loss: {early_stopping.best_loss:.4f}")
break
A critical implementation detail: always call model.eval() before computing the validation loss. As we saw in Section 13.3 (dropout) and Chapter 12 (batch normalization), several layers behave differently during training and evaluation. Forgetting this switch is one of the most common bugs in deep learning code.
13.5.4 Early Stopping as Regularization
Early stopping has a beautiful theoretical connection to L2 regularization. For linear models trained with gradient descent, early stopping is equivalent to L2 regularization where the regularization strength is inversely proportional to the number of training steps (Bishop, 1995). Stopping earlier corresponds to stronger regularization.
This connection extends approximately to neural networks: early in training, the weights are close to their initial (small) values, and the model is effectively simpler. As training continues, the weights grow and the model becomes more complex. Early stopping restricts this complexity growth.
13.5.5 Practical Recommendations
- Patience: Use 5-20 epochs for most problems. Larger patience values allow the model to escape temporary plateaus but risk overfitting.
- Monitoring metric: Use validation loss for regression; for classification, validation accuracy or validation loss both work, but loss is more sensitive to small changes.
- Always save the best model: Do not just stop training---restore the weights from the best epoch.
- Combine with learning rate scheduling: As discussed in Chapter 10, reducing the learning rate can help the model converge to a better minimum. Use early stopping after learning rate reductions have been exhausted.
13.6 Label Smoothing
13.6.1 The Problem with Hard Labels
In standard classification, we train with hard labels---one-hot encoded targets where the correct class has probability 1 and all others have probability 0. This forces the model to predict increasingly extreme logits to minimize cross-entropy loss, which has two problems:
- Overconfidence: The model learns to be absolutely certain about its predictions, even when the input is ambiguous.
- Reduced generalization: The model overfits to the exact training labels rather than learning a calibrated probability distribution.
13.6.2 The Label Smoothing Solution
Label smoothing (Szegedy et al., 2016) replaces hard targets with soft targets:
$$y_{\text{smooth}}(k) = \begin{cases} 1 - \alpha + \frac{\alpha}{K} & \text{if } k = \text{true class} \\ \frac{\alpha}{K} & \text{otherwise} \end{cases}$$
where $\alpha$ is the smoothing parameter (typically 0.1) and $K$ is the number of classes.
For example, with 10 classes and $\alpha = 0.1$, the target for class 3 becomes:
$$[0.01, 0.01, 0.01, 0.91, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01]$$
instead of:
$$[0, 0, 0, 1, 0, 0, 0, 0, 0, 0]$$
13.6.3 Implementation
import torch
import torch.nn as nn
import torch.nn.functional as F
torch.manual_seed(42)
class LabelSmoothingCrossEntropy(nn.Module):
"""Cross-entropy loss with label smoothing.
Args:
smoothing: Label smoothing factor. Defaults to 0.1.
"""
def __init__(self, smoothing: float = 0.1) -> None:
super().__init__()
self.smoothing = smoothing
def forward(
self, pred: torch.Tensor, target: torch.Tensor
) -> torch.Tensor:
"""Compute label-smoothed cross-entropy loss.
Args:
pred: Predicted logits of shape (batch_size, num_classes).
target: Target class indices of shape (batch_size,).
Returns:
Scalar loss tensor.
"""
num_classes = pred.size(-1)
log_probs = F.log_softmax(pred, dim=-1)
# NLL loss for the true class
nll_loss = -log_probs.gather(dim=-1, index=target.unsqueeze(1))
nll_loss = nll_loss.squeeze(1)
# Smooth loss: uniform distribution over all classes
smooth_loss = -log_probs.mean(dim=-1)
# Combine
loss = (1.0 - self.smoothing) * nll_loss + self.smoothing * smooth_loss
return loss.mean()
# PyTorch also provides built-in label smoothing
loss_fn = nn.CrossEntropyLoss(label_smoothing=0.1)
13.6.4 Worked Example: Label Smoothing in Action
To see the effect of label smoothing concretely, consider a 3-class problem where the true label is class 1. With $\alpha = 0.1$:
Hard label (no smoothing): $[0, 1, 0]$
The cross-entropy loss pushes the model to make the logit for class 1 as large as possible relative to the others. As training progresses, the model might output logits like $[-5.2, 12.8, -4.1]$, producing a softmax output of $[0.0000, 0.9999, 0.0001]$. These extreme logits are brittle -- a small perturbation to the input can shift the prediction dramatically.
Soft label ($\alpha = 0.1$): $[0.033, 0.9, 0.033]$
Now the model is penalized for being too confident. The optimal logits are more moderate, such as $[-1.8, 3.2, -1.5]$, producing softmax output of $[0.03, 0.91, 0.04]$. These softer predictions are more robust to input perturbations and better reflect genuine uncertainty.
Muller et al. (2019) showed that label smoothing also has a geometric effect: it encourages the penultimate layer representations of different classes to form tighter, more separated clusters. This improved structure in the representation space is partly why label smoothing helps with transfer learning and knowledge distillation.
13.6.5 Benefits and Considerations
Benefits:
- Improves model calibration (predicted probabilities better reflect true likelihoods).
- Reduces overfitting, especially for datasets with noisy labels.
- Encourages the model to learn more discriminative features for the penultimate layer.
- Almost always helps for classification tasks.
Considerations:
- Do not use label smoothing for knowledge distillation (the soft teacher labels already provide smoothing).
- Common values: 0.1 for most tasks, 0.05 for tasks with very clean labels, 0.2 for noisy labels.
- Label smoothing was a key component in the original Transformer paper (Chapter 14 will discuss transformers in detail).
13.7 Mixup and CutMix
13.7.1 Mixup
Mixup (Zhang et al., 2018) is an elegant data augmentation technique that creates new training examples by taking convex combinations of pairs of training examples and their labels:
$$\tilde{x} = \lambda x_i + (1 - \lambda) x_j$$ $$\tilde{y} = \lambda y_i + (1 - \lambda) y_j$$
where $\lambda \sim \text{Beta}(\alpha, \alpha)$ and $\alpha$ is a hyperparameter (typically 0.2-0.4).
The beauty of mixup is its simplicity: it requires just two lines of code to implement, yet it provides substantial regularization benefits.
import torch
import torch.nn as nn
torch.manual_seed(42)
def mixup_data(
x: torch.Tensor,
y: torch.Tensor,
alpha: float = 0.2,
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, float]:
"""Apply mixup augmentation to a batch of data.
Args:
x: Input batch of shape (batch_size, ...).
y: Target batch of shape (batch_size,).
alpha: Mixup interpolation strength. Defaults to 0.2.
Returns:
Tuple of (mixed_x, y_a, y_b, lam) where lam is the mixing
coefficient.
"""
if alpha > 0:
lam = torch.distributions.Beta(alpha, alpha).sample().item()
else:
lam = 1.0
batch_size = x.size(0)
index = torch.randperm(batch_size, device=x.device)
mixed_x = lam * x + (1 - lam) * x[index]
y_a, y_b = y, y[index]
return mixed_x, y_a, y_b, lam
def mixup_criterion(
criterion: nn.Module,
pred: torch.Tensor,
y_a: torch.Tensor,
y_b: torch.Tensor,
lam: float,
) -> torch.Tensor:
"""Compute mixup loss.
Args:
criterion: Base loss function.
pred: Model predictions.
y_a: First set of targets.
y_b: Second set of targets.
lam: Mixing coefficient.
Returns:
Scalar loss tensor.
"""
return lam * criterion(pred, y_a) + (1 - lam) * criterion(pred, y_b)
13.7.2 Why Mixup Works
Mixup regularizes by:
- Expanding the training distribution: Interpolated examples lie between training points, filling in gaps in the data manifold.
- Enforcing linear behavior between classes: The model learns smooth transitions rather than sharp decision boundaries.
- Reducing memorization: Interpolated examples are harder to memorize because they are different every epoch.
- Improving calibration: Models trained with mixup produce better-calibrated probability estimates.
13.7.3 CutMix
CutMix (Yun et al., 2019) takes a different approach: instead of blending entire images, it cuts a rectangular region from one image and pastes it onto another. The label is mixed proportionally to the area:
$$\tilde{x} = \mathbf{M} \odot x_i + (1 - \mathbf{M}) \odot x_j$$ $$\tilde{y} = \lambda y_i + (1 - \lambda) y_j$$
where $\mathbf{M}$ is a binary mask and $\lambda$ is the proportion of the image from $x_i$.
import torch
torch.manual_seed(42)
def cutmix_data(
x: torch.Tensor,
y: torch.Tensor,
alpha: float = 1.0,
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, float]:
"""Apply CutMix augmentation to a batch of images.
Args:
x: Input batch of shape (batch_size, channels, height, width).
y: Target batch of shape (batch_size,).
alpha: CutMix interpolation strength. Defaults to 1.0.
Returns:
Tuple of (mixed_x, y_a, y_b, lam).
"""
lam = torch.distributions.Beta(alpha, alpha).sample().item()
batch_size = x.size(0)
index = torch.randperm(batch_size, device=x.device)
_, _, h, w = x.shape
# Sample bounding box
cut_ratio = (1.0 - lam) ** 0.5
cut_h = int(h * cut_ratio)
cut_w = int(w * cut_ratio)
cx = torch.randint(0, w, (1,)).item()
cy = torch.randint(0, h, (1,)).item()
x1 = max(0, cx - cut_w // 2)
y1 = max(0, cy - cut_h // 2)
x2 = min(w, cx + cut_w // 2)
y2 = min(h, cy + cut_h // 2)
mixed_x = x.clone()
mixed_x[:, :, y1:y2, x1:x2] = x[index, :, y1:y2, x1:x2]
# Adjust lambda based on actual cut area
lam = 1 - (y2 - y1) * (x2 - x1) / (h * w)
return mixed_x, y, y[index], lam
13.7.4 CutMix vs. Mixup
| Aspect | Mixup | CutMix |
|---|---|---|
| Operation | Pixel-wise blending | Region replacement |
| Visual realism | Blurry/ghostly images | Locally coherent |
| Best for | General regularization | Localization tasks |
| Object detection | Less effective | More effective |
| Typical alpha | 0.2-0.4 | 1.0 |
In practice, many state-of-the-art training pipelines use both mixup and CutMix, applying one or the other with some probability for each batch.
13.8 Batch Size Effects on Generalization
13.8.1 The Large Batch Size Problem
The choice of batch size has a profound, sometimes surprising, effect on generalization. A seminal observation by Keskar et al. (2017) found that models trained with large batch sizes generalize worse than those trained with small batch sizes, even when trained to the same training loss.
The explanation relates to the geometry of the loss landscape:
- Small batches introduce noise in the gradient estimate, which helps the optimizer escape sharp local minima and settle into flat minima---regions where the loss is low across a wide neighborhood of parameter values.
- Large batches provide more accurate gradient estimates, which tend to converge to sharp minima---narrow valleys in the loss landscape where small perturbations in parameters cause large increases in loss.
Flat minima generalize better because the test loss landscape is slightly shifted from the training loss landscape. At a flat minimum, this shift causes minimal performance degradation; at a sharp minimum, the shift can cause the model to land on a slope with much higher loss.
13.8.2 The Linear Scaling Rule
When increasing batch size, a common practice is to proportionally increase the learning rate. This is known as the linear scaling rule (Goyal et al., 2017):
$$\eta_{\text{new}} = \eta_{\text{base}} \times \frac{B_{\text{new}}}{B_{\text{base}}}$$
This keeps the expected weight update magnitude approximately constant. However, the linear scaling rule breaks down for very large batch sizes.
13.8.3 Practical Recommendations
- Default batch sizes: 32-256 for most tasks.
- Gradient accumulation: If you want the effective gradient quality of large batches but the regularization of small batches, you can accumulate gradients over multiple small batches before updating.
- Learning rate warmup: When using large batch sizes, gradually increase the learning rate over the first few epochs (as discussed in Chapter 10).
- LARS/LAMB optimizers: Specialized optimizers like LARS (You et al., 2017) and LAMB (You et al., 2020) can train with very large batch sizes while maintaining generalization.
import torch
import torch.nn as nn
torch.manual_seed(42)
model = nn.Linear(100, 10)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
accumulation_steps = 4 # Effective batch size = actual_batch * 4
for step in range(100):
x = torch.randn(32, 100)
y = torch.randint(0, 10, (32,))
loss = nn.CrossEntropyLoss()(model(x), y)
loss = loss / accumulation_steps # Normalize loss
loss.backward()
if (step + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
13.9 The Double Descent Phenomenon
13.9.1 Beyond the Classical U-Curve
Classical learning theory predicts a U-shaped test error curve: as model complexity increases, test error first decreases (underfitting regime) then increases (overfitting regime). The sweet spot is somewhere in the middle. This motivated decades of work on model selection---choosing the "right" model complexity.
However, Belkin et al. (2019) and Nakkiran et al. (2020) discovered that this picture is incomplete. When you continue to increase model complexity beyond the interpolation threshold (the point where the model can perfectly fit the training data), test error decreases again, forming a double descent curve.
13.9.2 The Three Regimes
The double descent curve has three distinct regimes:
-
Under-parameterized regime (classical): Model complexity is less than needed to fit the training data. Increasing complexity helps. Classical bias-variance tradeoff applies.
-
Interpolation threshold (critical): Model complexity is just barely sufficient to fit the training data. The model is forced to use all its capacity for memorization, leaving nothing for generalization. Test error peaks here.
-
Over-parameterized regime (modern): Model complexity far exceeds what is needed to fit the data. The model can fit the training data in many ways and implicitly selects a smooth, well-generalizing solution. Test error decreases.
13.9.3 Epoch-Wise Double Descent
Double descent occurs not only as a function of model size but also as a function of training time. For a fixed model, the test error can show a second descent after initially overfitting. This means that the conventional practice of early stopping might sometimes halt training prematurely, before the model reaches the second descent.
13.9.4 A Worked Example of Double Descent
To make this concrete, consider training a series of neural networks with increasing width on CIFAR-10. Suppose we use a single hidden layer with width $h$:
| Width $h$ | Parameters | Train Error | Test Error | Regime |
|---|---|---|---|---|
| 32 | ~25K | 15.2% | 22.1% | Under-parameterized |
| 128 | ~100K | 4.1% | 14.8% | Under-parameterized |
| 512 | ~400K | 0.3% | 16.5% | Near interpolation |
| 1024 | ~800K | 0.0% | 18.9% | Interpolation threshold |
| 2048 | ~1.6M | 0.0% | 15.2% | Over-parameterized |
| 4096 | ~3.2M | 0.0% | 12.8% | Over-parameterized |
| 8192 | ~6.4M | 0.0% | 11.5% | Over-parameterized |
Notice the pattern: test error initially decreases, peaks near the interpolation threshold (width 1024, where the model has just enough parameters to memorize all 50,000 training images), and then decreases again as the model becomes increasingly overparameterized. The network with 8,192 hidden units has far more parameters than training examples, yet it generalizes better than the network with 512 units.
This phenomenon has profound implications for how we think about model selection. The classical approach of choosing the smallest model that fits the data adequately is not always correct -- sometimes a much larger model will generalize better.
13.9.5 Sample-Wise Double Descent
Nakkiran et al. (2020) also demonstrated sample-wise double descent: for a fixed model, increasing the number of training samples can temporarily hurt test performance. This happens when additional samples push the model closer to the interpolation threshold without pushing it past. The practical implication is that if you observe a performance dip after adding more data, the solution may be to add even more data (or increase model size) rather than reverting to the smaller dataset.
13.9.6 Implications for Practice
- Do not fear overparameterization: Modern deep networks are massively overparameterized, and this is often a feature rather than a bug. The models in Chapters 7-12 all benefited from having more parameters than strictly necessary.
- Be cautious around the interpolation threshold: If your model size is just barely sufficient for the task, you may be in the worst possible regime. Either simplify the model or make it larger.
- Regularization smooths the curve: Strong regularization (weight decay, dropout, data augmentation) reduces the peak at the interpolation threshold, making the test error curve more monotonically decreasing.
- More data helps: With more training data, the interpolation threshold shifts to larger models, and the double descent peak becomes less pronounced.
13.9.7 Connection to Regularization
The double descent phenomenon provides a deeper understanding of why regularization works. Regularization effectively constrains the set of solutions the optimizer can find, biasing it toward simpler functions even in the overparameterized regime. This is related to the concept of implicit regularization---the idea that SGD itself has an inductive bias toward simpler solutions, as discussed in Chapter 10.
13.10 The Lottery Ticket Hypothesis
13.10.1 The Central Claim
The lottery ticket hypothesis, proposed by Frankle and Carlin (2019), makes a surprising claim:
A randomly initialized dense network contains a subnetwork (the "winning ticket") that---when trained in isolation from the same initialization---can match the test accuracy of the full network in a comparable number of training steps.
In other words, large networks are not inherently necessary for good performance. They are necessary for finding good subnetworks during training. The large network acts as a search space, and training identifies which connections matter.
13.10.2 Finding Winning Tickets
The original algorithm for finding winning tickets is Iterative Magnitude Pruning (IMP):
- Initialize a network with random weights $\theta_0$.
- Train the network to convergence, reaching weights $\theta_T$.
- Prune the smallest-magnitude weights (e.g., remove the bottom 20%).
- Reset the remaining weights to their original values $\theta_0$.
- Repeat from step 2 with the pruned network.
This iterative process gradually identifies the critical substructure. Each round removes the least important connections (as measured by weight magnitude), and resetting to the original initialization is crucial -- randomly re-initializing the pruned network typically produces much worse results. This suggests that the specific initial values of the winning ticket's weights are important, not just the structure.
The IMP algorithm can find subnetworks that are 10-20% the size of the original network while matching its performance. However, there is an important caveat: for very large networks, Frankle and Carlin (2019) found that rewinding to an early training checkpoint (e.g., epoch 5) rather than the exact initialization $\theta_0$ produces better results. This variant, called late resetting, suggests that the first few training steps serve an important role in "priming" the network for later learning.
Worked Example: Consider a fully connected network for MNIST with 300-100-10 architecture (approximately 266,000 parameters). After 15 rounds of iterative magnitude pruning (removing 20% of remaining weights each round), we obtain a subnetwork with only ~5,300 parameters (2% of the original) that achieves 98.0% test accuracy -- essentially matching the full network's 98.2%. The 98% of parameters that were pruned were not contributing meaningfully to the final prediction.
13.10.3 Implications
For regularization:
- The lottery ticket hypothesis suggests that much of the network's capacity is redundant. Regularization works partly by suppressing these redundant pathways.
- Dropout can be seen as a stochastic way to search for winning tickets---it forces the network to find robust subnetworks.
For efficiency:
- If we could identify winning tickets before training, we could train much smaller networks from the start, saving computation.
- This has motivated extensive research in neural network pruning and sparse training.
For understanding generalization:
- The hypothesis suggests that what matters is not the total number of parameters but the structure of the connections. This partly explains why overparameterized networks generalize well---they provide a rich search space for finding good subnetworks.
13.10.4 Practical Pruning
While the full lottery ticket algorithm is computationally expensive (it requires multiple rounds of training), simpler pruning techniques can still yield significant benefits:
import torch
import torch.nn as nn
import torch.nn.utils.prune as prune
torch.manual_seed(42)
model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Linear(256, 128),
nn.ReLU(),
nn.Linear(128, 10),
)
# Prune 30% of weights with smallest magnitude in each layer
for name, module in model.named_modules():
if isinstance(module, nn.Linear):
prune.l1_unstructured(module, name="weight", amount=0.3)
# Check sparsity
total_params = 0
zero_params = 0
for name, module in model.named_modules():
if isinstance(module, nn.Linear):
total_params += module.weight.nelement()
zero_params += (module.weight == 0).sum().item()
print(f"Sparsity: {100.0 * zero_params / total_params:.1f}%")
13.11 Combining Regularization Techniques
13.11.1 The Regularization Recipe
In practice, you will almost always use multiple regularization techniques simultaneously. The key is understanding how they interact and building a coherent regularization strategy.
Here is a recommended approach, ordered by priority:
- Data augmentation (almost always helps, especially for small datasets)
- Weight decay (low cost, reliable benefit)
- Dropout (effective for fully connected layers)
- Label smoothing (simple, often beneficial for classification)
- Early stopping (free regularization, always use)
- Mixup/CutMix (strong regularization, especially for image tasks)
- Stochastic depth (for very deep networks, as referenced in Chapter 7)
13.11.2 A Complete Training Pipeline
Here is a comprehensive example that combines multiple regularization techniques:
import torch
import torch.nn as nn
from torchvision import transforms
torch.manual_seed(42)
class WellRegularizedModel(nn.Module):
"""A CNN with comprehensive regularization.
Args:
num_classes: Number of output classes.
dropout_rate: Dropout probability for FC layers.
"""
def __init__(
self, num_classes: int = 10, dropout_rate: float = 0.5
) -> None:
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 32, 3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.Conv2d(32, 64, 3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Dropout2d(0.1),
nn.Conv2d(64, 128, 3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(),
nn.AdaptiveAvgPool2d(4),
)
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(128 * 4 * 4, 256),
nn.ReLU(),
nn.Dropout(dropout_rate),
nn.Linear(256, num_classes),
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""Forward pass.
Args:
x: Input tensor of shape (batch_size, 3, H, W).
Returns:
Logits of shape (batch_size, num_classes).
"""
x = self.features(x)
return self.classifier(x)
# Data augmentation
train_transform = transforms.Compose([
transforms.RandomCrop(32, padding=4),
transforms.RandomHorizontalFlip(),
transforms.ColorJitter(0.2, 0.2, 0.2),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
transforms.RandomErasing(p=0.2),
])
# Model with dropout
model = WellRegularizedModel(num_classes=10, dropout_rate=0.5)
# Optimizer with weight decay
optimizer = torch.optim.AdamW(
model.parameters(), lr=1e-3, weight_decay=0.01
)
# Label smoothing
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
# Learning rate scheduler
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=200
)
13.11.3 When to Add More Regularization
Monitor these signals to determine if you need more regularization:
| Signal | Diagnosis | Action |
|---|---|---|
| Train loss >> test loss, both high | Underfitting + noise | Reduce regularization, increase model capacity |
| Train loss low, test loss high | Overfitting | Add more regularization |
| Train loss low, test loss low, but gap growing | Beginning to overfit | Mild additional regularization or early stopping |
| Both losses plateau at similar values | Good fit but could improve | Try different augmentation, increase model size |
13.11.4 Regularization for Different Data Scales
The amount of regularization you need depends heavily on your dataset size:
| Dataset Size | Recommended Strategy |
|---|---|
| Very small (< 1K) | Heavy augmentation, strong dropout (0.5+), high weight decay, consider transfer learning (Chapter 15) |
| Small (1K-10K) | Moderate augmentation, dropout (0.3-0.5), weight decay, mixup |
| Medium (10K-100K) | Standard augmentation, moderate dropout (0.1-0.3), weight decay |
| Large (100K-1M) | Light augmentation, light dropout (0-0.1), light weight decay |
| Very large (1M+) | Minimal explicit regularization; the data itself regularizes |
13.12 Implicit Regularization
13.12.1 Beyond Explicit Techniques
Not all regularization is explicit. Many design choices in deep learning have implicit regularization effects:
SGD with small batch sizes: As discussed in Section 13.8, the noise in mini-batch gradient estimates acts as a regularizer by preventing the optimizer from settling into sharp minima.
Overparameterization: Counterintuitively, using a larger model can sometimes improve generalization. Overparameterized models have many possible solutions that perfectly fit the training data, and SGD tends to find the simplest one among these (related to the double descent discussion in Section 13.9).
Architecture choices: The architecture itself encodes inductive biases. Convolutional networks assume spatial locality and translation invariance (Chapter 7). Recurrent networks assume sequential structure. These biases act as regularizers by limiting the functions the network can represent.
Batch normalization: As discussed in Chapter 12, batch normalization adds noise during training (because batch statistics vary between mini-batches), which has a regularization effect. This is why dropout is often unnecessary when batch normalization is used.
Skip connections: Residual connections (Chapter 7) provide a regularization effect by making it easy for the network to learn identity mappings for layers that are not needed.
13.12.2 The Role of the Optimizer
The choice of optimizer itself introduces implicit regularization:
- SGD has an implicit bias toward minimum-norm solutions in linear models, which extends approximately to neural networks.
- Adam tends to find different solutions than SGD, sometimes with worse generalization (this is why AdamW with proper weight decay was developed).
- Learning rate acts as an implicit regularizer: larger learning rates create more noise and find flatter minima, while smaller learning rates allow the optimizer to exploit sharper features.
Understanding implicit regularization is important because it means that even without any explicit regularization techniques, your model is already being regularized by your choice of architecture, optimizer, batch size, and learning rate.
13.13 Regularization in Modern Architectures
13.13.1 Regularization in Transformers
Transformers (which we will cover in detail in Chapter 14) use a specific set of regularization techniques:
- Dropout: Applied to attention weights and feed-forward layers, typically with $p = 0.1$.
- Weight decay: Applied with AdamW, typically $\lambda = 0.01$ to $0.1$.
- Label smoothing: Commonly used with $\alpha = 0.1$.
- Stochastic depth: Randomly dropping entire transformer blocks during training.
- Attention dropout: Dropping elements of the attention weight matrix.
Notably, batch normalization is rarely used in transformers; instead, layer normalization provides normalization without the batch-size-dependent noise.
13.13.2 Regularization in Convolutional Networks
For convolutional neural networks (which we cover in depth in Chapter 14), regularization requires attention to the spatial structure of feature maps:
- Data augmentation is paramount. CNNs benefit enormously from geometric augmentations (rotation, flipping, cropping) that exploit the spatial nature of images. For tasks like medical imaging where labeled data is scarce, augmentation can be worth more than all other regularization techniques combined.
- Dropout2d over standard dropout. As discussed in Section 13.3.3, standard dropout drops individual pixels, which is too fine-grained for correlated spatial features. Dropout2d drops entire feature channels.
- Weight decay should exclude batch normalization parameters. A common practice is to apply weight decay only to convolutional and linear layer weights, not to batch normalization scale and shift parameters. This is because batch normalization parameters serve a different role (controlling the distribution of activations) and constraining them can harm training dynamics.
- Progressive resizing. Training on small images first and gradually increasing resolution acts as a form of curriculum-based regularization, as the model learns coarse features before fine details.
import torch
import torch.nn as nn
torch.manual_seed(42)
# Separate weight decay groups for a CNN
model = nn.Sequential(
nn.Conv2d(3, 32, 3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(),
)
# Apply weight decay only to conv/linear weights, not BN params
decay_params = []
no_decay_params = []
for name, param in model.named_parameters():
if 'bn' in name or 'bias' in name:
no_decay_params.append(param)
else:
decay_params.append(param)
optimizer = torch.optim.AdamW([
{'params': decay_params, 'weight_decay': 0.01},
{'params': no_decay_params, 'weight_decay': 0.0},
], lr=1e-3)
13.13.3 Regularization in Large Language Models
For very large language models, the regularization landscape shifts dramatically:
- Data is the primary regularizer: With billions of training tokens, explicit regularization becomes less critical.
- Weight decay is still used but with lower values.
- Dropout is often set to zero for very large models because the models are so large that overfitting to the training distribution is less of a concern.
- Data deduplication and quality filtering serve as forms of data-level regularization.
13.13.4 Regularization in Few-Shot and Transfer Learning
When fine-tuning pretrained models on small datasets (as we will discuss in Chapter 15):
- Higher weight decay helps prevent the model from straying too far from the pretrained weights.
- Lower learning rates for pretrained layers serve a similar purpose.
- Freezing layers is an extreme form of regularization---frozen layers cannot overfit at all. A common approach is to freeze all pretrained layers initially, train only the new head for a few epochs, and then gradually unfreeze deeper layers with progressively smaller learning rates.
- Data augmentation becomes even more critical because the dataset is small. Aggressive augmentation can effectively multiply your dataset size by 10x or more.
13.14 Monitoring Generalization
13.14.1 Essential Metrics
To effectively use regularization, you must monitor the right metrics:
import torch
torch.manual_seed(42)
def compute_generalization_metrics(
train_losses: list[float],
val_losses: list[float],
train_accs: list[float],
val_accs: list[float],
) -> dict[str, float]:
"""Compute metrics for monitoring generalization.
Args:
train_losses: Training losses per epoch.
val_losses: Validation losses per epoch.
train_accs: Training accuracies per epoch.
val_accs: Validation accuracies per epoch.
Returns:
Dictionary of generalization metrics.
"""
metrics = {
"generalization_gap_loss": val_losses[-1] - train_losses[-1],
"generalization_gap_acc": train_accs[-1] - val_accs[-1],
"best_val_loss": min(val_losses),
"best_val_acc": max(val_accs),
"overfitting_ratio": val_losses[-1] / max(train_losses[-1], 1e-8),
}
# Detect if still improving
if len(val_losses) >= 5:
recent_trend = val_losses[-1] - val_losses[-5]
metrics["val_loss_trend_5ep"] = recent_trend
return metrics
13.14.2 Learning Curve Analysis
Plotting learning curves (training and validation loss/accuracy over epochs) is the single most informative diagnostic tool for understanding your model's generalization behavior. The shape of these curves tells you:
- Parallel curves with gap: Overfitting. Add regularization.
- Converging curves: Good generalization. You might be able to train longer.
- Diverging curves: Increasing overfitting. Stop training or add regularization.
- Both curves plateau high: Underfitting. Increase model capacity or reduce regularization.
13.15 Summary
Regularization is not a single technique but a philosophy: every decision in your training pipeline affects generalization, and your job as an AI engineer is to make choices that produce models which perform well on data they have never seen.
The key takeaways from this chapter:
-
Overfitting is the default behavior of modern neural networks. Without regularization, models will memorize the training data.
-
Weight decay (L2 regularization) is the most fundamental explicit regularization. Use AdamW, not Adam with L2 penalty. Typical values: 0.01-0.1.
-
Dropout prevents co-adaptation of neurons. Use 0.1-0.5 depending on architecture. Always remember
model.train()andmodel.eval(). -
Data augmentation is the most powerful regularization technique because it addresses the root cause: insufficient data diversity. Use automated policies like RandAugment when possible.
-
Early stopping is free regularization. Always monitor validation performance and save the best model.
-
Label smoothing improves calibration and generalization for classification tasks. Use $\alpha = 0.1$ as a default.
-
Mixup and CutMix provide strong regularization by interpolating between training examples. They are standard components of modern training pipelines.
-
Batch size affects generalization: smaller batches provide implicit regularization through gradient noise.
-
Double descent challenges classical intuitions: overparameterized models can generalize well, and the worst regime is at the interpolation threshold.
-
The lottery ticket hypothesis suggests that large networks contain small subnetworks that can achieve comparable performance, connecting regularization to network pruning.
-
Combine techniques thoughtfully: Start with data augmentation and weight decay, then add dropout and label smoothing. Adjust based on learning curves.
In the next chapter, we will explore the transformer architecture, which has revolutionized deep learning and requires its own specific regularization strategies that build on everything we have learned here.
References
- Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). Reconciling modern machine learning practice and the bias-variance trade-off. PNAS.
- Bishop, C. M. (1995). Training with noise is equivalent to Tikhonov regularization. Neural Computation.
- Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., & Le, Q. V. (2019). AutoAugment: Learning augmentation strategies from data. CVPR.
- Cubuk, E. D., Zoph, B., Shlens, J., & Le, Q. V. (2020). RandAugment: Practical automated data augmentation with a reduced search space. CVPR Workshops.
- Frankle, J., & Carlin, M. (2019). The lottery ticket hypothesis: Finding sparse, trainable neural networks. ICLR.
- Gal, Y., & Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. ICML.
- Goyal, P., Dollar, P., Girshick, R., et al. (2017). Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv preprint.
- Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., & Tang, P. T. P. (2017). On large-batch training for deep learning: Generalization gap and sharp minima. ICLR.
- Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization. ICLR.
- Muller, S. G., & Hutter, F. (2021). TrivialAugment: Tuning-free yet state-of-the-art data augmentation. ICCV.
- Nakkiran, P., Kaplun, G., Bansal, Y., et al. (2020). Deep double descent: Where bigger models and more data can hurt. ICLR.
- Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. JMLR.
- Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the Inception architecture. CVPR.
- Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., & Yoo, Y. (2019). CutMix: Regularization strategy to train strong classifiers with localizable features. ICCV.
- Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2017). Understanding deep learning requires rethinking generalization. ICLR.
- Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2018). mixup: Beyond empirical risk minimization. ICLR.