Chapter 13: Exercises
Conceptual Exercises
Exercise 13.1: Overfitting Diagnosis
You are training a neural network and observe the following after 50 epochs: - Training accuracy: 99.8% - Validation accuracy: 72.3% - Training loss: 0.005 - Validation loss: 1.84
(a) Diagnose the problem. (b) List five specific regularization techniques you would try, ordered by priority. (c) Explain why each technique would help in this scenario.
Exercise 13.2: Bias-Variance Decomposition
Consider the bias-variance decomposition: $\mathbb{E}[(\hat{f}(x) - y)^2] = \text{Bias}^2 + \text{Variance} + \sigma^2$.
(a) Explain in your own words what each term represents. (b) Which term does L2 regularization primarily affect? Why? (c) Which term does data augmentation primarily affect? Why? (d) Is it possible to reduce all three terms simultaneously? Explain.
Exercise 13.3: L1 vs. L2 Regularization
(a) Sketch the constraint regions for L1 and L2 regularization in 2D weight space. (b) Explain geometrically why L1 regularization produces sparse solutions. (c) Under what circumstances would you prefer L1 over L2 in a deep learning project? (d) Why is L2 regularization more commonly used than L1 in modern deep networks?
Exercise 13.4: Weight Decay and Adam
(a) Explain the difference between L2 regularization and decoupled weight decay. (b) Why does the standard L2 penalty interact poorly with Adam's adaptive learning rates? (c) Show mathematically how the SGD update with L2 regularization is equivalent to weight decay.
Exercise 13.5: Dropout Theory
(a) During training with dropout rate $p = 0.5$, what is the expected value of a neuron's output relative to its value without dropout? (b) Explain why inverted dropout is preferred over standard dropout in practice. (c) A colleague argues that dropout is just noise injection. Provide at least two additional perspectives on why dropout works. (d) How does the ensemble interpretation of dropout relate to the number of possible subnetworks in a network with $n$ neurons?
Exercise 13.6: Label Smoothing Mathematics
Given a classification problem with $K = 5$ classes and smoothing parameter $\alpha = 0.2$:
(a) Write out the smoothed label vector for class 2 (0-indexed). (b) Compute the cross-entropy loss between the smoothed labels and the prediction $[0.1, 0.1, 0.6, 0.1, 0.1]$. (c) Compare this with the cross-entropy loss using hard labels. Which is lower? Explain why. (d) What happens as $\alpha \to 1$? Interpret this result.
Exercise 13.7: Mixup Theory
(a) If you apply mixup with $\alpha = 1.0$, what is the distribution of $\lambda$? What does this mean for the mixed examples? (b) If you apply mixup with $\alpha \to 0$, what happens to the mixing coefficient? (c) Prove that the expected value of $\lambda$ from a $\text{Beta}(\alpha, \alpha)$ distribution is always 0.5. (d) Why does mixup encourage linear behavior between training examples?
Exercise 13.8: Double Descent
(a) Describe the three regimes of the double descent curve. (b) What is the "interpolation threshold" and why does test error peak there? (c) How does regularization affect the double descent curve? (d) A colleague claims that you should always use the smallest model that fits your data well. Based on double descent, argue for or against this claim.
Exercise 13.9: Lottery Ticket Hypothesis
(a) State the lottery ticket hypothesis in your own words. (b) Describe the iterative magnitude pruning algorithm. (c) How does the lottery ticket hypothesis relate to dropout? (d) What are the practical implications for model deployment?
Exercise 13.10: Regularization Interactions
Explain how the following pairs of regularization techniques interact:
(a) Dropout and batch normalization (b) Weight decay and learning rate (c) Data augmentation and model size (d) Early stopping and learning rate scheduling (e) Label smoothing and knowledge distillation
Coding Exercises
Exercise 13.11: Implementing L1 and L2 Regularization
Write a training loop that applies both L1 and L2 regularization manually (without using the weight_decay parameter). Train on a synthetic dataset and compare results with different regularization strengths.
import torch
import torch.nn as nn
torch.manual_seed(42)
# TODO: Create a synthetic dataset with noise
# TODO: Define a model
# TODO: Implement training with manual L1/L2 regularization
# TODO: Compare results across lambda values [0, 0.001, 0.01, 0.1]
Exercise 13.12: Dropout Rate Sweep
Implement an experiment that trains the same architecture with different dropout rates and plots the results.
import torch
import torch.nn as nn
torch.manual_seed(42)
# TODO: Define a model class with configurable dropout
# TODO: Train with dropout rates [0.0, 0.1, 0.2, 0.3, 0.5, 0.7, 0.9]
# TODO: Record train/val accuracy for each
# TODO: Plot results showing the effect of dropout rate
Exercise 13.13: Early Stopping Implementation
Implement an EarlyStopping class with the following features:
- Configurable patience
- Configurable minimum improvement delta
- Best model weight restoration
- Logging of best epoch
Test it on a training loop and verify it stops at the right time.
import torch
import torch.nn as nn
torch.manual_seed(42)
# TODO: Implement EarlyStopping class
# TODO: Create a scenario where overfitting occurs
# TODO: Demonstrate that early stopping catches it
Exercise 13.14: Data Augmentation Pipeline
Create three different augmentation pipelines (weak, medium, strong) for CIFAR-10 style images. Train the same model with each and compare generalization.
import torch
from torchvision import transforms
torch.manual_seed(42)
# TODO: Define weak augmentation (only horizontal flip)
# TODO: Define medium augmentation (flip + crop + color jitter)
# TODO: Define strong augmentation (medium + random erasing + auto augment)
# TODO: Train and compare results
Exercise 13.15: Mixup Implementation
Implement mixup training from scratch and demonstrate its effect on a classification task.
import torch
import torch.nn as nn
torch.manual_seed(42)
# TODO: Implement mixup_data function
# TODO: Implement mixup_criterion function
# TODO: Train with and without mixup
# TODO: Compare generalization gaps
Exercise 13.16: CutMix Implementation
Implement CutMix from scratch, including the random bounding box generation.
import torch
torch.manual_seed(42)
# TODO: Implement random_bbox function
# TODO: Implement cutmix_data function
# TODO: Apply to image classification
# TODO: Visualize CutMix examples
Exercise 13.17: Label Smoothing
Implement label smoothing cross-entropy loss from scratch (without using PyTorch's built-in label_smoothing parameter). Verify your implementation matches PyTorch's output.
import torch
import torch.nn as nn
torch.manual_seed(42)
# TODO: Implement LabelSmoothingLoss class
# TODO: Compare with nn.CrossEntropyLoss(label_smoothing=0.1)
# TODO: Verify outputs match within numerical tolerance
Exercise 13.18: Weight Pruning
Implement magnitude-based weight pruning and measure its effect on model accuracy and sparsity.
import torch
import torch.nn as nn
torch.manual_seed(42)
# TODO: Train a model to convergence
# TODO: Prune 10%, 30%, 50%, 70%, 90% of weights
# TODO: Evaluate accuracy at each sparsity level
# TODO: Plot accuracy vs sparsity curve
Exercise 13.19: Gradient Accumulation
Implement gradient accumulation to simulate large batch training with limited memory. Compare the effective results with actual large batch training.
import torch
import torch.nn as nn
torch.manual_seed(42)
# TODO: Implement training with gradient accumulation
# TODO: Compare with actual large batch training
# TODO: Verify gradients are (approximately) equivalent
Exercise 13.20: Regularization Ablation Study
Perform an ablation study that starts with a fully regularized model and removes one technique at a time to measure its individual contribution.
import torch
import torch.nn as nn
torch.manual_seed(42)
# TODO: Train with all regularization: dropout, weight decay,
# label smoothing, data augmentation, early stopping
# TODO: Remove one at a time and retrain
# TODO: Record the change in validation accuracy
# TODO: Rank techniques by their individual contribution
Advanced Exercises
Exercise 13.21: Visualizing the Loss Landscape
Use random direction perturbation to visualize a 1D slice of the loss landscape. Compare landscapes for models trained with and without regularization.
import torch
import torch.nn as nn
torch.manual_seed(42)
# TODO: Train a model with and without weight decay
# TODO: Generate random direction vectors
# TODO: Evaluate loss at points along the direction
# TODO: Plot 1D loss landscape slices for both models
Exercise 13.22: Measuring Double Descent
Create a controlled experiment to observe double descent by training models of increasing width on a fixed-size dataset.
import torch
import torch.nn as nn
torch.manual_seed(42)
# TODO: Create a small fixed dataset
# TODO: Train models with widths [4, 8, 16, 32, 64, 128, 256, 512, 1024]
# TODO: Record final train and test errors for each width
# TODO: Plot to observe double descent
Exercise 13.23: Monte Carlo Dropout for Uncertainty
Implement Monte Carlo Dropout to estimate prediction uncertainty.
import torch
import torch.nn as nn
torch.manual_seed(42)
# TODO: Train a model with dropout
# TODO: At inference, keep dropout active
# TODO: Run 50 forward passes for each test input
# TODO: Compute mean and variance of predictions
# TODO: Show that uncertain inputs have higher variance
Exercise 13.24: Data Augmentation for Tabular Data
Design and implement a data augmentation strategy for tabular data including: - Gaussian noise injection for continuous features - Feature dropout (randomly zeroing features) - SMOTE-like interpolation between samples
import torch
torch.manual_seed(42)
# TODO: Implement TabularAugmentation class
# TODO: Apply to a synthetic tabular dataset
# TODO: Compare model performance with and without augmentation
Exercise 13.25: Implementing Stochastic Depth
Implement stochastic depth (randomly dropping entire residual blocks) for a ResNet-style model.
import torch
import torch.nn as nn
torch.manual_seed(42)
# TODO: Implement StochasticDepthBlock
# TODO: Build a model with stochastic depth
# TODO: Compare with standard ResNet training
Exercise 13.26: Spectral Normalization
Implement spectral normalization, which constrains the spectral norm of weight matrices.
import torch
import torch.nn as nn
torch.manual_seed(42)
# TODO: Implement spectral normalization using power iteration
# TODO: Apply to a discriminator network
# TODO: Compare training stability with/without spectral normalization
Exercise 13.27: R-Drop Regularization
Implement R-Drop, which minimizes the KL divergence between outputs of two forward passes through the same input (with different dropout masks).
import torch
import torch.nn as nn
import torch.nn.functional as F
torch.manual_seed(42)
# TODO: Implement R-Drop loss
# TODO: Integrate into a training loop
# TODO: Compare with standard dropout training
Exercise 13.28: Shake-Shake Regularization
Implement Shake-Shake regularization for a multi-branch residual network where forward and backward passes use different random interpolation coefficients.
import torch
import torch.nn as nn
torch.manual_seed(42)
# TODO: Implement ShakeShake module
# TODO: Use different alpha for forward and backward
# TODO: Build a simple multi-branch network
Research-Oriented Exercises
Exercise 13.29: Regularization and Memorization
Design an experiment inspired by Zhang et al. (2017) to test whether a model can memorize random labels.
(a) Train on true labels with no regularization. Record convergence time. (b) Train on random labels with no regularization. Record convergence time. (c) Train on random labels with dropout (0.5). Does it still memorize? (d) Train on random labels with weight decay (0.1). Does it still memorize? (e) What do these results tell you about the nature of regularization?
Exercise 13.30: Adaptive Regularization Scheduling
Design a system that automatically adjusts regularization strength based on the generalization gap during training.
import torch
torch.manual_seed(42)
# TODO: Implement AdaptiveRegularizer class
# TODO: Monitor train/val gap
# TODO: Increase regularization when gap grows
# TODO: Decrease regularization when gap shrinks
# TODO: Compare with fixed regularization
Exercise 13.31: Cross-Domain Augmentation Transfer
Train augmentation-based models on one domain and test whether the learned augmentation sensitivity transfers to a related domain.
Exercise 13.32: Combining Mixup with Label Smoothing
(a) Mathematically derive the effective label distribution when combining mixup ($\alpha = 0.2$) with label smoothing ($\alpha = 0.1$) for a 10-class problem. (b) Implement this combination and compare with each technique alone. (c) Do the benefits stack? Or is there diminishing returns?
Exercise 13.33: Pruning Schedule Optimization
Implement and compare three pruning schedules: (a) One-shot pruning at the end of training. (b) Gradual magnitude pruning during training. (c) Lottery ticket iterative pruning.
Compare final accuracy and training time for each approach at 90% sparsity.
Exercise 13.34: Noise as Regularization
Implement three forms of noise-based regularization and compare their effects: (a) Input noise (Gaussian noise added to inputs) (b) Weight noise (Gaussian noise added to weights during training) (c) Gradient noise (Gaussian noise added to gradients)
import torch
import torch.nn as nn
torch.manual_seed(42)
# TODO: Implement InputNoise, WeightNoise, GradientNoise modules
# TODO: Train with each and compare generalization
Exercise 13.35: Regularization Under Distribution Shift
Test how different regularization techniques perform when the test distribution differs from the training distribution.
(a) Train on clean data, test on corrupted data (Gaussian noise, blur, contrast changes). (b) Which regularization techniques improve robustness to distribution shift? (c) Which regularization techniques only help with in-distribution generalization?