Chapter 13: Exercises

Conceptual Exercises

Exercise 13.1: Overfitting Diagnosis

You are training a neural network and observe the following after 50 epochs: - Training accuracy: 99.8% - Validation accuracy: 72.3% - Training loss: 0.005 - Validation loss: 1.84

(a) Diagnose the problem. (b) List five specific regularization techniques you would try, ordered by priority. (c) Explain why each technique would help in this scenario.

Exercise 13.2: Bias-Variance Decomposition

Consider the bias-variance decomposition: $\mathbb{E}[(\hat{f}(x) - y)^2] = \text{Bias}^2 + \text{Variance} + \sigma^2$.

(a) Explain in your own words what each term represents. (b) Which term does L2 regularization primarily affect? Why? (c) Which term does data augmentation primarily affect? Why? (d) Is it possible to reduce all three terms simultaneously? Explain.

Exercise 13.3: L1 vs. L2 Regularization

(a) Sketch the constraint regions for L1 and L2 regularization in 2D weight space. (b) Explain geometrically why L1 regularization produces sparse solutions. (c) Under what circumstances would you prefer L1 over L2 in a deep learning project? (d) Why is L2 regularization more commonly used than L1 in modern deep networks?

Exercise 13.4: Weight Decay and Adam

(a) Explain the difference between L2 regularization and decoupled weight decay. (b) Why does the standard L2 penalty interact poorly with Adam's adaptive learning rates? (c) Show mathematically how the SGD update with L2 regularization is equivalent to weight decay.

Exercise 13.5: Dropout Theory

(a) During training with dropout rate $p = 0.5$, what is the expected value of a neuron's output relative to its value without dropout? (b) Explain why inverted dropout is preferred over standard dropout in practice. (c) A colleague argues that dropout is just noise injection. Provide at least two additional perspectives on why dropout works. (d) How does the ensemble interpretation of dropout relate to the number of possible subnetworks in a network with $n$ neurons?

Exercise 13.6: Label Smoothing Mathematics

Given a classification problem with $K = 5$ classes and smoothing parameter $\alpha = 0.2$:

(a) Write out the smoothed label vector for class 2 (0-indexed). (b) Compute the cross-entropy loss between the smoothed labels and the prediction $[0.1, 0.1, 0.6, 0.1, 0.1]$. (c) Compare this with the cross-entropy loss using hard labels. Which is lower? Explain why. (d) What happens as $\alpha \to 1$? Interpret this result.

Exercise 13.7: Mixup Theory

(a) If you apply mixup with $\alpha = 1.0$, what is the distribution of $\lambda$? What does this mean for the mixed examples? (b) If you apply mixup with $\alpha \to 0$, what happens to the mixing coefficient? (c) Prove that the expected value of $\lambda$ from a $\text{Beta}(\alpha, \alpha)$ distribution is always 0.5. (d) Why does mixup encourage linear behavior between training examples?

Exercise 13.8: Double Descent

(a) Describe the three regimes of the double descent curve. (b) What is the "interpolation threshold" and why does test error peak there? (c) How does regularization affect the double descent curve? (d) A colleague claims that you should always use the smallest model that fits your data well. Based on double descent, argue for or against this claim.

Exercise 13.9: Lottery Ticket Hypothesis

(a) State the lottery ticket hypothesis in your own words. (b) Describe the iterative magnitude pruning algorithm. (c) How does the lottery ticket hypothesis relate to dropout? (d) What are the practical implications for model deployment?

Exercise 13.10: Regularization Interactions

Explain how the following pairs of regularization techniques interact:

(a) Dropout and batch normalization (b) Weight decay and learning rate (c) Data augmentation and model size (d) Early stopping and learning rate scheduling (e) Label smoothing and knowledge distillation

Coding Exercises

Exercise 13.11: Implementing L1 and L2 Regularization

Write a training loop that applies both L1 and L2 regularization manually (without using the weight_decay parameter). Train on a synthetic dataset and compare results with different regularization strengths.

import torch
import torch.nn as nn

torch.manual_seed(42)

# TODO: Create a synthetic dataset with noise
# TODO: Define a model
# TODO: Implement training with manual L1/L2 regularization
# TODO: Compare results across lambda values [0, 0.001, 0.01, 0.1]

Exercise 13.12: Dropout Rate Sweep

Implement an experiment that trains the same architecture with different dropout rates and plots the results.

import torch
import torch.nn as nn

torch.manual_seed(42)

# TODO: Define a model class with configurable dropout
# TODO: Train with dropout rates [0.0, 0.1, 0.2, 0.3, 0.5, 0.7, 0.9]
# TODO: Record train/val accuracy for each
# TODO: Plot results showing the effect of dropout rate

Exercise 13.13: Early Stopping Implementation

Implement an EarlyStopping class with the following features: - Configurable patience - Configurable minimum improvement delta - Best model weight restoration - Logging of best epoch

Test it on a training loop and verify it stops at the right time.

import torch
import torch.nn as nn

torch.manual_seed(42)

# TODO: Implement EarlyStopping class
# TODO: Create a scenario where overfitting occurs
# TODO: Demonstrate that early stopping catches it

Exercise 13.14: Data Augmentation Pipeline

Create three different augmentation pipelines (weak, medium, strong) for CIFAR-10 style images. Train the same model with each and compare generalization.

import torch
from torchvision import transforms

torch.manual_seed(42)

# TODO: Define weak augmentation (only horizontal flip)
# TODO: Define medium augmentation (flip + crop + color jitter)
# TODO: Define strong augmentation (medium + random erasing + auto augment)
# TODO: Train and compare results

Exercise 13.15: Mixup Implementation

Implement mixup training from scratch and demonstrate its effect on a classification task.

import torch
import torch.nn as nn

torch.manual_seed(42)

# TODO: Implement mixup_data function
# TODO: Implement mixup_criterion function
# TODO: Train with and without mixup
# TODO: Compare generalization gaps

Exercise 13.16: CutMix Implementation

Implement CutMix from scratch, including the random bounding box generation.

import torch

torch.manual_seed(42)

# TODO: Implement random_bbox function
# TODO: Implement cutmix_data function
# TODO: Apply to image classification
# TODO: Visualize CutMix examples

Exercise 13.17: Label Smoothing

Implement label smoothing cross-entropy loss from scratch (without using PyTorch's built-in label_smoothing parameter). Verify your implementation matches PyTorch's output.

import torch
import torch.nn as nn

torch.manual_seed(42)

# TODO: Implement LabelSmoothingLoss class
# TODO: Compare with nn.CrossEntropyLoss(label_smoothing=0.1)
# TODO: Verify outputs match within numerical tolerance

Exercise 13.18: Weight Pruning

Implement magnitude-based weight pruning and measure its effect on model accuracy and sparsity.

import torch
import torch.nn as nn

torch.manual_seed(42)

# TODO: Train a model to convergence
# TODO: Prune 10%, 30%, 50%, 70%, 90% of weights
# TODO: Evaluate accuracy at each sparsity level
# TODO: Plot accuracy vs sparsity curve

Exercise 13.19: Gradient Accumulation

Implement gradient accumulation to simulate large batch training with limited memory. Compare the effective results with actual large batch training.

import torch
import torch.nn as nn

torch.manual_seed(42)

# TODO: Implement training with gradient accumulation
# TODO: Compare with actual large batch training
# TODO: Verify gradients are (approximately) equivalent

Exercise 13.20: Regularization Ablation Study

Perform an ablation study that starts with a fully regularized model and removes one technique at a time to measure its individual contribution.

import torch
import torch.nn as nn

torch.manual_seed(42)

# TODO: Train with all regularization: dropout, weight decay,
#       label smoothing, data augmentation, early stopping
# TODO: Remove one at a time and retrain
# TODO: Record the change in validation accuracy
# TODO: Rank techniques by their individual contribution

Advanced Exercises

Exercise 13.21: Visualizing the Loss Landscape

Use random direction perturbation to visualize a 1D slice of the loss landscape. Compare landscapes for models trained with and without regularization.

import torch
import torch.nn as nn

torch.manual_seed(42)

# TODO: Train a model with and without weight decay
# TODO: Generate random direction vectors
# TODO: Evaluate loss at points along the direction
# TODO: Plot 1D loss landscape slices for both models

Exercise 13.22: Measuring Double Descent

Create a controlled experiment to observe double descent by training models of increasing width on a fixed-size dataset.

import torch
import torch.nn as nn

torch.manual_seed(42)

# TODO: Create a small fixed dataset
# TODO: Train models with widths [4, 8, 16, 32, 64, 128, 256, 512, 1024]
# TODO: Record final train and test errors for each width
# TODO: Plot to observe double descent

Exercise 13.23: Monte Carlo Dropout for Uncertainty

Implement Monte Carlo Dropout to estimate prediction uncertainty.

import torch
import torch.nn as nn

torch.manual_seed(42)

# TODO: Train a model with dropout
# TODO: At inference, keep dropout active
# TODO: Run 50 forward passes for each test input
# TODO: Compute mean and variance of predictions
# TODO: Show that uncertain inputs have higher variance

Exercise 13.24: Data Augmentation for Tabular Data

Design and implement a data augmentation strategy for tabular data including: - Gaussian noise injection for continuous features - Feature dropout (randomly zeroing features) - SMOTE-like interpolation between samples

import torch

torch.manual_seed(42)

# TODO: Implement TabularAugmentation class
# TODO: Apply to a synthetic tabular dataset
# TODO: Compare model performance with and without augmentation

Exercise 13.25: Implementing Stochastic Depth

Implement stochastic depth (randomly dropping entire residual blocks) for a ResNet-style model.

import torch
import torch.nn as nn

torch.manual_seed(42)

# TODO: Implement StochasticDepthBlock
# TODO: Build a model with stochastic depth
# TODO: Compare with standard ResNet training

Exercise 13.26: Spectral Normalization

Implement spectral normalization, which constrains the spectral norm of weight matrices.

import torch
import torch.nn as nn

torch.manual_seed(42)

# TODO: Implement spectral normalization using power iteration
# TODO: Apply to a discriminator network
# TODO: Compare training stability with/without spectral normalization

Exercise 13.27: R-Drop Regularization

Implement R-Drop, which minimizes the KL divergence between outputs of two forward passes through the same input (with different dropout masks).

import torch
import torch.nn as nn
import torch.nn.functional as F

torch.manual_seed(42)

# TODO: Implement R-Drop loss
# TODO: Integrate into a training loop
# TODO: Compare with standard dropout training

Exercise 13.28: Shake-Shake Regularization

Implement Shake-Shake regularization for a multi-branch residual network where forward and backward passes use different random interpolation coefficients.

import torch
import torch.nn as nn

torch.manual_seed(42)

# TODO: Implement ShakeShake module
# TODO: Use different alpha for forward and backward
# TODO: Build a simple multi-branch network

Research-Oriented Exercises

Exercise 13.29: Regularization and Memorization

Design an experiment inspired by Zhang et al. (2017) to test whether a model can memorize random labels.

(a) Train on true labels with no regularization. Record convergence time. (b) Train on random labels with no regularization. Record convergence time. (c) Train on random labels with dropout (0.5). Does it still memorize? (d) Train on random labels with weight decay (0.1). Does it still memorize? (e) What do these results tell you about the nature of regularization?

Exercise 13.30: Adaptive Regularization Scheduling

Design a system that automatically adjusts regularization strength based on the generalization gap during training.

import torch

torch.manual_seed(42)

# TODO: Implement AdaptiveRegularizer class
# TODO: Monitor train/val gap
# TODO: Increase regularization when gap grows
# TODO: Decrease regularization when gap shrinks
# TODO: Compare with fixed regularization

Exercise 13.31: Cross-Domain Augmentation Transfer

Train augmentation-based models on one domain and test whether the learned augmentation sensitivity transfers to a related domain.

Exercise 13.32: Combining Mixup with Label Smoothing

(a) Mathematically derive the effective label distribution when combining mixup ($\alpha = 0.2$) with label smoothing ($\alpha = 0.1$) for a 10-class problem. (b) Implement this combination and compare with each technique alone. (c) Do the benefits stack? Or is there diminishing returns?

Exercise 13.33: Pruning Schedule Optimization

Implement and compare three pruning schedules: (a) One-shot pruning at the end of training. (b) Gradual magnitude pruning during training. (c) Lottery ticket iterative pruning.

Compare final accuracy and training time for each approach at 90% sparsity.

Exercise 13.34: Noise as Regularization

Implement three forms of noise-based regularization and compare their effects: (a) Input noise (Gaussian noise added to inputs) (b) Weight noise (Gaussian noise added to weights during training) (c) Gradient noise (Gaussian noise added to gradients)

import torch
import torch.nn as nn

torch.manual_seed(42)

# TODO: Implement InputNoise, WeightNoise, GradientNoise modules
# TODO: Train with each and compare generalization

Exercise 13.35: Regularization Under Distribution Shift

Test how different regularization techniques perform when the test distribution differs from the training distribution.

(a) Train on clean data, test on corrupted data (Gaussian noise, blur, contrast changes). (b) Which regularization techniques improve robustness to distribution shift? (c) Which regularization techniques only help with in-distribution generalization?