Chapter 12: Exercises

These exercises progress from foundational concepts to advanced implementation challenges. They are designed to reinforce the key ideas from the chapter and build practical skills in training deep networks with PyTorch.

Conceptual Exercises

Exercise 12.1: Loss Function Selection

For each of the following scenarios, choose the most appropriate loss function and justify your choice: 1. Predicting house prices from features (square footage, bedrooms, etc.) 2. Classifying images into 1000 ImageNet categories 3. Detecting whether an email is spam or not spam 4. Object detection where 99.9% of anchor boxes are background 5. Training a student model to mimic a teacher model's output distribution 6. Predicting multiple labels per image (e.g., "outdoor", "sunny", "beach" can all be true)

Exercise 12.2: Numerical Stability

Explain why computing torch.log(torch.softmax(logits, dim=-1)) is numerically inferior to torch.nn.functional.log_softmax(logits, dim=-1). Provide a concrete example with extreme logit values that demonstrates the difference.

Exercise 12.3: Adam vs. SGD

A colleague claims that "Adam always converges faster than SGD, so there is no reason to use SGD." Critically evaluate this claim. In your answer, discuss: 1. Convergence speed vs. generalization quality 2. Memory requirements of each optimizer 3. Scenarios where SGD with momentum outperforms Adam 4. The role of learning rate schedules in SGD's performance

Exercise 12.4: Batch Norm vs. Layer Norm

Explain why batch normalization is problematic for: 1. Very small batch sizes (e.g., batch size = 1) 2. Recurrent neural networks processing variable-length sequences 3. Inference-time behavior when batch statistics differ from training

For each case, explain how layer normalization addresses the issue.

Exercise 12.5: Weight Initialization Derivation

Starting from the assumption that we want $\text{Var}(h_l) = \text{Var}(h_{l-1})$ for a linear layer $h_l = W_l h_{l-1}$ with $n_{\text{in}}$ input units: 1. Derive that $\text{Var}(W_{ij}) = \frac{1}{n_{\text{in}}}$ assuming zero-mean weights and activations 2. Explain why ReLU activations require the He correction factor of $\frac{2}{n_{\text{in}}}$ 3. Show why Xavier initialization uses $\frac{2}{n_{\text{in}} + n_{\text{out}}}$

Exercise 12.6: Gradient Clipping Analysis

Consider a network training with gradient clipping at max_norm=1.0. The gradient norms for 10 consecutive steps are: [0.5, 0.8, 1.2, 15.0, 0.9, 0.7, 1.1, 0.6, 0.8, 0.4]. 1. For which steps is the gradient clipped? 2. By what factor is the gradient scaled at step 4? 3. If this pattern of occasional large spikes continues, what does it suggest about the training dynamics? 4. What interventions would you try?

Exercise 12.7: Mixed Precision Trade-offs

Explain why loss scaling is necessary in FP16 mixed precision training but not in BF16.
What happens if the loss scale is too large? Too small?
How does the dynamic loss scaler in PyTorch's GradScaler adapt the scale?
Which operations should NOT be run in FP16 and why?

Exercise 12.8: Learning Rate Warmup

Explain why training transformers without warmup often fails.
A model uses a linear warmup over 1000 steps to a peak learning rate of 5e-4. What is the learning rate at step 250? At step 500?
Does warmup duration need to scale with batch size? Why or why not?

Coding Exercises

Exercise 12.9: Implement Focal Loss

Implement focal loss as a PyTorch nn.Module that: - Accepts raw logits (not probabilities) - Supports configurable gamma and alpha parameters - Handles multi-class classification - Is numerically stable (uses log-softmax, not log of softmax)

class FocalLoss(nn.Module):
    """Focal loss for multi-class classification.

    Args:
        gamma: Focusing parameter (default: 2.0).
        alpha: Class balance weight (default: None).
        reduction: Reduction method ('mean', 'sum', 'none').
    """
    # Your implementation here

Test your implementation against standard cross-entropy when gamma=0.

Exercise 12.10: Custom Learning Rate Scheduler

Implement a linear warmup with polynomial decay scheduler: - Phase 1: Linear warmup from 0 to base_lr over warmup_steps steps - Phase 2: Polynomial decay from base_lr to end_lr over remaining steps - The polynomial decay follows: $\eta_t = (\eta_{\text{base}} - \eta_{\text{end}}) \times (1 - \frac{t - t_w}{T - t_w})^p + \eta_{\text{end}}$

where $p$ is the polynomial power (e.g., $p=2$ for quadratic decay).

Exercise 12.11: Gradient Flow Visualization

Write a function that, given a model and a batch of data: 1. Performs a forward and backward pass 2. Collects the gradient norm for each layer 3. Plots the gradient norms as a bar chart (layer index on x-axis, norm on y-axis) 4. Returns a dictionary mapping layer names to gradient norms

Use this to compare gradient flow in: - A 10-layer MLP with sigmoid activations and random initialization - The same MLP with ReLU activations and He initialization - The same MLP with ReLU, He init, and batch normalization

Exercise 12.12: Training Loop from Scratch

Implement a complete training pipeline that: 1. Creates a 5-layer MLP for classifying FashionMNIST (10 classes) 2. Uses He initialization, batch normalization, and dropout 3. Uses AdamW optimizer with weight decay (excluding bias and norm parameters) 4. Implements cosine annealing with linear warmup 5. Uses mixed precision training 6. Implements gradient clipping 7. Logs training and validation loss/accuracy per epoch 8. Saves the best model checkpoint based on validation accuracy 9. Implements early stopping

The model should achieve at least 89% test accuracy.

Exercise 12.13: Overfit-One-Batch Debugging

Given the following buggy training code, identify and fix all bugs:

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 128),
    nn.Softmax(dim=1),  # Bug 1
    nn.Linear(128, 10),
)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=10.0)  # Bug 2

for epoch in range(100):
    model.eval()  # Bug 3
    for inputs, targets in train_loader:
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        optimizer.step()  # Bug 4
        loss.backward()
        optimizer.zero_grad()

List each bug, explain why it is a bug, and provide the corrected code.

Exercise 12.14: Learning Rate Finder

Implement a learning rate finder that: 1. Trains for one epoch with exponentially increasing learning rate (from 1e-7 to 10) 2. Records the loss at each step 3. Plots loss vs. learning rate (log scale) 4. Returns the learning rate at which the loss starts to decrease most rapidly

This technique (Smith, 2017) helps find a good initial learning rate. The optimal rate is typically 1-2 orders of magnitude before the loss starts to explode.

Exercise 12.15: Custom Batch Normalization

Implement batch normalization from scratch (without using nn.BatchNorm1d): - Track running mean and variance using exponential moving average - Handle training vs. evaluation mode correctly - Include learnable scale ($\gamma$) and shift ($\beta$) parameters - Use the epsilon parameter for numerical stability

Verify your implementation produces the same output as nn.BatchNorm1d (up to floating-point precision).

Exercise 12.16: Gradient Accumulation

Modify the standard training loop to support gradient accumulation with N accumulation steps. Verify that: 1. The effective batch size equals batch_size * N 2. The loss is normalized correctly (divided by N) 3. The gradient norm matches what you'd get with a single batch of size batch_size * N (approximately)

Exercise 12.17: Optimizer Comparison Experiment

Train the same model (a 3-layer MLP on MNIST) with four optimizers: 1. SGD (lr=0.01, no momentum) 2. SGD (lr=0.01, momentum=0.9) 3. Adam (lr=0.001) 4. AdamW (lr=0.001, weight_decay=0.01)

For each optimizer: - Plot training loss curves on the same graph - Plot validation accuracy curves on the same graph - Report final test accuracy and total training time - Discuss the trade-offs observed

Exercise 12.18: Label Smoothing Implementation

Implement label smoothing cross-entropy loss from scratch: 1. Convert hard labels to soft labels: true class gets $(1 - \epsilon)$, other classes get $\frac{\epsilon}{C-1}$ 2. Compute cross-entropy with the soft labels 3. Verify your implementation matches nn.CrossEntropyLoss(label_smoothing=epsilon)

Exercise 12.19: EMA Implementation and Evaluation

Implement an EMA model wrapper and demonstrate its effect: 1. Train a model on CIFAR-10 for 50 epochs 2. At each evaluation point, compare the performance of the raw model vs. the EMA model 3. Plot both validation accuracy curves 4. Experiment with different decay rates (0.99, 0.999, 0.9999) and discuss the trade-offs

Exercise 12.20: Weight Initialization Experiment

For a 20-layer MLP with 256 hidden units: 1. Initialize with zeros---observe what happens 2. Initialize with large random values ($\sigma = 1.0$)---observe what happens 3. Initialize with Xavier---observe training with tanh activations 4. Initialize with He---observe training with ReLU activations

For each case, plot: - Activation magnitudes per layer (forward pass) - Gradient magnitudes per layer (backward pass) - Training loss curve

Challenge Exercises

Exercise 12.21: Implement AdamW from Scratch

Implement the AdamW optimizer from scratch as a subclass of torch.optim.Optimizer: - Support bias correction for first and second moments - Implement decoupled weight decay (not L2 regularization) - Support parameter groups with different learning rates and weight decay - Verify your implementation produces identical updates to torch.optim.AdamW

Exercise 12.22: Training Instability Analysis

Create a synthetic scenario where training is unstable, then systematically stabilize it: 1. Start with a deep network (20+ layers) with no normalization, no gradient clipping, and a high learning rate 2. Add one stabilization technique at a time (normalization, clipping, schedule, initialization) and measure the effect 3. Create a table showing which combinations of techniques are sufficient for stable training

Exercise 12.23: Mixed Precision from Scratch

Implement a simplified version of mixed precision training without using torch.amp: 1. Manually cast model inputs to FP16 for the forward pass 2. Compute loss in FP32 3. Implement manual loss scaling (multiply loss by a fixed scale, divide gradients by the same scale) 4. Keep master weights in FP32

Compare training speed and final accuracy against full FP32 and PyTorch AMP.

Exercise 12.24: Distributed Training Simulation

Simulate distributed training on a single GPU: 1. Split a batch into N sub-batches 2. Compute gradients for each sub-batch independently 3. Average the gradients (simulating all-reduce) 4. Apply the optimizer step 5. Verify the results match single-GPU training with the full batch

Exercise 12.25: Custom Training Monitor

Build a training monitor class that: 1. Tracks loss, accuracy, gradient norms, learning rate, and parameter norms at every step 2. Detects anomalies (loss spikes, gradient explosions, NaN values) 3. Automatically generates a diagnostic report at the end of training 4. Supports plotting all tracked metrics 5. Issues warnings during training if anomalies are detected

Exercise 12.26: Cosine Annealing with Warm Restarts

Implement CosineAnnealingWarmRestarts from scratch: - Support configurable restart period $T_0$ and period multiplier $T_{\text{mult}}$ - After each restart, multiply the period by $T_{\text{mult}}$ - Plot the learning rate schedule for $T_0=10$, $T_{\text{mult}}=2$ over 70 epochs

Exercise 12.27: Loss Landscape Visualization

For a simple 2D classification problem: 1. Train a small network to convergence 2. Choose two random directions in parameter space 3. Evaluate the loss along a grid in these two directions centered at the final parameters 4. Create a 3D surface plot and contour plot of the loss landscape 5. Compare the landscape when trained with SGD vs. Adam

Exercise 12.28: Hyperparameter Sensitivity Analysis

For a fixed architecture (ResNet-18 on CIFAR-10), systematically study the sensitivity to: 1. Learning rate (log scale from 1e-5 to 1.0) 2. Weight decay (0, 1e-4, 1e-3, 1e-2, 1e-1) 3. Batch size (16, 32, 64, 128, 256, 512) 4. Dropout rate (0, 0.1, 0.2, 0.3, 0.5)

For each hyperparameter, plot final validation accuracy as a function of the hyperparameter value. Which hyperparameter has the largest effect on performance?

Exercise 12.29: Gradient Noise Scale

Implement the gradient noise scale (McCandlish et al., 2018), which measures the ratio of gradient signal to noise: $$B_{\text{noise}} = \frac{\text{tr}(\Sigma)}{|G|^2}$$ where $\Sigma$ is the covariance of per-sample gradients and $G$ is the full-batch gradient. Estimate this by comparing gradients computed with different batch sizes.

Exercise 12.30: Training with Noisy Labels

Simulate training with label noise: 1. Take MNIST and randomly flip 10%, 20%, and 40% of labels 2. Train the same model with standard cross-entropy on each noisy dataset 3. Train with label smoothing on each noisy dataset 4. Train with a symmetric cross-entropy loss (Wang et al., 2019) on each noisy dataset 5. Compare test accuracy (on clean test set) for all combinations

Integration Exercises

Exercise 12.31: End-to-End Pipeline

Build a complete, production-ready training pipeline for classifying CIFAR-10 that includes: - Data loading with augmentation (random crop, horizontal flip, normalization) - A configurable CNN architecture - AdamW optimizer with parameter groups (different weight decay for norm/bias) - Cosine annealing with warmup - Mixed precision training - Gradient clipping - Checkpointing (best model and periodic) - Early stopping - TensorBoard logging - Reproducibility (seeds, deterministic algorithms)

Target: 93%+ test accuracy with a ResNet-style architecture.

Exercise 12.32: Training Diagnostic Dashboard

Create a function that takes a training log (dictionary of metrics over time) and generates a comprehensive diagnostic report including: 1. Loss curves (train and val) with annotations for learning rate changes 2. Accuracy curves with the gap between train and val highlighted 3. Gradient norm plot with a horizontal line at the clipping threshold 4. Learning rate schedule plot 5. A text summary of potential issues (overfitting, underfitting, instability)

Exercise 12.33: Ablation Study

Starting from the full training recipe in Section 12.11, systematically remove each component and measure the impact on final validation accuracy: 1. No weight initialization (use PyTorch defaults) 2. No normalization layers 3. No learning rate warmup 4. No gradient clipping 5. No weight decay 6. No data augmentation 7. No mixed precision (FP32 only) 8. No learning rate schedule (constant LR)

Report results in a table and discuss which components have the largest impact.

Exercise 12.34: Transfer Learning Pipeline

Implement a transfer learning pipeline that: 1. Loads a pretrained ResNet-18 from torchvision.models 2. Replaces the final classification head for a new task (e.g., Flowers102) 3. Uses discriminative learning rates (lower LR for pretrained layers) 4. Implements gradual unfreezing (first epoch: only head; then unfreeze top layers; then all) 5. Uses the training techniques from this chapter (warmup, cosine schedule, mixed precision)

Compare against training from scratch.

Exercise 12.35: Reproducing a Paper's Training Recipe

Choose one of the following papers and reproduce their training recipe exactly: 1. ResNet (He et al., 2016): SGD, momentum 0.9, step decay, weight decay 1e-4 2. BERT fine-tuning (Devlin et al., 2019): AdamW, linear warmup, linear decay 3. GPT-2 (Radford et al., 2019): Adam, cosine schedule with warmup

Train on a small dataset and verify that your implementation matches the paper's described setup.