Chapter 7: Exercises

Exercises are graded by difficulty: - One star (*): Apply the technique from the chapter to a new dataset or scenario - Two stars (**): Extend the technique or combine it with a previous chapter's methods - Three stars (***): Derive a result, implement from scratch, or design a system component - Four stars (****): Research-level problems that connect to open questions in the field


Weight Initialization

Exercise 7.1 (*)

Consider a fully connected layer with $n_{\text{in}} = 512$ and $n_{\text{out}} = 256$.

(a) Compute the variance and standard deviation for weights under Xavier normal initialization.

(b) Compute the variance and standard deviation for weights under He normal initialization.

(c) If you initialize with $w \sim \mathcal{N}(0, 0.01)$ (a common but often incorrect default), compute the ratio $\text{Var}(z) / \text{Var}(x)$ for a single layer. What happens after 20 layers?


Exercise 7.2 (**)

Implement an experiment that tracks the variance of activations across 50 layers of a linear network (no activation functions) for three initialization schemes: (1) $\text{Var}(w) = 1/n_{\text{in}}$, (2) $\text{Var}(w) = 2/(n_{\text{in}} + n_{\text{out}})$, and (3) $\text{Var}(w) = 1/n_{\text{out}}$. Use $n = 256$ for all layers.

(a) Plot activation variance vs. layer index for all three schemes. Which produces stable variance?

(b) Now repeat with ReLU activations. Which of the three schemes is no longer suitable, and why?

(c) Add He initialization ($\text{Var}(w) = 2/n_{\text{in}}$) to the ReLU experiment. Verify that it produces stable variance.

import torch
import torch.nn as nn
import matplotlib.pyplot as plt
from typing import List, Tuple

def track_activation_variance(
    n_layers: int,
    dim: int,
    init_var: float,
    use_relu: bool,
    n_samples: int = 1000,
) -> List[float]:
    """Track activation variance through layers.

    Args:
        n_layers: Number of layers.
        dim: Width of each layer.
        init_var: Variance for weight initialization.
        use_relu: Whether to apply ReLU after each layer.
        n_samples: Number of input samples.

    Returns:
        List of activation variances at each layer.
    """
    # YOUR CODE HERE
    pass

Exercise 7.3 (***)

Derive He initialization from first principles.

(a) Starting from the forward pass $z = Wx + b$ where $w_{ij} \sim \mathcal{N}(0, \sigma^2_w)$ and $x_i$ are zero-mean with variance $\text{Var}(x)$, show that $\text{Var}(z_j) = n_{\text{in}} \cdot \sigma^2_w \cdot \text{Var}(x)$.

(b) For $a = \text{ReLU}(z)$ where $z \sim \mathcal{N}(0, \sigma^2_z)$, prove that $\text{Var}(a) = \frac{1}{2}\sigma^2_z$. (Hint: compute $\mathbb{E}[a^2]$ by integrating over the positive half of the Gaussian.)

(c) Combine parts (a) and (b) to derive $\sigma^2_w = 2/n_{\text{in}}$.

(d) Repeat the backward pass analysis. Show that variance preservation during backpropagation requires $\sigma^2_w = 2/n_{\text{out}}$.


Exercise 7.4 (*)

PyTorch's kaiming_uniform_ initializes weights from $U[-b, b]$ where $b = \sqrt{6 / n_{\text{in}}}$ for ReLU mode.

(a) Verify that $U[-b, b]$ has variance $b^2/3 = 2/n_{\text{in}}$, matching the He normal variance.

(b) Create a network with 10 layers of width 512 and compare the layer-20 activation variance between kaiming_normal_ and kaiming_uniform_. Are they equivalent in practice?


Normalization

Exercise 7.5 (*)

Given a mini-batch of 4 samples with 3 features:

$$X = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \\ 10 & 11 & 12 \end{bmatrix}$$

(a) Compute the batch normalization output (with $\gamma = 1$, $\beta = 0$, $\epsilon = 0$) manually.

(b) Compute the layer normalization output for the first sample only.

(c) Verify your answers using torch.nn.BatchNorm1d and torch.nn.LayerNorm.


Exercise 7.6 (**)

Implement batch normalization from scratch in PyTorch (using tensors, not nn.BatchNorm1d) and verify that it produces the same output as the built-in module for both training and eval modes.

def manual_batch_norm(
    x: torch.Tensor,
    gamma: torch.Tensor,
    beta: torch.Tensor,
    running_mean: torch.Tensor,
    running_var: torch.Tensor,
    training: bool = True,
    momentum: float = 0.1,
    eps: float = 1e-5,
) -> torch.Tensor:
    """Manual batch normalization forward pass.

    Args:
        x: Input tensor of shape (N, C).
        gamma: Scale parameter of shape (C,).
        beta: Shift parameter of shape (C,).
        running_mean: Running mean of shape (C,).
        running_var: Running variance of shape (C,).
        training: Whether in training mode.
        momentum: EMA momentum for running stats.
        eps: Numerical stability constant.

    Returns:
        Normalized output of shape (N, C).
    """
    # YOUR CODE HERE
    pass

Exercise 7.7 (***)

Derive the backward pass for batch normalization.

(a) Write the batch normalization forward pass as a sequence of elementary operations: $\mu = \frac{1}{m}\sum_i x_i$, $\sigma^2 = \frac{1}{m}\sum_i(x_i - \mu)^2$, $\hat{x}_i = (x_i - \mu)/\sqrt{\sigma^2 + \epsilon}$, $y_i = \gamma\hat{x}_i + \beta$.

(b) Draw the computational graph for these operations.

(c) Apply the chain rule backward through the graph to derive $\frac{\partial L}{\partial x_i}$, $\frac{\partial L}{\partial \gamma}$, and $\frac{\partial L}{\partial \beta}$.

(d) Implement the backward pass in PyTorch and verify using torch.autograd.gradcheck.


Exercise 7.8 (**)

A production team reports that their recommendation model produces different predictions for the same input depending on what other inputs are in the batch. Explain why this happens and propose two solutions.


Exercise 7.9 (**)

Implement group normalization from scratch. Verify that with $G = 1$ it matches layer normalization, and with $G = C$ (number of channels) it approaches instance normalization.

def group_norm(
    x: torch.Tensor,
    num_groups: int,
    gamma: torch.Tensor,
    beta: torch.Tensor,
    eps: float = 1e-5,
) -> torch.Tensor:
    """Manual group normalization.

    Args:
        x: Input of shape (N, C, *) where * is any number of spatial dims.
        num_groups: Number of groups G. C must be divisible by G.
        gamma: Scale parameter of shape (C,).
        beta: Shift parameter of shape (C,).
        eps: Numerical stability constant.

    Returns:
        Group-normalized output of shape (N, C, *).
    """
    # YOUR CODE HERE
    pass

Regularization

Exercise 7.10 (*)

A 3-layer MLP with 512 hidden units is trained on 5,000 credit scoring examples. The training loss reaches 0.01, but the validation loss is 0.45.

(a) Is the model underfitting or overfitting? Justify your answer.

(b) Propose three techniques from this chapter that could reduce the gap. Explain the mechanism of each.

(c) Would increasing the number of hidden layers help? Why or why not?


Exercise 7.11 (**)

Implement inverted dropout from scratch and verify the expected value property.

(a) Implement an InvertedDropout class that applies dropout during training and passes through during inference.

(b) Empirically verify that $\mathbb{E}[\text{dropout}(x)] = x$ by computing the mean of 10,000 dropout applications to a fixed tensor.

(c) Compare the variance of the output with and without dropout. By what factor does dropout increase the variance?


Exercise 7.12 (***)

Show that dropout with rate $p$ applied to a layer with $n$ neurons is equivalent to averaging $2^n$ subnetworks, where each subnetwork uses a different subset of neurons.

(a) For $n = 3$ neurons and $p = 0.5$, enumerate all $2^3 = 8$ subnetworks and their weights in the ensemble.

(b) Show that the dropout prediction (with test-time scaling) equals the weighted average of subnetwork predictions when the subnetwork weights are proportional to their selection probability.

(c) Explain why this "exponential ensemble" interpretation suggests that dropout should be more effective for wider networks.


Exercise 7.13 (**)

Compare L2 regularization and decoupled weight decay on a simple logistic regression problem.

(a) Generate a binary classification dataset with 20 features, 10 of which are informative and 10 of which are noise.

(b) Train with torch.optim.Adam(weight_decay=1e-4) (L2 regularization).

(c) Train with torch.optim.AdamW(weight_decay=1e-2) (decoupled weight decay).

(d) Compare the final weight magnitudes for informative vs. noise features. Which optimizer does a better job of driving noise feature weights to zero?


Exercise 7.14 (**)

Implement early stopping with a twist: instead of restoring the best model, compute an exponential moving average (EMA) of model weights during training and evaluate the EMA model.

class EMAModel:
    """Exponential moving average of model parameters.

    Args:
        model: The model to track.
        decay: EMA decay factor (higher = slower update).
    """

    def __init__(self, model: nn.Module, decay: float = 0.999) -> None:
        self.decay = decay
        self.shadow = {
            name: param.data.clone()
            for name, param in model.named_parameters()
        }

    def update(self, model: nn.Module) -> None:
        """Update the EMA parameters."""
        # YOUR CODE HERE
        pass

    def apply(self, model: nn.Module) -> None:
        """Copy EMA parameters into the model."""
        # YOUR CODE HERE
        pass

Compare EMA with standard early stopping on a classification task. When does each approach perform better?


Learning Rate Schedules

Exercise 7.15 (*)

Plot the following learning rate schedules over 100 epochs (assume 100 steps per epoch = 10,000 total steps):

(a) Step decay: $\eta_0 = 0.1$, $\gamma = 0.1$ every 30 epochs.

(b) Cosine annealing: $\eta_{\max} = 0.1$, $\eta_{\min} = 10^{-6}$.

(c) One-cycle: $\eta_{\max} = 0.1$, 30% warmup, $\text{div\_factor} = 25$.

(d) Linear warmup (2,000 steps) + cosine annealing.

Which schedule reaches the lowest learning rate at the end? Which maintains the highest average learning rate?


Exercise 7.16 (***)

Implement the learning rate finder from Section 7.8.2.

(a) Apply it to a 3-layer MLP (256 hidden units each) on the MNIST dataset. Plot loss vs. learning rate.

(b) Identify the optimal learning rate from the plot. Train the model with this learning rate and compare convergence to training with $\eta = 0.1$ and $\eta = 10^{-5}$.

(c) Does the optimal learning rate change when you add batch normalization? Why or why not?


Exercise 7.17 (**)

The one-cycle policy uses an inverse momentum schedule: low momentum when the learning rate is high, and high momentum when the learning rate is low.

(a) Explain the intuition behind this coupling. (Hint: think about the effective step size as a function of both learning rate and momentum.)

(b) Implement a one-cycle schedule that couples learning rate and momentum. Verify that torch.optim.lr_scheduler.OneCycleLR implements the same behavior.


Mixed Precision Training

Exercise 7.18 (*)

(a) What is the largest integer that can be represented exactly in fp16? In bf16? In fp32?

(b) A gradient has value $3.2 \times 10^{-6}$. Can this be represented in fp16? In bf16? What happens if it cannot?

(c) Explain why loss scaling solves the problem in part (b). If the scale factor is $2^{16} = 65536$, what is the scaled gradient value, and can it be represented in fp16?


Exercise 7.19 (**)

Implement mixed precision training for the StreamRec MLP and measure the speedup.

(a) Train the model in fp32 and record the wall-clock time per epoch and peak GPU memory.

(b) Train with fp16 mixed precision (using torch.cuda.amp.autocast and GradScaler) and record the same metrics.

(c) Train with bf16 mixed precision (if hardware supports it) without loss scaling. Compare all three.

(d) Is the final validation loss the same across all three? If not, which is best and why?


Exercise 7.20 (***)

Dynamic loss scaling adjusts the scale factor based on gradient overflow detection. Implement a simple dynamic loss scaler.

class SimpleDynamicScaler:
    """Simplified dynamic loss scaler.

    Args:
        init_scale: Initial loss scale factor.
        growth_factor: Factor to increase scale by.
        backoff_factor: Factor to decrease scale by on overflow.
        growth_interval: Steps between scale increases.
    """

    def __init__(
        self,
        init_scale: float = 2.0**16,
        growth_factor: float = 2.0,
        backoff_factor: float = 0.5,
        growth_interval: int = 2000,
    ) -> None:
        self.scale = init_scale
        self.growth_factor = growth_factor
        self.backoff_factor = backoff_factor
        self.growth_interval = growth_interval
        self.steps_since_growth = 0

    def scale_loss(self, loss: torch.Tensor) -> torch.Tensor:
        """Scale the loss before backward."""
        # YOUR CODE HERE
        pass

    def unscale_and_step(
        self, optimizer: optim.Optimizer, model: nn.Module
    ) -> bool:
        """Unscale gradients and step if no overflow.

        Returns True if the step was taken, False if skipped.
        """
        # YOUR CODE HERE
        pass

Debugging and Diagnosis

Exercise 7.21 (*)

For each of the following loss curve patterns, identify the likely cause and propose a fix:

(a) Training loss: 2.30, 2.30, 2.30, 2.30, ... (constant at $\ln(10) \approx 2.30$ for a 10-class problem)

(b) Training loss: 0.50, 0.45, NaN, NaN, NaN, ...

(c) Training loss: 0.80, 0.60, 0.45, 0.35, ... (smoothly decreasing) Validation loss: 0.80, 0.75, 0.78, 0.82, ... (increasing after epoch 2)

(d) Training loss: 0.80, 0.30, 1.50, 0.25, 2.10, 0.20, 3.40, ... (oscillating with increasing amplitude)


Exercise 7.22 (**)

Build a training diagnostics dashboard that logs the following per epoch:

  • Training and validation loss
  • Per-layer gradient norms (min, mean, max)
  • Per-layer activation means and standard deviations
  • Fraction of dead neurons (activations identically zero) per ReLU layer
  • Learning rate

Use PyTorch hooks to collect activation statistics. Log to a dictionary and plot after training.


Exercise 7.23 (***)

Implement the "overfit one batch" debugging technique as a reusable function.

def overfit_single_batch(
    model: nn.Module,
    batch: Tuple[torch.Tensor, torch.Tensor],
    criterion: nn.Module,
    lr: float = 1e-2,
    max_steps: int = 1000,
    target_loss: float = 0.01,
    device: str = "cuda",
) -> dict:
    """Attempt to overfit a single batch as a sanity check.

    If the model cannot drive the loss below target_loss on a
    single batch, there is likely a bug in the model or loss.

    Args:
        model: Model to test.
        batch: Single (input, target) batch.
        criterion: Loss function.
        lr: Learning rate (use a high value).
        max_steps: Maximum optimization steps.
        target_loss: Target loss to reach.
        device: Device.

    Returns:
        Dict with 'success' (bool), 'final_loss', 'steps', 'losses'.
    """
    # YOUR CODE HERE
    pass

Integration and Application

Exercise 7.24 (**)

Credit scoring regularization study. Generate a synthetic credit scoring dataset with 20 features, 5 of which are highly correlated (pairwise correlation > 0.9).

(a) Train a 3-layer MLP without regularization. Inspect the weights on the correlated features. What pattern do you observe?

(b) Train with L2 regularization ($\lambda = 10^{-4}$) via torch.optim.Adam. Inspect the weights.

(c) Train with decoupled weight decay ($\lambda = 10^{-2}$) via torch.optim.AdamW. Inspect the weights.

(d) Perturb the test data by adding Gaussian noise ($\sigma = 0.1$) to the correlated features only. Compare the prediction stability (standard deviation of output logits) across the three models. Which is most stable?


Exercise 7.25 (***)

Full training pipeline. Implement a complete training pipeline for a 5-layer MLP on CIFAR-10 (flattened to 3072 input features) that includes:

  • He initialization
  • Batch normalization after each linear layer
  • Dropout (0.3) after each activation
  • AdamW optimizer with weight decay $10^{-2}$
  • One-cycle learning rate schedule
  • Gradient clipping (max norm 1.0)
  • Early stopping (patience 10)
  • Training and validation loss logging

Train for 100 epochs and report the final test accuracy.


Exercise 7.26 (***)

Ablation study. Using the CIFAR-10 pipeline from Exercise 7.25, systematically remove one technique at a time and measure the impact on final test accuracy and convergence speed:

Configuration Test Accuracy Epochs to Converge
Full pipeline
No batch normalization
No dropout
No weight decay
Constant learning rate (no schedule)
No gradient clipping
Xavier init (instead of He)

Which technique has the largest individual impact? Do any techniques interact (i.e., removing one makes another more or less important)?


Exercise 7.27 (****)

The interaction between batch size and learning rate. The "linear scaling rule" (Goyal et al., 2017) suggests that when you multiply the batch size by $k$, you should multiply the learning rate by $k$.

(a) Provide a theoretical justification based on the relationship between batch gradient and full-batch gradient variance.

(b) Implement an experiment on CIFAR-10: train with batch sizes {32, 64, 128, 256, 512} and for each batch size, sweep learning rates {$\eta_0/4$, $\eta_0/2$, $\eta_0$, $2\eta_0$, $4\eta_0$} where $\eta_0$ is the optimal learning rate for batch size 32.

(c) Does the linear scaling rule hold exactly? For what batch sizes does it break down? How does learning rate warmup affect the range of batch sizes where it holds?


Exercise 7.28 (****)

Batch normalization and residual connections. Implement a residual MLP (with skip connections over every 2 linear layers) and compare the following normalization placements:

(a) Pre-activation BN: BN → ReLU → Linear → BN → ReLU → Linear + skip

(b) Post-activation BN: Linear → BN → ReLU → Linear → BN → ReLU + skip

(c) Pre-norm (transformer style): LayerNorm → Linear → ReLU → LayerNorm → Linear → ReLU + skip

Train each variant on CIFAR-10 with depths {4, 8, 16, 32} residual blocks. Which normalization placement trains deepest? Connect your finding to He et al. (2016), "Identity Mappings in Deep Residual Networks."


Exercise 7.29 (***)

Mixed precision edge cases. Construct a neural network and training scenario where fp16 mixed precision training fails (produces NaN or significantly degraded accuracy) but bf16 succeeds.

(a) Design the scenario. (Hint: think about architectures that produce very large or very small intermediate values.)

(b) Demonstrate the failure with fp16 (with loss scaling).

(c) Show that bf16 handles the same scenario correctly.

(d) Can you fix the fp16 failure by modifying the architecture (without changing to bf16)?


Exercise 7.30 (****)

Weight decay as implicit Bayesian prior. Santos (1996) and others have shown that L2 regularization is equivalent to a Gaussian prior on the weights in a Bayesian framework.

(a) Show that MAP estimation with a $\mathcal{N}(0, \sigma^2)$ prior on each weight is equivalent to L2 regularization with $\lambda = 1/\sigma^2$.

(b) Under this interpretation, what prior does decoupled weight decay (AdamW) correspond to? Is it still a Gaussian prior?

(c) Train a Bayesian logistic regression (using PyMC or similar) on the credit scoring dataset with a $\mathcal{N}(0, \sigma^2)$ prior. Compare the posterior weight distributions with the weights found by AdamW with $\lambda = 1/\sigma^2$. How closely do they match?

(d) Discuss: in what sense is early stopping also a Bayesian procedure? (See Duvenaud et al., 2016, "Early Stopping as Nonparametric Variational Inference.")