Chapter 6: Exercises

Exercises are graded by difficulty: - One star (*): Apply the technique from the chapter to a new dataset or scenario - Two stars (**): Extend the technique or combine it with a previous chapter's methods - Three stars (***): Derive a result, implement from scratch, or design a system component - Four stars (****): Research-level problems that connect to open questions in the field


Foundational Computations

Exercise 6.1 (*)

A single perceptron has weights $\mathbf{w} = [2, -1, 3]$ and bias $b = -1$.

(a) Compute the output for input $\mathbf{x} = [1, 0, -1]$ using the step activation function.

(b) Compute the output for input $\mathbf{x} = [0.5, 2, 0.5]$.

(c) What is the equation of the decision boundary hyperplane in $\mathbb{R}^3$? Describe its geometric interpretation.


Exercise 6.2 (*)

Compute the following activation function outputs and their derivatives at the given points:

(a) $\sigma(0)$, $\sigma(3)$, $\sigma(-3)$ and their gradients.

(b) $\tanh(0)$, $\tanh(2)$, $\tanh(-2)$ and their gradients.

(c) $\text{ReLU}(-1)$, $\text{ReLU}(0)$, $\text{ReLU}(3)$ and their (sub)gradients.

(d) For each of the three functions, at what input value is the gradient maximized? What is the maximum gradient value?


Exercise 6.3 (*)

Consider a two-layer MLP with architecture [3, 4, 1] (3 inputs, 4 hidden neurons, 1 output), ReLU hidden activation, and sigmoid output activation.

(a) How many trainable parameters does this network have? Count weights and biases separately.

(b) If the batch size is 32, what are the shapes of $\mathbf{Z}^{(1)}$, $\mathbf{A}^{(1)}$, $\mathbf{Z}^{(2)}$, and $\mathbf{A}^{(2)}$ during the forward pass?

(c) What is the total memory required to cache all intermediate activations for backpropagation, assuming float32 precision?


Exercise 6.4 (*)

Show algebraically that composing two linear transformations (without activation functions) produces another linear transformation. Specifically, if $\mathbf{h} = \mathbf{W}_2(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2$, find the equivalent single-layer parameters $\mathbf{W}'$ and $\mathbf{b}'$.


Exercise 6.5 (*)

Use the generate_click_data() function from the chapter to generate a dataset with n_samples=5000 and n_features=10.

(a) Train a logistic regression model and record the validation AUC.

(b) Train the NumpyMLP with architecture [10, 32, 16, 1], ReLU activation, learning rate 0.01, for 100 epochs. Record the validation AUC.

(c) Does the MLP outperform logistic regression? Why or why not, given the data-generating process?


Backpropagation and Gradients

Exercise 6.6 (**)

Derive the backpropagation equations for a three-layer network [d, h1, h2, 1] with tanh hidden activations and sigmoid output, using binary cross-entropy loss. Write out the gradient expressions for all six parameter groups ($\mathbf{W}^{(1)}, \mathbf{b}^{(1)}, \mathbf{W}^{(2)}, \mathbf{b}^{(2)}, \mathbf{W}^{(3)}, \mathbf{b}^{(3)}$) explicitly.


Exercise 6.7 (**)

Implement gradient checking for the NumpyMLP using only 5 training examples and a network of architecture [4, 3, 2, 1].

(a) Verify that the analytical gradients match the numerical gradients with relative error below $10^{-5}$.

(b) Intentionally introduce a bug into the backward pass (e.g., remove the activation gradient term for one layer). Run gradient checking again. What is the relative error now?

(c) Based on this experiment, what threshold would you use in practice to distinguish "correct" from "buggy" gradients?


Exercise 6.8 (**)

Implement the backward pass for a network with softmax output and categorical cross-entropy loss. The softmax function is:

$$\text{softmax}(\mathbf{z})_k = \frac{e^{z_k}}{\sum_j e^{z_j}}$$

(a) Derive the Jacobian matrix $\frac{\partial \text{softmax}(\mathbf{z})}{\partial \mathbf{z}}$.

(b) Show that the combined gradient of categorical cross-entropy + softmax simplifies to $\hat{\mathbf{y}} - \mathbf{y}$ (where $\mathbf{y}$ is one-hot), analogous to the sigmoid + BCE simplification.

(c) Implement a NumpyMLP variant that supports multi-class classification with softmax output and categorical cross-entropy loss.


Exercise 6.9 (**)

The chapter implements vanilla SGD. Extend the NumpyMLP class to support SGD with momentum:

$$\mathbf{v}_t = \beta \mathbf{v}_{t-1} + \frac{\partial \mathcal{L}}{\partial \theta}$$ $$\theta_t = \theta_{t-1} - \eta \mathbf{v}_t$$

(a) Implement the momentum update by adding a velocity dictionary to the NumpyMLP class.

(b) Train on the click prediction dataset with $\beta = 0.9$ and compare the loss curve to vanilla SGD. How does momentum affect convergence speed?

(c) Experiment with $\beta \in \{0.0, 0.5, 0.9, 0.99\}$. How does the choice of $\beta$ affect training stability?


Exercise 6.10 (**)

Draw the computational graph for a two-layer MLP computing $\hat{y} = \sigma(\mathbf{w}_2^T \text{ReLU}(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1) + b_2)$ with input $\mathbf{x} \in \mathbb{R}^3$, hidden dimension 4, and scalar output.

(a) Label each node with its operation (matmul, add, ReLU, sigmoid).

(b) Label each edge with the shape of the intermediate tensor.

(c) For the backward pass, annotate each node with the local Jacobian (or its shape).

(d) Trace the gradient computation for $\frac{\partial \mathcal{L}}{\partial \mathbf{W}_1}$ through the graph, identifying each chain rule application.


Activation Functions and Initialization

Exercise 6.11 (**)

Implement the Leaky ReLU activation function:

$$\text{LeakyReLU}(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha z & \text{if } z \leq 0 \end{cases}$$

(a) Implement leaky_relu(z, alpha=0.01) and leaky_relu_gradient(z, alpha=0.01).

(b) Add Leaky ReLU to the ACTIVATIONS and ACTIVATION_GRADIENTS dictionaries.

(c) Train two networks on the click data: one with ReLU, one with Leaky ReLU ($\alpha = 0.01$). Compare dead neuron counts after training.


Exercise 6.12 (**)

Reproduce the initialization experiment from Section 6.9 with 20 layers instead of 10.

(a) Record the activation variance at each layer for all four initialization schemes (too_small, too_large, Xavier, He).

(b) At what depth does Xavier initialization show significant variance decay for ReLU networks?

(c) Plot the activation variance vs. layer depth for all four schemes on the same axes. What does this tell you about the relationship between initialization and trainable depth?


Exercise 6.13 (**)

The chapter discusses He initialization for ReLU. Derive the appropriate initialization variance for Leaky ReLU with slope $\alpha$ for negative inputs.

(a) If $z \sim \mathcal{N}(0, \sigma^2)$ and $a = \text{LeakyReLU}(z)$, show that $\text{Var}(a) = \frac{1 + \alpha^2}{2} \sigma^2$.

(b) What initialization variance preserves the activation variance across layers?

(c) Verify your derivation empirically by running the initialization experiment with Leaky ReLU and your derived variance.


Exercise 6.14 (**)

Implement SELU (Scaled Exponential Linear Unit):

$$\text{SELU}(z) = \lambda \begin{cases} z & \text{if } z > 0 \\ \alpha(e^z - 1) & \text{if } z \leq 0 \end{cases}$$

where $\lambda \approx 1.0507$ and $\alpha \approx 1.6733$ are fixed constants derived to ensure self-normalizing behavior.

(a) Implement selu(z) and selu_gradient(z).

(b) Run the initialization experiment from Section 6.9 with SELU activation and standard normal initialization ($\text{Var}(w) = 1/n_\text{in}$). Does the variance remain stable across 20 layers?

(c) Why does SELU not require He or Xavier initialization? What property of the $\alpha$ and $\lambda$ constants ensures self-normalization?


PyTorch and Framework Comparison

Exercise 6.15 (**)

Using PyTorch's autograd, verify the backpropagation equations from Section 6.7.

(a) Create a 2-layer network with known weights (do not use random initialization; set specific values). Compute the forward pass and loss for a single input.

(b) Call loss.backward() and extract the gradients of each parameter.

(c) Manually compute the same gradients using the chain rule equations from Section 6.7. Verify that they match PyTorch's gradients.


Exercise 6.16 (**)

PyTorch's nn.BCEWithLogitsLoss combines the sigmoid activation and binary cross-entropy loss into a single numerically stable operation.

(a) Explain why computing sigmoid(z) followed by BCELoss can be numerically unstable. What input values cause problems?

(b) Derive the formula for BCEWithLogitsLoss directly from the logit $z$ (without computing $\sigma(z)$ as an intermediate):

$$\mathcal{L}(z, y) = \max(z, 0) - z \cdot y + \log(1 + e^{-|z|})$$

Show that this is algebraically equivalent to $-[y \log \sigma(z) + (1-y)\log(1-\sigma(z))]$.

(c) Modify the PyTorchMLP to use nn.BCEWithLogitsLoss instead of nn.BCELoss (removing the sigmoid from the output layer). Verify that training produces the same results.


Exercise 6.17 (**)

Compare the training speed of the numpy and PyTorch implementations.

(a) Time both implementations on the click prediction dataset (10,000 samples, 100 epochs, batch size 64) using time.perf_counter().

(b) Explain the performance difference. Which operations benefit most from PyTorch's optimizations?

(c) If a GPU is available, time the PyTorch implementation on GPU vs. CPU. At what batch size does the GPU become faster than the CPU for this problem size?


Deeper Investigations

Exercise 6.18 (***)

Implement a NumpyMLP variant that supports arbitrary computational graphs (not just sequential layers). Specifically, implement a network with a skip connection: the output of layer $\ell$ is $\mathbf{a}^{(\ell)} = \sigma(\mathbf{z}^{(\ell)}) + \mathbf{a}^{(\ell-2)}$ (adding the activation from two layers back).

(a) Modify the forward pass to support skip connections.

(b) Derive and implement the backward pass with skip connections. How does the gradient flow change?

(c) Train a 10-layer network with and without skip connections on the click data. Compare gradient norms at the first layer. Does the skip connection reduce vanishing gradients?


Exercise 6.19 (***)

Implement automatic differentiation for a small set of operations. Build a Value class that records operations in a computational graph and supports reverse-mode autodiff.

class Value:
    """A scalar value with automatic differentiation support."""

    def __init__(self, data: float, _children=(), _op=""):
        self.data = data
        self.grad = 0.0
        self._backward = lambda: None
        self._prev = set(_children)
        self._op = _op

    def __add__(self, other):
        # Implement: record the operation, define _backward
        ...

    def __mul__(self, other):
        # Implement: record the operation, define _backward
        ...

    def relu(self):
        # Implement: record the operation, define _backward
        ...

    def backward(self):
        # Topological sort, then call _backward in reverse order
        ...

(a) Implement __add__, __mul__, __neg__, __sub__, __pow__, relu(), sigmoid(), log().

(b) Implement the backward() method using topological sort.

(c) Verify correctness with gradient checking on $f(x, y) = \sigma(x \cdot y + x^2)$.

(d) Build a 2-layer perceptron using your Value class and train it on a toy dataset (XOR or circles).


Exercise 6.20 (***)

The universal approximation theorem guarantees that a single hidden layer can approximate any continuous function. Demonstrate this empirically.

(a) Define the target function $f(x) = \sin(3x) + 0.5\cos(7x)$ on $[-\pi, \pi]$.

(b) Train single-hidden-layer networks with width $N \in \{4, 16, 64, 256, 1024\}$ to approximate $f$. Use MSE loss and report the final MSE for each width.

(c) Plot the learned function vs. the true function for each width. At what width does the approximation become visually indistinguishable?

(d) Now try approximating $g(\mathbf{x}) = \sin(3x_1) \cdot \cos(3x_2)$ on $[-\pi, \pi]^2$. How does the required width scale with input dimension?


Exercise 6.21 (***)

Implement batch normalization for the numpy MLP (a preview of Chapter 7).

Batch normalization normalizes the pre-activation values within each mini-batch:

$$\hat{z}_i = \frac{z_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$$ $$\tilde{z}_i = \gamma \hat{z}_i + \beta$$

where $\mu_B$ and $\sigma_B^2$ are the batch mean and variance, and $\gamma, \beta$ are learnable parameters.

(a) Implement the forward pass for batch normalization.

(b) Derive and implement the backward pass (the gradient with respect to $\gamma$, $\beta$, and the input $z$).

(c) Add batch normalization to the NumpyMLP after each hidden layer's linear transformation (before the activation). Compare training with and without batch normalization on the click data.


Exercise 6.22 (***)

Analyze the loss landscape of a small neural network.

(a) Train a NumpyMLP with architecture [2, 4, 1] on a 2D classification dataset (use sklearn.datasets.make_moons).

(b) After training, take the final weight vector $\theta^*$ and two random directions $\mathbf{d}_1, \mathbf{d}_2$ in parameter space. Compute the loss on a grid:

$$L(\alpha, \beta) = \mathcal{L}(f_{\theta^* + \alpha \mathbf{d}_1 + \beta \mathbf{d}_2}(X), y)$$

for $\alpha, \beta \in [-1, 1]$.

(c) Plot the loss landscape as a contour plot. Is the minimum sharp or flat? How does the landscape change with different initialization seeds?


Applied and Production Problems

Exercise 6.23 (**)

StreamRec's data science team wants to understand which features are most important for click prediction.

(a) Train the numpy MLP on the click data and compute the gradient of the output with respect to the input features: $\frac{\partial \hat{y}}{\partial \mathbf{x}}$.

(b) Average the absolute gradients across all test examples to get a feature importance score for each input feature.

(c) Compare these gradient-based importance scores to the feature importance from a random forest. Do they agree on the top-5 most important features?


Exercise 6.24 (***)

Implement dropout for the numpy MLP.

(a) During training, implement inverted dropout: multiply each activation by a random binary mask divided by the keep probability $p$.

(b) During inference, use the full activations (no dropout).

(c) Train two networks on the click data with 50,000 samples: one with dropout ($p = 0.5$) and one without. Compare training vs. validation loss curves. Does dropout reduce overfitting?


Exercise 6.25 (***)

Meridian Financial needs to understand why their credit scoring MLP makes specific decisions. Implement integrated gradients — a method for attributing a prediction to input features.

Given input $\mathbf{x}$ and baseline $\mathbf{x}_0$ (typically $\mathbf{0}$), integrated gradients are:

$$\text{IG}_i(\mathbf{x}) = (x_i - x_{0,i}) \int_0^1 \frac{\partial f(\mathbf{x}_0 + \alpha(\mathbf{x} - \mathbf{x}_0))}{\partial x_i} d\alpha$$

(a) Implement integrated gradients using the Riemann sum approximation with $m = 50$ steps.

(b) Verify the completeness axiom: $\sum_i \text{IG}_i(\mathbf{x}) = f(\mathbf{x}) - f(\mathbf{x}_0)$.

(c) Apply integrated gradients to explain three predictions from the credit scoring MLP. Which features contribute most to high-risk predictions?


Research-Level Problems

Exercise 6.26 (****)

The lottery ticket hypothesis (Frankle and Carlin, 2019) conjectures that dense neural networks contain sparse subnetworks ("winning tickets") that, when trained in isolation from their original initialization, achieve comparable accuracy.

(a) Train a NumpyMLP with architecture [20, 128, 64, 32, 1] on the click data.

(b) Prune 50% of the weights (set the smallest magnitude weights to zero). Fine-tune the pruned network. What is the accuracy loss?

(c) Reset the surviving weights to their original initialization values and retrain from scratch. Does this "rewound" subnetwork achieve comparable accuracy to the full network?

(d) Repeat with 70%, 80%, and 90% pruning. At what sparsity level does performance degrade significantly?


Exercise 6.27 (****)

The neural tangent kernel (NTK) theory (Jacot et al., 2018) shows that infinitely wide neural networks trained with gradient descent behave as linear models in a fixed feature space determined by the initialization.

(a) For a single-hidden-layer network $f(\mathbf{x}; \theta) = \frac{1}{\sqrt{N}} \sum_{j=1}^{N} c_j \sigma(\mathbf{w}_j^T \mathbf{x})$, derive the NTK:

$$K(\mathbf{x}, \mathbf{x}') = \left\langle \nabla_\theta f(\mathbf{x}; \theta), \nabla_\theta f(\mathbf{x}'; \theta) \right\rangle$$

evaluated at initialization.

(b) Implement the empirical NTK computation for a network with width $N$. Compute the NTK matrix for 100 training points.

(c) As $N \to \infty$, the NTK converges to a deterministic kernel. Compute the NTK for $N \in \{16, 64, 256, 1024, 4096\}$ and measure the variance of the kernel entries across random initializations. Does the variance decrease?

(d) Train the wide network with gradient descent and compare the predictions to kernel regression with the NTK. At what width do the predictions become effectively identical?


Exercise 6.28 (****)

Investigate the double descent phenomenon empirically.

(a) Generate a regression dataset with $n = 100$ training points in $d = 20$ dimensions with moderate noise.

(b) Train single-hidden-layer networks with width $N$ ranging from 5 to 500 (in increments of 5). For each width, train for enough epochs to converge. Record the test MSE.

(c) Plot test MSE vs. network width. Do you observe the classical U-shaped bias-variance tradeoff? Does a second descent in test error occur after the interpolation threshold (where the number of parameters exceeds the number of training points)?

(d) How does the double descent curve change with the amount of label noise? Repeat with noise standard deviations of $\{0.0, 0.1, 0.5, 1.0\}$.


Exercise 6.29 (****)

Implement a neural network with complex-valued weights and activations.

(a) Extend the NumpyMLP to support complex-valued parameters. The forward pass computes $\mathbf{z} = \mathbf{W}\mathbf{x} + \mathbf{b}$ where $\mathbf{W} \in \mathbb{C}^{n_\ell \times n_{\ell-1}}$. Use the split-complex activation: apply ReLU independently to the real and imaginary parts.

(b) Derive the backward pass for complex-valued networks using Wirtinger derivatives.

(c) Train on a regression task where the target function is $f(x_1, x_2) = \text{Re}(e^{i(x_1 + x_2 i)})$. Compare performance with a real-valued network of the same parameter count.


Exercise 6.30 (****)

Design an experiment to test whether depth or width matters more for approximation quality.

(a) Define a target function $f: \mathbb{R}^{10} \to \mathbb{R}$ that has compositional structure (e.g., $f(\mathbf{x}) = g(h_1(x_1, x_2, x_3), h_2(x_4, x_5, x_6), h_3(x_7, x_8, x_9, x_{10}))$ where $g, h_1, h_2, h_3$ are simple nonlinear functions).

(b) Train networks with approximately the same total parameter count but different depth-width tradeoffs: (1 hidden layer, width 500), (3 hidden layers, width 100), (5 hidden layers, width 50), (10 hidden layers, width 25). Use He initialization and ReLU.

(c) Compare approximation quality (test MSE) after training. Which architecture works best? How does this relate to the compositional structure of $f$?

(d) Repeat with a non-compositional target function (e.g., a random smooth function). Does the depth advantage disappear?