Exercises — Chapter 2: Multivariate Calculus and Optimization
Exercises are graded by difficulty: - ⭐ Apply a technique from the chapter - ⭐⭐ Extend or combine techniques - ⭐⭐⭐ Derive, implement from scratch, or design a component - ⭐⭐⭐⭐ Research-level problems connecting to open questions
Gradients and Partial Derivatives
Exercise 2.1 ⭐
Compute the gradient of the following functions by hand. Verify each result numerically using finite differences in numpy.
(a) $f(x_1, x_2) = x_1^2 x_2 + 3x_2^3 - 7x_1$
(b) $f(\mathbf{x}) = \|\mathbf{A}\mathbf{x} - \mathbf{b}\|^2$ where $\mathbf{A} \in \mathbb{R}^{m \times n}$, $\mathbf{b} \in \mathbb{R}^m$
(c) $f(\mathbf{x}) = \log \sum_{i=1}^{n} \exp(x_i)$ (the log-sum-exp function)
Exercise 2.2 ⭐
The directional derivative of $f$ at $\mathbf{x}$ in direction $\mathbf{v}$ is $D_\mathbf{v} f = \nabla f^T \mathbf{v}$. For $f(x_1, x_2) = x_1^2 + 4x_2^2$ at the point $(1, 1)$:
(a) Compute $\nabla f(1, 1)$.
(b) Find the directional derivative in the direction $\mathbf{v} = \frac{1}{\sqrt{2}}(1, 1)$.
(c) Find the unit direction $\mathbf{v}^*$ that maximizes $D_\mathbf{v} f$. What is the maximum directional derivative?
(d) Find a direction $\mathbf{v}_0$ along which the directional derivative is exactly zero. Interpret this geometrically.
Exercise 2.3 ⭐⭐
Derive the gradient of the softmax cross-entropy loss. Given logits $\mathbf{z} \in \mathbb{R}^K$ and a one-hot target $\mathbf{y}$ (where $y_c = 1$ for the true class $c$):
$$ \mathcal{L}(\mathbf{z}) = -\sum_{k=1}^K y_k \log \frac{\exp(z_k)}{\sum_j \exp(z_j)} $$
(a) Show that $\frac{\partial \mathcal{L}}{\partial z_k} = \text{softmax}(z_k) - y_k$.
(b) Why is this expression so "clean"? Connect it to the theory of exponential families.
(c) Implement both the loss and gradient in numpy. Verify the gradient using finite differences for $K = 5$ and a random logit vector.
Exercise 2.4 ⭐⭐
For the regularized logistic regression loss from the Meridian Financial credit scoring example (Section 2.2.3):
(a) Compute the Hessian $\mathbf{H} = \nabla^2 \mathcal{L}(\boldsymbol{\theta})$.
(b) Show that $\mathbf{H}$ is positive semidefinite for any $\boldsymbol{\theta}$, and positive definite when $\lambda > 0$. What does this imply about the optimization landscape?
(c) Compute the condition number $\kappa(\mathbf{H})$ at the optimal $\boldsymbol{\theta}^*$ for a synthetic dataset with 100 features sampled from $\mathcal{N}(\mathbf{0}, \mathbf{I})$. How does $\kappa$ change as $\lambda$ varies from $10^{-6}$ to $10$?
Jacobians and the Chain Rule
Exercise 2.5 ⭐
Consider the function $\mathbf{f}(\mathbf{x}) = \mathbf{A} \sigma(\mathbf{B}\mathbf{x})$ where $\sigma$ is the element-wise sigmoid function, $\mathbf{A} \in \mathbb{R}^{p \times q}$, $\mathbf{B} \in \mathbb{R}^{q \times n}$, and $\mathbf{x} \in \mathbb{R}^n$.
(a) Compute the Jacobian $\mathbf{J} = \frac{\partial \mathbf{f}}{\partial \mathbf{x}}$ using the chain rule. Express your answer in terms of $\mathbf{A}$, $\mathbf{B}$, and a diagonal matrix involving $\sigma'$.
(b) What are the dimensions of $\mathbf{J}$?
(c) Implement the Jacobian computation in numpy and verify against torch.autograd.functional.jacobian.
Exercise 2.6 ⭐⭐
Computational graph tracing. Consider $f(x, y, z) = \frac{xy + \sin(z)}{1 + e^{-x}}$.
(a) Draw the computational graph, decomposing into elementary operations ($+$, $\times$, $\sin$, $\exp$, reciprocal).
(b) Perform the forward pass for $x = 1, y = 2, z = \pi/4$. Label every intermediate node with its value.
(c) Perform the backward pass, computing $\partial f / \partial x$, $\partial f / \partial y$, $\partial f / \partial z$. Show your work at each node.
(d) Verify your answers with numerical differentiation.
Exercise 2.7 ⭐⭐⭐
Forward mode vs. reverse mode AD.
(a) For a function $\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m$, how many forward passes are needed to compute the full Jacobian using forward mode AD? How many backward passes are needed using reverse mode?
(b) For a neural network with $n = 10^6$ parameters and a scalar loss ($m = 1$), compute the ratio of forward-mode cost to reverse-mode cost.
(c) Implement forward-mode AD for a simple computational graph (the two-layer network from Section 2.4.3) by propagating Jacobian-vector products (tangent vectors) through the graph. Compare the output to the backward pass result.
(d) Describe a scenario where forward mode is more efficient than reverse mode. Hint: consider computing the diagonal of the Hessian.
Convexity and Optimization Theory
Exercise 2.8 ⭐
Determine whether each function is convex, concave, or neither. Justify your answer using the second-order condition (eigenvalues of the Hessian).
(a) $f(x_1, x_2) = x_1^2 + x_2^2 + x_1 x_2$
(b) $f(x_1, x_2) = e^{x_1 + 2x_2}$
(c) $f(x_1, x_2) = x_1 x_2$
(d) $f(\mathbf{x}) = \max(x_1, x_2, \ldots, x_n)$
Exercise 2.9 ⭐⭐
Condition number experiment. Generate a 2D quadratic loss $f(\mathbf{x}) = \frac{1}{2}\mathbf{x}^T \mathbf{A} \mathbf{x}$ with:
(a) $\kappa(\mathbf{A}) = 1$ (isotropic: $\mathbf{A} = \mathbf{I}$)
(b) $\kappa(\mathbf{A}) = 10$
(c) $\kappa(\mathbf{A}) = 100$
(d) $\kappa(\mathbf{A}) = 1000$
For each case, run gradient descent with optimal learning rate $\eta = 2 / (\lambda_{\max} + \lambda_{\min})$ for 200 steps from the same starting point. Plot the convergence curves and the trajectories on contour plots. Write a paragraph explaining the relationship between condition number and convergence behavior.
Exercise 2.10 ⭐⭐
Convexity of composition. Prove that:
(a) If $f$ is convex and $g$ is convex and non-decreasing, then $g \circ f$ is convex.
(b) The sum of convex functions is convex.
(c) Use (a) and (b) to prove that the logistic regression loss $\mathcal{L}(\boldsymbol{\theta}) = \frac{1}{N}\sum_i \log(1 + \exp(-y_i \boldsymbol{\theta}^T \mathbf{x}_i))$ is convex, where $y_i \in \{-1, +1\}$.
Exercise 2.11 ⭐⭐⭐
Strong convexity and convergence rate. For a $\mu$-strongly convex, $L$-smooth function $f$:
(a) Prove that $f(\mathbf{x}) \geq f(\mathbf{y}) + \nabla f(\mathbf{y})^T(\mathbf{x} - \mathbf{y}) + \frac{\mu}{2}\|\mathbf{x} - \mathbf{y}\|^2$ for all $\mathbf{x}, \mathbf{y}$.
(b) Using the result from (a), show that gradient descent with step size $\eta = 1/L$ satisfies $\|\boldsymbol{\theta}_{t+1} - \boldsymbol{\theta}^*\|^2 \leq (1 - \mu/L)\|\boldsymbol{\theta}_t - \boldsymbol{\theta}^*\|^2$.
(c) Derive the convergence rate $f(\boldsymbol{\theta}_t) - f(\boldsymbol{\theta}^*) \leq \frac{L}{2}(1 - \mu/L)^t \|\boldsymbol{\theta}_0 - \boldsymbol{\theta}^*\|^2$.
(d) How many iterations does gradient descent need to achieve $f(\boldsymbol{\theta}_t) - f(\boldsymbol{\theta}^*) \leq \epsilon$ as a function of $\kappa = L/\mu$ and $\epsilon$?
Implementing Optimizers
Exercise 2.12 ⭐
Implement gradient descent with a step-decay learning rate schedule. Run it on the credit scoring logistic regression problem with initial $\eta = 0.1$ and decay by factor 0.5 every 200 steps. Compare convergence to constant-LR gradient descent.
Exercise 2.13 ⭐⭐
Implement Nesterov momentum from scratch. Apply it to:
(a) The 2D quadratic with $\kappa = 100$.
(b) The Rosenbrock function starting from $(-1, 1)$.
Compare the trajectories to standard momentum. Does the "lookahead" improve convergence in both cases?
Exercise 2.14 ⭐⭐
Implement AdaGrad from scratch and apply it to a sparse feature problem. Generate a dataset with 100 features where only 5 features are relevant (non-zero) for any given example. Compare AdaGrad and SGD on convergence speed and final loss. Why does AdaGrad excel in this setting?
Exercise 2.15 ⭐⭐
Adam ablation study. Implement Adam with optional ablation of each component:
(a) Adam without bias correction (set $\hat{\mathbf{m}} = \mathbf{m}$, $\hat{\mathbf{v}} = \mathbf{v}$)
(b) Adam without the first moment (set $\hat{\mathbf{m}} = \nabla \mathcal{L}$, reducing to RMSProp with bias-corrected second moment)
(c) Adam without the second moment (set $\hat{\mathbf{v}} = 1$, reducing to momentum with bias-corrected first moment)
Run all four variants on the matrix factorization problem. Plot convergence curves. Which component contributes most to Adam's performance?
Exercise 2.16 ⭐⭐⭐
Implement AdamW and compare it to Adam with L2 regularization on a logistic regression problem with $\lambda = 0.1$.
(a) Show that for Adam with L2 regularization, the effective weight decay is scaled by $1/\sqrt{\hat{v}_j}$ for each parameter $j$.
(b) Demonstrate empirically that Adam + L2 and AdamW produce different solutions.
(c) For which parameters is the difference largest? Plot the ratio of effective weight decay (Adam + L2) to the intended weight decay ($\lambda$).
Exercise 2.17 ⭐⭐⭐
Learning rate finder. Implement the learning rate range test (Smith, 2017):
- Start with a very small learning rate ($10^{-7}$).
- Increase it geometrically each step by a factor $\alpha = (10^{1} / 10^{-7})^{1/N}$ over $N$ steps.
- Record the loss at each step.
- The optimal learning rate is just below the point where the loss starts increasing.
Apply this to: (a) Logistic regression on the credit scoring problem. (b) Matrix factorization on the StreamRec data. (c) The Rosenbrock function.
Do the suggested learning rates match the ones used in the chapter?
Second-Order Methods and Advanced Topics
Exercise 2.18 ⭐⭐
Implement Newton's method with line search (backtracking Armijo condition) for a 10-dimensional quadratic function. Compare the number of iterations to reach $\|f(\mathbf{x}_t) - f(\mathbf{x}^*)\| < 10^{-10}$ for:
(a) Newton's method
(b) Gradient descent with optimal step size
(c) Adam with default hyperparameters
Comment on the tradeoff between iteration count and per-iteration cost.
Exercise 2.19 ⭐⭐⭐
Hessian-free optimization. Instead of computing and storing the full Hessian, use Pearlmutter's (1994) "trick": compute the Hessian-vector product $\mathbf{H}\mathbf{v}$ using two forward passes.
(a) Show that $\mathbf{H}\mathbf{v} = \lim_{\epsilon \to 0} \frac{\nabla f(\mathbf{x} + \epsilon \mathbf{v}) - \nabla f(\mathbf{x})}{\epsilon}$.
(b) Implement this as a function hessian_vector_product(f, x, v, eps=1e-5) using finite differences.
(c) Use the conjugate gradient method with Hessian-vector products to solve $\mathbf{H}\mathbf{d} = -\nabla f$ without ever forming $\mathbf{H}$ explicitly. Apply this to a 100-dimensional quadratic.
Exercise 2.20 ⭐⭐⭐
KKT conditions in practice. Consider Meridian Financial's credit scoring model with a fairness constraint: the model's positive prediction rate for protected group A must be within 5% of the positive prediction rate for group B.
(a) Formulate this as a constrained optimization problem. What is the objective? What are the constraints?
(b) Write the Lagrangian and derive the KKT conditions.
(c) Implement a projected gradient descent method that enforces the fairness constraint at each step.
(d) Compare the constrained solution to the unconstrained solution. How much does the loss increase to satisfy the fairness constraint?
Exercise 2.21 ⭐⭐⭐
Loss landscape visualization. Implement the filter-normalized random direction method of Li et al. (2018) for visualizing loss surfaces.
(a) Generate two random direction vectors $\boldsymbol{\delta}_1, \boldsymbol{\delta}_2$ in parameter space.
(b) Normalize each direction so that the filter norms match the corresponding filter norms of the trained network.
(c) Evaluate the loss on a grid: $\mathcal{L}(\boldsymbol{\theta}^* + \alpha \boldsymbol{\delta}_1 + \beta \boldsymbol{\delta}_2)$ for $\alpha, \beta \in [-1, 1]$.
(d) Create a 3D surface plot and a contour plot of the resulting loss landscape. Apply this to a small neural network (2-layer, 50 hidden units) trained on MNIST.
Integration and Research
Exercise 2.22 ⭐
Run all five optimizers (SGD, Momentum, Nesterov, RMSProp, Adam) on the Rosenbrock function from the same starting point. Create a single contour plot with all five trajectories. Which optimizer reaches the minimum first? Which takes the most direct path?
Exercise 2.23 ⭐⭐
Gradient noise and generalization. Train a simple 2-layer neural network on a small classification dataset (e.g., make_moons from scikit-learn) using:
(a) Full-batch gradient descent
(b) SGD with batch size 32
(c) SGD with batch size 4
Measure both training loss and test accuracy. Does smaller batch size lead to better generalization? Connect your findings to the noise-regularization hypothesis (Keskar et al., 2017).
Exercise 2.24 ⭐⭐⭐
Optimizer sensitivity analysis. For the StreamRec matrix factorization problem, create a heatmap showing final validation loss as a function of:
(a) Learning rate $\eta \in \{10^{-4}, 3 \times 10^{-4}, 10^{-3}, 3 \times 10^{-3}, 10^{-2}, 3 \times 10^{-2}, 10^{-1}\}$
(b) Momentum $\beta \in \{0.0, 0.5, 0.9, 0.95, 0.99\}$ (for SGD with momentum)
(c) $(\beta_1, \beta_2) \in \{0.8, 0.9, 0.95\} \times \{0.99, 0.999, 0.9999\}$ (for Adam)
Which optimizer has the widest "basin" of good hyperparameters? Which is the most sensitive to hyperparameter choices?
Exercise 2.25 ⭐⭐⭐
Warm restarts. Implement cosine annealing with warm restarts (SGDR; Loshchilov & Hutter, 2017). The learning rate follows a cosine curve within each "cycle" of length $T_i$, then jumps back to $\eta_{\max}$:
$$ \eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{t_{\text{cur}}}{T_i}\pi\right)\right) $$
where $t_{\text{cur}}$ is the number of epochs since the last restart and $T_i$ is the current cycle length (which can increase after each restart).
(a) Implement SGDR with cycle length doubling ($T_{i+1} = 2T_i$).
(b) Apply SGDR to the matrix factorization problem and compare convergence to standard cosine annealing.
(c) Plot the learning rate schedule and loss curve together. Do the warm restarts correlate with temporary loss increases?
Exercise 2.26 ⭐⭐⭐⭐
Sharpness-Aware Minimization (SAM). Foret et al. (2021) proposed SAM, which seeks parameters that lie in flat regions of the loss landscape:
$$ \min_{\boldsymbol{\theta}} \max_{\|\boldsymbol{\epsilon}\| \leq \rho} \mathcal{L}(\boldsymbol{\theta} + \boldsymbol{\epsilon}) $$
(a) Show that the inner maximization has an approximate solution $\boldsymbol{\epsilon}^* \approx \rho \frac{\nabla \mathcal{L}(\boldsymbol{\theta})}{\|\nabla \mathcal{L}(\boldsymbol{\theta})\|}$.
(b) Implement SAM on top of SGD. Each SAM step requires two forward-backward passes: one to compute $\boldsymbol{\epsilon}^*$, one to compute the gradient at $\boldsymbol{\theta} + \boldsymbol{\epsilon}^*$.
(c) Train a 3-layer neural network on CIFAR-10 (first 10,000 examples) with and without SAM. Compare test accuracy.
(d) After training, compute the trace of the Hessian (using the Hutchinson trace estimator) for both SAM and vanilla SGD solutions. Does SAM find flatter minima?
Exercise 2.27 ⭐⭐⭐⭐
Does Adam converge? Redfield et al. (2018) showed that Adam can fail to converge on simple convex problems.
(a) Construct the counterexample from the paper: an online convex optimization problem where Adam diverges.
(b) Implement AMSGrad (Redfield et al., 2018), which fixes the convergence issue by using $\hat{v}_t = \max(\hat{v}_{t-1}, v_t)$ instead of exponentially decaying $v_t$.
(c) Run both Adam and AMSGrad on the counterexample. Does AMSGrad converge?
(d) Run both on a practical problem (logistic regression, matrix factorization, or neural network training). Does the convergence difference matter in practice? Discuss why AMSGrad has not replaced Adam as the default.
Exercise 2.28 ⭐⭐⭐⭐
The edge of stability. Cohen et al. (2021) observed that gradient descent on neural network losses enters an "edge of stability" regime: the largest eigenvalue of the Hessian hovers near $2/\eta$ (the stability threshold), and the loss oscillates but ultimately decreases.
(a) Train a 2-layer fully connected network on a subset of MNIST using full-batch gradient descent with a fixed learning rate.
(b) At every 10th step, estimate the largest eigenvalue of the Hessian using the power iteration method.
(c) Plot the largest Hessian eigenvalue, the loss, and the stability threshold $2/\eta$ on the same plot.
(d) Do your results match the "edge of stability" prediction? What does this phenomenon imply about the classical convergence theory from Section 2.5?