Chapter 3 Exercises: Calculus, Optimization, and Automatic Differentiation
These exercises progress from foundational calculations to implementation challenges. Exercises marked with a single star () are straightforward applications. Double-star () exercises require deeper thinking or more substantial coding. Triple-star (**) exercises are research-level challenges.
Section A: Derivatives and Gradients
Exercise 3.1 — Derivative Practice (*)
Compute the derivatives of the following functions analytically:
a) f(x) = 3x^4 - 2x^2 + 7x - 5
b) f(x) = e^(2x) sin(x)
c) f(x) = ln(x^2 + 1)
d) f(x) = (x^2 + 1) / (x - 1)
e) f(x) = tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))
Exercise 3.2 — Gradient Computation (*)
Compute the gradient of each function:
a) f(x, y) = x^2 y + x y^3
b) f(x, y, z) = x e^(y z)
c) f(w) = w^T A w where A is a symmetric matrix (use results from Chapter 1)
d) f(x) = ||x||_2^2 = sum of x_i^2
Exercise 3.3 — Numerical vs. Analytical Derivatives (**)
Write a Python function that computes the derivative of f(x) = x sin(1/x) for x != 0 at x = 0.1 using:
a) The forward difference formula with h = 10^(-1), 10^(-3), 10^(-5), 10^(-7), 10^(-9), 10^(-12), 10^(-15).
b) The central difference formula with the same values of h.
c) The analytical derivative.
Plot the absolute error as a function of h for both methods on a log-log scale. Explain why the error does not keep decreasing as h gets smaller.
Exercise 3.4 — Gradient of Softmax (**)
The softmax function maps a vector z in R^n to a probability distribution:
softmax(z)_i = e^(z_i) / sum_j e^(z_j)
a) Compute the Jacobian matrix of the softmax function. Show that the diagonal entries are softmax_i (1 - softmax_i) and the off-diagonal entries are -softmax_i softmax_j.
b) Implement the softmax Jacobian computation in NumPy and verify it against numerical differentiation.
Exercise 3.5 — Directional Derivatives (*)
For f(x, y) = x^2 - x y + y^2:
a) Compute the gradient at the point (1, 2).
b) Compute the directional derivative at (1, 2) in the direction u = [1/sqrt(2), 1/sqrt(2)]^T.
c) In what direction is the directional derivative maximized? What is its value?
d) Find a direction in which the directional derivative is zero.
Exercise 3.6 — Hessian Analysis (**)
For f(x, y) = x^4 + y^4 - 2x^2 y^2:
a) Compute the gradient and find all critical points.
b) Compute the Hessian matrix.
c) Evaluate the Hessian at each critical point and classify it as a local minimum, local maximum, or saddle point using the eigenvalues.
d) Write NumPy code to verify your classifications numerically.
Section B: The Chain Rule
Exercise 3.7 — Chain Rule Practice (*)
Compute df/dx for each composite function:
a) f = sin(x^2 + 1)
b) f = exp(-x^2 / 2) (the Gaussian kernel)
c) f = ln(1 + e^x) (the softplus function)
d) f = 1 / (1 + e^(-x)) (the sigmoid function)
Exercise 3.8 — Neural Network Gradient (**)
Consider a simple two-neuron network: f(x) = sigma(w_2 * sigma(w_1 * x + b_1) + b_2), where sigma is the sigmoid function.
a) Draw the computational graph.
b) Compute df/dw_1, df/dw_2, df/db_1, df/db_2 using the chain rule.
c) Evaluate all four derivatives at x = 1, w_1 = 0.5, b_1 = -0.1, w_2 = -0.3, b_2 = 0.2.
d) Verify your answers using numerical differentiation.
Exercise 3.9 — Chain Rule in Matrix Form (**)
For a neural network layer h = sigma(W x + b), where sigma is applied element-wise:
a) Compute the Jacobian dh/dx.
b) Compute the Jacobian dh/dW (this will be a higher-order tensor; express the result component-wise).
c) If the loss is L = g(h) and we know dL/dh, write the expression for dL/dW using the chain rule.
Exercise 3.10 — Chain Rule Depth (***)
Consider a network with L layers: h^(l) = sigma(W^(l) h^(l-1)) for l = 1, ..., L, with h^(0) = x and a scalar loss L = g(h^(L)).
a) Write the chain rule expression for dL/dW^(1) (the gradient with respect to the first layer's weights).
b) If all W^(l) are the same matrix W and sigma is the identity, show that the gradient involves W^(L-1), which can either explode or vanish depending on the spectral radius of W. Connect this to the vanishing/exploding gradient problem.
Section C: Optimization
Exercise 3.11 — Gradient Descent by Hand (*)
Apply three steps of gradient descent to f(x) = x^4 - 4x^2 + 4 starting at x_0 = 3 with learning rate eta = 0.01. Show the value of x and f(x) at each step.
Exercise 3.12 — Learning Rate Sensitivity (**)
Implement gradient descent for f(x, y) = 10x^2 + y^2 (a narrow valley).
a) Run gradient descent from (5, 5) with learning rates eta = 0.01, 0.05, 0.09, 0.1, and 0.2 for 100 iterations.
b) Plot the trajectories on a contour plot of f.
c) Plot the loss curves. For what learning rate does the algorithm diverge?
d) Analytically, what is the maximum learning rate for convergence on this function? (Hint: consider the eigenvalues of the Hessian.)
Exercise 3.13 — Mini-Batch Size Experiment (**)
Using the linear regression gradient descent code from the chapter:
a) Generate a synthetic dataset with 10,000 samples and 5 features.
b) Train with batch sizes of 1, 8, 32, 128, 512, and 10,000.
c) For each batch size, record the wall-clock time and final loss for 50 epochs.
d) Plot: (i) loss vs. epoch for each batch size, (ii) loss vs. wall time for each batch size, (iii) final loss vs. batch size.
e) Discuss the trade-offs you observe.
Exercise 3.14 — Implementing Momentum (**)
Implement SGD with momentum from scratch.
a) Apply it to the Rosenbrock function: f(x, y) = (1 - x)^2 + 100(y - x^2)^2.
b) Compare convergence with and without momentum (beta = 0.9) using learning rate eta = 0.0001.
c) Also implement Nesterov momentum and compare all three variants.
d) Visualize the optimization paths on a contour plot.
Exercise 3.15 — Adam Implementation (**)
Implement the Adam optimizer from scratch.
a) Apply it to minimize f(x, y) = sin(x) cos(y) + 0.1(x^2 + y^2) starting from (3, 3).
b) Show the effect of the bias correction by plotting the first and second moments with and without correction for the first 100 steps.
c) What happens when you set beta_1 = 0 (no momentum)? When you set beta_2 = 0 (no RMSProp)?
Exercise 3.16 — Optimizer Comparison (***)
Implement and compare five optimizers (SGD, Momentum, Nesterov, RMSProp, Adam) on three test functions:
a) A quadratic: f(x) = x^T A x where A has eigenvalues [1, 100] (ill-conditioned).
b) The Rosenbrock function (non-convex, narrow valley).
c) The Rastrigin function: f(x) = 10n + sum_i [x_i^2 - 10 cos(2 pi x_i)] (many local minima).
For each, report iterations to convergence (if applicable) and create visualization of trajectories.
Exercise 3.17 — Learning Rate Schedules (**)
Implement and compare three learning rate schedules:
a) Step decay: reduce learning rate by factor 0.5 every 50 epochs.
b) Exponential decay: eta_t = eta_0 * 0.95^t.
c) Cosine annealing: eta_t = eta_min + 0.5(eta_max - eta_min)(1 + cos(pi t / T)).
Apply each to training a linear regression model on synthetic data. Compare convergence behavior and final loss.
Exercise 3.18 — Gradient Clipping (*)
Write a function that implements two forms of gradient clipping:
a) Clip by value: Clip each component of the gradient to [-c, c].
b) Clip by norm: If the gradient norm exceeds c, scale the gradient to have norm c.
Show that clip-by-norm preserves the direction of the gradient while clip-by-value does not.
Section D: Automatic Differentiation
Exercise 3.19 — Forward-Mode Autodiff (*)
Trace through forward-mode autodiff to compute df/dx and df/dy for f(x, y) = x^2 y + sin(x y) at (x, y) = (pi/4, 2). Show the dual number at each step.
Exercise 3.20 — Reverse-Mode Autodiff (*)
Trace through reverse-mode autodiff to compute df/dx and df/dy for f(x, y) = ln(x^2 + y) + x e^y at (x, y) = (1, 0). Show the adjoint at each node.
Exercise 3.21 — Extending the Autodiff Engine (**)
Extend the Value class from the chapter to support:
a) Subtraction and negation
b) Division
c) Power (including non-integer exponents)
d) exp() and log()
e) relu(): max(0, x)
f) sigmoid(): 1 / (1 + exp(-x))
For each, implement the forward computation and the correct backward pass. Verify with numerical gradient checking.
Exercise 3.22 — Vector-Valued Autodiff (***)
Extend the autodiff engine to support vector-valued operations. Implement:
a) Element-wise addition of two vectors.
b) Dot product of two vectors.
c) Matrix-vector multiplication.
d) Use your engine to compute the gradient of f(x) = ||A x - b||^2 and verify against the analytical gradient 2 A^T (A x - b).
Exercise 3.23 — Computational Graph Visualization (**)
Write a function that takes a Value object (the output node) and generates a visualization of the computational graph using either Graphviz or a text-based representation. Each node should display its value, operation, and gradient.
Exercise 3.24 — Dual Numbers (**)
Implement a DualNumber class that supports forward-mode autodiff.
a) Support addition, multiplication, subtraction, division, and power.
b) Implement exp(), log(), sin(), and cos().
c) Use it to compute the derivative of f(x) = exp(sin(cos(x))) at x = 1.
d) Verify against numerical differentiation.
Exercise 3.25 — Jacobian Computation (***)
Using your autodiff engine (or NumPy with numerical differentiation):
a) Compute the full Jacobian of a function f: R^3 -> R^2 defined by f_1(x) = x_1 x_2 + sin(x_3) and f_2(x) = x_1^2 + x_2 x_3.
b) Compare the cost of computing the Jacobian using forward-mode vs. reverse-mode autodiff.
c) For what dimensions of input and output is forward mode more efficient?
Section E: Applications and Integration
Exercise 3.26 — Logistic Regression from Scratch (**)
Implement logistic regression trained with mini-batch gradient descent.
a) Generate a 2D binary classification dataset using np.random.randn.
b) Implement the sigmoid function, binary cross-entropy loss, and gradient computation.
c) Train using mini-batch SGD with batch size 32 for 200 epochs.
d) Plot the decision boundary at epochs 1, 10, 50, and 200.
e) Compare convergence using SGD, Momentum, and Adam.
Exercise 3.27 — Softmax Regression (**)
Extend Exercise 3.26 to multi-class classification:
a) Generate a 3-class dataset in 2D.
b) Implement the softmax function, cross-entropy loss, and gradient.
c) Train using Adam with learning rate 0.01.
d) Plot the decision boundaries for each class.
Exercise 3.28 — Neural Network from Scratch (***)
Build a complete two-layer neural network (one hidden layer with ReLU, output layer with softmax) from scratch using only NumPy.
a) Implement the forward pass.
b) Implement the backward pass using the chain rule.
c) Implement gradient checking to verify your backward pass.
d) Train on a synthetic spiral dataset (3 classes).
e) Visualize the decision boundaries and the loss curve.
Exercise 3.29 — Loss Landscape Visualization (***)
For the neural network from Exercise 3.28:
a) Choose two random directions d_1, d_2 in parameter space.
b) Evaluate the loss on a grid of points theta + alpha d_1 + beta d_2 for alpha, beta* in [-1, 1].
c) Create a 3D surface plot and contour plot of the loss landscape.
d) Plot the actual optimization trajectory projected onto this 2D plane.
e) Discuss what the visualization reveals about the loss landscape.
Exercise 3.30 — Convergence Rate Analysis (***)
Empirically investigate convergence rates:
a) For a strongly convex quadratic f(x) = 0.5 x^T A x - b^T x, measure the convergence rate of gradient descent as a function of the condition number of A.
b) Show that the convergence rate is ((kappa - 1) / (kappa + 1))^t where kappa is the condition number.
c) Repeat with momentum. Show that optimal momentum achieves a rate of ((sqrt(kappa) - 1) / (sqrt(kappa) + 1))^t, which is faster for large kappa.
d) Plot the number of iterations to converge to a tolerance of 10^(-6) as a function of kappa for both methods.
Exercise 3.31 — Saddle Point Escape (**)
Create a 3D function with a known saddle point:
a) Use f(x, y) = x^2 - y^2 (saddle at origin).
b) Start gradient descent exactly at the saddle point. What happens? Why?
c) Start slightly perturbed from the saddle point. How does the behavior depend on the direction of perturbation?
d) Compare the behavior of SGD (with noise) vs. batch gradient descent near the saddle point.
Exercise 3.32 — Gradient Accumulation (**)
In practice, when GPU memory is limited, we simulate larger batch sizes using gradient accumulation.
a) Implement a training loop that accumulates gradients over k mini-batches before performing an update.
b) Verify that accumulating gradients over 4 batches of size 8 produces the same result as a single batch of size 32 (on the same data).
c) Discuss the relationship between gradient accumulation and the mathematical equivalence to larger batch sizes.
Bonus Exercises
Exercise 3.33 — Higher-Order Derivatives (***)
Extend the reverse-mode autodiff engine to compute second-order derivatives (the Hessian) by applying reverse-mode autodiff twice. Use this to compute the Hessian of f(x, y) = x^2 y + sin(x y) at (1, 1).
Exercise 3.34 — Automatic Differentiation for ODEs (***)
Consider the ordinary differential equation dy/dt = -y + sin(t) with y(0) = 1.
a) Solve it numerically using Euler's method with step size h = 0.01.
b) Now suppose the ODE has a parameter alpha: dy/dt = -alpha y + sin(t). Use your autodiff engine to compute dy(T)/dalpha at T = 1.
c) Verify your answer against numerical differentiation with respect to alpha.
Exercise 3.35 — Natural Gradient (***)
The natural gradient pre-multiplies the ordinary gradient by the inverse of the Fisher information matrix F:
theta_{t+1} = theta_t - eta F^(-1) nabla L(theta_t)
a) For a simple 1D Gaussian model with parameters mu and sigma^2, compute the Fisher information matrix.
b) Implement one step of natural gradient descent and compare it to ordinary gradient descent on a maximum likelihood estimation problem.
c) Discuss why the natural gradient is invariant to parameterization.