Quiz — Chapter 2: Multivariate Calculus and Optimization

Test your understanding of gradients, backpropagation, convexity, and optimization algorithms. Answers are in the expandable sections below each question.


Question 1

What is the geometric interpretation of the gradient $\nabla f(\mathbf{x})$?

Answer The gradient $\nabla f(\mathbf{x})$ points in the direction of steepest *ascent* of $f$ at $\mathbf{x}$, and its magnitude $\|\nabla f(\mathbf{x})\|$ equals the maximum rate of change of $f$ in any direction. Gradient descent steps in the *opposite* direction ($-\nabla f$) to decrease $f$ as rapidly as possible.

Question 2

For the function $f(\mathbf{x}) = \frac{1}{2}\mathbf{x}^T \mathbf{A} \mathbf{x} - \mathbf{b}^T \mathbf{x}$ with symmetric positive definite $\mathbf{A}$, what is the gradient $\nabla f(\mathbf{x})$ and the unique minimizer $\mathbf{x}^*$?

Answer The gradient is $\nabla f(\mathbf{x}) = \mathbf{A}\mathbf{x} - \mathbf{b}$. Setting it to zero gives the unique minimizer $\mathbf{x}^* = \mathbf{A}^{-1}\mathbf{b}$. This is the matrix form of solving a system of linear equations, and the positive definiteness of $\mathbf{A}$ guarantees the minimum is unique.

Question 3

For the regularized logistic regression loss $\mathcal{L}(\boldsymbol{\theta}) = \frac{1}{N}\sum_i [-y_i \log \hat{y}_i - (1-y_i)\log(1-\hat{y}_i)] + \frac{\lambda}{2}\|\boldsymbol{\theta}\|^2$, what is the gradient $\nabla_{\boldsymbol{\theta}} \mathcal{L}$?

Answer $$\nabla_{\boldsymbol{\theta}} \mathcal{L} = \frac{1}{N}\sum_{i=1}^{N} (\hat{y}_i - y_i)\mathbf{x}_i + \lambda \boldsymbol{\theta}$$ where $\hat{y}_i = \sigma(\boldsymbol{\theta}^T \mathbf{x}_i)$. The gradient is a weighted average of feature vectors, weighted by the prediction error $(\hat{y}_i - y_i)$, plus the regularization term.

Question 4

What does the Hessian matrix $\mathbf{H}$ tell us about a critical point where $\nabla f = \mathbf{0}$?

Answer The eigenvalues of the Hessian classify the critical point: - **All positive eigenvalues** ($\mathbf{H} \succ 0$, positive definite): local minimum - **All negative eigenvalues** ($\mathbf{H} \prec 0$, negative definite): local maximum - **Mixed positive and negative eigenvalues** (indefinite): saddle point - **Some zero eigenvalues** (positive semidefinite): degenerate case requiring higher-order analysis In high dimensions, saddle points are exponentially more likely than local minima at random critical points.

Question 5

Why is reverse mode automatic differentiation (backpropagation) preferred over forward mode AD for training neural networks?

Answer For a function $f : \mathbb{R}^n \to \mathbb{R}^m$: - Forward mode computes one column of the Jacobian per pass, requiring $n$ passes for the full Jacobian - Reverse mode computes one row of the Jacobian per pass, requiring $m$ passes In neural network training, we have $n \sim 10^6{-}10^{11}$ parameters and $m = 1$ (scalar loss). Reverse mode computes all $n$ gradient components in a *single* backward pass, making it $n$ times cheaper than forward mode. This is precisely why backpropagation (reverse mode AD) is the standard algorithm.

Question 6

State the definition of a convex function. Why is convexity important for optimization?

Answer A function $f$ is convex if for all $\mathbf{x}, \mathbf{y}$ in its domain and all $\lambda \in [0, 1]$: $$f(\lambda \mathbf{x} + (1-\lambda)\mathbf{y}) \leq \lambda f(\mathbf{x}) + (1-\lambda)f(\mathbf{y})$$ Convexity is important because it guarantees that **every local minimum is a global minimum**. This means any algorithm that finds a stationary point ($\nabla f = \mathbf{0}$) has found the optimal solution. Without convexity, gradient descent may converge to a suboptimal local minimum or saddle point.

Question 7

What is the condition number $\kappa$ of a convex optimization problem, and how does it affect the convergence of gradient descent?

Answer For a function with $L$-Lipschitz gradient and $\mu$-strong convexity, the condition number is $\kappa = L / \mu$, which equals the ratio of the largest to smallest eigenvalue of the Hessian. The convergence rate of gradient descent is proportional to $\left(\frac{\kappa - 1}{\kappa + 1}\right)^{2t}$. When $\kappa$ is large (ill-conditioned), convergence is slow because the loss landscape has very different curvatures in different directions, causing gradient descent to oscillate. L2 regularization reduces $\kappa$ by increasing $\mu$, which is one reason regularization improves optimization (not just generalization).

Question 8

What is the difference between batch gradient descent and stochastic gradient descent (SGD)? What is the tradeoff?

Answer - **Batch gradient descent** computes the exact gradient over the entire dataset at each step: $\nabla \mathcal{L} = \frac{1}{N}\sum_{i=1}^N \nabla \ell_i$ - **SGD** estimates the gradient using a mini-batch of $B \ll N$ samples: $\hat{\nabla} \mathcal{L} = \frac{1}{B}\sum_{i \in \mathcal{B}} \nabla \ell_i$ The tradeoff: SGD is $N/B$ times cheaper per step but introduces gradient noise (variance $\propto 1/B$). However, this noise is often *beneficial*: it provides implicit regularization and helps escape sharp local minima that generalize poorly (Keskar et al., 2017).

Question 9

Explain how momentum addresses a specific weakness of vanilla SGD.

Answer Vanilla SGD oscillates on ill-conditioned problems: it bounces back and forth across narrow valleys while making slow progress along the valley floor. Momentum adds a velocity term $\mathbf{v}_t = \beta \mathbf{v}_{t-1} + \nabla \mathcal{L}(\boldsymbol{\theta}_t)$ that accumulates consistent gradient directions and cancels oscillating components. Physically, it is like a ball that builds speed on downhill slopes and rolls past small bumps. On quadratics with condition number $\kappa$, momentum improves the convergence factor from $(\kappa-1)/(\kappa+1)$ to $(\sqrt{\kappa}-1)/(\sqrt{\kappa}+1)$.

Question 10

What problem does AdaGrad solve, and what new problem does it introduce?

Answer **Problem solved:** AdaGrad adapts the learning rate per parameter by dividing by $\sqrt{\sum_{\tau=1}^t g_\tau^2}$ (accumulated squared gradients). Parameters that receive large, frequent gradients get smaller learning rates; rare features get larger learning rates. This is ideal for sparse data. **Problem introduced:** The accumulated squared gradient sum only grows, so the effective learning rate monotonically decreases to zero. For non-convex problems (deep learning), this causes training to stall prematurely. RMSProp fixes this by using an exponentially decaying average instead of a sum.

Question 11

Write the Adam update equations. What is the purpose of bias correction?

Answer $$\mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1 - \beta_1)\nabla \mathcal{L}_t \qquad \text{(first moment)}$$ $$\mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1 - \beta_2)\nabla \mathcal{L}_t^2 \qquad \text{(second moment)}$$ $$\hat{\mathbf{m}}_t = \mathbf{m}_t / (1 - \beta_1^t) \qquad \hat{\mathbf{v}}_t = \mathbf{v}_t / (1 - \beta_2^t) \qquad \text{(bias correction)}$$ $$\boldsymbol{\theta}_t = \boldsymbol{\theta}_{t-1} - \eta \hat{\mathbf{m}}_t / (\sqrt{\hat{\mathbf{v}}_t} + \epsilon)$$ **Bias correction** compensates for initialization at zero. At early steps, $\mathbf{m}_t$ and $\mathbf{v}_t$ are biased toward zero because they have not accumulated enough gradient history. The correction factors $1/(1-\beta_i^t)$ grow large when $t$ is small, amplifying the early estimates to their expected magnitudes.

Question 12

What is the key difference between Adam with L2 regularization and AdamW?

Answer In **Adam + L2**, the regularization gradient $\lambda \boldsymbol{\theta}$ is added to $\nabla \mathcal{L}$ before the adaptive scaling: the weight decay is scaled by $1/\sqrt{\hat{v}_j}$ for each parameter, making it non-uniform and dampened for parameters with large historical gradients. In **AdamW**, the weight decay is *decoupled*: $\boldsymbol{\theta}_t = (1 - \eta\lambda)\boldsymbol{\theta}_{t-1} - \eta \hat{\mathbf{m}}_t / (\sqrt{\hat{\mathbf{v}}_t} + \epsilon)$. The decay acts directly on the parameter magnitude, unaffected by the adaptive rate. This ensures that weight decay behaves as intended: shrinking all parameters uniformly. AdamW is the standard optimizer for transformer training.

Question 13

Why are second-order methods (Newton's method, L-BFGS) rarely used for training deep neural networks?

Answer **Computational cost:** For $n$ parameters, Newton's method requires computing and inverting the $n \times n$ Hessian matrix, costing $O(n^2)$ storage and $O(n^3)$ computation per step. For modern networks with $n \sim 10^8{-}10^{11}$ parameters, this is completely infeasible. **Stochastic gradients:** L-BFGS assumes deterministic gradients. The noise in mini-batch gradients corrupts the Hessian approximation, causing instability. **Non-smoothness:** ReLU activations create non-smooth loss surfaces where the Hessian is undefined at kink points. Despite this, understanding second-order methods is valuable because Adam can be interpreted as a diagonal approximation to the inverse Hessian, connecting the first-order methods we use in practice to the second-order theory.

Question 14

Explain the concept of a saddle point. Why are saddle points more problematic than local minima for gradient-based optimization?

Answer A **saddle point** is a critical point where $\nabla f = \mathbf{0}$ but the Hessian has both positive and negative eigenvalues. The loss increases in some directions and decreases in others — like a mountain pass. Saddle points are more problematic than local minima for two reasons: 1. **Prevalence:** In $d$ dimensions, a random critical point has probability $(1/2)^d$ of being a local minimum. For $d > 100$, essentially all critical points are saddle points. 2. **Slow escape:** Near a saddle point, the gradient is small, so gradient descent takes very small steps. The algorithm may linger near the saddle point for many iterations before drifting away along a direction of negative curvature. Adaptive methods like Adam escape faster because they normalize by gradient magnitude.

Question 15

What is learning rate warmup, and why is it important for training with Adam?

Answer **Warmup** starts training with a very small learning rate and linearly increases it to the target rate over $T_w$ steps. The purpose is to give Adam's second-moment estimates $\hat{\mathbf{v}}_t$ time to accumulate accurate statistics before committing to large parameter updates. At initialization, the second-moment estimates are based on very few gradient samples and are unreliable. Large updates based on unreliable variance estimates can push parameters to a region from which recovery is difficult. This is a more precise explanation than the common belief that warmup is needed because "gradients are large at initialization."

Question 16

For the StreamRec matrix factorization problem $\min_{\mathbf{U}, \mathbf{V}} \|\mathbf{R} - \mathbf{UV}^T\|_F^2$, is the loss function convex? What does this imply?

Answer The loss is **not convex** in $(\mathbf{U}, \mathbf{V})$ jointly. The product $\mathbf{UV}^T$ introduces a bilinear interaction between parameters, and the substitution $(\mathbf{U}, \mathbf{V}) \to (\alpha \mathbf{U}, \mathbf{V}/\alpha)$ preserves the loss for any $\alpha \neq 0$, creating manifolds of equivalent solutions. However, the loss is convex in $\mathbf{U}$ with $\mathbf{V}$ fixed (and vice versa), which is why alternating least squares (ALS) works well. Furthermore, when the factorization rank exceeds the true rank of $\mathbf{R}$, all local minima are global minima (Bhojanapalli et al., 2016), meaning that gradient descent will eventually find a good solution despite non-convexity.

Question 17

State the four KKT conditions for a constrained optimization problem $\min f(\boldsymbol{\theta})$ subject to $g_i(\boldsymbol{\theta}) \leq 0$.

Answer At an optimal point $\boldsymbol{\theta}^*$ with Lagrange multipliers $\mu_i^*$: 1. **Stationarity:** $\nabla f(\boldsymbol{\theta}^*) + \sum_i \mu_i^* \nabla g_i(\boldsymbol{\theta}^*) = \mathbf{0}$ 2. **Primal feasibility:** $g_i(\boldsymbol{\theta}^*) \leq 0$ for all $i$ 3. **Dual feasibility:** $\mu_i^* \geq 0$ for all $i$ 4. **Complementary slackness:** $\mu_i^* g_i(\boldsymbol{\theta}^*) = 0$ for all $i$ Complementary slackness means that either the constraint is active ($g_i = 0$) or its multiplier is zero ($\mu_i = 0$). The multiplier $\mu_i^*$ represents the sensitivity of the optimal objective value to relaxation of the $i$-th constraint.

Question 18

How does Adam implicitly approximate second-order optimization?

Answer Newton's method computes the ideal update $\Delta \boldsymbol{\theta} = -\mathbf{H}^{-1}\nabla \mathcal{L}$. Adam approximates this with a diagonal approximation: for each parameter $j$, the update is $\Delta \theta_j \approx -\hat{m}_j / \sqrt{\hat{v}_j}$. The term $\sqrt{\hat{v}_j}$ is an estimate of the RMS gradient magnitude for parameter $j$, which approximates $|H_{jj}|^{1/2}$ — the square root of the diagonal Hessian entry. Thus, Adam implicitly rescales each parameter's update by an estimate of its curvature. This is why Adam handles ill-conditioned problems better than SGD, and why it struggles when significant off-diagonal Hessian structure exists.

Question 19

You are training a model and observe that the training loss oscillates without decreasing. Name three possible causes and the corresponding fixes.

Answer 1. **Learning rate too high.** The step size overshoots the minimum on each iteration. **Fix:** Reduce the learning rate, or use a learning rate schedule (warmup + cosine decay). 2. **Poor conditioning.** The loss landscape has very different curvatures in different directions, causing oscillation in the steep directions. **Fix:** Use momentum (to smooth oscillations) or an adaptive method (Adam/AdamW) that rescales per-parameter. 3. **Batch size too small.** Extreme gradient noise from tiny mini-batches causes erratic updates. **Fix:** Increase the batch size (or use gradient accumulation), or reduce the learning rate proportionally. Other possibilities include numerical instability (exploding gradients), a bug in the loss computation, or a fundamentally misspecified model.

Question 20

Explain why the gradient of the logistic regression loss $\nabla_{\boldsymbol{\theta}} \mathcal{L} = \frac{1}{N}\sum_i (\hat{y}_i - y_i)\mathbf{x}_i$ has a natural interpretation. What does it mean for the model to "focus on where it is most wrong"?

Answer The gradient is a weighted sum of feature vectors $\mathbf{x}_i$, where the weight for example $i$ is the prediction error $(\hat{y}_i - y_i)$. This error ranges from $-1$ to $+1$: - When $\hat{y}_i \approx y_i$ (correct prediction), the weight is near zero — this example contributes almost nothing to the gradient. - When $\hat{y}_i \gg y_i$ (false positive, e.g., $\hat{y} = 0.9, y = 0$), the weight is large and positive, pushing $\boldsymbol{\theta}$ away from features associated with this example. - When $\hat{y}_i \ll y_i$ (false negative, e.g., $\hat{y} = 0.1, y = 1$), the weight is large and negative, pulling $\boldsymbol{\theta}$ toward features of this example. The model "focuses on where it is most wrong" because examples with large errors dominate the gradient. Well-classified examples are effectively ignored, directing the model's learning capacity toward its failure cases. This is analogous to the attention mechanism in teaching: a good study strategy focuses on the material you do not yet understand.