29 min read

> "Gradient descent is not a single algorithm — it is a family tree. Understanding that family tree means understanding why your model converges, oscillates, or quietly diverges at 3 a.m. on a Saturday."

Chapter 2: Multivariate Calculus and Optimization

Gradients, Loss Landscapes, and Finding the Minimum

"Gradient descent is not a single algorithm — it is a family tree. Understanding that family tree means understanding why your model converges, oscillates, or quietly diverges at 3 a.m. on a Saturday."


Learning Objectives

By the end of this chapter, you will be able to:

  1. Compute gradients, Jacobians, and Hessians for multivariate functions relevant to machine learning
  2. Derive the backpropagation algorithm as an application of the chain rule on computational graphs
  3. Analyze convexity and its implications for optimization guarantees
  4. Compare and implement SGD variants (momentum, RMSProp, Adam) from first principles
  5. Explain second-order methods and why they matter despite rarely being used in deep learning

2.1 Why Optimization Is the Heart of Machine Learning

Every supervised learning algorithm solves the same problem: find parameters $\boldsymbol{\theta}^*$ that minimize a loss function $\mathcal{L}(\boldsymbol{\theta})$ over a dataset. Linear regression minimizes squared error. Logistic regression minimizes cross-entropy. A 175-billion-parameter language model minimizes next-token prediction loss. The models differ enormously, but the optimization machinery is shared.

In Chapter 1, we represented this problem algebraically: given a user-item interaction matrix $\mathbf{R} \in \mathbb{R}^{m \times n}$ with missing entries, we factorized it into $\mathbf{U} \in \mathbb{R}^{m \times k}$ and $\mathbf{V} \in \mathbb{R}^{n \times k}$ by minimizing the Frobenius norm of the reconstruction error. We defined the loss. But we did not ask: how do we actually find the minimum?

That is the question this chapter answers. We begin with the mathematical machinery — gradients, Jacobians, Hessians — then trace the chain rule through computational graphs to derive backpropagation. We study convexity to understand when optimization problems are "easy" (with guarantees) and when they are "hard" (without them). We build the SGD family tree from first principles, implementing each variant in numpy. And we close with second-order methods: beautiful, theoretically optimal, and almost never used at scale.

The central insight: understanding why Adam converges faster than vanilla SGD on your problem requires understanding the full family tree.

Common Misconception: "Optimization is a solved problem — just use Adam." Adam is an excellent default, but it is not universally superior. Understanding when and why it fails (and when SGD with momentum actually converges to better solutions) requires the theory developed in this chapter. Empirical evidence from Wilson et al. (2017) demonstrates that SGD with momentum can generalize better than adaptive methods on certain deep learning tasks.


2.2 Gradients and Partial Derivatives

2.2.1 The Gradient as a Vector of Partial Derivatives

Consider a scalar-valued function $f : \mathbb{R}^n \to \mathbb{R}$. The gradient of $f$ at a point $\mathbf{x} \in \mathbb{R}^n$ is the vector of all partial derivatives:

$$ \nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix} $$

The gradient has a precise geometric meaning: it points in the direction of steepest ascent of $f$ at $\mathbf{x}$, and its magnitude $\|\nabla f(\mathbf{x})\|$ equals the rate of change in that direction.

Example: Quadratic loss. Let $f(\mathbf{x}) = \frac{1}{2}\mathbf{x}^T \mathbf{A} \mathbf{x} - \mathbf{b}^T \mathbf{x}$ where $\mathbf{A}$ is symmetric positive definite. Then:

$$ \nabla f(\mathbf{x}) = \mathbf{A}\mathbf{x} - \mathbf{b} $$

Setting $\nabla f(\mathbf{x}) = \mathbf{0}$ yields $\mathbf{x}^* = \mathbf{A}^{-1}\mathbf{b}$, the closed-form solution. Most interesting problems do not admit closed-form solutions, which is why we need iterative methods.

import numpy as np

def quadratic_loss(x: np.ndarray, A: np.ndarray, b: np.ndarray) -> float:
    """Compute f(x) = 0.5 * x^T A x - b^T x."""
    return 0.5 * x @ A @ x - b @ x

def quadratic_gradient(x: np.ndarray, A: np.ndarray, b: np.ndarray) -> np.ndarray:
    """Compute gradient: A x - b."""
    return A @ x - b

# Example: 2D quadratic
A = np.array([[3.0, 1.0],
              [1.0, 2.0]])  # symmetric positive definite
b = np.array([1.0, 2.0])

x = np.array([0.0, 0.0])
print(f"f(x) = {quadratic_loss(x, A, b):.4f}")
print(f"grad f(x) = {quadratic_gradient(x, A, b)}")
print(f"Optimal x* = {np.linalg.solve(A, b)}")

2.2.2 The Directional Derivative

The directional derivative of $f$ at $\mathbf{x}$ in direction $\mathbf{v}$ (where $\|\mathbf{v}\| = 1$) is:

$$ D_\mathbf{v} f(\mathbf{x}) = \nabla f(\mathbf{x})^T \mathbf{v} = \|\nabla f(\mathbf{x})\| \cos \theta $$

where $\theta$ is the angle between $\nabla f(\mathbf{x})$ and $\mathbf{v}$. This expression reveals why gradient descent works: the direction that maximally decreases $f$ is $\mathbf{v} = -\nabla f(\mathbf{x}) / \|\nabla f(\mathbf{x})\|$, because that makes $\cos \theta = -1$.

Mathematical Foundation: The directional derivative formula $D_\mathbf{v} f = \nabla f^T \mathbf{v}$ is a consequence of the first-order Taylor expansion: $f(\mathbf{x} + \epsilon \mathbf{v}) \approx f(\mathbf{x}) + \epsilon \nabla f(\mathbf{x})^T \mathbf{v}$. The gradient concentrates all local information about $f$ into a single vector.

2.2.3 Gradients in Machine Learning: The Credit Scoring Example

Consider Meridian Financial's credit scoring model from the anchor examples. They use logistic regression with L2 regularization to predict default probability. The loss function for a single data point $(\mathbf{x}_i, y_i)$ with parameters $\boldsymbol{\theta}$ is:

$$ \ell(\boldsymbol{\theta}; \mathbf{x}_i, y_i) = -\left[ y_i \log \sigma(\boldsymbol{\theta}^T \mathbf{x}_i) + (1 - y_i) \log(1 - \sigma(\boldsymbol{\theta}^T \mathbf{x}_i)) \right] $$

where $\sigma(z) = 1 / (1 + e^{-z})$ is the sigmoid function. The full regularized loss over $N$ data points is:

$$ \mathcal{L}(\boldsymbol{\theta}) = \frac{1}{N} \sum_{i=1}^{N} \ell(\boldsymbol{\theta}; \mathbf{x}_i, y_i) + \frac{\lambda}{2} \|\boldsymbol{\theta}\|^2 $$

Let us derive the gradient step by step:

Step 1. Compute $\frac{\partial \sigma}{\partial z}$. The sigmoid has a remarkably clean derivative:

$$ \sigma'(z) = \sigma(z)(1 - \sigma(z)) $$

Step 2. Let $\hat{y}_i = \sigma(\boldsymbol{\theta}^T \mathbf{x}_i)$. By the chain rule:

$$ \frac{\partial \ell}{\partial \boldsymbol{\theta}} = -\left[ y_i (1 - \hat{y}_i) - (1 - y_i) \hat{y}_i \right] \mathbf{x}_i = (\hat{y}_i - y_i) \mathbf{x}_i $$

Step 3. The gradient of the full loss, including regularization:

$$ \nabla_{\boldsymbol{\theta}} \mathcal{L} = \frac{1}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i) \mathbf{x}_i + \lambda \boldsymbol{\theta} $$

This is the expression that every optimizer in this chapter will use. The gradient is a weighted average of the feature vectors $\mathbf{x}_i$, weighted by the prediction error $(\hat{y}_i - y_i)$. Larger errors contribute more to the gradient — the model focuses its updates where it is most wrong.

def sigmoid(z: np.ndarray) -> np.ndarray:
    """Numerically stable sigmoid function."""
    return np.where(z >= 0,
                    1 / (1 + np.exp(-z)),
                    np.exp(z) / (1 + np.exp(z)))

def logistic_loss_and_gradient(
    theta: np.ndarray,
    X: np.ndarray,
    y: np.ndarray,
    lam: float = 0.01
) -> tuple[float, np.ndarray]:
    """
    Compute regularized logistic regression loss and gradient.

    Args:
        theta: Parameter vector, shape (d,)
        X: Feature matrix, shape (N, d)
        y: Binary labels, shape (N,)
        lam: L2 regularization strength

    Returns:
        Tuple of (loss, gradient)
    """
    N = X.shape[0]
    z = X @ theta                          # (N,)
    y_hat = sigmoid(z)                     # (N,)

    # Cross-entropy loss (numerically stable form)
    loss = -np.mean(
        y * np.log(y_hat + 1e-12) + (1 - y) * np.log(1 - y_hat + 1e-12)
    ) + 0.5 * lam * np.dot(theta, theta)

    # Gradient
    grad = (1 / N) * X.T @ (y_hat - y) + lam * theta  # (d,)

    return loss, grad

Implementation Note: The naive computation log(sigmoid(z)) is numerically unstable when $z$ is very negative (because sigmoid(z) underflows to 0.0). In production, use torch.nn.functional.binary_cross_entropy_with_logits, which computes the loss directly from the logits $z$ using the stable identity $\log(1 + e^{-|z|}) + \max(z, 0) - yz$.


2.3 The Jacobian and the Hessian

2.3.1 The Jacobian Matrix

When $\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m$ is a vector-valued function, its derivative is the Jacobian matrix $\mathbf{J} \in \mathbb{R}^{m \times n}$:

$$ \mathbf{J}_{ij} = \frac{\partial f_i}{\partial x_j}, \qquad \mathbf{J} = \begin{bmatrix} \nabla f_1^T \\ \nabla f_2^T \\ \vdots \\ \nabla f_m^T \end{bmatrix} $$

Each row of the Jacobian is the transpose of the gradient of one output component. The Jacobian generalizes the gradient to vector-valued functions.

ML application. Consider a neural network layer $\mathbf{h} = \sigma(\mathbf{W}\mathbf{x} + \mathbf{b})$ where $\mathbf{W} \in \mathbb{R}^{m \times n}$. The Jacobian $\frac{\partial \mathbf{h}}{\partial \mathbf{x}} \in \mathbb{R}^{m \times n}$ tells us how each output neuron responds to each input feature. This is exactly what backpropagation computes implicitly.

2.3.2 The Hessian Matrix

For a scalar function $f : \mathbb{R}^n \to \mathbb{R}$, the Hessian matrix $\mathbf{H} \in \mathbb{R}^{n \times n}$ is the matrix of second partial derivatives:

$$ \mathbf{H}_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j} $$

The Hessian encodes the curvature of $f$. If $\nabla f(\mathbf{x})$ tells you the slope, $\mathbf{H}(\mathbf{x})$ tells you how the slope is changing.

Key properties: - $\mathbf{H}$ is symmetric (by Schwarz's theorem, when second partials are continuous) - If $\mathbf{H}$ is positive definite at $\mathbf{x}$, then $\mathbf{x}$ is a local minimum - If $\mathbf{H}$ is negative definite, then $\mathbf{x}$ is a local maximum - If $\mathbf{H}$ has both positive and negative eigenvalues, $\mathbf{x}$ is a saddle point

def numerical_hessian(f, x: np.ndarray, eps: float = 1e-5) -> np.ndarray:
    """
    Compute the Hessian matrix of f at x using finite differences.

    Args:
        f: Scalar-valued function of a vector
        x: Point at which to evaluate the Hessian, shape (n,)
        eps: Step size for finite differences

    Returns:
        Hessian matrix, shape (n, n)
    """
    n = len(x)
    H = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            x_pp = x.copy(); x_pp[i] += eps; x_pp[j] += eps
            x_pm = x.copy(); x_pm[i] += eps; x_pm[j] -= eps
            x_mp = x.copy(); x_mp[i] -= eps; x_mp[j] += eps
            x_mm = x.copy(); x_mm[i] -= eps; x_mm[j] -= eps
            H[i, j] = (f(x_pp) - f(x_pm) - f(x_mp) + f(x_mm)) / (4 * eps**2)
    return H

# Hessian of the quadratic: should recover A
f_quad = lambda x: quadratic_loss(x, A, b)
H = numerical_hessian(f_quad, np.array([1.0, 1.0]))
print(f"Hessian:\n{H}")   # Should equal A
print(f"Eigenvalues: {np.linalg.eigvalsh(H)}")  # Both positive → local minimum

Production Reality: For a neural network with $n$ parameters, the Hessian has $n^2$ entries. GPT-3 has $n = 175 \times 10^9$ parameters, so its Hessian would require $\sim 10^{20}$ entries — roughly $10^{11}$ terabytes. This is why second-order methods are impractical at scale, and why the entire field uses first-order methods (gradient descent) with clever approximations.


2.4 The Chain Rule and Computational Graphs

2.4.1 The Multivariate Chain Rule

If $f = g \circ h$, meaning $f(\mathbf{x}) = g(h(\mathbf{x}))$, and $h : \mathbb{R}^n \to \mathbb{R}^m$, $g : \mathbb{R}^m \to \mathbb{R}$, then:

$$ \frac{\partial f}{\partial x_j} = \sum_{i=1}^{m} \frac{\partial g}{\partial h_i} \cdot \frac{\partial h_i}{\partial x_j} $$

In matrix notation: $\nabla_\mathbf{x} f = \mathbf{J}_h^T \nabla_\mathbf{h} g$, where $\mathbf{J}_h$ is the Jacobian of $h$.

This is the single most important equation in deep learning. A neural network is a composition of functions $f = f_L \circ f_{L-1} \circ \cdots \circ f_1$. The chain rule tells us how to propagate gradient information backward through the entire composition.

2.4.2 Computational Graphs

A computational graph is a directed acyclic graph (DAG) where: - Each node represents a variable (input, intermediate, or output) - Each edge represents a dependency: if node $v$ is an input to the operation that produces node $u$, there is an edge from $v$ to $u$ - Each operation has a local derivative that can be computed independently

Consider the simple function $f(x, y) = (x + y) \cdot \sigma(x)$. We decompose it into elementary operations:

Computational Graph:

  x ──┬──> [+] ──> a = x + y ──> [*] ──> f = a * b
      │                            ^
  y ──┘                            │
                                   │
  x ────> [σ] ──> b = σ(x)  ──────┘

The forward pass computes: 1. $a = x + y$ 2. $b = \sigma(x)$ 3. $f = a \cdot b$

The backward pass applies the chain rule from the output backward: 1. $\frac{\partial f}{\partial f} = 1$ 2. $\frac{\partial f}{\partial a} = b$, $\quad \frac{\partial f}{\partial b} = a$ 3. $\frac{\partial f}{\partial x} = \frac{\partial f}{\partial a} \cdot \frac{\partial a}{\partial x} + \frac{\partial f}{\partial b} \cdot \frac{\partial b}{\partial x} = b \cdot 1 + a \cdot \sigma'(x)$ 4. $\frac{\partial f}{\partial y} = \frac{\partial f}{\partial a} \cdot \frac{\partial a}{\partial y} = b \cdot 1$

def computational_graph_forward_backward(x: float, y: float) -> dict:
    """
    Forward and backward pass through f(x, y) = (x + y) * sigmoid(x).

    Returns:
        Dictionary with forward values and gradients.
    """
    # Forward pass
    a = x + y
    b = 1.0 / (1.0 + np.exp(-x))       # sigmoid(x)
    f = a * b

    # Backward pass
    df_df = 1.0                          # seed gradient
    df_da = b * df_df                    # ∂f/∂a = b
    df_db = a * df_df                    # ∂f/∂b = a
    db_dx = b * (1 - b)                  # sigmoid derivative
    da_dx = 1.0
    da_dy = 1.0

    df_dx = df_da * da_dx + df_db * db_dx
    df_dy = df_da * da_dy

    return {
        "forward": {"a": a, "b": b, "f": f},
        "gradients": {"df/dx": df_dx, "df/dy": df_dy}
    }

# Verify against numerical gradient
result = computational_graph_forward_backward(1.0, 2.0)
print(f"Analytical: df/dx = {result['gradients']['df/dx']:.6f}")

eps = 1e-7
f_plus = (1.0 + eps + 2.0) * sigmoid(np.array([1.0 + eps]))[0]
f_minus = (1.0 - eps + 2.0) * sigmoid(np.array([1.0 - eps]))[0]
print(f"Numerical:  df/dx = {(f_plus - f_minus) / (2 * eps):.6f}")

2.4.3 Deriving Backpropagation

Now we scale this to a full neural network. Consider a two-layer network:

$$ \begin{aligned} \mathbf{z}_1 &= \mathbf{W}_1 \mathbf{x} + \mathbf{b}_1 \\ \mathbf{h}_1 &= \text{ReLU}(\mathbf{z}_1) \\ \mathbf{z}_2 &= \mathbf{W}_2 \mathbf{h}_1 + \mathbf{b}_2 \\ \hat{y} &= \sigma(\mathbf{z}_2) \\ \mathcal{L} &= -\left[ y \log \hat{y} + (1-y) \log(1 - \hat{y}) \right] \end{aligned} $$

Forward pass: Compute $\mathbf{z}_1 \to \mathbf{h}_1 \to \mathbf{z}_2 \to \hat{y} \to \mathcal{L}$. Each intermediate value must be cached.

Backward pass: Apply the chain rule in reverse order.

$$ \begin{aligned} \frac{\partial \mathcal{L}}{\partial \mathbf{z}_2} &= \hat{y} - y \\ \frac{\partial \mathcal{L}}{\partial \mathbf{W}_2} &= \frac{\partial \mathcal{L}}{\partial \mathbf{z}_2} \cdot \mathbf{h}_1^T \\ \frac{\partial \mathcal{L}}{\partial \mathbf{b}_2} &= \frac{\partial \mathcal{L}}{\partial \mathbf{z}_2} \\ \frac{\partial \mathcal{L}}{\partial \mathbf{h}_1} &= \mathbf{W}_2^T \frac{\partial \mathcal{L}}{\partial \mathbf{z}_2} \\ \frac{\partial \mathcal{L}}{\partial \mathbf{z}_1} &= \frac{\partial \mathcal{L}}{\partial \mathbf{h}_1} \odot \mathbb{1}[\mathbf{z}_1 > 0] \\ \frac{\partial \mathcal{L}}{\partial \mathbf{W}_1} &= \frac{\partial \mathcal{L}}{\partial \mathbf{z}_1} \cdot \mathbf{x}^T \\ \frac{\partial \mathcal{L}}{\partial \mathbf{b}_1} &= \frac{\partial \mathcal{L}}{\partial \mathbf{z}_1} \end{aligned} $$

The $\odot$ denotes element-wise multiplication, and $\mathbb{1}[\mathbf{z}_1 > 0]$ is the ReLU derivative (1 where $\mathbf{z}_1 > 0$, 0 otherwise).

class TwoLayerNet:
    """
    A two-layer neural network with backpropagation implemented from scratch.
    Architecture: Linear -> ReLU -> Linear -> Sigmoid -> BCE Loss
    """

    def __init__(self, input_dim: int, hidden_dim: int, seed: int = 42):
        rng = np.random.default_rng(seed)
        # He initialization for ReLU
        self.W1 = rng.normal(0, np.sqrt(2 / input_dim), (hidden_dim, input_dim))
        self.b1 = np.zeros(hidden_dim)
        # Xavier initialization for sigmoid output
        self.W2 = rng.normal(0, np.sqrt(1 / hidden_dim), (1, hidden_dim))
        self.b2 = np.zeros(1)
        # Cache for backward pass
        self._cache: dict = {}

    def forward(self, x: np.ndarray, y: np.ndarray) -> float:
        """Forward pass. Returns scalar loss."""
        z1 = self.W1 @ x + self.b1
        h1 = np.maximum(0, z1)              # ReLU
        z2 = self.W2 @ h1 + self.b2
        y_hat = sigmoid(z2)

        loss = -(y * np.log(y_hat + 1e-12)
                 + (1 - y) * np.log(1 - y_hat + 1e-12))

        self._cache = {"x": x, "z1": z1, "h1": h1, "z2": z2,
                        "y_hat": y_hat, "y": y}
        return float(loss)

    def backward(self) -> dict:
        """Backward pass. Returns gradients for all parameters."""
        c = self._cache
        dz2 = c["y_hat"] - c["y"]           # (1,)
        dW2 = dz2[:, None] * c["h1"][None, :]  # outer product: (1, hidden)
        db2 = dz2
        dh1 = self.W2.T @ dz2               # (hidden,)
        dz1 = dh1 * (c["z1"] > 0).astype(float)  # ReLU derivative
        dW1 = dz1[:, None] * c["x"][None, :]  # (hidden, input)
        db1 = dz1
        return {"dW1": dW1, "db1": db1, "dW2": dW2, "db2": db2}

Research Insight: Backpropagation was not invented once. Bryson and Ho (1969) described it for control systems. Werbos (1974) applied it to neural networks in his PhD thesis. Rumelhart, Hinton, and Williams (1986) popularized it in "Learning representations by back-propagating errors." The algorithm is simply the chain rule applied systematically — the insight was recognizing its computational efficiency.

2.4.4 Forward Mode vs. Reverse Mode Automatic Differentiation

The backward pass above is an instance of reverse mode automatic differentiation (AD). There is an alternative: forward mode AD.

Property Forward Mode Reverse Mode
Direction Input → output Output → input
Computes $\frac{\partial \mathbf{f}}{\partial x_j}$ (one input at a time) $\frac{\partial f_i}{\partial \mathbf{x}}$ (one output at a time)
Cost per pass One forward sweep One forward + one backward sweep
Efficient when $n \ll m$ (few inputs, many outputs) $n \gg m$ (many inputs, few outputs)
ML use case Rare (Jacobian-vector products) Standard (backpropagation)

In machine learning, we typically have millions of parameters (inputs to the loss) and a single scalar loss (one output). Reverse mode computes all $n$ partial derivatives in a single backward pass, whereas forward mode would require $n$ separate passes. This is why backpropagation (reverse mode AD) dominates.

Advanced Sidebar: Forward mode AD becomes relevant in two settings: (1) computing Jacobian-vector products for Hessian-free optimization (Pearlmutter, 1994), and (2) computing per-example gradients efficiently in differential privacy (Goodfellow, 2015). JAX supports both modes via jax.jvp (forward) and jax.vjp (reverse).


2.5 Convexity: When Optimization Is "Easy"

2.5.1 Convex Sets and Convex Functions

A set $\mathcal{C} \subseteq \mathbb{R}^n$ is convex if, for any two points $\mathbf{x}, \mathbf{y} \in \mathcal{C}$ and any $\lambda \in [0, 1]$:

$$ \lambda \mathbf{x} + (1 - \lambda) \mathbf{y} \in \mathcal{C} $$

Geometrically: the line segment between any two points in the set lies entirely within the set.

A function $f : \mathbb{R}^n \to \mathbb{R}$ is convex if its domain is convex and, for all $\mathbf{x}, \mathbf{y}$ in the domain and $\lambda \in [0, 1]$:

$$ f(\lambda \mathbf{x} + (1 - \lambda) \mathbf{y}) \leq \lambda f(\mathbf{x}) + (1 - \lambda) f(\mathbf{y}) $$

Equivalently: the function lies on or below the line segment connecting any two points on its graph. This is Jensen's inequality applied to $f$.

2.5.2 Why Convexity Matters: The Optimization Guarantee

For convex functions, every local minimum is a global minimum. This is the strongest guarantee in optimization theory, and it means that any algorithm that finds a stationary point ($\nabla f(\mathbf{x}) = \mathbf{0}$) has found the global optimum.

Second-order characterization. A twice-differentiable function $f$ is convex if and only if its Hessian is positive semidefinite everywhere:

$$ \mathbf{H}(\mathbf{x}) \succeq \mathbf{0} \quad \forall \mathbf{x} $$

This connects Chapter 1 (eigenvalues of the Hessian) to optimization: if all eigenvalues of $\mathbf{H}$ are non-negative, the function curves upward in every direction.

Example. The regularized logistic regression loss from Meridian Financial's credit model is convex. The cross-entropy is convex in the logits (a known result from the theory of exponential families), and the L2 regularization term $\frac{\lambda}{2}\|\boldsymbol{\theta}\|^2$ is strongly convex. Their sum is strongly convex, guaranteeing a unique global minimum.

def verify_convexity_numerically(
    loss_fn,
    grad_fn,
    theta_1: np.ndarray,
    theta_2: np.ndarray,
    n_points: int = 100
) -> bool:
    """
    Numerically verify convexity by checking Jensen's inequality
    along a line segment between two parameter vectors.

    Returns True if no violations are found.
    """
    violations = 0
    for alpha in np.linspace(0, 1, n_points):
        theta_mid = alpha * theta_1 + (1 - alpha) * theta_2
        lhs = loss_fn(theta_mid)
        rhs = alpha * loss_fn(theta_1) + (1 - alpha) * loss_fn(theta_2)
        if lhs > rhs + 1e-10:  # tolerance for floating point
            violations += 1
    return violations == 0

2.5.3 Strong Convexity and Conditioning

A function $f$ is $\mu$-strongly convex if $f(\mathbf{x}) - \frac{\mu}{2}\|\mathbf{x}\|^2$ is convex. Equivalently, the Hessian satisfies $\mathbf{H}(\mathbf{x}) \succeq \mu \mathbf{I}$ everywhere. Strong convexity implies:

  • A unique global minimum
  • Gradient descent converges at a linear rate (exponential decay of the error)
  • The convergence rate depends on the condition number $\kappa = L / \mu$, where $L$ is the Lipschitz constant of the gradient

Mathematical Foundation: The convergence rate of gradient descent on an $L$-smooth, $\mu$-strongly convex function is:

$$f(\mathbf{x}_t) - f(\mathbf{x}^*) \leq \left(\frac{\kappa - 1}{\kappa + 1}\right)^{2t} \left(f(\mathbf{x}_0) - f(\mathbf{x}^*) \right)$$

where $\kappa = L/\mu$ is the condition number. When $\kappa$ is large (ill-conditioned), convergence is slow: the loss landscape is a narrow valley, and gradient descent oscillates between the walls. L2 regularization increases $\mu$ (and thus decreases $\kappa$), which partly explains why regularization aids optimization — not just generalization.

2.5.4 Lipschitz Continuity of the Gradient

A function $f$ has an $L$-Lipschitz continuous gradient if:

$$ \|\nabla f(\mathbf{x}) - \nabla f(\mathbf{y})\| \leq L \|\mathbf{x} - \mathbf{y}\| \quad \forall \mathbf{x}, \mathbf{y} $$

The constant $L$ bounds how fast the gradient can change. This is the key assumption for proving convergence of gradient descent: if the gradient changes too rapidly, a fixed-size step might overshoot. The optimal learning rate for gradient descent on an $L$-smooth function is $\eta = 1/L$.

2.5.5 The Non-Convex Reality of Deep Learning

The loss function of a neural network is almost never convex. The composition of linear layers and nonlinear activations creates a highly non-convex landscape with:

  • Multiple local minima (though research suggests most are nearly as good as the global minimum for overparameterized networks; Choromanska et al., 2015)
  • Saddle points (far more numerous than local minima in high dimensions; Dauphin et al., 2014)
  • Flat regions (where the gradient is near zero but the loss is not near optimal)

Research Insight: Dauphin et al. (2014) showed that in high-dimensional loss landscapes, saddle points are exponentially more common than local minima. A random critical point (where $\nabla f = \mathbf{0}$) in a $d$-dimensional space has a probability of $(1/2)^d$ of being a local minimum (all Hessian eigenvalues positive). For $d = 1000$, this probability is effectively zero. The practical implication: gradient-based methods rarely get trapped in local minima. They get trapped near saddle points — and adaptive methods like Adam escape saddle points faster because they normalize by the gradient magnitude.

Despite non-convexity, gradient-based methods work remarkably well for deep learning. The theoretical understanding of this phenomenon is one of the most active research areas in optimization and learning theory. Key hypotheses include the overparameterization theory (the network has far more parameters than data points, making local minima benign) and the loss surface connectivity theory (Draxler et al., 2018, showed that local minima are connected by paths of nearly constant loss).


2.6 Gradient Descent and Its Variants: The Family Tree

We now build the optimizer family tree from first principles. Each variant addresses a specific failure mode of its predecessor.

2.6.1 Vanilla Gradient Descent

The update rule is:

$$ \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \nabla \mathcal{L}(\boldsymbol{\theta}_t) $$

where $\eta > 0$ is the learning rate. This is the simplest possible first-order method: take a step in the direction of steepest descent.

def gradient_descent(
    loss_and_grad_fn,
    theta_init: np.ndarray,
    lr: float = 0.01,
    n_steps: int = 1000,
    verbose: bool = False
) -> tuple[np.ndarray, list[float]]:
    """
    Vanilla gradient descent.

    Args:
        loss_and_grad_fn: Function returning (loss, gradient)
        theta_init: Initial parameters
        lr: Learning rate
        n_steps: Number of optimization steps
        verbose: Print loss every 100 steps

    Returns:
        Tuple of (final parameters, loss history)
    """
    theta = theta_init.copy()
    history = []

    for t in range(n_steps):
        loss, grad = loss_and_grad_fn(theta)
        history.append(loss)
        theta -= lr * grad

        if verbose and t % 100 == 0:
            print(f"Step {t:4d}: loss = {loss:.6f}, |grad| = {np.linalg.norm(grad):.6f}")

    return theta, history

Failure mode: On ill-conditioned problems (high $\kappa$), vanilla GD oscillates across the narrow direction of the loss landscape while making slow progress along the elongated direction. It is also expensive on large datasets because it computes the full gradient over all $N$ data points at each step.

2.6.2 Stochastic Gradient Descent (SGD)

Instead of computing the full gradient, SGD samples a mini-batch $\mathcal{B} \subset \{1, \ldots, N\}$ of size $B$ and uses the mini-batch gradient as a noisy estimate:

$$ \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \cdot \frac{1}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \nabla \ell(\boldsymbol{\theta}_t; \mathbf{x}_i, y_i) $$

The mini-batch gradient is an unbiased estimate of the full gradient: $\mathbb{E}[\nabla_\mathcal{B}] = \nabla \mathcal{L}$. The variance of this estimate decreases as $O(1/B)$, which means larger batches give more stable (but more expensive) updates.

Common Misconception: "Stochastic gradient descent is just a cheaper approximation of gradient descent." While this is technically true, the noise in SGD is not merely a cost to be tolerated — it is beneficial. The noise helps SGD escape sharp local minima that generalize poorly (Keskar et al., 2017) and acts as implicit regularization. This is why SGD often finds solutions that generalize better than those found by exact gradient descent.

2.6.3 SGD with Momentum

Momentum addresses the oscillation problem by adding a "velocity" term that accumulates gradient information over time:

$$ \begin{aligned} \mathbf{v}_{t+1} &= \beta \mathbf{v}_t + \nabla \mathcal{L}(\boldsymbol{\theta}_t) \\ \boldsymbol{\theta}_{t+1} &= \boldsymbol{\theta}_t - \eta \mathbf{v}_{t+1} \end{aligned} $$

where $\beta \in [0, 1)$ is the momentum coefficient (typically 0.9). The velocity $\mathbf{v}_t$ is an exponentially decaying average of past gradients. Gradients that consistently point in the same direction accumulate, while oscillating components cancel out.

Physical analogy. Imagine a ball rolling down a hilly landscape. Without momentum, the ball stops wherever the local slope is zero. With momentum, the ball builds up speed on consistent downhill slopes and rolls past small bumps and shallow valleys.

Convergence improvement. On a quadratic with condition number $\kappa$, momentum reduces the convergence factor from $(\kappa - 1)/(\kappa + 1)$ to $(\sqrt{\kappa} - 1)/(\sqrt{\kappa} + 1)$. For $\kappa = 100$, this improves from 0.98 to 0.82 — a dramatic speedup.

def sgd_momentum(
    loss_and_grad_fn,
    theta_init: np.ndarray,
    lr: float = 0.01,
    beta: float = 0.9,
    n_steps: int = 1000
) -> tuple[np.ndarray, list[float]]:
    """SGD with momentum."""
    theta = theta_init.copy()
    velocity = np.zeros_like(theta)
    history = []

    for t in range(n_steps):
        loss, grad = loss_and_grad_fn(theta)
        history.append(loss)
        velocity = beta * velocity + grad
        theta -= lr * velocity

    return theta, history

2.6.4 Nesterov Accelerated Gradient (NAG)

Nesterov momentum modifies the standard momentum update by computing the gradient at the "lookahead" position $\boldsymbol{\theta}_t - \eta \beta \mathbf{v}_t$ rather than at $\boldsymbol{\theta}_t$:

$$ \begin{aligned} \mathbf{v}_{t+1} &= \beta \mathbf{v}_t + \nabla \mathcal{L}(\boldsymbol{\theta}_t - \eta \beta \mathbf{v}_t) \\ \boldsymbol{\theta}_{t+1} &= \boldsymbol{\theta}_t - \eta \mathbf{v}_{t+1} \end{aligned} $$

The intuition: before committing to a step, peek ahead in the momentum direction and correct based on the gradient there. This prevents overshooting and improves convergence on convex problems from $O(1/t)$ to $O(1/t^2)$ — a theoretically optimal rate due to Nesterov (1983).

def nesterov_momentum(
    loss_and_grad_fn,
    theta_init: np.ndarray,
    lr: float = 0.01,
    beta: float = 0.9,
    n_steps: int = 1000
) -> tuple[np.ndarray, list[float]]:
    """Nesterov accelerated gradient descent."""
    theta = theta_init.copy()
    velocity = np.zeros_like(theta)
    history = []

    for t in range(n_steps):
        # Lookahead position
        theta_lookahead = theta - lr * beta * velocity
        loss, grad = loss_and_grad_fn(theta_lookahead)

        # Record loss at current (non-lookahead) position
        current_loss, _ = loss_and_grad_fn(theta)
        history.append(current_loss)

        velocity = beta * velocity + grad
        theta -= lr * velocity

    return theta, history

2.6.5 AdaGrad: Per-Parameter Learning Rates

All methods so far use a single learning rate $\eta$ for all parameters. But different parameters may require different step sizes. A parameter that has received many large gradients (a frequently updated feature) should take smaller steps than one that has received few small gradients (a rare feature).

AdaGrad (Duchi, Hazan, & Singer, 2011) adapts the learning rate per parameter by dividing by the square root of the sum of all past squared gradients:

$$ \begin{aligned} \mathbf{G}_{t+1} &= \mathbf{G}_t + \nabla \mathcal{L}(\boldsymbol{\theta}_t)^2 \quad \text{(element-wise square)} \\ \boldsymbol{\theta}_{t+1} &= \boldsymbol{\theta}_t - \frac{\eta}{\sqrt{\mathbf{G}_{t+1}} + \epsilon} \odot \nabla \mathcal{L}(\boldsymbol{\theta}_t) \end{aligned} $$

where $\epsilon \approx 10^{-8}$ prevents division by zero.

Failure mode of AdaGrad: The denominator $\mathbf{G}_t$ only grows. Eventually, the effective learning rate shrinks to near zero, and training stalls. This is beneficial for convex problems (where you want to converge and stop) but catastrophic for non-convex problems (where you may need to keep exploring).

2.6.6 RMSProp: Fixing AdaGrad's Decay

RMSProp (Hinton, unpublished lecture notes, 2012) fixes AdaGrad's monotonic decay by using an exponentially decaying average of squared gradients instead of a sum:

$$ \begin{aligned} \mathbf{s}_{t+1} &= \rho \mathbf{s}_t + (1 - \rho) \nabla \mathcal{L}(\boldsymbol{\theta}_t)^2 \\ \boldsymbol{\theta}_{t+1} &= \boldsymbol{\theta}_t - \frac{\eta}{\sqrt{\mathbf{s}_{t+1}} + \epsilon} \odot \nabla \mathcal{L}(\boldsymbol{\theta}_t) \end{aligned} $$

where $\rho \in [0, 1)$ (typically 0.99) controls the decay rate. Now the denominator adapts to the recent gradient magnitudes, not the entire history.

def rmsprop(
    loss_and_grad_fn,
    theta_init: np.ndarray,
    lr: float = 0.001,
    rho: float = 0.99,
    eps: float = 1e-8,
    n_steps: int = 1000
) -> tuple[np.ndarray, list[float]]:
    """RMSProp optimizer."""
    theta = theta_init.copy()
    s = np.zeros_like(theta)
    history = []

    for t in range(n_steps):
        loss, grad = loss_and_grad_fn(theta)
        history.append(loss)
        s = rho * s + (1 - rho) * grad**2
        theta -= lr * grad / (np.sqrt(s) + eps)

    return theta, history

2.6.7 Adam: Combining Momentum and Adaptive Rates

Adam (Kingma & Ba, 2015) combines the best of momentum (first moment) and RMSProp (second moment):

$$ \begin{aligned} \mathbf{m}_{t+1} &= \beta_1 \mathbf{m}_t + (1 - \beta_1) \nabla \mathcal{L}(\boldsymbol{\theta}_t) & \text{(first moment estimate)} \\ \mathbf{v}_{t+1} &= \beta_2 \mathbf{v}_t + (1 - \beta_2) \nabla \mathcal{L}(\boldsymbol{\theta}_t)^2 & \text{(second moment estimate)} \\ \hat{\mathbf{m}}_{t+1} &= \frac{\mathbf{m}_{t+1}}{1 - \beta_1^{t+1}} & \text{(bias correction)} \\ \hat{\mathbf{v}}_{t+1} &= \frac{\mathbf{v}_{t+1}}{1 - \beta_2^{t+1}} & \text{(bias correction)} \\ \boldsymbol{\theta}_{t+1} &= \boldsymbol{\theta}_t - \frac{\eta}{\sqrt{\hat{\mathbf{v}}_{t+1}} + \epsilon} \odot \hat{\mathbf{m}}_{t+1} \end{aligned} $$

Default hyperparameters: $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$.

Bias correction is crucial. At $t=0$, $\mathbf{m}_0 = \mathbf{0}$ and $\mathbf{v}_0 = \mathbf{0}$, so early estimates of the first and second moments are biased toward zero. The correction factors $1 / (1 - \beta_i^t)$ counteract this initialization bias. Without bias correction, the first few updates would be much smaller than intended.

def adam(
    loss_and_grad_fn,
    theta_init: np.ndarray,
    lr: float = 0.001,
    beta1: float = 0.9,
    beta2: float = 0.999,
    eps: float = 1e-8,
    n_steps: int = 1000
) -> tuple[np.ndarray, list[float]]:
    """
    Adam optimizer (Kingma & Ba, 2015).

    Args:
        loss_and_grad_fn: Function returning (loss, gradient)
        theta_init: Initial parameters
        lr: Learning rate
        beta1: First moment decay rate
        beta2: Second moment decay rate
        eps: Numerical stability constant
        n_steps: Number of optimization steps

    Returns:
        Tuple of (final parameters, loss history)
    """
    theta = theta_init.copy()
    m = np.zeros_like(theta)  # first moment
    v = np.zeros_like(theta)  # second moment
    history = []

    for t in range(1, n_steps + 1):
        loss, grad = loss_and_grad_fn(theta)
        history.append(loss)

        m = beta1 * m + (1 - beta1) * grad
        v = beta2 * v + (1 - beta2) * grad**2

        m_hat = m / (1 - beta1**t)   # bias correction
        v_hat = v / (1 - beta2**t)

        theta -= lr * m_hat / (np.sqrt(v_hat) + eps)

    return theta, history

2.6.8 AdamW: Decoupled Weight Decay

Loshchilov and Hutter (2019) showed that Adam's interaction with L2 regularization is problematic. In standard Adam with L2 regularization, the weight decay gradient $\lambda \boldsymbol{\theta}$ is scaled by the adaptive learning rate — but weight decay should apply uniformly, not be dampened for parameters with large historical gradients.

AdamW decouples weight decay from the adaptive update:

$$ \boldsymbol{\theta}_{t+1} = (1 - \eta \lambda) \boldsymbol{\theta}_t - \frac{\eta}{\sqrt{\hat{\mathbf{v}}_{t+1}} + \epsilon} \odot \hat{\mathbf{m}}_{t+1} $$

The key difference: the weight decay term $(1 - \eta \lambda) \boldsymbol{\theta}_t$ operates on $\boldsymbol{\theta}$ directly, not through the Adam update mechanism.

def adamw(
    loss_and_grad_fn,
    theta_init: np.ndarray,
    lr: float = 0.001,
    beta1: float = 0.9,
    beta2: float = 0.999,
    eps: float = 1e-8,
    weight_decay: float = 0.01,
    n_steps: int = 1000
) -> tuple[np.ndarray, list[float]]:
    """AdamW optimizer with decoupled weight decay."""
    theta = theta_init.copy()
    m = np.zeros_like(theta)
    v = np.zeros_like(theta)
    history = []

    for t in range(1, n_steps + 1):
        loss, grad = loss_and_grad_fn(theta)
        history.append(loss)

        m = beta1 * m + (1 - beta1) * grad
        v = beta2 * v + (1 - beta2) * grad**2

        m_hat = m / (1 - beta1**t)
        v_hat = v / (1 - beta2**t)

        # Decoupled weight decay: applied to theta, not through adaptive rate
        theta *= (1 - lr * weight_decay)
        theta -= lr * m_hat / (np.sqrt(v_hat) + eps)

    return theta, history

Production Reality: AdamW is the default optimizer for training large language models. The GPT-3 paper (Brown et al., 2020) used AdamW with $\beta_1 = 0.9$, $\beta_2 = 0.95$, and cosine learning rate decay. Chinchilla (Hoffmann et al., 2022) used $\beta_1 = 0.9$, $\beta_2 = 0.95$, cosine decay to 10% of peak LR. These are not arbitrary choices — they are the product of extensive hyperparameter sweeps costing millions of dollars.

2.6.9 The Family Tree: A Summary

Gradient Descent (full batch, single LR)
│
├── SGD (mini-batch: noisy but cheap)
│   │
│   ├── SGD + Momentum (accumulate velocity, smooth oscillations)
│   │   │
│   │   └── Nesterov Momentum (look ahead before stepping)
│   │
│   ├── AdaGrad (per-parameter LR, monotonically decreasing)
│   │   │
│   │   └── RMSProp (exponentially decaying average of squared gradients)
│   │       │
│   │       └── Adam (RMSProp + Momentum + bias correction)
│   │           │
│   │           └── AdamW (decoupled weight decay)
│   │
│   └── [Other branches: Adadelta, Nadam, RAdam, LAMB, ...]

Each branch addresses a specific failure mode: - SGD: Full gradient is too expensive → use mini-batches - Momentum: Oscillation in narrow valleys → accumulate direction - Nesterov: Overshooting → look ahead before committing - AdaGrad: Same LR for all parameters → adapt per-parameter - RMSProp: AdaGrad decays to zero → use recent history - Adam: Why choose between momentum and adaptive rates? → combine both - AdamW: L2 regularization interacts badly with Adam → decouple


2.7 Learning Rate Schedules

The learning rate $\eta$ is arguably the single most important hyperparameter in deep learning. A schedule that varies $\eta$ over training can significantly improve convergence and final performance.

2.7.1 Common Schedules

Step decay: Reduce $\eta$ by a factor (e.g., $\times 0.1$) every $k$ epochs. Simple and widely used.

$$ \eta_t = \eta_0 \cdot \gamma^{\lfloor t / k \rfloor} $$

Exponential decay: Smooth version of step decay.

$$ \eta_t = \eta_0 \cdot e^{-\lambda t} $$

Cosine annealing (Loshchilov & Hutter, 2017): Smoothly anneal from $\eta_{\max}$ to $\eta_{\min}$:

$$ \eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{t}{T}\pi\right)\right) $$

Linear warmup + cosine decay: Start with a very small LR, linearly increase to $\eta_{\max}$ over $T_w$ warmup steps, then cosine-anneal to $\eta_{\min}$. This is the standard schedule for transformer training.

def warmup_cosine_schedule(
    step: int,
    total_steps: int,
    warmup_steps: int,
    lr_max: float = 1e-3,
    lr_min: float = 1e-5
) -> float:
    """
    Linear warmup followed by cosine decay.

    Standard schedule for transformer training.
    """
    if step < warmup_steps:
        # Linear warmup
        return lr_max * step / warmup_steps
    else:
        # Cosine decay
        progress = (step - warmup_steps) / (total_steps - warmup_steps)
        return lr_min + 0.5 * (lr_max - lr_min) * (1 + np.cos(np.pi * progress))

Common Misconception: "Learning rate warmup is needed because gradients are large at initialization." The actual reason is more subtle. At initialization, the Adam second-moment estimates $\hat{\mathbf{v}}_t$ are based on very few samples and are unreliable. Large updates based on unreliable variance estimates can push parameters into a region from which recovery is difficult. Warmup gives Adam time to build accurate second-moment estimates before committing to large steps (Liu et al., 2020, "On the Variance of the Adaptive Learning Rate and Beyond" — RAdam).


2.8 Second-Order Methods

2.8.1 Newton's Method

While gradient descent uses only first-order information (the gradient), Newton's method uses second-order information (the Hessian) to find the optimal step size and direction:

$$ \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \mathbf{H}^{-1}(\boldsymbol{\theta}_t) \nabla \mathcal{L}(\boldsymbol{\theta}_t) $$

For a quadratic function, Newton's method converges in a single step (the Hessian is constant, and inverting it perfectly "undoes" the curvature). For general smooth functions, Newton's method has quadratic convergence near the optimum: the error squared at each iteration.

The problem: Computing and inverting the Hessian costs $O(n^2)$ storage and $O(n^3)$ computation per step, where $n$ is the number of parameters. For modern neural networks with $n \sim 10^8$ to $10^{11}$ parameters, this is utterly impractical.

def newtons_method(
    loss_fn,
    grad_fn,
    hessian_fn,
    theta_init: np.ndarray,
    n_steps: int = 20,
    damping: float = 1e-4
) -> tuple[np.ndarray, list[float]]:
    """
    Newton's method with Levenberg-Marquardt damping.
    Only practical for small-scale problems.

    Args:
        damping: Added to diagonal of Hessian for numerical stability
    """
    theta = theta_init.copy()
    history = []

    for t in range(n_steps):
        loss = loss_fn(theta)
        history.append(loss)
        grad = grad_fn(theta)
        H = hessian_fn(theta) + damping * np.eye(len(theta))
        step = np.linalg.solve(H, grad)
        theta -= step

    return theta, history

2.8.2 L-BFGS: Approximating the Hessian

L-BFGS (Limited-memory BFGS; Nocedal, 1980) approximates the inverse Hessian using information from the last $m$ gradient evaluations (typically $m = 5{-}20$). It stores $m$ pairs $(\mathbf{s}_k, \mathbf{y}_k)$ where $\mathbf{s}_k = \boldsymbol{\theta}_{k+1} - \boldsymbol{\theta}_k$ and $\mathbf{y}_k = \nabla \mathcal{L}_{k+1} - \nabla \mathcal{L}_k$, and uses them to implicitly represent $\mathbf{H}^{-1}$.

Storage cost: $O(mn)$ instead of $O(n^2)$. Computation per step: $O(mn)$ instead of $O(n^3)$.

L-BFGS works well for: - Convex problems with moderate dimensionality - Fine-tuning the last layers of a neural network - Classical ML models (logistic regression, CRFs)

It struggles with: - Stochastic gradients (it assumes deterministic, exact gradients) - Non-smooth objectives - Very high-dimensional problems ($n > 10^7$)

Advanced Sidebar: Recent work on stochastic second-order methods includes K-FAC (Kronecker-Factored Approximate Curvature; Martens & Grosse, 2015), which approximates the Fisher information matrix as a Kronecker product of smaller matrices, reducing per-step cost from $O(n^2)$ to $O(n)$. Shampoo (Gupta, Koren, & Singer, 2018) extends this to general preconditioners. These methods are increasingly used in large-scale training at Google DeepMind, but adoption outside large labs remains limited.

2.8.3 Why Second-Order Methods Illuminate First-Order Methods

Even if you never use Newton's method in practice, understanding it reveals why Adam works:

  1. Newton's method computes the ideal update: $\Delta \boldsymbol{\theta} = -\mathbf{H}^{-1} \nabla \mathcal{L}$
  2. Adam approximates this with a diagonal approximation: $\Delta \theta_j \approx -\frac{m_j}{\sqrt{v_j}}$, where $\sqrt{v_j} \approx |H_{jj}|^{1/2}$
  3. The adaptive learning rate in Adam is a crude diagonal approximation of the inverse Hessian

This connection explains why Adam works well on ill-conditioned problems (it implicitly rescales by curvature) and why it can fail when the diagonal approximation is poor (the Hessian has significant off-diagonal terms).

2.8.4 KKT Conditions for Constrained Optimization

Many ML problems involve constraints. Fairness constraints on Meridian Financial's credit model ("the approval rate for group A must be within 10% of group B"), capacity constraints on StreamRec's recommendation system ("each content creator must receive at least $k$ impressions per day"), or simplex constraints on probability distributions.

The Karush-Kuhn-Tucker (KKT) conditions generalize the gradient-equals-zero condition to constrained optimization. For the problem:

$$ \min_{\boldsymbol{\theta}} f(\boldsymbol{\theta}) \quad \text{s.t.} \quad g_i(\boldsymbol{\theta}) \leq 0, \; i = 1, \ldots, m $$

the KKT conditions at an optimal $\boldsymbol{\theta}^*$ are:

  1. Stationarity: $\nabla f(\boldsymbol{\theta}^*) + \sum_i \mu_i \nabla g_i(\boldsymbol{\theta}^*) = \mathbf{0}$
  2. Primal feasibility: $g_i(\boldsymbol{\theta}^*) \leq 0$ for all $i$
  3. Dual feasibility: $\mu_i \geq 0$ for all $i$
  4. Complementary slackness: $\mu_i g_i(\boldsymbol{\theta}^*) = 0$ for all $i$

The multipliers $\mu_i$ are called Lagrange multipliers or dual variables. At the optimum, $\mu_i$ represents the sensitivity of the objective to the $i$-th constraint: how much the optimal loss would decrease if the constraint were relaxed by one unit.

Production Reality: In practice, constrained optimization in ML is often handled by (a) converting constraints to penalty terms in the loss (soft constraints), (b) projection methods (projecting back to the feasible set after each unconstrained step), or (c) Lagrangian relaxation with alternating updates. The KKT conditions are the theoretical foundation for all three approaches.


2.9 Loss Landscape Visualization

Understanding the loss landscape geometrically provides intuition that equations alone cannot. We use two techniques: 2D contour plots and the method of Li et al. (2018) for neural network loss surfaces.

2.9.1 Contour Plots for Low-Dimensional Problems

For the credit scoring problem (logistic regression with 2 features for visualization), we can plot the loss as a function of $\theta_1$ and $\theta_2$ and overlay the optimizer trajectories.

import matplotlib.pyplot as plt

def plot_optimization_trajectories(
    loss_fn,
    trajectories: dict[str, list[np.ndarray]],
    theta_range: tuple[float, float] = (-3.0, 3.0),
    n_grid: int = 200,
    title: str = "Optimizer Trajectories on Loss Landscape"
) -> None:
    """
    Plot 2D loss contours with overlaid optimizer trajectories.

    Args:
        loss_fn: Loss function f(theta) where theta is 2D
        trajectories: Dict mapping optimizer name to list of parameter vectors
        theta_range: Range for both axes
        n_grid: Grid resolution
        title: Plot title
    """
    t1 = np.linspace(theta_range[0], theta_range[1], n_grid)
    t2 = np.linspace(theta_range[0], theta_range[1], n_grid)
    T1, T2 = np.meshgrid(t1, t2)
    Z = np.zeros_like(T1)
    for i in range(n_grid):
        for j in range(n_grid):
            Z[i, j] = loss_fn(np.array([T1[i, j], T2[i, j]]))

    fig, ax = plt.subplots(1, 1, figsize=(10, 8))
    ax.contour(T1, T2, Z, levels=30, cmap="viridis", alpha=0.6)
    ax.contourf(T1, T2, Z, levels=30, cmap="viridis", alpha=0.3)

    colors = {"SGD": "red", "Momentum": "blue", "Adam": "green", "AdamW": "orange"}
    for name, traj in trajectories.items():
        pts = np.array(traj)
        ax.plot(pts[:, 0], pts[:, 1], "-o", markersize=2,
                label=name, color=colors.get(name, "black"), linewidth=1.5)
        ax.plot(pts[0, 0], pts[0, 1], "s", color=colors.get(name, "black"),
                markersize=8)  # start

    ax.set_xlabel(r"$\theta_1$", fontsize=14)
    ax.set_ylabel(r"$\theta_2$", fontsize=14)
    ax.set_title(title, fontsize=16)
    ax.legend(fontsize=12)
    plt.tight_layout()
    plt.savefig("optimizer_trajectories.png", dpi=150)
    plt.show()

2.9.2 The Loss Surface of Matrix Factorization

The StreamRec matrix factorization problem from Chapter 1 has a non-convex loss surface. Consider factoring a $3 \times 3$ matrix into rank-1 factors $\mathbf{u} \in \mathbb{R}^3$ and $\mathbf{v} \in \mathbb{R}^3$ by minimizing $\|\mathbf{R} - \mathbf{u}\mathbf{v}^T\|_F^2$.

This loss has a fundamental symmetry: $(\mathbf{u}, \mathbf{v})$ and $(\alpha \mathbf{u}, \mathbf{v}/\alpha)$ give the same loss for any $\alpha \neq 0$. This creates "ravines" in the loss landscape — curves of equal loss — which make optimization challenging. At any point along such a ravine, the gradient points along the ravine (where progress is zero) rather than across it (where progress would be meaningful).

def matrix_factorization_loss(
    params: np.ndarray,
    R: np.ndarray,
    rank: int,
    reg: float = 0.01
) -> tuple[float, np.ndarray]:
    """
    Compute loss and gradient for matrix factorization: ||R - UV^T||_F^2.

    Args:
        params: Flattened [U, V] parameters
        R: Target matrix, shape (m, n)
        rank: Factorization rank
        reg: L2 regularization strength

    Returns:
        Tuple of (loss, gradient)
    """
    m, n = R.shape
    U = params[:m * rank].reshape(m, rank)
    V = params[m * rank:].reshape(n, rank)

    residual = R - U @ V.T                        # (m, n)
    loss = 0.5 * np.sum(residual**2) + 0.5 * reg * (
        np.sum(U**2) + np.sum(V**2)
    )

    grad_U = -residual @ V + reg * U               # (m, rank)
    grad_V = -residual.T @ U + reg * V             # (n, rank)

    grad = np.concatenate([grad_U.ravel(), grad_V.ravel()])
    return loss, grad

Research Insight: Bhojanapalli, Neyshabur, and Srebro (2016) proved that for matrix factorization with overparameterization (rank of factors $\geq$ rank of target), all local minima are global minima, and all saddle points are strict (they have at least one direction of negative curvature). This means that gradient descent with noise (SGD) will almost surely escape saddle points and converge to a global minimum. The non-convex landscape is not as hostile as it first appears — but the path to the minimum can be tortuous.


2.10 Saddle Points and the High-Dimensional Landscape

2.10.1 Saddle Points in High Dimensions

A saddle point is a critical point ($\nabla f = \mathbf{0}$) where the Hessian has both positive and negative eigenvalues. At a saddle point, the loss decreases in some directions and increases in others.

In high-dimensional optimization, saddle points dominate the landscape. For a random function in $d$ dimensions, a critical point is a saddle point with probability approaching 1 as $d$ grows (as noted in Section 2.5.5). This means gradient descent spends most of its time near saddle points, not local minima.

Escaping saddle points. First-order methods can be very slow near saddle points because the gradient is small. Adaptive methods (Adam, RMSProp) escape faster because they normalize by the gradient magnitude: small gradients are amplified rather than ignored. Adding noise (as in SGD) also helps: the noise introduces perturbations that push the iterate away from the saddle point in the direction of negative curvature.

def rosenbrock(x: np.ndarray) -> tuple[float, np.ndarray]:
    """
    Rosenbrock function: a classic non-convex test function with
    a narrow, curved valley. The global minimum is at (1, 1, ..., 1).

    f(x) = sum_{i=1}^{n-1} [100(x_{i+1} - x_i^2)^2 + (1 - x_i)^2]
    """
    n = len(x)
    loss = 0.0
    grad = np.zeros(n)
    for i in range(n - 1):
        loss += 100 * (x[i + 1] - x[i]**2)**2 + (1 - x[i])**2
        grad[i] += -400 * x[i] * (x[i + 1] - x[i]**2) - 2 * (1 - x[i])
        grad[i + 1] += 200 * (x[i + 1] - x[i]**2)
    return loss, grad

2.10.2 Comparing Optimizers on the Rosenbrock Function

The Rosenbrock function provides a stringent test: its minimum lies at the bottom of a narrow, curved valley. Vanilla SGD oscillates wildly. Momentum helps follow the valley. Adam's adaptive rates handle the extreme difference in curvature between the valley floor and walls.

def compare_optimizers(
    loss_and_grad_fn,
    theta_init: np.ndarray,
    n_steps: int = 5000
) -> dict[str, list[float]]:
    """Compare SGD, Momentum, and Adam on the same problem."""
    results = {}

    _, results["SGD"] = gradient_descent(
        loss_and_grad_fn, theta_init, lr=0.0001, n_steps=n_steps
    )
    _, results["Momentum"] = sgd_momentum(
        loss_and_grad_fn, theta_init, lr=0.0001, beta=0.9, n_steps=n_steps
    )
    _, results["Adam"] = adam(
        loss_and_grad_fn, theta_init, lr=0.001, n_steps=n_steps
    )

    return results

# Example comparison
theta0 = np.array([-1.0, 1.0])
histories = compare_optimizers(rosenbrock, theta0, n_steps=5000)

plt.figure(figsize=(10, 6))
for name, hist in histories.items():
    plt.semilogy(hist, label=name)
plt.xlabel("Step", fontsize=14)
plt.ylabel("Loss (log scale)", fontsize=14)
plt.title("Optimizer Comparison on Rosenbrock Function", fontsize=16)
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("optimizer_comparison_rosenbrock.png", dpi=150)
plt.show()

2.11 Progressive Project: Optimizing Matrix Factorization for StreamRec

This section extends the matrix factorization problem from Chapter 1's progressive project (M0). You will implement three optimizers from scratch and apply them to the StreamRec recommendation problem.

2.11.1 Task Specification

Objective: Minimize the regularized matrix factorization loss:

$$ \mathcal{L}(\mathbf{U}, \mathbf{V}) = \frac{1}{|\Omega|} \sum_{(i,j) \in \Omega} (R_{ij} - \mathbf{u}_i^T \mathbf{v}_j)^2 + \frac{\lambda}{2}\left(\|\mathbf{U}\|_F^2 + \|\mathbf{V}\|_F^2\right) $$

where $\Omega$ is the set of observed entries.

Implementation:

def create_synthetic_streamrec_data(
    n_users: int = 500,
    n_items: int = 200,
    true_rank: int = 5,
    observation_rate: float = 0.1,
    noise_std: float = 0.5,
    seed: int = 42
) -> tuple[np.ndarray, np.ndarray]:
    """
    Generate synthetic StreamRec user-item interaction data.

    The true rating matrix is low-rank (rank=5), observed with noise
    at a 10% observation rate.

    Returns:
        R_observed: Observed rating matrix with NaN for missing entries
        R_true: Ground truth full rating matrix
    """
    rng = np.random.default_rng(seed)

    # True low-rank factors
    U_true = rng.normal(0, 1, (n_users, true_rank))
    V_true = rng.normal(0, 1, (n_items, true_rank))
    R_true = U_true @ V_true.T

    # Observe a subset with noise
    mask = rng.random((n_users, n_items)) < observation_rate
    R_observed = np.full((n_users, n_items), np.nan)
    R_observed[mask] = R_true[mask] + rng.normal(0, noise_std, mask.sum())

    return R_observed, R_true


def mf_loss_and_grad(
    params: np.ndarray,
    R_observed: np.ndarray,
    rank: int,
    reg: float = 0.01
) -> tuple[float, np.ndarray]:
    """
    Matrix factorization loss and gradient over observed entries only.
    """
    m, n = R_observed.shape
    U = params[:m * rank].reshape(m, rank)
    V = params[m * rank:].reshape(n, rank)

    mask = ~np.isnan(R_observed)
    R_pred = U @ V.T
    residual = np.where(mask, R_observed - R_pred, 0.0)
    n_observed = mask.sum()

    loss = 0.5 * np.sum(residual**2) / n_observed + 0.5 * reg * (
        np.sum(U**2) + np.sum(V**2)
    )

    grad_U = (-residual @ V) / n_observed + reg * U
    grad_V = (-residual.T @ U) / n_observed + reg * V

    grad = np.concatenate([grad_U.ravel(), grad_V.ravel()])
    return loss, grad


# Generate data
R_observed, R_true = create_synthetic_streamrec_data()

# Initialize parameters
n_users, n_items = R_observed.shape
rank = 10  # overparameterize relative to true rank=5
rng = np.random.default_rng(0)
theta_init = rng.normal(0, 0.1, n_users * rank + n_items * rank)

# Create loss function with fixed data
loss_fn = lambda theta: mf_loss_and_grad(theta, R_observed, rank, reg=0.01)

# Run all three optimizers
_, hist_sgd = gradient_descent(loss_fn, theta_init, lr=0.01, n_steps=2000)
_, hist_mom = sgd_momentum(loss_fn, theta_init, lr=0.01, beta=0.9, n_steps=2000)
_, hist_adam = adam(loss_fn, theta_init, lr=0.005, n_steps=2000)

# Plot convergence
plt.figure(figsize=(10, 6))
plt.semilogy(hist_sgd, label="SGD", alpha=0.8)
plt.semilogy(hist_mom, label="Momentum (β=0.9)", alpha=0.8)
plt.semilogy(hist_adam, label="Adam", alpha=0.8)
plt.xlabel("Step", fontsize=14)
plt.ylabel("Loss (log scale)", fontsize=14)
plt.title("StreamRec Matrix Factorization: Optimizer Comparison", fontsize=16)
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("streamrec_optimizer_comparison.png", dpi=150)
plt.show()

2.11.2 Analysis Questions

After running the optimizers:

  1. Which optimizer reaches the lowest loss? Which converges fastest in terms of wall-clock time?
  2. Compute the reconstruction error on held-out entries (those in $R_{\text{true}}$ but not in $\Omega$). Does the optimizer that achieves the lowest training loss also achieve the lowest test error? Why or why not?
  3. Try rank $k = 3, 5, 10, 20$. How does the overparameterization ratio affect convergence speed and final loss?
  4. Visualize the loss landscape by projecting the trajectory onto the two principal components of the parameter differences between initialization and convergence. This is a simplified version of the Li et al. (2018) technique.

2.12 Connecting to What Follows

This chapter has built the optimization toolkit that the rest of the book depends on. Every chapter from here forward uses gradient-based optimization:

  • Chapter 3 (Probability) derives maximum likelihood estimation as an optimization problem — the gradient of the log-likelihood is the score function.
  • Chapter 6 (Neural Networks from Scratch) applies backpropagation to deeper architectures, where the vanishing/exploding gradient problem (a consequence of the chain rule) motivates architectural choices like skip connections and normalization.
  • Chapter 7 (Training Deep Networks) extends the optimizer family tree with learning rate schedules, gradient clipping, and distributed training — all of which require the foundations from this chapter.
  • Chapter 10 (Transformers) reveals why Adam is the default optimizer for attention-based models and why the learning rate schedule matters so much.
  • Chapter 26 (Training at Scale) addresses distributed optimization, where the interaction between mini-batch size, learning rate, and momentum becomes critical.

The gradient is the lens through which a model sees its own errors. Understanding gradients — how they flow, how they vanish, how they are accumulated and adapted — is understanding the mechanism by which neural networks learn.

Fundamentals > Frontier. Adam was published in 2015. AdamW in 2019. The learning rate schedules used to train GPT-4 in 2023 are, at their core, the same mathematics we derived in this chapter. The frontier moves fast, but the optimization foundations move slow. Master them once, and every new training configuration becomes readable.


Chapter Summary

Gradient descent is a family of algorithms, not a single algorithm. The family tree grows by addressing specific failure modes:

  1. The gradient is the vector of partial derivatives — it points toward steepest ascent. We descend by stepping in the opposite direction.
  2. The chain rule on computational graphs gives us backpropagation — the efficient algorithm for computing gradients through deep compositions of functions. Reverse mode AD computes all gradients in a single backward pass.
  3. Convexity guarantees that every local minimum is global. Most deep learning problems are non-convex, but convex theory gives us the vocabulary (condition number, saddle points, strong convexity) to reason about what we observe.
  4. Momentum smooths oscillations by accumulating velocity. Nesterov momentum looks ahead before committing.
  5. Adaptive methods (AdaGrad, RMSProp, Adam) assign per-parameter learning rates, implicitly approximating diagonal second-order information.
  6. AdamW decouples weight decay from the adaptive update — the default for transformer training.
  7. Second-order methods (Newton, L-BFGS) use curvature information for optimal steps but are impractical at the scale of modern deep learning. Understanding them illuminates why Adam works.
  8. The learning rate schedule — particularly warmup + cosine decay — can matter as much as the choice of optimizer.

The next chapter shifts from "how do we find the best parameters?" to "what are we optimizing, and why?" — probability theory and statistical inference provide the principled foundation for choosing loss functions in the first place.