38 min read

> "The calculus was the first achievement of modern mathematics, and it is difficult to

Learning Objectives

Compute derivatives, partial derivatives, and gradients for multivariate functions
Apply the chain rule to composite functions and understand its role in backpropagation
Characterize optimization landscapes including minima, maxima, and saddle points
Implement gradient descent and its variants from scratch
Explain momentum, RMSProp, and Adam optimizers and their advantages
Distinguish forward-mode and reverse-mode automatic differentiation
Build and traverse computational graphs for gradient computation

In This Chapter

3.1 Derivatives and the Language of Change
3.2 Taylor Series and the Optimization Landscape
3.3 Gradient Descent: The Workhorse of Optimization
3.4 Advanced Optimizers: Momentum and Adaptive Methods
3.5 Automatic Differentiation
3.6 Putting It All Together: From Theory to Practice
Summary
References

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

"The calculus was the first achievement of modern mathematics, and it is difficult to overestimate its importance. I think it defines more unequivocally than anything else the inception of modern mathematics, and the system of mathematical analysis, which is its logical development, still constitutes the greatest technical advance in exact thinking." — John von Neumann

Chapter 3: Calculus, Optimization, and Automatic Differentiation

Every time a machine learning model improves its predictions, calculus is at work beneath the surface. When a language model learns to complete your sentences, when a recommendation system refines its suggestions, when an image classifier sharpens its accuracy — in each case, the model is following gradients downhill through a mathematical landscape, adjusting its parameters one small step at a time. Calculus provides the language for describing these changes, optimization provides the strategies for navigating toward better solutions, and automatic differentiation provides the computational machinery that makes it all tractable at scale.

In Chapter 1, we established that linear algebra gives us the vocabulary to describe data and transformations. In Chapter 2, we saw how probability and statistics allow us to reason about uncertainty and learn from observations. Now we turn to the third pillar of the mathematical foundation: calculus and optimization. If linear algebra describes the structure of our models, and probability describes the uncertainty in our data, then calculus describes how things change — and optimization tells us how to change them for the better.

This chapter is not a comprehensive calculus course. Instead, we focus on the specific concepts and techniques that appear most frequently in AI engineering. We begin with the fundamentals — derivatives, gradients, and the chain rule — building intuition before introducing formalism. We then explore the optimization landscape and develop the gradient descent algorithm and its modern variants. Finally, we examine automatic differentiation, the remarkable technique that allows frameworks like PyTorch and TensorFlow to compute gradients through arbitrarily complex programs.

By the end of this chapter, you will understand not just what these tools are, but why they work, when to use each variant, and how they are implemented in practice.

3.1 Derivatives and the Language of Change

3.1.1 The Derivative: Instantaneous Rate of Change

At its core, a derivative measures how a function's output changes as its input changes. If you are driving a car and your position changes over time, the derivative of position with respect to time is your velocity. If your model's loss changes as you adjust a weight, the derivative of loss with respect to that weight tells you how sensitive the loss is to that particular weight.

Intuition. Imagine standing on a hillside. The steepness of the slope beneath your feet — how quickly the elevation changes as you take a step — is the derivative of the elevation function at your current position. A steep slope means a large derivative; flat ground means a derivative near zero.

Definition. For a function f(x), the derivative at a point x is defined as:

$$ f'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h} $$

where:

f(x) is the function value at point x
h is a small perturbation to the input
The limit captures the instantaneous rate of change as the perturbation shrinks to zero

Worked Example. Consider f(x) = x^2. Applying the definition:

$$ f'(x) = \lim_{h \to 0} \frac{(x + h)^2 - x^2}{h} = \lim_{h \to 0} \frac{2xh + h^2}{h} = \lim_{h \to 0} (2x + h) = 2x $$

At x = 3, the derivative is f'(3) = 6, meaning a small increase in x near 3 produces approximately six times that increase in f(x).

3.1.2 Numerical Differentiation

Before diving deeper into analytical derivatives, it is worth noting that we can approximate derivatives numerically. This is invaluable for testing and debugging gradient computations.

The forward difference approximation uses:

$$ f'(x) \approx \frac{f(x + h) - f(x)}{h} $$

A better approximation is the central difference:

$$ f'(x) \approx \frac{f(x + h) - f(x - h)}{2h} $$

where:

h is a small step size, typically around 10^(-5) to 10^(-7)
The central difference has error proportional to h^2 rather than h, making it significantly more accurate

import numpy as np


def numerical_derivative(f, x: float, h: float = 1e-5) -> float:
    """Compute the numerical derivative using central differences.

    Args:
        f: A scalar-valued function of one variable.
        x: The point at which to evaluate the derivative.
        h: Step size for the finite difference.

    Returns:
        Approximate derivative of f at x.
    """
    return (f(x + h) - f(x - h)) / (2 * h)


# Example: derivative of x^2 at x = 3
f = lambda x: x ** 2
print(f"Numerical: {numerical_derivative(f, 3.0):.6f}")  # ≈ 6.000000
print(f"Analytical: {2 * 3.0:.6f}")                       # = 6.000000

3.1.3 Common Derivative Rules

In practice, we rarely compute derivatives from the limit definition. Instead, we use a handful of rules that cover the vast majority of functions encountered in machine learning:

Rule	Function	Derivative
Power rule	x^n	n x^(n-1)
Exponential	e^x	e^x
Logarithm	ln(x)	1/x
Sum rule	f(x) + g(x)	f'(x) + g'(x)
Product rule	f(x) g(x)	f'(x) g(x) + f(x) g'(x)
Quotient rule	f(x)/g(x)	[f'(x) g(x) - f(x) g'(x)] / [g(x)]^2

These rules are the building blocks. The chain rule, which we discuss in Section 3.1.5, allows us to combine them to differentiate arbitrarily complex compositions.

3.1.4 Partial Derivatives and Gradients

Machine learning models almost never depend on a single variable. A neural network may have millions of parameters, and the loss depends on all of them simultaneously. To handle this, we extend the derivative to functions of multiple variables.

Intuition. Imagine the surface of a landscape that depends on two coordinates, x and y. The partial derivative with respect to x tells you how the elevation changes if you walk purely in the east-west direction, holding your north-south position fixed. The partial derivative with respect to y tells you how the elevation changes walking purely north-south.

Definition. For a function f(x_1, x_2, ..., x_n), the partial derivative with respect to x_i is:

$$ \frac{\partial f}{\partial x_i} = \lim_{h \to 0} \frac{f(x_1, \ldots, x_i + h, \ldots, x_n) - f(x_1, \ldots, x_i, \ldots, x_n)}{h} $$

where all variables except x_i are held constant.

The Gradient. The gradient collects all partial derivatives into a single vector:

$$ \nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix} $$

where:

x = [x_1, x_2, ..., x_n]^T is the vector of inputs
Each component of the gradient measures the sensitivity of f to the corresponding input
The gradient points in the direction of steepest ascent
The magnitude of the gradient indicates how steep that ascent is

Worked Example. Let f(x, y) = x^2 + 3xy + y^2.

The partial derivatives are:

$$ \frac{\partial f}{\partial x} = 2x + 3y, \quad \frac{\partial f}{\partial y} = 3x + 2y $$

At the point (1, 2):

$$ \nabla f(1, 2) = \begin{bmatrix} 2(1) + 3(2) \\ 3(1) + 2(2) \end{bmatrix} = \begin{bmatrix} 8 \\ 7 \end{bmatrix} $$

The gradient tells us that at (1, 2), the function increases most rapidly in the direction [8, 7]^T.

import numpy as np


def compute_gradient(f, x: np.ndarray, h: float = 1e-5) -> np.ndarray:
    """Compute the gradient of f at x using central differences.

    Args:
        f: A scalar-valued function that takes a NumPy array.
        x: The point at which to evaluate the gradient.
        h: Step size for the finite difference.

    Returns:
        Gradient vector as a NumPy array.
    """
    grad = np.zeros_like(x, dtype=float)
    for i in range(len(x)):
        x_plus = x.copy()
        x_minus = x.copy()
        x_plus[i] += h
        x_minus[i] -= h
        grad[i] = (f(x_plus) - f(x_minus)) / (2 * h)
    return grad


# Example
f = lambda x: x[0] ** 2 + 3 * x[0] * x[1] + x[1] ** 2
point = np.array([1.0, 2.0])
print(f"Numerical gradient: {compute_gradient(f, point)}")   # ≈ [8, 7]
print(f"Analytical gradient: [{2*1 + 3*2}, {3*1 + 2*2}]")   # = [8, 7]

3.1.5 The Chain Rule: Backbone of Backpropagation

The chain rule is arguably the single most important calculus concept for deep learning. It tells us how to compute the derivative of a composite function — a function built by plugging one function into another.

Intuition. Suppose temperature affects ice cream sales, and ice cream sales affect a shop's revenue. The chain rule tells us how to calculate the effect of temperature on revenue by multiplying the two individual effects together: the effect of temperature on sales, times the effect of sales on revenue.

Single-Variable Chain Rule. If y = f(g(x)), then:

$$ \frac{dy}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx} = f'(g(x)) \cdot g'(x) $$

Multivariable Chain Rule. If z = f(x, y) where x = x(t) and y = y(t), then:

$$ \frac{dz}{dt} = \frac{\partial f}{\partial x} \cdot \frac{dx}{dt} + \frac{\partial f}{\partial y} \cdot \frac{dy}{dt} $$

Worked Example (Deep Learning Context). Consider a simple two-layer computation:

$$ z = w_2 \cdot \sigma(w_1 \cdot x + b_1) + b_2 $$

where sigma is the sigmoid activation sigma(a) = 1 / (1 + e^(-a)). To compute dz/dw_1, we apply the chain rule:

Let a = w_1 x + b_1 and h = sigma(a). Then z = w_2 h + b_2.

$$ \frac{dz}{dw_1} = \frac{dz}{dh} \cdot \frac{dh}{da} \cdot \frac{da}{dw_1} = w_2 \cdot \sigma(a)(1 - \sigma(a)) \cdot x $$

This chain of multiplications is exactly what backpropagation computes, layer by layer, from the output back to the input. We will explore backpropagation in detail in Chapter 6 when we discuss neural network training.

3.1.6 The Jacobian and Hessian

Two matrix-valued generalizations of the derivative arise frequently in optimization and machine learning.

The Jacobian Matrix. When a function maps a vector to a vector, f: R^n -> R^m, the Jacobian is the m x n matrix of all partial derivatives:

$$ \mathbf{J} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix} $$

The Jacobian generalizes the gradient: if the output is a scalar (m = 1), the Jacobian reduces to the gradient (as a row vector).

The Hessian Matrix. For a scalar-valued function f: R^n -> R, the Hessian is the n x n matrix of second partial derivatives:

$$ \mathbf{H} = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix} $$

where:

Diagonal entries measure the curvature of f along each axis
Off-diagonal entries measure how the slope in one direction changes as you move in another direction
The Hessian is symmetric for functions with continuous second derivatives (by Schwarz's theorem)

The Hessian's eigenvalues reveal the local geometry of the function: all positive eigenvalues indicate a local minimum, all negative indicate a local maximum, and a mix of signs indicates a saddle point — a concept we will revisit in Section 3.2.

3.2 Taylor Series and the Optimization Landscape

3.2.1 Taylor Series Approximations

Taylor series allow us to approximate a function locally using polynomials. This is not merely a mathematical curiosity; it is the theoretical foundation for why gradient descent works.

Intuition. A Taylor series says: "Near a point a, we can approximate any smooth function as a polynomial. The zeroth-order term tells us the function value, the first-order term captures the slope, the second-order term captures the curvature, and so on."

First-Order (Linear) Approximation:

$$ f(x) \approx f(a) + f'(a)(x - a) $$

Second-Order (Quadratic) Approximation:

$$ f(x) \approx f(a) + f'(a)(x - a) + \frac{1}{2}f''(a)(x - a)^2 $$

Multivariate Second-Order Approximation:

$$ f(\mathbf{x}) \approx f(\mathbf{a}) + \nabla f(\mathbf{a})^T (\mathbf{x} - \mathbf{a}) + \frac{1}{2} (\mathbf{x} - \mathbf{a})^T \mathbf{H}(\mathbf{a}) (\mathbf{x} - \mathbf{a}) $$

where:

f(a) is the function value at the expansion point
nabla f(a) is the gradient at a
H(a) is the Hessian matrix at a
The linear term captures the slope; the quadratic term captures the curvature

The key insight for optimization: gradient descent uses the first-order approximation to decide which direction to move. Newton's method uses the second-order approximation for a more refined step. This explains both the simplicity of gradient descent and why second-order methods can converge faster (at the cost of computing the Hessian).

3.2.1b Second-Order Methods: Newton's Method and L-BFGS

While gradient descent uses only first-order information (the gradient), second-order methods also use curvature information (the Hessian) to take more informed steps. The trade-off is that these methods converge in fewer iterations but each iteration is much more expensive.

Newton's Method. Rather than stepping in the direction of steepest descent, Newton's method finds the minimum of the local quadratic approximation. Setting the gradient of the quadratic approximation to zero gives the update rule:

$$\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \mathbf{H}^{-1} \nabla L(\boldsymbol{\theta}_t)$$

where:

H is the Hessian matrix of the loss function at theta_t
H^(-1) nabla L is the Newton step, which accounts for curvature
No learning rate is needed in the pure Newton step (though in practice a step size is often used)

Intuition. Imagine you are trying to reach the bottom of a valley. Gradient descent always walks directly downhill, which can lead to zig-zagging across narrow valleys. Newton's method is like having a map that shows the valley's shape --- it can walk diagonally toward the bottom in a single step if the landscape is well-approximated by a quadratic.

Worked Example. Consider minimizing f(x, y) = x^2 + 10y^2 starting at (10, 10). The gradient is [2x, 20y]^T and the Hessian is diag(2, 20). The Newton step from (10, 10) is:

$$\begin{bmatrix} x_{new} \\ y_{new} \end{bmatrix} = \begin{bmatrix} 10 \\ 10 \end{bmatrix} - \begin{bmatrix} 1/2 & 0 \\ 0 & 1/20 \end{bmatrix} \begin{bmatrix} 20 \\ 200 \end{bmatrix} = \begin{bmatrix} 10 - 10 \\ 10 - 10 \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}$$

Newton's method reaches the optimum in a single step because the function is exactly quadratic. Gradient descent, by contrast, would take many iterations due to the different curvatures along x and y.

Why Newton's Method Is Impractical for Deep Learning. Computing the full Hessian requires O(n^2) storage and O(n^3) computation for n parameters. A ResNet-50 with 25 million parameters would require storing a 25M x 25M matrix (about 5 petabytes in float64) --- clearly infeasible. This is why first-order methods dominate deep learning.

L-BFGS. The Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm is a practical quasi-Newton method that approximates the inverse Hessian using only the most recent m gradient evaluations (typically m = 5 to 20). It provides second-order-like convergence without explicitly computing or storing the Hessian. L-BFGS is widely used for optimization problems with moderate numbers of parameters (thousands to millions) and is particularly effective for convex problems like logistic regression. However, it requires full-batch gradients, making it less suitable for the stochastic mini-batch setting of deep learning.

Convergence Comparison. For a strongly convex function with Lipschitz continuous Hessian, the convergence rates are:

Gradient descent: O(1/t) --- linear convergence
Newton's method: O(c^(2^t)) --- quadratic convergence (the number of correct digits doubles each step)
L-BFGS: superlinear convergence (faster than linear, slower than quadratic)

These convergence rates explain why second-order methods are so attractive for well-conditioned problems and why researchers continue to develop approximations that bring second-order benefits to deep learning, such as K-FAC (Kronecker-Factored Approximate Curvature) and Shampoo.

3.2.2 Understanding the Optimization Landscape

In machine learning, we define a loss function (also called an objective function or cost function) that measures how well our model performs. Training is the process of finding parameters that minimize this loss. The loss function over the space of all possible parameter values forms an optimization landscape.

Critical Points. A critical point is where the gradient is zero: nabla f(x) = 0. Critical points come in three flavors:

Local minimum: The function value is lower than all nearby points. The Hessian is positive semi-definite (all eigenvalues >= 0).
Local maximum: The function value is higher than all nearby points. The Hessian is negative semi-definite (all eigenvalues <= 0).
Saddle point: The function curves up in some directions and down in others. The Hessian has both positive and negative eigenvalues.

Global vs. Local Minima. The global minimum is the lowest point across the entire landscape. Local minima are the lowest points in their neighborhood but may not be globally optimal. A major challenge in training neural networks is that the loss landscape is highly non-convex, containing many local minima and saddle points.

A Surprising Fact About High Dimensions. Research has shown that in high-dimensional optimization (typical for neural networks), saddle points are far more common than local minima. Intuitively, at a critical point in n dimensions, each dimension independently curves up or down. A local minimum requires all n dimensions to curve up, which becomes exponentially unlikely as n grows. This insight, discussed in the influential work by Dauphin et al. (2014), explains why gradient descent often works well in practice despite the non-convexity of neural network loss landscapes — the algorithm can escape saddle points even though it might get stuck at local minima.

Loss Landscapes and Saddle Points in Practice. The geometry of neural network loss landscapes has profound practical implications. Li et al. (2018) developed visualization techniques that project the high-dimensional loss surface onto two dimensions, revealing that well-designed architectures (like ResNets with skip connections, which we will explore in Chapter 14) produce smoother, more navigable loss landscapes than architectures without skip connections. This visual intuition helps explain why certain architectural choices lead to easier training.

Saddle points are particularly problematic because the gradient is zero at a saddle point, just as it is at a minimum. Vanilla gradient descent slows to a crawl near saddle points because the gradients become vanishingly small. However, stochastic gradient descent (SGD) with mini-batches provides a natural escape mechanism: the gradient noise from sampling random mini-batches perturbs the parameters away from the saddle point. This is one of the reasons why SGD often outperforms full-batch gradient descent in practice, and it connects to the broader theme that noise can be beneficial in optimization — a topic we will revisit in Chapter 13 when we discuss regularization.

Flat Minima and Generalization. Not all local minima are created equal. Hochreiter and Schmidhuber (1997) conjectured, and later work by Keskar et al. (2017) confirmed, that flat minima — regions of the loss landscape where the loss changes slowly as parameters vary — tend to produce models that generalize better than sharp minima, where the loss changes rapidly. The intuition is that the test loss landscape is slightly different from the training loss landscape, and at a flat minimum, this shift has minimal effect on performance. This insight has practical implications: optimization techniques that find flatter minima (like SGD with large learning rates, or Stochastic Weight Averaging) tend to produce better-generalizing models, as we will discuss in Chapter 13.

3.2.3 Convexity

A function f is convex if the line segment between any two points on its graph lies above or on the graph:

$$ f(\lambda \mathbf{x} + (1 - \lambda)\mathbf{y}) \leq \lambda f(\mathbf{x}) + (1 - \lambda) f(\mathbf{y}) \quad \text{for all } \lambda \in [0, 1] $$

For convex functions, every local minimum is also the global minimum — there is no risk of getting trapped. Many classical machine learning problems (linear regression, logistic regression, SVMs) have convex loss functions. Neural network loss functions, however, are generally non-convex, which is why training them is both challenging and fascinating.

import numpy as np


def check_convexity_numerical(
    f, x: np.ndarray, y: np.ndarray, n_samples: int = 100
) -> bool:
    """Check convexity numerically along the line between x and y.

    Args:
        f: A scalar-valued function of a NumPy array.
        x: First point.
        y: Second point.
        n_samples: Number of interpolation points to check.

    Returns:
        True if the convexity condition holds for all sampled points.
    """
    lambdas = np.linspace(0, 1, n_samples)
    for lam in lambdas:
        interpolated = lam * x + (1 - lam) * y
        lhs = f(interpolated)
        rhs = lam * f(x) + (1 - lam) * f(y)
        if lhs > rhs + 1e-10:  # small tolerance for numerical error
            return False
    return True


# x^2 is convex
f_convex = lambda x: np.sum(x ** 2)
x, y = np.array([1.0, 2.0]), np.array([-1.0, 3.0])
print(f"x^2 convex: {check_convexity_numerical(f_convex, x, y)}")  # True

# sin(x) is not convex
f_nonconvex = lambda x: np.sum(np.sin(x))
print(f"sin(x) convex: {check_convexity_numerical(f_nonconvex, x, y)}")  # False

3.3 Gradient Descent: The Workhorse of Optimization

3.3.1 The Basic Idea

Gradient descent is beautifully simple: to minimize a function, repeatedly take small steps in the direction opposite to the gradient (the direction of steepest descent).

Intuition. You are blindfolded on a hilly landscape and want to reach the lowest point. At each step, you feel the slope beneath your feet and walk downhill. That is gradient descent.

Update Rule:

$$ \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \nabla L(\boldsymbol{\theta}_t) $$

where:

theta_t is the parameter vector at step t
eta (eta) is the learning rate, controlling step size
nabla L(theta_t) is the gradient of the loss function at the current parameters
The negative sign means we move opposite to the gradient (downhill)

The Learning Rate. The learning rate is perhaps the most important hyperparameter in deep learning. Too large, and the algorithm overshoots, bouncing around or diverging. Too small, and convergence is painfully slow. Finding the right learning rate is both art and science. As a rule of thumb, if the loss is oscillating wildly, reduce the learning rate; if it is decreasing too slowly, increase it. We will explore learning rate schedules in Chapter 8 when we discuss training strategies.

3.3.2 Batch Gradient Descent

In batch gradient descent (also called full-batch or vanilla gradient descent), we compute the gradient using the entire training dataset:

$$ \nabla L(\boldsymbol{\theta}) = \frac{1}{N} \sum_{i=1}^{N} \nabla \ell(\boldsymbol{\theta}; \mathbf{x}_i, y_i) $$

where:

N is the total number of training examples
ell(theta; x_i, y_i) is the loss for a single training example
The gradient is averaged over all examples

Advantages:

Stable convergence with a smooth loss curve
Guaranteed to converge to a local minimum for convex functions (with appropriate learning rate)

Disadvantages:

Computationally expensive for large datasets (must process all N examples before a single parameter update)
Cannot exploit redundancy in the data
More likely to get stuck in sharp local minima

3.3.3 Stochastic Gradient Descent (SGD)

Stochastic gradient descent uses a single randomly selected example to estimate the gradient at each step:

$$ \nabla L(\boldsymbol{\theta}) \approx \nabla \ell(\boldsymbol{\theta}; \mathbf{x}_i, y_i) $$

where (x_i, y_i) is a single randomly selected training example.

Intuition. Instead of surveying the entire landscape carefully before each step (batch), you glance at a single data point and immediately take a step. The individual steps are noisy, but on average, they point in the right direction.

Advantages:

Very fast updates (one example per step)
The noise in gradient estimates can help escape shallow local minima and saddle points
Can handle online learning (streaming data)

Disadvantages:

Very noisy convergence — the loss oscillates significantly
May never settle precisely at the minimum without learning rate decay
Cannot take advantage of vectorized operations efficiently

3.3.4 Mini-Batch Gradient Descent

Mini-batch gradient descent strikes a balance by using a small batch of B examples:

$$ \nabla L(\boldsymbol{\theta}) \approx \frac{1}{B} \sum_{i=1}^{B} \nabla \ell(\boldsymbol{\theta}; \mathbf{x}_i, y_i) $$

where:

B is the batch size, typically 32, 64, 128, or 256
Examples are randomly sampled (or shuffled and drawn sequentially)

This is the standard approach in practice. When practitioners say "SGD," they almost always mean mini-batch SGD. Common batch sizes are powers of 2 (for GPU memory alignment), with 32 being a popular default.

import numpy as np


def mini_batch_sgd(
    X: np.ndarray,
    y: np.ndarray,
    theta_init: np.ndarray,
    learning_rate: float = 0.01,
    batch_size: int = 32,
    n_epochs: int = 100,
    loss_fn=None,
    grad_fn=None,
) -> tuple[np.ndarray, list[float]]:
    """Perform mini-batch stochastic gradient descent.

    Args:
        X: Training features, shape (n_samples, n_features).
        y: Training targets, shape (n_samples,).
        theta_init: Initial parameter vector.
        learning_rate: Step size for each update.
        batch_size: Number of examples per mini-batch.
        n_epochs: Number of full passes through the dataset.
        loss_fn: Function(theta, X, y) -> scalar loss.
        grad_fn: Function(theta, X, y) -> gradient vector.

    Returns:
        Tuple of (optimized parameters, loss history).
    """
    theta = theta_init.copy()
    n_samples = X.shape[0]
    loss_history = []

    for epoch in range(n_epochs):
        # Shuffle the data at the start of each epoch
        indices = np.random.permutation(n_samples)
        X_shuffled = X[indices]
        y_shuffled = y[indices]

        for start in range(0, n_samples, batch_size):
            end = min(start + batch_size, n_samples)
            X_batch = X_shuffled[start:end]
            y_batch = y_shuffled[start:end]

            gradient = grad_fn(theta, X_batch, y_batch)
            theta -= learning_rate * gradient

        # Record loss after each epoch
        loss_history.append(loss_fn(theta, X, y))

    return theta, loss_history

3.3.5 Convergence and Learning Rate Schedules

The convergence behavior of gradient descent depends critically on the learning rate and the properties of the loss function.

For a convex function with L-Lipschitz continuous gradients, batch gradient descent with learning rate eta = 1/L converges at a rate:

$$ f(\boldsymbol{\theta}_t) - f(\boldsymbol{\theta}^*) \leq \frac{\|\boldsymbol{\theta}_0 - \boldsymbol{\theta}^*\|^2}{2 \eta t} $$

This is O(1/t) convergence, meaning it takes roughly 1/epsilon steps to get within epsilon of the optimum.

In practice, learning rate schedules that reduce the learning rate over time often work better than a fixed learning rate. Let us examine each schedule with its formula and practical guidance:

Step Decay. Reduce the learning rate by a factor every k epochs:

$$\eta_t = \eta_0 \cdot \gamma^{\lfloor t / k \rfloor}$$

where:

eta_0 is the initial learning rate
gamma is the decay factor (typically 0.1 or 0.5)
k is the step size in epochs (e.g., every 30 epochs)

This is the simplest schedule and was used in many early deep learning papers (e.g., ResNet). It works well when you have a good sense of how many epochs training will take.

Exponential Decay. A smoother version of step decay:

$$\eta_t = \eta_0 \cdot \gamma^t$$

where gamma is close to 1 (e.g., 0.995 per epoch). This produces a gradual, monotonic decrease that avoids the abrupt jumps of step decay.

Cosine Annealing. One of the most popular modern schedules:

$$\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{\pi t}{T}\right)\right)$$

where:

eta_min is the minimum learning rate (often 0 or 1e-6)
eta_max is the maximum learning rate (the initial learning rate)
T is the total number of training steps
The learning rate follows a smooth cosine curve from eta_max down to eta_min

Cosine annealing has become the default schedule for many modern training pipelines, including transformer-based models. Its smooth decay avoids the sharp transitions of step decay while still spending significant time at lower learning rates.

Linear Warm-up with Decay. Start with a very small learning rate and linearly increase it over the first W steps, then apply a decay schedule:

$$\eta_t = \begin{cases} \eta_{\max} \cdot \frac{t}{W} & \text{if } t < W \\ \text{decay\_schedule}(t - W) & \text{if } t \geq W \end{cases}$$

Warm-up is particularly important for training transformers and when using large batch sizes. Without warm-up, the randomly initialized model can produce large, poorly-directed gradient updates that destabilize training. We will see this schedule used extensively when we study transformer architectures in Chapter 19.

Practical Debugging with Learning Rates. When training stalls or diverges, the learning rate is almost always the first thing to investigate. Here are common symptoms and their diagnoses:

Symptom	Likely Cause	Fix
Loss oscillates wildly	Learning rate too high	Reduce by 2-10x
Loss decreases very slowly	Learning rate too low	Increase by 2-5x
Loss decreases then spikes	Learning rate too high late in training	Add decay schedule
Loss plateaus early	Learning rate too low or stuck in saddle point	Try warm restarts or increase LR temporarily
NaN in loss	Learning rate far too high, or numerical instability	Reduce LR drastically; check for log(0) or division by zero

A practical technique called the learning rate finder (Smith, 2017) helps identify a good initial learning rate: train for a few hundred steps while exponentially increasing the learning rate from very small (1e-7) to very large (10), and plot the loss. The best learning rate is typically an order of magnitude below the point where the loss begins to explode.

3.4 Advanced Optimizers: Momentum and Adaptive Methods

Vanilla gradient descent has significant limitations: it oscillates in narrow valleys, progresses slowly along shallow dimensions, and treats all parameters identically. The optimizers in this section address these shortcomings and form the backbone of modern deep learning training.

3.4.1 Momentum

Intuition. Imagine a ball rolling downhill. It does not just follow the instantaneous slope; it accumulates velocity. If the slope consistently points in one direction, the ball speeds up. If the slope oscillates, the back-and-forth averaging dampens the oscillation. This is the idea behind momentum.

Update Rules:

$$ \mathbf{v}_{t+1} = \beta \mathbf{v}_t + \nabla L(\boldsymbol{\theta}_t) $$

$$ \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \mathbf{v}_{t+1} $$

where:

v_t is the velocity (accumulated gradient) at step t
beta is the momentum coefficient, typically 0.9
The velocity is a running average of past gradients, giving more weight to recent ones
A gradient that consistently points in the same direction builds up velocity in that direction
Oscillating gradients cancel out, damping the oscillation

Why Momentum Helps. Consider a loss surface shaped like a narrow valley — steep walls on the sides but a gentle slope along the bottom. Vanilla gradient descent oscillates back and forth across the valley while making slow progress along it. Momentum dampens the cross-valley oscillations (because they alternate in sign and cancel) while amplifying the along-valley motion (because it consistently points the same way).

Nesterov Accelerated Gradient (NAG). A variant called Nesterov momentum looks ahead before computing the gradient:

$$ \mathbf{v}_{t+1} = \beta \mathbf{v}_t + \nabla L(\boldsymbol{\theta}_t - \eta \beta \mathbf{v}_t) $$

$$ \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \mathbf{v}_{t+1} $$

The key difference is that the gradient is computed at the "look-ahead" position theta_t - eta beta v_t rather than the current position. This anticipatory correction leads to better convergence, particularly near the optimum.

3.4.2 RMSProp

Intuition. Different parameters may have very different gradient magnitudes. A parameter that frequently receives large gradients needs a smaller effective learning rate; one with consistently small gradients should get a larger one. RMSProp adapts the learning rate for each parameter individually.

Update Rules:

$$ \mathbf{s}_{t+1} = \rho \mathbf{s}_t + (1 - \rho) (\nabla L(\boldsymbol{\theta}_t))^2 $$

$$ \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \frac{\eta}{\sqrt{\mathbf{s}_{t+1} + \epsilon}} \odot \nabla L(\boldsymbol{\theta}_t) $$

where:

s_t is the exponential moving average of squared gradients
rho (rho) is the decay rate, typically 0.99
epsilon is a small constant (e.g., 10^(-8)) to prevent division by zero
The squaring and square root operations are applied element-wise
The symbol odot denotes element-wise multiplication

How It Works. Parameters with large recent gradients have a large s value, so their effective learning rate eta/sqrt(s + epsilon) is small. Parameters with small recent gradients have a small s value, getting a relatively larger effective learning rate. This per-parameter adaptation makes RMSProp effective on problems with very different gradient scales across dimensions.

3.4.3 The Adam Optimizer

Adam (Adaptive Moment Estimation) combines the ideas of momentum and RMSProp, maintaining both a first moment (mean) and second moment (uncentered variance) of the gradients.

Update Rules:

$$ \mathbf{m}_{t+1} = \beta_1 \mathbf{m}_t + (1 - \beta_1) \nabla L(\boldsymbol{\theta}_t) $$

$$ \mathbf{v}_{t+1} = \beta_2 \mathbf{v}_t + (1 - \beta_2) (\nabla L(\boldsymbol{\theta}_t))^2 $$

Bias Correction:

$$ \hat{\mathbf{m}}_{t+1} = \frac{\mathbf{m}_{t+1}}{1 - \beta_1^{t+1}}, \quad \hat{\mathbf{v}}_{t+1} = \frac{\mathbf{v}_{t+1}}{1 - \beta_2^{t+1}} $$

Parameter Update:

$$ \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \frac{\eta}{\sqrt{\hat{\mathbf{v}}_{t+1}} + \epsilon} \odot \hat{\mathbf{m}}_{t+1} $$

where:

m_t is the first moment estimate (mean of gradients, like momentum)
v_t is the second moment estimate (mean of squared gradients, like RMSProp)
beta_1 = 0.9 and beta_2 = 0.999 are the default decay rates
The bias correction compensates for the initialization at zero, which biases the estimates toward zero in the early steps
eta = 0.001 is the default learning rate
epsilon = 10^(-8) prevents division by zero

Why Adam Is So Popular. Adam works well "out of the box" with its default hyperparameters across a wide range of problems. It adapts learning rates per-parameter (like RMSProp) while also incorporating momentum. The bias correction ensures stable behavior in the early training steps. For these reasons, Adam is often the default optimizer for new projects.

Known Limitations of Adam. Despite its popularity, Adam has known issues:

It can converge to suboptimal solutions compared to well-tuned SGD with momentum, particularly for image classification tasks
The adaptive learning rates can be too aggressive, leading to poor generalization
AdamW, which decouples weight decay from the adaptive learning rate, is often preferred (we will explore regularization techniques including weight decay in Chapter 8)

import numpy as np


class AdamOptimizer:
    """Adam optimizer implementation.

    Attributes:
        learning_rate: Step size for parameter updates.
        beta1: Exponential decay rate for the first moment.
        beta2: Exponential decay rate for the second moment.
        epsilon: Small constant for numerical stability.
    """

    def __init__(
        self,
        learning_rate: float = 0.001,
        beta1: float = 0.9,
        beta2: float = 0.999,
        epsilon: float = 1e-8,
    ) -> None:
        self.learning_rate = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.m = None  # First moment
        self.v = None  # Second moment
        self.t = 0     # Time step

    def step(self, params: np.ndarray, grads: np.ndarray) -> np.ndarray:
        """Perform a single optimization step.

        Args:
            params: Current parameter values.
            grads: Current gradient values.

        Returns:
            Updated parameter values.
        """
        if self.m is None:
            self.m = np.zeros_like(params)
            self.v = np.zeros_like(params)

        self.t += 1

        # Update biased first and second moment estimates
        self.m = self.beta1 * self.m + (1 - self.beta1) * grads
        self.v = self.beta2 * self.v + (1 - self.beta2) * grads ** 2

        # Bias correction
        m_hat = self.m / (1 - self.beta1 ** self.t)
        v_hat = self.v / (1 - self.beta2 ** self.t)

        # Parameter update
        params = params - self.learning_rate * m_hat / (np.sqrt(v_hat) + self.epsilon)
        return params

3.4.4 Comparing Optimizers: A Summary

Optimizer	Key Idea	Hyperparameters	Best For
SGD	Raw gradient descent	eta	Simple, well-understood problems
SGD + Momentum	Accumulate velocity	eta, beta	Standard deep learning training
RMSProp	Adaptive per-parameter LR	eta, rho, epsilon	RNNs, non-stationary objectives
Adam	Momentum + Adaptive LR	eta, beta_1, beta_2, epsilon	Default choice, rapid prototyping
AdamW	Adam + decoupled weight decay	Same as Adam + lambda	Large-scale training, transformers

In practice, start with Adam (or AdamW) for new projects. If you need to squeeze out the last bit of performance, especially for computer vision tasks, consider switching to SGD with momentum and a carefully tuned learning rate schedule.

3.5 Automatic Differentiation

3.5.1 The Need for Automatic Differentiation

So far, we have discussed computing derivatives analytically (by hand) and numerically (via finite differences). Both approaches have significant limitations in the context of modern deep learning:

Symbolic differentiation (as done by computer algebra systems) produces exact derivatives but can lead to expression swell — the symbolic expressions grow exponentially with the depth of the computation, making them impractical for deep networks.
Numerical differentiation is simple but suffers from numerical errors (truncation and rounding) and is slow for functions with many parameters (it requires O(n) function evaluations for n parameters using forward differences).

Automatic differentiation (autodiff) is a fundamentally different approach. It computes exact derivatives (up to floating-point precision) by decomposing the computation into elementary operations and applying the chain rule systematically. It avoids both the expression swell of symbolic methods and the numerical issues of finite differences.

The key insight is that every computation, no matter how complex, is ultimately a sequence of elementary operations (addition, multiplication, exponentiation, etc.) for which we know the derivatives. By tracking these operations and applying the chain rule, we can compute derivatives efficiently and exactly.

3.5.2 Computational Graphs

A computational graph is a directed acyclic graph (DAG) that represents a computation. Each node represents a variable or an operation, and each edge represents a data dependency.

Example. Consider f(x, y) = (x + y) * (x * y). The computational graph is:

Input:  x    y
        |  X  |
        | / \ |
        v     v
  a = x+y   b = x*y
        \     /
         \   /
          v v
       f = a * b

Each intermediate variable (a, b) and the final output (f) is a node. The edges encode which inputs feed into which operations. This graph structure is what frameworks like PyTorch and TensorFlow build internally. When you write z = torch.mm(W, x) + b, you are constructing nodes in a computational graph.

3.5.3 Forward-Mode Automatic Differentiation

In forward-mode autodiff, we propagate derivatives forward through the computational graph, from inputs to outputs. Alongside each intermediate value, we compute its derivative with respect to a chosen input variable.

Dual Numbers. Forward-mode autodiff can be elegantly implemented using dual numbers. A dual number has the form a + b epsilon, where epsilon is an infinitesimal with the property that epsilon^2 = 0 (similar to the imaginary unit i where i^2 = -1).

When we evaluate f(x + epsilon) and expand, the coefficient of epsilon in the result is exactly f'(x):

$$ f(x + \epsilon) = f(x) + f'(x)\epsilon + \frac{f''(x)}{2}\epsilon^2 + \cdots = f(x) + f'(x)\epsilon $$

because all terms with epsilon^2 and higher vanish.

Forward-Mode Algorithm:

Seed the input variable of interest with a derivative of 1 (and all other inputs with 0).
Propagate through the graph: at each operation, compute both the value and the derivative using the chain rule.
The derivative at the output is the desired partial derivative.

Worked Example. Compute df/dx for f(x, y) = (x + y) * (x * y) at x = 2, y = 3.

Variable	Value	d/dx
x	2	1 (seed)
y	3	0
a = x + y	5	1 + 0 = 1
b = x * y	6	1 * 3 + 2 * 0 = 3
f = a * b	30	1 * 6 + 5 * 3 = 21

So df/dx = 21.

Complexity. One forward pass computes the derivative with respect to one input variable. For n inputs and one output, we need n forward passes. This makes forward mode efficient when the number of outputs exceeds the number of inputs (tall Jacobians).

3.5.4 Reverse-Mode Automatic Differentiation (Backpropagation)

In reverse-mode autodiff, we first perform a forward pass to compute all intermediate values, then a backward pass to propagate derivatives from the output back to the inputs.

Reverse-Mode Algorithm:

Forward pass: Compute and store all intermediate values.
Backward pass: Starting from the output (with a "seed" derivative of 1), propagate derivatives backward through the graph using the chain rule.

At each node, the adjoint (the derivative of the output with respect to that node's value) is computed by summing contributions from all downstream nodes:

$$ \bar{v}_i = \sum_{j \in \text{children}(i)} \bar{v}_j \frac{\partial v_j}{\partial v_i} $$

where:

v-bar_i is the adjoint of node i (the derivative of the output with respect to v_i)
The sum is over all nodes j that directly depend on node i

Worked Example. Compute both df/dx and df/dy for f(x, y) = (x + y) * (x * y) at x = 2, y = 3.

Forward pass (as before): a = 5, b = 6, f = 30.

Backward pass:

Variable	Adjoint (df/d·)	Computation
f	1 (seed)	—
a	df/da = b = 6	f = a * b, so df/da = b
b	df/db = a = 5	f = a * b, so df/db = a
x	df/dx = 6 * 1 + 5 * y = 6 + 15 = 21	via a and b
y	df/dy = 6 * 1 + 5 * x = 6 + 10 = 16	via a and b

In a single backward pass, we computed the derivatives with respect to both inputs.

Complexity. One backward pass computes the derivatives with respect to all input variables. For n inputs and one output, we need just one backward pass (plus one forward pass). This is why reverse mode is overwhelmingly preferred in deep learning, where we have one scalar loss and millions of parameters.

Backpropagation is reverse-mode autodiff applied to neural networks. The terms are often used interchangeably, though "backpropagation" typically refers specifically to the application in neural networks.

3.5.5 Forward Mode vs. Reverse Mode: When to Use Which

	Forward Mode	Reverse Mode
Computes	df/dx_i for one i per pass	df/dx_i for all i in one pass
Cost per pass	O(ops)	O(ops)
Passes needed	n (number of inputs)	m (number of outputs)
Memory	Low (no need to store graph)	Higher (must store forward pass)
Best for	Few inputs, many outputs	Many inputs, few outputs
Deep learning	Rarely used	Standard (one scalar loss)

In deep learning, we almost always have one scalar output (the loss) and many inputs (the parameters), making reverse mode the clear winner. However, forward mode appears in specific applications such as computing Jacobian-vector products and in certain scientific computing scenarios.

3.5.6 Implementing a Simple Autodiff Engine

Let us build a minimal reverse-mode automatic differentiation engine. This will demystify what happens inside PyTorch and TensorFlow.

import numpy as np
from typing import Optional


class Value:
    """A node in a computational graph supporting reverse-mode autodiff.

    Attributes:
        data: The numerical value.
        grad: The gradient (adjoint), accumulated during backward pass.
    """

    def __init__(
        self,
        data: float,
        children: tuple = (),
        operation: str = "",
    ) -> None:
        self.data = data
        self.grad = 0.0
        self._backward = lambda: None
        self._children = set(children)
        self._operation = operation

    def __repr__(self) -> str:
        return f"Value(data={self.data:.4f}, grad={self.grad:.4f})"

    def __add__(self, other: "Value") -> "Value":
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), "+")

        def _backward() -> None:
            self.grad += 1.0 * out.grad
            other.grad += 1.0 * out.grad

        out._backward = _backward
        return out

    def __mul__(self, other: "Value") -> "Value":
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), "*")

        def _backward() -> None:
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad

        out._backward = _backward
        return out

    def backward(self) -> None:
        """Perform reverse-mode autodiff from this node."""
        # Topological sort
        topo_order = []
        visited = set()

        def build_topo(v: "Value") -> None:
            if v not in visited:
                visited.add(v)
                for child in v._children:
                    build_topo(child)
                topo_order.append(v)

        build_topo(self)

        # Backward pass
        self.grad = 1.0
        for node in reversed(topo_order):
            node._backward()


# Example usage
x = Value(2.0)
y = Value(3.0)
a = x + y       # a = 5
b = x * y       # b = 6
f = a * b       # f = 30

f.backward()
print(f"f = {f.data}")       # 30.0
print(f"df/dx = {x.grad}")   # 21.0
print(f"df/dy = {y.grad}")   # 16.0

This simple engine captures the essence of what production frameworks do, though they add many optimizations: GPU support, batched operations, memory management, and hundreds of supported operations.

3.6 Putting It All Together: From Theory to Practice

3.6.1 Gradients in Linear Regression

Let us trace the full gradient computation for the simplest machine learning model: linear regression. The model is:

$$ \hat{y} = \mathbf{X}\boldsymbol{\theta} $$

The mean squared error loss is:

$$ L(\boldsymbol{\theta}) = \frac{1}{2N} \|\mathbf{X}\boldsymbol{\theta} - \mathbf{y}\|^2 = \frac{1}{2N} (\mathbf{X}\boldsymbol{\theta} - \mathbf{y})^T (\mathbf{X}\boldsymbol{\theta} - \mathbf{y}) $$

Taking the gradient with respect to theta (using the matrix calculus results from Chapter 1):

$$ \nabla_{\boldsymbol{\theta}} L = \frac{1}{N} \mathbf{X}^T (\mathbf{X}\boldsymbol{\theta} - \mathbf{y}) $$

where:

X is the N x d design matrix (as introduced in Chapter 1)
theta is the d-dimensional parameter vector
y is the N-dimensional target vector
The factor 1/2 in the loss was chosen to simplify the derivative

Setting the gradient to zero gives the closed-form solution (the normal equations):

$$ \boldsymbol{\theta}^* = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} $$

But for large-scale problems, computing this inverse is impractical, and we use gradient descent instead.

import numpy as np


def linear_regression_gradient_descent(
    X: np.ndarray,
    y: np.ndarray,
    learning_rate: float = 0.01,
    n_iterations: int = 1000,
) -> tuple[np.ndarray, list[float]]:
    """Train linear regression using gradient descent.

    Args:
        X: Design matrix, shape (n_samples, n_features).
        y: Target vector, shape (n_samples,).
        learning_rate: Step size for gradient descent.
        n_iterations: Number of gradient descent steps.

    Returns:
        Tuple of (learned parameters, loss history).
    """
    n_samples, n_features = X.shape
    theta = np.zeros(n_features)
    loss_history = []

    for i in range(n_iterations):
        # Predictions
        y_pred = X @ theta

        # Loss
        residuals = y_pred - y
        loss = np.mean(residuals ** 2) / 2
        loss_history.append(loss)

        # Gradient
        gradient = X.T @ residuals / n_samples

        # Update
        theta -= learning_rate * gradient

    return theta, loss_history


# Generate synthetic data
np.random.seed(42)
n_samples, n_features = 100, 3
X = np.random.randn(n_samples, n_features)
true_theta = np.array([2.0, -1.0, 0.5])
y = X @ true_theta + 0.1 * np.random.randn(n_samples)

# Train
theta_gd, losses = linear_regression_gradient_descent(X, y, learning_rate=0.1)
theta_exact = np.linalg.lstsq(X, y, rcond=None)[0]

print(f"True parameters:      {true_theta}")
print(f"Gradient descent:     {theta_gd.round(4)}")
print(f"Exact solution:       {theta_exact.round(4)}")
print(f"Final loss:           {losses[-1]:.6f}")

3.6.2 Gradients in Logistic Regression

Logistic regression extends this to classification. The model predicts probabilities:

$$ p(y = 1 | \mathbf{x}) = \sigma(\boldsymbol{\theta}^T \mathbf{x}) = \frac{1}{1 + e^{-\boldsymbol{\theta}^T \mathbf{x}}} $$

The binary cross-entropy loss is (as derived from maximum likelihood in Chapter 2):

$$ L(\boldsymbol{\theta}) = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \ln(\hat{y}_i) + (1 - y_i) \ln(1 - \hat{y}_i) \right] $$

The gradient is:

$$ \nabla_{\boldsymbol{\theta}} L = \frac{1}{N} \mathbf{X}^T (\hat{\mathbf{y}} - \mathbf{y}) $$

where y-hat is the vector of predicted probabilities. Notice the remarkable similarity to the linear regression gradient — the difference is that the residuals are between probabilities and labels rather than between predicted and true values.

3.6.3 The Gradient Computation Pipeline in Deep Learning

In a modern deep learning framework, the gradient computation pipeline follows these steps:

Define the computation: Build the computational graph by writing the forward pass (model architecture + loss function).
Forward pass: Evaluate the computation, storing intermediate values.
Backward pass: Traverse the graph in reverse topological order, applying the chain rule at each node.
Update parameters: Use the computed gradients with an optimizer (SGD, Adam, etc.) to update the model parameters.
Repeat: Iterate until convergence.

This pipeline — forward, backward, update — is the heartbeat of deep learning training. Every training step in every deep learning experiment ever run follows this pattern. Understanding it deeply is essential for AI engineering.

3.6.4 Common Pitfalls and Debugging Tips

Vanishing Gradients. When gradients become very small as they propagate backward through many layers, learning stalls. This is common with sigmoid and tanh activations, which saturate (have near-zero derivatives) for large inputs. Mitigation strategies include using ReLU activations, careful weight initialization (as discussed in Chapter 2's section on distributions), and residual connections (which we will explore in Chapter 7).

Exploding Gradients. The opposite problem: gradients grow exponentially large, leading to numerical overflow and unstable training. Gradient clipping — limiting the norm of the gradient vector — is the standard remedy:

def clip_gradients(gradient: np.ndarray, max_norm: float = 1.0) -> np.ndarray:
    """Clip gradient vector to a maximum norm.

    Args:
        gradient: The gradient vector.
        max_norm: Maximum allowed gradient norm.

    Returns:
        Clipped gradient vector.
    """
    norm = np.linalg.norm(gradient)
    if norm > max_norm:
        gradient = gradient * (max_norm / norm)
    return gradient

Practical Optimization Debugging. Beyond vanishing and exploding gradients, AI engineers encounter several other optimization pathologies in practice:

Loss plateaus: The loss stops decreasing but has not reached a satisfactory level. This can indicate a learning rate that is too small, a saddle point region, or an insufficient model capacity. Try increasing the learning rate temporarily, using a warm restart schedule, or increasing model size.
Oscillating loss: The training loss bounces up and down without converging. This typically indicates a learning rate that is too large for the current loss landscape. Reduce the learning rate or switch to an adaptive optimizer like Adam. If using mini-batch SGD, increasing the batch size can also reduce oscillation by providing more stable gradient estimates.
Training loss decreases but validation loss increases from the start: This unusual pattern often indicates a bug in the data pipeline (e.g., data leakage, incorrect labels, or a mismatch between training and validation preprocessing). Always investigate data issues before tuning optimization hyperparameters.
NaN or Inf values in the loss: These are almost always caused by numerical instability --- taking the log of zero, dividing by zero, or exponentiating a large number. Common fixes include adding small epsilon values to denominators, using numerically stable implementations of log-sum-exp and softmax (as discussed earlier in this chapter), and gradient clipping.

A systematic debugging workflow for optimization problems follows this order: (1) verify the data pipeline, (2) check for numerical instability, (3) try a known-good learning rate and optimizer (Adam with lr=1e-3), (4) simplify the model to verify that it can overfit a small batch, (5) gradually add complexity back. As we saw in Chapter 1, the ability to diagnose and fix optimization problems is one of the core skills that distinguishes effective AI engineers.

Gradient Checking. To verify that your analytical gradient computation (or autodiff implementation) is correct, compare it against numerical gradients:

def gradient_check(
    f,
    x: np.ndarray,
    analytical_grad: np.ndarray,
    h: float = 1e-5,
    tolerance: float = 1e-7,
) -> bool:
    """Verify analytical gradients against numerical gradients.

    Args:
        f: Scalar-valued function.
        x: Point at which to check gradients.
        analytical_grad: The analytically computed gradient.
        h: Step size for numerical differentiation.
        tolerance: Maximum allowed relative error.

    Returns:
        True if all gradient components are within tolerance.
    """
    numerical_grad = np.zeros_like(x)
    for i in range(len(x)):
        x_plus = x.copy()
        x_minus = x.copy()
        x_plus[i] += h
        x_minus[i] -= h
        numerical_grad[i] = (f(x_plus) - f(x_minus)) / (2 * h)

    # Relative error
    diff = np.linalg.norm(analytical_grad - numerical_grad)
    denom = np.linalg.norm(analytical_grad) + np.linalg.norm(numerical_grad)
    relative_error = diff / (denom + 1e-15)

    if relative_error > tolerance:
        print(f"Gradient check FAILED: relative error = {relative_error:.2e}")
        print(f"Analytical: {analytical_grad}")
        print(f"Numerical:  {numerical_grad}")
        return False

    print(f"Gradient check PASSED: relative error = {relative_error:.2e}")
    return True

3.6.5 Computational Considerations

Memory vs. Computation Trade-offs. Reverse-mode autodiff requires storing all intermediate values from the forward pass, which can be memory-intensive for deep networks. Gradient checkpointing is a technique that trades computation for memory: instead of storing all intermediate values, only a subset of "checkpoints" are stored, and the remaining values are recomputed during the backward pass. This reduces memory from O(L) to O(sqrt(L)) for a network with L layers, at the cost of one additional forward pass.

Mixed Precision Training. Modern hardware (GPUs with Tensor Cores) supports faster computation in lower precision (float16 or bfloat16). By performing the forward and backward passes in reduced precision while maintaining a master copy of parameters in float32, we can train models faster with less memory. The key is that the gradient accumulation and parameter update happen in float32 to avoid precision loss. We will explore this in detail in Chapter 10 when we discuss computational efficiency.

Summary

This chapter has covered the essential calculus and optimization concepts that power modern machine learning:

Derivatives and gradients provide the language of change, telling us how sensitive a function is to its inputs. The gradient vector points in the direction of steepest ascent.
The chain rule allows us to compute derivatives of composite functions, forming the mathematical basis for backpropagation.
Taylor series explain why gradient descent works: near any point, a smooth function is well-approximated by its linear (or quadratic) Taylor expansion.
The optimization landscape of neural networks is complex, with local minima, saddle points, and flat regions. Understanding this landscape guides our choice of optimization algorithm.
Gradient descent and its variants form a progression from simple (batch GD) to practical (mini-batch SGD) to sophisticated (Adam). Each variant addresses specific limitations of its predecessors.
Automatic differentiation provides the computational machinery to compute exact gradients efficiently. Reverse-mode autodiff (backpropagation) is the engine that makes training deep neural networks tractable.

These tools are not abstract mathematics — they are the daily working toolkit of the AI engineer. Every model you train uses gradient descent (or a variant), every gradient is computed via autodiff, and every optimizer relies on the concepts we have covered here.

In the next chapter, we will explore information theory and its connections to machine learning, including concepts like entropy, cross-entropy, and KL divergence that underpin many of the loss functions used in practice. The gradient machinery we have developed here will be essential for optimizing those information-theoretic objectives.

References

Ruder, S. (2016). "An overview of gradient descent optimization algorithms." arXiv:1609.04747.
Kingma, D. P., & Ba, J. (2015). "Adam: A Method for Stochastic Optimization." ICLR.
Baydin, A. G., Pearlmutter, B. A., Radul, A. A., & Siskind, J. M. (2018). "Automatic Differentiation in Machine Learning: a Survey." JMLR, 18(153), 1-43.
Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., & Bengio, Y. (2014). "Identifying and attacking the saddle point problem in high-dimensional non-convex optimization." NeurIPS.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapters 4, 6, 8.
Griewank, A., & Walther, A. (2008). Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. SIAM.
Loshchilov, I., & Hutter, F. (2019). "Decoupled Weight Decay Regularization." ICLR.
Bottou, L. (2010). "Large-Scale Machine Learning with Stochastic Gradient Descent." COMPSTAT.