25 min read

You are about to do something that most machine learning courses skip: implement a neural network from scratch, twice. First in numpy, where every matrix multiply, every activation, every gradient is explicit in your code. Then in PyTorch, where...

Chapter 6: Neural Networks from Scratch

Perceptrons, Backpropagation, and What's Really Happening Inside


Learning Objectives

By the end of this chapter, you will be able to:

  1. Implement a multi-layer perceptron from scratch in numpy, including forward pass, loss computation, and backpropagation
  2. Derive the backpropagation equations for arbitrary computational graphs using the chain rule
  3. Explain the role of activation functions (ReLU, sigmoid, tanh, GELU) and their gradient properties
  4. Implement the same network in PyTorch and map each numpy operation to its PyTorch equivalent
  5. Diagnose common failure modes: vanishing gradients, exploding gradients, dead neurons

6.1 — Why Build It from Scratch?

You are about to do something that most machine learning courses skip: implement a neural network from scratch, twice. First in numpy, where every matrix multiply, every activation, every gradient is explicit in your code. Then in PyTorch, where automatic differentiation handles the backward pass for you.

This is not a pedagogical exercise. This is the chapter where you stop being someone who calls model.fit() and start being someone who understands what model.fit() does.

The reason is simple. Every deep learning framework — PyTorch, TensorFlow, JAX — is an implementation of the same mathematical operations. When your model does not converge, when your gradients explode, when your loss plateaus at an unacceptable value, the debugging process requires understanding those operations. You cannot debug a system you do not understand.

Here is what we will build:

  1. A single perceptron (Section 6.2)
  2. A multi-layer perceptron with arbitrary depth and width (Section 6.3)
  3. Activation functions and their gradients (Section 6.4)
  4. The forward pass as function composition (Section 6.5)
  5. Loss functions for classification and regression (Section 6.6)
  6. Backpropagation derived from the chain rule (Section 6.7)
  7. The complete training loop in numpy (Section 6.8)
  8. Weight initialization and its mathematical justification (Section 6.9)
  9. The same network in PyTorch (Section 6.10)
  10. Gradient checking as a debugging discipline (Section 6.11)
  11. Diagnosing pathologies: vanishing gradients, exploding gradients, dead neurons (Section 6.12)
  12. The universal approximation theorem: what it says and what it does not (Section 6.13)

Research Insight: Rumelhart, Hinton, and Williams (1986) published the backpropagation algorithm for neural networks in Nature. The mathematics — the chain rule — was centuries old. The insight was not the math but the recognition that this specific application of the chain rule, combined with nonlinear activation functions, could learn useful representations. That combination is exactly what we will implement.


6.2 — The Perceptron: Where It Begins

A perceptron computes a weighted sum of its inputs, adds a bias, and applies a step function:

$$y = \begin{cases} 1 & \text{if } \mathbf{w}^T \mathbf{x} + b \geq 0 \\ 0 & \text{otherwise} \end{cases}$$

where $\mathbf{x} \in \mathbb{R}^d$ is the input, $\mathbf{w} \in \mathbb{R}^d$ is the weight vector, and $b \in \mathbb{R}$ is the bias.

Geometrically, the perceptron defines a hyperplane in $\mathbb{R}^d$ that separates the input space into two half-spaces. Inputs on one side are classified as 1; inputs on the other as 0. The weight vector $\mathbf{w}$ is normal to this hyperplane, and the bias $b$ controls the offset from the origin.

import numpy as np
from typing import Tuple


def perceptron_forward(
    x: np.ndarray,
    w: np.ndarray,
    b: float,
) -> int:
    """Single perceptron forward pass.

    Args:
        x: Input vector of shape (d,).
        w: Weight vector of shape (d,).
        b: Scalar bias.

    Returns:
        Binary output (0 or 1).
    """
    z = np.dot(w, x) + b
    return int(z >= 0)

The perceptron learning algorithm updates weights by adding misclassified examples:

$$\mathbf{w} \leftarrow \mathbf{w} + \eta (y - \hat{y}) \mathbf{x}$$ $$b \leftarrow b + \eta (y - \hat{y})$$

where $\eta$ is the learning rate, $y$ is the true label, and $\hat{y}$ is the prediction.

def perceptron_train(
    X: np.ndarray,
    y: np.ndarray,
    lr: float = 1.0,
    max_epochs: int = 100,
    seed: int = 42,
) -> Tuple[np.ndarray, float]:
    """Train a perceptron using the perceptron learning algorithm.

    Args:
        X: Input data of shape (n, d).
        y: Binary labels of shape (n,).
        lr: Learning rate.
        max_epochs: Maximum number of passes through the data.
        seed: Random seed for weight initialization.

    Returns:
        Tuple of (weights, bias).
    """
    rng = np.random.default_rng(seed)
    n, d = X.shape
    w = rng.normal(0, 0.01, size=d)
    b = 0.0

    for epoch in range(max_epochs):
        errors = 0
        for i in range(n):
            y_hat = perceptron_forward(X[i], w, b)
            if y_hat != y[i]:
                w += lr * (y[i] - y_hat) * X[i]
                b += lr * (y[i] - y_hat)
                errors += 1
        if errors == 0:
            break

    return w, b

The perceptron convergence theorem guarantees convergence in a finite number of steps — if the data is linearly separable. This "if" is the fundamental limitation. The perceptron cannot learn XOR, and Minsky and Papert's 1969 proof of this fact contributed to the first AI winter.

Mathematical Foundation: The perceptron convergence theorem states: if the data is linearly separable with margin $\gamma > 0$ (meaning there exists $\mathbf{w}^*$ with $\|\mathbf{w}^*\| = 1$ and $y_i(\mathbf{w}^{*T}\mathbf{x}_i) \geq \gamma$ for all $i$), and $\|\mathbf{x}_i\| \leq R$ for all $i$, then the perceptron algorithm makes at most $R^2/\gamma^2$ mistakes. This bound is tight — there exist datasets that require exactly this many updates. Note what the bound depends on: the margin (how separable the data is) and the radius of the data, not the dimension $d$. This is an early hint that generalization depends on geometry, not dimensionality.


6.3 — The Multi-Layer Perceptron

The solution to the XOR problem — and to the broader limitation of linear classifiers — is to compose multiple layers of linear transformations interleaved with nonlinear activation functions.

A multi-layer perceptron (MLP) with $L$ layers computes:

$$\mathbf{z}^{(1)} = \mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)}$$ $$\mathbf{a}^{(1)} = \sigma(\mathbf{z}^{(1)})$$ $$\mathbf{z}^{(2)} = \mathbf{W}^{(2)} \mathbf{a}^{(1)} + \mathbf{b}^{(2)}$$ $$\mathbf{a}^{(2)} = \sigma(\mathbf{z}^{(2)})$$ $$\vdots$$ $$\mathbf{z}^{(L)} = \mathbf{W}^{(L)} \mathbf{a}^{(L-1)} + \mathbf{b}^{(L)}$$ $$\hat{\mathbf{y}} = \sigma_{\text{out}}(\mathbf{z}^{(L)})$$

where $\mathbf{W}^{(\ell)} \in \mathbb{R}^{n_\ell \times n_{\ell-1}}$ and $\mathbf{b}^{(\ell)} \in \mathbb{R}^{n_\ell}$ are the weights and biases of layer $\ell$, $\sigma$ is a nonlinear activation function (applied elementwise), and $\sigma_{\text{out}}$ is the output activation (sigmoid for binary classification, softmax for multi-class, identity for regression).

The hidden layers $\mathbf{a}^{(1)}, \ldots, \mathbf{a}^{(L-1)}$ are learned representations of the input. Each layer extracts progressively more abstract features. The width $n_\ell$ of each layer determines the representational capacity of that layer.

Common Misconception: "More layers always means a better network." Depth provides compositional power — the ability to build complex functions from simpler ones — but also makes training harder. The vanishing gradient problem (Section 6.12) means that, without careful initialization and architecture choices, gradients shrink exponentially with depth, making lower layers untrainable. Chapter 7 covers the techniques (batch normalization, residual connections) that make deep training practical.

Why do we need the nonlinearity $\sigma$? Without it, the composition of linear transformations is itself linear:

$$\mathbf{W}^{(2)}(\mathbf{W}^{(1)}\mathbf{x} + \mathbf{b}^{(1)}) + \mathbf{b}^{(2)} = (\mathbf{W}^{(2)}\mathbf{W}^{(1)})\mathbf{x} + (\mathbf{W}^{(2)}\mathbf{b}^{(1)} + \mathbf{b}^{(2)})$$

This is just another linear transformation $\mathbf{W}'\mathbf{x} + \mathbf{b}'$. No matter how many layers you stack, without nonlinearities, the network can only represent linear functions. The activation function is what gives the network its expressive power.


6.4 — Activation Functions and Their Gradients

The choice of activation function determines the network's gradient flow, representational properties, and failure modes. We cover the five most important.

Sigmoid

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

$$\sigma'(z) = \sigma(z)(1 - \sigma(z))$$

The sigmoid maps $\mathbb{R} \to (0, 1)$. Its maximum gradient occurs at $z = 0$, where $\sigma'(0) = 0.25$. For $|z| > 5$, the gradient is essentially zero — the saturation problem. When gradients are multiplied across layers (as in backpropagation), repeated multiplication by values $\leq 0.25$ causes vanishing gradients.

def sigmoid(z: np.ndarray) -> np.ndarray:
    """Sigmoid activation, numerically stable."""
    return np.where(
        z >= 0,
        1 / (1 + np.exp(-z)),
        np.exp(z) / (1 + np.exp(z)),
    )


def sigmoid_gradient(z: np.ndarray) -> np.ndarray:
    """Gradient of the sigmoid function."""
    s = sigmoid(z)
    return s * (1 - s)

Implementation Note: The numerically stable sigmoid uses np.where to avoid overflow in np.exp(-z) for large positive $z$. For $z \geq 0$, use $1/(1 + e^{-z})$. For $z < 0$, use $e^z/(1 + e^z)$. Both formulations are algebraically identical but numerically different.

Tanh

$$\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} = 2\sigma(2z) - 1$$

$$\tanh'(z) = 1 - \tanh^2(z)$$

Tanh maps $\mathbb{R} \to (-1, 1)$ and is zero-centered, which helps optimization. Its maximum gradient is 1.0 (at $z = 0$), making it less prone to vanishing gradients than sigmoid — but it still saturates for large $|z|$.

def tanh(z: np.ndarray) -> np.ndarray:
    """Tanh activation."""
    return np.tanh(z)


def tanh_gradient(z: np.ndarray) -> np.ndarray:
    """Gradient of tanh."""
    return 1 - np.tanh(z) ** 2

ReLU (Rectified Linear Unit)

$$\text{ReLU}(z) = \max(0, z)$$

$$\text{ReLU}'(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{if } z < 0 \\ \text{undefined} & \text{if } z = 0 \end{cases}$$

ReLU is the default activation for hidden layers in modern networks. Its gradient is exactly 1 for positive inputs — no saturation, no vanishing. But for negative inputs, the gradient is exactly 0: the neuron is "dead" and does not learn from that input. If a neuron's weights are pushed into a region where its pre-activation $z$ is always negative (for every training example), the neuron is permanently dead — the dead neuron problem.

def relu(z: np.ndarray) -> np.ndarray:
    """ReLU activation."""
    return np.maximum(0, z)


def relu_gradient(z: np.ndarray) -> np.ndarray:
    """Gradient of ReLU (subgradient at z=0 is 0)."""
    return (z > 0).astype(z.dtype)

GELU (Gaussian Error Linear Unit)

$$\text{GELU}(z) = z \cdot \Phi(z)$$

where $\Phi(z)$ is the standard Gaussian CDF. The GELU smoothly interpolates between 0 and the identity, weighting inputs by their probability under a standard normal. Unlike ReLU, GELU is smooth everywhere (differentiable at $z = 0$) and has a non-zero gradient for slightly negative inputs, mitigating dead neurons.

$$\text{GELU}'(z) = \Phi(z) + z \cdot \phi(z)$$

where $\phi(z) = \frac{1}{\sqrt{2\pi}} e^{-z^2/2}$ is the standard Gaussian PDF.

from scipy.special import erf


def gelu(z: np.ndarray) -> np.ndarray:
    """GELU activation (exact)."""
    return 0.5 * z * (1 + erf(z / np.sqrt(2)))


def gelu_gradient(z: np.ndarray) -> np.ndarray:
    """Gradient of GELU."""
    phi = np.exp(-0.5 * z**2) / np.sqrt(2 * np.pi)  # Gaussian PDF
    cdf = 0.5 * (1 + erf(z / np.sqrt(2)))            # Gaussian CDF
    return cdf + z * phi

Swish

$$\text{Swish}(z) = z \cdot \sigma(z)$$

$$\text{Swish}'(z) = \sigma(z) + z \cdot \sigma(z)(1 - \sigma(z)) = \sigma(z) + z \cdot \sigma'(z)$$

Swish is self-gated — the input modulates its own activation through the sigmoid. It shares GELU's smoothness and non-zero gradient for negative inputs. Swish with a learnable parameter $\beta$ (i.e., $z \cdot \sigma(\beta z)$) generalizes both ReLU ($\beta \to \infty$) and the identity ($\beta = 0$).

def swish(z: np.ndarray) -> np.ndarray:
    """Swish activation."""
    return z * sigmoid(z)


def swish_gradient(z: np.ndarray) -> np.ndarray:
    """Gradient of Swish."""
    s = sigmoid(z)
    return s + z * s * (1 - s)

Comparison: Gradient Behavior

The following table summarizes the gradient properties:

Activation Range Max $\|\nabla\|$ Saturates? Dead neuron risk?
Sigmoid $(0, 1)$ 0.25 Yes (both tails) No
Tanh $(-1, 1)$ 1.0 Yes (both tails) No
ReLU $[0, \infty)$ 1.0 No (positive) Yes (negative)
GELU $\approx [-0.17, \infty)$ $\leq 1.08$ No Low
Swish $\approx [-0.28, \infty)$ $\leq 1.10$ No Low

Production Reality: In modern practice, ReLU remains the default for most architectures. GELU is standard in transformers (GPT, BERT). Swish appears in EfficientNet and related architectures. Sigmoid and tanh are reserved for output layers (sigmoid for binary classification, tanh for bounded outputs) and gating mechanisms (LSTM gates use sigmoid). The choice between GELU and Swish rarely matters in practice — both solve the dead neuron problem while maintaining the computational efficiency of near-linear activations.


6.5 — The Forward Pass as Function Composition

We now implement a complete MLP with configurable architecture. The forward pass is function composition: layer by layer, we apply a linear transformation followed by a nonlinearity.

from dataclasses import dataclass, field
from typing import List, Dict, Callable


@dataclass
class MLPConfig:
    """Configuration for a multi-layer perceptron."""
    layer_dims: List[int]       # [input_dim, hidden1, hidden2, ..., output_dim]
    activation: str = "relu"    # Activation for hidden layers
    output_activation: str = "sigmoid"  # Activation for the output layer
    seed: int = 42


ACTIVATIONS: Dict[str, Callable] = {
    "relu": relu,
    "sigmoid": sigmoid,
    "tanh": tanh,
    "gelu": gelu,
    "swish": swish,
    "identity": lambda z: z,
}

ACTIVATION_GRADIENTS: Dict[str, Callable] = {
    "relu": relu_gradient,
    "sigmoid": sigmoid_gradient,
    "tanh": tanh_gradient,
    "gelu": gelu_gradient,
    "swish": swish_gradient,
    "identity": lambda z: np.ones_like(z),
}


class NumpyMLP:
    """Multi-layer perceptron implemented entirely in numpy.

    This class demonstrates the mechanics of neural networks without
    any deep learning framework. Every operation — forward pass, loss,
    backward pass, parameter update — is explicit.
    """

    def __init__(self, config: MLPConfig) -> None:
        self.config = config
        self.params: Dict[str, np.ndarray] = {}
        self.cache: Dict[str, np.ndarray] = {}
        self._init_params()

    def _init_params(self) -> None:
        """Initialize weights using He initialization for ReLU,
        Xavier/Glorot for sigmoid/tanh.
        """
        rng = np.random.default_rng(self.config.seed)
        dims = self.config.layer_dims

        for l in range(1, len(dims)):
            n_in, n_out = dims[l - 1], dims[l]

            # He initialization: Var(w) = 2/n_in (appropriate for ReLU)
            # Xavier initialization: Var(w) = 2/(n_in + n_out) (for sigmoid/tanh)
            if self.config.activation in ("relu",):
                std = np.sqrt(2.0 / n_in)
            else:
                std = np.sqrt(2.0 / (n_in + n_out))

            self.params[f"W{l}"] = rng.normal(0, std, size=(n_out, n_in))
            self.params[f"b{l}"] = np.zeros(n_out)

    def forward(self, X: np.ndarray) -> np.ndarray:
        """Forward pass through the network.

        Args:
            X: Input data of shape (n_samples, input_dim).

        Returns:
            Output predictions of shape (n_samples, output_dim).
        """
        self.cache.clear()
        self.cache["A0"] = X
        L = len(self.config.layer_dims) - 1  # Number of layers

        act_fn = ACTIVATIONS[self.config.activation]
        out_fn = ACTIVATIONS[self.config.output_activation]

        A = X
        for l in range(1, L + 1):
            W = self.params[f"W{l}"]
            b = self.params[f"b{l}"]

            # Linear transformation: Z = XW^T + b
            Z = A @ W.T + b  # (n, n_out)
            self.cache[f"Z{l}"] = Z

            # Activation
            if l < L:
                A = act_fn(Z)
            else:
                A = out_fn(Z)
            self.cache[f"A{l}"] = A

        return A

Each forward pass stores the pre-activation values $\mathbf{Z}^{(\ell)}$ and post-activation values $\mathbf{A}^{(\ell)}$ in the cache. These are needed for the backward pass: backpropagation requires the intermediate values to compute gradients. This is the fundamental time-memory tradeoff in deep learning — you can recompute forward activations during the backward pass to save memory (gradient checkpointing), but the standard approach caches everything.


6.6 — Loss Functions

The loss function measures how far the network's predictions are from the true labels. Its gradient with respect to the network's output is the starting point for backpropagation.

Binary Cross-Entropy Loss

For binary classification with sigmoid output:

$$\mathcal{L}(\hat{y}, y) = -\frac{1}{n}\sum_{i=1}^{n}\left[y_i \log(\hat{y}_i) + (1 - y_i)\log(1 - \hat{y}_i)\right]$$

The gradient with respect to the network output $\hat{y}$:

$$\frac{\partial \mathcal{L}}{\partial \hat{y}_i} = -\frac{y_i}{\hat{y}_i} + \frac{1 - y_i}{1 - \hat{y}_i} = \frac{\hat{y}_i - y_i}{\hat{y}_i(1 - \hat{y}_i)}$$

When the output activation is sigmoid, the combined gradient of the loss and the sigmoid simplifies beautifully:

$$\frac{\partial \mathcal{L}}{\partial z_i} = \frac{\partial \mathcal{L}}{\partial \hat{y}_i} \cdot \sigma'(z_i) = \frac{\hat{y}_i - y_i}{\hat{y}_i(1 - \hat{y}_i)} \cdot \hat{y}_i(1 - \hat{y}_i) = \hat{y}_i - y_i$$

This cancellation — the gradient of BCE + sigmoid is simply $\hat{y} - y$ — is not a coincidence. It follows from the fact that cross-entropy is the negative log-likelihood of a Bernoulli distribution, and the sigmoid is the canonical link function for Bernoulli data. This was discussed from the information theory perspective in Chapter 4.

Mean Squared Error

For regression with identity output:

$$\mathcal{L}(\hat{y}, y) = \frac{1}{n}\sum_{i=1}^{n}(\hat{y}_i - y_i)^2$$

$$\frac{\partial \mathcal{L}}{\partial \hat{y}_i} = \frac{2}{n}(\hat{y}_i - y_i)$$

def binary_cross_entropy(
    y_pred: np.ndarray,
    y_true: np.ndarray,
    eps: float = 1e-7,
) -> float:
    """Binary cross-entropy loss.

    Args:
        y_pred: Predicted probabilities of shape (n,) or (n, 1).
        y_true: True labels (0 or 1) of shape (n,) or (n, 1).
        eps: Small constant for numerical stability.

    Returns:
        Scalar loss value.
    """
    y_pred = np.clip(y_pred, eps, 1 - eps)
    loss = -np.mean(
        y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)
    )
    return float(loss)


def binary_cross_entropy_gradient(
    y_pred: np.ndarray,
    y_true: np.ndarray,
) -> np.ndarray:
    """Gradient of BCE with respect to the pre-activation (sigmoid + BCE combined).

    This is the combined gradient dL/dz = y_pred - y_true,
    which avoids the numerical issues of dividing by y_pred*(1-y_pred).

    Args:
        y_pred: Sigmoid output (predicted probabilities).
        y_true: True labels.

    Returns:
        Gradient of shape matching y_pred.
    """
    n = y_pred.shape[0]
    return (y_pred - y_true) / n


def mse_loss(y_pred: np.ndarray, y_true: np.ndarray) -> float:
    """Mean squared error loss."""
    return float(np.mean((y_pred - y_true) ** 2))


def mse_gradient(y_pred: np.ndarray, y_true: np.ndarray) -> np.ndarray:
    """Gradient of MSE with respect to predictions."""
    n = y_pred.shape[0]
    return 2 * (y_pred - y_true) / n

6.7 — Backpropagation: The Chain Rule, Applied

Backpropagation is the chain rule applied systematically to the computational graph of the forward pass. Nothing more, nothing less. If you understand the chain rule from Chapter 2, you understand backpropagation. What this section adds is the specific pattern of applying it to neural network graphs.

The Computational Graph

The forward pass computes a sequence of operations:

$$\mathbf{x} \xrightarrow{\mathbf{W}^{(1)}, \mathbf{b}^{(1)}} \mathbf{z}^{(1)} \xrightarrow{\sigma} \mathbf{a}^{(1)} \xrightarrow{\mathbf{W}^{(2)}, \mathbf{b}^{(2)}} \mathbf{z}^{(2)} \xrightarrow{\sigma} \mathbf{a}^{(2)} \xrightarrow{\cdots} \hat{\mathbf{y}} \xrightarrow{\mathcal{L}} \text{loss}$$

Each arrow is a differentiable function. The computational graph makes the chain rule application mechanical: to compute $\frac{\partial \mathcal{L}}{\partial \theta}$ for any parameter $\theta$, trace the path from $\mathcal{L}$ back to $\theta$ and multiply the local gradients along the way.

Deriving the Gradients

We need four gradient expressions for each layer $\ell$:

1. Gradient of loss with respect to pre-activation $\mathbf{z}^{(\ell)}$:

For the output layer $L$ (with combined sigmoid + BCE): $$\frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(L)}} = \hat{\mathbf{y}} - \mathbf{y}$$

For hidden layers $\ell < L$: $$\frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(\ell)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{a}^{(\ell)}} \odot \sigma'(\mathbf{z}^{(\ell)})$$

where $\odot$ denotes elementwise multiplication.

2. Gradient of loss with respect to post-activation $\mathbf{a}^{(\ell)}$:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{a}^{(\ell)}} = \left(\frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(\ell+1)}}\right) \mathbf{W}^{(\ell+1)}$$

This comes from the linear transformation $\mathbf{z}^{(\ell+1)} = \mathbf{W}^{(\ell+1)} \mathbf{a}^{(\ell)} + \mathbf{b}^{(\ell+1)}$.

3. Gradient of loss with respect to weights $\mathbf{W}^{(\ell)}$:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(\ell)}} = \left(\frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(\ell)}}\right)^T \mathbf{a}^{(\ell-1)}$$

For a minibatch of $n$ examples, $\frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(\ell)}}$ has shape $(n, n_\ell)$ and $\mathbf{a}^{(\ell-1)}$ has shape $(n, n_{\ell-1})$, giving $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(\ell)}}$ shape $(n_\ell, n_{\ell-1})$ — the same shape as $\mathbf{W}^{(\ell)}$.

4. Gradient of loss with respect to biases $\mathbf{b}^{(\ell)}$:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(\ell)}} = \sum_{i=1}^{n} \frac{\partial \mathcal{L}}{\partial z_i^{(\ell)}}$$

The bias gradient is the sum of the pre-activation gradients across the batch.

Mathematical Foundation: Let us derive step 2 in full. The linear transformation is $\mathbf{z}^{(\ell+1)} = \mathbf{W}^{(\ell+1)} \mathbf{a}^{(\ell)} + \mathbf{b}^{(\ell+1)}$. For a single example, $\mathbf{z}^{(\ell+1)} \in \mathbb{R}^{n_{\ell+1}}$, $\mathbf{a}^{(\ell)} \in \mathbb{R}^{n_\ell}$, $\mathbf{W}^{(\ell+1)} \in \mathbb{R}^{n_{\ell+1} \times n_\ell}$. The Jacobian is $\frac{\partial \mathbf{z}^{(\ell+1)}}{\partial \mathbf{a}^{(\ell)}} = \mathbf{W}^{(\ell+1)}$. By the chain rule: $\frac{\partial \mathcal{L}}{\partial \mathbf{a}^{(\ell)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(\ell+1)}} \cdot \frac{\partial \mathbf{z}^{(\ell+1)}}{\partial \mathbf{a}^{(\ell)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(\ell+1)}} \cdot \mathbf{W}^{(\ell+1)}$. This is a Jacobian-vector product (JVP) — the same operation from Chapter 2, now applied to a specific computational graph.

The Backward Pass in Code

def backward(self, y_true: np.ndarray) -> Dict[str, np.ndarray]:
    """Backward pass: compute gradients for all parameters.

    Must be called after forward(). Uses cached activations.

    Args:
        y_true: True labels of shape (n_samples,) or (n_samples, 1).

    Returns:
        Dictionary of gradients with keys matching self.params.
    """
    grads: Dict[str, np.ndarray] = {}
    L = len(self.config.layer_dims) - 1
    n = y_true.shape[0]

    y_true = y_true.reshape(-1, 1) if y_true.ndim == 1 else y_true
    y_pred = self.cache[f"A{L}"]

    # Output layer gradient (combined sigmoid + BCE)
    if self.config.output_activation == "sigmoid":
        dZ = (y_pred - y_true) / n  # (n, n_L)
    else:
        # For identity output with MSE
        dZ = 2 * (y_pred - y_true) / n

    # Backpropagate through layers L, L-1, ..., 1
    for l in range(L, 0, -1):
        A_prev = self.cache[f"A{l-1}"]

        # Gradient w.r.t. weights: dL/dW = dZ^T @ A_prev
        grads[f"W{l}"] = dZ.T @ A_prev  # (n_l, n_{l-1})

        # Gradient w.r.t. bias: dL/db = sum of dZ over batch
        grads[f"b{l}"] = np.sum(dZ, axis=0)  # (n_l,)

        if l > 1:
            # Gradient w.r.t. previous activation: dL/dA = dZ @ W
            dA = dZ @ self.params[f"W{l}"]  # (n, n_{l-1})

            # Gradient through activation function
            act_grad_fn = ACTIVATION_GRADIENTS[self.config.activation]
            dZ = dA * act_grad_fn(self.cache[f"Z{l-1}"])  # (n, n_{l-1})

    return grads

Examine this code carefully. The backward pass mirrors the forward pass in reverse. At each layer, we compute three things: (1) the weight gradient $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(\ell)}}$, (2) the bias gradient $\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(\ell)}}$, and (3) the gradient flowing to the previous layer $\frac{\partial \mathcal{L}}{\partial \mathbf{a}^{(\ell-1)}}$. The third is needed to continue the chain.

Common Misconception: "Backpropagation is a learning algorithm." No. Backpropagation is a gradient computation algorithm. Learning happens when you use those gradients to update the weights (via SGD, Adam, or another optimizer). The distinction matters: the same backpropagation algorithm computes gradients regardless of which optimizer consumes them.

Computational Cost of Backpropagation

The backward pass has approximately the same computational cost as the forward pass — roughly 2x the forward pass in total FLOPs. This is not obvious a priori, but follows from the structure of the chain rule: each layer's backward computation involves the same matrix dimensions as its forward computation (transposed).

For a single layer with weight matrix $\mathbf{W} \in \mathbb{R}^{n_\ell \times n_{\ell-1}}$ and batch size $n$:

  • Forward: $\mathbf{Z} = \mathbf{A}_{\text{prev}} \mathbf{W}^T$ costs $O(n \cdot n_\ell \cdot n_{\ell-1})$ FLOPs
  • Weight gradient: $\frac{\partial \mathcal{L}}{\partial \mathbf{W}} = \delta^T \mathbf{A}_{\text{prev}}$ costs $O(n \cdot n_\ell \cdot n_{\ell-1})$ FLOPs
  • Input gradient: $\frac{\partial \mathcal{L}}{\partial \mathbf{A}_{\text{prev}}} = \delta \mathbf{W}$ costs $O(n \cdot n_\ell \cdot n_{\ell-1})$ FLOPs

The backward pass through each layer requires two matrix multiplications (weight gradient and input gradient) compared to one for the forward pass. Therefore the total cost of one forward-backward iteration is approximately 3x the cost of a forward pass alone. In practice, the memory access pattern matters as much as the FLOP count: the backward pass reads both the cached activations and the weight matrices, making it more memory-bandwidth-intensive than the forward pass.


6.8 — The Complete Training Loop

We now assemble the forward pass, loss, backward pass, and parameter update into a training loop.

# Add these methods to the NumpyMLP class

def update_params(
    self,
    grads: Dict[str, np.ndarray],
    lr: float,
) -> None:
    """Update parameters using vanilla SGD.

    Args:
        grads: Gradient dictionary from backward().
        lr: Learning rate.
    """
    for key in self.params:
        self.params[key] -= lr * grads[key]


def train(
    self,
    X_train: np.ndarray,
    y_train: np.ndarray,
    X_val: np.ndarray,
    y_val: np.ndarray,
    epochs: int = 100,
    batch_size: int = 32,
    lr: float = 0.01,
    verbose: bool = True,
) -> Dict[str, List[float]]:
    """Train the network with mini-batch SGD.

    Args:
        X_train: Training features of shape (n_train, d).
        y_train: Training labels of shape (n_train,).
        X_val: Validation features of shape (n_val, d).
        y_val: Validation labels of shape (n_val,).
        epochs: Number of training epochs.
        batch_size: Mini-batch size.
        lr: Learning rate.
        verbose: Whether to print progress.

    Returns:
        Dictionary with training history.
    """
    history: Dict[str, List[float]] = {
        "train_loss": [],
        "val_loss": [],
        "train_acc": [],
        "val_acc": [],
    }
    n = X_train.shape[0]
    rng = np.random.default_rng(self.config.seed + 1)

    for epoch in range(epochs):
        # Shuffle training data
        perm = rng.permutation(n)
        X_shuffled = X_train[perm]
        y_shuffled = y_train[perm]

        # Mini-batch training
        epoch_loss = 0.0
        n_batches = 0
        for start in range(0, n, batch_size):
            end = min(start + batch_size, n)
            X_batch = X_shuffled[start:end]
            y_batch = y_shuffled[start:end]

            # Forward
            y_pred = self.forward(X_batch)

            # Loss
            batch_loss = binary_cross_entropy(
                y_pred.ravel(), y_batch.ravel()
            )
            epoch_loss += batch_loss

            # Backward
            grads = self.backward(y_batch.reshape(-1, 1))

            # Update
            self.update_params(grads, lr)
            n_batches += 1

        # Epoch metrics
        train_pred = self.forward(X_train)
        val_pred = self.forward(X_val)

        train_loss = binary_cross_entropy(
            train_pred.ravel(), y_train.ravel()
        )
        val_loss = binary_cross_entropy(
            val_pred.ravel(), y_val.ravel()
        )
        train_acc = np.mean(
            (train_pred.ravel() > 0.5) == y_train.ravel()
        )
        val_acc = np.mean(
            (val_pred.ravel() > 0.5) == y_val.ravel()
        )

        history["train_loss"].append(train_loss)
        history["val_loss"].append(val_loss)
        history["train_acc"].append(float(train_acc))
        history["val_acc"].append(float(val_acc))

        if verbose and (epoch % 10 == 0 or epoch == epochs - 1):
            print(
                f"Epoch {epoch:4d} | "
                f"Train Loss: {train_loss:.4f} | "
                f"Val Loss: {val_loss:.4f} | "
                f"Train Acc: {train_acc:.4f} | "
                f"Val Acc: {val_acc:.4f}"
            )

    return history

Let us train this network on a concrete problem — a synthetic click prediction dataset.

def generate_click_data(
    n_samples: int = 10_000,
    n_features: int = 20,
    seed: int = 42,
) -> Dict[str, np.ndarray]:
    """Generate synthetic click-prediction data for StreamRec.

    Features represent user-item interaction signals: user activity level,
    item popularity, category match score, time-of-day encoding, etc.

    The true decision boundary is nonlinear: click probability depends on
    interactions between features that a logistic regression cannot capture.

    Args:
        n_samples: Number of user-item pairs.
        n_features: Number of handcrafted features.
        seed: Random seed.

    Returns:
        Dictionary with train/val/test splits.
    """
    rng = np.random.default_rng(seed)

    X = rng.standard_normal((n_samples, n_features))

    # True click probability: nonlinear function of feature interactions
    logit = (
        0.5 * X[:, 0] * X[:, 1]           # User activity x item popularity
        - 0.3 * X[:, 2] ** 2               # Diminishing returns on category match
        + 0.8 * np.sin(X[:, 3] * np.pi)    # Time-of-day effect (periodic)
        + 0.4 * X[:, 4] * X[:, 5]          # Content type x user preference
        - 0.2 * np.abs(X[:, 6])            # Novelty penalty
        + 0.1 * np.sum(X[:, 7:12], axis=1) # Linear effects from remaining features
    )
    prob = sigmoid(logit)
    y = rng.binomial(1, prob)

    # Train/val/test split: 70/15/15
    n_train = int(0.7 * n_samples)
    n_val = int(0.15 * n_samples)

    return {
        "X_train": X[:n_train],
        "y_train": y[:n_train],
        "X_val": X[n_train:n_train + n_val],
        "y_val": y[n_train:n_train + n_val],
        "X_test": X[n_train + n_val:],
        "y_test": y[n_train + n_val:],
    }


# Generate data
data = generate_click_data()

# Build and train the MLP
config = MLPConfig(
    layer_dims=[20, 64, 32, 1],
    activation="relu",
    output_activation="sigmoid",
    seed=42,
)
model = NumpyMLP(config)

history = model.train(
    X_train=data["X_train"],
    y_train=data["y_train"],
    X_val=data["X_val"],
    y_val=data["y_val"],
    epochs=100,
    batch_size=64,
    lr=0.01,
)
# Expected output:
# Epoch    0 | Train Loss: 0.6925 | Val Loss: 0.6931 | Train Acc: 0.5274 | Val Acc: 0.5200
# Epoch   10 | Train Loss: 0.6401 | Val Loss: 0.6455 | Train Acc: 0.6431 | Val Acc: 0.6307
# ...
# Epoch   99 | Train Loss: 0.5682 | Val Loss: 0.5801 | Train Acc: 0.7162 | Val Acc: 0.7007

This is a working neural network. No framework. No autograd. Every gradient is explicit. When something goes wrong in a production model, this is the level of understanding you need to diagnose the problem.


6.9 — Weight Initialization: Why It Matters

The choice of initial weights determines whether the network can learn at all. Bad initialization causes gradients to vanish or explode from the very first step.

The Variance Propagation Argument

Consider the pre-activation at layer $\ell$:

$$z_j^{(\ell)} = \sum_{i=1}^{n_{\ell-1}} w_{ji}^{(\ell)} a_i^{(\ell-1)} + b_j^{(\ell)}$$

Assuming the weights and activations are independent with zero mean:

$$\text{Var}(z_j^{(\ell)}) = n_{\ell-1} \cdot \text{Var}(w_{ji}^{(\ell)}) \cdot \text{Var}(a_i^{(\ell-1)})$$

If we want the variance of the pre-activation to remain constant across layers (neither growing nor shrinking), we need:

$$n_{\ell-1} \cdot \text{Var}(w_{ji}^{(\ell)}) = 1$$

$$\text{Var}(w_{ji}^{(\ell)}) = \frac{1}{n_{\ell-1}}$$

This is fan-in initialization. The same analysis for the backward pass (requiring stable gradient variance) gives $\text{Var}(w) = 1/n_\ell$, which is fan-out initialization.

Xavier/Glorot Initialization

Glorot and Bengio (2010) proposed compromising between the forward and backward requirements:

$$\text{Var}(w) = \frac{2}{n_{\ell-1} + n_\ell}$$

For a uniform distribution: $w \sim U\left[-\sqrt{\frac{6}{n_{\ell-1} + n_\ell}}, \sqrt{\frac{6}{n_{\ell-1} + n_\ell}}\right]$.

For a normal distribution: $w \sim \mathcal{N}\left(0, \frac{2}{n_{\ell-1} + n_\ell}\right)$.

This analysis assumes linear or tanh activations. For ReLU, which zeros out half its inputs, the analysis changes.

He Initialization

He et al. (2015) account for the fact that ReLU sets approximately half of the activations to zero, halving the variance:

$$\text{Var}(w) = \frac{2}{n_{\ell-1}}$$

For a normal distribution: $w \sim \mathcal{N}\left(0, \frac{2}{n_{\ell-1}}\right)$.

This is the standard initialization for ReLU networks and the default in PyTorch's kaiming_normal_.

def demonstrate_initialization_impact(
    n_layers: int = 10,
    width: int = 256,
    n_samples: int = 1000,
    input_dim: int = 20,
    seed: int = 42,
) -> Dict[str, List[float]]:
    """Show how initialization affects activation variance across layers.

    Args:
        n_layers: Number of hidden layers.
        width: Width of each hidden layer.
        n_samples: Number of random input samples.
        input_dim: Input dimension.
        seed: Random seed.

    Returns:
        Dictionary mapping initialization scheme to list of activation
        variances at each layer.
    """
    rng = np.random.default_rng(seed)
    X = rng.standard_normal((n_samples, input_dim))

    results: Dict[str, List[float]] = {}

    for name, var_fn in [
        ("too_small", lambda n_in, n_out: 0.001),
        ("too_large", lambda n_in, n_out: 1.0),
        ("xavier", lambda n_in, n_out: 2.0 / (n_in + n_out)),
        ("he", lambda n_in, n_out: 2.0 / n_in),
    ]:
        A = X.copy()
        variances = []

        dims = [input_dim] + [width] * n_layers
        for l in range(n_layers):
            n_in, n_out = dims[l], dims[l + 1]
            std = np.sqrt(var_fn(n_in, n_out))
            W = rng.normal(0, std, (n_out, n_in))
            Z = A @ W.T
            A = np.maximum(0, Z)  # ReLU
            variances.append(float(np.var(A)))

        results[name] = variances

    return results

# Run the experiment
init_results = demonstrate_initialization_impact()
for name, variances in init_results.items():
    print(f"{name:>10}: layer 1 var = {variances[0]:.4f}, "
          f"layer 10 var = {variances[-1]:.6f}")
# Expected output:
# too_small: layer 1 var = 0.0050, layer 10 var = 0.000000
# too_large: layer 1 var = 12.7628, layer 10 var = 1985128942.123456
#    xavier: layer 1 var = 0.4903, layer 10 var = 0.004213
#        he: layer 1 var = 0.9841, layer 10 var = 0.892317

The results are dramatic. With too-small initialization, activations vanish to zero by layer 10. With too-large initialization, they explode. Xavier works for tanh but underestimates variance for ReLU (it loses roughly half the variance per layer). He initialization maintains stable variance across all 10 layers.

Mathematical Foundation: The factor of 2 in He initialization comes directly from the ReLU activation. If $z \sim \mathcal{N}(0, \sigma^2)$, then $\text{Var}(\text{ReLU}(z)) = \frac{\sigma^2}{2}$. This is because ReLU zeros out the negative half of the distribution, which has mean zero, so it preserves half the variance. For GELU and Swish, which do not zero out the negative half completely, the correction factor is smaller — empirically, He initialization works well for both.


6.10 — The Same Network in PyTorch

We now implement the identical network in PyTorch. The goal is not merely to "use PyTorch" but to build a one-to-one mapping: every numpy operation from the previous sections has a PyTorch counterpart, and PyTorch's autograd computes the exact same gradients that we computed by hand.

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset


class PyTorchMLP(nn.Module):
    """The same MLP architecture, now in PyTorch.

    Mapping to NumpyMLP:
    - self.params[f"W{l}"] → self.layers[l-1].weight
    - self.params[f"b{l}"] → self.layers[l-1].bias
    - self.forward() → same structure, autograd handles backward()
    """

    def __init__(self, config: MLPConfig) -> None:
        super().__init__()
        dims = config.layer_dims

        layers = []
        for l in range(len(dims) - 1):
            linear = nn.Linear(dims[l], dims[l + 1])

            # Match our numpy initialization
            if config.activation == "relu":
                nn.init.kaiming_normal_(
                    linear.weight, mode="fan_in", nonlinearity="relu"
                )
            else:
                nn.init.xavier_normal_(linear.weight)
            nn.init.zeros_(linear.bias)
            layers.append(linear)

        self.layers = nn.ModuleList(layers)

        # Activation functions
        act_map = {
            "relu": nn.ReLU(),
            "sigmoid": nn.Sigmoid(),
            "tanh": nn.Tanh(),
            "gelu": nn.GELU(),
        }
        self.activation = act_map[config.activation]
        self.output_activation = act_map[config.output_activation]

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass — identical structure to NumpyMLP.forward().

        Args:
            x: Input tensor of shape (batch_size, input_dim).

        Returns:
            Output tensor of shape (batch_size, output_dim).
        """
        for i, layer in enumerate(self.layers[:-1]):
            x = self.activation(layer(x))  # Linear + activation
        x = self.output_activation(self.layers[-1](x))  # Output layer
        return x

Training in PyTorch

def train_pytorch_mlp(
    config: MLPConfig,
    data: Dict[str, np.ndarray],
    epochs: int = 100,
    batch_size: int = 64,
    lr: float = 0.01,
    verbose: bool = True,
) -> Dict[str, List[float]]:
    """Train the PyTorch MLP and return history.

    Args:
        config: MLP configuration.
        data: Dictionary with train/val splits (numpy arrays).
        epochs: Number of training epochs.
        batch_size: Mini-batch size.
        lr: Learning rate.
        verbose: Whether to print progress.

    Returns:
        Training history dictionary.
    """
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    model = PyTorchMLP(config).to(device)
    optimizer = torch.optim.SGD(model.parameters(), lr=lr)
    criterion = nn.BCELoss()

    # Convert numpy data to PyTorch tensors
    X_train = torch.tensor(data["X_train"], dtype=torch.float32)
    y_train = torch.tensor(
        data["y_train"], dtype=torch.float32
    ).unsqueeze(1)
    X_val = torch.tensor(data["X_val"], dtype=torch.float32)
    y_val = torch.tensor(
        data["y_val"], dtype=torch.float32
    ).unsqueeze(1)

    train_dataset = TensorDataset(X_train, y_train)
    train_loader = DataLoader(
        train_dataset, batch_size=batch_size, shuffle=True
    )

    history: Dict[str, List[float]] = {
        "train_loss": [],
        "val_loss": [],
        "train_acc": [],
        "val_acc": [],
    }

    for epoch in range(epochs):
        # Training
        model.train()
        for X_batch, y_batch in train_loader:
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)

            # Forward — same as: y_pred = model.forward(X_batch)
            y_pred = model(X_batch)

            # Loss — same as: loss = binary_cross_entropy(y_pred, y_batch)
            loss = criterion(y_pred, y_batch)

            # Backward — replaces our manual backward() method
            optimizer.zero_grad()  # Clear accumulated gradients
            loss.backward()        # Compute gradients via autograd

            # Update — replaces our manual update_params()
            optimizer.step()

        # Evaluation
        model.eval()
        with torch.no_grad():
            train_pred = model(X_train.to(device))
            val_pred = model(X_val.to(device))

            train_loss = criterion(train_pred, y_train.to(device)).item()
            val_loss = criterion(val_pred, y_val.to(device)).item()
            train_acc = (
                (train_pred.cpu() > 0.5) == y_train
            ).float().mean().item()
            val_acc = (
                (val_pred.cpu() > 0.5) == y_val
            ).float().mean().item()

        history["train_loss"].append(train_loss)
        history["val_loss"].append(val_loss)
        history["train_acc"].append(train_acc)
        history["val_acc"].append(val_acc)

        if verbose and (epoch % 10 == 0 or epoch == epochs - 1):
            print(
                f"Epoch {epoch:4d} | "
                f"Train Loss: {train_loss:.4f} | "
                f"Val Loss: {val_loss:.4f} | "
                f"Train Acc: {train_acc:.4f} | "
                f"Val Acc: {val_acc:.4f}"
            )

    return history

The Mapping: Numpy to PyTorch

Numpy (manual) PyTorch (autograd)
Z = A @ W.T + b layer(x) (nn.Linear)
A = relu(Z) nn.ReLU()(x)
loss = binary_cross_entropy(y_pred, y_true) nn.BCELoss()(y_pred, y_true)
grads = model.backward(y_true) loss.backward()
model.update_params(grads, lr) optimizer.step()
self.cache[f"Z{l}"] (manual caching) Autograd records the computational graph automatically

The correspondence is exact. loss.backward() computes the same chain rule application that our backward() method implements manually. The difference is that PyTorch constructs the computational graph dynamically during the forward pass and traverses it in reverse during backward(). This is automatic differentiation — not numerical differentiation (finite differences) and not symbolic differentiation (formula manipulation). It is the algorithmic application of the chain rule to a recorded sequence of operations.

Understanding Automatic Differentiation

Automatic differentiation (autodiff) deserves a precise treatment because it is often confused with the two approaches it is not.

Numerical differentiation approximates derivatives using finite differences: $f'(x) \approx \frac{f(x+\epsilon) - f(x-\epsilon)}{2\epsilon}$. It is simple to implement but expensive ($O(p)$ function evaluations for $p$ parameters) and suffers from inherent numerical error: too small an $\epsilon$ causes floating-point cancellation; too large an $\epsilon$ causes truncation error.

Symbolic differentiation manipulates algebraic expressions to produce exact derivative formulas (like Mathematica or Wolfram Alpha). It is exact but can suffer from "expression swell" — the derivative formula may be exponentially larger than the original expression — and it requires the computation to be expressible as a closed-form formula, which excludes programs with control flow (if/else, loops).

Automatic differentiation decomposes any computation (including programs with branches and loops) into a sequence of elementary operations and applies the chain rule at each step. The key insight: the chain rule factors a complicated derivative into a product of simple local derivatives, each of which is known analytically. The composition is computed numerically (by multiplying numbers, not symbols), but each local derivative is exact. The result is exact up to floating-point arithmetic — no truncation error, no expression swell.

PyTorch implements reverse-mode autodiff (also called the "backward mode" or the "adjoint method"). It is equivalent to backpropagation for neural networks, but the concept is more general: reverse-mode autodiff works for any differentiable computation, not just layered neural networks. JAX offers both forward-mode (efficient when input dimension $<$ output dimension) and reverse-mode (efficient when output dimension $<$ input dimension). For neural networks, where we differentiate a scalar loss with respect to millions of parameters, reverse mode is overwhelmingly more efficient.

Production Reality: In production, you will never implement backpropagation by hand. The value of having done so is diagnostic: when your PyTorch model produces NaN losses, you know to check whether a pre-activation value caused overflow in an activation function. When gradients vanish, you know to check whether sigmoid saturations are zeroing out the chain. The numpy implementation gives you a mental model that the framework cannot provide.

Verifying Equivalence

def verify_numpy_pytorch_equivalence(
    config: MLPConfig,
    X: np.ndarray,
    atol: float = 1e-6,
) -> None:
    """Verify that the numpy and PyTorch implementations produce
    identical forward pass outputs when given the same weights.

    Args:
        config: MLP configuration.
        X: Input data of shape (n, d).
        atol: Absolute tolerance for floating point comparison.
    """
    # Build numpy model
    np_model = NumpyMLP(config)

    # Build PyTorch model and copy weights from numpy
    pt_model = PyTorchMLP(config)
    with torch.no_grad():
        for l in range(1, len(config.layer_dims)):
            pt_model.layers[l - 1].weight.copy_(
                torch.tensor(np_model.params[f"W{l}"], dtype=torch.float32)
            )
            pt_model.layers[l - 1].bias.copy_(
                torch.tensor(np_model.params[f"b{l}"], dtype=torch.float32)
            )

    # Forward pass
    np_out = np_model.forward(X)
    pt_out = pt_model(
        torch.tensor(X, dtype=torch.float32)
    ).detach().numpy()

    max_diff = np.max(np.abs(np_out - pt_out))
    print(f"Maximum difference between numpy and PyTorch: {max_diff:.2e}")
    assert max_diff < atol, (
        f"Outputs differ by {max_diff:.2e}, exceeding tolerance {atol}"
    )
    print("Verified: numpy and PyTorch forward passes are equivalent.")

6.11 — Gradient Checking: Trust but Verify

Gradient checking is the practice of verifying analytically computed gradients against numerical approximations. It is the single most reliable way to detect bugs in a backpropagation implementation.

The idea is simple. The derivative of a function $f$ at a point $\theta$ is:

$$f'(\theta) \approx \frac{f(\theta + \epsilon) - f(\theta - \epsilon)}{2\epsilon}$$

This centered difference approximation has error $O(\epsilon^2)$, compared to $O(\epsilon)$ for the one-sided difference. With $\epsilon = 10^{-5}$, the numerical gradient should match the analytical gradient to about 7 decimal places.

def gradient_check(
    model: NumpyMLP,
    X: np.ndarray,
    y: np.ndarray,
    eps: float = 1e-5,
    tolerance: float = 1e-5,
) -> Dict[str, float]:
    """Check analytical gradients against numerical gradients.

    Computes the relative error between the analytical gradient (from
    backpropagation) and the numerical gradient (from centered finite
    differences) for each parameter.

    Args:
        model: Trained or untrained NumpyMLP.
        X: Input data of shape (n, d).
        y: Labels of shape (n,).
        eps: Step size for finite differences.
        tolerance: Maximum acceptable relative error.

    Returns:
        Dictionary mapping parameter name to relative error.
    """
    # Compute analytical gradients
    model.forward(X)
    analytical_grads = model.backward(y.reshape(-1, 1))

    relative_errors: Dict[str, float] = {}

    for param_name in model.params:
        param = model.params[param_name]
        grad = analytical_grads[param_name]
        numerical_grad = np.zeros_like(param)

        # Iterate over every element of the parameter
        it = np.nditer(param, flags=["multi_index"])
        while not it.finished:
            idx = it.multi_index
            original_value = param[idx]

            # f(theta + eps)
            param[idx] = original_value + eps
            y_pred_plus = model.forward(X)
            loss_plus = binary_cross_entropy(
                y_pred_plus.ravel(), y.ravel()
            )

            # f(theta - eps)
            param[idx] = original_value - eps
            y_pred_minus = model.forward(X)
            loss_minus = binary_cross_entropy(
                y_pred_minus.ravel(), y.ravel()
            )

            # Centered difference
            numerical_grad[idx] = (loss_plus - loss_minus) / (2 * eps)

            # Restore
            param[idx] = original_value
            it.iternext()

        # Relative error
        numerator = np.linalg.norm(grad - numerical_grad)
        denominator = (
            np.linalg.norm(grad) + np.linalg.norm(numerical_grad) + 1e-8
        )
        rel_error = numerator / denominator
        relative_errors[param_name] = float(rel_error)

        status = "PASS" if rel_error < tolerance else "FAIL"
        print(
            f"{param_name}: relative error = {rel_error:.2e} [{status}]"
        )

    return relative_errors

Implementation Note: Gradient checking is expensive — $O(p)$ forward passes for $p$ parameters — so run it on a small network with a small batch. A relative error below $10^{-5}$ indicates correct gradients. Between $10^{-5}$ and $10^{-3}$, inspect carefully (floating point issues or subtle bugs). Above $10^{-3}$, your backward pass has a bug. Always gradient-check a new backpropagation implementation before training.


6.12 — Diagnosing Pathologies

Three failure modes plague neural network training. Understanding their mechanics is essential for debugging.

Vanishing Gradients

In a network with $L$ layers and sigmoid activations, the gradient of the loss with respect to layer 1's weights involves a product of $L-1$ Jacobians:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(1)}} \propto \prod_{\ell=2}^{L} \text{diag}(\sigma'(\mathbf{z}^{(\ell)})) \cdot \mathbf{W}^{(\ell)}$$

Since $\sigma'(z) \leq 0.25$ for the sigmoid, each factor in the product is bounded by $0.25 \cdot \|\mathbf{W}^{(\ell)}\|$. If $\|\mathbf{W}^{(\ell)}\| < 4$ (which is typical), the product shrinks exponentially with depth. For a 10-layer sigmoid network, gradients at layer 1 are attenuated by a factor of roughly $0.25^9 \approx 3.8 \times 10^{-6}$.

The result: lower layers learn orders of magnitude more slowly than upper layers. They are effectively frozen.

def diagnose_vanishing_gradients(
    layer_dims: List[int],
    activation: str = "sigmoid",
    n_samples: int = 500,
    seed: int = 42,
) -> Dict[str, List[float]]:
    """Demonstrate gradient magnitude across layers.

    Args:
        layer_dims: Network architecture.
        activation: Activation function to use.
        n_samples: Number of random input samples.
        seed: Random seed.

    Returns:
        Dictionary with gradient norms at each layer.
    """
    config = MLPConfig(
        layer_dims=layer_dims,
        activation=activation,
        output_activation="sigmoid",
        seed=seed,
    )
    model = NumpyMLP(config)

    rng = np.random.default_rng(seed)
    X = rng.standard_normal((n_samples, layer_dims[0]))
    y = rng.binomial(1, 0.5, n_samples)

    model.forward(X)
    grads = model.backward(y.reshape(-1, 1))

    grad_norms = {}
    for l in range(1, len(layer_dims)):
        norm = np.linalg.norm(grads[f"W{l}"])
        grad_norms[f"Layer {l}"] = float(norm)
        print(f"Layer {l} gradient norm: {norm:.6e}")

    return grad_norms


# Sigmoid network: gradients vanish
print("=== Sigmoid (10 layers) ===")
diagnose_vanishing_gradients(
    [20, 64, 64, 64, 64, 64, 64, 64, 64, 64, 1],
    activation="sigmoid",
)
# Expected: Layer 1 gradient norm orders of magnitude smaller than Layer 10

print("\n=== ReLU (10 layers) ===")
diagnose_vanishing_gradients(
    [20, 64, 64, 64, 64, 64, 64, 64, 64, 64, 1],
    activation="relu",
)
# Expected: Gradient norms remain comparable across layers

Exploding Gradients

The converse problem: if $\|\mathbf{W}^{(\ell)}\|$ is large enough, the product of Jacobians grows exponentially with depth. Gradients become enormous, parameter updates become catastrophically large, and the network diverges (loss goes to NaN or infinity).

Exploding gradients are diagnosed by monitoring gradient norms during training. The standard fix is gradient clipping: rescaling the gradient vector if its norm exceeds a threshold.

$$\mathbf{g} \leftarrow \begin{cases} \mathbf{g} & \text{if } \|\mathbf{g}\| \leq \tau \\ \tau \cdot \frac{\mathbf{g}}{\|\mathbf{g}\|} & \text{if } \|\mathbf{g}\| > \tau \end{cases}$$

def clip_gradients(
    grads: Dict[str, np.ndarray],
    max_norm: float = 1.0,
) -> Dict[str, np.ndarray]:
    """Clip gradients by global norm.

    Args:
        grads: Dictionary of gradient arrays.
        max_norm: Maximum allowed gradient norm.

    Returns:
        Clipped gradient dictionary.
    """
    # Compute global norm
    total_norm = np.sqrt(
        sum(np.sum(g**2) for g in grads.values())
    )

    if total_norm > max_norm:
        scale = max_norm / (total_norm + 1e-8)
        return {k: v * scale for k, v in grads.items()}
    return grads

Dead Neurons

A ReLU neuron is "dead" when its pre-activation $z$ is negative for every example in the training set. Since $\text{ReLU}'(z) = 0$ for $z < 0$, the gradient flowing through this neuron is zero, and its weights never update. It is permanently stuck.

Dead neurons typically arise from large learning rates that push weights into a regime where the pre-activation is always negative, or from poor initialization.

def count_dead_neurons(
    model: NumpyMLP,
    X: np.ndarray,
    threshold: float = 0.0,
) -> Dict[str, int]:
    """Count neurons that are inactive (always produce zero output)
    across the entire dataset.

    Args:
        model: A trained NumpyMLP with ReLU activation.
        X: Input data (entire training set).
        threshold: Activation threshold below which a neuron is
            considered dead.

    Returns:
        Dictionary mapping layer name to dead neuron count.
    """
    model.forward(X)
    dead_counts = {}

    L = len(model.config.layer_dims) - 1
    for l in range(1, L):  # Hidden layers only
        activations = model.cache[f"A{l}"]  # (n_samples, n_neurons)
        # A neuron is dead if it outputs <= threshold for ALL samples
        max_per_neuron = np.max(activations, axis=0)
        n_dead = int(np.sum(max_per_neuron <= threshold))
        n_total = activations.shape[1]
        dead_counts[f"Layer {l}"] = n_dead
        print(
            f"Layer {l}: {n_dead}/{n_total} dead neurons "
            f"({100*n_dead/n_total:.1f}%)"
        )

    return dead_counts

Production Reality: In large-scale training, monitoring gradient norms per layer, activation distributions, and dead neuron percentages is standard practice. Tools like Weights & Biases and TensorBoard display these diagnostics in real time. The numpy implementations above show you what these tools compute internally.


6.13 — The Universal Approximation Theorem

The universal approximation theorem (Cybenko 1989; Hornik, Stinchcombe, and White 1989) is the theoretical foundation for neural network expressiveness.

What It Says

A single hidden layer with a sufficient number of neurons and a non-polynomial activation function can approximate any continuous function on a compact subset of $\mathbb{R}^d$ to arbitrary accuracy.

Formally: for any continuous function $f: [0, 1]^d \to \mathbb{R}$, any $\epsilon > 0$, and any non-polynomial activation function $\sigma$ (sigmoid, ReLU, tanh, and most common activations qualify), there exists a single-hidden-layer network:

$$g(\mathbf{x}) = \sum_{j=1}^{N} c_j \, \sigma\left(\mathbf{w}_j^T \mathbf{x} + b_j\right)$$

such that $\sup_{\mathbf{x} \in [0,1]^d} |f(\mathbf{x}) - g(\mathbf{x})| < \epsilon$.

What It Does Not Say

The theorem is an existence result. It guarantees that an approximating network exists. It says nothing about:

  1. How many neurons $N$ are needed. The width $N$ required to achieve accuracy $\epsilon$ can be exponentially large in the input dimension $d$. A network that approximates a function over $[0,1]^{100}$ to within $\epsilon = 0.01$ might require more neurons than atoms in the universe.

  2. Whether gradient descent can find the right weights. The theorem says the parameters exist. It does not say SGD (or any other optimizer) will find them. The loss landscape might have exponentially many local minima or saddle points that trap the optimizer.

  3. Sample efficiency. A network with $N$ parameters needs at least $N$ training examples to avoid overfitting (by a counting argument), and usually many more. The theorem does not say how much data is required.

  4. Depth vs. width tradeoffs. The theorem is about single-hidden-layer networks. Deeper networks can represent certain function classes exponentially more efficiently than shallow ones (this is the depth separation theorem of Eldan and Shamir, 2016). In practice, depth provides compositional structure that makes learning tractable.

Common Misconception: "The universal approximation theorem proves that neural networks can learn anything." It proves that a wide enough single-layer network can represent any continuous function. Representation is necessary but not sufficient for learning. Learning requires that (1) the function class is tractable for optimization, (2) sufficient data is available, and (3) the training procedure converges to a good solution. The theorem gives no guarantees about any of these.

Advanced Sidebar: The universal approximation theorem has been extended in several directions. Leshno et al. (1993) proved that any non-polynomial activation suffices (removing the bounded-ness requirement of Cybenko's original proof, which covers ReLU). Kidger and Lyons (2020) proved a universal approximation result for depth: networks with width $d+1$ (only slightly wider than the input) but arbitrary depth are universal approximators. The depth result suggests that depth — not width — is the more fundamental source of expressiveness, aligning with empirical practice where deep narrow networks outperform shallow wide ones.


6.14 — Progressive Project M2: Click-Prediction MLP for StreamRec

The StreamRec recommendation system needs a click-prediction model. In Chapter 1, we built a matrix factorization baseline. In Chapter 4, we used mutual information for feature ranking. Now we build the neural baseline: an MLP that predicts click probability from handcrafted user and item features.

Problem Setup

StreamRec has handcrafted features for each user-item pair:

  • User features (10): activity level, subscription tier, session count, average session length, content diversity score, recency of last visit, signup age, platform (mobile/desktop/tablet encoded), preferred content category (one-hot), time-zone offset
  • Item features (10): popularity score, freshness (days since publish), category embedding (compressed), creator reputation, content length, engagement rate, share rate, completion rate among similar users, content type, trending score

The target is binary: did the user click (1) or not (0)?

Baseline Comparison: Logistic Regression

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, log_loss


def logistic_regression_baseline(
    data: Dict[str, np.ndarray],
) -> Dict[str, float]:
    """Train a logistic regression baseline for comparison.

    Args:
        data: Dictionary with train/val/test splits.

    Returns:
        Dictionary with evaluation metrics.
    """
    lr_model = LogisticRegression(max_iter=1000, random_state=42)
    lr_model.fit(data["X_train"], data["y_train"])

    metrics = {}
    for split in ["train", "val", "test"]:
        y_true = data[f"y_{split}"]
        y_prob = lr_model.predict_proba(data[f"X_{split}"])[:, 1]
        metrics[f"{split}_auc"] = roc_auc_score(y_true, y_prob)
        metrics[f"{split}_logloss"] = log_loss(y_true, y_prob)

    return metrics


# Run baseline
data = generate_click_data(n_samples=10_000, n_features=20, seed=42)
lr_metrics = logistic_regression_baseline(data)
print(f"Logistic Regression — Val AUC: {lr_metrics['val_auc']:.4f}, "
      f"Val LogLoss: {lr_metrics['val_logloss']:.4f}")

MLP Model

# Numpy MLP
config = MLPConfig(
    layer_dims=[20, 128, 64, 32, 1],
    activation="relu",
    output_activation="sigmoid",
    seed=42,
)
np_model = NumpyMLP(config)
np_history = np_model.train(
    X_train=data["X_train"],
    y_train=data["y_train"],
    X_val=data["X_val"],
    y_val=data["y_val"],
    epochs=200,
    batch_size=64,
    lr=0.005,
)

# Evaluate on test set
test_pred = np_model.forward(data["X_test"]).ravel()
np_test_auc = roc_auc_score(data["y_test"], test_pred)
np_test_logloss = binary_cross_entropy(test_pred, data["y_test"])
print(f"Numpy MLP — Test AUC: {np_test_auc:.4f}, "
      f"Test LogLoss: {np_test_logloss:.4f}")

# PyTorch MLP (same architecture)
pt_history = train_pytorch_mlp(
    config=config,
    data=data,
    epochs=200,
    batch_size=64,
    lr=0.005,
)

The MLP should outperform logistic regression on this data because the true decision boundary is nonlinear — it involves feature interactions ($x_0 \cdot x_1$), quadratic terms ($x_2^2$), and periodic effects ($\sin(x_3 \pi)$) that a linear model cannot capture. This is the fundamental value proposition of neural networks: learning nonlinear decision boundaries from data.

In Chapter 7, we will take this same MLP and apply proper training techniques — batch normalization, dropout, learning rate scheduling — to improve both convergence speed and final performance.


Summary

This chapter built a neural network from the ground up. The key ideas, in the order they will matter for the rest of this book:

  1. A neural network is function composition. Linear transformations interleaved with nonlinear activations. Without the nonlinearity, multiple layers collapse to one.

  2. Backpropagation is the chain rule. No more, no less. The computational graph records the forward operations; the backward pass traverses them in reverse, multiplying local Jacobians.

  3. Activation functions determine gradient flow. Sigmoid saturates and causes vanishing gradients. ReLU avoids saturation but can die. GELU and Swish offer smooth compromises.

  4. Initialization determines whether learning can begin. He initialization for ReLU maintains stable variance across layers. Xavier for tanh and sigmoid.

  5. The universal approximation theorem is an existence result. It guarantees that a wide enough network can represent any continuous function, but says nothing about whether gradient descent will find the right parameters or how much data is needed.

  6. PyTorch autograd computes the same gradients you computed by hand. The framework abstracts away the backward pass, but the mathematics — and the failure modes — are identical.

  7. Gradient checking is a debugging discipline. Before trusting a backpropagation implementation, verify it against numerical differentiation.

The numpy implementation in this chapter is not production code. But the understanding it builds — of what happens at every step of forward and backward computation — is what separates a practitioner who can debug a failing model from one who cannot.

In Chapter 7, we turn to the craft of making this basic architecture train reliably and generalize well.