35 min read

> "You don't understand anything until you learn it more than one way."

Learning Objectives

Understand the biological inspiration behind artificial neural networks
Implement a perceptron and multi-layer perceptron from scratch in NumPy
Master key activation functions and their properties
Derive and implement the backpropagation algorithm step by step
Compute forward and backward passes for arbitrary network architectures
Transition from NumPy to PyTorch for neural network construction
Build, train, and evaluate neural networks using PyTorch's nn.Module

In This Chapter

11.1 Biological Inspiration and the Artificial Neuron
11.2 Activation Functions
11.3 Multi-Layer Perceptrons and the Forward Pass
11.4 Loss Functions
11.5 Backpropagation: The Learning Algorithm
11.6 Implementing a Neural Network in NumPy
11.7 Introduction to PyTorch
11.8 Putting It All Together
Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

"You don't understand anything until you learn it more than one way." --- Marvin Minsky, co-founder of the MIT AI Laboratory

Chapter 11: Neural Networks from Scratch

In Chapters 9 and 10, you learned how gradient descent optimizes a cost function and how logistic regression draws a linear decision boundary between classes. These tools are powerful, but they share a fundamental limitation: they can only model linear relationships. If you try to separate two classes that swirl around each other in a spiral, or predict a target that depends on the product of two features, a single linear model will fail.

Neural networks shatter this limitation. By stacking layers of simple computational units---each one little more than a logistic regression node---and connecting them with nonlinear activation functions, neural networks can approximate virtually any continuous function. This is not a vague promise; it is a mathematical theorem (the universal approximation theorem) that we will encounter later in this chapter.

In this chapter, you will build a neural network from the ground up. We begin with the biological neuron that inspired the entire field, move through the perceptron and its limitations, and then construct multi-layer perceptrons that overcome those limitations. You will implement the forward pass, choose loss functions, and---most importantly---derive and implement the backpropagation algorithm that makes training feasible. Every line of code will be written first in pure NumPy so that you see exactly what happens inside the black box. Then, we will transition to PyTorch, the modern framework that automates the tedious parts while giving you full control over the interesting ones.

By the end of this chapter, you will have both the theoretical understanding and the practical skills to design, implement, train, and debug neural networks. This foundation will serve you through every subsequent chapter in this book, from convolutional networks in Chapter 13 to transformers in Chapter 16.

11.1 Biological Inspiration and the Artificial Neuron

11.1.1 The Biological Neuron

The human brain contains roughly 86 billion neurons, each connected to thousands of others through synapses. A biological neuron operates through a deceptively simple process:

Dendrites receive electrochemical signals from other neurons.
The cell body (soma) integrates these signals.
If the combined signal exceeds a threshold, the neuron fires, sending an electrical impulse down its axon.
The signal crosses synapses to reach the dendrites of downstream neurons.

The strength of each synapse---how much influence one neuron's signal has on the next---changes over time through a process called synaptic plasticity. This is the biological basis of learning: frequently used pathways become stronger, while neglected ones weaken. Donald Hebb summarized this in 1949 with his famous principle: "Neurons that fire together wire together."

11.1.2 From Biology to Mathematics

Warren McCulloch and Walter Pitts proposed the first mathematical model of a neuron in 1943. Their McCulloch-Pitts neuron accepts binary inputs, applies binary weights, sums the result, and fires if the sum exceeds a threshold. While historically significant, this model is too rigid for practical use because the weights and threshold are fixed, not learned.

The artificial neuron used in modern neural networks is a direct generalization. Given an input vector x = [x_1, x_2, ..., x_n]^T and a weight vector w = [w_1, w_2, ..., w_n]^T, the neuron computes:

$$ z = \mathbf{w}^T \mathbf{x} + b = \sum_{i=1}^{n} w_i x_i + b $$

where: - w is the weight vector (analogous to synaptic strengths), - x is the input vector (analogous to incoming signals), - b is the bias term (analogous to the firing threshold), and - z is the pre-activation value.

The neuron then applies a nonlinear activation function f to produce the output:

$$ a = f(z) $$

This is the fundamental building block of every neural network. If you understand this single computation, you understand the atom from which all deep learning architectures are built.

11.1.3 The Perceptron

Frank Rosenblatt introduced the perceptron in 1958 as the first trainable artificial neuron. The perceptron uses a step activation function:

$$ f(z) = \begin{cases} 1 & \text{if } z \geq 0 \\ 0 & \text{if } z < 0 \end{cases} $$

The perceptron learning algorithm adjusts weights based on classification errors:

$$ \mathbf{w} \leftarrow \mathbf{w} + \eta (y - \hat{y}) \mathbf{x} $$

where: - eta is the learning rate (a small positive number), - y is the true label (0 or 1), and - y_hat is the predicted label.

Worked example. Consider a perceptron with two inputs and initial weights w = [0.5, -0.3]^T, bias b = 0.1, and learning rate eta = 0.1. Given input x = [1.0, 2.0]^T with true label y = 1:

Compute pre-activation: z = 0.5(1.0) + (-0.3)(2.0) + 0.1 = 0.5 - 0.6 + 0.1 = 0.0
Apply step function: y_hat = f(0.0) = 1 (since z >= 0)
Compute error: y - y_hat = 1 - 1 = 0
Since the error is zero, no weight update is needed.

Now suppose the true label were y = 0. Then the error would be 0 - 1 = -1, and we would update: - w_1 = 0.5 + 0.1(-1)(1.0) = 0.4 - w_2 = -0.3 + 0.1(-1)(2.0) = -0.5 - b = 0.1 + 0.1(-1) = 0.0

The Perceptron Convergence Theorem guarantees that if the data is linearly separable, the algorithm will find a separating hyperplane in a finite number of steps. However, if the data is not linearly separable---as Minsky and Papert famously demonstrated with the XOR problem in 1969---the perceptron will never converge. This limitation nearly killed the field of neural networks for over a decade.

The solution? Stack multiple neurons into layers.

11.2 Activation Functions

Before we build multi-layer networks, we need to understand the nonlinear functions that give neural networks their power. Without nonlinearity, stacking layers would be pointless: a composition of linear functions is still linear, so a 100-layer linear network would be equivalent to a single-layer one. The activation function breaks this linearity.

11.2.1 Sigmoid

The sigmoid function, which you encountered in Chapter 10 (Logistic Regression), maps any real number to the interval (0, 1):

$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$

Its derivative has a convenient form:

$$ \sigma'(z) = \sigma(z)(1 - \sigma(z)) $$

Properties: - Output range: (0, 1), useful for probabilities - Smooth and differentiable everywhere - Saturates for large |z|, causing the vanishing gradient problem - Output is not zero-centered, which can slow convergence

Worked example. For z = 2.0: - sigma(2.0) = 1 / (1 + e^{-2.0}) = 1 / (1 + 0.1353) = 1 / 1.1353 = 0.8808 - sigma'(2.0) = 0.8808 * (1 - 0.8808) = 0.8808 * 0.1192 = 0.1050

Notice how the gradient is already quite small at z = 2.0. At z = 10.0, sigma'(10.0) is approximately 0.0000454---nearly zero. This means that during backpropagation, gradients flowing through a sigmoid neuron with a large pre-activation will be almost completely attenuated. In deep networks, this attenuation compounds across layers, making training extremely slow or impossible. This is the vanishing gradient problem.

11.2.2 Hyperbolic Tangent (tanh)

The tanh function maps inputs to the interval (-1, 1):

$$ \tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} = 2\sigma(2z) - 1 $$

Its derivative is:

$$ \tanh'(z) = 1 - \tanh^2(z) $$

Properties: - Output range: (-1, 1), zero-centered (better than sigmoid for hidden layers) - Stronger gradients than sigmoid near zero - Still suffers from vanishing gradients for large |z|

Worked example. For z = 1.0: - tanh(1.0) = (e^1 - e^{-1}) / (e^1 + e^{-1}) = (2.718 - 0.368) / (2.718 + 0.368) = 2.350 / 3.086 = 0.7616 - tanh'(1.0) = 1 - 0.7616^2 = 1 - 0.5800 = 0.4200

11.2.3 Rectified Linear Unit (ReLU)

ReLU, introduced by Nair and Hinton in 2010, transformed deep learning by providing a simple activation function that avoids the vanishing gradient problem:

$$ \text{ReLU}(z) = \max(0, z) $$

Its derivative is:

$$ \text{ReLU}'(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{if } z < 0 \end{cases} $$

(The derivative is undefined at z = 0; in practice, we set it to 0 or 1.)

Properties: - Computationally efficient (no exponentials) - Non-saturating for positive inputs, so gradients flow freely - Sparse activation: many neurons output exactly zero - Dying ReLU problem: if a neuron's pre-activation is always negative, its gradient is always zero, and it can never recover

Variants. Leaky ReLU addresses the dying ReLU problem by allowing a small negative slope:

$$ \text{LeakyReLU}(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha z & \text{if } z \leq 0 \end{cases} $$

where alpha is typically 0.01.

11.2.4 Gaussian Error Linear Unit (GELU)

GELU, proposed by Hendrycks and Gimpel in 2016, has become the default activation in modern transformer architectures (Chapter 16):

$$ \text{GELU}(z) = z \cdot \Phi(z) $$

where: - Phi(z) is the cumulative distribution function of the standard normal distribution.

Intuition. GELU can be understood as a smooth, deterministic approximation to a stochastic process. Imagine that for each neuron, you randomly multiply the input by either 0 or 1, with the probability of keeping it proportional to how large the input is. Inputs with large positive values are almost certainly kept (like ReLU); inputs with large negative values are almost certainly zeroed (like ReLU); but inputs near zero have a probability of being kept that depends on their magnitude. GELU computes the expected value of this stochastic masking, which is $z \cdot \Phi(z)$. This gives GELU a natural connection to dropout-style regularization.

An efficient approximation is:

$$ \text{GELU}(z) \approx 0.5 z \left(1 + \tanh\left[\sqrt{\frac{2}{\pi}} \left(z + 0.044715 z^3\right)\right]\right) $$

Worked example. For z = 1.0: - Phi(1.0) = 0.8413 (from standard normal CDF tables) - GELU(1.0) = 1.0 * 0.8413 = 0.8413

For z = -1.0: - Phi(-1.0) = 0.1587 - GELU(-1.0) = (-1.0) * 0.1587 = -0.1587

Notice that GELU is slightly negative for negative inputs---it does not completely zero them out like ReLU, but it does strongly suppress them. This smooth behavior near zero makes the gradient more informative than ReLU's abrupt transition.

Properties: - Smooth, non-monotonic (slightly negative for small negative inputs) - Combines properties of ReLU and dropout (stochastic regularization interpretation) - Used in BERT, GPT, and most modern language models

11.2.5 Swish and Mish

Two other modern activation functions deserve mention, as they appear in several influential architectures.

Swish (Ramachandran et al., 2017), also called SiLU (Sigmoid Linear Unit), was discovered through automated search over activation function spaces:

$$ \text{Swish}(z) = z \cdot \sigma(\beta z) $$

where $\sigma$ is the sigmoid function and $\beta$ is a learnable or fixed parameter (typically $\beta = 1$). When $\beta = 1$, Swish is nearly identical to GELU. When $\beta \to \infty$, Swish converges to ReLU. Swish is the default activation in EfficientNet and several other computer vision architectures.

Mish (Misra, 2019) is defined as:

$$ \text{Mish}(z) = z \cdot \tanh(\ln(1 + e^z)) = z \cdot \tanh(\text{softplus}(z)) $$

Mish is smooth, non-monotonic (like GELU and Swish), and self-regularizing. It was used in YOLOv4 and other object detection models. In practice, the differences between GELU, Swish, and Mish are small, and the choice often depends on which framework and architecture you are using.

Worked example for Swish. For z = 2.0 with $\beta = 1$: - sigma(2.0) = 0.8808 - Swish(2.0) = 2.0 * 0.8808 = 1.7616

For z = -2.0: - sigma(-2.0) = 0.1192 - Swish(-2.0) = (-2.0) * 0.1192 = -0.2384

Like GELU, Swish allows a small negative output for negative inputs, providing a non-zero gradient signal that helps avoid the dying neuron problem.

11.2.6 Choosing an Activation Function

Here is practical guidance:

Layer Type	Recommended Activation	Reason
Hidden layers (default)	ReLU	Fast, avoids vanishing gradients
Hidden layers (transformers)	GELU	Smoother; standard for attention models
Output (binary classification)	Sigmoid	Outputs probability in (0, 1)
Output (multi-class)	Softmax	Outputs probability distribution
Output (regression)	None (linear)	Unrestricted output range

11.3 Multi-Layer Perceptrons and the Forward Pass

11.3.1 Architecture

A multi-layer perceptron (MLP) organizes neurons into layers:

Input layer: receives the feature vector x. This layer performs no computation.
Hidden layers: one or more layers of neurons that transform the input. Each neuron in a hidden layer connects to every neuron in the previous layer (fully connected or dense layers).
Output layer: produces the final prediction.

For a network with L layers (not counting the input), we define: - W^[l] is the weight matrix for layer l, with shape (n_l, n_{l-1}) - b^[l] is the bias vector for layer l, with shape (n_l, 1) - z^[l] = W^[l] a^[l-1] + b^[l] is the pre-activation at layer l - a^[l] = f^l is the activation (output) of layer l - a^[0] = x is the input

11.3.2 The Forward Pass

The forward pass computes the network's output by propagating the input through each layer sequentially:

$$ \mathbf{z}^{[1]} = \mathbf{W}^{[1]} \mathbf{x} + \mathbf{b}^{[1]}, \quad \mathbf{a}^{[1]} = f^{[1]}(\mathbf{z}^{[1]}) $$

$$ \mathbf{z}^{[2]} = \mathbf{W}^{[2]} \mathbf{a}^{[1]} + \mathbf{b}^{[2]}, \quad \mathbf{a}^{[2]} = f^{[2]}(\mathbf{z}^{[2]}) $$

$$ \vdots $$

$$ \mathbf{z}^{[L]} = \mathbf{W}^{[L]} \mathbf{a}^{[L-1]} + \mathbf{b}^{[L]}, \quad \hat{\mathbf{y}} = f^{[L]}(\mathbf{z}^{[L]}) $$

Worked example. Consider a 2-layer network (one hidden layer with 3 neurons, one output neuron) with ReLU in the hidden layer and sigmoid at the output. Input: x = [1.0, 0.5]^T.

Suppose:

$$ \mathbf{W}^{[1]} = \begin{bmatrix} 0.2 & 0.4 \\ -0.5 & 0.1 \\ 0.3 & -0.2 \end{bmatrix}, \quad \mathbf{b}^{[1]} = \begin{bmatrix} 0.1 \\ 0.0 \\ -0.1 \end{bmatrix} $$

$$ \mathbf{W}^{[2]} = \begin{bmatrix} 0.7 & -0.3 & 0.5 \end{bmatrix}, \quad b^{[2]} = 0.2 $$

Layer 1 (hidden):

$$ \mathbf{z}^{[1]} = \begin{bmatrix} 0.2(1.0) + 0.4(0.5) + 0.1 \\ -0.5(1.0) + 0.1(0.5) + 0.0 \\ 0.3(1.0) + (-0.2)(0.5) + (-0.1) \end{bmatrix} = \begin{bmatrix} 0.5 \\ -0.45 \\ 0.1 \end{bmatrix} $$

$$ \mathbf{a}^{[1]} = \text{ReLU}(\mathbf{z}^{[1]}) = \begin{bmatrix} 0.5 \\ 0.0 \\ 0.1 \end{bmatrix} $$

Layer 2 (output):

$$ z^{[2]} = 0.7(0.5) + (-0.3)(0.0) + 0.5(0.1) + 0.2 = 0.35 + 0.0 + 0.05 + 0.2 = 0.6 $$

$$ \hat{y} = \sigma(0.6) = \frac{1}{1 + e^{-0.6}} = 0.6457 $$

The network predicts a probability of 0.6457 for the positive class.

11.3.3 The Universal Approximation Theorem

The universal approximation theorem (Cybenko, 1989; Hornik, 1991) states that a feedforward network with a single hidden layer containing a finite number of neurons can approximate any continuous function on a compact subset of R^n, given a suitable activation function and enough neurons.

Intuition: Why Does This Work? Consider a one-dimensional function $f(x)$ that you want to approximate. A single ReLU neuron produces a "hinge"---a function that is zero on one side and linear on the other. By combining many hinges with different positions and slopes, you can construct a piecewise linear approximation to any continuous curve. The more neurons (hinges) you add, the finer the approximation. This is the same idea as approximating a curve with many short line segments.

In two dimensions, each neuron defines a hyperplane that divides the input space into two half-spaces. A hidden layer with $n$ neurons creates up to $2^n$ distinct regions (though the actual number is usually less). Within each region, the network computes a different linear function. By choosing the right boundaries and linear functions, the network can approximate any continuous surface.

More formally, Cybenko's original theorem showed that for any continuous function $f: [0,1]^d \to \mathbb{R}$ and any $\epsilon > 0$, there exists a network with a single hidden layer using sigmoid activations such that:

$$\sup_{\mathbf{x} \in [0,1]^d} |f(\mathbf{x}) - \hat{f}(\mathbf{x})| < \epsilon$$

Hornik (1991) generalized this to a broad class of activation functions, including ReLU and tanh.

Practical Caveats. This theorem is reassuring but has important limitations:

Width vs. existence: It says nothing about how many neurons are needed. The required width may be exponential in the input dimension. For some functions, a single hidden layer with $n$ neurons requires $n$ to be astronomically large.
Learnability: It says nothing about whether gradient descent will find the right weights. The theorem is an existence result, not a constructive one. There may be many local minima in the loss landscape.
Depth efficiency: While a single hidden layer suffices in theory, deeper networks (more layers) tend to learn more efficiently than very wide, shallow ones. Telgarsky (2016) showed that there exist functions that can be represented by a network with $O(k)$ layers and polynomial width, but require exponential width with $O(1)$ layers. This is the theoretical motivation for "deep" learning.
Generalization: Even if a network memorizes the training data perfectly, it may not generalize. The theorem guarantees approximation on the training domain, not prediction on unseen data.

In practice, we use networks with multiple hidden layers (hence "deep" learning) because they can represent hierarchical features: early layers detect simple patterns (edges, frequencies), middle layers combine them into parts (textures, phonemes), and later layers compose parts into complex abstractions (objects, words). You will see this hierarchy clearly in convolutional networks (Chapter 13).

11.3.4 Vectorized Forward Pass for Mini-Batches

When training, we process multiple examples simultaneously in a mini-batch. If the input matrix X has shape (n_features, m) where m is the batch size, the vectorized forward pass is:

$$ \mathbf{Z}^{[l]} = \mathbf{W}^{[l]} \mathbf{A}^{[l-1]} + \mathbf{b}^{[l]} $$

where: - A^[0] = X has shape (n_0, m), - Z^[l] has shape (n_l, m), - b^[l] is broadcast across the m columns.

This is a matrix multiplication followed by a broadcast addition---exactly the kind of operation that NumPy and PyTorch execute efficiently on modern hardware.

11.4 Loss Functions

To train a neural network, we need a loss function (also called a cost function) that quantifies how wrong the network's predictions are. The choice of loss function depends on the task.

11.4.1 Mean Squared Error (Regression)

For regression tasks, the standard choice is mean squared error:

$$ \mathcal{L}_{\text{MSE}} = \frac{1}{m} \sum_{i=1}^{m} (y^{(i)} - \hat{y}^{(i)})^2 $$

where: - m is the number of examples, - y^(i) is the true value for example i, and - y_hat^(i) is the predicted value.

11.4.2 Binary Cross-Entropy (Binary Classification)

For binary classification, we use binary cross-entropy, which you derived in Chapter 10:

$$ \mathcal{L}_{\text{BCE}} = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \right] $$

This loss heavily penalizes confident wrong predictions. If the true label is 1 and the model predicts 0.01, the loss contribution is -log(0.01) = 4.605, which is very large.

11.4.3 Categorical Cross-Entropy (Multi-Class Classification)

For multi-class classification with K classes, the output layer uses a softmax activation:

$$ \text{softmax}(z_k) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}} $$

And the loss is:

$$ \mathcal{L}_{\text{CE}} = -\frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^{K} y_k^{(i)} \log(\hat{y}_k^{(i)}) $$

where y_k^(i) is 1 if example i belongs to class k and 0 otherwise (one-hot encoding).

In practice, we combine softmax and cross-entropy into a single numerically stable operation (as PyTorch's nn.CrossEntropyLoss does).

11.5 Backpropagation: The Learning Algorithm

Backpropagation is the algorithm that makes neural networks trainable. It computes the gradient of the loss with respect to every weight and bias in the network, enabling gradient descent (Chapter 9) to update the parameters. Without backpropagation, training a network with thousands or millions of parameters would be computationally infeasible.

11.5.1 The Key Insight: The Chain Rule

Backpropagation is an application of the chain rule from calculus (Chapter 6). Recall that if y = f(g(x)), then:

$$ \frac{dy}{dx} = \frac{dy}{dg} \cdot \frac{dg}{dx} $$

In a neural network, the loss is a composition of many functions: the loss function applied to the output activation, which depends on the pre-activation, which depends on the previous layer's activation, and so on. The chain rule lets us decompose this complex derivative into a product of simpler derivatives, one for each layer.

11.5.2 Deriving Backpropagation for a Two-Layer Network

Let us derive backpropagation step by step for a network with one hidden layer (ReLU activation) and one output neuron (sigmoid activation), trained with binary cross-entropy loss.

Setup: - Input: x (shape: n_0 x 1) - Hidden layer: W^[1] (shape: n_1 x n_0), b^[1] (shape: n_1 x 1) - Output layer: W^[2] (shape: 1 x n_1), b^[2] (scalar)

Forward pass (for a single example):

$$ \mathbf{z}^{[1]} = \mathbf{W}^{[1]} \mathbf{x} + \mathbf{b}^{[1]} $$

$$ \mathbf{a}^{[1]} = \text{ReLU}(\mathbf{z}^{[1]}) $$

$$ z^{[2]} = \mathbf{W}^{[2]} \mathbf{a}^{[1]} + b^{[2]} $$

$$ \hat{y} = \sigma(z^{[2]}) $$

Loss:

$$ \mathcal{L} = -[y \log(\hat{y}) + (1-y) \log(1-\hat{y})] $$

Step 1: Output layer gradient.

We need dL/dz^[2]. Using the chain rule:

$$ \frac{\partial \mathcal{L}}{\partial z^{[2]}} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z^{[2]}} $$

We compute each factor:

$$ \frac{\partial \mathcal{L}}{\partial \hat{y}} = -\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}} $$

$$ \frac{\partial \hat{y}}{\partial z^{[2]}} = \hat{y}(1-\hat{y}) $$

Multiplying these together:

$$ \frac{\partial \mathcal{L}}{\partial z^{[2]}} = \left(-\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}\right) \cdot \hat{y}(1-\hat{y}) $$

$$ = -y(1-\hat{y}) + (1-y)\hat{y} = \hat{y} - y $$

This remarkably clean result---y_hat - y---is one of the beautiful simplifications in neural networks. We denote this as:

$$ \delta^{[2]} = \hat{y} - y $$

Step 2: Gradients for output layer weights and bias.

$$ \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[2]}} = \delta^{[2]} \cdot (\mathbf{a}^{[1]})^T $$

$$ \frac{\partial \mathcal{L}}{\partial b^{[2]}} = \delta^{[2]} $$

Step 3: Propagate the gradient backward to the hidden layer.

$$ \frac{\partial \mathcal{L}}{\partial \mathbf{a}^{[1]}} = (\mathbf{W}^{[2]})^T \delta^{[2]} $$

$$ \delta^{[1]} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{[1]}} = \frac{\partial \mathcal{L}}{\partial \mathbf{a}^{[1]}} \odot \text{ReLU}'(\mathbf{z}^{[1]}) $$

where the circle-dot symbol denotes element-wise multiplication and ReLU'(z^[1]) is a vector of ones and zeros (1 where z_j > 0, 0 otherwise).

Step 4: Gradients for hidden layer weights and bias.

$$ \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[1]}} = \delta^{[1]} \cdot \mathbf{x}^T $$

$$ \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{[1]}} = \delta^{[1]} $$

11.5.3 General Backpropagation for L Layers

The pattern generalizes to a network with L layers. During the backward pass, for each layer l from L down to 1:

$$ \delta^{[l]} = \begin{cases} \nabla_{\hat{\mathbf{y}}} \mathcal{L} \odot f'^{[L]}(\mathbf{z}^{[L]}) & \text{if } l = L \\ \left[(\mathbf{W}^{[l+1]})^T \delta^{[l+1]}\right] \odot f'^{[l]}(\mathbf{z}^{[l]}) & \text{if } l < L \end{cases} $$

$$ \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[l]}} = \frac{1}{m} \delta^{[l]} (\mathbf{A}^{[l-1]})^T $$

$$ \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{[l]}} = \frac{1}{m} \sum_{i=1}^{m} \delta^{[l](i)} $$

where: - delta^[l] is the error signal at layer l, - f'^[l] is the derivative of the activation function at layer l, and - the (1/m) factor averages the gradients over the mini-batch.

11.5.4 Detailed Backpropagation Walkthrough with Numbers

Let us trace backpropagation through a concrete numerical example to make the abstract derivation tangible. We use the same 2-layer network from Section 11.3.2: hidden layer with 3 ReLU neurons and a sigmoid output, with input x = [1.0, 0.5]^T and true label y = 1.

Forward pass (from Section 11.3.2): - z^[1] = [0.5, -0.45, 0.1]^T - a^[1] = ReLU(z^[1]) = [0.5, 0.0, 0.1]^T - z^[2] = 0.6 - y_hat = sigma(0.6) = 0.6457

Loss: L = -[1 * ln(0.6457) + 0 * ln(1 - 0.6457)] = -ln(0.6457) = 0.4375

Step 1: Output layer error signal.

delta^[2] = y_hat - y = 0.6457 - 1.0 = -0.3543

Step 2: Output layer gradients.

dL/dW^[2] = delta^[2] * (a^[1])^T = -0.3543 * [0.5, 0.0, 0.1] = [-0.1772, 0.0, -0.0354]

dL/db^[2] = delta^[2] = -0.3543

Step 3: Propagate error to hidden layer.

dL/da^[1] = (W^[2])^T * delta^[2] = [0.7, -0.3, 0.5]^T * (-0.3543) = [-0.2480, 0.1063, -0.1772]^T

Step 4: Apply ReLU derivative. Since z^[1] = [0.5, -0.45, 0.1]^T, the ReLU derivative is [1, 0, 1]^T (1 where z > 0, 0 elsewhere).

delta^[1] = [-0.2480, 0.1063, -0.1772]^T odot [1, 0, 1]^T = [-0.2480, 0.0, -0.1772]^T

Notice that the second neuron (where z^[1]_2 = -0.45 < 0) has its gradient zeroed out. The ReLU killed this neuron's gradient---it contributed nothing to the prediction, so it receives no gradient signal for the weight update. This is the ReLU gate in action.

Step 5: Hidden layer gradients.

dL/dW^[1] = delta^[1] * x^T = [-0.2480, 0.0, -0.1772]^T * [1.0, 0.5] = [[-0.2480, -0.1240], [0.0, 0.0], [-0.1772, -0.0886]]

dL/db^[1] = delta^[1] = [-0.2480, 0.0, -0.1772]^T

Step 6: Update weights. With learning rate eta = 0.1:

W^[2]_new = [0.7, -0.3, 0.5] - 0.1 * [-0.1772, 0.0, -0.0354] = [0.7177, -0.3, 0.5035]

b^[2]_new = 0.2 - 0.1 * (-0.3543) = 0.2354

And similarly for W^[1] and b^[1]. After this update, the network's prediction will move closer to the true label y = 1. Over many iterations with many training examples, the weights converge to values that minimize the average loss.

11.5.5 Backpropagation as Computational Graph

An alternative way to understand backpropagation is through computational graphs. Each operation in the forward pass becomes a node in a directed acyclic graph (DAG). During the backward pass, we traverse this graph in reverse, applying the chain rule at each node.

Building the Graph. Consider the simple computation L = -(y * log(sigma(w * x + b))). The forward pass creates nodes for each operation:

Node A: multiply w and x to get wx
Node B: add b to get z = wx + b
Node C: apply sigmoid to get a = sigma(z)
Node D: take log to get log(a)
Node E: multiply by y to get y * log(a)
Node F: negate to get L = -y * log(a)

Each node stores its output (for use in the backward pass) and knows how to compute the local derivative of its output with respect to its inputs.

Traversing Backward. Starting from the loss node F, we propagate gradients in reverse topological order. At each node, we multiply the incoming gradient (from downstream) by the local derivative (the Jacobian of the node's operation). This is a direct application of the chain rule.

This computational graph perspective is powerful because it generalizes beyond feedforward networks to any differentiable computation: recurrent networks, attention mechanisms, custom loss functions, and even control flow like if-statements and loops. This is exactly how PyTorch's autograd system works internally---it records a computational graph during the forward pass and uses it to compute gradients automatically during the backward pass.

Building a Mini Autograd Engine. To solidify the computational graph concept, here is a minimal autograd engine that tracks operations and computes gradients automatically:

class Value:
    """A scalar value that tracks its computation history.

    This is a simplified version of what PyTorch does internally.
    Each Value stores its data, gradient, and the operation that created it.
    """

    def __init__(self, data: float, _children: tuple = (), _op: str = "") -> None:
        self.data = data
        self.grad = 0.0
        self._backward = lambda: None
        self._prev = set(_children)
        self._op = _op

    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), "+")

        def _backward():
            self.grad += out.grad   # d(a+b)/da = 1
            other.grad += out.grad  # d(a+b)/db = 1

        out._backward = _backward
        return out

    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), "*")

        def _backward():
            self.grad += other.data * out.grad  # d(a*b)/da = b
            other.grad += self.data * out.grad  # d(a*b)/db = a

        out._backward = _backward
        return out

    def tanh(self):
        import math
        t = math.tanh(self.data)
        out = Value(t, (self,), "tanh")

        def _backward():
            self.grad += (1 - t ** 2) * out.grad

        out._backward = _backward
        return out

    def backward(self):
        """Compute gradients for all nodes via reverse-mode autodiff."""
        topo = []
        visited = set()

        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    build_topo(child)
                topo.append(v)

        build_topo(self)
        self.grad = 1.0
        for v in reversed(topo):
            v._backward()

This miniature engine (inspired by Andrej Karpathy's micrograd) demonstrates the essential idea: each operation records a backward function, and calling backward() on the final output triggers a reverse traversal that accumulates gradients via the chain rule. PyTorch's autograd follows the same principle but handles tensors, GPU operations, and hundreds of operation types.

11.5.6 The Gradient Descent Update

After computing all gradients via backpropagation, we update the parameters using gradient descent (as you learned in Chapter 9):

$$ \mathbf{W}^{[l]} \leftarrow \mathbf{W}^{[l]} - \eta \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[l]}} $$

$$ \mathbf{b}^{[l]} \leftarrow \mathbf{b}^{[l]} - \eta \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{[l]}} $$

where eta is the learning rate. In Chapter 12, we will explore more sophisticated update rules---momentum, Adam, AdamW---that converge faster and more reliably than plain gradient descent.

11.5.7 Numerical Gradient Checking

When implementing backpropagation by hand, bugs are easy to introduce and hard to detect. Gradient checking compares your analytical gradients to numerically approximated gradients:

$$ \frac{\partial \mathcal{L}}{\partial \theta_j} \approx \frac{\mathcal{L}(\theta_j + \epsilon) - \mathcal{L}(\theta_j - \epsilon)}{2\epsilon} $$

where epsilon is a small number like 10^{-7}. The two-sided finite difference formula has approximation error $O(\epsilon^2)$, much better than the one-sided formula $(\mathcal{L}(\theta_j + \epsilon) - \mathcal{L}(\theta_j)) / \epsilon$ which has error $O(\epsilon)$.

To check your entire implementation, compute the relative error for each parameter:

$$ \text{relative error} = \frac{|g_{\text{analytical}} - g_{\text{numerical}}|}{\max(|g_{\text{analytical}}|, |g_{\text{numerical}}|) + \epsilon} $$

If the relative difference is less than 10^{-5} for all parameters, your implementation is likely correct. If any parameter shows a relative error above 10^{-3}, there is almost certainly a bug. Gradient checking is too slow for training ($O(p)$ forward passes for $p$ parameters), but it is invaluable for debugging.

def gradient_check(
    x: np.ndarray,
    y: np.ndarray,
    parameters: dict,
    layer_dims: list[int],
    epsilon: float = 1e-7,
) -> float:
    """Check analytical gradients against numerical approximation.

    Args:
        x: Input data.
        y: Labels.
        parameters: Network parameters.
        layer_dims: Layer dimensions.
        epsilon: Perturbation size for finite differences.

    Returns:
        Maximum relative error across all parameters.
    """
    num_layers = len(layer_dims) - 1
    y_hat, cache = forward_pass(x, parameters, num_layers)
    gradients = backward_pass(y, parameters, cache, num_layers)

    max_error = 0.0
    for key in parameters:
        param = parameters[key]
        grad = gradients[f"d{key}"]

        # Compute numerical gradient for each element
        numerical_grad = np.zeros_like(param)
        it = np.nditer(param, flags=["multi_index"])
        while not it.finished:
            idx = it.multi_index
            old_val = param[idx]

            param[idx] = old_val + epsilon
            y_plus, _ = forward_pass(x, parameters, num_layers)
            loss_plus = compute_loss(y_plus, y)

            param[idx] = old_val - epsilon
            y_minus, _ = forward_pass(x, parameters, num_layers)
            loss_minus = compute_loss(y_minus, y)

            numerical_grad[idx] = (loss_plus - loss_minus) / (2 * epsilon)
            param[idx] = old_val
            it.iternext()

        # Relative error
        diff = np.linalg.norm(grad - numerical_grad)
        norm = np.linalg.norm(grad) + np.linalg.norm(numerical_grad) + epsilon
        error = diff / norm
        max_error = max(max_error, error)
        print(f"{key}: relative error = {error:.2e}")

    return max_error

11.6 Implementing a Neural Network in NumPy

Now you will put everything together and implement a neural network from scratch using only NumPy. This implementation follows the mathematical derivations exactly, so you can verify each line against the equations above.

11.6.1 Weight Initialization

Proper weight initialization is critical. If weights are too large, activations saturate and gradients vanish or explode. If weights are too small, signals shrink to zero as they propagate through layers.

Why Random Initialization Matters: Symmetry Breaking. If you initialize all weights to the same value (say, zero), then every neuron in a given layer computes exactly the same function. During backpropagation, every neuron receives exactly the same gradient and updates identically. No matter how long you train, all neurons in a layer remain identical---the network has effectively only one neuron per layer. This is the symmetry problem. Random initialization breaks this symmetry, ensuring that neurons specialize during training.

Xavier (Glorot) Initialization. Before He initialization, the standard approach was Xavier initialization (Glorot and Bengio, 2010), designed for networks with symmetric activations like sigmoid and tanh. The idea is to maintain constant variance of activations and gradients across layers. If a layer has $n_{\text{in}}$ inputs and $n_{\text{out}}$ outputs, Xavier initialization samples from:

$$ w \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}} + n_{\text{out}}}\right) \quad \text{or} \quad w \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}, \sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}\right) $$

The derivation assumes that activations are linear around zero (true for tanh and sigmoid near the origin), and that the forward and backward passes should both maintain unit variance.

He Initialization. He et al. (2015) observed that ReLU zeroes out roughly half of its inputs, effectively halving the variance at each layer. To compensate, He initialization samples from:

$$ w \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n_{\text{in}}}}\right) $$

where n_in is the number of inputs to the layer. The factor of 2 (instead of 1) accounts for the variance halving caused by ReLU. This keeps the variance of activations roughly constant across layers, even in very deep networks.

LSUV: Layer-Sequential Unit-Variance Initialization. Mishkin and Matas (2016) proposed a data-driven approach that does not rely on assumptions about the activation function. LSUV works in two phases:

Initialize all weight matrices with orthonormal matrices (using SVD decomposition of random matrices).
For each layer sequentially, pass a mini-batch of real data through the network, measure the variance of that layer's output, and rescale the weights to achieve unit variance.

This approach is more robust than formula-based initialization because it accounts for the actual data distribution and any nonlinearities. In practice, LSUV can improve convergence speed, especially for architectures where analytical formulas are unavailable.

Practical Advice. For most modern architectures, He initialization with ReLU (or its variants) is the safe default. Use Xavier initialization when your activation function is tanh or sigmoid. Consider LSUV when training novel architectures where standard formulas may not apply. As we will see in Chapter 12, normalization layers (BatchNorm, LayerNorm) can partially compensate for imperfect initialization, but starting from a good initial point still leads to faster and more stable training.

import numpy as np

def initialize_parameters(layer_dims: list[int]) -> dict:
    """Initialize weights using He initialization and biases to zero.

    Args:
        layer_dims: List of layer sizes, e.g., [784, 128, 64, 10].

    Returns:
        Dictionary containing W1, b1, W2, b2, ..., WL, bL.
    """
    np.random.seed(42)
    parameters = {}
    for l in range(1, len(layer_dims)):
        parameters[f"W{l}"] = np.random.randn(
            layer_dims[l], layer_dims[l - 1]
        ) * np.sqrt(2.0 / layer_dims[l - 1])
        parameters[f"b{l}"] = np.zeros((layer_dims[l], 1))
    return parameters

11.6.2 Activation Functions in NumPy

def relu(z: np.ndarray) -> np.ndarray:
    """Compute ReLU activation element-wise.

    Args:
        z: Pre-activation array of any shape.

    Returns:
        Array of same shape with ReLU applied.
    """
    return np.maximum(0, z)


def relu_derivative(z: np.ndarray) -> np.ndarray:
    """Compute derivative of ReLU.

    Args:
        z: Pre-activation array of any shape.

    Returns:
        Array of same shape: 1 where z > 0, 0 elsewhere.
    """
    return (z > 0).astype(float)


def sigmoid(z: np.ndarray) -> np.ndarray:
    """Compute sigmoid activation element-wise.

    Args:
        z: Pre-activation array of any shape.

    Returns:
        Array of same shape with values in (0, 1).
    """
    return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))


def softmax(z: np.ndarray) -> np.ndarray:
    """Compute softmax activation (numerically stable).

    Args:
        z: Pre-activation array of shape (K, m) where K is the
           number of classes and m is the batch size.

    Returns:
        Array of same shape with columns summing to 1.
    """
    exp_z = np.exp(z - np.max(z, axis=0, keepdims=True))
    return exp_z / np.sum(exp_z, axis=0, keepdims=True)

11.6.3 Forward Pass Implementation

def forward_pass(
    x: np.ndarray,
    parameters: dict,
    num_layers: int,
) -> tuple[np.ndarray, dict]:
    """Execute the forward pass through all layers.

    Uses ReLU for hidden layers and sigmoid for the output layer.

    Args:
        x: Input data of shape (n_features, m).
        parameters: Dictionary with W1, b1, ..., WL, bL.
        num_layers: Number of layers (not counting input).

    Returns:
        Tuple of (output predictions, cache dictionary for backprop).
    """
    cache = {"A0": x}
    a = x

    # Hidden layers with ReLU
    for l in range(1, num_layers):
        z = parameters[f"W{l}"] @ a + parameters[f"b{l}"]
        a = relu(z)
        cache[f"Z{l}"] = z
        cache[f"A{l}"] = a

    # Output layer with sigmoid
    z = parameters[f"W{num_layers}"] @ a + parameters[f"b{num_layers}"]
    a = sigmoid(z)
    cache[f"Z{num_layers}"] = z
    cache[f"A{num_layers}"] = a

    return a, cache

11.6.4 Loss Computation

def compute_loss(
    y_hat: np.ndarray,
    y: np.ndarray,
) -> float:
    """Compute binary cross-entropy loss.

    Args:
        y_hat: Predictions of shape (1, m), values in (0, 1).
        y: True labels of shape (1, m), values in {0, 1}.

    Returns:
        Scalar loss value.
    """
    m = y.shape[1]
    epsilon = 1e-8  # Prevent log(0)
    loss = -np.sum(
        y * np.log(y_hat + epsilon)
        + (1 - y) * np.log(1 - y_hat + epsilon)
    ) / m
    return float(loss)

11.6.5 Backward Pass Implementation

def backward_pass(
    y: np.ndarray,
    parameters: dict,
    cache: dict,
    num_layers: int,
) -> dict:
    """Execute backpropagation to compute all gradients.

    Args:
        y: True labels of shape (1, m).
        parameters: Dictionary with W1, b1, ..., WL, bL.
        cache: Dictionary from forward pass with Z1, A1, ..., ZL, AL.
        num_layers: Number of layers.

    Returns:
        Dictionary with dW1, db1, ..., dWL, dbL.
    """
    m = y.shape[1]
    gradients = {}

    # Output layer (sigmoid + BCE gives clean gradient)
    dz = cache[f"A{num_layers}"] - y  # Shape: (1, m)

    gradients[f"dW{num_layers}"] = (
        dz @ cache[f"A{num_layers - 1}"].T / m
    )
    gradients[f"db{num_layers}"] = np.sum(dz, axis=1, keepdims=True) / m

    # Hidden layers
    for l in range(num_layers - 1, 0, -1):
        da = parameters[f"W{l + 1}"].T @ dz
        dz = da * relu_derivative(cache[f"Z{l}"])
        gradients[f"dW{l}"] = dz @ cache[f"A{l - 1}"].T / m
        gradients[f"db{l}"] = np.sum(dz, axis=1, keepdims=True) / m

    return gradients

11.6.6 Training Loop

def train(
    x: np.ndarray,
    y: np.ndarray,
    layer_dims: list[int],
    learning_rate: float = 0.01,
    num_epochs: int = 1000,
    print_every: int = 100,
) -> tuple[dict, list[float]]:
    """Train a neural network using gradient descent.

    Args:
        x: Training data of shape (n_features, m).
        y: Labels of shape (1, m).
        layer_dims: List of layer sizes including input.
        learning_rate: Step size for gradient descent.
        num_epochs: Number of training iterations.
        print_every: Print loss every this many epochs.

    Returns:
        Tuple of (trained parameters, loss history).
    """
    parameters = initialize_parameters(layer_dims)
    num_layers = len(layer_dims) - 1
    losses = []

    for epoch in range(num_epochs):
        # Forward pass
        y_hat, cache = forward_pass(x, parameters, num_layers)

        # Compute loss
        loss = compute_loss(y_hat, y)
        losses.append(loss)

        # Backward pass
        gradients = backward_pass(y, parameters, cache, num_layers)

        # Update parameters
        for l in range(1, num_layers + 1):
            parameters[f"W{l}"] -= learning_rate * gradients[f"dW{l}"]
            parameters[f"b{l}"] -= learning_rate * gradients[f"db{l}"]

        if epoch % print_every == 0:
            print(f"Epoch {epoch:4d} | Loss: {loss:.6f}")

    return parameters, losses

11.6.7 Solving XOR

Let us verify our implementation by solving the XOR problem---the very problem that defeated the single-layer perceptron:

# XOR dataset
X = np.array([[0, 0, 1, 1],
              [0, 1, 0, 1]])  # Shape: (2, 4)
Y = np.array([[0, 1, 1, 0]])   # Shape: (1, 4)

# Train a network with one hidden layer of 4 neurons
params, loss_history = train(
    X, Y,
    layer_dims=[2, 4, 1],
    learning_rate=1.0,
    num_epochs=5000,
    print_every=1000,
)

# Test predictions
predictions, _ = forward_pass(X, params, num_layers=2)
print("Predictions:", np.round(predictions, 3))
# Expected: approximately [[0, 1, 1, 0]]

The network learns to solve XOR---something impossible for a single perceptron. The hidden layer creates a nonlinear feature space in which XOR becomes linearly separable.

Why Does This Work? The hidden layer with ReLU activations transforms the 2D input into a 4D representation where the XOR classes become separable. To see this intuitively: the four XOR inputs [0,0], [0,1], [1,0], [1,1] are not linearly separable in 2D (you cannot draw a single line that separates the "0" points from the "1" points). But each hidden neuron computes a different linear boundary and applies ReLU, creating a new feature that is nonzero on one side of the boundary and zero on the other. With four such features, the hidden layer "unfolds" the input space so that the output neuron can draw a single linear boundary in the 4D space.

This is the fundamental mechanism of all deep learning: each layer applies a nonlinear transformation that reshapes the representation space, making previously inseparable patterns become separable. As we will see in Chapter 13, convolutional layers do this for spatial patterns in images, and as we will see in Chapter 16, attention layers do this for relationships between tokens in sequences.

11.7 Introduction to PyTorch

Writing neural networks from scratch in NumPy is educational, but it does not scale. For networks with millions of parameters, dozens of layer types, and GPU acceleration, you need a framework. PyTorch is the dominant framework in AI research and increasingly in production.

11.7.1 Why PyTorch?

PyTorch provides three core capabilities:

Tensors with GPU acceleration: PyTorch tensors are like NumPy arrays but can run on GPUs, providing 10-100x speedups for large matrix operations.
Automatic differentiation (autograd): PyTorch records a computational graph during the forward pass and computes all gradients automatically. You never need to derive or implement backpropagation by hand.
Neural network modules: The torch.nn module provides pre-built layers, loss functions, and optimizers that compose cleanly.

11.7.2 Tensors: The Foundation

A PyTorch tensor is a multi-dimensional array, similar to a NumPy array:

import torch

# Creating tensors
x = torch.tensor([1.0, 2.0, 3.0])
print(x)          # tensor([1.0000, 2.0000, 3.0000])
print(x.shape)    # torch.Size([3])
print(x.dtype)    # torch.float32

# From NumPy (shares memory when possible)
import numpy as np
np_array = np.array([1.0, 2.0, 3.0])
t = torch.from_numpy(np_array)

# Random tensors with reproducibility
torch.manual_seed(42)
w = torch.randn(3, 2)  # Shape (3, 2), standard normal

# GPU transfer (if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
x_gpu = x.to(device)

Key differences from NumPy: - Tensors track gradients when requires_grad=True - Tensors can live on CPU or GPU - PyTorch uses torch.matmul or @ for matrix multiplication (same as NumPy)

11.7.3 Autograd: Automatic Differentiation

Autograd is PyTorch's secret weapon. When you set requires_grad=True on a tensor, PyTorch records every operation applied to it. Calling .backward() on the result computes all gradients automatically:

torch.manual_seed(42)

# Create tensors with gradient tracking
x = torch.tensor([2.0, 3.0], requires_grad=True)
w = torch.tensor([0.5, -1.0], requires_grad=True)
b = torch.tensor(0.1, requires_grad=True)

# Forward pass
z = torch.dot(w, x) + b     # z = 0.5*2 + (-1)*3 + 0.1 = -1.9
y_hat = torch.sigmoid(z)     # sigmoid(-1.9)

# Define a simple loss
y = torch.tensor(1.0)
loss = -(y * torch.log(y_hat) + (1 - y) * torch.log(1 - y_hat))

# Backward pass: compute all gradients
loss.backward()

print(f"dL/dw = {w.grad}")   # Gradient of loss w.r.t. weights
print(f"dL/db = {b.grad}")   # Gradient of loss w.r.t. bias

Compare this to the dozens of lines we wrote for manual backpropagation. Autograd handles all the chain rule computations, correctly handles arbitrary computational graphs, and is numerically exact (not an approximation).

11.7.4 Building Networks with nn.Module

PyTorch's nn.Module class is the foundation of all neural network code in PyTorch. Understanding its internals will help you debug issues and design custom architectures.

How nn.Module Works Internally. When you create a subclass of nn.Module, PyTorch uses Python's __setattr__ method to intercept attribute assignments. If you assign an nn.Module or nn.Parameter to an attribute, PyTorch automatically registers it. This is how model.parameters() knows about all your weights without you manually maintaining a list.

The key internal data structures are: - _parameters: A dictionary mapping parameter names to nn.Parameter objects (the learnable weights) - _modules: A dictionary mapping submodule names to nn.Module objects (child layers) - _buffers: A dictionary for non-learnable state (e.g., running mean/variance in BatchNorm)

When you call model.parameters(), it recursively walks through _modules and collects all _parameters. When you call model.to(device), it moves all parameters and buffers to the specified device. When you call model.train() or model.eval(), it recursively sets the training flag on all submodules.

nn.Module provides a clean interface for defining neural networks:

import torch
import torch.nn as nn

class NeuralNetwork(nn.Module):
    """A multi-layer perceptron for binary classification.

    Architecture: input -> hidden (ReLU) -> output (sigmoid).

    Args:
        input_dim: Number of input features.
        hidden_dim: Number of neurons in the hidden layer.
        output_dim: Number of output neurons.
    """

    def __init__(
        self,
        input_dim: int,
        hidden_dim: int,
        output_dim: int = 1,
    ) -> None:
        super().__init__()
        self.hidden = nn.Linear(input_dim, hidden_dim)
        self.output = nn.Linear(hidden_dim, output_dim)
        self.relu = nn.ReLU()
        self.sigmoid = nn.Sigmoid()

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass through the network.

        Args:
            x: Input tensor of shape (batch_size, input_dim).

        Returns:
            Output tensor of shape (batch_size, output_dim).
        """
        x = self.relu(self.hidden(x))
        x = self.sigmoid(self.output(x))
        return x

11.7.5 Training in PyTorch

The training loop in PyTorch follows a standard pattern:

torch.manual_seed(42)

# Create the model
model = NeuralNetwork(input_dim=2, hidden_dim=4, output_dim=1)

# Loss function and optimizer
criterion = nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1.0)

# XOR dataset (note: PyTorch uses row-major format)
X = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=torch.float32)
Y = torch.tensor([[0], [1], [1], [0]], dtype=torch.float32)

# Training loop
for epoch in range(5000):
    # Forward pass
    y_hat = model(X)
    loss = criterion(y_hat, Y)

    # Backward pass and update
    optimizer.zero_grad()  # Clear previous gradients
    loss.backward()        # Compute gradients
    optimizer.step()       # Update parameters

    if epoch % 1000 == 0:
        print(f"Epoch {epoch:4d} | Loss: {loss.item():.6f}")

# Test
with torch.no_grad():
    predictions = model(X)
    print("Predictions:", predictions.round().squeeze().tolist())

Notice the critical three-step pattern inside the training loop: 1. optimizer.zero_grad() --- clears old gradients (PyTorch accumulates gradients by default) 2. loss.backward() --- computes gradients via autograd 3. optimizer.step() --- updates parameters using the chosen optimization algorithm

This pattern is identical whether your network has 4 parameters or 400 billion.

11.7.6 PyTorch nn.Module: Key Methods and Patterns

Understanding the key methods of nn.Module will make you a more effective PyTorch user:

model.parameters() and model.named_parameters(). These iterators yield all learnable parameters. named_parameters() also gives you the parameter name, which is useful for debugging, logging, and applying different learning rates to different layers.

# Count total parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total: {total_params:,}, Trainable: {trainable_params:,}")

model.train() and model.eval(). These methods set the training mode flag, which affects layers like dropout and batch normalization. Forgetting to call model.eval() before inference is one of the most common bugs in PyTorch code, as we will discuss in Chapter 12.

model.state_dict() and model.load_state_dict(). These methods serialize and deserialize the model's parameters, enabling checkpointing and model sharing.

model.to(device). Moves all parameters and buffers to the specified device (CPU or GPU). This is essential for GPU training.

nn.Sequential. For simple architectures where layers are applied in sequence, nn.Sequential provides a concise syntax:

model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Linear(128, 10),
)

This is equivalent to defining a custom nn.Module with a forward method that chains these layers. Use nn.Sequential for simple architectures and custom nn.Module subclasses when you need conditional logic, multiple inputs/outputs, or non-sequential data flow.

11.7.7 Comparing NumPy and PyTorch

Aspect	NumPy Implementation	PyTorch Implementation
Forward pass	Manual matrix multiplications	`model(x)`
Backward pass	Manual gradient derivation	`loss.backward()`
Parameter updates	Manual subtraction	`optimizer.step()`
GPU support	None	`model.to(device)`
Lines of code	~100	~30
Educational value	Very high	Moderate
Production readiness	Low	High

The NumPy implementation teaches you what happens inside the black box. The PyTorch implementation is what you will use in practice. Both are essential: you need to understand the mechanics to debug problems, design architectures, and read research papers, but you also need the efficiency and reliability of a mature framework to build real systems.

11.8 Putting It All Together

11.8.1 A Complete Example: Classifying Moons

Let us train both a NumPy network and a PyTorch network on the same non-linear classification problem to see them side by side.

from sklearn.datasets import make_moons

# Generate dataset
np.random.seed(42)
X_np, Y_np = make_moons(n_samples=500, noise=0.2, random_state=42)
X_np = X_np.T  # Shape: (2, 500) for our NumPy convention
Y_np = Y_np.reshape(1, -1)  # Shape: (1, 500)

# NumPy training
params, losses = train(
    X_np, Y_np,
    layer_dims=[2, 16, 8, 1],
    learning_rate=0.5,
    num_epochs=3000,
    print_every=500,
)

# NumPy accuracy
preds_np, _ = forward_pass(X_np, params, num_layers=3)
accuracy_np = np.mean((preds_np > 0.5).astype(float) == Y_np)
print(f"NumPy accuracy: {accuracy_np:.4f}")

# PyTorch training on the same data
torch.manual_seed(42)

X_torch = torch.tensor(X_np.T, dtype=torch.float32)  # (500, 2)
Y_torch = torch.tensor(Y_np.T, dtype=torch.float32)  # (500, 1)

class MoonsNet(nn.Module):
    """Three-layer network for moon classification."""

    def __init__(self) -> None:
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(2, 16),
            nn.ReLU(),
            nn.Linear(16, 8),
            nn.ReLU(),
            nn.Linear(8, 1),
            nn.Sigmoid(),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x)

model = MoonsNet()
criterion = nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

for epoch in range(3000):
    y_hat = model(X_torch)
    loss = criterion(y_hat, Y_torch)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

with torch.no_grad():
    preds_torch = model(X_torch)
    accuracy_torch = ((preds_torch > 0.5).float() == Y_torch).float().mean()
    print(f"PyTorch accuracy: {accuracy_torch:.4f}")

Both implementations achieve similar accuracy (typically > 95%) on the moons dataset, confirming that our NumPy implementation is correct and that PyTorch provides an equivalent but more concise interface.

Interpreting the Results. The moons dataset consists of two interleaved half-circles that no linear classifier can separate. Our network with architecture [2, 16, 8, 1] learns a decision boundary that curves through the gap between the two moons. The first hidden layer (16 neurons) creates 16 linear boundaries, and the ReLU activations create a piecewise linear partition of the input space. The second hidden layer (8 neurons) combines these features into higher-level patterns. The output layer then draws a single (linear) boundary in the 8-dimensional feature space created by the second hidden layer. Visualizing this decision boundary on the 2D input space reveals a smooth, curved boundary that cleanly separates the two classes---a dramatic improvement over the linear boundaries available to logistic regression.

11.8.2 Common Pitfalls and Debugging

As you implement and train neural networks, watch for these common issues:

Exploding loss (NaN): Learning rate too high, or numerical overflow in sigmoid/softmax. Lower the learning rate or add gradient clipping.
Loss not decreasing: Learning rate too low, poor weight initialization, or a bug in backpropagation. Use gradient checking to verify. A useful debugging technique is to start with a very high learning rate and decrease it until training becomes stable.
Loss decreasing then increasing: Overfitting. Add regularization (Chapter 12) or reduce model size.
All predictions the same: Dead neurons (dying ReLU) or saturated sigmoid. Check weight initialization and learning rate. A common cause is initializing weights too large, which pushes all sigmoid activations to 0 or 1.
Forgetting optimizer.zero_grad(): Gradients accumulate across iterations, leading to incorrect updates. Always zero gradients before loss.backward().
Shape mismatches: Transposing or reshaping data incorrectly. In our NumPy implementation, we use column vectors (each sample is a column), while PyTorch typically uses row vectors (each sample is a row). Mixing conventions silently produces wrong results.
Data not normalized: Feeding raw pixel values (0--255) into a network designed for standardized inputs (mean 0, std 1) causes activations to saturate immediately. Always preprocess your data (as we discussed in Chapter 3).

The Debugging Protocol. When training fails, follow this systematic protocol: 1. Verify shapes at every layer using print statements or a debugger 2. Train on a single batch---if the loss does not go to zero, there is a bug 3. Use gradient checking to verify backpropagation 4. Visualize activations and gradients at each layer to identify vanishing or exploding signals 5. Compare against a known-good reference implementation (e.g., PyTorch) on the same data

11.8.3 Looking Ahead

In this chapter, you built neural networks from first principles. You now understand every component: neurons, activation functions, forward passes, loss functions, and backpropagation. You can implement all of these in NumPy, and you know how PyTorch automates them.

In Chapter 12, we will explore training deep networks, covering essential techniques like batch normalization, dropout, learning rate scheduling, and advanced optimizers (Adam, AdamW) that make training deep networks reliable. In Chapter 13, we will introduce convolutional neural networks that exploit spatial structure in images. And in Chapter 16, we will encounter transformers, the architecture that powers modern language models.

Every one of those architectures uses the same fundamental operations you learned here: matrix multiplications, activation functions, loss computation, and backpropagation. The only difference is how the layers are connected and what assumptions they encode about the structure of the data.

Summary

This chapter covered the complete foundations of neural networks:

Biological inspiration: Artificial neurons are mathematical abstractions of biological neurons, computing a weighted sum of inputs followed by a nonlinear activation function.
The perceptron can learn linearly separable patterns but fails on non-linear problems like XOR. Multi-layer networks overcome this limitation.
Activation functions (sigmoid, tanh, ReLU, GELU) introduce the non-linearity that gives neural networks their representational power. ReLU is the default for hidden layers; GELU is standard in transformers.
The forward pass propagates inputs through layers via matrix multiplications and activation functions to produce predictions.
Loss functions (MSE, binary cross-entropy, categorical cross-entropy) quantify prediction errors and define the optimization objective.
Backpropagation applies the chain rule to efficiently compute gradients of the loss with respect to all parameters, enabling gradient descent.
NumPy implementation from scratch reveals exactly how forward passes, loss computation, backpropagation, and parameter updates work at the lowest level.
PyTorch provides tensors with GPU support, automatic differentiation (autograd), and the nn.Module abstraction, replacing hundreds of lines of manual code with a clean, efficient, and production-ready interface.

The transition from NumPy to PyTorch mirrors the transition every AI engineer must make: from understanding the fundamentals to leveraging tools that handle the mechanics so you can focus on architecture design, experimentation, and solving real problems.

What comes next. With the foundations from this chapter firmly in place, you are ready to tackle the practical challenges of training deep networks (Chapter 12), where we will cover learning rate schedules, normalization layers, regularization techniques, mixed precision training, and the debugging practices that separate productive practitioners from frustrated ones. You will also be ready for the specialized architectures that dominate modern AI: convolutional networks for images (Chapter 13), recurrent networks for sequences (Chapter 15), and Transformers for everything from language to vision (Chapter 16). Every one of these architectures is built from the same primitives you learned here---matrix multiplications, activation functions, loss functions, and gradient-based optimization.