Chapter 6: Key Takeaways

  1. A neural network is function composition: linear transformations interleaved with nonlinear activations. Without nonlinear activation functions, any number of stacked linear layers collapses to a single linear transformation. The activation function is what gives the network its expressive power — the ability to represent nonlinear decision boundaries. This is the most fundamental fact about neural networks, and every design decision in deep learning flows from it.

  2. Backpropagation is the chain rule applied to a computational graph. The forward pass records a sequence of differentiable operations. The backward pass traverses this graph in reverse, multiplying local Jacobians at each node. This is not a learning algorithm — it is a gradient computation algorithm. Learning happens when an optimizer (SGD, Adam, or another method) uses these gradients to update parameters. The chain rule has not changed since Leibniz; the insight of backpropagation is its systematic application to neural network graphs.

  3. Activation functions determine gradient flow and failure modes. Sigmoid saturates at both tails (maximum gradient 0.25), causing vanishing gradients in deep networks. ReLU avoids saturation for positive inputs but kills neurons for negative inputs (gradient exactly zero). GELU and Swish provide smooth, non-zero gradients everywhere, mitigating both problems. The choice of activation function is a choice about how gradients propagate — and therefore about which layers can learn.

  4. Weight initialization determines whether learning can begin. Random initialization with the wrong variance causes activations to either vanish (too small) or explode (too large) across layers. He initialization ($\text{Var}(w) = 2/n_\text{in}$) maintains stable activation variance for ReLU networks; Xavier initialization ($\text{Var}(w) = 2/(n_\text{in} + n_\text{out})$) does the same for sigmoid and tanh. Correct initialization is not optional — it is the difference between a network that learns and one that does not.

  5. The universal approximation theorem is an existence result, not a learning guarantee. It proves that a sufficiently wide single-hidden-layer network can represent any continuous function. It says nothing about how many neurons are needed (possibly exponentially many), whether gradient descent can find the right parameters, or how much training data is required. Representation power is necessary but not sufficient for practical learning. Depth, in practice, is more powerful than width for capturing compositional structure.

  6. Build it twice: first in numpy, then in PyTorch. The numpy implementation makes every operation explicit — every matrix multiply, every activation gradient, every parameter update. The PyTorch implementation abstracts away the backward pass through automatic differentiation. The correspondence is exact: loss.backward() computes the same chain rule you computed by hand. The framework saves effort; the understanding saves you when something breaks.

  7. Gradient checking is a debugging discipline, not a luxury. Comparing analytical gradients to numerical gradients (centered finite differences with $\epsilon = 10^{-5}$) is the most reliable way to verify a backpropagation implementation. Relative error below $10^{-5}$ indicates correct gradients; above $10^{-3}$ indicates a bug. Always gradient-check a new implementation before trusting it with real training.