Key Takeaways — Chapter 2: Multivariate Calculus and Optimization
-
The gradient is the fundamental object of machine learning optimization. It is the vector of all partial derivatives, pointing in the direction of steepest ascent. Every training algorithm — from logistic regression to GPT — works by stepping against the gradient to decrease the loss. The gradient of regularized logistic regression $\nabla \mathcal{L} = \frac{1}{N}\sum_i (\hat{y}_i - y_i)\mathbf{x}_i + \lambda\boldsymbol{\theta}$ has a natural interpretation: the model focuses its updates on the examples it gets most wrong.
-
Backpropagation is the chain rule applied to computational graphs. A neural network is a composition of differentiable functions. Reverse mode automatic differentiation (backpropagation) computes the gradient of a scalar loss with respect to all $n$ parameters in a single backward pass — a factor-of-$n$ speedup over forward mode. This computational efficiency is why deep learning is practical.
-
Convexity determines whether optimization has guarantees. For convex functions, every local minimum is a global minimum, and gradient descent provably converges at a rate determined by the condition number $\kappa = L/\mu$. Most deep learning losses are non-convex, but convex theory provides the vocabulary (saddle points, conditioning, strong convexity) to reason about what we observe.
-
Gradient descent is a family of algorithms, not a single algorithm. Each variant addresses a specific failure mode: SGD reduces per-step cost, momentum smooths oscillations, adaptive methods (AdaGrad, RMSProp, Adam) assign per-parameter learning rates, and AdamW decouples weight decay. Understanding the family tree means understanding why your model converges or diverges.
-
Adam approximates second-order optimization with first-order cost. Adam's adaptive learning rate $\eta / \sqrt{\hat{v}_j}$ implicitly estimates the diagonal of the inverse Hessian, rescaling each parameter's update by its curvature. This is why Adam handles ill-conditioned problems well and why it struggles when significant off-diagonal Hessian structure exists.
-
The learning rate schedule can matter as much as the optimizer choice. Linear warmup followed by cosine decay is the standard schedule for transformer training. Warmup gives Adam time to build reliable second-moment estimates; cosine decay smoothly reduces the learning rate for fine-grained convergence.
-
Second-order methods are impractical at scale but essential for understanding. Newton's method converges quadratically but requires $O(n^2)$ storage and $O(n^3)$ computation — impossible for modern networks. Understanding Newton's method reveals that Adam is a diagonal approximation to the ideal (Hessian-informed) update, connecting the practical tools we use to the theory that explains them.