Further Reading — Chapter 2: Multivariate Calculus and Optimization

Annotated recommendations for going deeper into the topics covered in this chapter, ordered from most foundational to most specialized.

1. Boyd, S. & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.

What it covers: The definitive textbook on convex optimization. Chapters 1-5 develop the theory of convex sets, convex functions, and optimality conditions (including KKT). Chapters 9-11 cover unconstrained and constrained minimization algorithms. Chapter 2 (convex sets) and Chapter 3 (convex functions) provide far more depth on convexity than this chapter could.

How to read it: Start with Chapters 2 and 3 for the theory, then skip to Chapter 9 (unconstrained minimization) for the algorithmic perspective. The duality theory in Chapters 5 and 11 is essential if you will work on constrained optimization problems (fairness constraints, portfolio optimization, support vector machines). The entire book is available free at https://web.stanford.edu/~boyd/cvxbook/.

Connection to this chapter: Sections 2.5 (convexity) and 2.8.4 (KKT conditions) are condensed treatments of material developed rigorously in Chapters 2-5 and Chapter 5 of Boyd & Vandenberghe.

2. Ruder, S. (2016). "An overview of gradient descent optimization algorithms." arXiv:1609.04747.

What it covers: A comprehensive survey of gradient descent variants, from vanilla SGD through Adam, with clear equations, intuitive explanations, and convergence analysis. Also covers less common methods (Adadelta, Nadam, AMSGrad) and provides guidance on optimizer selection. Updated periodically since initial publication.

How to read it: Read front to back — it is written as a tutorial, building each method as an improvement on the last. Pay particular attention to Section 4 (challenges motivating each variant) and Section 6 (which optimizer to use). This is one of the most cited resources in the deep learning optimization literature.

Connection to this chapter: The SGD family tree in Section 2.6 follows the same logical progression as Ruder's survey but includes implementation code and the connection to second-order methods.

3. Kingma, D. P. & Ba, J. (2015). "Adam: A Method for Stochastic Optimization." Proceedings of ICLR 2015. arXiv:1412.6980.

What it covers: The original Adam paper, introducing the combination of momentum (first moment) and adaptive learning rates (second moment) with bias correction. Includes the convergence analysis, the connection to natural gradient methods, and experimental results on deep learning benchmarks.

How to read it: The paper is short (15 pages including appendix) and well-written. Focus on Section 2 (the algorithm itself), the bias correction derivation in Section 3, and the regret bound in Section 4. The connection to the natural gradient (Section 5) provides deep insight into why Adam approximates second-order information.

Companion paper: Loshchilov, I. & Hutter, F. (2019). "Decoupled Weight Decay Regularization." ICLR 2019. This paper introduces AdamW and demonstrates that the standard practice of combining Adam with L2 regularization is theoretically flawed. Essential reading for anyone training transformer models.

4. Baydin, A. G., Pearlmutter, B. A., Radul, A. A., & Siskind, J. M. (2018). "Automatic Differentiation in Machine Learning: a Survey." Journal of Machine Learning Research, 18(153), 1-43.

What it covers: A thorough survey of automatic differentiation (AD), covering forward mode, reverse mode (backpropagation), higher-order derivatives, and the implementation of AD in modern frameworks (TensorFlow, PyTorch, JAX). Includes the historical development of AD, its connections to numerical differentiation and symbolic differentiation, and advanced topics like checkpointing for memory-efficient backpropagation.

How to read it: Sections 1-3 provide the essential theory (forward vs. reverse mode, computational graphs, the chain rule). Section 4 on implementation is valuable if you want to understand how PyTorch's autograd engine works internally. Section 5 on applications connects AD to areas beyond standard backpropagation: scientific computing, probabilistic programming, and differentiable rendering.

Connection to this chapter: Section 2.4.4 (forward vs. reverse mode AD) gives a condensed treatment. This survey provides the full theoretical framework and places backpropagation in the broader context of automatic differentiation research.

5. Li, H., Xu, Z., Taylor, G., Studer, C., & Goldstein, T. (2018). "Visualizing the Loss Landscape of Neural Nets." NeurIPS 2018.

What it covers: The seminal paper on loss landscape visualization for neural networks. Introduces the "filter normalization" technique that produces meaningful 2D slices of high-dimensional loss surfaces. Shows that skip connections in ResNets create dramatically smoother loss landscapes than plain networks, partially explaining why ResNets are easier to train. Includes stunning visualizations of loss surfaces for various architectures.

How to read it: The key contribution is the filter normalization method (Section 3), which addresses the scale-invariance problem that makes naive random-direction visualization misleading. The visualization results (Section 4) provide powerful geometric intuition for why certain architectural choices (skip connections, batch normalization) improve training. The connection between landscape sharpness and generalization (Section 5) connects to the broader question of why neural networks generalize despite non-convexity.

Connection to this chapter: Section 2.9 introduces loss landscape visualization concepts. This paper provides the rigorous methodology for extending those techniques to real neural networks, and is the foundation for Exercise 2.21.