Chapter 3 Further Reading

Foundational Textbooks

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapters 4 (Numerical Computation) and 8 (Optimization for Training Deep Models) provide an excellent treatment of gradient-based optimization from a deep learning perspective. Chapter 6 covers backpropagation in neural networks. Freely available at deeplearningbook.org.
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. The definitive reference on convex optimization. While neural network loss functions are non-convex, many of the concepts (duality, convergence rates, KKT conditions) provide essential intuition. Freely available at stanford.edu/~boyd/cvxbook.
Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.). Springer. Comprehensive coverage of line search methods, trust region methods, conjugate gradient, and quasi-Newton methods. Particularly useful for understanding the theoretical underpinnings of gradient descent convergence.

Griewank, A., & Walther, A. (2008). Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation (2nd ed.). SIAM. The most thorough treatment of automatic differentiation theory, covering forward mode, reverse mode, higher-order derivatives, and sparsity exploitation. Essential reading for anyone implementing or extending autodiff systems.

Robbins, H., & Monro, S. (1951). "A Stochastic Approximation Method." Annals of Mathematical Statistics, 22(3), 400-407. The foundational paper on stochastic approximation that underlies SGD. Establishes the convergence conditions for stochastic gradient methods.
Polyak, B. T. (1964). "Some Methods of Speeding Up the Convergence of Iteration Methods." USSR Computational Mathematics and Mathematical Physics, 4(5), 1-17. Introduces the momentum method (heavy ball method) for accelerating gradient descent convergence.
Kingma, D. P., & Ba, J. (2015). "Adam: A Method for Stochastic Optimization." ICLR. The paper introducing Adam, now the most widely used optimizer in deep learning. Clearly explains the bias correction mechanism and provides convergence analysis.
Loshchilov, I., & Hutter, F. (2019). "Decoupled Weight Decay Regularization." ICLR. Introduces AdamW, which decouples weight decay from the adaptive learning rate. This is now the standard optimizer for training transformers and large language models.
Ruder, S. (2016). "An Overview of Gradient Descent Optimization Algorithms." arXiv:1609.04747. An excellent survey that covers SGD, momentum, Nesterov momentum, Adagrad, Adadelta, RMSProp, and Adam in a unified framework. One of the most cited overview papers in the field.

Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., & Bengio, Y. (2014). "Identifying and Attacking the Saddle Point Problem in High-Dimensional Non-Convex Optimization." NeurIPS. Demonstrates that saddle points, not local minima, are the primary obstacle in high-dimensional optimization, fundamentally changing how we think about neural network training.
Li, H., Xu, Z., Taylor, G., Studer, C., & Goldstein, T. (2018). "Visualizing the Loss Landscape of Neural Nets." NeurIPS. Introduces filter-normalized loss landscape visualization, revealing how architecture choices (skip connections, batch normalization) affect landscape smoothness.
Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., & LeCun, Y. (2015). "The Loss Surfaces of Multilayer Networks." AISTATS. Connects neural network loss surfaces to the energy landscapes of spin glass models in statistical physics, providing theoretical justification for why local minima are often good enough.

Baydin, A. G., Pearlmutter, B. A., Radul, A. A., & Siskind, J. M. (2018). "Automatic Differentiation in Machine Learning: A Survey." JMLR, 18(153), 1-43. The definitive survey on autodiff in machine learning. Covers forward mode, reverse mode, implementation strategies, and connections to related techniques.
Paszke, A., et al. (2019). "PyTorch: An Imperative Style, High-Performance Deep Learning Library." NeurIPS. Describes PyTorch's autograd system, which implements reverse-mode autodiff with dynamic computational graphs.
Bradbury, J., et al. (2018). "JAX: Composable Transformations of Python+NumPy Programs." Software available at github.com/google/jax. JAX implements a functional approach to autodiff based on composable program transformations, supporting both forward and reverse mode as well as higher-order derivatives.

Smith, L. N. (2017). "Cyclical Learning Rates for Training Neural Networks." WACV. Introduces cyclical learning rates and the learning rate range test, practical techniques for finding good learning rate schedules.
Loshchilov, I., & Hutter, F. (2017). "SGDR: Stochastic Gradient Descent with Warm Restarts." ICLR. Introduces cosine annealing with warm restarts, a widely used learning rate schedule for training deep networks.
Bottou, L. (2010). "Large-Scale Machine Learning with Stochastic Gradient Descent." COMPSTAT. A practical guide to scaling SGD to large datasets, with insights on learning rate selection and batch size trade-offs.

Stanford CS231n: Convolutional Neural Networks for Visual Recognition. Lecture notes on backpropagation and optimization are particularly clear and include detailed worked examples. Available at cs231n.github.io.
3Blue1Brown: "The Essence of Calculus" (YouTube series). Grant Sanderson's visual explanations of derivatives, integrals, and the chain rule provide outstanding geometric intuition.
Andrej Karpathy: "micrograd" (GitHub). A minimal autograd engine in about 100 lines of Python. The code directly corresponds to the autodiff engine built in this chapter's Case Study 2. Available at github.com/karpathy/micrograd.

PyTorch Autograd Documentation. Detailed explanation of PyTorch's computational graph and gradient computation system. Available at pytorch.org/docs/stable/autograd.html.
JAX Autodiff Cookbook. Practical examples of forward-mode and reverse-mode differentiation in JAX. Available at jax.readthedocs.io.

Topic	Where It Leads
Gradient descent for neural networks	Chapter 6 (Neural Network Fundamentals), Chapter 8 (Training Deep Networks)
Adam and AdamW optimizers	Chapter 8 (Training Strategies), Chapter 19 (Transformers)
Backpropagation implementation	Chapter 6 (Backpropagation in Practice)
Loss landscape geometry	Chapter 8 (Training Dynamics), Chapter 12 (Scaling Laws)
Learning rate schedules	Chapter 8 (Hyperparameter Tuning)
Gradient clipping	Chapter 8 (Training Stability), Chapter 20 (Sequence Models)
Mixed precision and gradient checkpointing	Chapter 10 (Computational Efficiency)
Second-order methods and the Hessian	Chapter 13 (Regularization), Chapter 10 (Advanced Optimization)