Chapter 12: Further Reading

Textbooks

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 8 (Optimization for Training Deep Models) provides a rigorous treatment of SGD, momentum, adaptive learning rates, and second-order methods. Chapter 7 (Regularization) covers weight decay, dropout, and early stopping. Available free at deeplearningbook.org.
Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2023). Dive into Deep Learning. Cambridge University Press. The optimization chapters (d2l.ai) include executable code in PyTorch for SGD, momentum, Adam, and learning rate schedules. The hands-on approach complements the theoretical treatment in this chapter.
Bishop, C. M. & Bishop, H. (2024). Deep Learning: Foundations and Concepts. Springer. Chapters on optimization and regularization provide a mathematically precise treatment with clear derivations of momentum, Adam, and normalization techniques.

Stevens, E., Antiga, L., & Viehmann, T. (2020). Deep Learning with PyTorch. Manning Publications. Chapters 5-8 cover the training loop, loss functions, and optimization in PyTorch. Excellent practical examples for building production training pipelines.
Howard, J. & Gugger, S. (2020). Deep Learning for Coders with fastai and PyTorch. O'Reilly Media. Covers the one-cycle policy, learning rate finder, and discriminative learning rates with practical implementations. The fastai library embodies many best practices from this chapter.

Kingma, D. P. & Ba, J. (2015). "Adam: A method for stochastic optimization." ICLR. The original Adam paper, introducing adaptive moment estimation. The default hyperparameters ($\beta_1=0.9$, $\beta_2=0.999$, $\epsilon=10^{-8}$) proposed here remain the standard.
Loshchilov, I. & Hutter, F. (2019). "Decoupled weight decay regularization." ICLR. Introduces AdamW, showing that decoupling weight decay from the gradient-based update produces better generalization than L2 regularization with Adam. This paper changed the default optimizer for transformers.
Smith, L. N. & Topin, N. (2019). "Super-convergence: Very fast training of neural networks using large learning rates." AAAI. Describes the one-cycle learning rate policy and super-convergence phenomenon, achieving state-of-the-art results in fewer epochs.

Ioffe, S. & Szegedy, C. (2015). "Batch normalization: Accelerating deep network training by reducing internal covariate shift." ICML. The original batch normalization paper. While the "internal covariate shift" explanation has been debated, batch norm remains one of the most impactful techniques in deep learning.
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). "Layer normalization." arXiv:1607.06450. Introduces layer normalization, which normalizes over the feature dimension rather than the batch dimension. Essential for transformers and RNNs.
Wu, Y. & He, K. (2018). "Group normalization." ECCV. Proposes group normalization as a compromise between batch norm and layer norm, performing well across a range of batch sizes.

Loshchilov, I. & Hutter, F. (2017). "SGDR: Stochastic gradient descent with warm restarts." ICLR. Introduces cosine annealing with warm restarts, showing that periodic learning rate resets can improve optimization.
Smith, L. N. (2017). "Cyclical learning rates for training neural networks." WACV. Proposes the learning rate range test (LR finder) and cyclical learning rates, providing practical tools for learning rate selection.
Goyal, P., et al. (2017). "Accurate, large minibatch SGD: Training ImageNet in 1 hour." arXiv:1706.02677. Demonstrates the linear scaling rule for learning rates with large batch sizes and the importance of warmup for training stability.

He, K., Zhang, X., Ren, S., & Sun, J. (2015). "Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification." ICCV. Derives He/Kaiming initialization for ReLU networks and demonstrates its critical importance for training very deep networks.
Glorot, X. & Bengio, Y. (2010). "Understanding the difficulty of training deep feedforward neural networks." AISTATS. Analyzes activation and gradient flow, deriving Xavier/Glorot initialization for sigmoid and tanh networks.

Micikevicius, P., et al. (2018). "Mixed precision training." ICLR. The foundational paper on FP16 mixed precision training, introducing loss scaling and master weights. This work enabled practical training of large models on GPUs with limited memory.

Pascanu, R., Mikolov, T., & Bengio, Y. (2013). "On the difficulty of training recurrent neural networks." ICML. Analyzes exploding and vanishing gradients in RNNs and proposes gradient clipping as a solution. The analysis extends to deep feedforward networks.

Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollar, P. (2017). "Focal loss for dense object detection." ICCV. Introduces focal loss to address the extreme foreground-background class imbalance in object detection. The technique generalizes to any imbalanced classification problem.

PyTorch Documentation: Optimization. https://pytorch.org/docs/stable/optim.html Comprehensive reference for all built-in optimizers, learning rate schedulers, and their hyperparameters.
PyTorch Documentation: Automatic Mixed Precision. https://pytorch.org/docs/stable/amp.html Guide to using torch.amp.autocast and GradScaler for mixed precision training.
PyTorch Tutorials: "Training a Classifier." https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html End-to-end tutorial for training a CNN on CIFAR-10 with the complete training loop pattern from this chapter.
Andrej Karpathy: "A Recipe for Training Neural Networks" (2019). http://karpathy.github.io/2019/04/25/recipe/ A practical guide to the debugging and diagnostic process described in Section 12.9. Essential reading for any deep learning practitioner.

The training techniques from this chapter connect directly to:

Chapter 13 (Regularization): Weight decay is an optimizer setting; dropout and data augmentation interact with the training loop; early stopping requires validation monitoring.
Chapter 14 (CNNs): Batch normalization placement in convolutional blocks; transfer learning requires parameter groups and discriminative learning rates.
Chapter 15 (RNNs): Gradient clipping is essential; layer normalization replaces batch normalization; TBPTT requires careful scheduler integration.