Chapter 7: Further Reading

Essential Sources

1. Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry, "How Does Batch Normalization Help Optimization?" (NeurIPS, 2018)

The paper that upended the standard explanation for batch normalization. Santurkar et al. demonstrate that BN does not significantly reduce internal covariate shift (the original hypothesis from Ioffe and Szegedy). Instead, BN smooths the loss landscape — making the loss more Lipschitz continuous and the gradients more predictive of the actual loss change. The experimental methodology is instructive: they inject additional covariate shift into BN-equipped networks and show that they still train faster than networks without BN. This paper is a case study in how the ML community can use a technique successfully for years before understanding why it works. Sections 3 and 4 (the smoothness analysis) are the essential reading.

Reading guidance: Pair this with the original BN paper (Ioffe and Szegedy, 2015) to understand both the initial motivation and the corrected understanding. The distinction matters for practice: if BN worked by fixing input distributions, then normalizing inputs would be sufficient. Since BN works by smoothing the loss landscape, the learnable affine parameters ($\gamma, \beta$) and the specific placement within the architecture matter.

2. Ilya Loshchilov and Frank Hutter, "Decoupled Weight Decay Regularization" (ICLR, 2019)

This concise but impactful paper resolves a long-standing confusion in deep learning practice. Loshchilov and Hutter show that L2 regularization and weight decay are equivalent for vanilla SGD but not for Adam. When L2 regularization is added to the Adam gradient, the adaptive learning rate rescales the regularization term unevenly across parameters, weakening regularization on high-gradient parameters and strengthening it on low-gradient parameters. Decoupled weight decay (AdamW) applies weight decay independently of the gradient, providing uniform regularization. The paper is the origin of PyTorch's torch.optim.AdamW and should be read by anyone who uses Adam.

Reading guidance: Focus on Section 2 (the mathematical analysis of why L2 and weight decay diverge under Adam) and Section 4 (experimental comparison on image classification). The practical takeaway is simple: always use AdamW rather than Adam with weight_decay, and note that the optimal weight decay coefficient is typically $10^{-2}$ for AdamW vs. $10^{-4}$ for L2 in Adam.

3. Leslie N. Smith and Nicholay Topin, "Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates" (2019)

The paper behind the one-cycle learning rate policy and the learning rate range test (LR finder). Smith and Topin make an empirical observation that challenges conventional wisdom: using a learning rate that is much higher than standard practice, combined with a specific warmup-then-annealing schedule and an inverse momentum schedule, produces models that converge 5-10x faster and generalize as well or better than models trained with standard schedules. The theoretical explanation — that large learning rates act as regularizers by steering the model toward wider minima — is suggestive rather than rigorous, but the empirical results are compelling and have been reproduced across architectures and datasets.

Reading guidance: Start with Section 2 (the LR range test) and Section 3 (the 1cycle policy). Section 5 discusses the connection between learning rate and regularization. The paper is also a useful case study in how empirical findings can outpace theory — the one-cycle policy works better than any theoretically motivated schedule in most practical settings, and the field's understanding of why is still catching up.

4. Paulius Micikevicius et al., "Mixed Precision Training" (ICLR, 2018)

The NVIDIA research paper that established the practice of mixed-precision training. Micikevicius et al. show that neural networks can be trained in fp16 without accuracy loss, provided three conditions are met: (1) maintain a fp32 master copy of weights, (2) use loss scaling to prevent gradient underflow, and (3) keep certain operations (reductions, normalization) in fp32. The paper covers the IEEE 754 fp16 format, the loss scaling trick, and a systematic evaluation across CNNs, RNNs, and GANs. Though the paper predates bf16's widespread adoption, the principles extend directly: bf16's wider dynamic range eliminates condition (2) in most cases.

Reading guidance: Sections 2 and 3 (the mixed-precision recipe and loss scaling) are the core. Figure 1 — showing the distribution of gradient values and how many fall below fp16's representable range — is one of the most instructive diagrams in the mixed-precision literature. For the current state of the art, supplement with NVIDIA's documentation on Ampere/Hopper Tensor Cores and bf16 support.

5. Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep Learning, Chapters 7-8 (MIT Press, 2016)

Chapter 7 ("Regularization for Deep Learning") and Chapter 8 ("Optimization for Training Deep Models") of the Goodfellow-Bengio-Courville textbook remain the most comprehensive single treatment of the material in this chapter. Chapter 7 covers L2/L1 regularization, dropout (including the ensemble and noise-injection interpretations), early stopping, data augmentation, and multi-task learning. Chapter 8 covers SGD, momentum, Adam, initialization, batch normalization, and the role of the loss landscape geometry. The treatment is mathematically rigorous and pedagogically clear.

Reading guidance: If this chapter felt dense, GBC Chapters 7-8 provide a slower-paced, more thorough exposition of the same material. Section 7.12 (dropout) includes the connection to Bayesian inference that we mentioned briefly. Section 8.4 (initialization) provides additional derivations. The textbook is freely available at deeplearningbook.org.