Chapter 13: Further Reading

Textbooks

  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 7 (Regularization for Deep Learning) provides a comprehensive mathematical treatment of weight decay, dropout, data augmentation, noise injection, and early stopping. The analysis of dropout as ensemble averaging is particularly clear.

  • Bishop, C. M. & Bishop, H. (2024). Deep Learning: Foundations and Concepts. Springer. Covers regularization from a Bayesian perspective, connecting weight decay to Gaussian priors and dropout to model averaging. The theoretical depth complements the practical focus of this chapter.

  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer. Chapters 3 and 7 provide the classical statistical perspective on regularization, including the bias-variance tradeoff, ridge regression, and LASSO. Essential background for understanding how deep learning regularization relates to classical methods.

Key Papers

Foundational

  • Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). "Dropout: A simple way to prevent neural networks from overfitting." JMLR, 15(1), 1929-1958. The original dropout paper, demonstrating that randomly dropping neurons during training prevents co-adaptation and improves generalization. The ensemble interpretation is developed in detail.

  • Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2017). "Understanding deep learning requires rethinking generalization." ICLR. The landmark paper showing that neural networks can memorize random labels, challenging classical theories of generalization. This result motivates the search for new explanations of why regularization works.

Data Augmentation

  • Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2018). "mixup: Beyond empirical risk minimization." ICLR. Introduces mixup training, which creates virtual examples by interpolating between pairs of training examples and their labels. Simple to implement and consistently improves generalization.

  • Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., & Yoo, Y. (2019). "CutMix: Regularization strategy to train strong classifiers with localizable features." ICCV. Proposes CutMix, which combines cut-and-paste augmentation with label mixing. Outperforms mixup and cutout on image classification while producing better-localized features.

  • Cubuk, E. D., Zoph, B., Shlens, J., & Le, Q. V. (2020). "RandAugment: Practical automated data augmentation with a reduced search space." NeurIPS. Simplifies AutoAugment to just two hyperparameters (number of transformations and magnitude), making augmentation search practical for any dataset.

Modern Regularization

  • Muller, R., Kornblith, S., & Hinton, G. E. (2019). "When does label smoothing help?" NeurIPS. Provides theoretical and empirical analysis of when and why label smoothing improves generalization and calibration.

  • Huang, G., Sun, Y., Liu, Z., Sedra, D., & Weinberger, K. Q. (2016). "Deep networks with stochastic depth." ECCV. Introduces stochastic depth, which randomly drops entire residual blocks during training. Reduces training time and improves generalization for very deep networks.

  • Loshchilov, I. & Hutter, F. (2019). "Decoupled weight decay regularization." ICLR. Shows that weight decay and L2 regularization are not equivalent for adaptive optimizers and proposes AdamW with decoupled weight decay.

Pruning and Compression

  • Frankle, J. & Carlin, M. (2019). "The lottery ticket hypothesis: Finding sparse, trainable neural networks." ICLR. Demonstrates that dense networks contain sparse subnetworks that can match the full network's performance when trained from the same initialization.

  • Han, S., Pool, J., Tran, J., & Dally, W. J. (2015). "Learning both weights and connections for efficient neural networks." NeurIPS. Introduces magnitude-based pruning and shows that neural networks can be pruned by 90%+ with minimal accuracy loss.

Theoretical Perspectives

  • Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., & Sutskever, I. (2021). "Deep double descent: Where bigger models and more data can hurt." JMLR. Extends the double descent phenomenon to model size, data size, and training time, providing a unified view of when overparameterization helps or hurts.

  • Neyshabur, B., Bhojanapalli, S., McAllester, D., & Srebro, N. (2017). "Exploring generalization in deep nets." NeurIPS. Analyzes generalization bounds for deep networks and proposes norm-based measures that better predict generalization than parameter count.

Online Resources

  • PyTorch Documentation: Regularization Techniques. https://pytorch.org/docs/stable/nn.html#dropout-layers Reference for dropout, batch normalization, and other regularization-related modules.

  • Papers With Code: Regularization Methods. https://paperswithcode.com/methods/category/regularization Comprehensive catalog of regularization techniques with benchmarks and implementations.

Looking Ahead

  • Chapter 14 (CNNs): Spatial augmentation (crop, flip, rotation) is the primary regularizer; batch normalization provides implicit regularization; transfer learning reduces the need for task-specific regularization.
  • Chapter 15 (RNNs): Dropout applied to non-recurrent connections; variational dropout maintains the same mask across time steps; weight tying between embedding and output layers.