Chapter 14: Further Reading

Textbooks

  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 9 (Convolutional Networks) provides a thorough mathematical treatment of convolution, pooling, and CNN architectures. The analysis of equivariance and invariance is particularly rigorous.

  • Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2023). Dive into Deep Learning. Cambridge University Press. Chapters 7-8 (d2l.ai) cover CNNs with executable code in PyTorch. The progressive implementation from basic convolutions to modern architectures mirrors this chapter's approach.

  • Prince, S. J. D. (2023). Understanding Deep Learning. MIT Press. Chapters 10-11 provide clear explanations of convolution mechanics and modern CNN architectures with excellent diagrams.

Key Papers

Foundational Architectures

  • LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). "Gradient-based learning applied to document recognition." Proceedings of the IEEE, 86(11), 2278-2324. The LeNet paper that established CNNs as practical tools. Introduced the conv-pool-conv-pool pattern that defined CNN architecture for nearly two decades.

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). "ImageNet classification with deep convolutional neural networks." NeurIPS. AlexNet launched the deep learning revolution in computer vision. Key innovations: GPU training, ReLU activations, dropout regularization, and data augmentation.

  • Simonyan, K. & Zisserman, A. (2015). "Very deep convolutional networks for large-scale image recognition." ICLR. VGGNet demonstrated that depth matters: using only 3x3 convolutions but going 16-19 layers deep achieved excellent results. Established the principle that small kernels stacked are preferable to large kernels.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). "Deep residual learning for image recognition." CVPR. Introduced residual connections (skip connections), enabling training of networks with 100+ layers. Perhaps the most influential CNN paper, with skip connections now used across all of deep learning.

Efficient Architectures

  • Howard, A. G., et al. (2017). "MobileNets: Efficient convolutional neural networks for mobile vision applications." arXiv:1704.04861. Introduces depthwise separable convolutions for efficient mobile deployment. The width multiplier and resolution multiplier provide a systematic way to trade accuracy for efficiency.

  • Tan, M. & Le, Q. V. (2019). "EfficientNet: Rethinking model scaling for convolutional neural networks." ICML. Proposes compound scaling---simultaneously scaling width, depth, and resolution---achieving state-of-the-art efficiency across a range of computational budgets.

  • Szegedy, C., et al. (2015). "Going deeper with convolutions." CVPR. GoogLeNet/Inception introduced multi-scale processing within a single layer using parallel convolutions of different kernel sizes.

Transfer Learning

  • Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). "How transferable are features in deep neural networks?" NeurIPS. Seminal study on transfer learning, showing that early CNN layers learn general features (edges, textures) while later layers become task-specific. Quantifies the transferability of each layer.

  • Kornblith, S., Shlens, J., & Le, Q. V. (2019). "Do better ImageNet models transfer better?" CVPR. Empirically verifies that better ImageNet accuracy generally leads to better transfer performance, validating the use of ImageNet pretraining as a foundation.

Visualization and Interpretability

  • Zeiler, M. D. & Fergus, R. (2014). "Visualizing and understanding convolutional networks." ECCV. Develops deconvolutional visualization to understand what CNN features learn at each layer, revealing the hierarchical feature progression from edges to object parts.

  • Selvaraju, R. R., et al. (2017). "Grad-CAM: Visual explanations from deep networks via gradient-based localization." ICCV. Introduces Grad-CAM for visualizing which regions of an image are important for a classification decision. Widely used for CNN interpretability.

Beyond Standard Convolutions

  • Yu, F. & Koltun, V. (2016). "Multi-scale context aggregation by dilated convolutions." ICLR. Introduces dilated (atrous) convolutions for expanding the receptive field without increasing parameters or reducing resolution. Essential for semantic segmentation.

  • Dai, J., et al. (2017). "Deformable convolutional networks." ICCV. Adds learnable offsets to convolution sampling locations, allowing the network to adaptively adjust its receptive field shape.

Online Resources

  • Stanford CS231n: Convolutional Neural Networks for Visual Recognition. http://cs231n.stanford.edu/ The definitive course on CNNs with excellent lecture notes, assignments, and visualization tools.

  • PyTorch Vision Library (torchvision). https://pytorch.org/vision/stable/index.html Pretrained models, standard datasets, and transformation utilities for computer vision.

  • Papers With Code: Image Classification. https://paperswithcode.com/task/image-classification Up-to-date leaderboards and implementations for image classification benchmarks.

Looking Ahead

  • Chapter 15 (RNNs): 1D convolutions as a non-recurrent approach to sequence modeling; temporal convolution networks (TCN).
  • Chapters 16--17 (Transformers): Vision Transformers (ViT) as a non-convolutional alternative; hybrid CNN-transformer architectures.