Chapter 14: Further Reading
Textbooks
-
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 9 (Convolutional Networks) provides a thorough mathematical treatment of convolution, pooling, and CNN architectures. The analysis of equivariance and invariance is particularly rigorous.
-
Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2023). Dive into Deep Learning. Cambridge University Press. Chapters 7-8 (d2l.ai) cover CNNs with executable code in PyTorch. The progressive implementation from basic convolutions to modern architectures mirrors this chapter's approach.
-
Prince, S. J. D. (2023). Understanding Deep Learning. MIT Press. Chapters 10-11 provide clear explanations of convolution mechanics and modern CNN architectures with excellent diagrams.
Key Papers
Foundational Architectures
-
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). "Gradient-based learning applied to document recognition." Proceedings of the IEEE, 86(11), 2278-2324. The LeNet paper that established CNNs as practical tools. Introduced the conv-pool-conv-pool pattern that defined CNN architecture for nearly two decades.
-
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). "ImageNet classification with deep convolutional neural networks." NeurIPS. AlexNet launched the deep learning revolution in computer vision. Key innovations: GPU training, ReLU activations, dropout regularization, and data augmentation.
-
Simonyan, K. & Zisserman, A. (2015). "Very deep convolutional networks for large-scale image recognition." ICLR. VGGNet demonstrated that depth matters: using only 3x3 convolutions but going 16-19 layers deep achieved excellent results. Established the principle that small kernels stacked are preferable to large kernels.
-
He, K., Zhang, X., Ren, S., & Sun, J. (2016). "Deep residual learning for image recognition." CVPR. Introduced residual connections (skip connections), enabling training of networks with 100+ layers. Perhaps the most influential CNN paper, with skip connections now used across all of deep learning.
Efficient Architectures
-
Howard, A. G., et al. (2017). "MobileNets: Efficient convolutional neural networks for mobile vision applications." arXiv:1704.04861. Introduces depthwise separable convolutions for efficient mobile deployment. The width multiplier and resolution multiplier provide a systematic way to trade accuracy for efficiency.
-
Tan, M. & Le, Q. V. (2019). "EfficientNet: Rethinking model scaling for convolutional neural networks." ICML. Proposes compound scaling---simultaneously scaling width, depth, and resolution---achieving state-of-the-art efficiency across a range of computational budgets.
-
Szegedy, C., et al. (2015). "Going deeper with convolutions." CVPR. GoogLeNet/Inception introduced multi-scale processing within a single layer using parallel convolutions of different kernel sizes.
Transfer Learning
-
Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). "How transferable are features in deep neural networks?" NeurIPS. Seminal study on transfer learning, showing that early CNN layers learn general features (edges, textures) while later layers become task-specific. Quantifies the transferability of each layer.
-
Kornblith, S., Shlens, J., & Le, Q. V. (2019). "Do better ImageNet models transfer better?" CVPR. Empirically verifies that better ImageNet accuracy generally leads to better transfer performance, validating the use of ImageNet pretraining as a foundation.
Visualization and Interpretability
-
Zeiler, M. D. & Fergus, R. (2014). "Visualizing and understanding convolutional networks." ECCV. Develops deconvolutional visualization to understand what CNN features learn at each layer, revealing the hierarchical feature progression from edges to object parts.
-
Selvaraju, R. R., et al. (2017). "Grad-CAM: Visual explanations from deep networks via gradient-based localization." ICCV. Introduces Grad-CAM for visualizing which regions of an image are important for a classification decision. Widely used for CNN interpretability.
Beyond Standard Convolutions
-
Yu, F. & Koltun, V. (2016). "Multi-scale context aggregation by dilated convolutions." ICLR. Introduces dilated (atrous) convolutions for expanding the receptive field without increasing parameters or reducing resolution. Essential for semantic segmentation.
-
Dai, J., et al. (2017). "Deformable convolutional networks." ICCV. Adds learnable offsets to convolution sampling locations, allowing the network to adaptively adjust its receptive field shape.
Online Resources
-
Stanford CS231n: Convolutional Neural Networks for Visual Recognition. http://cs231n.stanford.edu/ The definitive course on CNNs with excellent lecture notes, assignments, and visualization tools.
-
PyTorch Vision Library (torchvision). https://pytorch.org/vision/stable/index.html Pretrained models, standard datasets, and transformation utilities for computer vision.
-
Papers With Code: Image Classification. https://paperswithcode.com/task/image-classification Up-to-date leaderboards and implementations for image classification benchmarks.
Looking Ahead
- Chapter 15 (RNNs): 1D convolutions as a non-recurrent approach to sequence modeling; temporal convolution networks (TCN).
- Chapters 16--17 (Transformers): Vision Transformers (ViT) as a non-convolutional alternative; hybrid CNN-transformer architectures.