Chapter 14: Further Reading

Textbooks

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 9 (Convolutional Networks) provides a thorough mathematical treatment of convolution, pooling, and CNN architectures. The analysis of equivariance and invariance is particularly rigorous.
Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2023). Dive into Deep Learning. Cambridge University Press. Chapters 7-8 (d2l.ai) cover CNNs with executable code in PyTorch. The progressive implementation from basic convolutions to modern architectures mirrors this chapter's approach.
Prince, S. J. D. (2023). Understanding Deep Learning. MIT Press. Chapters 10-11 provide clear explanations of convolution mechanics and modern CNN architectures with excellent diagrams.

LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). "Gradient-based learning applied to document recognition." Proceedings of the IEEE, 86(11), 2278-2324. The LeNet paper that established CNNs as practical tools. Introduced the conv-pool-conv-pool pattern that defined CNN architecture for nearly two decades.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). "ImageNet classification with deep convolutional neural networks." NeurIPS. AlexNet launched the deep learning revolution in computer vision. Key innovations: GPU training, ReLU activations, dropout regularization, and data augmentation.
Simonyan, K. & Zisserman, A. (2015). "Very deep convolutional networks for large-scale image recognition." ICLR. VGGNet demonstrated that depth matters: using only 3x3 convolutions but going 16-19 layers deep achieved excellent results. Established the principle that small kernels stacked are preferable to large kernels.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). "Deep residual learning for image recognition." CVPR. Introduced residual connections (skip connections), enabling training of networks with 100+ layers. Perhaps the most influential CNN paper, with skip connections now used across all of deep learning.

Howard, A. G., et al. (2017). "MobileNets: Efficient convolutional neural networks for mobile vision applications." arXiv:1704.04861. Introduces depthwise separable convolutions for efficient mobile deployment. The width multiplier and resolution multiplier provide a systematic way to trade accuracy for efficiency.
Tan, M. & Le, Q. V. (2019). "EfficientNet: Rethinking model scaling for convolutional neural networks." ICML. Proposes compound scaling---simultaneously scaling width, depth, and resolution---achieving state-of-the-art efficiency across a range of computational budgets.
Szegedy, C., et al. (2015). "Going deeper with convolutions." CVPR. GoogLeNet/Inception introduced multi-scale processing within a single layer using parallel convolutions of different kernel sizes.

Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). "How transferable are features in deep neural networks?" NeurIPS. Seminal study on transfer learning, showing that early CNN layers learn general features (edges, textures) while later layers become task-specific. Quantifies the transferability of each layer.
Kornblith, S., Shlens, J., & Le, Q. V. (2019). "Do better ImageNet models transfer better?" CVPR. Empirically verifies that better ImageNet accuracy generally leads to better transfer performance, validating the use of ImageNet pretraining as a foundation.

Zeiler, M. D. & Fergus, R. (2014). "Visualizing and understanding convolutional networks." ECCV. Develops deconvolutional visualization to understand what CNN features learn at each layer, revealing the hierarchical feature progression from edges to object parts.
Selvaraju, R. R., et al. (2017). "Grad-CAM: Visual explanations from deep networks via gradient-based localization." ICCV. Introduces Grad-CAM for visualizing which regions of an image are important for a classification decision. Widely used for CNN interpretability.

Yu, F. & Koltun, V. (2016). "Multi-scale context aggregation by dilated convolutions." ICLR. Introduces dilated (atrous) convolutions for expanding the receptive field without increasing parameters or reducing resolution. Essential for semantic segmentation.
Dai, J., et al. (2017). "Deformable convolutional networks." ICCV. Adds learnable offsets to convolution sampling locations, allowing the network to adaptively adjust its receptive field shape.

Stanford CS231n: Convolutional Neural Networks for Visual Recognition. http://cs231n.stanford.edu/ The definitive course on CNNs with excellent lecture notes, assignments, and visualization tools.
PyTorch Vision Library (torchvision). https://pytorch.org/vision/stable/index.html Pretrained models, standard datasets, and transformation utilities for computer vision.
Papers With Code: Image Classification. https://paperswithcode.com/task/image-classification Up-to-date leaderboards and implementations for image classification benchmarks.

Chapter 15 (RNNs): 1D convolutions as a non-recurrent approach to sequence modeling; temporal convolution networks (TCN).
Chapters 16--17 (Transformers): Vision Transformers (ViT) as a non-convolutional alternative; hybrid CNN-transformer architectures.