Chapter 8: Further Reading
Essential Sources
1. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, "Deep Residual Learning for Image Recognition" (CVPR, 2016)
The paper that made deep networks trainable. He et al. showed that adding identity shortcuts — $\mathbf{y} = \mathcal{F}(\mathbf{x}) + \mathbf{x}$ — allows networks of 152 layers (and beyond) to train successfully, winning the 2015 ImageNet challenge with a top-5 error rate of 3.57%. The paper is clearly written and the experiments are thorough: the authors demonstrate that plain networks degrade with depth (even on the training set), prove that this is an optimization problem (not capacity), and show that residual connections solve it.
Reading guidance: Start with Sections 1-3 (motivation, residual learning formulation, and the key experiment comparing 20-layer and 56-layer plain vs. residual networks). Figure 1 is one of the most important figures in deep learning — the training curves that show deeper plain networks performing worse. Section 4 covers the bottleneck block architecture used in ResNet-50/101/152. The follow-up paper — He et al., "Identity Mappings in Deep Residual Networks" (ECCV, 2016) — introduces the pre-activation variant and provides deeper analysis of gradient flow through residual connections. Read both papers together.
2. Mingxing Tan and Quoc V. Le, "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks" (ICML, 2019)
This paper introduced compound scaling — the principle that depth, width, and resolution should be scaled together under a fixed compute constraint. The key contribution is the parameterization $d = \alpha^\phi$, $w = \beta^\phi$, $r = \gamma^\phi$ with $\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2$, where $\phi$ controls total compute. The authors first search for $\alpha, \beta, \gamma$ on a small baseline network (EfficientNet-B0, found via neural architecture search), then scale $\phi$ to produce B1 through B7, achieving state-of-the-art accuracy at every compute level.
Reading guidance: Section 3 is the core contribution — the compound scaling method. The key insight is Figure 2, which shows that scaling any single dimension (depth alone, width alone, resolution alone) saturates, while compound scaling continues to improve. Table 1 compares EfficientNet variants against prior architectures at similar FLOPs, demonstrating consistent improvements. The paper also introduces the EfficientNet-B0 baseline architecture, which uses mobile inverted bottleneck blocks (MBConv) with depthwise separable convolutions — understanding this baseline requires familiarity with MobileNetV2 (Sandler et al., 2018). For a deeper understanding of the neural architecture search component, see Tan and Le's companion paper, "MnasNet" (CVPR, 2019).
3. Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra, "Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization" (ICCV, 2017)
Grad-CAM provides class-discriminative visual explanations for CNN predictions by computing gradient-weighted combinations of the last convolutional layer's feature maps. The method is model-agnostic (works with any CNN architecture), requires no retraining or architectural modification, and produces coarse localization maps that can be refined with guided backpropagation (Guided Grad-CAM). The paper demonstrates applications to image classification, visual question answering, and image captioning.
Reading guidance: Sections 3.1-3.2 derive the Grad-CAM formula and explain the role of the ReLU (suppressing features with negative influence on the target class). Section 5 is particularly valuable: it uses Grad-CAM to diagnose model failures, showing cases where models achieve high accuracy by attending to spurious features (e.g., classifying "nurse" based on the presence of a hospital bed rather than the person). This connects to the broader interpretability discussion in Chapter 35. For the mathematical relationship between Grad-CAM and CAM (Class Activation Mapping; Zhou et al., 2016), see Section 3.3. The extension Grad-CAM++ (Chattopadhyay et al., 2018) provides improved localization for multiple instances of the same class in a single image.
4. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, "Deep Learning" (Nature, 2015)
A landmark review article by three pioneers of deep learning. Section on convolutional networks (pp. 439-440) provides a concise, authoritative summary of the convolution operation, weight sharing, and pooling. The article places CNNs in the broader context of representation learning and traces the intellectual lineage from biological vision to modern architectures.
Reading guidance: Read the full article — it is only 9 pages and serves as an excellent overview of the entire deep learning landscape as of 2015. The CNN section is brief but every sentence is carefully chosen. The article predates ResNet, so it stops at VGG/GoogLeNet, but its discussion of hierarchical feature learning and translational equivariance remains the clearest short treatment available. For the historical perspective on how convolution was inspired by Hubel and Wiesel's work on the visual cortex (simple cells and complex cells), see LeCun et al., "Gradient-Based Learning Applied to Document Recognition" (Proceedings of the IEEE, 1998) — the original LeNet paper, which remains surprisingly readable.
5. Richard Zhang, "Making Convolutional Networks Shift-Invariant Again" (ICML, 2019)
This paper demonstrates that standard CNNs are not truly translation equivariant due to aliasing in strided convolutions and max pooling. When a strided operation subsamples the feature map, it violates the Nyquist sampling theorem, causing small input shifts to produce qualitatively different outputs. Zhang proposes anti-aliased downsampling: applying a low-pass filter (blur) before subsampling, which restores shift equivariance. The paper includes extensive experiments showing improved consistency and, surprisingly, improved accuracy as well.
Reading guidance: Section 2 explains the aliasing problem with clear diagrams. The key figure is Figure 1, which shows that shifting an input image by one pixel can dramatically change the output of max pooling. Section 3 presents the fix (blur-then-subsample), which is elegant in its simplicity — it is the same anti-aliasing that signal processing textbooks have taught for decades. The practical implication is important: if shift-equivariance matters for your application (e.g., object detection, where bounding box predictions should not flicker with small camera movements), use anti-aliased pooling. The paper connects the deep learning community's rediscovery of classical signal processing principles — a satisfying instance of the "Fundamentals > Frontier" theme.