Chapter 14 Key Takeaways

The Big Picture

Convolutional Neural Networks embed two powerful inductive biases---local connectivity and weight sharing---directly into the architecture. These biases dramatically reduce parameter count while making networks equivariant to spatial translations. CNNs have dominated computer vision for over a decade, and their principles extend to audio, text, and scientific computing.


The Convolution Operation

  • A convolutional layer slides a small kernel across the input, computing weighted sums at each position. The same kernel (weights) is used everywhere---weight sharing.
  • Output dimensions: $H_{out} = \lfloor(H - k + 2p)/s\rfloor + 1$, where $k$ = kernel size, $p$ = padding, $s$ = stride.
  • Multi-channel convolution: Each filter is a 3D tensor of shape $(C_{in}, k_h, k_w)$. Using $C_{out}$ filters gives a 4D weight tensor: $(C_{out}, C_{in}, k_h, k_w)$.
  • Parameter savings are enormous: a 3x3 conv with 64 filters on 3-channel input uses $3 \times 3 \times 3 \times 64 = 1{,}728$ parameters vs. millions for a fully connected layer.

Stride, Padding, and Pooling

Concept Effect Common Values
Stride Downsamples spatial dimensions 1 (default), 2 (downsample)
Padding Preserves spatial dimensions padding=k//2 for "same" output
Max Pooling Takes maximum value in each region 2x2 with stride 2
Average Pooling Takes mean value in each region Global average pooling before FC
Dilation Increases receptive field without parameters 2 or 4 for segmentation
  • Global Average Pooling replaces the need for large FC layers at the end of CNNs, reducing parameters and overfitting.

Receptive Field

  • The receptive field of a neuron is the region of the input that influences its value.
  • Stacking 3x3 convolutions increases the receptive field efficiently: two 3x3 layers give a 5x5 receptive field; three give 7x7, with fewer parameters than a single 7x7 kernel.
  • Deeper networks have larger receptive fields, enabling them to capture global context.

Landmark Architectures

Architecture Year Key Innovation Depth Params
LeNet-5 1998 First practical CNN (digit recognition) 5 60K
AlexNet 2012 GPU training, dropout, ReLU 8 60M
VGG 2014 Uniform 3x3 convolutions throughout 16/19 138M
GoogLeNet 2014 Inception modules (multi-scale) 22 6.8M
ResNet 2015 Skip connections / residual learning 50/101/152 25M
MobileNet 2017 Depthwise separable convolutions ~28 3.4M
EfficientNet 2019 Compound scaling (width, depth, resolution) Variable Variable

Residual Connections (Skip Connections)

  • Core idea: $\mathbf{y} = F(\mathbf{x}) + \mathbf{x}$. The network learns the residual $F(\mathbf{x}) = \mathbf{y} - \mathbf{x}$ rather than the full mapping.
  • Why they work: Gradient flows directly through the identity shortcut, mitigating vanishing gradients. The gradient includes a term of 1 from the skip, so it never vanishes.
  • When dimensions differ: Use a 1x1 convolution on the shortcut to match channel dimensions (projection shortcut).
  • Skip connections enabled training networks with 100+ layers---impossible with plain architectures.

Depthwise Separable Convolutions

  • Factorize a standard convolution into: (1) depthwise convolution (one filter per channel) + (2) pointwise 1x1 convolution (mixes channels).
  • Parameter reduction: from $C_{in} \times C_{out} \times k^2$ to $C_{in} \times k^2 + C_{in} \times C_{out}$.
  • For a 3x3 conv with 128 inputs and 256 outputs: standard = 294,912 params; separable = 34,048 params (~8.7x reduction).
  • Foundation of MobileNet and efficient architectures for mobile/edge deployment.

Transfer Learning

Strategy When to Use What to Do
Feature extraction Very small dataset, similar domain Freeze backbone, train only new head
Fine-tune last layers Small-medium dataset Freeze early layers, fine-tune later ones
Fine-tune all layers Large dataset or different domain Lower LR for pretrained layers
  • Transfer learning is often the most impactful technique for small datasets.
  • Discriminative learning rates: Use lower LR for early (pretrained) layers, higher LR for the new classification head.
  • Gradual unfreezing: Start by training only the head, then progressively unfreeze deeper layers.
  • ImageNet-pretrained features transfer remarkably well even to very different domains (medical imaging, satellite imagery).

CNN Design Principles

  1. Increase channels, decrease spatial resolution as you go deeper.
  2. Use 3x3 kernels almost exclusively (stacking for larger receptive fields).
  3. Add batch normalization after every convolution.
  4. Use skip connections for networks deeper than ~10 layers.
  5. Replace FC layers with global average pooling where possible.
  6. Start with a pretrained model when data is limited.

Common Pitfalls

  1. Applying CNNs to non-spatial data where the inductive biases are not appropriate.
  2. Forgetting to adjust input size when changing stride or padding---dimension mismatches.
  3. Using batch norm before the convolution instead of after (the standard is conv -> BN -> ReLU).
  4. Training from scratch on small datasets when pretrained models are available.
  5. Ignoring spatial resolution in the classification head---flatten too early, lose spatial information.

Looking Ahead

  • Chapter 15 (RNNs): From spatial to temporal weight sharing; 1D convolutions as an alternative to recurrence for some sequence tasks.
  • Chapters 16--17 (Transformers): Self-attention as a generalization of convolution without the locality bias; Vision Transformers (ViT) challenge CNN dominance.