Chapter 14 Key Takeaways

The Big Picture

Convolutional Neural Networks embed two powerful inductive biases---local connectivity and weight sharing---directly into the architecture. These biases dramatically reduce parameter count while making networks equivariant to spatial translations. CNNs have dominated computer vision for over a decade, and their principles extend to audio, text, and scientific computing.

The Convolution Operation

A convolutional layer slides a small kernel across the input, computing weighted sums at each position. The same kernel (weights) is used everywhere---weight sharing.
Output dimensions: $H_{out} = \lfloor(H - k + 2p)/s\rfloor + 1$, where $k$ = kernel size, $p$ = padding, $s$ = stride.
Multi-channel convolution: Each filter is a 3D tensor of shape $(C_{in}, k_h, k_w)$. Using $C_{out}$ filters gives a 4D weight tensor: $(C_{out}, C_{in}, k_h, k_w)$.
Parameter savings are enormous: a 3x3 conv with 64 filters on 3-channel input uses $3 \times 3 \times 3 \times 64 = 1{,}728$ parameters vs. millions for a fully connected layer.

Stride, Padding, and Pooling

Concept	Effect	Common Values
Stride	Downsamples spatial dimensions	1 (default), 2 (downsample)
Padding	Preserves spatial dimensions	`padding=k//2` for "same" output
Max Pooling	Takes maximum value in each region	2x2 with stride 2
Average Pooling	Takes mean value in each region	Global average pooling before FC
Dilation	Increases receptive field without parameters	2 or 4 for segmentation

Global Average Pooling replaces the need for large FC layers at the end of CNNs, reducing parameters and overfitting.

Receptive Field

The receptive field of a neuron is the region of the input that influences its value.
Stacking 3x3 convolutions increases the receptive field efficiently: two 3x3 layers give a 5x5 receptive field; three give 7x7, with fewer parameters than a single 7x7 kernel.
Deeper networks have larger receptive fields, enabling them to capture global context.

Landmark Architectures

Architecture	Year	Key Innovation	Depth	Params
LeNet-5	1998	First practical CNN (digit recognition)	5	60K
AlexNet	2012	GPU training, dropout, ReLU	8	60M
VGG	2014	Uniform 3x3 convolutions throughout	16/19	138M
GoogLeNet	2014	Inception modules (multi-scale)	22	6.8M
ResNet	2015	Skip connections / residual learning	50/101/152	25M
MobileNet	2017	Depthwise separable convolutions	~28	3.4M
EfficientNet	2019	Compound scaling (width, depth, resolution)	Variable	Variable

Residual Connections (Skip Connections)

Core idea: $\mathbf{y} = F(\mathbf{x}) + \mathbf{x}$. The network learns the residual $F(\mathbf{x}) = \mathbf{y} - \mathbf{x}$ rather than the full mapping.
Why they work: Gradient flows directly through the identity shortcut, mitigating vanishing gradients. The gradient includes a term of 1 from the skip, so it never vanishes.
When dimensions differ: Use a 1x1 convolution on the shortcut to match channel dimensions (projection shortcut).
Skip connections enabled training networks with 100+ layers---impossible with plain architectures.

Depthwise Separable Convolutions

Factorize a standard convolution into: (1) depthwise convolution (one filter per channel) + (2) pointwise 1x1 convolution (mixes channels).
Parameter reduction: from $C_{in} \times C_{out} \times k^2$ to $C_{in} \times k^2 + C_{in} \times C_{out}$.
For a 3x3 conv with 128 inputs and 256 outputs: standard = 294,912 params; separable = 34,048 params (~8.7x reduction).
Foundation of MobileNet and efficient architectures for mobile/edge deployment.

Transfer Learning

Strategy	When to Use	What to Do
Feature extraction	Very small dataset, similar domain	Freeze backbone, train only new head
Fine-tune last layers	Small-medium dataset	Freeze early layers, fine-tune later ones
Fine-tune all layers	Large dataset or different domain	Lower LR for pretrained layers

Transfer learning is often the most impactful technique for small datasets.
Discriminative learning rates: Use lower LR for early (pretrained) layers, higher LR for the new classification head.
Gradual unfreezing: Start by training only the head, then progressively unfreeze deeper layers.
ImageNet-pretrained features transfer remarkably well even to very different domains (medical imaging, satellite imagery).

CNN Design Principles

Increase channels, decrease spatial resolution as you go deeper.
Use 3x3 kernels almost exclusively (stacking for larger receptive fields).
Add batch normalization after every convolution.
Use skip connections for networks deeper than ~10 layers.
Replace FC layers with global average pooling where possible.
Start with a pretrained model when data is limited.

Common Pitfalls

Applying CNNs to non-spatial data where the inductive biases are not appropriate.
Forgetting to adjust input size when changing stride or padding---dimension mismatches.
Using batch norm before the convolution instead of after (the standard is conv -> BN -> ReLU).
Training from scratch on small datasets when pretrained models are available.
Ignoring spatial resolution in the classification head---flatten too early, lose spatial information.

Looking Ahead

Chapter 15 (RNNs): From spatial to temporal weight sharing; 1D convolutions as an alternative to recurrence for some sequence tasks.
Chapters 16--17 (Transformers): Self-attention as a generalization of convolution without the locality bias; Vision Transformers (ViT) challenge CNN dominance.