Convolutional Neural Networks embed two powerful inductive biases---local connectivity and weight sharing---directly into the architecture. These biases dramatically reduce parameter count while making networks equivariant to spatial translations. CNNs have dominated computer vision for over a decade, and their principles extend to audio, text, and scientific computing.
The Convolution Operation
A convolutional layer slides a small kernel across the input, computing weighted sums at each position. The same kernel (weights) is used everywhere---weight sharing.
Multi-channel convolution: Each filter is a 3D tensor of shape $(C_{in}, k_h, k_w)$. Using $C_{out}$ filters gives a 4D weight tensor: $(C_{out}, C_{in}, k_h, k_w)$.
Parameter savings are enormous: a 3x3 conv with 64 filters on 3-channel input uses $3 \times 3 \times 3 \times 64 = 1{,}728$ parameters vs. millions for a fully connected layer.
Stride, Padding, and Pooling
Concept
Effect
Common Values
Stride
Downsamples spatial dimensions
1 (default), 2 (downsample)
Padding
Preserves spatial dimensions
padding=k//2 for "same" output
Max Pooling
Takes maximum value in each region
2x2 with stride 2
Average Pooling
Takes mean value in each region
Global average pooling before FC
Dilation
Increases receptive field without parameters
2 or 4 for segmentation
Global Average Pooling replaces the need for large FC layers at the end of CNNs, reducing parameters and overfitting.
Receptive Field
The receptive field of a neuron is the region of the input that influences its value.
Stacking 3x3 convolutions increases the receptive field efficiently: two 3x3 layers give a 5x5 receptive field; three give 7x7, with fewer parameters than a single 7x7 kernel.
Deeper networks have larger receptive fields, enabling them to capture global context.
Landmark Architectures
Architecture
Year
Key Innovation
Depth
Params
LeNet-5
1998
First practical CNN (digit recognition)
5
60K
AlexNet
2012
GPU training, dropout, ReLU
8
60M
VGG
2014
Uniform 3x3 convolutions throughout
16/19
138M
GoogLeNet
2014
Inception modules (multi-scale)
22
6.8M
ResNet
2015
Skip connections / residual learning
50/101/152
25M
MobileNet
2017
Depthwise separable convolutions
~28
3.4M
EfficientNet
2019
Compound scaling (width, depth, resolution)
Variable
Variable
Residual Connections (Skip Connections)
Core idea: $\mathbf{y} = F(\mathbf{x}) + \mathbf{x}$. The network learns the residual $F(\mathbf{x}) = \mathbf{y} - \mathbf{x}$ rather than the full mapping.
Why they work: Gradient flows directly through the identity shortcut, mitigating vanishing gradients. The gradient includes a term of 1 from the skip, so it never vanishes.
When dimensions differ: Use a 1x1 convolution on the shortcut to match channel dimensions (projection shortcut).
Skip connections enabled training networks with 100+ layers---impossible with plain architectures.
Depthwise Separable Convolutions
Factorize a standard convolution into: (1) depthwise convolution (one filter per channel) + (2) pointwise 1x1 convolution (mixes channels).
Parameter reduction: from $C_{in} \times C_{out} \times k^2$ to $C_{in} \times k^2 + C_{in} \times C_{out}$.
For a 3x3 conv with 128 inputs and 256 outputs: standard = 294,912 params; separable = 34,048 params (~8.7x reduction).
Foundation of MobileNet and efficient architectures for mobile/edge deployment.
Transfer Learning
Strategy
When to Use
What to Do
Feature extraction
Very small dataset, similar domain
Freeze backbone, train only new head
Fine-tune last layers
Small-medium dataset
Freeze early layers, fine-tune later ones
Fine-tune all layers
Large dataset or different domain
Lower LR for pretrained layers
Transfer learning is often the most impactful technique for small datasets.
Discriminative learning rates: Use lower LR for early (pretrained) layers, higher LR for the new classification head.
Gradual unfreezing: Start by training only the head, then progressively unfreeze deeper layers.
ImageNet-pretrained features transfer remarkably well even to very different domains (medical imaging, satellite imagery).
CNN Design Principles
Increase channels, decrease spatial resolution as you go deeper.
Use 3x3 kernels almost exclusively (stacking for larger receptive fields).
Add batch normalization after every convolution.
Use skip connections for networks deeper than ~10 layers.
Replace FC layers with global average pooling where possible.
Start with a pretrained model when data is limited.
Common Pitfalls
Applying CNNs to non-spatial data where the inductive biases are not appropriate.
Forgetting to adjust input size when changing stride or padding---dimension mismatches.
Using batch norm before the convolution instead of after (the standard is conv -> BN -> ReLU).
Training from scratch on small datasets when pretrained models are available.
Ignoring spatial resolution in the classification head---flatten too early, lose spatial information.
Looking Ahead
Chapter 15 (RNNs): From spatial to temporal weight sharing; 1D convolutions as an alternative to recurrence for some sequence tasks.
Chapters 16--17 (Transformers): Self-attention as a generalization of convolution without the locality bias; Vision Transformers (ViT) challenge CNN dominance.
We use cookies to improve your experience and show relevant ads. Privacy Policy