> "The visual cortex is not a general-purpose pattern recognizer; it is a hierarchy of increasingly complex feature detectors." -- David Hubel and Torsten Wiesel, Nobel Laureates in Physiology
In This Chapter
- 14.1 The Convolution Operation
- 14.2 Stride and Padding
- 14.3 Feature Maps and Receptive Fields
- 14.4 Pooling Layers
- 14.5 Building a Basic CNN
- 14.6 Landmark CNN Architectures
- 14.7 1x1 Convolutions
- 14.8 Depthwise Separable Convolutions
- 14.9 Transfer Learning with Pretrained CNNs
- 14.10 Visualizing Learned Features
- 14.11 Batch Normalization in CNNs
- 14.12 Modern CNN Design Principles
- 14.13 CNNs Beyond Image Classification
- 14.14 CNNs vs. Vision Transformers
- 14.15 Putting It All Together: A Complete Training Pipeline
- 14.16 Common Pitfalls and Debugging Tips
- 14.17 Summary
Chapter 14: Convolutional Neural Networks
"The visual cortex is not a general-purpose pattern recognizer; it is a hierarchy of increasingly complex feature detectors." -- David Hubel and Torsten Wiesel, Nobel Laureates in Physiology
In Chapter 13, we explored the mechanics of deep feedforward networks -- layers of neurons connected end to end, each transforming its input through learned weights and nonlinear activations. Those fully connected architectures are powerful general-purpose function approximators, but they carry a fundamental limitation: they treat every input feature as independent of its neighbors. When you flatten a 224x224 color image into a vector of 150,528 values and feed it into a dense layer, the network has no notion that pixel (10, 10) is spatially adjacent to pixel (10, 11). It must learn spatial relationships entirely from scratch, requiring enormous numbers of parameters and enormous amounts of data.
Convolutional Neural Networks (CNNs) solve this problem elegantly. Inspired by the hierarchical organization of the biological visual cortex -- where simple cells detect oriented edges and complex cells pool over spatial regions -- CNNs embed two powerful inductive biases directly into their architecture: local connectivity and weight sharing. These biases dramatically reduce the number of learnable parameters while simultaneously making the network equivariant to spatial translations. The result is a family of architectures that have dominated computer vision for over a decade and whose principles extend far beyond images into audio processing, natural language understanding, and scientific computing.
This chapter takes you from the mathematical foundations of the convolution operation through the landmark architectures that defined modern deep learning. By the end, you will understand not just how CNNs work, but why each design choice was made -- and you will have the practical skills to build, train, and fine-tune convolutional networks using PyTorch.
14.1 The Convolution Operation
14.1.1 From Signals to Images
The convolution operation is borrowed from signal processing, where it has been used for over a century to filter and transform signals. In continuous mathematics, the convolution of two functions $f$ and $g$ is defined as:
$$(f * g)(t) = \int_{-\infty}^{\infty} f(\tau) \, g(t - \tau) \, d\tau$$
For discrete signals -- and all digital images are discrete -- we replace the integral with a summation:
$$(f * g)[n] = \sum_{k=-\infty}^{\infty} f[k] \, g[n - k]$$
In the context of neural networks, we typically work with the cross-correlation operation rather than true convolution (which would require flipping the kernel). Since the kernel weights are learned, the distinction is immaterial -- the network will simply learn flipped versions of whatever a true convolution would learn. Following the convention used by PyTorch and virtually every deep learning framework, we will use the term "convolution" to refer to cross-correlation throughout this chapter.
14.1.2 2D Convolution for Images
For a 2D input image $I$ and a 2D kernel $K$ of size $k_h \times k_w$, the discrete cross-correlation is:
$$(I * K)[i, j] = \sum_{m=0}^{k_h - 1} \sum_{n=0}^{k_w - 1} I[i + m, \, j + n] \cdot K[m, n]$$
The kernel slides across the input image, computing a weighted sum at each position. Each output value depends only on a small local region of the input -- the receptive field of that output neuron. This local connectivity is the first key inductive bias of CNNs.
The same kernel is applied at every spatial position, meaning that the weights are shared across the entire image. This is the second key inductive bias. Weight sharing means that if the network learns to detect a vertical edge in the top-left corner, it can detect the same edge anywhere in the image without learning separate detectors for each location. Formally, the convolution operation is equivariant to translation: if the input shifts by $(dx, dy)$, the output shifts by the same amount.
Consider the parameter savings. For a 224x224 grayscale image fed into a fully connected layer with 1000 neurons, we need $224 \times 224 \times 1000 = 50{,}176{,}000$ parameters. A single 3x3 convolutional kernel requires just 9 parameters (plus a bias term), yet it can process the entire image.
14.1.3 Multi-Channel Convolution
Real images have multiple channels -- three for RGB color images, or potentially hundreds for the feature maps produced by intermediate CNN layers. For an input with $C_{in}$ channels, each convolutional filter is actually a 3D tensor of shape $C_{in} \times k_h \times k_w$. The convolution sums over all input channels:
$$\text{output}[i, j] = \sum_{c=0}^{C_{in}-1} \sum_{m=0}^{k_h-1} \sum_{n=0}^{k_w-1} I[c, i+m, j+n] \cdot K[c, m, n] + b$$
where $b$ is a scalar bias term. To produce $C_{out}$ output channels (feature maps), we use $C_{out}$ such filters, each with its own set of weights. The full convolutional layer thus has a 4D weight tensor of shape $C_{out} \times C_{in} \times k_h \times k_w$.
In PyTorch:
import torch
import torch.nn as nn
torch.manual_seed(42)
# A convolutional layer: 3 input channels (RGB), 16 output channels, 3x3 kernels
conv_layer = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3)
# Input: batch of 4 RGB images, each 32x32
x = torch.randn(4, 3, 32, 32)
output = conv_layer(x)
print(f"Input shape: {x.shape}") # torch.Size([4, 3, 32, 32])
print(f"Output shape: {output.shape}") # torch.Size([4, 16, 30, 30])
print(f"Parameters: {conv_layer.weight.shape}") # torch.Size([16, 3, 3, 3])
Notice that the spatial dimensions decreased from 32x32 to 30x30. This brings us to the concepts of stride and padding.
14.2 Stride and Padding
14.2.1 Stride
The stride controls how far the kernel moves between successive applications. With stride 1 (the default), the kernel moves one pixel at a time. With stride 2, it skips every other position, effectively downsampling the output by a factor of 2 in each spatial dimension.
For an input of size $H \times W$, kernel size $k$, stride $s$, and padding $p$, the output size is:
$$H_{out} = \left\lfloor \frac{H - k + 2p}{s} \right\rfloor + 1$$
$$W_{out} = \left\lfloor \frac{W - k + 2p}{s} \right\rfloor + 1$$
This formula is essential for designing CNN architectures -- you will use it constantly to verify that layer dimensions are compatible. We first encountered the importance of tracking tensor dimensions in Chapter 12 when discussing matrix operations in neural networks.
14.2.2 Padding
Without padding, each convolution layer shrinks the spatial dimensions. For a 3x3 kernel, the output is 2 pixels smaller in each dimension. After several layers, the feature maps become tiny, and information at the borders is underrepresented because border pixels participate in fewer convolution operations than central pixels.
Zero padding (also called "same" padding when chosen to preserve spatial dimensions) adds zeros around the border of the input. For a kernel of size $k$, padding of $p = \lfloor k/2 \rfloor$ preserves the spatial dimensions when the stride is 1:
torch.manual_seed(42)
# "Same" padding preserves spatial dimensions
conv_same = nn.Conv2d(3, 16, kernel_size=3, padding=1)
x = torch.randn(1, 3, 32, 32)
output = conv_same(x)
print(f"Output shape with padding=1: {output.shape}") # torch.Size([1, 16, 32, 32])
# Strided convolution for downsampling
conv_strided = nn.Conv2d(3, 16, kernel_size=3, stride=2, padding=1)
output_strided = conv_strided(x)
print(f"Output shape with stride=2: {output_strided.shape}") # torch.Size([1, 16, 16, 16])
Other padding strategies include replicate padding (repeating border pixels) and reflect padding (mirroring the input at the boundary). These can reduce border artifacts in certain applications like image generation.
14.2.3 Dilated (Atrous) Convolutions
Dilated convolutions insert gaps between kernel elements, effectively expanding the receptive field without increasing the number of parameters or reducing resolution. A dilation rate of $d$ means there are $d-1$ zeros between each kernel element. The effective kernel size becomes:
$$k_{eff} = k + (k - 1)(d - 1)$$
Dilated convolutions are particularly useful in semantic segmentation (as in the DeepLab architecture) where maintaining spatial resolution while capturing long-range context is critical.
torch.manual_seed(42)
# Standard 3x3 convolution: receptive field = 3x3
conv_standard = nn.Conv2d(1, 1, kernel_size=3, padding=1)
# Dilated 3x3 convolution with dilation=2: effective receptive field = 5x5
conv_dilated = nn.Conv2d(1, 1, kernel_size=3, padding=2, dilation=2)
x = torch.randn(1, 1, 8, 8)
print(f"Standard output: {conv_standard(x).shape}") # [1, 1, 8, 8]
print(f"Dilated output: {conv_dilated(x).shape}") # [1, 1, 8, 8]
14.3 Feature Maps and Receptive Fields
14.3.1 Feature Maps
Each output channel of a convolutional layer is called a feature map. Early layers tend to learn low-level features: edges, corners, color gradients, and simple textures. As we move deeper, feature maps capture increasingly abstract concepts: object parts, textures, and eventually entire objects or scenes.
This hierarchical feature extraction mirrors the organization of the primate visual cortex, where neurons in area V1 respond to oriented edges, neurons in V2 respond to more complex patterns like corners and contours, and neurons in the inferotemporal cortex respond to faces and objects. This biological parallel is not coincidental -- it reflects the mathematical properties of composing local operations into a deep hierarchy.
14.3.2 Receptive Field
The receptive field of a neuron in a CNN is the region of the input image that influences that neuron's activation. For the first convolutional layer with a $k \times k$ kernel, the receptive field is simply $k \times k$ pixels. For deeper layers, the receptive field grows because each neuron depends on a $k \times k$ region of the previous layer's feature map, and each of those neurons in turn depends on the layer before it.
For a network with $L$ layers, each using kernel size $k$ and stride $s$, the receptive field $r_L$ at layer $L$ is:
$$r_L = r_{L-1} + (k_L - 1) \cdot \prod_{i=1}^{L-1} s_i$$
with $r_0 = 1$. For a stack of three 3x3 convolutional layers with stride 1, the receptive field is $7 \times 7$ -- the same as a single 7x7 kernel, but with far fewer parameters and more nonlinearities (as we discussed regarding depth vs. width tradeoffs in Chapter 13).
This insight -- that stacking small kernels achieves the same receptive field as larger kernels with fewer parameters -- was a key motivation behind the VGG architecture, which we will examine in Section 14.6.
14.3.3 Understanding Feature Map Dimensions
A practical skill that separates experienced CNN practitioners from beginners is the ability to mentally track feature map dimensions through a network. For each convolutional or pooling layer, apply the output size formula from Section 14.2.1. Here is a worked example:
| Layer | Operation | Input Size | Output Size |
|---|---|---|---|
| 1 | Conv2d(3, 64, 3, padding=1) | 3 x 224 x 224 | 64 x 224 x 224 |
| 2 | MaxPool2d(2, 2) | 64 x 224 x 224 | 64 x 112 x 112 |
| 3 | Conv2d(64, 128, 3, padding=1) | 64 x 112 x 112 | 128 x 112 x 112 |
| 4 | MaxPool2d(2, 2) | 128 x 112 x 112 | 128 x 56 x 56 |
| 5 | Conv2d(128, 256, 3, padding=1) | 128 x 56 x 56 | 256 x 56 x 56 |
Tracking these dimensions is crucial when debugging shape mismatches -- one of the most common errors when building CNNs. As we emphasized in Chapter 11 on debugging neural networks, shape errors are often the first sign that your architecture specification is incorrect.
14.4 Pooling Layers
14.4.1 Max Pooling
Pooling layers reduce the spatial dimensions of feature maps, providing a form of spatial invariance and reducing computational cost. Max pooling takes the maximum value within each pooling window:
$$\text{MaxPool}(I)[i, j] = \max_{0 \le m < k, \, 0 \le n < k} I[i \cdot s + m, \, j \cdot s + n]$$
The most common configuration is a 2x2 window with stride 2, which halves each spatial dimension:
torch.manual_seed(42)
pool = nn.MaxPool2d(kernel_size=2, stride=2)
x = torch.randn(1, 64, 32, 32)
pooled = pool(x)
print(f"Before pooling: {x.shape}") # [1, 64, 32, 32]
print(f"After pooling: {pooled.shape}") # [1, 64, 16, 16]
Max pooling introduces a degree of translation invariance -- small shifts in the input lead to the same (or very similar) max values. This complements the translation equivariance of the convolution operation itself.
14.4.2 Average Pooling
Average pooling computes the mean within each window instead of the maximum. It is less commonly used in hidden layers but plays a critical role at the end of modern architectures through Global Average Pooling (GAP):
torch.manual_seed(42)
# Global Average Pooling: reduces spatial dimensions to 1x1
gap = nn.AdaptiveAvgPool2d(1)
x = torch.randn(1, 512, 7, 7)
output = gap(x)
print(f"Before GAP: {x.shape}") # [1, 512, 7, 7]
print(f"After GAP: {output.shape}") # [1, 512, 1, 1]
GAP replaces the large fully connected layers that were common in early CNN architectures. Instead of flattening a 512x7x7 feature map into a 25,088-dimensional vector (requiring 25 million parameters for a 1000-class classifier), GAP reduces it to a 512-dimensional vector, needing only 512,000 parameters. This dramatically reduces overfitting and the total model size.
14.4.3 The Debate Over Pooling
Pooling has been controversial. While it provides computational savings and some degree of invariance, it also discards spatial information -- the exact location of a feature is lost, only its presence is preserved. Some researchers, notably those behind the all-convolutional network (Springenberg et al., 2015), have argued that strided convolutions can replace pooling entirely, allowing the network to learn its own downsampling strategy. This is an area where the field continues to evolve, and we will see both approaches in the architectures discussed below.
14.5 Building a Basic CNN
Before examining landmark architectures, let us build a simple CNN from scratch and understand how all the pieces fit together. Recall from Chapter 13 that a neural network is a composition of differentiable functions -- a CNN simply replaces some of the fully connected layers with convolutional and pooling layers.
A typical CNN follows the pattern:
Input -> [Conv -> Activation -> Pool] x N -> Flatten -> [FC -> Activation] x M -> Output
import torch
import torch.nn as nn
import torch.nn.functional as F
torch.manual_seed(42)
class SimpleCNN(nn.Module):
"""A basic CNN for CIFAR-10 classification.
Architecture:
- Two convolutional blocks (conv -> relu -> maxpool)
- Two fully connected layers
- Output layer with 10 classes
Args:
num_classes: Number of output classes. Defaults to 10.
"""
def __init__(self, num_classes: int = 10) -> None:
super().__init__()
# Convolutional layers
self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
self.pool = nn.MaxPool2d(2, 2)
# Fully connected layers
self.fc1 = nn.Linear(64 * 8 * 8, 256)
self.fc2 = nn.Linear(256, num_classes)
self.dropout = nn.Dropout(0.5)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""Forward pass through the network.
Args:
x: Input tensor of shape (batch_size, 3, 32, 32).
Returns:
Logits tensor of shape (batch_size, num_classes).
"""
# Block 1: Conv -> ReLU -> Pool (32x32 -> 16x16)
x = self.pool(F.relu(self.conv1(x)))
# Block 2: Conv -> ReLU -> Pool (16x16 -> 8x8)
x = self.pool(F.relu(self.conv2(x)))
# Flatten
x = x.view(x.size(0), -1)
# Fully connected layers
x = self.dropout(F.relu(self.fc1(x)))
x = self.fc2(x)
return x
# Verify the architecture
model = SimpleCNN()
x = torch.randn(2, 3, 32, 32)
output = model(x)
print(f"Output shape: {output.shape}") # [2, 10]
# Count parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params:,}")
This simple architecture already demonstrates the key CNN principles: local connectivity through 3x3 kernels, weight sharing across spatial positions, hierarchical feature extraction through stacked layers, and spatial downsampling through pooling. As we discussed the importance of batch normalization and dropout for regularization in Chapter 13, note that we include dropout before the final classification layer.
14.6 Landmark CNN Architectures
The history of CNN architectures is a story of increasing depth, more efficient parameter usage, and clever structural innovations. Understanding these architectures provides design intuition that remains valuable even as new architectures emerge.
14.6.1 LeNet-5 (1998)
Yann LeCun's LeNet-5 was one of the first successful applications of CNNs, designed for handwritten digit recognition on the MNIST dataset. Its architecture is simple by modern standards:
- Input: 1 x 32 x 32 grayscale image
- C1: 6 filters of size 5x5, producing 6 x 28 x 28 feature maps
- S2: 2x2 average pooling, producing 6 x 14 x 14
- C3: 16 filters of size 5x5, producing 16 x 10 x 10
- S4: 2x2 average pooling, producing 16 x 5 x 5
- C5: 120 filters of size 5x5, producing 120 x 1 x 1
- F6: Fully connected layer with 84 units
- Output: 10 units (one per digit)
LeNet-5 introduced several ideas that remain foundational: the alternation of convolutional and pooling layers, the progressive increase in the number of channels while decreasing spatial dimensions, and the use of fully connected layers at the end for classification.
torch.manual_seed(42)
class LeNet5(nn.Module):
"""LeNet-5 architecture for digit recognition.
Args:
num_classes: Number of output classes. Defaults to 10.
"""
def __init__(self, num_classes: int = 10) -> None:
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(1, 6, kernel_size=5),
nn.Tanh(),
nn.AvgPool2d(kernel_size=2, stride=2),
nn.Conv2d(6, 16, kernel_size=5),
nn.Tanh(),
nn.AvgPool2d(kernel_size=2, stride=2),
nn.Conv2d(16, 120, kernel_size=5),
nn.Tanh(),
)
self.classifier = nn.Sequential(
nn.Linear(120, 84),
nn.Tanh(),
nn.Linear(84, num_classes),
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""Forward pass.
Args:
x: Input tensor of shape (batch_size, 1, 32, 32).
Returns:
Logits tensor of shape (batch_size, num_classes).
"""
x = self.features(x)
x = x.view(x.size(0), -1)
x = self.classifier(x)
return x
Note the use of tanh activations -- ReLU had not yet been popularized. As we discussed in Chapter 13 regarding activation functions, the shift to ReLU was one of the key enablers of deeper networks.
14.6.2 AlexNet (2012)
AlexNet (Krizhevsky, Sutskever, and Hinton, 2012) is often credited with igniting the deep learning revolution. It won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012 by a dramatic margin, reducing the top-5 error rate from 26% to 16%.
Key innovations of AlexNet: - ReLU activations instead of tanh or sigmoid, enabling faster training (see Chapter 13) - Dropout for regularization (see Chapter 13, Section 13.7) - Data augmentation (random crops, horizontal flips, color jittering) - GPU training -- the network was split across two GPUs due to memory constraints - Local Response Normalization (LRN) -- later superseded by batch normalization
AlexNet's architecture: 5 convolutional layers followed by 3 fully connected layers, with about 60 million parameters. The first layer uses large 11x11 filters with stride 4, which was later recognized as suboptimal.
14.6.3 VGGNet (2014)
The VGG architecture (Simonyan and Zisserman, 2014) demonstrated a powerful design principle: use small 3x3 filters everywhere. The key insight is that two stacked 3x3 convolutional layers have the same effective receptive field as a single 5x5 layer, but with fewer parameters and more nonlinear activations:
- One 5x5 layer: $5 \times 5 \times C^2 = 25C^2$ parameters
- Two 3x3 layers: $2 \times 3 \times 3 \times C^2 = 18C^2$ parameters
Three 3x3 layers match a 7x7 receptive field with $27C^2$ vs $49C^2$ parameters.
VGG comes in several variants (VGG-11, VGG-13, VGG-16, VGG-19), all following the same pattern: blocks of 3x3 convolutions followed by 2x2 max pooling, with the number of channels doubling after each pooling layer (64 -> 128 -> 256 -> 512 -> 512).
torch.manual_seed(42)
def make_vgg_block(in_channels: int, out_channels: int, num_convs: int) -> nn.Sequential:
"""Create a VGG-style convolutional block.
Args:
in_channels: Number of input channels.
out_channels: Number of output channels.
num_convs: Number of convolutional layers in the block.
Returns:
Sequential block of conv-relu layers followed by max pooling.
"""
layers: list[nn.Module] = []
for i in range(num_convs):
layers.append(
nn.Conv2d(
in_channels if i == 0 else out_channels,
out_channels,
kernel_size=3,
padding=1,
)
)
layers.append(nn.ReLU(inplace=True))
layers.append(nn.MaxPool2d(kernel_size=2, stride=2))
return nn.Sequential(*layers)
class VGG16(nn.Module):
"""VGG-16 architecture.
Args:
num_classes: Number of output classes. Defaults to 1000.
"""
def __init__(self, num_classes: int = 1000) -> None:
super().__init__()
self.features = nn.Sequential(
make_vgg_block(3, 64, 2), # 224 -> 112
make_vgg_block(64, 128, 2), # 112 -> 56
make_vgg_block(128, 256, 3), # 56 -> 28
make_vgg_block(256, 512, 3), # 28 -> 14
make_vgg_block(512, 512, 3), # 14 -> 7
)
self.classifier = nn.Sequential(
nn.Linear(512 * 7 * 7, 4096),
nn.ReLU(inplace=True),
nn.Dropout(0.5),
nn.Linear(4096, 4096),
nn.ReLU(inplace=True),
nn.Dropout(0.5),
nn.Linear(4096, num_classes),
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""Forward pass.
Args:
x: Input tensor of shape (batch_size, 3, 224, 224).
Returns:
Logits tensor of shape (batch_size, num_classes).
"""
x = self.features(x)
x = x.view(x.size(0), -1)
x = self.classifier(x)
return x
VGG-16 has approximately 138 million parameters, most of which (about 124 million) reside in the fully connected layers. This observation motivated the move toward global average pooling in later architectures.
14.6.4 ResNet and Skip Connections (2015)
ResNet (He et al., 2015) introduced perhaps the single most important architectural innovation in deep learning: the residual connection (also called a skip connection or shortcut connection). This idea solved the degradation problem -- the surprising observation that simply adding more layers to a network can decrease accuracy, not because of overfitting, but because of optimization difficulties.
The key insight is deceptively simple. Instead of asking a layer to learn the desired mapping $H(x)$ directly, ResNet asks it to learn the residual $F(x) = H(x) - x$. The output of the block is then:
$$y = F(x) + x$$
If the optimal transformation is close to the identity (which is often the case in very deep networks), then learning $F(x) \approx 0$ is much easier than learning $H(x) \approx x$.
The residual block comes in two forms:
- Basic Block (used in ResNet-18 and ResNet-34): Two 3x3 convolutional layers with a skip connection
- Bottleneck Block (used in ResNet-50, ResNet-101, ResNet-152): A 1x1 convolution to reduce channels, a 3x3 convolution, and a 1x1 convolution to restore channels
torch.manual_seed(42)
class BasicBlock(nn.Module):
"""ResNet Basic Block with skip connection.
Args:
in_channels: Number of input channels.
out_channels: Number of output channels.
stride: Stride for the first convolution. Defaults to 1.
"""
def __init__(
self, in_channels: int, out_channels: int, stride: int = 1
) -> None:
super().__init__()
self.conv1 = nn.Conv2d(
in_channels, out_channels, kernel_size=3,
stride=stride, padding=1, bias=False
)
self.bn1 = nn.BatchNorm2d(out_channels)
self.conv2 = nn.Conv2d(
out_channels, out_channels, kernel_size=3,
stride=1, padding=1, bias=False
)
self.bn2 = nn.BatchNorm2d(out_channels)
# Shortcut connection for dimension matching
self.shortcut = nn.Sequential()
if stride != 1 or in_channels != out_channels:
self.shortcut = nn.Sequential(
nn.Conv2d(
in_channels, out_channels, kernel_size=1,
stride=stride, bias=False
),
nn.BatchNorm2d(out_channels),
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""Forward pass with residual connection.
Args:
x: Input tensor.
Returns:
Output tensor after residual addition and ReLU.
"""
identity = self.shortcut(x)
out = F.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
out += identity # The residual connection
out = F.relu(out)
return out
class BottleneckBlock(nn.Module):
"""ResNet Bottleneck Block for deeper networks.
Uses 1x1 -> 3x3 -> 1x1 convolution pattern to reduce computation.
Args:
in_channels: Number of input channels.
bottleneck_channels: Number of channels in the bottleneck.
out_channels: Number of output channels.
stride: Stride for the 3x3 convolution. Defaults to 1.
"""
def __init__(
self,
in_channels: int,
bottleneck_channels: int,
out_channels: int,
stride: int = 1,
) -> None:
super().__init__()
self.conv1 = nn.Conv2d(
in_channels, bottleneck_channels, kernel_size=1, bias=False
)
self.bn1 = nn.BatchNorm2d(bottleneck_channels)
self.conv2 = nn.Conv2d(
bottleneck_channels, bottleneck_channels, kernel_size=3,
stride=stride, padding=1, bias=False
)
self.bn2 = nn.BatchNorm2d(bottleneck_channels)
self.conv3 = nn.Conv2d(
bottleneck_channels, out_channels, kernel_size=1, bias=False
)
self.bn3 = nn.BatchNorm2d(out_channels)
self.shortcut = nn.Sequential()
if stride != 1 or in_channels != out_channels:
self.shortcut = nn.Sequential(
nn.Conv2d(
in_channels, out_channels, kernel_size=1,
stride=stride, bias=False
),
nn.BatchNorm2d(out_channels),
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""Forward pass with bottleneck residual connection.
Args:
x: Input tensor.
Returns:
Output tensor after residual addition and ReLU.
"""
identity = self.shortcut(x)
out = F.relu(self.bn1(self.conv1(x)))
out = F.relu(self.bn2(self.conv2(out)))
out = self.bn3(self.conv3(out))
out += identity
out = F.relu(out)
return out
The impact of residual connections cannot be overstated. Before ResNet, training networks deeper than about 20 layers was extremely difficult. ResNet-152, with 152 layers, won ILSVRC 2015 with a 3.57% top-5 error rate -- surpassing human-level performance (estimated at about 5.1%). The residual connection idea has since been adopted far beyond CNNs, appearing in transformers (Chapter 16), recurrent networks, and virtually every modern deep architecture.
From the perspective of gradient flow (discussed in Chapter 13), the skip connection creates an "information highway" that allows gradients to flow directly from the loss function to early layers. During backpropagation, the gradient through a residual block is:
$$\frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial y} \cdot \left(1 + \frac{\partial F(x)}{\partial x}\right)$$
The constant term of 1 ensures that gradients never vanish completely, even if $\frac{\partial F(x)}{\partial x}$ is small.
14.7 1x1 Convolutions
A 1x1 convolution might seem trivial -- after all, it looks at only a single pixel. But its power lies in the channel dimension. A 1x1 convolution with $C_{in}$ input channels and $C_{out}$ output channels performs a linear transformation across channels at each spatial position, independently:
$$\text{output}[c_{out}, i, j] = \sum_{c_{in}} W[c_{out}, c_{in}] \cdot \text{input}[c_{in}, i, j] + b[c_{out}]$$
This is equivalent to applying a fully connected layer to each pixel's channel vector independently. The uses of 1x1 convolutions include:
- Channel dimensionality reduction: Reducing the number of channels before an expensive 3x3 or 5x5 convolution (used in GoogLeNet's Inception modules and ResNet's bottleneck blocks)
- Channel dimensionality increase: Expanding the channel dimension (used in the expansion phase of bottleneck blocks)
- Adding nonlinearity: When followed by an activation function, a 1x1 convolution adds a nonlinear transformation without changing spatial dimensions
- Cross-channel interaction: Mixing information across channels, which is the key operation in architectures like Network in Network (Lin et al., 2013)
torch.manual_seed(42)
# Reduce 256 channels to 64 channels
reduce = nn.Conv2d(256, 64, kernel_size=1)
x = torch.randn(1, 256, 14, 14)
output = reduce(x)
print(f"Channel reduction: {x.shape} -> {output.shape}")
# [1, 256, 14, 14] -> [1, 64, 14, 14]
# Parameter comparison: 3x3 conv on 256 channels vs 1x1 reduce + 3x3 conv
direct_params = 256 * 256 * 3 * 3 # 589,824
bottleneck_params = 256 * 64 * 1 * 1 + 64 * 64 * 3 * 3 # 16,384 + 36,864 = 53,248
print(f"Direct 3x3: {direct_params:,} params")
print(f"Bottleneck: {bottleneck_params:,} params")
print(f"Reduction: {bottleneck_params / direct_params:.1%}")
The 1x1 convolution achieves an 11x reduction in parameters in this example -- a key insight exploited by GoogLeNet and ResNet.
14.8 Depthwise Separable Convolutions
Standard convolution jointly learns spatial and cross-channel patterns. Depthwise separable convolutions factorize this into two steps:
- Depthwise convolution: Apply a separate $k \times k$ filter to each input channel independently
- Pointwise convolution: Apply a 1x1 convolution to combine the outputs across channels
The parameter savings are dramatic. For a standard convolution with $C_{in}$ input channels, $C_{out}$ output channels, and kernel size $k$:
$$\text{Standard params} = C_{in} \times C_{out} \times k^2$$
For depthwise separable convolution:
$$\text{Separable params} = C_{in} \times k^2 + C_{in} \times C_{out}$$
The ratio of separable to standard parameters is:
$$\frac{C_{in} \times k^2 + C_{in} \times C_{out}}{C_{in} \times C_{out} \times k^2} = \frac{1}{C_{out}} + \frac{1}{k^2}$$
For $C_{out} = 256$ and $k = 3$, this is approximately $\frac{1}{256} + \frac{1}{9} \approx 0.115$ -- nearly a 9x reduction in parameters.
torch.manual_seed(42)
class DepthwiseSeparableConv(nn.Module):
"""Depthwise separable convolution block.
Factorizes a standard convolution into a depthwise convolution
(spatial filtering per channel) and a pointwise convolution
(channel mixing via 1x1 conv).
Args:
in_channels: Number of input channels.
out_channels: Number of output channels.
kernel_size: Size of the depthwise kernel. Defaults to 3.
stride: Stride for the depthwise convolution. Defaults to 1.
padding: Padding for the depthwise convolution. Defaults to 1.
"""
def __init__(
self,
in_channels: int,
out_channels: int,
kernel_size: int = 3,
stride: int = 1,
padding: int = 1,
) -> None:
super().__init__()
# Depthwise: groups=in_channels means each channel gets its own filter
self.depthwise = nn.Conv2d(
in_channels, in_channels, kernel_size=kernel_size,
stride=stride, padding=padding, groups=in_channels, bias=False,
)
self.pointwise = nn.Conv2d(
in_channels, out_channels, kernel_size=1, bias=False
)
self.bn1 = nn.BatchNorm2d(in_channels)
self.bn2 = nn.BatchNorm2d(out_channels)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""Forward pass through depthwise separable convolution.
Args:
x: Input tensor.
Returns:
Output tensor after depthwise and pointwise convolutions.
"""
x = F.relu(self.bn1(self.depthwise(x)))
x = F.relu(self.bn2(self.pointwise(x)))
return x
# Compare parameter counts
standard_conv = nn.Conv2d(64, 128, kernel_size=3, padding=1)
dw_sep_conv = DepthwiseSeparableConv(64, 128)
standard_params = sum(p.numel() for p in standard_conv.parameters())
dw_sep_params = sum(p.numel() for p in dw_sep_conv.parameters())
print(f"Standard conv params: {standard_params:,}")
print(f"DW separable params: {dw_sep_params:,}")
print(f"Reduction factor: {standard_params / dw_sep_params:.1f}x")
Depthwise separable convolutions are the foundation of MobileNet (Howard et al., 2017), designed for deployment on mobile and edge devices. They represent a key instance of the efficiency vs. accuracy tradeoffs discussed in Chapter 6 on model selection.
14.9 Transfer Learning with Pretrained CNNs
14.9.1 Why Transfer Learning Works
Training a large CNN from scratch requires massive datasets (ImageNet has 1.2 million images) and significant compute resources. Transfer learning leverages the fact that features learned on one task often generalize to others. The lower layers of a CNN learn universal features -- edges, textures, color patterns -- that are useful for virtually any visual task. Higher layers learn increasingly task-specific features.
This concept connects directly to our discussion of representation learning in Chapter 10. A pretrained CNN provides a powerful feature extractor that encodes useful visual priors, similar to how pretrained word embeddings encode linguistic priors.
14.9.2 Transfer Learning Strategies
There are two main approaches:
Feature Extraction: Freeze all convolutional layers and only train a new classification head. The pretrained CNN acts as a fixed feature extractor.
import torchvision.models as models
torch.manual_seed(42)
# Load pretrained ResNet-18
model = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)
# Freeze all parameters
for param in model.parameters():
param.requires_grad = False
# Replace the final fully connected layer
num_classes = 5 # For example, 5 flower categories
model.fc = nn.Linear(model.fc.in_features, num_classes)
# Only the new fc layer's parameters will be trained
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable_params:,} / {total_params:,} "
f"({trainable_params / total_params:.1%})")
Fine-Tuning: Start with pretrained weights but allow some or all layers to update during training. Typically, earlier layers (which learn general features) are kept frozen while later layers are fine-tuned with a small learning rate.
torch.manual_seed(42)
model = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)
# Freeze early layers (layer1, layer2)
for name, param in model.named_parameters():
if "layer1" in name or "layer2" in name or "conv1" in name or "bn1" in name:
param.requires_grad = False
# Replace final layer
model.fc = nn.Linear(model.fc.in_features, 5)
# Use different learning rates for pretrained vs new layers
pretrained_params = [p for n, p in model.named_parameters()
if p.requires_grad and "fc" not in n]
new_params = [p for n, p in model.named_parameters()
if p.requires_grad and "fc" in n]
optimizer = torch.optim.Adam([
{"params": pretrained_params, "lr": 1e-4}, # Small LR for pretrained
{"params": new_params, "lr": 1e-3}, # Larger LR for new layers
])
14.9.3 When to Use Each Strategy
The choice between feature extraction and fine-tuning depends on:
- Dataset size: Small datasets favor feature extraction (less risk of overfitting). Large datasets can benefit from fine-tuning.
- Domain similarity: If your target domain is similar to ImageNet (natural images), feature extraction often works well. For very different domains (medical images, satellite imagery), fine-tuning deeper layers may be necessary.
- Compute budget: Feature extraction is much faster since you only train a few layers. As we discussed regarding computational tradeoffs in Chapter 7, this can be the deciding factor in resource-constrained settings.
A practical guideline:
| Scenario | Dataset Size | Domain Similarity | Strategy |
|---|---|---|---|
| A | Small | Similar | Feature extraction |
| B | Small | Different | Fine-tune later layers |
| C | Large | Similar | Fine-tune all layers |
| D | Large | Different | Fine-tune all layers (or retrain from scratch) |
14.9.4 Progressive Fine-Tuning
A more sophisticated approach to fine-tuning, sometimes called progressive unfreezing or gradual unfreezing, trains the model in stages:
-
Stage 1: Freeze the entire pretrained backbone. Train only the new classification head for 5-10 epochs. This gives the new layers time to adapt to the pretrained features without disrupting them.
-
Stage 2: Unfreeze the last one or two residual blocks. Fine-tune with a small learning rate (1/10 of the initial rate) for another 5-10 epochs. These later layers learn task-specific features.
-
Stage 3: Optionally unfreeze the entire network. Fine-tune with an even smaller learning rate (1/100 of the initial rate) using a learning rate schedule like cosine annealing (Chapter 10).
This staged approach is more stable than unfreezing everything at once because it prevents the randomly initialized classification head from generating large gradients that corrupt the pretrained features. The technique is closely related to discriminative learning rates, where different layers receive different learning rates -- earlier (more general) layers get smaller learning rates than later (more task-specific) layers.
torch.manual_seed(42)
# Stage 1: Train head only
for param in model.parameters():
param.requires_grad = False
model.fc = nn.Linear(512, num_classes)
optimizer = torch.optim.Adam(model.fc.parameters(), lr=1e-3)
# ... train for 5-10 epochs ...
# Stage 2: Unfreeze last block
for param in model.layer4.parameters():
param.requires_grad = True
optimizer = torch.optim.Adam([
{"params": model.layer4.parameters(), "lr": 1e-4},
{"params": model.fc.parameters(), "lr": 1e-3},
])
# ... train for 5-10 more epochs ...
14.9.5 Data Preprocessing for Transfer Learning
When using pretrained models, you must apply the same preprocessing that was used during pretraining. For ImageNet-pretrained models, this means:
from torchvision import transforms
# Standard ImageNet preprocessing
imagenet_transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225],
),
])
These specific mean and standard deviation values come from the statistics of the ImageNet training set. Using different normalization would produce feature maps that the pretrained weights are not calibrated for, degrading performance. This is a common mistake that can lead to hours of wasted debugging time -- an issue we addressed in Chapter 11 on systematic debugging practices.
14.10 Visualizing Learned Features
Understanding what a CNN has learned is crucial for debugging, building trust in model predictions, and gaining scientific insight. Several visualization techniques have been developed.
14.10.1 Filter Visualization
The simplest approach is to directly visualize the learned filters. First-layer filters are interpretable because they operate on raw pixels:
import matplotlib.pyplot as plt
torch.manual_seed(42)
model = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)
# Get first convolutional layer filters
filters = model.conv1.weight.data.clone()
print(f"First layer filters shape: {filters.shape}") # [64, 3, 7, 7]
# Normalize for visualization
filters = filters - filters.min()
filters = filters / filters.max()
# Plot first 16 filters
fig, axes = plt.subplots(4, 4, figsize=(8, 8))
for i, ax in enumerate(axes.flat):
if i < 16:
# Transpose from (C, H, W) to (H, W, C) for matplotlib
img = filters[i].permute(1, 2, 0).numpy()
ax.imshow(img)
ax.axis("off")
plt.suptitle("First Layer Filters of ResNet-18")
plt.tight_layout()
plt.savefig("first_layer_filters.png", dpi=150)
plt.close()
First-layer filters typically reveal edge detectors at various orientations, color-opponent filters, and low-frequency pattern detectors -- strikingly similar to the receptive fields of neurons in area V1 of the visual cortex.
14.10.2 Feature Map Visualization
We can visualize the feature maps (activations) produced by intermediate layers when processing a specific image:
import torch
from torchvision import transforms
from PIL import Image
torch.manual_seed(42)
def get_feature_maps(
model: nn.Module,
image: torch.Tensor,
layer_name: str,
) -> torch.Tensor:
"""Extract feature maps from a specific layer.
Args:
model: The CNN model.
image: Input image tensor of shape (1, C, H, W).
layer_name: Name of the layer to extract features from.
Returns:
Feature map tensor from the specified layer.
"""
features = {}
def hook_fn(module: nn.Module, input: tuple, output: torch.Tensor) -> None:
features["output"] = output.detach()
# Register hook on the target layer
for name, module in model.named_modules():
if name == layer_name:
handle = module.register_forward_hook(hook_fn)
break
# Forward pass
with torch.no_grad():
model(image)
handle.remove()
return features["output"]
14.10.3 Gradient-Based Visualization
Grad-CAM (Gradient-weighted Class Activation Mapping) produces a coarse localization map highlighting which regions of the input are important for a particular class prediction:
torch.manual_seed(42)
def grad_cam(
model: nn.Module,
image: torch.Tensor,
target_class: int,
target_layer: str,
) -> torch.Tensor:
"""Compute Grad-CAM heatmap for a target class.
Args:
model: The CNN model.
image: Input image tensor of shape (1, C, H, W).
target_class: Index of the target class.
target_layer: Name of the target convolutional layer.
Returns:
Heatmap tensor of shape (H, W).
"""
gradients = {}
activations = {}
def forward_hook(module: nn.Module, input: tuple, output: torch.Tensor) -> None:
activations["value"] = output
def backward_hook(module: nn.Module, grad_input: tuple, grad_output: tuple) -> None:
gradients["value"] = grad_output[0]
# Register hooks
for name, module in model.named_modules():
if name == target_layer:
fh = module.register_forward_hook(forward_hook)
bh = module.register_full_backward_hook(backward_hook)
break
# Forward pass
output = model(image)
model.zero_grad()
# Backward pass for target class
target = output[0, target_class]
target.backward()
# Compute Grad-CAM
weights = gradients["value"].mean(dim=[2, 3], keepdim=True) # GAP of gradients
cam = (weights * activations["value"]).sum(dim=1, keepdim=True)
cam = F.relu(cam) # Only positive contributions
cam = cam.squeeze()
# Normalize
cam = cam - cam.min()
cam = cam / (cam.max() + 1e-8)
fh.remove()
bh.remove()
return cam
Grad-CAM is invaluable for verifying that your model is "looking at the right thing." If a model trained to classify dogs is actually relying on the background (perhaps all dog images were taken outdoors), Grad-CAM will reveal this spurious correlation. This connects to our discussion of model interpretability in Chapter 9.
14.10.4 Activation Maximization
Activation maximization synthesizes an input image that maximally activates a particular neuron. Starting from random noise, we perform gradient ascent on the input to maximize the neuron's activation:
torch.manual_seed(42)
def activation_maximization(
model: nn.Module,
layer_name: str,
filter_index: int,
num_iterations: int = 200,
learning_rate: float = 0.1,
) -> torch.Tensor:
"""Generate an image that maximally activates a specific filter.
Args:
model: The CNN model (set to eval mode).
layer_name: Name of the target layer.
filter_index: Index of the filter to maximize.
num_iterations: Number of optimization steps. Defaults to 200.
learning_rate: Step size for gradient ascent. Defaults to 0.1.
Returns:
Optimized image tensor of shape (1, 3, 224, 224).
"""
model.eval()
# Start from random noise
image = torch.randn(1, 3, 224, 224, requires_grad=True)
optimizer = torch.optim.Adam([image], lr=learning_rate)
for _ in range(num_iterations):
optimizer.zero_grad()
# Get activations at target layer
activation = {}
def hook(module, input, output):
activation["value"] = output
for name, module in model.named_modules():
if name == layer_name:
handle = module.register_forward_hook(hook)
break
model(image)
handle.remove()
# Maximize the mean activation of the target filter
loss = -activation["value"][0, filter_index].mean()
loss.backward()
optimizer.step()
return image.detach()
These visualizations reveal that deeper layers learn increasingly complex and abstract patterns -- from simple Gabor-like filters in the first layer to complex textures, object parts, and full objects in deeper layers.
14.11 Batch Normalization in CNNs
We introduced batch normalization in Chapter 13 as a general technique for stabilizing training. In CNNs, batch normalization is applied slightly differently than in fully connected networks. For a feature map with shape $(N, C, H, W)$, batch normalization computes statistics over the $(N, H, W)$ dimensions for each channel $c$:
$$\hat{x}_{n,c,h,w} = \frac{x_{n,c,h,w} - \mu_c}{\sqrt{\sigma_c^2 + \epsilon}}$$
where:
$$\mu_c = \frac{1}{N \cdot H \cdot W} \sum_{n,h,w} x_{n,c,h,w}$$
$$\sigma_c^2 = \frac{1}{N \cdot H \cdot W} \sum_{n,h,w} (x_{n,c,h,w} - \mu_c)^2$$
This means each channel has its own mean and variance, computed across all spatial positions and all samples in the mini-batch. The rationale is that each channel represents a different feature, so normalizing per-channel preserves the spatial structure within each feature map.
In PyTorch, nn.BatchNorm2d handles this automatically:
torch.manual_seed(42)
# Batch norm after convolution
conv_bn_relu = nn.Sequential(
nn.Conv2d(64, 128, kernel_size=3, padding=1, bias=False),
nn.BatchNorm2d(128),
nn.ReLU(inplace=True),
)
Note that we set bias=False in the convolution when followed by batch normalization. The batch norm's learnable shift parameter $\beta$ serves the same role as the convolution bias, so including both would be redundant.
14.12 Modern CNN Design Principles
Looking across the evolution from LeNet to modern architectures, several design principles have emerged:
14.12.1 Progressive Downsampling
Spatial dimensions should decrease gradually while the number of channels increases. This preserves the total information capacity at each layer. A common pattern is to double the channels every time the spatial dimensions are halved.
14.12.2 Skip Connections Everywhere
After ResNet demonstrated their effectiveness, skip connections have become nearly universal. Even architectures that predate ResNet (like VGG) can be improved by adding skip connections. The principle extends to dense connections (DenseNet, Huang et al., 2017), where each layer receives input from all preceding layers.
14.12.3 Efficient Building Blocks
Modern efficient architectures combine depthwise separable convolutions, squeeze-and-excitation blocks (which learn channel-wise attention weights), and inverted residual blocks (MobileNetV2) to achieve strong performance with minimal computation.
14.12.4 EfficientNet and Compound Scaling
A breakthrough in CNN design came with the observation that scaling networks along a single dimension (depth, width, or resolution) yields diminishing returns. EfficientNet (Tan and Le, 2019) introduced compound scaling, which scales all three dimensions simultaneously using fixed ratios:
$$\text{depth:} \quad d = \alpha^\phi, \qquad \text{width:} \quad w = \beta^\phi, \qquad \text{resolution:} \quad r = \gamma^\phi$$
subject to the constraint $\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2$ (so that total computation roughly doubles with each increment of $\phi$), where $\alpha$, $\beta$, and $\gamma$ are determined by a small grid search on the base model.
The intuition is straightforward: a deeper network needs wider layers to capture richer features at each level, and higher-resolution inputs require more layers to process the finer details. Scaling only one dimension creates bottlenecks. For instance, doubling depth without increasing width forces each layer to compress the same information through a narrower channel, limiting capacity. Compound scaling balances these dimensions.
The EfficientNet family ranges from EfficientNet-B0 (5.3M parameters, 77.1% ImageNet accuracy) to EfficientNet-B7 (66M parameters, 84.3% accuracy). Notably, EfficientNet-B0 matches ResNet-50's accuracy with 8x fewer parameters, demonstrating the power of principled scaling combined with an efficient base architecture (built using depthwise separable convolutions and squeeze-and-excitation blocks).
14.12.5 Neural Architecture Search
The design of CNN architectures has increasingly been automated through Neural Architecture Search (NAS) (Zoph and Le, 2017). NAS uses reinforcement learning or evolutionary algorithms to search over the space of possible architectures, discovering designs like EfficientNet (Tan and Le, 2019) that achieve state-of-the-art accuracy with fewer parameters than hand-designed networks.
14.12.6 The Role of Data Augmentation
CNN performance is strongly tied to the data augmentation strategy. Common augmentations for image classification include:
train_transform = transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),
transforms.RandomRotation(15),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
More advanced techniques include Cutout (randomly masking rectangular patches), Mixup (blending pairs of images and their labels), and CutMix (replacing patches of one image with patches from another). These methods relate to the regularization strategies we covered in Chapter 13 and the data preprocessing techniques from Chapter 4.
14.13 CNNs Beyond Image Classification
While we have focused on classification, CNNs are the backbone of many computer vision tasks:
14.13.1 Object Detection
Object detection requires both classification (what is in the image) and localization (where is it). The field has evolved through two paradigms:
Two-stage detectors like Faster R-CNN first generate region proposals (candidate bounding boxes likely to contain objects), then classify and refine each proposal. The CNN backbone extracts features, a Region Proposal Network (RPN) generates proposals, and a classification head produces final predictions. These models are accurate but relatively slow.
Single-stage detectors like YOLO (You Only Look Once) predict class probabilities and bounding box coordinates directly from the feature map in a single forward pass. YOLO divides the image into a grid and has each grid cell predict a fixed number of bounding boxes. This approach is significantly faster -- YOLOv5 can process images at over 100 frames per second on a modern GPU, making it suitable for real-time applications like autonomous driving and video surveillance. The trade-off is slightly lower accuracy on small objects, though recent versions (YOLOv8 and beyond) have largely closed this gap.
DETR (Detection Transformer, Carion et al., 2020) represents a paradigm shift: it treats object detection as a set prediction problem using a transformer decoder (as we will explore in Chapter 19). DETR removes the need for hand-designed components like anchor boxes and non-maximum suppression, replacing them with learned attention mechanisms. While DETR was initially slower to converge than traditional detectors, subsequent work (Deformable DETR, DINO) has addressed these limitations and achieved state-of-the-art results.
14.13.2 Semantic Segmentation
Semantic segmentation assigns a class label to every pixel in the image. This is fundamentally different from classification (one label per image) or detection (bounding boxes) -- it requires pixel-level precision.
U-Net (Ronneberger et al., 2015) introduced the encoder-decoder architecture with skip connections that has become the standard template for segmentation. The encoder (contracting path) progressively downsamples while extracting features, identical to a standard CNN. The decoder (expanding path) progressively upsamples using transposed convolutions, and crucially, skip connections concatenate feature maps from the encoder to the decoder at each resolution level. This allows the decoder to combine high-level semantic information from deep layers with fine-grained spatial information from shallow layers:
Encoder: Input -> 64 -> 128 -> 256 -> 512 -> 1024
| | | |
v v v v (skip connections)
Decoder: 1024 -> 512 -> 256 -> 128 -> 64 -> Output
DeepLab takes a different approach using atrous (dilated) convolutions to maintain spatial resolution while expanding the receptive field. DeepLabv3+ combines an encoder using atrous convolutions with a decoder module, and introduces Atrous Spatial Pyramid Pooling (ASPP) -- parallel dilated convolutions at multiple rates that capture multi-scale context. This addresses a fundamental challenge in segmentation: objects appear at many different scales, and the network must recognize both a small bird and a large building in the same image.
14.13.3 Image Generation
Convolutional architectures also power image generation through transposed convolutions (sometimes called deconvolutions). These upsampling operations are used in autoencoders (Chapter 17), GANs, and other generative models. A transposed convolution can be thought of as the "reverse" of a standard convolution, mapping from a smaller spatial dimension to a larger one.
14.13.4 Beyond Vision
CNNs have been successfully applied to: - Natural language processing: 1D convolutions over word embeddings (as we will see in Chapter 15) - Audio processing: Spectrograms treated as 2D images, or 1D convolutions on raw waveforms - Graph neural networks: Generalizing convolution to irregular graph structures - Scientific computing: Processing grid-structured data like weather maps or molecular structures
14.14 CNNs vs. Vision Transformers
The rise of Vision Transformers (ViT) (Dosovitskiy et al., 2021) has raised the question: are CNNs obsolete? ViT divides images into patches (typically 16x16 pixels), treats each patch as a "token," and processes the sequence with a standard transformer encoder (as we will discuss in Chapter 19). The key differences are:
| Aspect | CNNs | Vision Transformers |
|---|---|---|
| Inductive bias | Local connectivity, translation equivariance | Minimal (global attention from layer 1) |
| Data efficiency | Better with small datasets | Requires large datasets or strong augmentation |
| Computational scaling | Fixed receptive field growth | Quadratic in sequence length (number of patches) |
| Feature locality | Inherently local, gradually building global context | Global attention from the first layer |
| Inference speed | Generally faster for similar accuracy | Slower due to attention computation |
| Interpretability | Feature maps are spatially aligned | Attention maps provide some interpretability |
One striking difference is data efficiency: CNNs learn spatial priors through their architecture (local connectivity, weight sharing), while ViTs must learn these patterns entirely from data. The original ViT paper reported that ViT only outperformed CNNs when pretrained on datasets of 14 million or more images. On smaller datasets, the CNN's built-in inductive biases give it a significant advantage.
The practical verdict as of 2025: hybrid approaches win. ConvNeXt (Liu et al., 2022) showed that a CNN designed using transformer-inspired principles (larger kernels, fewer normalization layers, GELU activations) can match ViT performance. Meanwhile, architectures like Swin Transformer add local windowed attention that mimics CNN-style locality. For most practitioners, the choice between CNN and ViT depends on dataset size, deployment constraints, and available pretrained models rather than fundamental architectural superiority.
Practical Tip: If you have fewer than 10,000 labeled images, start with a pretrained CNN (ResNet or EfficientNet) and fine-tune. If you have millions of images or access to a pretrained ViT, transformers may offer better performance. For real-time applications on edge devices, efficient CNNs (MobileNet, EfficientNet-Lite) remain the go-to choice.
14.15 Putting It All Together: A Complete Training Pipeline
Let us combine everything we have learned into a complete, production-quality training pipeline. This brings together concepts from loss functions (Chapter 8), optimization (Chapter 12), and regularization (Chapter 13):
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms, models
from typing import Tuple
torch.manual_seed(42)
def create_data_loaders(
batch_size: int = 32,
num_workers: int = 4,
) -> Tuple[DataLoader, DataLoader]:
"""Create train and validation data loaders for CIFAR-10.
Args:
batch_size: Number of samples per batch. Defaults to 32.
num_workers: Number of data loading workers. Defaults to 4.
Returns:
Tuple of (train_loader, val_loader).
"""
train_transform = transforms.Compose([
transforms.RandomCrop(32, padding=4),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616)),
])
val_transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616)),
])
train_dataset = datasets.CIFAR10(
root="./data", train=True, download=True, transform=train_transform
)
val_dataset = datasets.CIFAR10(
root="./data", train=False, download=True, transform=val_transform
)
train_loader = DataLoader(
train_dataset, batch_size=batch_size, shuffle=True, num_workers=num_workers
)
val_loader = DataLoader(
val_dataset, batch_size=batch_size, shuffle=False, num_workers=num_workers
)
return train_loader, val_loader
def train_one_epoch(
model: nn.Module,
train_loader: DataLoader,
criterion: nn.Module,
optimizer: optim.Optimizer,
device: torch.device,
) -> float:
"""Train the model for one epoch.
Args:
model: The CNN model.
train_loader: Training data loader.
criterion: Loss function.
optimizer: Optimizer.
device: Device to train on.
Returns:
Average training loss for the epoch.
"""
model.train()
running_loss = 0.0
for inputs, targets in train_loader:
inputs, targets = inputs.to(device), targets.to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
running_loss += loss.item() * inputs.size(0)
return running_loss / len(train_loader.dataset)
@torch.no_grad()
def evaluate(
model: nn.Module,
val_loader: DataLoader,
criterion: nn.Module,
device: torch.device,
) -> Tuple[float, float]:
"""Evaluate the model on the validation set.
Args:
model: The CNN model.
val_loader: Validation data loader.
criterion: Loss function.
device: Device to evaluate on.
Returns:
Tuple of (average loss, accuracy).
"""
model.eval()
running_loss = 0.0
correct = 0
total = 0
for inputs, targets in val_loader:
inputs, targets = inputs.to(device), targets.to(device)
outputs = model(inputs)
loss = criterion(outputs, targets)
running_loss += loss.item() * inputs.size(0)
_, predicted = outputs.max(1)
total += targets.size(0)
correct += predicted.eq(targets).sum().item()
avg_loss = running_loss / len(val_loader.dataset)
accuracy = correct / total
return avg_loss, accuracy
This pipeline incorporates best practices we have discussed throughout the book: proper train/validation splitting (Chapter 5), data augmentation for regularization (Chapter 4), separate transforms for training and evaluation, and the use of torch.no_grad() during evaluation for memory efficiency.
14.16 Common Pitfalls and Debugging Tips
Drawing on our discussion of debugging in Chapter 11, here are CNN-specific issues to watch for:
-
Shape mismatches: The most common CNN bug. Always compute expected dimensions before running the network. Print shapes at each layer during development.
-
Forgetting to freeze batch norm: When fine-tuning, batch norm layers continue to update their running statistics by default. Call
model.eval()during validation, and consider freezing batch norm layers when fine-tuning with small batches. -
Wrong normalization: Using ImageNet normalization for data that was not preprocessed for ImageNet will degrade performance. Always match the normalization to the pretrained model.
-
Learning rate too high for fine-tuning: Pretrained weights are already close to a good solution. High learning rates will destroy them. Start with learning rates 10-100x smaller than you would use for training from scratch.
-
Not using data augmentation: CNNs are data-hungry. Without augmentation, they will overfit quickly on small datasets, even with transfer learning.
-
Ignoring class imbalance: For imbalanced datasets, use weighted loss functions (Chapter 8) or oversampling strategies. Accuracy can be misleading when classes are imbalanced, as we discussed in Chapter 9 on evaluation metrics.
14.17 Summary
Convolutional Neural Networks represent one of the great success stories of deep learning. By embedding the inductive biases of local connectivity and weight sharing, they achieve remarkable efficiency on spatially structured data. The evolution from LeNet's 60,000 parameters to ResNet's skip connections and MobileNet's depthwise separable convolutions illustrates how architectural innovation -- not just more data and compute -- drives progress in deep learning.
Key concepts from this chapter:
- The convolution operation performs local, weight-shared feature extraction
- Stride and padding control the spatial dimensions of feature maps
- Pooling layers provide spatial downsampling and a degree of translation invariance
- Receptive fields grow with depth, enabling hierarchical feature detection
- Landmark architectures (LeNet, AlexNet, VGG, ResNet) each introduced design principles still in use today
- Skip connections solved the degradation problem and enabled networks with hundreds of layers
- 1x1 convolutions enable efficient channel mixing and dimensionality reduction
- Depthwise separable convolutions factorize spatial and channel-wise operations for efficiency
- Transfer learning leverages pretrained features for new tasks with limited data
- Visualization techniques (filter visualization, Grad-CAM, activation maximization) provide interpretability
Decision Framework for CNN Architecture Selection
When choosing a CNN architecture for a new project, use this practical guide:
Is inference speed critical (real-time, edge deployment)?
├── Yes → MobileNetV3 or EfficientNet-Lite
└── No →
Is your dataset large (>100K images)?
├── Yes → EfficientNet-B4/B5 or ConvNeXt-Base
└── No →
Are pretrained weights available for your domain?
├── Yes → ResNet-50 or EfficientNet-B0 with fine-tuning
└── No → Start with ResNet-18, add capacity as needed
For segmentation tasks, U-Net remains the default starting point, especially in medical imaging where it was originally designed. For object detection, YOLOv8 offers the best speed-accuracy trade-off for real-time applications, while Faster R-CNN with a ResNet or EfficientNet backbone provides higher accuracy when speed is not the primary constraint.
Historical Perspective
The development of CNNs illustrates a recurring pattern in AI research: biological inspiration leading to mathematical formalization leading to engineering innovation. Hubel and Wiesel's Nobel Prize-winning experiments on the cat visual cortex in the 1960s revealed the hierarchical structure of visual processing. Fukushima's Neocognitron (1980) translated this insight into a computational model. LeCun refined it into the trainable CNN architecture. And decades of engineering innovation -- from AlexNet's GPU training to ResNet's skip connections to EfficientNet's compound scaling -- have transformed this biological inspiration into the practical tools that power computer vision today.
In Chapter 15, we will explore recurrent neural networks, which extend the idea of weight sharing from the spatial domain to the temporal domain, processing sequences of arbitrary length with a fixed set of parameters. The skip connection principle from ResNet will reappear in a new form when we study LSTM gating mechanisms and their role in preserving long-range information through sequences.
Related Reading
Explore this topic in other books
AI Engineering Neural Networks from Scratch Sports Betting Neural Networks for Betting Soccer Analytics Deep Learning for Soccer