26 min read

> "We can hope that machines will eventually compete with humans in all purely intellectual fields. But which are the best ones to start with? Even if we restrict ourselves to the theoretical aspects, some problems seem more suited to mechanized...

Chapter 8: Convolutional Neural Networks — Architecture, Intuition, and Computer Vision Applications

"We can hope that machines will eventually compete with humans in all purely intellectual fields. But which are the best ones to start with? Even if we restrict ourselves to the theoretical aspects, some problems seem more suited to mechanized treatment than others. I suggest that a fruitful approach is to investigate the use of machines for solving problems connected with geometry, topology, and the visual aspects of pattern recognition." — Alan Turing (1948), quoted in B. Jack Copeland, The Essential Turing


Learning Objectives

By the end of this chapter, you will be able to:

  1. Derive the convolution operation from a constrained fully connected layer and explain why weight sharing and locality are useful inductive biases
  2. Trace the evolution of CNN architectures (LeNet → AlexNet → VGG → ResNet → EfficientNet) and identify the key innovation at each stage
  3. Implement a CNN in PyTorch for image classification with proper data augmentation
  4. Explain residual connections and their impact on gradient flow and trainability
  5. Apply CNNs beyond images: 1D convolutions for time series and text processing

8.1 Why Convolutional Neural Networks?

In Chapter 6, we built a multilayer perceptron (MLP) from scratch and trained it on structured data. In Chapter 7, we learned the craft of training deep networks — initialization, normalization, regularization, and learning rate schedules. The MLP is a powerful universal approximator, but it has a fatal weakness for spatial data: it has no concept of structure.

Consider a grayscale image of size $224 \times 224$. Flattened into a vector, this is $224 \times 224 = 50{,}176$ input features. If the first hidden layer has 4,096 neurons (modest by modern standards), the weight matrix for this single layer has $50{,}176 \times 4{,}096 = 205{,}520{,}896$ parameters — over 200 million, just for the first layer. This is computationally absurd, but the deeper problem is statistical: with 200 million parameters and perhaps 50,000 training images, the model will memorize the training set long before it learns any generalizable pattern.

The problem is not merely computational cost. The problem is that the MLP treats pixel (0, 0) and pixel (223, 223) with the same relational structure as pixel (112, 112) and pixel (112, 113). It has no notion that nearby pixels are more related than distant pixels. It has no concept that a cat in the upper-left corner of an image is the same cat when shifted to the lower-right corner. Every spatial relationship must be learned from data, and the data budget is never large enough.

Convolutional neural networks solve this by encoding two inductive biases directly into the architecture:

  1. Locality. Each neuron connects only to a small spatial region of the input (the receptive field), not to the entire input.
  2. Weight sharing. The same set of weights (the kernel or filter) is applied at every spatial location, so a feature detector learned at one position automatically generalizes to all positions.

These two constraints are not arbitrary. They reflect the statistical structure of natural images — and, as we will see, of any data with local spatial or temporal structure.

Understanding Why: The convolutional layer is not a new kind of computation — it is a fully connected layer with most weights set to zero and the remaining weights tied across positions. Understanding this equivalence is the key to understanding why CNNs work, when they fail, and how to extend them.


8.2 Deriving Convolution from Constrained Fully Connected Layers

The Fully Connected Baseline

Consider a 1D input signal $\mathbf{x} \in \mathbb{R}^n$ and a single-neuron layer producing output $y_j$ at position $j$. In a fully connected layer:

$$y_j = \sum_{i=0}^{n-1} w_{ji} x_i + b_j$$

The weight matrix $\mathbf{W} \in \mathbb{R}^{m \times n}$ is dense: every output position connects to every input position, and each connection has its own independent weight. For $m$ output positions, we have $mn$ parameters.

Constraint 1: Locality

We impose the constraint that output position $j$ depends only on input positions within a window of size $k$ centered at $j$:

$$y_j = \sum_{i=j-\lfloor k/2 \rfloor}^{j+\lfloor k/2 \rfloor} w_{ji} x_i + b_j$$

This makes $\mathbf{W}$ a banded matrix. The parameter count drops from $O(mn)$ to $O(mk)$, where $k \ll n$. But each position still has its own local weights.

Constraint 2: Weight Sharing (Translation Equivariance)

We impose the additional constraint that the local weights are the same at every position:

$$w_{ji} = w_{i-j}$$

Now there is a single set of $k$ weights — the kernel $\mathbf{w} = [w_0, w_1, \ldots, w_{k-1}]$ — shared across all positions. The output becomes:

$$y_j = \sum_{t=0}^{k-1} w_t \cdot x_{j+t-\lfloor k/2 \rfloor} + b$$

This is exactly the discrete cross-correlation operation (which deep learning frameworks call "convolution"):

$$y_j = (\mathbf{w} \star \mathbf{x})_j = \sum_{t=0}^{k-1} w_t \cdot x_{j+t-\lfloor k/2 \rfloor}$$

The parameter count is now $O(k)$ — independent of the input size. A $3 \times 3$ kernel on a $1000 \times 1000$ image has 9 parameters, regardless of the million input pixels.

Mathematical Foundation: In signal processing, true convolution flips the kernel before sliding: $(w * x)_j = \sum_t w_t \cdot x_{j - t}$. Cross-correlation does not flip: $(w \star x)_j = \sum_t w_t \cdot x_{j + t}$. Since the kernel weights are learned, flipping is irrelevant — the network simply learns the flipped version. All major deep learning frameworks implement cross-correlation and call it conv. This is a harmless notational inconsistency, but it is important to know when reading signal processing literature.

Visualizing the Constraint

The weight matrix of a convolutional layer is a sparse, banded Toeplitz matrix. For a 1D input of length 5 and kernel $[w_0, w_1, w_2]$ with stride 1 and no padding:

$$\mathbf{W}_{\text{conv}} = \begin{bmatrix} w_0 & w_1 & w_2 & 0 & 0 \\ 0 & w_0 & w_1 & w_2 & 0 \\ 0 & 0 & w_0 & w_1 & w_2 \end{bmatrix}$$

Compare this to the dense $\mathbf{W}_{\text{FC}} \in \mathbb{R}^{3 \times 5}$, which would have 15 independent parameters. The convolutional version has 3. The sparsity pattern encodes locality; the repeated values encode weight sharing.

Implementation: Convolution as Matrix Multiplication

import numpy as np

def conv1d_as_matmul(x: np.ndarray, kernel: np.ndarray) -> np.ndarray:
    """Implement 1D convolution by constructing the Toeplitz weight matrix.

    This demonstrates that convolution IS a constrained linear operation.

    Args:
        x: Input signal of shape (n,).
        kernel: Convolution kernel of shape (k,).

    Returns:
        Output signal of shape (n - k + 1,) — valid convolution.
    """
    n = len(x)
    k = len(kernel)
    out_len = n - k + 1

    # Build the Toeplitz (banded, weight-shared) matrix
    W = np.zeros((out_len, n))
    for j in range(out_len):
        W[j, j:j+k] = kernel

    return W @ x


# Verify equivalence with direct computation
x = np.array([1.0, 3.0, 2.0, 4.0, 1.0, 5.0, 2.0])
kernel = np.array([0.5, 1.0, 0.5])

matmul_result = conv1d_as_matmul(x, kernel)
direct_result = np.correlate(x, kernel, mode='valid')
print(f"Matrix multiply: {matmul_result}")
print(f"Direct correlate: {direct_result}")
print(f"Equal: {np.allclose(matmul_result, direct_result)}")
Matrix multiply: [2.5 4.  3.5 5.  3.5]
Direct correlate: [2.5 4.  3.5 5.  3.5]
Equal: True

Parameter Reduction: The Numbers

Architecture Input Size First Layer Parameters
Fully connected $224 \times 224 \times 3 = 150{,}528$ $150{,}528 \times 4{,}096 = 616{,}562{,}688$
Conv ($3 \times 3$, 64 filters) $224 \times 224 \times 3$ $(3 \times 3 \times 3 + 1) \times 64 = 1{,}792$
Reduction factor $344{,}000\times$ fewer parameters

This is not merely a computational trick. The dramatic parameter reduction is a form of regularization: the model cannot memorize spatial configurations because it is forced to learn local, translationally equivariant features.


8.3 The Mechanics of 2D Convolution

Extending to Two Dimensions

Natural images have two spatial dimensions plus a channel dimension (RGB). A 2D convolutional layer operates with kernels of shape $(C_{\text{in}}, k_h, k_w)$, where $C_{\text{in}}$ is the number of input channels and $k_h, k_w$ are the spatial kernel dimensions (typically $k_h = k_w = k$). With $C_{\text{out}}$ output channels (filters), the full weight tensor is $\mathbf{W} \in \mathbb{R}^{C_{\text{out}} \times C_{\text{in}} \times k_h \times k_w}$.

For input tensor $\mathbf{X} \in \mathbb{R}^{C_{\text{in}} \times H \times W}$, the output feature map at channel $c$ and spatial position $(i, j)$ is:

$$Y_{c,i,j} = b_c + \sum_{c'=0}^{C_{\text{in}}-1} \sum_{p=0}^{k_h-1} \sum_{q=0}^{k_w-1} W_{c,c',p,q} \cdot X_{c', i \cdot s + p, j \cdot s + q}$$

where $s$ is the stride and $b_c$ is the bias for output channel $c$.

Output Dimensions

For input spatial size $H \times W$, kernel size $k$, stride $s$, and padding $p$:

$$H_{\text{out}} = \left\lfloor \frac{H + 2p - k}{s} \right\rfloor + 1, \qquad W_{\text{out}} = \left\lfloor \frac{W + 2p - k}{s} \right\rfloor + 1$$

Three common padding strategies:

  • Valid (no padding): $p = 0$. Output shrinks by $k - 1$ in each dimension.
  • Same padding: $p = \lfloor k/2 \rfloor$ (for odd $k$ and stride 1). Output has the same spatial dimensions as the input.
  • Full padding: $p = k - 1$. Output is larger than the input. Rarely used in deep learning.

Stride

Stride $s > 1$ subsamples the output, reducing spatial dimensions by a factor of $s$. This is an alternative to pooling for spatial downsampling and is now preferred in many modern architectures (the "strided convolution" approach).

Feature Maps: What Does Each Channel Represent?

Each output channel corresponds to one learned filter. In early layers, filters learn simple patterns: edges, color gradients, textures. In deeper layers, filters learn complex compositions: parts of objects, object prototypes, scene layouts.

import torch
import torch.nn as nn

# A single convolutional layer
conv = nn.Conv2d(
    in_channels=3,      # RGB input
    out_channels=64,     # 64 learned filters
    kernel_size=3,       # 3x3 spatial kernels
    stride=1,
    padding=1            # Same padding
)

# Inspect the weight shape
print(f"Weight shape: {conv.weight.shape}")  # (64, 3, 3, 3)
print(f"Bias shape:   {conv.bias.shape}")    # (64,)
print(f"Parameters:   {conv.weight.numel() + conv.bias.numel()}")  # 1,792

# Forward pass on a batch of 8 images
x = torch.randn(8, 3, 224, 224)
y = conv(x)
print(f"Input shape:  {x.shape}")   # (8, 3, 224, 224)
print(f"Output shape: {y.shape}")   # (8, 64, 224, 224) — same spatial, 64 channels
Weight shape: torch.Size([64, 3, 3, 3])
Bias shape:   torch.Size([64])
Parameters:   1792
Input shape:  torch.Size([8, 3, 224, 224])
Output shape: torch.Size([8, 64, 224, 224])

8.4 Translation Equivariance and Invariance

Two related but distinct properties make CNNs powerful for spatial data.

Translation Equivariance

A function $f$ is translation equivariant if shifting the input shifts the output by the same amount:

$$f(T_{\Delta}[\mathbf{x}]) = T_{\Delta}[f(\mathbf{x})]$$

where $T_{\Delta}$ is a translation operator. Convolution is translation equivariant: if a cat appears 10 pixels to the right in the input, the feature map response shifts 10 pixels to the right. The features do not change — only their positions.

This is a direct consequence of weight sharing. Because the same kernel is applied at every spatial position, the output at position $(i, j)$ depends only on the local patch centered at $(i, j)$, not on the absolute position.

Translation Invariance

A function $f$ is translation invariant if shifting the input does not change the output:

$$f(T_{\Delta}[\mathbf{x}]) = f(\mathbf{x})$$

Convolution alone is equivariant, not invariant. Classification requires invariance — we want to classify "cat" regardless of where the cat appears. Invariance is achieved by combining convolution with pooling and, ultimately, with global average pooling or a fully connected classifier that aggregates over all spatial positions.

Common Misconception: "CNNs are translation invariant." This is imprecise. Individual convolutional layers are translation equivariant. The full CNN achieves approximate translation invariance through pooling and spatial downsampling. The distinction matters: equivariance preserves spatial information (needed for detection and segmentation), while invariance discards it (needed for classification).


8.5 Pooling

Pooling layers reduce spatial dimensions while introducing a degree of local translation invariance.

Max Pooling

$$y_{i,j} = \max_{(p,q) \in \mathcal{R}_{i,j}} x_{p,q}$$

where $\mathcal{R}_{i,j}$ is the pooling region (typically $2 \times 2$) at output position $(i, j)$. Max pooling selects the strongest activation in each region, providing invariance to small spatial shifts within the pooling window.

Average Pooling

$$y_{i,j} = \frac{1}{|\mathcal{R}_{i,j}|} \sum_{(p,q) \in \mathcal{R}_{i,j}} x_{p,q}$$

Average pooling computes the mean activation. It preserves more information than max pooling but provides weaker invariance.

Global Average Pooling (GAP)

$$y_c = \frac{1}{H \times W} \sum_{i=0}^{H-1} \sum_{j=0}^{W-1} x_{c,i,j}$$

Global average pooling reduces each channel to a single scalar by averaging over the entire spatial extent. Introduced by Lin et al. (2014), GAP replaces the large fully connected layers at the end of a CNN, dramatically reducing parameters and overfitting. It is now standard in most modern architectures.

The Receptive Field

The receptive field of a neuron in layer $l$ is the region of the input image that affects its activation. For a stack of $L$ convolutional layers, each with kernel size $k$ and stride 1:

$$R_L = 1 + L(k - 1)$$

With strides and pooling, the receptive field grows faster. For stride $s$ at each layer:

$$R_L = 1 + \sum_{l=1}^{L} (k_l - 1) \prod_{i=1}^{l-1} s_i$$

The receptive field determines what spatial context each neuron can access. Early layers have small receptive fields (local edges); deep layers have large receptive fields (global structure). A well-designed CNN ensures that the receptive field at the final feature map covers the entire input.


8.6 Architecture Evolution: Principles Through History

The history of CNN architectures is not merely a chronicle of increasing accuracy on ImageNet. Each milestone introduced a principle that transcends the specific architecture and applies to deep learning broadly. We trace six landmarks, extracting the lasting lesson from each.

LeNet-5 (LeCun et al., 1998)

Architecture: Two convolutional layers (5x5 kernels), two average pooling layers, three fully connected layers. Total: ~60,000 parameters. Trained for handwritten digit recognition (MNIST).

The principle: Learned local feature detectors, composed hierarchically, outperform handcrafted features. LeNet showed that gradient-based learning could discover edge detectors, corner detectors, and stroke patterns — features that computer vision researchers had been designing by hand for decades.

AlexNet (Krizhevsky, Sutskever & Hinton, 2012)

Architecture: Five convolutional layers, three fully connected layers, ReLU activations, dropout, local response normalization. Total: ~62 million parameters. Trained on ImageNet (1.2M images, 1000 classes) across two GPUs.

The principle: Scale + ReLU + GPU = qualitative breakthrough. AlexNet won the 2012 ImageNet challenge by a massive margin (top-5 error: 15.3% vs. 26.2% for the runner-up), demonstrating that the ideas from the 1990s worked dramatically better with more data, more compute, and ReLU activations. The switch from sigmoid/tanh to ReLU was critical — it largely solved the vanishing gradient problem for feed-forward networks (as discussed in Chapter 7).

VGGNet (Simonyan & Zisserman, 2014)

Architecture: 16 or 19 layers of exclusively $3 \times 3$ convolutions with max pooling. Total: ~138 million parameters.

The principle: Depth through simplicity. VGG showed that a stack of small ($3 \times 3$) kernels achieves the same receptive field as fewer large kernels, with fewer parameters and more nonlinearities. Two stacked $3 \times 3$ layers have the same receptive field ($5 \times 5$) as a single $5 \times 5$ layer, but with $2 \times (3^2) = 18$ parameters vs. $5^2 = 25$, and two ReLU nonlinearities instead of one.

More formally, a stack of $L$ layers of $3 \times 3$ convolutions has receptive field $1 + 2L$ (from our formula above), so: - $L = 2$: receptive field 5 (matches $5 \times 5$ kernel) - $L = 3$: receptive field 7 (matches $7 \times 7$ kernel)

The parameter ratio scales as $L \cdot 3^2 C^2$ vs. $(2L+1)^2 C^2$ for $C$ channels, which favors the stacked $3 \times 3$ approach for $L \geq 2$.

The Depth Crisis: Why Deeper Is Not Always Better

VGG's principle — deeper is better — hit a wall. When researchers tried training networks with 50, 100, or 1,000 layers, performance degraded. This was not merely the vanishing gradient problem (which batch normalization largely addressed). Even with batch normalization, a 56-layer network performed worse than a 20-layer network on both training and test sets.

The problem is optimization, not capacity. A 56-layer network is strictly more expressive than a 20-layer one (it can represent any function the 20-layer network can, by setting extra layers to identity mappings). Yet gradient-based optimization cannot find those solutions. The loss landscape of deep plain networks has pathological curvature that makes optimization increasingly difficult with depth.

ResNet (He, Zhang, Ren & Sun, 2015)

Architecture: Residual connections that skip one or more layers. Networks of 50, 101, or 152 layers — dramatically deeper than anything before.

The key insight: Instead of learning the desired mapping $\mathcal{H}(\mathbf{x})$ directly, learn the residual $\mathcal{F}(\mathbf{x}) = \mathcal{H}(\mathbf{x}) - \mathbf{x}$ and compute:

$$\mathbf{y} = \mathcal{F}(\mathbf{x}) + \mathbf{x}$$

If the optimal mapping is close to the identity (i.e., the layer should mostly pass information through), learning $\mathcal{F}(\mathbf{x}) \approx \mathbf{0}$ is much easier than learning $\mathcal{H}(\mathbf{x}) \approx \mathbf{x}$ — pushing weights toward zero is what regularization does naturally.

The principle: Residual connections create gradient highways that enable arbitrary depth. We derive this rigorously in Section 8.7.

EfficientNet (Tan & Le, 2019)

Architecture: Compound scaling of depth, width, and resolution using a principled search over scaling coefficients.

The principle: Depth, width, and resolution must be scaled together. Prior work scaled one dimension at a time (deeper ResNets, wider WideResNets, higher-resolution inputs). Tan and Le showed that these dimensions are interdependent and proposed compound scaling:

$$d = \alpha^\phi, \quad w = \beta^\phi, \quad r = \gamma^\phi$$

subject to $\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2$ (so that total FLOPs scale as $2^\phi$), where: - $d$ is the depth multiplier - $w$ is the width multiplier - $r$ is the resolution multiplier - $\phi$ is a user-specified compound coefficient that controls total compute budget - $\alpha, \beta, \gamma$ are found by grid search on a small baseline network

The constraint $\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2$ arises because depth scales FLOPs linearly while width and resolution scale FLOPs quadratically (width affects both input and output channels; resolution affects both spatial dimensions). Setting $\phi = 1$ and searching over $\alpha, \beta, \gamma$ on EfficientNet-B0 yielded $\alpha = 1.2$, $\beta = 1.1$, $\gamma = 1.15$. Scaling $\phi$ from 1 to 7 produced EfficientNet-B0 through B7, each achieving state-of-the-art accuracy at its compute level.

Fundamentals > Frontier: The architecture evolution teaches principles that outlast any specific model. Locality and weight sharing (LeNet) underpin all spatial architectures. Small kernels composed deeply (VGG) generalize to the design of any deep network. Residual connections (ResNet) appear in transformers, diffusion models, and reinforcement learning architectures. Compound scaling (EfficientNet) is a framework for principled architecture scaling in any domain.


8.7 Residual Connections and Gradient Flow: A Rigorous Treatment

The Gradient Highway

Consider a plain network (no skip connections) with $L$ layers. The output is:

$$\mathbf{y} = f_L \circ f_{L-1} \circ \cdots \circ f_1(\mathbf{x})$$

By the chain rule, the gradient of the loss $\mathcal{L}$ with respect to the parameters $\theta_l$ of layer $l$ involves:

$$\frac{\partial \mathcal{L}}{\partial \theta_l} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}_L} \cdot \prod_{i=l+1}^{L} \frac{\partial \mathbf{h}_i}{\partial \mathbf{h}_{i-1}} \cdot \frac{\partial \mathbf{h}_l}{\partial \theta_l}$$

The product $\prod_{i=l+1}^{L} \frac{\partial \mathbf{h}_i}{\partial \mathbf{h}_{i-1}}$ is a product of $L - l$ Jacobian matrices. If the spectral norms of these Jacobians are consistently less than 1, the product vanishes exponentially. If consistently greater than 1, it explodes. Maintaining spectral norms near 1 for hundreds of layers is extremely difficult in practice.

Residual Connections Transform the Gradient

Now consider a residual network where layer $l$ computes:

$$\mathbf{h}_l = \mathcal{F}_l(\mathbf{h}_{l-1}) + \mathbf{h}_{l-1}$$

Unrolling this recurrence from layer $l$ to layer $L$:

$$\mathbf{h}_L = \mathbf{h}_l + \sum_{i=l+1}^{L} \mathcal{F}_i(\mathbf{h}_{i-1})$$

The gradient of the loss with respect to $\mathbf{h}_l$ becomes:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{h}_l} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}_L} \cdot \frac{\partial \mathbf{h}_L}{\partial \mathbf{h}_l} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}_L} \left( \mathbf{I} + \frac{\partial}{\partial \mathbf{h}_l} \sum_{i=l+1}^{L} \mathcal{F}_i(\mathbf{h}_{i-1}) \right)$$

The critical term is $\mathbf{I}$ — the identity matrix. It guarantees that the gradient has a direct path from the loss to any layer, bypassing all intermediate transformations. Even if the residual Jacobians $\frac{\partial \mathcal{F}_i}{\partial \mathbf{h}_{i-1}}$ are poorly conditioned, the identity term ensures that the gradient signal is never completely lost. This is the gradient highway: the skip connections create a direct information path through which gradients flow unimpeded.

The Bottleneck Block

For deep networks (ResNet-50 and beyond), the standard residual block is replaced by a bottleneck block that reduces computational cost:

$$\mathbf{x} \xrightarrow{1 \times 1, C/4} \xrightarrow{3 \times 3, C/4} \xrightarrow{1 \times 1, C} + \mathbf{x}$$

The first $1 \times 1$ convolution reduces channels from $C$ to $C/4$ (the "bottleneck"), the $3 \times 3$ convolution operates on the reduced channels, and the final $1 \times 1$ convolution expands back to $C$ channels. This reduces the parameter count of the $3 \times 3$ convolution by a factor of 16 ($C/4$ input channels and $C/4$ output channels vs. $C$ input and $C$ output).

import torch
import torch.nn as nn

class BottleneckBlock(nn.Module):
    """ResNet bottleneck residual block.

    Architecture: 1x1 (reduce) -> 3x3 (process) -> 1x1 (expand) + shortcut.

    Args:
        in_channels: Number of input channels.
        bottleneck_channels: Number of channels in the bottleneck (typically in_channels // 4).
        stride: Stride for the 3x3 convolution (for downsampling).
    """

    def __init__(
        self,
        in_channels: int,
        bottleneck_channels: int,
        stride: int = 1,
    ) -> None:
        super().__init__()
        out_channels = in_channels  # Bottleneck preserves channel count

        self.conv1 = nn.Conv2d(in_channels, bottleneck_channels, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(bottleneck_channels)

        self.conv2 = nn.Conv2d(
            bottleneck_channels, bottleneck_channels, 3,
            stride=stride, padding=1, bias=False
        )
        self.bn2 = nn.BatchNorm2d(bottleneck_channels)

        self.conv3 = nn.Conv2d(bottleneck_channels, out_channels, 1, bias=False)
        self.bn3 = nn.BatchNorm2d(out_channels)

        self.relu = nn.ReLU(inplace=True)

        # Shortcut projection if spatial dimensions change
        self.shortcut = nn.Identity()
        if stride != 1:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, 1, stride=stride, bias=False),
                nn.BatchNorm2d(out_channels),
            )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        identity = self.shortcut(x)

        out = self.relu(self.bn1(self.conv1(x)))
        out = self.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))

        out += identity  # The residual connection
        out = self.relu(out)
        return out


# Verify shapes
block = BottleneckBlock(in_channels=256, bottleneck_channels=64)
x = torch.randn(4, 256, 56, 56)
y = block(x)
print(f"Input:  {x.shape}")  # (4, 256, 56, 56)
print(f"Output: {y.shape}")  # (4, 256, 56, 56)

# Count parameters
params = sum(p.numel() for p in block.parameters())
print(f"Bottleneck block parameters: {params:,}")
# Compare: a plain 3x3 conv block with 256 channels would have
# 2 * (256 * 256 * 3 * 3) = 1,179,648 parameters
# The bottleneck has roughly 70,000 — a ~17x reduction
Input:  torch.Size([4, 256, 56, 56])
Output: torch.Size([4, 256, 56, 56])
Bottleneck block parameters: 70,144

Pre-Activation vs. Post-Activation

The original ResNet places batch normalization and ReLU after the addition: $\text{ReLU}(\text{BN}(\mathcal{F}(\mathbf{x})) + \mathbf{x})$. He et al. (2016b) showed that placing them before the convolution (pre-activation) improves training:

$$\mathbf{h}_l = \mathcal{F}_l(\text{ReLU}(\text{BN}(\mathbf{h}_{l-1}))) + \mathbf{h}_{l-1}$$

Pre-activation ensures that the skip connection carries a clean, untransformed signal. The identity shortcut is truly an identity — no batch norm or ReLU modifies it. This produces a purer gradient highway and consistently improves results on very deep networks (1,001 layers).


8.8 Depthwise Separable Convolutions

A standard convolution with $C_{\text{in}}$ input channels, $C_{\text{out}}$ output channels, and kernel size $k$ has $C_{\text{in}} \cdot C_{\text{out}} \cdot k^2$ parameters and $O(C_{\text{in}} \cdot C_{\text{out}} \cdot k^2 \cdot H \cdot W)$ FLOPs.

A depthwise separable convolution (Chollet, 2017; Howard et al., 2017) factorizes this into two operations:

  1. Depthwise convolution: Apply a separate $k \times k$ kernel to each input channel independently. Parameters: $C_{\text{in}} \cdot k^2$. This captures spatial patterns within each channel.

  2. Pointwise convolution: Apply a $1 \times 1$ convolution across channels. Parameters: $C_{\text{in}} \cdot C_{\text{out}}$. This mixes information across channels.

Total parameters: $C_{\text{in}} \cdot k^2 + C_{\text{in}} \cdot C_{\text{out}}$.

The ratio of depthwise separable to standard convolution parameters is:

$$\frac{C_{\text{in}} \cdot k^2 + C_{\text{in}} \cdot C_{\text{out}}}{C_{\text{in}} \cdot C_{\text{out}} \cdot k^2} = \frac{1}{C_{\text{out}}} + \frac{1}{k^2}$$

For $C_{\text{out}} = 256$ and $k = 3$: $\frac{1}{256} + \frac{1}{9} \approx 0.115$ — an $8.7\times$ reduction.

This factorization is the foundation of MobileNets and EfficientNet's baseline architecture. It achieves comparable accuracy to standard convolutions at a fraction of the cost.

import torch
import torch.nn as nn

class DepthwiseSeparableConv(nn.Module):
    """Depthwise separable convolution block.

    Factorizes a standard convolution into:
    1. Depthwise: spatial filtering per channel
    2. Pointwise: channel mixing via 1x1 conv

    Args:
        in_channels: Number of input channels.
        out_channels: Number of output channels.
        kernel_size: Spatial kernel size for depthwise convolution.
        stride: Stride for the depthwise convolution.
        padding: Padding for the depthwise convolution.
    """

    def __init__(
        self,
        in_channels: int,
        out_channels: int,
        kernel_size: int = 3,
        stride: int = 1,
        padding: int = 1,
    ) -> None:
        super().__init__()
        self.depthwise = nn.Conv2d(
            in_channels, in_channels, kernel_size,
            stride=stride, padding=padding, groups=in_channels, bias=False
        )
        self.pointwise = nn.Conv2d(in_channels, out_channels, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(in_channels)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.relu(self.bn1(self.depthwise(x)))
        x = self.relu(self.bn2(self.pointwise(x)))
        return x


# Compare parameter counts
standard = nn.Conv2d(256, 256, 3, padding=1)
separable = DepthwiseSeparableConv(256, 256, 3, padding=1)

standard_params = sum(p.numel() for p in standard.parameters())
separable_params = sum(p.numel() for p in separable.parameters())

print(f"Standard conv parameters:  {standard_params:,}")
print(f"Separable conv parameters: {separable_params:,}")
print(f"Ratio: {separable_params / standard_params:.3f}")
Standard conv parameters:  590,080
Separable conv parameters: 68,352
Ratio: 0.116

8.9 Data Augmentation

Data augmentation is the single most effective regularization technique for CNNs. It exploits the known symmetries of the problem: if flipping an image horizontally does not change its label, then we should train on both the original and flipped version. This artificially increases the effective training set size without collecting new data.

Common Augmentation Transforms

Transform Effect Typical Parameters
Random horizontal flip Translation invariance to reflection $p = 0.5$
Random crop Translation invariance, partial occlusion Pad 4, crop to original size
Color jitter Invariance to illumination changes Brightness, contrast, saturation, hue
Random rotation Rotation invariance (when appropriate) $\pm 15°$
Random erasing (Cutout) Occlusion robustness Erase 10-30% of image
Mixup Convex combinations of image pairs $\alpha = 0.2$ for Beta distribution
CutMix Paste patches between image pairs Region size from Beta distribution
RandAugment Automated augmentation policy $N = 2$, $M = 9$

Implementation

import torch
from torchvision import transforms

def get_train_transforms(image_size: int = 224) -> transforms.Compose:
    """Standard training augmentation pipeline for image classification.

    Args:
        image_size: Target image size after resizing and cropping.

    Returns:
        Composed transform pipeline.
    """
    return transforms.Compose([
        transforms.RandomResizedCrop(image_size, scale=(0.08, 1.0)),
        transforms.RandomHorizontalFlip(p=0.5),
        transforms.ColorJitter(
            brightness=0.4, contrast=0.4, saturation=0.4, hue=0.1
        ),
        transforms.RandomGrayscale(p=0.1),
        transforms.ToTensor(),
        transforms.Normalize(
            mean=[0.485, 0.456, 0.406],  # ImageNet statistics
            std=[0.229, 0.224, 0.225],
        ),
        transforms.RandomErasing(p=0.25, scale=(0.02, 0.33)),
    ])


def get_eval_transforms(image_size: int = 224) -> transforms.Compose:
    """Evaluation transforms — deterministic, no augmentation.

    Args:
        image_size: Target image size.

    Returns:
        Composed transform pipeline.
    """
    return transforms.Compose([
        transforms.Resize(int(image_size * 1.143)),  # Resize to 256 for 224
        transforms.CenterCrop(image_size),
        transforms.ToTensor(),
        transforms.Normalize(
            mean=[0.485, 0.456, 0.406],
            std=[0.229, 0.224, 0.225],
        ),
    ])

Implementation Note: Data augmentation is applied only during training, never during evaluation. A common bug is applying random transforms during validation, which introduces noise into the evaluation metrics. Use separate transform pipelines for training and evaluation. For test-time augmentation (TTA), apply a deterministic set of transforms (e.g., horizontal flip, five crops) and average predictions — this is distinct from random training augmentation.

Mixup: A Mathematical Perspective

Mixup (Zhang et al., 2018) creates virtual training examples by taking convex combinations:

$$\tilde{\mathbf{x}} = \lambda \mathbf{x}_i + (1 - \lambda) \mathbf{x}_j, \qquad \tilde{y} = \lambda y_i + (1 - \lambda) y_j$$

where $\lambda \sim \text{Beta}(\alpha, \alpha)$. For classification with one-hot labels, this means the model is trained to predict a mixture of two class probabilities, not a single hard label. This has a regularization effect: the model cannot memorize individual examples because the targets are no longer one-hot.

The loss under Mixup is:

$$\mathcal{L}_{\text{Mixup}} = \lambda \cdot \mathcal{L}(\hat{y}, y_i) + (1 - \lambda) \cdot \mathcal{L}(\hat{y}, y_j)$$

From a Bayesian perspective, Mixup acts as a data-dependent prior that encourages linear behavior between training examples, which improves calibration and reduces overconfident predictions on out-of-distribution inputs.


8.10 Putting It Together: A Complete CNN in PyTorch

We now implement a simplified ResNet-style CNN for image classification, incorporating the concepts from this chapter: convolutional layers, batch normalization, residual connections, global average pooling, and data augmentation.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
from typing import List, Tuple


class ResidualBlock(nn.Module):
    """Basic residual block with two 3x3 convolutions.

    Args:
        in_channels: Number of input channels.
        out_channels: Number of output channels.
        stride: Stride for the first convolution (downsampling).
    """

    def __init__(self, in_channels: int, out_channels: int, stride: int = 1) -> None:
        super().__init__()
        self.conv1 = nn.Conv2d(
            in_channels, out_channels, 3, stride=stride, padding=1, bias=False
        )
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(
            out_channels, out_channels, 3, stride=1, padding=1, bias=False
        )
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)

        self.shortcut = nn.Identity()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, 1, stride=stride, bias=False),
                nn.BatchNorm2d(out_channels),
            )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        identity = self.shortcut(x)
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += identity
        out = self.relu(out)
        return out


class SimpleCNN(nn.Module):
    """A simplified ResNet-style CNN for image classification.

    Architecture:
        - Initial 7x7 conv with stride 2, followed by max pool
        - Four stages of residual blocks with increasing channels
        - Global average pooling
        - Linear classifier

    Args:
        num_classes: Number of output classes.
        blocks_per_stage: Number of residual blocks in each stage.
    """

    def __init__(
        self,
        num_classes: int = 10,
        blocks_per_stage: List[int] = [2, 2, 2, 2],
    ) -> None:
        super().__init__()
        self.stem = nn.Sequential(
            nn.Conv2d(3, 64, 7, stride=2, padding=3, bias=False),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(3, stride=2, padding=1),
        )

        channels = [64, 128, 256, 512]
        self.stages = nn.ModuleList()
        in_ch = 64
        for stage_idx, (num_blocks, out_ch) in enumerate(
            zip(blocks_per_stage, channels)
        ):
            blocks = []
            for block_idx in range(num_blocks):
                stride = 2 if block_idx == 0 and stage_idx > 0 else 1
                blocks.append(ResidualBlock(in_ch, out_ch, stride=stride))
                in_ch = out_ch
            self.stages.append(nn.Sequential(*blocks))

        self.gap = nn.AdaptiveAvgPool2d(1)
        self.classifier = nn.Linear(512, num_classes)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.stem(x)
        for stage in self.stages:
            x = stage(x)
        x = self.gap(x)
        x = x.flatten(1)
        x = self.classifier(x)
        return x


def train_cnn(
    model: nn.Module,
    train_loader: DataLoader,
    val_loader: DataLoader,
    epochs: int = 30,
    lr: float = 0.1,
    weight_decay: float = 1e-4,
    device: str = "cuda",
) -> List[dict]:
    """Train a CNN with standard recipe: SGD + momentum + cosine LR + augmentation.

    Args:
        model: The CNN model.
        train_loader: Training data loader (with augmentation).
        val_loader: Validation data loader (no augmentation).
        epochs: Number of training epochs.
        lr: Initial learning rate.
        weight_decay: L2 regularization coefficient.
        device: Device to train on.

    Returns:
        List of per-epoch metrics dictionaries.
    """
    model = model.to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(
        model.parameters(), lr=lr, momentum=0.9, weight_decay=weight_decay
    )
    scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=epochs)

    history = []
    for epoch in range(epochs):
        # Training
        model.train()
        train_loss, train_correct, train_total = 0.0, 0, 0
        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            train_loss += loss.item() * images.size(0)
            train_correct += (outputs.argmax(1) == labels).sum().item()
            train_total += images.size(0)

        # Validation
        model.eval()
        val_loss, val_correct, val_total = 0.0, 0, 0
        with torch.no_grad():
            for images, labels in val_loader:
                images, labels = images.to(device), labels.to(device)
                outputs = model(images)
                loss = criterion(outputs, labels)
                val_loss += loss.item() * images.size(0)
                val_correct += (outputs.argmax(1) == labels).sum().item()
                val_total += images.size(0)

        scheduler.step()

        metrics = {
            "epoch": epoch + 1,
            "train_loss": train_loss / train_total,
            "train_acc": train_correct / train_total,
            "val_loss": val_loss / val_total,
            "val_acc": val_correct / val_total,
            "lr": scheduler.get_last_lr()[0],
        }
        history.append(metrics)
        print(
            f"Epoch {metrics['epoch']:3d} | "
            f"Train Loss: {metrics['train_loss']:.4f} | "
            f"Train Acc: {metrics['train_acc']:.4f} | "
            f"Val Loss: {metrics['val_loss']:.4f} | "
            f"Val Acc: {metrics['val_acc']:.4f} | "
            f"LR: {metrics['lr']:.6f}"
        )

    return history


# Example usage with CIFAR-10
# (In practice, use a larger dataset; CIFAR-10 is used here for demonstration)
if __name__ == "__main__":
    train_transforms = transforms.Compose([
        transforms.RandomCrop(32, padding=4),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616)),
    ])
    val_transforms = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616)),
    ])

    train_dataset = datasets.CIFAR10("./data", train=True, transform=train_transforms)
    val_dataset = datasets.CIFAR10("./data", train=False, transform=val_transforms)
    train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True, num_workers=4)
    val_loader = DataLoader(val_dataset, batch_size=256, shuffle=False, num_workers=4)

    model = SimpleCNN(num_classes=10, blocks_per_stage=[2, 2, 2, 2])
    total_params = sum(p.numel() for p in model.parameters())
    print(f"Total parameters: {total_params:,}")

    history = train_cnn(model, train_loader, val_loader, epochs=30, lr=0.1)

8.11 Grad-CAM: Understanding What the CNN Sees

A persistent criticism of deep learning is that models are "black boxes." Gradient-weighted Class Activation Mapping (Grad-CAM; Selvaraju et al., 2017) addresses this by producing a coarse heatmap highlighting the image regions that are most important for a particular class prediction.

The Derivation

Given a target class $c$ and the feature maps $A^k \in \mathbb{R}^{H \times W}$ from the last convolutional layer (where $k$ indexes channels), Grad-CAM computes:

Step 1. Compute the gradient of the logit $y^c$ (before softmax) with respect to each feature map:

$$\frac{\partial y^c}{\partial A^k_{ij}}$$

Step 2. Global average pool the gradients to obtain channel importance weights:

$$\alpha_k^c = \frac{1}{H \times W} \sum_{i=1}^{H} \sum_{j=1}^{W} \frac{\partial y^c}{\partial A^k_{ij}}$$

This $\alpha_k^c$ is a scalar representing how important feature map $k$ is for class $c$.

Step 3. Compute the weighted combination of feature maps, followed by ReLU:

$$L^c_{\text{Grad-CAM}} = \text{ReLU}\left( \sum_k \alpha_k^c A^k \right)$$

The ReLU ensures we focus on features that have a positive influence on the class of interest. Negative activations are suppressed because they correspond to features that reduce confidence in class $c$.

Step 4. Upsample $L^c_{\text{Grad-CAM}}$ (which has the resolution of the last conv layer, e.g., $7 \times 7$ for ResNet) to the input resolution via bilinear interpolation, then overlay on the original image.

Implementation

import torch
import torch.nn.functional as F
from typing import Optional
import numpy as np


class GradCAM:
    """Grad-CAM: Gradient-weighted Class Activation Mapping.

    Produces a heatmap showing which spatial regions of the input are
    most relevant for a given class prediction.

    Args:
        model: A CNN model.
        target_layer: The convolutional layer to extract feature maps from
                      (typically the last conv layer before global pooling).
    """

    def __init__(self, model: torch.nn.Module, target_layer: torch.nn.Module) -> None:
        self.model = model
        self.target_layer = target_layer
        self.feature_maps: Optional[torch.Tensor] = None
        self.gradients: Optional[torch.Tensor] = None

        # Register hooks
        target_layer.register_forward_hook(self._save_feature_maps)
        target_layer.register_full_backward_hook(self._save_gradients)

    def _save_feature_maps(
        self, module: torch.nn.Module, input: tuple, output: torch.Tensor
    ) -> None:
        self.feature_maps = output.detach()

    def _save_gradients(
        self, module: torch.nn.Module, grad_input: tuple, grad_output: tuple
    ) -> None:
        self.gradients = grad_output[0].detach()

    def generate(
        self,
        input_tensor: torch.Tensor,
        target_class: Optional[int] = None,
    ) -> np.ndarray:
        """Generate a Grad-CAM heatmap.

        Args:
            input_tensor: Input image tensor of shape (1, C, H, W).
            target_class: Class index to generate heatmap for.
                          If None, uses the predicted class.

        Returns:
            Heatmap of shape (H_input, W_input) with values in [0, 1].
        """
        self.model.eval()

        # Forward pass
        output = self.model(input_tensor)

        if target_class is None:
            target_class = output.argmax(dim=1).item()

        # Backward pass for the target class
        self.model.zero_grad()
        target_score = output[0, target_class]
        target_score.backward()

        # Step 2: Global average pool the gradients -> channel importance
        # gradients shape: (1, C, H_feat, W_feat)
        alpha = self.gradients.mean(dim=(2, 3), keepdim=True)  # (1, C, 1, 1)

        # Step 3: Weighted combination + ReLU
        # feature_maps shape: (1, C, H_feat, W_feat)
        cam = (alpha * self.feature_maps).sum(dim=1, keepdim=True)  # (1, 1, H_feat, W_feat)
        cam = F.relu(cam)

        # Step 4: Upsample to input resolution
        cam = F.interpolate(
            cam, size=input_tensor.shape[2:], mode="bilinear", align_corners=False
        )

        # Normalize to [0, 1]
        cam = cam.squeeze().numpy()
        if cam.max() > 0:
            cam = cam / cam.max()

        return cam


# Usage example (assumes a trained model and a loaded image)
# model = SimpleCNN(num_classes=10)
# model.load_state_dict(torch.load("model.pt"))
#
# # Target layer: last conv layer in the final stage
# target_layer = model.stages[-1][-1].conv2
# grad_cam = GradCAM(model, target_layer)
#
# # Generate heatmap
# image_tensor = val_transforms(image).unsqueeze(0)
# heatmap = grad_cam.generate(image_tensor)
#
# # Overlay on image
# import matplotlib.pyplot as plt
# plt.imshow(image)
# plt.imshow(heatmap, alpha=0.5, cmap='jet')
# plt.title(f"Grad-CAM: class {target_class}")
# plt.colorbar()
# plt.show()

Research Insight: Grad-CAM reveals not just where the model looks, but whether it looks at the right things. A model that classifies "boat" by attending to the water region (a spurious correlation — boats in ImageNet are typically photographed on water) will have high accuracy on ImageNet but fail on boats in other contexts. Grad-CAM makes such failure modes visible and auditable. This connects to the broader theme of Chapter 35 (Interpretability and Explainability at Scale).


8.12 Beyond Images: 1D Convolutions for Sequences

Convolution is not inherently a 2D operation. The same principles — locality, weight sharing, hierarchical feature composition — apply to any data with local structure. 1D convolutions are powerful tools for processing sequences: text, time series, audio, and sensor data.

1D Convolution Mechanics

For an input sequence $\mathbf{X} \in \mathbb{R}^{C_{\text{in}} \times L}$ (where $C_{\text{in}}$ is the number of input channels and $L$ is the sequence length), a 1D convolution with kernel $\mathbf{W} \in \mathbb{R}^{C_{\text{out}} \times C_{\text{in}} \times k}$ produces:

$$Y_{c,j} = b_c + \sum_{c'=0}^{C_{\text{in}}-1} \sum_{t=0}^{k-1} W_{c,c',t} \cdot X_{c', j \cdot s + t}$$

This is identical to the 2D case but with one spatial dimension instead of two.

1D CNNs for Text

For text processing, the standard pipeline is:

  1. Embed each token as a dense vector: $\mathbf{X} \in \mathbb{R}^{d_{\text{emb}} \times L}$ (embedding dimension as channels, sequence length as the spatial dimension).
  2. Apply 1D convolutions with multiple kernel sizes to capture n-gram patterns at different scales.
  3. Pool across the sequence dimension (global max or average pooling) to produce a fixed-size representation regardless of input length.
  4. Classify or use the representation as an embedding.
import torch
import torch.nn as nn
from typing import List


class TextCNN(nn.Module):
    """1D CNN for text classification / embedding extraction.

    Uses multiple kernel sizes to capture n-gram patterns at different
    scales, inspired by Kim (2014) "Convolutional Neural Networks for
    Sentence Classification."

    Args:
        vocab_size: Size of the vocabulary.
        embedding_dim: Dimension of token embeddings.
        num_filters: Number of filters per kernel size.
        kernel_sizes: List of kernel sizes (n-gram widths).
        num_classes: Number of output classes (0 for embedding mode).
        dropout: Dropout probability.
        max_length: Maximum sequence length.
    """

    def __init__(
        self,
        vocab_size: int = 30000,
        embedding_dim: int = 128,
        num_filters: int = 100,
        kernel_sizes: List[int] = [3, 4, 5],
        num_classes: int = 10,
        dropout: float = 0.5,
        max_length: int = 256,
    ) -> None:
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.convs = nn.ModuleList([
            nn.Sequential(
                nn.Conv1d(embedding_dim, num_filters, k, padding=k // 2),
                nn.BatchNorm1d(num_filters),
                nn.ReLU(),
            )
            for k in kernel_sizes
        ])
        self.dropout = nn.Dropout(dropout)

        total_filters = num_filters * len(kernel_sizes)
        self.num_classes = num_classes
        if num_classes > 0:
            self.classifier = nn.Linear(total_filters, num_classes)

    def get_embedding(self, x: torch.Tensor) -> torch.Tensor:
        """Extract a fixed-size text embedding via 1D CNN + global max pool.

        Args:
            x: Token indices of shape (batch_size, seq_length).

        Returns:
            Text embedding of shape (batch_size, total_filters).
        """
        # Embedding: (batch, seq_len) -> (batch, seq_len, emb_dim)
        emb = self.embedding(x)
        # Transpose for Conv1d: (batch, emb_dim, seq_len)
        emb = emb.transpose(1, 2)

        # Apply each conv + global max pool
        pooled = []
        for conv in self.convs:
            h = conv(emb)                   # (batch, num_filters, seq_len')
            h = h.max(dim=2).values         # (batch, num_filters) — global max pool
            pooled.append(h)

        # Concatenate across kernel sizes
        out = torch.cat(pooled, dim=1)      # (batch, total_filters)
        out = self.dropout(out)
        return out

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass for classification.

        Args:
            x: Token indices of shape (batch_size, seq_length).

        Returns:
            Logits of shape (batch_size, num_classes).
        """
        emb = self.get_embedding(x)
        if self.num_classes > 0:
            return self.classifier(emb)
        return emb


# Example: extract item embeddings for StreamRec
model = TextCNN(
    vocab_size=30000,
    embedding_dim=128,
    num_filters=100,
    kernel_sizes=[3, 4, 5],
    num_classes=0,  # Embedding mode — no classifier head
)

# Simulate a batch of item descriptions (token indices)
batch_tokens = torch.randint(1, 30000, (32, 128))  # 32 items, max 128 tokens
embeddings = model.get_embedding(batch_tokens)
print(f"Input shape:     {batch_tokens.shape}")      # (32, 128)
print(f"Embedding shape: {embeddings.shape}")         # (32, 300) = 100 * 3 kernel sizes

total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params:,}")
Input shape:     torch.Size([32, 128])
Embedding shape: torch.Size([32, 300])
Total parameters: 3,908,100

Why Multi-Kernel 1D CNN?

A kernel of size $k$ over an embedding sequence acts as a learned n-gram detector:

  • $k = 3$: trigram patterns ("not very good", "state of the")
  • $k = 4$: 4-gram patterns ("on the other hand")
  • $k = 5$: 5-gram patterns ("this is the best movie")

By using multiple kernel sizes and concatenating the results, the model captures patterns at multiple granularities simultaneously. Global max pooling then selects the most salient pattern at each granularity, regardless of position — achieving the same translation invariance principle as in image CNNs.

1D CNNs for Time Series

The same architecture applies to time series data. For the Climate DL anchor, gridded temperature data at a single spatial location over time forms a 1D signal. Multivariate time series (temperature, precipitation, wind speed) map to multi-channel 1D inputs, and 1D convolutions capture temporal patterns at multiple scales.

class TemporalCNN(nn.Module):
    """1D CNN for multivariate time series processing.

    Designed for climate data: multiple meteorological variables
    measured over time at a grid point.

    Args:
        n_variables: Number of input variables (channels).
        n_filters: Number of filters per layer.
        kernel_size: Temporal kernel size.
        n_layers: Number of convolutional layers.
        output_dim: Dimension of the output embedding.
    """

    def __init__(
        self,
        n_variables: int = 6,
        n_filters: int = 64,
        kernel_size: int = 5,
        n_layers: int = 4,
        output_dim: int = 128,
    ) -> None:
        super().__init__()
        layers = []
        in_ch = n_variables
        for i in range(n_layers):
            out_ch = n_filters * (2 ** min(i, 2))  # Increase channels with depth
            layers.append(nn.Conv1d(in_ch, out_ch, kernel_size, padding=kernel_size // 2))
            layers.append(nn.BatchNorm1d(out_ch))
            layers.append(nn.ReLU())
            if i < n_layers - 1:
                layers.append(nn.MaxPool1d(2))  # Halve temporal resolution
            in_ch = out_ch

        self.backbone = nn.Sequential(*layers)
        self.gap = nn.AdaptiveAvgPool1d(1)
        self.fc = nn.Linear(out_ch, output_dim)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Process multivariate time series.

        Args:
            x: Input tensor of shape (batch, n_variables, time_steps).

        Returns:
            Embedding of shape (batch, output_dim).
        """
        h = self.backbone(x)
        h = self.gap(h).squeeze(-1)
        return self.fc(h)


# Example: climate time series
model = TemporalCNN(n_variables=6, n_filters=64, n_layers=4, output_dim=128)
x = torch.randn(16, 6, 365)  # 16 samples, 6 variables, 365 days
out = model(x)
print(f"Input:  {x.shape}")   # (16, 6, 365)
print(f"Output: {out.shape}")  # (16, 128)
Input:  torch.Size([16, 6, 365])
Output: torch.Size([16, 128])

Fundamentals > Frontier: 1D CNNs for text have been largely superseded by transformers for most NLP tasks. But they remain highly competitive for several important use cases: (1) when inference latency matters (CNNs are much faster than transformers for short sequences), (2) when model size must be small (edge deployment), (3) as feature extractors within larger systems (e.g., extracting text embeddings as input to a recommendation model, as in the StreamRec progressive project), and (4) for time series data where temporal locality is a strong inductive bias. Knowing when the simpler model suffices is a hallmark of senior practice.


8.13 Progressive Project Milestone M3: Content Embeddings for StreamRec

In Milestone M2 (Chapter 6), you built a click-prediction MLP for StreamRec using structured features (user age, subscription tier, content category). Now we add text features: each item on StreamRec has a description (title + synopsis, typically 20-100 tokens). We will use a 1D CNN to convert these descriptions into dense embeddings and compare with a bag-of-words baseline.

The Setup

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np
from typing import Dict, List, Tuple
from collections import Counter


class StreamRecItemDataset(Dataset):
    """StreamRec item dataset with text descriptions and category labels.

    In a real system, the descriptions would come from the content catalog.
    Here we simulate with category-specific vocabulary patterns.

    Args:
        n_items: Number of items to generate.
        max_length: Maximum token sequence length.
        vocab_size: Vocabulary size.
        n_categories: Number of content categories.
        seed: Random seed for reproducibility.
    """

    def __init__(
        self,
        n_items: int = 50000,
        max_length: int = 64,
        vocab_size: int = 10000,
        n_categories: int = 20,
        seed: int = 42,
    ) -> None:
        super().__init__()
        rng = np.random.RandomState(seed)

        self.max_length = max_length
        self.vocab_size = vocab_size

        # Generate category-specific token distributions
        # Each category has "signature" tokens that appear more frequently
        category_signatures = {}
        for cat in range(n_categories):
            sig_tokens = rng.choice(
                range(2, vocab_size), size=50, replace=False
            )
            category_signatures[cat] = sig_tokens

        # Generate items
        self.tokens = []
        self.labels = []
        for _ in range(n_items):
            category = rng.randint(0, n_categories)
            length = rng.randint(15, max_length)

            # Mix signature tokens (60%) with random tokens (40%)
            sig = category_signatures[category]
            n_sig = int(0.6 * length)
            n_rand = length - n_sig
            seq = np.concatenate([
                rng.choice(sig, size=n_sig),
                rng.randint(2, vocab_size, size=n_rand),
            ])
            rng.shuffle(seq)

            # Pad to max_length
            padded = np.zeros(max_length, dtype=np.int64)
            padded[:length] = seq[:max_length]

            self.tokens.append(padded)
            self.labels.append(category)

        self.tokens = np.array(self.tokens)
        self.labels = np.array(self.labels)

    def __len__(self) -> int:
        return len(self.labels)

    def __getitem__(self, idx: int) -> Tuple[torch.Tensor, int]:
        return torch.tensor(self.tokens[idx], dtype=torch.long), self.labels[idx]


class BagOfWordsModel(nn.Module):
    """Bag-of-words baseline: average token embeddings, then classify.

    This model ignores token order entirely — it treats text as a set
    of words, not a sequence.

    Args:
        vocab_size: Vocabulary size.
        embedding_dim: Token embedding dimension.
        num_classes: Number of categories.
    """

    def __init__(
        self, vocab_size: int = 10000, embedding_dim: int = 128, num_classes: int = 20
    ) -> None:
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.classifier = nn.Sequential(
            nn.Linear(embedding_dim, 128),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, num_classes),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        emb = self.embedding(x)           # (batch, seq_len, emb_dim)
        mask = (x != 0).float().unsqueeze(-1)
        emb = (emb * mask).sum(1) / mask.sum(1).clamp(min=1)  # Masked mean
        return self.classifier(emb)


def train_and_evaluate(
    model: nn.Module,
    train_loader: DataLoader,
    val_loader: DataLoader,
    epochs: int = 20,
    lr: float = 1e-3,
) -> Dict[str, float]:
    """Train a text model and return final validation metrics.

    Args:
        model: Text model (BoW or CNN).
        train_loader: Training data loader.
        val_loader: Validation data loader.
        epochs: Number of training epochs.
        lr: Learning rate.

    Returns:
        Dictionary with final train and validation accuracy.
    """
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    optimizer = optim.Adam(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()

    for epoch in range(epochs):
        model.train()
        for tokens, labels in train_loader:
            tokens, labels = tokens.to(device), labels.to(device)
            optimizer.zero_grad()
            loss = criterion(model(tokens), labels)
            loss.backward()
            optimizer.step()

    # Final evaluation
    model.eval()
    results = {}
    for name, loader in [("train", train_loader), ("val", val_loader)]:
        correct, total = 0, 0
        with torch.no_grad():
            for tokens, labels in loader:
                tokens, labels = tokens.to(device), labels.to(device)
                preds = model(tokens).argmax(1)
                correct += (preds == labels).sum().item()
                total += labels.size(0)
        results[f"{name}_acc"] = correct / total

    return results


# Compare BoW vs. 1D CNN
dataset = StreamRecItemDataset(n_items=50000, max_length=64, n_categories=20)
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = torch.utils.data.random_split(
    dataset, [train_size, val_size], generator=torch.Generator().manual_seed(42)
)
train_loader = DataLoader(train_dataset, batch_size=256, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=256, shuffle=False)

# Bag of Words
bow_model = BagOfWordsModel(vocab_size=10000, embedding_dim=128, num_classes=20)
bow_results = train_and_evaluate(bow_model, train_loader, val_loader, epochs=20)

# 1D CNN (using TextCNN from Section 8.12)
cnn_model = TextCNN(
    vocab_size=10000, embedding_dim=128, num_filters=100,
    kernel_sizes=[3, 4, 5], num_classes=20, dropout=0.3,
)
cnn_results = train_and_evaluate(cnn_model, train_loader, val_loader, epochs=20)

print("Results:")
print(f"  BoW  — Train: {bow_results['train_acc']:.4f}, Val: {bow_results['val_acc']:.4f}")
print(f"  CNN  — Train: {cnn_results['train_acc']:.4f}, Val: {cnn_results['val_acc']:.4f}")

The 1D CNN should outperform bag-of-words because it captures local token patterns (n-grams) that signal category membership, while BoW treats each token independently. The extracted embeddings (via get_embedding) will be used as item features in the recommendation model, providing a richer representation than bag-of-words or simple category labels.


8.14 Climate Deep Learning: CNNs for Spatial Data

The Climate DL anchor illustrates why CNNs are natural for gridded earth science data. CMIP6 global climate models produce outputs on grids of approximately $100 \text{ km}$ resolution — too coarse for regional planning. Statistical downscaling uses a CNN to learn the mapping from coarse-resolution input to high-resolution output, exploiting the spatial structure of climate fields.

The input is a coarse grid (e.g., $16 \times 16$ cells covering a region) with multiple climate variables (temperature, pressure, humidity) as channels. The output is a high-resolution grid (e.g., $64 \times 64$ cells) of the target variable (e.g., daily maximum temperature). This is a dense prediction problem (every output pixel gets a prediction), and the CNN's translational equivariance is essential: a cold front in the northwest corner of the grid should be downscaled the same way as a cold front in the southeast corner.

This application is explored in depth in Case Study 1.


8.15 Summary

This chapter derived convolutional neural networks from first principles — not as a new architecture to memorize, but as a fully connected layer with two constraints (locality and weight sharing) that encode the statistical structure of spatial data. These constraints reduce parameters by orders of magnitude, introduce translation equivariance, and enable the learning of hierarchical feature representations.

The architecture evolution from LeNet to EfficientNet is a story of principles, not just models: - Learned features beat handcrafted ones (LeNet) - Scale and activation functions matter (AlexNet) - Depth through small kernels (VGG) - Skip connections as gradient highways (ResNet) - Compound scaling of depth, width, and resolution (EfficientNet)

Each principle transcends the specific architecture and applies broadly across deep learning. Residual connections, in particular, are now ubiquitous — they appear in transformers (Chapter 10), diffusion models (Chapter 12), and graph neural networks (Chapter 14).

Grad-CAM provides interpretability: we can see what the CNN attends to and verify that it is using the right features for the right reasons. This is essential for deployment in domains like climate science and healthcare where trust requires explanations.

Finally, convolution is not limited to images. 1D convolutions over text and time series exploit the same locality and weight sharing principles, and they remain practical tools in the deep learning toolkit — especially as efficient feature extractors within larger systems.

In the progressive project, you built content embeddings for StreamRec using a 1D CNN over item descriptions, demonstrating that capturing local token patterns (n-grams) produces richer representations than bag-of-words features. These embeddings will serve as item features in the recommendation model as it grows in subsequent chapters.

Looking Ahead: Chapter 9 introduces recurrent networks, which handle sequences by maintaining a hidden state over time — a fundamentally different approach from the fixed-window convolutions of this chapter. Chapter 10 introduces transformers, which combine the parallelism of CNNs with the global context of RNNs through the self-attention mechanism. The evolution from CNNs to RNNs to transformers is a story of trading one inductive bias for another, and understanding all three gives you the vocabulary to choose the right architecture for any problem.