Chapter 8: Exercises

Exercises are graded by difficulty: - One star (*): Apply the technique from the chapter to a new dataset or scenario - Two stars (**): Extend the technique or combine it with a previous chapter's methods - Three stars (***): Derive a result, implement from scratch, or design a system component - Four stars (****): Research-level problems that connect to open questions in the field


Convolution Fundamentals

Exercise 8.1 (*)

A 1D input signal $\mathbf{x} = [2, 1, 3, 0, 4, 2]$ is convolved with kernel $\mathbf{w} = [1, 0, -1]$ using stride 1 and no padding.

(a) Compute the output by hand. What is the output length?

(b) Write out the Toeplitz weight matrix $\mathbf{W}_{\text{conv}}$ for this operation. Verify that $\mathbf{W}_{\text{conv}} \mathbf{x}$ gives the same result.

(c) How many independent parameters does $\mathbf{W}_{\text{conv}}$ have? How many would a fully connected layer mapping 6 inputs to 4 outputs have?


Exercise 8.2 (*)

For a 2D convolution with the following parameters, compute the output spatial dimensions:

(a) Input: $32 \times 32$, kernel: $3 \times 3$, stride: 1, padding: 1

(b) Input: $224 \times 224$, kernel: $7 \times 7$, stride: 2, padding: 3

(c) Input: $56 \times 56$, kernel: $3 \times 3$, stride: 2, padding: 1

(d) For each case, verify your answer using PyTorch:

import torch
import torch.nn as nn

conv = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1)
x = torch.randn(1, 3, 32, 32)
print(conv(x).shape)

Exercise 8.3 (*)

Compute the total number of learnable parameters (weights + biases) for each layer:

(a) nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3)

(b) nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1)

(c) nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1, groups=256) (depthwise)

(d) nn.Conv2d(256, 512, kernel_size=1) (pointwise)


Exercise 8.4 (*)

A CNN has three convolutional layers, each with $3 \times 3$ kernels and stride 1.

(a) What is the receptive field of a neuron in the output of layer 3?

(b) If layers 1 and 2 use stride 2 instead, what is the receptive field at layer 3?

(c) If we insert $2 \times 2$ max pooling after layers 1 and 2 (each with stride 2), what is the effective receptive field at layer 3?


Architecture and Design

Exercise 8.5 (**)

(a) Show algebraically that two stacked $3 \times 3$ convolutional layers have the same receptive field as a single $5 \times 5$ convolutional layer.

(b) For $C$ input and output channels, compute the parameter count for: (i) two stacked $3 \times 3$ conv layers, and (ii) one $5 \times 5$ conv layer. What is the ratio?

(c) The stacked $3 \times 3$ approach also has two ReLU activations instead of one. Why does this additional nonlinearity help the network, beyond the parameter savings?


Exercise 8.6 (**)

Consider a ResNet-50 bottleneck block operating on a feature map with $C = 512$ channels.

(a) Draw the block diagram showing the three convolutions ($1 \times 1$, $3 \times 3$, $1 \times 1$) with their channel dimensions.

(b) Compute the total parameter count for this bottleneck block (ignoring batch norm parameters).

(c) Compute the parameter count for a plain two-layer residual block ($3 \times 3 \to 3 \times 3$) with 512 channels. What is the compression factor of the bottleneck design?


Exercise 8.7 (**)

EfficientNet uses compound scaling with $d = \alpha^\phi$, $w = \beta^\phi$, $r = \gamma^\phi$ subject to $\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2$.

(a) Explain why depth multiplies FLOPs linearly but width and resolution multiply FLOPs quadratically.

(b) The EfficientNet-B0 search found $\alpha = 1.2$, $\beta = 1.1$, $\gamma = 1.15$. Verify that the constraint is approximately satisfied.

(c) For $\phi = 3$ (EfficientNet-B3), compute the depth, width, and resolution multipliers. If the B0 baseline has 18 layers, 32 base channels, and $224 \times 224$ resolution, what are the B3 values (approximately)?

(d) What total FLOP multiplier does $\phi = 3$ correspond to?


Exercise 8.8 (**)

Compare the parameter counts of the following architectures for classifying $224 \times 224 \times 3$ images into 1000 classes. For each, compute the approximate total parameter count:

(a) A single fully connected layer (no hidden layers): input $\to$ 1000 classes.

(b) VGG-16 (use the published architecture: 13 conv layers + 3 FC layers with 4096, 4096, and 1000 neurons).

(c) ResNet-50 (use the published architecture with global average pooling — no FC layers with 4096 neurons).

(d) Which part of VGG-16 contains the most parameters? Why did ResNet's use of global average pooling solve this?


Implementation

Exercise 8.9 (**)

Implement the convolution operation from scratch (no nn.Conv2d or F.conv2d). Your implementation should:

(a) Accept a 4D input tensor (batch, C_in, H, W), a weight tensor (C_out, C_in, k, k), stride, and padding.

(b) Return the correct output tensor (batch, C_out, H_out, W_out).

(c) Verify that your output matches F.conv2d on random inputs to within floating-point tolerance.

import torch
import torch.nn.functional as F

def my_conv2d(
    x: torch.Tensor,
    weight: torch.Tensor,
    stride: int = 1,
    padding: int = 0,
) -> torch.Tensor:
    """Implement 2D convolution from scratch using loops.

    Args:
        x: Input tensor of shape (batch, C_in, H, W).
        weight: Weight tensor of shape (C_out, C_in, k_h, k_w).
        stride: Convolution stride.
        padding: Zero-padding added to both sides.

    Returns:
        Output tensor of shape (batch, C_out, H_out, W_out).
    """
    # YOUR CODE HERE
    pass

# Test
x = torch.randn(2, 3, 8, 8)
w = torch.randn(16, 3, 3, 3)
out_mine = my_conv2d(x, w, stride=1, padding=1)
out_ref = F.conv2d(x, w, stride=1, padding=1)
print(f"Max difference: {(out_mine - out_ref).abs().max().item():.2e}")

Exercise 8.10 (**)

Implement data augmentation for CIFAR-10 and measure its impact:

(a) Train a SimpleCNN (from Section 8.10) on CIFAR-10 for 50 epochs without data augmentation (only normalize).

(b) Train the same architecture for 50 epochs with standard augmentation (random crop with padding, horizontal flip, color jitter).

(c) Plot training loss and validation accuracy for both runs. What is the gap between training and validation accuracy in each case?

(d) What does the gap tell you about overfitting, and how does augmentation change it?


Exercise 8.11 (**)

Implement Grad-CAM for the SimpleCNN trained in Exercise 8.10:

(a) Select 5 correctly classified images from different classes.

(b) Generate Grad-CAM heatmaps for the predicted class.

(c) For one of the images, generate Grad-CAM heatmaps for two different classes (the predicted class and one other). How do the heatmaps differ?

(d) Find an example where the model makes a correct prediction but attends to a suspicious region (e.g., background instead of the object). What does this suggest about the model's learned features?


Exercise 8.12 (**)

Implement the Mixup data augmentation technique and integrate it into the training loop:

(a) Write a mixup_data function that takes a batch of images and labels and returns the mixed versions.

def mixup_data(
    x: torch.Tensor,
    y: torch.Tensor,
    alpha: float = 0.2,
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, float]:
    """Apply Mixup augmentation to a batch.

    Args:
        x: Batch of images, shape (B, C, H, W).
        y: Batch of labels, shape (B,).
        alpha: Beta distribution parameter.

    Returns:
        Tuple of (mixed_x, y_a, y_b, lam) for computing Mixup loss.
    """
    # YOUR CODE HERE
    pass

(b) Modify the training loop to use the Mixup loss: $\mathcal{L} = \lambda \cdot \text{CE}(\hat{y}, y_a) + (1 - \lambda) \cdot \text{CE}(\hat{y}, y_b)$.

(c) Train with Mixup ($\alpha = 0.2$) for 50 epochs and compare validation accuracy with the augmentation-only baseline from Exercise 8.10.


Residual Connections and Depth

Exercise 8.13 (***)

Derive the gradient flow through a 3-block residual network.

(a) Write out the forward computation for $\mathbf{h}_1, \mathbf{h}_2, \mathbf{h}_3$ with skip connections.

(b) Expand $\frac{\partial \mathcal{L}}{\partial \mathbf{h}_0}$ using the chain rule. Show that it contains $2^3 = 8$ terms corresponding to all possible paths through or around each block.

(c) Argue that at least one of these 8 paths has no multiplicative Jacobian terms (the "gradient highway"). How does this prevent vanishing gradients?


Exercise 8.14 (***)

Empirically measure gradient flow in deep networks.

(a) Build a plain network (no skip connections) with 20 layers of $3 \times 3$ convolutions and batch normalization. Pass a random input through the network and compute $\|\frac{\partial \mathcal{L}}{\partial \mathbf{h}_l}\|$ for each layer $l$.

(b) Build an equivalent residual network (same layers but with skip connections). Compute the same gradient norms.

(c) Plot the gradient norms for both networks. You should observe the gradient magnitude decreasing monotonically in the plain network but remaining stable in the residual network.

(d) Repeat with 50 layers. At what depth does the plain network's gradient become numerically zero?

import torch
import torch.nn as nn
from typing import List

def measure_gradient_norms(
    model: nn.Module,
    input_shape: tuple = (1, 64, 32, 32),
) -> List[float]:
    """Measure gradient norms at each layer of a network.

    Args:
        model: A sequential network.
        input_shape: Shape of the random input tensor.

    Returns:
        List of gradient L2 norms, one per layer.
    """
    # YOUR CODE HERE
    pass

Exercise 8.15 (***)

Compare pre-activation and post-activation ResNet blocks.

(a) Implement both variants: - Post-activation: $\mathbf{y} = \text{ReLU}(\text{BN}(\text{Conv}(\text{ReLU}(\text{BN}(\text{Conv}(\mathbf{x}))))) + \mathbf{x})$ - Pre-activation: $\mathbf{y} = \text{Conv}(\text{ReLU}(\text{BN}(\text{Conv}(\text{ReLU}(\text{BN}(\mathbf{x})))))) + \mathbf{x}$

(b) Train a 56-layer ResNet on CIFAR-10 with each variant for 100 epochs.

(c) Compare final validation accuracy and training curves. Which converges faster?

(d) Explain why the pre-activation variant performs better for very deep networks by analyzing the shortcut path.


1D Convolutions and Text

Exercise 8.16 (**)

Implement a 1D CNN for sentiment classification on a text dataset.

(a) Use the IMDB movie review dataset (available via torchtext or download). Build a vocabulary of the top 20,000 words and tokenize reviews to a maximum length of 256 tokens.

(b) Train the TextCNN from Section 8.12 with kernel sizes $[2, 3, 4, 5]$ and 128 filters each.

(c) Report test accuracy. Compare with a bag-of-words baseline (mean of word embeddings + linear classifier).

(d) Which kernel size contributes most to classification? (Hint: train four models, each with a single kernel size, and compare.)


Exercise 8.17 (**)

Analyze the learned filters of a 1D CNN trained on text.

(a) After training the TextCNN from Exercise 8.16, extract the top-5 most activated n-grams for 10 selected filters (5 from the kernel-3 bank, 5 from the kernel-5 bank). An n-gram's activation is the dot product of its embedding sequence with the filter weights.

(b) Do the extracted n-grams have interpretable patterns? For a sentiment task, do you find filters that activate on phrases like "not good", "waste of time", or "absolutely loved"?

(c) Compare with the Grad-CAM-like approach: for a given review, which tokens have the highest gradient magnitude with respect to the positive-class logit?


Exercise 8.18 (**)

Apply a 1D CNN to a time series classification task.

(a) Generate synthetic multivariate time series from 5 classes, where each class is defined by a distinct temporal pattern (e.g., upward trend, periodic oscillation with different frequencies, step function, damped oscillation, random walk with drift).

(b) Train the TemporalCNN from Section 8.12 on this data.

(c) Compare with: (i) a global average of input features + linear classifier, (ii) an MLP on the flattened time series.

(d) Which model handles the periodic patterns best? Why does the CNN's locality bias help for this task?


Depthwise Separable and Efficient Convolutions

Exercise 8.19 (**)

(a) Prove that the FLOP ratio of depthwise separable to standard convolution is:

$$\frac{H_{\text{out}} W_{\text{out}} (C_{\text{in}} k^2 + C_{\text{in}} C_{\text{out}})}{H_{\text{out}} W_{\text{out}} C_{\text{in}} C_{\text{out}} k^2} = \frac{1}{C_{\text{out}}} + \frac{1}{k^2}$$

(b) For a layer with $C_{\text{in}} = C_{\text{out}} = 512$ and $k = 3$, compute the FLOP and parameter reduction factors.

(c) Benchmark the wall-clock time of a standard nn.Conv2d(512, 512, 3) vs. a DepthwiseSeparableConv(512, 512, 3) on a $32 \times 32$ feature map with batch size 64. Is the speedup consistent with the FLOP ratio? If not, explain why (hint: memory bandwidth and kernel launch overhead).


Exercise 8.20 (***)

Implement an inverted residual block (MobileNetV2). In this design, the bottleneck is inverted: the input and output have few channels, and the intermediate $3 \times 3$ depthwise convolution operates on expanded channels.

class InvertedResidual(nn.Module):
    """MobileNetV2 inverted residual block.

    Architecture: 1x1 expand -> 3x3 depthwise -> 1x1 project + shortcut.

    Args:
        in_channels: Input channels.
        out_channels: Output channels.
        expansion: Channel expansion ratio.
        stride: Stride for the depthwise convolution.
    """

    def __init__(
        self,
        in_channels: int,
        out_channels: int,
        expansion: int = 6,
        stride: int = 1,
    ) -> None:
        super().__init__()
        # YOUR CODE HERE
        pass

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # YOUR CODE HERE
        pass

(a) Implement the block.

(b) Compare parameter counts with a standard residual block for $C_{\text{in}} = C_{\text{out}} = 32$, expansion = 6.

(c) Why is the shortcut only applied when stride = 1 and $C_{\text{in}} = C_{\text{out}}$?


Advanced and Research-Level

Exercise 8.21 (***)

Convolution as matrix multiplication via im2col.

The im2col operation rearranges input patches into columns of a matrix, converting convolution into a single matrix multiplication.

(a) For a $4 \times 4$ single-channel input and a $3 \times 3$ kernel with stride 1, write out the im2col matrix. What are its dimensions?

(b) Show that the convolution output equals the matrix product $\mathbf{W}_{\text{reshape}} \cdot \mathbf{X}_{\text{im2col}}$, where $\mathbf{W}_{\text{reshape}}$ is the kernel reshaped into a row vector.

(c) Implement im2col and verify:

def im2col(
    x: torch.Tensor,
    kernel_size: int,
    stride: int = 1,
    padding: int = 0,
) -> torch.Tensor:
    """Convert input patches to columns for matrix-multiply convolution.

    Args:
        x: Input of shape (1, C_in, H, W).
        kernel_size: Spatial kernel size.
        stride: Convolution stride.
        padding: Zero padding.

    Returns:
        Column matrix of shape (C_in * k * k, H_out * W_out).
    """
    # YOUR CODE HERE
    pass

(d) Why do GPU frameworks (cuDNN) use implicit im2col rather than explicit matrix construction?


Exercise 8.22 (***)

Translation equivariance under different operations. For each of the following operations, determine whether it is translation equivariant, approximately equivariant, or not equivariant. Prove or provide a counterexample.

(a) Convolution with stride 1 and "same" padding

(b) Convolution with stride 2

(c) Max pooling with $2 \times 2$ window and stride 2

(d) Batch normalization (during training vs. during inference)

(e) Global average pooling

(f) A fully connected layer


Exercise 8.23 (***)

Derive the backward pass for a convolutional layer.

Given the forward pass $\mathbf{Y} = \mathbf{W} \star \mathbf{X}$ (cross-correlation), derive:

(a) $\frac{\partial \mathcal{L}}{\partial \mathbf{W}}$ — the gradient with respect to the kernel (needed for weight updates). Show that this is a cross-correlation between the input and the upstream gradient.

(b) $\frac{\partial \mathcal{L}}{\partial \mathbf{X}}$ — the gradient with respect to the input (needed for backpropagation to earlier layers). Show that this is a full convolution (with kernel flipping) of the upstream gradient with the kernel.

(c) Verify your derivations numerically using PyTorch autograd:

import torch

x = torch.randn(1, 1, 5, 5, requires_grad=True)
w = torch.randn(1, 1, 3, 3, requires_grad=True)
y = torch.nn.functional.conv2d(x, w)
loss = y.sum()
loss.backward()

# Compare x.grad and w.grad with your derived formulas

Exercise 8.24 (***)

Dilated (atrous) convolutions. A dilated convolution inserts $d - 1$ zeros between kernel elements, increasing the receptive field without increasing parameters.

(a) For a $3 \times 3$ kernel with dilation $d$, what is the effective kernel size? What is the receptive field of a single layer?

(b) Show that a stack of three $3 \times 3$ dilated convolutions with dilation rates $d = 1, 2, 4$ achieves the same receptive field as a $15 \times 15$ standard convolution. How many parameters does each require for $C$ channels?

(c) Implement a dilated convolution using nn.Conv2d(..., dilation=d) and verify the receptive field empirically by backpropagating through the network and checking the gradient support on the input.


Exercise 8.25 (***)

Progressive project extension. In the StreamRec progressive project, you built 1D CNN embeddings for item descriptions (Milestone M3).

(a) Train the TextCNN with three different configurations: (i) kernel sizes $[3]$ only, (ii) kernel sizes $[3, 5, 7]$, (iii) kernel sizes $[3, 4, 5]$ with twice as many filters. Compare embedding quality using category classification accuracy.

(b) Extract embeddings for the full item catalog and compute pairwise cosine similarities. Do items in the same category cluster together? Visualize with t-SNE or UMAP.

(c) Compare the CNN embeddings with TF-IDF bag-of-words vectors (from scikit-learn). For the top-10 nearest neighbors of 5 randomly selected items, which representation produces more semantically coherent neighbors?


Exercise 8.26 (****)

Connections between CNNs and signal processing.

(a) Show that a convolutional layer with a fixed (non-learned) Gaussian kernel implements a low-pass filter. What determines the cutoff frequency?

(b) Show that the commonly learned Sobel-like edge detector ($[[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]]$) implements a high-pass filter in the horizontal direction.

(c) Modern CNNs often learn filters that, in the first layer, approximate Gabor filters (oriented, frequency-selective). Why does this emerge from gradient descent on natural images? Connect your answer to the sparse coding hypothesis (Olshausen & Field, 1996).

(d) Zhang (2019, "Making Convolutional Networks Shift-Invariant Again") showed that aliasing in strided convolutions and max pooling breaks strict translation equivariance. Explain the aliasing mechanism and how anti-aliased downsampling (applying a low-pass filter before striding) addresses it.


Exercise 8.27 (****)

The Lottery Ticket Hypothesis applied to CNNs.

Frankle and Carlin (2019) showed that dense, randomly initialized networks contain sparse subnetworks ("winning tickets") that can be trained to full accuracy.

(a) Train a ResNet-18 on CIFAR-10 to convergence. Apply global unstructured magnitude pruning at 50%, 70%, and 90% sparsity. Retrain each pruned network from the original initialization. Report accuracy at each sparsity level.

(b) Compare with random pruning at the same sparsity levels (prune random weights instead of lowest-magnitude). Is magnitude pruning significantly better?

(c) Does the pruning distribution differ across layers? Specifically, are early layers (low-level feature detectors) pruned less aggressively than later layers? What does this tell you about the importance of low-level features?


Exercise 8.28 (****)

Climate DL: Designing a downscaling CNN.

(a) The climate downscaling problem takes a $16 \times 16 \times 6$ input (6 climate variables at coarse resolution) and produces a $64 \times 64 \times 1$ output (high-resolution temperature). This is a $4\times$ spatial upscaling. Design two architectures: - (i) Encoder-decoder with transposed convolutions for upsampling - (ii) Encoder with sub-pixel convolution (pixel shuffle) for upsampling

(b) Implement both architectures in PyTorch. Compare parameter counts.

(c) The sub-pixel convolution (Shi et al., 2016) rearranges a $(C \cdot r^2, H, W)$ feature map into a $(C, rH, rW)$ output, where $r$ is the upscaling factor. Implement this using nn.PixelShuffle(r) and explain why it avoids the checkerboard artifacts that plague transposed convolutions.

(d) What loss function would you use for this task? Why might L1 loss be preferred over L2 for spatial prediction? How would you incorporate a perceptual loss or adversarial loss to improve the sharpness of downscaled fields?