Chapter 14: Exercises

Conceptual Exercises

Exercise 14.1: Convolution Output Dimensions

Calculate the output dimensions for each of the following convolutional layers applied to an input of shape (1, 3, 64, 64):

a) nn.Conv2d(3, 32, kernel_size=5, stride=1, padding=0) b) nn.Conv2d(3, 32, kernel_size=5, stride=1, padding=2) c) nn.Conv2d(3, 32, kernel_size=3, stride=2, padding=1) d) nn.Conv2d(3, 32, kernel_size=7, stride=2, padding=3) e) nn.Conv2d(3, 32, kernel_size=3, stride=1, padding=2, dilation=2)

Show your work using the output dimension formula from Section 14.2.

Exercise 14.2: Parameter Counting

For each architecture below, compute the total number of learnable parameters (weights and biases). Compare the results and explain the differences.

a) A single fully connected layer from a flattened 32x32 RGB image to 128 neurons b) A convolutional layer: nn.Conv2d(3, 128, kernel_size=3, padding=1) followed by global average pooling c) A convolutional layer: nn.Conv2d(3, 128, kernel_size=5, padding=2) followed by global average pooling

Exercise 14.3: Receptive Field Computation

Compute the receptive field at the output of each architecture:

a) Three stacked Conv2d(C, C, 3, padding=1) layers (stride 1) b) Two stacked Conv2d(C, C, 5, padding=2) layers (stride 1) c) One Conv2d(C, C, 3, padding=1, stride=2) followed by one Conv2d(C, C, 3, padding=1, stride=1) d) One Conv2d(C, C, 3, padding=2, dilation=2) followed by one Conv2d(C, C, 3, padding=1)

Which architecture achieves the largest receptive field? Which uses the fewest parameters?

Exercise 14.4: Translation Equivariance vs. Invariance

Explain the difference between translation equivariance and translation invariance. For each of the following operations, state whether it provides equivariance, invariance, both, or neither:

a) A single convolutional layer b) A max pooling layer c) A convolutional layer followed by max pooling d) Global average pooling e) A fully connected layer applied to a flattened feature map

Exercise 14.5: The Role of Nonlinearity

Consider a network with two convolutional layers but no activation functions between them. Show mathematically that this is equivalent to a single convolutional layer with a larger kernel. What does this imply about the importance of nonlinear activations in CNNs? (Hint: Recall the discussion of linear composition in Chapter 13.)

Exercise 14.6: Batch Normalization in CNNs

Explain why batch normalization in CNNs computes statistics per channel rather than per neuron (as in fully connected networks). How many learnable parameters does nn.BatchNorm2d(64) have? What are the running statistics it maintains, and how are they used differently during training and inference?

Exercise 14.7: Skip Connections and Gradient Flow

Consider a residual block with output $y = F(x) + x$. Derive the gradient of the loss $\mathcal{L}$ with respect to $x$ in terms of $\frac{\partial \mathcal{L}}{\partial y}$ and $\frac{\partial F}{\partial x}$. Explain why this expression helps prevent vanishing gradients, referencing the gradient flow problems discussed in Chapter 13.

Exercise 14.8: Depthwise Separable Convolutions

A standard Conv2d(128, 256, kernel_size=3, padding=1) is applied to a 14x14 feature map.

a) How many parameters does this layer have (including bias)? b) How many multiply-accumulate (MAC) operations does the forward pass require? c) Replace this with a depthwise separable convolution (depthwise 3x3 + pointwise 1x1). How many parameters and MACs does this require? d) What is the reduction factor for parameters and MACs?

Exercise 14.9: Transfer Learning Strategy Selection

For each scenario below, recommend a transfer learning strategy (feature extraction, fine-tune last layers, or fine-tune all layers) and justify your choice:

a) 500 medical X-ray images, 3 classes, using ImageNet-pretrained ResNet-50 b) 100,000 food images, 100 classes, using ImageNet-pretrained ResNet-50 c) 2,000 satellite images for land-use classification, using ImageNet-pretrained VGG-16 d) 50,000 images of handwritten Chinese characters, 3,000 classes, using ImageNet-pretrained ResNet-50

Exercise 14.10: Architecture Comparison

Fill in the following table for each architecture. Then discuss the key insight that each architecture introduced.

Architecture Year Depth Params (approx.) Top-5 Error (ImageNet) Key Innovation
LeNet-5
AlexNet
VGG-16
ResNet-50
MobileNetV2

Programming Exercises

Exercise 14.11: Manual Convolution

Implement a 2D convolution operation from scratch (without using nn.Conv2d or F.conv2d). Your function should support arbitrary kernel sizes, stride, and zero padding. Verify that your implementation produces the same output as PyTorch's built-in convolution for several test cases.

def manual_conv2d(
    input_tensor: torch.Tensor,
    kernel: torch.Tensor,
    stride: int = 1,
    padding: int = 0,
) -> torch.Tensor:
    """Perform 2D convolution manually.

    Args:
        input_tensor: Input of shape (H, W).
        kernel: Kernel of shape (kH, kW).
        stride: Stride of the convolution.
        padding: Zero padding to add.

    Returns:
        Output tensor of the convolution.
    """
    # Your implementation here
    pass

Exercise 14.12: Visualize Feature Maps

Using a pretrained ResNet-18, load an image of your choice and visualize the feature maps at layers conv1, layer1, layer2, layer3, and layer4. Create a grid of the first 16 feature maps at each layer. Describe the patterns you observe and how they change with depth.

Exercise 14.13: Build LeNet-5 for MNIST

Implement LeNet-5 and train it on the MNIST dataset. Report training and validation accuracy at each epoch. Experiment with replacing: a) Tanh activations with ReLU b) Average pooling with max pooling c) Both changes together

How does each change affect convergence speed and final accuracy?

Exercise 14.14: VGG Block Implementation

Implement a configurable VGG-style network where the user can specify the number of convolutional layers in each block. Train VGG-11, VGG-13, and VGG-16 variants on CIFAR-10 (with reduced channel counts to fit in memory) and compare their performance. Does deeper always mean better?

Exercise 14.15: ResNet from Scratch

Implement ResNet-18 from scratch using the BasicBlock class from Section 14.6.4. Train it on CIFAR-10. Then create a variant without skip connections (set the shortcut to zero). Compare: a) Training loss curves b) Validation accuracy c) Gradient magnitudes at each layer

Exercise 14.16: The Effect of Kernel Size

Train three versions of the SimpleCNN from Section 14.5 with kernel sizes of 3x3, 5x5, and 7x7. Keep all other hyperparameters identical. Compare: a) Number of parameters b) Training time per epoch c) Final validation accuracy d) Qualitative examination of learned first-layer filters

Exercise 14.17: Pooling Ablation Study

Using the SimpleCNN architecture, compare three downsampling strategies: a) Max pooling (2x2, stride 2) b) Average pooling (2x2, stride 2) c) Strided convolution (stride 2, no pooling)

Train each variant on CIFAR-10 for 50 epochs and report the results. Which strategy works best, and why do you think that is?

Exercise 14.18: Depthwise Separable CNN

Replace all standard convolutional layers in a simple CNN with depthwise separable convolutions. Train both versions on CIFAR-10 and compare: a) Total parameter count b) Training time per epoch c) Final validation accuracy d) Inference time for a batch of 100 images

Exercise 14.19: Transfer Learning Comparison

Using a pretrained ResNet-18, compare three transfer learning approaches on a small subset (1,000 images) of a dataset of your choice: a) Feature extraction (freeze all layers, train only the classifier) b) Fine-tune the last residual block + classifier c) Fine-tune the entire network with discriminative learning rates

Report accuracy, training time, and the number of epochs to convergence for each approach.

Exercise 14.20: Grad-CAM Implementation

Implement Grad-CAM from scratch and apply it to a pretrained ResNet-18. For at least 5 images from different ImageNet classes: a) Generate Grad-CAM heatmaps at layer4 b) Generate Grad-CAM heatmaps at layer3 and layer2 c) Compare how the heatmaps change across layers d) Overlay the heatmaps on the original images

Exercise 14.21: 1x1 Convolution Bottleneck

Implement an Inception-like module that uses 1x1 convolutions to reduce the channel dimension before applying larger convolutions:

class InceptionModule(nn.Module):
    """Simplified Inception module with 1x1 bottlenecks."""
    # Implement: 1x1 branch, 1x1->3x3 branch, 1x1->5x5 branch, pool->1x1 branch
    pass

Compare the parameter count and computational cost with and without the 1x1 bottlenecks.

Exercise 14.22: Data Augmentation Study

Train a ResNet-18 on CIFAR-10 with the following augmentation strategies and compare results: a) No augmentation b) Random horizontal flip only c) Random crop + horizontal flip d) Random crop + flip + color jitter e) All above + Cutout (randomly zero out a 16x16 patch)

Plot training and validation accuracy curves for all five experiments on the same graph.

Exercise 14.23: Learning Rate Scheduling for CNNs

Train a ResNet-18 on CIFAR-10 with four different learning rate schedules, using concepts from Chapter 12: a) Constant learning rate (0.1) b) Step decay (multiply by 0.1 at epochs 50 and 75) c) Cosine annealing d) One-cycle policy

Train for 100 epochs each and compare convergence behavior and final accuracy.

Exercise 14.24: CNN for 1D Signals

Adapt the SimpleCNN architecture to work with 1D signals. Apply it to a time series classification task (e.g., ECG heartbeat classification or activity recognition). Replace Conv2d with Conv1d and MaxPool2d with MaxPool1d. How does performance compare to a simple fully connected network on the same task?

Exercise 14.25: Architecture Search Experiment

Design and test 5 different CNN architectures for CIFAR-10, varying: - Number of convolutional blocks (2-5) - Number of filters per block (16-256) - Kernel size (3 or 5) - Use of batch normalization (yes/no) - Use of skip connections (yes/no)

Create a table summarizing the parameter count, training time, and validation accuracy for each architecture. Which design choices have the largest impact on performance?

Challenge Exercises

Exercise 14.26: Implement Transposed Convolution

Implement a 2D transposed convolution from scratch. Show that applying a transposed convolution followed by a regular convolution (with the same kernel) does not recover the original input. Under what conditions does the composition of convolution and transposed convolution approximate the identity?

Exercise 14.27: Channel Attention (SE Block)

Implement a Squeeze-and-Excitation (SE) block that learns channel-wise attention weights. Integrate it into a ResNet basic block and train on CIFAR-10. Does the SE block improve performance? Visualize the learned channel attention weights for different layers.

Exercise 14.28: Mixed Precision Training

Implement mixed precision training for a CNN using PyTorch's torch.cuda.amp module. Compare training time and memory usage against standard float32 training. Is there any impact on final accuracy? (Requires a CUDA-capable GPU.)

Exercise 14.29: Knowledge Distillation

Train a large "teacher" CNN (e.g., ResNet-50) on CIFAR-10. Then train a small "student" CNN (e.g., a 4-layer network) using knowledge distillation, where the student learns to match the teacher's soft probability outputs. Compare the student's accuracy when trained with: a) Hard labels only b) Soft labels from the teacher only c) A combination of hard and soft labels

Exercise 14.30: Adversarial Robustness

Generate adversarial examples for a pretrained ResNet-18 using the Fast Gradient Sign Method (FGSM):

$$x_{adv} = x + \epsilon \cdot \text{sign}(\nabla_x \mathcal{L}(\theta, x, y))$$

a) For $\epsilon$ values of 0.01, 0.05, 0.1, and 0.3, compute the classification accuracy on adversarial examples. b) Visualize the adversarial perturbations and the resulting adversarial images. c) Train a model using adversarial training (include adversarial examples in each training batch) and compare its robustness to the standard model.

Exercise 14.31: Implementing Group Convolution

Implement group convolution from scratch and verify it against nn.Conv2d(groups=g). Show that: a) groups=1 is standard convolution b) groups=C_in is depthwise convolution c) For intermediate values, group convolution partitions channels into independent groups

Discuss the trade-off between parameter reduction and representational capacity.

Exercise 14.32: Class Activation Mapping Without Gradients

Implement the original CAM (Class Activation Mapping) technique that works without gradients, but requires the architecture to have a GAP layer followed by a single linear layer. Compare the resulting heatmaps to Grad-CAM on the same images. When do they agree, and when do they differ?

Exercise 14.33: Progressive Resizing

Implement a training strategy that starts with low-resolution images (e.g., 64x64) and progressively increases to the full resolution (224x224) during training. Compare convergence speed and final accuracy against training at full resolution from the start. (This technique was popularized by fast.ai and is related to curriculum learning from Chapter 7.)

Exercise 14.34: Feature Map Statistics

For a pretrained ResNet-18, feed 1,000 random ImageNet images and compute the following statistics for each convolutional layer's output feature maps: a) Mean activation value per channel b) Sparsity (fraction of activations that are zero after ReLU) c) Correlation between channels

Discuss what these statistics reveal about the learned representations. Are deeper layers sparser? Are channels in the same layer correlated?

Exercise 14.35: Build a CNN Visualizer

Create an interactive visualization tool (using matplotlib or a web framework) that allows a user to: a) Load a pretrained CNN (ResNet-18, VGG-16, etc.) b) Upload an image c) Select a layer to visualize d) View feature maps, Grad-CAM heatmaps, and filter weights

This exercise combines concepts from this chapter with practical software engineering.