Chapter 14: Quiz

Test your understanding of Convolutional Neural Networks. Try to answer each question before revealing the solution.

Question 1

What are the two key inductive biases that CNNs embed into their architecture?

Show Answer

**Local connectivity** and **weight sharing** (also called parameter sharing). Local connectivity means each neuron connects only to a small spatial region of the input (the receptive field). Weight sharing means the same filter weights are applied at every spatial position, making the convolution operation equivariant to translation.

Question 2

For an input of shape (1, 3, 128, 128) passed through nn.Conv2d(3, 64, kernel_size=5, stride=2, padding=2), what is the output shape?

Show Answer

Using the formula: $H_{out} = \lfloor (128 - 5 + 2 \times 2) / 2 \rfloor + 1 = \lfloor 127/2 \rfloor + 1 = 63 + 1 = 64$ Output shape: **(1, 64, 64, 64)**

Question 3

What is the difference between translation equivariance and translation invariance? Which does convolution provide?

Show Answer

**Translation equivariance** means that if the input is shifted, the output shifts by the same amount. The convolution operation is translation equivariant -- if a cat moves to the right in the image, the feature map response also moves to the right. **Translation invariance** means the output does not change when the input is shifted. Pooling layers (especially global average pooling) provide a degree of translation invariance. A full CNN typically achieves approximate translation invariance through the combination of equivariant convolutions and invariant pooling layers.

Question 4

How many learnable parameters does nn.Conv2d(64, 128, kernel_size=3, padding=1, bias=True) have?

Show Answer

Weights: $128 \times 64 \times 3 \times 3 = 73{,}728$ Bias: $128$ Total: **73,856 parameters**

Question 5

Explain why three stacked 3x3 convolutional layers are preferred over a single 7x7 layer, even though they have the same effective receptive field.

Show Answer

Three 3x3 layers have **fewer parameters** ($3 \times 9C^2 = 27C^2$ vs. $49C^2$) and **more nonlinear activations** (three ReLUs vs. one), making them more expressive. The additional nonlinearities allow the network to learn more complex functions within the same receptive field. This insight was the foundation of the VGG architecture.

Question 6

What problem did ResNet's skip connections solve? How do they help gradient flow during backpropagation?

Show Answer

Skip connections solved the **degradation problem**: the observation that adding more layers to a deep network could paradoxically *decrease* accuracy, not due to overfitting but due to optimization difficulties. For gradient flow, the skip connection gives: $\frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial y} \cdot (1 + \frac{\partial F(x)}{\partial x})$. The constant term of **1** ensures that gradients can flow directly through the identity path, preventing vanishing gradients even in very deep networks. This creates an "information highway" from the loss back to the earliest layers.

Question 7

What is the purpose of a 1x1 convolution? Give three specific uses.

Show Answer

A 1x1 convolution performs a linear transformation across channels at each spatial position independently. Three uses: 1. **Channel dimensionality reduction**: Reducing the number of channels before expensive larger convolutions (used in Inception modules and ResNet bottleneck blocks) 2. **Channel dimensionality expansion**: Increasing channels (used in bottleneck block expansion) 3. **Adding nonlinearity**: When followed by an activation function, it adds a nonlinear transformation without changing spatial dimensions (used in Network in Network) Additional uses include cross-channel feature interaction and producing task-specific output maps (e.g., in segmentation heads).

Question 8

What is the difference between max pooling and average pooling? When is each preferred?

Show Answer

**Max pooling** takes the maximum value in each pooling window, making it sensitive to the *presence* of a feature regardless of its exact position. It is preferred in hidden layers because it provides strong translation invariance and tends to preserve the most salient features. **Average pooling** computes the mean in each window, providing a smoother downsampling. It is preferred as **Global Average Pooling (GAP)** at the end of modern architectures, where it replaces large fully connected layers and dramatically reduces parameter count and overfitting.

Question 9

How does depthwise separable convolution differ from standard convolution? What is the approximate parameter reduction factor?

Show Answer

A **standard convolution** jointly learns spatial and cross-channel patterns with a single operation. A **depthwise separable convolution** factorizes this into two steps: (1) a depthwise convolution that applies a separate spatial filter to each channel, and (2) a pointwise (1x1) convolution that mixes channels. The parameter ratio is approximately $\frac{1}{C_{out}} + \frac{1}{k^2}$. For typical values ($C_{out} = 256$, $k = 3$), this gives approximately $\frac{1}{9} \approx$ **8-9x reduction** in parameters and computation.

Question 10

When fine-tuning a pretrained CNN on a new dataset, why should you use a smaller learning rate for the pretrained layers than for the new classification head?

Show Answer

Pretrained layers already contain useful, well-optimized features. A large learning rate would make large updates to these weights, potentially **destroying the learned representations** -- a phenomenon sometimes called "catastrophic forgetting." The new classification head, initialized randomly, needs larger updates to learn meaningful weights quickly. Using **discriminative learning rates** (smaller for pretrained, larger for new layers) allows the pretrained features to adapt gently while the new head trains more aggressively.

Question 11

Why do we set bias=False in convolutional layers that are immediately followed by batch normalization?

Show Answer

Batch normalization includes a learnable shift parameter $\beta$ that serves the same role as the bias term. Since batch normalization first subtracts the mean (which would absorb any constant bias), and then applies $\gamma \cdot \hat{x} + \beta$, the convolution's bias is redundant. Setting `bias=False` saves a small number of parameters and avoids redundant computation.

Question 12

What is a dilated (atrous) convolution? How does it differ from a standard convolution, and what is its primary advantage?

Show Answer

A **dilated convolution** inserts gaps (zeros) between kernel elements. A dilation rate of $d$ means there are $d-1$ zeros between consecutive kernel elements. The effective kernel size becomes $k + (k-1)(d-1)$. Its primary advantage is that it **increases the receptive field without increasing the number of parameters or reducing spatial resolution**. This is particularly valuable in tasks like semantic segmentation, where you need both fine-grained spatial detail and large-context understanding. Standard alternatives -- larger kernels (more parameters) or pooling/striding (reduced resolution) -- sacrifice one of these properties.

Question 13

In the context of transfer learning, what does "feature extraction" mean, and how does it differ from "fine-tuning"?

Show Answer

**Feature extraction**: All pretrained convolutional layers are **frozen** (parameters not updated). Only the new classification head is trained. The pretrained CNN acts as a fixed feature extractor. This is faster, requires less data, and carries less risk of overfitting. **Fine-tuning**: Some or all pretrained layers are **unfrozen** and allowed to update during training, typically with a small learning rate. This allows the features to adapt to the new domain but requires more data and careful hyperparameter tuning to avoid overfitting or destroying pretrained features. Feature extraction is preferred for small datasets similar to the pretraining domain; fine-tuning is preferred for larger datasets or when the target domain is very different.

Question 14

What is Grad-CAM, and how does it produce class-specific localization maps?

Show Answer

**Grad-CAM** (Gradient-weighted Class Activation Mapping) produces a coarse heatmap showing which spatial regions of a feature map are most important for predicting a specific class. The process: 1. Perform a forward pass and compute the score for the target class 2. Backpropagate the gradient to the target convolutional layer 3. Compute the **global average** of the gradients over the spatial dimensions to get channel importance weights $\alpha_k$ 4. Compute a weighted combination of the feature maps: $L^c = \text{ReLU}(\sum_k \alpha_k A^k)$ 5. Apply ReLU to keep only positive contributions (features that increase the class score) The resulting heatmap can be upsampled and overlaid on the original image to visualize which regions the model considers important for its prediction.

Question 15

What is the receptive field of a neuron? How does it grow with network depth?

Show Answer

The **receptive field** is the region of the original input image that can influence a particular neuron's activation. It grows with each additional layer according to: $$r_L = r_{L-1} + (k_L - 1) \cdot \prod_{i=1}^{L-1} s_i$$ For stride-1 layers with kernel size $k$, the receptive field grows linearly with depth: each additional layer adds $k-1$ pixels. Strided layers or pooling multiply the growth rate of subsequent layers. For example, three 3x3 stride-1 layers have a receptive field of $7 \times 7$, but adding a stride-2 layer before them would double the effective growth, resulting in a much larger receptive field.

Question 16

Why is it important to use the same normalization (mean and standard deviation) when using a pretrained model on new data?

Show Answer

Pretrained models were trained with specific normalization statistics (e.g., ImageNet's mean=[0.485, 0.456, 0.406] and std=[0.229, 0.224, 0.225]). The learned weights are calibrated to expect inputs with this distribution. Using different normalization would shift and scale the input features in ways the pretrained filters are not designed to handle, causing: 1. Feature maps with unexpected magnitudes 2. Batch normalization layers receiving out-of-distribution inputs 3. Significant degradation in performance This is one of the most common mistakes in transfer learning and can lead to hours of debugging.

Question 17

Explain the bottleneck block design in ResNet-50/101/152. Why does it use 1x1 convolutions?

Show Answer

The **bottleneck block** uses a three-layer design: **1x1 reduce -> 3x3 conv -> 1x1 expand**. 1. The first 1x1 convolution **reduces** the channel dimension (e.g., from 256 to 64) 2. The 3x3 convolution operates on the reduced channels (much cheaper: 64 channels instead of 256) 3. The second 1x1 convolution **expands** the channels back (64 to 256) This design dramatically reduces computation. A direct 3x3 conv on 256 channels would need $256 \times 256 \times 9 = 589{,}824$ parameters. The bottleneck needs $256 \times 64 + 64 \times 64 \times 9 + 64 \times 256 = 69{,}632$ parameters -- about **8.5x fewer**. This efficiency allows ResNet-50/101/152 to be much deeper than the basic block variants while using comparable compute.

Question 18

What is Global Average Pooling, and why did it replace fully connected layers in modern CNN architectures?

Show Answer

**Global Average Pooling (GAP)** computes the spatial average of each feature map, reducing the tensor from $(N, C, H, W)$ to $(N, C, 1, 1)$. Benefits over fully connected layers: 1. **Dramatic parameter reduction**: VGG-16's FC layers have ~124M of its 138M total parameters. GAP eliminates this entirely. 2. **Reduced overfitting**: Fewer parameters means less capacity to memorize training data. 3. **Spatial flexibility**: The network can accept inputs of any spatial dimension (no fixed flattening size). 4. **More interpretable**: Each channel's average directly corresponds to the confidence for a specific spatial feature, enabling Class Activation Mapping. GAP was introduced in Network in Network (Lin et al., 2013) and adopted by GoogLeNet, ResNet, and virtually all subsequent architectures.

Question 19

A CNN is trained on CIFAR-10 (32x32 images) with the following architecture. What is the shape of the tensor at each stage?

Input -> Conv2d(3, 32, 3, padding=1) -> ReLU -> MaxPool2d(2,2) ->
Conv2d(32, 64, 3, padding=1) -> ReLU -> MaxPool2d(2,2) ->
Conv2d(64, 128, 3, padding=1) -> ReLU -> AdaptiveAvgPool2d(1) ->
Flatten -> Linear(128, 10)

Show Answer

| Stage | Shape | |-------|-------| | Input | (N, 3, 32, 32) | | After Conv2d(3, 32, 3, padding=1) | (N, 32, 32, 32) | | After ReLU | (N, 32, 32, 32) | | After MaxPool2d(2,2) | (N, 32, 16, 16) | | After Conv2d(32, 64, 3, padding=1) | (N, 64, 16, 16) | | After ReLU | (N, 64, 16, 16) | | After MaxPool2d(2,2) | (N, 64, 8, 8) | | After Conv2d(64, 128, 3, padding=1) | (N, 128, 8, 8) | | After ReLU | (N, 128, 8, 8) | | After AdaptiveAvgPool2d(1) | (N, 128, 1, 1) | | After Flatten | (N, 128) | | After Linear(128, 10) | (N, 10) |

Question 20

What is the key difference between the ResNet basic block and the ResNet bottleneck block? When is each used?

Show Answer

The **basic block** uses two 3x3 convolutions with a skip connection: `x -> Conv3x3 -> BN -> ReLU -> Conv3x3 -> BN -> (+x) -> ReLU`. It is used in **ResNet-18 and ResNet-34**. The **bottleneck block** uses a 1x1 -> 3x3 -> 1x1 pattern with a skip connection: `x -> Conv1x1 -> BN -> ReLU -> Conv3x3 -> BN -> ReLU -> Conv1x1 -> BN -> (+x) -> ReLU`. It is used in **ResNet-50, ResNet-101, and ResNet-152**. The bottleneck block is more parameter-efficient for deeper networks because the 1x1 convolutions reduce and then restore the channel dimension, making the expensive 3x3 convolution operate on fewer channels. This allows for much deeper networks without proportional increases in computation.

Question 21

True or False: In PyTorch, nn.Conv2d implements mathematical convolution (with kernel flipping).

Show Answer

**False.** PyTorch's `nn.Conv2d` implements **cross-correlation**, not true mathematical convolution. True convolution would flip the kernel horizontally and vertically before sliding it over the input. Since the kernel weights are *learned*, this distinction is irrelevant in practice -- the network simply learns the flipped version of whatever a true convolution would learn. This convention is standard across virtually all deep learning frameworks.

Question 22

How does data augmentation function as a regularizer in CNN training? Name four common augmentation techniques.

Show Answer

Data augmentation creates **modified versions of training images** that preserve the label but vary the appearance, effectively increasing the diversity of the training set. This prevents the network from memorizing specific pixel patterns and forces it to learn invariances. Four common techniques: 1. **Random horizontal flipping**: Mirrors the image left-to-right (appropriate for most natural images but not, e.g., text) 2. **Random cropping** (with padding): Shifts the image slightly, forcing spatial invariance 3. **Color jittering**: Randomly perturbs brightness, contrast, saturation, and hue, forcing color invariance 4. **Random rotation**: Rotates the image by small angles, forcing rotational invariance Advanced techniques include Cutout, Mixup, CutMix, and AutoAugment.

Question 23

Explain the concept of "groups" in nn.Conv2d(in_channels, out_channels, kernel_size, groups=g). What happens when groups=1 and when groups=in_channels?

Show Answer

The `groups` parameter partitions input and output channels into `g` separate groups, where each group performs an independent convolution. - **`groups=1`** (default): Standard convolution. Every input channel is connected to every output channel. Parameters: $C_{out} \times C_{in} \times k^2$. - **`groups=in_channels`** (and `out_channels=in_channels`): **Depthwise convolution**. Each input channel has its own separate filter. Parameters: $C_{in} \times k^2$. This is the spatial filtering step of depthwise separable convolution. - **Intermediate values** (e.g., `groups=2`): The input is split into `g` groups, each processed by $C_{out}/g$ filters acting on $C_{in}/g$ channels. This was used in the original AlexNet to split computation across two GPUs.

Question 24

What is the degradation problem, and why can it not be explained by overfitting?

Show Answer

The **degradation problem** is the observation that adding more layers to a deep network can lead to *higher* training error (not just higher test error). If the problem were overfitting, we would expect training error to decrease (or stay the same) while only test error increases. The fact that *training* error increases indicates that the optimizer is unable to find a good solution -- deeper networks are harder to optimize, not just more prone to overfitting. The degradation problem suggests that it is difficult for deep networks to learn identity mappings when needed. ResNet's skip connections solve this by providing an explicit identity path: $y = F(x) + x$. Learning $F(x) = 0$ (the residual) is much easier than learning $H(x) = x$ (the full mapping).

Question 25

You need to classify 500 X-ray images into 3 categories using a pretrained ImageNet model. Describe your transfer learning approach step by step.

Show Answer

Step-by-step approach: 1. **Choose a pretrained model**: ResNet-18 or ResNet-34 (not too large for 500 images) 2. **Preprocess data**: Resize to 224x224, apply ImageNet normalization (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) 3. **Split data**: 80/20 train/validation split with stratification 4. **Feature extraction first**: Freeze all pretrained layers, replace the final FC layer with `nn.Linear(512, 3)`, and train only the new layer for 10-20 epochs 5. **Optional fine-tuning**: If feature extraction accuracy plateaus, unfreeze the last 1-2 residual blocks and fine-tune with a very small learning rate (e.g., 1e-5 for pretrained layers, 1e-3 for the new head) 6. **Data augmentation**: Use random rotation, horizontal flip, brightness/contrast jitter to increase effective dataset size 7. **Regularization**: Use dropout before the classifier and potentially weight decay 8. **Monitor carefully**: With only 500 images, watch for overfitting by tracking the gap between training and validation accuracy Feature extraction is preferred here because the dataset is small and X-rays are somewhat different from ImageNet images. Fine-tuning the later layers may help since X-ray features differ from natural image features.

Question 26

What is activation maximization, and what does it reveal about CNN learned representations?

Show Answer

**Activation maximization** is a visualization technique that generates a synthetic input image that maximally activates a specific neuron or filter. Starting from random noise, gradient ascent is performed on the *input* (not the weights) to maximize the target neuron's activation. It reveals what each filter has "learned to detect": - **Early layers**: Simple patterns like edges at specific orientations, color gradients, and frequency patterns (similar to Gabor filters) - **Middle layers**: Textures, patterns, and object parts (fur textures, wheels, eyes) - **Late layers**: Complex, class-specific concepts (dog faces, keyboards, flowers) This progressive increase in abstraction with depth confirms the hierarchical feature learning principle of CNNs. It also helps diagnose learned spurious correlations.

Question 27

How does batch normalization work differently during training vs. inference in a CNN?

Show Answer

**During training**: Batch normalization computes the mean $\mu_B$ and variance $\sigma_B^2$ from the *current mini-batch*. It normalizes using these statistics, then applies learnable scale $\gamma$ and shift $\beta$. It also maintains exponential moving averages of $\mu$ and $\sigma^2$ (the running statistics). **During inference** (`model.eval()`): Batch normalization uses the **running statistics** (accumulated during training) instead of computing batch statistics. This makes the output deterministic and independent of other samples in the batch. The formula becomes a fixed linear transformation: $y = \frac{\gamma}{\sqrt{\sigma_{running}^2 + \epsilon}} x + (\beta - \frac{\gamma \mu_{running}}{\sqrt{\sigma_{running}^2 + \epsilon}})$. A common bug is forgetting to call `model.eval()` during inference, which causes batch norm to use (unstable) mini-batch statistics instead of the stable running statistics.