Chapter 8: Quiz

Test your understanding of convolutional neural networks. Answers follow each question.


Question 1

What two constraints, when applied to a fully connected layer, produce a convolutional layer?

Answer **Locality** (each output neuron connects only to a small spatial region of the input, not the entire input) and **weight sharing** (the same set of weights is used at every spatial position). Locality makes the weight matrix sparse (banded), and weight sharing makes the nonzero entries identical across positions (Toeplitz structure). Together, they reduce parameters from $O(n^2)$ to $O(k^2)$ where $k$ is the kernel size.

Question 2

A Conv2d layer has in_channels=64, out_channels=128, kernel_size=3, and bias=True. How many learnable parameters does it have?

Answer Parameters $= C_{\text{out}} \times (C_{\text{in}} \times k_h \times k_w + 1) = 128 \times (64 \times 3 \times 3 + 1) = 128 \times 577 = 73{,}856$. The weight tensor has shape $(128, 64, 3, 3)$ with $128 \times 64 \times 9 = 73{,}728$ entries, plus 128 bias terms.

Question 3

What is the output spatial size of a Conv2d with input $56 \times 56$, kernel $3 \times 3$, stride 2, and padding 1?

Answer $H_{\text{out}} = \lfloor \frac{56 + 2(1) - 3}{2} \rfloor + 1 = \lfloor \frac{55}{2} \rfloor + 1 = 27 + 1 = 28$. The output is $28 \times 28$.

Question 4

True or False: A single convolutional layer is translation invariant.

Answer **False.** A convolutional layer is translation **equivariant**, not invariant. If the input is shifted by $\Delta$, the output feature map shifts by $\Delta$ (scaled by stride). The values do not change — only their positions. Translation **invariance** (the output does not change at all when the input is shifted) is achieved through pooling and global aggregation, not convolution alone.

Question 5

Explain the difference between max pooling and global average pooling. When is each used in a typical CNN?

Answer **Max pooling** selects the maximum activation within a local window (e.g., $2 \times 2$) and is used within the network to progressively reduce spatial dimensions while retaining the strongest features. It provides local translation invariance. **Global average pooling (GAP)** averages each channel over the entire spatial extent, reducing a $(C, H, W)$ feature map to a $(C,)$ vector. It is used at the end of the network (before the classifier) to aggregate spatial information into a single representation. GAP replaces the large fully connected layers used in older architectures (VGG), dramatically reducing parameters and overfitting.

Question 6

What is the receptive field of a neuron in the output of 5 stacked $3 \times 3$ convolutional layers (all with stride 1, no pooling)?

Answer $R = 1 + L(k - 1) = 1 + 5 \times 2 = 11$. The neuron "sees" an $11 \times 11$ region of the input. Each $3 \times 3$ layer adds 2 to the receptive field (one pixel on each side).

Question 7

In a residual block $\mathbf{y} = \mathcal{F}(\mathbf{x}) + \mathbf{x}$, what happens to the gradient $\frac{\partial \mathcal{L}}{\partial \mathbf{x}}$?

Answer $\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \cdot \left(\frac{\partial \mathcal{F}}{\partial \mathbf{x}} + \mathbf{I}\right) = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \cdot \frac{\partial \mathcal{F}}{\partial \mathbf{x}} + \frac{\partial \mathcal{L}}{\partial \mathbf{y}}$ The gradient has two components: one that passes through the residual function $\mathcal{F}$ (and may vanish) and one that passes directly through the identity shortcut. The identity term $\frac{\partial \mathcal{L}}{\partial \mathbf{y}}$ ensures the gradient is never completely lost, creating a "gradient highway" that enables training of very deep networks.

Question 8

Why does the ResNet bottleneck block use $1 \times 1$ convolutions? What purpose does each serve?

Answer The bottleneck block uses three convolutions: $1 \times 1 \to 3 \times 3 \to 1 \times 1$. - **First $1 \times 1$:** Reduces the channel dimension (e.g., from 256 to 64), decreasing the computational cost of the subsequent $3 \times 3$ convolution. - **$3 \times 3$:** Performs spatial processing on the reduced-channel feature map. - **Second $1 \times 1$:** Expands the channel dimension back to the original size for the residual addition. This reduces the parameter and FLOP cost of the $3 \times 3$ convolution by a factor of $(C/C_b)^2$, where $C_b$ is the bottleneck width (typically $C/4$).

Question 9

What is the key innovation that AlexNet introduced relative to earlier CNNs like LeNet?

Answer AlexNet demonstrated that CNNs trained at scale (large datasets, GPU computation) with ReLU activations, dropout, and data augmentation could dramatically outperform traditional computer vision methods. The specific technical contributions were: (1) ReLU activations (replacing sigmoid/tanh, which suffer from vanishing gradients), (2) training on GPUs (enabling much larger models and datasets), (3) dropout for regularization, and (4) local response normalization. The key insight was that the ideas from the 1990s worked dramatically better with more data and compute.

Question 10

What is depthwise separable convolution? By what factor does it reduce parameters compared to standard convolution for $C = 256$ channels and $k = 3$?

Answer Depthwise separable convolution factorizes a standard convolution into: (1) a **depthwise** convolution (separate $k \times k$ kernel per channel, parameters $= C \cdot k^2$), and (2) a **pointwise** $1 \times 1$ convolution (mixes channels, parameters $= C \cdot C$). The parameter ratio is $\frac{1}{C} + \frac{1}{k^2} = \frac{1}{256} + \frac{1}{9} \approx 0.115$. This is approximately an $8.7\times$ reduction.

Question 11

In Grad-CAM, why is a ReLU applied to the weighted combination of feature maps?

Answer The ReLU ensures the heatmap highlights only features that have a **positive** influence on the target class score. Negative activations correspond to features that *decrease* confidence in the target class (i.e., features that support other classes). Including them would produce a misleading heatmap that mixes "evidence for" and "evidence against" the class. The ReLU zeros out the negative contributions, leaving only the positively contributing spatial regions.

Question 12

For a 1D CNN processing text, why are multiple kernel sizes (e.g., 3, 4, 5) typically used?

Answer Different kernel sizes capture n-gram patterns at different scales: a kernel of size 3 detects trigrams, size 4 detects 4-grams, and size 5 detects 5-grams. By using multiple sizes and concatenating the resulting feature maps (after global max pooling), the model captures linguistic patterns at multiple granularities simultaneously. This is analogous to using edge detectors at multiple orientations in image processing — no single scale captures all the relevant patterns.

Question 13

EfficientNet uses compound scaling with the constraint $\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2$. Why does width appear squared but depth appears linearly?

Answer **Depth** scales FLOPs linearly: doubling the number of layers doubles the total computation (each layer adds a fixed amount of work). **Width** scales FLOPs quadratically: each convolution layer has FLOPs proportional to $C_{\text{in}} \times C_{\text{out}} \times k^2 \times H \times W$. When width is multiplied by $\beta$, both $C_{\text{in}}$ and $C_{\text{out}}$ are scaled, so FLOPs scale as $\beta^2$. **Resolution** scales FLOPs quadratically: the spatial dimensions $H$ and $W$ each multiply by $\gamma$, so FLOPs scale as $\gamma^2$. The constraint ensures that the total FLOP budget doubles when $\phi$ increases by 1: $\alpha \cdot \beta^2 \cdot \gamma^2 = 2^1$.

Question 14

What is the difference between convolution and cross-correlation? Does this distinction matter in deep learning?

Answer **Convolution** flips the kernel before sliding: $(w * x)_j = \sum_t w_t \cdot x_{j-t}$. **Cross-correlation** does not flip: $(w \star x)_j = \sum_t w_t \cdot x_{j+t}$. In deep learning, the distinction does not matter because the kernel weights are learned — the network simply learns the flipped version if convolution is used, or the direct version if cross-correlation is used. All major frameworks (PyTorch, TensorFlow) implement cross-correlation and call it "convolution." The distinction matters only when reading signal processing literature, where the term "convolution" strictly implies kernel flipping.

Question 15

Why did simply making networks deeper (beyond ~20 layers) fail before ResNet, even with batch normalization?

Answer The problem was not vanishing gradients (which batch normalization largely addresses) but **optimization difficulty**. A 56-layer plain network performed worse than a 20-layer network on *both* training and test sets — this is a training problem, not a generalization problem. The loss landscape of deep plain networks has pathological curvature that makes gradient descent unable to find good solutions. Deeper networks are strictly more expressive (they can represent identity mappings in the extra layers), but the optimizer cannot discover those solutions. Residual connections solve this by making identity the default: the optimizer only needs to learn small deviations from identity, which is much easier.

Question 16

How does data augmentation act as a regularizer? Why is it different from explicit regularization like L2 weight decay?

Answer Data augmentation artificially increases the effective training set size by generating label-preserving transformations of existing examples. This reduces overfitting because the model sees more diverse input patterns during training, making it harder to memorize specific examples. Unlike L2 weight decay (which penalizes large weights regardless of their utility) or dropout (which randomly zeros activations), data augmentation injects domain knowledge about the *symmetries of the problem*: horizontal flips encode the knowledge that mirror images have the same label, random crops encode partial occlusion robustness, and color jitter encodes illumination invariance. This makes augmentation more targeted and often more effective than generic regularizers.

Question 17

In a 1D CNN for text, the embedding dimension serves as the "channel" dimension. What is the "spatial" dimension?

Answer The **sequence length** (number of tokens) is the spatial dimension. The input to `Conv1d` has shape `(batch, channels, length)` = `(batch, embedding_dim, seq_len)`. The 1D kernel slides along the sequence dimension, computing dot products between the kernel weights and windows of consecutive token embeddings. A kernel of size $k$ at position $j$ processes the token embeddings at positions $j, j+1, \ldots, j+k-1$.

Question 18

What does Grad-CAM reveal if a model classifies an image of a boat correctly but the heatmap highlights the water rather than the boat?

Answer This reveals that the model has learned a **spurious correlation**: boats in the training data typically appear on water, so the model learned to associate "water texture" with the "boat" class. The model achieves high accuracy on the test set (which has the same bias), but it would fail on images of boats on land, in dry dock, or on trailers. This is a failure of generalization, and Grad-CAM makes it visible. The appropriate response is to augment the training data with diverse boat contexts, or to evaluate on out-of-distribution test sets specifically designed to break such correlations.

Question 19

Why are 1D CNNs sometimes preferred over transformers for extracting text features in a recommendation system?

Answer Three practical reasons: (1) **Inference speed** — 1D CNNs are much faster than transformers for short sequences because they avoid the $O(n^2)$ self-attention computation and are highly parallelizable. In a recommendation system serving millions of requests per second, latency matters. (2) **Model size** — 1D CNNs have far fewer parameters than transformers, reducing memory and deployment costs. (3) **The features are secondary** — in a recommendation system, text embeddings are just one input feature among many (user features, interaction history, item metadata). A lightweight CNN that captures n-gram patterns may be sufficient when the text embedding is combined with other signals in a larger model. The principle is: use the simplest model that solves the subproblem.

Question 20

A CNN for dense prediction (e.g., climate downscaling from $16 \times 16$ to $64 \times 64$) requires spatial upsampling. Name two upsampling methods and their tradeoffs.

Answer **Transposed convolution (deconvolution):** Inserts zeros between input pixels and applies a standard convolution, effectively upsampling. It is learnable (the upsampling kernel is trained) but prone to **checkerboard artifacts** when the kernel size is not divisible by the stride, because some output positions receive more overlapping kernel contributions than others. **Sub-pixel convolution (pixel shuffle):** Produces $r^2$ channels at the original resolution, then rearranges them into a single channel at $r\times$ resolution. It avoids checkerboard artifacts because every output pixel is computed exactly once. It is generally preferred for super-resolution and spatial upsampling tasks. An alternative is bilinear upsampling followed by a standard convolution, which is simple and artifact-free but not end-to-end learnable in the upsampling step.