Chapter 19: Exercises

Conceptual Exercises

Exercise 1: Positional Encoding Properties

Explain why the sinusoidal positional encoding allows the model to learn to attend to relative positions. Specifically, show that for any fixed offset $k$, the positional encoding at position $pos + k$ can be written as a linear transformation of the encoding at position $pos$. (Hint: Use the trigonometric addition formulas.)

Exercise 2: Why Not Concatenate?

The Transformer adds positional encodings to token embeddings rather than concatenating them. What are the advantages and disadvantages of each approach? What would change about the model architecture if we used concatenation instead?

Exercise 3: Pre-Norm vs. Post-Norm Gradient Flow

Draw the computational graph for a single Transformer block using (a) post-norm and (b) pre-norm configurations. Trace the gradient flow from the output back to the input and explain why pre-norm provides a more direct gradient path.

Exercise 4: Causal Mask Purpose

During training, the Transformer decoder receives the entire target sequence at once. Explain why this does not constitute "cheating" --- how does the causal mask preserve the autoregressive property even when the full target is available?

Exercise 5: Cross-Attention Asymmetry

In the decoder's cross-attention layer, queries come from the decoder and keys/values come from the encoder. What would happen if we reversed this --- using the encoder output as queries and the decoder state as keys and values? Would the model still work? Why or why not?

Exercise 6: Feed-Forward Network Interpretation

The position-wise FFN processes each position independently. Some researchers interpret the FFN as a key-value memory (Geva et al., 2021). Explain this interpretation: what plays the role of keys, what plays the role of values, and how does retrieval work?

Exercise 7: Residual Stream Bottleneck

In the Transformer, the residual stream dimension $d_{\text{model}}$ is fixed throughout the network. All sublayers must read from and write to vectors of this dimension. How might this create a bottleneck? How does the FFN's expansion to $d_{\text{ff}} = 4d_{\text{model}}$ help address this?

Exercise 8: Attention vs. Recurrence Complexity

Compare the computational complexity of self-attention and recurrence for a sequence of length $n$ with model dimension $d$. Under what conditions is attention more efficient? Under what conditions is recurrence more efficient?

Exercise 9: Encoder Masking

The encoder uses unmasked self-attention, meaning every position can attend to every other position. Why is this appropriate for the encoder but not for the decoder? Can you think of a scenario where masking in the encoder might be useful?

Exercise 10: Label Smoothing Trade-offs

The original Transformer uses label smoothing with $\epsilon = 0.1$. Explain the intuition behind label smoothing. Why does it hurt perplexity but improve BLEU scores? What would happen with $\epsilon = 0.5$?

Coding Exercises

Exercise 11: Implement Learned Positional Encoding

Implement a LearnedPositionalEncoding module that uses nn.Embedding instead of sinusoidal functions. It should have the same interface as SinusoidalPositionalEncoding. Compare the two by training the same toy Transformer with each and plotting the loss curves.

Exercise 12: Implement Post-Norm Transformer Block

Modify the TransformerEncoderBlock to use post-norm instead of pre-norm. Compare training stability by training the same model architecture with pre-norm and post-norm on the toy reversal task, using various learning rates (1e-3, 5e-4, 1e-4, 5e-5).

Exercise 13: Visualize Positional Encodings

Write code to generate and visualize the sinusoidal positional encoding matrix for a sequence length of 100 and $d_{\text{model}} = 64$. Create two plots: (a) a heatmap of the full encoding matrix, and (b) the encoding vectors for positions 0, 1, 2, and 50 plotted as line graphs. Compute and plot the dot product between all pairs of position encodings to verify that nearby positions are more similar.

Exercise 14: Implement Multi-Head Attention from Scratch

Without using nn.MultiheadAttention, implement multi-head attention as described in Chapter 18 and use it in a Transformer encoder block. Verify that your implementation produces the same shaped outputs as the PyTorch built-in version. (Cross-reference: Chapter 18, Section 18.4.)

Exercise 15: Causal Mask Verification

Write a test that verifies the causal mask works correctly. Feed a sequence through the decoder, then modify a token at position $t$ and verify that (a) all decoder outputs at positions $< t$ remain unchanged, and (b) outputs at positions $\geq t$ may change.

Exercise 16: Weight Tying Implementation

Implement weight tying between the decoder embedding and the output projection layer. Train the toy Transformer with and without weight tying, comparing: (a) total parameter count, (b) final loss, and (c) convergence speed.

Exercise 17: Custom Learning Rate Scheduler

Implement the full Transformer learning rate schedule (warm-up + inverse square root decay). Plot the learning rate as a function of step for $d_{\text{model}} \in \{64, 128, 256, 512\}$ and $warmup\_steps \in \{1000, 4000, 8000\}$. What patterns do you notice?

Exercise 18: Beam Search Decoder

Implement beam search decoding for the trained Transformer. Your implementation should accept a beam width parameter $k$ and return the top-$k$ candidate sequences. Compare the results with greedy decoding on the toy reversal task.

Exercise 19: Gradient Flow Analysis

Write code that registers backward hooks on each layer of the Transformer to track gradient magnitudes. Train the model for a few steps and plot the average gradient magnitude at each layer. Compare pre-norm and post-norm configurations.

Exercise 20: Attention Head Visualization

Train the Transformer on the toy reversal task and extract attention weights from all heads of all layers. Create visualizations showing which source positions each target position attends to in (a) self-attention and (b) cross-attention. Do any heads learn the reversal pattern explicitly?

Mathematical Exercises

Exercise 21: Positional Encoding Orthogonality

For the sinusoidal positional encoding, compute the dot product $PE_{pos_1} \cdot PE_{pos_2}$ analytically. Show that for large $d_{\text{model}}$, this dot product is approximately a function of $|pos_1 - pos_2|$ only (i.e., it depends on relative, not absolute, position).

Exercise 22: Attention Complexity Derivation

Derive the exact number of floating-point operations (FLOPs) required for a single multi-head attention operation with $h$ heads, sequence length $n$, model dimension $d_{\text{model}}$, and head dimension $d_k = d_{\text{model}} / h$. Include the projections for Q, K, V, and the output projection.

Exercise 23: Parameter Count

Compute the exact number of trainable parameters in a Transformer with: - $d_{\text{model}} = 512$, $d_{\text{ff}} = 2048$, $h = 8$ - $N_{\text{enc}} = 6$ encoder layers, $N_{\text{dec}} = 6$ decoder layers - Source vocabulary of 32,000 tokens, target vocabulary of 32,000 tokens - With and without weight tying

Exercise 24: Layer Normalization Jacobian

Compute the Jacobian matrix $\frac{\partial \text{LayerNorm}(\mathbf{x})}{\partial \mathbf{x}}$ for an input vector $\mathbf{x} \in \mathbb{R}^d$ (ignoring the learnable scale and shift). Show that this Jacobian has a specific structure that prevents both vanishing and exploding gradients.

Exercise 25: FFN as Key-Value Memory

Consider the FFN as computing $\text{FFN}(\mathbf{x}) = W_2 \cdot \text{ReLU}(W_1 \mathbf{x})$ (ignoring biases). Let $\mathbf{k}_i$ be the $i$-th row of $W_1$ and $\mathbf{v}_i$ be the $i$-th column of $W_2$. Show that $\text{FFN}(\mathbf{x}) = \sum_{i} \max(0, \mathbf{k}_i \cdot \mathbf{x}) \cdot \mathbf{v}_i$ and interpret this as a memory lookup.

Applied Exercises

Exercise 26: Sorting Task

Train a Transformer on a sorting task: given a sequence of random integers, output them in sorted order. This is more challenging than reversal because the output depends on global properties of the input. Experiment with model size and report the minimum model that achieves > 95% accuracy on sequences of length 5--10.

Exercise 27: Simple Calculator

Train a Transformer to perform simple arithmetic. Create a dataset of additions of 2-digit numbers (e.g., "23+45" -> "68"). Represent numbers as sequences of individual digits with a special separator token. Train and evaluate the model, reporting accuracy as a function of training steps.

Exercise 28: Copy with Noise

Modify the toy task: the source sequence has random tokens inserted (noise), and the target should be the original clean sequence. This requires the model to learn which tokens to copy and which to ignore. How does performance change with increasing noise levels?

Exercise 29: Encoder-Only Classification

Build an encoder-only Transformer for sequence classification. Use the [CLS] token approach (prepend a special token and use its final representation for classification). Train on a simple task: classify whether the sum of digits in a sequence is even or odd.

Exercise 30: Decoder-Only Language Model

Build a decoder-only Transformer (no encoder, no cross-attention). Train it as a character-level language model on a small text corpus. Generate samples using temperature-scaled sampling and report the perplexity.

Exercise 31: Ablation Study

Perform an ablation study on the toy reversal task. Train the model with each of the following removed (one at a time) and compare performance: - Positional encoding - Residual connections - Layer normalization - Multi-head attention (replace with single-head) - Feed-forward network (replace with identity)

Report and discuss which components are most critical.

Exercise 32: Shared Weights Across Layers

The original Transformer uses separate parameters for each layer. Implement a variant where all encoder layers share the same parameters (and similarly for decoder layers). This is the approach used in ALBERT. Compare the parameter count and performance.

Exercise 33: Relative Positional Encoding

Implement a simple form of relative positional encoding by adding a learnable bias to the attention scores based on the relative distance between query and key positions. Compare with sinusoidal absolute positional encoding on the toy task.

Exercise 34: KV Cache for Efficient Inference

Implement key-value caching for the decoder during inference. Instead of recomputing attention over all previous tokens at each step, cache the key and value projections and only compute the new token's attention. Measure the speedup compared to naive decoding.

Exercise 35: Mixed-Precision Training

Implement mixed-precision training for the Transformer using torch.cuda.amp. Compare training speed and memory usage with full-precision training. Verify that the final model quality is comparable.

Exercise 36: Subword Tokenization

Replace the character-level tokenization in the toy task with byte pair encoding (BPE). Implement a simple BPE tokenizer from scratch, train it on a corpus, and use it with the Transformer. Compare vocabulary sizes and sequence lengths.

Exercise 37: Attention Dropout Analysis

The original Transformer applies dropout to attention weights. Train models with attention dropout rates of 0.0, 0.1, 0.2, and 0.3. Plot training and validation loss curves. At what point does attention dropout hurt training more than it helps generalization?

Exercise 38: Layer-by-Layer Probing

Train a Transformer on the reversal task. Then, for each layer $l$, add a linear probe that predicts the correct output directly from the intermediate representation at layer $l$. Plot probe accuracy vs. layer number. At which layer does the model "solve" the task?

Exercise 39: Scaling Experiment

Train Transformers of different sizes (varying $d_{\text{model}}$ from 32 to 512 and $N$ from 1 to 8) on the reversal task with a fixed dataset size. Plot final loss vs. parameter count on a log-log scale. Do you observe the power-law scaling behavior predicted by scaling laws?

Exercise 40: Transformer Debugging Challenge

You are given a Transformer implementation with three deliberate bugs: 1. Missing scaling by $\sqrt{d_k}$ in attention 2. Positional encoding applied after the first layer instead of before 3. Causal mask applied to the encoder instead of the decoder

For each bug, explain what symptoms you would observe during training and how you would diagnose the problem. Then fix the bugs and verify the model trains correctly.