Chapter 21: Exercises
Conceptual Exercises
Exercise 21.1: Autoregressive Factorization
Write out the full autoregressive factorization for the sentence "The cat sat on the mat" (treating each word as a token). Explicitly write each conditional probability term in the product. How many conditional distributions must the model evaluate to compute the joint probability of this 6-token sequence?
Exercise 21.2: Causal Mask Properties
Consider a sequence of length 8. (a) Draw the causal attention mask matrix (using 0 for "attend" and X for "masked"). (b) How many non-zero attention weights does position 5 have? (c) If we sum across each row of the attention weight matrix (after softmax), what value do we get? Explain why. (d) What would happen if we accidentally used a non-causal (bidirectional) mask during training but a causal mask during generation?
Exercise 21.3: GPT Architecture Comparison
Create a detailed comparison table of GPT-1, GPT-2, and GPT-3 covering: number of parameters, number of layers, hidden dimension, number of attention heads, head dimension, context length, vocabulary size, training data size, and layer normalization placement. Identify which changes had the largest impact on performance and explain your reasoning.
Exercise 21.4: Pre-Norm vs. Post-Norm
(a) Write the equations for a Transformer block using post-norm (GPT-1 style) and pre-norm (GPT-2 style). (b) Explain why pre-norm is more stable during training by analyzing the gradient flow through the residual stream. (c) What is the potential disadvantage of pre-norm? (Hint: think about the effective depth of the network.)
Exercise 21.5: Weight Tying
(a) In our mini-GPT implementation, we tie the input embedding matrix with the output projection matrix. Explain the intuition behind this: why should the same matrix that maps tokens to embeddings also map hidden states to logits? (b) Calculate the number of parameters saved by weight tying in a model with vocabulary size 50,257 and embedding dimension 768. (c) Can weight tying hurt performance? In what situations might untied weights be preferable?
Exercise 21.6: Temperature Analysis
Consider a vocabulary of 4 tokens with logits $z = [2.0, 1.0, 0.5, -1.0]$. (a) Compute the probabilities after softmax with temperature $\tau = 0.5, 1.0, 2.0$. (b) Compute the entropy of each resulting distribution. (c) At what temperature does the distribution become approximately uniform? (d) What is the limiting behavior as $\tau \to 0^+$ and $\tau \to \infty$?
Exercise 21.7: Top-k vs. Top-p
Given a probability distribution over 10 tokens: $[0.35, 0.20, 0.15, 0.10, 0.08, 0.05, 0.03, 0.02, 0.01, 0.01]$: (a) Which tokens are included in top-k sampling with $k = 3$? What are the renormalized probabilities? (b) Which tokens are included in nucleus sampling with $p = 0.9$? What are the renormalized probabilities? (c) For this specific distribution, which method gives a more appropriate truncation? Justify your answer. (d) Construct an example distribution where top-k with $k=5$ includes highly unlikely tokens.
Exercise 21.8: KV Cache Computation
For a model with 12 layers, 12 attention heads, head dimension 64, and a context window of 1024: (a) Calculate the total memory required for the KV cache in float32 and float16. (b) If we batch 8 sequences simultaneously, what is the total KV cache memory? (c) Suppose we generate 512 new tokens from a prompt of 256 tokens. Compare the total attention FLOPs with and without KV caching.
Exercise 21.9: Perplexity Interpretation
(a) If a language model has a perplexity of 25 on a test set, what is the average cross-entropy loss per token? (b) A uniform model over a vocabulary of 50,000 tokens assigns equal probability to every token. What is its perplexity? (c) Is a perplexity of 50 "good" or "bad"? Explain why this question cannot be answered without additional context. (d) A model achieves perplexity 15 on English text and perplexity 45 on Python code. Does this mean the model is better at English than Python? Discuss.
Exercise 21.10: Autoregressive vs. Bidirectional
Compare and contrast autoregressive (decoder-only) models with bidirectional (encoder-only) models like BERT: (a) Which is better suited for text generation? Explain why. (b) Which is better suited for text classification? Explain why. (c) Can an autoregressive model perform bidirectional-style tasks? If so, how? (d) The GPT-3 paper showed that a sufficiently large autoregressive model can match BERT-style models on many NLU benchmarks. What does this suggest about the relationship between model size and architecture choice?
Programming Exercises
Exercise 21.11: Implement Causal Masking from Scratch
Implement the create_causal_mask function and verify it by showing that when applied to an attention matrix, positions cannot attend to future tokens. Test with a small example (sequence length 4) and print the attention weights before and after masking.
Exercise 21.12: Build a Minimal Attention Layer
Implement a single-head causal self-attention layer from scratch (without using nn.MultiheadAttention). Your implementation should:
(a) Accept input of shape (batch, seq_len, d_model).
(b) Compute Q, K, V projections.
(c) Apply the causal mask.
(d) Return the attention output and the attention weights.
Test it with random input and verify the attention weights are lower-triangular.
Exercise 21.13: Temperature Sampling Implementation
Implement a function that takes logits, a temperature, and an optional top-k value, and returns a sampled token index. Test it by generating 1000 samples at different temperatures and plotting the histogram of selected tokens.
Exercise 21.14: Nucleus Sampling from Scratch
Implement nucleus (top-p) sampling from scratch without using any library functions for the filtering. Your implementation should: (a) Sort the logits in descending order. (b) Compute cumulative probabilities. (c) Find the cutoff point. (d) Mask out tokens below the cutoff. (e) Renormalize and sample. Verify by comparing with the reference implementation from the chapter.
Exercise 21.15: Complete Mini-GPT Training
Using the mini-GPT architecture from the chapter, train a character-level language model on a text file of your choice (e.g., Shakespeare, a novel from Project Gutenberg). Plot the training loss curve and generate sample text after each epoch to observe how quality improves during training.
Exercise 21.16: KV Cache Implementation
Extend the CausalSelfAttention class to support KV caching. Implement a generation loop that uses the cache and compare the wall-clock time of generation with and without caching for sequences of length 100, 200, and 500. Plot the speedup factor as a function of sequence length.
Exercise 21.17: Attention Pattern Analysis
Load a pre-trained GPT-2 model and extract attention patterns for 5 different input sentences. For each sentence: (a) Visualize the attention patterns for 4 different layers and 4 different heads. (b) Identify heads that appear to specialize (e.g., attending to the previous token, attending to punctuation, etc.). (c) Do you see evidence of induction heads? Describe what you observe.
Exercise 21.18: Beam Search vs. Sampling
Implement beam search for the mini-GPT model and compare the generated text quality (as measured by perplexity under GPT-2) against: (a) Greedy decoding. (b) Top-k sampling (k=50). (c) Nucleus sampling (p=0.92). (d) Temperature sampling (T=0.7). Run each strategy 10 times and report the mean and standard deviation of perplexity.
Exercise 21.19: Repetition Penalty
Implement the repetition penalty described in Section 21.11.2. Apply it during generation with GPT-2 and compare: (a) No penalty ($\alpha = 1.0$). (b) Mild penalty ($\alpha = 1.2$). (c) Strong penalty ($\alpha = 1.5$). (d) Aggressive penalty ($\alpha = 2.0$). For each, generate 10 samples of 200 tokens and compute the fraction of repeated n-grams (bigrams, trigrams, and 4-grams).
Exercise 21.20: Positional Embedding Visualization
Extract the learned positional embeddings from GPT-2 and analyze them: (a) Compute the cosine similarity matrix between all 1024 positional embeddings. (b) Apply PCA to reduce the positional embeddings to 2D and plot them. (c) What patterns do you observe? Do nearby positions have similar embeddings? (d) Compare with sinusoidal positional embeddings of the same dimension.
Exercise 21.21: GPT-2 Fine-Tuning
Fine-tune GPT-2 on a small domain-specific corpus (e.g., cooking recipes, legal text, or poetry): (a) Prepare the dataset using the GPT-2 tokenizer. (b) Fine-tune for 3 epochs with a learning rate of 5e-5. (c) Compare generated text before and after fine-tuning. (d) Compute perplexity on a held-out test set before and after fine-tuning.
Exercise 21.22: Model Size Experiment
Create three versions of the mini-GPT model with different sizes: - Small: 2 layers, 128 hidden dim, 4 heads (~2M params) - Medium: 4 layers, 256 hidden dim, 8 heads (~10M params) - Large: 6 layers, 384 hidden dim, 12 heads (~30M params)
Train all three on the same dataset and compare: (a) Training loss curves. (b) Validation perplexity. (c) Quality of generated text (subjective evaluation). (d) Training time per epoch.
Exercise 21.23: Tokenizer Effects
Take a paragraph of text and tokenize it with three different tokenizers: (a) Character-level tokenization. (b) GPT-2's BPE tokenizer. (c) A word-level tokenizer (splitting on whitespace and punctuation).
For each, compute the sequence length and discuss the implications for: - Context window utilization. - Training efficiency. - The model's ability to handle rare words.
Exercise 21.24: Cross-Entropy Loss Decomposition
Given a trained GPT-2 model and a text passage: (a) Compute the per-token cross-entropy loss for each token in the passage. (b) Identify the 10 tokens with the highest loss (most surprising to the model). (c) Identify the 10 tokens with the lowest loss (most predictable). (d) Does the loss tend to be higher at the beginning of sentences? At content words vs. function words? Analyze and discuss.
Challenge Exercises
Exercise 21.25: Implement RoPE
Implement Rotary Positional Embeddings (RoPE) from scratch and integrate them into the mini-GPT model: (a) Implement the rotation matrix computation. (b) Modify the attention mechanism to apply RoPE to queries and keys. (c) Train the model with RoPE and compare with learned positional embeddings. (d) Test whether the RoPE model can generalize to longer sequences than it was trained on.
Exercise 21.26: Sliding Window Attention
Implement sliding window attention (as used in Mistral) where each position can only attend to the previous $W$ positions: (a) Modify the causal mask to include the sliding window constraint. (b) Analyze how this reduces computation and memory. (c) Train the model with different window sizes and compare performance. (d) Can information still propagate across the entire sequence? If so, how?
Exercise 21.27: Speculative Decoding
Implement speculative decoding, where a smaller "draft" model generates candidate tokens that are then verified by the larger model: (a) Train a small (2-layer) and large (6-layer) mini-GPT on the same data. (b) Implement the speculative decoding algorithm. (c) Measure the acceptance rate and effective speedup. (d) How does the speedup depend on the quality of the draft model?
Exercise 21.28: Grouped Query Attention
Implement Grouped Query Attention (GQA) as used in Llama 2: (a) Modify the multi-head attention to share key-value heads across query heads. (b) Compare the parameter count and KV cache size with standard MHA. (c) Train models with MHA, GQA, and Multi-Query Attention (MQA) and compare performance.
Exercise 21.29: Attention Sink Analysis
Implement and test the "attention sink" phenomenon (Xiao et al., 2023): (a) Load GPT-2 and process a long sequence. (b) Extract attention patterns and verify that the first few tokens receive disproportionate attention regardless of their content. (c) Implement the StreamingLLM approach of keeping a few "sink tokens" when evicting from the KV cache. (d) Compare generation quality with and without sink tokens for very long sequences.
Exercise 21.30: Prefix Language Modeling
Implement prefix language modeling, where the model uses bidirectional attention on a prefix and autoregressive attention on the rest: (a) Modify the causal mask to allow bidirectional attention on the first $P$ positions. (b) Train the model and compare with standard causal language modeling. (c) Evaluate on tasks where the prefix provides context (e.g., text completion given a paragraph).
Exercise 21.31: Contrastive Decoding
Implement contrastive decoding (Li et al., 2022), which uses the difference between a large and small model's logits: (a) Train a large and small mini-GPT on the same data. (b) Implement the contrastive decoding formula: $\text{logits} = \text{logits}_{\text{large}} - \alpha \cdot \text{logits}_{\text{small}}$. (c) Compare generation quality against standard sampling from the large model. (d) Explore different values of $\alpha$ and analyze their effects.
Exercise 21.32: Watermarking
Implement a simple text watermarking scheme for autoregressive generation: (a) At each generation step, hash the previous $k$ tokens to create a pseudo-random partition of the vocabulary into "green" and "red" lists. (b) During generation, add a small bias $\delta$ to green-list token logits. (c) Implement a detection algorithm that tests whether a text contains a statistically significant proportion of green-list tokens. (d) Evaluate the trade-off between watermark detectability and text quality.
Exercise 21.33: Multi-Token Prediction
Instead of predicting only the next token, implement a model that predicts the next $N$ tokens simultaneously: (a) Add $N$ separate prediction heads to the model (one for each future position). (b) Train with the combined loss across all $N$ heads. (c) During generation, use only the first head's prediction. Does the multi-head training improve generation quality? (d) Experiment with different values of $N$ (2, 4, 8) and analyze the effect.
Exercise 21.34: Efficient Inference with Quantization
Implement post-training quantization for the mini-GPT model: (a) Quantize the model weights from float32 to int8 using PyTorch's quantization utilities. (b) Measure the impact on generation quality (perplexity) and inference speed. (c) Compare naive round-to-nearest quantization with absmax quantization. (d) At what bit width does generation quality start to degrade noticeably?
Exercise 21.35: Knowledge Distillation
Train a smaller "student" model to mimic a larger "teacher" model: (a) Train a 6-layer teacher model on text data. (b) Train a 2-layer student model using a combination of the standard cross-entropy loss and a distillation loss that matches the teacher's output distribution. (c) Compare the student trained with distillation against one trained only with the standard loss. (d) Vary the temperature used for distillation and analyze its effect.