Chapter 22: Exercises
Part A: Scaling Laws (Conceptual)
Exercise 22.1: Power-Law Relationships
The Kaplan scaling law for model size is $L(N) = (N_c / N)^{\alpha_N}$ with $\alpha_N \approx 0.076$. (a) If a model with $N = 10^9$ parameters achieves loss $L = 3.0$, what loss would you predict for a model with $N = 10^{10}$ parameters (assuming sufficient data)? (b) How many parameters would be needed to halve the loss? (c) Plot (on paper or in code) the scaling curve on both linear and log-log axes. What property of the log-log plot confirms the power-law relationship?
Exercise 22.2: Chinchilla vs. Kaplan Allocation
You have a fixed compute budget of $C = 10^{22}$ FLOPs and the approximation $C \approx 6ND$. (a) Under Kaplan's allocation ($N_{\text{opt}} \propto C^{0.73}$, $D_{\text{opt}} \propto C^{0.27}$), compute the approximate optimal $N$ and $D$. Use the constraint $C = 6ND$ to find the tokens-per-parameter ratio. (b) Under Chinchilla's allocation ($N_{\text{opt}} \propto C^{0.50}$, $D_{\text{opt}} \propto C^{0.50}$), compute the approximate optimal $N$ and $D$ for the same budget. What is the tokens-per-parameter ratio? (c) Explain qualitatively why the Chinchilla allocation produces a better final model.
Exercise 22.3: Inference-Optimal Training
Suppose you plan to serve a model for 1 year, processing $Q = 10^{12}$ total query tokens. The total cost is $C_{\text{train}} + C_{\text{inference}} = 6ND + 2NQ$ (where inference uses approximately 2 FLOPs per parameter per token for a forward pass only). (a) For fixed total budget $C_{\text{total}}$, derive the optimal $N$ as a function of $D$ and $Q$. (b) Show that when $Q \gg D$, the optimal model is much smaller than Chinchilla-optimal. (c) Llama 3 8B was trained on 15 trillion tokens. What value of $Q$ would justify this training budget over training a 40B model on 3T tokens?
Exercise 22.4: Scaling Law Extrapolation
A research lab trains three models on the same data with different sizes and observes: - $N_1 = 10^7$ parameters, $L_1 = 4.2$ - $N_2 = 10^8$ parameters, $L_2 = 3.6$ - $N_3 = 10^9$ parameters, $L_3 = 3.1$
(a) Fit a power law $L(N) = aN^{-b}$ by performing linear regression on the log-transformed data. Find $a$ and $b$. (b) Predict the loss at $N_4 = 10^{10}$ parameters. (c) What assumptions must hold for this extrapolation to be valid?
Exercise 22.5: Irreducible Loss
The Chinchilla scaling law is $L(N, D) = E + A/N^{\alpha} + B/D^{\beta}$ with $E \approx 1.69$ nats. (a) What is the entropy interpretation of $E$? What does it represent physically? (b) Convert $E = 1.69$ nats to bits. What does this tell us about the per-token predictability of natural language? (c) If the vocabulary size is 32,000, what is the perplexity corresponding to the irreducible loss? Compare this to the perplexity of a uniform distribution over the vocabulary.
Exercise 22.6: Data Scaling Limits
(a) Estimate the total number of high-quality English text tokens available on the public internet (use the Common Crawl as a reference: approximately 100 billion web pages, averaging ~1,000 tokens each after deduplication and filtering, yielding ~3--5 trillion tokens). (b) If a 100B-parameter model needs 2T tokens for Chinchilla-optimal training, how many such models could be trained on the available data before exhaustion? (c) Discuss three strategies for addressing data scarcity. For each, explain the potential benefits and risks.
Part B: Emergent Abilities and Benchmarks
Exercise 22.7: Emergence vs. Metric Choice
Consider a 5-digit addition task (e.g., "12345 + 67890 = ?"). (a) If a model has per-digit accuracy of 90%, what is the probability that it gets the entire 5-digit answer exactly right (assuming digit errors are independent)? (b) What per-digit accuracy is needed for 50% exact-match accuracy on 5-digit answers? (c) Plot exact-match accuracy as a function of per-digit accuracy for answer lengths 1, 3, 5, and 10. Explain how this relates to the "emergent abilities as mirage" argument.
Exercise 22.8: MMLU Analysis
MMLU reports accuracy averaged across 57 subjects. Suppose a model scores 70% average across all subjects. (a) Is it possible for the model to score below 50% (random chance for 4-choice questions) on some subjects while still achieving 70% average? Give a concrete example. (b) Why is subject-level reporting important for understanding model capabilities? (c) If the model was trained on data that includes MMLU-style questions, does the 70% score overestimate or underestimate true capability? Discuss the contamination problem.
Exercise 22.9: HumanEval pass@k
The pass@k metric is defined as $\text{pass@}k = 1 - \binom{n-c}{k}/\binom{n}{k}$ where $n$ is total samples and $c$ is correct samples. (a) If $n = 100$ and $c = 20$, compute pass@1, pass@5, and pass@10. (b) Show that as $k \to n$, pass@k approaches 1 if $c \geq 1$. (c) A model achieves pass@1 = 0.40 and pass@10 = 0.75. What does the gap between these two numbers tell you about the model's capabilities? (d) Why is pass@k a more informative metric than simple accuracy for code generation?
Exercise 22.10: Benchmark Saturation
(a) Define benchmark saturation. Why is it a problem for the field? (b) List three benchmarks that are considered saturated (top models score >90%) and three that remain challenging. (c) Propose a strategy for creating benchmarks that are resistant to saturation. What trade-offs does your strategy involve?
Exercise 22.11: LLM-as-Judge
MT-Bench uses GPT-4 as a judge to evaluate model responses on a 1--10 scale. (a) What are two advantages of using an LLM as a judge compared to human evaluation? (b) What are two potential biases introduced by using an LLM as a judge? (c) Design a simple experiment to measure the agreement between GPT-4 judge scores and human expert scores. What metrics would you use?
Exercise 22.12: Designing a Benchmark
Design a benchmark for evaluating LLMs on a specific domain of your choice (e.g., legal reasoning, medical diagnosis, software debugging). Your design should include: (a) Task specification (10 example items) (b) Evaluation metric(s) with precise definitions (c) A contamination mitigation strategy (d) Expected baseline performance for random guessing, small models (7B), and frontier models
Part C: LLM Families and Architecture
Exercise 22.13: Architecture Comparison Table
Create a detailed comparison table for GPT-4, Claude 3 Opus, Llama 3 70B, Mixtral 8x7B, and Gemini 1.5 Pro. Include: estimated parameter count, architecture type (dense vs. MoE), context window, tokenizer vocabulary size, position encoding type, activation function, normalization type, and whether weights are publicly available. Use "undisclosed" where information is not public.
Exercise 22.14: Active Parameter Computation
Mixtral 8x7B has 8 experts per MoE layer with top-2 routing. (a) If each expert FFN has dimensions $d_{\text{model}} = 4096$ and $d_{\text{ffn}} = 14336$, calculate the total parameters in one MoE layer (including the gating network). (b) Calculate the active parameters per token in one MoE layer. (c) If the model has 32 transformer layers and every FFN is an MoE layer, calculate the total and active parameter counts for the entire model (ignore attention parameters for this calculation).
Exercise 22.15: GQA Analysis
Grouped-Query Attention (GQA) uses fewer key-value heads than query heads. (a) In standard multi-head attention with $h = 32$ heads and $d_k = 128$, calculate the KV cache size per token per layer in bytes (float16). (b) In GQA with $h_q = 32$ query heads and $h_{kv} = 8$ key-value groups, calculate the KV cache size per token per layer. (c) Compute the memory savings ratio for GQA vs. standard MHA. (d) Does GQA affect model quality? Discuss the empirical findings.
Exercise 22.16: Open vs. Closed Models
(a) List three advantages of open-weight models (e.g., Llama 3) for the research community. (b) List three advantages of closed-API models (e.g., GPT-4) for enterprise deployment. (c) A startup needs to build a customer service chatbot. They have moderate GPU resources and strict data privacy requirements. Should they use an open-weight model or a closed API? Justify your recommendation.
Part D: Tokenizers and Context Windows
Exercise 22.17: BPE by Hand
Start with the corpus: "low low low low low lowest lowest newer newer newer wider wider wider wider" (a) Write the initial character-level vocabulary (including the end-of-word symbol). (b) Perform 5 iterations of BPE merging. Show the most frequent pair and the resulting merge at each step. (c) Tokenize the word "newest" using your learned vocabulary. What happens to out-of-vocabulary substrings?
Exercise 22.18: Tokenizer Fertility
Using a tokenizer (e.g., GPT-2's tiktoken or the Llama tokenizer via HuggingFace), measure the fertility (tokens per word) for the following text samples: (a) An English Wikipedia paragraph (~200 words) (b) The same paragraph translated into Chinese (c) The same paragraph translated into Arabic (d) A Python code snippet (~50 lines) Report the fertility for each and explain the differences.
Exercise 22.19: Vocabulary Size Trade-offs
(a) A model with vocabulary $V = 32{,}000$ and embedding dimension $d = 4096$ has an embedding matrix of what size (in parameters and in megabytes in float16)? (b) If we increase $V$ to $128{,}000$, what is the new embedding matrix size? (c) A 7B-parameter model has roughly what percentage of its parameters in the embedding matrix for $V = 32{,}000$ and $V = 128{,}000$? (d) Discuss the trade-off between vocabulary size and embedding matrix overhead.
Exercise 22.20: RoPE Position Encoding
Rotary Position Embeddings (RoPE) encode position $m$ by rotating query and key vectors. (a) For a 2D subspace, write the rotation matrix $R_m$ for position $m$ with base frequency $\theta$. (b) Show that the dot product $\langle R_m q, R_n k \rangle$ depends only on $m - n$ (the relative position). (c) If a model is trained with context length 4096 and $\theta_{\text{base}} = 10{,}000$, what happens if we try to use it at position 8192 without any modification? Why does "NTK-aware" scaling help?
Exercise 22.21: Context Window and Memory
A Llama 3 70B model uses GQA with $h_{kv} = 8$, $d_k = 128$, and 80 layers. (a) Calculate the KV cache size for a single sequence of 128K tokens in float16. (b) How many concurrent 128K-token sequences can fit in a single 80GB A100 GPU's memory (assuming the model weights are already loaded and take 70B * 2 bytes = 140GB spread across multiple GPUs, so only the KV cache needs to fit in local memory)? (c) Explain why KV cache management (e.g., PagedAttention) is critical for long-context inference.
Exercise 22.22: Sliding Window Attention
A Mistral 7B model uses sliding window attention with window size $W = 4096$ and 32 layers. (a) What is the theoretical receptive field (the maximum distance over which information can propagate)? (b) In practice, is the effective receptive field equal to the theoretical receptive field? Why or why not? (c) Compare the memory complexity of sliding window attention vs. full attention for a sequence of length $L = 32{,}768$.
Part E: Mixture of Experts (Coding and Analysis)
Exercise 22.23: Implementing a Gating Network
Implement a top-k gating network in PyTorch that takes a hidden state of dimension $d = 512$, has $E = 8$ experts, and routes to the top $k = 2$ experts. Your implementation should: (a) Compute the routing logits and probabilities. (b) Select the top-k experts per token. (c) Return the expert indices and their normalized weights. (d) Compute and return the load balancing loss.
Exercise 22.24: MoE vs. Dense Scaling
You have a budget of 100B total parameters. Compare two architectures: - Dense: A single 100B-parameter model, all parameters active. - MoE: 16 experts of ~6.25B each, top-2 routing, ~12.5B active per token (plus shared attention parameters).
(a) Estimate the per-token FLOPs for each architecture. (b) If the MoE model achieves loss equivalent to a dense model with 2x its active parameters, what effective dense-equivalent size is the MoE model? (c) Calculate the memory requirements for each architecture in float16. (d) Which architecture would you choose for (i) a research lab with 8 A100 GPUs, and (ii) a cloud deployment serving millions of queries?
Exercise 22.25: Expert Specialization Analysis
After training an MoE model, you want to analyze what each expert has specialized in. (a) Describe a method to analyze expert specialization by examining which tokens are routed to each expert. (b) Design an experiment to test whether experts specialize by topic, by syntactic role, by position in the sequence, or by some other criterion. (c) If you find that two experts have very similar specialization patterns, what does this suggest about training, and what could you do about it?
Exercise 22.26: Load Balancing Loss
The load balancing loss is $\mathcal{L}_{\text{balance}} = \alpha \cdot E \cdot \sum_{i=1}^{E} f_i \cdot P_i$. (a) Show that this loss is minimized when $f_i = 1/E$ and $P_i = 1/E$ for all $i$. (b) Compute the loss value at the minimum. (c) If $\alpha = 0.01$ and $E = 8$, and one expert receives 50% of all tokens while the rest share equally, compute the load balancing loss. Compare to the minimum. (d) Why is the coefficient $E$ included in the formula?
Exercise 22.27: Expert Capacity Factor
In the Switch Transformer, each expert has a capacity of $C = \lceil (\text{tokens in batch} / E) \cdot c_f \rceil$ where $c_f$ is the capacity factor (typically 1.0--1.5). (a) For a batch of 2048 tokens, $E = 16$ experts, and $c_f = 1.25$, compute the capacity per expert. (b) If the routing is perfectly balanced, what fraction of the expert capacity is used? (c) With $c_f = 1.0$ and imperfect balancing, what happens to tokens that cannot fit into their assigned expert? (d) Why is $c_f > 1.0$ preferred in practice?
Exercise 22.28: MoE Communication Pattern
In a distributed MoE model with 4 GPUs, each holding 4 experts (16 total), and a batch of 4096 tokens distributed evenly across GPUs: (a) Describe the all-to-all communication pattern required for top-2 routing. (b) In the worst case, how many tokens might a single GPU need to receive? (c) Estimate the communication volume (in bytes, float16) for one MoE layer. (d) How does expert parallelism interact with data parallelism and tensor parallelism?
Exercise 22.29: Complete MoE Forward Pass
Implement a complete MoE layer in PyTorch with the following specifications: - Input dimension: 512 - Number of experts: 8 - Expert FFN hidden dimension: 2048 - Top-k routing: k = 2 - Include load balancing loss Test your implementation on a random batch of shape (batch=4, seq_len=128, d_model=512).
Exercise 22.30: Scaling Laws Meta-Analysis
Conduct a mini-meta-analysis of scaling laws: (a) Find the reported parameter counts and benchmark scores (MMLU, HumanEval, or another common benchmark) for at least 6 models spanning at least 2 orders of magnitude in parameter count. (b) Plot the scores against parameter count on a log-linear scale. (c) Fit a curve to the data. Does a power law fit well? What about a sigmoid? (d) Based on your curve, predict the parameter count needed to achieve 95% on the benchmark. Discuss the reliability of this prediction.