Chapter 22: Quiz

Test your understanding of scaling laws, large language models, benchmarking, tokenizers, context windows, and mixture-of-experts architectures.

Question 1

What is the functional form of the Kaplan neural scaling law relating model loss $L$ to model size $N$?

A) $L(N) = a \cdot N + b$ (linear) B) $L(N) = a \cdot \log N + b$ (logarithmic) C) $L(N) = (N_c / N)^{\alpha_N}$ (power law) D) $L(N) = a \cdot e^{-bN}$ (exponential decay)

Show Answer

**C) $(N_c / N)^{\alpha_N}$ (power law)** Kaplan et al. found that language model loss follows a power-law relationship with model size, dataset size, and compute. Power laws appear as straight lines on log-log plots and hold over many orders of magnitude. The exponent $\alpha_N \approx 0.076$ governs how quickly loss decreases with model size.

Question 2

The Chinchilla paper's key finding was that, for a fixed compute budget, the optimal allocation should:

A) Invest nearly all compute into model size, keeping data fixed B) Invest nearly all compute into data, keeping model size fixed C) Scale model size and training data roughly equally D) Keep both model size and data fixed but train for more epochs

Show Answer

**C) Scale model size and training data roughly equally** Hoffmann et al. found $N_{\text{opt}} \propto C^{0.50}$ and $D_{\text{opt}} \propto C^{0.50}$, meaning that doubling compute should double both model size and training data. This contrasts with Kaplan's original recommendation to invest disproportionately in model size. The optimal ratio is approximately 20 tokens per parameter.

Question 3

GPT-3 (175B parameters, 300B tokens) has a tokens-per-parameter ratio of approximately:

A) 0.6 B) 1.7 C) 20 D) 100

Show Answer

**B) 1.7** $300\text{B} / 175\text{B} \approx 1.7$ tokens per parameter. By the Chinchilla standard (20 tokens per parameter), GPT-3 was significantly undertrained. The Chinchilla-optimal configuration for the same compute budget would have been a ~40B model trained on ~800B tokens.

Question 4

What does the Chinchilla scaling law's term $E \approx 1.69$ nats represent?

A) The maximum achievable loss B) The irreducible entropy of natural language that cannot be reduced by scaling C) The minimum number of epochs required for convergence D) The energy cost of training in joules per parameter

Show Answer

**B) The irreducible entropy of natural language that cannot be reduced by scaling** In the formula $L(N, D) = E + A/N^{\alpha} + B/D^{\beta}$, the term $E$ represents the inherent unpredictability (entropy) of natural language. No amount of scaling in model size or data can reduce the loss below $E$. It reflects the genuine randomness and ambiguity in language.

Question 5

Which of the following is the strongest argument from Schaeffer et al. (2023) against emergent abilities?

A) Large language models are no more capable than small ones B) Scaling laws break down above 100 billion parameters C) Apparent emergence is an artifact of using discontinuous metrics like exact-match accuracy D) Emergent abilities only appear in English, not other languages

Show Answer

**C) Apparent emergence is an artifact of using discontinuous metrics like exact-match accuracy** Schaeffer et al. showed that when the same tasks are measured with linear/continuous metrics (like per-token accuracy or edit distance), the apparent sharp phase transitions disappear and performance scales smoothly. The "emergence" is a property of the measurement, not the model.

Question 6

Which LLM family was the first to release competitive open-weight models that catalyzed the open-source ecosystem?

A) GPT B) Claude C) LLaMA D) Gemini

Show Answer

**C) LLaMA** Meta's LLaMA (2023) was the first wave of truly competitive open-weight LLMs. The 65B model matched GPT-3.5 on many benchmarks, and its release catalyzed an explosion of open-source research including fine-tuned variants, quantized versions, and specialized adaptations.

Question 7

Llama 3 8B was trained on 15 trillion tokens, giving a tokens-per-parameter ratio of approximately:

A) 2 B) 20 C) 200 D) 2000

Show Answer

**D) 2000** $15\text{T} / 8\text{B} = 1875 \approx 2000$ tokens per parameter. This is roughly 100x the Chinchilla-optimal ratio. The motivation is inference-optimal training: a smaller model trained on vastly more data costs less to serve, even though training compute is higher than "optimal" by the Chinchilla definition.

Question 8

What is the pass@k metric used in HumanEval?

A) The probability of passing exactly $k$ test cases out of a test suite B) The probability that at least one of $k$ generated code samples passes all unit tests C) The accuracy of the model after $k$ attempts at the same problem D) The percentage of problems where $k$ or more solutions are correct

Show Answer

**B) The probability that at least one of $k$ generated code samples passes all unit tests** pass@k = $1 - \binom{n-c}{k}/\binom{n}{k}$, where $n$ is total samples generated and $c$ is the count of correct samples. It measures the probability that a user who generates $k$ samples will find at least one correct solution. This is important because code generation often benefits from generating multiple candidates.

Question 9

Which benchmark provides the most comprehensive evaluation across accuracy, fairness, bias, toxicity, and efficiency?

A) MMLU B) HumanEval C) HELM D) GSM8K

Show Answer

**C) HELM (Holistic Evaluation of Language Models)** HELM, developed by Stanford's CRFM, evaluates models across 42 scenarios on multiple dimensions including accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency. It is the most comprehensive single evaluation framework for LLMs.

Question 10

The "Lost in the Middle" phenomenon refers to:

A) Models forgetting information from early layers of the network B) Models performing well on information at the beginning and end of the context but poorly in the middle C) Expert networks in MoE models receiving unequal token loads D) Tokenizers losing information when encoding rare characters

Show Answer

**B) Models performing well on information at the beginning and end of the context but poorly in the middle** Liu et al. (2023) demonstrated that LLMs show a U-shaped retrieval curve: they attend well to information near the start and end of the context window but poorly to information in the middle. This is a fundamental limitation of current attention mechanisms for long-context tasks.

Question 11

What is the primary purpose of instruction tuning (SFT)?

A) To increase the model's parameter count B) To teach the model to follow natural language instructions and produce appropriate responses C) To compress the model for faster inference D) To remove toxic content from the training data

Show Answer

**B) To teach the model to follow natural language instructions and produce appropriate responses** Instruction tuning (supervised fine-tuning on instruction-response pairs) transforms a base language model from a next-token predictor into an instruction-following assistant. Without it, the model may continue the prompt rather than responding to it as an instruction.

Question 12

What is the "alignment tax"?

A) The financial cost of hiring human annotators for RLHF B) A slight decrease in raw knowledge benchmark scores after instruction tuning C) The compute overhead of running a reward model during inference D) Government regulations on AI alignment

Show Answer

**B) A slight decrease in raw knowledge benchmark scores after instruction tuning** Instruction-tuned models sometimes score slightly lower than base models on pure knowledge benchmarks like MMLU, because the format-following bias introduces a small overhead. However, on practical tasks and in real-world usage, instruction-tuned models are dramatically more useful.

Question 13

Byte-Pair Encoding (BPE) builds its vocabulary by:

A) Starting with whole words and splitting them into subwords B) Starting with individual bytes/characters and iteratively merging the most frequent adjacent pairs C) Using a neural network to learn optimal tokenization D) Randomly sampling subword units from the training data

Show Answer

**B) Starting with individual bytes/characters and iteratively merging the most frequent adjacent pairs** BPE begins with single bytes (or characters) and repeatedly merges the most frequent pair of adjacent tokens into a new token. This bottom-up process continues until the desired vocabulary size is reached. It produces a vocabulary that balances between character-level and word-level representation.

Question 14

GPT-2's tokenizer has a "fertility" (tokens per word) of ~1.3 for English but ~15 for Burmese. What is the practical consequence?

A) The model generates Burmese text 15x faster B) The model's effective context window is roughly 10x shorter for Burmese C) Burmese text is 15x more accurate D) The model cannot process Burmese at all

Show Answer

**B) The model's effective context window is roughly 10x shorter for Burmese** If a model has a 4096-token context window, that window holds approximately 3150 English words ($4096 / 1.3$) but only about 273 Burmese words ($4096 / 15$). This means the model can process far less Burmese text per inference call, effectively discriminating against languages with high tokenizer fertility.

Question 15

Which position encoding method encodes position by rotating query and key vectors in 2D subspaces?

A) Sinusoidal positional encoding B) Learned positional embeddings C) ALiBi (Attention with Linear Biases) D) RoPE (Rotary Position Embeddings)

Show Answer

**D) RoPE (Rotary Position Embeddings)** RoPE encodes position $m$ by applying a rotation of angle $m \cdot \theta$ to pairs of dimensions in the query and key vectors. The key property is that the dot product between rotated query and key vectors depends only on the relative position $m - n$, making it a relative position encoding that can be extended beyond the training context length.

Question 16

FlashAttention achieves its speedup by:

A) Approximating attention with a sparse matrix B) Reducing the number of attention heads C) Restructuring the computation to minimize memory I/O between SRAM and HBM D) Using a smaller embedding dimension

Show Answer

**C) Restructuring the computation to minimize memory I/O between SRAM and HBM** FlashAttention computes exact standard attention (no approximation) but tiles the computation to keep intermediate results in fast GPU SRAM rather than writing them to slow HBM. This I/O-aware algorithm achieves 2--4x speedups and reduces peak memory from $O(L^2)$ to $O(L)$.

Question 17

In a Mixture of Experts layer with 8 experts and top-2 routing, what fraction of total expert parameters are active for each token?

A) 1/8 B) 2/8 = 1/4 C) 4/8 = 1/2 D) 8/8 = all of them

Show Answer

**B) 2/8 = 1/4** With top-2 routing, each token activates exactly 2 of the 8 experts, meaning 2/8 = 25% of the expert parameters are used per token. This is the key efficiency gain of MoE: the model has 8x more expert parameters than a dense model but only uses 2x per token.

Question 18

What is the primary purpose of the load balancing loss in MoE training?

A) To ensure all tokens are processed in the correct order B) To prevent the model from routing most tokens to a few experts (expert collapse) C) To balance the learning rates across different experts D) To ensure equal training time across GPUs

Show Answer

**B) To prevent the model from routing most tokens to a few experts (expert collapse)** Without the load balancing loss, the routing function tends to collapse, sending most tokens to a few "popular" experts while other experts receive almost no tokens and learn nothing. The load balancing loss penalizes uneven expert utilization, encouraging the router to spread tokens across all experts.

Question 19

Mixtral 8x7B has 46.7B total parameters but only ~12.9B active per token. What is the primary disadvantage of this architecture compared to a dense 13B model?

A) Lower accuracy on all benchmarks B) Much higher memory requirements (all 46.7B parameters must be stored) C) Slower training convergence D) Cannot be fine-tuned

Show Answer

**B) Much higher memory requirements (all 46.7B parameters must be stored)** Even though only ~12.9B parameters are active per token, all 46.7B parameters must reside in memory (either GPU or CPU) because different tokens route to different experts. This means the memory footprint is 3.6x larger than a dense model with the same active parameter count, even though the compute per token is similar.

Question 20

The Switch Transformer (Fedus et al., 2021) differed from previous MoE approaches primarily by:

A) Using top-1 routing instead of top-2 routing B) Using the MoE layer only in the first Transformer block C) Removing the gating network entirely D) Using continuous rather than discrete routing

Show Answer

**A) Using top-1 routing instead of top-2 routing** The Switch Transformer simplified MoE by routing each token to exactly one expert (top-1) instead of two (top-2). This reduced computation and communication costs while still achieving significant improvements over dense baselines. The simplified design also used a cleaner load balancing loss formulation.

Question 21

Which of the following is NOT a strategy for mitigating benchmark contamination?

A) Embedding canary strings in test sets B) Using benchmarks created after the model's training data cutoff C) Training the model specifically on benchmark test sets to ensure familiarity D) Keeping test examples confidential (private test sets)

Show Answer

**C) Training the model specifically on benchmark test sets to ensure familiarity** Training on benchmark test sets is the definition of contamination, not a mitigation strategy. Legitimate mitigation strategies include canary strings, temporal holdout, dynamic benchmarks with programmatically generated instances, and private test sets.

Question 22

Which model family was the first to demonstrate the practical viability of MoE at the LLM scale, matching Llama 2 70B with fewer active parameters?

A) Switch Transformer B) Mixtral 8x7B C) GPT-4 D) Gemini 1.0

Show Answer

**B) Mixtral 8x7B** Mixtral 8x7B (Mistral AI, 2023) was a landmark demonstration that MoE architectures could match or exceed much larger dense models (like Llama 2 70B and GPT-3.5) at practical scales. With only ~12.9B active parameters out of 46.7B total, it achieved comparable quality at significantly lower inference cost.

Question 23

A system prompt in the instruction-following paradigm is:

A) The first user message in a conversation B) A special instruction defining the model's persona, constraints, and behavioral guidelines, treated as higher priority than user messages C) A prompt used only during model training, not inference D) The operating system command that launches the model server

Show Answer

**B) A special instruction defining the model's persona, constraints, and behavioral guidelines, treated as higher priority than user messages** The system prompt is prepended to every conversation and defines the model's role, behavioral constraints, output format, and other high-level instructions. It is treated as higher-priority than user instructions, allowing developers to set guardrails that ordinary user prompts cannot easily override.

Question 24

Sliding Window Attention with window size $W = 4096$ and $L = 32$ layers has a theoretical receptive field of:

A) 4,096 tokens B) 32 tokens C) $32 \times 4096 = 131{,}072$ tokens D) $4096^{32}$ tokens

Show Answer

**C) $32 \times 4096 = 131{,}072$ tokens** In sliding window attention, each layer allows information to propagate $W$ positions. With $L$ layers, information can propagate up to $L \times W$ positions in total. For $W = 4096$ and $L = 32$, the theoretical receptive field is $32 \times 4096 = 131{,}072$ tokens, though the effective receptive field in practice may be smaller due to information decay.

Question 25

Which of the following best describes why many production LLMs are trained beyond the Chinchilla-optimal point?

A) Because Chinchilla's analysis was proven to be incorrect B) Because a smaller model trained on more data is cheaper to serve at inference time, even if training is more expensive C) Because there is unlimited high-quality training data available D) Because larger models always outperform smaller models regardless of training data

Show Answer

**B) Because a smaller model trained on more data is cheaper to serve at inference time, even if training is more expensive** The Chinchilla-optimal point minimizes training compute, but inference cost depends on model size, not training data. Since a model may serve millions of queries over its lifetime, the total inference cost can far exceed training cost. Training a smaller model on more data than Chinchilla-optimal increases training cost but reduces the per-query inference cost, which is often the better trade-off for production systems.