Chapter 11: Quiz
Test your understanding of large language models. Answers follow each question.
Question 1
What single training objective is used during the pretraining stage of a decoder-only LLM?
Answer
**Next-token prediction** (autoregressive language modeling). The model minimizes the cross-entropy loss: $\mathcal{L} = -\frac{1}{T}\sum_{t=1}^{T} \log P_\theta(x_t \mid x_{Question 2
How does a decoder-only transformer differ from the original encoder-decoder transformer? Why did the field converge on decoder-only for language modeling?
Answer
A decoder-only transformer uses only the decoder stack with **causal self-attention** (each token attends only to itself and preceding tokens). The original encoder-decoder architecture uses bidirectional self-attention in the encoder and both causal self-attention and cross-attention in the decoder. The field converged on decoder-only because: (1) a single architecture handles both "understanding" (via the prompt, processed in parallel) and "generation" (via autoregressive decoding), simplifying the system; (2) the same model can serve both as a text completer and as a question answerer; and (3) scaling experiments showed that decoder-only models achieve comparable or better performance with simpler training pipelines.Question 3
What is the purpose of the causal mask in a decoder-only transformer? Write the mask matrix for a sequence of length 4.
Answer
The causal mask prevents each token from attending to future tokens, enforcing the autoregressive property: the prediction for position $t$ can only depend on positions $1, \ldots, t$. Without the mask, the model could "cheat" by looking at the token it is supposed to predict. For length 4: $$M = \begin{pmatrix} 0 & -\infty & -\infty & -\infty \\ 0 & 0 & -\infty & -\infty \\ 0 & 0 & 0 & -\infty \\ 0 & 0 & 0 & 0 \end{pmatrix}$$ After softmax, the $-\infty$ entries become 0, so position $i$ attends only to positions $j \leq i$.Question 4
What is perplexity, and why is it preferred over raw cross-entropy loss for reporting language model performance?
Answer
Perplexity is $\text{PPL} = \exp(-\frac{1}{T}\sum_{t=1}^{T} \log P_\theta(x_t \mid x_{Question 5
Explain the three stages of the modern LLM training pipeline. What data is used at each stage, and what is the objective?
Answer
1. **Pretraining:** Trained on trillions of tokens of diverse text (web crawls, books, code, Wikipedia) with next-token prediction. Objective: learn general language understanding and world knowledge. Duration: weeks to months. 2. **Supervised fine-tuning (SFT) / Instruction tuning:** Trained on 10K-100K curated (prompt, response) pairs. Loss computed only on response tokens. Objective: teach the model to follow instructions rather than simply complete text. Duration: hours. 3. **Alignment (RLHF or DPO):** Trained on 10K-100K human preference pairs (preferred vs. dispreferred responses). Objective: align model outputs with human values — helpfulness, truthfulness, harmlessness. RLHF uses a learned reward model + PPO; DPO optimizes directly on preferences without an explicit reward model. Duration: hours to days.Question 6
What did the Chinchilla paper (Hoffmann et al., 2022) demonstrate, and what is its practical significance?
Answer
Chinchilla showed that for a given compute budget, model size and dataset size should be **scaled equally**, with approximately 20 tokens per parameter being optimal. Many existing models (including Gopher at 280B parameters) were severely undertrained. Chinchilla (70B parameters, 1.4T tokens) matched Gopher's performance at 4x fewer parameters because it used a 4x larger dataset. Practical significance: (1) A smaller, well-trained model is often better than a larger, undertrained one — this directly affects GPU procurement. (2) Inference cost scales with model size, not training data, so Chinchilla-optimal models are cheaper to deploy. (3) Data quality and quantity become the binding constraints, not model architecture.Question 7
In RLHF, what is "reward hacking" and how is it prevented?
Answer
**Reward hacking** occurs when the policy model learns to exploit artifacts in the reward model — generating outputs that score high according to the reward model but are not actually preferred by humans. For example, the model might learn that the reward model gives higher scores to longer, more verbose responses, even when a concise answer would be better. It is prevented by the **KL divergence penalty**: $\beta \cdot D_{\text{KL}}[\pi_\theta(y|x) \| \pi_{\text{SFT}}(y|x)]$. This term penalizes the policy for deviating too far from the SFT model, constraining exploration to the neighborhood of known-good behavior. The hyperparameter $\beta$ controls the trade-off between reward maximization and proximity to the reference policy.Question 8
How does DPO simplify the RLHF pipeline? What components does it eliminate?
Answer
DPO eliminates **two major components**: (1) the explicit reward model and its training pipeline, and (2) the reinforcement learning loop (PPO with its multiple hyperparameters, value function, advantage estimation, and training instability). DPO works by deriving a closed-form relationship between the optimal policy and the reward function, then substituting this into the Bradley-Terry preference model to obtain a supervised loss that can be optimized directly on preference data. The result is a single supervised fine-tuning step on paired preferences — no reward model, no RL, no PPO. This reduces engineering complexity (one training loop instead of three interacting components), eliminates PPO hyperparameter tuning, and produces comparable results in practice.Question 9
What is LoRA, and why does low-rank adaptation work for fine-tuning LLMs?
Answer
**LoRA** (Low-Rank Adaptation) fine-tunes a model by learning low-rank matrices $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times d}$ (with $r \ll d$) that are added to frozen pretrained weights: $h = W_0 x + \frac{\alpha}{r} BAx$. It works because: (1) **Intrinsic dimensionality** — the fine-tuning loss landscape has a much lower effective dimension than the full parameter space (Aghajanyan et al., 2021). (2) **The weight update $\Delta W$ is empirically low-rank** — most of the model's knowledge is preserved from pretraining, and only a small task-specific adjustment is needed. The singular values of $\Delta W$ decay rapidly, confirming that rank $r = 8$-$64$ captures most of the update. Key advantage: LoRA reduces trainable parameters by 100x+ and can be merged into the base weights at inference time with zero latency overhead.Question 10
What is QLoRA and what three innovations does it introduce?
Answer
QLoRA combines LoRA with aggressive quantization to enable fine-tuning large models on consumer hardware. Its three innovations: 1. **4-bit NormalFloat (NF4) quantization:** Quantizes pretrained weights to 4 bits using a data type whose quantization levels match the quantiles of the normal distribution (since neural network weights are approximately normally distributed). This provides better accuracy than uniform INT4. 2. **Double quantization:** The quantization constants (scale factors) are themselves quantized from fp32 to fp8, saving an additional ~0.4 bits per parameter. 3. **Paged optimizers:** Optimizer states are offloaded to CPU memory when GPU memory is exhausted, using unified memory page management. This prevents out-of-memory errors during training spikes. Together, these allow fine-tuning a 65B model on a single 48GB GPU.Question 11
Describe the four stages of a RAG pipeline.
Answer
1. **Indexing:** Documents are split into chunks, each chunk is embedded using an embedding model, and the embeddings are stored in a vector database with metadata. 2. **Retrieval:** Given a user query, embed the query with the same embedding model and retrieve the $k$ most similar chunks using approximate nearest neighbor (ANN) search. 3. **Augmentation:** Construct a prompt that includes the retrieved chunks as context, typically with a system instruction explaining how to use the context. 4. **Generation:** The LLM generates a response grounded in the retrieved context, ideally answering the question using information from the chunks and indicating when the context is insufficient.Question 12
What are the trade-offs of chunk size in a RAG system?
Answer
**Smaller chunks** (128-256 tokens): Higher retrieval precision (the chunk is more likely to contain exactly the needed information and less noise), but each chunk provides less surrounding context. The model may miss information that spans multiple sentences. More chunks must be retrieved and stored. **Larger chunks** (512-1024 tokens): More context per chunk (preserving paragraph-level coherence), but lower precision — the chunk may contain relevant information diluted by irrelevant text, which can confuse the LLM. Fewer chunks fit in the context window. The practical sweet spot is typically 256-512 tokens with 50-100 token overlap. A common pattern is to retrieve more chunks than needed (top-20) and then rerank with a cross-encoder to select the best 3-5.Question 13
Why does BPE tokenization represent common words as single tokens but rare words as multiple subword tokens?
Answer
BPE starts with individual characters (or bytes) and iteratively merges the most frequent adjacent pair. Common words like "the" quickly become single tokens because their character sequences (t-h-e) are among the most frequent pairs in any English corpus. Rare words like "photovoltaic" are assembled from subword units ("photo", "volt", "aic") because their full character sequences are too infrequent to be merged into a single token within the vocabulary budget. This means: (1) common text is tokenized efficiently (fewer tokens, lower cost, more fits in the context window); (2) the tokenizer never encounters out-of-vocabulary words (it can always fall back to characters); and (3) the model can compose meanings from subword units, potentially generalizing to unseen words that share morphological components with training words.Question 14
What is the KV-cache and why is it necessary for efficient autoregressive generation?
Answer
During autoregressive generation, each new token must attend to all previous tokens. Without caching, generating the $n$-th token would require recomputing the key ($K$) and value ($V$) projections for all $n-1$ previous tokens — $O(n)$ computation that makes total generation cost $O(n^2)$. The **KV-cache** stores the $K$ and $V$ tensors for all previously generated tokens. When generating the next token, only the new token's $Q$, $K$, and $V$ need to be computed; the new $K$ and $V$ are appended to the cache, and attention is computed between the new $Q$ and the full cached $K, V$. This reduces per-token computation from $O(n)$ to $O(1)$ (for the linear projections) plus $O(n)$ for the attention dot product itself. The trade-off: the KV-cache consumes memory proportional to $L \times h \times n \times d_k \times 2$ (layers times heads times sequence length times head dimension times 2 for K and V). For long sequences, this can exceed the model weight memory.Question 15
Explain the difference between top-$k$ and top-$p$ (nucleus) sampling. When would you prefer one over the other?
Answer
**Top-$k$** samples from a fixed number of candidate tokens (the $k$ highest-probability tokens), regardless of the probability distribution's shape. If $k = 50$ and the model is very confident (top token has probability 0.95), 49 low-probability tokens are still included. **Top-$p$ (nucleus)** samples from the smallest set of tokens whose cumulative probability exceeds $p$. This adapts dynamically: when the model is confident, the nucleus is small (perhaps 1-5 tokens); when uncertain, it expands to include many tokens. Top-$p$ is generally preferred because it adapts to the model's confidence. Top-$k$ with a fixed $k$ can include too many unlikely tokens (when the model is confident) or too few diverse options (when the distribution is flat). In practice, many systems combine both: `top_k=50, top_p=0.95`.Question 16
Why is hallucination considered a fundamental property of LLMs rather than a bug to be fixed?
Answer
Hallucination is fundamental because the training objective — next-token prediction — optimizes for **plausibility**, not **truthfulness**. The model learns $P(x_t \mid x_{Question 17
What is BERTScore and why does it correlate better with human judgments than BLEU or ROUGE?
Answer
**BERTScore** computes similarity between generated and reference text by matching tokens using contextual embeddings from a pretrained language model (BERT). Each token in the candidate is matched to its most similar token in the reference (computing precision), and vice versa (recall). The F1 score summarizes both. BERTScore correlates better with human judgments because it captures **semantic similarity** rather than exact lexical overlap. If the candidate says "automobile" and the reference says "car," BLEU and ROUGE score this as zero overlap, but BERTScore recognizes the semantic equivalence through similar embedding vectors. Similarly, paraphrases, synonym substitutions, and syntactic rearrangements are penalized by n-gram metrics but handled gracefully by BERTScore. Limitation: BERTScore still requires a reference text, which may not exist for open-ended generation tasks.Question 18
In the LLM-as-judge evaluation paradigm, name three known biases and one mitigation for each.
Answer
1. **Self-preference bias:** The judge model tends to prefer outputs generated by itself or similar models. *Mitigation:* Use a different model family as the judge (e.g., Claude to judge GPT-4 outputs) or use multiple diverse judges and aggregate scores. 2. **Verbosity bias:** The judge assigns higher scores to longer, more detailed responses even when a concise answer is better. *Mitigation:* Include explicit instructions in the judge prompt to value conciseness, or normalize scores by response length. 3. **Position bias:** When comparing two responses, the judge prefers whichever is presented first. *Mitigation:* Evaluate each pair twice with the order swapped and average the scores, discarding cases where the ordering affects the judgment.Question 19
A model with hidden dimension $d = 4{,}096$ is adapted with LoRA rank $r = 16$ and $\alpha = 32$ on the $Q$ and $V$ projections of each attention layer (32 layers). How many trainable parameters are added?
Answer
Each LoRA adaptation adds matrices $A \in \mathbb{R}^{r \times d}$ and $B \in \mathbb{R}^{d \times r}$. Per adapted layer: $$\text{params per LoRA} = r \times d + d \times r = 2 \times r \times d = 2 \times 16 \times 4096 = 131{,}072$$ With $Q$ and $V$ adapted in 32 layers: $$\text{total} = 2 \times 32 \times 131{,}072 = 8{,}388{,}608 \approx 8.4\text{M parameters}$$ For a 7B model, this is $8.4\text{M} / 7{,}000\text{M} \approx 0.12\%$ of total parameters.Question 20
Explain why hybrid search (combining dense vector retrieval with sparse keyword retrieval like BM25) often outperforms either method alone in RAG systems.