Chapter 11: Exercises
Exercises are graded by difficulty: - One star (*): Apply the technique from the chapter to a new dataset or scenario - Two stars (**): Extend the technique or combine it with a previous chapter's methods - Three stars (***): Derive a result, implement from scratch, or design a system component - Four stars (****): Research-level problems that connect to open questions in the field
Architecture and Fundamentals
Exercise 11.1 (*)
A decoder-only transformer has the following configuration: vocabulary size $V = 32{,}000$, hidden dimension $d = 2{,}048$, $L = 24$ layers, $h = 16$ attention heads, and FFN intermediate dimension $d_{\text{ff}} = 5{,}504$ (SwiGLU).
(a) Compute the total number of parameters, broken down by component: token embeddings, attention (per layer), FFN (per layer), and final layer norm. Assume weight tying between the token embedding and the language model head.
(b) How much GPU memory is required to store this model in fp16? In INT4?
(c) During training with the Adam optimizer, what is the total memory requirement per parameter (model weights + gradients + optimizer states)? What is the total training memory for model state alone (excluding activations)?
Exercise 11.2 (*)
Consider the causal mask matrix $M$ for a sequence of length 5:
(a) Write out the full $5 \times 5$ mask matrix $M$ and the resulting attention weight matrix after applying $\text{softmax}(QK^T / \sqrt{d_k} + M)$, assuming $QK^T / \sqrt{d_k}$ is the all-ones matrix.
(b) What is the maximum number of positions that token at position $t$ can attend to? What is the minimum?
(c) Explain why the causal mask is essential for next-token prediction but would be harmful for tasks like sentiment classification.
Exercise 11.3 (*)
Compute the perplexity for the following token sequences given the model's predicted probabilities:
(a) Tokens: [A, B, C, D], predicted probabilities: [0.8, 0.6, 0.9, 0.7]
(b) Tokens: [A, B, C, D], predicted probabilities: [0.5, 0.5, 0.5, 0.5]
(c) What is the perplexity of a perfect model that assigns probability 1.0 to every correct token? What is the perplexity of a uniform distribution over a vocabulary of 32,000 tokens?
(d) A model achieves perplexity 12.5 on WikiText-103. Interpret this number in terms of "equivalent uniform vocabulary size."
Exercise 11.4 (*)
For BPE tokenization with a target vocabulary size of 1,000, starting from a character-level vocabulary of 256 byte values:
(a) How many merge operations are performed to reach the target vocabulary size?
(b) Given the corpus "aaabdaaabac", trace the first 3 BPE merge operations. Show the vocabulary and the encoded corpus after each merge.
(c) Explain why BPE naturally handles rare and unseen words, unlike word-level tokenization.
Training Pipeline
Exercise 11.5 (**)
(a) The Chinchilla scaling law states that for a compute budget $C$, the optimal model size $N$ and dataset size $D$ satisfy $N_{\text{opt}} \propto C^{0.5}$ and $D_{\text{opt}} \propto C^{0.5}$, with approximately 20 tokens per parameter.
You have a compute budget sufficient to train a 3B-parameter model on 300B tokens. According to the Chinchilla rule, is this allocation optimal? If not, what is the optimal allocation, and what is the expected improvement?
(b) LLaMA 3 trained an 8B model on 15T tokens (~1,875 tokens per parameter). Why would Meta deliberately deviate from the Chinchilla-optimal ratio? What economic trade-off drives this decision?
(c) A startup has a fixed compute budget of $\$500{,}000$ (approximately $10^{22}$ FLOPs on cloud GPUs). Using the Chinchilla ratio, estimate the optimal model size and dataset size. What constraints might prevent achieving this in practice?
Exercise 11.6 (**)
(a) In supervised fine-tuning (SFT), the loss is computed only over response tokens, not prompt tokens. Write the loss function explicitly, with indicator variables distinguishing prompt from response tokens.
(b) Why is it important to exclude the prompt from the loss computation? What would happen if you included the prompt tokens in the loss?
(c) An SFT dataset contains 50,000 instruction-response pairs. The average prompt is 100 tokens and the average response is 300 tokens. If you train for 3 epochs with a batch size of 32 sequences, how many gradient steps are performed? How many total response tokens are trained on?
Exercise 11.7 (**)
(a) Write out the DPO loss function. Explain each term and the role of the hyperparameter $\beta$.
(b) What happens as $\beta \to 0$? As $\beta \to \infty$? Why is choosing $\beta$ important in practice?
(c) In the RLHF pipeline, the reward model is trained on human preference data. Describe a scenario where the reward model assigns high scores to outputs that humans would actually dislike (reward hacking). How does the KL penalty mitigate this?
(d) Compare the engineering complexity of RLHF vs. DPO. List the components needed for each and explain why DPO is often preferred for smaller teams.
LoRA and Fine-Tuning
Exercise 11.8 (**)
(a) A LoRA configuration uses rank $r = 16$ on all attention and FFN projections of a 7B model with $d = 4{,}096$. Compute the number of trainable parameters and the percentage of total parameters being trained. Assume attention has 4 projections ($Q, K, V, O$) and the SwiGLU FFN has 3 projections ($W_1, W_g, W_2$ with $d_{\text{ff}} = 11{,}008$), across 32 layers.
(b) The same model is fine-tuned with LoRA rank $r = 64$. By what factor do the trainable parameters increase compared to $r = 16$? Is the training time expected to increase by the same factor? Why or why not?
(c) At inference time, LoRA weights are merged: $W_{\text{merged}} = W_0 + \frac{\alpha}{r} BA$. What is the inference latency overhead of the merged model compared to the original model?
Exercise 11.9 (***)
(a) Derive the LoRA forward pass from the perspective of the SVD. If $W_0 = U \Sigma V^T$ is the pretrained weight matrix and $\Delta W = BA$ is the LoRA update, express the adapted weight matrix in terms of its modified singular values.
(b) Hu et al. (2022) initialize $A$ from a Gaussian and $B = 0$. What is $\Delta W$ at initialization? Why is this important for training stability?
(c) An alternative initialization is $A = 0$, $B \sim \mathcal{N}(0, \sigma^2)$. Would this work equally well? Explain.
(d) Implement a LoRALinear layer from scratch in PyTorch (no external libraries). It should wrap an existing nn.Linear, freeze its weights, and add trainable low-rank matrices $A$ and $B$.
import torch
import torch.nn as nn
class LoRALinear(nn.Module):
"""Linear layer with LoRA adaptation.
Args:
linear: Pretrained nn.Linear layer to adapt.
rank: Rank of the low-rank matrices.
alpha: LoRA scaling factor.
"""
def __init__(self, linear: nn.Linear, rank: int = 16, alpha: float = 32.0) -> None:
super().__init__()
self.linear = linear
self.rank = rank
self.alpha = alpha
self.scaling = alpha / rank
# Freeze the pretrained weights
for param in self.linear.parameters():
param.requires_grad = False
# Initialize LoRA matrices
d_out, d_in = linear.weight.shape
self.lora_A = nn.Parameter(torch.randn(rank, d_in) * 0.01)
self.lora_B = nn.Parameter(torch.zeros(d_out, rank))
def forward(self, x: torch.Tensor) -> torch.Tensor:
# Original forward + scaled low-rank update
base_output = self.linear(x)
lora_output = (x @ self.lora_A.T @ self.lora_B.T) * self.scaling
return base_output + lora_output
def merge(self) -> nn.Linear:
"""Merge LoRA weights into the base linear layer.
Returns:
A new nn.Linear with merged weights and no LoRA overhead.
"""
merged = nn.Linear(
self.linear.in_features,
self.linear.out_features,
bias=self.linear.bias is not None,
)
merged.weight.data = (
self.linear.weight.data + self.scaling * (self.lora_B @ self.lora_A)
)
if self.linear.bias is not None:
merged.bias.data = self.linear.bias.data.clone()
return merged
Verify your implementation: create a random nn.Linear(256, 256), wrap it in LoRALinear(rank=8), pass a batch of inputs, and confirm that the output shape is correct. Then verify that merge() produces identical outputs.
Exercise 11.10 (**)
QLoRA uses NormalFloat4 (NF4) quantization.
(a) Standard INT4 quantization maps values to 16 evenly spaced levels. NF4 maps values to 16 levels that are evenly spaced in the quantiles of the standard normal distribution. Why is NF4 better for neural network weights?
(b) A 13B-parameter model requires ~26 GB in fp16. Estimate the memory required in NF4 (4-bit weights + fp16 LoRA adapters with $r = 16$ on all linear layers). Can this fit on a single 24 GB GPU?
(c) Double quantization quantizes the quantization constants (scale factors). If there is one fp32 scale factor per 64 weights, how many bits per parameter does this add? How does double quantization reduce it?
RAG and Retrieval
Exercise 11.11 (**)
(a) You are building a RAG system for a legal document corpus with 50,000 documents averaging 5,000 words each. Using chunks of 512 tokens with 50-token overlap, estimate the total number of chunks. If each chunk is embedded as a 1024-dimensional float32 vector, what is the storage requirement for the vector index?
(b) At query time, you need to search this index in under 50ms. Would brute-force cosine similarity suffice? If not, what approximate nearest neighbor algorithm would you use, and what trade-off does it introduce?
(c) A user queries "cases about environmental liability for groundwater contamination." The top-3 retrieved chunks all come from the same document. Is this a problem? Propose two strategies to ensure diversity in retrieval results.
Exercise 11.12 (**)
(a) Compare three chunking strategies on the following document excerpt by manually splitting it:
"The Supreme Court ruled in Chevron v. NRDC (1984) that courts should defer to agency interpretations of ambiguous statutes. This doctrine, known as Chevron deference, has been one of the most cited principles in administrative law. However, in Loper Bright v. Raimondo (2024), the Court overruled Chevron, holding that courts must exercise independent judgment in interpreting statutes."
Strategy 1: Fixed-size (30 words per chunk, no overlap). Strategy 2: Sentence-level (one chunk per sentence). Strategy 3: Semantic (chunk by topic shift).
(b) A user queries "Was Chevron deference overruled?" For each chunking strategy, which chunk(s) would be retrieved? Which strategy gives the best answer?
(c) Propose a hybrid chunking strategy that combines the strengths of all three approaches.
Exercise 11.13 (***)
Implement a RAG pipeline with the following components:
- Load a set of 100 Wikipedia-style paragraphs (you may generate synthetic ones).
- Chunk with recursive character splitting (split on
\n\n, then\n, then., then by character count). - Embed using
sentence-transformers/all-MiniLM-L6-v2. - Index in FAISS with an
IndexFlatIP(inner product) index. - Given a query, retrieve top-5 chunks and construct a RAG prompt.
Evaluate: manually label 20 queries with relevant document IDs, then compute precision@5 and recall@5.
Prompt Engineering and Generation
Exercise 11.14 (*)
(a) Write zero-shot, one-shot, and three-shot prompts for the task: "Extract the medication name, dosage, and frequency from a clinical note." Use the following note: "Patient started on metformin 500mg twice daily for type 2 diabetes."
(b) For each prompt variant, predict which is most likely to produce correctly structured output. Justify your answer.
(c) Add chain-of-thought reasoning to the three-shot prompt. How does this change the expected output?
Exercise 11.15 (**)
(a) Given a vocabulary of 5 tokens with logits $z = [2.0, 1.0, 0.5, 0.1, -1.0]$, compute the sampling probability for each token under: - Greedy decoding - Temperature $\tau = 0.5$ - Temperature $\tau = 2.0$ - Top-$k$ with $k = 3$ - Top-$p$ with $p = 0.9$
(b) Implement a function sample_token(logits, temperature, top_k, top_p) in PyTorch that applies temperature scaling, top-$k$ filtering, and top-$p$ filtering in sequence, then samples from the resulting distribution.
import torch
import torch.nn.functional as F
def sample_token(
logits: torch.Tensor,
temperature: float = 1.0,
top_k: int = 0,
top_p: float = 1.0,
) -> int:
"""Sample a token from logits with temperature, top-k, and top-p.
Args:
logits: Raw logits of shape (vocab_size,).
temperature: Temperature for softmax scaling.
top_k: If > 0, keep only top-k logits.
top_p: If < 1.0, keep smallest set with cumulative prob >= top_p.
Returns:
Sampled token index.
"""
# Temperature scaling
if temperature != 1.0:
logits = logits / temperature
# Top-k filtering
if top_k > 0:
top_k = min(top_k, logits.size(-1))
threshold = torch.topk(logits, top_k).values[-1]
logits = torch.where(logits < threshold, torch.tensor(float("-inf")), logits)
# Top-p (nucleus) filtering
if top_p < 1.0:
sorted_logits, sorted_indices = torch.sort(logits, descending=True)
cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
# Remove tokens with cumulative prob above top_p
sorted_mask = cumulative_probs - F.softmax(sorted_logits, dim=-1) >= top_p
sorted_logits[sorted_mask] = float("-inf")
# Scatter back to original order
logits = sorted_logits.scatter(0, sorted_indices, sorted_logits)
# Sample
probs = F.softmax(logits, dim=-1)
return torch.multinomial(probs, num_samples=1).item()
Verify with the logits from part (a).
Evaluation
Exercise 11.16 (**)
(a) Compute BLEU-1 and BLEU-2 (without brevity penalty) for: - Reference: "the cat sat on the mat" - Candidate: "the cat is on the mat"
(b) Compute ROUGE-1 precision, recall, and F1 for the same pair.
(c) Explain a scenario where BLEU gives a high score but the candidate is semantically wrong. Explain a scenario where BLEU gives a low score but the candidate is semantically correct.
Exercise 11.17 (***)
Design an LLM-as-judge evaluation for a RAG system. Your evaluation should:
(a) Define 5 evaluation criteria specific to RAG (not generic quality metrics). Examples: faithfulness to retrieved context, answer completeness relative to question, appropriate hedging when evidence is insufficient.
(b) Write the judge prompt template with scoring rubric (1-5 scale per criterion).
(c) Describe how you would calibrate the judge: what is your ground truth, how many examples do you need, and what inter-annotator agreement metric would you report?
(d) Address the following bias: the LLM judge consistently rates longer responses higher, regardless of quality. Propose two mitigation strategies.
Integration and Systems Design
Exercise 11.18 (***)
You are designing a clinical note extraction system for MediCore Pharmaceuticals. The system must extract: patient demographics, symptoms, diagnoses, medications (name, dose, frequency, route), lab results, and follow-up plans.
(a) Compare three approaches: (i) fine-tuned BERT NER model, (ii) few-shot LLM prompting, (iii) fine-tuned LLM with LoRA. For each, specify the data requirements, expected accuracy, inference cost, and regulatory considerations.
(b) Design a validation pipeline that flags extraction errors before they enter the database. What heuristic checks can catch common LLM failure modes?
(c) The system processes 10,000 notes per day. Estimate the daily API cost for approach (ii) using GPT-4 pricing (assume $10/million input tokens, $30/million output tokens, 500 input tokens per note, 200 output tokens per extraction).
Exercise 11.19 (***)
(a) StreamRec wants to generate personalized content summaries for each user's homepage. Each summary is ~100 tokens, and there are 10 million daily active users. Estimate the daily inference cost using a 7B model served on A100 GPUs (assume 50 tokens/second throughput per GPU, $3/GPU-hour).
(b) Propose three strategies to reduce this cost by at least 10x without significantly degrading quality.
(c) One proposed strategy is to cache summaries and reuse them across users with similar viewing histories. What are the trade-offs? How would you define "similar enough" to justify cache reuse?
Exercise 11.20 (***)
(a) Implement a complete LoRA fine-tuning pipeline for a small language model (GPT-2) on a custom instruction dataset. Use the Hugging Face peft and transformers libraries. Your pipeline should include:
- Data preparation (format instructions as
### Instruction:\n{instruction}\n\n### Response:\n{response}) - LoRA configuration ($r = 8$, $\alpha = 16$, targeting attention projections)
- Training loop with gradient accumulation
- Evaluation on a held-out set (compute perplexity)
- Weight merging and inference
(b) Compare the fine-tuned model's perplexity to the base model's perplexity on the held-out set. Is the improvement consistent across different instruction categories?
Exercise 11.21 (***)
Design an end-to-end RAG system for StreamRec that handles the following requirements:
(a) The content catalog is updated hourly (new items added, items removed, metadata changed). Describe the indexing pipeline: how do you handle incremental updates to the vector store without rebuilding the entire index?
(b) Some queries are factual ("when was Oppenheimer released?") and some are subjective ("recommend something uplifting"). Should the RAG pipeline treat these differently? Design a query classifier and describe how the downstream pipeline changes for each query type.
(c) Users report that the system sometimes recommends items that have been removed from the catalog. Diagnose the root cause and propose a fix.
Theoretical and Research-Level Problems
Exercise 11.22 (****)
(a) The Chinchilla scaling law predicts loss as:
$$L(N, D) = \frac{A}{N^\alpha} + \frac{B}{D^\beta} + E$$
where $A, B, E, \alpha, \beta$ are fitted constants. Explain the interpretation of each term. What does $E$ represent?
(b) The optimal allocation under a compute budget $C \propto ND$ (ignoring constants) can be found by minimizing $L(N, D)$ subject to $ND = C$. Use Lagrange multipliers to derive the optimal ratio $N^*/D^*$ in terms of $\alpha$ and $\beta$. Verify that for $\alpha = \beta$, the optimal ratio is $N^* = D^*$ (i.e., equal scaling).
(c) Hoffmann et al. estimate $\alpha \approx 0.34$ and $\beta \approx 0.28$. What does $\alpha > \beta$ imply about the relative importance of model size vs. data size?
Exercise 11.23 (****)
(a) Prove that the DPO loss is equivalent to the RLHF objective under the Bradley-Terry preference model. Start from the RLHF objective:
$$\max_\pi \mathbb{E}_{x, y \sim \pi}[r(x, y)] - \beta D_{\text{KL}}[\pi(y|x) \| \pi_{\text{ref}}(y|x)]$$
Show that the optimal policy has the form $\pi^*(y|x) \propto \pi_{\text{ref}}(y|x) \exp(r(x,y)/\beta)$, then substitute into the Bradley-Terry model to obtain the DPO loss.
(b) What assumption about the reward function does this derivation require? Under what conditions might this assumption fail?
Exercise 11.24 (****)
(a) Hallucination can be formalized as follows: let $\mathcal{F}$ be the set of factually correct statements and let $P_\theta(y \mid x)$ be the model's distribution over responses. A hallucination occurs when the model generates $y \notin \mathcal{F}$ with high probability. Show that for any model trained on a finite dataset with next-token prediction, $P_\theta(y \notin \mathcal{F}) > 0$ for infinitely many prompts $x$, assuming the model has nonzero temperature.
(b) Does RAG eliminate this problem? Formalize the conditions under which a RAG system can still hallucinate despite having the correct answer in its retrieved context.
(c) Propose a theoretical framework for bounding the hallucination rate of a RAG system as a function of retrieval quality (precision@$k$), context window utilization, and model size.
Exercise 11.25 (****)
The context window length of LLMs is limited (4K-128K tokens in current models). Several approaches extend effective context:
(a) Describe how RoPE enables length extrapolation. What happens when the model encounters positions beyond its training context length? Why does NTK-aware scaling (by modifying the base frequency of RoPE) improve extrapolation?
(b) Sliding window attention (Mistral) limits each token to attending to only the previous $W$ tokens. What is the effective receptive field after $L$ layers of sliding window attention? How does this compare to full causal attention?
(c) Ring attention distributes a long sequence across multiple devices, with each device handling a segment and passing KV-cache information in a ring topology. Analyze the communication cost of ring attention vs. standard tensor parallelism for a sequence of length $S$ on $P$ devices.
Exercise 11.26 (****)
(a) The "reversal curse" (Berglund et al., 2023) demonstrates that models trained on "A is B" do not learn "B is A." For example, a model trained on "The mother of Valentina Tereshkova is Elena Tereshkova" cannot answer "Who is Elena Tereshkova's daughter?" Explain why this occurs from the perspective of the autoregressive training objective.
(b) Does this phenomenon have implications for the reliability of LLM-based knowledge extraction in the pharma use case (extracting facts from clinical notes)? Provide a concrete example of how the reversal curse could cause an extraction error.
(c) Propose a data augmentation strategy during fine-tuning that would mitigate the reversal curse for a specific domain.