Chapter 25: Exercises

Conceptual Exercises

Exercise 25.1: The Alignment Problem

(a) Explain why a language model trained on next-token prediction is not inherently aligned with human values. Give three specific examples of misalignment. (b) Define the three dimensions of alignment: helpfulness, honesty, and harmlessness. For each, give an example where maximizing one dimension conflicts with another. (c) What is the "alignment tax"? When is it acceptable, and when does it become problematic (over-refusal)?

Exercise 25.2: The RLHF Pipeline

Draw a diagram of the three-stage RLHF pipeline (SFT, Reward Modeling, RL). For each stage: (a) What is the input data? (b) What is the training objective? (c) What is the output? (d) What are the key hyperparameters?

Exercise 25.3: Reward Model Mathematics

(a) Write the Bradley-Terry model for preferences: $P(y_1 \succ y_2 \mid x) = \sigma(r(x, y_1) - r(x, y_2))$. Explain each term. (b) Derive the negative log-likelihood loss for the reward model. Show that it reduces to binary cross-entropy. (c) If the reward model assigns $r(x, y_w) = 3.0$ and $r(x, y_l) = 1.0$, what is the predicted probability that $y_w$ is preferred? (d) What assumption does the Bradley-Terry model make about human preferences? When might this assumption fail?

Exercise 25.4: KL Divergence in RLHF

(a) Write the RLHF objective: $\max_\pi \mathbb{E}[r(x,y)] - \beta \cdot \text{KL}(\pi \| \pi_{\text{ref}})$. Explain the role of each term. (b) What happens if $\beta = 0$ (no KL penalty)? Describe the failure mode. (c) What happens if $\beta \to \infty$? What does the policy converge to? (d) How is $\beta$ typically chosen in practice? What are the signs that $\beta$ is too low or too high?

Exercise 25.5: PPO for Language Models

(a) In the PPO formulation for LLMs, identify the states, actions, and rewards. (b) Explain why the reward is zero for all tokens except the last. What are the implications for credit assignment? (c) What role does the value model $V_\psi$ play? Why is it needed alongside the reward model? (d) Explain why PPO requires four models simultaneously in GPU memory. Calculate the memory requirement for a 7B model.

Exercise 25.6: DPO Derivation

Starting from the RLHF objective, derive the DPO loss step by step: (a) Write the optimal policy $\pi^*$ in terms of the reward and reference policy. (b) Rearrange to express the reward in terms of the policy. (c) Substitute into the Bradley-Terry model and show that the partition function cancels. (d) Write the final DPO loss and explain why it is simpler than RLHF.

Exercise 25.7: DPO Gradient Analysis

The gradient of the DPO loss is: $$\nabla_\theta \mathcal{L} = -\beta \mathbb{E}[\sigma(\hat{r}(y_l) - \hat{r}(y_w))(\nabla \log \pi(y_w) - \nabla \log \pi(y_l))]$$ (a) Interpret the weighting term $\sigma(\hat{r}(y_l) - \hat{r}(y_w))$. When is it large? When is it small? (b) How does this adaptive weighting make DPO training efficient? (c) Compare this gradient to the reward model training gradient. What are the similarities?

Exercise 25.8: DPO vs. RLHF Comparison

Create a detailed comparison table of DPO vs. RLHF (PPO) covering: (a) Number of models required. (b) Training stability and sensitivity to hyperparameters. (c) Computational cost. (d) Data requirements. (e) Theoretical guarantees. (f) Performance on alignment benchmarks. For each dimension, explain which method has the advantage and why.

Exercise 25.9: ORPO, IPO, KTO, SimPO

(a) Explain how ORPO eliminates the need for a reference model. What is the "odds ratio" in ORPO? (b) IPO uses a regression loss instead of the Bradley-Terry model. What theoretical concern about DPO does this address? (c) KTO works with unpaired data (thumbs up/down). Why is this practically important? What is lost compared to pairwise comparisons? (d) SimPO uses average log probability as the implicit reward. Why is length normalization important?

Exercise 25.10: Constitutional AI

(a) Describe the two stages of Constitutional AI: SL-CAI and RL-CAI. (b) Write a constitution of 5 principles for a customer support chatbot. (c) What are the advantages of using AI feedback over human feedback for alignment? (d) What are the risks of using AI feedback? When might it fail?

Exercise 25.11: Red Teaming Strategies

(a) Design 5 adversarial prompts targeting different attack vectors: direct injection, role-play, multi-step manipulation, encoding, and social engineering. (b) For each prompt, describe the defense that would mitigate it. (c) How would you measure the success rate of red teaming (ASR)? (d) What is the over-refusal rate and why is it important to track alongside ASR?

Exercise 25.12: Preference Data Quality

(a) Explain why pairwise comparison is cognitively easier than generating a good response. How does this relate to the Karpathy principle "it is easier to judge than to create"? (b) If inter-annotator agreement (Cohen's kappa) is 0.5, what does this suggest about the data quality? (c) Describe three strategies for improving annotation quality. (d) When is synthetic preference data (from AI feedback) appropriate? When should you insist on human feedback?

Programming Exercises

Exercise 25.13: Implement Bradley-Terry Reward Model

Implement a reward model from scratch: (a) Initialize from a pre-trained language model with a scalar output head. (b) Implement the preference loss function. (c) Train on a small preference dataset. (d) Evaluate by computing accuracy on held-out preference pairs.

Exercise 25.14: Implement DPO Loss

Implement the DPO loss function from scratch: (a) Compute log probabilities of chosen and rejected responses under the policy. (b) Compute log probabilities under the reference model. (c) Compute the implicit rewards and the DPO loss. (d) Verify the gradient has the correct form.

Exercise 25.15: DPO Training Loop

Implement a complete DPO training loop: (a) Load an SFT model as both policy and reference. (b) Prepare preference data in the required format. (c) Implement the training step with proper log-probability computation. (d) Monitor chosen rewards, rejected rewards, and accuracy.

Exercise 25.16: Reward Model Evaluation

Build a reward model evaluation pipeline: (a) Compute accuracy (fraction of pairs where chosen > rejected). (b) Compute calibration (is a reward difference of 2.0 actually 2x as preferred?). (c) Test for length bias (does the model prefer longer responses regardless of quality?). (d) Test for sycophancy bias (does the model prefer agreeable responses?).

Exercise 25.17: Preference Data Generator

Build a synthetic preference data generator: (a) Given a prompt, generate N responses at different temperatures. (b) Use a simple heuristic (e.g., length, perplexity) to rank them. (c) Create preference pairs from the ranked responses. (d) Evaluate the quality of synthetic preferences vs. random pairs.

Exercise 25.18: Safety Evaluation Pipeline

Implement a safety evaluation pipeline: (a) Create a set of 20 adversarial prompts covering different harm categories. (b) Generate responses from the model. (c) Classify each response as safe/unsafe using keyword matching and a toxicity classifier. (d) Compute the Attack Success Rate (ASR) and over-refusal rate.

Exercise 25.19: KL Divergence Monitor

Implement a KL divergence monitor for alignment training: (a) Compute per-token KL divergence between policy and reference. (b) Compute average KL across a batch of prompts. (c) Plot KL over training steps. (d) Implement an early stopping criterion based on KL threshold.

Exercise 25.20: Implicit Reward Analysis

Analyze the implicit rewards learned by a DPO-trained model: (a) Compute $\hat{r}(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}$ for chosen and rejected responses. (b) Plot the distribution of reward margins. (c) Identify examples where the implicit reward disagrees with human preference. (d) Analyze whether the implicit reward correlates with response length.

Challenge Exercises

Exercise 25.21: Implement PPO for Language Models

Implement a simplified PPO training loop for language models: (a) Generate responses using the current policy. (b) Compute rewards using a reward model. (c) Compute advantages using GAE. (d) Perform PPO updates with clipping. (e) Monitor for training instability.

Exercise 25.22: Constitutional AI Implementation

Implement a simplified Constitutional AI pipeline: (a) Define a constitution with 5 principles. (b) Generate initial responses to potentially harmful prompts. (c) Use the model to critique and revise responses according to the constitution. (d) Fine-tune on the revised responses. (e) Evaluate safety improvement.

Exercise 25.23: ORPO Implementation

Implement ORPO from scratch: (a) Compute the SFT loss on the chosen response. (b) Compute the odds ratio between chosen and rejected. (c) Combine into the ORPO loss. (d) Compare training dynamics with DPO.

Exercise 25.24: Multi-Objective Alignment

Implement multi-objective alignment where helpfulness and safety are separate rewards: (a) Train two separate reward models (helpfulness and safety). (b) Combine them with adjustable weights. (c) Explore the Pareto frontier of helpfulness vs. safety. (d) Find the weight setting that maximizes helpfulness subject to a safety threshold.

Exercise 25.25: Iterative DPO

Implement iterative DPO (online DPO): (a) Train a DPO model on initial preference data. (b) Use the trained model to generate new responses. (c) Create new preference pairs from the generated responses. (d) Continue DPO training on the combined data. (e) Measure improvement across iterations.

Exercise 25.26: Length-Controlled DPO

Implement length-controlled DPO to mitigate verbosity bias: (a) Implement standard DPO and measure average response length. (b) Add length normalization: $\hat{r}(x,y) = \frac{\beta}{|y|} \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}$. (c) Compare response lengths and quality with and without normalization. (d) Is there a trade-off between length control and response quality?

Exercise 25.27: Preference Data Scaling

Study how the amount of preference data affects alignment quality: (a) Train DPO with 100, 500, 1000, 5000 preference pairs. (b) Evaluate on a held-out preference test set. (c) Evaluate on safety benchmarks. (d) Plot learning curves and identify diminishing returns.

Exercise 25.28: Reward Hacking Detection

Implement a reward hacking detection system: (a) Train a reward model and a DPO/PPO policy. (b) Generate responses and compute reward model scores. (c) Use a separate, stronger model to evaluate response quality. (d) Identify cases where high reward does not correspond to high quality. (e) Propose mitigation strategies.

Exercise 25.29: Annotator Modeling

Implement a simple annotator modeling system: (a) Simulate noisy annotators with different accuracy levels. (b) Implement the Dawid-Skene model to estimate true labels from noisy annotations. (c) Compare DPO training with raw vs. denoised labels. (d) How much does annotator modeling improve alignment quality?

Exercise 25.30: Full Alignment Pipeline

Implement the complete alignment pipeline from scratch: (a) Start with a base model and perform SFT on instruction data. (b) Collect preference data (simulated or from a strong model). (c) Train a reward model and evaluate it. (d) Train with DPO using the preference data. (e) Conduct red-teaming evaluation. (f) Iterate: identify failures, collect targeted data, retrain. Report metrics at each stage.