Chapter 25: Quiz

Multiple Choice Questions

Question 1

The alignment problem refers to:

A) Aligning the model's weights to reduce memory usage. B) Ensuring a language model's outputs are helpful, honest, and harmless, consistent with human intent. C) Aligning the tokenizer vocabulary with the training data. D) Ensuring multiple GPUs are synchronized during training.

Answer: B The alignment problem is the challenge of steering model behavior toward outputs that are helpful, honest, and harmless while maintaining broad capabilities. Pre-trained models are capable but lack reliable control over when helpful vs. harmful capabilities are deployed.

Question 2

In the RLHF pipeline, the three stages in correct order are:

A) Reward Modeling, SFT, PPO. B) SFT, PPO, Reward Modeling. C) SFT, Reward Modeling, PPO. D) PPO, SFT, Reward Modeling.

Answer: C The standard pipeline is: (1) Supervised Fine-Tuning (SFT) on demonstration data, (2) Reward Modeling on human preference data, (3) Reinforcement Learning (PPO) to optimize the policy against the reward model.

Question 3

The Bradley-Terry model for preferences assumes that the probability of preferring $y_1$ over $y_2$ is:

A) $P(y_1 \succ y_2) = r(x, y_1) - r(x, y_2)$ B) $P(y_1 \succ y_2) = \sigma(r(x, y_1) - r(x, y_2))$ where $\sigma$ is the sigmoid function. C) $P(y_1 \succ y_2) = \frac{r(x, y_1)}{r(x, y_1) + r(x, y_2)}$ D) $P(y_1 \succ y_2) = \max(r(x, y_1), r(x, y_2))$

Answer: B The Bradley-Terry model maps the reward difference through a sigmoid function: $P(y_1 \succ y_2) = \sigma(r(x, y_1) - r(x, y_2))$. This ensures the output is a valid probability in [0, 1] and that preference probability depends only on reward differences.

Question 4

Why is the KL divergence penalty $\beta \cdot \text{KL}(\pi_\theta \| \pi_{\text{ref}})$ included in the RLHF objective?

A) To speed up training convergence. B) To prevent reward hacking and maintain general capabilities. C) To reduce the memory footprint of the policy model. D) To ensure the reward model is well-calibrated.

Answer: B The KL penalty serves two purposes: it prevents reward hacking (the policy exploiting weaknesses in the reward model) and maintains general capabilities (keeping the model close to the capable SFT model). Without it, the policy would degenerate.

Question 5

PPO-based RLHF for language models requires how many models in GPU memory?

A) 1 (the policy model). B) 2 (policy and reward models). C) 3 (policy, reference, and reward models). D) 4 (policy, reference, reward, and value models).

Answer: D PPO requires: (1) the policy model $\pi_\theta$ being trained, (2) the frozen reference model $\pi_{\text{ref}}$, (3) the frozen reward model $r_\phi$, and (4) the value model $V_\psi$ being trained. For a 7B model, this means ~28B parameters in memory.

Question 6

What is the key insight of DPO (Direct Preference Optimization)?

A) Preferences can be optimized without any reference model. B) The optimal RLHF policy has a closed-form solution, allowing reward modeling and RL to be bypassed. C) Human preferences can be perfectly captured by a linear reward model. D) PPO always outperforms supervised learning for alignment.

Answer: B DPO's key insight is that the optimal policy under the RLHF objective can be expressed in closed form as a function of the reward and reference policy. By rearranging this relationship, the reward can be expressed through the policy itself, eliminating the need for an explicit reward model and RL.

Question 7

In DPO, the implicit reward is defined as:

A) $\hat{r}(x,y) = \pi_\theta(y|x)$ B) $\hat{r}(x,y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}$ C) $\hat{r}(x,y) = \log \pi_\theta(y|x) - \log \pi_\theta(y_l|x)$ D) $\hat{r}(x,y) = \sigma(\pi_\theta(y|x))$

Answer: B The implicit reward in DPO is $\hat{r}(x,y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}$, which measures how much the policy deviates from the reference for a given response. The DPO loss maximizes the margin between implicit rewards of chosen and rejected responses.

Question 8

Compared to RLHF (PPO), DPO requires:

A) More models in memory (6 instead of 4). B) Fewer models in memory (2 instead of 4) and is more stable. C) The same computational resources but different hyperparameters. D) A more complex training loop with multiple phases.

Answer: B DPO requires only 2 models (the policy being trained and the frozen reference), compared to PPO's 4 models. DPO is also more stable because it uses a supervised learning objective rather than the complex RL training loop of PPO.

Question 9

In DPO, the $\beta$ parameter controls:

A) The learning rate of the optimizer. B) The strength of the KL regularization (how close the policy stays to the reference). C) The number of samples used per training step. D) The maximum sequence length.

Answer: B $\beta$ controls KL regularization strength. Lower $\beta$ allows more deviation from the reference policy (more aggressive optimization), while higher $\beta$ keeps the policy closer to the reference (more conservative). Typical values range from 0.1 to 0.5.

Question 10

"Reward hacking" in RLHF refers to:

A) An attacker modifying the reward model weights. B) The policy finding responses that receive high reward scores without being genuinely good. C) The reward model assigning the same score to all responses. D) The training data being corrupted.

Answer: B Reward hacking occurs when the policy exploits weaknesses in the reward model to achieve high scores without producing genuinely good responses. For example, the model might learn that longer responses receive higher rewards and produce unnecessarily verbose outputs.

Question 11

Which alignment method eliminates the need for a reference model entirely?

A) DPO. B) PPO. C) ORPO. D) IPO.

Answer: C ORPO (Odds Ratio Preference Optimization) combines SFT and preference optimization in a single stage using odds ratio penalties, eliminating both the RL stage and the reference model requirement. DPO and IPO still require a reference model.

Question 12

KTO (Kahneman-Tversky Optimization) is designed for:

A) Pairwise preference data with explicit rankings. B) Unpaired preference data (individual responses labeled as good or bad). C) Numerical reward scores assigned by human annotators. D) Multi-turn conversation evaluation.

Answer: B KTO works with unpaired feedback where responses are individually labeled as "good" or "bad" without explicit pairwise comparisons. This is motivated by the observation that unpaired feedback is much easier to collect at scale than pairwise preferences.

Question 13

Constitutional AI uses AI feedback to:

A) Replace the need for any training data. B) Scale alignment by having the model critique and revise its own responses according to principles. C) Generate the pre-training corpus. D) Automatically design the model architecture.

Answer: B Constitutional AI uses a set of principles (the "constitution") to guide AI-generated feedback. The model generates responses, critiques them according to the principles, and revises them. This reduces dependence on human annotators while making alignment criteria explicit.

Question 14

The Attack Success Rate (ASR) in safety evaluation measures:

A) The percentage of model parameters that are vulnerable to attacks. B) The fraction of adversarial prompts that successfully elicit harmful content. C) The speed at which the model processes adversarial inputs. D) The number of security patches applied.

Answer: B ASR is defined as the number of successful attacks divided by the total number of attack attempts. A lower ASR indicates a more robust model. It should be tracked alongside the over-refusal rate to ensure the model is not simply refusing everything.

Question 15

The over-refusal rate measures:

A) The fraction of harmful requests that the model refuses. B) The fraction of benign requests that the model incorrectly refuses. C) The rate at which the model generates excessively long responses. D) The frequency of repeating the same response.

Answer: B The over-refusal rate captures cases where the model is too cautious, refusing perfectly reasonable requests. This is a common side effect of aggressive safety training. A well-aligned model should have both a low ASR and a low over-refusal rate.

Question 16

Which preference data collection strategy provides the most information per annotation?

A) Pairwise comparison (which of two responses is better?). B) Ranked preferences (rank $k$ responses from best to worst). C) Likert scale rating (rate each response 1-5). D) Binary thumbs up/down.

Answer: B Ranked preferences of $k$ responses yield $\binom{k}{2}$ pairwise comparisons per annotation. For example, ranking 4 responses provides 6 pairwise comparisons, making it the most information-dense annotation strategy.

Question 17

Inter-annotator agreement measured by Cohen's kappa of 0.4 indicates:

A) Perfect agreement. B) Moderate agreement. C) Fair agreement. D) Slight agreement.

Answer: C Cohen's kappa values are typically interpreted as: <0.20 = slight, 0.21-0.40 = fair, 0.41-0.60 = moderate, 0.61-0.80 = substantial, 0.81-1.00 = near-perfect. A kappa of 0.4 indicates fair agreement, suggesting significant annotator disagreement.

Question 18

In the DPO training data format, each example consists of:

A) A prompt and a single response. B) A prompt, a chosen response, and a rejected response. C) A prompt and a numerical reward score. D) Two prompts and two responses.

Answer: B DPO requires triples of (prompt, chosen response, rejected response) where the chosen response is preferred over the rejected one. This pairwise format directly corresponds to the Bradley-Terry preference model that DPO optimizes.

Question 19

During DPO training, which metrics should increase over time?

A) The implicit reward of rejected responses. B) The margin between chosen and rejected implicit rewards. C) The KL divergence from the reference model (without bound). D) The loss value.

Answer: B The reward margin $\hat{r}(y_w) - \hat{r}(y_l)$ should increase during training, indicating the model is learning to distinguish preferred from dispreferred responses. Chosen rewards should increase, rejected rewards should decrease, and the accuracy (fraction where chosen > rejected) should approach 1.

Question 20

Why is SFT alone insufficient for alignment?

A) SFT cannot learn from any training data. B) SFT learns to imitate average quality of demonstrations, not to optimize for the best responses. C) SFT only works with encoder models. D) SFT requires more data than RLHF.

Answer: B SFT learns to mimic the average behavior of demonstrations through imitation learning. It cannot distinguish between good and bad responses or learn that some behaviors are better than others. RLHF/DPO provide a preference signal that guides the model toward better responses beyond what demonstrations show.

Question 21

In the RLHF objective, the reward model $r_\phi$ is typically initialized from:

A) A randomly initialized model. B) The SFT model with the language modeling head replaced by a scalar output. C) A separate, pre-trained classifier. D) The base pre-trained model before SFT.

Answer: B The reward model is initialized from the SFT model with the language modeling head replaced by a linear layer that outputs a scalar reward. Using the same architecture ensures the reward model can understand the same features as the policy.

Question 22

SimPO simplifies DPO by:

A) Using a fixed reward model instead of implicit rewards. B) Using average log probability as the implicit reward, eliminating the reference model. C) Training only the last layer of the model. D) Using greedy decoding instead of sampling.

Answer: B SimPO uses $\hat{r}(x,y) = \frac{\beta}{|y|} \log \pi_\theta(y|x)$ as the implicit reward, where length normalization prevents verbosity bias. This eliminates the need for a reference model, simplifying the training pipeline further.

Question 23

In PPO for LLMs, the advantage $A_t$ is estimated using GAE (Generalized Advantage Estimation) because:

A) It reduces variance in advantage estimates while controlling bias. B) It is the simplest possible advantage estimator. C) It eliminates the need for a value function. D) It ensures the advantage is always positive.

Answer: A GAE provides a family of advantage estimators controlled by $\lambda$, trading off bias and variance. Higher $\lambda$ gives lower bias but higher variance; lower $\lambda$ gives higher bias but lower variance. This is crucial for stable PPO training with the noisy reward signals typical in RLHF.

Question 24

The practical alignment recipe from Section 25.12 recommends:

A) Starting with DPO directly on the base model. B) SFT on 10K-100K examples, then DPO on 10K-50K preference pairs with $\beta = 0.1$-$0.3$. C) Using only Constitutional AI without any human data. D) Training with PPO for 100 epochs on a small dataset.

Answer: B The recommended recipe is: (1) SFT on high-quality instruction data (10K-100K examples, 1-3 epochs), (2) Collect preference data (10K-50K pairs), (3) DPO training (1-2 epochs, $\beta = 0.1$-$0.3$), (4) Safety evaluation, and (5) Iterate on failure modes.

Question 25

Which alignment method is most appropriate when you have unpaired feedback (thumbs up/down) rather than pairwise comparisons?

A) DPO. B) RLHF (PPO). C) KTO. D) ORPO.

Answer: C KTO (Kahneman-Tversky Optimization) is specifically designed for unpaired preference data where responses are individually labeled as good or bad. DPO and RLHF require explicit pairwise comparisons (chosen vs. rejected). ORPO requires the same paired format as DPO.