Chapter 25: Key Takeaways

The Alignment Problem

Pre-trained models are capable but uncontrolled. A language model trained on next-token prediction generates text that is statistically likely, including misinformation, harmful content, and biased reasoning alongside useful text. Alignment bridges the gap between raw capability and reliable, intended behavior.
Alignment balances helpfulness, honesty, and harmlessness. These dimensions sometimes conflict: a maximally helpful model might provide dangerous information; a maximally safe model might refuse benign requests. Alignment methods must navigate these trade-offs without degenerating into over-refusal.
SFT alone is insufficient because it imitates rather than optimizes. Supervised fine-tuning learns the average quality of demonstrations. It cannot distinguish between good and bad responses or push the model beyond the quality ceiling of the training data. Preference-based methods (RLHF, DPO) provide the signal needed to exceed demonstration quality.

Reward Modeling

The reward model is the quality bottleneck of RLHF. Its biases and blind spots transfer directly to the aligned policy. Systematic evaluation for length bias, sycophancy, and calibration is essential before using a reward model for policy optimization.
The Bradley-Terry model maps reward differences through a sigmoid. The preference probability $P(y_w \succ y_l \mid x) = \sigma(r(x, y_w) - r(x, y_l))$ depends only on the reward difference, not the absolute values. The training loss is equivalent to binary cross-entropy on the reward margin.
Reward model architecture mirrors the policy. Initialize from the SFT model with the language modeling head replaced by a scalar output head. This ensures the reward model understands the same features as the policy, producing more meaningful reward signals.

RLHF and PPO

The KL divergence penalty prevents reward hacking and maintains capabilities. Without $\beta \cdot \text{KL}(\pi_\theta \| \pi_{\text{ref}})$, the policy exploits reward model weaknesses and degenerates. Too low $\beta$ allows reward hacking; too high $\beta$ collapses to the reference model. Typical values range from 0.01 to 0.2.
PPO for LLMs requires four models in GPU memory. The policy $\pi_\theta$, reference $\pi_{\text{ref}}$, reward model $r_\phi$, and value model $V_\psi$ must all be resident simultaneously. For a 7B model, this means roughly 28B parameters and corresponding memory, making PPO computationally expensive.
PPO training instability is the primary practical challenge. The interaction of reward scale, KL coefficient, clipping parameter, learning rate, and batch size creates a complex hyperparameter space. Common failure modes include reward hacking, KL explosion, and training collapse.

Direct Preference Optimization (DPO)

DPO eliminates the reward model and RL by exploiting a closed-form solution. The optimal RLHF policy has an analytical form $\pi^*(y|x) \propto \pi_{\text{ref}}(y|x) \exp(r(x,y)/\beta)$. Rearranging expresses the reward through the policy itself, and substituting into Bradley-Terry yields a supervised loss that directly optimizes preferences.
DPO reduces infrastructure from 4 models to 2. Only the policy $\pi_\theta$ (being trained) and the frozen reference $\pi_{\text{ref}}$ are needed. This halves memory requirements and eliminates the complexity of RL training loops, reward model training, and value function estimation.
The DPO implicit reward $\hat{r}(x,y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}$ measures policy deviation. Monitoring this quantity during training reveals whether the model is learning meaningful preferences. The chosen reward should increase, the rejected reward should decrease, and the margin should grow.
The $\beta$ parameter is the most important DPO hyperparameter. It controls how far the policy can deviate from the reference. Lower $\beta$ (0.1) allows aggressive optimization; higher $\beta$ (0.5) is conservative. Start with $\beta = 0.1$ and increase if the model degrades on general tasks.
DPO's adaptive gradient weighting makes training efficient. The gradient is weighted by $\sigma(\hat{r}(y_l) - \hat{r}(y_w))$, which is large when the model incorrectly ranks the losing response higher than the winning response. This focuses learning on the examples the model most needs to correct.

Beyond DPO: ORPO, KTO, SimPO

ORPO eliminates the reference model entirely. By combining SFT and preference optimization in a single stage using odds ratio penalties, ORPO simplifies the pipeline further. This is advantageous when storage and memory for a frozen reference model are constrained.
KTO enables alignment from unpaired feedback. When only thumbs-up/thumbs-down labels are available (not explicit pairwise comparisons), KTO provides a viable alignment signal. Unpaired feedback is far cheaper to collect at scale than pairwise preferences.
SimPO uses length-normalized average log probability as the implicit reward. The formulation $\hat{r}(x,y) = \frac{\beta}{|y|} \log \pi_\theta(y|x)$ prevents verbosity bias while eliminating the reference model requirement, simplifying the training pipeline further.

Safety and Evaluation

Constitutional AI scales alignment through explicit principles. Instead of collecting human preference annotations for every scenario, CAI uses a set of principles (the "constitution") to guide AI-generated feedback. This makes alignment criteria transparent and reduces annotation costs.
Red teaming must be paired with over-refusal measurement. Tracking the Attack Success Rate (ASR) alone is insufficient: a model that refuses everything has ASR = 0 but is useless. The over-refusal rate on benign prompts must be tracked alongside safety metrics to ensure the model remains helpful.
Preference data quality directly determines alignment quality. Ranked preferences of $k$ responses yield $\binom{k}{2}$ pairwise comparisons per annotation, making them the most information-dense collection strategy. Inter-annotator agreement (Cohen's kappa) should be 0.6-0.8 for reliable training signal.