A language model trained to predict the next token is not inherently aligned with human values, preferences, or intentions. It will generate text that is statistically likely given the training data---which includes misinformation, harmful content...
In This Chapter
- 25.1 Introduction: The Alignment Problem
- 25.2 Why Alignment Matters
- 25.3 The RLHF Pipeline
- 25.4 Reward Modeling
- 25.5 PPO for Language Models
- 25.6 Direct Preference Optimization (DPO)
- 25.7 ORPO and Other Alignment Methods
- 25.8 Constitutional AI
- 25.9 Red Teaming and Safety Evaluation
- 25.10 Preference Data Collection
- 25.11 Implementing DPO with TRL
- 25.12 Putting It All Together: The Modern Alignment Pipeline
- 25.13 Summary
- 25.14 Open Challenges and Future Directions
- References
Chapter 25: Alignment: RLHF, DPO, and Beyond
Part IV: Attention, Transformers, and Language Models
25.1 Introduction: The Alignment Problem
A language model trained to predict the next token is not inherently aligned with human values, preferences, or intentions. It will generate text that is statistically likely given the training data---which includes misinformation, harmful content, biased reasoning, and unhelpful patterns alongside useful, accurate, and safe text. The alignment problem is the challenge of steering model behavior toward outputs that are helpful, honest, and harmless, while maintaining the model's broad capabilities.
This chapter covers the theory and practice of alignment techniques that transform a capable but uncontrolled language model into one that reliably follows instructions, provides helpful responses, and avoids harmful outputs. We trace the full alignment pipeline: from supervised fine-tuning (SFT) through reward modeling and reinforcement learning from human feedback (RLHF), to more recent approaches like Direct Preference Optimization (DPO) and beyond.
The alignment landscape has evolved rapidly. OpenAI's InstructGPT (Ouyang et al., 2022) demonstrated the power of RLHF for producing helpful, truthful assistants. Anthropic's Constitutional AI (Bai et al., 2022) showed that AI feedback could partially replace human feedback. Rafailov et al. (2023) simplified the pipeline dramatically with DPO, eliminating the need for a separate reward model. Understanding these methods---their assumptions, trade-offs, and practical implementation---is essential for any engineer building production AI systems.
What You Will Learn
By the end of this chapter, you will be able to:
- Explain the alignment problem and why pre-training alone is insufficient
- Design and train a reward model from human preference data
- Implement the full RLHF pipeline: SFT, reward modeling, and PPO
- Derive and implement Direct Preference Optimization (DPO)
- Compare RLHF, DPO, ORPO, and other alignment methods
- Apply Constitutional AI principles to reduce reliance on human annotation
- Conduct red-teaming and safety evaluation of aligned models
- Collect and curate preference data for alignment training
- Use the TRL library to implement DPO in practice
Prerequisites
This chapter assumes familiarity with:
- Transformer architecture and attention (Chapter 20)
- Language model training and fine-tuning (Chapters 21, 24)
- Reinforcement learning basics (policy, reward, value functions)
- Prompt engineering concepts (Chapter 23)
25.2 Why Alignment Matters
25.2.1 The Gap Between Capability and Intent
Pre-training produces models with remarkable capabilities---reasoning, coding, creative writing, analysis---but without reliable control over when and how these capabilities are deployed. Consider a model that can:
- Generate both truthful medical advice and convincing medical misinformation
- Write both helpful code and malicious exploits
- Produce both balanced analysis and manipulative persuasion
Without alignment, the model treats all these outputs as equally valid completions. The goal of alignment is to create a reliable mapping from human intent to model behavior.
25.2.2 Dimensions of Alignment
Alignment encompasses several dimensions:
Helpfulness. The model should make a genuine effort to assist the user, providing accurate, relevant, and complete responses.
Honesty. The model should be truthful, express uncertainty when appropriate, and avoid fabricating information (hallucination).
Harmlessness. The model should refuse to assist with harmful activities, avoid generating offensive content, and minimize potential negative impacts.
Instruction following. The model should understand and faithfully execute user instructions, including constraints on format, style, and content.
These dimensions sometimes conflict. A maximally helpful model might provide dangerous information; a maximally safe model might refuse benign requests. Alignment techniques must navigate these trade-offs, and different deployment contexts demand different balances. A medical AI assistant may prioritize accuracy and honesty above all else, while a creative writing assistant may prioritize helpfulness and flexibility.
25.2.3 A Brief History of Alignment
The idea of aligning AI systems with human values predates large language models. The field has roots in:
- Reward shaping in RL (Ng et al., 1999): The observation that specifying the right reward function is crucial---and difficult---for training agents that behave as intended.
- Value alignment (Russell, 2019): The philosophical and technical challenge of ensuring AI systems pursue objectives that are truly aligned with human values, rather than proxies that may diverge in unexpected ways.
- RLHF for summarization (Stiennon et al., 2020): An early demonstration that reinforcement learning from human feedback could train language models to produce better summaries than those generated by supervised learning alone.
- InstructGPT (Ouyang et al., 2022): The landmark paper that applied the full RLHF pipeline (SFT + RM + PPO) to GPT-3, producing a model that was significantly preferred by human evaluators. InstructGPT demonstrated that alignment could dramatically improve usefulness with relatively modest additional training cost.
The success of InstructGPT and its successor ChatGPT catalyzed a rapid expansion of alignment research, producing the methods we cover in this chapter.
25.2.4 The Alignment Tax
Alignment typically reduces raw performance on some benchmarks. A model that refuses to generate harmful content will score lower on benchmarks that include harmful content generation. This "alignment tax" is generally considered acceptable, but excessive alignment can make the model overly cautious, refusing reasonable requests---a phenomenon known as "over-refusal."
Quantifying the alignment tax is important for making informed decisions. Typically, well-aligned models show less than 5% degradation on standard capability benchmarks (MMLU, HumanEval, GSM8K) while showing dramatic improvements on helpfulness and safety evaluations. If you observe more than 10% capability regression after alignment, this suggests that the alignment process was too aggressive---the learning rate was too high, the KL penalty was too low, or the preference data was too narrow. In such cases, reducing the alignment intensity or mixing in general-purpose training data can help recover lost capability, as we discussed in the context of catastrophic forgetting in Chapter 24.
25.3 The RLHF Pipeline
25.3.1 Overview
The standard RLHF pipeline consists of three stages:
- Supervised Fine-Tuning (SFT): Fine-tune the base model on high-quality demonstration data
- Reward Modeling (RM): Train a reward model to predict human preferences
- Reinforcement Learning (RL): Optimize the SFT model using the reward model via PPO
Each stage builds on the previous one, progressively refining the model's alignment.
25.3.2 Stage 1: Supervised Fine-Tuning
The SFT stage (covered in detail in Chapter 24) fine-tunes the pre-trained model on a curated dataset of high-quality demonstrations. This brings the model into the "right neighborhood" of behavior, making the subsequent RL stage more efficient.
The SFT dataset consists of prompt-response pairs $(x_i, y_i)$ where $y_i$ represents a high-quality, aligned response. The training objective is standard language modeling:
$$\mathcal{L}_{\text{SFT}} = -\mathbb{E}_{(x,y) \sim \mathcal{D}_{\text{SFT}}} \left[ \sum_{t=1}^{|y|} \log \pi_{\text{SFT}}(y_t \mid x, y_{ The resulting model $\pi_{\text{SFT}}$ serves as both the starting point for RL training and the reference policy for KL regularization. SFT learns to mimic demonstrations but has fundamental limitations: RLHF addresses these limitations by providing a preference signal that guides the model toward better responses, even beyond the quality of the demonstrations. To build intuition for why RLHF improves upon SFT, consider an analogy. Imagine teaching someone to cook by providing recipes (SFT) versus providing recipes plus having them taste-test their results and get feedback (RLHF). With SFT alone, the model learns: "When asked to write a poem, produce text that looks like the demonstration poems." It imitates the average quality of the demonstrations, including any idiosyncrasies or limitations in the demonstration data. With RLHF, the model learns: "When asked to write a poem, produce the poem that a discerning reader would most prefer." This preference-based objective allows the model to discover responses that are better than any individual demonstration, because it optimizes a quality signal rather than imitating a fixed dataset. Mathematically, SFT maximizes $\log p(y_{\text{demo}} | x)$ (probability of the demonstration), while RLHF maximizes $\mathbb{E}[r(x, y)]$ (expected reward). The reward model captures a richer signal than any fixed set of demonstrations because it generalizes from the preference data to unseen prompt-response combinations. The reward model (RM) serves as a learned proxy for human judgment. Given a prompt $x$ and a response $y$, the reward model produces a scalar score: $$r_\phi(x, y) \in \mathbb{R}$$ where higher scores indicate more preferred responses. This model encodes human preferences into a differentiable function that can be used for optimization. Reward models are trained on comparison data. For each prompt $x$, human annotators are shown two (or more) model responses and asked to indicate which they prefer: $$(x, y_w, y_l) \quad \text{where } y_w \succ y_l \text{ (}y_w \text{ is preferred over } y_l\text{)}$$ The annotator's task is easier than writing a good response from scratch---comparing two texts is cognitively simpler than generating text. This is a key insight: it is easier for humans to express preferences than to demonstrate optimal behavior. Preferences are modeled using the Bradley-Terry model, which assumes the probability that response $y_1$ is preferred over $y_2$ is: $$P(y_1 \succ y_2 \mid x) = \sigma(r_\phi(x, y_1) - r_\phi(x, y_2))$$ where $\sigma$ is the sigmoid function: $\sigma(z) = \frac{1}{1 + e^{-z}}$. This model assumes that preferences are determined by the difference in reward scores and that the preference probability follows a logistic function. The reward model is trained to minimize the negative log-likelihood of the observed preferences: $$\mathcal{L}_{\text{RM}} = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}_{\text{pref}}} \left[ \log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l)) \right]$$ This is equivalent to binary cross-entropy loss where the label is always 1 (the preferred response should receive a higher reward). The loss is minimized when $r_\phi(x, y_w) \gg r_\phi(x, y_l)$, driving the reward model to assign clearly distinct scores to preferred and rejected responses. In practice, training converges when the reward model correctly orders approximately 70-80% of held-out preference pairs, which reflects the noise ceiling of human annotation. The reward model is typically initialized from the SFT model with the language modeling head replaced by a scalar output head: $$r_\phi(x, y) = \text{Linear}(\text{Transformer}(x, y)_{\text{last\_token}})$$ The final hidden state of the last token is projected to a scalar reward. Using the same architecture as the policy model ensures the reward model can understand the same features and representations. Initializing from the SFT checkpoint (rather than from the base pre-trained model) also provides a better starting point, since the SFT model has already learned to process instruction-response pairs. Let us walk through the reward model training process in detail. Step 1: Data collection. For each of $N$ prompts, generate $K$ responses from the SFT model (typically $K = 2$ to $4$). Human annotators rank or compare the responses, producing preference pairs $(y_w, y_l)$. Step 2: Architecture setup. Initialize the reward model from the SFT checkpoint. Replace the language modeling head (which maps to vocabulary logits) with a scalar head: Step 3: Training loop. For each batch of preference pairs, compute the reward for both the chosen and rejected responses, and minimize the Bradley-Terry loss: Step 4: Validation. Evaluate the reward model's accuracy on held-out preference pairs---does the reward model assign higher scores to the human-preferred responses? A well-trained reward model achieves 70-80% accuracy on held-out data (recall that human inter-annotator agreement is typically 60-80%, so the reward model approaches the ceiling of human consistency). Reward hacking. The policy may find responses that receive high reward scores without being genuinely good. For example, the model might learn that longer responses tend to receive higher rewards and produce unnecessarily verbose outputs. More subtle forms of reward hacking include the model producing responses that are confident-sounding but factually incorrect, or responses that use flattery to game the reward model. Distribution shift. The reward model is trained on outputs from $\pi_{\text{SFT}}$, but during RL training, the policy $\pi_\theta$ diverges from $\pi_{\text{SFT}}$, producing responses outside the reward model's training distribution. When the reward model encounters out-of-distribution responses, its predictions become unreliable, and the policy can exploit these unreliable predictions. Annotation noise. Human annotators disagree---inter-annotator agreement is typically 60--80%. The reward model must learn robust preferences from noisy labels. Strategies to address noise include filtering for high-agreement pairs, using multiple annotations per pair, and training with label smoothing. Scaling rewards. The reward model's output scale is arbitrary. Normalizing rewards during RL training (e.g., subtracting the running mean and dividing by the running standard deviation) helps stabilize optimization and makes hyperparameter choices more transferable across reward models. Given a reward model $r_\phi$, we want to find a policy $\pi_\theta$ that maximizes expected reward while staying close to the reference policy $\pi_{\text{ref}}$ (typically $\pi_{\text{SFT}}$). The objective is: $$\max_{\pi_\theta} \; \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(\cdot|x)} \left[ r_\phi(x, y) \right] - \beta \cdot \text{KL}(\pi_\theta \| \pi_{\text{ref}})$$ The KL divergence penalty serves two purposes:
1. Prevents reward hacking: Limits how far the policy can deviate from sensible behavior
2. Maintains capabilities: Keeps the model close to the capable SFT model The KL divergence between the policy and reference for a given prompt $x$ is: $$\text{KL}(\pi_\theta(\cdot|x) \| \pi_{\text{ref}}(\cdot|x)) = \mathbb{E}_{y \sim \pi_\theta(\cdot|x)} \left[ \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} \right]$$ For autoregressive models, this decomposes token-by-token: $$\text{KL}(\pi_\theta \| \pi_{\text{ref}}) = \sum_{t=1}^{T} \mathbb{E}_{y_{ In practice, the KL penalty is often computed per-token and added to the reward: $$r_{\text{total}}(x, y) = r_\phi(x, y) - \beta \sum_{t=1}^{T} \log \frac{\pi_\theta(y_t|x, y_{ PPO (Schulman et al., 2017) is a policy gradient algorithm that constrains policy updates to a trust region. The clipped PPO objective is: $$\mathcal{L}_{\text{PPO}} = -\mathbb{E}_t \left[ \min\left( \rho_t A_t, \; \text{clip}(\rho_t, 1-\epsilon, 1+\epsilon) A_t \right) \right]$$ where: In the language model setting:
- States $s_t$ are the partial sequences $(x, y_{ The advantage is estimated using Generalized Advantage Estimation (GAE): $$\hat{A}_t = \sum_{l=0}^{T-t-1} (\gamma \lambda)^l \delta_{t+l}$$ where $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ is the TD residual, $\gamma$ is the discount factor (often set to 1 for language), and $\lambda$ is the GAE parameter. A value function $V_\psi(s_t)$ is trained alongside the policy to estimate expected returns: $$\mathcal{L}_V = \mathbb{E}_t \left[ (V_\psi(s_t) - R_t)^2 \right]$$ where $R_t = \sum_{l=0}^{T-t-1} \gamma^l r_{t+l}$ is the return. The KL penalty deserves special attention because it is the mechanism that prevents the policy from collapsing to a degenerate solution. Without the KL penalty, the policy would learn to produce whatever outputs maximize the reward model's score, regardless of whether those outputs are sensible. This typically leads to outputs that are grammatically correct but repetitive, flattering, or otherwise "gaming" the reward model. The intuition behind the KL penalty is straightforward: we want the aligned model to behave similarly to the SFT model on most inputs, deviating only where the reward signal provides a clear reason to do so. Mathematically, the KL divergence measures the "distance" between two probability distributions: $$\text{KL}(\pi_\theta \| \pi_{\text{ref}}) = \sum_{y} \pi_\theta(y|x) \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}$$ This quantity is always non-negative and equals zero only when $\pi_\theta = \pi_{\text{ref}}$. Key properties: Worked example. Suppose the reference model assigns probability 0.3 to token "certainly" and the policy assigns probability 0.8. The per-token KL contribution is: $$0.8 \times \log\frac{0.8}{0.3} = 0.8 \times 0.98 = 0.78 \text{ nats}$$ With $\beta = 0.1$, this contributes a penalty of $0.078$ to the total reward. If the reward model gave the response a score of 1.5, the penalty reduces it to approximately 1.42. This shows how the KL penalty gently discourages large deviations at the token level. The PPO training loop for language models follows these steps: Computational cost. PPO requires maintaining four models simultaneously:
- The policy model $\pi_\theta$ (being trained)
- The reference model $\pi_{\text{ref}}$ (frozen SFT model)
- The reward model $r_\phi$ (frozen)
- The value model $V_\psi$ (being trained) For a 7B parameter model, this means ~28B parameters in GPU memory. Training instability. PPO is sensitive to hyperparameters. The reward scale, KL coefficient $\beta$, clip parameter $\epsilon$, learning rate, and batch size all interact in complex ways. Common failure modes include:
- Reward hacking (exploiting reward model weaknesses)
- KL explosion (policy diverges rapidly from reference)
- Training collapse (loss of diversity in outputs) Reward model quality. The quality ceiling of RLHF is fundamentally limited by the reward model. If the reward model has systematic biases, the policy will exploit them. For practitioners who need RLHF with PPO, the TRL library provides a The DPO (Rafailov et al., 2023) was motivated by the observation that the RLHF pipeline is complex, unstable, and computationally expensive. The key insight is that the optimal policy under the RLHF objective has a closed-form solution, and this solution can be used to derive a simpler training objective that directly optimizes preferences without an explicit reward model or RL. Recall the RLHF objective: $$\max_\pi \; \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi(\cdot|x)} \left[ r(x, y) \right] - \beta \; \text{KL}(\pi \| \pi_{\text{ref}})$$ The optimal policy $\pi^*$ that maximizes this objective is: $$\pi^*(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta} r(x, y)\right)$$ where $Z(x) = \sum_y \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta} r(x, y)\right)$ is the partition function. The DPO derivation is one of the most elegant results in modern machine learning. Let us trace each step carefully. Step 1: Rearrange the optimal policy equation. Starting from the optimal policy: $$\pi^*(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta} r(x, y)\right)$$ Take the logarithm of both sides: $$\log \pi^*(y|x) = \log \pi_{\text{ref}}(y|x) + \frac{1}{\beta} r(x, y) - \log Z(x)$$ Rearrange to isolate the reward: $$r(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)$$ This expression tells us that the reward of any response $y$ can be recovered from the optimal policy and the reference policy, up to a prompt-dependent constant $\beta \log Z(x)$. Step 2: Substitute into the Bradley-Terry model. The preference model says: $$P(y_w \succ y_l | x) = \sigma(r(x, y_w) - r(x, y_l))$$ Substituting our expression for $r$: $$P(y_w \succ y_l | x) = \sigma\left(\beta \log \frac{\pi^*(y_w|x)}{\pi_{\text{ref}}(y_w|x)} + \beta \log Z(x) - \beta \log \frac{\pi^*(y_l|x)}{\pi_{\text{ref}}(y_l|x)} - \beta \log Z(x)\right)$$ Step 3: The partition function cancels. The $\beta \log Z(x)$ terms appear with opposite signs and cancel exactly: $$P(y_w \succ y_l | x) = \sigma\left(\beta \log \frac{\pi^*(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi^*(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)$$ This cancellation is the key insight: the intractable partition function $Z(x)$ drops out entirely. We never need to compute it. The preference probability depends only on the log-ratio of the policy and reference model probabilities, which are both computable. This leads directly to the DPO loss function. We parameterize $\pi^*$ as $\pi_\theta$ and minimize the negative log-likelihood of the preferences: $$\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right) \right]$$ This loss has an elegant interpretation. Define the implicit reward of a response: $$\hat{r}_\theta(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}$$ Then the DPO loss becomes: $$\mathcal{L}_{\text{DPO}} = -\mathbb{E} \left[ \log \sigma(\hat{r}_\theta(x, y_w) - \hat{r}_\theta(x, y_l)) \right]$$ This is exactly the reward modeling loss, but with the reward parameterized implicitly through the policy. The gradient of the DPO loss with respect to $\theta$ is: $$\nabla_\theta \mathcal{L}_{\text{DPO}} = -\beta \mathbb{E} \left[ \underbrace{\sigma(\hat{r}_\theta(y_l) - \hat{r}_\theta(y_w))}_{\text{weight}} \left( \underbrace{\nabla_\theta \log \pi_\theta(y_w|x)}_{\text{increase } y_w \text{ probability}} - \underbrace{\nabla_\theta \log \pi_\theta(y_l|x)}_{\text{decrease } y_l \text{ probability}} \right) \right]$$ The weighting term $\sigma(\hat{r}_\theta(y_l) - \hat{r}_\theta(y_w))$ is large when the model currently assigns a higher implicit reward to the losing response than the winning response---precisely when the model most needs to be corrected. This adaptive weighting makes DPO training efficient. Intuitively, DPO spends its gradient budget where it matters most---on examples where the model currently disagrees with the human preferences. Examples where the model already assigns higher implicit reward to the preferred response contribute little gradient, since $\sigma(\hat{r}_w - \hat{r}_l)$ is close to 1 and the weighting term $\sigma(\hat{r}_l - \hat{r}_w)$ is close to 0. This is in contrast to standard supervised learning losses, which weight all examples equally regardless of how well the model already handles them. DPO's adaptive weighting is similar in spirit to hard-negative mining in contrastive learning or focal loss in object detection---the model focuses on the cases it finds most challenging. The $\beta$ parameter. Controls the strength of the KL constraint. Lower $\beta$ allows more deviation from the reference policy; higher $\beta$ keeps the policy closer. Typical values range from 0.1 to 0.5. Reference model. The reference policy is typically the SFT model. It must be kept frozen during DPO training and used to compute $\pi_{\text{ref}}(y|x)$ for both chosen and rejected responses. Data requirements. DPO requires preference pairs $(x, y_w, y_l)$. The quality and diversity of these pairs significantly affects the final model. Length bias. DPO can develop a bias toward longer or shorter responses depending on the preference data. Normalizing log-probabilities by sequence length can mitigate this: $$\hat{r}_\theta(x, y) = \frac{\beta}{|y|} \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}$$ Label smoothing. Adding label smoothing to the DPO loss can prevent the model from becoming overconfident in its preferences: $$\mathcal{L}_{\text{DPO-smooth}} = -(1-\epsilon) \log \sigma(\hat{r}_w - \hat{r}_l) - \epsilon \log \sigma(\hat{r}_l - \hat{r}_w)$$ where $\epsilon \in [0, 0.1]$ is the smoothing factor. Label smoothing is especially useful when the preference data contains noise or when annotators disagree frequently. Preference data quality. The most important factor in DPO performance is the quality and diversity of preference pairs. In practice: ORPO (Hong et al., 2024) simplifies alignment further by eliminating the need for a reference model. It combines SFT and preference optimization in a single training stage. The ORPO loss combines a standard language modeling loss with an odds ratio penalty: $$\mathcal{L}_{\text{ORPO}} = \mathcal{L}_{\text{SFT}}(y_w) + \lambda \cdot \mathcal{L}_{\text{OR}}$$ where the odds ratio loss is: $$\mathcal{L}_{\text{OR}} = -\log \sigma \left( \log \frac{\text{odds}_\theta(y_w|x)}{\text{odds}_\theta(y_l|x)} \right)$$ and the odds are defined as: $$\text{odds}_\theta(y|x) = \frac{p_\theta(y|x)}{1 - p_\theta(y|x)}$$ The key advantage of ORPO is simplicity: it requires only a single model (no reference model), and it combines SFT and alignment in one training stage. The SFT component of the loss ensures the model learns to generate the preferred responses, while the odds ratio component ensures it learns to prefer those responses over the rejected ones. ORPO's practical appeal lies in its reduced infrastructure requirements. While DPO requires maintaining two models in memory (the policy and the frozen reference), and PPO requires four, ORPO needs only one. This makes it particularly attractive for teams with limited GPU resources. IPO (Azar et al., 2024) addresses a theoretical concern with DPO: the assumption that the Bradley-Terry model perfectly describes human preferences. IPO uses a different loss that does not rely on this assumption: $$\mathcal{L}_{\text{IPO}} = \left( \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} - \frac{1}{2\beta} \right)^2$$ This regression-style loss penalizes the model when the log-probability margin deviates from the target margin $\frac{1}{2\beta}$. KTO (Ethayarajh et al., 2024) works with unpaired preference data---responses labeled as simply "good" or "bad" without explicit pairwise comparisons. This is motivated by the observation that unpaired feedback is much easier to collect than pairwise preferences. The KTO loss is inspired by Kahneman and Tversky's prospect theory: $$\mathcal{L}_{\text{KTO}} = \mathbb{E}_{y_w} \left[ w(y_w) \cdot (1 - v_w) \right] + \mathbb{E}_{y_l} \left[ w(y_l) \cdot v_l \right]$$ where $v_w$ and $v_l$ are the values of chosen and rejected responses relative to the reference. SimPO (Meng et al., 2024) further simplifies DPO by using the average log probability as an implicit reward, eliminating the need for a reference model: $$\hat{r}_\theta(x, y) = \frac{\beta}{|y|} \log \pi_\theta(y|x)$$ This results in the loss: $$\mathcal{L}_{\text{SimPO}} = -\log \sigma \left( \frac{\beta}{|y_w|} \log \pi_\theta(y_w|x) - \frac{\beta}{|y_l|} \log \pi_\theta(y_l|x) - \gamma \right)$$ where $\gamma$ is a target margin that ensures the model maintains a minimum preference gap. Typical values of $\gamma$ range from 0.5 to 1.5, with higher values encouraging a larger gap between chosen and rejected responses. SimPO's key simplification is the elimination of the reference model. By using the average log probability as the reward (rather than the log-ratio with a reference), SimPO avoids the need to maintain a frozen copy of the initial model. This halves the memory requirement compared to DPO while maintaining competitive performance. The trend in alignment research is clearly toward simplicity: fewer models, fewer training stages, and less infrastructure. However, simplicity sometimes comes at the cost of flexibility. RLHF with PPO remains the most flexible approach because the reward model can be trained on arbitrary signals and updated independently of the policy. DPO and its variants sacrifice this flexibility for stability and simplicity. Constitutional AI (CAI), introduced by Bai et al. (2022), addresses the scalability problem of human preference annotation. Instead of relying entirely on human annotators, CAI uses a set of principles---a "constitution"---to guide AI-generated feedback. Stage 1: Supervised Learning from AI Feedback (SL-CAI) Example critique prompt: Stage 2: RL from AI Feedback (RL-CAI) A constitution typically includes principles like: Advantages:
- Scalable: AI feedback is cheaper and faster than human feedback
- Consistent: AI applies principles more uniformly than diverse human annotators
- Transparent: The constitution makes the alignment criteria explicit
- Iterative: Easy to update principles without collecting new human data Limitations:
- Dependent on model capability to understand and apply principles
- May miss subtle alignment issues that humans would catch
- Constitution design itself requires careful human judgment
- Can lead to over-cautious behavior if principles are too restrictive Here is a simplified example of the critique-and-revise loop from Stage 1: This loop can be applied iteratively with multiple principles, producing increasingly refined responses. The resulting (prompt, revised_response) pairs form the SFT dataset for Stage 1 of CAI. Red teaming involves systematically attempting to elicit harmful, biased, or undesired outputs from a model. It is a critical component of alignment evaluation. Manual red teaming. Human red teamers craft prompts designed to:
- Bypass safety filters (jailbreaking)
- Elicit harmful content (violence, self-harm, illegal activities)
- Reveal biases (gender, race, religion, political)
- Cause hallucination of false information
- Extract training data or private information
- Trigger inconsistent behavior Automated red teaming. Use another LLM to generate adversarial prompts: Several benchmarks evaluate model safety: TruthfulQA. Tests whether models generate truthful answers to questions where humans commonly have misconceptions. Measures both truthfulness and informativeness. BBQ (Bias Benchmark for QA). Tests for social biases across nine categories including age, gender, race, and disability status. RealToxicityPrompts. Measures the tendency of models to generate toxic text when given naturally occurring prompts with varying toxicity levels. HarmBench. A standardized benchmark for evaluating LLM safety against direct and indirect attacks. Attack success rate (ASR). The fraction of adversarial prompts that successfully elicit harmful content: $$\text{ASR} = \frac{\text{Number of successful attacks}}{\text{Total number of attack attempts}}$$ Over-refusal rate. The fraction of benign requests that the model incorrectly refuses: $$\text{Over-refusal rate} = \frac{\text{Benign requests refused}}{\text{Total benign requests}}$$ Toxicity score. Use a toxicity classifier (e.g., Perspective API) to measure the toxicity of model outputs. Bias metrics. Measure disparities in model behavior across demographic groups using metrics like demographic parity and equalized odds. Safety evaluation should be iterative: A systematic red-teaming process includes: Phase 1: Taxonomy development. Create a comprehensive taxonomy of potential harms, organized by category (violence, deception, bias, privacy, illegal activity) and severity (low, medium, high, critical). This taxonomy guides the red team's efforts and ensures broad coverage. Phase 2: Attack vector exploration. For each harm category, develop multiple attack vectors: Phase 3: Systematic testing. Each red-team member generates a set of test prompts covering their assigned categories and attack vectors. The prompts are run against the model, and the outputs are evaluated for harmfulness, policy violations, and quality of refusals. Phase 4: Analysis and reporting. Aggregate results to identify patterns: which categories are most vulnerable? Which attack vectors are most successful? What does a typical failure look like? This analysis guides the next iteration of alignment training. Phase 5: Targeted data collection. For each identified vulnerability, create preference pairs where the "rejected" response is the harmful output and the "chosen" response is a safe, helpful refusal. This targeted data is used in the next round of alignment training. Anthropic's HHH framework (Helpful, Honest, Harmless) provides a structured way to evaluate alignment quality across three dimensions: Each dimension can be measured independently, allowing fine-grained analysis of alignment quality. A model that is maximally helpful but not harmless (answering harmful requests without hesitation) is misaligned. A model that is maximally harmless but not helpful (refusing almost all requests) is also misaligned. The goal is to find the right balance across all three dimensions. The quality of alignment critically depends on the quality of preference data. Several strategies are used: Pairwise comparison. Show annotators two responses and ask which is better. This is the simplest and most common approach. Ranked preferences. Show annotators $k$ responses and ask them to rank all of them. This provides $\binom{k}{2}$ pairwise comparisons per annotation. Likert scale rating. Ask annotators to rate each response on a numerical scale (e.g., 1-5). Pairwise preferences are derived from rating differences. Attribute-based feedback. Ask annotators to rate specific attributes (helpfulness, accuracy, safety, relevance) separately. This provides richer signal but is more expensive. Clear, detailed annotation guidelines are essential: Inter-annotator agreement. Measure agreement using Cohen's kappa or Fleiss' kappa: $$\kappa = \frac{p_o - p_e}{1 - p_e}$$ where:
- $p_o$ is the observed agreement between annotators
- $p_e$ is the expected agreement by chance A kappa of 0.6-0.8 is considered substantial agreement. For alignment preference data, kappa values of 0.5-0.7 are typical, reflecting the inherent subjectivity of quality judgments. If kappa is below 0.4, the task definition or annotation guidelines likely need revision---the annotators are not sufficiently aligned on what constitutes a "better" response. Consensus filtering. Only include preferences where multiple annotators agree. This reduces noise at the cost of data volume. Annotator modeling. Some annotators are more reliable than others. Weight their contributions accordingly or use the Dawid-Skene model to estimate true labels from noisy annotations. How much preference data do you need? Empirical results suggest: The cost of human annotation ranges from $0.50 to $5.00 per preference pair, depending on task complexity and annotator expertise. For a 10K-pair dataset, this translates to $5,000-$50,000---a significant but often justifiable investment for production systems. Synthetic preference data (generated by AI) can supplement human annotations at much lower cost, but should be validated against human preferences to ensure quality. A common approach is to use synthetic data for 80% of the training set and reserve high-quality human annotations for the most important or ambiguous cases. Increasingly, preference data is generated synthetically: Self-play. Generate responses at different temperatures or with different system prompts, then rank them using a strong model. Best-of-N. Generate $N$ responses per prompt, use a reward model or heuristic to select the best and worst, creating preference pairs. AI feedback. Use a strong model (e.g., GPT-4) to compare responses and generate preferences. This approach is central to Constitutional AI, as discussed in Section 25.8. AI-generated preferences are cheaper and faster than human annotations, but they may miss nuanced quality differences and can amplify biases present in the judge model. A practical compromise is to use AI feedback for initial data collection and reserve human annotation for quality-critical subsets and validation. The TRL library provides a DPO requires data with three fields:
- Here is a complete, working example of DPO training using the TRL library: The DPOTrainer automatically handles:
- Creating a frozen copy of the initial model as the reference policy
- Computing log-probabilities for both the policy and reference models
- Implementing the DPO loss function
- Logging implicit rewards, margins, and accuracy metrics The typical DPO training procedure is: Key metrics to monitor during DPO training: Based on current best practices, here is a practical alignment pipeline: Start with a strong base model: Llama 3, Mistral, or similar. The base model determines the ceiling of what alignment can achieve---you cannot align capabilities that the model does not possess. SFT on high-quality data: 10K-100K instruction-response pairs, 1-3 epochs. Use a diverse mix of tasks as discussed in Chapter 24. The SFT stage brings the model into the "right neighborhood" of behavior. Collect preference data: 10K-50K preference pairs, covering diverse prompts. Include both "easy" pairs (clearly better/worse) and "hard" pairs (both good, but one slightly better). Hard pairs provide the most signal for alignment. DPO training: 1-2 epochs with $\beta = 0.1$-$0.3$. Use a lower learning rate than SFT ($5 \times 10^{-6}$ to $5 \times 10^{-7}$). Monitor the reward margin and accuracy throughout training. Safety evaluation: Red teaming + safety benchmarks. Use both automated benchmarks (TruthfulQA, BBQ) and manual red-teaming. Aim for a low attack success rate (<5%) and a low over-refusal rate (<10%). Iterate: Address failure modes with targeted data collection. The most impactful improvement usually comes from collecting preference data specifically for the failure modes identified during evaluation. Effective monitoring throughout the alignment pipeline requires tracking different metrics at each stage: During SFT:
- Training loss (should decrease smoothly)
- Validation loss (should decrease; divergence from training loss indicates overfitting)
- Response quality samples (manual inspection of generated responses) During DPO/RLHF:
- Chosen reward: $\hat{r}(y_w)$ should increase
- Rejected reward: $\hat{r}(y_l)$ should decrease
- Reward margin: $\hat{r}(y_w) - \hat{r}(y_l)$ should increase, but not explode
- KL divergence from reference: should remain bounded (typically < 10 nats)
- Accuracy: fraction of pairs correctly ordered should increase toward 1.0
- Response length: monitor for length gaming (responses getting longer without becoming better) After alignment:
- Safety benchmarks: TruthfulQA, HarmBench, BBQ
- Capability benchmarks: MMLU, HumanEval, GSM8K (to detect capability regression)
- User satisfaction: if in production, track thumbs up/down ratings A common failure pattern is "alignment collapse," where the model becomes overly cautious and refuses reasonable requests. Monitor the over-refusal rate on a set of benign prompts to detect this early. The choice also depends on practical constraints. If you have a single 24GB GPU, ORPO or SimPO is the most feasible option because they require only one model in memory. If you have a cluster of GPUs and experienced ML engineers, RLHF with PPO provides the most control and flexibility. For most teams starting with alignment, DPO is the recommended starting point: it is well-understood, well-supported in libraries, and produces results competitive with PPO at a fraction of the complexity. Alignment transforms capable language models into reliable, helpful assistants. In this chapter, we covered: The journey from a raw pre-trained model to an aligned assistant involves multiple stages of refinement, each building on the previous one. The field continues to evolve rapidly, with new alignment methods offering better efficiency, stability, and performance. Despite rapid progress, several fundamental challenges in alignment remain unresolved: As models become more capable, it becomes harder for humans to evaluate whether model outputs are correct and aligned. This is the scalable oversight problem, first articulated by Amodei et al. (2016). If a model generates a sophisticated mathematical proof, a complex piece of code, or a nuanced legal argument, how do we verify that it is correct without ourselves being experts in that domain? As models become superhuman at certain tasks, the gap between model capability and human evaluator capability will widen, making alignment increasingly difficult. Current approaches include: Current reward models are brittle. They can be "hacked" by policies that learn to produce outputs with high reward scores that do not actually represent good responses (as discussed in Section 25.4.7). Improving reward model robustness---through better training data, adversarial training, or ensemble reward models---is an active research area. Alignment can be fragile. Fine-tuning an aligned model on new data can undo alignment (alignment forgetting). Minor changes to the system prompt can cause aligned models to behave in unaligned ways. Developing methods that produce robust, stable alignment that persists across downstream uses is an important open problem. Current alignment techniques are better at making models sound confident than at making them truthful. A model that has been trained with RLHF may learn to express uncertainty only when the human-annotated data demonstrates uncertainty, not when the model itself is genuinely uncertain. This disconnect between expressed confidence and actual reliability is a fundamental challenge, closely related to the broader problem of hallucination in language models. Whose preferences should the model be aligned to? Different users, cultures, and organizations have different values and preferences. A model aligned to the preferences of one group may be misaligned for another. Techniques for personalizable alignment---where the model can adapt to different preference profiles while maintaining core safety constraints---are an emerging area of research. Our ability to build alignment techniques has outpaced our ability to evaluate them. We can measure toxicity and bias with automated classifiers, but we lack reliable metrics for more subtle alignment properties like truthfulness, manipulation resistance, and genuine helpfulness. Developing comprehensive, reliable alignment benchmarks remains a critical need.25.3.3 Why SFT Alone Is Insufficient
25.3.4 The SFT-to-RLHF Transition: An Intuition
25.4 Reward Modeling
25.4.1 The Role of the Reward Model
25.4.2 Preference Data Collection
25.4.3 The Bradley-Terry Model
25.4.4 Training Objective
25.4.5 Architecture
25.4.6 Training the Reward Model: Step by Step
import torch.nn as nn
from transformers import AutoModelForCausalLM
class RewardModel(nn.Module):
"""A reward model built from a pre-trained language model.
Produces a scalar reward for each (prompt, response) pair.
"""
def __init__(self, base_model_name: str) -> None:
super().__init__()
self.backbone = AutoModelForCausalLM.from_pretrained(
base_model_name
).model # Get the transformer without LM head
hidden_size = self.backbone.config.hidden_size
self.reward_head = nn.Linear(hidden_size, 1)
def forward(
self,
input_ids: torch.Tensor,
attention_mask: torch.Tensor,
) -> torch.Tensor:
"""Compute reward for a batch of sequences.
Args:
input_ids: Token IDs of shape (batch, seq_len).
attention_mask: Attention mask of shape (batch, seq_len).
Returns:
Scalar rewards of shape (batch,).
"""
outputs = self.backbone(
input_ids=input_ids,
attention_mask=attention_mask,
)
hidden_states = outputs.last_hidden_state
# Use the last token's hidden state as the sequence representation
# (find the last non-padding token for each sequence)
sequence_lengths = attention_mask.sum(dim=1) - 1
last_hidden = hidden_states[
torch.arange(hidden_states.size(0)),
sequence_lengths,
]
rewards = self.reward_head(last_hidden).squeeze(-1)
return rewards
def reward_model_loss(
model: RewardModel,
chosen_ids: torch.Tensor,
chosen_mask: torch.Tensor,
rejected_ids: torch.Tensor,
rejected_mask: torch.Tensor,
) -> torch.Tensor:
"""Compute the Bradley-Terry preference loss.
Args:
model: The reward model.
chosen_ids: Token IDs for preferred responses.
chosen_mask: Attention mask for preferred responses.
rejected_ids: Token IDs for rejected responses.
rejected_mask: Attention mask for rejected responses.
Returns:
Scalar loss tensor.
"""
r_chosen = model(chosen_ids, chosen_mask)
r_rejected = model(rejected_ids, rejected_mask)
# Bradley-Terry loss: -log(sigmoid(r_chosen - r_rejected))
loss = -torch.nn.functional.logsigmoid(r_chosen - r_rejected).mean()
return loss
25.4.7 Reward Model Challenges
25.5 PPO for Language Models
25.5.1 From Reward to Policy Optimization
25.5.2 KL Divergence in the Language Model Setting
25.5.3 Proximal Policy Optimization (PPO)
25.5.4 Advantage Estimation
25.5.5 Understanding the KL Divergence Penalty
25.5.6 PPO Training Loop for LLMs
25.5.7 Practical Challenges of PPO for LLMs
25.5.8 PPO Implementation with TRL
PPOTrainer class:from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
# Configuration
ppo_config = PPOConfig(
model_name="your-sft-model",
learning_rate=1.41e-5,
batch_size=64,
mini_batch_size=4,
gradient_accumulation_steps=1,
ppo_epochs=4, # PPO update epochs per batch
kl_penalty="kl", # KL penalty type
init_kl_coeff=0.2, # Initial beta
adap_kl_ctrl=True, # Adaptive KL control
target_kl=6.0, # Target KL divergence
)
# Load model with value head (policy + value function)
model = AutoModelForCausalLMWithValueHead.from_pretrained(
"your-sft-model"
)
# Load frozen reference model
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(
"your-sft-model"
)
# Initialize PPO trainer
ppo_trainer = PPOTrainer(
config=ppo_config,
model=model,
ref_model=ref_model,
tokenizer=tokenizer,
)
# Training loop (simplified)
for batch in dataloader:
# Generate responses
query_tensors = batch["input_ids"]
response_tensors = ppo_trainer.generate(query_tensors)
# Compute rewards using the reward model
rewards = [reward_model(q, r) for q, r in zip(query_tensors, response_tensors)]
# PPO update
stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
print(f"KL: {stats['objective/kl']:.3f}, "
f"Reward: {stats['ppo/returns/mean']:.3f}")
adap_kl_ctrl=True setting enables adaptive KL control, which automatically adjusts $\beta$ during training to keep the KL divergence near the target value. This is a practical necessity because the optimal $\beta$ changes during training as the policy evolves.
25.6 Direct Preference Optimization (DPO)
25.6.1 Motivation
25.6.2 The RLHF Objective Revisited
25.6.3 The DPO Derivation
25.6.4 The DPO Loss Function
25.6.5 Gradient Analysis
25.6.6 DPO vs. RLHF Comparison
Aspect
RLHF (PPO)
DPO
Requires reward model
Yes (separate training)
No (implicit)
Requires RL algorithm
Yes (PPO)
No (supervised)
Models in memory
4 (policy, ref, RM, value)
2 (policy, ref)
Training stability
Sensitive to hyperparameters
More stable
Computational cost
High
Moderate
Hyperparameters
Many (lr, KL coeff, clip, etc.)
Few (lr, $\beta$)
Performance
Strong
Competitive
Theory
Well-established
Clean closed-form solution
25.6.7 Practical Considerations for DPO
25.7 ORPO and Other Alignment Methods
25.7.1 ORPO: Odds Ratio Preference Optimization
25.7.2 IPO: Identity Preference Optimization
25.7.3 KTO: Kahneman-Tversky Optimization
25.7.4 SimPO: Simple Preference Optimization
25.7.5 Summary of Alignment Methods
Method
Models Required
Needs Reference
Needs RM
Training Stages
Complexity
RLHF (PPO)
4
Yes
Yes
3 (SFT + RM + RL)
High
DPO
2
Yes
No
2 (SFT + DPO)
Medium
IPO
2
Yes
No
2 (SFT + IPO)
Medium
ORPO
1
No
No
1 (combined)
Low
KTO
2
Yes
No
2 (SFT + KTO)
Medium
SimPO
1
No
No
2 (SFT + SimPO)
Low
25.8 Constitutional AI
25.8.1 Motivation
25.8.2 The CAI Pipeline
Critique the following response according to this principle:
"The response should not help with illegal activities."
Response: [model's original response]
Critique: [model generates critique]
Revision: Based on the critique, here is a revised response:
[model generates revised response]
25.8.3 Constitutional Principles
25.8.4 Advantages and Limitations
25.8.5 Implementing Constitutional AI: A Simplified Example
def constitutional_revision(
model,
tokenizer,
harmful_prompt: str,
initial_response: str,
principle: str,
) -> str:
"""Apply one round of constitutional critique and revision.
Args:
model: The language model.
tokenizer: The tokenizer.
harmful_prompt: The original (potentially harmful) prompt.
initial_response: The model's initial response.
principle: The constitutional principle to apply.
Returns:
The revised response.
"""
# Step 1: Critique
critique_prompt = (
f"Consider the following response to a user request.\n\n"
f"User request: {harmful_prompt}\n"
f"Response: {initial_response}\n\n"
f"Critique this response according to the following "
f"principle: '{principle}'\n\n"
f"Critique:"
)
critique = generate(model, tokenizer, critique_prompt)
# Step 2: Revise
revision_prompt = (
f"User request: {harmful_prompt}\n"
f"Initial response: {initial_response}\n"
f"Critique: {critique}\n\n"
f"Based on the critique, write an improved response "
f"that addresses the issues raised while remaining "
f"helpful.\n\n"
f"Revised response:"
)
revised = generate(model, tokenizer, revision_prompt)
return revised
25.9 Red Teaming and Safety Evaluation
25.9.1 Red Teaming
Generate 20 diverse prompts that might cause a language model to
produce harmful output. The prompts should cover different attack
vectors: direct requests, role-playing scenarios, hypothetical
framing, multi-step manipulation, and encoded instructions.
25.9.2 Safety Benchmarks
25.9.3 Evaluation Metrics
25.9.4 Iterative Safety Improvement
25.9.5 Red-Teaming Methodology in Practice
25.9.6 The HHH Framework
25.10 Preference Data Collection
25.10.1 Data Collection Strategies
25.10.2 Annotation Guidelines
25.10.3 Quality Control
25.10.4 Practical Guidance on Data Volume
25.10.4 Synthetic Preference Data
25.11 Implementing DPO with TRL
25.11.1 Overview
DPOTrainer class that handles the DPO training loop, including reference model management, loss computation, and logging.25.11.2 Data Format
prompt: The input prompt
- chosen: The preferred response
- rejected: The dispreferred response25.11.3 Complete DPO Training Example
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig
from trl import DPOTrainer
# 1. Load the SFT model (serves as both policy and reference)
model_name = "your-sft-model"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# The reference model is loaded automatically by DPOTrainer
# (a frozen copy of the initial model)
# 2. Load preference dataset
# Each example needs: prompt, chosen, rejected
dataset = load_dataset("your-preference-data", split="train")
# Example data format:
# {
# "prompt": "Explain quantum computing simply.",
# "chosen": "Quantum computing uses quantum bits...",
# "rejected": "Quantum computing is very complex..."
# }
# 3. Configure LoRA (optional but recommended)
peft_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
# 4. Configure training
training_args = TrainingArguments(
output_dir="./dpo-model",
num_train_epochs=1,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=5e-6, # Lower than SFT
warmup_ratio=0.1,
lr_scheduler_type="cosine",
logging_steps=10,
save_strategy="epoch",
bf16=True,
remove_unused_columns=False,
)
# 5. Initialize DPO trainer
dpo_trainer = DPOTrainer(
model=model,
args=training_args,
beta=0.1, # KL regularization strength
train_dataset=dataset,
tokenizer=tokenizer,
peft_config=peft_config,
max_length=1024,
max_prompt_length=512,
)
# 6. Train
dpo_trainer.train()
# 7. Save
dpo_trainer.save_model("./dpo-model/final")
25.11.4 Key Hyperparameters
Hyperparameter
Typical Range
Effect
$\beta$
0.1 -- 0.5
KL regularization strength
Learning rate
1e-6 -- 5e-6
Update magnitude (lower than SFT)
Batch size
4 -- 16
Per-device batch size
Epochs
1 -- 3
Risk of overfitting with more
Max sequence length
512 -- 2048
Depends on data
Warmup ratio
0.1
Fraction of steps
Label smoothing
0.0 -- 0.1
Prevents overconfident preferences
25.11.5 Monitoring Training
25.12 Putting It All Together: The Modern Alignment Pipeline
25.12.1 A Practical Alignment Recipe
25.12.2 Monitoring the Alignment Pipeline
25.12.3 When to Use Which Method
25.13 Summary
25.14 Open Challenges and Future Directions
25.14.1 Scalable Oversight
25.14.2 Reward Model Robustness
25.14.3 Alignment Stability
25.14.4 Truthfulness and Hallucination
25.14.5 Multi-Stakeholder Alignment
25.14.6 Evaluation Gaps
References