41 min read

A language model trained to predict the next token is not inherently aligned with human values, preferences, or intentions. It will generate text that is statistically likely given the training data---which includes misinformation, harmful content...

Chapter 25: Alignment: RLHF, DPO, and Beyond

Part IV: Attention, Transformers, and Language Models


25.1 Introduction: The Alignment Problem

A language model trained to predict the next token is not inherently aligned with human values, preferences, or intentions. It will generate text that is statistically likely given the training data---which includes misinformation, harmful content, biased reasoning, and unhelpful patterns alongside useful, accurate, and safe text. The alignment problem is the challenge of steering model behavior toward outputs that are helpful, honest, and harmless, while maintaining the model's broad capabilities.

This chapter covers the theory and practice of alignment techniques that transform a capable but uncontrolled language model into one that reliably follows instructions, provides helpful responses, and avoids harmful outputs. We trace the full alignment pipeline: from supervised fine-tuning (SFT) through reward modeling and reinforcement learning from human feedback (RLHF), to more recent approaches like Direct Preference Optimization (DPO) and beyond.

The alignment landscape has evolved rapidly. OpenAI's InstructGPT (Ouyang et al., 2022) demonstrated the power of RLHF for producing helpful, truthful assistants. Anthropic's Constitutional AI (Bai et al., 2022) showed that AI feedback could partially replace human feedback. Rafailov et al. (2023) simplified the pipeline dramatically with DPO, eliminating the need for a separate reward model. Understanding these methods---their assumptions, trade-offs, and practical implementation---is essential for any engineer building production AI systems.

What You Will Learn

By the end of this chapter, you will be able to:

  • Explain the alignment problem and why pre-training alone is insufficient
  • Design and train a reward model from human preference data
  • Implement the full RLHF pipeline: SFT, reward modeling, and PPO
  • Derive and implement Direct Preference Optimization (DPO)
  • Compare RLHF, DPO, ORPO, and other alignment methods
  • Apply Constitutional AI principles to reduce reliance on human annotation
  • Conduct red-teaming and safety evaluation of aligned models
  • Collect and curate preference data for alignment training
  • Use the TRL library to implement DPO in practice

Prerequisites

This chapter assumes familiarity with:

  • Transformer architecture and attention (Chapter 20)
  • Language model training and fine-tuning (Chapters 21, 24)
  • Reinforcement learning basics (policy, reward, value functions)
  • Prompt engineering concepts (Chapter 23)

25.2 Why Alignment Matters

25.2.1 The Gap Between Capability and Intent

Pre-training produces models with remarkable capabilities---reasoning, coding, creative writing, analysis---but without reliable control over when and how these capabilities are deployed. Consider a model that can:

  • Generate both truthful medical advice and convincing medical misinformation
  • Write both helpful code and malicious exploits
  • Produce both balanced analysis and manipulative persuasion

Without alignment, the model treats all these outputs as equally valid completions. The goal of alignment is to create a reliable mapping from human intent to model behavior.

25.2.2 Dimensions of Alignment

Alignment encompasses several dimensions:

Helpfulness. The model should make a genuine effort to assist the user, providing accurate, relevant, and complete responses.

Honesty. The model should be truthful, express uncertainty when appropriate, and avoid fabricating information (hallucination).

Harmlessness. The model should refuse to assist with harmful activities, avoid generating offensive content, and minimize potential negative impacts.

Instruction following. The model should understand and faithfully execute user instructions, including constraints on format, style, and content.

These dimensions sometimes conflict. A maximally helpful model might provide dangerous information; a maximally safe model might refuse benign requests. Alignment techniques must navigate these trade-offs, and different deployment contexts demand different balances. A medical AI assistant may prioritize accuracy and honesty above all else, while a creative writing assistant may prioritize helpfulness and flexibility.

25.2.3 A Brief History of Alignment

The idea of aligning AI systems with human values predates large language models. The field has roots in:

  • Reward shaping in RL (Ng et al., 1999): The observation that specifying the right reward function is crucial---and difficult---for training agents that behave as intended.
  • Value alignment (Russell, 2019): The philosophical and technical challenge of ensuring AI systems pursue objectives that are truly aligned with human values, rather than proxies that may diverge in unexpected ways.
  • RLHF for summarization (Stiennon et al., 2020): An early demonstration that reinforcement learning from human feedback could train language models to produce better summaries than those generated by supervised learning alone.
  • InstructGPT (Ouyang et al., 2022): The landmark paper that applied the full RLHF pipeline (SFT + RM + PPO) to GPT-3, producing a model that was significantly preferred by human evaluators. InstructGPT demonstrated that alignment could dramatically improve usefulness with relatively modest additional training cost.

The success of InstructGPT and its successor ChatGPT catalyzed a rapid expansion of alignment research, producing the methods we cover in this chapter.

25.2.4 The Alignment Tax

Alignment typically reduces raw performance on some benchmarks. A model that refuses to generate harmful content will score lower on benchmarks that include harmful content generation. This "alignment tax" is generally considered acceptable, but excessive alignment can make the model overly cautious, refusing reasonable requests---a phenomenon known as "over-refusal."

Quantifying the alignment tax is important for making informed decisions. Typically, well-aligned models show less than 5% degradation on standard capability benchmarks (MMLU, HumanEval, GSM8K) while showing dramatic improvements on helpfulness and safety evaluations. If you observe more than 10% capability regression after alignment, this suggests that the alignment process was too aggressive---the learning rate was too high, the KL penalty was too low, or the preference data was too narrow. In such cases, reducing the alignment intensity or mixing in general-purpose training data can help recover lost capability, as we discussed in the context of catastrophic forgetting in Chapter 24.


25.3 The RLHF Pipeline

25.3.1 Overview

The standard RLHF pipeline consists of three stages:

  1. Supervised Fine-Tuning (SFT): Fine-tune the base model on high-quality demonstration data
  2. Reward Modeling (RM): Train a reward model to predict human preferences
  3. Reinforcement Learning (RL): Optimize the SFT model using the reward model via PPO

Each stage builds on the previous one, progressively refining the model's alignment.

25.3.2 Stage 1: Supervised Fine-Tuning

The SFT stage (covered in detail in Chapter 24) fine-tunes the pre-trained model on a curated dataset of high-quality demonstrations. This brings the model into the "right neighborhood" of behavior, making the subsequent RL stage more efficient.

The SFT dataset consists of prompt-response pairs $(x_i, y_i)$ where $y_i$ represents a high-quality, aligned response. The training objective is standard language modeling:

$$\mathcal{L}_{\text{SFT}} = -\mathbb{E}_{(x,y) \sim \mathcal{D}_{\text{SFT}}} \left[ \sum_{t=1}^{|y|} \log \pi_{\text{SFT}}(y_t \mid x, y_{

The resulting model $\pi_{\text{SFT}}$ serves as both the starting point for RL training and the reference policy for KL regularization.

25.3.3 Why SFT Alone Is Insufficient

SFT learns to mimic demonstrations but has fundamental limitations:

  1. Imitation, not optimization: SFT learns to imitate the average quality of demonstrations, not to produce optimal responses
  2. Mode averaging: When demonstrations have variable quality, SFT averages over all modes, potentially producing mediocre outputs
  3. Limited generalization: SFT generalizes poorly to prompts that differ significantly from the training distribution
  4. No preference signal: SFT treats all demonstrations as equally good; it cannot learn that some responses are better than others

RLHF addresses these limitations by providing a preference signal that guides the model toward better responses, even beyond the quality of the demonstrations.

25.3.4 The SFT-to-RLHF Transition: An Intuition

To build intuition for why RLHF improves upon SFT, consider an analogy. Imagine teaching someone to cook by providing recipes (SFT) versus providing recipes plus having them taste-test their results and get feedback (RLHF).

With SFT alone, the model learns: "When asked to write a poem, produce text that looks like the demonstration poems." It imitates the average quality of the demonstrations, including any idiosyncrasies or limitations in the demonstration data.

With RLHF, the model learns: "When asked to write a poem, produce the poem that a discerning reader would most prefer." This preference-based objective allows the model to discover responses that are better than any individual demonstration, because it optimizes a quality signal rather than imitating a fixed dataset.

Mathematically, SFT maximizes $\log p(y_{\text{demo}} | x)$ (probability of the demonstration), while RLHF maximizes $\mathbb{E}[r(x, y)]$ (expected reward). The reward model captures a richer signal than any fixed set of demonstrations because it generalizes from the preference data to unseen prompt-response combinations.


25.4 Reward Modeling

25.4.1 The Role of the Reward Model

The reward model (RM) serves as a learned proxy for human judgment. Given a prompt $x$ and a response $y$, the reward model produces a scalar score:

$$r_\phi(x, y) \in \mathbb{R}$$

where higher scores indicate more preferred responses. This model encodes human preferences into a differentiable function that can be used for optimization.

25.4.2 Preference Data Collection

Reward models are trained on comparison data. For each prompt $x$, human annotators are shown two (or more) model responses and asked to indicate which they prefer:

$$(x, y_w, y_l) \quad \text{where } y_w \succ y_l \text{ (}y_w \text{ is preferred over } y_l\text{)}$$

The annotator's task is easier than writing a good response from scratch---comparing two texts is cognitively simpler than generating text. This is a key insight: it is easier for humans to express preferences than to demonstrate optimal behavior.

25.4.3 The Bradley-Terry Model

Preferences are modeled using the Bradley-Terry model, which assumes the probability that response $y_1$ is preferred over $y_2$ is:

$$P(y_1 \succ y_2 \mid x) = \sigma(r_\phi(x, y_1) - r_\phi(x, y_2))$$

where $\sigma$ is the sigmoid function: $\sigma(z) = \frac{1}{1 + e^{-z}}$.

This model assumes that preferences are determined by the difference in reward scores and that the preference probability follows a logistic function.

25.4.4 Training Objective

The reward model is trained to minimize the negative log-likelihood of the observed preferences:

$$\mathcal{L}_{\text{RM}} = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}_{\text{pref}}} \left[ \log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l)) \right]$$

This is equivalent to binary cross-entropy loss where the label is always 1 (the preferred response should receive a higher reward). The loss is minimized when $r_\phi(x, y_w) \gg r_\phi(x, y_l)$, driving the reward model to assign clearly distinct scores to preferred and rejected responses. In practice, training converges when the reward model correctly orders approximately 70-80% of held-out preference pairs, which reflects the noise ceiling of human annotation.

25.4.5 Architecture

The reward model is typically initialized from the SFT model with the language modeling head replaced by a scalar output head:

$$r_\phi(x, y) = \text{Linear}(\text{Transformer}(x, y)_{\text{last\_token}})$$

The final hidden state of the last token is projected to a scalar reward. Using the same architecture as the policy model ensures the reward model can understand the same features and representations. Initializing from the SFT checkpoint (rather than from the base pre-trained model) also provides a better starting point, since the SFT model has already learned to process instruction-response pairs.

25.4.6 Training the Reward Model: Step by Step

Let us walk through the reward model training process in detail.

Step 1: Data collection. For each of $N$ prompts, generate $K$ responses from the SFT model (typically $K = 2$ to $4$). Human annotators rank or compare the responses, producing preference pairs $(y_w, y_l)$.

Step 2: Architecture setup. Initialize the reward model from the SFT checkpoint. Replace the language modeling head (which maps to vocabulary logits) with a scalar head:

import torch.nn as nn
from transformers import AutoModelForCausalLM

class RewardModel(nn.Module):
    """A reward model built from a pre-trained language model.

    Produces a scalar reward for each (prompt, response) pair.
    """

    def __init__(self, base_model_name: str) -> None:
        super().__init__()
        self.backbone = AutoModelForCausalLM.from_pretrained(
            base_model_name
        ).model  # Get the transformer without LM head
        hidden_size = self.backbone.config.hidden_size
        self.reward_head = nn.Linear(hidden_size, 1)

    def forward(
        self,
        input_ids: torch.Tensor,
        attention_mask: torch.Tensor,
    ) -> torch.Tensor:
        """Compute reward for a batch of sequences.

        Args:
            input_ids: Token IDs of shape (batch, seq_len).
            attention_mask: Attention mask of shape (batch, seq_len).

        Returns:
            Scalar rewards of shape (batch,).
        """
        outputs = self.backbone(
            input_ids=input_ids,
            attention_mask=attention_mask,
        )
        hidden_states = outputs.last_hidden_state

        # Use the last token's hidden state as the sequence representation
        # (find the last non-padding token for each sequence)
        sequence_lengths = attention_mask.sum(dim=1) - 1
        last_hidden = hidden_states[
            torch.arange(hidden_states.size(0)),
            sequence_lengths,
        ]

        rewards = self.reward_head(last_hidden).squeeze(-1)
        return rewards

Step 3: Training loop. For each batch of preference pairs, compute the reward for both the chosen and rejected responses, and minimize the Bradley-Terry loss:

def reward_model_loss(
    model: RewardModel,
    chosen_ids: torch.Tensor,
    chosen_mask: torch.Tensor,
    rejected_ids: torch.Tensor,
    rejected_mask: torch.Tensor,
) -> torch.Tensor:
    """Compute the Bradley-Terry preference loss.

    Args:
        model: The reward model.
        chosen_ids: Token IDs for preferred responses.
        chosen_mask: Attention mask for preferred responses.
        rejected_ids: Token IDs for rejected responses.
        rejected_mask: Attention mask for rejected responses.

    Returns:
        Scalar loss tensor.
    """
    r_chosen = model(chosen_ids, chosen_mask)
    r_rejected = model(rejected_ids, rejected_mask)

    # Bradley-Terry loss: -log(sigmoid(r_chosen - r_rejected))
    loss = -torch.nn.functional.logsigmoid(r_chosen - r_rejected).mean()
    return loss

Step 4: Validation. Evaluate the reward model's accuracy on held-out preference pairs---does the reward model assign higher scores to the human-preferred responses? A well-trained reward model achieves 70-80% accuracy on held-out data (recall that human inter-annotator agreement is typically 60-80%, so the reward model approaches the ceiling of human consistency).

25.4.7 Reward Model Challenges

Reward hacking. The policy may find responses that receive high reward scores without being genuinely good. For example, the model might learn that longer responses tend to receive higher rewards and produce unnecessarily verbose outputs. More subtle forms of reward hacking include the model producing responses that are confident-sounding but factually incorrect, or responses that use flattery to game the reward model.

Distribution shift. The reward model is trained on outputs from $\pi_{\text{SFT}}$, but during RL training, the policy $\pi_\theta$ diverges from $\pi_{\text{SFT}}$, producing responses outside the reward model's training distribution. When the reward model encounters out-of-distribution responses, its predictions become unreliable, and the policy can exploit these unreliable predictions.

Annotation noise. Human annotators disagree---inter-annotator agreement is typically 60--80%. The reward model must learn robust preferences from noisy labels. Strategies to address noise include filtering for high-agreement pairs, using multiple annotations per pair, and training with label smoothing.

Scaling rewards. The reward model's output scale is arbitrary. Normalizing rewards during RL training (e.g., subtracting the running mean and dividing by the running standard deviation) helps stabilize optimization and makes hyperparameter choices more transferable across reward models.


25.5 PPO for Language Models

25.5.1 From Reward to Policy Optimization

Given a reward model $r_\phi$, we want to find a policy $\pi_\theta$ that maximizes expected reward while staying close to the reference policy $\pi_{\text{ref}}$ (typically $\pi_{\text{SFT}}$). The objective is:

$$\max_{\pi_\theta} \; \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(\cdot|x)} \left[ r_\phi(x, y) \right] - \beta \cdot \text{KL}(\pi_\theta \| \pi_{\text{ref}})$$

The KL divergence penalty serves two purposes: 1. Prevents reward hacking: Limits how far the policy can deviate from sensible behavior 2. Maintains capabilities: Keeps the model close to the capable SFT model

25.5.2 KL Divergence in the Language Model Setting

The KL divergence between the policy and reference for a given prompt $x$ is:

$$\text{KL}(\pi_\theta(\cdot|x) \| \pi_{\text{ref}}(\cdot|x)) = \mathbb{E}_{y \sim \pi_\theta(\cdot|x)} \left[ \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} \right]$$

For autoregressive models, this decomposes token-by-token:

$$\text{KL}(\pi_\theta \| \pi_{\text{ref}}) = \sum_{t=1}^{T} \mathbb{E}_{y_{

In practice, the KL penalty is often computed per-token and added to the reward:

$$r_{\text{total}}(x, y) = r_\phi(x, y) - \beta \sum_{t=1}^{T} \log \frac{\pi_\theta(y_t|x, y_{

25.5.3 Proximal Policy Optimization (PPO)

PPO (Schulman et al., 2017) is a policy gradient algorithm that constrains policy updates to a trust region. The clipped PPO objective is:

$$\mathcal{L}_{\text{PPO}} = -\mathbb{E}_t \left[ \min\left( \rho_t A_t, \; \text{clip}(\rho_t, 1-\epsilon, 1+\epsilon) A_t \right) \right]$$

where:

  • $\rho_t = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$ is the importance sampling ratio
  • $A_t$ is the advantage estimate
  • $\epsilon \in [0.1, 0.2]$ is the clipping parameter

In the language model setting: - States $s_t$ are the partial sequences $(x, y_{Actions $a_t$ are the next tokens $y_t$ - Rewards are zero for all tokens except the last, which receives the reward model score (minus per-token KL penalties)

25.5.4 Advantage Estimation

The advantage is estimated using Generalized Advantage Estimation (GAE):

$$\hat{A}_t = \sum_{l=0}^{T-t-1} (\gamma \lambda)^l \delta_{t+l}$$

where $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ is the TD residual, $\gamma$ is the discount factor (often set to 1 for language), and $\lambda$ is the GAE parameter.

A value function $V_\psi(s_t)$ is trained alongside the policy to estimate expected returns:

$$\mathcal{L}_V = \mathbb{E}_t \left[ (V_\psi(s_t) - R_t)^2 \right]$$

where $R_t = \sum_{l=0}^{T-t-1} \gamma^l r_{t+l}$ is the return.

25.5.5 Understanding the KL Divergence Penalty

The KL penalty deserves special attention because it is the mechanism that prevents the policy from collapsing to a degenerate solution. Without the KL penalty, the policy would learn to produce whatever outputs maximize the reward model's score, regardless of whether those outputs are sensible. This typically leads to outputs that are grammatically correct but repetitive, flattering, or otherwise "gaming" the reward model.

The intuition behind the KL penalty is straightforward: we want the aligned model to behave similarly to the SFT model on most inputs, deviating only where the reward signal provides a clear reason to do so. Mathematically, the KL divergence measures the "distance" between two probability distributions:

$$\text{KL}(\pi_\theta \| \pi_{\text{ref}}) = \sum_{y} \pi_\theta(y|x) \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}$$

This quantity is always non-negative and equals zero only when $\pi_\theta = \pi_{\text{ref}}$. Key properties:

  • Asymmetry: $\text{KL}(\pi_\theta \| \pi_{\text{ref}}) \neq \text{KL}(\pi_{\text{ref}} \| \pi_\theta)$. The direction matters: the forward KL used here penalizes the policy for assigning high probability to tokens that the reference model considers unlikely.
  • Per-token decomposition: For autoregressive models, the KL decomposes into a sum of per-token KLs, as shown in Section 25.5.2. This allows fine-grained monitoring of where the policy diverges most from the reference.
  • The $\beta$ trade-off: A higher $\beta$ keeps the policy closer to the reference, producing more conservative but safer outputs. A lower $\beta$ allows more deviation, potentially producing better-aligned but riskier outputs. Typical values range from 0.01 to 0.2.

Worked example. Suppose the reference model assigns probability 0.3 to token "certainly" and the policy assigns probability 0.8. The per-token KL contribution is:

$$0.8 \times \log\frac{0.8}{0.3} = 0.8 \times 0.98 = 0.78 \text{ nats}$$

With $\beta = 0.1$, this contributes a penalty of $0.078$ to the total reward. If the reward model gave the response a score of 1.5, the penalty reduces it to approximately 1.42. This shows how the KL penalty gently discourages large deviations at the token level.

25.5.6 PPO Training Loop for LLMs

The PPO training loop for language models follows these steps:

  1. Sample a batch of prompts $\{x_i\}$ from the dataset
  2. Generate responses $\{y_i\}$ using the current policy $\pi_\theta$
  3. Compute rewards $\{r_\phi(x_i, y_i)\}$ using the reward model
  4. Compute per-token KL penalties
  5. Compute advantages using GAE
  6. Perform multiple epochs of PPO updates on the batch
  7. Update the value function

25.5.7 Practical Challenges of PPO for LLMs

Computational cost. PPO requires maintaining four models simultaneously: - The policy model $\pi_\theta$ (being trained) - The reference model $\pi_{\text{ref}}$ (frozen SFT model) - The reward model $r_\phi$ (frozen) - The value model $V_\psi$ (being trained)

For a 7B parameter model, this means ~28B parameters in GPU memory.

Training instability. PPO is sensitive to hyperparameters. The reward scale, KL coefficient $\beta$, clip parameter $\epsilon$, learning rate, and batch size all interact in complex ways. Common failure modes include: - Reward hacking (exploiting reward model weaknesses) - KL explosion (policy diverges rapidly from reference) - Training collapse (loss of diversity in outputs)

Reward model quality. The quality ceiling of RLHF is fundamentally limited by the reward model. If the reward model has systematic biases, the policy will exploit them.

25.5.8 PPO Implementation with TRL

For practitioners who need RLHF with PPO, the TRL library provides a PPOTrainer class:

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead

# Configuration
ppo_config = PPOConfig(
    model_name="your-sft-model",
    learning_rate=1.41e-5,
    batch_size=64,
    mini_batch_size=4,
    gradient_accumulation_steps=1,
    ppo_epochs=4,          # PPO update epochs per batch
    kl_penalty="kl",       # KL penalty type
    init_kl_coeff=0.2,     # Initial beta
    adap_kl_ctrl=True,     # Adaptive KL control
    target_kl=6.0,         # Target KL divergence
)

# Load model with value head (policy + value function)
model = AutoModelForCausalLMWithValueHead.from_pretrained(
    "your-sft-model"
)

# Load frozen reference model
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(
    "your-sft-model"
)

# Initialize PPO trainer
ppo_trainer = PPOTrainer(
    config=ppo_config,
    model=model,
    ref_model=ref_model,
    tokenizer=tokenizer,
)

# Training loop (simplified)
for batch in dataloader:
    # Generate responses
    query_tensors = batch["input_ids"]
    response_tensors = ppo_trainer.generate(query_tensors)

    # Compute rewards using the reward model
    rewards = [reward_model(q, r) for q, r in zip(query_tensors, response_tensors)]

    # PPO update
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
    print(f"KL: {stats['objective/kl']:.3f}, "
          f"Reward: {stats['ppo/returns/mean']:.3f}")

The adap_kl_ctrl=True setting enables adaptive KL control, which automatically adjusts $\beta$ during training to keep the KL divergence near the target value. This is a practical necessity because the optimal $\beta$ changes during training as the policy evolves.


25.6 Direct Preference Optimization (DPO)

25.6.1 Motivation

DPO (Rafailov et al., 2023) was motivated by the observation that the RLHF pipeline is complex, unstable, and computationally expensive. The key insight is that the optimal policy under the RLHF objective has a closed-form solution, and this solution can be used to derive a simpler training objective that directly optimizes preferences without an explicit reward model or RL.

25.6.2 The RLHF Objective Revisited

Recall the RLHF objective:

$$\max_\pi \; \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi(\cdot|x)} \left[ r(x, y) \right] - \beta \; \text{KL}(\pi \| \pi_{\text{ref}})$$

The optimal policy $\pi^*$ that maximizes this objective is:

$$\pi^*(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta} r(x, y)\right)$$

where $Z(x) = \sum_y \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta} r(x, y)\right)$ is the partition function.

25.6.3 The DPO Derivation

The DPO derivation is one of the most elegant results in modern machine learning. Let us trace each step carefully.

Step 1: Rearrange the optimal policy equation. Starting from the optimal policy:

$$\pi^*(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta} r(x, y)\right)$$

Take the logarithm of both sides:

$$\log \pi^*(y|x) = \log \pi_{\text{ref}}(y|x) + \frac{1}{\beta} r(x, y) - \log Z(x)$$

Rearrange to isolate the reward:

$$r(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)$$

This expression tells us that the reward of any response $y$ can be recovered from the optimal policy and the reference policy, up to a prompt-dependent constant $\beta \log Z(x)$.

Step 2: Substitute into the Bradley-Terry model. The preference model says:

$$P(y_w \succ y_l | x) = \sigma(r(x, y_w) - r(x, y_l))$$

Substituting our expression for $r$:

$$P(y_w \succ y_l | x) = \sigma\left(\beta \log \frac{\pi^*(y_w|x)}{\pi_{\text{ref}}(y_w|x)} + \beta \log Z(x) - \beta \log \frac{\pi^*(y_l|x)}{\pi_{\text{ref}}(y_l|x)} - \beta \log Z(x)\right)$$

Step 3: The partition function cancels. The $\beta \log Z(x)$ terms appear with opposite signs and cancel exactly:

$$P(y_w \succ y_l | x) = \sigma\left(\beta \log \frac{\pi^*(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi^*(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)$$

This cancellation is the key insight: the intractable partition function $Z(x)$ drops out entirely. We never need to compute it. The preference probability depends only on the log-ratio of the policy and reference model probabilities, which are both computable.

25.6.4 The DPO Loss Function

This leads directly to the DPO loss function. We parameterize $\pi^*$ as $\pi_\theta$ and minimize the negative log-likelihood of the preferences:

$$\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right) \right]$$

This loss has an elegant interpretation. Define the implicit reward of a response:

$$\hat{r}_\theta(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}$$

Then the DPO loss becomes:

$$\mathcal{L}_{\text{DPO}} = -\mathbb{E} \left[ \log \sigma(\hat{r}_\theta(x, y_w) - \hat{r}_\theta(x, y_l)) \right]$$

This is exactly the reward modeling loss, but with the reward parameterized implicitly through the policy.

25.6.5 Gradient Analysis

The gradient of the DPO loss with respect to $\theta$ is:

$$\nabla_\theta \mathcal{L}_{\text{DPO}} = -\beta \mathbb{E} \left[ \underbrace{\sigma(\hat{r}_\theta(y_l) - \hat{r}_\theta(y_w))}_{\text{weight}} \left( \underbrace{\nabla_\theta \log \pi_\theta(y_w|x)}_{\text{increase } y_w \text{ probability}} - \underbrace{\nabla_\theta \log \pi_\theta(y_l|x)}_{\text{decrease } y_l \text{ probability}} \right) \right]$$

The weighting term $\sigma(\hat{r}_\theta(y_l) - \hat{r}_\theta(y_w))$ is large when the model currently assigns a higher implicit reward to the losing response than the winning response---precisely when the model most needs to be corrected. This adaptive weighting makes DPO training efficient. Intuitively, DPO spends its gradient budget where it matters most---on examples where the model currently disagrees with the human preferences. Examples where the model already assigns higher implicit reward to the preferred response contribute little gradient, since $\sigma(\hat{r}_w - \hat{r}_l)$ is close to 1 and the weighting term $\sigma(\hat{r}_l - \hat{r}_w)$ is close to 0.

This is in contrast to standard supervised learning losses, which weight all examples equally regardless of how well the model already handles them. DPO's adaptive weighting is similar in spirit to hard-negative mining in contrastive learning or focal loss in object detection---the model focuses on the cases it finds most challenging.

25.6.6 DPO vs. RLHF Comparison

Aspect RLHF (PPO) DPO
Requires reward model Yes (separate training) No (implicit)
Requires RL algorithm Yes (PPO) No (supervised)
Models in memory 4 (policy, ref, RM, value) 2 (policy, ref)
Training stability Sensitive to hyperparameters More stable
Computational cost High Moderate
Hyperparameters Many (lr, KL coeff, clip, etc.) Few (lr, $\beta$)
Performance Strong Competitive
Theory Well-established Clean closed-form solution

25.6.7 Practical Considerations for DPO

The $\beta$ parameter. Controls the strength of the KL constraint. Lower $\beta$ allows more deviation from the reference policy; higher $\beta$ keeps the policy closer. Typical values range from 0.1 to 0.5.

Reference model. The reference policy is typically the SFT model. It must be kept frozen during DPO training and used to compute $\pi_{\text{ref}}(y|x)$ for both chosen and rejected responses.

Data requirements. DPO requires preference pairs $(x, y_w, y_l)$. The quality and diversity of these pairs significantly affects the final model.

Length bias. DPO can develop a bias toward longer or shorter responses depending on the preference data. Normalizing log-probabilities by sequence length can mitigate this:

$$\hat{r}_\theta(x, y) = \frac{\beta}{|y|} \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}$$

Label smoothing. Adding label smoothing to the DPO loss can prevent the model from becoming overconfident in its preferences:

$$\mathcal{L}_{\text{DPO-smooth}} = -(1-\epsilon) \log \sigma(\hat{r}_w - \hat{r}_l) - \epsilon \log \sigma(\hat{r}_l - \hat{r}_w)$$

where $\epsilon \in [0, 0.1]$ is the smoothing factor. Label smoothing is especially useful when the preference data contains noise or when annotators disagree frequently.

Preference data quality. The most important factor in DPO performance is the quality and diversity of preference pairs. In practice:

  • Diverse prompts matter more than diverse responses. Covering many different prompt types ensures the alignment generalizes broadly.
  • Clear preference margins produce better training signal. Pairs where the chosen response is only marginally better than the rejected response provide weak signal and can add noise.
  • On-policy data (generated by the current model) tends to produce better alignment than off-policy data (generated by a different model). This is because the model learns most from correcting its own mistakes.

25.7 ORPO and Other Alignment Methods

25.7.1 ORPO: Odds Ratio Preference Optimization

ORPO (Hong et al., 2024) simplifies alignment further by eliminating the need for a reference model. It combines SFT and preference optimization in a single training stage.

The ORPO loss combines a standard language modeling loss with an odds ratio penalty:

$$\mathcal{L}_{\text{ORPO}} = \mathcal{L}_{\text{SFT}}(y_w) + \lambda \cdot \mathcal{L}_{\text{OR}}$$

where the odds ratio loss is:

$$\mathcal{L}_{\text{OR}} = -\log \sigma \left( \log \frac{\text{odds}_\theta(y_w|x)}{\text{odds}_\theta(y_l|x)} \right)$$

and the odds are defined as:

$$\text{odds}_\theta(y|x) = \frac{p_\theta(y|x)}{1 - p_\theta(y|x)}$$

The key advantage of ORPO is simplicity: it requires only a single model (no reference model), and it combines SFT and alignment in one training stage. The SFT component of the loss ensures the model learns to generate the preferred responses, while the odds ratio component ensures it learns to prefer those responses over the rejected ones.

ORPO's practical appeal lies in its reduced infrastructure requirements. While DPO requires maintaining two models in memory (the policy and the frozen reference), and PPO requires four, ORPO needs only one. This makes it particularly attractive for teams with limited GPU resources.

25.7.2 IPO: Identity Preference Optimization

IPO (Azar et al., 2024) addresses a theoretical concern with DPO: the assumption that the Bradley-Terry model perfectly describes human preferences. IPO uses a different loss that does not rely on this assumption:

$$\mathcal{L}_{\text{IPO}} = \left( \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} - \frac{1}{2\beta} \right)^2$$

This regression-style loss penalizes the model when the log-probability margin deviates from the target margin $\frac{1}{2\beta}$.

25.7.3 KTO: Kahneman-Tversky Optimization

KTO (Ethayarajh et al., 2024) works with unpaired preference data---responses labeled as simply "good" or "bad" without explicit pairwise comparisons. This is motivated by the observation that unpaired feedback is much easier to collect than pairwise preferences.

The KTO loss is inspired by Kahneman and Tversky's prospect theory:

$$\mathcal{L}_{\text{KTO}} = \mathbb{E}_{y_w} \left[ w(y_w) \cdot (1 - v_w) \right] + \mathbb{E}_{y_l} \left[ w(y_l) \cdot v_l \right]$$

where $v_w$ and $v_l$ are the values of chosen and rejected responses relative to the reference.

25.7.4 SimPO: Simple Preference Optimization

SimPO (Meng et al., 2024) further simplifies DPO by using the average log probability as an implicit reward, eliminating the need for a reference model:

$$\hat{r}_\theta(x, y) = \frac{\beta}{|y|} \log \pi_\theta(y|x)$$

This results in the loss:

$$\mathcal{L}_{\text{SimPO}} = -\log \sigma \left( \frac{\beta}{|y_w|} \log \pi_\theta(y_w|x) - \frac{\beta}{|y_l|} \log \pi_\theta(y_l|x) - \gamma \right)$$

where $\gamma$ is a target margin that ensures the model maintains a minimum preference gap. Typical values of $\gamma$ range from 0.5 to 1.5, with higher values encouraging a larger gap between chosen and rejected responses.

SimPO's key simplification is the elimination of the reference model. By using the average log probability as the reward (rather than the log-ratio with a reference), SimPO avoids the need to maintain a frozen copy of the initial model. This halves the memory requirement compared to DPO while maintaining competitive performance.

25.7.5 Summary of Alignment Methods

Method Models Required Needs Reference Needs RM Training Stages Complexity
RLHF (PPO) 4 Yes Yes 3 (SFT + RM + RL) High
DPO 2 Yes No 2 (SFT + DPO) Medium
IPO 2 Yes No 2 (SFT + IPO) Medium
ORPO 1 No No 1 (combined) Low
KTO 2 Yes No 2 (SFT + KTO) Medium
SimPO 1 No No 2 (SFT + SimPO) Low

The trend in alignment research is clearly toward simplicity: fewer models, fewer training stages, and less infrastructure. However, simplicity sometimes comes at the cost of flexibility. RLHF with PPO remains the most flexible approach because the reward model can be trained on arbitrary signals and updated independently of the policy. DPO and its variants sacrifice this flexibility for stability and simplicity.


25.8 Constitutional AI

25.8.1 Motivation

Constitutional AI (CAI), introduced by Bai et al. (2022), addresses the scalability problem of human preference annotation. Instead of relying entirely on human annotators, CAI uses a set of principles---a "constitution"---to guide AI-generated feedback.

25.8.2 The CAI Pipeline

Stage 1: Supervised Learning from AI Feedback (SL-CAI)

  1. Generate responses to harmful prompts using the model
  2. Ask the model to critique its own response according to the constitution
  3. Ask the model to revise its response based on the critique
  4. Fine-tune on the revised responses

Example critique prompt:

Critique the following response according to this principle:
"The response should not help with illegal activities."

Response: [model's original response]

Critique: [model generates critique]

Revision: Based on the critique, here is a revised response:
[model generates revised response]

Stage 2: RL from AI Feedback (RL-CAI)

  1. Generate pairs of responses
  2. Use an AI model to choose the preferred response according to the constitution
  3. Train a reward model on the AI-generated preferences
  4. Fine-tune with RL using the reward model

25.8.3 Constitutional Principles

A constitution typically includes principles like:

  1. Choose the response that is most helpful while being safe
  2. Choose the response that is least likely to be used for harmful purposes
  3. Choose the response that is most honest and accurate
  4. Choose the response that most clearly expresses uncertainty when appropriate
  5. Choose the response that is least biased and most fair

25.8.4 Advantages and Limitations

Advantages: - Scalable: AI feedback is cheaper and faster than human feedback - Consistent: AI applies principles more uniformly than diverse human annotators - Transparent: The constitution makes the alignment criteria explicit - Iterative: Easy to update principles without collecting new human data

Limitations: - Dependent on model capability to understand and apply principles - May miss subtle alignment issues that humans would catch - Constitution design itself requires careful human judgment - Can lead to over-cautious behavior if principles are too restrictive

25.8.5 Implementing Constitutional AI: A Simplified Example

Here is a simplified example of the critique-and-revise loop from Stage 1:

def constitutional_revision(
    model,
    tokenizer,
    harmful_prompt: str,
    initial_response: str,
    principle: str,
) -> str:
    """Apply one round of constitutional critique and revision.

    Args:
        model: The language model.
        tokenizer: The tokenizer.
        harmful_prompt: The original (potentially harmful) prompt.
        initial_response: The model's initial response.
        principle: The constitutional principle to apply.

    Returns:
        The revised response.
    """
    # Step 1: Critique
    critique_prompt = (
        f"Consider the following response to a user request.\n\n"
        f"User request: {harmful_prompt}\n"
        f"Response: {initial_response}\n\n"
        f"Critique this response according to the following "
        f"principle: '{principle}'\n\n"
        f"Critique:"
    )
    critique = generate(model, tokenizer, critique_prompt)

    # Step 2: Revise
    revision_prompt = (
        f"User request: {harmful_prompt}\n"
        f"Initial response: {initial_response}\n"
        f"Critique: {critique}\n\n"
        f"Based on the critique, write an improved response "
        f"that addresses the issues raised while remaining "
        f"helpful.\n\n"
        f"Revised response:"
    )
    revised = generate(model, tokenizer, revision_prompt)
    return revised

This loop can be applied iteratively with multiple principles, producing increasingly refined responses. The resulting (prompt, revised_response) pairs form the SFT dataset for Stage 1 of CAI.


25.9 Red Teaming and Safety Evaluation

25.9.1 Red Teaming

Red teaming involves systematically attempting to elicit harmful, biased, or undesired outputs from a model. It is a critical component of alignment evaluation.

Manual red teaming. Human red teamers craft prompts designed to: - Bypass safety filters (jailbreaking) - Elicit harmful content (violence, self-harm, illegal activities) - Reveal biases (gender, race, religion, political) - Cause hallucination of false information - Extract training data or private information - Trigger inconsistent behavior

Automated red teaming. Use another LLM to generate adversarial prompts:

Generate 20 diverse prompts that might cause a language model to
produce harmful output. The prompts should cover different attack
vectors: direct requests, role-playing scenarios, hypothetical
framing, multi-step manipulation, and encoded instructions.

25.9.2 Safety Benchmarks

Several benchmarks evaluate model safety:

TruthfulQA. Tests whether models generate truthful answers to questions where humans commonly have misconceptions. Measures both truthfulness and informativeness.

BBQ (Bias Benchmark for QA). Tests for social biases across nine categories including age, gender, race, and disability status.

RealToxicityPrompts. Measures the tendency of models to generate toxic text when given naturally occurring prompts with varying toxicity levels.

HarmBench. A standardized benchmark for evaluating LLM safety against direct and indirect attacks.

25.9.3 Evaluation Metrics

Attack success rate (ASR). The fraction of adversarial prompts that successfully elicit harmful content:

$$\text{ASR} = \frac{\text{Number of successful attacks}}{\text{Total number of attack attempts}}$$

Over-refusal rate. The fraction of benign requests that the model incorrectly refuses:

$$\text{Over-refusal rate} = \frac{\text{Benign requests refused}}{\text{Total benign requests}}$$

Toxicity score. Use a toxicity classifier (e.g., Perspective API) to measure the toxicity of model outputs.

Bias metrics. Measure disparities in model behavior across demographic groups using metrics like demographic parity and equalized odds.

25.9.4 Iterative Safety Improvement

Safety evaluation should be iterative:

  1. Evaluate the model with red teaming and benchmarks
  2. Identify failure modes and vulnerability categories
  3. Collect additional preference data targeting these failures
  4. Retrain with updated preference data
  5. Re-evaluate and repeat

25.9.5 Red-Teaming Methodology in Practice

A systematic red-teaming process includes:

Phase 1: Taxonomy development. Create a comprehensive taxonomy of potential harms, organized by category (violence, deception, bias, privacy, illegal activity) and severity (low, medium, high, critical). This taxonomy guides the red team's efforts and ensures broad coverage.

Phase 2: Attack vector exploration. For each harm category, develop multiple attack vectors:

  • Direct requests: "How do I [harmful action]?"
  • Role-playing: "Pretend you are a character who [harmful action]"
  • Hypothetical framing: "In a fictional story, how would a character [harmful action]?"
  • Gradual escalation: Start with benign questions and gradually steer toward harmful territory
  • Language manipulation: Use euphemisms, foreign languages, or encoded text to bypass filters
  • Multi-turn manipulation: Build rapport over multiple turns before making the harmful request
  • System prompt extraction: Attempt to extract the system prompt or internal instructions

Phase 3: Systematic testing. Each red-team member generates a set of test prompts covering their assigned categories and attack vectors. The prompts are run against the model, and the outputs are evaluated for harmfulness, policy violations, and quality of refusals.

Phase 4: Analysis and reporting. Aggregate results to identify patterns: which categories are most vulnerable? Which attack vectors are most successful? What does a typical failure look like? This analysis guides the next iteration of alignment training.

Phase 5: Targeted data collection. For each identified vulnerability, create preference pairs where the "rejected" response is the harmful output and the "chosen" response is a safe, helpful refusal. This targeted data is used in the next round of alignment training.

25.9.6 The HHH Framework

Anthropic's HHH framework (Helpful, Honest, Harmless) provides a structured way to evaluate alignment quality across three dimensions:

  • Helpful: Does the model make a genuine effort to assist? Does it answer the question directly and completely? Does it ask clarifying questions when appropriate?
  • Honest: Does the model express uncertainty when uncertain? Does it avoid fabricating information? Does it acknowledge its limitations?
  • Harmless: Does the model refuse harmful requests? Does it avoid generating biased, toxic, or offensive content? Does it minimize potential negative impacts?

Each dimension can be measured independently, allowing fine-grained analysis of alignment quality. A model that is maximally helpful but not harmless (answering harmful requests without hesitation) is misaligned. A model that is maximally harmless but not helpful (refusing almost all requests) is also misaligned. The goal is to find the right balance across all three dimensions.


25.10 Preference Data Collection

25.10.1 Data Collection Strategies

The quality of alignment critically depends on the quality of preference data. Several strategies are used:

Pairwise comparison. Show annotators two responses and ask which is better. This is the simplest and most common approach.

Ranked preferences. Show annotators $k$ responses and ask them to rank all of them. This provides $\binom{k}{2}$ pairwise comparisons per annotation.

Likert scale rating. Ask annotators to rate each response on a numerical scale (e.g., 1-5). Pairwise preferences are derived from rating differences.

Attribute-based feedback. Ask annotators to rate specific attributes (helpfulness, accuracy, safety, relevance) separately. This provides richer signal but is more expensive.

25.10.2 Annotation Guidelines

Clear, detailed annotation guidelines are essential:

  1. Define what "better" means for the specific task
  2. Provide examples of edge cases and how to handle them
  3. Specify how to handle ties (acceptable? which response wins?)
  4. Address subjectivity by defining objective criteria where possible
  5. Include calibration examples to align annotator understanding

25.10.3 Quality Control

Inter-annotator agreement. Measure agreement using Cohen's kappa or Fleiss' kappa:

$$\kappa = \frac{p_o - p_e}{1 - p_e}$$

where: - $p_o$ is the observed agreement between annotators - $p_e$ is the expected agreement by chance

A kappa of 0.6-0.8 is considered substantial agreement. For alignment preference data, kappa values of 0.5-0.7 are typical, reflecting the inherent subjectivity of quality judgments. If kappa is below 0.4, the task definition or annotation guidelines likely need revision---the annotators are not sufficiently aligned on what constitutes a "better" response.

Consensus filtering. Only include preferences where multiple annotators agree. This reduces noise at the cost of data volume.

Annotator modeling. Some annotators are more reliable than others. Weight their contributions accordingly or use the Dawid-Skene model to estimate true labels from noisy annotations.

25.10.4 Practical Guidance on Data Volume

How much preference data do you need? Empirical results suggest:

  • 1,000-5,000 pairs: Sufficient for basic alignment of a 7B model, but may not cover all failure modes. Suitable for narrow tasks with well-defined preference criteria.
  • 10,000-50,000 pairs: The sweet spot for general-purpose alignment. This volume provides broad coverage of prompts and failure modes while remaining practical to annotate.
  • 50,000-200,000 pairs: Used by frontier labs (OpenAI, Anthropic) for production models. Provides extensive coverage and multiple annotations per prompt for quality filtering.

The cost of human annotation ranges from $0.50 to $5.00 per preference pair, depending on task complexity and annotator expertise. For a 10K-pair dataset, this translates to $5,000-$50,000---a significant but often justifiable investment for production systems.

Synthetic preference data (generated by AI) can supplement human annotations at much lower cost, but should be validated against human preferences to ensure quality. A common approach is to use synthetic data for 80% of the training set and reserve high-quality human annotations for the most important or ambiguous cases.

25.10.4 Synthetic Preference Data

Increasingly, preference data is generated synthetically:

Self-play. Generate responses at different temperatures or with different system prompts, then rank them using a strong model.

Best-of-N. Generate $N$ responses per prompt, use a reward model or heuristic to select the best and worst, creating preference pairs.

AI feedback. Use a strong model (e.g., GPT-4) to compare responses and generate preferences. This approach is central to Constitutional AI, as discussed in Section 25.8. AI-generated preferences are cheaper and faster than human annotations, but they may miss nuanced quality differences and can amplify biases present in the judge model. A practical compromise is to use AI feedback for initial data collection and reserve human annotation for quality-critical subsets and validation.


25.11 Implementing DPO with TRL

25.11.1 Overview

The TRL library provides a DPOTrainer class that handles the DPO training loop, including reference model management, loss computation, and logging.

25.11.2 Data Format

DPO requires data with three fields: - prompt: The input prompt - chosen: The preferred response - rejected: The dispreferred response

25.11.3 Complete DPO Training Example

Here is a complete, working example of DPO training using the TRL library:

import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig
from trl import DPOTrainer

# 1. Load the SFT model (serves as both policy and reference)
model_name = "your-sft-model"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# The reference model is loaded automatically by DPOTrainer
# (a frozen copy of the initial model)

# 2. Load preference dataset
# Each example needs: prompt, chosen, rejected
dataset = load_dataset("your-preference-data", split="train")

# Example data format:
# {
#   "prompt": "Explain quantum computing simply.",
#   "chosen": "Quantum computing uses quantum bits...",
#   "rejected": "Quantum computing is very complex..."
# }

# 3. Configure LoRA (optional but recommended)
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

# 4. Configure training
training_args = TrainingArguments(
    output_dir="./dpo-model",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=5e-6,         # Lower than SFT
    warmup_ratio=0.1,
    lr_scheduler_type="cosine",
    logging_steps=10,
    save_strategy="epoch",
    bf16=True,
    remove_unused_columns=False,
)

# 5. Initialize DPO trainer
dpo_trainer = DPOTrainer(
    model=model,
    args=training_args,
    beta=0.1,                   # KL regularization strength
    train_dataset=dataset,
    tokenizer=tokenizer,
    peft_config=peft_config,
    max_length=1024,
    max_prompt_length=512,
)

# 6. Train
dpo_trainer.train()

# 7. Save
dpo_trainer.save_model("./dpo-model/final")

The DPOTrainer automatically handles: - Creating a frozen copy of the initial model as the reference policy - Computing log-probabilities for both the policy and reference models - Implementing the DPO loss function - Logging implicit rewards, margins, and accuracy metrics

The typical DPO training procedure is:

  1. Load the SFT model (serves as both the initial policy and the reference)
  2. Prepare preference data in the required format
  3. Configure DPO hyperparameters ($\beta$, learning rate, batch size)
  4. Train with the DPO loss
  5. Evaluate on held-out preference data and downstream tasks

25.11.4 Key Hyperparameters

Hyperparameter Typical Range Effect
$\beta$ 0.1 -- 0.5 KL regularization strength
Learning rate 1e-6 -- 5e-6 Update magnitude (lower than SFT)
Batch size 4 -- 16 Per-device batch size
Epochs 1 -- 3 Risk of overfitting with more
Max sequence length 512 -- 2048 Depends on data
Warmup ratio 0.1 Fraction of steps
Label smoothing 0.0 -- 0.1 Prevents overconfident preferences

25.11.5 Monitoring Training

Key metrics to monitor during DPO training:

  • Chosen rewards: The implicit reward $\hat{r}(y_w)$ should increase
  • Rejected rewards: The implicit reward $\hat{r}(y_l)$ should decrease
  • Reward margin: $\hat{r}(y_w) - \hat{r}(y_l)$ should increase
  • Accuracy: The fraction of pairs where $\hat{r}(y_w) > \hat{r}(y_l)$ should approach 1
  • KL divergence: The average KL from the reference should remain bounded

25.12 Putting It All Together: The Modern Alignment Pipeline

25.12.1 A Practical Alignment Recipe

Based on current best practices, here is a practical alignment pipeline:

  1. Start with a strong base model: Llama 3, Mistral, or similar. The base model determines the ceiling of what alignment can achieve---you cannot align capabilities that the model does not possess.

  2. SFT on high-quality data: 10K-100K instruction-response pairs, 1-3 epochs. Use a diverse mix of tasks as discussed in Chapter 24. The SFT stage brings the model into the "right neighborhood" of behavior.

  3. Collect preference data: 10K-50K preference pairs, covering diverse prompts. Include both "easy" pairs (clearly better/worse) and "hard" pairs (both good, but one slightly better). Hard pairs provide the most signal for alignment.

  4. DPO training: 1-2 epochs with $\beta = 0.1$-$0.3$. Use a lower learning rate than SFT ($5 \times 10^{-6}$ to $5 \times 10^{-7}$). Monitor the reward margin and accuracy throughout training.

  5. Safety evaluation: Red teaming + safety benchmarks. Use both automated benchmarks (TruthfulQA, BBQ) and manual red-teaming. Aim for a low attack success rate (<5%) and a low over-refusal rate (<10%).

  6. Iterate: Address failure modes with targeted data collection. The most impactful improvement usually comes from collecting preference data specifically for the failure modes identified during evaluation.

25.12.2 Monitoring the Alignment Pipeline

Effective monitoring throughout the alignment pipeline requires tracking different metrics at each stage:

During SFT: - Training loss (should decrease smoothly) - Validation loss (should decrease; divergence from training loss indicates overfitting) - Response quality samples (manual inspection of generated responses)

During DPO/RLHF: - Chosen reward: $\hat{r}(y_w)$ should increase - Rejected reward: $\hat{r}(y_l)$ should decrease - Reward margin: $\hat{r}(y_w) - \hat{r}(y_l)$ should increase, but not explode - KL divergence from reference: should remain bounded (typically < 10 nats) - Accuracy: fraction of pairs correctly ordered should increase toward 1.0 - Response length: monitor for length gaming (responses getting longer without becoming better)

After alignment: - Safety benchmarks: TruthfulQA, HarmBench, BBQ - Capability benchmarks: MMLU, HumanEval, GSM8K (to detect capability regression) - User satisfaction: if in production, track thumbs up/down ratings

A common failure pattern is "alignment collapse," where the model becomes overly cautious and refuses reasonable requests. Monitor the over-refusal rate on a set of benign prompts to detect this early.

25.12.3 When to Use Which Method

  • RLHF (PPO): When you need maximum control over the reward signal, have significant compute, and have experience with RL training
  • DPO: When you have pairwise preference data and want a simpler, more stable training process
  • ORPO: When you want to combine SFT and alignment in one stage with minimal infrastructure
  • KTO: When you have unpaired feedback (thumbs up/down) rather than pairwise comparisons
  • Constitutional AI: When you need to scale alignment without proportionally scaling human annotation

The choice also depends on practical constraints. If you have a single 24GB GPU, ORPO or SimPO is the most feasible option because they require only one model in memory. If you have a cluster of GPUs and experienced ML engineers, RLHF with PPO provides the most control and flexibility. For most teams starting with alignment, DPO is the recommended starting point: it is well-understood, well-supported in libraries, and produces results competitive with PPO at a fraction of the complexity.


25.13 Summary

Alignment transforms capable language models into reliable, helpful assistants. In this chapter, we covered:

  • The alignment problem arises because pre-trained models lack the ability to distinguish between helpful and harmful outputs; alignment techniques provide this distinction
  • Reward modeling learns a scalar function that predicts human preferences, trained on pairwise comparison data using the Bradley-Terry model
  • RLHF optimizes a policy to maximize the learned reward while staying close to the SFT reference via KL regularization, using PPO as the optimization algorithm
  • PPO for language models treats token generation as a sequential decision process, requiring careful management of four models and many hyperparameters
  • DPO eliminates the need for a separate reward model and RL by directly optimizing the policy on preference data. The derivation shows that the optimal RLHF policy has a closed-form relationship to the reward, enabling the reward model to be bypassed entirely
  • ORPO combines SFT and alignment in a single stage using odds ratio penalties, eliminating the reference model requirement
  • Constitutional AI scales alignment by using AI feedback guided by explicit principles, reducing dependence on human annotators
  • Red teaming and safety evaluation systematically probe for vulnerabilities and measure alignment quality across multiple dimensions
  • Preference data collection is the foundation of all alignment methods; data quality directly determines alignment quality
  • SimPO eliminates the reference model entirely by using average log probability as the implicit reward
  • DPO with TRL provides a practical, accessible implementation path for alignment training
  • Open challenges include scalable oversight, reward model robustness, alignment stability, truthfulness, multi-stakeholder alignment, and evaluation gaps

The journey from a raw pre-trained model to an aligned assistant involves multiple stages of refinement, each building on the previous one. The field continues to evolve rapidly, with new alignment methods offering better efficiency, stability, and performance.


25.14 Open Challenges and Future Directions

Despite rapid progress, several fundamental challenges in alignment remain unresolved:

25.14.1 Scalable Oversight

As models become more capable, it becomes harder for humans to evaluate whether model outputs are correct and aligned. This is the scalable oversight problem, first articulated by Amodei et al. (2016). If a model generates a sophisticated mathematical proof, a complex piece of code, or a nuanced legal argument, how do we verify that it is correct without ourselves being experts in that domain? As models become superhuman at certain tasks, the gap between model capability and human evaluator capability will widen, making alignment increasingly difficult. Current approaches include:

  • Debate: Two model instances argue for and against a claim, with a human judge deciding the winner. The adversarial dynamic helps surface errors.
  • Recursive reward modeling: Use aligned models to help humans evaluate model outputs, bootstrapping alignment at increasingly difficult levels.
  • Process-based reward models: Instead of scoring only the final answer, train reward models to evaluate each reasoning step, making it easier for humans to verify intermediate work.

25.14.2 Reward Model Robustness

Current reward models are brittle. They can be "hacked" by policies that learn to produce outputs with high reward scores that do not actually represent good responses (as discussed in Section 25.4.7). Improving reward model robustness---through better training data, adversarial training, or ensemble reward models---is an active research area.

25.14.3 Alignment Stability

Alignment can be fragile. Fine-tuning an aligned model on new data can undo alignment (alignment forgetting). Minor changes to the system prompt can cause aligned models to behave in unaligned ways. Developing methods that produce robust, stable alignment that persists across downstream uses is an important open problem.

25.14.4 Truthfulness and Hallucination

Current alignment techniques are better at making models sound confident than at making them truthful. A model that has been trained with RLHF may learn to express uncertainty only when the human-annotated data demonstrates uncertainty, not when the model itself is genuinely uncertain. This disconnect between expressed confidence and actual reliability is a fundamental challenge, closely related to the broader problem of hallucination in language models.

25.14.5 Multi-Stakeholder Alignment

Whose preferences should the model be aligned to? Different users, cultures, and organizations have different values and preferences. A model aligned to the preferences of one group may be misaligned for another. Techniques for personalizable alignment---where the model can adapt to different preference profiles while maintaining core safety constraints---are an emerging area of research.

25.14.6 Evaluation Gaps

Our ability to build alignment techniques has outpaced our ability to evaluate them. We can measure toxicity and bias with automated classifiers, but we lack reliable metrics for more subtle alignment properties like truthfulness, manipulation resistance, and genuine helpfulness. Developing comprehensive, reliable alignment benchmarks remains a critical need.


References

  1. Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS.
  2. Rafailov, R., et al. (2023). Direct preference optimization: Your language model is secretly a reward model. NeurIPS.
  3. Schulman, J., et al. (2017). Proximal policy optimization algorithms. arXiv.
  4. Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv.
  5. Ziegler, D. M., et al. (2019). Fine-tuning language models from human preferences. arXiv.
  6. Stiennon, N., et al. (2020). Learning to summarize from human feedback. NeurIPS.
  7. Hong, J., et al. (2024). ORPO: Monolithic preference optimization without reference model. arXiv.
  8. Azar, M. G., et al. (2024). A general theoretical paradigm to understand learning from human feedback. AISTATS.
  9. Ethayarajh, K., et al. (2024). KTO: Model alignment as prospect theoretic optimization. arXiv.
  10. Meng, Y., et al. (2024). SimPO: Simple preference optimization with a reference-free reward. arXiv.
  11. Bradley, R. A., & Terry, M. E. (1952). Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika.
  12. Perez, E., et al. (2022). Red teaming language models with language models. EMNLP.
  13. Amodei, D., et al. (2016). Concrete problems in AI safety. arXiv.
  14. Russell, S. (2019). Human Compatible: Artificial Intelligence and the Problem of Control. Viking.
  15. Ng, A. Y., Harada, D., & Russell, S. (1999). Policy invariance under reward transformations. ICML.