Case Study 1: Training a Reward Model from Human Preferences

Overview

In this case study, we build a reward model from scratch that learns to predict human preferences between response pairs. We cover the complete pipeline: creating preference data, initializing the reward model from a language model, implementing the Bradley-Terry training objective, training the model, and evaluating its quality including tests for common biases like length preference and sycophancy.

Learning Objectives

  • Structure preference data for reward model training.
  • Initialize a reward model from a pre-trained language model with a scalar head.
  • Implement the Bradley-Terry preference loss.
  • Train and evaluate a reward model.
  • Diagnose common reward model biases.

Step 1: Preference Data

"""Training a reward model from human preferences.

Implements the complete reward modeling pipeline: data preparation,
model architecture, training, and bias evaluation.

Requirements:
    pip install torch transformers datasets
"""

from dataclasses import dataclass
from typing import Optional

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
from transformers import AutoModel, AutoTokenizer

torch.manual_seed(42)


@dataclass
class PreferencePair:
    """A single preference annotation.

    Attributes:
        prompt: The input prompt.
        chosen: The preferred response.
        rejected: The dispreferred response.
    """

    prompt: str
    chosen: str
    rejected: str


PREFERENCE_DATA = [
    PreferencePair(
        prompt="Explain what a neural network is.",
        chosen=(
            "A neural network is a computational model inspired by the "
            "structure of biological neurons. It consists of layers of "
            "interconnected nodes that process information by learning "
            "patterns from data. Each connection has a weight that is "
            "adjusted during training to minimize prediction errors."
        ),
        rejected=(
            "Neural networks are like brains but for computers. They "
            "do AI stuff and machine learning things. They are very "
            "complex and hard to understand."
        ),
    ),
    PreferencePair(
        prompt="What is the capital of France?",
        chosen="The capital of France is Paris.",
        rejected=(
            "France is a beautiful country in Europe. It has many cities "
            "including Lyon, Marseille, Toulouse, and of course Paris, "
            "which serves as the capital. Paris is known for the Eiffel "
            "Tower, the Louvre, and its cuisine."
        ),
    ),
    PreferencePair(
        prompt="Write a Python function to reverse a string.",
        chosen=(
            "Here is a concise function:\n\n"
            "```python\n"
            "def reverse_string(s: str) -> str:\n"
            '    """Reverse the input string."""\n'
            "    return s[::-1]\n"
            "```"
        ),
        rejected=(
            "You can reverse a string in many ways. One way is to use "
            "a loop, another is slicing. There are also recursive "
            "approaches. Python makes it easy."
        ),
    ),
    PreferencePair(
        prompt="Is it safe to mix bleach and ammonia?",
        chosen=(
            "No, you should never mix bleach and ammonia. This combination "
            "produces toxic chloramine gases that can cause serious "
            "respiratory damage, chemical burns, and can be fatal in "
            "enclosed spaces. If you need to clean, use one product at a "
            "time and ensure adequate ventilation."
        ),
        rejected=(
            "Mixing cleaning products can create stronger cleaners. You "
            "might want to try it in a well-ventilated area to see if "
            "it works better for tough stains."
        ),
    ),
]

Step 2: Reward Model Architecture

class RewardModel(nn.Module):
    """Reward model that predicts scalar rewards from (prompt, response) pairs.

    Architecture: Pre-trained Transformer backbone + scalar output head.
    The reward is computed from the last token's hidden state.

    Attributes:
        backbone: Pre-trained Transformer model (frozen or trainable).
        reward_head: Linear layer mapping hidden states to scalar rewards.
    """

    def __init__(
        self,
        model_name: str = "gpt2",
        freeze_backbone: bool = False,
    ) -> None:
        """Initialize the reward model.

        Args:
            model_name: HuggingFace model identifier for the backbone.
            freeze_backbone: If True, freeze the backbone parameters.
        """
        super().__init__()
        self.backbone = AutoModel.from_pretrained(model_name)
        hidden_size = self.backbone.config.hidden_size

        if freeze_backbone:
            for param in self.backbone.parameters():
                param.requires_grad = False

        self.reward_head = nn.Linear(hidden_size, 1, bias=False)
        nn.init.zeros_(self.reward_head.weight)

    def forward(
        self,
        input_ids: torch.Tensor,
        attention_mask: torch.Tensor,
    ) -> torch.Tensor:
        """Compute reward scores for a batch of sequences.

        Args:
            input_ids: Token IDs of shape (batch_size, seq_len).
            attention_mask: Attention mask of shape (batch_size, seq_len).

        Returns:
            Scalar rewards of shape (batch_size,).
        """
        outputs = self.backbone(
            input_ids=input_ids,
            attention_mask=attention_mask,
        )
        # Use the last non-padding token's hidden state
        last_hidden = outputs.last_hidden_state

        # Find the position of the last real token for each sequence
        seq_lengths = attention_mask.sum(dim=1) - 1  # 0-indexed
        batch_indices = torch.arange(input_ids.size(0), device=input_ids.device)
        last_token_hidden = last_hidden[batch_indices, seq_lengths]

        rewards = self.reward_head(last_token_hidden).squeeze(-1)
        return rewards

Step 3: Bradley-Terry Loss

def preference_loss(
    chosen_rewards: torch.Tensor,
    rejected_rewards: torch.Tensor,
) -> torch.Tensor:
    """Compute the Bradley-Terry preference loss.

    Loss = -log(sigma(r_chosen - r_rejected))

    Args:
        chosen_rewards: Reward scores for chosen responses (batch_size,).
        rejected_rewards: Reward scores for rejected responses (batch_size,).

    Returns:
        Scalar loss value.
    """
    return -F.logsigmoid(chosen_rewards - rejected_rewards).mean()


def preference_accuracy(
    chosen_rewards: torch.Tensor,
    rejected_rewards: torch.Tensor,
) -> float:
    """Compute accuracy: fraction where chosen reward > rejected reward.

    Args:
        chosen_rewards: Reward scores for chosen responses.
        rejected_rewards: Reward scores for rejected responses.

    Returns:
        Accuracy as a float between 0 and 1.
    """
    correct = (chosen_rewards > rejected_rewards).float().mean()
    return correct.item()

Step 4: Training Loop

class PreferenceDataset(Dataset):
    """PyTorch dataset for preference pairs.

    Attributes:
        pairs: List of PreferencePair objects.
        tokenizer: HuggingFace tokenizer.
        max_length: Maximum sequence length.
    """

    def __init__(
        self,
        pairs: list[PreferencePair],
        tokenizer: AutoTokenizer,
        max_length: int = 512,
    ) -> None:
        """Initialize the dataset.

        Args:
            pairs: List of preference pairs.
            tokenizer: HuggingFace tokenizer.
            max_length: Maximum sequence length for tokenization.
        """
        self.pairs = pairs
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self) -> int:
        """Return the number of preference pairs."""
        return len(self.pairs)

    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
        """Get a single tokenized preference pair.

        Args:
            idx: Index of the pair.

        Returns:
            Dictionary with chosen and rejected input_ids and masks.
        """
        pair = self.pairs[idx]

        chosen_text = f"{pair.prompt}\n\n{pair.chosen}"
        rejected_text = f"{pair.prompt}\n\n{pair.rejected}"

        chosen_enc = self.tokenizer(
            chosen_text,
            truncation=True,
            max_length=self.max_length,
            padding="max_length",
            return_tensors="pt",
        )
        rejected_enc = self.tokenizer(
            rejected_text,
            truncation=True,
            max_length=self.max_length,
            padding="max_length",
            return_tensors="pt",
        )

        return {
            "chosen_input_ids": chosen_enc["input_ids"].squeeze(0),
            "chosen_attention_mask": chosen_enc["attention_mask"].squeeze(0),
            "rejected_input_ids": rejected_enc["input_ids"].squeeze(0),
            "rejected_attention_mask": rejected_enc["attention_mask"].squeeze(0),
        }


def train_reward_model(
    model: RewardModel,
    dataset: PreferenceDataset,
    num_epochs: int = 5,
    learning_rate: float = 1e-5,
    device: str = "cpu",
) -> list[dict[str, float]]:
    """Train the reward model on preference data.

    Args:
        model: The reward model to train.
        dataset: Training dataset of preference pairs.
        num_epochs: Number of training epochs.
        learning_rate: Learning rate for the optimizer.
        device: Device to train on.

    Returns:
        List of training metrics per epoch.
    """
    model = model.to(device)
    model.train()

    loader = DataLoader(dataset, batch_size=2, shuffle=True)
    optimizer = torch.optim.AdamW(
        filter(lambda p: p.requires_grad, model.parameters()),
        lr=learning_rate,
    )

    history: list[dict[str, float]] = []

    for epoch in range(num_epochs):
        total_loss = 0.0
        total_acc = 0.0
        num_batches = 0

        for batch in loader:
            chosen_rewards = model(
                batch["chosen_input_ids"].to(device),
                batch["chosen_attention_mask"].to(device),
            )
            rejected_rewards = model(
                batch["rejected_input_ids"].to(device),
                batch["rejected_attention_mask"].to(device),
            )

            loss = preference_loss(chosen_rewards, rejected_rewards)
            acc = preference_accuracy(chosen_rewards, rejected_rewards)

            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()

            total_loss += loss.item()
            total_acc += acc
            num_batches += 1

        avg_loss = total_loss / num_batches
        avg_acc = total_acc / num_batches
        history.append({"epoch": epoch + 1, "loss": avg_loss, "accuracy": avg_acc})
        print(f"Epoch {epoch + 1}: loss={avg_loss:.4f}, accuracy={avg_acc:.1%}")

    return history

Step 5: Bias Evaluation

def evaluate_length_bias(
    model: RewardModel,
    tokenizer: AutoTokenizer,
    prompts: list[str],
    device: str = "cpu",
) -> dict[str, float]:
    """Test whether the reward model has a length bias.

    Generates short and long versions of responses and checks if
    the model systematically prefers longer responses.

    Args:
        model: The trained reward model.
        tokenizer: HuggingFace tokenizer.
        prompts: List of test prompts.
        device: Device for inference.

    Returns:
        Dictionary with length bias metrics.
    """
    model.eval()
    prefers_longer = 0
    total = 0

    for prompt in prompts:
        short_response = f"{prompt}\n\nThis is a brief answer."
        long_response = (
            f"{prompt}\n\nThis is a much longer and more detailed answer "
            "that provides additional context, examples, and elaboration "
            "on the topic at hand. It covers multiple aspects and goes "
            "into significant depth."
        )

        short_enc = tokenizer(
            short_response, return_tensors="pt",
            truncation=True, max_length=512, padding="max_length",
        )
        long_enc = tokenizer(
            long_response, return_tensors="pt",
            truncation=True, max_length=512, padding="max_length",
        )

        with torch.no_grad():
            short_reward = model(
                short_enc["input_ids"].to(device),
                short_enc["attention_mask"].to(device),
            )
            long_reward = model(
                long_enc["input_ids"].to(device),
                long_enc["attention_mask"].to(device),
            )

        if long_reward.item() > short_reward.item():
            prefers_longer += 1
        total += 1

    return {
        "length_bias_rate": prefers_longer / total if total > 0 else 0.0,
        "total_tests": total,
    }

Step 6: Demonstration

if __name__ == "__main__":
    print("=" * 60)
    print("Case Study 1: Training a Reward Model")
    print("=" * 60)

    print(f"\nPreference pairs: {len(PREFERENCE_DATA)}")
    for i, pair in enumerate(PREFERENCE_DATA):
        print(f"  Pair {i}: {pair.prompt[:50]}...")

    print("\nReward Model Architecture:")
    print("  Backbone: Pre-trained Transformer")
    print("  Output head: Linear(hidden_dim, 1)")
    print("  Loss: Bradley-Terry preference loss")
    print("  Loss = -log(sigma(r_chosen - r_rejected))")

    print("\nBias Evaluation Checks:")
    print("  1. Length bias: Does the model prefer longer responses?")
    print("  2. Sycophancy: Does it prefer agreeable responses?")
    print("  3. Accuracy: Does chosen always score higher than rejected?")

    print("\nTo train, run with a compatible GPU and model.")

Key Takeaways

  1. The reward model is the quality bottleneck of RLHF. Its biases and blind spots directly transfer to the aligned model. Careful evaluation for length bias, sycophancy, and calibration is essential.
  2. Bradley-Terry loss is equivalent to binary cross-entropy on the reward difference, making it straightforward to implement and optimize.
  3. Initializing from the SFT model ensures the reward model understands the same features as the policy, leading to more meaningful reward signals.
  4. Preference accuracy on held-out data should be 65-80%. Below 60% suggests noise-dominated data; above 85% suggests the task may be too easy or the data lacks diversity.