Case Study 1: Training a Reward Model from Human Preferences
Overview
In this case study, we build a reward model from scratch that learns to predict human preferences between response pairs. We cover the complete pipeline: creating preference data, initializing the reward model from a language model, implementing the Bradley-Terry training objective, training the model, and evaluating its quality including tests for common biases like length preference and sycophancy.
Learning Objectives
- Structure preference data for reward model training.
- Initialize a reward model from a pre-trained language model with a scalar head.
- Implement the Bradley-Terry preference loss.
- Train and evaluate a reward model.
- Diagnose common reward model biases.
Step 1: Preference Data
"""Training a reward model from human preferences.
Implements the complete reward modeling pipeline: data preparation,
model architecture, training, and bias evaluation.
Requirements:
pip install torch transformers datasets
"""
from dataclasses import dataclass
from typing import Optional
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
from transformers import AutoModel, AutoTokenizer
torch.manual_seed(42)
@dataclass
class PreferencePair:
"""A single preference annotation.
Attributes:
prompt: The input prompt.
chosen: The preferred response.
rejected: The dispreferred response.
"""
prompt: str
chosen: str
rejected: str
PREFERENCE_DATA = [
PreferencePair(
prompt="Explain what a neural network is.",
chosen=(
"A neural network is a computational model inspired by the "
"structure of biological neurons. It consists of layers of "
"interconnected nodes that process information by learning "
"patterns from data. Each connection has a weight that is "
"adjusted during training to minimize prediction errors."
),
rejected=(
"Neural networks are like brains but for computers. They "
"do AI stuff and machine learning things. They are very "
"complex and hard to understand."
),
),
PreferencePair(
prompt="What is the capital of France?",
chosen="The capital of France is Paris.",
rejected=(
"France is a beautiful country in Europe. It has many cities "
"including Lyon, Marseille, Toulouse, and of course Paris, "
"which serves as the capital. Paris is known for the Eiffel "
"Tower, the Louvre, and its cuisine."
),
),
PreferencePair(
prompt="Write a Python function to reverse a string.",
chosen=(
"Here is a concise function:\n\n"
"```python\n"
"def reverse_string(s: str) -> str:\n"
' """Reverse the input string."""\n'
" return s[::-1]\n"
"```"
),
rejected=(
"You can reverse a string in many ways. One way is to use "
"a loop, another is slicing. There are also recursive "
"approaches. Python makes it easy."
),
),
PreferencePair(
prompt="Is it safe to mix bleach and ammonia?",
chosen=(
"No, you should never mix bleach and ammonia. This combination "
"produces toxic chloramine gases that can cause serious "
"respiratory damage, chemical burns, and can be fatal in "
"enclosed spaces. If you need to clean, use one product at a "
"time and ensure adequate ventilation."
),
rejected=(
"Mixing cleaning products can create stronger cleaners. You "
"might want to try it in a well-ventilated area to see if "
"it works better for tough stains."
),
),
]
Step 2: Reward Model Architecture
class RewardModel(nn.Module):
"""Reward model that predicts scalar rewards from (prompt, response) pairs.
Architecture: Pre-trained Transformer backbone + scalar output head.
The reward is computed from the last token's hidden state.
Attributes:
backbone: Pre-trained Transformer model (frozen or trainable).
reward_head: Linear layer mapping hidden states to scalar rewards.
"""
def __init__(
self,
model_name: str = "gpt2",
freeze_backbone: bool = False,
) -> None:
"""Initialize the reward model.
Args:
model_name: HuggingFace model identifier for the backbone.
freeze_backbone: If True, freeze the backbone parameters.
"""
super().__init__()
self.backbone = AutoModel.from_pretrained(model_name)
hidden_size = self.backbone.config.hidden_size
if freeze_backbone:
for param in self.backbone.parameters():
param.requires_grad = False
self.reward_head = nn.Linear(hidden_size, 1, bias=False)
nn.init.zeros_(self.reward_head.weight)
def forward(
self,
input_ids: torch.Tensor,
attention_mask: torch.Tensor,
) -> torch.Tensor:
"""Compute reward scores for a batch of sequences.
Args:
input_ids: Token IDs of shape (batch_size, seq_len).
attention_mask: Attention mask of shape (batch_size, seq_len).
Returns:
Scalar rewards of shape (batch_size,).
"""
outputs = self.backbone(
input_ids=input_ids,
attention_mask=attention_mask,
)
# Use the last non-padding token's hidden state
last_hidden = outputs.last_hidden_state
# Find the position of the last real token for each sequence
seq_lengths = attention_mask.sum(dim=1) - 1 # 0-indexed
batch_indices = torch.arange(input_ids.size(0), device=input_ids.device)
last_token_hidden = last_hidden[batch_indices, seq_lengths]
rewards = self.reward_head(last_token_hidden).squeeze(-1)
return rewards
Step 3: Bradley-Terry Loss
def preference_loss(
chosen_rewards: torch.Tensor,
rejected_rewards: torch.Tensor,
) -> torch.Tensor:
"""Compute the Bradley-Terry preference loss.
Loss = -log(sigma(r_chosen - r_rejected))
Args:
chosen_rewards: Reward scores for chosen responses (batch_size,).
rejected_rewards: Reward scores for rejected responses (batch_size,).
Returns:
Scalar loss value.
"""
return -F.logsigmoid(chosen_rewards - rejected_rewards).mean()
def preference_accuracy(
chosen_rewards: torch.Tensor,
rejected_rewards: torch.Tensor,
) -> float:
"""Compute accuracy: fraction where chosen reward > rejected reward.
Args:
chosen_rewards: Reward scores for chosen responses.
rejected_rewards: Reward scores for rejected responses.
Returns:
Accuracy as a float between 0 and 1.
"""
correct = (chosen_rewards > rejected_rewards).float().mean()
return correct.item()
Step 4: Training Loop
class PreferenceDataset(Dataset):
"""PyTorch dataset for preference pairs.
Attributes:
pairs: List of PreferencePair objects.
tokenizer: HuggingFace tokenizer.
max_length: Maximum sequence length.
"""
def __init__(
self,
pairs: list[PreferencePair],
tokenizer: AutoTokenizer,
max_length: int = 512,
) -> None:
"""Initialize the dataset.
Args:
pairs: List of preference pairs.
tokenizer: HuggingFace tokenizer.
max_length: Maximum sequence length for tokenization.
"""
self.pairs = pairs
self.tokenizer = tokenizer
self.max_length = max_length
def __len__(self) -> int:
"""Return the number of preference pairs."""
return len(self.pairs)
def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
"""Get a single tokenized preference pair.
Args:
idx: Index of the pair.
Returns:
Dictionary with chosen and rejected input_ids and masks.
"""
pair = self.pairs[idx]
chosen_text = f"{pair.prompt}\n\n{pair.chosen}"
rejected_text = f"{pair.prompt}\n\n{pair.rejected}"
chosen_enc = self.tokenizer(
chosen_text,
truncation=True,
max_length=self.max_length,
padding="max_length",
return_tensors="pt",
)
rejected_enc = self.tokenizer(
rejected_text,
truncation=True,
max_length=self.max_length,
padding="max_length",
return_tensors="pt",
)
return {
"chosen_input_ids": chosen_enc["input_ids"].squeeze(0),
"chosen_attention_mask": chosen_enc["attention_mask"].squeeze(0),
"rejected_input_ids": rejected_enc["input_ids"].squeeze(0),
"rejected_attention_mask": rejected_enc["attention_mask"].squeeze(0),
}
def train_reward_model(
model: RewardModel,
dataset: PreferenceDataset,
num_epochs: int = 5,
learning_rate: float = 1e-5,
device: str = "cpu",
) -> list[dict[str, float]]:
"""Train the reward model on preference data.
Args:
model: The reward model to train.
dataset: Training dataset of preference pairs.
num_epochs: Number of training epochs.
learning_rate: Learning rate for the optimizer.
device: Device to train on.
Returns:
List of training metrics per epoch.
"""
model = model.to(device)
model.train()
loader = DataLoader(dataset, batch_size=2, shuffle=True)
optimizer = torch.optim.AdamW(
filter(lambda p: p.requires_grad, model.parameters()),
lr=learning_rate,
)
history: list[dict[str, float]] = []
for epoch in range(num_epochs):
total_loss = 0.0
total_acc = 0.0
num_batches = 0
for batch in loader:
chosen_rewards = model(
batch["chosen_input_ids"].to(device),
batch["chosen_attention_mask"].to(device),
)
rejected_rewards = model(
batch["rejected_input_ids"].to(device),
batch["rejected_attention_mask"].to(device),
)
loss = preference_loss(chosen_rewards, rejected_rewards)
acc = preference_accuracy(chosen_rewards, rejected_rewards)
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
total_loss += loss.item()
total_acc += acc
num_batches += 1
avg_loss = total_loss / num_batches
avg_acc = total_acc / num_batches
history.append({"epoch": epoch + 1, "loss": avg_loss, "accuracy": avg_acc})
print(f"Epoch {epoch + 1}: loss={avg_loss:.4f}, accuracy={avg_acc:.1%}")
return history
Step 5: Bias Evaluation
def evaluate_length_bias(
model: RewardModel,
tokenizer: AutoTokenizer,
prompts: list[str],
device: str = "cpu",
) -> dict[str, float]:
"""Test whether the reward model has a length bias.
Generates short and long versions of responses and checks if
the model systematically prefers longer responses.
Args:
model: The trained reward model.
tokenizer: HuggingFace tokenizer.
prompts: List of test prompts.
device: Device for inference.
Returns:
Dictionary with length bias metrics.
"""
model.eval()
prefers_longer = 0
total = 0
for prompt in prompts:
short_response = f"{prompt}\n\nThis is a brief answer."
long_response = (
f"{prompt}\n\nThis is a much longer and more detailed answer "
"that provides additional context, examples, and elaboration "
"on the topic at hand. It covers multiple aspects and goes "
"into significant depth."
)
short_enc = tokenizer(
short_response, return_tensors="pt",
truncation=True, max_length=512, padding="max_length",
)
long_enc = tokenizer(
long_response, return_tensors="pt",
truncation=True, max_length=512, padding="max_length",
)
with torch.no_grad():
short_reward = model(
short_enc["input_ids"].to(device),
short_enc["attention_mask"].to(device),
)
long_reward = model(
long_enc["input_ids"].to(device),
long_enc["attention_mask"].to(device),
)
if long_reward.item() > short_reward.item():
prefers_longer += 1
total += 1
return {
"length_bias_rate": prefers_longer / total if total > 0 else 0.0,
"total_tests": total,
}
Step 6: Demonstration
if __name__ == "__main__":
print("=" * 60)
print("Case Study 1: Training a Reward Model")
print("=" * 60)
print(f"\nPreference pairs: {len(PREFERENCE_DATA)}")
for i, pair in enumerate(PREFERENCE_DATA):
print(f" Pair {i}: {pair.prompt[:50]}...")
print("\nReward Model Architecture:")
print(" Backbone: Pre-trained Transformer")
print(" Output head: Linear(hidden_dim, 1)")
print(" Loss: Bradley-Terry preference loss")
print(" Loss = -log(sigma(r_chosen - r_rejected))")
print("\nBias Evaluation Checks:")
print(" 1. Length bias: Does the model prefer longer responses?")
print(" 2. Sycophancy: Does it prefer agreeable responses?")
print(" 3. Accuracy: Does chosen always score higher than rejected?")
print("\nTo train, run with a compatible GPU and model.")
Key Takeaways
- The reward model is the quality bottleneck of RLHF. Its biases and blind spots directly transfer to the aligned model. Careful evaluation for length bias, sycophancy, and calibration is essential.
- Bradley-Terry loss is equivalent to binary cross-entropy on the reward difference, making it straightforward to implement and optimize.
- Initializing from the SFT model ensures the reward model understands the same features as the policy, leading to more meaningful reward signals.
- Preference accuracy on held-out data should be 65-80%. Below 60% suggests noise-dominated data; above 85% suggests the task may be too easy or the data lacks diversity.