Chapter 25: Further Reading

Foundational Papers

Ouyang, L., et al. (2022). "Training Language Models to Follow Instructions with Human Feedback." NeurIPS. The InstructGPT paper that established the modern RLHF pipeline: SFT on demonstrations, reward modeling on comparisons, and PPO optimization. Demonstrates dramatic improvements in helpfulness, truthfulness, and safety. Essential reading for understanding the full alignment pipeline.
Stiennon, N., et al. (2020). "Learning to Summarize from Human Feedback." NeurIPS. An early demonstration of RLHF for text summarization, showing that human feedback can train models to produce summaries preferred over those from supervised baselines. Establishes key techniques for reward modeling and KL-constrained optimization.
Ziegler, D. M., et al. (2019). "Fine-Tuning Language Models from Human Preferences." arXiv. One of the earliest applications of RLHF to language models, demonstrating the approach on stylistic continuation and summarization. Introduces the per-token KL penalty that became standard in subsequent work.

Schulman, J., et al. (2017). "Proximal Policy Optimization Algorithms." arXiv. The PPO paper introducing the clipped surrogate objective that enables stable policy gradient training. PPO became the default RL algorithm for RLHF due to its balance of simplicity and reliability.

Bradley, R. A., & Terry, M. E. (1952). "Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons." Biometrika. The original statistical paper on paired comparisons that forms the mathematical foundation of reward modeling. The Bradley-Terry model's assumption that preference probability is a logistic function of quality difference remains central to modern alignment.

Rafailov, R., et al. (2023). "Direct Preference Optimization: Your Language Model Is Secretly a Reward Model." NeurIPS. The DPO paper demonstrating that the optimal RLHF policy has a closed-form solution, enabling reward modeling and RL to be bypassed entirely. Shows competitive performance with PPO at a fraction of the computational cost. Essential reading for modern alignment.

Azar, M. G., et al. (2024). "A General Theoretical Paradigm to Understand Learning from Human Feedback." AISTATS. Introduces Identity Preference Optimization (IPO), which uses a regression loss instead of the Bradley-Terry assumption. Addresses theoretical concerns about DPO overfitting to the preference model and provides stronger convergence guarantees.

Ethayarajh, K., et al. (2024). "KTO: Model Alignment as Prospect Theoretic Optimization." arXiv. Introduces KTO for alignment from unpaired feedback (binary good/bad labels). Draws on Kahneman and Tversky's prospect theory to model human preferences asymmetrically, reflecting loss aversion. Practically important because unpaired feedback is cheaper to collect.

Hong, J., et al. (2024). "ORPO: Monolithic Preference Optimization Without Reference Model." arXiv. Proposes combining SFT and preference optimization into a single stage using odds ratio penalties. Eliminates both the RL stage and the reference model, producing the simplest alignment pipeline to date.

Meng, Y., et al. (2024). "SimPO: Simple Preference Optimization with a Reference-Free Reward." arXiv. Uses length-normalized average log probability as the implicit reward, eliminating the reference model. The length normalization prevents the verbosity bias that plagues standard DPO.

Bai, Y., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." arXiv. Introduces the CAI framework where models critique and revise their own outputs according to explicit principles. Demonstrates that AI feedback can scale alignment without proportionally scaling human annotation.
Lee, H., et al. (2023). "RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback." arXiv. Compares RLHF (human feedback) with RLAIF (AI feedback) and finds that AI feedback can approach human feedback quality for many alignment tasks. Establishes practical guidelines for when AI feedback is sufficient.

Perez, E., et al. (2022). "Red Teaming Language Models with Language Models." EMNLP. Demonstrates automated red teaming using LLMs to generate adversarial prompts. Shows that automated approaches can discover failure modes that human red teamers miss, while scaling to thousands of test cases.
Mazeika, M., et al. (2024). "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal." arXiv. Introduces a standardized benchmark for evaluating LLM safety across multiple attack categories and defense strategies. Provides reproducible metrics for comparing alignment methods.
Lin, S., et al. (2022). "TruthfulQA: Measuring How Models Mimic Human Falsehoods." ACL. A benchmark testing whether models generate truthful answers to questions where humans commonly have misconceptions. Important for evaluating the honesty dimension of alignment.

Kaufmann, T., et al. (2024). "A Survey of Reinforcement Learning from Human Feedback." arXiv. A comprehensive survey covering the full RLHF landscape: reward modeling, PPO, DPO, and variants. Provides clear comparisons and a unified framework for understanding different alignment approaches.
Wang, Z., et al. (2024). "Secrets of RLHF in Large Language Models." ICLR. Provides practical insights into RLHF implementation, including reward model training strategies, PPO hyperparameter tuning, and common failure modes. Valuable for practitioners implementing alignment pipelines.
Shen, W., et al. (2023). "Large Language Model Alignment: A Survey." arXiv. A broad survey covering alignment methods from RLHF through DPO and Constitutional AI, including discussion of evaluation approaches and open problems in the field.

Bai, Y., et al. (2022). "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback." arXiv. Describes Anthropic's approach to collecting preference data and training aligned models. Provides detailed annotation guidelines, quality control procedures, and analysis of annotator agreement.
Dubois, Y., et al. (2024). "Alpaca Farm: A Simulation Framework for Methods That Learn from Human Feedback." NeurIPS. Introduces a simulation environment for testing alignment methods using LLM-based annotators. Enables rapid prototyping and comparison of alignment approaches without expensive human annotation.

HuggingFace TRL Library. https://huggingface.co/docs/trl The primary library for alignment training, providing DPOTrainer, PPOTrainer, and reward model training utilities. Integrates with PEFT for memory-efficient alignment.
HuggingFace Alignment Handbook. https://github.com/huggingface/alignment-handbook Recipes and best practices for aligning language models using TRL. Includes configurations for SFT, DPO, and ORPO with various model families.
OpenRLHF. https://github.com/OpenRLHF/OpenRLHF An open-source framework for large-scale RLHF training with Ray and DeepSpeed integration. Supports PPO, DPO, and reward modeling with efficient distributed training.

The alignment concepts from this chapter connect to upcoming topics:

Chapter 26 (Vision Transformers): Alignment principles extend to multimodal models where visual content introduces new safety challenges.
Chapter 31 (RAG): Retrieval-augmented generation changes the alignment landscape by grounding model outputs in external knowledge, reducing hallucination but introducing new failure modes.
Chapter 39 (Ethics and Responsible AI): Alignment is one component of responsible AI deployment, which also includes fairness, transparency, and accountability considerations.