Chapter 36: Further Reading
Foundational Texts
-
Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction, 2nd ed. MIT Press. Available at: http://incompleteideas.net/book/the-book-2nd.html. The definitive textbook on RL, covering MDPs, dynamic programming, Monte Carlo methods, TD learning, policy gradients, and function approximation. Essential reading for any serious study of RL.
-
Szepesvari, C. (2010). Algorithms for Reinforcement Learning. Morgan & Claypool. A concise mathematical treatment of RL algorithms, suitable for readers comfortable with formal proofs and convergence analysis.
Tabular Methods
-
Watkins, C. J. C. H. & Dayan, P. (1992). "Q-Learning." Machine Learning, 8(3-4), 279--292. The original Q-learning paper proving convergence of tabular Q-learning to the optimal action-value function.
-
Rummery, G. A. & Niranjan, M. (1994). "On-Line Q-Learning Using Connectionist Systems." Technical Report CUED/F-INFENG/TR 166. University of Cambridge. Introduces SARSA, the on-policy counterpart to Q-learning.
Deep Reinforcement Learning
-
Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). "Human-Level Control Through Deep Reinforcement Learning." Nature, 518, 529--533. The DQN paper that demonstrated superhuman Atari game play using experience replay and target networks.
-
Van Hasselt, H., Guez, A., & Silver, D. (2016). "Deep Reinforcement Learning with Double Q-Learning." AAAI 2016. Introduces Double DQN, addressing the overestimation bias in standard DQN.
-
Wang, Z., Schaul, T., Hessel, M., et al. (2016). "Dueling Network Architectures for Deep Reinforcement Learning." ICML 2016. Introduces the dueling architecture that separates value and advantage estimation.
-
Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2016). "Prioritized Experience Replay." ICLR 2016. Improves sample efficiency by prioritizing transitions with high TD error in the replay buffer.
Policy Gradient Methods
-
Williams, R. J. (1992). "Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning." Machine Learning, 8(3-4), 229--256. The original REINFORCE paper introducing Monte Carlo policy gradient methods.
-
Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (1999). "Policy Gradient Methods for Reinforcement Learning with Function Approximation." NeurIPS 1999. The policy gradient theorem, providing the theoretical foundation for all policy gradient algorithms.
-
Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015a). "Trust Region Policy Optimization." ICML 2015. Introduces TRPO, which constrains policy updates using KL divergence to ensure monotonic improvement.
-
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). "Proximal Policy Optimization Algorithms." arXiv preprint arXiv:1707.06347. Introduces PPO with the clipped surrogate objective, the most widely used policy gradient algorithm.
Advantage Estimation
- Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015b). "High-Dimensional Continuous Control Using Generalized Advantage Estimation." ICLR 2016. Introduces GAE, providing a principled way to trade off bias and variance in advantage estimation.
RLHF and LLM Alignment
-
Ouyang, L., Wu, J., Jiang, X., et al. (2022). "Training Language Models to Follow Instructions with Human Feedback." NeurIPS 2022. The InstructGPT paper describing the three-step RLHF pipeline: supervised fine-tuning, reward model training, and PPO optimization.
-
Rafailov, R., Sharma, A., Mitchell, E., et al. (2023). "Direct Preference Optimization: Your Language Model Is Secretly a Reward Model." NeurIPS 2023. Introduces DPO, which eliminates the need for a separate reward model by reformulating RLHF as a classification problem.
-
Shao, Z., Wang, P., Zhu, Q., et al. (2024). "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models." arXiv preprint arXiv:2402.03300. Introduces GRPO (Group Relative Policy Optimization), which removes the value model from PPO, reducing memory requirements.
-
Christiano, P., Leike, J., Brown, T., et al. (2017). "Deep Reinforcement Learning from Human Preferences." NeurIPS 2017. The foundational paper on learning reward functions from human preferences.
Exploration
-
Bellemare, M. G., Srinivasan, S., Ostrovski, G., et al. (2016). "Unifying Count-Based Exploration and Intrinsic Motivation." NeurIPS 2016. Connects count-based exploration to intrinsic motivation through pseudo-counts.
-
Burda, Y., Edwards, H., Storkey, A., & Klimov, O. (2019). "Exploration by Random Network Distillation." ICLR 2019. Introduces RND, using prediction error of a random network as an intrinsic reward for exploration.
Practical Guides
-
Gymnasium Documentation. (2024). Available at: https://gymnasium.farama.org. Official documentation for Gymnasium (successor to OpenAI Gym), the standard environment interface for RL research.
-
Stable-Baselines3. (2024). Available at: https://stable-baselines3.readthedocs.io. A reliable library of RL algorithm implementations in PyTorch, widely used for benchmarking and prototyping.
-
Henderson, P., Islam, R., Bachman, P., et al. (2018). "Deep Reinforcement Learning That Matters." AAAI 2018. A critical examination of reproducibility in deep RL, highlighting the impact of hyperparameters, random seeds, and evaluation methodology.