Chapter 36: Further Reading

Foundational Texts

Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction, 2nd ed. MIT Press. Available at: http://incompleteideas.net/book/the-book-2nd.html. The definitive textbook on RL, covering MDPs, dynamic programming, Monte Carlo methods, TD learning, policy gradients, and function approximation. Essential reading for any serious study of RL.
Szepesvari, C. (2010). Algorithms for Reinforcement Learning. Morgan & Claypool. A concise mathematical treatment of RL algorithms, suitable for readers comfortable with formal proofs and convergence analysis.

Watkins, C. J. C. H. & Dayan, P. (1992). "Q-Learning." Machine Learning, 8(3-4), 279--292. The original Q-learning paper proving convergence of tabular Q-learning to the optimal action-value function.
Rummery, G. A. & Niranjan, M. (1994). "On-Line Q-Learning Using Connectionist Systems." Technical Report CUED/F-INFENG/TR 166. University of Cambridge. Introduces SARSA, the on-policy counterpart to Q-learning.

Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). "Human-Level Control Through Deep Reinforcement Learning." Nature, 518, 529--533. The DQN paper that demonstrated superhuman Atari game play using experience replay and target networks.
Van Hasselt, H., Guez, A., & Silver, D. (2016). "Deep Reinforcement Learning with Double Q-Learning." AAAI 2016. Introduces Double DQN, addressing the overestimation bias in standard DQN.
Wang, Z., Schaul, T., Hessel, M., et al. (2016). "Dueling Network Architectures for Deep Reinforcement Learning." ICML 2016. Introduces the dueling architecture that separates value and advantage estimation.
Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2016). "Prioritized Experience Replay." ICLR 2016. Improves sample efficiency by prioritizing transitions with high TD error in the replay buffer.

Williams, R. J. (1992). "Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning." Machine Learning, 8(3-4), 229--256. The original REINFORCE paper introducing Monte Carlo policy gradient methods.
Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (1999). "Policy Gradient Methods for Reinforcement Learning with Function Approximation." NeurIPS 1999. The policy gradient theorem, providing the theoretical foundation for all policy gradient algorithms.
Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015a). "Trust Region Policy Optimization." ICML 2015. Introduces TRPO, which constrains policy updates using KL divergence to ensure monotonic improvement.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). "Proximal Policy Optimization Algorithms." arXiv preprint arXiv:1707.06347. Introduces PPO with the clipped surrogate objective, the most widely used policy gradient algorithm.

Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015b). "High-Dimensional Continuous Control Using Generalized Advantage Estimation." ICLR 2016. Introduces GAE, providing a principled way to trade off bias and variance in advantage estimation.

Ouyang, L., Wu, J., Jiang, X., et al. (2022). "Training Language Models to Follow Instructions with Human Feedback." NeurIPS 2022. The InstructGPT paper describing the three-step RLHF pipeline: supervised fine-tuning, reward model training, and PPO optimization.
Rafailov, R., Sharma, A., Mitchell, E., et al. (2023). "Direct Preference Optimization: Your Language Model Is Secretly a Reward Model." NeurIPS 2023. Introduces DPO, which eliminates the need for a separate reward model by reformulating RLHF as a classification problem.
Shao, Z., Wang, P., Zhu, Q., et al. (2024). "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models." arXiv preprint arXiv:2402.03300. Introduces GRPO (Group Relative Policy Optimization), which removes the value model from PPO, reducing memory requirements.
Christiano, P., Leike, J., Brown, T., et al. (2017). "Deep Reinforcement Learning from Human Preferences." NeurIPS 2017. The foundational paper on learning reward functions from human preferences.

Bellemare, M. G., Srinivasan, S., Ostrovski, G., et al. (2016). "Unifying Count-Based Exploration and Intrinsic Motivation." NeurIPS 2016. Connects count-based exploration to intrinsic motivation through pseudo-counts.
Burda, Y., Edwards, H., Storkey, A., & Klimov, O. (2019). "Exploration by Random Network Distillation." ICLR 2019. Introduces RND, using prediction error of a random network as an intrinsic reward for exploration.

Gymnasium Documentation. (2024). Available at: https://gymnasium.farama.org. Official documentation for Gymnasium (successor to OpenAI Gym), the standard environment interface for RL research.
Stable-Baselines3. (2024). Available at: https://stable-baselines3.readthedocs.io. A reliable library of RL algorithm implementations in PyTorch, widely used for benchmarking and prototyping.
Henderson, P., Islam, R., Bachman, P., et al. (2018). "Deep Reinforcement Learning That Matters." AAAI 2018. A critical examination of reproducibility in deep RL, highlighting the impact of hyperparameters, random seeds, and evaluation methodology.