Chapter 36: Key Takeaways

1. Reinforcement Learning Learns from Interaction, Not Labels

Unlike supervised learning, RL agents learn by interacting with an environment, receiving scalar rewards, and discovering strategies that maximize cumulative return. There is no dataset of correct actions. This trial-and-error framework underpins game-playing AI, robotic control, and the RLHF alignment of large language models.

2. MDPs Provide the Mathematical Foundation

Markov Decision Processes formalize RL as a tuple of states, actions, transition dynamics, rewards, and a discount factor. The Markov property -- that the future depends only on the current state, not the history -- enables tractable algorithms. Understanding MDPs is essential for correctly formulating any RL problem.

3. The Bellman Equations Are the Basis of Nearly Every RL Algorithm

The Bellman equations express a recursive relationship: the value of a state equals the immediate reward plus the discounted value of successor states. Q-learning, DQN, policy gradients, and PPO all derive from this principle, either by explicitly estimating value functions or by optimizing policies that implicitly satisfy the optimality conditions.

4. Q-Learning Converges to the Optimal Policy Without a Model

Q-learning is a model-free, off-policy algorithm that directly learns the optimal action-value function. It updates Q-values using observed transitions and the maximum next-state Q-value. Convergence is guaranteed under sufficient exploration and appropriate learning rate decay, making it a foundational algorithm for discrete state-action spaces.

5. DQN Made Deep RL Practical Through Two Key Innovations

Experience replay and target networks solved the instability of combining neural networks with TD learning. Replay breaks temporal correlations in the training data, while the target network provides a stable optimization target. Extensions like Double DQN and Dueling DQN further improve stability and accuracy.

6. Policy Gradient Methods Naturally Handle Continuous Actions

Policy gradient methods parameterize the policy directly and optimize it via gradient ascent on expected return. This enables handling continuous action spaces (impossible for standard DQN) and learning stochastic policies. The REINFORCE algorithm is the simplest instance, and subtracting a baseline reduces variance without introducing bias.

7. PPO Is the Workhorse of Modern Deep RL

Proximal Policy Optimization combines clipped surrogate objectives, Generalized Advantage Estimation, and an entropy bonus into a simple yet effective algorithm. The clipping mechanism prevents destructive policy updates without the complexity of TRPO's constrained optimization. PPO is the default choice for both game environments and RLHF fine-tuning.

8. The Exploration-Exploitation Trade-Off Is Fundamental

An agent must balance exploiting known good actions with exploring unknown actions that might be better. Epsilon-greedy, Boltzmann exploration, and UCB provide different approaches. Getting this balance right is often more important than the choice of learning algorithm, especially in environments with sparse or deceptive rewards.

9. RLHF Connects RL to Language Model Alignment

Reinforcement Learning from Human Feedback trains a reward model on human preference data, then uses PPO to fine-tune a language model against that reward model. DPO simplifies this by eliminating the explicit reward model, and GRPO further reduces memory by removing the need for a value function. Understanding these connections is essential for modern AI engineering.

10. Reward Design Is the Most Critical and Underappreciated Challenge

A poorly designed reward function leads to reward hacking: the agent finds unintended shortcuts that maximize the reward signal without achieving the desired behavior. Reward engineering requires careful thought about edge cases, implicit incentives, and alignment between the mathematical objective and the true goal. This challenge directly parallels the alignment problem in large language models.