Chapter 36 Exercises: Reinforcement Learning for AI Engineers
Conceptual Exercises
Exercise 1: MDP Identification
For each scenario below, identify the state space, action space, reward signal, and discount factor you would choose. Justify your choices. a) A robot vacuum cleaning a house b) A stock trading bot managing a portfolio c) An automated traffic light controller
Exercise 2: Markov Property Analysis
Which of the following problems satisfy the Markov property? For those that do not, describe what additional information you would include in the state to make them approximately Markov. a) Playing chess (observing the full board) b) Predicting the next word in a sentence c) A self-driving car navigating an intersection d) A thermostat controlling room temperature
Exercise 3: Discount Factor Impact
Consider an MDP with two policies: - Policy A: Receives reward +1 at every time step for 10 steps, then 0 forever. - Policy B: Receives reward 0 for the first 5 steps, then +3 for the next 10 steps.
Calculate the total discounted return for each policy with $\gamma = 0.5$, $\gamma = 0.9$, and $\gamma = 0.99$. Which policy is preferred under each setting?
Exercise 4: Bellman Equation Derivation
Starting from the definition $V^\pi(s) = \mathbb{E}_\pi[G_t | S_t = s]$, derive the Bellman expectation equation for $V^\pi$. Show each step clearly.
Exercise 5: Q-Learning vs. SARSA
Consider a cliff-walking environment where the agent must navigate along a cliff edge to reach a goal. The agent gets -100 reward for falling off the cliff and -1 for each step.
a) Explain why Q-learning and SARSA would learn different paths. b) Which algorithm would produce a safer policy during training? Why? c) Which would find the optimal policy more quickly?
Exercise 6: Exploration Trade-offs
Compare epsilon-greedy, Boltzmann, and UCB exploration strategies along these dimensions: a) Computational complexity b) Sensitivity to hyperparameters c) Performance in environments with many similar-valued actions d) Behavior as training progresses
Exercise 7: DQN Innovations
Explain why each of the following DQN improvements is necessary: a) Experience replay: What problem does it solve? What happens without it? b) Target network: What instability does it address? c) Double DQN: What bias does it correct? Provide a concrete example.
Exercise 8: Policy Gradient Variance
The REINFORCE algorithm suffers from high variance. a) Explain mathematically why using the full return $G_t$ leads to high variance. b) Show that subtracting a baseline $b(s)$ does not introduce bias. c) Why is $V(s)$ a good baseline choice?
Exercise 9: PPO Clipping Analysis
Draw the PPO clipped objective function $L^{\text{CLIP}}$ as a function of the probability ratio $r(\theta)$ for: a) Positive advantage ($\hat{A} > 0$) b) Negative advantage ($\hat{A} < 0$)
Explain the behavior in each region and why clipping helps prevent destructive policy updates.
Exercise 10: RLHF vs DPO vs GRPO
Compare RLHF (with PPO), DPO, and GRPO along these dimensions: a) Need for a reward model b) Memory requirements c) Stability of training d) Ability to handle nuanced preferences e) Sample efficiency
Programming Exercises
Exercise 11: Implement SARSA
Implement a SARSA agent (on-policy TD control) for the FrozenLake environment. Compare its learning curve with Q-learning over 10,000 episodes. Plot the results.
Exercise 12: Epsilon Decay Schedules
Implement three epsilon decay schedules for Q-learning: a) Linear decay b) Exponential decay c) Cosine annealing
Train each on CartPole-v1 (after discretizing the state space) for 5,000 episodes and compare learning curves.
Exercise 13: Experience Replay Variants
Implement and compare three experience replay strategies for DQN on CartPole-v1: a) Uniform random sampling (standard) b) Prioritized experience replay (proportional to TD error) c) Reservoir sampling
Report the average reward over the last 100 episodes for each approach.
Exercise 14: Dueling DQN
Implement the dueling DQN architecture. Train both standard DQN and dueling DQN on LunarLander-v3. Compare: a) Learning curves (mean reward over episodes) b) Convergence speed c) Final performance
Exercise 15: REINFORCE with Different Baselines
Implement REINFORCE with three different baselines: a) No baseline b) Running average of returns c) Learned value function baseline
Train on CartPole-v1 and compare the variance of policy gradient estimates across the three variants.
Exercise 16: GAE Lambda Sweep
Implement GAE and run a hyperparameter sweep over $\lambda \in \{0.0, 0.5, 0.9, 0.95, 0.99, 1.0\}$ on CartPole-v1 using an actor-critic agent. Plot the learning curves and discuss the bias-variance trade-off.
Exercise 17: PPO Implementation from Scratch
Implement PPO completely from scratch (without using existing RL libraries) for the LunarLander-v3 environment. Your implementation should include: a) Actor-critic network with shared features b) GAE for advantage estimation c) Clipped surrogate objective d) Value function loss with clipping e) Entropy bonus
Train for 500,000 timesteps and plot the learning curve.
Exercise 18: Custom Environment
Create a custom Gymnasium environment for a simple inventory management problem: - State: current inventory level (0-100) - Actions: order 0, 10, 20, 30, or 40 units - Demand: random, drawn from Poisson(15) each day - Reward: +2 per unit sold, -1 per unit of unsatisfied demand, -0.5 per unit of excess inventory
Train a DQN agent and analyze the learned ordering policy.
Exercise 19: Reward Shaping
Consider the MountainCar-v0 environment, which has extremely sparse rewards. a) Train a standard DQN agent and record how many episodes it takes to first reach the goal. b) Implement potential-based reward shaping using the car's height as the potential function. c) Compare learning curves with and without reward shaping.
Exercise 20: Multi-Seed Evaluation
Train a PPO agent on CartPole-v1 with 10 different random seeds. Compute and plot: a) Mean and standard deviation of episode rewards over training b) The inter-quartile range of final performance c) Time to reach a reward threshold of 450
Discuss the reproducibility implications.
Advanced Exercises
Exercise 21: Continuous Action Spaces
Extend the PPO implementation to handle continuous action spaces using a Gaussian policy: $$\pi_\theta(a|s) = \mathcal{N}(\mu_\theta(s), \sigma_\theta(s))$$
Test on the Pendulum-v1 environment.
Exercise 22: Implement GRPO for a Simplified Setting
Implement GRPO for a bandit problem where: - The "prompt" is an integer context $c \in \{1, \ldots, 5\}$. - The "policy" generates a real-valued response $y$. - The "reward" is $R(c, y) = -|y - c^2|$ (the optimal response for context $c$ is $c^2$).
Use a simple neural network as the policy and show that GRPO can learn the optimal mapping.
Exercise 23: DPO Implementation
Implement DPO for a toy text generation task: - Use a small GPT-2 model as both the policy and reference. - Create synthetic preference data where shorter responses are preferred. - Train with DPO and verify that the model learns to generate shorter responses.
Exercise 24: Curiosity-Driven Exploration
Implement Random Network Distillation (RND) for intrinsic motivation: a) Create a fixed random target network and a predictor network. b) Use the prediction error as an intrinsic reward bonus. c) Train on MountainCar-v0 and compare with vanilla DQN.
Exercise 25: Model-Based RL
Implement a simple model-based RL approach for CartPole: a) Collect initial data with a random policy. b) Train a neural network dynamics model: $\hat{s}_{t+1} = f_\theta(s_t, a_t)$. c) Use the learned model for planning (generate imagined trajectories and select the best action sequence). d) Compare sample efficiency with model-free DQN.
Exercise 26: Offline RL
Implement Conservative Q-Learning (CQL) for offline RL: a) Collect a dataset of transitions using a partially trained DQN agent. b) Train a standard DQN on this offline dataset (no environment interaction). c) Train a CQL agent on the same dataset (add a conservative regularizer to penalize overestimation of out-of-distribution actions). d) Compare the performance of both agents when evaluated in the environment.
Exercise 27: Hierarchical RL
Design and implement a two-level hierarchical RL agent for a navigation task: - High-level policy: Selects subgoals (target positions on the grid). - Low-level policy: Selects primitive actions to reach the subgoal. Test on a 10x10 grid world with obstacles.
Exercise 28: Multi-Agent Competition
Create a simple two-player competitive environment (e.g., a simplified version of Pong) using Gymnasium. Train two independent DQN agents against each other using self-play. Track the Nash equilibrium convergence.
Exercise 29: Safe RL with Constraints
Implement a constrained RL variant where the agent must maximize reward while keeping a safety cost below a threshold: - Environment: CartPole with an added safety cost when the pole angle exceeds 10 degrees. - Algorithm: Lagrangian relaxation of the constrained objective. Compare the policies learned with and without the safety constraint.
Exercise 30: Reward Hacking
Design an experiment that demonstrates reward hacking: a) Create an environment where the reward function has an unintended shortcut. b) Train an RL agent and observe it exploiting the shortcut. c) Fix the reward function and retrain. Document your findings and discuss implications for AI alignment.
Exercise 31: PPO Hyperparameter Sensitivity
Conduct a systematic study of PPO hyperparameter sensitivity on LunarLander-v3: a) Vary clip epsilon in {0.05, 0.1, 0.2, 0.3, 0.5} b) Vary number of epochs in {1, 3, 5, 10, 20} c) Vary learning rate in {1e-5, 3e-4, 1e-3, 3e-3} Create a heatmap of final performance and identify the most sensitive hyperparameters.
Exercise 32: Imitation Learning Baseline
Implement behavioral cloning (supervised learning on expert demonstrations) for CartPole: a) Collect 100 episodes from a trained PPO expert. b) Train a policy using supervised learning (cross-entropy loss). c) Compare with PPO trained from scratch in terms of sample efficiency and final performance. d) Implement DAgger (Dataset Aggregation) and compare with pure behavioral cloning.
Exercise 33: Visualizing Value Functions
For a trained DQN agent on CartPole: a) Plot the Q-values as a function of pole angle and angular velocity (fixing other state dimensions). b) Overlay the policy (selected action) on the Q-value landscape. c) Identify regions where the agent is most uncertain.
Exercise 34: Natural Policy Gradients
Implement the natural policy gradient algorithm: a) Compute the Fisher information matrix for a simple policy. b) Apply the natural gradient update: $\theta \leftarrow \theta + \alpha F^{-1} \nabla_\theta J(\theta)$. c) Compare convergence with standard policy gradient on CartPole.
Exercise 35: Distributional RL
Implement C51 (categorical distributional RL): a) Instead of learning $Q(s,a)$, learn the distribution of returns using 51 atoms. b) Train on CartPole and compare with standard DQN. c) Visualize the learned return distributions for different state-action pairs.
Challenge Exercises
Exercise 36: End-to-End RLHF Pipeline
Build a minimal RLHF pipeline: a) Fine-tune a small language model (GPT-2 or similar) on a text dataset. b) Create synthetic preference data. c) Train a reward model on the preferences. d) Use PPO to optimize the language model against the reward model. e) Evaluate whether the KL constraint prevents reward hacking.
Exercise 37: Curriculum Learning for RL
Implement automatic curriculum learning for a difficult environment: a) Start with easy versions of the task (e.g., shorter episodes, simpler dynamics). b) Automatically increase difficulty as the agent improves. c) Compare with training directly on the hard task.
Exercise 38: Meta-RL
Implement a simple meta-RL agent (e.g., RL^2) that learns to learn: a) Create a family of related MDPs (e.g., grid worlds with different goal locations). b) Train an LSTM-based policy across many MDPs. c) Evaluate whether the agent can quickly adapt to new MDPs at test time.
Exercise 39: Reproducibility Report
Select one result from a published deep RL paper. Attempt to reproduce it: a) Implement the algorithm exactly as described. b) Run with the reported hyperparameters. c) Report your results alongside the original. d) Document any discrepancies and hypothesize explanations.
Exercise 40: RL for Combinatorial Optimization
Apply RL to the Traveling Salesman Problem (TSP): a) Formulate TSP as an MDP (state = partial tour, action = next city to visit). b) Train a policy network using REINFORCE with a greedy rollout baseline. c) Compare solutions with nearest-neighbor heuristic on random 20-city instances.