Chapter 36 Exercises: Reinforcement Learning for AI Engineers

Conceptual Exercises

Exercise 1: MDP Identification

For each scenario below, identify the state space, action space, reward signal, and discount factor you would choose. Justify your choices. a) A robot vacuum cleaning a house b) A stock trading bot managing a portfolio c) An automated traffic light controller

Exercise 2: Markov Property Analysis

Which of the following problems satisfy the Markov property? For those that do not, describe what additional information you would include in the state to make them approximately Markov. a) Playing chess (observing the full board) b) Predicting the next word in a sentence c) A self-driving car navigating an intersection d) A thermostat controlling room temperature

Exercise 3: Discount Factor Impact

Consider an MDP with two policies: - Policy A: Receives reward +1 at every time step for 10 steps, then 0 forever. - Policy B: Receives reward 0 for the first 5 steps, then +3 for the next 10 steps.

Calculate the total discounted return for each policy with $\gamma = 0.5$, $\gamma = 0.9$, and $\gamma = 0.99$. Which policy is preferred under each setting?

Exercise 4: Bellman Equation Derivation

Starting from the definition $V^\pi(s) = \mathbb{E}_\pi[G_t | S_t = s]$, derive the Bellman expectation equation for $V^\pi$. Show each step clearly.

Exercise 5: Q-Learning vs. SARSA

Consider a cliff-walking environment where the agent must navigate along a cliff edge to reach a goal. The agent gets -100 reward for falling off the cliff and -1 for each step.

a) Explain why Q-learning and SARSA would learn different paths. b) Which algorithm would produce a safer policy during training? Why? c) Which would find the optimal policy more quickly?

Exercise 6: Exploration Trade-offs

Compare epsilon-greedy, Boltzmann, and UCB exploration strategies along these dimensions: a) Computational complexity b) Sensitivity to hyperparameters c) Performance in environments with many similar-valued actions d) Behavior as training progresses

Exercise 7: DQN Innovations

Explain why each of the following DQN improvements is necessary: a) Experience replay: What problem does it solve? What happens without it? b) Target network: What instability does it address? c) Double DQN: What bias does it correct? Provide a concrete example.

Exercise 8: Policy Gradient Variance

The REINFORCE algorithm suffers from high variance. a) Explain mathematically why using the full return $G_t$ leads to high variance. b) Show that subtracting a baseline $b(s)$ does not introduce bias. c) Why is $V(s)$ a good baseline choice?

Exercise 9: PPO Clipping Analysis

Draw the PPO clipped objective function $L^{\text{CLIP}}$ as a function of the probability ratio $r(\theta)$ for: a) Positive advantage ($\hat{A} > 0$) b) Negative advantage ($\hat{A} < 0$)

Explain the behavior in each region and why clipping helps prevent destructive policy updates.

Exercise 10: RLHF vs DPO vs GRPO

Compare RLHF (with PPO), DPO, and GRPO along these dimensions: a) Need for a reward model b) Memory requirements c) Stability of training d) Ability to handle nuanced preferences e) Sample efficiency

Programming Exercises

Exercise 11: Implement SARSA

Implement a SARSA agent (on-policy TD control) for the FrozenLake environment. Compare its learning curve with Q-learning over 10,000 episodes. Plot the results.

Exercise 12: Epsilon Decay Schedules

Implement three epsilon decay schedules for Q-learning: a) Linear decay b) Exponential decay c) Cosine annealing

Train each on CartPole-v1 (after discretizing the state space) for 5,000 episodes and compare learning curves.

Exercise 13: Experience Replay Variants

Implement and compare three experience replay strategies for DQN on CartPole-v1: a) Uniform random sampling (standard) b) Prioritized experience replay (proportional to TD error) c) Reservoir sampling

Report the average reward over the last 100 episodes for each approach.

Exercise 14: Dueling DQN

Implement the dueling DQN architecture. Train both standard DQN and dueling DQN on LunarLander-v3. Compare: a) Learning curves (mean reward over episodes) b) Convergence speed c) Final performance

Exercise 15: REINFORCE with Different Baselines

Implement REINFORCE with three different baselines: a) No baseline b) Running average of returns c) Learned value function baseline

Train on CartPole-v1 and compare the variance of policy gradient estimates across the three variants.

Exercise 16: GAE Lambda Sweep

Implement GAE and run a hyperparameter sweep over $\lambda \in \{0.0, 0.5, 0.9, 0.95, 0.99, 1.0\}$ on CartPole-v1 using an actor-critic agent. Plot the learning curves and discuss the bias-variance trade-off.

Exercise 17: PPO Implementation from Scratch

Implement PPO completely from scratch (without using existing RL libraries) for the LunarLander-v3 environment. Your implementation should include: a) Actor-critic network with shared features b) GAE for advantage estimation c) Clipped surrogate objective d) Value function loss with clipping e) Entropy bonus

Train for 500,000 timesteps and plot the learning curve.

Exercise 18: Custom Environment

Create a custom Gymnasium environment for a simple inventory management problem: - State: current inventory level (0-100) - Actions: order 0, 10, 20, 30, or 40 units - Demand: random, drawn from Poisson(15) each day - Reward: +2 per unit sold, -1 per unit of unsatisfied demand, -0.5 per unit of excess inventory

Train a DQN agent and analyze the learned ordering policy.

Exercise 19: Reward Shaping

Consider the MountainCar-v0 environment, which has extremely sparse rewards. a) Train a standard DQN agent and record how many episodes it takes to first reach the goal. b) Implement potential-based reward shaping using the car's height as the potential function. c) Compare learning curves with and without reward shaping.

Exercise 20: Multi-Seed Evaluation

Train a PPO agent on CartPole-v1 with 10 different random seeds. Compute and plot: a) Mean and standard deviation of episode rewards over training b) The inter-quartile range of final performance c) Time to reach a reward threshold of 450

Discuss the reproducibility implications.

Advanced Exercises

Exercise 21: Continuous Action Spaces

Extend the PPO implementation to handle continuous action spaces using a Gaussian policy: $$\pi_\theta(a|s) = \mathcal{N}(\mu_\theta(s), \sigma_\theta(s))$$

Test on the Pendulum-v1 environment.

Exercise 22: Implement GRPO for a Simplified Setting

Implement GRPO for a bandit problem where: - The "prompt" is an integer context $c \in \{1, \ldots, 5\}$. - The "policy" generates a real-valued response $y$. - The "reward" is $R(c, y) = -|y - c^2|$ (the optimal response for context $c$ is $c^2$).

Use a simple neural network as the policy and show that GRPO can learn the optimal mapping.

Exercise 23: DPO Implementation

Implement DPO for a toy text generation task: - Use a small GPT-2 model as both the policy and reference. - Create synthetic preference data where shorter responses are preferred. - Train with DPO and verify that the model learns to generate shorter responses.

Exercise 24: Curiosity-Driven Exploration

Implement Random Network Distillation (RND) for intrinsic motivation: a) Create a fixed random target network and a predictor network. b) Use the prediction error as an intrinsic reward bonus. c) Train on MountainCar-v0 and compare with vanilla DQN.

Exercise 25: Model-Based RL

Implement a simple model-based RL approach for CartPole: a) Collect initial data with a random policy. b) Train a neural network dynamics model: $\hat{s}_{t+1} = f_\theta(s_t, a_t)$. c) Use the learned model for planning (generate imagined trajectories and select the best action sequence). d) Compare sample efficiency with model-free DQN.

Exercise 26: Offline RL

Implement Conservative Q-Learning (CQL) for offline RL: a) Collect a dataset of transitions using a partially trained DQN agent. b) Train a standard DQN on this offline dataset (no environment interaction). c) Train a CQL agent on the same dataset (add a conservative regularizer to penalize overestimation of out-of-distribution actions). d) Compare the performance of both agents when evaluated in the environment.

Exercise 27: Hierarchical RL

Design and implement a two-level hierarchical RL agent for a navigation task: - High-level policy: Selects subgoals (target positions on the grid). - Low-level policy: Selects primitive actions to reach the subgoal. Test on a 10x10 grid world with obstacles.

Exercise 28: Multi-Agent Competition

Create a simple two-player competitive environment (e.g., a simplified version of Pong) using Gymnasium. Train two independent DQN agents against each other using self-play. Track the Nash equilibrium convergence.

Exercise 29: Safe RL with Constraints

Implement a constrained RL variant where the agent must maximize reward while keeping a safety cost below a threshold: - Environment: CartPole with an added safety cost when the pole angle exceeds 10 degrees. - Algorithm: Lagrangian relaxation of the constrained objective. Compare the policies learned with and without the safety constraint.

Exercise 30: Reward Hacking

Design an experiment that demonstrates reward hacking: a) Create an environment where the reward function has an unintended shortcut. b) Train an RL agent and observe it exploiting the shortcut. c) Fix the reward function and retrain. Document your findings and discuss implications for AI alignment.

Exercise 31: PPO Hyperparameter Sensitivity

Conduct a systematic study of PPO hyperparameter sensitivity on LunarLander-v3: a) Vary clip epsilon in {0.05, 0.1, 0.2, 0.3, 0.5} b) Vary number of epochs in {1, 3, 5, 10, 20} c) Vary learning rate in {1e-5, 3e-4, 1e-3, 3e-3} Create a heatmap of final performance and identify the most sensitive hyperparameters.

Exercise 32: Imitation Learning Baseline

Implement behavioral cloning (supervised learning on expert demonstrations) for CartPole: a) Collect 100 episodes from a trained PPO expert. b) Train a policy using supervised learning (cross-entropy loss). c) Compare with PPO trained from scratch in terms of sample efficiency and final performance. d) Implement DAgger (Dataset Aggregation) and compare with pure behavioral cloning.

Exercise 33: Visualizing Value Functions

For a trained DQN agent on CartPole: a) Plot the Q-values as a function of pole angle and angular velocity (fixing other state dimensions). b) Overlay the policy (selected action) on the Q-value landscape. c) Identify regions where the agent is most uncertain.

Exercise 34: Natural Policy Gradients

Implement the natural policy gradient algorithm: a) Compute the Fisher information matrix for a simple policy. b) Apply the natural gradient update: $\theta \leftarrow \theta + \alpha F^{-1} \nabla_\theta J(\theta)$. c) Compare convergence with standard policy gradient on CartPole.

Exercise 35: Distributional RL

Implement C51 (categorical distributional RL): a) Instead of learning $Q(s,a)$, learn the distribution of returns using 51 atoms. b) Train on CartPole and compare with standard DQN. c) Visualize the learned return distributions for different state-action pairs.

Challenge Exercises

Exercise 36: End-to-End RLHF Pipeline

Build a minimal RLHF pipeline: a) Fine-tune a small language model (GPT-2 or similar) on a text dataset. b) Create synthetic preference data. c) Train a reward model on the preferences. d) Use PPO to optimize the language model against the reward model. e) Evaluate whether the KL constraint prevents reward hacking.

Exercise 37: Curriculum Learning for RL

Implement automatic curriculum learning for a difficult environment: a) Start with easy versions of the task (e.g., shorter episodes, simpler dynamics). b) Automatically increase difficulty as the agent improves. c) Compare with training directly on the hard task.

Exercise 38: Meta-RL

Implement a simple meta-RL agent (e.g., RL^2) that learns to learn: a) Create a family of related MDPs (e.g., grid worlds with different goal locations). b) Train an LSTM-based policy across many MDPs. c) Evaluate whether the agent can quickly adapt to new MDPs at test time.

Exercise 39: Reproducibility Report

Select one result from a published deep RL paper. Attempt to reproduce it: a) Implement the algorithm exactly as described. b) Run with the reported hyperparameters. c) Report your results alongside the original. d) Document any discrepancies and hypothesize explanations.

Exercise 40: RL for Combinatorial Optimization

Apply RL to the Traveling Salesman Problem (TSP): a) Formulate TSP as an MDP (state = partial tour, action = next city to visit). b) Train a policy network using REINFORCE with a greedy rollout baseline. c) Compare solutions with nearest-neighbor heuristic on random 20-city instances.