Chapter 36: Quiz -- Reinforcement Learning for AI Engineers

Test your understanding of reinforcement learning concepts, algorithms, and their connections to modern AI.

Question 1. Which property of the MDP framework states that the future depends only on the current state and action, not on the history?

(a) Ergodicity (b) The Markov property (c) Stationarity (d) The Bellman property

Question 2. In the discounted return $G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1}$, what happens as $\gamma \to 0$?

(a) The agent becomes far-sighted, valuing future rewards equally (b) The return diverges to infinity (c) The agent becomes myopic, caring only about immediate reward (d) The discount factor has no effect on the return

Question 3. What is the key difference between Q-learning and SARSA?

(a) Q-learning uses a neural network while SARSA uses a table (b) Q-learning is on-policy while SARSA is off-policy (c) Q-learning uses the maximum Q-value of the next state while SARSA uses the Q-value of the action actually taken (d) Q-learning requires a model of the environment while SARSA does not

Question 4. In epsilon-greedy exploration, what does the agent do with probability epsilon?

(a) Select the action with the highest Q-value (b) Select a random action (c) Terminate the episode (d) Reduce the learning rate

Question 5. Which two innovations made Deep Q-Networks (DQN) stable enough to play Atari games?

(a) Batch normalization and dropout (b) Experience replay and target network (c) Adam optimizer and learning rate scheduling (d) Data augmentation and weight decay

Question 6. Double DQN addresses what problem in standard DQN?

(a) Catastrophic forgetting (b) Overestimation of Q-values (c) Underestimation of Q-values (d) Slow convergence

Question 7. What is the main advantage of policy gradient methods over value-based methods like DQN?

(a) They always converge faster (b) They require less memory (c) They naturally handle continuous action spaces (d) They do not require a neural network

Question 8. In the REINFORCE algorithm, what is the purpose of subtracting a baseline from the return?

(a) To introduce bias for faster convergence (b) To reduce the variance of the gradient estimate without introducing bias (c) To normalize the reward signal (d) To clip the gradient

Question 9. In the PPO clipped objective, what does the probability ratio $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_\text{old}}(a_t|s_t)}$ measure?

(a) The reward received at time step $t$ (b) How much the policy has changed for the given state-action pair (c) The advantage of the action (d) The entropy of the policy

Question 10. In PPO, why is the probability ratio clipped to the range $[1-\epsilon, 1+\epsilon]$?

(a) To ensure gradients are always positive (b) To prevent excessively large policy updates that could destroy performance (c) To make the algorithm on-policy (d) To reduce memory usage

Question 11. What does the entropy bonus in the PPO objective encourage?

(a) Faster convergence (b) Exploration by preventing the policy from becoming too deterministic (c) Lower variance in gradient estimates (d) Higher rewards

Question 12. In Generalized Advantage Estimation (GAE), what does the parameter $\lambda$ control?

(a) The learning rate (b) The bias-variance trade-off in advantage estimation (c) The discount factor (d) The clipping range

Question 13. In RLHF (Reinforcement Learning from Human Feedback), what is the role of the reward model?

(a) To generate training data (b) To predict human preferences between model outputs, providing a reward signal for RL fine-tuning (c) To directly update the policy weights (d) To pre-train the language model

Question 14. What is the main advantage of DPO (Direct Preference Optimization) over standard RLHF with PPO?

(a) DPO produces better models (b) DPO eliminates the need for a separate reward model and RL training loop (c) DPO requires more compute but produces more aligned models (d) DPO works only with small models

Question 15. Which of the following is an example of the exploration-exploitation dilemma?

(a) Choosing between a restaurant you love and trying a new restaurant that might be better (b) Choosing between training a larger model or a smaller model (c) Choosing between SGD and Adam optimizer (d) Choosing between float16 and bfloat16 precision

Answer Key

(b) The Markov property states that $P(s_{t+1} | s_t, a_t, s_{t-1}, \ldots) = P(s_{t+1} | s_t, a_t)$; the future is conditionally independent of the past given the present state and action.
(c) As $\gamma \to 0$, the return approaches $r_{t+1}$ only. The agent becomes myopic, optimizing only for immediate reward and ignoring future consequences.
(c) Q-learning (off-policy) uses $\max_{a'} Q(s_{t+1}, a')$ for the target, while SARSA (on-policy) uses $Q(s_{t+1}, a_{t+1})$ where $a_{t+1}$ is the action actually taken by the current policy.
(b) With probability $\epsilon$, the agent selects a random action to explore the environment. With probability $1 - \epsilon$, it exploits its current knowledge by selecting the greedy action.
(b) Experience replay breaks temporal correlations by sampling random mini-batches, and the target network provides a stable optimization target by using a periodically updated copy of the network.
(b) Standard DQN overestimates Q-values because the same network selects and evaluates actions. Double DQN decouples these by using the policy network to select actions and the target network to evaluate them.
(c) Policy gradient methods parameterize the policy directly and can output continuous action distributions (e.g., Gaussian), while DQN requires computing $\max_a Q(s, a)$ over all actions, which is intractable for continuous spaces.
(b) Subtracting a baseline $b(s)$ reduces variance because $\mathbb{E}[\nabla \log \pi(a|s) \cdot b(s)] = 0$ (the baseline does not introduce bias), but it reduces the magnitude of the gradient signal, lowering variance.
(b) The ratio measures how much the probability of taking action $a_t$ in state $s_t$ has changed between the old and new policy. A ratio of 1 means no change.
(b) Clipping prevents the ratio from moving too far from 1, which limits the size of each policy update. This is a simpler alternative to TRPO's KL divergence constraint and prevents catastrophic performance collapse.
(b) The entropy bonus $-c_2 \sum_a \pi(a|s) \log \pi(a|s)$ is maximized when the policy is uniform. Adding it to the objective discourages premature convergence to a deterministic policy, maintaining exploration.
(b) GAE with $\lambda = 0$ gives the one-step TD estimate (low variance, high bias), while $\lambda = 1$ gives the Monte Carlo estimate (high variance, low bias). Intermediate values interpolate between these extremes.
(b) The reward model is trained on human preference data (pairwise comparisons of model outputs) and then provides scalar rewards to the RL algorithm (typically PPO) during fine-tuning of the language model.
(b) DPO reformulates the RLHF objective as a classification loss directly on the preference data, eliminating the need to train a separate reward model and run RL. This simplifies the pipeline and reduces computational cost.
(a) The restaurant example captures the exploration-exploitation trade-off: exploiting the known-good restaurant (high expected reward) versus exploring a new one (uncertain reward but potential for discovery).

Scoring Guide

Score	Level
13--15	Excellent: Strong grasp of RL fundamentals and modern methods
10--12	Good: Solid understanding with minor gaps
7--9	Fair: Review the sections on algorithms you missed
Below 7	Needs review: Revisit the chapter, especially Sections 36.4--36.8