Case Study 1: Training CartPole with DQN
Overview
A robotics startup needs to develop an automated control system for balancing an inverted pendulum on a moving cart. Before deploying to physical hardware, the team prototypes the controller using Deep Q-Networks (DQN) in the Gymnasium CartPole-v1 environment. The goal is to achieve a stable policy that consistently reaches the maximum episode length (500 steps) within 300 training episodes, while systematically comparing DQN variants.
Problem Statement
The CartPole environment presents a classic control challenge:
- State space: Four continuous values (cart position, cart velocity, pole angle, angular velocity).
- Action space: Two discrete actions (push left, push right).
- Reward: +1 for every step the pole stays upright.
- Termination: Pole angle exceeds 12 degrees or cart moves out of bounds.
- Success criterion: Average reward of 475+ over 100 consecutive episodes.
The key engineering challenges are: - Choosing appropriate network architecture and hyperparameters - Balancing exploration and exploitation during training - Ensuring stable convergence despite non-stationary data
Approach
Step 1: Baseline Random Agent
The team first establishes a random baseline: - Average reward: 22.3 steps per episode - Maximum reward: 62 steps - The pole falls quickly without learned control
Step 2: Vanilla DQN
Architecture and hyperparameters: - Network: 2 hidden layers, 128 units each, ReLU activation - Replay buffer: 100,000 transitions - Batch size: 64 - Learning rate: 1e-3 - Gamma (discount): 0.99 - Epsilon: linear decay from 1.0 to 0.01 over 10,000 steps - Target network update: every 1,000 steps
Step 3: Double DQN
To address Q-value overestimation, the team implements Double DQN: - Uses the policy network to select the best action - Uses the target network to evaluate that action's Q-value - All other hyperparameters remain the same
Step 4: Dueling DQN
The team adds the dueling architecture: - Separate value and advantage streams after the shared feature layers - Value stream: Linear(128) -> 1 - Advantage stream: Linear(128) -> 2 - Q = V + (A - mean(A))
Step 5: Combined Improvements
The final agent combines Double DQN + Dueling architecture with tuned hyperparameters: - Reduced learning rate: 5e-4 - Larger target update frequency: 2,000 steps - Gradient clipping: max norm 1.0
Results
| Agent | Episodes to Solve | Avg Reward (last 100) | Max Reward | Training Time |
|---|---|---|---|---|
| Random | N/A | 22.3 | 62 | N/A |
| Vanilla DQN | 185 | 487.2 | 500 | 45 sec |
| Double DQN | 152 | 492.1 | 500 | 47 sec |
| Dueling DQN | 168 | 489.5 | 500 | 48 sec |
| Combined | 128 | 496.8 | 500 | 50 sec |
Q-Value Analysis
| Agent | Mean Q-value (trained) | Q-value Overestimation |
|---|---|---|
| Vanilla DQN | 52.3 | +18.7 (vs true value) |
| Double DQN | 41.2 | +7.6 |
| Combined | 38.9 | +5.3 |
Hyperparameter Sensitivity
| Hyperparameter | Best Value | Sensitivity |
|---|---|---|
| Learning rate | 5e-4 | High -- 1e-2 causes divergence |
| Epsilon decay steps | 10,000 | Medium -- affects early exploration |
| Target update freq | 2,000 | Medium -- too frequent causes instability |
| Replay buffer size | 100,000 | Low -- 10,000+ works well |
| Batch size | 64 | Low -- 32-128 all work |
Key Lessons
-
Experience replay and target networks are essential. Without replay, the DQN diverges within 50 episodes due to correlated training data. Without the target network, Q-value estimates oscillate wildly.
-
Double DQN reliably reduces overestimation. The vanilla DQN overestimates Q-values by 36% on average. Double DQN cuts this to 18%, resulting in better policies with less wasted exploration.
-
The dueling architecture helps most in states where the action matters less. Near the center of the track (small pole angle), the advantage stream is near-zero because both actions are acceptable. Near failure states, the advantage stream clearly separates good and bad actions.
-
Epsilon decay schedule significantly affects sample efficiency. Decaying too fast (5,000 steps) causes the agent to miss good state-action pairs. Decaying too slowly (50,000 steps) wastes episodes on random exploration after the Q-values are already accurate.
-
Gradient clipping prevents catastrophic updates. Without clipping, occasional large TD errors (from rare transitions) cause gradient explosions that destabilize the policy for many subsequent episodes.
Code Reference
The complete implementation is available in code/case-study-code.py.