Case Study 1: Training CartPole with DQN

Overview

A robotics startup needs to develop an automated control system for balancing an inverted pendulum on a moving cart. Before deploying to physical hardware, the team prototypes the controller using Deep Q-Networks (DQN) in the Gymnasium CartPole-v1 environment. The goal is to achieve a stable policy that consistently reaches the maximum episode length (500 steps) within 300 training episodes, while systematically comparing DQN variants.

Problem Statement

The CartPole environment presents a classic control challenge:

  1. State space: Four continuous values (cart position, cart velocity, pole angle, angular velocity).
  2. Action space: Two discrete actions (push left, push right).
  3. Reward: +1 for every step the pole stays upright.
  4. Termination: Pole angle exceeds 12 degrees or cart moves out of bounds.
  5. Success criterion: Average reward of 475+ over 100 consecutive episodes.

The key engineering challenges are: - Choosing appropriate network architecture and hyperparameters - Balancing exploration and exploitation during training - Ensuring stable convergence despite non-stationary data

Approach

Step 1: Baseline Random Agent

The team first establishes a random baseline: - Average reward: 22.3 steps per episode - Maximum reward: 62 steps - The pole falls quickly without learned control

Step 2: Vanilla DQN

Architecture and hyperparameters: - Network: 2 hidden layers, 128 units each, ReLU activation - Replay buffer: 100,000 transitions - Batch size: 64 - Learning rate: 1e-3 - Gamma (discount): 0.99 - Epsilon: linear decay from 1.0 to 0.01 over 10,000 steps - Target network update: every 1,000 steps

Step 3: Double DQN

To address Q-value overestimation, the team implements Double DQN: - Uses the policy network to select the best action - Uses the target network to evaluate that action's Q-value - All other hyperparameters remain the same

Step 4: Dueling DQN

The team adds the dueling architecture: - Separate value and advantage streams after the shared feature layers - Value stream: Linear(128) -> 1 - Advantage stream: Linear(128) -> 2 - Q = V + (A - mean(A))

Step 5: Combined Improvements

The final agent combines Double DQN + Dueling architecture with tuned hyperparameters: - Reduced learning rate: 5e-4 - Larger target update frequency: 2,000 steps - Gradient clipping: max norm 1.0

Results

Agent Episodes to Solve Avg Reward (last 100) Max Reward Training Time
Random N/A 22.3 62 N/A
Vanilla DQN 185 487.2 500 45 sec
Double DQN 152 492.1 500 47 sec
Dueling DQN 168 489.5 500 48 sec
Combined 128 496.8 500 50 sec

Q-Value Analysis

Agent Mean Q-value (trained) Q-value Overestimation
Vanilla DQN 52.3 +18.7 (vs true value)
Double DQN 41.2 +7.6
Combined 38.9 +5.3

Hyperparameter Sensitivity

Hyperparameter Best Value Sensitivity
Learning rate 5e-4 High -- 1e-2 causes divergence
Epsilon decay steps 10,000 Medium -- affects early exploration
Target update freq 2,000 Medium -- too frequent causes instability
Replay buffer size 100,000 Low -- 10,000+ works well
Batch size 64 Low -- 32-128 all work

Key Lessons

  1. Experience replay and target networks are essential. Without replay, the DQN diverges within 50 episodes due to correlated training data. Without the target network, Q-value estimates oscillate wildly.

  2. Double DQN reliably reduces overestimation. The vanilla DQN overestimates Q-values by 36% on average. Double DQN cuts this to 18%, resulting in better policies with less wasted exploration.

  3. The dueling architecture helps most in states where the action matters less. Near the center of the track (small pole angle), the advantage stream is near-zero because both actions are acceptable. Near failure states, the advantage stream clearly separates good and bad actions.

  4. Epsilon decay schedule significantly affects sample efficiency. Decaying too fast (5,000 steps) causes the agent to miss good state-action pairs. Decaying too slowly (50,000 steps) wastes episodes on random exploration after the Q-values are already accurate.

  5. Gradient clipping prevents catastrophic updates. Without clipping, occasional large TD errors (from rare transitions) cause gradient explosions that destabilize the policy for many subsequent episodes.

Code Reference

The complete implementation is available in code/case-study-code.py.