Case Study 1: Training CartPole with DQN

Overview

A robotics startup needs to develop an automated control system for balancing an inverted pendulum on a moving cart. Before deploying to physical hardware, the team prototypes the controller using Deep Q-Networks (DQN) in the Gymnasium CartPole-v1 environment. The goal is to achieve a stable policy that consistently reaches the maximum episode length (500 steps) within 300 training episodes, while systematically comparing DQN variants.

Problem Statement

The CartPole environment presents a classic control challenge:

State space: Four continuous values (cart position, cart velocity, pole angle, angular velocity).
Action space: Two discrete actions (push left, push right).
Reward: +1 for every step the pole stays upright.
Termination: Pole angle exceeds 12 degrees or cart moves out of bounds.
Success criterion: Average reward of 475+ over 100 consecutive episodes.

The key engineering challenges are: - Choosing appropriate network architecture and hyperparameters - Balancing exploration and exploitation during training - Ensuring stable convergence despite non-stationary data

Approach

Step 1: Baseline Random Agent

The team first establishes a random baseline: - Average reward: 22.3 steps per episode - Maximum reward: 62 steps - The pole falls quickly without learned control

Step 2: Vanilla DQN

Architecture and hyperparameters: - Network: 2 hidden layers, 128 units each, ReLU activation - Replay buffer: 100,000 transitions - Batch size: 64 - Learning rate: 1e-3 - Gamma (discount): 0.99 - Epsilon: linear decay from 1.0 to 0.01 over 10,000 steps - Target network update: every 1,000 steps

Step 3: Double DQN

To address Q-value overestimation, the team implements Double DQN: - Uses the policy network to select the best action - Uses the target network to evaluate that action's Q-value - All other hyperparameters remain the same

Step 4: Dueling DQN

The team adds the dueling architecture: - Separate value and advantage streams after the shared feature layers - Value stream: Linear(128) -> 1 - Advantage stream: Linear(128) -> 2 - Q = V + (A - mean(A))

Step 5: Combined Improvements

The final agent combines Double DQN + Dueling architecture with tuned hyperparameters: - Reduced learning rate: 5e-4 - Larger target update frequency: 2,000 steps - Gradient clipping: max norm 1.0

Results

Agent	Episodes to Solve	Avg Reward (last 100)	Max Reward	Training Time
Random	N/A	22.3	62	N/A
Vanilla DQN	185	487.2	500	45 sec
Double DQN	152	492.1	500	47 sec
Dueling DQN	168	489.5	500	48 sec
Combined	128	496.8	500	50 sec

Q-Value Analysis

Agent	Mean Q-value (trained)	Q-value Overestimation
Vanilla DQN	52.3	+18.7 (vs true value)
Double DQN	41.2	+7.6
Combined	38.9	+5.3

Hyperparameter Sensitivity

Hyperparameter	Best Value	Sensitivity
Learning rate	5e-4	High -- 1e-2 causes divergence
Epsilon decay steps	10,000	Medium -- affects early exploration
Target update freq	2,000	Medium -- too frequent causes instability
Replay buffer size	100,000	Low -- 10,000+ works well
Batch size	64	Low -- 32-128 all work

Key Lessons

Experience replay and target networks are essential. Without replay, the DQN diverges within 50 episodes due to correlated training data. Without the target network, Q-value estimates oscillate wildly.
Double DQN reliably reduces overestimation. The vanilla DQN overestimates Q-values by 36% on average. Double DQN cuts this to 18%, resulting in better policies with less wasted exploration.
The dueling architecture helps most in states where the action matters less. Near the center of the track (small pole angle), the advantage stream is near-zero because both actions are acceptable. Near failure states, the advantage stream clearly separates good and bad actions.
Epsilon decay schedule significantly affects sample efficiency. Decaying too fast (5,000 steps) causes the agent to miss good state-action pairs. Decaying too slowly (50,000 steps) wastes episodes on random exploration after the Q-values are already accurate.
Gradient clipping prevents catastrophic updates. Without clipping, occasional large TD errors (from rare transitions) cause gradient explosions that destabilize the policy for many subsequent episodes.

Code Reference

The complete implementation is available in code/case-study-code.py.