Case Study 2: Policy Gradient for Custom Environment

Overview

A logistics company wants to build an RL agent that learns an optimal inventory ordering policy. The environment models daily demand for a perishable product with stochastic customer arrivals. The team implements a custom Gymnasium environment, trains both DQN and PPO agents, and compares the learned policies with a hand-crafted heuristic.

Problem Statement

The inventory management problem has the following structure:

  1. State: Current inventory level (0--100 units), day of week (0--6), recent demand history (last 3 days).
  2. Action: Order quantity -- one of {0, 10, 20, 30, 40, 50} units.
  3. Dynamics: Daily demand follows a Poisson distribution with mean that varies by day of week (weekdays: mean 15, weekends: mean 25). Orders arrive the next day.
  4. Rewards: - +2 per unit sold (revenue) - -1 per unit of unmet demand (lost sales penalty) - -0.5 per unit of excess inventory at end of day (holding cost) - -3 per unit of spoiled product (items older than 3 days)
  5. Episode length: 30 days (one month).

The challenges: - Balancing overstocking (holding costs, spoilage) with understocking (lost sales) - Adapting to varying demand across days of the week - Long-term planning: today's order affects tomorrow's inventory

Approach

Step 1: Custom Environment

The team implements a Gymnasium-compatible environment with: - reset(): Initialize inventory to 30 units on a random day - step(action): Process order, simulate demand, compute reward - State includes: inventory level, day of week, 3-day demand history (6 values total)

Step 2: Heuristic Baseline

A simple order-up-to policy: - Target inventory: 35 units - Order quantity: max(0, target - current_inventory) - Rounded to nearest valid action

Step 3: DQN Agent

  • Network: 2 hidden layers (128, 64), ReLU
  • Replay buffer: 50,000 transitions
  • Epsilon: decayed from 1.0 to 0.05 over 20,000 steps
  • Trained for 500 episodes

Step 4: PPO Agent

  • Actor-critic with shared features: 2 hidden layers (128, 64)
  • GAE lambda: 0.95
  • Clip epsilon: 0.2
  • Entropy coefficient: 0.01
  • Trained for 500 episodes, rollouts of 2,048 steps
  • 4 epochs per rollout

Step 5: Analysis

The team analyzes the learned policies by: - Comparing average reward per episode - Examining the ordering pattern by day of week - Measuring stockout rate and spoilage rate - Conducting sensitivity analysis on demand variance

Results

Training Performance

Agent Avg Reward (last 50 ep.) Stockout Rate Spoilage Rate Training Time
Random -42.7 38.2% 22.4% N/A
Heuristic 28.4 8.1% 5.3% N/A
DQN 35.2 5.7% 3.8% 2 min
PPO 38.9 4.2% 2.9% 3 min

Learned Policy Analysis

Average order quantity by day of week (PPO agent):

Day Mon Tue Wed Thu Fri Sat Sun
Heuristic 15 15 15 15 15 25 25
DQN 10 10 20 10 30 20 30
PPO 10 10 20 10 40 20 10

Notable: The PPO agent learns to place a large order on Friday to prepare for weekend demand, then orders minimally on Sunday because Monday demand is lower. This anticipatory behavior emerges without any explicit programming.

Sensitivity to Demand Variance

Demand Variance Heuristic Reward DQN Reward PPO Reward
Low (std=3) 32.1 37.8 41.2
Medium (std=7) 28.4 35.2 38.9
High (std=12) 18.6 28.3 31.5

The RL agents degrade more gracefully under high variance because they learn implicit safety margins, while the heuristic's fixed target becomes suboptimal.

Key Lessons

  1. PPO outperforms DQN on this structured decision problem. The stochastic policy from PPO naturally handles the uncertainty in demand, while DQN's deterministic policy is less adaptable. PPO's advantage is most pronounced under high demand variance.

  2. RL discovers non-obvious temporal strategies. The Friday pre-ordering pattern and the day-dependent ordering emerged purely from reward optimization. A human expert might design a similar heuristic, but the RL agent found it automatically.

  3. Custom environments require careful reward engineering. The initial reward function only included revenue and holding costs. The team had to add the spoilage penalty and lost-sales penalty to prevent degenerate policies (e.g., ordering maximum every day).

  4. The heuristic baseline is surprisingly strong for low-variance settings. Under stable demand, the simple order-up-to policy performs within 10% of the RL agent. The RL advantage becomes significant only when the environment has sufficient complexity and variability.

  5. State representation matters more than algorithm choice. Including the 3-day demand history improved both DQN and PPO by 15-20%. Without it, the agents could not anticipate demand patterns and performed only marginally better than the heuristic.

  6. Training stability requires tuning. DQN was sensitive to learning rate and target update frequency. PPO was more robust but required careful tuning of the entropy coefficient to prevent premature convergence to a deterministic (but suboptimal) ordering policy.

Code Reference

The complete implementation is available in code/case-study-code.py.