Case Study 2: Policy Gradient for Custom Environment
Overview
A logistics company wants to build an RL agent that learns an optimal inventory ordering policy. The environment models daily demand for a perishable product with stochastic customer arrivals. The team implements a custom Gymnasium environment, trains both DQN and PPO agents, and compares the learned policies with a hand-crafted heuristic.
Problem Statement
The inventory management problem has the following structure:
- State: Current inventory level (0--100 units), day of week (0--6), recent demand history (last 3 days).
- Action: Order quantity -- one of {0, 10, 20, 30, 40, 50} units.
- Dynamics: Daily demand follows a Poisson distribution with mean that varies by day of week (weekdays: mean 15, weekends: mean 25). Orders arrive the next day.
- Rewards: - +2 per unit sold (revenue) - -1 per unit of unmet demand (lost sales penalty) - -0.5 per unit of excess inventory at end of day (holding cost) - -3 per unit of spoiled product (items older than 3 days)
- Episode length: 30 days (one month).
The challenges: - Balancing overstocking (holding costs, spoilage) with understocking (lost sales) - Adapting to varying demand across days of the week - Long-term planning: today's order affects tomorrow's inventory
Approach
Step 1: Custom Environment
The team implements a Gymnasium-compatible environment with:
- reset(): Initialize inventory to 30 units on a random day
- step(action): Process order, simulate demand, compute reward
- State includes: inventory level, day of week, 3-day demand history (6 values total)
Step 2: Heuristic Baseline
A simple order-up-to policy: - Target inventory: 35 units - Order quantity: max(0, target - current_inventory) - Rounded to nearest valid action
Step 3: DQN Agent
- Network: 2 hidden layers (128, 64), ReLU
- Replay buffer: 50,000 transitions
- Epsilon: decayed from 1.0 to 0.05 over 20,000 steps
- Trained for 500 episodes
Step 4: PPO Agent
- Actor-critic with shared features: 2 hidden layers (128, 64)
- GAE lambda: 0.95
- Clip epsilon: 0.2
- Entropy coefficient: 0.01
- Trained for 500 episodes, rollouts of 2,048 steps
- 4 epochs per rollout
Step 5: Analysis
The team analyzes the learned policies by: - Comparing average reward per episode - Examining the ordering pattern by day of week - Measuring stockout rate and spoilage rate - Conducting sensitivity analysis on demand variance
Results
Training Performance
| Agent | Avg Reward (last 50 ep.) | Stockout Rate | Spoilage Rate | Training Time |
|---|---|---|---|---|
| Random | -42.7 | 38.2% | 22.4% | N/A |
| Heuristic | 28.4 | 8.1% | 5.3% | N/A |
| DQN | 35.2 | 5.7% | 3.8% | 2 min |
| PPO | 38.9 | 4.2% | 2.9% | 3 min |
Learned Policy Analysis
Average order quantity by day of week (PPO agent):
| Day | Mon | Tue | Wed | Thu | Fri | Sat | Sun |
|---|---|---|---|---|---|---|---|
| Heuristic | 15 | 15 | 15 | 15 | 15 | 25 | 25 |
| DQN | 10 | 10 | 20 | 10 | 30 | 20 | 30 |
| PPO | 10 | 10 | 20 | 10 | 40 | 20 | 10 |
Notable: The PPO agent learns to place a large order on Friday to prepare for weekend demand, then orders minimally on Sunday because Monday demand is lower. This anticipatory behavior emerges without any explicit programming.
Sensitivity to Demand Variance
| Demand Variance | Heuristic Reward | DQN Reward | PPO Reward |
|---|---|---|---|
| Low (std=3) | 32.1 | 37.8 | 41.2 |
| Medium (std=7) | 28.4 | 35.2 | 38.9 |
| High (std=12) | 18.6 | 28.3 | 31.5 |
The RL agents degrade more gracefully under high variance because they learn implicit safety margins, while the heuristic's fixed target becomes suboptimal.
Key Lessons
-
PPO outperforms DQN on this structured decision problem. The stochastic policy from PPO naturally handles the uncertainty in demand, while DQN's deterministic policy is less adaptable. PPO's advantage is most pronounced under high demand variance.
-
RL discovers non-obvious temporal strategies. The Friday pre-ordering pattern and the day-dependent ordering emerged purely from reward optimization. A human expert might design a similar heuristic, but the RL agent found it automatically.
-
Custom environments require careful reward engineering. The initial reward function only included revenue and holding costs. The team had to add the spoilage penalty and lost-sales penalty to prevent degenerate policies (e.g., ordering maximum every day).
-
The heuristic baseline is surprisingly strong for low-variance settings. Under stable demand, the simple order-up-to policy performs within 10% of the RL agent. The RL advantage becomes significant only when the environment has sufficient complexity and variability.
-
State representation matters more than algorithm choice. Including the 3-day demand history improved both DQN and PPO by 15-20%. Without it, the agents could not anticipate demand patterns and performed only marginally better than the heuristic.
-
Training stability requires tuning. DQN was sensitive to learning rate and target update frequency. PPO was more robust but required careful tuning of the entropy coefficient to prevent premature convergence to a deterministic (but suboptimal) ordering policy.
Code Reference
The complete implementation is available in code/case-study-code.py.