Case Study 2: Post-Mortem: When the Bot Went Wrong
Overview
Every automated trading system will eventually encounter failures. What separates good operators from bad ones is not the absence of failures but the quality of the response: detection speed, damage limitation, root cause analysis, and implementation of fixes. This case study examines three fictional but realistic bot failure scenarios, diagnoses their root causes, and implements fixes.
Incident 1: The Silent API Outage
Timeline
Tuesday 14:32 UTC -- The prediction market platform begins experiencing degraded API performance. Response times increase from 200ms to 3-5 seconds.
14:35 -- The trading bot's retry handler starts triggering. Requests succeed on retry, so the circuit breaker does not trip. No alerts are sent.
14:42 -- The platform's API begins returning stale data. Price quotes are 10 minutes old, but the response format is correct and the HTTP status is 200 OK. The bot does not detect the staleness.
14:48 -- The bot's signal generator sees a "deviation" between the stale price and what it expects based on its EMA. It generates a BUY signal for Market A at 0.42, believing the price has dropped significantly.
14:49 -- The order is submitted. Despite the slow API, the order endpoint is working normally. The order is placed at 0.42 when the actual market price is 0.50. The order does not fill immediately (it is 8 cents below the ask).
14:55 -- More stale-data signals fire. The bot submits three more orders in other markets, all based on stale prices.
15:10 -- The platform resolves the data issue. Fresh prices flow in. The bot's EMA updates, and the "deviations" disappear. No new signals fire.
15:15 -- One of the four stale-data orders fills. A seller hits the 0.42 bid in Market A (perhaps also using stale data or making an error). The bot now has a long position at 0.42 in a market trading at 0.50.
15:20 -- Alex (the bot operator) returns from lunch and notices the position. The unrealized P&L is +$1.20 (8 cents * 15 contracts), but Alex is concerned because the entry was based on bad data.
15:30 -- Alex reviews the logs and discovers the stale data issue. The other three orders are still open and unfilled. Alex manually cancels them.
Impact
- One position entered based on stale data (happened to be profitable by luck)
- Three additional orders placed based on stale data (unfilled, no damage)
- 58 minutes of trading on stale data with no alert
Root Cause Analysis
Primary cause: The bot did not validate data freshness. It relied solely on receiving a 200 OK response and valid-looking data. It did not check whether the timestamp of the data was current.
Contributing cause: The retry handler masked the degradation. Slow but successful responses prevented the circuit breaker from tripping, which would have stopped trading.
Contributing cause: No alert for elevated response times. The bot monitored for failures but not for degradation.
Fix Implementation
class StaleDataDetector:
"""Detects when market data is stale based on timestamps."""
def __init__(self, max_staleness_seconds: float = 120.0,
max_unchanged_intervals: int = 10):
self.max_staleness_seconds = max_staleness_seconds
self.max_unchanged_intervals = max_unchanged_intervals
self.last_prices: Dict[str, float] = {}
self.unchanged_count: Dict[str, int] = {}
self.logger = logging.getLogger("StaleDataDetector")
def check_freshness(self, market_id: str, price: float,
data_timestamp: datetime) -> Tuple[bool, str]:
"""
Returns (is_fresh, reason).
is_fresh is True if data appears current.
"""
now = datetime.utcnow()
# Check 1: Is the data timestamp recent?
if data_timestamp:
age_seconds = (now - data_timestamp).total_seconds()
if age_seconds > self.max_staleness_seconds:
return False, (
f"Data is {age_seconds:.0f}s old "
f"(max: {self.max_staleness_seconds}s)"
)
# Check 2: Has the price been unchanged for too many intervals?
last_price = self.last_prices.get(market_id)
if last_price is not None and abs(price - last_price) < 1e-9:
self.unchanged_count[market_id] = (
self.unchanged_count.get(market_id, 0) + 1
)
if self.unchanged_count[market_id] >= self.max_unchanged_intervals:
return False, (
f"Price unchanged for "
f"{self.unchanged_count[market_id]} intervals"
)
else:
self.unchanged_count[market_id] = 0
self.last_prices[market_id] = price
return True, "OK"
class LatencyMonitor:
"""Monitors API response latency and alerts on degradation."""
def __init__(self, warning_threshold_ms: float = 1000.0,
critical_threshold_ms: float = 5000.0,
window_size: int = 20):
self.warning_threshold_ms = warning_threshold_ms
self.critical_threshold_ms = critical_threshold_ms
self.latencies: deque = deque(maxlen=window_size)
self.logger = logging.getLogger("LatencyMonitor")
def record_latency(self, latency_ms: float):
self.latencies.append(latency_ms)
# Check recent average
if len(self.latencies) >= 5:
recent_avg = sum(list(self.latencies)[-5:]) / 5
if recent_avg > self.critical_threshold_ms:
self.logger.critical(
f"API latency critical: {recent_avg:.0f}ms avg"
)
return "CRITICAL"
elif recent_avg > self.warning_threshold_ms:
self.logger.warning(
f"API latency elevated: {recent_avg:.0f}ms avg"
)
return "WARNING"
return "OK"
Post-Fix Validation
After implementing these fixes:
- The
StaleDataDetectoris integrated into theDataFeedHandler. Stale data triggers astale_dataevent that pauses signal generation for the affected market. - The
LatencyMonitoris integrated into the API client. High latency triggers an alert and, at CRITICAL level, pauses all trading. - Alex runs a simulation where data timestamps are artificially aged. The detector correctly identifies stale data and pauses trading within one poll cycle.
Lessons Learned
- Do not trust HTTP 200 blindly. A successful HTTP response does not mean the data is correct or current.
- Monitor degradation, not just failures. The circuit breaker pattern handles failures well, but gradual degradation requires different detection.
- Always validate data timestamps. If the data source provides timestamps, check them. If it does not, track price changes to detect frozen feeds.
- Profitable outcomes from buggy behavior are still bugs. The position was profitable, but the process was broken. Next time it might not be profitable.
Incident 2: The Position Limit Bypass
Timeline
Thursday 09:15 UTC -- The bot is operating normally with these risk limits: - Max position per market: 100 contracts - Max order size: 50 contracts
09:22 -- A strong signal fires for Market B. The bot submits a BUY order for 45 contracts at 0.38. (Risk check passes: 45 < 50 order limit, 0+45 < 100 position limit.)
09:23 -- The order is submitted to the exchange but does not fill immediately.
09:25 -- Another strong signal fires for Market B (the price has moved further in the favorable direction). The bot submits a second BUY order for 42 contracts at 0.36.
Risk check evaluation: - Order size: 42 < 50. PASS. - Position limit: current_position(0) + order_quantity(42) = 42 < 100. PASS.
The bug: The risk check uses current_position, which only reflects filled orders. The first order for 45 contracts is still outstanding (unfilled). The risk check does not account for pending orders.
09:27 -- Both orders fill within seconds of each other. The bot now has a position of 87 contracts (45 + 42), which is within the 100-contract limit. But if the first order had been for 55 contracts and the second for 55, the position would have been 110 -- exceeding the limit.
09:30 -- A third signal fires. The bot submits a BUY order for 40 contracts. Risk check: - Position limit: current_position(87) + order_quantity(40) = 127 > 100. FAIL.
This order is correctly rejected. But the damage is that the effective position limit was not 100 contracts but 100 + max_order_size = 150 contracts (in the worst case where two max-size orders are both outstanding).
Impact
- Position limit was effectively 50% higher than intended
- No actual loss in this incident, but the risk was materially higher than configured
- If both orders had been larger, the limit could have been meaningfully breached
Root Cause Analysis
Primary cause: The pre-trade risk check did not account for pending (submitted but unfilled) orders. It only checked against confirmed positions.
Contributing cause: No "exposure including pending" metric was tracked or displayed.
Fix Implementation
class ExposureAwareRiskManager:
"""
Risk manager that accounts for both filled positions
AND pending orders when checking limits.
"""
def __init__(self, limits: RiskLimits, position_tracker,
order_manager, event_bus):
self.limits = limits
self.position_tracker = position_tracker
self.order_manager = order_manager
self.event_bus = event_bus
self.logger = logging.getLogger("ExposureAwareRisk")
def _get_pending_exposure(self, market_id: str, side: str) -> float:
"""
Calculate the quantity of pending (unfilled) orders
for a given market and side.
"""
open_orders = self.order_manager.get_open_orders(market_id)
pending_qty = 0.0
for order in open_orders:
if order.side == side:
pending_qty += order.remaining_quantity
return pending_qty
def _get_effective_position(self, market_id: str,
order_side: str,
order_qty: float) -> float:
"""
Calculate the effective position including pending orders
and the proposed new order.
"""
# Current confirmed position
pos = self.position_tracker.get_position(market_id)
current_qty = pos["quantity"] if pos else 0.0
# Add pending buy orders
pending_buys = self._get_pending_exposure(market_id, "BUY")
# Add pending sell orders
pending_sells = self._get_pending_exposure(market_id, "SELL")
# Worst-case position if all pending orders fill
worst_case_qty = current_qty + pending_buys - pending_sells
# Add the proposed order
if order_side == "BUY":
worst_case_qty += order_qty
else:
worst_case_qty -= order_qty
return worst_case_qty
def check_position_limit(self, market_id: str, side: str,
quantity: float,
price: float) -> Tuple[bool, str]:
"""
Check position limit including pending orders.
"""
effective_position = self._get_effective_position(
market_id, side, quantity
)
if abs(effective_position) > self.limits.max_position_size:
return False, (
f"Effective position {effective_position:.0f} "
f"(including pending orders) exceeds limit "
f"{self.limits.max_position_size}"
)
effective_value = abs(effective_position * price)
if effective_value > self.limits.max_position_value:
return False, (
f"Effective position value ${effective_value:.2f} "
f"exceeds limit ${self.limits.max_position_value:.2f}"
)
return True, "OK"
def check_portfolio_exposure(self, market_id: str, side: str,
quantity: float,
price: float) -> Tuple[bool, str]:
"""
Check portfolio exposure including all pending orders.
"""
# Current confirmed exposure
current_exposure = self.position_tracker.get_total_exposure()
# Add all pending orders across all markets
all_open_orders = self.order_manager.get_open_orders()
pending_exposure = sum(
o.remaining_quantity * o.price for o in all_open_orders
)
# Add proposed order
proposed_exposure = quantity * price
total_exposure = (
current_exposure + pending_exposure + proposed_exposure
)
if total_exposure > self.limits.max_portfolio_exposure:
return False, (
f"Total exposure ${total_exposure:.2f} "
f"(confirmed: ${current_exposure:.2f}, "
f"pending: ${pending_exposure:.2f}, "
f"proposed: ${proposed_exposure:.2f}) "
f"exceeds limit ${self.limits.max_portfolio_exposure:.2f}"
)
return True, "OK"
Post-Fix Validation
After implementing the fix:
- Alex writes a unit test that submits two orders without fills and verifies the second order is rejected when combined exposure exceeds limits.
- Alex runs paper trading for 48 hours to verify the fix does not reject orders too aggressively.
- Alex adds a dashboard metric: "Effective Exposure" that shows confirmed positions + pending orders.
Lessons Learned
- Pending orders are exposure. Risk checks must account for the worst case where all pending orders fill.
- Test boundary conditions. The original risk check was tested with single orders, not with concurrent orders in the same market.
- Display effective exposure. The operator should always see the worst-case exposure, not just confirmed positions.
Incident 3: The Correlated Market Cascade
Timeline
Monday 11:00 UTC -- A major political event occurs (surprise election result). Multiple prediction markets that are correlated with the outcome begin moving rapidly.
11:02 -- The bot has positions in 5 markets, all of which are correlated with the political event: - Market C: Long 60 contracts (political outcome A) - Market D: Long 40 contracts (policy consequence of A) - Market E: Short 30 contracts (alternative outcome B) - Market F: Long 25 contracts (economic impact of A) - Market G: Long 35 contracts (related political figure)
All positions are individually within limits. But they are all effectively bets on the same underlying outcome.
11:05 -- The event resolves contrary to the bot's positions. All five markets move against the bot simultaneously: - Market C: drops from 0.55 to 0.35 (-$12.00 unrealized) - Market D: drops from 0.60 to 0.42 (-$7.20 unrealized) - Market E: rises from 0.40 to 0.62 (-$6.60 unrealized) - Market F: drops from 0.45 to 0.30 (-$3.75 unrealized) - Market G: drops from 0.50 to 0.35 (-$5.25 unrealized)
Total unrealized loss: -$34.80 in 3 minutes.
11:06 -- The daily loss limit is -$200. The combined loss of $34.80 does not trigger it. But the bot's effective exposure to "political outcome A" is far larger than any single position limit suggests.
11:08 -- The bot's signal generator sees the price drops as mean-reversion opportunities. It generates BUY signals for Markets C, D, and F (prices have "deviated" from their EMAs). The risk manager approves these orders because each individual position is within limits.
11:10 -- Some of the new orders fill, increasing the bot's exposure to the same correlated risk.
11:15 -- Alex sees the alerts (daily P&L is now -$42 after the new positions also lose). Alex manually cancels all orders and begins unwinding positions.
11:30 -- After unwinding, the total realized loss is $58.40.
Impact
- $58.40 realized loss from correlated positions
- The bot added to losing positions because it could not recognize the correlation
- Daily loss limit was not reached, so the automated safeguard did not trigger
Root Cause Analysis
Primary cause: The risk manager had no concept of correlation between markets. It evaluated each position independently, missing the concentrated exposure to a single underlying factor.
Contributing cause: The signal generator interpreted a correlated sell-off as independent mean-reversion opportunities, which caused the bot to add to already-losing correlated positions.
Contributing cause: The daily loss limit ($200) was set too high relative to the likely maximum correlated loss. The five positions could have lost $80+ before triggering the limit.
Fix Implementation
from typing import Dict, Set, Tuple, List
class MarketCorrelationManager:
"""
Tracks correlations between markets and enforces
concentration limits on correlated groups.
"""
def __init__(self, max_group_exposure: float = 500.0):
self.max_group_exposure = max_group_exposure
# Manually maintained correlation groups
self.correlation_groups: Dict[str, Set[str]] = {}
# market_id -> list of group names
self.market_to_groups: Dict[str, List[str]] = {}
self.logger = logging.getLogger("CorrelationManager")
def define_group(self, group_name: str, market_ids: List[str]):
"""Define a group of correlated markets."""
self.correlation_groups[group_name] = set(market_ids)
for mid in market_ids:
if mid not in self.market_to_groups:
self.market_to_groups[mid] = []
self.market_to_groups[mid].append(group_name)
self.logger.info(
f"Defined correlation group '{group_name}': {market_ids}"
)
def check_group_exposure(
self, market_id: str, additional_exposure: float,
position_tracker
) -> Tuple[bool, str]:
"""
Check if adding exposure to a market would breach
any group limit.
"""
groups = self.market_to_groups.get(market_id, [])
if not groups:
return True, "Market not in any correlation group"
for group_name in groups:
group_markets = self.correlation_groups[group_name]
current_group_exposure = 0.0
for gm in group_markets:
pos = position_tracker.get_position(gm)
if pos:
current_group_exposure += abs(
pos["quantity"] * pos.get(
"average_entry_price", 0
)
)
new_group_exposure = (
current_group_exposure + abs(additional_exposure)
)
if new_group_exposure > self.max_group_exposure:
return False, (
f"Group '{group_name}' exposure "
f"${new_group_exposure:.2f} would exceed limit "
f"${self.max_group_exposure:.2f} "
f"(current: ${current_group_exposure:.2f})"
)
return True, "OK"
class CorrelatedLossDetector:
"""
Detects when multiple positions are losing simultaneously,
indicating correlated risk.
"""
def __init__(self, loss_threshold_pct: float = 0.05,
min_positions: int = 3):
self.loss_threshold_pct = loss_threshold_pct
self.min_positions = min_positions
self.logger = logging.getLogger("CorrelatedLossDetector")
def check(self, positions: List[dict]) -> Tuple[bool, str]:
"""
Check if multiple positions are losing simultaneously.
Returns (alert_needed, message).
"""
losing_positions = [
p for p in positions
if p.get("unrealized_pnl", 0) < 0
and p.get("market_value", 0) > 0
and abs(p["unrealized_pnl"] / p["market_value"])
> self.loss_threshold_pct
]
if len(losing_positions) >= self.min_positions:
total_loss = sum(
p["unrealized_pnl"] for p in losing_positions
)
return True, (
f"{len(losing_positions)} positions losing "
f"simultaneously (total: ${total_loss:.2f}). "
f"Possible correlated risk event."
)
return False, "OK"
class SignalSuppressor:
"""
Suppresses signals that would add to positions in markets
that are currently experiencing correlated losses.
"""
def __init__(self, correlation_manager, position_tracker,
loss_threshold_pct: float = 0.03):
self.correlation_manager = correlation_manager
self.position_tracker = position_tracker
self.loss_threshold_pct = loss_threshold_pct
self.suppressed_groups: Set[str] = set()
self.logger = logging.getLogger("SignalSuppressor")
def should_suppress(self, market_id: str, side: str) -> bool:
"""
Returns True if signals for this market should be
suppressed due to correlated losses.
"""
groups = self.correlation_manager.market_to_groups.get(
market_id, []
)
for group_name in groups:
if group_name in self.suppressed_groups:
self.logger.info(
f"Suppressing signal for {market_id}: "
f"group '{group_name}' is in loss suppression"
)
return True
return False
def update_suppression(self):
"""
Update which groups are suppressed based on current
correlated losses.
"""
for group_name, market_ids in (
self.correlation_manager.correlation_groups.items()
):
group_positions = []
for mid in market_ids:
pos = self.position_tracker.get_position(mid)
if pos and pos.get("quantity", 0) != 0:
group_positions.append(pos)
losing_count = sum(
1 for p in group_positions
if p.get("unrealized_pnl", 0) < 0
)
# If more than half of correlated positions are losing,
# suppress new signals for the group
if (len(group_positions) >= 2 and
losing_count > len(group_positions) / 2):
if group_name not in self.suppressed_groups:
self.suppressed_groups.add(group_name)
self.logger.warning(
f"Suppressing group '{group_name}': "
f"{losing_count}/{len(group_positions)} "
f"positions losing"
)
else:
self.suppressed_groups.discard(group_name)
Post-Fix Validation
After implementing these fixes:
- Alex defines correlation groups for markets sharing underlying political, economic, or thematic factors.
- Alex adds a group exposure limit of $500 (lower than the portfolio limit of $2,000) per correlation group.
- Alex runs a simulation of the incident scenario and verifies: - The correlated loss detector fires when 3+ positions lose simultaneously. - The signal suppressor prevents new buy signals in affected groups. - The group exposure limit prevents excessive concentration.
- Alex lowers the daily loss limit from $200 to $100 to provide a tighter safety net.
Lessons Learned
- Diversification is an illusion without correlation awareness. Five positions in correlated markets is effectively one large position.
- Mean-reversion signals during correlated sell-offs are dangerous. A price drop due to new information is not mean-reversion -- it is a permanent level shift.
- Define correlation groups proactively. Before trading in a market, classify which factor group it belongs to.
- Automate correlated loss detection. Human traders recognize correlated losses immediately. Bots need explicit programming to detect them.
- Daily loss limits should be set relative to realistic worst-case correlated losses. If your maximum correlated exposure can lose $80 in a single event, a $200 daily loss limit is too loose.
Summary of All Incidents
| Incident 1 | Incident 2 | Incident 3 | |
|---|---|---|---|
| Category | Data quality | Risk management | Correlation risk |
| Detection | Manual (58 min) | Manual (8 min) | Semi-auto (15 min) |
| Damage | None (lucky) | None (lucky) | $58.40 loss |
| Root cause | Missing data validation | Pending orders not in risk calc | No correlation awareness |
| Fix complexity | Medium | Low | High |
| Prevention | Stale data detector | Include pending in exposure | Correlation groups + suppression |
Universal Post-Mortem Principles
-
Every incident is a learning opportunity. Document every failure, even near-misses, with the same rigor as costly failures.
-
Fix the system, not just the symptom. A patch that addresses only the specific failure scenario will not prevent the next similar failure. Fix the class of failures.
-
Test fixes against the incident scenario. Write a test that reproduces the exact conditions of the incident and verify the fix prevents it.
-
Review neighboring assumptions. If one assumption was wrong (e.g., "data is always fresh"), check related assumptions (e.g., "order responses are always current").
-
Share learnings. If you trade with others, share post-mortems. If you trade alone, write them down for your future self.