Case Study 2: Detecting and Recovering from Pipeline Failures
Executive Summary
Production ML systems fail. Data sources go down, APIs change schemas, models drift silently, and edge cases surface that no test suite anticipated. This case study examines five real failure scenarios drawn from operating an NFL betting pipeline over a full 18-week season. For each failure, we analyze what went wrong, how the monitoring system detected (or failed to detect) the issue, what the financial impact was, and what remediation was implemented. The five scenarios cover stale odds data during a provider outage, a feature computation bug that introduced look-ahead bias, model drift following a mid-season rule emphasis change, a bet execution race condition that placed duplicate bets, and a cascading failure triggered by a database disk-space exhaustion. We conclude with a comprehensive resilience checklist that pipeline operators can use to harden their systems.
Background
The Pipeline
The pipeline under study is an NFL game prediction system that has been in production for three seasons. It processes approximately 270 regular-season games per year, placing bets on roughly 30% of them (80--90 bets per season). The system architecture follows the monolithic design recommended in Chapter 31: a single Python application with clear module boundaries, running on a single cloud server with scheduled cron jobs for batch processing and a FastAPI endpoint for on-demand predictions.
Why Failures Matter More in Betting
In a typical ML application, a model serving incorrect predictions for a few hours is an inconvenience. In a betting application, incorrect predictions can cause immediate financial loss. A model that is confidently wrong (high predicted edge, large bet size) causes more damage than a model that is cautiously wrong (small predicted edge, small bet size). This asymmetry makes monitoring and failure detection critical.
Scenario 1: Stale Odds During Provider Outage
What Happened
On Week 6 Sunday morning, the odds data provider experienced a 90-minute outage from 10:30 AM to noon. The ingestion pipeline's retry logic exhausted its retries and logged error messages, but the prediction pipeline proceeded on schedule at 11:00 AM using the most recently cached odds from 10:00 AM. During the outage, multiple lines moved significantly due to late injury news.
Impact
The pipeline generated 3 bet recommendations using stale odds. One bet was placed on a team whose line had moved from -3 to -5 after a starting quarterback was downgraded to "out." The model's edge was computed against the stale -3 line, but by the time of execution the actual line available was -5, reducing the true edge to near zero.
Detection
class DataFreshnessMonitor:
"""Monitor data source freshness and alert on staleness."""
def __init__(self, max_age_minutes: Dict[str, int]):
self.max_age = max_age_minutes
self.last_update: Dict[str, datetime] = {}
def record_update(self, source: str) -> None:
"""Record a successful data update."""
self.last_update[source] = datetime.utcnow()
def check_freshness(self) -> List[Dict[str, str]]:
"""Check all sources and return alerts for stale data."""
alerts = []
now = datetime.utcnow()
for source, max_minutes in self.max_age.items():
last = self.last_update.get(source)
if last is None:
alerts.append({
"source": source,
"severity": "critical",
"message": f"No data ever received from {source}",
})
else:
age_minutes = (now - last).total_seconds() / 60
if age_minutes > max_minutes:
alerts.append({
"source": source,
"severity": "warning",
"message": (
f"{source} data is {age_minutes:.0f} min old "
f"(threshold: {max_minutes} min)"
),
})
return alerts
Remediation
- Pre-execution odds verification: Before placing any bet, the execution engine now fetches live odds and verifies they match (within 0.5 points for spreads, within 10 cents for moneylines) the odds used in the edge calculation.
- Staleness gate: The prediction pipeline now checks data freshness before proceeding. If any critical data source is older than 60 minutes, the pipeline enters a "degraded mode" that generates predictions but does not execute bets.
Scenario 2: Feature Computation Bug Introducing Look-Ahead Bias
What Happened
A code change to the rolling-average feature transformer inadvertently removed the .shift(1) call that prevents data leakage. Instead of computing the 10-game rolling average through the previous game, the feature now included the current game's statistics. This bug was introduced during a refactoring that improved the transformer's performance by 3x but broke the temporal correctness invariant.
Impact
The model was retrained on features contaminated with look-ahead bias. In backtesting, the contaminated model showed a Brier score of 0.215 (much better than the historical 0.228), which should have been a red flag. The model was promoted to production, where it performed no better than the uncontaminated version because the leakage was not available at serving time (features were computed correctly for future games since there was no "current game" to leak from). However, the model's internal feature weights were miscalibrated because they were trained on artificially informative features. The net effect was a subtle degradation in calibration.
Detection
The bug was caught 3 weeks later when a routine calibration check showed that the model's predicted probabilities in the 0.55--0.65 range were winning at only 51%, indicating systematic overconfidence.
class CalibrationMonitor:
"""Monitor model calibration over a rolling window."""
def __init__(self, n_bins: int = 10, min_samples: int = 20):
self.n_bins = n_bins
self.min_samples = min_samples
self.predictions: List[float] = []
self.outcomes: List[int] = []
def add_result(self, predicted_prob: float, actual: int) -> None:
"""Record a prediction-outcome pair."""
self.predictions.append(predicted_prob)
self.outcomes.append(actual)
def compute_calibration_error(self) -> Dict[str, float]:
"""Compute expected calibration error (ECE)."""
preds = np.array(self.predictions)
actuals = np.array(self.outcomes)
bin_edges = np.linspace(0, 1, self.n_bins + 1)
ece = 0.0
bin_details = []
for i in range(self.n_bins):
mask = (preds >= bin_edges[i]) & (preds < bin_edges[i + 1])
n = mask.sum()
if n < self.min_samples:
continue
avg_pred = preds[mask].mean()
avg_actual = actuals[mask].mean()
bin_error = abs(avg_pred - avg_actual)
ece += (n / len(preds)) * bin_error
bin_details.append({
"bin": f"{bin_edges[i]:.2f}-{bin_edges[i+1]:.2f}",
"count": int(n),
"predicted": round(avg_pred, 4),
"actual": round(avg_actual, 4),
"error": round(bin_error, 4),
})
return {"ece": round(ece, 4), "bins": bin_details}
Remediation
- Temporal correctness test: An automated test was added to the CI pipeline that verifies, for a set of known games, that changing the outcome of game N does not affect any feature computed for game N. This test catches any missing
.shift()or incorrect temporal join. - Backtest sanity check: Any model that shows a Brier score improvement greater than 5% over the previous version triggers a manual review before promotion. Dramatic improvements are more likely to indicate a bug than a genuine advance.
Scenario 3: Model Drift After Rule Emphasis Change
What Happened
In mid-October, the NFL announced a point of emphasis on defensive pass interference, leading to more flags and longer pass plays. This shifted the distribution of several features: offensive passing efficiency increased league-wide, and the variance of defensive ratings expanded. The model, trained on pre-emphasis data, did not account for this shift.
Impact
The model's predictions became systematically biased for 4 weeks. It underestimated high-scoring games (because historical defensive ratings were lower) and overestimated the edge on "under" totals bets. The pipeline lost money on 8 of 11 totals bets during weeks 7--10, a run that was outside normal variance.
Detection
The drift was detected by the PSI monitor tracking the distribution of the "defensive_epa_per_play" feature.
class FeatureDriftDetector:
"""Detect feature distribution drift using PSI."""
def __init__(self, n_bins: int = 10):
self.n_bins = n_bins
self.reference_distributions: Dict[str, np.ndarray] = {}
def set_reference(self, feature_name: str,
values: np.ndarray) -> None:
"""Set the reference distribution for a feature."""
hist, bin_edges = np.histogram(values, bins=self.n_bins)
self.reference_distributions[feature_name] = {
"hist": hist / hist.sum(),
"bin_edges": bin_edges,
}
def compute_psi(self, feature_name: str,
current_values: np.ndarray) -> float:
"""Compute PSI between reference and current distribution."""
ref = self.reference_distributions.get(feature_name)
if ref is None:
return 0.0
current_hist, _ = np.histogram(
current_values, bins=ref["bin_edges"]
)
current_pct = current_hist / max(current_hist.sum(), 1)
ref_pct = ref["hist"]
# Add small epsilon to avoid log(0)
eps = 1e-6
current_pct = np.clip(current_pct, eps, None)
ref_pct = np.clip(ref_pct, eps, None)
psi = np.sum(
(current_pct - ref_pct) * np.log(current_pct / ref_pct)
)
return float(psi)
Remediation
- Automated retraining trigger: When PSI for any top-10 feature exceeds 0.20 for two consecutive weeks, the pipeline automatically retrains the model on an expanded window that includes the recent data, giving the model exposure to the shifted distribution.
- Feature normalization: Defensive and offensive efficiency features were changed from raw values to z-scores computed against the trailing 4-week league average, reducing sensitivity to league-wide level shifts.
Scenario 4: Duplicate Bet Execution
What Happened
A network timeout during bet placement caused the HTTP request to the sportsbook API to fail with a timeout exception. The retry logic re-sent the request, but the original request had actually succeeded (the timeout was on the response, not the request). The result was a duplicate bet: two identical $200 wagers on the same game, doubling the intended exposure.
Remediation
class IdempotentBetPlacer:
"""Place bets with idempotency keys to prevent duplicates."""
def __init__(self, db_path: str = "bet_ledger.db"):
self.db_path = db_path
self._init_db()
def _init_db(self) -> None:
with sqlite3.connect(self.db_path) as conn:
conn.execute("""
CREATE TABLE IF NOT EXISTS placed_bets (
idempotency_key TEXT PRIMARY KEY,
game_id TEXT NOT NULL,
side TEXT NOT NULL,
size REAL NOT NULL,
odds INTEGER NOT NULL,
status TEXT NOT NULL,
placed_at TEXT NOT NULL
)
""")
def place_bet(self, game_id: str, side: str,
size: float, odds: int) -> Dict:
"""Place a bet with duplicate prevention."""
key = f"{game_id}_{side}_{size}_{odds}"
with sqlite3.connect(self.db_path) as conn:
existing = conn.execute(
"SELECT * FROM placed_bets WHERE idempotency_key = ?",
(key,),
).fetchone()
if existing:
return {"status": "duplicate_prevented", "key": key}
conn.execute(
"""INSERT INTO placed_bets
(idempotency_key, game_id, side, size, odds,
status, placed_at)
VALUES (?, ?, ?, ?, ?, ?, ?)""",
(key, game_id, side, size, odds,
"placed", datetime.utcnow().isoformat()),
)
return {"status": "placed", "key": key}
Scenario 5: Cascading Failure from Disk Space Exhaustion
What Happened
The server's disk filled up because the logging system was writing verbose DEBUG-level logs without rotation. When the disk was full, SQLite could no longer write to the feature store, causing the feature computation job to fail. The prediction job detected missing features and entered degraded mode, but a bug in the degraded-mode logic caused it to use default feature values (all zeros) instead of skipping the predictions entirely. These zero-feature predictions produced near-random probabilities, which the execution engine evaluated as having small but positive edge on several games.
Remediation
- Log rotation: Implemented logrotate with a maximum of 500 MB of log files, rotated daily.
- Disk space monitor: Added a system-level check that alerts when disk usage exceeds 80% and halts all write operations at 95%.
- Degraded mode fix: The degraded-mode logic was fixed to require at least 80% of features to be non-null before generating predictions. Missing features below this threshold result in a full pipeline halt with an alert.
Resilience Checklist
Based on these five scenarios, every betting pipeline should implement:
- Data freshness gates that prevent predictions when input data is stale
- Temporal correctness tests that verify no feature uses future data
- Pre-execution odds verification that confirms prices have not moved
- Idempotency keys on bet placement to prevent duplicates
- PSI-based drift detection on all features with automated retraining triggers
- Calibration monitoring with rolling ECE computation
- Log rotation and disk space monitoring
- Graceful degradation with explicit thresholds for minimum data quality
- Backtest sanity checks that flag suspiciously good improvements
- Kill switches that halt all betting when anomalies are detected
Exercises for the Reader
-
Implement a complete circuit-breaker pattern for the odds data provider that transitions between Closed, Open, and Half-Open states based on consecutive failure counts.
-
Build a "chaos testing" framework that randomly injects failures (stale data, missing features, API timeouts) into the pipeline and verifies that each failure is detected and handled correctly.
-
Design a post-mortem template for pipeline failures that captures: timeline of events, root cause, impact (financial and operational), detection latency, and preventive measures. Apply it to a hypothetical scenario where the model serves predictions using a model artifact from a different sport.