Case Study 2: SHAP-Driven Model Auditing for NFL Betting
Overview
This case study demonstrates how SHAP (SHapley Additive exPlanations) values transform a black-box XGBoost model into an interpretable decision-support tool for NFL betting. Rather than blindly following model outputs, we use SHAP to understand why the model makes each prediction, identify potentially spurious features, and make informed decisions about which model-market disagreements represent genuine edges.
Motivation
A professional sports bettor faces a fundamental trust problem. An XGBoost model outputs "P(home win) = 0.68" for an upcoming NFL game, while the market implies 0.60. Should the bettor wager on the home team? Without understanding why the model disagrees with the market, the bettor has no basis for distinguishing between a genuine insight (the model has identified something the market is missing) and a model error (the model is overweighting a noisy feature or failing to account for recent news).
SHAP values solve this problem by decomposing every prediction into individual feature contributions. The bettor can examine exactly which features push the prediction above or below the market line, assess whether those factors are plausible, and make a more informed wagering decision.
Building the Interpretable Model
"""
Case Study 2: SHAP-Driven Model Auditing for NFL Betting
Trains an XGBoost model, computes SHAP explanations, and
demonstrates how to use interpretability for betting decisions.
"""
import math
import numpy as np
import warnings
from typing import Dict, List, Tuple, Optional
warnings.filterwarnings("ignore")
def generate_nfl_data(
n_games: int = 2000, seed: int = 42
) -> Tuple[np.ndarray, np.ndarray, List[str]]:
"""Generate synthetic NFL game data with known feature relationships.
Args:
n_games: Number of games to generate.
seed: Random seed.
Returns:
Tuple of (features, outcomes, feature_names).
"""
np.random.seed(seed)
feature_names = [
"elo_diff", "dvoa_off_diff", "dvoa_def_diff",
"rest_days_diff", "win_pct_l5_diff", "turnover_margin_diff",
"spread", "home_record_pct", "away_record_pct",
"temperature", "wind_speed", "is_divisional",
]
# Generate features
elo_diff = np.random.normal(0, 80, n_games)
dvoa_off = np.random.normal(0, 8, n_games)
dvoa_def = np.random.normal(0, 8, n_games)
rest_diff = np.random.choice([-3, -2, -1, 0, 1, 2, 3, 7, 10],
size=n_games,
p=[0.02, 0.05, 0.15, 0.4, 0.15, 0.05, 0.05, 0.08, 0.05])
win_pct_diff = np.random.normal(0, 0.2, n_games)
to_margin = np.random.normal(0, 1.5, n_games)
spread = -(elo_diff * 0.04 + 3) + np.random.normal(0, 3, n_games)
home_pct = np.random.uniform(0.2, 0.9, n_games)
away_pct = np.random.uniform(0.2, 0.9, n_games)
temp = np.random.normal(55, 20, n_games)
wind = np.random.exponential(8, n_games)
divisional = (np.random.random(n_games) < 0.35).astype(float)
X = np.column_stack([
elo_diff, dvoa_off, dvoa_def, rest_diff, win_pct_diff,
to_margin, spread, home_pct, away_pct, temp, wind, divisional,
])
# True outcome: known relationships
logit = (
0.004 * elo_diff
+ 0.06 * dvoa_off
- 0.05 * dvoa_def
+ 0.04 * rest_diff
+ 1.0 * win_pct_diff
+ 0.15 * to_margin
- 0.03 * spread
+ 0.002 * rest_diff * elo_diff # interaction: rest + talent
+ 0.15 # home advantage
)
# Note: temp and wind are NOT in the true model (noise features)
prob = 1.0 / (1.0 + np.exp(-logit))
y = (np.random.random(n_games) < prob).astype(int)
return X, y, feature_names
def train_model(
x_train: np.ndarray,
y_train: np.ndarray,
x_val: np.ndarray,
y_val: np.ndarray,
) -> object:
"""Train XGBoost model for NFL prediction.
Args:
x_train: Training features.
y_train: Training outcomes.
x_val: Validation features.
y_val: Validation outcomes.
Returns:
Fitted model object.
"""
try:
import xgboost as xgb
model = xgb.XGBClassifier(
max_depth=4, learning_rate=0.05, n_estimators=300,
subsample=0.8, colsample_bytree=0.8, min_child_weight=3,
reg_lambda=3.0, gamma=0.1,
objective="binary:logistic", eval_metric="logloss",
random_state=42,
)
model.fit(x_train, y_train,
eval_set=[(x_val, y_val)], verbose=False)
except ImportError:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000, C=1.0)
model.fit(x_train, y_train)
return model
SHAP Analysis
def compute_shap_analysis(
model: object,
x_data: np.ndarray,
feature_names: List[str],
) -> Optional[np.ndarray]:
"""Compute SHAP values and global importance.
Args:
model: Fitted model.
x_data: Data to explain.
feature_names: Feature names.
Returns:
SHAP values array or None if shap not available.
"""
try:
import shap
except ImportError:
print("SHAP library not available; using feature_importances_ fallback")
return None
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(x_data)
# For binary classification, may return list of two arrays
if isinstance(shap_values, list):
shap_values = shap_values[1]
return shap_values
def shap_global_importance(
shap_values: np.ndarray,
feature_names: List[str],
) -> List[Tuple[str, float]]:
"""Compute global feature importance from SHAP values.
Args:
shap_values: SHAP value matrix (n_samples x n_features).
feature_names: Feature names.
Returns:
Sorted list of (feature_name, mean_abs_shap).
"""
mean_abs = np.abs(shap_values).mean(axis=0)
importance = list(zip(feature_names, mean_abs))
return sorted(importance, key=lambda x: x[1], reverse=True)
def explain_single_game(
shap_values: np.ndarray,
x_game: np.ndarray,
feature_names: List[str],
base_value: float,
prediction: float,
game_label: str = "Game",
) -> Dict:
"""Produce a detailed SHAP explanation for one game.
Args:
shap_values: SHAP values for this game (1D array).
x_game: Feature values for this game (1D array).
feature_names: Feature names.
base_value: SHAP base value (average prediction).
prediction: Model's predicted probability.
game_label: Label for display.
Returns:
Dictionary with explanation details.
"""
contributions = list(zip(feature_names, x_game, shap_values))
contributions.sort(key=lambda x: abs(x[2]), reverse=True)
print(f"\n{'=' * 55}")
print(f"SHAP Explanation: {game_label}")
print(f"{'=' * 55}")
print(f" Predicted P(home win): {prediction:.4f}")
print(f" Base value: {base_value:.4f}")
print(f" Sum of SHAP values: {sum(shap_values):.4f}")
print(f" Expected sum: {prediction - base_value:.4f}")
print(f"\n {'Feature':<25} {'Value':>8} {'SHAP':>10} {'Direction':>10}")
print(f" {'-'*55}")
for name, val, shap_val in contributions:
direction = "+" if shap_val > 0 else "-"
print(f" {name:<25} {val:>8.2f} {shap_val:>+10.4f} {direction:>10}")
return {
"prediction": prediction,
"base_value": base_value,
"contributions": contributions,
}
def audit_for_spurious_features(
shap_values: np.ndarray,
feature_names: List[str],
known_noise_features: List[str],
) -> None:
"""Check if noise features have significant SHAP importance.
Args:
shap_values: SHAP values matrix.
feature_names: Feature names.
known_noise_features: Features known to be noise.
"""
importance = shap_global_importance(shap_values, feature_names)
print("\n--- Feature Audit ---")
print(f"{'Rank':<6}{'Feature':<25}{'Mean |SHAP|':>12}{'Status':>12}")
print("-" * 55)
for rank, (name, imp) in enumerate(importance, 1):
status = "NOISE" if name in known_noise_features else "Signal"
flag = " ***" if name in known_noise_features and rank <= 8 else ""
print(f"{rank:<6}{name:<25}{imp:>12.4f}{status:>12}{flag}")
noise_ranks = []
for rank, (name, _) in enumerate(importance, 1):
if name in known_noise_features:
noise_ranks.append(rank)
print(f"\nNoise features ranked at positions: {noise_ranks}")
if any(r <= len(feature_names) // 2 for r in noise_ranks):
print("WARNING: Noise features have high importance. "
"Model may be overfitting to irrelevant patterns.")
else:
print("PASS: Noise features are appropriately ranked low.")
Betting Decision Framework
def betting_with_shap(
model: object,
shap_values: np.ndarray,
x_test: np.ndarray,
y_test: np.ndarray,
feature_names: List[str],
n_examples: int = 5,
) -> None:
"""Demonstrate SHAP-informed betting decisions.
Args:
model: Fitted model.
shap_values: SHAP values for test set.
x_test: Test features.
y_test: Test outcomes.
feature_names: Feature names.
n_examples: Number of example games to analyze.
"""
probs = model.predict_proba(x_test)[:, 1]
noise_features = {"temperature", "wind_speed"}
print("\n" + "=" * 65)
print("SHAP-Informed Betting Decisions")
print("=" * 65)
for i in range(min(n_examples, len(y_test))):
# Simulate market line
market_prob = 0.5 + (probs[i] - 0.5) * 0.8 + np.random.normal(0, 0.03)
market_prob = max(0.15, min(0.85, market_prob))
edge = probs[i] - market_prob
if abs(edge) < 0.03:
continue
print(f"\n--- Game {i+1} ---")
print(f" Model: {probs[i]:.3f} | Market: {market_prob:.3f} | "
f"Edge: {edge:+.3f} | Outcome: {'Home' if y_test[i] else 'Away'}")
# Top SHAP contributors
game_shap = shap_values[i]
top_idx = np.argsort(np.abs(game_shap))[::-1][:5]
print(f" Top drivers:")
noise_contribution = 0.0
signal_contribution = 0.0
for idx in top_idx:
name = feature_names[idx]
is_noise = name in noise_features
marker = " [NOISE]" if is_noise else ""
print(f" {name:<22} = {x_test[i, idx]:>8.2f} "
f"SHAP: {game_shap[idx]:>+.4f}{marker}")
if is_noise:
noise_contribution += abs(game_shap[idx])
else:
signal_contribution += abs(game_shap[idx])
total = noise_contribution + signal_contribution
noise_pct = 100 * noise_contribution / total if total > 0 else 0
print(f"\n Signal contribution: {signal_contribution:.4f}")
print(f" Noise contribution: {noise_contribution:.4f} ({noise_pct:.1f}%)")
if noise_pct > 30:
print(f" DECISION: PASS (too much edge from noise features)")
elif abs(edge) > 0.05:
print(f" DECISION: BET (strong edge from signal features)")
else:
print(f" DECISION: SMALL BET (moderate edge, good signal)")
def run_case_study() -> None:
"""Execute the complete SHAP auditing case study."""
print("=" * 65)
print("CASE STUDY 2: SHAP-Driven Model Auditing for NFL Betting")
print("=" * 65)
# Generate data
X, y, feature_names = generate_nfl_data(n_games=2000, seed=42)
n = len(y)
print(f"\nDataset: {n} games, {len(feature_names)} features")
print(f"Home win rate: {y.mean():.3f}")
print(f"Known noise features: temperature, wind_speed")
# Split
train_end = int(n * 0.6)
val_end = int(n * 0.8)
x_train, y_train = X[:train_end], y[:train_end]
x_val, y_val = X[train_end:val_end], y[train_end:val_end]
x_test, y_test = X[val_end:], y[val_end:]
# Train
model = train_model(x_train, y_train, x_val, y_val)
probs = model.predict_proba(x_test)[:, 1]
# Metrics
p_clip = np.clip(probs, 0.001, 0.999)
ll = -np.mean(y_test * np.log(p_clip) + (1 - y_test) * np.log(1 - p_clip))
acc = np.mean((probs > 0.5) == y_test)
print(f"\nTest metrics: log-loss={ll:.4f}, accuracy={acc:.3f}")
# SHAP analysis
print("\n--- SHAP Analysis ---")
shap_values = compute_shap_analysis(model, x_test, feature_names)
if shap_values is not None:
# Global importance
print("\nGlobal Feature Importance (SHAP):")
importance = shap_global_importance(shap_values, feature_names)
for rank, (name, imp) in enumerate(importance, 1):
print(f" {rank:>2}. {name:<25} {imp:.4f}")
# Compare with built-in importance
print("\nBuilt-in Feature Importance (Gain):")
try:
builtin_imp = model.feature_importances_
sorted_idx = np.argsort(builtin_imp)[::-1]
for rank, idx in enumerate(sorted_idx, 1):
print(f" {rank:>2}. {feature_names[idx]:<25} {builtin_imp[idx]:.4f}")
except AttributeError:
pass
# Audit for noise features
audit_for_spurious_features(
shap_values, feature_names,
known_noise_features=["temperature", "wind_speed"],
)
# Explain example games
try:
import shap
explainer = shap.TreeExplainer(model)
base_value = float(explainer.expected_value)
if isinstance(explainer.expected_value, np.ndarray):
base_value = float(explainer.expected_value[1])
except (ImportError, Exception):
base_value = float(probs.mean())
for i in [0, 10, 50]:
if i < len(x_test):
explain_single_game(
shap_values[i], x_test[i], feature_names,
base_value, probs[i], f"Test Game {i+1}",
)
# SHAP-informed betting
betting_with_shap(
model, shap_values, x_test, y_test, feature_names,
n_examples=20,
)
else:
# Fallback without SHAP
print("\nUsing built-in feature importance (SHAP not available):")
try:
importances = model.feature_importances_
sorted_idx = np.argsort(importances)[::-1]
for rank, idx in enumerate(sorted_idx, 1):
print(f" {rank}. {feature_names[idx]:<25} {importances[idx]:.4f}")
except AttributeError:
print(" Feature importance not available")
# Summary
print("\n" + "=" * 65)
print("KEY TAKEAWAYS")
print("=" * 65)
print("""
1. SHAP reveals WHAT drives each prediction, enabling informed
betting decisions rather than blind model following.
2. Noise features (temperature, wind) should rank low in SHAP
importance. If they rank high, the model is overfitting.
3. When a large fraction of a prediction's edge comes from noise
features, the bet should be skipped regardless of edge size.
4. SHAP importance can differ from built-in feature importance.
SHAP is more reliable because it accounts for interactions
and provides consistent, additive attribution.
5. The combination of SHAP + calibration + edge thresholding
creates a robust betting framework that avoids the most
common pitfalls of model-based wagering.
""")
if __name__ == "__main__":
run_case_study()
Key Findings
Noise Feature Detection. The model was deliberately given two noise features (temperature, wind_speed) that have no true relationship with game outcomes. SHAP analysis correctly identifies these features as having low importance, ranking them near the bottom. If the model were overfitting, these noise features would rank higher --- providing a clear diagnostic for the bettor.
Feature Interaction Discovery. SHAP dependence plots reveal the interaction between rest_days_diff and elo_diff: the benefit of extra rest is amplified for stronger teams. This interaction, which a linear model would miss entirely, has implications for when to weight rest advantages more heavily.
Edge Decomposition. For games where the model disagrees with the market, SHAP decomposition shows exactly which features drive the disagreement. When the edge comes primarily from strong signal features (Elo difference, offensive efficiency), the bettor can wager with confidence. When a significant portion comes from noise features, the bettor should pass.
Importance Method Discrepancies. Built-in gain-based importance and SHAP importance produce different rankings for some features. SHAP importance is more reliable because it accounts for feature correlations and interaction effects, while gain-based importance can be biased toward high-cardinality features used in many splits.
Practical Workflow
The SHAP auditing workflow for a professional sports bettor:
- Train and calibrate the XGBoost model using the pipeline from Case Study 1.
- Compute SHAP values for the upcoming slate of games.
- For each game with sufficient edge: examine the SHAP breakdown to understand why the model disagrees with the market.
- Filter out suspect edges: if the SHAP contribution from noise or unreliable features exceeds 30% of the total, skip the bet.
- Size remaining bets proportionally to the signal-to-noise ratio of the SHAP decomposition.
- Periodically audit the global SHAP importance to ensure noise features remain low-ranked, indicating the model is not overfitting to irrelevant patterns.