The rating systems developed in Chapter 26 produce a single number for each team --- a strength estimate. But team performance is far more complex than a scalar can capture. A team might have an elite offense but a porous defense, play exceptionally...
In This Chapter
Chapter 27: Advanced Regression and Classification
The rating systems developed in Chapter 26 produce a single number for each team --- a strength estimate. But team performance is far more complex than a scalar can capture. A team might have an elite offense but a porous defense, play exceptionally well at home but collapse on the road, or perform differently depending on rest days, travel distance, or weather conditions. To capture this multidimensional reality, we need models that can ingest dozens or even hundreds of features and learn the nonlinear relationships that determine outcomes.
This chapter introduces the most powerful supervised learning methods available to the modern sports bettor: gradient-boosted trees (XGBoost), random forests, and ensemble stacking. But raw predictive power is only half the battle. A model that outputs "0.62" for a win probability is useless if that number does not actually correspond to a 62% win rate. We therefore devote substantial attention to probability calibration --- the art and science of ensuring your model's probabilities are trustworthy. We then address the challenge of imbalanced outcomes (predicting rare events like upsets or blowouts) and conclude with model interpretability, using SHAP values and related tools to understand not just what your model predicts, but why.
27.1 XGBoost for Sports Prediction
Gradient Boosting: The Intuition
Gradient boosting is one of the most successful machine learning algorithms in applied prediction. It builds an ensemble of weak learners (typically decision trees) sequentially, where each new tree corrects the errors of the ensemble so far.
The process works as follows:
- Start with a simple initial prediction (e.g., the average outcome).
- Compute the residuals --- the difference between the actual outcomes and the current predictions.
- Fit a decision tree to these residuals.
- Add this tree to the ensemble (scaled by a learning rate).
- Repeat steps 2--4 for a specified number of iterations.
Formally, the ensemble after $t$ trees is:
$$\hat{y}^{(t)}(x) = \hat{y}^{(t-1)}(x) + \eta \cdot f_t(x)$$
where $\eta$ is the learning rate and $f_t$ is the $t$-th tree. XGBoost (eXtreme Gradient Boosting), developed by Tianqi Chen in 2016, extends basic gradient boosting with several innovations:
-
Second-order gradients: XGBoost uses both the first derivative (gradient) and the second derivative (Hessian) of the loss function when fitting each tree. This provides more information for split selection and leads to better convergence.
-
Regularization: XGBoost adds L1 and L2 regularization terms to the objective function:
$$\mathcal{L}(\theta) = \sum_{i=1}^{n} l(y_i, \hat{y}_i) + \sum_{t=1}^{T} \Omega(f_t)$$
where:
$$\Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^{T} w_j^2$$
Here $T$ is the number of leaves in the tree, $w_j$ is the weight of leaf $j$, $\gamma$ controls tree complexity (more leaves = higher penalty), and $\lambda$ is the L2 regularization strength on leaf weights.
- Column subsampling: Like random forests, XGBoost can randomly select a subset of features at each split, reducing correlation between trees.
- Handling missing values: XGBoost learns the optimal direction (left or right) for missing values during training.
Hyperparameter Tuning for Sports Data
XGBoost has many hyperparameters, but a handful dominate performance. Understanding what each controls and how to tune it is essential.
max_depth (tree depth). Controls the maximum depth of each individual tree. Deeper trees capture more complex interactions but are more prone to overfitting.
| max_depth | Behavior | Sports Recommendation |
|---|---|---|
| 2--3 | Shallow trees, captures main effects | Good starting point for small datasets |
| 4--6 | Moderate complexity, captures interactions | Most sports prediction tasks |
| 7--10 | Deep trees, many interactions | Only with large datasets (10,000+ games) |
| 10+ | Very deep, high risk of overfitting | Rarely appropriate for sports |
learning_rate (eta). Controls how much each tree contributes to the ensemble. Lower values require more trees but often yield better generalization.
$$\text{learning\_rate} \in [0.001, 0.3]$$
A common strategy: set learning_rate to 0.01--0.05 and use early stopping to determine the number of trees.
n_estimators (number of trees). The total number of boosting rounds. With early stopping, this serves as an upper bound; training halts when validation performance stops improving.
subsample. Fraction of training samples used to build each tree (0.5--1.0). Values below 1.0 introduce randomness that reduces overfitting.
colsample_bytree. Fraction of features considered for each tree (0.5--1.0). Similar to random forests' feature subsampling.
min_child_weight. Minimum sum of instance weights needed in a child node. Higher values prevent the model from learning overly specific patterns. For classification with balanced data, values of 1--10 work well.
gamma. Minimum loss reduction required for a split. Acts as a pruning threshold. Setting $\gamma > 0$ creates more conservative trees.
reg_alpha (L1) and reg_lambda (L2). Regularization terms. L1 promotes sparsity (some features get zero weight), while L2 smooths weights.
Feature Engineering for Sports Prediction
The quality of features determines the ceiling of any model. For sports prediction, features typically fall into several categories:
Team-level statistics (rolling): - Offensive/defensive efficiency (points per 100 possessions, yards per play) - Win/loss record over recent games (last 5, 10, 20) - Home/away splits - Rating system outputs (Elo, Glicko-2, Massey, PageRank from Chapter 26)
Matchup-specific features: - Difference in key statistics (offensive rating A minus defensive rating B) - Historical head-to-head record - Style matchup indicators (e.g., run-heavy offense vs. weak run defense)
Contextual features: - Days of rest for each team - Travel distance - Back-to-back game indicator - Altitude (e.g., Denver's mile-high advantage) - Time zone changes
Market-derived features: - Opening line - Line movement (closing minus opening spread) - Consensus percentage of bets on each side - Over/under as a proxy for game pace
Python Implementation
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
from sklearn.metrics import log_loss, brier_score_loss, roc_auc_score
from typing import Dict, Tuple, Optional
import warnings
warnings.filterwarnings('ignore')
class XGBoostSportsPredictor:
"""
XGBoost-based sports prediction model with proper temporal validation.
"""
def __init__(self, objective: str = 'binary:logistic',
random_state: int = 42):
self.objective = objective
self.random_state = random_state
self.model: Optional[xgb.XGBClassifier] = None
self.best_params: Optional[Dict] = None
self.feature_names: Optional[list] = None
def create_features(self, df: pd.DataFrame) -> pd.DataFrame:
"""
Create features from raw game data.
Expected columns: home_team, away_team, home_score, away_score,
date, plus any team statistics columns.
Returns DataFrame with engineered features.
"""
features = pd.DataFrame(index=df.index)
# Example feature engineering (adapt to your data)
if 'home_elo' in df.columns and 'away_elo' in df.columns:
features['elo_diff'] = df['home_elo'] - df['away_elo']
features['elo_sum'] = df['home_elo'] + df['away_elo']
if 'home_off_rating' in df.columns:
features['off_diff'] = (df['home_off_rating']
- df['away_off_rating'])
features['def_diff'] = (df['home_def_rating']
- df['away_def_rating'])
# Matchup: home offense vs away defense
features['matchup_home'] = (df['home_off_rating']
- df['away_def_rating'])
features['matchup_away'] = (df['away_off_rating']
- df['home_def_rating'])
if 'home_rest_days' in df.columns:
features['rest_diff'] = (df['home_rest_days']
- df['away_rest_days'])
features['both_rested'] = (
(df['home_rest_days'] >= 3) & (df['away_rest_days'] >= 3)
).astype(int)
if 'home_win_pct' in df.columns:
features['win_pct_diff'] = (df['home_win_pct']
- df['away_win_pct'])
if 'spread' in df.columns:
features['spread'] = df['spread']
return features
def tune_hyperparameters(self, X: np.ndarray, y: np.ndarray,
n_splits: int = 5) -> Dict:
"""
Tune hyperparameters using time-series cross-validation.
Uses TimeSeriesSplit to respect temporal ordering ---
never train on future data.
"""
param_grid = {
'max_depth': [3, 4, 5, 6],
'learning_rate': [0.01, 0.05, 0.1],
'n_estimators': [100, 200, 500],
'subsample': [0.7, 0.8, 0.9],
'colsample_bytree': [0.7, 0.8, 1.0],
'min_child_weight': [1, 3, 5],
'gamma': [0, 0.1, 0.3],
'reg_lambda': [1, 3, 5],
}
# For practical purposes, use a smaller search space
# or switch to RandomizedSearchCV / Bayesian optimization
reduced_grid = {
'max_depth': [3, 5],
'learning_rate': [0.01, 0.05],
'n_estimators': [200, 500],
'subsample': [0.8],
'colsample_bytree': [0.8],
'min_child_weight': [1, 5],
'reg_lambda': [1, 5],
}
tscv = TimeSeriesSplit(n_splits=n_splits)
base_model = xgb.XGBClassifier(
objective=self.objective,
random_state=self.random_state,
use_label_encoder=False,
eval_metric='logloss',
)
grid_search = GridSearchCV(
base_model, reduced_grid,
cv=tscv,
scoring='neg_log_loss',
n_jobs=-1,
verbose=0
)
grid_search.fit(X, y)
self.best_params = grid_search.best_params_
print(f"Best parameters: {self.best_params}")
print(f"Best CV log-loss: {-grid_search.best_score_:.4f}")
return self.best_params
def fit(self, X_train: np.ndarray, y_train: np.ndarray,
X_val: Optional[np.ndarray] = None,
y_val: Optional[np.ndarray] = None,
feature_names: Optional[list] = None,
params: Optional[Dict] = None) -> None:
"""
Train the XGBoost model with optional early stopping.
"""
if params is None:
params = self.best_params or {}
self.feature_names = feature_names
self.model = xgb.XGBClassifier(
objective=self.objective,
random_state=self.random_state,
use_label_encoder=False,
eval_metric='logloss',
n_estimators=params.get('n_estimators', 500),
max_depth=params.get('max_depth', 4),
learning_rate=params.get('learning_rate', 0.05),
subsample=params.get('subsample', 0.8),
colsample_bytree=params.get('colsample_bytree', 0.8),
min_child_weight=params.get('min_child_weight', 3),
gamma=params.get('gamma', 0.1),
reg_lambda=params.get('reg_lambda', 3),
)
if X_val is not None and y_val is not None:
self.model.fit(
X_train, y_train,
eval_set=[(X_val, y_val)],
verbose=False
)
print(f"Best iteration: {self.model.best_iteration}")
else:
self.model.fit(X_train, y_train)
def predict_proba(self, X: np.ndarray) -> np.ndarray:
"""Return probability predictions for class 1 (home win)."""
if self.model is None:
raise ValueError("Model not fitted")
return self.model.predict_proba(X)[:, 1]
def evaluate(self, X_test: np.ndarray,
y_test: np.ndarray) -> Dict[str, float]:
"""Evaluate model on test set."""
probs = self.predict_proba(X_test)
return {
'log_loss': log_loss(y_test, probs),
'brier_score': brier_score_loss(y_test, probs),
'auc_roc': roc_auc_score(y_test, probs),
'accuracy': ((probs > 0.5) == y_test).mean(),
}
def feature_importance(self, importance_type: str = 'gain') -> pd.Series:
"""
Get feature importance scores.
Args:
importance_type: 'gain' (avg gain from splits),
'weight' (number of times used),
'cover' (avg number of samples affected)
"""
if self.model is None:
raise ValueError("Model not fitted")
importances = self.model.get_booster().get_score(
importance_type=importance_type
)
if self.feature_names:
# Map feature indices to names
named_importances = {}
for key, val in importances.items():
idx = int(key.replace('f', ''))
if idx < len(self.feature_names):
named_importances[self.feature_names[idx]] = val
else:
named_importances[key] = val
importances = named_importances
return pd.Series(importances).sort_values(ascending=False)
# --- Worked Example ---
if __name__ == "__main__":
np.random.seed(42)
n_games = 2000
# Generate synthetic sports data
data = {
'elo_diff': np.random.normal(0, 100, n_games),
'off_diff': np.random.normal(0, 3, n_games),
'def_diff': np.random.normal(0, 3, n_games),
'rest_diff': np.random.randint(-3, 4, n_games),
'win_pct_diff': np.random.normal(0, 0.2, n_games),
'spread': np.random.normal(0, 7, n_games),
}
X = pd.DataFrame(data)
# Generate outcomes with known relationship
logit = (0.003 * X['elo_diff'] + 0.15 * X['off_diff']
- 0.12 * X['def_diff'] + 0.05 * X['rest_diff']
+ 0.5 * X['win_pct_diff'] + 0.04 * X['spread']
+ 0.15) # Home advantage
prob = 1 / (1 + np.exp(-logit))
y = (np.random.random(n_games) < prob).astype(int)
# Temporal split
train_end = 1400
val_end = 1700
X_train = X.iloc[:train_end].values
y_train = y[:train_end]
X_val = X.iloc[train_end:val_end].values
y_val = y[train_end:val_end]
X_test = X.iloc[val_end:].values
y_test = y[val_end:]
# Train model
predictor = XGBoostSportsPredictor()
predictor.fit(X_train, y_train, X_val, y_val,
feature_names=list(X.columns),
params={
'max_depth': 4,
'learning_rate': 0.05,
'n_estimators': 500,
'subsample': 0.8,
'colsample_bytree': 0.8,
'min_child_weight': 3,
'reg_lambda': 3,
})
# Evaluate
metrics = predictor.evaluate(X_test, y_test)
print("\n=== XGBoost Test Set Performance ===")
for metric, value in metrics.items():
print(f" {metric}: {value:.4f}")
# Feature importance
print("\nFeature Importance (by gain):")
importance = predictor.feature_importance('gain')
for feat, imp in importance.items():
print(f" {feat}: {imp:.2f}")
XGBoost for Point Spread and Over/Under Prediction
While the binary classification approach (win/loss) is natural for moneyline betting, XGBoost can also be adapted for spread and total betting:
Spread prediction (regression): Use objective='reg:squarederror' and predict the point differential directly. Then compare the predicted differential to the posted spread.
Over/under (regression or classification): Predict total points using regression, or frame as binary classification (over vs. under the posted total).
Multi-output: Train separate models for home score and away score, then combine. This provides richer information than a single-output model.
27.2 Random Forests and Ensemble Methods
Bagging: Bootstrap Aggregation
Before understanding random forests, we must understand bagging (bootstrap aggregation), introduced by Leo Breiman in 1996. The idea is simple yet powerful:
- Create $B$ bootstrap samples from the training data (sampling with replacement).
- Train a decision tree on each bootstrap sample.
- Average the predictions (regression) or vote (classification) across all $B$ trees.
Bagging reduces variance without increasing bias. If individual trees have variance $\sigma^2$ and pairwise correlation $\rho$, the variance of the average is:
$$\text{Var}_{\text{ensemble}} = \rho \sigma^2 + \frac{1-\rho}{B}\sigma^2$$
As $B \to \infty$, the second term vanishes, leaving $\rho\sigma^2$. This means the only way to further reduce variance is to reduce the correlation $\rho$ between trees --- which is exactly what random forests do.
Random Forests
Random forests extend bagging with feature subsampling: at each split in each tree, only a random subset of $m$ features (out of $p$ total) are considered as candidates. This decorrelates the trees, reducing $\rho$ and thus reducing ensemble variance.
The typical choice is $m = \sqrt{p}$ for classification and $m = p/3$ for regression, but this should be tuned empirically.
Key hyperparameters:
| Parameter | Description | Typical Range |
|---|---|---|
n_estimators |
Number of trees | 100--1000 |
max_depth |
Maximum tree depth | None (full), or 10--30 |
max_features |
Features per split | 'sqrt', 'log2', or float 0.3--0.8 |
min_samples_leaf |
Min samples in a leaf | 1--20 |
min_samples_split |
Min samples to split | 2--20 |
Random Forests vs. XGBoost
| Property | Random Forest | XGBoost |
|---|---|---|
| Training | Parallel (trees are independent) | Sequential (each tree corrects errors) |
| Overfitting risk | Lower (bagging + subsampling) | Higher without regularization |
| Tuning effort | Fewer critical hyperparameters | More hyperparameters to tune |
| Performance ceiling | Good | Often higher |
| Training speed | Fast (parallelizable) | Moderate |
| Interpretability | Moderate | Moderate |
| Handles noisy features | Well (averages out noise) | Can overfit to noise |
In practice, XGBoost tends to outperform random forests on tabular data when properly tuned, but random forests are an excellent default model that requires less tuning and is more robust to hyperparameter choices.
Stacking and Blending
Stacking (introduced by David Wolpert in 1992) combines multiple diverse base models using a meta-learner:
- Level 0: Train $K$ diverse base models (e.g., logistic regression, random forest, XGBoost, neural network).
- Cross-validation predictions: For each base model, generate out-of-fold predictions using $k$-fold cross-validation. This produces a prediction from each model for every training example, without any model seeing its own training data.
- Level 1: Train a meta-model on the out-of-fold predictions (as features) to predict the true outcome.
The meta-model learns the optimal combination of base models, potentially capturing nonlinear relationships (e.g., "trust XGBoost when the spread is large, trust logistic regression when teams are closely matched").
Blending is a simpler variant: 1. Split the training data into a training set and a blending set (holdout). 2. Train base models on the training set. 3. Generate predictions on the blending set. 4. Train the meta-model on the blending set predictions.
Blending is easier to implement but wastes data (the blending set is not used for training base models).
When Ensembles Beat Single Models
Ensembles work best when: - Base models are diverse: They make different types of errors. A stack of 5 XGBoost models with slightly different hyperparameters is less diverse than a mix of XGBoost, logistic regression, and a neural network. - Base models are individually reasonably accurate: Combining garbage does not produce gold. Each model should be at least somewhat predictive. - There is sufficient data to train the meta-model: With very small datasets, stacking can overfit at the meta-level.
For sports betting, a practical ensemble might combine: - Logistic regression (simple, interpretable, hard to overfit) - Random forest (robust, handles nonlinearity) - XGBoost (high performance, captures interactions) - Rating system outputs (Elo, Massey, etc., from Chapter 26)
Python Implementation
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import TimeSeriesSplit, cross_val_predict
from sklearn.metrics import log_loss, brier_score_loss
import xgboost as xgb
from typing import Dict, List, Optional
class SportsEnsemble:
"""
Ensemble model combining random forest, XGBoost, and logistic regression
using stacking for sports prediction.
"""
def __init__(self, random_state: int = 42):
self.random_state = random_state
self.base_models = self._create_base_models()
self.stacking_model = None
self.blending_model = None
def _create_base_models(self) -> Dict:
return {
'logistic': LogisticRegression(
C=1.0, max_iter=1000,
random_state=self.random_state
),
'random_forest': RandomForestClassifier(
n_estimators=300,
max_depth=8,
max_features='sqrt',
min_samples_leaf=10,
n_jobs=-1,
random_state=self.random_state
),
'xgboost': xgb.XGBClassifier(
n_estimators=300,
max_depth=4,
learning_rate=0.05,
subsample=0.8,
colsample_bytree=0.8,
min_child_weight=3,
reg_lambda=3,
use_label_encoder=False,
eval_metric='logloss',
random_state=self.random_state,
),
}
def fit_stacking(self, X: np.ndarray, y: np.ndarray,
cv_splits: int = 5) -> None:
"""
Train stacking ensemble with time-series cross-validation.
"""
tscv = TimeSeriesSplit(n_splits=cv_splits)
estimators = [
('lr', self.base_models['logistic']),
('rf', self.base_models['random_forest']),
('xgb', self.base_models['xgboost']),
]
self.stacking_model = StackingClassifier(
estimators=estimators,
final_estimator=LogisticRegression(C=1.0, max_iter=1000),
cv=tscv,
stack_method='predict_proba',
passthrough=False, # Only use base model predictions
n_jobs=-1,
)
self.stacking_model.fit(X, y)
def fit_manual_stacking(self, X: np.ndarray, y: np.ndarray,
n_splits: int = 5) -> None:
"""
Manual stacking implementation with more control.
Generates out-of-fold predictions for the meta-learner
using time-series splits.
"""
tscv = TimeSeriesSplit(n_splits=n_splits)
n_models = len(self.base_models)
meta_features = np.zeros((len(X), n_models))
# Generate out-of-fold predictions
for fold_idx, (train_idx, val_idx) in enumerate(tscv.split(X)):
X_fold_train, X_fold_val = X[train_idx], X[val_idx]
y_fold_train = y[train_idx]
for model_idx, (name, model) in enumerate(
self.base_models.items()
):
model_clone = self._clone_model(name)
model_clone.fit(X_fold_train, y_fold_train)
meta_features[val_idx, model_idx] = (
model_clone.predict_proba(X_fold_val)[:, 1]
)
# Train meta-learner on the complete set of OOF predictions
# Only use rows that have OOF predictions (not the first fold's train)
valid_mask = meta_features.sum(axis=1) != 0
meta_X = meta_features[valid_mask]
meta_y = y[valid_mask]
self.meta_learner = LogisticRegression(C=1.0, max_iter=1000)
self.meta_learner.fit(meta_X, meta_y)
# Refit all base models on full training data
for name, model in self.base_models.items():
model.fit(X, y)
print("Meta-learner coefficients:")
for name, coef in zip(self.base_models.keys(),
self.meta_learner.coef_[0]):
print(f" {name}: {coef:.4f}")
def _clone_model(self, name: str):
"""Create a fresh copy of a base model."""
from sklearn.base import clone
return clone(self.base_models[name])
def predict_proba_stacking(self, X: np.ndarray) -> np.ndarray:
"""Predict using sklearn StackingClassifier."""
return self.stacking_model.predict_proba(X)[:, 1]
def predict_proba_manual(self, X: np.ndarray) -> np.ndarray:
"""Predict using manually stacked model."""
meta_features = np.column_stack([
model.predict_proba(X)[:, 1]
for model in self.base_models.values()
])
return self.meta_learner.predict_proba(meta_features)[:, 1]
def compare_models(self, X_test: np.ndarray,
y_test: np.ndarray) -> pd.DataFrame:
"""Compare individual models vs ensemble on test set."""
results = []
for name, model in self.base_models.items():
probs = model.predict_proba(X_test)[:, 1]
results.append({
'model': name,
'log_loss': log_loss(y_test, probs),
'brier_score': brier_score_loss(y_test, probs),
'accuracy': ((probs > 0.5) == y_test).mean(),
})
if self.stacking_model is not None:
probs = self.predict_proba_stacking(X_test)
results.append({
'model': 'stacking_ensemble',
'log_loss': log_loss(y_test, probs),
'brier_score': brier_score_loss(y_test, probs),
'accuracy': ((probs > 0.5) == y_test).mean(),
})
return pd.DataFrame(results).sort_values('log_loss')
# --- Worked Example ---
if __name__ == "__main__":
np.random.seed(42)
n = 3000
# Synthetic data with nonlinear relationships
X = np.random.randn(n, 8)
logit = (0.5 * X[:, 0] + 0.3 * X[:, 1] - 0.4 * X[:, 2]
+ 0.2 * X[:, 0] * X[:, 1] # Interaction
+ 0.1 * X[:, 3]**2 # Nonlinearity
+ 0.15)
prob = 1 / (1 + np.exp(-logit))
y = (np.random.random(n) < prob).astype(int)
# Split
X_train, X_test = X[:2400], X[2400:]
y_train, y_test = y[:2400], y[2400:]
ensemble = SportsEnsemble()
ensemble.fit_stacking(X_train, y_train, cv_splits=5)
print("=== Model Comparison ===\n")
comparison = ensemble.compare_models(X_test, y_test)
print(comparison.to_string(index=False))
27.3 Probability Calibration
Why Calibration Matters for Betting
In sports betting, the actual probability estimate matters enormously --- not just the ranking of outcomes. Consider two models that both correctly identify team A as the favorite:
- Model 1 predicts P(A wins) = 0.72
- Model 2 predicts P(A wins) = 0.58
If the bookmaker's implied probability (after removing vig) is 0.65, Model 1 sees a strong value bet on A, Model 2 sees value on B, and the correct action depends entirely on which probability is more accurate. An uncalibrated model that consistently overestimates the probability of favorites will see phantom value everywhere and bleed money.
Calibration means that when your model says an event has probability $p$, it actually occurs with frequency $p$. Formally, a model is perfectly calibrated if:
$$P(Y = 1 \mid \hat{p}(X) = p) = p \quad \text{for all } p \in [0, 1]$$
Calibration Curves (Reliability Diagrams)
A calibration curve plots predicted probabilities against observed frequencies. To construct one:
- Sort predictions into bins (e.g., 10 bins: [0, 0.1), [0.1, 0.2), ..., [0.9, 1.0]).
- For each bin, compute the mean predicted probability and the observed frequency of positive outcomes.
- Plot mean prediction (x-axis) vs. observed frequency (y-axis).
A perfectly calibrated model produces a 45-degree line. Common miscalibration patterns:
- Overconfident: The curve is S-shaped, predicting more extreme probabilities than warranted. Predictions near 0.8 actually correspond to only 0.65 win rates.
- Underconfident: The curve is inverse-S-shaped. Predictions are too moderate.
- Biased: The curve is shifted above or below the diagonal. The model systematically over- or under-predicts.
Platt Scaling
Platt scaling (John Platt, 1999) fits a logistic regression to the model's outputs:
$$P(y = 1 \mid f(x)) = \frac{1}{1 + \exp(Af(x) + B)}$$
where $f(x)$ is the uncalibrated model output (raw score or log-odds) and $A$, $B$ are parameters learned by maximum likelihood on a held-out calibration set.
For a model that outputs probabilities, we first convert to log-odds: $f(x) = \ln(p/(1-p))$, then fit the logistic regression.
Advantages: - Simple, with only 2 parameters. - Works well when the miscalibration is roughly sigmoid-shaped. - Fast to fit.
Disadvantages: - Assumes the calibration function is logistic (may not hold). - Can underperform when the true calibration function is non-monotonic.
Isotonic Regression
Isotonic regression fits a non-decreasing step function to the relationship between predicted probabilities and actual outcomes. It is more flexible than Platt scaling because it makes no parametric assumptions --- it only requires that the calibration function is monotonically non-decreasing.
The isotonic regression problem is:
$$\min_{z} \sum_{i=1}^{n} (y_i - z_i)^2 \quad \text{subject to } z_1 \leq z_2 \leq \cdots \leq z_n$$
where the $y_i$ are the actual outcomes (0 or 1) and the $z_i$ are the calibrated probabilities, ordered by the model's original predictions.
The solution is computed efficiently using the pool adjacent violators algorithm (PAVA) in $O(n)$ time.
Advantages: - Nonparametric: adapts to any shape of miscalibration. - Can fix non-sigmoid miscalibration patterns.
Disadvantages: - Requires more calibration data than Platt scaling (being nonparametric, it is prone to overfitting with small samples). - The step function can produce non-smooth calibration maps.
Platt vs. Isotonic: A Comparison
| Property | Platt Scaling | Isotonic Regression |
|---|---|---|
| Parametric | Yes (2 params) | No |
| Flexibility | Low | High |
| Data requirement | Small (100+) | Moderate (500+) |
| Overfitting risk | Low | Moderate |
| Handles non-sigmoid | Poorly | Well |
| Smoothness | Smooth (sigmoid) | Step function |
Recommendation for sports betting: Use Platt scaling when your calibration set is small (a single season of games). Use isotonic regression when you have multiple seasons of data. In practice, isotonic regression with careful cross-validation tends to produce slightly better-calibrated probabilities for well-established models.
Python Implementation
import numpy as np
import matplotlib
matplotlib.use('Agg') # Non-interactive backend
import matplotlib.pyplot as plt
from sklearn.calibration import (
CalibratedClassifierCV, calibration_curve
)
from sklearn.linear_model import LogisticRegression
from sklearn.isotonic import IsotonicRegression
from sklearn.metrics import log_loss, brier_score_loss
from typing import Tuple, Optional
class ProbabilityCalibrator:
"""
Calibrates model probabilities using Platt scaling or isotonic regression.
"""
def __init__(self, method: str = 'isotonic'):
"""
Args:
method: 'platt' (logistic/sigmoid) or 'isotonic'
"""
self.method = method
self.calibrator = None
def fit(self, uncalibrated_probs: np.ndarray,
true_outcomes: np.ndarray) -> None:
"""
Fit the calibration model.
Args:
uncalibrated_probs: Model's probability predictions (1D array)
true_outcomes: Actual outcomes, 0 or 1 (1D array)
"""
if self.method == 'platt':
# Platt scaling: logistic regression on log-odds
log_odds = np.log(
np.clip(uncalibrated_probs, 1e-10, 1-1e-10)
/ np.clip(1-uncalibrated_probs, 1e-10, 1-1e-10)
).reshape(-1, 1)
self.calibrator = LogisticRegression(C=1e10, max_iter=10000)
self.calibrator.fit(log_odds, true_outcomes)
elif self.method == 'isotonic':
self.calibrator = IsotonicRegression(
y_min=0.001, y_max=0.999, out_of_bounds='clip'
)
self.calibrator.fit(uncalibrated_probs, true_outcomes)
def transform(self, uncalibrated_probs: np.ndarray) -> np.ndarray:
"""Apply calibration to new probability predictions."""
if self.calibrator is None:
raise ValueError("Must call fit() first")
if self.method == 'platt':
log_odds = np.log(
np.clip(uncalibrated_probs, 1e-10, 1-1e-10)
/ np.clip(1-uncalibrated_probs, 1e-10, 1-1e-10)
).reshape(-1, 1)
return self.calibrator.predict_proba(log_odds)[:, 1]
elif self.method == 'isotonic':
return self.calibrator.transform(uncalibrated_probs)
def plot_calibration_curve(self, probs_before: np.ndarray,
probs_after: np.ndarray,
true_outcomes: np.ndarray,
n_bins: int = 10,
save_path: Optional[str] = None) -> None:
"""
Plot calibration curves before and after calibration.
"""
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
for ax, probs, title in [
(axes[0], probs_before, 'Before Calibration'),
(axes[1], probs_after, 'After Calibration')
]:
fraction_pos, mean_predicted = calibration_curve(
true_outcomes, probs, n_bins=n_bins, strategy='uniform'
)
ax.plot(mean_predicted, fraction_pos, 's-',
label='Model', color='blue')
ax.plot([0, 1], [0, 1], '--', color='gray',
label='Perfectly Calibrated')
ax.set_xlabel('Mean Predicted Probability')
ax.set_ylabel('Observed Frequency')
ax.set_title(title)
ax.legend()
ax.set_xlim([0, 1])
ax.set_ylim([0, 1])
# Add histogram of predictions
ax2 = ax.twinx()
ax2.hist(probs, bins=20, alpha=0.2, color='orange')
ax2.set_ylabel('Count')
plt.tight_layout()
if save_path:
plt.savefig(save_path, dpi=150, bbox_inches='tight')
plt.close()
def calibration_metrics(y_true: np.ndarray,
y_pred: np.ndarray,
n_bins: int = 10) -> dict:
"""
Compute comprehensive calibration metrics.
Returns:
Dictionary with ECE, MCE, log_loss, and brier_score
"""
# Expected Calibration Error (ECE)
fraction_pos, mean_predicted = calibration_curve(
y_true, y_pred, n_bins=n_bins, strategy='uniform'
)
# Compute bin sizes for weighting
bin_edges = np.linspace(0, 1, n_bins + 1)
bin_indices = np.digitize(y_pred, bin_edges) - 1
bin_indices = np.clip(bin_indices, 0, n_bins - 1)
bin_counts = np.bincount(bin_indices, minlength=n_bins)
# Align dimensions (calibration_curve may return fewer bins)
n_actual_bins = len(fraction_pos)
actual_bin_counts = bin_counts[:n_actual_bins]
weights = actual_bin_counts / actual_bin_counts.sum()
ece = np.sum(weights * np.abs(fraction_pos - mean_predicted))
mce = np.max(np.abs(fraction_pos - mean_predicted))
return {
'ECE': round(ece, 4),
'MCE': round(mce, 4),
'log_loss': round(log_loss(y_true, y_pred), 4),
'brier_score': round(brier_score_loss(y_true, y_pred), 4),
}
# --- Worked Example ---
if __name__ == "__main__":
np.random.seed(42)
n = 2000
# Simulate an overconfident model
true_prob = np.random.uniform(0.2, 0.8, n)
y = (np.random.random(n) < true_prob).astype(int)
# Model predictions: overconfident (pushed toward 0 and 1)
uncalibrated = np.clip(true_prob + 0.3 * (true_prob - 0.5), 0.01, 0.99)
uncalibrated += np.random.normal(0, 0.05, n)
uncalibrated = np.clip(uncalibrated, 0.01, 0.99)
# Split
cal_split = 1000
cal_probs = uncalibrated[:cal_split]
cal_y = y[:cal_split]
test_probs = uncalibrated[cal_split:]
test_y = y[cal_split:]
print("=== Probability Calibration ===\n")
print("Before calibration:")
metrics_before = calibration_metrics(test_y, test_probs)
for k, v in metrics_before.items():
print(f" {k}: {v}")
# Platt scaling
platt = ProbabilityCalibrator(method='platt')
platt.fit(cal_probs, cal_y)
platt_preds = platt.transform(test_probs)
print("\nAfter Platt scaling:")
metrics_platt = calibration_metrics(test_y, platt_preds)
for k, v in metrics_platt.items():
print(f" {k}: {v}")
# Isotonic regression
isotonic = ProbabilityCalibrator(method='isotonic')
isotonic.fit(cal_probs, cal_y)
iso_preds = isotonic.transform(test_probs)
print("\nAfter isotonic regression:")
metrics_iso = calibration_metrics(test_y, iso_preds)
for k, v in metrics_iso.items():
print(f" {k}: {v}")
Practical Calibration Workflow for Bettors
- Train your prediction model on data from seasons 1 through $N-1$.
- Generate predictions on season $N$ (hold-out calibration set).
- Fit a calibrator (Platt or isotonic) on the hold-out predictions and actual outcomes.
- Apply the calibrator to all future predictions.
- Monitor calibration throughout the current season. If the calibration curve drifts, refit with recent data.
- Use cross-validated calibration (e.g.,
CalibratedClassifierCVwithcv=5) when you cannot afford to set aside a separate calibration set.
27.4 Handling Imbalanced Outcomes
The Imbalance Problem in Sports Betting
Many valuable betting opportunities involve predicting rare outcomes:
- Upsets: A significant underdog winning outright (moneyline).
- Blowouts: Winning by a large margin (alternative spread markets).
- High-scoring games: Total points exceeding a high threshold.
- Player props: A player achieving an unusually high statistical line.
- Parlays and teasers: Multiple correlated outcomes occurring simultaneously.
When one class is much rarer than the other, standard models struggle. If upsets occur in only 20% of games, a model that always predicts "no upset" achieves 80% accuracy while being completely useless. Standard metrics like accuracy are misleading, and the model's probability estimates for the rare class are often poorly calibrated.
Why Standard Models Fail on Imbalanced Data
Most classification algorithms optimize an objective function that treats all misclassifications equally. With a 80/20 class split:
- The model learns to predict the majority class most of the time.
- The loss function is dominated by majority-class examples.
- The decision boundary is pushed toward the minority class.
- Probability estimates for the minority class are compressed toward zero.
For a bettor, this is disastrous. The rare outcomes (upsets, blowouts) are precisely where the market offers the highest expected value, because bookmakers also struggle to price them accurately.
Techniques for Handling Imbalance
1. Class Weights
The simplest approach: assign higher weight to minority-class examples in the loss function. Instead of minimizing:
$$\mathcal{L} = -\frac{1}{n}\sum_{i=1}^{n} [y_i \log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i)]$$
We minimize:
$$\mathcal{L}_w = -\frac{1}{n}\sum_{i=1}^{n} [w_1 \cdot y_i \log(\hat{p}_i) + w_0 \cdot (1-y_i)\log(1-\hat{p}_i)]$$
where $w_1 > w_0$. A common choice is balanced weights: $w_k = n / (2 \cdot n_k)$ where $n_k$ is the number of examples in class $k$. In scikit-learn, this is achieved with class_weight='balanced'.
2. SMOTE (Synthetic Minority Over-sampling Technique)
SMOTE (Chawla et al., 2002) generates synthetic minority-class examples by interpolating between existing minority examples:
- For each minority example $x_i$, find its $k$ nearest minority-class neighbors.
- Randomly select one neighbor $x_{nn}$.
- Create a synthetic example: $x_{\text{new}} = x_i + \lambda(x_{nn} - x_i)$ where $\lambda \sim \text{Uniform}(0, 1)$.
This increases the effective size of the minority class without simply duplicating examples (which can lead to overfitting).
Variants of SMOTE: - Borderline-SMOTE: Only oversamples minority examples near the decision boundary. - SMOTE-ENN: Combines SMOTE with Edited Nearest Neighbors to clean noisy examples. - ADASYN: Adaptively generates more synthetic examples for harder-to-classify minority instances.
3. Threshold Optimization
Rather than using the default 0.5 threshold for classification, optimize the threshold on a validation set. For imbalanced problems, the optimal threshold is often below 0.5.
4. Ensemble Methods for Imbalanced Data
- BalancedRandomForest: Random forest variant that balances each bootstrap sample.
- EasyEnsemble: Trains multiple models on balanced subsets and averages.
- RUSBoost: Combines random undersampling with boosting.
Evaluation Metrics for Imbalanced Data
Standard accuracy is misleading. Use these instead:
| Metric | Formula | Interpretation |
|---|---|---|
| Precision | $\frac{TP}{TP+FP}$ | Of predicted positives, what fraction is correct? |
| Recall | $\frac{TP}{TP+FN}$ | Of actual positives, what fraction did we catch? |
| F1 Score | $\frac{2 \cdot P \cdot R}{P + R}$ | Harmonic mean of precision and recall |
| PR-AUC | Area under precision-recall curve | Overall quality for imbalanced data |
| Log-loss | $-\frac{1}{n}\sum y\log\hat{p} + (1-y)\log(1-\hat{p})$ | Penalizes poor probability estimates |
| Brier Score | $\frac{1}{n}\sum(y - \hat{p})^2$ | Mean squared probability error |
For betting applications, log-loss and Brier score are the most relevant because they evaluate the quality of probability estimates directly, which is what matters for identifying value bets.
Python Implementation
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
classification_report, precision_recall_curve,
average_precision_score, log_loss, brier_score_loss,
roc_auc_score, f1_score
)
from sklearn.model_selection import TimeSeriesSplit
import xgboost as xgb
from typing import Dict, Optional
try:
from imblearn.over_sampling import SMOTE, ADASYN, BorderlineSMOTE
from imblearn.combine import SMOTEENN
from imblearn.ensemble import BalancedRandomForestClassifier
HAS_IMBLEARN = True
except ImportError:
HAS_IMBLEARN = False
class ImbalancedSportsPredictor:
"""
Handles imbalanced outcomes in sports prediction.
Compares multiple approaches: class weights, SMOTE, threshold
optimization, and specialized ensemble methods.
"""
def __init__(self, random_state: int = 42):
self.random_state = random_state
self.models: Dict = {}
self.optimal_thresholds: Dict[str, float] = {}
def fit_class_weighted(self, X_train: np.ndarray,
y_train: np.ndarray) -> None:
"""Train models with class weight adjustment."""
self.models['xgb_weighted'] = xgb.XGBClassifier(
n_estimators=300,
max_depth=4,
learning_rate=0.05,
scale_pos_weight=(y_train == 0).sum() / (y_train == 1).sum(),
use_label_encoder=False,
eval_metric='logloss',
random_state=self.random_state,
)
self.models['xgb_weighted'].fit(X_train, y_train)
self.models['rf_weighted'] = RandomForestClassifier(
n_estimators=300,
max_depth=8,
class_weight='balanced',
random_state=self.random_state,
)
self.models['rf_weighted'].fit(X_train, y_train)
def fit_smote(self, X_train: np.ndarray,
y_train: np.ndarray) -> None:
"""Train models on SMOTE-resampled data."""
if not HAS_IMBLEARN:
print("imbalanced-learn not installed; skipping SMOTE")
return
# Standard SMOTE
smote = SMOTE(random_state=self.random_state, k_neighbors=5)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
print(f"Original: {len(y_train)} samples "
f"({y_train.sum()} positive, {(1-y_train).sum()} negative)")
print(f"After SMOTE: {len(y_resampled)} samples "
f"({y_resampled.sum()} positive, "
f"{(1-y_resampled).sum()} negative)")
self.models['xgb_smote'] = xgb.XGBClassifier(
n_estimators=300, max_depth=4, learning_rate=0.05,
use_label_encoder=False, eval_metric='logloss',
random_state=self.random_state,
)
self.models['xgb_smote'].fit(X_resampled, y_resampled)
# Borderline-SMOTE (focuses on borderline examples)
bsmote = BorderlineSMOTE(
random_state=self.random_state, k_neighbors=5
)
X_border, y_border = bsmote.fit_resample(X_train, y_train)
self.models['xgb_borderline_smote'] = xgb.XGBClassifier(
n_estimators=300, max_depth=4, learning_rate=0.05,
use_label_encoder=False, eval_metric='logloss',
random_state=self.random_state,
)
self.models['xgb_borderline_smote'].fit(X_border, y_border)
def fit_baseline(self, X_train: np.ndarray,
y_train: np.ndarray) -> None:
"""Train baseline model without any imbalance handling."""
self.models['xgb_baseline'] = xgb.XGBClassifier(
n_estimators=300, max_depth=4, learning_rate=0.05,
use_label_encoder=False, eval_metric='logloss',
random_state=self.random_state,
)
self.models['xgb_baseline'].fit(X_train, y_train)
def optimize_threshold(self, model_name: str,
X_val: np.ndarray,
y_val: np.ndarray,
metric: str = 'f1') -> float:
"""
Find the optimal classification threshold.
Args:
model_name: Name of the model to optimize
X_val: Validation features
y_val: Validation outcomes
metric: 'f1', 'precision', or 'recall'
"""
model = self.models[model_name]
probs = model.predict_proba(X_val)[:, 1]
best_threshold = 0.5
best_score = 0
for threshold in np.arange(0.1, 0.9, 0.01):
preds = (probs >= threshold).astype(int)
if metric == 'f1':
score = f1_score(y_val, preds, zero_division=0)
elif metric == 'precision':
tp = ((preds == 1) & (y_val == 1)).sum()
fp = ((preds == 1) & (y_val == 0)).sum()
score = tp / (tp + fp) if (tp + fp) > 0 else 0
elif metric == 'recall':
tp = ((preds == 1) & (y_val == 1)).sum()
fn = ((preds == 0) & (y_val == 1)).sum()
score = tp / (tp + fn) if (tp + fn) > 0 else 0
if score > best_score:
best_score = score
best_threshold = threshold
self.optimal_thresholds[model_name] = best_threshold
print(f"Optimal threshold for {model_name}: "
f"{best_threshold:.2f} ({metric} = {best_score:.4f})")
return best_threshold
def evaluate_all(self, X_test: np.ndarray,
y_test: np.ndarray) -> pd.DataFrame:
"""Comprehensive evaluation of all models."""
results = []
for name, model in self.models.items():
probs = model.predict_proba(X_test)[:, 1]
threshold = self.optimal_thresholds.get(name, 0.5)
preds = (probs >= threshold).astype(int)
tp = ((preds == 1) & (y_test == 1)).sum()
fp = ((preds == 1) & (y_test == 0)).sum()
fn = ((preds == 0) & (y_test == 1)).sum()
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1 = (2 * precision * recall / (precision + recall)
if (precision + recall) > 0 else 0)
results.append({
'model': name,
'threshold': threshold,
'log_loss': round(log_loss(y_test, probs), 4),
'brier': round(brier_score_loss(y_test, probs), 4),
'auc_roc': round(roc_auc_score(y_test, probs), 4),
'precision': round(precision, 4),
'recall': round(recall, 4),
'f1': round(f1, 4),
})
return pd.DataFrame(results).sort_values('log_loss')
# --- Worked Example: Predicting Upsets ---
if __name__ == "__main__":
np.random.seed(42)
n = 3000
# Generate imbalanced data: upsets occur ~20% of the time
X = np.random.randn(n, 10)
logit = (-1.5 + 0.4 * X[:, 0] + 0.3 * X[:, 1] - 0.5 * X[:, 2]
+ 0.2 * X[:, 3] + 0.1 * X[:, 4])
prob = 1 / (1 + np.exp(-logit))
y = (np.random.random(n) < prob).astype(int)
print(f"Class distribution: {y.mean():.3f} positive rate "
f"({y.sum()} upsets out of {n} games)\n")
# Split temporally
train_end = 2000
val_end = 2500
X_train, y_train = X[:train_end], y[:train_end]
X_val, y_val = X[train_end:val_end], y[train_end:val_end]
X_test, y_test = X[val_end:], y[val_end:]
predictor = ImbalancedSportsPredictor()
# Fit all approaches
predictor.fit_baseline(X_train, y_train)
predictor.fit_class_weighted(X_train, y_train)
predictor.fit_smote(X_train, y_train)
# Optimize thresholds on validation set
for name in predictor.models:
predictor.optimize_threshold(name, X_val, y_val, metric='f1')
# Evaluate
print("\n=== Imbalanced Classification Results ===\n")
results = predictor.evaluate_all(X_test, y_test)
print(results.to_string(index=False))
Practical Advice for Bettors
-
Use class weights rather than SMOTE for most sports prediction tasks. SMOTE can introduce artifacts that distort probability estimates, while class weights simply reweight the loss function.
-
Calibrate after rebalancing. Any rebalancing technique (weights, SMOTE, undersampling) distorts the output probabilities. Always calibrate the model's probabilities on the original (unbalanced) distribution using the methods from Section 27.3.
-
Focus on log-loss and Brier score, not F1 or accuracy. For betting, you need accurate probability estimates, not accurate class predictions.
-
Consider the base rate. If upsets occur 20% of the time and your model correctly identifies some upsets while maintaining good calibration, you can exploit moneyline value on underdogs.
-
Threshold optimization is for classification, not betting. In betting, you never "classify" a game as win or loss --- you estimate the probability and compare to the market price. Threshold optimization is useful for internal model evaluation but should not drive betting decisions.
27.5 Model Interpretability (SHAP, Feature Importance)
Why Interpretability Matters for Betting
A model that outputs "P(home win) = 0.67" is useful. A model that says "P(home win) = 0.67, primarily because the home team has a +5 Elo advantage and is well-rested, partially offset by a poor matchup against the away team's defense" is far more useful. Interpretability serves several critical functions for bettors:
-
Sanity checking. Does the model make predictions for sensible reasons? If your model favors a team primarily because of the color of their jerseys (an inadvertently included feature), you want to know.
-
Feature discovery. Which features actually drive predictions? This guides future feature engineering and data collection.
-
Edge identification. Understanding why your model disagrees with the market helps you determine whether the disagreement is likely to persist (a real edge) or is based on noise.
-
Model debugging. When the model makes a surprising prediction, interpretability tools help you understand whether it is a genuine insight or a model error.
-
Communication. If you are betting with other people's money, you need to explain your rationale beyond "the model says so."
Built-in Feature Importance
Tree-based models (random forests, XGBoost) provide built-in feature importance metrics:
Gain-based importance: For each feature, sum the improvement in the loss function across all splits using that feature. This measures how much each feature contributes to reducing prediction error.
Split-based importance (weight): Count the number of times each feature is used as a split criterion. Features used in many splits are deemed more important.
Cover-based importance: For each feature, average the number of training examples affected by splits using that feature.
Limitations of built-in importance: - Biased toward high-cardinality features. Features with many unique values have more potential split points and are used more often, even if they are not more predictive. - Does not account for feature interactions. A feature might be unimportant on its own but highly important in combination with another feature. - Inconsistent across methods. Gain, weight, and cover can give very different rankings. - No directionality. The importance tells you that a feature matters, but not the direction of its effect.
SHAP Values
SHAP (SHapley Additive exPlanations), developed by Scott Lundberg and Su-In Lee in 2017, addresses all of these limitations. SHAP is grounded in cooperative game theory, specifically the concept of Shapley values from 1953.
The Shapley value of a feature is defined as the average marginal contribution of that feature across all possible orderings of features. Formally, the SHAP value for feature $j$ for prediction $i$ is:
$$\phi_j^{(i)} = \sum_{S \subseteq F \setminus \{j\}} \frac{|S|!(|F|-|S|-1)!}{|F|!} [f(S \cup \{j\}) - f(S)]$$
where $F$ is the set of all features, $S$ is a subset not including $j$, and $f(S)$ is the model's prediction using only features in $S$.
Intuitively, SHAP asks: "For this specific prediction, how much did each feature push the prediction away from the average?" The sum of all SHAP values equals the difference between the prediction and the average prediction:
$$f(x) = \phi_0 + \sum_{j=1}^{p} \phi_j$$
where $\phi_0$ is the base value (average model output across all training data).
Key properties of SHAP: - Local accuracy: SHAP values sum to the actual prediction. - Consistency: If a feature's contribution increases, its SHAP value never decreases. - Missingness: Features not in the model get zero SHAP value. - Additivity: For ensemble models, SHAP values of the ensemble equal the sum of SHAP values of the component models.
TreeSHAP
For tree-based models, Lundberg developed TreeSHAP, an exact polynomial-time algorithm for computing SHAP values. This is dramatically faster than the exponential-time naive computation and makes SHAP practical for large models.
TreeSHAP runs in $O(TLD^2)$ time where $T$ is the number of trees, $L$ is the maximum number of leaves, and $D$ is the maximum depth. For typical XGBoost models, this allows computing SHAP values for thousands of predictions in seconds.
Partial Dependence Plots (PDPs)
A partial dependence plot shows the marginal effect of a single feature on the model's predictions, averaged over all other features:
$$\text{PDP}(x_j) = \frac{1}{n}\sum_{i=1}^{n} f(x_j, \mathbf{x}_{-j}^{(i)})$$
where $\mathbf{x}_{-j}^{(i)}$ represents all features except $j$ for the $i$-th training example. The PDP shows how the average prediction changes as feature $j$ varies, marginalizing out the effects of all other features.
SHAP dependence plots are a more informative alternative. They plot the SHAP value of a feature against the feature's value for each individual prediction. This reveals not just the average effect but also interactions (visible as vertical spread in the plot).
LIME (Local Interpretable Model-agnostic Explanations)
LIME (Ribeiro et al., 2016) explains individual predictions by fitting a simple interpretable model (usually linear regression) in the neighborhood of the prediction:
- Generate perturbed samples around the instance of interest.
- Get the complex model's predictions for these perturbed samples.
- Weight the perturbed samples by their proximity to the original instance.
- Fit a linear model to the weighted perturbed predictions.
- The linear model's coefficients are the feature explanations.
LIME is model-agnostic (works with any model) but less theoretically grounded than SHAP. For tree-based models, SHAP is generally preferred due to TreeSHAP's efficiency and SHAP's stronger theoretical guarantees.
Python Implementation
import numpy as np
import pandas as pd
import xgboost as xgb
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from typing import Optional, List
try:
import shap
HAS_SHAP = True
except ImportError:
HAS_SHAP = False
try:
import lime
import lime.lime_tabular
HAS_LIME = True
except ImportError:
HAS_LIME = False
class ModelInterpreter:
"""
Comprehensive model interpretability toolkit for sports prediction.
Provides SHAP values, feature importance, partial dependence,
and LIME explanations.
"""
def __init__(self, model, X_train: np.ndarray,
feature_names: List[str]):
"""
Args:
model: Fitted model (must support predict_proba)
X_train: Training data (for background distribution)
feature_names: List of feature names
"""
self.model = model
self.X_train = X_train
self.feature_names = feature_names
self.shap_values = None
self.explainer = None
def compute_shap_values(self, X: np.ndarray,
check_additivity: bool = True) -> np.ndarray:
"""
Compute SHAP values for a set of predictions.
Args:
X: Feature matrix to explain
check_additivity: Verify SHAP values sum to predictions
Returns:
SHAP values array of shape (n_samples, n_features)
"""
if not HAS_SHAP:
raise ImportError("shap library required")
self.explainer = shap.TreeExplainer(self.model)
self.shap_values = self.explainer.shap_values(X)
# For binary classification, shap_values may be a list of 2 arrays
if isinstance(self.shap_values, list):
self.shap_values = self.shap_values[1] # Class 1 (positive)
if check_additivity:
# Verify: base_value + sum(shap_values) ≈ model output
base = self.explainer.expected_value
if isinstance(base, np.ndarray):
base = base[1]
raw_preds = self.model.predict_proba(X)[:, 1]
# Convert from log-odds to probability for comparison
shap_sum = base + self.shap_values.sum(axis=1)
# For tree models, SHAP works in log-odds or margin space
# so direct comparison with probabilities requires sigmoid
from scipy.special import expit
reconstructed = expit(shap_sum)
max_error = np.max(np.abs(reconstructed - raw_preds))
print(f"SHAP additivity check: max error = {max_error:.6f}")
return self.shap_values
def shap_summary_plot(self, X: np.ndarray,
save_path: Optional[str] = None) -> None:
"""
Create SHAP summary (beeswarm) plot.
Shows the distribution of SHAP values for each feature,
revealing both importance and directionality.
"""
if self.shap_values is None:
self.compute_shap_values(X)
plt.figure(figsize=(10, 8))
shap.summary_plot(
self.shap_values,
X,
feature_names=self.feature_names,
show=False
)
if save_path:
plt.savefig(save_path, dpi=150, bbox_inches='tight')
plt.close()
def shap_bar_plot(self, X: np.ndarray,
save_path: Optional[str] = None) -> None:
"""
Create SHAP bar plot showing mean absolute SHAP values.
"""
if self.shap_values is None:
self.compute_shap_values(X)
mean_abs_shap = np.abs(self.shap_values).mean(axis=0)
importance = pd.Series(
mean_abs_shap, index=self.feature_names
).sort_values(ascending=True)
plt.figure(figsize=(10, 6))
importance.plot(kind='barh', color='steelblue')
plt.xlabel('Mean |SHAP value|')
plt.title('Feature Importance (SHAP)')
plt.tight_layout()
if save_path:
plt.savefig(save_path, dpi=150, bbox_inches='tight')
plt.close()
return importance
def explain_prediction(self, x: np.ndarray,
game_label: str = "Game") -> pd.DataFrame:
"""
Explain a single prediction using SHAP values.
Args:
x: Feature vector for one game (1D or 2D with shape (1, n))
game_label: Label for the game
Returns:
DataFrame with features, values, and SHAP contributions
"""
if self.explainer is None:
self.explainer = shap.TreeExplainer(self.model)
x_2d = x.reshape(1, -1) if x.ndim == 1 else x
sv = self.explainer.shap_values(x_2d)
if isinstance(sv, list):
sv = sv[1]
base = self.explainer.expected_value
if isinstance(base, np.ndarray):
base = base[1]
prediction = self.model.predict_proba(x_2d)[0, 1]
explanation = pd.DataFrame({
'feature': self.feature_names,
'value': x_2d[0],
'shap_value': sv[0],
'abs_shap': np.abs(sv[0]),
}).sort_values('abs_shap', ascending=False)
print(f"\n=== Explanation: {game_label} ===")
print(f"Prediction: {prediction:.4f}")
print(f"Base value: {base:.4f}")
print(f"Sum of SHAP values: {sv[0].sum():.4f}")
print(f"\nTop contributing features:")
for _, row in explanation.head(10).iterrows():
direction = "+" if row['shap_value'] > 0 else "-"
print(f" {row['feature']:20s} = {row['value']:+8.3f} "
f"SHAP: {row['shap_value']:+.4f} ({direction})")
return explanation
def shap_dependence_plot(self, feature_name: str, X: np.ndarray,
interaction_feature: Optional[str] = None,
save_path: Optional[str] = None) -> None:
"""
Create SHAP dependence plot for a single feature.
Shows how the SHAP value of a feature varies with the feature value,
with optional coloring by an interaction feature.
"""
if self.shap_values is None:
self.compute_shap_values(X)
feat_idx = self.feature_names.index(feature_name)
interact_idx = (self.feature_names.index(interaction_feature)
if interaction_feature else None)
plt.figure(figsize=(10, 6))
shap.dependence_plot(
feat_idx,
self.shap_values,
X,
feature_names=self.feature_names,
interaction_index=interact_idx,
show=False
)
if save_path:
plt.savefig(save_path, dpi=150, bbox_inches='tight')
plt.close()
def lime_explanation(self, x: np.ndarray,
game_label: str = "Game") -> Optional[object]:
"""
Generate LIME explanation for a single prediction.
Args:
x: Feature vector for one game
game_label: Label for the game
Returns:
LIME explanation object
"""
if not HAS_LIME:
print("LIME library not installed; skipping")
return None
explainer = lime.lime_tabular.LimeTabularExplainer(
self.X_train,
feature_names=self.feature_names,
class_names=['Away Win', 'Home Win'],
mode='classification',
)
x_1d = x.flatten() if x.ndim > 1 else x
exp = explainer.explain_instance(
x_1d,
self.model.predict_proba,
num_features=len(self.feature_names),
)
print(f"\n=== LIME Explanation: {game_label} ===")
print(f"Prediction: {self.model.predict_proba(x.reshape(1, -1))[0]}")
print(f"\nFeature contributions:")
for feat, weight in exp.as_list():
print(f" {feat}: {weight:+.4f}")
return exp
def compare_importances(self, X: np.ndarray) -> pd.DataFrame:
"""
Compare feature importance from multiple methods.
Returns DataFrame with importance from: gain, weight, cover, SHAP.
"""
# Built-in importances
booster = self.model.get_booster()
importance_types = {}
for imp_type in ['gain', 'weight', 'cover']:
raw_imp = booster.get_score(importance_type=imp_type)
imp_dict = {}
for key, val in raw_imp.items():
idx = int(key.replace('f', ''))
if idx < len(self.feature_names):
imp_dict[self.feature_names[idx]] = val
importance_types[imp_type] = imp_dict
# SHAP importance
if self.shap_values is None:
self.compute_shap_values(X)
mean_abs_shap = np.abs(self.shap_values).mean(axis=0)
shap_imp = {self.feature_names[i]: mean_abs_shap[i]
for i in range(len(self.feature_names))}
# Combine into DataFrame
df = pd.DataFrame({
'gain': pd.Series(importance_types.get('gain', {})),
'weight': pd.Series(importance_types.get('weight', {})),
'cover': pd.Series(importance_types.get('cover', {})),
'shap': pd.Series(shap_imp),
})
# Normalize each column to 0-1
for col in df.columns:
col_max = df[col].max()
if col_max > 0:
df[col] = df[col] / col_max
return df.sort_values('shap', ascending=False)
# --- Worked Example ---
if __name__ == "__main__":
np.random.seed(42)
n = 2000
feature_names = [
'elo_diff', 'off_rating_diff', 'def_rating_diff',
'rest_days_diff', 'win_pct_diff', 'spread',
'pace_diff', 'turnover_diff', 'rebound_diff', 'home_record'
]
X = np.random.randn(n, len(feature_names))
# Generate outcomes with known relationships
logit = (0.003 * X[:, 0] * 100 # elo_diff (rescaled)
+ 0.15 * X[:, 1] # off_rating_diff
- 0.12 * X[:, 2] # def_rating_diff
+ 0.05 * X[:, 3] # rest_days_diff
+ 0.5 * X[:, 4] # win_pct_diff
+ 0.04 * X[:, 5] # spread
+ 0.02 * X[:, 6] # pace_diff
+ 0.2 * X[:, 0] * X[:, 3] # interaction: elo * rest
+ 0.15) # home advantage
prob = 1 / (1 + np.exp(-logit))
y = (np.random.random(n) < prob).astype(int)
# Split
X_train, X_test = X[:1600], X[1600:]
y_train, y_test = y[:1600], y[1600:]
# Train XGBoost
model = xgb.XGBClassifier(
n_estimators=300, max_depth=4, learning_rate=0.05,
subsample=0.8, colsample_bytree=0.8,
use_label_encoder=False, eval_metric='logloss',
random_state=42,
)
model.fit(X_train, y_train)
# Interpret
interpreter = ModelInterpreter(model, X_train, feature_names)
# Global importance comparison
print("=== Feature Importance Comparison ===\n")
importance_df = interpreter.compare_importances(X_test)
print(importance_df.round(3).to_string())
# Explain a specific prediction
game_idx = 0
interpreter.explain_prediction(
X_test[game_idx],
game_label="Test Game 1"
)
# SHAP dependence for top feature
interpreter.shap_dependence_plot(
'elo_diff', X_test,
interaction_feature='rest_days_diff',
save_path='elo_diff_dependence.png'
)
# Compare built-in vs SHAP importance
print("\n=== Key Insight ===")
print("Notice how SHAP importance can differ from gain-based importance.")
print("SHAP accounts for feature interactions and provides directional")
print("information, making it more reliable for understanding model")
print("behavior in sports prediction contexts.")
Interpreting SHAP Values for Betting Decisions
When you obtain SHAP explanations for a predicted game, you can use them to make more informed betting decisions:
Example interpretation:
Suppose your model predicts P(home win) = 0.67 for a game where the market implies 0.60. The SHAP breakdown shows:
| Feature | Value | SHAP Contribution |
|---|---|---|
| elo_diff | +120 | +0.08 |
| rest_days_diff | +2 | +0.05 |
| off_rating_diff | +3.2 | +0.04 |
| def_rating_diff | -1.1 | -0.02 |
| spread | -3.5 | +0.02 |
From this, you can see that the model's edge over the market comes primarily from the Elo difference and the rest advantage. You can then ask:
- Is the Elo difference real? Was it earned against quality opponents or inflated by easy schedule?
- Is the rest advantage priced in? Most sophisticated bettors know about rest advantages; this might already be in the line.
- Are there factors the model is missing? Injury news, weather, motivation --- things not captured in your features.
This kind of analysis transforms model-based betting from black-box gambling into an informed analytical process.
SHAP vs. LIME vs. Built-in Importance: A Comparison
| Property | Built-in Importance | SHAP | LIME |
|---|---|---|---|
| Theoretical foundation | Ad hoc | Game theory (Shapley) | Local approximation |
| Consistency | No | Yes | No |
| Local explanations | No | Yes | Yes |
| Global explanations | Yes | Yes | Approximate |
| Model-agnostic | No (tree-specific) | Partial (TreeSHAP for trees) | Yes |
| Computational cost | Very low | Low (TreeSHAP) | Moderate |
| Interaction detection | No | Yes (interaction values) | Limited |
| Directionality | No | Yes | Yes |
| Recommended for betting | Baseline only | Primary method | Secondary / verification |
27.6 Chapter Summary
This chapter has covered the advanced machine learning techniques that form the backbone of modern quantitative sports betting systems. Each section addresses a specific challenge that separates amateur modelers from professionals.
XGBoost provides the raw predictive power. Gradient-boosted trees with second-order optimization, regularization, and column subsampling can capture the complex nonlinear relationships and feature interactions present in sports data. The key to success lies in disciplined hyperparameter tuning using time-series cross-validation (never train on future data) and thoughtful feature engineering that transforms raw statistics into predictive signals.
Random Forests and Ensemble Methods offer robustness and diversity. Random forests are harder to overfit than boosted trees and serve as an excellent complement in a stacking ensemble. The mathematical insight behind bagging --- that averaging reduces variance proportionally to the correlation between base learners --- explains why diversity matters more than individual model accuracy. Stacking, with its meta-learner trained on out-of-fold predictions, provides the principled framework for combining diverse models.
Probability Calibration is the bridge between raw model outputs and actionable betting decisions. An overconfident model sees value everywhere and bleeds money; an underconfident model misses legitimate opportunities. Platt scaling and isotonic regression provide the mathematical tools to ensure that a predicted probability of 0.65 truly corresponds to a 65% win rate. The Expected Calibration Error (ECE) quantifies this property, and regular monitoring of calibration curves ensures your model stays honest.
Handling Imbalanced Outcomes enables profitable betting on rare events --- upsets, blowouts, and extreme player performances --- where the market often misprices risk. Class weights, SMOTE, and threshold optimization each address imbalance from a different angle, but the overriding principle for bettors is clear: focus on well-calibrated probability estimates (log-loss, Brier score) rather than classification accuracy.
Model Interpretability through SHAP values transforms sports betting from a black-box exercise into an analytical discipline. SHAP's theoretical grounding in Shapley values ensures consistent, additive explanations that reveal exactly why your model makes each prediction. This enables sanity checking, edge identification, and the kind of deep understanding that distinguishes sustainable edges from data-mining artifacts.
Integration with Previous Chapters
The techniques in this chapter build directly on the rating systems from Chapter 26. Rating system outputs (Elo ratings, Glicko-2 ratings and RDs, Massey scores, PageRank values) serve as powerful features for the XGBoost and random forest models. The ensemble rating methods from Section 26.5 can be viewed as a special case of the model stacking framework developed in Section 27.2. And the probability calibration techniques from Section 27.3 should be applied to any rating-based probability estimate, including the raw Elo expected-score formula.
Decision Framework for Model Selection
| Scenario | Recommended Approach |
|---|---|
| Small dataset (< 500 games) | Logistic regression + Elo features + Platt calibration |
| Medium dataset (500--5000) | XGBoost with regularization + stacking + isotonic calibration |
| Large dataset (5000+) | Full stacking ensemble + SHAP analysis + isotonic calibration |
| Predicting rare outcomes | Class-weighted XGBoost + calibration + PR-AUC evaluation |
| Need for explanation | SHAP with TreeSHAP for any tree-based model |
| Unbalanced schedule (college) | Include PageRank features + strength-of-schedule features |
Key Takeaways
- XGBoost is the workhorse of modern sports prediction, but it must be properly regularized and validated using temporal splits.
- Diverse ensembles beat individual models --- combine different model families, not just different hyperparameters.
- Calibration is not optional. Uncalibrated probabilities lead to incorrect bet sizing and phantom value.
- Class imbalance requires special handling for rare-outcome markets, but always recalibrate after rebalancing.
- SHAP values are the gold standard for understanding model predictions and identifying genuine versus spurious edges.
Exercises
-
XGBoost Feature Engineering. Using NBA game data, create at least 20 features (including rating system outputs from Chapter 26). Train an XGBoost model with time-series cross-validation and report log-loss, Brier score, and AUC-ROC.
-
Stacking Depth. Compare a 2-model stack (logistic regression + XGBoost) to a 4-model stack (logistic regression + random forest + XGBoost + neural network). Does adding more models help? At what point does the meta-learner start to overfit?
-
Calibration Impact on Profit. Train a model with and without probability calibration. Simulate flat betting on all games where your model disagrees with the market by more than 5 percentage points. Compare cumulative profit/loss for the calibrated vs. uncalibrated model.
-
Upset Detection. Build a model specifically to predict NBA upsets (games where the team with a lower Elo rating wins). Use class-weighted XGBoost and evaluate with PR-AUC. What features are most predictive of upsets?
-
SHAP Audit. Train an XGBoost model on NFL data. Use SHAP to identify the top 5 most important features. Then retrain the model using only those 5 features. How much performance is lost? This tells you how much of the model's knowledge is concentrated in a few key variables.
Related Reading
Explore this topic in other books
Sports Betting Regression Analysis AI Engineering Supervised Learning Soccer Analytics ML & Regression for Soccer College Football Analytics Intro to Predictive Analytics