25 min read

The rating systems developed in Chapter 26 produce a single number for each team --- a strength estimate. But team performance is far more complex than a scalar can capture. A team might have an elite offense but a porous defense, play exceptionally...

Chapter 27: Advanced Regression and Classification

The rating systems developed in Chapter 26 produce a single number for each team --- a strength estimate. But team performance is far more complex than a scalar can capture. A team might have an elite offense but a porous defense, play exceptionally well at home but collapse on the road, or perform differently depending on rest days, travel distance, or weather conditions. To capture this multidimensional reality, we need models that can ingest dozens or even hundreds of features and learn the nonlinear relationships that determine outcomes.

This chapter introduces the most powerful supervised learning methods available to the modern sports bettor: gradient-boosted trees (XGBoost), random forests, and ensemble stacking. But raw predictive power is only half the battle. A model that outputs "0.62" for a win probability is useless if that number does not actually correspond to a 62% win rate. We therefore devote substantial attention to probability calibration --- the art and science of ensuring your model's probabilities are trustworthy. We then address the challenge of imbalanced outcomes (predicting rare events like upsets or blowouts) and conclude with model interpretability, using SHAP values and related tools to understand not just what your model predicts, but why.


27.1 XGBoost for Sports Prediction

Gradient Boosting: The Intuition

Gradient boosting is one of the most successful machine learning algorithms in applied prediction. It builds an ensemble of weak learners (typically decision trees) sequentially, where each new tree corrects the errors of the ensemble so far.

The process works as follows:

  1. Start with a simple initial prediction (e.g., the average outcome).
  2. Compute the residuals --- the difference between the actual outcomes and the current predictions.
  3. Fit a decision tree to these residuals.
  4. Add this tree to the ensemble (scaled by a learning rate).
  5. Repeat steps 2--4 for a specified number of iterations.

Formally, the ensemble after $t$ trees is:

$$\hat{y}^{(t)}(x) = \hat{y}^{(t-1)}(x) + \eta \cdot f_t(x)$$

where $\eta$ is the learning rate and $f_t$ is the $t$-th tree. XGBoost (eXtreme Gradient Boosting), developed by Tianqi Chen in 2016, extends basic gradient boosting with several innovations:

  • Second-order gradients: XGBoost uses both the first derivative (gradient) and the second derivative (Hessian) of the loss function when fitting each tree. This provides more information for split selection and leads to better convergence.

  • Regularization: XGBoost adds L1 and L2 regularization terms to the objective function:

$$\mathcal{L}(\theta) = \sum_{i=1}^{n} l(y_i, \hat{y}_i) + \sum_{t=1}^{T} \Omega(f_t)$$

where:

$$\Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^{T} w_j^2$$

Here $T$ is the number of leaves in the tree, $w_j$ is the weight of leaf $j$, $\gamma$ controls tree complexity (more leaves = higher penalty), and $\lambda$ is the L2 regularization strength on leaf weights.

  • Column subsampling: Like random forests, XGBoost can randomly select a subset of features at each split, reducing correlation between trees.
  • Handling missing values: XGBoost learns the optimal direction (left or right) for missing values during training.

Hyperparameter Tuning for Sports Data

XGBoost has many hyperparameters, but a handful dominate performance. Understanding what each controls and how to tune it is essential.

max_depth (tree depth). Controls the maximum depth of each individual tree. Deeper trees capture more complex interactions but are more prone to overfitting.

max_depth Behavior Sports Recommendation
2--3 Shallow trees, captures main effects Good starting point for small datasets
4--6 Moderate complexity, captures interactions Most sports prediction tasks
7--10 Deep trees, many interactions Only with large datasets (10,000+ games)
10+ Very deep, high risk of overfitting Rarely appropriate for sports

learning_rate (eta). Controls how much each tree contributes to the ensemble. Lower values require more trees but often yield better generalization.

$$\text{learning\_rate} \in [0.001, 0.3]$$

A common strategy: set learning_rate to 0.01--0.05 and use early stopping to determine the number of trees.

n_estimators (number of trees). The total number of boosting rounds. With early stopping, this serves as an upper bound; training halts when validation performance stops improving.

subsample. Fraction of training samples used to build each tree (0.5--1.0). Values below 1.0 introduce randomness that reduces overfitting.

colsample_bytree. Fraction of features considered for each tree (0.5--1.0). Similar to random forests' feature subsampling.

min_child_weight. Minimum sum of instance weights needed in a child node. Higher values prevent the model from learning overly specific patterns. For classification with balanced data, values of 1--10 work well.

gamma. Minimum loss reduction required for a split. Acts as a pruning threshold. Setting $\gamma > 0$ creates more conservative trees.

reg_alpha (L1) and reg_lambda (L2). Regularization terms. L1 promotes sparsity (some features get zero weight), while L2 smooths weights.

Feature Engineering for Sports Prediction

The quality of features determines the ceiling of any model. For sports prediction, features typically fall into several categories:

Team-level statistics (rolling): - Offensive/defensive efficiency (points per 100 possessions, yards per play) - Win/loss record over recent games (last 5, 10, 20) - Home/away splits - Rating system outputs (Elo, Glicko-2, Massey, PageRank from Chapter 26)

Matchup-specific features: - Difference in key statistics (offensive rating A minus defensive rating B) - Historical head-to-head record - Style matchup indicators (e.g., run-heavy offense vs. weak run defense)

Contextual features: - Days of rest for each team - Travel distance - Back-to-back game indicator - Altitude (e.g., Denver's mile-high advantage) - Time zone changes

Market-derived features: - Opening line - Line movement (closing minus opening spread) - Consensus percentage of bets on each side - Over/under as a proxy for game pace

Python Implementation

import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
from sklearn.metrics import log_loss, brier_score_loss, roc_auc_score
from typing import Dict, Tuple, Optional
import warnings
warnings.filterwarnings('ignore')

class XGBoostSportsPredictor:
    """
    XGBoost-based sports prediction model with proper temporal validation.
    """

    def __init__(self, objective: str = 'binary:logistic',
                 random_state: int = 42):
        self.objective = objective
        self.random_state = random_state
        self.model: Optional[xgb.XGBClassifier] = None
        self.best_params: Optional[Dict] = None
        self.feature_names: Optional[list] = None

    def create_features(self, df: pd.DataFrame) -> pd.DataFrame:
        """
        Create features from raw game data.

        Expected columns: home_team, away_team, home_score, away_score,
        date, plus any team statistics columns.

        Returns DataFrame with engineered features.
        """
        features = pd.DataFrame(index=df.index)

        # Example feature engineering (adapt to your data)
        if 'home_elo' in df.columns and 'away_elo' in df.columns:
            features['elo_diff'] = df['home_elo'] - df['away_elo']
            features['elo_sum'] = df['home_elo'] + df['away_elo']

        if 'home_off_rating' in df.columns:
            features['off_diff'] = (df['home_off_rating']
                                     - df['away_off_rating'])
            features['def_diff'] = (df['home_def_rating']
                                     - df['away_def_rating'])
            # Matchup: home offense vs away defense
            features['matchup_home'] = (df['home_off_rating']
                                         - df['away_def_rating'])
            features['matchup_away'] = (df['away_off_rating']
                                         - df['home_def_rating'])

        if 'home_rest_days' in df.columns:
            features['rest_diff'] = (df['home_rest_days']
                                      - df['away_rest_days'])
            features['both_rested'] = (
                (df['home_rest_days'] >= 3) & (df['away_rest_days'] >= 3)
            ).astype(int)

        if 'home_win_pct' in df.columns:
            features['win_pct_diff'] = (df['home_win_pct']
                                         - df['away_win_pct'])

        if 'spread' in df.columns:
            features['spread'] = df['spread']

        return features

    def tune_hyperparameters(self, X: np.ndarray, y: np.ndarray,
                             n_splits: int = 5) -> Dict:
        """
        Tune hyperparameters using time-series cross-validation.

        Uses TimeSeriesSplit to respect temporal ordering ---
        never train on future data.
        """
        param_grid = {
            'max_depth': [3, 4, 5, 6],
            'learning_rate': [0.01, 0.05, 0.1],
            'n_estimators': [100, 200, 500],
            'subsample': [0.7, 0.8, 0.9],
            'colsample_bytree': [0.7, 0.8, 1.0],
            'min_child_weight': [1, 3, 5],
            'gamma': [0, 0.1, 0.3],
            'reg_lambda': [1, 3, 5],
        }

        # For practical purposes, use a smaller search space
        # or switch to RandomizedSearchCV / Bayesian optimization
        reduced_grid = {
            'max_depth': [3, 5],
            'learning_rate': [0.01, 0.05],
            'n_estimators': [200, 500],
            'subsample': [0.8],
            'colsample_bytree': [0.8],
            'min_child_weight': [1, 5],
            'reg_lambda': [1, 5],
        }

        tscv = TimeSeriesSplit(n_splits=n_splits)

        base_model = xgb.XGBClassifier(
            objective=self.objective,
            random_state=self.random_state,
            use_label_encoder=False,
            eval_metric='logloss',
        )

        grid_search = GridSearchCV(
            base_model, reduced_grid,
            cv=tscv,
            scoring='neg_log_loss',
            n_jobs=-1,
            verbose=0
        )

        grid_search.fit(X, y)

        self.best_params = grid_search.best_params_
        print(f"Best parameters: {self.best_params}")
        print(f"Best CV log-loss: {-grid_search.best_score_:.4f}")

        return self.best_params

    def fit(self, X_train: np.ndarray, y_train: np.ndarray,
            X_val: Optional[np.ndarray] = None,
            y_val: Optional[np.ndarray] = None,
            feature_names: Optional[list] = None,
            params: Optional[Dict] = None) -> None:
        """
        Train the XGBoost model with optional early stopping.
        """
        if params is None:
            params = self.best_params or {}

        self.feature_names = feature_names

        self.model = xgb.XGBClassifier(
            objective=self.objective,
            random_state=self.random_state,
            use_label_encoder=False,
            eval_metric='logloss',
            n_estimators=params.get('n_estimators', 500),
            max_depth=params.get('max_depth', 4),
            learning_rate=params.get('learning_rate', 0.05),
            subsample=params.get('subsample', 0.8),
            colsample_bytree=params.get('colsample_bytree', 0.8),
            min_child_weight=params.get('min_child_weight', 3),
            gamma=params.get('gamma', 0.1),
            reg_lambda=params.get('reg_lambda', 3),
        )

        if X_val is not None and y_val is not None:
            self.model.fit(
                X_train, y_train,
                eval_set=[(X_val, y_val)],
                verbose=False
            )
            print(f"Best iteration: {self.model.best_iteration}")
        else:
            self.model.fit(X_train, y_train)

    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """Return probability predictions for class 1 (home win)."""
        if self.model is None:
            raise ValueError("Model not fitted")
        return self.model.predict_proba(X)[:, 1]

    def evaluate(self, X_test: np.ndarray,
                 y_test: np.ndarray) -> Dict[str, float]:
        """Evaluate model on test set."""
        probs = self.predict_proba(X_test)

        return {
            'log_loss': log_loss(y_test, probs),
            'brier_score': brier_score_loss(y_test, probs),
            'auc_roc': roc_auc_score(y_test, probs),
            'accuracy': ((probs > 0.5) == y_test).mean(),
        }

    def feature_importance(self, importance_type: str = 'gain') -> pd.Series:
        """
        Get feature importance scores.

        Args:
            importance_type: 'gain' (avg gain from splits),
                           'weight' (number of times used),
                           'cover' (avg number of samples affected)
        """
        if self.model is None:
            raise ValueError("Model not fitted")

        importances = self.model.get_booster().get_score(
            importance_type=importance_type
        )

        if self.feature_names:
            # Map feature indices to names
            named_importances = {}
            for key, val in importances.items():
                idx = int(key.replace('f', ''))
                if idx < len(self.feature_names):
                    named_importances[self.feature_names[idx]] = val
                else:
                    named_importances[key] = val
            importances = named_importances

        return pd.Series(importances).sort_values(ascending=False)


# --- Worked Example ---
if __name__ == "__main__":
    np.random.seed(42)
    n_games = 2000

    # Generate synthetic sports data
    data = {
        'elo_diff': np.random.normal(0, 100, n_games),
        'off_diff': np.random.normal(0, 3, n_games),
        'def_diff': np.random.normal(0, 3, n_games),
        'rest_diff': np.random.randint(-3, 4, n_games),
        'win_pct_diff': np.random.normal(0, 0.2, n_games),
        'spread': np.random.normal(0, 7, n_games),
    }

    X = pd.DataFrame(data)

    # Generate outcomes with known relationship
    logit = (0.003 * X['elo_diff'] + 0.15 * X['off_diff']
             - 0.12 * X['def_diff'] + 0.05 * X['rest_diff']
             + 0.5 * X['win_pct_diff'] + 0.04 * X['spread']
             + 0.15)  # Home advantage
    prob = 1 / (1 + np.exp(-logit))
    y = (np.random.random(n_games) < prob).astype(int)

    # Temporal split
    train_end = 1400
    val_end = 1700

    X_train = X.iloc[:train_end].values
    y_train = y[:train_end]
    X_val = X.iloc[train_end:val_end].values
    y_val = y[train_end:val_end]
    X_test = X.iloc[val_end:].values
    y_test = y[val_end:]

    # Train model
    predictor = XGBoostSportsPredictor()
    predictor.fit(X_train, y_train, X_val, y_val,
                  feature_names=list(X.columns),
                  params={
                      'max_depth': 4,
                      'learning_rate': 0.05,
                      'n_estimators': 500,
                      'subsample': 0.8,
                      'colsample_bytree': 0.8,
                      'min_child_weight': 3,
                      'reg_lambda': 3,
                  })

    # Evaluate
    metrics = predictor.evaluate(X_test, y_test)
    print("\n=== XGBoost Test Set Performance ===")
    for metric, value in metrics.items():
        print(f"  {metric}: {value:.4f}")

    # Feature importance
    print("\nFeature Importance (by gain):")
    importance = predictor.feature_importance('gain')
    for feat, imp in importance.items():
        print(f"  {feat}: {imp:.2f}")

XGBoost for Point Spread and Over/Under Prediction

While the binary classification approach (win/loss) is natural for moneyline betting, XGBoost can also be adapted for spread and total betting:

Spread prediction (regression): Use objective='reg:squarederror' and predict the point differential directly. Then compare the predicted differential to the posted spread.

Over/under (regression or classification): Predict total points using regression, or frame as binary classification (over vs. under the posted total).

Multi-output: Train separate models for home score and away score, then combine. This provides richer information than a single-output model.


27.2 Random Forests and Ensemble Methods

Bagging: Bootstrap Aggregation

Before understanding random forests, we must understand bagging (bootstrap aggregation), introduced by Leo Breiman in 1996. The idea is simple yet powerful:

  1. Create $B$ bootstrap samples from the training data (sampling with replacement).
  2. Train a decision tree on each bootstrap sample.
  3. Average the predictions (regression) or vote (classification) across all $B$ trees.

Bagging reduces variance without increasing bias. If individual trees have variance $\sigma^2$ and pairwise correlation $\rho$, the variance of the average is:

$$\text{Var}_{\text{ensemble}} = \rho \sigma^2 + \frac{1-\rho}{B}\sigma^2$$

As $B \to \infty$, the second term vanishes, leaving $\rho\sigma^2$. This means the only way to further reduce variance is to reduce the correlation $\rho$ between trees --- which is exactly what random forests do.

Random Forests

Random forests extend bagging with feature subsampling: at each split in each tree, only a random subset of $m$ features (out of $p$ total) are considered as candidates. This decorrelates the trees, reducing $\rho$ and thus reducing ensemble variance.

The typical choice is $m = \sqrt{p}$ for classification and $m = p/3$ for regression, but this should be tuned empirically.

Key hyperparameters:

Parameter Description Typical Range
n_estimators Number of trees 100--1000
max_depth Maximum tree depth None (full), or 10--30
max_features Features per split 'sqrt', 'log2', or float 0.3--0.8
min_samples_leaf Min samples in a leaf 1--20
min_samples_split Min samples to split 2--20

Random Forests vs. XGBoost

Property Random Forest XGBoost
Training Parallel (trees are independent) Sequential (each tree corrects errors)
Overfitting risk Lower (bagging + subsampling) Higher without regularization
Tuning effort Fewer critical hyperparameters More hyperparameters to tune
Performance ceiling Good Often higher
Training speed Fast (parallelizable) Moderate
Interpretability Moderate Moderate
Handles noisy features Well (averages out noise) Can overfit to noise

In practice, XGBoost tends to outperform random forests on tabular data when properly tuned, but random forests are an excellent default model that requires less tuning and is more robust to hyperparameter choices.

Stacking and Blending

Stacking (introduced by David Wolpert in 1992) combines multiple diverse base models using a meta-learner:

  1. Level 0: Train $K$ diverse base models (e.g., logistic regression, random forest, XGBoost, neural network).
  2. Cross-validation predictions: For each base model, generate out-of-fold predictions using $k$-fold cross-validation. This produces a prediction from each model for every training example, without any model seeing its own training data.
  3. Level 1: Train a meta-model on the out-of-fold predictions (as features) to predict the true outcome.

The meta-model learns the optimal combination of base models, potentially capturing nonlinear relationships (e.g., "trust XGBoost when the spread is large, trust logistic regression when teams are closely matched").

Blending is a simpler variant: 1. Split the training data into a training set and a blending set (holdout). 2. Train base models on the training set. 3. Generate predictions on the blending set. 4. Train the meta-model on the blending set predictions.

Blending is easier to implement but wastes data (the blending set is not used for training base models).

When Ensembles Beat Single Models

Ensembles work best when: - Base models are diverse: They make different types of errors. A stack of 5 XGBoost models with slightly different hyperparameters is less diverse than a mix of XGBoost, logistic regression, and a neural network. - Base models are individually reasonably accurate: Combining garbage does not produce gold. Each model should be at least somewhat predictive. - There is sufficient data to train the meta-model: With very small datasets, stacking can overfit at the meta-level.

For sports betting, a practical ensemble might combine: - Logistic regression (simple, interpretable, hard to overfit) - Random forest (robust, handles nonlinearity) - XGBoost (high performance, captures interactions) - Rating system outputs (Elo, Massey, etc., from Chapter 26)

Python Implementation

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import TimeSeriesSplit, cross_val_predict
from sklearn.metrics import log_loss, brier_score_loss
import xgboost as xgb
from typing import Dict, List, Optional

class SportsEnsemble:
    """
    Ensemble model combining random forest, XGBoost, and logistic regression
    using stacking for sports prediction.
    """

    def __init__(self, random_state: int = 42):
        self.random_state = random_state
        self.base_models = self._create_base_models()
        self.stacking_model = None
        self.blending_model = None

    def _create_base_models(self) -> Dict:
        return {
            'logistic': LogisticRegression(
                C=1.0, max_iter=1000,
                random_state=self.random_state
            ),
            'random_forest': RandomForestClassifier(
                n_estimators=300,
                max_depth=8,
                max_features='sqrt',
                min_samples_leaf=10,
                n_jobs=-1,
                random_state=self.random_state
            ),
            'xgboost': xgb.XGBClassifier(
                n_estimators=300,
                max_depth=4,
                learning_rate=0.05,
                subsample=0.8,
                colsample_bytree=0.8,
                min_child_weight=3,
                reg_lambda=3,
                use_label_encoder=False,
                eval_metric='logloss',
                random_state=self.random_state,
            ),
        }

    def fit_stacking(self, X: np.ndarray, y: np.ndarray,
                     cv_splits: int = 5) -> None:
        """
        Train stacking ensemble with time-series cross-validation.
        """
        tscv = TimeSeriesSplit(n_splits=cv_splits)

        estimators = [
            ('lr', self.base_models['logistic']),
            ('rf', self.base_models['random_forest']),
            ('xgb', self.base_models['xgboost']),
        ]

        self.stacking_model = StackingClassifier(
            estimators=estimators,
            final_estimator=LogisticRegression(C=1.0, max_iter=1000),
            cv=tscv,
            stack_method='predict_proba',
            passthrough=False,  # Only use base model predictions
            n_jobs=-1,
        )

        self.stacking_model.fit(X, y)

    def fit_manual_stacking(self, X: np.ndarray, y: np.ndarray,
                             n_splits: int = 5) -> None:
        """
        Manual stacking implementation with more control.

        Generates out-of-fold predictions for the meta-learner
        using time-series splits.
        """
        tscv = TimeSeriesSplit(n_splits=n_splits)
        n_models = len(self.base_models)
        meta_features = np.zeros((len(X), n_models))

        # Generate out-of-fold predictions
        for fold_idx, (train_idx, val_idx) in enumerate(tscv.split(X)):
            X_fold_train, X_fold_val = X[train_idx], X[val_idx]
            y_fold_train = y[train_idx]

            for model_idx, (name, model) in enumerate(
                self.base_models.items()
            ):
                model_clone = self._clone_model(name)
                model_clone.fit(X_fold_train, y_fold_train)
                meta_features[val_idx, model_idx] = (
                    model_clone.predict_proba(X_fold_val)[:, 1]
                )

        # Train meta-learner on the complete set of OOF predictions
        # Only use rows that have OOF predictions (not the first fold's train)
        valid_mask = meta_features.sum(axis=1) != 0
        meta_X = meta_features[valid_mask]
        meta_y = y[valid_mask]

        self.meta_learner = LogisticRegression(C=1.0, max_iter=1000)
        self.meta_learner.fit(meta_X, meta_y)

        # Refit all base models on full training data
        for name, model in self.base_models.items():
            model.fit(X, y)

        print("Meta-learner coefficients:")
        for name, coef in zip(self.base_models.keys(),
                               self.meta_learner.coef_[0]):
            print(f"  {name}: {coef:.4f}")

    def _clone_model(self, name: str):
        """Create a fresh copy of a base model."""
        from sklearn.base import clone
        return clone(self.base_models[name])

    def predict_proba_stacking(self, X: np.ndarray) -> np.ndarray:
        """Predict using sklearn StackingClassifier."""
        return self.stacking_model.predict_proba(X)[:, 1]

    def predict_proba_manual(self, X: np.ndarray) -> np.ndarray:
        """Predict using manually stacked model."""
        meta_features = np.column_stack([
            model.predict_proba(X)[:, 1]
            for model in self.base_models.values()
        ])
        return self.meta_learner.predict_proba(meta_features)[:, 1]

    def compare_models(self, X_test: np.ndarray,
                       y_test: np.ndarray) -> pd.DataFrame:
        """Compare individual models vs ensemble on test set."""
        results = []

        for name, model in self.base_models.items():
            probs = model.predict_proba(X_test)[:, 1]
            results.append({
                'model': name,
                'log_loss': log_loss(y_test, probs),
                'brier_score': brier_score_loss(y_test, probs),
                'accuracy': ((probs > 0.5) == y_test).mean(),
            })

        if self.stacking_model is not None:
            probs = self.predict_proba_stacking(X_test)
            results.append({
                'model': 'stacking_ensemble',
                'log_loss': log_loss(y_test, probs),
                'brier_score': brier_score_loss(y_test, probs),
                'accuracy': ((probs > 0.5) == y_test).mean(),
            })

        return pd.DataFrame(results).sort_values('log_loss')


# --- Worked Example ---
if __name__ == "__main__":
    np.random.seed(42)
    n = 3000

    # Synthetic data with nonlinear relationships
    X = np.random.randn(n, 8)
    logit = (0.5 * X[:, 0] + 0.3 * X[:, 1] - 0.4 * X[:, 2]
             + 0.2 * X[:, 0] * X[:, 1]  # Interaction
             + 0.1 * X[:, 3]**2          # Nonlinearity
             + 0.15)
    prob = 1 / (1 + np.exp(-logit))
    y = (np.random.random(n) < prob).astype(int)

    # Split
    X_train, X_test = X[:2400], X[2400:]
    y_train, y_test = y[:2400], y[2400:]

    ensemble = SportsEnsemble()
    ensemble.fit_stacking(X_train, y_train, cv_splits=5)

    print("=== Model Comparison ===\n")
    comparison = ensemble.compare_models(X_test, y_test)
    print(comparison.to_string(index=False))

27.3 Probability Calibration

Why Calibration Matters for Betting

In sports betting, the actual probability estimate matters enormously --- not just the ranking of outcomes. Consider two models that both correctly identify team A as the favorite:

  • Model 1 predicts P(A wins) = 0.72
  • Model 2 predicts P(A wins) = 0.58

If the bookmaker's implied probability (after removing vig) is 0.65, Model 1 sees a strong value bet on A, Model 2 sees value on B, and the correct action depends entirely on which probability is more accurate. An uncalibrated model that consistently overestimates the probability of favorites will see phantom value everywhere and bleed money.

Calibration means that when your model says an event has probability $p$, it actually occurs with frequency $p$. Formally, a model is perfectly calibrated if:

$$P(Y = 1 \mid \hat{p}(X) = p) = p \quad \text{for all } p \in [0, 1]$$

Calibration Curves (Reliability Diagrams)

A calibration curve plots predicted probabilities against observed frequencies. To construct one:

  1. Sort predictions into bins (e.g., 10 bins: [0, 0.1), [0.1, 0.2), ..., [0.9, 1.0]).
  2. For each bin, compute the mean predicted probability and the observed frequency of positive outcomes.
  3. Plot mean prediction (x-axis) vs. observed frequency (y-axis).

A perfectly calibrated model produces a 45-degree line. Common miscalibration patterns:

  • Overconfident: The curve is S-shaped, predicting more extreme probabilities than warranted. Predictions near 0.8 actually correspond to only 0.65 win rates.
  • Underconfident: The curve is inverse-S-shaped. Predictions are too moderate.
  • Biased: The curve is shifted above or below the diagonal. The model systematically over- or under-predicts.

Platt Scaling

Platt scaling (John Platt, 1999) fits a logistic regression to the model's outputs:

$$P(y = 1 \mid f(x)) = \frac{1}{1 + \exp(Af(x) + B)}$$

where $f(x)$ is the uncalibrated model output (raw score or log-odds) and $A$, $B$ are parameters learned by maximum likelihood on a held-out calibration set.

For a model that outputs probabilities, we first convert to log-odds: $f(x) = \ln(p/(1-p))$, then fit the logistic regression.

Advantages: - Simple, with only 2 parameters. - Works well when the miscalibration is roughly sigmoid-shaped. - Fast to fit.

Disadvantages: - Assumes the calibration function is logistic (may not hold). - Can underperform when the true calibration function is non-monotonic.

Isotonic Regression

Isotonic regression fits a non-decreasing step function to the relationship between predicted probabilities and actual outcomes. It is more flexible than Platt scaling because it makes no parametric assumptions --- it only requires that the calibration function is monotonically non-decreasing.

The isotonic regression problem is:

$$\min_{z} \sum_{i=1}^{n} (y_i - z_i)^2 \quad \text{subject to } z_1 \leq z_2 \leq \cdots \leq z_n$$

where the $y_i$ are the actual outcomes (0 or 1) and the $z_i$ are the calibrated probabilities, ordered by the model's original predictions.

The solution is computed efficiently using the pool adjacent violators algorithm (PAVA) in $O(n)$ time.

Advantages: - Nonparametric: adapts to any shape of miscalibration. - Can fix non-sigmoid miscalibration patterns.

Disadvantages: - Requires more calibration data than Platt scaling (being nonparametric, it is prone to overfitting with small samples). - The step function can produce non-smooth calibration maps.

Platt vs. Isotonic: A Comparison

Property Platt Scaling Isotonic Regression
Parametric Yes (2 params) No
Flexibility Low High
Data requirement Small (100+) Moderate (500+)
Overfitting risk Low Moderate
Handles non-sigmoid Poorly Well
Smoothness Smooth (sigmoid) Step function

Recommendation for sports betting: Use Platt scaling when your calibration set is small (a single season of games). Use isotonic regression when you have multiple seasons of data. In practice, isotonic regression with careful cross-validation tends to produce slightly better-calibrated probabilities for well-established models.

Python Implementation

import numpy as np
import matplotlib
matplotlib.use('Agg')  # Non-interactive backend
import matplotlib.pyplot as plt
from sklearn.calibration import (
    CalibratedClassifierCV, calibration_curve
)
from sklearn.linear_model import LogisticRegression
from sklearn.isotonic import IsotonicRegression
from sklearn.metrics import log_loss, brier_score_loss
from typing import Tuple, Optional

class ProbabilityCalibrator:
    """
    Calibrates model probabilities using Platt scaling or isotonic regression.
    """

    def __init__(self, method: str = 'isotonic'):
        """
        Args:
            method: 'platt' (logistic/sigmoid) or 'isotonic'
        """
        self.method = method
        self.calibrator = None

    def fit(self, uncalibrated_probs: np.ndarray,
            true_outcomes: np.ndarray) -> None:
        """
        Fit the calibration model.

        Args:
            uncalibrated_probs: Model's probability predictions (1D array)
            true_outcomes: Actual outcomes, 0 or 1 (1D array)
        """
        if self.method == 'platt':
            # Platt scaling: logistic regression on log-odds
            log_odds = np.log(
                np.clip(uncalibrated_probs, 1e-10, 1-1e-10)
                / np.clip(1-uncalibrated_probs, 1e-10, 1-1e-10)
            ).reshape(-1, 1)

            self.calibrator = LogisticRegression(C=1e10, max_iter=10000)
            self.calibrator.fit(log_odds, true_outcomes)

        elif self.method == 'isotonic':
            self.calibrator = IsotonicRegression(
                y_min=0.001, y_max=0.999, out_of_bounds='clip'
            )
            self.calibrator.fit(uncalibrated_probs, true_outcomes)

    def transform(self, uncalibrated_probs: np.ndarray) -> np.ndarray:
        """Apply calibration to new probability predictions."""
        if self.calibrator is None:
            raise ValueError("Must call fit() first")

        if self.method == 'platt':
            log_odds = np.log(
                np.clip(uncalibrated_probs, 1e-10, 1-1e-10)
                / np.clip(1-uncalibrated_probs, 1e-10, 1-1e-10)
            ).reshape(-1, 1)
            return self.calibrator.predict_proba(log_odds)[:, 1]

        elif self.method == 'isotonic':
            return self.calibrator.transform(uncalibrated_probs)

    def plot_calibration_curve(self, probs_before: np.ndarray,
                                probs_after: np.ndarray,
                                true_outcomes: np.ndarray,
                                n_bins: int = 10,
                                save_path: Optional[str] = None) -> None:
        """
        Plot calibration curves before and after calibration.
        """
        fig, axes = plt.subplots(1, 2, figsize=(14, 6))

        for ax, probs, title in [
            (axes[0], probs_before, 'Before Calibration'),
            (axes[1], probs_after, 'After Calibration')
        ]:
            fraction_pos, mean_predicted = calibration_curve(
                true_outcomes, probs, n_bins=n_bins, strategy='uniform'
            )

            ax.plot(mean_predicted, fraction_pos, 's-',
                    label='Model', color='blue')
            ax.plot([0, 1], [0, 1], '--', color='gray',
                    label='Perfectly Calibrated')
            ax.set_xlabel('Mean Predicted Probability')
            ax.set_ylabel('Observed Frequency')
            ax.set_title(title)
            ax.legend()
            ax.set_xlim([0, 1])
            ax.set_ylim([0, 1])

            # Add histogram of predictions
            ax2 = ax.twinx()
            ax2.hist(probs, bins=20, alpha=0.2, color='orange')
            ax2.set_ylabel('Count')

        plt.tight_layout()
        if save_path:
            plt.savefig(save_path, dpi=150, bbox_inches='tight')
        plt.close()


def calibration_metrics(y_true: np.ndarray,
                        y_pred: np.ndarray,
                        n_bins: int = 10) -> dict:
    """
    Compute comprehensive calibration metrics.

    Returns:
        Dictionary with ECE, MCE, log_loss, and brier_score
    """
    # Expected Calibration Error (ECE)
    fraction_pos, mean_predicted = calibration_curve(
        y_true, y_pred, n_bins=n_bins, strategy='uniform'
    )

    # Compute bin sizes for weighting
    bin_edges = np.linspace(0, 1, n_bins + 1)
    bin_indices = np.digitize(y_pred, bin_edges) - 1
    bin_indices = np.clip(bin_indices, 0, n_bins - 1)
    bin_counts = np.bincount(bin_indices, minlength=n_bins)

    # Align dimensions (calibration_curve may return fewer bins)
    n_actual_bins = len(fraction_pos)
    actual_bin_counts = bin_counts[:n_actual_bins]
    weights = actual_bin_counts / actual_bin_counts.sum()

    ece = np.sum(weights * np.abs(fraction_pos - mean_predicted))
    mce = np.max(np.abs(fraction_pos - mean_predicted))

    return {
        'ECE': round(ece, 4),
        'MCE': round(mce, 4),
        'log_loss': round(log_loss(y_true, y_pred), 4),
        'brier_score': round(brier_score_loss(y_true, y_pred), 4),
    }


# --- Worked Example ---
if __name__ == "__main__":
    np.random.seed(42)
    n = 2000

    # Simulate an overconfident model
    true_prob = np.random.uniform(0.2, 0.8, n)
    y = (np.random.random(n) < true_prob).astype(int)

    # Model predictions: overconfident (pushed toward 0 and 1)
    uncalibrated = np.clip(true_prob + 0.3 * (true_prob - 0.5), 0.01, 0.99)
    uncalibrated += np.random.normal(0, 0.05, n)
    uncalibrated = np.clip(uncalibrated, 0.01, 0.99)

    # Split
    cal_split = 1000
    cal_probs = uncalibrated[:cal_split]
    cal_y = y[:cal_split]
    test_probs = uncalibrated[cal_split:]
    test_y = y[cal_split:]

    print("=== Probability Calibration ===\n")
    print("Before calibration:")
    metrics_before = calibration_metrics(test_y, test_probs)
    for k, v in metrics_before.items():
        print(f"  {k}: {v}")

    # Platt scaling
    platt = ProbabilityCalibrator(method='platt')
    platt.fit(cal_probs, cal_y)
    platt_preds = platt.transform(test_probs)

    print("\nAfter Platt scaling:")
    metrics_platt = calibration_metrics(test_y, platt_preds)
    for k, v in metrics_platt.items():
        print(f"  {k}: {v}")

    # Isotonic regression
    isotonic = ProbabilityCalibrator(method='isotonic')
    isotonic.fit(cal_probs, cal_y)
    iso_preds = isotonic.transform(test_probs)

    print("\nAfter isotonic regression:")
    metrics_iso = calibration_metrics(test_y, iso_preds)
    for k, v in metrics_iso.items():
        print(f"  {k}: {v}")

Practical Calibration Workflow for Bettors

  1. Train your prediction model on data from seasons 1 through $N-1$.
  2. Generate predictions on season $N$ (hold-out calibration set).
  3. Fit a calibrator (Platt or isotonic) on the hold-out predictions and actual outcomes.
  4. Apply the calibrator to all future predictions.
  5. Monitor calibration throughout the current season. If the calibration curve drifts, refit with recent data.
  6. Use cross-validated calibration (e.g., CalibratedClassifierCV with cv=5) when you cannot afford to set aside a separate calibration set.

27.4 Handling Imbalanced Outcomes

The Imbalance Problem in Sports Betting

Many valuable betting opportunities involve predicting rare outcomes:

  • Upsets: A significant underdog winning outright (moneyline).
  • Blowouts: Winning by a large margin (alternative spread markets).
  • High-scoring games: Total points exceeding a high threshold.
  • Player props: A player achieving an unusually high statistical line.
  • Parlays and teasers: Multiple correlated outcomes occurring simultaneously.

When one class is much rarer than the other, standard models struggle. If upsets occur in only 20% of games, a model that always predicts "no upset" achieves 80% accuracy while being completely useless. Standard metrics like accuracy are misleading, and the model's probability estimates for the rare class are often poorly calibrated.

Why Standard Models Fail on Imbalanced Data

Most classification algorithms optimize an objective function that treats all misclassifications equally. With a 80/20 class split:

  • The model learns to predict the majority class most of the time.
  • The loss function is dominated by majority-class examples.
  • The decision boundary is pushed toward the minority class.
  • Probability estimates for the minority class are compressed toward zero.

For a bettor, this is disastrous. The rare outcomes (upsets, blowouts) are precisely where the market offers the highest expected value, because bookmakers also struggle to price them accurately.

Techniques for Handling Imbalance

1. Class Weights

The simplest approach: assign higher weight to minority-class examples in the loss function. Instead of minimizing:

$$\mathcal{L} = -\frac{1}{n}\sum_{i=1}^{n} [y_i \log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i)]$$

We minimize:

$$\mathcal{L}_w = -\frac{1}{n}\sum_{i=1}^{n} [w_1 \cdot y_i \log(\hat{p}_i) + w_0 \cdot (1-y_i)\log(1-\hat{p}_i)]$$

where $w_1 > w_0$. A common choice is balanced weights: $w_k = n / (2 \cdot n_k)$ where $n_k$ is the number of examples in class $k$. In scikit-learn, this is achieved with class_weight='balanced'.

2. SMOTE (Synthetic Minority Over-sampling Technique)

SMOTE (Chawla et al., 2002) generates synthetic minority-class examples by interpolating between existing minority examples:

  1. For each minority example $x_i$, find its $k$ nearest minority-class neighbors.
  2. Randomly select one neighbor $x_{nn}$.
  3. Create a synthetic example: $x_{\text{new}} = x_i + \lambda(x_{nn} - x_i)$ where $\lambda \sim \text{Uniform}(0, 1)$.

This increases the effective size of the minority class without simply duplicating examples (which can lead to overfitting).

Variants of SMOTE: - Borderline-SMOTE: Only oversamples minority examples near the decision boundary. - SMOTE-ENN: Combines SMOTE with Edited Nearest Neighbors to clean noisy examples. - ADASYN: Adaptively generates more synthetic examples for harder-to-classify minority instances.

3. Threshold Optimization

Rather than using the default 0.5 threshold for classification, optimize the threshold on a validation set. For imbalanced problems, the optimal threshold is often below 0.5.

4. Ensemble Methods for Imbalanced Data

  • BalancedRandomForest: Random forest variant that balances each bootstrap sample.
  • EasyEnsemble: Trains multiple models on balanced subsets and averages.
  • RUSBoost: Combines random undersampling with boosting.

Evaluation Metrics for Imbalanced Data

Standard accuracy is misleading. Use these instead:

Metric Formula Interpretation
Precision $\frac{TP}{TP+FP}$ Of predicted positives, what fraction is correct?
Recall $\frac{TP}{TP+FN}$ Of actual positives, what fraction did we catch?
F1 Score $\frac{2 \cdot P \cdot R}{P + R}$ Harmonic mean of precision and recall
PR-AUC Area under precision-recall curve Overall quality for imbalanced data
Log-loss $-\frac{1}{n}\sum y\log\hat{p} + (1-y)\log(1-\hat{p})$ Penalizes poor probability estimates
Brier Score $\frac{1}{n}\sum(y - \hat{p})^2$ Mean squared probability error

For betting applications, log-loss and Brier score are the most relevant because they evaluate the quality of probability estimates directly, which is what matters for identifying value bets.

Python Implementation

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    classification_report, precision_recall_curve,
    average_precision_score, log_loss, brier_score_loss,
    roc_auc_score, f1_score
)
from sklearn.model_selection import TimeSeriesSplit
import xgboost as xgb
from typing import Dict, Optional

try:
    from imblearn.over_sampling import SMOTE, ADASYN, BorderlineSMOTE
    from imblearn.combine import SMOTEENN
    from imblearn.ensemble import BalancedRandomForestClassifier
    HAS_IMBLEARN = True
except ImportError:
    HAS_IMBLEARN = False


class ImbalancedSportsPredictor:
    """
    Handles imbalanced outcomes in sports prediction.

    Compares multiple approaches: class weights, SMOTE, threshold
    optimization, and specialized ensemble methods.
    """

    def __init__(self, random_state: int = 42):
        self.random_state = random_state
        self.models: Dict = {}
        self.optimal_thresholds: Dict[str, float] = {}

    def fit_class_weighted(self, X_train: np.ndarray,
                           y_train: np.ndarray) -> None:
        """Train models with class weight adjustment."""
        self.models['xgb_weighted'] = xgb.XGBClassifier(
            n_estimators=300,
            max_depth=4,
            learning_rate=0.05,
            scale_pos_weight=(y_train == 0).sum() / (y_train == 1).sum(),
            use_label_encoder=False,
            eval_metric='logloss',
            random_state=self.random_state,
        )
        self.models['xgb_weighted'].fit(X_train, y_train)

        self.models['rf_weighted'] = RandomForestClassifier(
            n_estimators=300,
            max_depth=8,
            class_weight='balanced',
            random_state=self.random_state,
        )
        self.models['rf_weighted'].fit(X_train, y_train)

    def fit_smote(self, X_train: np.ndarray,
                  y_train: np.ndarray) -> None:
        """Train models on SMOTE-resampled data."""
        if not HAS_IMBLEARN:
            print("imbalanced-learn not installed; skipping SMOTE")
            return

        # Standard SMOTE
        smote = SMOTE(random_state=self.random_state, k_neighbors=5)
        X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

        print(f"Original: {len(y_train)} samples "
              f"({y_train.sum()} positive, {(1-y_train).sum()} negative)")
        print(f"After SMOTE: {len(y_resampled)} samples "
              f"({y_resampled.sum()} positive, "
              f"{(1-y_resampled).sum()} negative)")

        self.models['xgb_smote'] = xgb.XGBClassifier(
            n_estimators=300, max_depth=4, learning_rate=0.05,
            use_label_encoder=False, eval_metric='logloss',
            random_state=self.random_state,
        )
        self.models['xgb_smote'].fit(X_resampled, y_resampled)

        # Borderline-SMOTE (focuses on borderline examples)
        bsmote = BorderlineSMOTE(
            random_state=self.random_state, k_neighbors=5
        )
        X_border, y_border = bsmote.fit_resample(X_train, y_train)

        self.models['xgb_borderline_smote'] = xgb.XGBClassifier(
            n_estimators=300, max_depth=4, learning_rate=0.05,
            use_label_encoder=False, eval_metric='logloss',
            random_state=self.random_state,
        )
        self.models['xgb_borderline_smote'].fit(X_border, y_border)

    def fit_baseline(self, X_train: np.ndarray,
                     y_train: np.ndarray) -> None:
        """Train baseline model without any imbalance handling."""
        self.models['xgb_baseline'] = xgb.XGBClassifier(
            n_estimators=300, max_depth=4, learning_rate=0.05,
            use_label_encoder=False, eval_metric='logloss',
            random_state=self.random_state,
        )
        self.models['xgb_baseline'].fit(X_train, y_train)

    def optimize_threshold(self, model_name: str,
                           X_val: np.ndarray,
                           y_val: np.ndarray,
                           metric: str = 'f1') -> float:
        """
        Find the optimal classification threshold.

        Args:
            model_name: Name of the model to optimize
            X_val: Validation features
            y_val: Validation outcomes
            metric: 'f1', 'precision', or 'recall'
        """
        model = self.models[model_name]
        probs = model.predict_proba(X_val)[:, 1]

        best_threshold = 0.5
        best_score = 0

        for threshold in np.arange(0.1, 0.9, 0.01):
            preds = (probs >= threshold).astype(int)
            if metric == 'f1':
                score = f1_score(y_val, preds, zero_division=0)
            elif metric == 'precision':
                tp = ((preds == 1) & (y_val == 1)).sum()
                fp = ((preds == 1) & (y_val == 0)).sum()
                score = tp / (tp + fp) if (tp + fp) > 0 else 0
            elif metric == 'recall':
                tp = ((preds == 1) & (y_val == 1)).sum()
                fn = ((preds == 0) & (y_val == 1)).sum()
                score = tp / (tp + fn) if (tp + fn) > 0 else 0

            if score > best_score:
                best_score = score
                best_threshold = threshold

        self.optimal_thresholds[model_name] = best_threshold
        print(f"Optimal threshold for {model_name}: "
              f"{best_threshold:.2f} ({metric} = {best_score:.4f})")
        return best_threshold

    def evaluate_all(self, X_test: np.ndarray,
                     y_test: np.ndarray) -> pd.DataFrame:
        """Comprehensive evaluation of all models."""
        results = []

        for name, model in self.models.items():
            probs = model.predict_proba(X_test)[:, 1]
            threshold = self.optimal_thresholds.get(name, 0.5)
            preds = (probs >= threshold).astype(int)

            tp = ((preds == 1) & (y_test == 1)).sum()
            fp = ((preds == 1) & (y_test == 0)).sum()
            fn = ((preds == 0) & (y_test == 1)).sum()

            precision = tp / (tp + fp) if (tp + fp) > 0 else 0
            recall = tp / (tp + fn) if (tp + fn) > 0 else 0
            f1 = (2 * precision * recall / (precision + recall)
                  if (precision + recall) > 0 else 0)

            results.append({
                'model': name,
                'threshold': threshold,
                'log_loss': round(log_loss(y_test, probs), 4),
                'brier': round(brier_score_loss(y_test, probs), 4),
                'auc_roc': round(roc_auc_score(y_test, probs), 4),
                'precision': round(precision, 4),
                'recall': round(recall, 4),
                'f1': round(f1, 4),
            })

        return pd.DataFrame(results).sort_values('log_loss')


# --- Worked Example: Predicting Upsets ---
if __name__ == "__main__":
    np.random.seed(42)
    n = 3000

    # Generate imbalanced data: upsets occur ~20% of the time
    X = np.random.randn(n, 10)
    logit = (-1.5 + 0.4 * X[:, 0] + 0.3 * X[:, 1] - 0.5 * X[:, 2]
             + 0.2 * X[:, 3] + 0.1 * X[:, 4])
    prob = 1 / (1 + np.exp(-logit))
    y = (np.random.random(n) < prob).astype(int)

    print(f"Class distribution: {y.mean():.3f} positive rate "
          f"({y.sum()} upsets out of {n} games)\n")

    # Split temporally
    train_end = 2000
    val_end = 2500
    X_train, y_train = X[:train_end], y[:train_end]
    X_val, y_val = X[train_end:val_end], y[train_end:val_end]
    X_test, y_test = X[val_end:], y[val_end:]

    predictor = ImbalancedSportsPredictor()

    # Fit all approaches
    predictor.fit_baseline(X_train, y_train)
    predictor.fit_class_weighted(X_train, y_train)
    predictor.fit_smote(X_train, y_train)

    # Optimize thresholds on validation set
    for name in predictor.models:
        predictor.optimize_threshold(name, X_val, y_val, metric='f1')

    # Evaluate
    print("\n=== Imbalanced Classification Results ===\n")
    results = predictor.evaluate_all(X_test, y_test)
    print(results.to_string(index=False))

Practical Advice for Bettors

  1. Use class weights rather than SMOTE for most sports prediction tasks. SMOTE can introduce artifacts that distort probability estimates, while class weights simply reweight the loss function.

  2. Calibrate after rebalancing. Any rebalancing technique (weights, SMOTE, undersampling) distorts the output probabilities. Always calibrate the model's probabilities on the original (unbalanced) distribution using the methods from Section 27.3.

  3. Focus on log-loss and Brier score, not F1 or accuracy. For betting, you need accurate probability estimates, not accurate class predictions.

  4. Consider the base rate. If upsets occur 20% of the time and your model correctly identifies some upsets while maintaining good calibration, you can exploit moneyline value on underdogs.

  5. Threshold optimization is for classification, not betting. In betting, you never "classify" a game as win or loss --- you estimate the probability and compare to the market price. Threshold optimization is useful for internal model evaluation but should not drive betting decisions.


27.5 Model Interpretability (SHAP, Feature Importance)

Why Interpretability Matters for Betting

A model that outputs "P(home win) = 0.67" is useful. A model that says "P(home win) = 0.67, primarily because the home team has a +5 Elo advantage and is well-rested, partially offset by a poor matchup against the away team's defense" is far more useful. Interpretability serves several critical functions for bettors:

  1. Sanity checking. Does the model make predictions for sensible reasons? If your model favors a team primarily because of the color of their jerseys (an inadvertently included feature), you want to know.

  2. Feature discovery. Which features actually drive predictions? This guides future feature engineering and data collection.

  3. Edge identification. Understanding why your model disagrees with the market helps you determine whether the disagreement is likely to persist (a real edge) or is based on noise.

  4. Model debugging. When the model makes a surprising prediction, interpretability tools help you understand whether it is a genuine insight or a model error.

  5. Communication. If you are betting with other people's money, you need to explain your rationale beyond "the model says so."

Built-in Feature Importance

Tree-based models (random forests, XGBoost) provide built-in feature importance metrics:

Gain-based importance: For each feature, sum the improvement in the loss function across all splits using that feature. This measures how much each feature contributes to reducing prediction error.

Split-based importance (weight): Count the number of times each feature is used as a split criterion. Features used in many splits are deemed more important.

Cover-based importance: For each feature, average the number of training examples affected by splits using that feature.

Limitations of built-in importance: - Biased toward high-cardinality features. Features with many unique values have more potential split points and are used more often, even if they are not more predictive. - Does not account for feature interactions. A feature might be unimportant on its own but highly important in combination with another feature. - Inconsistent across methods. Gain, weight, and cover can give very different rankings. - No directionality. The importance tells you that a feature matters, but not the direction of its effect.

SHAP Values

SHAP (SHapley Additive exPlanations), developed by Scott Lundberg and Su-In Lee in 2017, addresses all of these limitations. SHAP is grounded in cooperative game theory, specifically the concept of Shapley values from 1953.

The Shapley value of a feature is defined as the average marginal contribution of that feature across all possible orderings of features. Formally, the SHAP value for feature $j$ for prediction $i$ is:

$$\phi_j^{(i)} = \sum_{S \subseteq F \setminus \{j\}} \frac{|S|!(|F|-|S|-1)!}{|F|!} [f(S \cup \{j\}) - f(S)]$$

where $F$ is the set of all features, $S$ is a subset not including $j$, and $f(S)$ is the model's prediction using only features in $S$.

Intuitively, SHAP asks: "For this specific prediction, how much did each feature push the prediction away from the average?" The sum of all SHAP values equals the difference between the prediction and the average prediction:

$$f(x) = \phi_0 + \sum_{j=1}^{p} \phi_j$$

where $\phi_0$ is the base value (average model output across all training data).

Key properties of SHAP: - Local accuracy: SHAP values sum to the actual prediction. - Consistency: If a feature's contribution increases, its SHAP value never decreases. - Missingness: Features not in the model get zero SHAP value. - Additivity: For ensemble models, SHAP values of the ensemble equal the sum of SHAP values of the component models.

TreeSHAP

For tree-based models, Lundberg developed TreeSHAP, an exact polynomial-time algorithm for computing SHAP values. This is dramatically faster than the exponential-time naive computation and makes SHAP practical for large models.

TreeSHAP runs in $O(TLD^2)$ time where $T$ is the number of trees, $L$ is the maximum number of leaves, and $D$ is the maximum depth. For typical XGBoost models, this allows computing SHAP values for thousands of predictions in seconds.

Partial Dependence Plots (PDPs)

A partial dependence plot shows the marginal effect of a single feature on the model's predictions, averaged over all other features:

$$\text{PDP}(x_j) = \frac{1}{n}\sum_{i=1}^{n} f(x_j, \mathbf{x}_{-j}^{(i)})$$

where $\mathbf{x}_{-j}^{(i)}$ represents all features except $j$ for the $i$-th training example. The PDP shows how the average prediction changes as feature $j$ varies, marginalizing out the effects of all other features.

SHAP dependence plots are a more informative alternative. They plot the SHAP value of a feature against the feature's value for each individual prediction. This reveals not just the average effect but also interactions (visible as vertical spread in the plot).

LIME (Local Interpretable Model-agnostic Explanations)

LIME (Ribeiro et al., 2016) explains individual predictions by fitting a simple interpretable model (usually linear regression) in the neighborhood of the prediction:

  1. Generate perturbed samples around the instance of interest.
  2. Get the complex model's predictions for these perturbed samples.
  3. Weight the perturbed samples by their proximity to the original instance.
  4. Fit a linear model to the weighted perturbed predictions.
  5. The linear model's coefficients are the feature explanations.

LIME is model-agnostic (works with any model) but less theoretically grounded than SHAP. For tree-based models, SHAP is generally preferred due to TreeSHAP's efficiency and SHAP's stronger theoretical guarantees.

Python Implementation

import numpy as np
import pandas as pd
import xgboost as xgb
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from typing import Optional, List

try:
    import shap
    HAS_SHAP = True
except ImportError:
    HAS_SHAP = False

try:
    import lime
    import lime.lime_tabular
    HAS_LIME = True
except ImportError:
    HAS_LIME = False


class ModelInterpreter:
    """
    Comprehensive model interpretability toolkit for sports prediction.

    Provides SHAP values, feature importance, partial dependence,
    and LIME explanations.
    """

    def __init__(self, model, X_train: np.ndarray,
                 feature_names: List[str]):
        """
        Args:
            model: Fitted model (must support predict_proba)
            X_train: Training data (for background distribution)
            feature_names: List of feature names
        """
        self.model = model
        self.X_train = X_train
        self.feature_names = feature_names
        self.shap_values = None
        self.explainer = None

    def compute_shap_values(self, X: np.ndarray,
                             check_additivity: bool = True) -> np.ndarray:
        """
        Compute SHAP values for a set of predictions.

        Args:
            X: Feature matrix to explain
            check_additivity: Verify SHAP values sum to predictions

        Returns:
            SHAP values array of shape (n_samples, n_features)
        """
        if not HAS_SHAP:
            raise ImportError("shap library required")

        self.explainer = shap.TreeExplainer(self.model)
        self.shap_values = self.explainer.shap_values(X)

        # For binary classification, shap_values may be a list of 2 arrays
        if isinstance(self.shap_values, list):
            self.shap_values = self.shap_values[1]  # Class 1 (positive)

        if check_additivity:
            # Verify: base_value + sum(shap_values) ≈ model output
            base = self.explainer.expected_value
            if isinstance(base, np.ndarray):
                base = base[1]

            raw_preds = self.model.predict_proba(X)[:, 1]
            # Convert from log-odds to probability for comparison
            shap_sum = base + self.shap_values.sum(axis=1)

            # For tree models, SHAP works in log-odds or margin space
            # so direct comparison with probabilities requires sigmoid
            from scipy.special import expit
            reconstructed = expit(shap_sum)

            max_error = np.max(np.abs(reconstructed - raw_preds))
            print(f"SHAP additivity check: max error = {max_error:.6f}")

        return self.shap_values

    def shap_summary_plot(self, X: np.ndarray,
                          save_path: Optional[str] = None) -> None:
        """
        Create SHAP summary (beeswarm) plot.

        Shows the distribution of SHAP values for each feature,
        revealing both importance and directionality.
        """
        if self.shap_values is None:
            self.compute_shap_values(X)

        plt.figure(figsize=(10, 8))
        shap.summary_plot(
            self.shap_values,
            X,
            feature_names=self.feature_names,
            show=False
        )
        if save_path:
            plt.savefig(save_path, dpi=150, bbox_inches='tight')
        plt.close()

    def shap_bar_plot(self, X: np.ndarray,
                      save_path: Optional[str] = None) -> None:
        """
        Create SHAP bar plot showing mean absolute SHAP values.
        """
        if self.shap_values is None:
            self.compute_shap_values(X)

        mean_abs_shap = np.abs(self.shap_values).mean(axis=0)
        importance = pd.Series(
            mean_abs_shap, index=self.feature_names
        ).sort_values(ascending=True)

        plt.figure(figsize=(10, 6))
        importance.plot(kind='barh', color='steelblue')
        plt.xlabel('Mean |SHAP value|')
        plt.title('Feature Importance (SHAP)')
        plt.tight_layout()
        if save_path:
            plt.savefig(save_path, dpi=150, bbox_inches='tight')
        plt.close()

        return importance

    def explain_prediction(self, x: np.ndarray,
                           game_label: str = "Game") -> pd.DataFrame:
        """
        Explain a single prediction using SHAP values.

        Args:
            x: Feature vector for one game (1D or 2D with shape (1, n))
            game_label: Label for the game

        Returns:
            DataFrame with features, values, and SHAP contributions
        """
        if self.explainer is None:
            self.explainer = shap.TreeExplainer(self.model)

        x_2d = x.reshape(1, -1) if x.ndim == 1 else x
        sv = self.explainer.shap_values(x_2d)
        if isinstance(sv, list):
            sv = sv[1]

        base = self.explainer.expected_value
        if isinstance(base, np.ndarray):
            base = base[1]

        prediction = self.model.predict_proba(x_2d)[0, 1]

        explanation = pd.DataFrame({
            'feature': self.feature_names,
            'value': x_2d[0],
            'shap_value': sv[0],
            'abs_shap': np.abs(sv[0]),
        }).sort_values('abs_shap', ascending=False)

        print(f"\n=== Explanation: {game_label} ===")
        print(f"Prediction: {prediction:.4f}")
        print(f"Base value: {base:.4f}")
        print(f"Sum of SHAP values: {sv[0].sum():.4f}")
        print(f"\nTop contributing features:")
        for _, row in explanation.head(10).iterrows():
            direction = "+" if row['shap_value'] > 0 else "-"
            print(f"  {row['feature']:20s} = {row['value']:+8.3f}  "
                  f"SHAP: {row['shap_value']:+.4f} ({direction})")

        return explanation

    def shap_dependence_plot(self, feature_name: str, X: np.ndarray,
                              interaction_feature: Optional[str] = None,
                              save_path: Optional[str] = None) -> None:
        """
        Create SHAP dependence plot for a single feature.

        Shows how the SHAP value of a feature varies with the feature value,
        with optional coloring by an interaction feature.
        """
        if self.shap_values is None:
            self.compute_shap_values(X)

        feat_idx = self.feature_names.index(feature_name)
        interact_idx = (self.feature_names.index(interaction_feature)
                       if interaction_feature else None)

        plt.figure(figsize=(10, 6))
        shap.dependence_plot(
            feat_idx,
            self.shap_values,
            X,
            feature_names=self.feature_names,
            interaction_index=interact_idx,
            show=False
        )
        if save_path:
            plt.savefig(save_path, dpi=150, bbox_inches='tight')
        plt.close()

    def lime_explanation(self, x: np.ndarray,
                         game_label: str = "Game") -> Optional[object]:
        """
        Generate LIME explanation for a single prediction.

        Args:
            x: Feature vector for one game
            game_label: Label for the game

        Returns:
            LIME explanation object
        """
        if not HAS_LIME:
            print("LIME library not installed; skipping")
            return None

        explainer = lime.lime_tabular.LimeTabularExplainer(
            self.X_train,
            feature_names=self.feature_names,
            class_names=['Away Win', 'Home Win'],
            mode='classification',
        )

        x_1d = x.flatten() if x.ndim > 1 else x
        exp = explainer.explain_instance(
            x_1d,
            self.model.predict_proba,
            num_features=len(self.feature_names),
        )

        print(f"\n=== LIME Explanation: {game_label} ===")
        print(f"Prediction: {self.model.predict_proba(x.reshape(1, -1))[0]}")
        print(f"\nFeature contributions:")
        for feat, weight in exp.as_list():
            print(f"  {feat}: {weight:+.4f}")

        return exp

    def compare_importances(self, X: np.ndarray) -> pd.DataFrame:
        """
        Compare feature importance from multiple methods.

        Returns DataFrame with importance from: gain, weight, cover, SHAP.
        """
        # Built-in importances
        booster = self.model.get_booster()

        importance_types = {}
        for imp_type in ['gain', 'weight', 'cover']:
            raw_imp = booster.get_score(importance_type=imp_type)
            imp_dict = {}
            for key, val in raw_imp.items():
                idx = int(key.replace('f', ''))
                if idx < len(self.feature_names):
                    imp_dict[self.feature_names[idx]] = val
            importance_types[imp_type] = imp_dict

        # SHAP importance
        if self.shap_values is None:
            self.compute_shap_values(X)

        mean_abs_shap = np.abs(self.shap_values).mean(axis=0)
        shap_imp = {self.feature_names[i]: mean_abs_shap[i]
                    for i in range(len(self.feature_names))}

        # Combine into DataFrame
        df = pd.DataFrame({
            'gain': pd.Series(importance_types.get('gain', {})),
            'weight': pd.Series(importance_types.get('weight', {})),
            'cover': pd.Series(importance_types.get('cover', {})),
            'shap': pd.Series(shap_imp),
        })

        # Normalize each column to 0-1
        for col in df.columns:
            col_max = df[col].max()
            if col_max > 0:
                df[col] = df[col] / col_max

        return df.sort_values('shap', ascending=False)


# --- Worked Example ---
if __name__ == "__main__":
    np.random.seed(42)
    n = 2000

    feature_names = [
        'elo_diff', 'off_rating_diff', 'def_rating_diff',
        'rest_days_diff', 'win_pct_diff', 'spread',
        'pace_diff', 'turnover_diff', 'rebound_diff', 'home_record'
    ]

    X = np.random.randn(n, len(feature_names))

    # Generate outcomes with known relationships
    logit = (0.003 * X[:, 0] * 100     # elo_diff (rescaled)
             + 0.15 * X[:, 1]           # off_rating_diff
             - 0.12 * X[:, 2]           # def_rating_diff
             + 0.05 * X[:, 3]           # rest_days_diff
             + 0.5 * X[:, 4]            # win_pct_diff
             + 0.04 * X[:, 5]           # spread
             + 0.02 * X[:, 6]           # pace_diff
             + 0.2 * X[:, 0] * X[:, 3]  # interaction: elo * rest
             + 0.15)                     # home advantage

    prob = 1 / (1 + np.exp(-logit))
    y = (np.random.random(n) < prob).astype(int)

    # Split
    X_train, X_test = X[:1600], X[1600:]
    y_train, y_test = y[:1600], y[1600:]

    # Train XGBoost
    model = xgb.XGBClassifier(
        n_estimators=300, max_depth=4, learning_rate=0.05,
        subsample=0.8, colsample_bytree=0.8,
        use_label_encoder=False, eval_metric='logloss',
        random_state=42,
    )
    model.fit(X_train, y_train)

    # Interpret
    interpreter = ModelInterpreter(model, X_train, feature_names)

    # Global importance comparison
    print("=== Feature Importance Comparison ===\n")
    importance_df = interpreter.compare_importances(X_test)
    print(importance_df.round(3).to_string())

    # Explain a specific prediction
    game_idx = 0
    interpreter.explain_prediction(
        X_test[game_idx],
        game_label="Test Game 1"
    )

    # SHAP dependence for top feature
    interpreter.shap_dependence_plot(
        'elo_diff', X_test,
        interaction_feature='rest_days_diff',
        save_path='elo_diff_dependence.png'
    )

    # Compare built-in vs SHAP importance
    print("\n=== Key Insight ===")
    print("Notice how SHAP importance can differ from gain-based importance.")
    print("SHAP accounts for feature interactions and provides directional")
    print("information, making it more reliable for understanding model")
    print("behavior in sports prediction contexts.")

Interpreting SHAP Values for Betting Decisions

When you obtain SHAP explanations for a predicted game, you can use them to make more informed betting decisions:

Example interpretation:

Suppose your model predicts P(home win) = 0.67 for a game where the market implies 0.60. The SHAP breakdown shows:

Feature Value SHAP Contribution
elo_diff +120 +0.08
rest_days_diff +2 +0.05
off_rating_diff +3.2 +0.04
def_rating_diff -1.1 -0.02
spread -3.5 +0.02

From this, you can see that the model's edge over the market comes primarily from the Elo difference and the rest advantage. You can then ask:

  1. Is the Elo difference real? Was it earned against quality opponents or inflated by easy schedule?
  2. Is the rest advantage priced in? Most sophisticated bettors know about rest advantages; this might already be in the line.
  3. Are there factors the model is missing? Injury news, weather, motivation --- things not captured in your features.

This kind of analysis transforms model-based betting from black-box gambling into an informed analytical process.

SHAP vs. LIME vs. Built-in Importance: A Comparison

Property Built-in Importance SHAP LIME
Theoretical foundation Ad hoc Game theory (Shapley) Local approximation
Consistency No Yes No
Local explanations No Yes Yes
Global explanations Yes Yes Approximate
Model-agnostic No (tree-specific) Partial (TreeSHAP for trees) Yes
Computational cost Very low Low (TreeSHAP) Moderate
Interaction detection No Yes (interaction values) Limited
Directionality No Yes Yes
Recommended for betting Baseline only Primary method Secondary / verification

27.6 Chapter Summary

This chapter has covered the advanced machine learning techniques that form the backbone of modern quantitative sports betting systems. Each section addresses a specific challenge that separates amateur modelers from professionals.

XGBoost provides the raw predictive power. Gradient-boosted trees with second-order optimization, regularization, and column subsampling can capture the complex nonlinear relationships and feature interactions present in sports data. The key to success lies in disciplined hyperparameter tuning using time-series cross-validation (never train on future data) and thoughtful feature engineering that transforms raw statistics into predictive signals.

Random Forests and Ensemble Methods offer robustness and diversity. Random forests are harder to overfit than boosted trees and serve as an excellent complement in a stacking ensemble. The mathematical insight behind bagging --- that averaging reduces variance proportionally to the correlation between base learners --- explains why diversity matters more than individual model accuracy. Stacking, with its meta-learner trained on out-of-fold predictions, provides the principled framework for combining diverse models.

Probability Calibration is the bridge between raw model outputs and actionable betting decisions. An overconfident model sees value everywhere and bleeds money; an underconfident model misses legitimate opportunities. Platt scaling and isotonic regression provide the mathematical tools to ensure that a predicted probability of 0.65 truly corresponds to a 65% win rate. The Expected Calibration Error (ECE) quantifies this property, and regular monitoring of calibration curves ensures your model stays honest.

Handling Imbalanced Outcomes enables profitable betting on rare events --- upsets, blowouts, and extreme player performances --- where the market often misprices risk. Class weights, SMOTE, and threshold optimization each address imbalance from a different angle, but the overriding principle for bettors is clear: focus on well-calibrated probability estimates (log-loss, Brier score) rather than classification accuracy.

Model Interpretability through SHAP values transforms sports betting from a black-box exercise into an analytical discipline. SHAP's theoretical grounding in Shapley values ensures consistent, additive explanations that reveal exactly why your model makes each prediction. This enables sanity checking, edge identification, and the kind of deep understanding that distinguishes sustainable edges from data-mining artifacts.

Integration with Previous Chapters

The techniques in this chapter build directly on the rating systems from Chapter 26. Rating system outputs (Elo ratings, Glicko-2 ratings and RDs, Massey scores, PageRank values) serve as powerful features for the XGBoost and random forest models. The ensemble rating methods from Section 26.5 can be viewed as a special case of the model stacking framework developed in Section 27.2. And the probability calibration techniques from Section 27.3 should be applied to any rating-based probability estimate, including the raw Elo expected-score formula.

Decision Framework for Model Selection

Scenario Recommended Approach
Small dataset (< 500 games) Logistic regression + Elo features + Platt calibration
Medium dataset (500--5000) XGBoost with regularization + stacking + isotonic calibration
Large dataset (5000+) Full stacking ensemble + SHAP analysis + isotonic calibration
Predicting rare outcomes Class-weighted XGBoost + calibration + PR-AUC evaluation
Need for explanation SHAP with TreeSHAP for any tree-based model
Unbalanced schedule (college) Include PageRank features + strength-of-schedule features

Key Takeaways

  1. XGBoost is the workhorse of modern sports prediction, but it must be properly regularized and validated using temporal splits.
  2. Diverse ensembles beat individual models --- combine different model families, not just different hyperparameters.
  3. Calibration is not optional. Uncalibrated probabilities lead to incorrect bet sizing and phantom value.
  4. Class imbalance requires special handling for rare-outcome markets, but always recalibrate after rebalancing.
  5. SHAP values are the gold standard for understanding model predictions and identifying genuine versus spurious edges.

Exercises

  1. XGBoost Feature Engineering. Using NBA game data, create at least 20 features (including rating system outputs from Chapter 26). Train an XGBoost model with time-series cross-validation and report log-loss, Brier score, and AUC-ROC.

  2. Stacking Depth. Compare a 2-model stack (logistic regression + XGBoost) to a 4-model stack (logistic regression + random forest + XGBoost + neural network). Does adding more models help? At what point does the meta-learner start to overfit?

  3. Calibration Impact on Profit. Train a model with and without probability calibration. Simulate flat betting on all games where your model disagrees with the market by more than 5 percentage points. Compare cumulative profit/loss for the calibrated vs. uncalibrated model.

  4. Upset Detection. Build a model specifically to predict NBA upsets (games where the team with a lower Elo rating wins). Use class-weighted XGBoost and evaluate with PR-AUC. What features are most predictive of upsets?

  5. SHAP Audit. Train an XGBoost model on NFL data. Use SHAP to identify the top 5 most important features. Then retrain the model using only those 5 features. How much performance is lost? This tells you how much of the model's knowledge is concentrated in a few key variables.