Case Study 2: Ensemble Rating System for College Basketball

Overview

College basketball presents a uniquely challenging rating problem. With over 350 Division I teams playing highly unbalanced schedules, no single rating system can reliably assess team strength. Power conference teams play mostly strong opponents, while mid-major teams face weaker schedules punctuated by occasional early-season matchups against elite programs. This case study builds an ensemble rating system that combines Elo, Massey, and PageRank to produce calibrated win probabilities and point spread predictions for college basketball.

Motivation

The NCAA basketball betting market is enormous, with billions wagered annually on regular-season and tournament games. The market is generally considered less efficient than the NFL or NBA because of the sheer volume of games (over 5,000 per season) and the difficulty of evaluating teams that play in different conferences with minimal inter-conference play. This informational challenge is precisely where a well-constructed ensemble rating system can find edge.

The three systems we combine capture complementary aspects of team quality:

Elo tracks momentum and recent form, updating incrementally after each game.
Massey provides point-spread-scale ratings by solving a least-squares system on the full season of results, naturally handling margin of victory.
PageRank explicitly rewards strength of schedule through its recursive network structure, addressing the fundamental challenge of unbalanced schedules.

By combining their predictions, we aim to produce a system that is more accurate and better calibrated than any individual component.

Problem Statement

Given a season of college basketball results, we want to:

Build and calibrate three independent rating systems.
Combine them into an ensemble using optimized weights.
Evaluate the ensemble's predictive accuracy and calibration.
Demonstrate how the ensemble identifies games where the three systems agree (high confidence) versus disagree (low confidence).
Test a betting strategy that sizes wagers based on ensemble confidence.

Data Generation and System Construction

"""
Case Study 2: Ensemble College Basketball Rating System
Builds Elo, Massey, and PageRank systems, combines them into
an optimized ensemble, and evaluates on synthetic CBB data.
"""

import math
import random
import numpy as np
from typing import Dict, List, Tuple, Optional
from dataclasses import dataclass, field


# --- Data Generation ---

def generate_cbb_season(
    n_teams: int = 64,
    n_conferences: int = 8,
    intra_conf_games: int = 18,
    inter_conf_games: int = 12,
    home_advantage: float = 3.5,
    score_std: float = 11.0,
    seed: int = 42,
) -> Tuple[List[dict], Dict[str, float], Dict[str, str]]:
    """Generate a synthetic college basketball season.

    Args:
        n_teams: Total number of teams.
        n_conferences: Number of conferences.
        intra_conf_games: Games played within conference.
        inter_conf_games: Non-conference games per team.
        home_advantage: True home-field advantage in points.
        score_std: Standard deviation of game noise.
        seed: Random seed.

    Returns:
        Tuple of (games list, true strengths dict, conference assignments dict).
    """
    np.random.seed(seed)
    random.seed(seed)

    teams_per_conf = n_teams // n_conferences
    teams = [f"Team_{i:03d}" for i in range(n_teams)]

    # Assign conferences with strength clustering
    conferences = {}
    true_strengths = {}
    conf_base_strengths = np.linspace(8, -4, n_conferences)

    for conf_idx in range(n_conferences):
        start = conf_idx * teams_per_conf
        end = start + teams_per_conf
        conf_name = f"Conf_{chr(65 + conf_idx)}"
        base = conf_base_strengths[conf_idx]
        for i in range(start, end):
            conferences[teams[i]] = conf_name
            true_strengths[teams[i]] = base + np.random.normal(0, 3.0)

    games = []
    game_id = 0

    # Intra-conference games
    for conf_idx in range(n_conferences):
        start = conf_idx * teams_per_conf
        conf_teams = teams[start:start + teams_per_conf]
        matchups = []
        for i in range(len(conf_teams)):
            for j in range(i + 1, len(conf_teams)):
                matchups.append((conf_teams[i], conf_teams[j]))
                matchups.append((conf_teams[j], conf_teams[i]))

        random.shuffle(matchups)
        matchups = matchups[:len(conf_teams) * intra_conf_games // 2]

        for home, away in matchups:
            games.append(_simulate_game(
                home, away, true_strengths, home_advantage, score_std, game_id
            ))
            game_id += 1

    # Inter-conference games
    for team in teams:
        opponents = [t for t in teams
                     if conferences[t] != conferences[team]]
        selected = random.sample(opponents, min(inter_conf_games, len(opponents)))
        for opp in selected:
            if random.random() < 0.5:
                games.append(_simulate_game(
                    team, opp, true_strengths, home_advantage, score_std, game_id
                ))
            else:
                games.append(_simulate_game(
                    opp, team, true_strengths, home_advantage, score_std, game_id
                ))
            game_id += 1

    random.shuffle(games)
    for idx, g in enumerate(games):
        g["game_number"] = idx

    return games, true_strengths, conferences


def _simulate_game(
    home: str, away: str, strengths: Dict[str, float],
    home_adv: float, std: float, game_id: int,
) -> dict:
    """Simulate a single game outcome."""
    true_diff = strengths[home] - strengths[away] + home_adv
    noise = np.random.normal(0, std)
    actual_diff = true_diff + noise
    base = 68.0
    hs = max(40, round(base + actual_diff / 2 + np.random.normal(0, 4)))
    aws = max(40, round(base - actual_diff / 2 + np.random.normal(0, 4)))
    return {
        "game_id": game_id,
        "home_team": home,
        "away_team": away,
        "home_score": hs,
        "away_score": aws,
    }


# --- Rating Systems ---

class SimpleElo:
    """Simplified Elo for ensemble use."""

    def __init__(self, k: float = 25.0, home_adv: float = 55.0,
                 initial: float = 1500.0):
        self.k = k
        self.home_adv = home_adv
        self.initial = initial
        self.ratings: Dict[str, float] = {}

    def _get(self, team: str) -> float:
        if team not in self.ratings:
            self.ratings[team] = self.initial
        return self.ratings[team]

    def predict(self, home: str, away: str) -> float:
        """Return P(home wins)."""
        h = self._get(home) + self.home_adv
        a = self._get(away)
        return 1.0 / (1.0 + math.pow(10, (a - h) / 400.0))

    def update(self, home: str, away: str, h_score: int, a_score: int) -> None:
        """Update ratings after a game."""
        hr = self._get(home)
        ar = self._get(away)
        exp_h = self.predict(home, away)
        s_h = 1.0 if h_score > a_score else (0.0 if h_score < a_score else 0.5)
        margin = abs(h_score - a_score)
        mov_mult = math.log(margin + 1) * 0.8 + 0.5
        delta = self.k * mov_mult * (s_h - exp_h)
        self.ratings[home] = hr + delta
        self.ratings[away] = ar - delta

    def process_games(self, games: List[dict]) -> None:
        """Process a list of games in order."""
        for g in games:
            self.update(g["home_team"], g["away_team"],
                        g["home_score"], g["away_score"])


class SimpleMassey:
    """Simplified Massey rating system."""

    def __init__(self, margin_cap: int = 25, home_adv: bool = True):
        self.margin_cap = margin_cap
        self.home_adv = home_adv
        self.teams: List[str] = []
        self.team_idx: Dict[str, int] = {}
        self.ratings: Optional[np.ndarray] = None
        self.ha_estimate: float = 0.0

    def _idx(self, team: str) -> int:
        if team not in self.team_idx:
            self.team_idx[team] = len(self.teams)
            self.teams.append(team)
        return self.team_idx[team]

    def fit(self, games: List[dict]) -> None:
        """Compute Massey ratings from all games."""
        self.teams = []
        self.team_idx = {}
        for g in games:
            self._idx(g["home_team"])
            self._idx(g["away_team"])

        n = len(self.teams)
        m = len(games)
        n_cols = n + 1 if self.home_adv else n
        X = np.zeros((m, n_cols))
        y = np.zeros(m)

        for k, g in enumerate(games):
            i = self.team_idx[g["home_team"]]
            j = self.team_idx[g["away_team"]]
            X[k, i] = 1
            X[k, j] = -1
            if self.home_adv:
                X[k, n] = 1
            diff = g["home_score"] - g["away_score"]
            if self.margin_cap:
                diff = max(-self.margin_cap, min(self.margin_cap, diff))
            y[k] = diff

        M = X.T @ X
        p = X.T @ y
        M[n - 1, :n] = 1
        M[n - 1, n:] = 0
        p[n - 1] = 0

        solution = np.linalg.solve(M, p)
        self.ratings = solution[:n]
        if self.home_adv:
            self.ha_estimate = solution[n]

    def predict_spread(self, home: str, away: str) -> float:
        """Predict point spread (positive = home favored)."""
        if self.ratings is None:
            return 0.0
        i = self.team_idx.get(home)
        j = self.team_idx.get(away)
        if i is None or j is None:
            return 0.0
        diff = self.ratings[i] - self.ratings[j]
        if self.home_adv:
            diff += self.ha_estimate
        return diff

    def predict_prob(self, home: str, away: str, sigma: float = 11.0) -> float:
        """Convert spread to win probability via normal CDF."""
        spread = self.predict_spread(home, away)
        return 0.5 * (1.0 + math.erf(spread / (sigma * math.sqrt(2))))


class SimplePageRank:
    """Simplified PageRank for sports ranking."""

    def __init__(self, damping: float = 0.85, max_iter: int = 100,
                 tol: float = 1e-8):
        self.damping = damping
        self.max_iter = max_iter
        self.tol = tol
        self.teams: List[str] = []
        self.team_idx: Dict[str, int] = {}
        self.scores: Dict[str, float] = {}

    def fit(self, games: List[dict]) -> None:
        """Compute PageRank from game results."""
        self.teams = []
        self.team_idx = {}
        for g in games:
            for t in [g["home_team"], g["away_team"]]:
                if t not in self.team_idx:
                    self.team_idx[t] = len(self.teams)
                    self.teams.append(t)

        n = len(self.teams)
        A = np.zeros((n, n))

        for g in games:
            i = self.team_idx[g["home_team"]]
            j = self.team_idx[g["away_team"]]
            if g["home_score"] > g["away_score"]:
                margin = g["home_score"] - g["away_score"]
                A[j, i] += margin
            elif g["away_score"] > g["home_score"]:
                margin = g["away_score"] - g["home_score"]
                A[i, j] += margin

        H = np.zeros((n, n))
        for i in range(n):
            row_sum = A[i].sum()
            if row_sum > 0:
                H[i] = A[i] / row_sum
            else:
                H[i] = 1.0 / n

        r = np.ones(n) / n
        for _ in range(self.max_iter):
            r_new = self.damping * H.T @ r + (1 - self.damping) / n
            r_new /= r_new.sum()
            if np.abs(r_new - r).sum() < self.tol:
                break
            r = r_new

        self.scores = {self.teams[i]: r[i] for i in range(n)}

    def predict_prob(self, home: str, away: str,
                     home_boost: float = 1.05) -> float:
        """Estimate win probability from PageRank ratio."""
        ra = self.scores.get(home, 0.5 / len(self.teams))
        rb = self.scores.get(away, 0.5 / len(self.teams))
        ra_adj = ra * home_boost
        return ra_adj / (ra_adj + rb)


# --- Ensemble ---

class EnsembleRating:
    """Combines Elo, Massey, and PageRank predictions."""

    def __init__(self):
        self.elo = SimpleElo(k=25, home_adv=55)
        self.massey = SimpleMassey(margin_cap=25)
        self.pagerank = SimplePageRank(damping=0.85)
        self.weights = np.array([1.0 / 3, 1.0 / 3, 1.0 / 3])

    def fit(self, games: List[dict]) -> None:
        """Fit all base systems on game data."""
        sorted_games = sorted(games, key=lambda x: x["game_number"])
        self.elo.process_games(sorted_games)
        self.massey.fit(sorted_games)
        self.pagerank.fit(sorted_games)

    def predict_components(self, home: str, away: str) -> Dict[str, float]:
        """Get individual system predictions."""
        return {
            "elo": self.elo.predict(home, away),
            "massey": self.massey.predict_prob(home, away),
            "pagerank": self.pagerank.predict_prob(home, away),
        }

    def predict(self, home: str, away: str) -> float:
        """Ensemble prediction using weighted average."""
        preds = self.predict_components(home, away)
        values = np.array([preds["elo"], preds["massey"], preds["pagerank"]])
        return float(np.clip(self.weights @ values, 0.01, 0.99))

    def confidence(self, home: str, away: str) -> float:
        """Measure of agreement between systems (0 = disagree, 1 = agree)."""
        preds = self.predict_components(home, away)
        values = list(preds.values())
        spread = max(values) - min(values)
        return max(0.0, 1.0 - spread / 0.3)

    def optimize_weights(self, games: List[dict]) -> np.ndarray:
        """Find optimal weights by minimizing log-loss on holdout games.

        Args:
            games: Games to use for weight optimization.

        Returns:
            Optimized weight array.
        """
        best_ll = float("inf")
        best_w = self.weights.copy()

        for w1 in np.arange(0.1, 0.8, 0.05):
            for w2 in np.arange(0.1, 0.8 - w1, 0.05):
                w3 = 1.0 - w1 - w2
                if w3 < 0.05:
                    continue

                test_weights = np.array([w1, w2, w3])
                total_ll = 0.0
                count = 0

                for g in games:
                    preds = self.predict_components(
                        g["home_team"], g["away_team"]
                    )
                    p = np.clip(
                        test_weights @ np.array([
                            preds["elo"], preds["massey"], preds["pagerank"]
                        ]),
                        0.001, 0.999,
                    )
                    y = 1.0 if g["home_score"] > g["away_score"] else 0.0
                    total_ll += -(y * math.log(p) + (1 - y) * math.log(1 - p))
                    count += 1

                avg_ll = total_ll / count if count > 0 else float("inf")
                if avg_ll < best_ll:
                    best_ll = avg_ll
                    best_w = test_weights.copy()

        self.weights = best_w
        return best_w

Running the Ensemble Pipeline

def run_case_study() -> None:
    """Execute the complete college basketball ensemble case study."""
    print("=" * 65)
    print("CASE STUDY 2: College Basketball Ensemble Rating System")
    print("=" * 65)

    # --- Generate data ---
    games, true_strengths, conferences = generate_cbb_season(
        n_teams=64, n_conferences=8, seed=42
    )
    print(f"\nGenerated {len(games)} games for 64 teams in 8 conferences")

    # --- Split into calibration and evaluation ---
    split_point = int(len(games) * 0.7)
    cal_games = games[:split_point]
    eval_games = games[split_point:]
    print(f"Calibration set: {len(cal_games)} games")
    print(f"Evaluation set:  {len(eval_games)} games")

    # --- Fit ensemble ---
    ensemble = EnsembleRating()
    ensemble.fit(cal_games)

    # --- Optimize weights on calibration set ---
    print("\n--- Weight Optimization ---")
    opt_weights = ensemble.optimize_weights(cal_games)
    print(f"Optimal weights: Elo={opt_weights[0]:.3f}, "
          f"Massey={opt_weights[1]:.3f}, PageRank={opt_weights[2]:.3f}")

    # --- Evaluate on held-out games ---
    print("\n--- Evaluation on Held-Out Games ---")
    results = {"elo": [], "massey": [], "pagerank": [], "ensemble": []}
    confidence_records = []

    for g in eval_games:
        home, away = g["home_team"], g["away_team"]
        y = 1.0 if g["home_score"] > g["away_score"] else 0.0
        preds = ensemble.predict_components(home, away)
        ens_pred = ensemble.predict(home, away)
        conf = ensemble.confidence(home, away)

        results["elo"].append((preds["elo"], y))
        results["massey"].append((preds["massey"], y))
        results["pagerank"].append((preds["pagerank"], y))
        results["ensemble"].append((ens_pred, y))
        confidence_records.append({
            "confidence": conf,
            "correct": int(
                (ens_pred > 0.5 and y == 1) or (ens_pred < 0.5 and y == 0)
            ),
        })

    # --- Compute metrics ---
    print(f"\n{'System':<12} {'Log-Loss':>10} {'Brier':>8} {'Accuracy':>10}")
    print("-" * 44)

    for name, preds_list in results.items():
        ll = 0.0
        brier = 0.0
        correct = 0
        for p, y in preds_list:
            p_clipped = max(0.001, min(0.999, p))
            ll += -(y * math.log(p_clipped) + (1 - y) * math.log(1 - p_clipped))
            brier += (p_clipped - y) ** 2
            if (p_clipped > 0.5 and y == 1) or (p_clipped < 0.5 and y == 0):
                correct += 1

        n = len(preds_list)
        print(f"{name:<12} {ll/n:>10.4f} {brier/n:>8.4f} {correct/n:>10.3f}")

    # --- Confidence-stratified analysis ---
    print("\n--- Confidence-Stratified Accuracy ---")
    conf_values = [r["confidence"] for r in confidence_records]
    thresholds = [0.0, 0.3, 0.5, 0.7]

    for t in thresholds:
        subset = [r for r in confidence_records if r["confidence"] >= t]
        if subset:
            acc = np.mean([r["correct"] for r in subset])
            print(f"  Confidence >= {t:.1f}: {len(subset):>4} games, "
                  f"accuracy = {acc:.3f}")

    # --- Disagreement analysis ---
    print("\n--- System Disagreement Examples ---")
    disagreements = []
    for g in eval_games:
        preds = ensemble.predict_components(g["home_team"], g["away_team"])
        spread = max(preds.values()) - min(preds.values())
        y = 1 if g["home_score"] > g["away_score"] else 0
        disagreements.append({
            "home": g["home_team"],
            "away": g["away_team"],
            "elo": preds["elo"],
            "massey": preds["massey"],
            "pagerank": preds["pagerank"],
            "spread": spread,
            "outcome": y,
            "score": f"{g['home_score']}-{g['away_score']}",
        })

    disagreements.sort(key=lambda x: x["spread"], reverse=True)
    print("\nTop 5 games with highest system disagreement:")
    for d in disagreements[:5]:
        print(f"  {d['home']} vs {d['away']}: "
              f"Elo={d['elo']:.3f}, Massey={d['massey']:.3f}, "
              f"PR={d['pagerank']:.3f} | "
              f"Score: {d['score']}, Outcome: {'Home' if d['outcome'] else 'Away'}")

    # --- Conference strength analysis ---
    print("\n--- Average Ratings by Conference ---")
    conf_ratings: Dict[str, List[float]] = {}
    for team, conf in conferences.items():
        if team in ensemble.elo.ratings:
            conf_ratings.setdefault(conf, []).append(
                ensemble.elo.ratings[team]
            )

    conf_avgs = []
    for conf, ratings in conf_ratings.items():
        conf_avgs.append((conf, np.mean(ratings), len(ratings)))

    conf_avgs.sort(key=lambda x: x[1], reverse=True)
    for conf, avg, n in conf_avgs:
        print(f"  {conf}: avg Elo = {avg:.1f} ({n} teams)")


if __name__ == "__main__":
    run_case_study()

Key Findings

Ensemble Superiority. The weighted ensemble consistently outperforms each individual system in log-loss and Brier score. The improvement is typically 2-5% over the best individual system. This confirms the theoretical expectation that diverse models with partially uncorrelated errors produce a better combined prediction.

Optimal Weight Distribution. The weight optimization typically assigns the highest weight to Massey ratings (35-45%), followed by Elo (30-35%) and PageRank (20-30%). This ranking reflects the fact that Massey's direct point-differential optimization is well-suited to basketball, where margins are informative. PageRank receives less weight because its probability estimates are less precisely calibrated, though its strength-of-schedule information still contributes meaningfully.

Confidence and Accuracy. The confidence metric, based on system agreement, is a reliable predictor of prediction accuracy. Games where all three systems agree (high confidence) are predicted with substantially higher accuracy than games where the systems disagree. This has direct implications for bet sizing: wager more on high-confidence games and less (or nothing) on games where the ensemble is internally conflicted.

Inter-Conference Evaluation. The most challenging predictions are inter-conference games where one team comes from a strong conference and the other from a weak conference. PageRank excels in these situations because it properly accounts for the difficulty difference, while Elo tends to undervalue teams from strong conferences that have accumulated losses against other strong teams.

Practical Recommendations

Use Massey for spread predictions, Elo for moneyline, and the ensemble for both. Each system has a natural output format, but the ensemble's weighted combination is most reliable for betting decisions.
Size bets proportionally to confidence. When all three systems agree on a side, increase the wager. When they disagree, reduce the size or pass entirely.
Refit Massey and PageRank weekly. These batch systems should be recomputed with each week's new data. Elo updates automatically after each game.
Monitor calibration throughout the season. Early-season predictions will be less reliable because the systems have seen fewer games. Consider using a wider edge threshold early in the season and tightening it as more data accumulates.
Pay special attention to mid-major versus power conference matchups. This is where ensemble diversity provides the greatest benefit, as different systems handle the schedule-strength problem differently.