20 min read

In This Chapter

26.1 The Elo Rating System
26.2 Glicko and Glicko-2
26.3 Massey Ratings and Least Squares
26.4 PageRank and Network-Based Rankings
26.5 Combining Multiple Rating Systems
26.6 Chapter Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 26: Ratings and Ranking Systems

Every sports bettor eventually confronts the same fundamental question: how strong is each team, really? The point spread, the moneyline, and the over/under all derive from some underlying assessment of relative team quality. Bookmakers invest enormous resources into answering this question with precision. Bettors who want to compete must develop their own rigorous, quantitative systems for measuring team strength and converting those measurements into probabilities.

This chapter explores the mathematical foundations of rating and ranking systems --- from the venerable Elo system born in competitive chess, through the uncertainty-aware Glicko framework, to the linear-algebraic elegance of Massey ratings, and finally to network-based approaches inspired by Google's PageRank algorithm. Each system encodes different assumptions about competition, and each offers distinct advantages for the sports bettor. By the end of this chapter, you will have implemented each system from scratch in Python and learned how to combine them into ensemble ratings that outperform any individual approach.

26.1 The Elo Rating System

Historical Origins

The Elo rating system was developed by Arpad Elo, a Hungarian-American physics professor and chess master, in the 1960s. The United States Chess Federation adopted his system in 1960, and FIDE (the international chess governing body) followed in 1970. Elo's fundamental insight was elegant: every player (or team) possesses a latent strength that can be estimated from the outcomes of their games. When two players meet, the difference in their ratings determines the expected outcome, and ratings are updated based on how the actual result compares to that expectation.

The system has since been adopted far beyond chess. FIFA used an Elo variant for its world rankings from 2018 onward. FiveThirtyEight popularized Elo for NFL, NBA, and MLB forecasting. Competitive gaming platforms use Elo or Elo-derived systems for matchmaking. For sports bettors, Elo provides a transparent, mathematically grounded framework for estimating win probabilities --- the raw material from which profitable wagers are identified.

Mathematical Formulation

Let $R_A$ and $R_B$ denote the current ratings of teams A and B, respectively. The expected score for team A when facing team B is:

$$E_A = \frac{1}{1 + 10^{(R_B - R_A)/400}}$$

This is a logistic function. The constant 400 is a scaling parameter inherited from chess: a 400-point rating difference corresponds to an expected score of approximately 0.91 (about a 91% win probability). The expected score for team B is simply:

$$E_B = 1 - E_A = \frac{1}{1 + 10^{(R_A - R_B)/400}}$$

After the game is played, each team's rating is updated:

$$R_A^{\text{new}} = R_A + K \cdot (S_A - E_A)$$

$$R_B^{\text{new}} = R_B + K \cdot (S_B - E_B)$$

where $S_A$ is the actual score (1 for a win, 0 for a loss, 0.5 for a draw) and $K$ is the update factor that controls how much a single game shifts the rating.

The system is zero-sum: the total rating points in the system are conserved (assuming no new teams enter or leave). If team A gains $\Delta$ points, team B loses $\Delta$ points.

The K-Factor: Sensitivity and Responsiveness

The K-factor is the single most important tuning parameter in an Elo system. It governs the tradeoff between responsiveness (how quickly ratings react to new results) and stability (how resistant ratings are to noise from individual game outcomes).

K-Factor	Behavior	Best For
5--10	Very stable, slow to update	Large sample sizes, long seasons (MLB)
15--25	Moderate responsiveness	Standard applications (NBA, NFL regular season)
30--40	Highly responsive	Small sample sizes, volatile sports
40+	Extremely reactive	Early season, rapid form changes

In chess, FIDE uses $K = 40$ for new players (under 30 games), $K = 20$ for established players below 2400 rating, and $K = 10$ for elite players above 2400. For sports betting applications, the optimal K-factor depends on the sport's inherent variance and sample size. NFL football, with only 16--17 regular-season games and high single-game variance, typically benefits from higher K-factors (20--30) compared to NBA basketball, where an 82-game season and lower single-game variance favor K-factors of 10--20.

Adapting Elo for Sports

Raw Elo was designed for chess, a two-player game with no home advantage and where margin of victory is not meaningful (you either win, lose, or draw). Sports require several adaptations.

Home-Field Advantage. Most sports exhibit a measurable home advantage. We incorporate this by adding a constant $H$ to the home team's rating before computing expected scores:

$$E_{\text{home}} = \frac{1}{1 + 10^{(R_{\text{away}} - (R_{\text{home}} + H))/400}}$$

The value of $H$ should be calibrated empirically. In the NFL, home teams win approximately 56--57% of games, corresponding to $H \approx 48$ rating points. In the NBA, home teams win approximately 58--60% of games, corresponding to $H \approx 70$ rating points. In European soccer, home advantage is typically $H \approx 60$--$100$ depending on the league.

Margin of Victory (MOV). In chess, a win is a win. In sports, beating an opponent by 30 points conveys more information than winning by 1 point. We can incorporate margin of victory through a multiplier on the K-factor:

$$M = \ln(|MOV| + 1) \cdot \frac{2.2}{(\text{ELOW}_{\text{diff}} \cdot 0.001) + 2.2}$$

where $MOV$ is the point differential and $\text{ELOW}_{\text{diff}}$ is the Elo difference between the winning team and the losing team. The logarithm prevents blowouts from distorting ratings excessively. The denominator is an autocorrelation adjustment that prevents strong teams from being rewarded too much for running up the score against weak opponents (since large margins are already expected in such matchups).

The update becomes:

$$R_A^{\text{new}} = R_A + K \cdot M \cdot (S_A - E_A)$$

Season-to-Season Regression. At the start of a new season, teams undergo roster changes, coaching changes, and other transitions. It is standard practice to regress each team's rating toward the mean between seasons:

$$R_{\text{new season}} = R_{\text{end of season}} \cdot (1 - \alpha) + R_{\text{mean}} \cdot \alpha$$

A typical regression factor is $\alpha = 1/3$, meaning each team carries two-thirds of its end-of-season rating and one-third of the league average. This prevents stale ratings from dominating early-season predictions.

Python Implementation from Scratch

import math
from dataclasses import dataclass, field
from typing import Dict, List, Tuple, Optional

@dataclass
class EloSystem:
    """
    A complete Elo rating system adapted for sports betting.

    Attributes:
        k_factor: Base K-factor for rating updates
        home_advantage: Rating points added for home team
        initial_rating: Starting rating for new teams
        use_mov: Whether to use margin-of-victory adjustment
        season_regression: Fraction to regress toward mean between seasons
    """
    k_factor: float = 20.0
    home_advantage: float = 48.0
    initial_rating: float = 1500.0
    use_mov: bool = True
    season_regression: float = 1/3
    ratings: Dict[str, float] = field(default_factory=dict)
    history: List[dict] = field(default_factory=list)

    def get_rating(self, team: str) -> float:
        """Get current rating for a team, initializing if needed."""
        if team not in self.ratings:
            self.ratings[team] = self.initial_rating
        return self.ratings[team]

    def expected_score(self, rating_a: float, rating_b: float) -> float:
        """
        Compute expected score for team A against team B.

        E_A = 1 / (1 + 10^((R_B - R_A) / 400))
        """
        return 1.0 / (1.0 + math.pow(10, (rating_b - rating_a) / 400.0))

    def mov_multiplier(self, margin: int, elo_diff: float) -> float:
        """
        Margin of victory multiplier.

        Uses log scaling with autocorrelation adjustment to prevent
        strong teams from being over-rewarded for blowout wins.
        """
        return math.log(abs(margin) + 1) * (2.2 / (elo_diff * 0.001 + 2.2))

    def predict(self, home_team: str, away_team: str) -> Tuple[float, float]:
        """
        Predict win probabilities for a matchup.

        Returns:
            (home_win_prob, away_win_prob)
        """
        home_rating = self.get_rating(home_team) + self.home_advantage
        away_rating = self.get_rating(away_team)

        home_prob = self.expected_score(home_rating, away_rating)
        return home_prob, 1.0 - home_prob

    def update(self, home_team: str, away_team: str,
               home_score: int, away_score: int) -> dict:
        """
        Update ratings after a game result.

        Args:
            home_team: Name of the home team
            away_team: Name of the away team
            home_score: Points scored by home team
            away_score: Points scored by away team

        Returns:
            Dictionary with pre-game predictions and rating changes
        """
        home_rating = self.get_rating(home_team)
        away_rating = self.get_rating(away_team)

        # Compute expected scores with home advantage
        home_expected = self.expected_score(
            home_rating + self.home_advantage, away_rating
        )
        away_expected = 1.0 - home_expected

        # Determine actual scores (1 = win, 0 = loss, 0.5 = tie)
        if home_score > away_score:
            home_actual, away_actual = 1.0, 0.0
        elif home_score < away_score:
            home_actual, away_actual = 0.0, 1.0
        else:
            home_actual, away_actual = 0.5, 0.5

        # Compute K-factor with optional MOV adjustment
        k = self.k_factor
        if self.use_mov:
            margin = abs(home_score - away_score)
            winner_elo = home_rating if home_score > away_score else away_rating
            loser_elo = away_rating if home_score > away_score else home_rating
            elo_diff = winner_elo - loser_elo
            k *= self.mov_multiplier(margin, elo_diff)

        # Update ratings
        home_delta = k * (home_actual - home_expected)
        away_delta = k * (away_actual - away_expected)

        self.ratings[home_team] = home_rating + home_delta
        self.ratings[away_team] = away_rating + away_delta

        result = {
            'home_team': home_team,
            'away_team': away_team,
            'home_score': home_score,
            'away_score': away_score,
            'home_expected': round(home_expected, 4),
            'away_expected': round(away_expected, 4),
            'home_delta': round(home_delta, 2),
            'away_delta': round(away_delta, 2),
            'home_rating_new': round(self.ratings[home_team], 1),
            'away_rating_new': round(self.ratings[away_team], 1),
        }
        self.history.append(result)
        return result

    def regress_to_mean(self):
        """Apply between-season regression toward the mean rating."""
        mean_rating = sum(self.ratings.values()) / len(self.ratings)
        for team in self.ratings:
            self.ratings[team] = (
                self.ratings[team] * (1 - self.season_regression)
                + mean_rating * self.season_regression
            )

    def get_rankings(self) -> List[Tuple[str, float]]:
        """Return teams sorted by rating in descending order."""
        return sorted(self.ratings.items(), key=lambda x: x[1], reverse=True)


# --- Worked Example ---
if __name__ == "__main__":
    elo = EloSystem(k_factor=20, home_advantage=48, use_mov=True)

    # Simulate a short season
    games = [
        ("Chiefs", "Ravens", 27, 20),
        ("Bills", "Dolphins", 31, 17),
        ("Ravens", "Bills", 24, 21),
        ("Dolphins", "Chiefs", 14, 35),
        ("Chiefs", "Bills", 17, 20),
        ("Ravens", "Dolphins", 28, 10),
    ]

    print("=== Elo Rating System: NFL Example ===\n")
    for home, away, hs, as_ in games:
        result = elo.update(home, away, hs, as_)
        print(f"{home} {hs} - {as_} {away}")
        print(f"  Expected: {result['home_expected']:.3f} vs {result['away_expected']:.3f}")
        print(f"  Rating changes: {home} {result['home_delta']:+.1f} -> "
              f"{result['home_rating_new']:.1f}, "
              f"{away} {result['away_delta']:+.1f} -> "
              f"{result['away_rating_new']:.1f}\n")

    print("Final Rankings:")
    for rank, (team, rating) in enumerate(elo.get_rankings(), 1):
        print(f"  {rank}. {team}: {rating:.1f}")

    # Predict next game
    home_prob, away_prob = elo.predict("Bills", "Chiefs")
    print(f"\nPrediction: Bills (home) vs Chiefs")
    print(f"  Bills win probability: {home_prob:.3f}")
    print(f"  Chiefs win probability: {away_prob:.3f}")
    print(f"  Implied fair moneyline Bills: {-100*home_prob/(1-home_prob):+.0f}"
          if home_prob > 0.5
          else f"  Implied fair moneyline Bills: +{100*(1-home_prob)/home_prob:.0f}")

Converting Elo to Betting Lines

The expected score function directly yields a win probability. To convert to American odds:

$$\text{If } p > 0.5: \quad \text{ML} = -\frac{100p}{1-p}$$

$$\text{If } p < 0.5: \quad \text{ML} = +\frac{100(1-p)}{p}$$

To convert an Elo difference to a point spread, we need the empirical relationship between Elo margin and scoring margin. In the NFL, each Elo point corresponds to roughly 0.035--0.04 points of scoring margin. So a 100-point Elo advantage (including home advantage) translates to roughly a 3.5--4 point spread.

26.2 Glicko and Glicko-2

Motivation: The Problem with Elo

Elo treats all ratings as equally reliable. A team rated 1600 after 100 games has the same rating "confidence" as a team rated 1600 after 3 games. This is clearly wrong --- we should be far less certain about the second team's true strength. Furthermore, Elo cannot distinguish between a team that has been inactive (whose true strength is now highly uncertain) and a team that has played regularly.

Mark Glickman, a statistician at Harvard, developed the Glicko system (1995) and its successor Glicko-2 (2001) to address these shortcomings. The key innovation is that each team carries not just a rating $r$ but also a rating deviation (RD) $\phi$ that quantifies the uncertainty in that rating. Glicko-2 further adds a volatility parameter $\sigma$ that measures how consistently a team performs relative to its rating.

Glicko: Rating with Uncertainty

In Glicko, each team has a rating $r$ and a rating deviation $\phi$. The rating represents the best estimate of team strength; the RD represents the uncertainty. We can think of the team's "true" rating as lying in the interval $[r - 2\phi, r + 2\phi]$ with approximately 95% confidence.

Step 1: Convert to the Glicko scale. Glicko uses a different scale than Elo:

$$\mu = \frac{r - 1500}{173.7178}$$

$$\phi = \frac{\text{RD}}{173.7178}$$

The constant $173.7178 = 400 / \ln(10) \approx 173.72$ converts between the Elo 0--3000 scale and the Glicko internal scale.

Step 2: Compute the g-function. The RD of an opponent modifies the expected outcome:

$$g(\phi) = \frac{1}{\sqrt{1 + 3\phi^2/\pi^2}}$$

This function downweights the impact of games against opponents with uncertain ratings. When $\phi$ is small (certain rating), $g(\phi) \approx 1$ and the game is treated normally. When $\phi$ is large (uncertain rating), $g(\phi) < 1$ and the game carries less weight.

Step 3: Expected outcome.

$$E(\mu, \mu_j, \phi_j) = \frac{1}{1 + \exp(-g(\phi_j)(\mu - \mu_j))}$$

Step 4: Update the rating deviation. The new RD incorporates information from all games in the rating period:

$$\frac{1}{\phi'^2} = \frac{1}{\phi^2} + \sum_{j=1}^{m} g(\phi_j)^2 \cdot E_j \cdot (1 - E_j)$$

where $m$ is the number of games played in the period.

Step 5: Update the rating.

$$\mu' = \mu + \phi'^2 \sum_{j=1}^{m} g(\phi_j)(s_j - E_j)$$

where $s_j$ is the actual outcome of game $j$ (1, 0, or 0.5).

Step 6: Increase RD for inactivity. If a team does not play in a rating period, its RD increases:

$$\phi_{\text{new}} = \sqrt{\phi_{\text{old}}^2 + c^2}$$

where $c$ is a constant that controls how quickly uncertainty grows during inactivity.

Glicko-2: Adding Volatility

Glicko-2 adds a volatility parameter $\sigma$ that represents the degree of expected fluctuation in a team's rating. A team with high volatility has erratic results; a team with low volatility performs consistently.

The Glicko-2 update involves an iterative algorithm to estimate the new volatility:

Step 1. Compute the estimated variance $v$ of the team's rating based on game outcomes:

$$v = \left[\sum_{j=1}^{m} g(\phi_j)^2 \cdot E_j \cdot (1 - E_j)\right]^{-1}$$

Step 2. Compute the estimated improvement $\Delta$:

$$\Delta = v \sum_{j=1}^{m} g(\phi_j)(s_j - E_j)$$

Step 3. Determine the new volatility $\sigma'$ using the Illinois algorithm to solve:

$$f(x) = \frac{e^x(\Delta^2 - \phi^2 - v - e^x)}{2(\phi^2 + v + e^x)^2} - \frac{x - \ln(\sigma^2)}{\tau^2} = 0$$

where $\tau$ is the system constant (typically 0.3--1.2; smaller values constrain volatility changes more).

Step 4. Update the rating deviation:

$$\phi^* = \sqrt{\phi^2 + \sigma'^2}$$

$$\phi' = \frac{1}{\sqrt{1/\phi^{*2} + 1/v}}$$

Step 5. Update the rating:

$$\mu' = \mu + \phi'^2 \sum_{j=1}^{m} g(\phi_j)(s_j - E_j)$$

Advantages Over Elo for Betting

Feature	Elo	Glicko-2
Rating uncertainty	No	Yes (RD)
Handles inactivity	No	Yes (RD increases)
Volatility tracking	No	Yes
Confidence intervals	No	Yes
Information per game	Fixed	Variable (based on opponent RD)
Computational cost	Low	Moderate
Better calibrated probabilities	Moderate	Higher

For betting, Glicko-2's confidence intervals are invaluable. A matchup where both teams have low RDs produces a more reliable probability estimate than one where either team has high RD. This allows the bettor to size wagers proportionally to confidence, or to skip games entirely when the uncertainty is too high.

Python Implementation

import math
from dataclasses import dataclass, field
from typing import List, Tuple, Dict

GLICKO2_SCALE = 173.7178  # 400 / ln(10)

@dataclass
class Glicko2Player:
    """Represents a team/player in the Glicko-2 system."""
    mu: float = 0.0           # Rating on Glicko-2 internal scale
    phi: float = 350 / GLICKO2_SCALE  # Rating deviation (internal)
    sigma: float = 0.06       # Volatility

    @property
    def rating(self) -> float:
        """Rating on the traditional Elo-like scale."""
        return self.mu * GLICKO2_SCALE + 1500

    @property
    def rd(self) -> float:
        """Rating deviation on the traditional scale."""
        return self.phi * GLICKO2_SCALE

    def confidence_interval(self, z: float = 1.96) -> Tuple[float, float]:
        """95% confidence interval for the true rating."""
        return (self.rating - z * self.rd, self.rating + z * self.rd)


@dataclass
class Glicko2System:
    """
    Complete Glicko-2 rating system implementation.

    Attributes:
        tau: System constant controlling volatility change (0.3 to 1.2)
        epsilon: Convergence tolerance for volatility iteration
    """
    tau: float = 0.5
    epsilon: float = 1e-6
    players: Dict[str, Glicko2Player] = field(default_factory=dict)

    def get_player(self, name: str) -> Glicko2Player:
        if name not in self.players:
            self.players[name] = Glicko2Player()
        return self.players[name]

    @staticmethod
    def g(phi: float) -> float:
        """The g-function that scales by opponent's rating deviation."""
        return 1.0 / math.sqrt(1.0 + 3.0 * phi**2 / math.pi**2)

    @staticmethod
    def expect(mu: float, mu_j: float, phi_j: float) -> float:
        """Expected outcome against opponent j."""
        return 1.0 / (1.0 + math.exp(-Glicko2System.g(phi_j) * (mu - mu_j)))

    def update_player(self, player: Glicko2Player,
                      opponents: List[Glicko2Player],
                      outcomes: List[float]) -> None:
        """
        Update a single player's rating based on a rating period's games.

        Args:
            player: The player to update
            opponents: List of opponent Glicko2Player objects
            outcomes: List of outcomes (1.0 = win, 0.0 = loss, 0.5 = draw)
        """
        if not opponents:
            # No games played: only increase RD
            player.phi = math.sqrt(player.phi**2 + player.sigma**2)
            return

        mu = player.mu
        phi = player.phi
        sigma = player.sigma

        # Step 1: Compute variance v
        v_inv = 0.0
        delta_sum = 0.0
        for opp, s in zip(opponents, outcomes):
            g_val = self.g(opp.phi)
            e_val = self.expect(mu, opp.mu, opp.phi)
            v_inv += g_val**2 * e_val * (1.0 - e_val)
            delta_sum += g_val * (s - e_val)

        v = 1.0 / v_inv
        delta = v * delta_sum

        # Step 2: Determine new volatility via Illinois algorithm
        a = math.log(sigma**2)
        tau2 = self.tau**2

        def f(x):
            ex = math.exp(x)
            num = ex * (delta**2 - phi**2 - v - ex)
            den = 2.0 * (phi**2 + v + ex)**2
            return num / den - (x - a) / tau2

        # Initial bounds
        A = a
        if delta**2 > phi**2 + v:
            B = math.log(delta**2 - phi**2 - v)
        else:
            k = 1
            while f(a - k * self.tau) < 0:
                k += 1
            B = a - k * self.tau

        # Illinois algorithm iteration
        fA = f(A)
        fB = f(B)

        while abs(B - A) > self.epsilon:
            C = A + (A - B) * fA / (fB - fA)
            fC = f(C)

            if fC * fB <= 0:
                A = B
                fA = fB
            else:
                fA /= 2.0

            B = C
            fB = fC

        sigma_new = math.exp(A / 2.0)

        # Step 3: Update RD (pre-rating period)
        phi_star = math.sqrt(phi**2 + sigma_new**2)

        # Step 4: Update RD and rating
        phi_new = 1.0 / math.sqrt(1.0 / phi_star**2 + 1.0 / v)
        mu_new = mu + phi_new**2 * delta_sum

        player.mu = mu_new
        player.phi = phi_new
        player.sigma = sigma_new

    def process_period(self, games: List[Tuple[str, str, float]]) -> None:
        """
        Process all games in a rating period.

        Args:
            games: List of (team_a, team_b, outcome_for_a) tuples
        """
        # Collect opponents and outcomes for each player
        opponents_map: Dict[str, List[Glicko2Player]] = {}
        outcomes_map: Dict[str, List[float]] = {}

        # Snapshot current ratings (updates use pre-period ratings)
        snapshots = {name: Glicko2Player(p.mu, p.phi, p.sigma)
                     for name, p in self.players.items()}

        for team_a, team_b, outcome_a in games:
            pa = self.get_player(team_a)
            pb = self.get_player(team_b)

            if team_a not in snapshots:
                snapshots[team_a] = Glicko2Player(pa.mu, pa.phi, pa.sigma)
            if team_b not in snapshots:
                snapshots[team_b] = Glicko2Player(pb.mu, pb.phi, pb.sigma)

            opponents_map.setdefault(team_a, []).append(snapshots[team_b])
            outcomes_map.setdefault(team_a, []).append(outcome_a)

            opponents_map.setdefault(team_b, []).append(snapshots[team_a])
            outcomes_map.setdefault(team_b, []).append(1.0 - outcome_a)

        # Update each player
        for name, player in self.players.items():
            opps = opponents_map.get(name, [])
            outs = outcomes_map.get(name, [])
            self.update_player(player, opps, outs)

    def predict(self, team_a: str, team_b: str) -> Tuple[float, float, float]:
        """
        Predict outcome probabilities with confidence measure.

        Returns:
            (prob_a_wins, prob_b_wins, confidence)
            where confidence is inversely related to combined RD
        """
        pa = self.get_player(team_a)
        pb = self.get_player(team_b)

        prob_a = self.expect(pa.mu, pb.mu, pb.phi)
        combined_rd = math.sqrt(pa.rd**2 + pb.rd**2)
        confidence = max(0, 1 - combined_rd / 500)  # Normalized 0-1

        return prob_a, 1 - prob_a, confidence

    def get_rankings(self) -> List[Tuple[str, float, float, float]]:
        """Return rankings: (name, rating, RD, volatility)."""
        return sorted(
            [(n, p.rating, p.rd, p.sigma) for n, p in self.players.items()],
            key=lambda x: x[1],
            reverse=True
        )


# --- Worked Example ---
if __name__ == "__main__":
    system = Glicko2System(tau=0.5)

    # Week 1 games
    week1 = [
        ("Chiefs", "Ravens", 1.0),   # Chiefs beat Ravens
        ("Bills", "Dolphins", 1.0),  # Bills beat Dolphins
    ]

    week2 = [
        ("Ravens", "Bills", 1.0),    # Ravens beat Bills
        ("Chiefs", "Dolphins", 1.0), # Chiefs beat Dolphins
    ]

    for week_num, week_games in enumerate([week1, week2], 1):
        system.process_period(week_games)
        print(f"After Week {week_num}:")
        for name, rating, rd, vol in system.get_rankings():
            lo, hi = system.get_player(name).confidence_interval()
            print(f"  {name}: {rating:.1f} (RD: {rd:.1f}, "
                  f"Vol: {vol:.4f}, 95% CI: [{lo:.0f}, {hi:.0f}])")
        print()

    # Prediction with confidence
    pa, pb, conf = system.predict("Chiefs", "Bills")
    print(f"Chiefs vs Bills: {pa:.3f} vs {pb:.3f} (confidence: {conf:.3f})")

26.3 Massey Ratings and Least Squares

Formulating Ratings as a Linear System

Kenneth Massey developed his rating system as part of his undergraduate honors thesis at Bluefield College in 1997. The Massey method approaches ratings from a fundamentally different angle than Elo or Glicko: instead of incrementally updating ratings game by game, it solves for the entire set of ratings simultaneously using the complete history of results.

The core idea is beautifully simple. If team $i$ has rating $r_i$ and team $j$ has rating $r_j$, then the expected point differential when team $i$ plays team $j$ is:

$$r_i - r_j \approx y_{ij}$$

where $y_{ij}$ is the actual point differential (positive if $i$ won, negative if $i$ lost). Each game gives us one equation. With $n$ teams and $m$ games, we have $m$ equations and $n$ unknowns. Since typically $m \gg n$, the system is overdetermined, and we solve it using least squares.

The Massey Matrix

We can write the system in matrix form:

$$\mathbf{X} \mathbf{r} = \mathbf{y}$$

where: - $\mathbf{X}$ is an $m \times n$ matrix. For game $k$ between teams $i$ and $j$, row $k$ has $+1$ in column $i$, $-1$ in column $j$, and $0$ elsewhere. - $\mathbf{r}$ is the $n \times 1$ vector of ratings we want to solve for. - $\mathbf{y}$ is the $m \times 1$ vector of point differentials.

The normal equations for the least squares solution are:

$$\mathbf{X}^T \mathbf{X} \mathbf{r} = \mathbf{X}^T \mathbf{y}$$

Let $\mathbf{M} = \mathbf{X}^T \mathbf{X}$ (the Massey matrix) and $\mathbf{p} = \mathbf{X}^T \mathbf{y}$ (the point differential vector). Then:

$$\mathbf{M} \mathbf{r} = \mathbf{p}$$

The Massey matrix $\mathbf{M}$ has a special structure: - The diagonal entry $M_{ii}$ equals the number of games played by team $i$. - The off-diagonal entry $M_{ij}$ equals the negative of the number of games between teams $i$ and $j$.

The vector $\mathbf{p}$ has entry $p_i$ equal to the total point differential for team $i$ across all games.

The matrix $\mathbf{M}$ is singular (its rows sum to zero), so we need to add a constraint. The standard approach is to replace the last row of $\mathbf{M}$ with all ones and the last entry of $\mathbf{p}$ with zero, which forces the ratings to sum to zero.

Solving with Linear Algebra

The solution is then:

$$\mathbf{r} = \mathbf{M}^{-1} \mathbf{p}$$

Or more numerically stable, we can use numpy.linalg.solve. The resulting ratings are on a point-differential scale: a team rated +5 is expected to beat a team rated 0 by 5 points.

Weighted Massey Ratings

We can extend the basic Massey method by weighting games differently. Recent games should carry more weight than early-season games. Close games might be more informative than blowouts. We introduce a diagonal weight matrix $\mathbf{W}$:

$$\mathbf{X}^T \mathbf{W} \mathbf{X} \mathbf{r} = \mathbf{X}^T \mathbf{W} \mathbf{y}$$

Common weighting schemes: - Temporal decay: $w_k = \lambda^{T-t_k}$ where $T$ is the current time and $t_k$ is the time of game $k$, with $\lambda \in (0, 1)$. - Margin capping: Cap point differentials at some maximum (e.g., 20 points in NFL) to reduce the influence of garbage time. - Home advantage: Add a home-advantage column to $\mathbf{X}$ (1 for home, -1 for away, 0 for neutral).

Python Implementation

import numpy as np
from typing import List, Tuple, Dict, Optional

class MasseyRatings:
    """
    Massey rating system using least-squares optimization.

    Solves for team ratings that best explain observed point differentials.
    """

    def __init__(self, home_advantage: bool = True,
                 recency_weight: float = 1.0,
                 margin_cap: Optional[int] = None):
        """
        Args:
            home_advantage: Whether to estimate home-field advantage
            recency_weight: Decay factor (1.0 = no decay, 0.99 = slight decay)
            margin_cap: Maximum absolute point differential to use
        """
        self.home_advantage = home_advantage
        self.recency_weight = recency_weight
        self.margin_cap = margin_cap
        self.teams: List[str] = []
        self.team_index: Dict[str, int] = {}
        self.ratings: Optional[np.ndarray] = None
        self.home_adv_estimate: float = 0.0

    def _get_team_idx(self, team: str) -> int:
        if team not in self.team_index:
            self.team_index[team] = len(self.teams)
            self.teams.append(team)
        return self.team_index[team]

    def fit(self, games: List[Tuple[str, str, int, int, bool]]) -> None:
        """
        Compute ratings from a set of game results.

        Args:
            games: List of (team_a, team_b, score_a, score_b, a_is_home)
        """
        # Reset teams
        self.teams = []
        self.team_index = {}

        # First pass: register all teams
        for team_a, team_b, _, _, _ in games:
            self._get_team_idx(team_a)
            self._get_team_idx(team_b)

        n = len(self.teams)
        m = len(games)

        # Number of columns: n teams + optional home advantage
        n_cols = n + 1 if self.home_advantage else n

        # Build the design matrix X and outcome vector y
        X = np.zeros((m, n_cols))
        y = np.zeros(m)
        W = np.zeros(m)  # Weights

        for k, (team_a, team_b, score_a, score_b, a_home) in enumerate(games):
            i = self.team_index[team_a]
            j = self.team_index[team_b]

            X[k, i] = 1
            X[k, j] = -1

            if self.home_advantage:
                if a_home:
                    X[k, n] = 1   # Home advantage column
                else:
                    X[k, n] = -1  # Away team in column

            # Point differential
            diff = score_a - score_b
            if self.margin_cap is not None:
                diff = max(-self.margin_cap, min(self.margin_cap, diff))
            y[k] = diff

            # Recency weighting
            W[k] = self.recency_weight ** (m - 1 - k)

        # Weighted normal equations: X'WX r = X'Wy
        W_diag = np.diag(W)
        M = X.T @ W_diag @ X
        p = X.T @ W_diag @ y

        # Handle singularity: replace last team row with sum-to-zero constraint
        M[n-1, :n] = 1
        M[n-1, n:] = 0  # Don't include home advantage in constraint
        p[n-1] = 0

        # Solve
        solution = np.linalg.solve(M, p)

        self.ratings = solution[:n]
        if self.home_advantage:
            self.home_adv_estimate = solution[n]

    def predict(self, team_a: str, team_b: str,
                a_is_home: bool = True) -> float:
        """
        Predict point differential (team_a score - team_b score).

        Returns:
            Expected point differential for team_a
        """
        if self.ratings is None:
            raise ValueError("Must call fit() before predict()")

        i = self.team_index[team_a]
        j = self.team_index[team_b]

        diff = self.ratings[i] - self.ratings[j]
        if a_is_home and self.home_advantage:
            diff += self.home_adv_estimate
        elif not a_is_home and self.home_advantage:
            diff -= self.home_adv_estimate

        return diff

    def point_diff_to_win_prob(self, predicted_diff: float,
                                sigma: float = 13.5) -> float:
        """
        Convert predicted point differential to win probability.

        Uses a normal CDF approximation. Sigma should be calibrated
        to the sport (NFL ~ 13.5, NBA ~ 12, soccer ~ 1.1).
        """
        from scipy.stats import norm
        return norm.cdf(predicted_diff / sigma)

    def get_rankings(self) -> List[Tuple[str, float]]:
        """Return teams sorted by rating."""
        if self.ratings is None:
            raise ValueError("Must call fit() before get_rankings()")

        indexed = [(self.teams[i], self.ratings[i]) for i in range(len(self.teams))]
        return sorted(indexed, key=lambda x: x[1], reverse=True)


# --- Worked Example ---
if __name__ == "__main__":
    massey = MasseyRatings(home_advantage=True, margin_cap=28,
                           recency_weight=0.98)

    # (team_a, team_b, score_a, score_b, a_is_home)
    games = [
        ("Chiefs", "Ravens", 27, 20, True),
        ("Bills", "Dolphins", 31, 17, True),
        ("Ravens", "Bills", 24, 21, True),
        ("Dolphins", "Chiefs", 14, 35, True),
        ("Chiefs", "Bills", 17, 20, True),
        ("Ravens", "Dolphins", 28, 10, True),
        ("Bills", "Ravens", 17, 24, False),
        ("Dolphins", "Bills", 20, 27, False),
    ]

    massey.fit(games)

    print("=== Massey Ratings ===\n")
    print(f"Estimated home advantage: {massey.home_adv_estimate:.2f} points\n")

    print("Rankings:")
    for rank, (team, rating) in enumerate(massey.get_rankings(), 1):
        print(f"  {rank}. {team}: {rating:+.2f}")

    print("\nPredictions:")
    for home, away in [("Chiefs", "Bills"), ("Ravens", "Dolphins")]:
        diff = massey.predict(home, away, a_is_home=True)
        prob = massey.point_diff_to_win_prob(diff, sigma=13.5)
        print(f"  {home} (home) vs {away}: "
              f"predicted spread = {diff:+.1f}, "
              f"win prob = {prob:.3f}")

Strengths and Limitations

Massey ratings have several attractive properties for bettors:

Strengths: - Ratings are in interpretable units (points). - Naturally produces point spread predictions. - Home advantage is estimated directly from the data. - Can incorporate weighting (recency, margin caps) flexibly. - Computation is efficient and deterministic.

Limitations: - Assumes linearity: the difference in ratings equals the expected point differential. - No built-in uncertainty quantification (unlike Glicko). - Requires a connected schedule (every team must be linkable through chains of opponents). - Sensitive to outlier games unless margins are capped.

26.4 PageRank and Network-Based Rankings

Graph Theory Meets Sports

In 1998, Larry Page and Sergey Brin developed the PageRank algorithm to rank web pages for Google's search engine. The core insight was that a page is important if many important pages link to it. This idea transfers naturally to sports: a team is strong if it beats many strong teams.

We model a sports season as a directed graph (digraph). Each team is a node. Each game result creates a directed edge from the losing team to the winning team, weighted by the margin of victory (or simply unweighted for a basic version). A team's strength is then determined by the strengths of the teams it has beaten --- a recursive definition that PageRank resolves elegantly.

The PageRank Algorithm for Sports

Let $n$ be the number of teams. Define the adjacency matrix $\mathbf{A}$ where $A_{ij}$ represents the "vote" from team $i$ to team $j$. In the sports context, $A_{ij}$ is the total margin by which team $j$ has beaten team $i$ (or the number of wins, in the unweighted case). If team $j$ has never beaten team $i$, then $A_{ij} = 0$.

We normalize each row so that it sums to 1, creating the transition matrix $\mathbf{H}$:

$$H_{ij} = \frac{A_{ij}}{\sum_k A_{ik}}$$

Teams that have lost to nobody (undefeated teams) create "dangling nodes." We handle these by distributing their "vote" equally among all teams.

The PageRank vector $\mathbf{r}$ is the stationary distribution of the Markov chain defined by the modified transition matrix with a damping factor $d$:

$$\mathbf{r} = d \cdot \mathbf{H}^T \mathbf{r} + \frac{1-d}{n} \mathbf{1}$$

The damping factor $d$ (typically 0.85) can be interpreted in the sports context as follows: with probability $d$, a team's strength comes from the teams it has beaten; with probability $1-d$, all teams share a baseline level of competitiveness. This prevents isolated teams or undefeated teams from dominating the rankings unreasonably.

The Power Method

PageRank is typically computed iteratively using the power method:

Initialize $\mathbf{r}^{(0)} = \frac{1}{n}\mathbf{1}$
Iterate: $\mathbf{r}^{(k+1)} = d \cdot \mathbf{H}^T \mathbf{r}^{(k)} + \frac{1-d}{n}\mathbf{1}$
Stop when $\|\mathbf{r}^{(k+1)} - \mathbf{r}^{(k)}\|_1 < \epsilon$

Convergence is guaranteed and is typically fast (20--50 iterations).

Variations for Sports

Several modifications improve PageRank for sports ranking:

Weighted Edges. Rather than binary win/loss, use the margin of victory as edge weight. A 30-point win contributes more than a 1-point win.

Win Percentage Weighting. Scale edges by the loser's winning percentage, so that beating a strong team contributes more than beating a weak team. This creates a second-order effect that rewards strength of schedule.

Score-Based PageRank. Use points scored rather than margin. Team $j$ gets credit proportional to the points scored against team $i$ divided by team $i$'s total points allowed.

Python Implementation

import numpy as np
from typing import List, Tuple, Dict, Optional

try:
    import networkx as nx
    HAS_NETWORKX = True
except ImportError:
    HAS_NETWORKX = False


class SportsPageRank:
    """
    PageRank-based rating system for sports.

    Models the season as a directed graph where losses point to winners,
    weighted by margin of victory.
    """

    def __init__(self, damping: float = 0.85, max_iter: int = 100,
                 tol: float = 1e-8, weight_by_margin: bool = True):
        self.damping = damping
        self.max_iter = max_iter
        self.tol = tol
        self.weight_by_margin = weight_by_margin
        self.teams: List[str] = []
        self.team_index: Dict[str, int] = {}
        self.rankings: Optional[Dict[str, float]] = None

    def _get_idx(self, team: str) -> int:
        if team not in self.team_index:
            self.team_index[team] = len(self.teams)
            self.teams.append(team)
        return self.team_index[team]

    def fit(self, games: List[Tuple[str, str, int, int]]) -> None:
        """
        Compute PageRank ratings from game results.

        Args:
            games: List of (team_a, team_b, score_a, score_b)
        """
        self.teams = []
        self.team_index = {}

        for a, b, _, _ in games:
            self._get_idx(a)
            self._get_idx(b)

        n = len(self.teams)
        A = np.zeros((n, n))

        for team_a, team_b, score_a, score_b in games:
            i = self.team_index[team_a]
            j = self.team_index[team_b]

            if score_a > score_b:
                # Loser j "votes" for winner i
                weight = (score_a - score_b) if self.weight_by_margin else 1
                A[j, i] += weight
            elif score_b > score_a:
                weight = (score_b - score_a) if self.weight_by_margin else 1
                A[i, j] += weight
            else:
                # Draw: exchange half-votes
                A[i, j] += 0.5
                A[j, i] += 0.5

        # Normalize rows to create transition matrix H
        H = np.zeros((n, n))
        for i in range(n):
            row_sum = A[i].sum()
            if row_sum > 0:
                H[i] = A[i] / row_sum
            else:
                # Dangling node: distribute evenly
                H[i] = 1.0 / n

        # Power iteration
        r = np.ones(n) / n

        for iteration in range(self.max_iter):
            r_new = self.damping * H.T @ r + (1 - self.damping) / n

            # Handle any numerical issues
            r_new = r_new / r_new.sum()

            if np.abs(r_new - r).sum() < self.tol:
                break
            r = r_new

        self.rankings = {self.teams[i]: r[i] for i in range(n)}

    def fit_networkx(self, games: List[Tuple[str, str, int, int]]) -> None:
        """Alternative implementation using networkx."""
        if not HAS_NETWORKX:
            raise ImportError("networkx is required for this method")

        G = nx.DiGraph()

        for team_a, team_b, score_a, score_b in games:
            G.add_node(team_a)
            G.add_node(team_b)

            if score_a > score_b:
                weight = (score_a - score_b) if self.weight_by_margin else 1
                if G.has_edge(team_b, team_a):
                    G[team_b][team_a]['weight'] += weight
                else:
                    G.add_edge(team_b, team_a, weight=weight)
            elif score_b > score_a:
                weight = (score_b - score_a) if self.weight_by_margin else 1
                if G.has_edge(team_a, team_b):
                    G[team_a][team_b]['weight'] += weight
                else:
                    G.add_edge(team_a, team_b, weight=weight)

        self.rankings = nx.pagerank(G, alpha=self.damping, weight='weight',
                                     max_iter=self.max_iter, tol=self.tol)
        self.teams = list(self.rankings.keys())

    def get_rankings(self) -> List[Tuple[str, float]]:
        """Return teams sorted by PageRank score."""
        if self.rankings is None:
            raise ValueError("Must call fit() first")
        return sorted(self.rankings.items(), key=lambda x: x[1], reverse=True)

    def predict_win_prob(self, team_a: str, team_b: str) -> float:
        """
        Estimate win probability for team_a against team_b.

        Uses the ratio of PageRank scores as a probability estimate.
        """
        if self.rankings is None:
            raise ValueError("Must call fit() first")

        ra = self.rankings[team_a]
        rb = self.rankings[team_b]
        return ra / (ra + rb)


# --- Worked Example ---
if __name__ == "__main__":
    pr = SportsPageRank(damping=0.85, weight_by_margin=True)

    games = [
        ("Chiefs", "Ravens", 27, 20),
        ("Bills", "Dolphins", 31, 17),
        ("Ravens", "Bills", 24, 21),
        ("Dolphins", "Chiefs", 14, 35),
        ("Chiefs", "Bills", 17, 20),
        ("Ravens", "Dolphins", 28, 10),
    ]

    pr.fit(games)

    print("=== PageRank Sports Ratings ===\n")
    print("Rankings (higher = better):")
    for rank, (team, score) in enumerate(pr.get_rankings(), 1):
        print(f"  {rank}. {team}: {score:.6f}")

    print("\nWin probability estimates:")
    for a, b in [("Chiefs", "Bills"), ("Ravens", "Dolphins"),
                 ("Chiefs", "Ravens")]:
        prob = pr.predict_win_prob(a, b)
        print(f"  {a} vs {b}: {a} wins {prob:.3f}, {b} wins {1-prob:.3f}")

PageRank vs. Traditional Ratings

Property	Elo/Glicko	Massey	PageRank
Update style	Incremental	Batch	Batch
Handles margin	With adjustment	Natively	With weighting
Strength of schedule	Implicit	Implicit	Explicit (recursive)
Interpretability	Rating points	Point differential	Relative importance
Transitivity	Not assumed	Assumed (linear)	Not assumed
Computational cost	O(games)	O(n^3)	O(n^2 * iterations)

PageRank's distinctive advantage is its explicit modeling of strength of schedule through the recursive structure. Beating a team that has itself beaten many strong teams contributes more than beating a team with an easy schedule. This makes it particularly valuable in sports with unbalanced schedules, such as college football or college basketball, where teams play very different opponents.

26.5 Combining Multiple Rating Systems

Why Combine?

Each rating system encodes different assumptions and captures different aspects of team strength:

Elo captures momentum and recent form through incremental updates.
Glicko-2 provides uncertainty quantification and handles inactivity.
Massey directly estimates point differentials and home advantage.
PageRank emphasizes strength of schedule and quality of wins.

No single system dominates across all sports and all situations. By combining multiple systems into an ensemble, we can capture the strengths of each while mitigating their individual weaknesses. Research in machine learning consistently demonstrates that ensembles of diverse models outperform any individual model, and this principle applies directly to rating systems.

Simple Averaging

The simplest ensemble approach is to average the outputs of multiple systems. If system $k$ produces a predicted win probability $p_k$ for some matchup, the ensemble prediction is:

$$\hat{p} = \frac{1}{K} \sum_{k=1}^{K} p_k$$

This already provides substantial benefits if the systems make different errors. However, it assumes all systems are equally accurate, which is rarely true.

Weighted Averaging

We can assign weights $w_k$ to each system based on its historical accuracy:

$$\hat{p} = \sum_{k=1}^{K} w_k \cdot p_k, \quad \text{where } \sum_k w_k = 1$$

The weights can be determined by: - Log-loss minimization: Find weights that minimize the log-loss on a validation set. - Brier score minimization: Minimize the mean squared error between predictions and outcomes. - Inverse error weighting: $w_k \propto 1/\text{error}_k$

Model Stacking

A more sophisticated approach treats the individual system predictions as features and trains a meta-model to combine them. This is called stacking (or stacked generalization):

Generate predictions from each base rating system on a validation set.
Use these predictions as input features to a meta-model (e.g., logistic regression, gradient boosting).
The meta-model learns the optimal nonlinear combination of the base systems.

Stacking can capture interactions between systems --- for example, it might learn that when Elo and PageRank disagree strongly, the Massey prediction is most reliable.

Calibration of Combined Predictions

After combining systems, the resulting probabilities may not be well-calibrated. A probability estimate of 0.70 should correspond to a 70% win rate in practice. We can apply isotonic regression or Platt scaling (covered in detail in Chapter 27) to recalibrate the combined output.

Python Implementation

import numpy as np
from scipy.optimize import minimize
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibratedClassifierCV
from typing import List, Dict, Tuple, Callable, Optional

class EnsembleRatings:
    """
    Combines multiple rating systems into an ensemble.

    Supports simple averaging, weighted averaging, and model stacking.
    """

    def __init__(self, systems: Dict[str, object],
                 predict_fns: Dict[str, Callable]):
        """
        Args:
            systems: Dict mapping system names to fitted system objects
            predict_fns: Dict mapping system names to prediction functions.
                         Each function takes (system, team_a, team_b) and
                         returns P(team_a wins).
        """
        self.systems = systems
        self.predict_fns = predict_fns
        self.system_names = list(systems.keys())
        self.weights: Optional[np.ndarray] = None
        self.meta_model = None

    def get_base_predictions(self, team_a: str,
                             team_b: str) -> Dict[str, float]:
        """Get predictions from all base systems."""
        preds = {}
        for name in self.system_names:
            try:
                preds[name] = self.predict_fns[name](
                    self.systems[name], team_a, team_b
                )
            except (KeyError, ValueError):
                preds[name] = 0.5  # Default for missing teams
        return preds

    def predict_simple_average(self, team_a: str, team_b: str) -> float:
        """Predict using simple average of all systems."""
        preds = self.get_base_predictions(team_a, team_b)
        return np.mean(list(preds.values()))

    def fit_weights(self, matchups: List[Tuple[str, str]],
                    outcomes: List[int],
                    metric: str = 'log_loss') -> np.ndarray:
        """
        Learn optimal weights for weighted averaging.

        Args:
            matchups: List of (team_a, team_b) tuples
            outcomes: List of 1 (team_a won) or 0 (team_b won)
            metric: 'log_loss' or 'brier'

        Returns:
            Optimal weight vector
        """
        # Get base predictions for all matchups
        all_preds = []
        for team_a, team_b in matchups:
            preds = self.get_base_predictions(team_a, team_b)
            all_preds.append([preds[name] for name in self.system_names])

        P = np.array(all_preds)  # (n_games, n_systems)
        y = np.array(outcomes)
        n_systems = len(self.system_names)

        def objective(w):
            w_norm = np.exp(w) / np.exp(w).sum()  # Softmax for constraints
            p_ensemble = P @ w_norm
            p_ensemble = np.clip(p_ensemble, 1e-10, 1 - 1e-10)

            if metric == 'log_loss':
                return -np.mean(y * np.log(p_ensemble) +
                               (1-y) * np.log(1-p_ensemble))
            elif metric == 'brier':
                return np.mean((p_ensemble - y)**2)

        result = minimize(objective, np.zeros(n_systems), method='Nelder-Mead')
        self.weights = np.exp(result.x) / np.exp(result.x).sum()

        print("Optimal weights:")
        for name, w in zip(self.system_names, self.weights):
            print(f"  {name}: {w:.4f}")

        return self.weights

    def predict_weighted(self, team_a: str, team_b: str) -> float:
        """Predict using optimized weights."""
        if self.weights is None:
            raise ValueError("Must call fit_weights() first")

        preds = self.get_base_predictions(team_a, team_b)
        pred_array = np.array([preds[name] for name in self.system_names])
        return float(pred_array @ self.weights)

    def fit_stacking(self, matchups: List[Tuple[str, str]],
                     outcomes: List[int],
                     calibrate: bool = True) -> None:
        """
        Train a stacking meta-model using logistic regression.

        Args:
            matchups: List of (team_a, team_b) tuples
            outcomes: List of outcomes
            calibrate: Whether to apply probability calibration
        """
        # Get base predictions
        X = []
        for team_a, team_b in matchups:
            preds = self.get_base_predictions(team_a, team_b)
            X.append([preds[name] for name in self.system_names])

        X = np.array(X)
        y = np.array(outcomes)

        if calibrate:
            base_model = LogisticRegression(C=1.0, max_iter=1000)
            self.meta_model = CalibratedClassifierCV(
                base_model, cv=5, method='isotonic'
            )
        else:
            self.meta_model = LogisticRegression(C=1.0, max_iter=1000)

        self.meta_model.fit(X, y)

        # Display learned coefficients
        if hasattr(self.meta_model, 'coef_'):
            print("Stacking coefficients:")
            for name, coef in zip(self.system_names,
                                   self.meta_model.coef_[0]):
                print(f"  {name}: {coef:.4f}")

    def predict_stacked(self, team_a: str, team_b: str) -> float:
        """Predict using the stacking meta-model."""
        if self.meta_model is None:
            raise ValueError("Must call fit_stacking() first")

        preds = self.get_base_predictions(team_a, team_b)
        X = np.array([[preds[name] for name in self.system_names]])
        return float(self.meta_model.predict_proba(X)[0, 1])

    def evaluate(self, matchups: List[Tuple[str, str]],
                 outcomes: List[int]) -> Dict[str, float]:
        """
        Evaluate all methods on a test set.

        Returns log-loss and Brier score for each method.
        """
        results = {}

        for method_name, predict_fn in [
            ('simple_avg', self.predict_simple_average),
            ('weighted', self.predict_weighted if self.weights is not None
                         else None),
            ('stacked', self.predict_stacked if self.meta_model is not None
                        else None),
        ]:
            if predict_fn is None:
                continue

            preds = []
            for team_a, team_b in matchups:
                p = predict_fn(team_a, team_b)
                preds.append(np.clip(p, 1e-10, 1 - 1e-10))

            preds = np.array(preds)
            y = np.array(outcomes)

            log_loss = -np.mean(y * np.log(preds) +
                               (1-y) * np.log(1-preds))
            brier = np.mean((preds - y)**2)

            results[method_name] = {
                'log_loss': round(log_loss, 4),
                'brier': round(brier, 4)
            }

        # Also evaluate individual systems
        for name in self.system_names:
            preds = []
            for team_a, team_b in matchups:
                p = self.get_base_predictions(team_a, team_b)[name]
                preds.append(np.clip(p, 1e-10, 1 - 1e-10))

            preds = np.array(preds)
            log_loss = -np.mean(outcomes * np.log(preds) +
                               (1-outcomes) * np.log(1-preds))
            brier = np.mean((preds - outcomes)**2)

            results[f'base_{name}'] = {
                'log_loss': round(log_loss, 4),
                'brier': round(brier, 4)
            }

        return results


# --- Worked Example ---
if __name__ == "__main__":
    # This example assumes you have fitted Elo, Massey, and PageRank systems
    # from the previous sections. Here we demonstrate the ensemble structure.

    # Simulate base system predictions for demonstration
    np.random.seed(42)
    n_games = 200

    # Generate synthetic predictions from 3 "systems"
    true_probs = np.random.uniform(0.3, 0.7, n_games)
    outcomes = (np.random.random(n_games) < true_probs).astype(int)

    # Each system adds different noise
    elo_preds = np.clip(true_probs + np.random.normal(0, 0.08, n_games),
                        0.01, 0.99)
    massey_preds = np.clip(true_probs + np.random.normal(0, 0.10, n_games),
                           0.01, 0.99)
    pr_preds = np.clip(true_probs + np.random.normal(0.02, 0.12, n_games),
                        0.01, 0.99)

    # Split into train/test
    split = 150

    # Compute metrics for individual and ensemble
    from sklearn.metrics import log_loss, brier_score_loss

    # Simple average
    avg_preds = (elo_preds + massey_preds + pr_preds) / 3

    print("=== Ensemble Rating System Comparison ===\n")

    for name, preds in [("Elo", elo_preds), ("Massey", massey_preds),
                        ("PageRank", pr_preds), ("Simple Average", avg_preds)]:
        test_preds = preds[split:]
        test_outcomes = outcomes[split:]
        ll = log_loss(test_outcomes, test_preds)
        bs = brier_score_loss(test_outcomes, test_preds)
        print(f"  {name:15s}: Log-Loss = {ll:.4f}, Brier = {bs:.4f}")

    # Weighted combination via optimization
    from scipy.optimize import minimize

    train_P = np.column_stack([elo_preds[:split], massey_preds[:split],
                                pr_preds[:split]])
    train_y = outcomes[:split]

    def neg_log_likelihood(w):
        w_norm = np.exp(w) / np.exp(w).sum()
        p = train_P @ w_norm
        p = np.clip(p, 1e-10, 1-1e-10)
        return -np.mean(train_y * np.log(p) + (1-train_y) * np.log(1-p))

    result = minimize(neg_log_likelihood, np.zeros(3), method='Nelder-Mead')
    opt_w = np.exp(result.x) / np.exp(result.x).sum()

    print(f"\n  Optimal weights: Elo={opt_w[0]:.3f}, "
          f"Massey={opt_w[1]:.3f}, PR={opt_w[2]:.3f}")

    test_P = np.column_stack([elo_preds[split:], massey_preds[split:],
                               pr_preds[split:]])
    weighted_preds = test_P @ opt_w
    ll = log_loss(outcomes[split:], weighted_preds)
    bs = brier_score_loss(outcomes[split:], weighted_preds)
    print(f"  {'Weighted Avg':15s}: Log-Loss = {ll:.4f}, Brier = {bs:.4f}")

Practical Recommendations for Bettors

Start with Elo for its simplicity and interpretability. Use it as a baseline.
Add Massey for direct point spread predictions. This is especially useful for over/under and spread betting.
Add PageRank for sports with unbalanced schedules (college sports).
Use Glicko-2 when you need confidence intervals for bet sizing (Kelly criterion).
Combine using weighted averaging --- it is robust and requires minimal tuning. Use stacking only when you have enough historical data (500+ games) to train the meta-model reliably.
Recalibrate the ensemble periodically. Optimal weights shift over time as the competitive landscape changes.

26.6 Chapter Summary

This chapter has equipped you with four distinct approaches to rating and ranking sports teams, each grounded in rigorous mathematics and implemented from scratch in Python.

The Elo System provides an elegant, incremental rating method rooted in the logistic function. Its expected score formula $E_A = 1/(1 + 10^{(R_B - R_A)/400})$ converts rating differences into win probabilities. The K-factor controls the balance between responsiveness and stability, and adaptations for home advantage and margin of victory make Elo suitable for sports beyond its chess origins.

Glicko-2 extends Elo with two critical additions: the rating deviation $\phi$ that quantifies how uncertain we are about a team's true strength, and the volatility $\sigma$ that captures how erratically a team performs. For bettors, these uncertainty measures are essential for sizing bets appropriately and knowing when to abstain.

Massey Ratings approach the problem from a linear algebra perspective, formulating ratings as the least-squares solution to an overdetermined system of equations relating team strength differences to observed point differentials. The Massey matrix $\mathbf{M} = \mathbf{X}^T\mathbf{X}$ captures the structure of the schedule, and the solution provides ratings directly in point-differential units.

PageRank brings network science to sports ranking. By modeling the season as a directed graph where losses create edges pointing toward winners, PageRank recursively defines team strength in terms of the strength of defeated opponents. This provides an explicit, principled treatment of strength of schedule.

Finally, ensemble methods combine these diverse perspectives. Simple averaging, optimized weighting, and model stacking allow bettors to leverage the distinct strengths of each system while compensating for their individual weaknesses. The empirical finding that diverse ensembles outperform individual models holds as strongly in sports rating as in any other domain of predictive modeling.

In the next chapter, we will build on these rating systems by exploring advanced regression and classification methods --- including gradient boosting, random forests, and probability calibration --- that can incorporate ratings as features alongside other predictive variables to produce even more accurate betting models.

Key Formulas Reference

System	Core Formula	Output
Elo	$E_A = \frac{1}{1 + 10^{(R_B - R_A)/400}}$	Win probability
Elo update	$R_A' = R_A + K(S_A - E_A)$	Updated rating
Glicko-2 $g$	$g(\phi) = 1/\sqrt{1 + 3\phi^2/\pi^2}$	Uncertainty scaling
Massey	$\mathbf{M}\mathbf{r} = \mathbf{p}$	Point differential ratings
PageRank	$\mathbf{r} = d\mathbf{H}^T\mathbf{r} + \frac{1-d}{n}\mathbf{1}$	Relative importance

Exercises

Elo Calibration. Using historical NFL data, find the K-factor and home-advantage parameter that minimize the log-loss of Elo predictions. Plot log-loss as a function of K for K in {5, 10, 15, 20, 25, 30, 40}.
Glicko-2 Confidence. Implement a betting strategy that only wagers when both teams' RD is below a threshold. Test thresholds from 30 to 100 and measure profitability.
Massey with Recency. Compare Massey ratings with uniform weights versus exponential decay ($\lambda = 0.99, 0.97, 0.95$). Which decay factor produces the best point spread predictions?
PageRank Damping. Test damping factors from 0.5 to 0.95 for college basketball PageRank. Does a different damping factor emerge as optimal compared to the web-standard 0.85?
Ensemble Construction. Build an ensemble of all four systems for NBA prediction. Compare simple averaging, weighted averaging, and logistic regression stacking. Report log-loss and calibration curves for each.

In This Chapter

Chapter 26: Ratings and Ranking Systems

26.1 The Elo Rating System

Historical Origins

Mathematical Formulation

The K-Factor: Sensitivity and Responsiveness

Adapting Elo for Sports

Python Implementation from Scratch

Converting Elo to Betting Lines

26.2 Glicko and Glicko-2

Motivation: The Problem with Elo

Glicko: Rating with Uncertainty

Glicko-2: Adding Volatility

Advantages Over Elo for Betting

Python Implementation

26.3 Massey Ratings and Least Squares

Formulating Ratings as a Linear System

The Massey Matrix

Solving with Linear Algebra

Weighted Massey Ratings

Python Implementation

Strengths and Limitations

26.4 PageRank and Network-Based Rankings

Graph Theory Meets Sports

The PageRank Algorithm for Sports

The Power Method

Variations for Sports

Python Implementation

PageRank vs. Traditional Ratings

26.5 Combining Multiple Rating Systems

Why Combine?

Simple Averaging

Weighted Averaging

Model Stacking

Calibration of Combined Predictions

Python Implementation

Practical Recommendations for Bettors

26.6 Chapter Summary

Key Formulas Reference

Exercises

Related Reading