Every sports bettor eventually confronts the same fundamental question: how strong is each team, really? The point spread, the moneyline, and the over/under all derive from some underlying assessment of relative team quality. Bookmakers invest...
In This Chapter
Chapter 26: Ratings and Ranking Systems
Every sports bettor eventually confronts the same fundamental question: how strong is each team, really? The point spread, the moneyline, and the over/under all derive from some underlying assessment of relative team quality. Bookmakers invest enormous resources into answering this question with precision. Bettors who want to compete must develop their own rigorous, quantitative systems for measuring team strength and converting those measurements into probabilities.
This chapter explores the mathematical foundations of rating and ranking systems --- from the venerable Elo system born in competitive chess, through the uncertainty-aware Glicko framework, to the linear-algebraic elegance of Massey ratings, and finally to network-based approaches inspired by Google's PageRank algorithm. Each system encodes different assumptions about competition, and each offers distinct advantages for the sports bettor. By the end of this chapter, you will have implemented each system from scratch in Python and learned how to combine them into ensemble ratings that outperform any individual approach.
26.1 The Elo Rating System
Historical Origins
The Elo rating system was developed by Arpad Elo, a Hungarian-American physics professor and chess master, in the 1960s. The United States Chess Federation adopted his system in 1960, and FIDE (the international chess governing body) followed in 1970. Elo's fundamental insight was elegant: every player (or team) possesses a latent strength that can be estimated from the outcomes of their games. When two players meet, the difference in their ratings determines the expected outcome, and ratings are updated based on how the actual result compares to that expectation.
The system has since been adopted far beyond chess. FIFA used an Elo variant for its world rankings from 2018 onward. FiveThirtyEight popularized Elo for NFL, NBA, and MLB forecasting. Competitive gaming platforms use Elo or Elo-derived systems for matchmaking. For sports bettors, Elo provides a transparent, mathematically grounded framework for estimating win probabilities --- the raw material from which profitable wagers are identified.
Mathematical Formulation
Let $R_A$ and $R_B$ denote the current ratings of teams A and B, respectively. The expected score for team A when facing team B is:
$$E_A = \frac{1}{1 + 10^{(R_B - R_A)/400}}$$
This is a logistic function. The constant 400 is a scaling parameter inherited from chess: a 400-point rating difference corresponds to an expected score of approximately 0.91 (about a 91% win probability). The expected score for team B is simply:
$$E_B = 1 - E_A = \frac{1}{1 + 10^{(R_A - R_B)/400}}$$
After the game is played, each team's rating is updated:
$$R_A^{\text{new}} = R_A + K \cdot (S_A - E_A)$$
$$R_B^{\text{new}} = R_B + K \cdot (S_B - E_B)$$
where $S_A$ is the actual score (1 for a win, 0 for a loss, 0.5 for a draw) and $K$ is the update factor that controls how much a single game shifts the rating.
The system is zero-sum: the total rating points in the system are conserved (assuming no new teams enter or leave). If team A gains $\Delta$ points, team B loses $\Delta$ points.
The K-Factor: Sensitivity and Responsiveness
The K-factor is the single most important tuning parameter in an Elo system. It governs the tradeoff between responsiveness (how quickly ratings react to new results) and stability (how resistant ratings are to noise from individual game outcomes).
| K-Factor | Behavior | Best For |
|---|---|---|
| 5--10 | Very stable, slow to update | Large sample sizes, long seasons (MLB) |
| 15--25 | Moderate responsiveness | Standard applications (NBA, NFL regular season) |
| 30--40 | Highly responsive | Small sample sizes, volatile sports |
| 40+ | Extremely reactive | Early season, rapid form changes |
In chess, FIDE uses $K = 40$ for new players (under 30 games), $K = 20$ for established players below 2400 rating, and $K = 10$ for elite players above 2400. For sports betting applications, the optimal K-factor depends on the sport's inherent variance and sample size. NFL football, with only 16--17 regular-season games and high single-game variance, typically benefits from higher K-factors (20--30) compared to NBA basketball, where an 82-game season and lower single-game variance favor K-factors of 10--20.
Adapting Elo for Sports
Raw Elo was designed for chess, a two-player game with no home advantage and where margin of victory is not meaningful (you either win, lose, or draw). Sports require several adaptations.
Home-Field Advantage. Most sports exhibit a measurable home advantage. We incorporate this by adding a constant $H$ to the home team's rating before computing expected scores:
$$E_{\text{home}} = \frac{1}{1 + 10^{(R_{\text{away}} - (R_{\text{home}} + H))/400}}$$
The value of $H$ should be calibrated empirically. In the NFL, home teams win approximately 56--57% of games, corresponding to $H \approx 48$ rating points. In the NBA, home teams win approximately 58--60% of games, corresponding to $H \approx 70$ rating points. In European soccer, home advantage is typically $H \approx 60$--$100$ depending on the league.
Margin of Victory (MOV). In chess, a win is a win. In sports, beating an opponent by 30 points conveys more information than winning by 1 point. We can incorporate margin of victory through a multiplier on the K-factor:
$$M = \ln(|MOV| + 1) \cdot \frac{2.2}{(\text{ELOW}_{\text{diff}} \cdot 0.001) + 2.2}$$
where $MOV$ is the point differential and $\text{ELOW}_{\text{diff}}$ is the Elo difference between the winning team and the losing team. The logarithm prevents blowouts from distorting ratings excessively. The denominator is an autocorrelation adjustment that prevents strong teams from being rewarded too much for running up the score against weak opponents (since large margins are already expected in such matchups).
The update becomes:
$$R_A^{\text{new}} = R_A + K \cdot M \cdot (S_A - E_A)$$
Season-to-Season Regression. At the start of a new season, teams undergo roster changes, coaching changes, and other transitions. It is standard practice to regress each team's rating toward the mean between seasons:
$$R_{\text{new season}} = R_{\text{end of season}} \cdot (1 - \alpha) + R_{\text{mean}} \cdot \alpha$$
A typical regression factor is $\alpha = 1/3$, meaning each team carries two-thirds of its end-of-season rating and one-third of the league average. This prevents stale ratings from dominating early-season predictions.
Python Implementation from Scratch
import math
from dataclasses import dataclass, field
from typing import Dict, List, Tuple, Optional
@dataclass
class EloSystem:
"""
A complete Elo rating system adapted for sports betting.
Attributes:
k_factor: Base K-factor for rating updates
home_advantage: Rating points added for home team
initial_rating: Starting rating for new teams
use_mov: Whether to use margin-of-victory adjustment
season_regression: Fraction to regress toward mean between seasons
"""
k_factor: float = 20.0
home_advantage: float = 48.0
initial_rating: float = 1500.0
use_mov: bool = True
season_regression: float = 1/3
ratings: Dict[str, float] = field(default_factory=dict)
history: List[dict] = field(default_factory=list)
def get_rating(self, team: str) -> float:
"""Get current rating for a team, initializing if needed."""
if team not in self.ratings:
self.ratings[team] = self.initial_rating
return self.ratings[team]
def expected_score(self, rating_a: float, rating_b: float) -> float:
"""
Compute expected score for team A against team B.
E_A = 1 / (1 + 10^((R_B - R_A) / 400))
"""
return 1.0 / (1.0 + math.pow(10, (rating_b - rating_a) / 400.0))
def mov_multiplier(self, margin: int, elo_diff: float) -> float:
"""
Margin of victory multiplier.
Uses log scaling with autocorrelation adjustment to prevent
strong teams from being over-rewarded for blowout wins.
"""
return math.log(abs(margin) + 1) * (2.2 / (elo_diff * 0.001 + 2.2))
def predict(self, home_team: str, away_team: str) -> Tuple[float, float]:
"""
Predict win probabilities for a matchup.
Returns:
(home_win_prob, away_win_prob)
"""
home_rating = self.get_rating(home_team) + self.home_advantage
away_rating = self.get_rating(away_team)
home_prob = self.expected_score(home_rating, away_rating)
return home_prob, 1.0 - home_prob
def update(self, home_team: str, away_team: str,
home_score: int, away_score: int) -> dict:
"""
Update ratings after a game result.
Args:
home_team: Name of the home team
away_team: Name of the away team
home_score: Points scored by home team
away_score: Points scored by away team
Returns:
Dictionary with pre-game predictions and rating changes
"""
home_rating = self.get_rating(home_team)
away_rating = self.get_rating(away_team)
# Compute expected scores with home advantage
home_expected = self.expected_score(
home_rating + self.home_advantage, away_rating
)
away_expected = 1.0 - home_expected
# Determine actual scores (1 = win, 0 = loss, 0.5 = tie)
if home_score > away_score:
home_actual, away_actual = 1.0, 0.0
elif home_score < away_score:
home_actual, away_actual = 0.0, 1.0
else:
home_actual, away_actual = 0.5, 0.5
# Compute K-factor with optional MOV adjustment
k = self.k_factor
if self.use_mov:
margin = abs(home_score - away_score)
winner_elo = home_rating if home_score > away_score else away_rating
loser_elo = away_rating if home_score > away_score else home_rating
elo_diff = winner_elo - loser_elo
k *= self.mov_multiplier(margin, elo_diff)
# Update ratings
home_delta = k * (home_actual - home_expected)
away_delta = k * (away_actual - away_expected)
self.ratings[home_team] = home_rating + home_delta
self.ratings[away_team] = away_rating + away_delta
result = {
'home_team': home_team,
'away_team': away_team,
'home_score': home_score,
'away_score': away_score,
'home_expected': round(home_expected, 4),
'away_expected': round(away_expected, 4),
'home_delta': round(home_delta, 2),
'away_delta': round(away_delta, 2),
'home_rating_new': round(self.ratings[home_team], 1),
'away_rating_new': round(self.ratings[away_team], 1),
}
self.history.append(result)
return result
def regress_to_mean(self):
"""Apply between-season regression toward the mean rating."""
mean_rating = sum(self.ratings.values()) / len(self.ratings)
for team in self.ratings:
self.ratings[team] = (
self.ratings[team] * (1 - self.season_regression)
+ mean_rating * self.season_regression
)
def get_rankings(self) -> List[Tuple[str, float]]:
"""Return teams sorted by rating in descending order."""
return sorted(self.ratings.items(), key=lambda x: x[1], reverse=True)
# --- Worked Example ---
if __name__ == "__main__":
elo = EloSystem(k_factor=20, home_advantage=48, use_mov=True)
# Simulate a short season
games = [
("Chiefs", "Ravens", 27, 20),
("Bills", "Dolphins", 31, 17),
("Ravens", "Bills", 24, 21),
("Dolphins", "Chiefs", 14, 35),
("Chiefs", "Bills", 17, 20),
("Ravens", "Dolphins", 28, 10),
]
print("=== Elo Rating System: NFL Example ===\n")
for home, away, hs, as_ in games:
result = elo.update(home, away, hs, as_)
print(f"{home} {hs} - {as_} {away}")
print(f" Expected: {result['home_expected']:.3f} vs {result['away_expected']:.3f}")
print(f" Rating changes: {home} {result['home_delta']:+.1f} -> "
f"{result['home_rating_new']:.1f}, "
f"{away} {result['away_delta']:+.1f} -> "
f"{result['away_rating_new']:.1f}\n")
print("Final Rankings:")
for rank, (team, rating) in enumerate(elo.get_rankings(), 1):
print(f" {rank}. {team}: {rating:.1f}")
# Predict next game
home_prob, away_prob = elo.predict("Bills", "Chiefs")
print(f"\nPrediction: Bills (home) vs Chiefs")
print(f" Bills win probability: {home_prob:.3f}")
print(f" Chiefs win probability: {away_prob:.3f}")
print(f" Implied fair moneyline Bills: {-100*home_prob/(1-home_prob):+.0f}"
if home_prob > 0.5
else f" Implied fair moneyline Bills: +{100*(1-home_prob)/home_prob:.0f}")
Converting Elo to Betting Lines
The expected score function directly yields a win probability. To convert to American odds:
$$\text{If } p > 0.5: \quad \text{ML} = -\frac{100p}{1-p}$$
$$\text{If } p < 0.5: \quad \text{ML} = +\frac{100(1-p)}{p}$$
To convert an Elo difference to a point spread, we need the empirical relationship between Elo margin and scoring margin. In the NFL, each Elo point corresponds to roughly 0.035--0.04 points of scoring margin. So a 100-point Elo advantage (including home advantage) translates to roughly a 3.5--4 point spread.
26.2 Glicko and Glicko-2
Motivation: The Problem with Elo
Elo treats all ratings as equally reliable. A team rated 1600 after 100 games has the same rating "confidence" as a team rated 1600 after 3 games. This is clearly wrong --- we should be far less certain about the second team's true strength. Furthermore, Elo cannot distinguish between a team that has been inactive (whose true strength is now highly uncertain) and a team that has played regularly.
Mark Glickman, a statistician at Harvard, developed the Glicko system (1995) and its successor Glicko-2 (2001) to address these shortcomings. The key innovation is that each team carries not just a rating $r$ but also a rating deviation (RD) $\phi$ that quantifies the uncertainty in that rating. Glicko-2 further adds a volatility parameter $\sigma$ that measures how consistently a team performs relative to its rating.
Glicko: Rating with Uncertainty
In Glicko, each team has a rating $r$ and a rating deviation $\phi$. The rating represents the best estimate of team strength; the RD represents the uncertainty. We can think of the team's "true" rating as lying in the interval $[r - 2\phi, r + 2\phi]$ with approximately 95% confidence.
Step 1: Convert to the Glicko scale. Glicko uses a different scale than Elo:
$$\mu = \frac{r - 1500}{173.7178}$$
$$\phi = \frac{\text{RD}}{173.7178}$$
The constant $173.7178 = 400 / \ln(10) \approx 173.72$ converts between the Elo 0--3000 scale and the Glicko internal scale.
Step 2: Compute the g-function. The RD of an opponent modifies the expected outcome:
$$g(\phi) = \frac{1}{\sqrt{1 + 3\phi^2/\pi^2}}$$
This function downweights the impact of games against opponents with uncertain ratings. When $\phi$ is small (certain rating), $g(\phi) \approx 1$ and the game is treated normally. When $\phi$ is large (uncertain rating), $g(\phi) < 1$ and the game carries less weight.
Step 3: Expected outcome.
$$E(\mu, \mu_j, \phi_j) = \frac{1}{1 + \exp(-g(\phi_j)(\mu - \mu_j))}$$
Step 4: Update the rating deviation. The new RD incorporates information from all games in the rating period:
$$\frac{1}{\phi'^2} = \frac{1}{\phi^2} + \sum_{j=1}^{m} g(\phi_j)^2 \cdot E_j \cdot (1 - E_j)$$
where $m$ is the number of games played in the period.
Step 5: Update the rating.
$$\mu' = \mu + \phi'^2 \sum_{j=1}^{m} g(\phi_j)(s_j - E_j)$$
where $s_j$ is the actual outcome of game $j$ (1, 0, or 0.5).
Step 6: Increase RD for inactivity. If a team does not play in a rating period, its RD increases:
$$\phi_{\text{new}} = \sqrt{\phi_{\text{old}}^2 + c^2}$$
where $c$ is a constant that controls how quickly uncertainty grows during inactivity.
Glicko-2: Adding Volatility
Glicko-2 adds a volatility parameter $\sigma$ that represents the degree of expected fluctuation in a team's rating. A team with high volatility has erratic results; a team with low volatility performs consistently.
The Glicko-2 update involves an iterative algorithm to estimate the new volatility:
Step 1. Compute the estimated variance $v$ of the team's rating based on game outcomes:
$$v = \left[\sum_{j=1}^{m} g(\phi_j)^2 \cdot E_j \cdot (1 - E_j)\right]^{-1}$$
Step 2. Compute the estimated improvement $\Delta$:
$$\Delta = v \sum_{j=1}^{m} g(\phi_j)(s_j - E_j)$$
Step 3. Determine the new volatility $\sigma'$ using the Illinois algorithm to solve:
$$f(x) = \frac{e^x(\Delta^2 - \phi^2 - v - e^x)}{2(\phi^2 + v + e^x)^2} - \frac{x - \ln(\sigma^2)}{\tau^2} = 0$$
where $\tau$ is the system constant (typically 0.3--1.2; smaller values constrain volatility changes more).
Step 4. Update the rating deviation:
$$\phi^* = \sqrt{\phi^2 + \sigma'^2}$$
$$\phi' = \frac{1}{\sqrt{1/\phi^{*2} + 1/v}}$$
Step 5. Update the rating:
$$\mu' = \mu + \phi'^2 \sum_{j=1}^{m} g(\phi_j)(s_j - E_j)$$
Advantages Over Elo for Betting
| Feature | Elo | Glicko-2 |
|---|---|---|
| Rating uncertainty | No | Yes (RD) |
| Handles inactivity | No | Yes (RD increases) |
| Volatility tracking | No | Yes |
| Confidence intervals | No | Yes |
| Information per game | Fixed | Variable (based on opponent RD) |
| Computational cost | Low | Moderate |
| Better calibrated probabilities | Moderate | Higher |
For betting, Glicko-2's confidence intervals are invaluable. A matchup where both teams have low RDs produces a more reliable probability estimate than one where either team has high RD. This allows the bettor to size wagers proportionally to confidence, or to skip games entirely when the uncertainty is too high.
Python Implementation
import math
from dataclasses import dataclass, field
from typing import List, Tuple, Dict
GLICKO2_SCALE = 173.7178 # 400 / ln(10)
@dataclass
class Glicko2Player:
"""Represents a team/player in the Glicko-2 system."""
mu: float = 0.0 # Rating on Glicko-2 internal scale
phi: float = 350 / GLICKO2_SCALE # Rating deviation (internal)
sigma: float = 0.06 # Volatility
@property
def rating(self) -> float:
"""Rating on the traditional Elo-like scale."""
return self.mu * GLICKO2_SCALE + 1500
@property
def rd(self) -> float:
"""Rating deviation on the traditional scale."""
return self.phi * GLICKO2_SCALE
def confidence_interval(self, z: float = 1.96) -> Tuple[float, float]:
"""95% confidence interval for the true rating."""
return (self.rating - z * self.rd, self.rating + z * self.rd)
@dataclass
class Glicko2System:
"""
Complete Glicko-2 rating system implementation.
Attributes:
tau: System constant controlling volatility change (0.3 to 1.2)
epsilon: Convergence tolerance for volatility iteration
"""
tau: float = 0.5
epsilon: float = 1e-6
players: Dict[str, Glicko2Player] = field(default_factory=dict)
def get_player(self, name: str) -> Glicko2Player:
if name not in self.players:
self.players[name] = Glicko2Player()
return self.players[name]
@staticmethod
def g(phi: float) -> float:
"""The g-function that scales by opponent's rating deviation."""
return 1.0 / math.sqrt(1.0 + 3.0 * phi**2 / math.pi**2)
@staticmethod
def expect(mu: float, mu_j: float, phi_j: float) -> float:
"""Expected outcome against opponent j."""
return 1.0 / (1.0 + math.exp(-Glicko2System.g(phi_j) * (mu - mu_j)))
def update_player(self, player: Glicko2Player,
opponents: List[Glicko2Player],
outcomes: List[float]) -> None:
"""
Update a single player's rating based on a rating period's games.
Args:
player: The player to update
opponents: List of opponent Glicko2Player objects
outcomes: List of outcomes (1.0 = win, 0.0 = loss, 0.5 = draw)
"""
if not opponents:
# No games played: only increase RD
player.phi = math.sqrt(player.phi**2 + player.sigma**2)
return
mu = player.mu
phi = player.phi
sigma = player.sigma
# Step 1: Compute variance v
v_inv = 0.0
delta_sum = 0.0
for opp, s in zip(opponents, outcomes):
g_val = self.g(opp.phi)
e_val = self.expect(mu, opp.mu, opp.phi)
v_inv += g_val**2 * e_val * (1.0 - e_val)
delta_sum += g_val * (s - e_val)
v = 1.0 / v_inv
delta = v * delta_sum
# Step 2: Determine new volatility via Illinois algorithm
a = math.log(sigma**2)
tau2 = self.tau**2
def f(x):
ex = math.exp(x)
num = ex * (delta**2 - phi**2 - v - ex)
den = 2.0 * (phi**2 + v + ex)**2
return num / den - (x - a) / tau2
# Initial bounds
A = a
if delta**2 > phi**2 + v:
B = math.log(delta**2 - phi**2 - v)
else:
k = 1
while f(a - k * self.tau) < 0:
k += 1
B = a - k * self.tau
# Illinois algorithm iteration
fA = f(A)
fB = f(B)
while abs(B - A) > self.epsilon:
C = A + (A - B) * fA / (fB - fA)
fC = f(C)
if fC * fB <= 0:
A = B
fA = fB
else:
fA /= 2.0
B = C
fB = fC
sigma_new = math.exp(A / 2.0)
# Step 3: Update RD (pre-rating period)
phi_star = math.sqrt(phi**2 + sigma_new**2)
# Step 4: Update RD and rating
phi_new = 1.0 / math.sqrt(1.0 / phi_star**2 + 1.0 / v)
mu_new = mu + phi_new**2 * delta_sum
player.mu = mu_new
player.phi = phi_new
player.sigma = sigma_new
def process_period(self, games: List[Tuple[str, str, float]]) -> None:
"""
Process all games in a rating period.
Args:
games: List of (team_a, team_b, outcome_for_a) tuples
"""
# Collect opponents and outcomes for each player
opponents_map: Dict[str, List[Glicko2Player]] = {}
outcomes_map: Dict[str, List[float]] = {}
# Snapshot current ratings (updates use pre-period ratings)
snapshots = {name: Glicko2Player(p.mu, p.phi, p.sigma)
for name, p in self.players.items()}
for team_a, team_b, outcome_a in games:
pa = self.get_player(team_a)
pb = self.get_player(team_b)
if team_a not in snapshots:
snapshots[team_a] = Glicko2Player(pa.mu, pa.phi, pa.sigma)
if team_b not in snapshots:
snapshots[team_b] = Glicko2Player(pb.mu, pb.phi, pb.sigma)
opponents_map.setdefault(team_a, []).append(snapshots[team_b])
outcomes_map.setdefault(team_a, []).append(outcome_a)
opponents_map.setdefault(team_b, []).append(snapshots[team_a])
outcomes_map.setdefault(team_b, []).append(1.0 - outcome_a)
# Update each player
for name, player in self.players.items():
opps = opponents_map.get(name, [])
outs = outcomes_map.get(name, [])
self.update_player(player, opps, outs)
def predict(self, team_a: str, team_b: str) -> Tuple[float, float, float]:
"""
Predict outcome probabilities with confidence measure.
Returns:
(prob_a_wins, prob_b_wins, confidence)
where confidence is inversely related to combined RD
"""
pa = self.get_player(team_a)
pb = self.get_player(team_b)
prob_a = self.expect(pa.mu, pb.mu, pb.phi)
combined_rd = math.sqrt(pa.rd**2 + pb.rd**2)
confidence = max(0, 1 - combined_rd / 500) # Normalized 0-1
return prob_a, 1 - prob_a, confidence
def get_rankings(self) -> List[Tuple[str, float, float, float]]:
"""Return rankings: (name, rating, RD, volatility)."""
return sorted(
[(n, p.rating, p.rd, p.sigma) for n, p in self.players.items()],
key=lambda x: x[1],
reverse=True
)
# --- Worked Example ---
if __name__ == "__main__":
system = Glicko2System(tau=0.5)
# Week 1 games
week1 = [
("Chiefs", "Ravens", 1.0), # Chiefs beat Ravens
("Bills", "Dolphins", 1.0), # Bills beat Dolphins
]
week2 = [
("Ravens", "Bills", 1.0), # Ravens beat Bills
("Chiefs", "Dolphins", 1.0), # Chiefs beat Dolphins
]
for week_num, week_games in enumerate([week1, week2], 1):
system.process_period(week_games)
print(f"After Week {week_num}:")
for name, rating, rd, vol in system.get_rankings():
lo, hi = system.get_player(name).confidence_interval()
print(f" {name}: {rating:.1f} (RD: {rd:.1f}, "
f"Vol: {vol:.4f}, 95% CI: [{lo:.0f}, {hi:.0f}])")
print()
# Prediction with confidence
pa, pb, conf = system.predict("Chiefs", "Bills")
print(f"Chiefs vs Bills: {pa:.3f} vs {pb:.3f} (confidence: {conf:.3f})")
26.3 Massey Ratings and Least Squares
Formulating Ratings as a Linear System
Kenneth Massey developed his rating system as part of his undergraduate honors thesis at Bluefield College in 1997. The Massey method approaches ratings from a fundamentally different angle than Elo or Glicko: instead of incrementally updating ratings game by game, it solves for the entire set of ratings simultaneously using the complete history of results.
The core idea is beautifully simple. If team $i$ has rating $r_i$ and team $j$ has rating $r_j$, then the expected point differential when team $i$ plays team $j$ is:
$$r_i - r_j \approx y_{ij}$$
where $y_{ij}$ is the actual point differential (positive if $i$ won, negative if $i$ lost). Each game gives us one equation. With $n$ teams and $m$ games, we have $m$ equations and $n$ unknowns. Since typically $m \gg n$, the system is overdetermined, and we solve it using least squares.
The Massey Matrix
We can write the system in matrix form:
$$\mathbf{X} \mathbf{r} = \mathbf{y}$$
where: - $\mathbf{X}$ is an $m \times n$ matrix. For game $k$ between teams $i$ and $j$, row $k$ has $+1$ in column $i$, $-1$ in column $j$, and $0$ elsewhere. - $\mathbf{r}$ is the $n \times 1$ vector of ratings we want to solve for. - $\mathbf{y}$ is the $m \times 1$ vector of point differentials.
The normal equations for the least squares solution are:
$$\mathbf{X}^T \mathbf{X} \mathbf{r} = \mathbf{X}^T \mathbf{y}$$
Let $\mathbf{M} = \mathbf{X}^T \mathbf{X}$ (the Massey matrix) and $\mathbf{p} = \mathbf{X}^T \mathbf{y}$ (the point differential vector). Then:
$$\mathbf{M} \mathbf{r} = \mathbf{p}$$
The Massey matrix $\mathbf{M}$ has a special structure: - The diagonal entry $M_{ii}$ equals the number of games played by team $i$. - The off-diagonal entry $M_{ij}$ equals the negative of the number of games between teams $i$ and $j$.
The vector $\mathbf{p}$ has entry $p_i$ equal to the total point differential for team $i$ across all games.
The matrix $\mathbf{M}$ is singular (its rows sum to zero), so we need to add a constraint. The standard approach is to replace the last row of $\mathbf{M}$ with all ones and the last entry of $\mathbf{p}$ with zero, which forces the ratings to sum to zero.
Solving with Linear Algebra
The solution is then:
$$\mathbf{r} = \mathbf{M}^{-1} \mathbf{p}$$
Or more numerically stable, we can use numpy.linalg.solve. The resulting ratings are on a point-differential scale: a team rated +5 is expected to beat a team rated 0 by 5 points.
Weighted Massey Ratings
We can extend the basic Massey method by weighting games differently. Recent games should carry more weight than early-season games. Close games might be more informative than blowouts. We introduce a diagonal weight matrix $\mathbf{W}$:
$$\mathbf{X}^T \mathbf{W} \mathbf{X} \mathbf{r} = \mathbf{X}^T \mathbf{W} \mathbf{y}$$
Common weighting schemes: - Temporal decay: $w_k = \lambda^{T-t_k}$ where $T$ is the current time and $t_k$ is the time of game $k$, with $\lambda \in (0, 1)$. - Margin capping: Cap point differentials at some maximum (e.g., 20 points in NFL) to reduce the influence of garbage time. - Home advantage: Add a home-advantage column to $\mathbf{X}$ (1 for home, -1 for away, 0 for neutral).
Python Implementation
import numpy as np
from typing import List, Tuple, Dict, Optional
class MasseyRatings:
"""
Massey rating system using least-squares optimization.
Solves for team ratings that best explain observed point differentials.
"""
def __init__(self, home_advantage: bool = True,
recency_weight: float = 1.0,
margin_cap: Optional[int] = None):
"""
Args:
home_advantage: Whether to estimate home-field advantage
recency_weight: Decay factor (1.0 = no decay, 0.99 = slight decay)
margin_cap: Maximum absolute point differential to use
"""
self.home_advantage = home_advantage
self.recency_weight = recency_weight
self.margin_cap = margin_cap
self.teams: List[str] = []
self.team_index: Dict[str, int] = {}
self.ratings: Optional[np.ndarray] = None
self.home_adv_estimate: float = 0.0
def _get_team_idx(self, team: str) -> int:
if team not in self.team_index:
self.team_index[team] = len(self.teams)
self.teams.append(team)
return self.team_index[team]
def fit(self, games: List[Tuple[str, str, int, int, bool]]) -> None:
"""
Compute ratings from a set of game results.
Args:
games: List of (team_a, team_b, score_a, score_b, a_is_home)
"""
# Reset teams
self.teams = []
self.team_index = {}
# First pass: register all teams
for team_a, team_b, _, _, _ in games:
self._get_team_idx(team_a)
self._get_team_idx(team_b)
n = len(self.teams)
m = len(games)
# Number of columns: n teams + optional home advantage
n_cols = n + 1 if self.home_advantage else n
# Build the design matrix X and outcome vector y
X = np.zeros((m, n_cols))
y = np.zeros(m)
W = np.zeros(m) # Weights
for k, (team_a, team_b, score_a, score_b, a_home) in enumerate(games):
i = self.team_index[team_a]
j = self.team_index[team_b]
X[k, i] = 1
X[k, j] = -1
if self.home_advantage:
if a_home:
X[k, n] = 1 # Home advantage column
else:
X[k, n] = -1 # Away team in column
# Point differential
diff = score_a - score_b
if self.margin_cap is not None:
diff = max(-self.margin_cap, min(self.margin_cap, diff))
y[k] = diff
# Recency weighting
W[k] = self.recency_weight ** (m - 1 - k)
# Weighted normal equations: X'WX r = X'Wy
W_diag = np.diag(W)
M = X.T @ W_diag @ X
p = X.T @ W_diag @ y
# Handle singularity: replace last team row with sum-to-zero constraint
M[n-1, :n] = 1
M[n-1, n:] = 0 # Don't include home advantage in constraint
p[n-1] = 0
# Solve
solution = np.linalg.solve(M, p)
self.ratings = solution[:n]
if self.home_advantage:
self.home_adv_estimate = solution[n]
def predict(self, team_a: str, team_b: str,
a_is_home: bool = True) -> float:
"""
Predict point differential (team_a score - team_b score).
Returns:
Expected point differential for team_a
"""
if self.ratings is None:
raise ValueError("Must call fit() before predict()")
i = self.team_index[team_a]
j = self.team_index[team_b]
diff = self.ratings[i] - self.ratings[j]
if a_is_home and self.home_advantage:
diff += self.home_adv_estimate
elif not a_is_home and self.home_advantage:
diff -= self.home_adv_estimate
return diff
def point_diff_to_win_prob(self, predicted_diff: float,
sigma: float = 13.5) -> float:
"""
Convert predicted point differential to win probability.
Uses a normal CDF approximation. Sigma should be calibrated
to the sport (NFL ~ 13.5, NBA ~ 12, soccer ~ 1.1).
"""
from scipy.stats import norm
return norm.cdf(predicted_diff / sigma)
def get_rankings(self) -> List[Tuple[str, float]]:
"""Return teams sorted by rating."""
if self.ratings is None:
raise ValueError("Must call fit() before get_rankings()")
indexed = [(self.teams[i], self.ratings[i]) for i in range(len(self.teams))]
return sorted(indexed, key=lambda x: x[1], reverse=True)
# --- Worked Example ---
if __name__ == "__main__":
massey = MasseyRatings(home_advantage=True, margin_cap=28,
recency_weight=0.98)
# (team_a, team_b, score_a, score_b, a_is_home)
games = [
("Chiefs", "Ravens", 27, 20, True),
("Bills", "Dolphins", 31, 17, True),
("Ravens", "Bills", 24, 21, True),
("Dolphins", "Chiefs", 14, 35, True),
("Chiefs", "Bills", 17, 20, True),
("Ravens", "Dolphins", 28, 10, True),
("Bills", "Ravens", 17, 24, False),
("Dolphins", "Bills", 20, 27, False),
]
massey.fit(games)
print("=== Massey Ratings ===\n")
print(f"Estimated home advantage: {massey.home_adv_estimate:.2f} points\n")
print("Rankings:")
for rank, (team, rating) in enumerate(massey.get_rankings(), 1):
print(f" {rank}. {team}: {rating:+.2f}")
print("\nPredictions:")
for home, away in [("Chiefs", "Bills"), ("Ravens", "Dolphins")]:
diff = massey.predict(home, away, a_is_home=True)
prob = massey.point_diff_to_win_prob(diff, sigma=13.5)
print(f" {home} (home) vs {away}: "
f"predicted spread = {diff:+.1f}, "
f"win prob = {prob:.3f}")
Strengths and Limitations
Massey ratings have several attractive properties for bettors:
Strengths: - Ratings are in interpretable units (points). - Naturally produces point spread predictions. - Home advantage is estimated directly from the data. - Can incorporate weighting (recency, margin caps) flexibly. - Computation is efficient and deterministic.
Limitations: - Assumes linearity: the difference in ratings equals the expected point differential. - No built-in uncertainty quantification (unlike Glicko). - Requires a connected schedule (every team must be linkable through chains of opponents). - Sensitive to outlier games unless margins are capped.
26.4 PageRank and Network-Based Rankings
Graph Theory Meets Sports
In 1998, Larry Page and Sergey Brin developed the PageRank algorithm to rank web pages for Google's search engine. The core insight was that a page is important if many important pages link to it. This idea transfers naturally to sports: a team is strong if it beats many strong teams.
We model a sports season as a directed graph (digraph). Each team is a node. Each game result creates a directed edge from the losing team to the winning team, weighted by the margin of victory (or simply unweighted for a basic version). A team's strength is then determined by the strengths of the teams it has beaten --- a recursive definition that PageRank resolves elegantly.
The PageRank Algorithm for Sports
Let $n$ be the number of teams. Define the adjacency matrix $\mathbf{A}$ where $A_{ij}$ represents the "vote" from team $i$ to team $j$. In the sports context, $A_{ij}$ is the total margin by which team $j$ has beaten team $i$ (or the number of wins, in the unweighted case). If team $j$ has never beaten team $i$, then $A_{ij} = 0$.
We normalize each row so that it sums to 1, creating the transition matrix $\mathbf{H}$:
$$H_{ij} = \frac{A_{ij}}{\sum_k A_{ik}}$$
Teams that have lost to nobody (undefeated teams) create "dangling nodes." We handle these by distributing their "vote" equally among all teams.
The PageRank vector $\mathbf{r}$ is the stationary distribution of the Markov chain defined by the modified transition matrix with a damping factor $d$:
$$\mathbf{r} = d \cdot \mathbf{H}^T \mathbf{r} + \frac{1-d}{n} \mathbf{1}$$
The damping factor $d$ (typically 0.85) can be interpreted in the sports context as follows: with probability $d$, a team's strength comes from the teams it has beaten; with probability $1-d$, all teams share a baseline level of competitiveness. This prevents isolated teams or undefeated teams from dominating the rankings unreasonably.
The Power Method
PageRank is typically computed iteratively using the power method:
- Initialize $\mathbf{r}^{(0)} = \frac{1}{n}\mathbf{1}$
- Iterate: $\mathbf{r}^{(k+1)} = d \cdot \mathbf{H}^T \mathbf{r}^{(k)} + \frac{1-d}{n}\mathbf{1}$
- Stop when $\|\mathbf{r}^{(k+1)} - \mathbf{r}^{(k)}\|_1 < \epsilon$
Convergence is guaranteed and is typically fast (20--50 iterations).
Variations for Sports
Several modifications improve PageRank for sports ranking:
Weighted Edges. Rather than binary win/loss, use the margin of victory as edge weight. A 30-point win contributes more than a 1-point win.
Win Percentage Weighting. Scale edges by the loser's winning percentage, so that beating a strong team contributes more than beating a weak team. This creates a second-order effect that rewards strength of schedule.
Score-Based PageRank. Use points scored rather than margin. Team $j$ gets credit proportional to the points scored against team $i$ divided by team $i$'s total points allowed.
Python Implementation
import numpy as np
from typing import List, Tuple, Dict, Optional
try:
import networkx as nx
HAS_NETWORKX = True
except ImportError:
HAS_NETWORKX = False
class SportsPageRank:
"""
PageRank-based rating system for sports.
Models the season as a directed graph where losses point to winners,
weighted by margin of victory.
"""
def __init__(self, damping: float = 0.85, max_iter: int = 100,
tol: float = 1e-8, weight_by_margin: bool = True):
self.damping = damping
self.max_iter = max_iter
self.tol = tol
self.weight_by_margin = weight_by_margin
self.teams: List[str] = []
self.team_index: Dict[str, int] = {}
self.rankings: Optional[Dict[str, float]] = None
def _get_idx(self, team: str) -> int:
if team not in self.team_index:
self.team_index[team] = len(self.teams)
self.teams.append(team)
return self.team_index[team]
def fit(self, games: List[Tuple[str, str, int, int]]) -> None:
"""
Compute PageRank ratings from game results.
Args:
games: List of (team_a, team_b, score_a, score_b)
"""
self.teams = []
self.team_index = {}
for a, b, _, _ in games:
self._get_idx(a)
self._get_idx(b)
n = len(self.teams)
A = np.zeros((n, n))
for team_a, team_b, score_a, score_b in games:
i = self.team_index[team_a]
j = self.team_index[team_b]
if score_a > score_b:
# Loser j "votes" for winner i
weight = (score_a - score_b) if self.weight_by_margin else 1
A[j, i] += weight
elif score_b > score_a:
weight = (score_b - score_a) if self.weight_by_margin else 1
A[i, j] += weight
else:
# Draw: exchange half-votes
A[i, j] += 0.5
A[j, i] += 0.5
# Normalize rows to create transition matrix H
H = np.zeros((n, n))
for i in range(n):
row_sum = A[i].sum()
if row_sum > 0:
H[i] = A[i] / row_sum
else:
# Dangling node: distribute evenly
H[i] = 1.0 / n
# Power iteration
r = np.ones(n) / n
for iteration in range(self.max_iter):
r_new = self.damping * H.T @ r + (1 - self.damping) / n
# Handle any numerical issues
r_new = r_new / r_new.sum()
if np.abs(r_new - r).sum() < self.tol:
break
r = r_new
self.rankings = {self.teams[i]: r[i] for i in range(n)}
def fit_networkx(self, games: List[Tuple[str, str, int, int]]) -> None:
"""Alternative implementation using networkx."""
if not HAS_NETWORKX:
raise ImportError("networkx is required for this method")
G = nx.DiGraph()
for team_a, team_b, score_a, score_b in games:
G.add_node(team_a)
G.add_node(team_b)
if score_a > score_b:
weight = (score_a - score_b) if self.weight_by_margin else 1
if G.has_edge(team_b, team_a):
G[team_b][team_a]['weight'] += weight
else:
G.add_edge(team_b, team_a, weight=weight)
elif score_b > score_a:
weight = (score_b - score_a) if self.weight_by_margin else 1
if G.has_edge(team_a, team_b):
G[team_a][team_b]['weight'] += weight
else:
G.add_edge(team_a, team_b, weight=weight)
self.rankings = nx.pagerank(G, alpha=self.damping, weight='weight',
max_iter=self.max_iter, tol=self.tol)
self.teams = list(self.rankings.keys())
def get_rankings(self) -> List[Tuple[str, float]]:
"""Return teams sorted by PageRank score."""
if self.rankings is None:
raise ValueError("Must call fit() first")
return sorted(self.rankings.items(), key=lambda x: x[1], reverse=True)
def predict_win_prob(self, team_a: str, team_b: str) -> float:
"""
Estimate win probability for team_a against team_b.
Uses the ratio of PageRank scores as a probability estimate.
"""
if self.rankings is None:
raise ValueError("Must call fit() first")
ra = self.rankings[team_a]
rb = self.rankings[team_b]
return ra / (ra + rb)
# --- Worked Example ---
if __name__ == "__main__":
pr = SportsPageRank(damping=0.85, weight_by_margin=True)
games = [
("Chiefs", "Ravens", 27, 20),
("Bills", "Dolphins", 31, 17),
("Ravens", "Bills", 24, 21),
("Dolphins", "Chiefs", 14, 35),
("Chiefs", "Bills", 17, 20),
("Ravens", "Dolphins", 28, 10),
]
pr.fit(games)
print("=== PageRank Sports Ratings ===\n")
print("Rankings (higher = better):")
for rank, (team, score) in enumerate(pr.get_rankings(), 1):
print(f" {rank}. {team}: {score:.6f}")
print("\nWin probability estimates:")
for a, b in [("Chiefs", "Bills"), ("Ravens", "Dolphins"),
("Chiefs", "Ravens")]:
prob = pr.predict_win_prob(a, b)
print(f" {a} vs {b}: {a} wins {prob:.3f}, {b} wins {1-prob:.3f}")
PageRank vs. Traditional Ratings
| Property | Elo/Glicko | Massey | PageRank |
|---|---|---|---|
| Update style | Incremental | Batch | Batch |
| Handles margin | With adjustment | Natively | With weighting |
| Strength of schedule | Implicit | Implicit | Explicit (recursive) |
| Interpretability | Rating points | Point differential | Relative importance |
| Transitivity | Not assumed | Assumed (linear) | Not assumed |
| Computational cost | O(games) | O(n^3) | O(n^2 * iterations) |
PageRank's distinctive advantage is its explicit modeling of strength of schedule through the recursive structure. Beating a team that has itself beaten many strong teams contributes more than beating a team with an easy schedule. This makes it particularly valuable in sports with unbalanced schedules, such as college football or college basketball, where teams play very different opponents.
26.5 Combining Multiple Rating Systems
Why Combine?
Each rating system encodes different assumptions and captures different aspects of team strength:
- Elo captures momentum and recent form through incremental updates.
- Glicko-2 provides uncertainty quantification and handles inactivity.
- Massey directly estimates point differentials and home advantage.
- PageRank emphasizes strength of schedule and quality of wins.
No single system dominates across all sports and all situations. By combining multiple systems into an ensemble, we can capture the strengths of each while mitigating their individual weaknesses. Research in machine learning consistently demonstrates that ensembles of diverse models outperform any individual model, and this principle applies directly to rating systems.
Simple Averaging
The simplest ensemble approach is to average the outputs of multiple systems. If system $k$ produces a predicted win probability $p_k$ for some matchup, the ensemble prediction is:
$$\hat{p} = \frac{1}{K} \sum_{k=1}^{K} p_k$$
This already provides substantial benefits if the systems make different errors. However, it assumes all systems are equally accurate, which is rarely true.
Weighted Averaging
We can assign weights $w_k$ to each system based on its historical accuracy:
$$\hat{p} = \sum_{k=1}^{K} w_k \cdot p_k, \quad \text{where } \sum_k w_k = 1$$
The weights can be determined by: - Log-loss minimization: Find weights that minimize the log-loss on a validation set. - Brier score minimization: Minimize the mean squared error between predictions and outcomes. - Inverse error weighting: $w_k \propto 1/\text{error}_k$
Model Stacking
A more sophisticated approach treats the individual system predictions as features and trains a meta-model to combine them. This is called stacking (or stacked generalization):
- Generate predictions from each base rating system on a validation set.
- Use these predictions as input features to a meta-model (e.g., logistic regression, gradient boosting).
- The meta-model learns the optimal nonlinear combination of the base systems.
Stacking can capture interactions between systems --- for example, it might learn that when Elo and PageRank disagree strongly, the Massey prediction is most reliable.
Calibration of Combined Predictions
After combining systems, the resulting probabilities may not be well-calibrated. A probability estimate of 0.70 should correspond to a 70% win rate in practice. We can apply isotonic regression or Platt scaling (covered in detail in Chapter 27) to recalibrate the combined output.
Python Implementation
import numpy as np
from scipy.optimize import minimize
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibratedClassifierCV
from typing import List, Dict, Tuple, Callable, Optional
class EnsembleRatings:
"""
Combines multiple rating systems into an ensemble.
Supports simple averaging, weighted averaging, and model stacking.
"""
def __init__(self, systems: Dict[str, object],
predict_fns: Dict[str, Callable]):
"""
Args:
systems: Dict mapping system names to fitted system objects
predict_fns: Dict mapping system names to prediction functions.
Each function takes (system, team_a, team_b) and
returns P(team_a wins).
"""
self.systems = systems
self.predict_fns = predict_fns
self.system_names = list(systems.keys())
self.weights: Optional[np.ndarray] = None
self.meta_model = None
def get_base_predictions(self, team_a: str,
team_b: str) -> Dict[str, float]:
"""Get predictions from all base systems."""
preds = {}
for name in self.system_names:
try:
preds[name] = self.predict_fns[name](
self.systems[name], team_a, team_b
)
except (KeyError, ValueError):
preds[name] = 0.5 # Default for missing teams
return preds
def predict_simple_average(self, team_a: str, team_b: str) -> float:
"""Predict using simple average of all systems."""
preds = self.get_base_predictions(team_a, team_b)
return np.mean(list(preds.values()))
def fit_weights(self, matchups: List[Tuple[str, str]],
outcomes: List[int],
metric: str = 'log_loss') -> np.ndarray:
"""
Learn optimal weights for weighted averaging.
Args:
matchups: List of (team_a, team_b) tuples
outcomes: List of 1 (team_a won) or 0 (team_b won)
metric: 'log_loss' or 'brier'
Returns:
Optimal weight vector
"""
# Get base predictions for all matchups
all_preds = []
for team_a, team_b in matchups:
preds = self.get_base_predictions(team_a, team_b)
all_preds.append([preds[name] for name in self.system_names])
P = np.array(all_preds) # (n_games, n_systems)
y = np.array(outcomes)
n_systems = len(self.system_names)
def objective(w):
w_norm = np.exp(w) / np.exp(w).sum() # Softmax for constraints
p_ensemble = P @ w_norm
p_ensemble = np.clip(p_ensemble, 1e-10, 1 - 1e-10)
if metric == 'log_loss':
return -np.mean(y * np.log(p_ensemble) +
(1-y) * np.log(1-p_ensemble))
elif metric == 'brier':
return np.mean((p_ensemble - y)**2)
result = minimize(objective, np.zeros(n_systems), method='Nelder-Mead')
self.weights = np.exp(result.x) / np.exp(result.x).sum()
print("Optimal weights:")
for name, w in zip(self.system_names, self.weights):
print(f" {name}: {w:.4f}")
return self.weights
def predict_weighted(self, team_a: str, team_b: str) -> float:
"""Predict using optimized weights."""
if self.weights is None:
raise ValueError("Must call fit_weights() first")
preds = self.get_base_predictions(team_a, team_b)
pred_array = np.array([preds[name] for name in self.system_names])
return float(pred_array @ self.weights)
def fit_stacking(self, matchups: List[Tuple[str, str]],
outcomes: List[int],
calibrate: bool = True) -> None:
"""
Train a stacking meta-model using logistic regression.
Args:
matchups: List of (team_a, team_b) tuples
outcomes: List of outcomes
calibrate: Whether to apply probability calibration
"""
# Get base predictions
X = []
for team_a, team_b in matchups:
preds = self.get_base_predictions(team_a, team_b)
X.append([preds[name] for name in self.system_names])
X = np.array(X)
y = np.array(outcomes)
if calibrate:
base_model = LogisticRegression(C=1.0, max_iter=1000)
self.meta_model = CalibratedClassifierCV(
base_model, cv=5, method='isotonic'
)
else:
self.meta_model = LogisticRegression(C=1.0, max_iter=1000)
self.meta_model.fit(X, y)
# Display learned coefficients
if hasattr(self.meta_model, 'coef_'):
print("Stacking coefficients:")
for name, coef in zip(self.system_names,
self.meta_model.coef_[0]):
print(f" {name}: {coef:.4f}")
def predict_stacked(self, team_a: str, team_b: str) -> float:
"""Predict using the stacking meta-model."""
if self.meta_model is None:
raise ValueError("Must call fit_stacking() first")
preds = self.get_base_predictions(team_a, team_b)
X = np.array([[preds[name] for name in self.system_names]])
return float(self.meta_model.predict_proba(X)[0, 1])
def evaluate(self, matchups: List[Tuple[str, str]],
outcomes: List[int]) -> Dict[str, float]:
"""
Evaluate all methods on a test set.
Returns log-loss and Brier score for each method.
"""
results = {}
for method_name, predict_fn in [
('simple_avg', self.predict_simple_average),
('weighted', self.predict_weighted if self.weights is not None
else None),
('stacked', self.predict_stacked if self.meta_model is not None
else None),
]:
if predict_fn is None:
continue
preds = []
for team_a, team_b in matchups:
p = predict_fn(team_a, team_b)
preds.append(np.clip(p, 1e-10, 1 - 1e-10))
preds = np.array(preds)
y = np.array(outcomes)
log_loss = -np.mean(y * np.log(preds) +
(1-y) * np.log(1-preds))
brier = np.mean((preds - y)**2)
results[method_name] = {
'log_loss': round(log_loss, 4),
'brier': round(brier, 4)
}
# Also evaluate individual systems
for name in self.system_names:
preds = []
for team_a, team_b in matchups:
p = self.get_base_predictions(team_a, team_b)[name]
preds.append(np.clip(p, 1e-10, 1 - 1e-10))
preds = np.array(preds)
log_loss = -np.mean(outcomes * np.log(preds) +
(1-outcomes) * np.log(1-preds))
brier = np.mean((preds - outcomes)**2)
results[f'base_{name}'] = {
'log_loss': round(log_loss, 4),
'brier': round(brier, 4)
}
return results
# --- Worked Example ---
if __name__ == "__main__":
# This example assumes you have fitted Elo, Massey, and PageRank systems
# from the previous sections. Here we demonstrate the ensemble structure.
# Simulate base system predictions for demonstration
np.random.seed(42)
n_games = 200
# Generate synthetic predictions from 3 "systems"
true_probs = np.random.uniform(0.3, 0.7, n_games)
outcomes = (np.random.random(n_games) < true_probs).astype(int)
# Each system adds different noise
elo_preds = np.clip(true_probs + np.random.normal(0, 0.08, n_games),
0.01, 0.99)
massey_preds = np.clip(true_probs + np.random.normal(0, 0.10, n_games),
0.01, 0.99)
pr_preds = np.clip(true_probs + np.random.normal(0.02, 0.12, n_games),
0.01, 0.99)
# Split into train/test
split = 150
# Compute metrics for individual and ensemble
from sklearn.metrics import log_loss, brier_score_loss
# Simple average
avg_preds = (elo_preds + massey_preds + pr_preds) / 3
print("=== Ensemble Rating System Comparison ===\n")
for name, preds in [("Elo", elo_preds), ("Massey", massey_preds),
("PageRank", pr_preds), ("Simple Average", avg_preds)]:
test_preds = preds[split:]
test_outcomes = outcomes[split:]
ll = log_loss(test_outcomes, test_preds)
bs = brier_score_loss(test_outcomes, test_preds)
print(f" {name:15s}: Log-Loss = {ll:.4f}, Brier = {bs:.4f}")
# Weighted combination via optimization
from scipy.optimize import minimize
train_P = np.column_stack([elo_preds[:split], massey_preds[:split],
pr_preds[:split]])
train_y = outcomes[:split]
def neg_log_likelihood(w):
w_norm = np.exp(w) / np.exp(w).sum()
p = train_P @ w_norm
p = np.clip(p, 1e-10, 1-1e-10)
return -np.mean(train_y * np.log(p) + (1-train_y) * np.log(1-p))
result = minimize(neg_log_likelihood, np.zeros(3), method='Nelder-Mead')
opt_w = np.exp(result.x) / np.exp(result.x).sum()
print(f"\n Optimal weights: Elo={opt_w[0]:.3f}, "
f"Massey={opt_w[1]:.3f}, PR={opt_w[2]:.3f}")
test_P = np.column_stack([elo_preds[split:], massey_preds[split:],
pr_preds[split:]])
weighted_preds = test_P @ opt_w
ll = log_loss(outcomes[split:], weighted_preds)
bs = brier_score_loss(outcomes[split:], weighted_preds)
print(f" {'Weighted Avg':15s}: Log-Loss = {ll:.4f}, Brier = {bs:.4f}")
Practical Recommendations for Bettors
- Start with Elo for its simplicity and interpretability. Use it as a baseline.
- Add Massey for direct point spread predictions. This is especially useful for over/under and spread betting.
- Add PageRank for sports with unbalanced schedules (college sports).
- Use Glicko-2 when you need confidence intervals for bet sizing (Kelly criterion).
- Combine using weighted averaging --- it is robust and requires minimal tuning. Use stacking only when you have enough historical data (500+ games) to train the meta-model reliably.
- Recalibrate the ensemble periodically. Optimal weights shift over time as the competitive landscape changes.
26.6 Chapter Summary
This chapter has equipped you with four distinct approaches to rating and ranking sports teams, each grounded in rigorous mathematics and implemented from scratch in Python.
The Elo System provides an elegant, incremental rating method rooted in the logistic function. Its expected score formula $E_A = 1/(1 + 10^{(R_B - R_A)/400})$ converts rating differences into win probabilities. The K-factor controls the balance between responsiveness and stability, and adaptations for home advantage and margin of victory make Elo suitable for sports beyond its chess origins.
Glicko-2 extends Elo with two critical additions: the rating deviation $\phi$ that quantifies how uncertain we are about a team's true strength, and the volatility $\sigma$ that captures how erratically a team performs. For bettors, these uncertainty measures are essential for sizing bets appropriately and knowing when to abstain.
Massey Ratings approach the problem from a linear algebra perspective, formulating ratings as the least-squares solution to an overdetermined system of equations relating team strength differences to observed point differentials. The Massey matrix $\mathbf{M} = \mathbf{X}^T\mathbf{X}$ captures the structure of the schedule, and the solution provides ratings directly in point-differential units.
PageRank brings network science to sports ranking. By modeling the season as a directed graph where losses create edges pointing toward winners, PageRank recursively defines team strength in terms of the strength of defeated opponents. This provides an explicit, principled treatment of strength of schedule.
Finally, ensemble methods combine these diverse perspectives. Simple averaging, optimized weighting, and model stacking allow bettors to leverage the distinct strengths of each system while compensating for their individual weaknesses. The empirical finding that diverse ensembles outperform individual models holds as strongly in sports rating as in any other domain of predictive modeling.
In the next chapter, we will build on these rating systems by exploring advanced regression and classification methods --- including gradient boosting, random forests, and probability calibration --- that can incorporate ratings as features alongside other predictive variables to produce even more accurate betting models.
Key Formulas Reference
| System | Core Formula | Output |
|---|---|---|
| Elo | $E_A = \frac{1}{1 + 10^{(R_B - R_A)/400}}$ | Win probability |
| Elo update | $R_A' = R_A + K(S_A - E_A)$ | Updated rating |
| Glicko-2 $g$ | $g(\phi) = 1/\sqrt{1 + 3\phi^2/\pi^2}$ | Uncertainty scaling |
| Massey | $\mathbf{M}\mathbf{r} = \mathbf{p}$ | Point differential ratings |
| PageRank | $\mathbf{r} = d\mathbf{H}^T\mathbf{r} + \frac{1-d}{n}\mathbf{1}$ | Relative importance |
Exercises
-
Elo Calibration. Using historical NFL data, find the K-factor and home-advantage parameter that minimize the log-loss of Elo predictions. Plot log-loss as a function of K for K in {5, 10, 15, 20, 25, 30, 40}.
-
Glicko-2 Confidence. Implement a betting strategy that only wagers when both teams' RD is below a threshold. Test thresholds from 30 to 100 and measure profitability.
-
Massey with Recency. Compare Massey ratings with uniform weights versus exponential decay ($\lambda = 0.99, 0.97, 0.95$). Which decay factor produces the best point spread predictions?
-
PageRank Damping. Test damping factors from 0.5 to 0.95 for college basketball PageRank. Does a different damping factor emerge as optimal compared to the web-standard 0.85?
-
Ensemble Construction. Build an ensemble of all four systems for NBA prediction. Compare simple averaging, weighted averaging, and logistic regression stacking. Report log-loss and calibration curves for each.
Related Reading
Explore this topic in other books
NFL Analytics Elo Power Ratings College Football Analytics Game Outcome Prediction