Chapter 26 Key Takeaways: Ratings and Ranking Systems
Core Concepts
-
Every rating system answers the same question differently. The fundamental question --- "how strong is each team, really?" --- admits multiple valid answers depending on what aspects of strength you prioritize. Elo emphasizes recent form, Glicko-2 adds uncertainty, Massey focuses on point differentials, and PageRank rewards quality of opponents beaten.
-
The Elo system is the foundational framework. Its expected score formula $E_A = 1/(1 + 10^{(R_B - R_A)/400})$ converts rating differences into win probabilities using a logistic function. The K-factor controls responsiveness, home advantage adjusts for venue effects, and margin-of-victory adjustments extract additional information from game scores. Elo is simple, transparent, and effective.
-
Glicko-2 introduces uncertainty quantification. By tracking a rating deviation (RD) alongside the point estimate, Glicko-2 tells you not just what a team's rating is, but how confident you should be in that rating. This is directly useful for bet sizing: wager more when ratings are precise, less when they are uncertain.
-
Massey ratings solve a linear algebra problem. By formulating ratings as the least-squares solution to $\mathbf{Mr} = \mathbf{p}$, Massey produces ratings in interpretable point-differential units. The Massey matrix captures the schedule structure, and extensions for temporal weighting, margin capping, and home advantage make the system flexible and powerful.
-
PageRank brings network science to sports. By modeling the season as a directed graph where losses point to winners, PageRank recursively defines team strength in terms of the strength of defeated opponents. This provides the most explicit treatment of strength of schedule among the four systems studied.
Practical Insights for Bettors
-
K-factor selection is the most important tuning decision. For short seasons with high variance (NFL), use K = 20-30. For long seasons with lower variance (NBA, MLB), use K = 10-20. Always validate with out-of-sample log-loss, never in-sample metrics.
-
Season-to-season regression is essential. Without regression, ratings carry stale information into a new season where rosters and coaching staffs have changed. A regression fraction of 1/3 (carrying 2/3 of end-of-season rating) is a robust starting point for most sports.
-
Margin of victory is informative but noisy. Log-scaled MOV adjustments improve predictions by 1-3% in log-loss compared to win/loss-only Elo. The autocorrelation adjustment prevents strong teams from being over-rewarded for expected blowouts.
-
Home advantage varies by sport and is changing over time. NFL home advantage has declined from roughly 60% to 57% over recent decades. COVID-era empty stadiums provided evidence that crowd effects account for a significant portion of home advantage. Always calibrate home advantage empirically rather than using fixed values.
-
Ensemble ratings outperform individual systems. Combining Elo, Massey, and PageRank through weighted averaging typically improves log-loss by 2-5% over the best individual system. The improvement comes from diversity: each system captures different aspects of team strength and makes different types of errors.
Mathematical Foundations
-
The logistic function is the natural link between rating differences and probabilities. It arises from the assumption that team performances follow logistic (or approximately normal) distributions, making it theoretically grounded as well as practically useful.
-
The Massey matrix has elegant structure. Diagonal entries count games played; off-diagonal entries count matchups between teams. The matrix is singular because only rating differences are determined, requiring a constraint (usually sum-to-zero) for a unique solution.
-
PageRank convergence is guaranteed. The power method converges to the unique stationary distribution of the Markov chain defined by the damped transition matrix. Convergence is typically fast (20-50 iterations) and insensitive to the starting vector.
-
Calibration is the bridge between ratings and betting. A well-calibrated system produces probabilities that match observed frequencies. The Expected Calibration Error (ECE) quantifies this property. Calibration should be monitored continuously and recalibrated when it drifts.
Common Pitfalls
-
Overfitting the K-factor to a single season. The optimal K-factor varies slightly between seasons. Use multi-season validation to select a robust value rather than optimizing for one season's results.
-
Ignoring the startup period. All rating systems produce unreliable predictions for the first several games while ratings are still far from their equilibrium values. Exclude early-season predictions from evaluation metrics.
-
Treating PageRank scores as calibrated probabilities. Raw PageRank scores are relative importance measures, not probabilities. Converting them to win probabilities via score ratios ($r_A/(r_A + r_B)$) produces estimates that typically need further calibration.
-
Assuming stationarity. Rating systems assume that the underlying team strengths evolve slowly. Mid-season trades, injuries, and coaching changes violate this assumption. Build monitoring systems that detect when predictions suddenly degrade.
Connections to Other Chapters
- Chapter 9 (Regression Analysis): Massey ratings are fundamentally a linear regression problem, connecting rating systems to the statistical foundations covered earlier.
- Chapter 10 (Bayesian Methods): Glicko-2's uncertainty tracking is Bayesian in spirit, updating beliefs about team strength as new evidence arrives.
- Chapter 27 (Advanced Regression and Classification): Rating system outputs serve as powerful features for the machine learning models covered next. XGBoost models that include Elo, Massey, and PageRank ratings as features consistently outperform models without them.