Chapter 26 Exercises: Ratings and Ranking Systems
Instructions: Complete all exercises in the parts assigned by your instructor. Show all work for calculation problems. For programming challenges, include comments explaining your logic and provide sample output. For analysis and research problems, cite your sources where applicable.
Part A: Conceptual Understanding
Each problem is worth 5 points. Answer in complete sentences unless otherwise directed.
Exercise A.1 --- Elo System Foundations
Explain the mathematical intuition behind the Elo expected score formula $E_A = 1/(1 + 10^{(R_B - R_A)/400})$. Address (a) why the logistic function is a natural choice for modeling win probabilities, (b) what the scaling constant 400 controls and how changing it would affect the system, (c) why the system is zero-sum and what practical implications this has for rating inflation or deflation, and (d) how the expected score formula relates to the Bradley-Terry model from paired comparison theory.
Exercise A.2 --- The K-Factor Tradeoff
The K-factor governs the tradeoff between responsiveness and stability in Elo ratings. Explain (a) why a bettor using Elo for NFL prediction might choose a different K-factor than one modeling MLB, referencing sample size and single-game variance, (b) why using a single fixed K-factor for all teams is suboptimal and what alternatives exist (e.g., dynamic K-factors), (c) how excessively high K-factors can cause "whipsaw" effects that degrade predictive accuracy, and (d) the relationship between K-factor selection and the concept of effective sample size in the rating window.
Exercise A.3 --- Glicko-2 Uncertainty Quantification
Explain the role of the rating deviation (RD) in the Glicko-2 system. Address (a) why treating all ratings as equally certain (as Elo does) leads to suboptimal predictions, (b) how RD increases during periods of inactivity and why this is appropriate, (c) how the g-function $g(\phi) = 1/\sqrt{1 + 3\phi^2/\pi^2}$ downweights the influence of games against opponents with uncertain ratings, and (d) how a bettor can use the RD to make more informed wagering decisions, including confidence-based bet sizing.
Exercise A.4 --- Massey Ratings and Linear Algebra
Describe the Massey rating system from a linear algebra perspective. Explain (a) how each game contributes a row to the design matrix $\mathbf{X}$ and why the system is typically overdetermined, (b) why the Massey matrix $\mathbf{M} = \mathbf{X}^T\mathbf{X}$ is singular and how the sum-to-zero constraint resolves this, (c) why the resulting ratings are in interpretable units (points of margin), and (d) compare the batch computation approach of Massey to the incremental approach of Elo, discussing when each is more appropriate for sports betting.
Exercise A.5 --- PageRank for Sports Rankings
Explain how Google's PageRank algorithm adapts to sports ranking. Address (a) the analogy between web page links and game results (losses as "votes" for the winner), (b) the role of the damping factor $d$ and why the web-standard value of 0.85 may not be optimal for sports, (c) how PageRank's recursive definition of strength explicitly models strength of schedule, and (d) why PageRank is particularly valuable for ranking teams in leagues with unbalanced schedules such as college football.
Exercise A.6 --- Margin of Victory Adjustments
Discuss the incorporation of margin of victory into rating systems. Explain (a) why raw margin of victory is a noisy but informative signal, (b) why logarithmic scaling (e.g., $\ln(|MOV| + 1)$) is preferred over linear margin, (c) the autocorrelation problem (strong teams beating weak teams by large margins that are already expected) and how the Elo MOV multiplier addresses it, and (d) why capping margins at a sport-specific threshold (e.g., 28 points in the NFL) can improve predictive accuracy.
Exercise A.7 --- Season-to-Season Regression
Explain why between-season regression toward the mean is a critical component of any sports rating system. Address (a) the sources of roster and coaching turnover that justify regression, (b) how the regression fraction $\alpha$ should differ across sports (NFL vs. NBA vs. MLB), (c) how setting $\alpha$ too low (insufficient regression) versus too high (excessive regression) affects early-season predictions, and (d) the relationship between season-to-season regression in ratings and the statistical phenomenon of regression to the mean described by Francis Galton.
Exercise A.8 --- Ensemble Rating Systems
Describe why combining multiple rating systems into an ensemble is advantageous. Explain (a) the concept of model diversity and why systems that make different types of errors are most valuable to combine, (b) the difference between simple averaging, weighted averaging, and model stacking as combination methods, (c) how log-loss minimization determines optimal weights for a weighted ensemble, and (d) why an ensemble of Elo, Glicko-2, Massey, and PageRank captures complementary aspects of team strength.
Part B: Calculations
Each problem is worth 5 points. Show all work and round final answers to the indicated precision.
Exercise B.1 --- Elo Expected Score Computation
Team A has an Elo rating of 1620 and Team B has a rating of 1480. Team A is playing at home with a home-field advantage of $H = 48$ points.
(a) Compute the effective rating difference (including home advantage).
(b) Compute the expected score $E_A$ for Team A. Round to four decimal places.
(c) Compute the expected score $E_B$ for Team B.
(d) Convert $E_A$ to American moneyline odds for Team A.
(e) If the sportsbook posts Team A at $-180$ (implied probability $\approx 0.643$), does the Elo model identify value on either side?
Exercise B.2 --- Elo Rating Update with Margin of Victory
Using the setup from B.1, suppose Team A (home) defeats Team B 31-17. The base K-factor is $K = 20$.
(a) Compute the margin of victory: $MOV = |31 - 17| = 14$.
(b) Compute the MOV multiplier $M = \ln(|MOV| + 1) \times \frac{2.2}{(\text{ELOW}_{\text{diff}} \times 0.001) + 2.2}$. The Elo difference between the winner (A, rating 1620) and the loser (B, rating 1480) is 140. Round $M$ to four decimal places.
(c) Compute the effective K-factor: $K_{\text{eff}} = K \times M$.
(d) Compute Team A's rating change: $\Delta_A = K_{\text{eff}} \times (S_A - E_A)$, where $S_A = 1$.
(e) What are Team A's and Team B's new ratings after this game?
Exercise B.3 --- Glicko-2 Scale Conversion
A team has a traditional Elo-scale rating of $r = 1650$ and a rating deviation of $\text{RD} = 90$.
(a) Convert the rating to the Glicko-2 internal scale: $\mu = (r - 1500) / 173.7178$. Round to four decimal places.
(b) Convert the RD to the internal scale: $\phi = \text{RD} / 173.7178$.
(c) Compute the 95% confidence interval on the traditional scale: $[r - 1.96 \times \text{RD}, \; r + 1.96 \times \text{RD}]$.
(d) Compute the g-function value for this team's RD: $g(\phi) = 1 / \sqrt{1 + 3\phi^2/\pi^2}$.
(e) Another team has $\mu_2 = 0.5$ and $\phi_2 = 0.35$. Compute the expected outcome $E = 1/(1 + \exp(-g(\phi_2)(\mu - \mu_2)))$ for the first team against the second.
Exercise B.4 --- Massey Rating Construction
Four teams play a round-robin tournament with the following results (home team listed first):
| Home | Away | Home Score | Away Score |
|---|---|---|---|
| A | B | 28 | 14 |
| C | D | 21 | 17 |
| A | C | 24 | 20 |
| B | D | 10 | 30 |
| A | D | 35 | 7 |
| B | C | 17 | 24 |
Ignoring home advantage:
(a) Construct the design matrix $\mathbf{X}$ (6 rows, 4 columns) and the point differential vector $\mathbf{y}$.
(b) Compute $\mathbf{M} = \mathbf{X}^T\mathbf{X}$ (the Massey matrix).
(c) Compute $\mathbf{p} = \mathbf{X}^T\mathbf{y}$ (the aggregated point differential vector).
(d) Replace the last row of $\mathbf{M}$ with $[1, 1, 1, 1]$ and the last entry of $\mathbf{p}$ with 0. Solve $\mathbf{Mr} = \mathbf{p}$ for the ratings (use any method; verify ratings sum to zero).
(e) Rank the four teams by their Massey ratings and interpret the ratings in terms of expected point differential.
Exercise B.5 --- PageRank Computation
Using the tournament from B.4, construct the PageRank adjacency matrix with margin-of-victory weighting.
(a) Build the adjacency matrix $\mathbf{A}$ where $A_{ij}$ is the total margin by which team $j$ has beaten team $i$.
(b) Normalize each row to create the transition matrix $\mathbf{H}$. Handle any dangling nodes (rows summing to zero) by distributing weight equally.
(c) Perform two iterations of the power method with damping factor $d = 0.85$, starting from $\mathbf{r}^{(0)} = [0.25, 0.25, 0.25, 0.25]^T$.
(d) Report the ranking after two iterations.
(e) Compare the PageRank ranking to the Massey ranking from B.4. Do they agree? If not, explain why.
Exercise B.6 --- Converting Ratings to Betting Lines
Team X has an Elo rating of 1580 and Team Y has a rating of 1520. Team X is playing at home ($H = 48$).
(a) Compute the expected win probability for Team X.
(b) Convert this probability to decimal odds and American odds.
(c) Using the NFL approximation that 1 Elo point $\approx$ 0.0375 points of scoring margin, compute the implied point spread for this game.
(d) The sportsbook line is X $-4.5$ at $-110$. After accounting for the vig, what is the implied break-even probability? Compare to your model.
(e) Compute the expected value (EV) of a \$100 bet on Team X at $-4.5$ using your model's probability and the sportsbook payout structure.
Exercise B.7 --- Ensemble Weight Optimization
Three rating systems produce the following win probabilities for 5 games, along with the actual outcomes:
| Game | Elo $p_1$ | Massey $p_2$ | PageRank $p_3$ | Outcome |
|---|---|---|---|---|
| 1 | 0.65 | 0.60 | 0.70 | 1 |
| 2 | 0.40 | 0.35 | 0.45 | 0 |
| 3 | 0.55 | 0.50 | 0.52 | 1 |
| 4 | 0.72 | 0.68 | 0.75 | 1 |
| 5 | 0.30 | 0.38 | 0.28 | 0 |
(a) Compute the log-loss for each individual system.
(b) Compute the log-loss for a simple average ensemble: $p_{\text{avg}} = (p_1 + p_2 + p_3)/3$.
(c) Find the weights $w_1, w_2, w_3$ (summing to 1) that minimize the ensemble log-loss. (Hint: try a grid search over $w_1, w_2 \in \{0, 0.1, 0.2, \ldots, 1.0\}$ with $w_3 = 1 - w_1 - w_2$.)
(d) Compute the Brier score for the optimal weighted ensemble.
(e) Does the weighted ensemble outperform the best individual system? By how much?
Part C: Programming Challenges
Each problem is worth 10 points. Write clean, well-documented Python code. Include docstrings, type hints, and at least three test cases per function.
Exercise C.1 --- Complete Elo System with Backtesting
Build a complete Elo rating system with a backtesting framework for evaluating predictive performance.
Requirements: - Implement the full Elo system with configurable K-factor, home advantage, margin-of-victory adjustment, and season-to-season regression. - Generate synthetic season data for 8 teams playing a balanced schedule of 14 games each. - Backtest over at least 3 simulated seasons, recording predictions versus actual outcomes. - Compute evaluation metrics: log-loss, Brier score, accuracy, and calibration error. - Implement a K-factor grid search that tests K values from 5 to 40 in steps of 5, reporting log-loss for each. - Plot (or print in tabular form) the relationship between K-factor and predictive accuracy.
Exercise C.2 --- Glicko-2 System with Confidence-Based Betting
Implement a Glicko-2 rating system and use the rating deviations to build a confidence-based betting strategy.
Requirements: - Implement the full Glicko-2 algorithm including the Illinois algorithm for volatility estimation. - Track rating, RD, and volatility for each team across multiple rating periods. - Implement a betting strategy that only wagers when both teams' RDs are below a configurable threshold. - Compare betting results (simulated P&L) across RD thresholds of 40, 60, 80, and 100. - Produce a report showing how many games are bet at each threshold and the resulting accuracy.
Exercise C.3 --- Massey Ratings with Weighted Least Squares
Implement Massey ratings with extensions for weighted least squares and home advantage estimation.
Requirements:
- Implement the base Massey system using numpy.linalg.solve.
- Add temporal weighting with configurable decay parameter $\lambda$.
- Add margin capping at a configurable threshold.
- Add home-advantage estimation by augmenting the design matrix.
- Compare predictions across configurations: (1) uniform weights no cap, (2) $\lambda = 0.98$ with cap at 21, (3) $\lambda = 0.95$ with cap at 14.
- Report predicted point spreads for 5 hypothetical matchups under each configuration.
Exercise C.4 --- PageRank Sports Ranking Engine
Build a PageRank ranking engine with multiple weighting options and convergence analysis.
Requirements: - Implement PageRank from scratch using the power method (no external graph libraries). - Support three edge weighting modes: (1) binary (win/loss only), (2) margin-weighted, (3) score-proportional. - Track convergence: record the L1 norm of the change vector at each iteration and report the number of iterations to convergence. - Test damping factors from 0.5 to 0.95 in steps of 0.05 and report how rankings change. - Compare rankings from the three weighting modes on the same dataset.
Exercise C.5 --- Ensemble Rating System with Stacking
Build an ensemble that combines Elo, Massey, and PageRank predictions using multiple combination strategies.
Requirements: - Use the implementations from C.1, C.3, and C.4 (or simplified versions) as base systems. - Generate predictions from all three systems on a synthetic season of at least 100 games. - Implement three combination strategies: (1) simple averaging, (2) log-loss-optimized weights, (3) logistic regression stacking. - Evaluate all strategies plus individual systems using log-loss and Brier score. - Produce a summary table ranking all approaches by predictive performance.
Part D: Analysis & Interpretation
Each problem is worth 5 points. Provide structured, well-reasoned responses.
Exercise D.1 --- Evaluating FiveThirtyEight's Elo System
FiveThirtyEight (now owned by ABC News) used Elo ratings extensively for NFL, NBA, and MLB predictions. Analyze their system:
(a) What K-factor and home-advantage values did FiveThirtyEight use for NFL Elo? How do these compare to the values recommended in Section 26.1?
(b) FiveThirtyEight incorporated quarterback-adjusted Elo for the NFL. Explain why incorporating individual player information into a team-level rating system is valuable but also risky. What assumptions must hold for this adjustment to improve predictions?
(c) Evaluate the claim that "Elo is intentionally simple and thus underperforms more complex models." Under what circumstances might Elo's simplicity be an advantage for a bettor?
(d) If you were to build a competing rating system to FiveThirtyEight's NFL Elo, what specific modifications would you make and why?
Exercise D.2 --- Rating System Selection by Sport
Different sports have different characteristics that favor different rating systems. For each of the following sports, recommend the most appropriate primary rating system from {Elo, Glicko-2, Massey, PageRank} and justify your choice:
(a) English Premier League soccer (38-game season, promotion/relegation, relatively few draws resolved by margin)
(b) NCAA March Madness basketball (300+ teams, highly unbalanced schedules, single-elimination tournament)
(c) NFL football (17-game season, high single-game variance, significant roster turnover)
(d) Tennis (individual sport, frequent matches, wide range of skill levels, varied surfaces)
(e) For each sport, identify which secondary system you would combine with your primary choice and explain what complementary information it provides.
Exercise D.3 --- Diagnosing Rating System Failures
You have been running an Elo system for NBA predictions throughout a season. At the midpoint, you notice that your log-loss has increased significantly compared to the previous season. Investigation reveals:
- Three teams made major mid-season trades, fundamentally changing their roster composition.
- Two star players returned from long-term injuries, dramatically improving their teams.
- Your K-factor is 15 (moderate responsiveness).
(a) Explain why a standard Elo system struggles with abrupt changes in team strength.
(b) Would Glicko-2 handle this situation better? Why or why not?
(c) Propose a specific modification to your Elo system that would detect and respond to mid-season roster changes.
(d) How would you use the magnitude of line movement (the market's reaction to news) as a signal to adjust your ratings more aggressively?
(e) What is the risk of making your system too responsive to mid-season events?
Exercise D.4 --- Home Advantage Calibration
The table below shows home winning percentages across major sports leagues:
| League | Home Win % | Approximate Elo $H$ |
|---|---|---|
| NFL | 57% | 48 |
| NBA | 59% | 70 |
| MLB | 54% | 24 |
| NHL | 55% | 30 |
| EPL | 46% | -30 (away advantage?) |
(a) Verify that the NFL home advantage of $H = 48$ approximately corresponds to a 57% win rate by computing $E = 1/(1 + 10^{-48/400})$.
(b) The EPL home win percentage has dropped below 50% in recent seasons (when excluding draws). Propose three explanations for the declining home advantage in soccer.
(c) Should home advantage be treated as a constant throughout a season, or should it vary? What factors might cause within-season variation in home advantage?
(d) During the COVID-19 pandemic (2020), many sports were played without fans. How could you use this natural experiment to isolate the crowd component of home advantage from other factors?
Exercise D.5 --- The Predictive Value of Strength of Schedule
Compare how Elo, Massey, and PageRank handle strength of schedule (SOS).
(a) Explain how Elo implicitly accounts for SOS through the sequential updating process. Why is this implicit treatment sometimes insufficient?
(b) Massey ratings incorporate SOS through the structure of the Massey matrix. Demonstrate with a simple example (two teams with identical records but different opponents) how Massey differentiates them.
(c) PageRank treats SOS most explicitly through its recursive definition. Explain why this recursive treatment can sometimes overvalue teams in "strong conferences" that primarily play each other.
(d) A college football team goes 10-2 with losses to the #1 and #2 teams. Another goes 10-2 with losses to the #50 and #75 teams. How would each rating system handle this difference?
(e) For betting purposes, which treatment of SOS do you find most useful, and why?
Part E: Research & Extension
Each problem is worth 5 points. These require independent research beyond Chapter 26. Cite all sources.
Exercise E.1 --- History of Rating Systems
Research and write a brief essay (500-700 words) tracing the history of rating systems in competitive games and sports. Cover (a) the origins of the Elo system in chess and its adoption by FIDE, (b) the development of Glicko and Glicko-2 by Mark Glickman, (c) Kenneth Massey's college football rating system and its role in the BCS, (d) the application of PageRank to sports by various researchers, and (e) modern developments including TrueSkill (Microsoft) and the whole-history rating approach.
Exercise E.2 --- Rating Systems in Esports
Research how rating and ranking systems are used in competitive esports (e.g., League of Legends, Chess.com, Dota 2, Counter-Strike). For at least two games, report (a) which rating system is used (or what it is derived from), (b) what modifications were made to adapt it to the game's specific characteristics, (c) how the system handles team-based games where individual skill must be estimated, and (d) how the esports betting market uses or reacts to these official ratings.
Exercise E.3 --- TrueSkill and Beyond
Research Microsoft's TrueSkill rating system (used in Xbox matchmaking) and its successor TrueSkill 2. Write a 400-600 word summary explaining (a) how TrueSkill differs from Glicko-2 in its treatment of uncertainty, (b) how TrueSkill handles multiplayer games and team-based games (a limitation of both Elo and Glicko), (c) the Bayesian inference framework underlying TrueSkill, and (d) whether TrueSkill's innovations could improve sports betting models.
Exercise E.4 --- Market Efficiency and Rating Systems
Research the relationship between public rating systems and market efficiency in sports betting. Address (a) whether publicly available Elo ratings (such as FiveThirtyEight's) are already priced into betting lines, (b) empirical evidence on whether simple Elo-based strategies can beat closing lines, (c) the concept of "closing line value" and how it relates to the predictive accuracy of rating systems, and (d) what edge, if any, remains for bettors who build custom rating systems versus using publicly available ones.
Exercise E.5 --- Network Science Applications in Sports
Research how network science methods beyond PageRank have been applied to sports analytics. Investigate (a) community detection algorithms for identifying "tiers" of teams, (b) centrality measures (betweenness, eigenvector) as alternative ranking criteria, (c) directed weighted networks for modeling scoring patterns within games, and (d) dynamic network models that capture how competitive relationships evolve over a season. Provide at least two specific published examples with citations.
Scoring Guide
| Part | Problems | Points Each | Total Points |
|---|---|---|---|
| A: Conceptual Understanding | 8 | 5 | 40 |
| B: Calculations | 7 | 5 | 35 |
| C: Programming Challenges | 5 | 10 | 50 |
| D: Analysis & Interpretation | 5 | 5 | 25 |
| E: Research & Extension | 5 | 5 | 25 |
| Total | 30 | --- | 175 |
Grading Criteria
Part A (Conceptual): Full credit requires clear, accurate explanations that demonstrate understanding of the underlying mathematical concepts and their relevance to sports betting. Partial credit for incomplete but correct reasoning.
Part B (Calculations): Full credit requires correct final answers with all work shown. Partial credit for correct methodology with arithmetic errors.
Part C (Programming): Graded on correctness (40%), code quality and documentation (30%), and test coverage (30%). Code must execute without errors.
Part D (Analysis): Graded on analytical depth, logical reasoning, and appropriate application of rating system concepts to real-world betting scenarios. Multiple valid approaches may exist.
Part E (Research): Graded on research quality, source credibility, analytical depth, and clear writing. Minimum source requirements specified per problem.
Solutions: Complete worked solutions for all exercises are available in
code/exercise-solutions.py. For programming challenges, reference implementations are provided in thecode/directory.