12 min read

When chess player Arpad Elo developed his rating system in the 1960s, he created something that would eventually transcend chess and become the foundation for ranking systems across sports, video games, and countless competitive domains. The Elo...

Chapter 19: Elo and Power Ratings

The backbone of sports prediction systems


Introduction

When chess player Arpad Elo developed his rating system in the 1960s, he created something that would eventually transcend chess and become the foundation for ranking systems across sports, video games, and countless competitive domains. The Elo system's elegance lies in its simplicity: teams or players gain rating points when they win and lose points when they lose, with the magnitude of change determined by expectation versus outcome.

This chapter explores Elo ratings and their broader category—power ratings—in the context of NFL prediction. We'll build rating systems from scratch, understand their mathematical foundations, explore variations that improve predictive accuracy, and learn how to calibrate and maintain ratings across seasons.


Learning Objectives

By the end of this chapter, you will be able to:

  1. Explain the mathematical foundations of Elo ratings
  2. Implement a basic Elo system for NFL teams
  3. Modify Elo to incorporate margin of victory
  4. Build alternative power rating systems (DVOA-style, SRS)
  5. Handle season transitions and regression to the mean
  6. Calibrate ratings to spread predictions
  7. Evaluate and compare different rating systems
  8. Identify appropriate use cases for each rating approach

Part 1: The Elo Rating System

Historical Background

Arpad Elo, a Hungarian-American physics professor and chess master, developed his rating system to address limitations in the existing chess ranking methods. The United States Chess Federation adopted it in 1960, and FIDE (the international chess federation) followed in 1970.

The system's key innovation was treating ratings as probability estimates. A rating difference between players implies a specific expected win probability, which updates based on actual results. This self-correcting mechanism allows ratings to converge toward "true" skill levels over time.

The Core Mathematics

The Elo system rests on two fundamental equations:

Expected Score Calculation:

$$E_A = \frac{1}{1 + 10^{(R_B - R_A) / 400}}$$

Where: - $E_A$ = Expected score for player/team A (between 0 and 1) - $R_A$ = Rating of player/team A - $R_B$ = Rating of player/team B - 400 = Scaling factor (determines how rating differences map to probabilities)

Rating Update:

$$R_{A,new} = R_A + K \times (S_A - E_A)$$

Where: - $K$ = Update factor (how much ratings change per game) - $S_A$ = Actual score (1 for win, 0 for loss, 0.5 for tie) - $E_A$ = Expected score from above

Understanding the Components

The Expected Score Function:

The expected score function converts rating differences into win probabilities. Here's how different rating gaps translate:

Rating Difference Win Probability
0 50.0%
50 57.1%
100 64.0%
150 70.3%
200 75.9%
300 84.9%
400 90.9%

The 400 scaling factor means that a 400-point rating advantage implies approximately 91% win probability—a dominant edge but not certainty.

The K-Factor:

The K-factor controls rating volatility. Higher K values make ratings more responsive to recent results but also more volatile. Lower K values create more stable ratings but slower adaptation.

For NFL applications, K-factor selection involves tradeoffs:

K-Factor Characteristics
15-20 Stable ratings, slow to react to changes
25-35 Balanced responsiveness and stability
40-50 Highly reactive, captures streaks quickly
60+ Very volatile, potentially overreacting

Most NFL Elo implementations use K-factors between 20 and 32.

Implementing Basic NFL Elo

from dataclasses import dataclass, field
from typing import Dict, List, Optional, Tuple
import pandas as pd
import numpy as np

@dataclass
class NFLEloSystem:
    """
    Basic Elo rating system for NFL teams.

    Attributes:
        k_factor: How much ratings change per game
        home_advantage: Rating points added for home team
        initial_rating: Starting rating for all teams
    """
    k_factor: float = 28
    home_advantage: float = 48  # ~2.5 points in spread terms
    initial_rating: float = 1500
    ratings: Dict[str, float] = field(default_factory=dict)

    def get_rating(self, team: str) -> float:
        """Get current rating for a team."""
        return self.ratings.get(team, self.initial_rating)

    def expected_score(self, rating_a: float, rating_b: float) -> float:
        """Calculate expected score for team A vs team B."""
        return 1 / (1 + 10 ** ((rating_b - rating_a) / 400))

    def predict(self, home_team: str, away_team: str) -> Dict:
        """
        Predict game outcome.

        Returns dictionary with:
        - home_win_prob: Probability home team wins
        - predicted_spread: Implied point spread
        - predicted_winner: Team predicted to win
        """
        home_rating = self.get_rating(home_team) + self.home_advantage
        away_rating = self.get_rating(away_team)

        home_win_prob = self.expected_score(home_rating, away_rating)

        # Convert rating difference to spread
        # ~25 Elo points ≈ 1 point spread
        rating_diff = home_rating - away_rating
        predicted_spread = -rating_diff / 25

        return {
            'home_team': home_team,
            'away_team': away_team,
            'home_win_prob': round(home_win_prob, 3),
            'away_win_prob': round(1 - home_win_prob, 3),
            'predicted_spread': round(predicted_spread, 1),
            'predicted_winner': home_team if home_win_prob > 0.5 else away_team
        }

    def update(self, home_team: str, away_team: str,
               home_score: int, away_score: int) -> Dict:
        """
        Update ratings based on game result.

        Returns dictionary with rating changes.
        """
        # Get pre-game ratings (with HFA for home team)
        home_rating = self.get_rating(home_team) + self.home_advantage
        away_rating = self.get_rating(away_team)

        # Calculate expected scores
        home_expected = self.expected_score(home_rating, away_rating)
        away_expected = 1 - home_expected

        # Determine actual scores
        if home_score > away_score:
            home_actual, away_actual = 1, 0
        elif away_score > home_score:
            home_actual, away_actual = 0, 1
        else:
            home_actual, away_actual = 0.5, 0.5

        # Calculate rating changes
        home_change = self.k_factor * (home_actual - home_expected)
        away_change = self.k_factor * (away_actual - away_expected)

        # Update ratings (note: HFA not stored, just used for calculation)
        old_home = self.get_rating(home_team)
        old_away = self.get_rating(away_team)

        self.ratings[home_team] = old_home + home_change
        self.ratings[away_team] = old_away + away_change

        return {
            'home_team': home_team,
            'away_team': away_team,
            'home_change': round(home_change, 1),
            'away_change': round(away_change, 1),
            'home_new_rating': round(self.ratings[home_team], 1),
            'away_new_rating': round(self.ratings[away_team], 1)
        }

    def process_season(self, games: pd.DataFrame) -> pd.DataFrame:
        """
        Process all games in a season chronologically.

        Returns DataFrame with predictions and outcomes.
        """
        games = games.sort_values(['week', 'game_id']).copy()
        results = []

        for _, game in games.iterrows():
            if pd.isna(game['home_score']):
                continue

            # Get prediction before updating
            pred = self.predict(game['home_team'], game['away_team'])

            # Update ratings
            update = self.update(
                game['home_team'], game['away_team'],
                game['home_score'], game['away_score']
            )

            # Record result
            actual_winner = (game['home_team'] if game['home_score'] > game['away_score']
                           else game['away_team'])

            results.append({
                'game_id': game.get('game_id', ''),
                'week': game['week'],
                'home_team': game['home_team'],
                'away_team': game['away_team'],
                'home_score': game['home_score'],
                'away_score': game['away_score'],
                'predicted_winner': pred['predicted_winner'],
                'actual_winner': actual_winner,
                'correct': pred['predicted_winner'] == actual_winner,
                'home_win_prob': pred['home_win_prob'],
                'predicted_spread': pred['predicted_spread'],
                'actual_spread': game['away_score'] - game['home_score']
            })

        return pd.DataFrame(results)

Properties of Elo Ratings

Conservation:

In a pure Elo system, total rating points are conserved. When one team gains points, the opponent loses the same amount. This property emerges from the symmetric update rules.

For a league of N teams starting at 1500: - Total points always equal N × 1500 - Average rating always equals 1500

This conservation helps interpret ratings: a team rated 1550 is exactly 50 points above average.

Self-Correction:

Elo ratings are self-correcting. If a team is under-rated, they'll outperform expectations and gain rating points until reaching their "true" level. Overrated teams face the opposite pressure.

This property makes Elo robust to initialization errors—ratings eventually converge regardless of starting values, though convergence takes longer with poor initialization.

Stability:

The expected value of rating change for a correctly-rated team is zero. If ratings accurately reflect true skill, wins and losses will balance on average, keeping ratings stable.


Part 2: Margin-Adjusted Elo

The Limitation of Win/Loss

Basic Elo treats all wins equally—a 1-point victory provides the same rating boost as a 35-point blowout. But in NFL prediction, margin matters. A team that consistently wins by large margins is likely stronger than one barely squeaking out victories.

Incorporating Margin of Victory

The most common approach multiplies the K-factor by a margin-based multiplier:

def margin_multiplier(self, point_diff: int, elo_diff: float) -> float:
    """
    Calculate margin of victory multiplier.

    Uses FiveThirtyEight-style formula that:
    1. Rewards larger margins
    2. Dampens effect when favorite wins big
    3. Amplifies effect for upset blowouts

    Args:
        point_diff: Winner's margin (always positive)
        elo_diff: Winner's Elo minus loser's Elo (can be negative for upsets)
    """
    # Logarithmic scaling prevents extreme margins from dominating
    base_multiplier = np.log(abs(point_diff) + 1)

    # Autocorrelation adjustment
    # Prevents strong teams from gaining too many points in expected blowouts
    autocorr = 2.2 / ((elo_diff * 0.001) + 2.2)

    return base_multiplier * autocorr

Why Autocorrelation Matters:

Without the autocorrelation adjustment, strong teams would gain disproportionate rating points. When the Chiefs beat a weak opponent by 28 points, the margin tells us less about the Chiefs' true strength than when they beat a good opponent by 28 points.

The adjustment dampens margin's effect when the winner was already favored, and amplifies it when the underdog wins big.

Full Margin-Adjusted Implementation

@dataclass
class MarginAdjustedElo(NFLEloSystem):
    """
    Elo system that incorporates margin of victory.

    Based on FiveThirtyEight's NFL Elo methodology.
    """
    margin_cap: int = 24  # Maximum margin to count

    def margin_multiplier(self, point_diff: int, elo_diff: float) -> float:
        """Calculate margin of victory multiplier."""
        # Cap extreme margins
        capped_diff = min(abs(point_diff), self.margin_cap)

        # Logarithmic scaling
        base = np.log(capped_diff + 1)

        # Autocorrelation adjustment
        # elo_diff is winner rating - loser rating
        autocorr = 2.2 / ((elo_diff * 0.001) + 2.2)

        return base * autocorr

    def update(self, home_team: str, away_team: str,
               home_score: int, away_score: int) -> Dict:
        """Update ratings with margin adjustment."""
        # Get pre-game ratings
        home_rating = self.get_rating(home_team) + self.home_advantage
        away_rating = self.get_rating(away_team)

        # Calculate expected scores
        home_expected = self.expected_score(home_rating, away_rating)
        away_expected = 1 - home_expected

        # Determine winner and margin
        point_diff = abs(home_score - away_score)

        if home_score > away_score:
            winner_elo_diff = home_rating - away_rating
            home_actual, away_actual = 1, 0
        elif away_score > home_score:
            winner_elo_diff = away_rating - home_rating
            home_actual, away_actual = 0, 1
        else:
            # Ties: no margin multiplier
            home_actual, away_actual = 0.5, 0.5
            multiplier = 1.0

        if home_score != away_score:
            multiplier = self.margin_multiplier(point_diff, winner_elo_diff)

        # Calculate rating changes with margin adjustment
        effective_k = self.k_factor * multiplier

        home_change = effective_k * (home_actual - home_expected)
        away_change = effective_k * (away_actual - away_expected)

        # Update ratings
        old_home = self.get_rating(home_team)
        old_away = self.get_rating(away_team)

        self.ratings[home_team] = old_home + home_change
        self.ratings[away_team] = old_away + away_change

        return {
            'home_team': home_team,
            'away_team': away_team,
            'point_diff': point_diff,
            'margin_multiplier': round(multiplier, 2),
            'effective_k': round(effective_k, 1),
            'home_change': round(home_change, 1),
            'away_change': round(away_change, 1),
            'home_new_rating': round(self.ratings[home_team], 1),
            'away_new_rating': round(self.ratings[away_team], 1)
        }

Margin Multiplier Effects

Here's how different scenarios affect rating changes with the margin-adjusted system:

Scenario Margin Base K Multiplier Effective K
Close game, any teams 3 28 1.4 39
Moderate win, favorite 14 28 2.2 62
Moderate win, underdog 14 28 3.1 87
Blowout, big favorite 28 28 2.4 67
Blowout, underdog 28 28 4.2 118

The key insight: upset blowouts cause the largest rating swings, while expected blowouts have dampened effects.


Part 3: Season Transitions and Regression

The Offseason Problem

NFL teams change significantly between seasons. Players retire, get traded, or signed in free agency. Coaches change. Simply carrying over last year's ratings creates problems:

  1. Overconfidence: End-of-season ratings assume team quality that may have changed
  2. Roster Turnover: Key players leaving can dramatically alter team strength
  3. Coaching Changes: New systems take time to implement

Regression to the Mean

The solution is regression to the mean: pulling ratings toward average between seasons. This acknowledges that:

  1. Extreme ratings often reflect luck in addition to skill
  2. Teams change over the offseason
  3. New data will eventually correct ratings anyway
def regress_to_mean(self, regression_factor: float = 0.33) -> Dict[str, float]:
    """
    Regress all ratings toward the mean between seasons.

    Args:
        regression_factor: How much to regress (0.33 = regress 1/3 to mean)

    Returns:
        Dictionary of new ratings
    """
    mean_rating = np.mean(list(self.ratings.values()))

    new_ratings = {}
    for team, rating in self.ratings.items():
        distance_from_mean = rating - mean_rating
        new_ratings[team] = rating - (distance_from_mean * regression_factor)

    self.ratings = new_ratings
    return new_ratings

Choosing the Regression Factor

The optimal regression factor balances two errors:

  1. Under-regression: Carrying too much prior belief, slow to recognize change
  2. Over-regression: Throwing away useful information, ratings too uniform

Research on NFL Elo suggests:

Regression Factor Effect
0.20-0.25 Minimal regression, assumes continuity
0.30-0.40 Moderate regression, typical choice
0.50-0.60 Heavy regression, fresh-start emphasis

FiveThirtyEight uses approximately 1/3 regression (moving ratings 1/3 toward the mean).

Practical Example

End of 2023 season ratings: - Chiefs: 1680 - Patriots: 1380 - League mean: 1500

After 1/3 regression: - Chiefs: 1680 - (180 × 0.33) = 1621 - Patriots: 1380 + (120 × 0.33) = 1420

The Chiefs remain strong favorites but with reduced certainty. The Patriots improve somewhat, acknowledging potential positive changes.


Part 4: Simple Rating System (SRS)

Beyond Sequential Updates

Elo updates ratings game-by-game, which has advantages (recency weighting, simple updates) but also limitations (path dependence, sensitivity to game order).

An alternative approach solves for ratings all at once using the entire season's data.

The SRS Method

The Simple Rating System calculates ratings that satisfy:

For each team: Rating = Average Margin of Victory + Average Opponent Rating

This creates a system of equations that can be solved simultaneously.

class SimpleRatingSystem:
    """
    Simple Rating System (SRS) for NFL teams.

    Iteratively solves for ratings where each team's rating
    equals their average margin plus average opponent rating.
    """

    def __init__(self, home_advantage: float = 2.5, margin_cap: int = 21):
        self.home_advantage = home_advantage
        self.margin_cap = margin_cap
        self.ratings = {}

    def fit(self, games: pd.DataFrame, iterations: int = 100) -> 'SimpleRatingSystem':
        """
        Calculate ratings from game results.

        Args:
            games: DataFrame with home_team, away_team, home_score, away_score
            iterations: Number of iterations for convergence
        """
        games = games[games['home_score'].notna()].copy()

        # Get all teams
        teams = set(games['home_team'].tolist() + games['away_team'].tolist())

        # Calculate average margins for each team
        margins = {team: [] for team in teams}
        opponents = {team: [] for team in teams}

        for _, game in games.iterrows():
            home, away = game['home_team'], game['away_team']

            # Adjust for home advantage
            home_margin = game['home_score'] - game['away_score'] - self.home_advantage
            away_margin = game['away_score'] - game['home_score'] + self.home_advantage

            # Cap margins
            home_margin = np.clip(home_margin, -self.margin_cap, self.margin_cap)
            away_margin = np.clip(away_margin, -self.margin_cap, self.margin_cap)

            margins[home].append(home_margin)
            margins[away].append(away_margin)
            opponents[home].append(away)
            opponents[away].append(home)

        # Calculate average margins
        avg_margins = {team: np.mean(m) if m else 0 for team, m in margins.items()}

        # Initialize ratings to average margins
        self.ratings = avg_margins.copy()

        # Iterate to convergence
        for _ in range(iterations):
            new_ratings = {}

            for team in teams:
                # Average opponent rating
                if opponents[team]:
                    avg_opp_rating = np.mean([self.ratings[opp] for opp in opponents[team]])
                else:
                    avg_opp_rating = 0

                # New rating = margin + SOS
                new_ratings[team] = avg_margins[team] + avg_opp_rating

            # Normalize to mean 0
            mean_rating = np.mean(list(new_ratings.values()))
            self.ratings = {team: r - mean_rating for team, r in new_ratings.items()}

        return self

    def predict(self, home_team: str, away_team: str) -> Dict:
        """Predict game outcome."""
        home_rating = self.ratings.get(home_team, 0)
        away_rating = self.ratings.get(away_team, 0)

        # Spread = away rating - home rating - HFA
        predicted_spread = away_rating - home_rating - self.home_advantage

        # Convert to probability (each point ≈ 3% probability)
        home_win_prob = 1 / (1 + 10 ** (predicted_spread / 8))

        return {
            'home_team': home_team,
            'away_team': away_team,
            'home_rating': round(home_rating, 1),
            'away_rating': round(away_rating, 1),
            'predicted_spread': round(predicted_spread, 1),
            'home_win_prob': round(home_win_prob, 3),
            'predicted_winner': home_team if predicted_spread < 0 else away_team
        }

    def rankings(self) -> pd.DataFrame:
        """Return teams ranked by rating."""
        return pd.DataFrame([
            {'team': team, 'rating': rating}
            for team, rating in sorted(self.ratings.items(),
                                       key=lambda x: x[1], reverse=True)
        ])

SRS Properties

Strength of Schedule Built-In:

Unlike simple margin averages, SRS automatically adjusts for opponent quality. A team with a 0.0 SRS rating that played the hardest schedule is equivalent to a 0.0 team with the easiest schedule—the SOS adjustment is embedded.

Convergence:

SRS typically converges within 20-50 iterations. The final ratings satisfy the defining equation to within floating-point precision.

Margin Matters:

Every point of margin affects the rating. This can be good (margins are informative) or bad (garbage time can distort).

Comparing Elo and SRS

Aspect Elo SRS
Update frequency Game-by-game All at once
Recency weighting Natural (recent games more impact) None (all games equal)
Path dependence Yes (order matters) No (order irrelevant)
SOS adjustment Implicit (through opponents) Explicit (in formula)
Margin handling Configurable Always uses margin
Cross-season Natural (carry forward) Requires separate handling

Part 5: DVOA-Style Efficiency Ratings

The Play-by-Play Approach

While Elo and SRS use game outcomes, efficiency ratings analyze play-by-play data. Football Outsiders' DVOA (Defense-adjusted Value Over Average) exemplifies this approach.

Core Concept

Instead of asking "did the team win?", efficiency ratings ask "how well did the team perform on each play?"

Key components: 1. Success Rate: What percentage of plays gained positive expected value? 2. Explosive Plays: How often did big plays occur? 3. Negative Plays: How often did drives stall? 4. Situational Performance: How did performance vary by down, distance, and field position?

class EfficiencyRatingSystem:
    """
    Efficiency-based rating system using play-by-play data.

    Calculates offense and defense ratings separately,
    adjusting for opponent strength.
    """

    def __init__(self, league_avg_epa: float = 0.0):
        self.league_avg_epa = league_avg_epa
        self.offensive_ratings = {}
        self.defensive_ratings = {}
        self.overall_ratings = {}

    def success_rate(self, plays: pd.DataFrame) -> float:
        """
        Calculate success rate for a set of plays.

        Success defined as:
        - 1st down: 45%+ of yards needed
        - 2nd down: 60%+ of yards needed
        - 3rd/4th down: 100% of yards needed
        """
        if len(plays) == 0:
            return 0.5

        successes = 0
        for _, play in plays.iterrows():
            if play.get('yards_gained') is None:
                continue

            needed = play.get('ydstogo', 10)
            gained = play['yards_gained']
            down = play.get('down', 1)

            if down == 1:
                threshold = 0.45
            elif down == 2:
                threshold = 0.60
            else:
                threshold = 1.0

            if gained >= needed * threshold:
                successes += 1

        return successes / len(plays)

    def calculate_epa_efficiency(self, team: str, plays: pd.DataFrame,
                                  side: str = 'offense') -> float:
        """
        Calculate EPA-based efficiency rating.

        Args:
            team: Team abbreviation
            plays: Play-by-play data
            side: 'offense' or 'defense'
        """
        if side == 'offense':
            team_plays = plays[plays['posteam'] == team]
        else:
            team_plays = plays[plays['defteam'] == team]

        if len(team_plays) == 0:
            return 0.0

        # Filter to relevant plays
        relevant = team_plays[
            team_plays['play_type'].isin(['run', 'pass']) &
            team_plays['epa'].notna()
        ]

        if len(relevant) == 0:
            return 0.0

        # Calculate EPA per play
        epa_per_play = relevant['epa'].mean()

        # Convert to rating (multiply by 100 for readability)
        return epa_per_play * 100

    def fit(self, plays: pd.DataFrame, games: pd.DataFrame,
            iterations: int = 10) -> 'EfficiencyRatingSystem':
        """
        Calculate efficiency ratings from play-by-play data.

        Uses iterative adjustment for opponent strength.
        """
        teams = set(plays['posteam'].dropna().tolist())

        # Initial ratings: raw EPA
        for team in teams:
            self.offensive_ratings[team] = self.calculate_epa_efficiency(
                team, plays, 'offense'
            )
            self.defensive_ratings[team] = -self.calculate_epa_efficiency(
                team, plays, 'defense'
            )  # Negative because lower defensive EPA is better

        # Iterative opponent adjustment
        for _ in range(iterations):
            new_off = {}
            new_def = {}

            for team in teams:
                # Get opponents
                home_games = games[games['home_team'] == team]
                away_games = games[games['away_team'] == team]

                opponents = (
                    home_games['away_team'].tolist() +
                    away_games['home_team'].tolist()
                )

                if not opponents:
                    new_off[team] = self.offensive_ratings[team]
                    new_def[team] = self.defensive_ratings[team]
                    continue

                # Adjust for opponent strength
                raw_off = self.calculate_epa_efficiency(team, plays, 'offense')
                raw_def = -self.calculate_epa_efficiency(team, plays, 'defense')

                # Opponent defensive ratings (for offense adjustment)
                opp_def_avg = np.mean([
                    self.defensive_ratings.get(opp, 0) for opp in opponents
                ])

                # Opponent offensive ratings (for defense adjustment)
                opp_off_avg = np.mean([
                    self.offensive_ratings.get(opp, 0) for opp in opponents
                ])

                new_off[team] = raw_off - opp_def_avg
                new_def[team] = raw_def - opp_off_avg

            self.offensive_ratings = new_off
            self.defensive_ratings = new_def

        # Calculate overall ratings
        for team in teams:
            self.overall_ratings[team] = (
                self.offensive_ratings[team] +
                self.defensive_ratings[team]
            )

        # Normalize to mean 0
        mean_overall = np.mean(list(self.overall_ratings.values()))
        self.overall_ratings = {
            t: r - mean_overall for t, r in self.overall_ratings.items()
        }

        return self

    def rankings(self, category: str = 'overall') -> pd.DataFrame:
        """
        Return rankings for specified category.

        Args:
            category: 'overall', 'offense', or 'defense'
        """
        if category == 'overall':
            ratings = self.overall_ratings
        elif category == 'offense':
            ratings = self.offensive_ratings
        else:
            ratings = self.defensive_ratings

        return pd.DataFrame([
            {'team': team, 'rating': rating, 'category': category}
            for team, rating in sorted(ratings.items(),
                                       key=lambda x: x[1], reverse=True)
        ])

Advantages of Efficiency Ratings

Stability:

Play-by-play data provides many more observations than game outcomes. A team plays ~1,000 offensive plays per season versus ~17 games. More data means more stable estimates.

Granularity:

Separate offensive and defensive ratings enable matchup analysis. Can the Rams' offense exploit the Cowboys' weak secondary?

Context Awareness:

Efficiency ratings can incorporate situation (down, distance, field position, time), revealing teams that perform well in critical moments.

Disadvantages

Complexity:

Requires play-by-play data, which is more complex to obtain and process than game scores.

Noise Sensitivity:

Play-by-play data includes noise (randomness, garbage time, unusual situations) that game outcomes aggregate away.

Outcome Disconnect:

Teams with great efficiency ratings sometimes underperform in actual wins/losses, and vice versa. Efficiency doesn't always translate to victories.


Part 6: Calibrating Ratings to Spreads

The Calibration Problem

Different rating systems use different scales: - Elo: Typically 1200-1800 range - SRS: Typically -15 to +15 range - Efficiency: Various scales

To make predictions useful, we need to convert ratings to point spreads.

Linear Calibration

The simplest approach assumes a linear relationship:

$$Spread = a \times (Away\_Rating - Home\_Rating) + b$$

Where $a$ and $b$ are calibration constants, and $b$ typically equals the home field advantage.

def calibrate_to_spreads(self, games: pd.DataFrame) -> Tuple[float, float]:
    """
    Calibrate rating differences to actual point spreads.

    Uses linear regression to find the best mapping.

    Returns:
        (multiplier, intercept) for spread conversion
    """
    rating_diffs = []
    actual_spreads = []

    for _, game in games.iterrows():
        if pd.isna(game['home_score']):
            continue

        home_r = self.ratings.get(game['home_team'], 0)
        away_r = self.ratings.get(game['away_team'], 0)

        rating_diffs.append(away_r - home_r)
        actual_spreads.append(game['away_score'] - game['home_score'])

    # Simple linear regression
    X = np.array(rating_diffs)
    y = np.array(actual_spreads)

    # Add constant term
    X_with_const = np.column_stack([X, np.ones(len(X))])

    # Solve least squares
    coeffs = np.linalg.lstsq(X_with_const, y, rcond=None)[0]

    return coeffs[0], coeffs[1]  # multiplier, intercept

Calibration Results by System

Typical calibration factors:

System Rating Diff → Spread Multiplier
Elo ~0.04 (25 Elo ≈ 1 point)
SRS ~1.0 (ratings are in points)
Efficiency ~0.15-0.25 (depends on scale)

Converting to Win Probability

Once we have predicted spreads, converting to win probability:

def spread_to_probability(spread: float, sigma: float = 13.5) -> float:
    """
    Convert point spread to win probability.

    Uses normal CDF with NFL's ~13.5 point standard deviation.

    Args:
        spread: Predicted point spread (negative = home favored)
        sigma: Standard deviation of NFL score differences

    Returns:
        Home team win probability
    """
    from scipy import stats

    # Probability home team wins = P(home_score - away_score > 0)
    # If spread = away_score - home_score, then
    # P(home wins) = P(normal > spread)
    return 1 - stats.norm.cdf(spread / sigma)

Part 7: Evaluating Rating Systems

Which System Is Best?

Evaluating rating systems requires clear metrics and honest assessment.

class RatingSystemEvaluator:
    """
    Compare and evaluate different rating systems.
    """

    def evaluate(self, system, games: pd.DataFrame,
                 market_spreads: pd.DataFrame = None) -> Dict:
        """
        Comprehensive evaluation of a rating system.

        Args:
            system: Rating system with predict() method
            games: Games to evaluate on
            market_spreads: Optional market lines for comparison
        """
        games = games[games['home_score'].notna()].copy()

        predictions = []
        actuals = []

        for _, game in games.iterrows():
            pred = system.predict(game['home_team'], game['away_team'])

            actual_winner = (game['home_team'] if game['home_score'] > game['away_score']
                           else game['away_team'])
            actual_spread = game['away_score'] - game['home_score']

            predictions.append(pred)
            actuals.append({
                'winner': actual_winner,
                'spread': actual_spread,
                'home_won': 1 if actual_winner == game['home_team'] else 0
            })

        # Calculate metrics
        n = len(predictions)

        # Straight-up accuracy
        su_correct = sum(
            1 for p, a in zip(predictions, actuals)
            if p['predicted_winner'] == a['winner']
        )
        su_accuracy = su_correct / n

        # Brier score
        brier = np.mean([
            (p['home_win_prob'] - a['home_won']) ** 2
            for p, a in zip(predictions, actuals)
        ])

        # Mean absolute error of spreads
        mae = np.mean([
            abs(p['predicted_spread'] - a['spread'])
            for p, a in zip(predictions, actuals)
        ])

        # Against the spread (if market spreads provided)
        ats_record = None
        if market_spreads is not None:
            ats_correct = 0
            ats_total = 0

            for i, (pred, actual) in enumerate(zip(predictions, actuals)):
                if i >= len(market_spreads):
                    break

                market = market_spreads.iloc[i]['spread']
                predicted = pred['predicted_spread']

                # Skip if too close
                if abs(predicted - market) < 0.5:
                    continue

                # Did our spread beat the market?
                if predicted < market:  # We like home more
                    beat_spread = actual['spread'] < market
                else:  # We like away more
                    beat_spread = actual['spread'] > market

                if beat_spread:
                    ats_correct += 1
                ats_total += 1

            if ats_total > 0:
                ats_record = {
                    'correct': ats_correct,
                    'total': ats_total,
                    'accuracy': ats_correct / ats_total
                }

        return {
            'n_games': n,
            'straight_up': {
                'correct': su_correct,
                'accuracy': round(su_accuracy, 3)
            },
            'brier_score': round(brier, 4),
            'spread_mae': round(mae, 2),
            'ats_record': ats_record
        }

    def compare_systems(self, systems: Dict, games: pd.DataFrame) -> pd.DataFrame:
        """
        Compare multiple rating systems on the same games.

        Args:
            systems: Dictionary of {name: system}
            games: Games to evaluate on
        """
        results = []

        for name, system in systems.items():
            eval_result = self.evaluate(system, games)
            results.append({
                'system': name,
                'su_accuracy': eval_result['straight_up']['accuracy'],
                'brier_score': eval_result['brier_score'],
                'spread_mae': eval_result['spread_mae']
            })

        return pd.DataFrame(results).sort_values('brier_score')

Typical Performance Benchmarks

For reference, here are typical performance levels:

Metric Random Average Model Good Model Excellent
SU Accuracy 50% 55-58% 59-62% 63%+
Brier Score 0.250 0.230 0.215-0.220 <0.210
Spread MAE 14.0 11.0-12.0 10.0-11.0 <10.0
ATS vs Market 50% 50-51% 52-54% 54%+

The market (betting lines) typically achieves: - ~63% straight-up accuracy - ~0.210 Brier score - ~10.5 spread MAE

Beating the market consistently is extremely difficult.


Part 8: Advanced Topics

Combining Multiple Rating Systems

No single rating system captures all information. Combining systems can improve predictions.

def ensemble_prediction(systems: List, weights: List[float],
                        home_team: str, away_team: str) -> Dict:
    """
    Combine predictions from multiple rating systems.

    Args:
        systems: List of rating systems with predict() methods
        weights: Relative weights for each system
        home_team, away_team: Teams playing

    Returns:
        Ensemble prediction
    """
    # Normalize weights
    weights = np.array(weights) / sum(weights)

    # Get predictions from each system
    spreads = []
    probs = []

    for system in systems:
        pred = system.predict(home_team, away_team)
        spreads.append(pred['predicted_spread'])
        probs.append(pred['home_win_prob'])

    # Weighted average
    ensemble_spread = np.average(spreads, weights=weights)
    ensemble_prob = np.average(probs, weights=weights)

    return {
        'home_team': home_team,
        'away_team': away_team,
        'predicted_spread': round(ensemble_spread, 1),
        'home_win_prob': round(ensemble_prob, 3),
        'predicted_winner': home_team if ensemble_spread < 0 else away_team,
        'component_spreads': dict(zip(range(len(systems)), spreads))
    }

Handling Team Changes

Ratings struggle with major team changes (new quarterback, coaching changes, significant roster turnover). Options:

Manual Adjustments:

Explicitly adjust ratings based on known changes:

def apply_qb_adjustment(self, team: str, change_type: str):
    """
    Adjust rating for quarterback change.

    Typical adjustments:
    - Elite QB added: +100-150 Elo
    - Average QB added: +0-50 Elo
    - Starter lost to injury: -100-150 Elo
    - Backup playing: -50-100 Elo
    """
    adjustments = {
        'elite_added': 125,
        'average_added': 25,
        'starter_injured': -125,
        'backup_playing': -75
    }

    if change_type in adjustments:
        self.ratings[team] += adjustments[change_type]

Roster Continuity:

Weight prior ratings by how much the roster has changed:

def roster_adjusted_regression(self, team: str,
                                continuity: float) -> float:
    """
    Adjust regression based on roster continuity.

    Args:
        team: Team abbreviation
        continuity: 0-1 measure of how much roster returned
                   (1.0 = identical roster, 0.0 = complete turnover)
    """
    base_regression = 0.33

    # More continuity = less regression needed
    adjusted_regression = base_regression * (1 - continuity * 0.5)

    current = self.ratings[team]
    mean = 1500

    return current - (current - mean) * adjusted_regression

Real-Time Updates

For in-season predictions, ratings should update in real-time:

class RealTimeElo(MarginAdjustedElo):
    """
    Elo system with real-time game updates.
    """

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.game_log = []

    def live_update(self, game_id: str, home_team: str, away_team: str,
                    home_score: int, away_score: int, final: bool = False):
        """
        Update ratings during or after a game.

        For live games, uses current score.
        For final games, locks in the update.
        """
        if not final:
            # During game: tentative update (don't store)
            return self.predict(home_team, away_team)

        # Game final: permanent update
        result = self.update(home_team, away_team, home_score, away_score)

        self.game_log.append({
            'game_id': game_id,
            'home_team': home_team,
            'away_team': away_team,
            'home_score': home_score,
            'away_score': away_score,
            **result
        })

        return result

Practical Considerations

Choosing Your System

Use Elo when: - You need cross-season comparisons - You want intuitive, interpretable ratings - You're building a public-facing system - Simplicity is important

Use SRS when: - You're analyzing a single season - You want pure point-spread predictions - Strength of schedule adjustment is crucial - Path independence matters

Use Efficiency Ratings when: - You have play-by-play data - You need offense/defense splits - You want maximum predictive power - Context matters (situational analysis)

Common Mistakes

1. Overfitting K-Factor:

Optimizing K-factor on the same data you'll evaluate creates overfitting. Use separate calibration and test sets.

2. Ignoring Sample Size:

Early-season ratings are unreliable. Consider: - Higher uncertainty estimates - Heavier regression to prior - Explicit confidence intervals

3. Forgetting Home Advantage:

Even modern analyses need home advantage adjustment, though the advantage has decreased over time.

4. Margin Extremes:

Without capping, one 45-point blowout can distort ratings for weeks.


Summary

Rating systems provide the foundation for NFL predictions. This chapter covered:

  1. Elo Basics: Expected score calculation, K-factor selection, rating updates
  2. Margin Adjustment: Incorporating point differential while preventing distortion
  3. Season Transitions: Regression to mean between seasons
  4. SRS: Simultaneous solve for strength-of-schedule-adjusted ratings
  5. Efficiency Ratings: Play-by-play based approaches
  6. Calibration: Converting ratings to spreads and probabilities
  7. Evaluation: Comparing systems fairly

Key takeaways: - Simple systems (Elo, SRS) perform surprisingly well - Margin of victory adds predictive value when handled correctly - Regression to the mean is essential for season transitions - Ensemble approaches can outperform individual systems - Beating the market is hard—know your benchmarks


Key Equations Reference

Elo Expected Score: $$E_A = \frac{1}{1 + 10^{(R_B - R_A) / 400}}$$

Elo Update: $$R_{new} = R + K(S - E)$$

SRS Equation: $$Rating = AvgMargin + AvgOpponentRating$$

Spread to Probability: $$P(home) = \Phi\left(\frac{-spread}{\sigma}\right)$$

Where $\Phi$ is the standard normal CDF and $\sigma \approx 13.5$ for NFL.


Looking Ahead

Chapter 20 Preview: Next, we'll explore Machine Learning for NFL Prediction—building on these rating foundations with gradient boosting, neural networks, and feature engineering approaches that can capture complex patterns in NFL data.


Chapter Summary

Rating systems transform game results into team strength estimates. The Elo system's elegance—simple update rules, interpretable outputs, self-correction—makes it the foundation for many prediction systems. Adding margin of victory, proper season transitions, and calibration transforms basic Elo into a competitive prediction tool.

Alternative approaches like SRS and efficiency ratings offer different tradeoffs. SRS provides clean mathematical properties and built-in strength of schedule adjustment. Efficiency ratings maximize information extraction from play-by-play data at the cost of complexity.

The best practitioners often combine multiple approaches, using Elo's simplicity for public communication, SRS for schedule analysis, and efficiency metrics for detailed matchup breakdowns. Understanding each system's strengths and limitations enables better prediction and clearer communication of uncertainty.