4 min read

> "Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital." — Aaron Levenstein

Chapter 5: Statistical Foundations for Football Analytics

"Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital." — Aaron Levenstein

Learning Objectives

By the end of this chapter, you will be able to:

  1. Apply probability concepts to football decision-making
  2. Understand and interpret Expected Points Added (EPA)
  3. Calculate and use Win Probability models
  4. Conduct hypothesis testing for football questions
  5. Recognize and avoid common statistical pitfalls
  6. Apply regression techniques to football problems

5.1 Probability Foundations

Why Probability Matters in Football

Football is fundamentally a game of probabilities. Every decision—whether to go for it on fourth down, when to throw deep, how to allocate the salary cap—involves weighing uncertain outcomes.

Key Probability Concepts:

import numpy as np
from scipy import stats

# Basic probability: If a kicker makes 85% of field goals
fg_probability = 0.85

# Expected value: What's the expected points from a 40-yard attempt?
# Assume: Make = 3 points, Miss = 0 points (simplified)
expected_points = fg_probability * 3 + (1 - fg_probability) * 0
print(f"Expected points: {expected_points:.2f}")  # 2.55

Conditional Probability

The probability of event A given that event B has occurred:

$$P(A|B) = \frac{P(A \cap B)}{P(B)}$$

Football Example: Third Down Conversion

def conditional_probability_example(pbp):
    """
    Calculate P(First Down | Third and Short).

    This is the probability of converting given we're in
    a third-and-short situation.
    """
    # P(Third and Short)
    all_plays = len(pbp.query("down == 3"))
    third_short = len(pbp.query("down == 3 and ydstogo <= 3"))
    p_third_short = third_short / all_plays

    # P(First Down AND Third and Short)
    first_down_short = len(pbp.query("down == 3 and ydstogo <= 3 and first_down == 1"))
    p_first_and_short = first_down_short / all_plays

    # P(First Down | Third and Short)
    p_conversion_given_short = p_first_and_short / p_third_short

    return p_conversion_given_short

Bayes' Theorem

Bayes' theorem allows us to update probabilities as new information arrives:

$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$

Application: Updating Win Probability

At kickoff, each team has approximately 50% win probability (with slight home advantage). As the game progresses, we update this probability based on:

  • Score differential
  • Time remaining
  • Field position
  • Down and distance
  • Timeouts remaining
def bayes_win_probability_update():
    """
    Conceptual example of Bayesian updating in football.

    Prior: Pre-game win probability = 50%
    Evidence: Team scores opening touchdown
    Posterior: Updated win probability

    Historical data tells us:
    - P(Win | Score First TD) ≈ 68%
    - This incorporates our prior implicitly
    """
    prior_win_prob = 0.50
    p_score_first_given_win = 0.58  # Teams that win score first more often
    p_score_first_given_lose = 0.42
    p_score_first = 0.50  # Either team equally likely to score first

    # Bayes' theorem
    posterior = (p_score_first_given_win * prior_win_prob) / p_score_first

    return posterior  # ≈ 0.58

5.2 Expected Points Added (EPA)

The Foundation Metric

Expected Points Added (EPA) is the foundational metric in modern football analytics. It measures how much each play changes a team's expected point outcome.

The Concept:

Every down-and-distance situation has an expected point value. For example: - 1st and 10 at own 25 ≈ 1.0 expected points - 1st and 10 at opponent's 25 ≈ 4.0 expected points - 1st and Goal at the 1 ≈ 6.3 expected points

EPA measures the change: EPA = EP_after - EP_before

Building an Expected Points Model

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import PolynomialFeatures

class ExpectedPointsModel:
    """
    Simplified Expected Points model.

    The real nflfastR model uses:
    - Multinomial logistic regression
    - 7 outcome categories (TD, FG, Safety, etc. for each team)
    - Game state variables (down, distance, field position, etc.)
    """

    def __init__(self):
        self.model = None
        self.point_values = {
            'touchdown': 7,  # Including extra point
            'field_goal': 3,
            'safety': 2,
            'no_score': 0
        }

    def calculate_expected_points(self, yardline_100, down, ydstogo):
        """
        Calculate expected points for a situation.

        Parameters
        ----------
        yardline_100 : int
            Yards from opponent's goal (1-99)
        down : int
            Current down (1-4)
        ydstogo : int
            Yards needed for first down

        Returns
        -------
        float
            Expected points value
        """
        # Simplified model using historical averages
        # Real models use regression with many features

        # Base EP by field position (simplified)
        base_ep = (100 - yardline_100) * 0.06 - 1.5

        # Down adjustment
        down_adjustment = {
            1: 0.3,
            2: 0.0,
            3: -0.3,
            4: -0.6
        }
        base_ep += down_adjustment.get(down, 0)

        # Distance adjustment (longer = harder)
        if ydstogo > 10:
            base_ep -= 0.3
        elif ydstogo <= 3:
            base_ep += 0.2

        return base_ep

    def calculate_epa(self, play_data):
        """
        Calculate EPA for a play.

        Parameters
        ----------
        play_data : dict
            Contains before/after state

        Returns
        -------
        float
            Expected Points Added
        """
        ep_before = self.calculate_expected_points(
            play_data['yardline_100_before'],
            play_data['down_before'],
            play_data['ydstogo_before']
        )

        # Handle scoring plays and turnovers
        if play_data.get('touchdown'):
            ep_after = 7
        elif play_data.get('turnover'):
            ep_after = -self.calculate_expected_points(
                100 - play_data['yardline_100_after'],
                1,
                10
            )
        else:
            ep_after = self.calculate_expected_points(
                play_data['yardline_100_after'],
                play_data['down_after'],
                play_data['ydstogo_after']
            )

        return ep_after - ep_before

Interpreting EPA

EPA Value Interpretation
+0.5 Good play, above average
+1.0 Very good play
+2.0 Excellent play (long gain, TD)
0.0 Average play
-0.5 Below average play
-2.0 Bad play (turnover, sack)
-4.0 Disaster (pick-six territory)
def interpret_epa(epa):
    """Provide interpretation of EPA value."""
    if epa >= 2.0:
        return "Excellent play"
    elif epa >= 1.0:
        return "Very good play"
    elif epa >= 0.5:
        return "Good play"
    elif epa >= -0.5:
        return "Average play"
    elif epa >= -1.0:
        return "Below average"
    elif epa >= -2.0:
        return "Bad play"
    else:
        return "Disaster"

EPA Applications

1. Player Evaluation

def evaluate_quarterback(pbp, min_dropbacks=200):
    """Evaluate QBs using EPA."""
    qb_stats = (
        pbp
        .query("pass == 1")
        .groupby('passer_player_name')
        .agg(
            dropbacks=('pass', 'count'),
            total_epa=('epa', 'sum'),
            epa_per_play=('epa', 'mean'),
            success_rate=('success', 'mean')
        )
        .query(f"dropbacks >= {min_dropbacks}")
        .sort_values('epa_per_play', ascending=False)
    )
    return qb_stats

2. Play-Calling Analysis

def analyze_play_calling(pbp, team):
    """Analyze play-calling efficiency."""
    team_plays = pbp.query(f"posteam == '{team}'")

    analysis = (
        team_plays
        .groupby(['down', 'play_type'])
        .agg(
            plays=('play_id', 'count'),
            epa=('epa', 'mean')
        )
        .reset_index()
        .pivot(index='down', columns='play_type', values='epa')
    )
    return analysis

5.3 Win Probability

Understanding Win Probability

Win Probability (WP) estimates the likelihood of winning at any point during a game, based on the current game state.

Factors in Win Probability:

  1. Score differential
  2. Time remaining
  3. Down and distance
  4. Field position
  5. Timeouts (for each team)
  6. Whether team has the ball

Win Probability Model

class WinProbabilityModel:
    """
    Simplified Win Probability model.

    Real models use logistic regression on historical game data.
    """

    def calculate_wp(
        self,
        score_diff: int,
        seconds_remaining: int,
        has_ball: bool = True,
        yardline_100: int = 75,
        down: int = 1,
        ydstogo: int = 10,
        home: bool = True
    ) -> float:
        """
        Calculate win probability.

        Parameters
        ----------
        score_diff : int
            Score differential (positive = winning)
        seconds_remaining : int
            Seconds left in the game
        has_ball : bool
            Whether the team has possession
        yardline_100 : int
            Yards from opponent's goal
        down : int
            Current down
        ydstogo : int
            Yards to first down
        home : bool
            Whether team is home

        Returns
        -------
        float
            Win probability (0-1)
        """
        # Simplified model
        # Real model: Logistic regression on game outcomes

        # Base probability from score differential
        # Each point ≈ 3% win probability adjustment at game start
        # This effect decays as time remains (more volatile early)

        time_factor = seconds_remaining / 3600  # Fraction of game remaining

        # Score adjustment (diminishes with less time)
        score_impact = score_diff * 0.03 * (1 + time_factor)

        # Home advantage
        home_advantage = 0.03 if home else -0.03

        # Possession bonus (depends on field position and time)
        if has_ball:
            possession_bonus = 0.05 * (1 - yardline_100/100) * time_factor
        else:
            possession_bonus = 0

        # Combine factors
        base_wp = 0.5 + score_impact + home_advantage + possession_bonus

        # Clamp to valid range
        return max(0.01, min(0.99, base_wp))

    def calculate_wpa(self, wp_before: float, wp_after: float) -> float:
        """
        Calculate Win Probability Added.

        Parameters
        ----------
        wp_before : float
            Win probability before play
        wp_after : float
            Win probability after play

        Returns
        -------
        float
            Win Probability Added
        """
        return wp_after - wp_before

Win Probability Added (WPA)

WPA measures how much each play changed the team's chances of winning:

def calculate_game_wpa(pbp, game_id, team):
    """
    Calculate cumulative WPA through a game.

    Parameters
    ----------
    pbp : pd.DataFrame
        Play-by-play data
    game_id : str
        Game identifier
    team : str
        Team to calculate WPA for

    Returns
    -------
    pd.DataFrame
        Play-by-play with cumulative WPA
    """
    game = pbp.query(f"game_id == '{game_id}'").copy()

    # Calculate WPA for each play
    game['team_wpa'] = game.apply(
        lambda x: x['wpa'] if x['posteam'] == team else -x['wpa'],
        axis=1
    )

    # Cumulative WPA
    game['cumulative_wpa'] = game['team_wpa'].cumsum()

    return game[['play_id', 'desc', 'wpa', 'team_wpa', 'cumulative_wpa']]

Clutch Performance

WPA identifies clutch performers—players who contribute most when the stakes are highest:

def identify_clutch_performers(pbp, position='QB', min_plays=50):
    """
    Find players with highest average WPA.

    Parameters
    ----------
    pbp : pd.DataFrame
        Play-by-play data
    position : str
        Position to analyze
    min_plays : int
        Minimum plays for inclusion

    Returns
    -------
    pd.DataFrame
        Players ranked by average WPA
    """
    if position == 'QB':
        player_col = 'passer_player_name'
        filter_expr = 'pass == 1'
    elif position == 'RB':
        player_col = 'rusher_player_name'
        filter_expr = 'rush == 1'
    else:
        player_col = 'receiver_player_name'
        filter_expr = 'pass == 1'

    clutch_stats = (
        pbp
        .query(filter_expr)
        .groupby(player_col)
        .agg(
            plays=('wpa', 'count'),
            total_wpa=('wpa', 'sum'),
            avg_wpa=('wpa', 'mean')
        )
        .query(f"plays >= {min_plays}")
        .sort_values('total_wpa', ascending=False)
    )

    return clutch_stats

5.4 Statistical Inference

Hypothesis Testing in Football

Hypothesis testing helps us determine whether observed differences are statistically significant or could be due to chance.

Common Football Questions:

  • Is home field advantage real?
  • Does this QB perform differently in cold weather?
  • Is the difference between two receivers' catch rates meaningful?

The t-Test

from scipy import stats

def test_home_field_advantage(pbp):
    """
    Test whether home teams have a significant EPA advantage.

    H0: No difference between home and away EPA
    H1: Home teams have higher EPA
    """
    # Separate home and away plays
    home_epa = pbp[pbp['posteam'] == pbp['home_team']]['epa'].dropna()
    away_epa = pbp[pbp['posteam'] == pbp['away_team']]['epa'].dropna()

    # Independent samples t-test
    t_stat, p_value = stats.ttest_ind(home_epa, away_epa)

    # Effect size (Cohen's d)
    pooled_std = np.sqrt((home_epa.std()**2 + away_epa.std()**2) / 2)
    cohens_d = (home_epa.mean() - away_epa.mean()) / pooled_std

    print(f"Home EPA: {home_epa.mean():.4f}")
    print(f"Away EPA: {away_epa.mean():.4f}")
    print(f"Difference: {home_epa.mean() - away_epa.mean():.4f}")
    print(f"t-statistic: {t_stat:.4f}")
    print(f"p-value: {p_value:.4e}")
    print(f"Cohen's d: {cohens_d:.4f}")

    return {
        'home_mean': home_epa.mean(),
        'away_mean': away_epa.mean(),
        't_stat': t_stat,
        'p_value': p_value,
        'cohens_d': cohens_d
    }

Confidence Intervals

def calculate_confidence_interval(data, confidence=0.95):
    """
    Calculate confidence interval for the mean.

    Parameters
    ----------
    data : array-like
        Sample data
    confidence : float
        Confidence level

    Returns
    -------
    tuple
        (lower bound, mean, upper bound)
    """
    data = np.array(data)
    n = len(data)
    mean = data.mean()
    se = stats.sem(data)  # Standard error of the mean

    # t-critical value
    t_crit = stats.t.ppf((1 + confidence) / 2, n - 1)

    margin = t_crit * se

    return (mean - margin, mean, mean + margin)

Application: Is a QB's EPA Significantly Different from Average?

def test_qb_above_average(pbp, qb_name, min_plays=200):
    """
    Test if a QB's EPA is significantly above league average.

    Parameters
    ----------
    pbp : pd.DataFrame
        Play-by-play data
    qb_name : str
        QB name to test
    min_plays : int
        Minimum plays

    Returns
    -------
    dict
        Test results
    """
    # Get QB's passes
    qb_passes = pbp.query(f"pass == 1 and passer_player_name == '{qb_name}'")['epa'].dropna()

    if len(qb_passes) < min_plays:
        return {'error': f'Only {len(qb_passes)} plays, need {min_plays}'}

    # League average
    league_avg = pbp.query("pass == 1")['epa'].mean()

    # One-sample t-test against league average
    t_stat, p_value = stats.ttest_1samp(qb_passes, league_avg)

    # One-tailed (is QB above average?)
    p_value_one_tail = p_value / 2 if t_stat > 0 else 1 - p_value / 2

    # Confidence interval
    ci_low, mean, ci_high = calculate_confidence_interval(qb_passes)

    print(f"{qb_name} EPA Analysis:")
    print(f"  Mean EPA: {mean:.4f}")
    print(f"  95% CI: [{ci_low:.4f}, {ci_high:.4f}]")
    print(f"  League Average: {league_avg:.4f}")
    print(f"  p-value (one-tailed): {p_value_one_tail:.4f}")

    significant = p_value_one_tail < 0.05 and mean > league_avg

    return {
        'qb_mean': mean,
        'league_avg': league_avg,
        'ci': (ci_low, ci_high),
        'p_value': p_value_one_tail,
        'significant': significant
    }

Multiple Comparisons Problem

When testing many hypotheses, some will appear significant by chance:

def adjust_p_values(p_values, method='bonferroni'):
    """
    Adjust p-values for multiple comparisons.

    Parameters
    ----------
    p_values : array-like
        Original p-values
    method : str
        'bonferroni' or 'fdr' (Benjamini-Hochberg)

    Returns
    -------
    array
        Adjusted p-values
    """
    p_values = np.array(p_values)
    n = len(p_values)

    if method == 'bonferroni':
        return np.minimum(p_values * n, 1.0)

    elif method == 'fdr':
        # Benjamini-Hochberg procedure
        sorted_idx = np.argsort(p_values)
        sorted_p = p_values[sorted_idx]

        adjusted = np.zeros(n)
        for i, p in enumerate(sorted_p):
            adjusted[sorted_idx[i]] = min(p * n / (i + 1), 1.0)

        # Ensure monotonicity
        for i in range(n-2, -1, -1):
            adjusted[i] = min(adjusted[i], adjusted[i+1]) if i < n-1 else adjusted[i]

        return adjusted

5.5 Regression Analysis

Linear Regression for Football

from sklearn.linear_model import LinearRegression
import statsmodels.api as sm

def regression_example(pbp):
    """
    Predict EPA using game situation variables.
    """
    # Prepare data
    plays = pbp.query("play_type.isin(['pass', 'run']) and epa.notna()").copy()

    # Features
    features = ['down', 'ydstogo', 'yardline_100', 'score_differential', 'pass']
    X = plays[features].dropna()
    y = plays.loc[X.index, 'epa']

    # Fit model
    X_const = sm.add_constant(X)
    model = sm.OLS(y, X_const).fit()

    print(model.summary())

    return model

Logistic Regression for Binary Outcomes

from sklearn.linear_model import LogisticRegression

def predict_success(pbp):
    """
    Build logistic regression to predict play success.
    """
    plays = pbp.query("play_type.isin(['pass', 'run']) and success.notna()").copy()

    # Features
    features = ['down', 'ydstogo', 'yardline_100', 'score_differential', 'pass']
    X = plays[features].dropna()
    y = plays.loc[X.index, 'success']

    # Fit logistic regression
    model = LogisticRegression(max_iter=1000)
    model.fit(X, y)

    # Coefficients
    coef_df = pd.DataFrame({
        'feature': features,
        'coefficient': model.coef_[0],
        'odds_ratio': np.exp(model.coef_[0])
    })

    print("Logistic Regression Coefficients:")
    print(coef_df)

    return model, coef_df

Regularization

To prevent overfitting when using many features:

from sklearn.linear_model import Ridge, Lasso

def regularized_regression(X, y, alpha=1.0, method='ridge'):
    """
    Fit regularized regression.

    Parameters
    ----------
    X : array-like
        Features
    y : array-like
        Target
    alpha : float
        Regularization strength
    method : str
        'ridge' (L2) or 'lasso' (L1)

    Returns
    -------
    model
        Fitted model
    """
    if method == 'ridge':
        model = Ridge(alpha=alpha)
    else:
        model = Lasso(alpha=alpha)

    model.fit(X, y)

    return model

5.6 Common Statistical Pitfalls

Sample Size Issues

def check_sample_size(n, effect_size=0.2, power=0.8, alpha=0.05):
    """
    Check if sample size is adequate for detecting an effect.

    Parameters
    ----------
    n : int
        Sample size
    effect_size : float
        Expected effect size (Cohen's d)
    power : float
        Desired statistical power
    alpha : float
        Significance level

    Returns
    -------
    dict
        Power analysis results
    """
    from scipy.stats import norm

    z_alpha = norm.ppf(1 - alpha/2)
    z_beta = norm.ppf(power)

    required_n = ((z_alpha + z_beta) / effect_size) ** 2 * 2

    actual_power = norm.cdf(
        effect_size * np.sqrt(n/2) - z_alpha
    )

    return {
        'current_n': n,
        'required_n': int(required_n),
        'adequate': n >= required_n,
        'actual_power': actual_power
    }

Regression to the Mean

The Problem: Extreme performances tend to regress toward average.

def demonstrate_regression_to_mean(pbp):
    """
    Show regression to the mean in QB performance.

    QBs who perform extremely well in week 1 tend to
    regress toward average in subsequent weeks.
    """
    # Get QB performance by week
    qb_weekly = (
        pbp
        .query("pass == 1")
        .groupby(['passer_player_name', 'week'])
        .agg(
            passes=('pass', 'count'),
            epa=('epa', 'mean')
        )
        .query("passes >= 20")
        .reset_index()
    )

    # Find top performers in week 1
    week1_top = (
        qb_weekly
        .query("week == 1")
        .nlargest(5, 'epa')
        ['passer_player_name']
        .tolist()
    )

    # Track their performance in subsequent weeks
    tracking = (
        qb_weekly
        [qb_weekly['passer_player_name'].isin(week1_top)]
        .pivot(index='passer_player_name', columns='week', values='epa')
    )

    print("Week 1 Top Performers - EPA by Week:")
    print(tracking)
    print("\nNote how extreme week 1 performers tend to regress toward average.")

    return tracking

Survivorship Bias

Only looking at successful outcomes can lead to wrong conclusions:

def survivorship_bias_example():
    """
    Demonstrate survivorship bias in draft analysis.

    If we only analyze starters, we miss all the players
    who were drafted but never made an impact.
    """
    print("Survivorship Bias Warning:")
    print("-" * 50)
    print("When analyzing 'successful' draft picks:")
    print("- Don't forget about busts who are no longer in league")
    print("- Sample of 'starters' is biased toward success")
    print("- True hit rate requires counting all picks, not just successes")
    print()
    print("Example: '1st round QBs have 65% success rate'")
    print("Reality: This often only counts QBs who are still playing,")
    print("         not the ones who flamed out after 2 seasons.")

Selection Bias

def selection_bias_example():
    """
    Explain selection bias in football analytics.
    """
    print("Selection Bias Examples in Football:")
    print("-" * 50)
    print()
    print("1. Fourth Down Analysis")
    print("   - Teams only go for it when they think they can convert")
    print("   - Observed conversion rate is biased upward")
    print("   - Can't conclude 'teams should always go for it'")
    print()
    print("2. Deep Pass Analysis")
    print("   - QBs throw deep when receivers are open")
    print("   - Completion rate reflects receiver separation, not just QB skill")
    print()
    print("3. Blitz Analysis")
    print("   - Defenses blitz when they expect pass")
    print("   - Comparing EPA on blitz vs non-blitz is confounded")

5.7 Effect Sizes and Practical Significance

Beyond p-values

Statistical significance ≠ practical significance

def calculate_effect_size(group1, group2):
    """
    Calculate various effect size measures.

    Parameters
    ----------
    group1 : array-like
        First group
    group2 : array-like
        Second group

    Returns
    -------
    dict
        Effect size measures
    """
    # Cohen's d
    pooled_std = np.sqrt(
        (np.std(group1)**2 + np.std(group2)**2) / 2
    )
    cohens_d = (np.mean(group1) - np.mean(group2)) / pooled_std

    # Percent difference
    pct_diff = (np.mean(group1) - np.mean(group2)) / np.mean(group2) * 100

    # Effect size interpretation
    if abs(cohens_d) < 0.2:
        interpretation = "negligible"
    elif abs(cohens_d) < 0.5:
        interpretation = "small"
    elif abs(cohens_d) < 0.8:
        interpretation = "medium"
    else:
        interpretation = "large"

    return {
        'cohens_d': cohens_d,
        'pct_diff': pct_diff,
        'interpretation': interpretation
    }

Football-Specific Thresholds

Effect Size EPA Difference Interpretation
Negligible < 0.02 No practical difference
Small 0.02-0.05 Noticeable over many plays
Medium 0.05-0.10 Meaningful strategic difference
Large > 0.10 Major performance difference

5.8 Chapter Summary

Statistical foundations enable rigorous football analysis:

  1. Probability provides the framework for decision-making under uncertainty
  2. EPA quantifies play value by measuring expected point changes
  3. Win Probability tracks game state and identifies crucial moments
  4. Hypothesis testing determines whether differences are meaningful
  5. Regression models relationships and makes predictions
  6. Avoiding pitfalls ensures conclusions are valid

These tools form the basis for all advanced football analytics covered in subsequent chapters.


Chapter Summary Checklist

After completing this chapter, you should be able to:

  • [ ] Calculate conditional probabilities for game situations
  • [ ] Explain how EPA is calculated and what it measures
  • [ ] Interpret Win Probability and WPA
  • [ ] Conduct and interpret hypothesis tests
  • [ ] Build regression models for football outcomes
  • [ ] Recognize and avoid common statistical pitfalls

Preview: Chapter 6

Part 2 begins with quarterback evaluation—applying these statistical foundations to measure passer performance beyond traditional stats.