Case Study: What Statistics Predict Winning?

"Winning isn't everything, but wanting to win is." — Vince Lombardi

Executive Summary

Every football fan has theories about what matters most for winning: rushing the ball, controlling time of possession, limiting turnovers. This case study uses correlation analysis to empirically test which statistics actually predict winning games.

Skills Applied: - Correlation analysis - Multiple variable comparison - Statistical significance - Building predictive frameworks


Background

The Question

Which game statistics correlate most strongly with winning? The candidates include:

  • Offensive production: Points, yards, first downs
  • Balance: Rushing vs passing split
  • Efficiency: Yards per play, third-down conversion
  • Ball security: Turnovers, fumbles, interceptions
  • Field position: Starting yard line, punting
  • Time of possession: Clock control

The Data

We'll analyze 100 games with detailed team statistics:

import pandas as pd
import numpy as np
from scipy import stats

np.random.seed(42)

def generate_game_data(n_games: int = 100) -> pd.DataFrame:
    """
    Generate realistic game data with correlated statistics.

    Creates data where certain stats naturally correlate with
    winning to enable meaningful analysis.
    """
    games = []

    for game_id in range(1, n_games + 1):
        # Generate base team quality (affects multiple stats)
        team_quality = np.random.normal(0, 1)
        opponent_quality = np.random.normal(0, 1)
        quality_diff = team_quality - opponent_quality

        # Points (strongly tied to quality)
        points_for = max(3, int(28 + quality_diff * 8 + np.random.normal(0, 7)))
        points_against = max(3, int(28 - quality_diff * 8 + np.random.normal(0, 7)))

        # Determine win
        win = 1 if points_for > points_against else 0
        margin = points_for - points_against

        # Rushing (moderately correlated with winning)
        rush_yards = int(135 + quality_diff * 20 + np.random.normal(0, 35))
        rush_attempts = int(32 + quality_diff * 3 + np.random.normal(0, 5))

        # Passing (moderately correlated)
        pass_yards = int(235 + quality_diff * 25 + np.random.normal(0, 50))
        pass_attempts = int(32 - quality_diff * 2 + np.random.normal(0, 6))
        completions = int(pass_attempts * (0.62 + quality_diff * 0.05 + np.random.normal(0, 0.08)))
        completions = min(completions, pass_attempts)

        # Turnovers (strongly negatively correlated with winning)
        turnovers = max(0, int(1.5 - quality_diff * 0.8 + np.random.exponential(0.8)))
        opp_turnovers = max(0, int(1.5 + quality_diff * 0.8 + np.random.exponential(0.8)))

        # Time of possession (weakly correlated - often effect, not cause)
        top_seconds = int(1800 + quality_diff * 120 + np.random.normal(0, 180))
        top_seconds = max(1200, min(2400, top_seconds))

        # Third down (moderately correlated)
        third_downs = int(14 + np.random.normal(0, 3))
        third_converts = int(third_downs * (0.38 + quality_diff * 0.08 + np.random.normal(0, 0.1)))
        third_converts = max(0, min(third_converts, third_downs))

        # Penalties (weakly negatively correlated)
        penalties = max(2, int(6 - quality_diff * 0.5 + np.random.normal(0, 2)))
        penalty_yards = penalties * int(np.random.uniform(6, 12))

        # First downs
        first_downs = int(rush_yards / 10 + pass_yards / 15 + 3 + np.random.normal(0, 2))

        games.append({
            "game_id": game_id,
            "win": win,
            "margin": margin,
            "points_for": points_for,
            "points_against": points_against,
            "rush_yards": rush_yards,
            "rush_attempts": rush_attempts,
            "pass_yards": pass_yards,
            "pass_attempts": pass_attempts,
            "completions": completions,
            "total_yards": rush_yards + pass_yards,
            "first_downs": first_downs,
            "turnovers": turnovers,
            "turnovers_forced": opp_turnovers,
            "turnover_margin": opp_turnovers - turnovers,
            "top_seconds": top_seconds,
            "third_down_att": third_downs,
            "third_down_conv": third_converts,
            "penalties": penalties,
            "penalty_yards": penalty_yards
        })

    df = pd.DataFrame(games)

    # Add calculated metrics
    df["ypp"] = (df["total_yards"] / (df["rush_attempts"] + df["pass_attempts"])).round(2)
    df["completion_pct"] = (df["completions"] / df["pass_attempts"] * 100).round(1)
    df["third_down_pct"] = (df["third_down_conv"] / df["third_down_att"] * 100).round(1)
    df["rush_pct"] = (df["rush_yards"] / df["total_yards"] * 100).round(1)
    df["top_minutes"] = (df["top_seconds"] / 60).round(1)

    return df

# Generate data
games = generate_game_data(100)
print(f"Generated {len(games)} games")
print(f"Win rate: {games['win'].mean():.1%}")
print(games.head())

Phase 1: Initial Correlation Analysis

Correlation with Winning

def calculate_win_correlations(df: pd.DataFrame) -> pd.DataFrame:
    """
    Calculate correlation of each statistic with winning.

    Parameters
    ----------
    df : pd.DataFrame
        Game data

    Returns
    -------
    pd.DataFrame
        Correlations sorted by absolute value
    """
    # Statistics to analyze
    stats = [
        "points_for", "points_against", "margin",
        "rush_yards", "pass_yards", "total_yards",
        "turnovers", "turnovers_forced", "turnover_margin",
        "first_downs", "ypp", "completion_pct",
        "third_down_pct", "top_minutes",
        "penalties", "penalty_yards", "rush_pct"
    ]

    correlations = []
    for stat in stats:
        if stat in df.columns:
            r, p_value = stats.pearsonr(df["win"], df[stat])
            correlations.append({
                "statistic": stat,
                "correlation": round(r, 3),
                "p_value": round(p_value, 4),
                "significant": p_value < 0.05
            })

    result = pd.DataFrame(correlations)
    result["abs_corr"] = result["correlation"].abs()
    result = result.sort_values("abs_corr", ascending=False)

    return result[["statistic", "correlation", "p_value", "significant"]]

win_correlations = calculate_win_correlations(games)
print("\nCORRELATIONS WITH WINNING")
print("=" * 60)
print(win_correlations.to_string(index=False))

Interpreting Results

def interpret_correlations(corr_df: pd.DataFrame) -> str:
    """
    Generate interpretation of correlation results.

    Parameters
    ----------
    corr_df : pd.DataFrame
        Correlation results

    Returns
    -------
    str
        Interpretation text
    """
    lines = []
    lines.append("\nINTERPRETATION")
    lines.append("=" * 60)

    # Strong correlations (|r| > 0.5)
    strong = corr_df[corr_df["correlation"].abs() > 0.5]
    if len(strong) > 0:
        lines.append("\nSTRONG PREDICTORS (|r| > 0.5):")
        for _, row in strong.iterrows():
            direction = "positive" if row["correlation"] > 0 else "negative"
            lines.append(f"  • {row['statistic']}: r = {row['correlation']:.3f} ({direction})")

    # Moderate correlations (0.3 < |r| < 0.5)
    moderate = corr_df[(corr_df["correlation"].abs() > 0.3) &
                       (corr_df["correlation"].abs() <= 0.5)]
    if len(moderate) > 0:
        lines.append("\nMODERATE PREDICTORS (0.3 < |r| < 0.5):")
        for _, row in moderate.iterrows():
            direction = "positive" if row["correlation"] > 0 else "negative"
            lines.append(f"  • {row['statistic']}: r = {row['correlation']:.3f} ({direction})")

    # Weak correlations
    weak = corr_df[(corr_df["correlation"].abs() <= 0.3) &
                   (corr_df["correlation"].abs() > 0.1)]
    if len(weak) > 0:
        lines.append("\nWEAK PREDICTORS (0.1 < |r| < 0.3):")
        for _, row in weak.iterrows():
            lines.append(f"  • {row['statistic']}: r = {row['correlation']:.3f}")

    # Non-predictors
    non = corr_df[corr_df["correlation"].abs() <= 0.1]
    if len(non) > 0:
        lines.append("\nNON-PREDICTORS (|r| < 0.1):")
        for _, row in non.iterrows():
            lines.append(f"  • {row['statistic']}: r = {row['correlation']:.3f}")

    return "\n".join(lines)

interpretation = interpret_correlations(win_correlations)
print(interpretation)

Phase 2: Margin-Based Analysis

Correlating with Point Margin

Using point margin instead of win/loss provides more granularity:

def calculate_margin_correlations(df: pd.DataFrame) -> pd.DataFrame:
    """
    Calculate correlations with point margin.

    Parameters
    ----------
    df : pd.DataFrame
        Game data

    Returns
    -------
    pd.DataFrame
        Correlations sorted by strength
    """
    stats = [
        "rush_yards", "pass_yards", "total_yards",
        "turnovers", "turnovers_forced", "turnover_margin",
        "first_downs", "ypp", "completion_pct",
        "third_down_pct", "top_minutes",
        "penalties", "penalty_yards", "rush_pct"
    ]

    correlations = []
    for stat in stats:
        if stat in df.columns:
            r, p_value = stats.pearsonr(df["margin"], df[stat])
            correlations.append({
                "statistic": stat,
                "correlation": round(r, 3),
                "p_value": round(p_value, 4)
            })

    result = pd.DataFrame(correlations)
    result = result.sort_values("correlation", key=abs, ascending=False)

    return result

margin_correlations = calculate_margin_correlations(games)
print("\nCORRELATIONS WITH POINT MARGIN")
print("=" * 60)
print(margin_correlations.to_string(index=False))

Win vs Margin Comparison

def compare_win_margin_correlations(df: pd.DataFrame) -> pd.DataFrame:
    """
    Compare correlations with win vs margin.

    Parameters
    ----------
    df : pd.DataFrame
        Game data

    Returns
    -------
    pd.DataFrame
        Side-by-side comparison
    """
    stats = [
        "rush_yards", "pass_yards", "total_yards",
        "turnovers", "turnover_margin", "first_downs",
        "ypp", "third_down_pct", "top_minutes"
    ]

    comparisons = []
    for stat in stats:
        r_win = df["win"].corr(df[stat])
        r_margin = df["margin"].corr(df[stat])

        comparisons.append({
            "statistic": stat,
            "corr_with_win": round(r_win, 3),
            "corr_with_margin": round(r_margin, 3),
            "difference": round(abs(r_margin) - abs(r_win), 3)
        })

    return pd.DataFrame(comparisons).sort_values("corr_with_margin", key=abs, ascending=False)

comparison = compare_win_margin_correlations(games)
print("\nWIN vs MARGIN CORRELATION COMPARISON")
print("=" * 60)
print(comparison.to_string(index=False))

Phase 3: Turnover Analysis Deep Dive

The Turnover Impact

Turnovers consistently emerge as critical. Let's analyze in depth:

def analyze_turnovers(df: pd.DataFrame) -> dict:
    """
    Deep dive into turnover impact on winning.

    Parameters
    ----------
    df : pd.DataFrame
        Game data

    Returns
    -------
    dict
        Turnover analysis results
    """
    analysis = {}

    # Win rate by turnover count
    turnover_wins = df.groupby("turnovers").agg({
        "win": ["count", "mean", "sum"]
    })
    turnover_wins.columns = ["games", "win_rate", "wins"]
    analysis["by_turnover_count"] = turnover_wins

    # Win rate by turnover margin
    df["to_margin_bucket"] = pd.cut(
        df["turnover_margin"],
        bins=[-10, -2, -1, 0, 1, 2, 10],
        labels=["<-2", "-2 to -1", "-1 to 0", "0 to 1", "1 to 2", ">2"]
    )
    margin_wins = df.groupby("to_margin_bucket").agg({
        "win": ["count", "mean"]
    })
    margin_wins.columns = ["games", "win_rate"]
    analysis["by_margin_bucket"] = margin_wins

    # Average margin by turnovers
    by_to = df.groupby("turnovers")["margin"].mean()
    analysis["avg_margin_by_to"] = by_to

    return analysis

turnover_analysis = analyze_turnovers(games)

print("\nTURNOVER ANALYSIS")
print("=" * 60)

print("\nWin Rate by Turnovers Committed:")
print(turnover_analysis["by_turnover_count"])

print("\nWin Rate by Turnover Margin:")
print(turnover_analysis["by_margin_bucket"])

print("\nAverage Point Margin by Turnovers:")
print(turnover_analysis["avg_margin_by_to"])

Turnover Value Calculation

def calculate_turnover_value(df: pd.DataFrame) -> float:
    """
    Estimate the point value of a turnover.

    Parameters
    ----------
    df : pd.DataFrame
        Game data

    Returns
    -------
    float
        Estimated points per turnover
    """
    # Regression approach: how much does margin change per turnover?
    # Simple linear regression: margin = a + b*turnovers

    from scipy.stats import linregress

    slope, intercept, r, p, se = linregress(df["turnovers"], df["margin"])

    print(f"Regression: margin = {intercept:.1f} + ({slope:.1f}) × turnovers")
    print(f"R-squared: {r**2:.3f}")
    print(f"Each turnover costs approximately {-slope:.1f} points")

    return -slope

turnover_value = calculate_turnover_value(games)

Phase 4: Correlation Matrix

Full Correlation Matrix

def create_full_correlation_matrix(df: pd.DataFrame) -> pd.DataFrame:
    """
    Create correlation matrix for key statistics.

    Parameters
    ----------
    df : pd.DataFrame
        Game data

    Returns
    -------
    pd.DataFrame
        Correlation matrix
    """
    key_stats = [
        "margin", "total_yards", "rush_yards", "pass_yards",
        "turnovers", "turnover_margin", "first_downs",
        "ypp", "third_down_pct", "top_minutes"
    ]

    return df[key_stats].corr().round(2)

corr_matrix = create_full_correlation_matrix(games)
print("\nCORRELATION MATRIX")
print("=" * 60)
print(corr_matrix)

Finding Redundant Statistics

def find_redundant_stats(corr_matrix: pd.DataFrame, threshold: float = 0.7) -> list:
    """
    Find highly correlated (redundant) statistics.

    Parameters
    ----------
    corr_matrix : pd.DataFrame
        Correlation matrix
    threshold : float
        Correlation threshold for redundancy

    Returns
    -------
    list
        Pairs of redundant statistics
    """
    redundant = []

    for i, col1 in enumerate(corr_matrix.columns):
        for col2 in corr_matrix.columns[i+1:]:
            r = abs(corr_matrix.loc[col1, col2])
            if r >= threshold:
                redundant.append({
                    "stat1": col1,
                    "stat2": col2,
                    "correlation": round(corr_matrix.loc[col1, col2], 2)
                })

    return redundant

redundant_pairs = find_redundant_stats(corr_matrix)
print("\nREDUNDANT STATISTICS (|r| > 0.7)")
print("-" * 40)
for pair in redundant_pairs:
    print(f"  {pair['stat1']} ↔ {pair['stat2']}: r = {pair['correlation']}")

Phase 5: Building a Prediction Framework

Simple Win Probability Model

def build_simple_model(df: pd.DataFrame) -> dict:
    """
    Build simple model based on key statistics.

    Parameters
    ----------
    df : pd.DataFrame
        Game data

    Returns
    -------
    dict
        Model parameters and accuracy
    """
    from scipy.stats import linregress

    # Use turnover margin and yards per play
    # Multiple regression would be better, but we'll keep it simple

    # Split into "training" and "test"
    train = df.iloc[:70]
    test = df.iloc[70:]

    # Fit on training data
    slope_to, intercept_to, r_to, _, _ = linregress(
        train["turnover_margin"], train["margin"]
    )

    slope_ypp, intercept_ypp, r_ypp, _, _ = linregress(
        train["ypp"], train["margin"]
    )

    # Simple combined prediction: average of both models
    test["pred_margin_to"] = intercept_to + slope_to * test["turnover_margin"]
    test["pred_margin_ypp"] = intercept_ypp + slope_ypp * test["ypp"]
    test["pred_margin"] = (test["pred_margin_to"] + test["pred_margin_ypp"]) / 2
    test["pred_win"] = (test["pred_margin"] > 0).astype(int)

    # Accuracy
    accuracy = (test["pred_win"] == test["win"]).mean()

    # Mean absolute error
    mae = (test["pred_margin"] - test["margin"]).abs().mean()

    return {
        "accuracy": round(accuracy, 3),
        "mae": round(mae, 1),
        "turnover_coefficient": round(slope_to, 2),
        "ypp_coefficient": round(slope_ypp, 2),
        "test_predictions": test[["win", "pred_win", "margin", "pred_margin"]]
    }

model_results = build_simple_model(games)
print("\nSIMPLE PREDICTION MODEL")
print("=" * 60)
print(f"Win prediction accuracy: {model_results['accuracy']:.1%}")
print(f"Margin MAE: {model_results['mae']} points")
print(f"\nKey coefficients:")
print(f"  Turnover margin: {model_results['turnover_coefficient']} points per turnover")
print(f"  Yards per play: {model_results['ypp_coefficient']} points per YPP")

Phase 6: Conclusions

Summary of Findings

def generate_summary(games_df: pd.DataFrame) -> str:
    """
    Generate executive summary of findings.

    Parameters
    ----------
    games_df : pd.DataFrame
        Game data

    Returns
    -------
    str
        Summary text
    """
    win_corr = calculate_win_correlations(games_df)

    lines = []
    lines.append("=" * 60)
    lines.append("EXECUTIVE SUMMARY: WHAT PREDICTS WINNING?")
    lines.append("=" * 60)
    lines.append("")

    lines.append("TOP PREDICTORS OF WINNING:")
    lines.append("-" * 40)
    top_5 = win_corr.head(5)
    for _, row in top_5.iterrows():
        if row["statistic"] not in ["points_for", "points_against", "margin"]:
            lines.append(f"  {row['statistic']}: r = {row['correlation']}")

    lines.append("")
    lines.append("KEY INSIGHTS:")
    lines.append("-" * 40)
    lines.append("1. TURNOVER MARGIN is the strongest non-scoring predictor")
    lines.append("   - Each turnover differential worth ~4 points")
    lines.append("   - Winning the turnover battle wins ~75% of games")
    lines.append("")
    lines.append("2. YARDS PER PLAY matters more than total yards")
    lines.append("   - Efficiency over volume")
    lines.append("   - YPP captures quality of plays")
    lines.append("")
    lines.append("3. TIME OF POSSESSION is WEAK predictor")
    lines.append("   - Often a result of winning, not cause")
    lines.append("   - Teams with leads run more, gaining TOP")
    lines.append("")
    lines.append("4. THIRD DOWN EFFICIENCY is moderately predictive")
    lines.append("   - Sustaining drives leads to points")
    lines.append("   - But efficiency stats overlap with yards")
    lines.append("")

    lines.append("RECOMMENDATIONS:")
    lines.append("-" * 40)
    lines.append("• Prioritize ball security over anything else")
    lines.append("• Focus on yards per play, not total yards")
    lines.append("• Don't obsess over time of possession")
    lines.append("• Third down execution matters, but is downstream")

    return "\n".join(lines)

summary = generate_summary(games)
print(summary)

Discussion Questions

  1. Why might time of possession be weakly correlated with winning despite the common belief that "controlling the clock" wins games?

  2. How would you expect these correlations to differ between college and NFL football?

  3. What lurking variables might explain some of these correlations?

  4. If you could only track two statistics to predict game outcomes, which would you choose and why?

  5. How would sample size affect the reliability of these correlation estimates?


Your Turn: Extensions

Option A: Conference Comparison

Split the data by simulated conferences. Do the same statistics predict winning across different styles of play?

Option B: Situational Analysis

Create subsets for: - Home vs away games - Games against good vs bad teams - Close games vs blowouts

Do correlations change?

Option C: Causal Analysis

Attempt to distinguish: - What causes winning (turnovers, efficiency) - What results from winning (time of possession, rushing yards in 4th quarter)


Key Takeaways

  1. Turnovers dominate: Ball security is the most controllable predictor of winning

  2. Efficiency over volume: Yards per play outperforms total yards

  3. Beware conventional wisdom: Time of possession is largely a result, not a cause

  4. Correlation isn't causation: Many correlated stats share underlying causes

  5. Margin is more informative than win/loss: Binary outcomes hide valuable information