Case Study: What Statistics Predict Winning?
"Winning isn't everything, but wanting to win is." — Vince Lombardi
Executive Summary
Every football fan has theories about what matters most for winning: rushing the ball, controlling time of possession, limiting turnovers. This case study uses correlation analysis to empirically test which statistics actually predict winning games.
Skills Applied: - Correlation analysis - Multiple variable comparison - Statistical significance - Building predictive frameworks
Background
The Question
Which game statistics correlate most strongly with winning? The candidates include:
- Offensive production: Points, yards, first downs
- Balance: Rushing vs passing split
- Efficiency: Yards per play, third-down conversion
- Ball security: Turnovers, fumbles, interceptions
- Field position: Starting yard line, punting
- Time of possession: Clock control
The Data
We'll analyze 100 games with detailed team statistics:
import pandas as pd
import numpy as np
from scipy import stats
np.random.seed(42)
def generate_game_data(n_games: int = 100) -> pd.DataFrame:
"""
Generate realistic game data with correlated statistics.
Creates data where certain stats naturally correlate with
winning to enable meaningful analysis.
"""
games = []
for game_id in range(1, n_games + 1):
# Generate base team quality (affects multiple stats)
team_quality = np.random.normal(0, 1)
opponent_quality = np.random.normal(0, 1)
quality_diff = team_quality - opponent_quality
# Points (strongly tied to quality)
points_for = max(3, int(28 + quality_diff * 8 + np.random.normal(0, 7)))
points_against = max(3, int(28 - quality_diff * 8 + np.random.normal(0, 7)))
# Determine win
win = 1 if points_for > points_against else 0
margin = points_for - points_against
# Rushing (moderately correlated with winning)
rush_yards = int(135 + quality_diff * 20 + np.random.normal(0, 35))
rush_attempts = int(32 + quality_diff * 3 + np.random.normal(0, 5))
# Passing (moderately correlated)
pass_yards = int(235 + quality_diff * 25 + np.random.normal(0, 50))
pass_attempts = int(32 - quality_diff * 2 + np.random.normal(0, 6))
completions = int(pass_attempts * (0.62 + quality_diff * 0.05 + np.random.normal(0, 0.08)))
completions = min(completions, pass_attempts)
# Turnovers (strongly negatively correlated with winning)
turnovers = max(0, int(1.5 - quality_diff * 0.8 + np.random.exponential(0.8)))
opp_turnovers = max(0, int(1.5 + quality_diff * 0.8 + np.random.exponential(0.8)))
# Time of possession (weakly correlated - often effect, not cause)
top_seconds = int(1800 + quality_diff * 120 + np.random.normal(0, 180))
top_seconds = max(1200, min(2400, top_seconds))
# Third down (moderately correlated)
third_downs = int(14 + np.random.normal(0, 3))
third_converts = int(third_downs * (0.38 + quality_diff * 0.08 + np.random.normal(0, 0.1)))
third_converts = max(0, min(third_converts, third_downs))
# Penalties (weakly negatively correlated)
penalties = max(2, int(6 - quality_diff * 0.5 + np.random.normal(0, 2)))
penalty_yards = penalties * int(np.random.uniform(6, 12))
# First downs
first_downs = int(rush_yards / 10 + pass_yards / 15 + 3 + np.random.normal(0, 2))
games.append({
"game_id": game_id,
"win": win,
"margin": margin,
"points_for": points_for,
"points_against": points_against,
"rush_yards": rush_yards,
"rush_attempts": rush_attempts,
"pass_yards": pass_yards,
"pass_attempts": pass_attempts,
"completions": completions,
"total_yards": rush_yards + pass_yards,
"first_downs": first_downs,
"turnovers": turnovers,
"turnovers_forced": opp_turnovers,
"turnover_margin": opp_turnovers - turnovers,
"top_seconds": top_seconds,
"third_down_att": third_downs,
"third_down_conv": third_converts,
"penalties": penalties,
"penalty_yards": penalty_yards
})
df = pd.DataFrame(games)
# Add calculated metrics
df["ypp"] = (df["total_yards"] / (df["rush_attempts"] + df["pass_attempts"])).round(2)
df["completion_pct"] = (df["completions"] / df["pass_attempts"] * 100).round(1)
df["third_down_pct"] = (df["third_down_conv"] / df["third_down_att"] * 100).round(1)
df["rush_pct"] = (df["rush_yards"] / df["total_yards"] * 100).round(1)
df["top_minutes"] = (df["top_seconds"] / 60).round(1)
return df
# Generate data
games = generate_game_data(100)
print(f"Generated {len(games)} games")
print(f"Win rate: {games['win'].mean():.1%}")
print(games.head())
Phase 1: Initial Correlation Analysis
Correlation with Winning
def calculate_win_correlations(df: pd.DataFrame) -> pd.DataFrame:
"""
Calculate correlation of each statistic with winning.
Parameters
----------
df : pd.DataFrame
Game data
Returns
-------
pd.DataFrame
Correlations sorted by absolute value
"""
# Statistics to analyze
stats = [
"points_for", "points_against", "margin",
"rush_yards", "pass_yards", "total_yards",
"turnovers", "turnovers_forced", "turnover_margin",
"first_downs", "ypp", "completion_pct",
"third_down_pct", "top_minutes",
"penalties", "penalty_yards", "rush_pct"
]
correlations = []
for stat in stats:
if stat in df.columns:
r, p_value = stats.pearsonr(df["win"], df[stat])
correlations.append({
"statistic": stat,
"correlation": round(r, 3),
"p_value": round(p_value, 4),
"significant": p_value < 0.05
})
result = pd.DataFrame(correlations)
result["abs_corr"] = result["correlation"].abs()
result = result.sort_values("abs_corr", ascending=False)
return result[["statistic", "correlation", "p_value", "significant"]]
win_correlations = calculate_win_correlations(games)
print("\nCORRELATIONS WITH WINNING")
print("=" * 60)
print(win_correlations.to_string(index=False))
Interpreting Results
def interpret_correlations(corr_df: pd.DataFrame) -> str:
"""
Generate interpretation of correlation results.
Parameters
----------
corr_df : pd.DataFrame
Correlation results
Returns
-------
str
Interpretation text
"""
lines = []
lines.append("\nINTERPRETATION")
lines.append("=" * 60)
# Strong correlations (|r| > 0.5)
strong = corr_df[corr_df["correlation"].abs() > 0.5]
if len(strong) > 0:
lines.append("\nSTRONG PREDICTORS (|r| > 0.5):")
for _, row in strong.iterrows():
direction = "positive" if row["correlation"] > 0 else "negative"
lines.append(f" • {row['statistic']}: r = {row['correlation']:.3f} ({direction})")
# Moderate correlations (0.3 < |r| < 0.5)
moderate = corr_df[(corr_df["correlation"].abs() > 0.3) &
(corr_df["correlation"].abs() <= 0.5)]
if len(moderate) > 0:
lines.append("\nMODERATE PREDICTORS (0.3 < |r| < 0.5):")
for _, row in moderate.iterrows():
direction = "positive" if row["correlation"] > 0 else "negative"
lines.append(f" • {row['statistic']}: r = {row['correlation']:.3f} ({direction})")
# Weak correlations
weak = corr_df[(corr_df["correlation"].abs() <= 0.3) &
(corr_df["correlation"].abs() > 0.1)]
if len(weak) > 0:
lines.append("\nWEAK PREDICTORS (0.1 < |r| < 0.3):")
for _, row in weak.iterrows():
lines.append(f" • {row['statistic']}: r = {row['correlation']:.3f}")
# Non-predictors
non = corr_df[corr_df["correlation"].abs() <= 0.1]
if len(non) > 0:
lines.append("\nNON-PREDICTORS (|r| < 0.1):")
for _, row in non.iterrows():
lines.append(f" • {row['statistic']}: r = {row['correlation']:.3f}")
return "\n".join(lines)
interpretation = interpret_correlations(win_correlations)
print(interpretation)
Phase 2: Margin-Based Analysis
Correlating with Point Margin
Using point margin instead of win/loss provides more granularity:
def calculate_margin_correlations(df: pd.DataFrame) -> pd.DataFrame:
"""
Calculate correlations with point margin.
Parameters
----------
df : pd.DataFrame
Game data
Returns
-------
pd.DataFrame
Correlations sorted by strength
"""
stats = [
"rush_yards", "pass_yards", "total_yards",
"turnovers", "turnovers_forced", "turnover_margin",
"first_downs", "ypp", "completion_pct",
"third_down_pct", "top_minutes",
"penalties", "penalty_yards", "rush_pct"
]
correlations = []
for stat in stats:
if stat in df.columns:
r, p_value = stats.pearsonr(df["margin"], df[stat])
correlations.append({
"statistic": stat,
"correlation": round(r, 3),
"p_value": round(p_value, 4)
})
result = pd.DataFrame(correlations)
result = result.sort_values("correlation", key=abs, ascending=False)
return result
margin_correlations = calculate_margin_correlations(games)
print("\nCORRELATIONS WITH POINT MARGIN")
print("=" * 60)
print(margin_correlations.to_string(index=False))
Win vs Margin Comparison
def compare_win_margin_correlations(df: pd.DataFrame) -> pd.DataFrame:
"""
Compare correlations with win vs margin.
Parameters
----------
df : pd.DataFrame
Game data
Returns
-------
pd.DataFrame
Side-by-side comparison
"""
stats = [
"rush_yards", "pass_yards", "total_yards",
"turnovers", "turnover_margin", "first_downs",
"ypp", "third_down_pct", "top_minutes"
]
comparisons = []
for stat in stats:
r_win = df["win"].corr(df[stat])
r_margin = df["margin"].corr(df[stat])
comparisons.append({
"statistic": stat,
"corr_with_win": round(r_win, 3),
"corr_with_margin": round(r_margin, 3),
"difference": round(abs(r_margin) - abs(r_win), 3)
})
return pd.DataFrame(comparisons).sort_values("corr_with_margin", key=abs, ascending=False)
comparison = compare_win_margin_correlations(games)
print("\nWIN vs MARGIN CORRELATION COMPARISON")
print("=" * 60)
print(comparison.to_string(index=False))
Phase 3: Turnover Analysis Deep Dive
The Turnover Impact
Turnovers consistently emerge as critical. Let's analyze in depth:
def analyze_turnovers(df: pd.DataFrame) -> dict:
"""
Deep dive into turnover impact on winning.
Parameters
----------
df : pd.DataFrame
Game data
Returns
-------
dict
Turnover analysis results
"""
analysis = {}
# Win rate by turnover count
turnover_wins = df.groupby("turnovers").agg({
"win": ["count", "mean", "sum"]
})
turnover_wins.columns = ["games", "win_rate", "wins"]
analysis["by_turnover_count"] = turnover_wins
# Win rate by turnover margin
df["to_margin_bucket"] = pd.cut(
df["turnover_margin"],
bins=[-10, -2, -1, 0, 1, 2, 10],
labels=["<-2", "-2 to -1", "-1 to 0", "0 to 1", "1 to 2", ">2"]
)
margin_wins = df.groupby("to_margin_bucket").agg({
"win": ["count", "mean"]
})
margin_wins.columns = ["games", "win_rate"]
analysis["by_margin_bucket"] = margin_wins
# Average margin by turnovers
by_to = df.groupby("turnovers")["margin"].mean()
analysis["avg_margin_by_to"] = by_to
return analysis
turnover_analysis = analyze_turnovers(games)
print("\nTURNOVER ANALYSIS")
print("=" * 60)
print("\nWin Rate by Turnovers Committed:")
print(turnover_analysis["by_turnover_count"])
print("\nWin Rate by Turnover Margin:")
print(turnover_analysis["by_margin_bucket"])
print("\nAverage Point Margin by Turnovers:")
print(turnover_analysis["avg_margin_by_to"])
Turnover Value Calculation
def calculate_turnover_value(df: pd.DataFrame) -> float:
"""
Estimate the point value of a turnover.
Parameters
----------
df : pd.DataFrame
Game data
Returns
-------
float
Estimated points per turnover
"""
# Regression approach: how much does margin change per turnover?
# Simple linear regression: margin = a + b*turnovers
from scipy.stats import linregress
slope, intercept, r, p, se = linregress(df["turnovers"], df["margin"])
print(f"Regression: margin = {intercept:.1f} + ({slope:.1f}) × turnovers")
print(f"R-squared: {r**2:.3f}")
print(f"Each turnover costs approximately {-slope:.1f} points")
return -slope
turnover_value = calculate_turnover_value(games)
Phase 4: Correlation Matrix
Full Correlation Matrix
def create_full_correlation_matrix(df: pd.DataFrame) -> pd.DataFrame:
"""
Create correlation matrix for key statistics.
Parameters
----------
df : pd.DataFrame
Game data
Returns
-------
pd.DataFrame
Correlation matrix
"""
key_stats = [
"margin", "total_yards", "rush_yards", "pass_yards",
"turnovers", "turnover_margin", "first_downs",
"ypp", "third_down_pct", "top_minutes"
]
return df[key_stats].corr().round(2)
corr_matrix = create_full_correlation_matrix(games)
print("\nCORRELATION MATRIX")
print("=" * 60)
print(corr_matrix)
Finding Redundant Statistics
def find_redundant_stats(corr_matrix: pd.DataFrame, threshold: float = 0.7) -> list:
"""
Find highly correlated (redundant) statistics.
Parameters
----------
corr_matrix : pd.DataFrame
Correlation matrix
threshold : float
Correlation threshold for redundancy
Returns
-------
list
Pairs of redundant statistics
"""
redundant = []
for i, col1 in enumerate(corr_matrix.columns):
for col2 in corr_matrix.columns[i+1:]:
r = abs(corr_matrix.loc[col1, col2])
if r >= threshold:
redundant.append({
"stat1": col1,
"stat2": col2,
"correlation": round(corr_matrix.loc[col1, col2], 2)
})
return redundant
redundant_pairs = find_redundant_stats(corr_matrix)
print("\nREDUNDANT STATISTICS (|r| > 0.7)")
print("-" * 40)
for pair in redundant_pairs:
print(f" {pair['stat1']} ↔ {pair['stat2']}: r = {pair['correlation']}")
Phase 5: Building a Prediction Framework
Simple Win Probability Model
def build_simple_model(df: pd.DataFrame) -> dict:
"""
Build simple model based on key statistics.
Parameters
----------
df : pd.DataFrame
Game data
Returns
-------
dict
Model parameters and accuracy
"""
from scipy.stats import linregress
# Use turnover margin and yards per play
# Multiple regression would be better, but we'll keep it simple
# Split into "training" and "test"
train = df.iloc[:70]
test = df.iloc[70:]
# Fit on training data
slope_to, intercept_to, r_to, _, _ = linregress(
train["turnover_margin"], train["margin"]
)
slope_ypp, intercept_ypp, r_ypp, _, _ = linregress(
train["ypp"], train["margin"]
)
# Simple combined prediction: average of both models
test["pred_margin_to"] = intercept_to + slope_to * test["turnover_margin"]
test["pred_margin_ypp"] = intercept_ypp + slope_ypp * test["ypp"]
test["pred_margin"] = (test["pred_margin_to"] + test["pred_margin_ypp"]) / 2
test["pred_win"] = (test["pred_margin"] > 0).astype(int)
# Accuracy
accuracy = (test["pred_win"] == test["win"]).mean()
# Mean absolute error
mae = (test["pred_margin"] - test["margin"]).abs().mean()
return {
"accuracy": round(accuracy, 3),
"mae": round(mae, 1),
"turnover_coefficient": round(slope_to, 2),
"ypp_coefficient": round(slope_ypp, 2),
"test_predictions": test[["win", "pred_win", "margin", "pred_margin"]]
}
model_results = build_simple_model(games)
print("\nSIMPLE PREDICTION MODEL")
print("=" * 60)
print(f"Win prediction accuracy: {model_results['accuracy']:.1%}")
print(f"Margin MAE: {model_results['mae']} points")
print(f"\nKey coefficients:")
print(f" Turnover margin: {model_results['turnover_coefficient']} points per turnover")
print(f" Yards per play: {model_results['ypp_coefficient']} points per YPP")
Phase 6: Conclusions
Summary of Findings
def generate_summary(games_df: pd.DataFrame) -> str:
"""
Generate executive summary of findings.
Parameters
----------
games_df : pd.DataFrame
Game data
Returns
-------
str
Summary text
"""
win_corr = calculate_win_correlations(games_df)
lines = []
lines.append("=" * 60)
lines.append("EXECUTIVE SUMMARY: WHAT PREDICTS WINNING?")
lines.append("=" * 60)
lines.append("")
lines.append("TOP PREDICTORS OF WINNING:")
lines.append("-" * 40)
top_5 = win_corr.head(5)
for _, row in top_5.iterrows():
if row["statistic"] not in ["points_for", "points_against", "margin"]:
lines.append(f" {row['statistic']}: r = {row['correlation']}")
lines.append("")
lines.append("KEY INSIGHTS:")
lines.append("-" * 40)
lines.append("1. TURNOVER MARGIN is the strongest non-scoring predictor")
lines.append(" - Each turnover differential worth ~4 points")
lines.append(" - Winning the turnover battle wins ~75% of games")
lines.append("")
lines.append("2. YARDS PER PLAY matters more than total yards")
lines.append(" - Efficiency over volume")
lines.append(" - YPP captures quality of plays")
lines.append("")
lines.append("3. TIME OF POSSESSION is WEAK predictor")
lines.append(" - Often a result of winning, not cause")
lines.append(" - Teams with leads run more, gaining TOP")
lines.append("")
lines.append("4. THIRD DOWN EFFICIENCY is moderately predictive")
lines.append(" - Sustaining drives leads to points")
lines.append(" - But efficiency stats overlap with yards")
lines.append("")
lines.append("RECOMMENDATIONS:")
lines.append("-" * 40)
lines.append("• Prioritize ball security over anything else")
lines.append("• Focus on yards per play, not total yards")
lines.append("• Don't obsess over time of possession")
lines.append("• Third down execution matters, but is downstream")
return "\n".join(lines)
summary = generate_summary(games)
print(summary)
Discussion Questions
-
Why might time of possession be weakly correlated with winning despite the common belief that "controlling the clock" wins games?
-
How would you expect these correlations to differ between college and NFL football?
-
What lurking variables might explain some of these correlations?
-
If you could only track two statistics to predict game outcomes, which would you choose and why?
-
How would sample size affect the reliability of these correlation estimates?
Your Turn: Extensions
Option A: Conference Comparison
Split the data by simulated conferences. Do the same statistics predict winning across different styles of play?
Option B: Situational Analysis
Create subsets for: - Home vs away games - Games against good vs bad teams - Close games vs blowouts
Do correlations change?
Option C: Causal Analysis
Attempt to distinguish: - What causes winning (turnovers, efficiency) - What results from winning (time of possession, rushing yards in 4th quarter)
Key Takeaways
-
Turnovers dominate: Ball security is the most controllable predictor of winning
-
Efficiency over volume: Yards per play outperforms total yards
-
Beware conventional wisdom: Time of possession is largely a result, not a cause
-
Correlation isn't causation: Many correlated stats share underlying causes
-
Margin is more informative than win/loss: Binary outcomes hide valuable information