> "The goal is to turn data into information, and information into insight." — Carly Fiorina
In This Chapter
- Learning Objectives
- 4.1 The EDA Mindset
- 4.2 Initial Data Inspection
- 4.3 Univariate Analysis
- 4.4 Bivariate Analysis
- 4.5 Multivariate Analysis
- 4.6 Visualization Best Practices
- 4.7 Common EDA Patterns in Football
- 4.8 Building an EDA Workflow
- 4.9 Communicating Insights
- 4.10 Chapter Summary
- Chapter Summary Checklist
- Preview: Chapter 5
Chapter 4: Exploratory Data Analysis for Football
"The goal is to turn data into information, and information into insight." — Carly Fiorina
Learning Objectives
By the end of this chapter, you will be able to:
- Apply the EDA mindset to football data problems
- Create effective visualizations for play-by-play data
- Identify patterns and anomalies in player and team performance
- Build a systematic EDA workflow for football analysis
- Communicate findings through visual storytelling
4.1 The EDA Mindset
What Is Exploratory Data Analysis?
Exploratory Data Analysis (EDA) is the art and science of examining data before modeling or hypothesis testing. Coined by statistician John Tukey in the 1970s, EDA emphasizes visual methods and open-ended questioning over formal statistical procedures.
For football analytics, EDA serves multiple purposes:
Understanding the Data Structure - What columns exist in play-by-play data? - How are plays categorized? - What's missing or inconsistent?
Generating Hypotheses - Why do some teams perform better in the red zone? - What distinguishes elite receivers from average ones? - How does play-calling change when trailing?
Validating Assumptions - Is EPA normally distributed? - Do home teams really have an advantage? - Are there seasonal trends in passing offense?
Communicating Findings - What story does the data tell? - How can we make insights accessible to coaches and executives?
The EDA Process
A systematic EDA process follows this general pattern:
1. Load and Inspect → 2. Clean and Prepare → 3. Univariate Analysis →
4. Bivariate Analysis → 5. Multivariate Analysis → 6. Document Insights
Let's examine each stage.
4.2 Initial Data Inspection
First Look at the Data
When encountering a new dataset, start with these fundamental questions:
import pandas as pd
import numpy as np
import nfl_data_py as nfl
# Load data
pbp = nfl.import_pbp_data([2023])
# Basic dimensions
print(f"Shape: {pbp.shape}")
print(f"Rows: {pbp.shape[0]:,}")
print(f"Columns: {pbp.shape[1]}")
# Column overview
print("\nColumn names:")
print(pbp.columns.tolist())
# Data types
print("\nData types:")
print(pbp.dtypes.value_counts())
Understanding Column Categories
NFL play-by-play data contains hundreds of columns. Organizing them into categories helps navigation:
def categorize_columns(df: pd.DataFrame) -> dict:
"""Categorize columns by their apparent purpose."""
categories = {
'game_info': [],
'play_info': [],
'player_ids': [],
'player_names': [],
'metrics': [],
'flags': [],
'other': []
}
for col in df.columns:
col_lower = col.lower()
if 'game' in col_lower or 'week' in col_lower or 'season' in col_lower:
categories['game_info'].append(col)
elif col_lower.endswith('_id'):
categories['player_ids'].append(col)
elif col_lower.endswith('_name'):
categories['player_names'].append(col)
elif col_lower in ['epa', 'wpa', 'cpoe', 'air_yards', 'yac']:
categories['metrics'].append(col)
elif df[col].dtype == 'float64' and df[col].between(0, 1).mean() > 0.9:
categories['flags'].append(col)
else:
categories['other'].append(col)
return categories
Handling Missing Data
Missing data is pervasive in football analytics. Understanding why data is missing matters:
def analyze_missing_data(df: pd.DataFrame, threshold: float = 0.01) -> pd.DataFrame:
"""Analyze missing data patterns."""
missing = df.isnull().sum()
missing_pct = missing / len(df)
missing_analysis = pd.DataFrame({
'missing_count': missing,
'missing_pct': missing_pct,
'dtype': df.dtypes
})
# Filter to columns with missing data above threshold
missing_analysis = missing_analysis[missing_analysis['missing_pct'] > threshold]
missing_analysis = missing_analysis.sort_values('missing_pct', ascending=False)
return missing_analysis
Common Missing Data Patterns in NFL Data:
| Column | When Missing | Reason |
|---|---|---|
passer_player_name |
Non-pass plays | Only populated on passes |
air_yards |
Incomplete passes, runs | Only for completions |
epa |
Certain special teams | Not calculated for all plays |
cpoe |
Spikes, throwaway | No completion expectation |
4.3 Univariate Analysis
Understanding Distributions
Before analyzing relationships, understand individual variables:
import matplotlib.pyplot as plt
import seaborn as sns
def analyze_distribution(series: pd.Series, name: str):
"""Comprehensive univariate analysis."""
fig, axes = plt.subplots(1, 3, figsize=(14, 4))
# Histogram
axes[0].hist(series.dropna(), bins=50, edgecolor='black', alpha=0.7)
axes[0].axvline(series.mean(), color='red', linestyle='--', label=f'Mean: {series.mean():.3f}')
axes[0].axvline(series.median(), color='green', linestyle='--', label=f'Median: {series.median():.3f}')
axes[0].set_title(f'{name} Distribution')
axes[0].legend()
# Box plot
axes[1].boxplot(series.dropna())
axes[1].set_title(f'{name} Box Plot')
# Summary statistics text
stats_text = f"""
Count: {series.count():,}
Mean: {series.mean():.4f}
Std: {series.std():.4f}
Min: {series.min():.4f}
25%: {series.quantile(0.25):.4f}
50%: {series.median():.4f}
75%: {series.quantile(0.75):.4f}
Max: {series.max():.4f}
Skew: {series.skew():.4f}
"""
axes[2].text(0.1, 0.5, stats_text, fontsize=10, family='monospace',
verticalalignment='center')
axes[2].axis('off')
axes[2].set_title('Summary Statistics')
plt.tight_layout()
return fig
EPA Distribution Analysis
Expected Points Added (EPA) is the cornerstone metric in modern football analytics. Understanding its distribution is essential:
# Filter to relevant plays
plays = pbp.query("play_type.isin(['pass', 'run'])")
# EPA distribution
fig = analyze_distribution(plays['epa'], 'EPA')
plt.savefig('epa_distribution.png', dpi=150, bbox_inches='tight')
Key EPA Distribution Properties:
- Centered near zero: By construction, EPA averages to approximately zero across all plays
- Heavy tails: Extreme positive (big plays) and negative (turnovers) outcomes
- Slight right skew: More room for positive outcomes than negative
- Pass vs Run difference: Passes have higher variance than runs
# Compare pass vs run distributions
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
for ax, play_type in zip(axes, ['pass', 'run']):
subset = plays.query(f"play_type == '{play_type}'")['epa']
ax.hist(subset.dropna(), bins=50, edgecolor='black', alpha=0.7)
ax.axvline(subset.mean(), color='red', linestyle='--')
ax.set_title(f'{play_type.title()} EPA Distribution\nMean: {subset.mean():.3f}')
ax.set_xlabel('EPA')
plt.tight_layout()
Categorical Variable Analysis
For categorical variables, frequency tables and bar charts reveal patterns:
def analyze_categorical(df: pd.DataFrame, col: str, top_n: int = 20):
"""Analyze categorical variable."""
counts = df[col].value_counts().head(top_n)
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Bar chart
counts.plot(kind='barh', ax=axes[0])
axes[0].set_title(f'{col} Frequency')
axes[0].invert_yaxis()
# Percentage table
pcts = (counts / counts.sum() * 100).round(1)
table_data = pd.DataFrame({
'Count': counts.values,
'Percent': pcts.values
}, index=counts.index)
axes[1].axis('off')
table = axes[1].table(
cellText=table_data.values,
rowLabels=table_data.index,
colLabels=table_data.columns,
cellLoc='right',
loc='center'
)
table.auto_set_font_size(False)
table.set_fontsize(9)
plt.tight_layout()
return fig
4.4 Bivariate Analysis
Relationships Between Variables
Bivariate analysis examines how pairs of variables relate:
def analyze_bivariate(df: pd.DataFrame, x_col: str, y_col: str,
hue_col: str = None):
"""Bivariate analysis with optional grouping."""
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Scatter plot
if hue_col:
for group in df[hue_col].unique():
subset = df[df[hue_col] == group]
axes[0, 0].scatter(subset[x_col], subset[y_col],
alpha=0.5, label=group, s=10)
axes[0, 0].legend()
else:
axes[0, 0].scatter(df[x_col], df[y_col], alpha=0.3, s=10)
axes[0, 0].set_xlabel(x_col)
axes[0, 0].set_ylabel(y_col)
axes[0, 0].set_title(f'{x_col} vs {y_col}')
# Correlation
corr = df[[x_col, y_col]].dropna().corr().iloc[0, 1]
axes[0, 0].text(0.05, 0.95, f'r = {corr:.3f}',
transform=axes[0, 0].transAxes)
# Hexbin for density
axes[0, 1].hexbin(df[x_col], df[y_col], gridsize=30, cmap='YlOrRd')
axes[0, 1].set_xlabel(x_col)
axes[0, 1].set_ylabel(y_col)
axes[0, 1].set_title('Density Plot')
# Binned means
df_clean = df[[x_col, y_col]].dropna()
df_clean['bin'] = pd.qcut(df_clean[x_col], 10, duplicates='drop')
binned = df_clean.groupby('bin')[y_col].agg(['mean', 'std', 'count'])
binned['x_mid'] = [interval.mid for interval in binned.index]
axes[1, 0].errorbar(binned['x_mid'], binned['mean'],
yerr=binned['std']/np.sqrt(binned['count']),
marker='o', capsize=3)
axes[1, 0].set_xlabel(x_col)
axes[1, 0].set_ylabel(f'Mean {y_col}')
axes[1, 0].set_title('Binned Means with Standard Errors')
# Residual plot (linear fit)
from scipy import stats
slope, intercept, r, p, se = stats.linregress(
df_clean[x_col], df_clean[y_col]
)
predicted = intercept + slope * df_clean[x_col]
residuals = df_clean[y_col] - predicted
axes[1, 1].scatter(predicted, residuals, alpha=0.3, s=10)
axes[1, 1].axhline(0, color='red', linestyle='--')
axes[1, 1].set_xlabel('Predicted')
axes[1, 1].set_ylabel('Residuals')
axes[1, 1].set_title('Residual Plot')
plt.tight_layout()
return fig
Key Bivariate Relationships in Football
Air Yards vs EPA
Air yards (how far the ball travels in the air) relates to EPA, but not linearly:
passes = pbp.query("pass == 1 and air_yards.notna() and epa.notna()")
fig = analyze_bivariate(passes, 'air_yards', 'epa')
plt.savefig('airyards_epa.png', dpi=150, bbox_inches='tight')
Key insight: Short passes have lower variance but also lower upside. Deep passes are high-risk, high-reward.
Yards to Go vs Play Selection
How does distance to first down affect play calling?
plays = pbp.query("down <= 3 and play_type.isin(['pass', 'run'])")
pass_rate = (
plays
.groupby('ydstogo')['pass']
.mean()
.reset_index()
.query("ydstogo <= 15")
)
plt.figure(figsize=(10, 6))
plt.plot(pass_rate['ydstogo'], pass_rate['pass'], marker='o')
plt.xlabel('Yards to Go')
plt.ylabel('Pass Rate')
plt.title('Pass Rate by Yards to Go')
plt.axhline(0.5, color='gray', linestyle='--', alpha=0.5)
plt.grid(True, alpha=0.3)
Score Differential vs Play Calling
Game situation dramatically affects strategy:
plays = pbp.query("play_type.isin(['pass', 'run'])")
plays['score_diff_bucket'] = pd.cut(
plays['score_differential'],
bins=[-50, -14, -7, 0, 7, 14, 50],
labels=['Down 14+', 'Down 7-14', 'Down 1-7', 'Up 1-7', 'Up 7-14', 'Up 14+']
)
pass_by_score = (
plays
.groupby('score_diff_bucket')
.agg(
pass_rate=('pass', 'mean'),
epa_per_play=('epa', 'mean'),
plays=('play_id', 'count')
)
.reset_index()
)
print(pass_by_score)
4.5 Multivariate Analysis
Beyond Pairs
Real football analysis requires examining multiple variables simultaneously:
def create_team_profile(pbp: pd.DataFrame, team: str) -> pd.DataFrame:
"""Create comprehensive team profile."""
team_plays = pbp.query(f"posteam == '{team}' and play_type.isin(['pass', 'run'])")
profile = {
'total_plays': len(team_plays),
'pass_rate': team_plays['pass'].mean(),
'epa_per_play': team_plays['epa'].mean(),
'success_rate': team_plays['success'].mean(),
'explosive_play_rate': (
((team_plays['pass'] == 1) & (team_plays['yards_gained'] >= 20)) |
((team_plays['rush'] == 1) & (team_plays['yards_gained'] >= 10))
).mean(),
'turnover_rate': (
team_plays['interception'].sum() + team_plays['fumble_lost'].sum()
) / len(team_plays),
'third_down_success': team_plays.query("down == 3")['success'].mean(),
'red_zone_td_rate': team_plays.query("yardline_100 <= 20")['touchdown'].mean()
}
return pd.Series(profile, name=team)
Correlation Matrices
Examining correlations across multiple metrics:
def create_correlation_heatmap(df: pd.DataFrame, columns: list):
"""Create correlation heatmap for specified columns."""
corr_matrix = df[columns].corr()
fig, ax = plt.subplots(figsize=(10, 8))
# Create heatmap
im = ax.imshow(corr_matrix, cmap='RdBu_r', vmin=-1, vmax=1)
# Add colorbar
cbar = ax.figure.colorbar(im, ax=ax)
# Add labels
ax.set_xticks(range(len(columns)))
ax.set_yticks(range(len(columns)))
ax.set_xticklabels(columns, rotation=45, ha='right')
ax.set_yticklabels(columns)
# Add correlation values
for i in range(len(columns)):
for j in range(len(columns)):
text = ax.text(j, i, f'{corr_matrix.iloc[i, j]:.2f}',
ha='center', va='center', fontsize=8)
ax.set_title('Correlation Matrix')
plt.tight_layout()
return fig
Principal Component Analysis for Team Profiles
When many metrics exist, dimensionality reduction reveals underlying structure:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
def pca_team_analysis(team_stats: pd.DataFrame, n_components: int = 2):
"""PCA analysis of team statistics."""
# Select numeric columns
numeric_cols = team_stats.select_dtypes(include=[np.number]).columns
# Standardize
scaler = StandardScaler()
scaled_data = scaler.fit_transform(team_stats[numeric_cols])
# PCA
pca = PCA(n_components=n_components)
components = pca.fit_transform(scaled_data)
# Create results DataFrame
results = pd.DataFrame(
components,
columns=[f'PC{i+1}' for i in range(n_components)],
index=team_stats.index
)
# Explained variance
print("Explained variance ratio:", pca.explained_variance_ratio_)
# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Scatter of teams in PC space
axes[0].scatter(results['PC1'], results['PC2'])
for idx, team in enumerate(results.index):
axes[0].annotate(team, (results.iloc[idx]['PC1'], results.iloc[idx]['PC2']))
axes[0].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%})')
axes[0].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%})')
axes[0].set_title('Teams in Principal Component Space')
# Loadings
loadings = pd.DataFrame(
pca.components_.T,
columns=[f'PC{i+1}' for i in range(n_components)],
index=numeric_cols
)
loadings['PC1'].sort_values().plot(kind='barh', ax=axes[1])
axes[1].set_title('PC1 Loadings')
plt.tight_layout()
return results, loadings, fig
4.6 Visualization Best Practices
Principles of Effective Football Visualization
1. Know Your Audience
Different audiences need different visualizations:
| Audience | Preferred Visuals | Key Considerations |
|---|---|---|
| Coaches | Play diagrams, simple charts | Quick interpretation, actionable |
| Front Office | Trend lines, comparisons | Context, historical data |
| Media | Infographics, rankings | Engaging, shareable |
| Analysts | Detailed charts, statistical | Precision, methodology |
2. Choose the Right Chart Type
chart_recommendations = {
'distribution': ['histogram', 'density plot', 'box plot'],
'comparison': ['bar chart', 'dot plot', 'grouped bar'],
'relationship': ['scatter plot', 'line chart', 'hexbin'],
'composition': ['stacked bar', 'pie chart', 'treemap'],
'time_series': ['line chart', 'area chart'],
'ranking': ['horizontal bar', 'lollipop chart']
}
3. Color Usage
# NFL team colors (sample)
team_colors = {
'KC': {'primary': '#E31837', 'secondary': '#FFB612'},
'SF': {'primary': '#AA0000', 'secondary': '#B3995D'},
'BUF': {'primary': '#00338D', 'secondary': '#C60C30'},
'PHI': {'primary': '#004C54', 'secondary': '#A5ACAF'}
}
def get_team_color(team: str, color_type: str = 'primary') -> str:
"""Get team color for visualization."""
return team_colors.get(team, {}).get(color_type, '#333333')
Creating Publication-Quality Figures
def set_plot_style():
"""Set consistent plot style for the textbook."""
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams.update({
'figure.figsize': (10, 6),
'figure.dpi': 150,
'axes.titlesize': 14,
'axes.labelsize': 12,
'xtick.labelsize': 10,
'ytick.labelsize': 10,
'legend.fontsize': 10,
'font.family': 'sans-serif',
'axes.spines.top': False,
'axes.spines.right': False
})
set_plot_style()
Interactive Visualizations
For exploratory work, interactive visualizations accelerate discovery:
import plotly.express as px
import plotly.graph_objects as go
def interactive_team_comparison(team_stats: pd.DataFrame):
"""Create interactive team comparison."""
fig = px.scatter(
team_stats,
x='epa_per_play',
y='success_rate',
size='total_plays',
color='pass_rate',
hover_name=team_stats.index,
title='Team Offensive Profile',
labels={
'epa_per_play': 'EPA per Play',
'success_rate': 'Success Rate',
'pass_rate': 'Pass Rate'
}
)
fig.update_layout(
width=800,
height=600,
coloraxis_colorbar_title='Pass Rate'
)
return fig
4.7 Common EDA Patterns in Football
Pattern 1: Year-Over-Year Comparisons
def year_over_year_analysis(seasons: list, metric_func):
"""Compare metrics across seasons."""
results = []
for season in seasons:
pbp = nfl.import_pbp_data([season])
metric = metric_func(pbp)
metric['season'] = season
results.append(metric)
return pd.concat(results, ignore_index=True)
# Example: League-wide passing EPA by season
def get_passing_epa(pbp):
return (
pbp
.query("pass == 1")
.groupby('posteam')
.agg(epa_per_pass=('epa', 'mean'))
.reset_index()
)
multi_year = year_over_year_analysis([2019, 2020, 2021, 2022, 2023], get_passing_epa)
Pattern 2: Split Analysis
Comparing performance across different conditions:
def split_analysis(
pbp: pd.DataFrame,
split_col: str,
player_col: str,
min_plays: int = 50
) -> pd.DataFrame:
"""Analyze player performance across splits."""
splits = (
pbp
.query(f"{player_col}.notna()")
.groupby([player_col, split_col])
.agg(
plays=('play_id', 'count'),
epa_per_play=('epa', 'mean'),
success_rate=('success', 'mean')
)
.reset_index()
.query(f"plays >= {min_plays}")
)
# Pivot for comparison
pivoted = splits.pivot(
index=player_col,
columns=split_col,
values='epa_per_play'
)
return pivoted
# Example: Home vs Away
pbp['location'] = np.where(pbp['posteam'] == pbp['home_team'], 'home', 'away')
home_away_splits = split_analysis(
pbp.query("pass == 1"),
'location',
'passer_player_name'
)
Pattern 3: Rolling Averages
Tracking performance over time:
def calculate_rolling_stats(
pbp: pd.DataFrame,
player_col: str,
player_name: str,
window: int = 100
) -> pd.DataFrame:
"""Calculate rolling statistics for a player."""
player_plays = (
pbp
.query(f"{player_col} == @player_name")
.sort_values(['season', 'week', 'play_id'])
.copy()
)
player_plays['rolling_epa'] = (
player_plays['epa']
.rolling(window, min_periods=20)
.mean()
)
player_plays['rolling_success'] = (
player_plays['success']
.rolling(window, min_periods=20)
.mean()
)
# Add play number for x-axis
player_plays['play_num'] = range(len(player_plays))
return player_plays
# Example usage
mahomes_rolling = calculate_rolling_stats(
pbp.query("pass == 1"),
'passer_player_name',
'P.Mahomes',
window=100
)
Pattern 4: Anomaly Detection
Finding outliers and unusual performances:
def detect_anomalies(
df: pd.DataFrame,
metric_col: str,
group_col: str,
z_threshold: float = 2.0
) -> pd.DataFrame:
"""Detect anomalous performances."""
# Calculate z-scores within groups
df = df.copy()
df['z_score'] = df.groupby(group_col)[metric_col].transform(
lambda x: (x - x.mean()) / x.std()
)
# Flag anomalies
df['is_anomaly'] = df['z_score'].abs() > z_threshold
df['anomaly_type'] = np.where(
df['z_score'] > z_threshold, 'high',
np.where(df['z_score'] < -z_threshold, 'low', 'normal')
)
return df[df['is_anomaly']]
4.8 Building an EDA Workflow
Structured Approach
A repeatable EDA workflow ensures thorough analysis:
class FootballEDA:
"""
Structured EDA workflow for football data.
Example
-------
>>> eda = FootballEDA([2023])
>>> eda.run_full_analysis()
>>> eda.generate_report('output/')
"""
def __init__(self, seasons: list):
self.seasons = seasons
self.pbp = None
self.findings = []
def load_data(self):
"""Stage 1: Load and cache data."""
print("Loading data...")
self.pbp = nfl.import_pbp_data(self.seasons)
print(f"Loaded {len(self.pbp):,} plays")
return self
def inspect_data(self):
"""Stage 2: Initial inspection."""
print("\n" + "="*60)
print("DATA INSPECTION")
print("="*60)
print(f"\nShape: {self.pbp.shape}")
print(f"\nPlay type distribution:")
print(self.pbp['play_type'].value_counts())
print(f"\nMissing data (top 10):")
missing = self.pbp.isnull().sum().sort_values(ascending=False).head(10)
print(missing)
return self
def univariate_analysis(self):
"""Stage 3: Single variable analysis."""
print("\n" + "="*60)
print("UNIVARIATE ANALYSIS")
print("="*60)
# EPA distribution
plays = self.pbp.query("play_type.isin(['pass', 'run'])")
print(f"\nEPA Statistics:")
print(plays['epa'].describe())
# Pass vs Run
print(f"\nPass EPA: {plays.query('pass == 1')['epa'].mean():.4f}")
print(f"Run EPA: {plays.query('rush == 1')['epa'].mean():.4f}")
self.findings.append({
'stage': 'univariate',
'finding': f"Pass EPA ({plays.query('pass == 1')['epa'].mean():.3f}) vs Run EPA ({plays.query('rush == 1')['epa'].mean():.3f})"
})
return self
def bivariate_analysis(self):
"""Stage 4: Two variable relationships."""
print("\n" + "="*60)
print("BIVARIATE ANALYSIS")
print("="*60)
plays = self.pbp.query("play_type.isin(['pass', 'run'])")
# Down vs EPA
down_epa = plays.groupby('down')['epa'].mean()
print(f"\nEPA by Down:")
print(down_epa)
# Correlation between yards to go and EPA
corr = plays[['ydstogo', 'epa']].dropna().corr().iloc[0, 1]
print(f"\nCorrelation (ydstogo, epa): {corr:.4f}")
return self
def team_analysis(self):
"""Stage 5: Team-level analysis."""
print("\n" + "="*60)
print("TEAM ANALYSIS")
print("="*60)
plays = self.pbp.query("play_type.isin(['pass', 'run'])")
team_stats = (
plays
.groupby('posteam')
.agg(
plays=('play_id', 'count'),
epa_per_play=('epa', 'mean'),
pass_rate=('pass', 'mean'),
success_rate=('success', 'mean')
)
.reset_index()
.sort_values('epa_per_play', ascending=False)
)
print(f"\nTop 5 Teams by EPA/Play:")
print(team_stats.head())
print(f"\nBottom 5 Teams by EPA/Play:")
print(team_stats.tail())
return self
def run_full_analysis(self):
"""Run complete EDA pipeline."""
return (
self
.load_data()
.inspect_data()
.univariate_analysis()
.bivariate_analysis()
.team_analysis()
)
def generate_report(self, output_dir: str):
"""Generate EDA report."""
from pathlib import Path
output_path = Path(output_dir)
output_path.mkdir(exist_ok=True)
# Summary report
with open(output_path / 'eda_summary.txt', 'w') as f:
f.write("EDA Summary Report\n")
f.write("="*60 + "\n\n")
f.write(f"Seasons analyzed: {self.seasons}\n")
f.write(f"Total plays: {len(self.pbp):,}\n\n")
f.write("Key Findings:\n")
for i, finding in enumerate(self.findings, 1):
f.write(f"{i}. [{finding['stage']}] {finding['finding']}\n")
print(f"\nReport saved to {output_path}")
4.9 Communicating Insights
The Analyst's Narrative
Data exploration is only valuable if insights are communicated effectively:
Structure for Presenting Findings:
- Lead with the insight: State the key finding clearly
- Show the evidence: Present supporting visualization
- Provide context: Compare to benchmarks or expectations
- Acknowledge limitations: Note sample size, data quality issues
- Suggest implications: What should be done with this information?
Example Write-Up:
Finding: Patrick Mahomes maintains elite performance under pressure
Analysis of 2023 passing data reveals that Mahomes' EPA per play remains above league average even when facing pressure (defined as sack rate > 5%).
While most quarterbacks see a 40-60% decline in EPA under pressure, Mahomes' decline is only 25%, ranking him 2nd among qualified passers.
This suggests the Chiefs' passing game maintains value even when the offensive line struggles, making aggressive blitz strategies less effective against Kansas City.
Visualization Checklist
Before sharing any visualization:
- [ ] Title clearly states the insight
- [ ] Axes are labeled with units
- [ ] Legend is included if using colors/shapes
- [ ] Data source and date range noted
- [ ] Font sizes are readable
- [ ] Colors are colorblind-friendly
- [ ] Chart type matches the data and message
4.10 Chapter Summary
Exploratory Data Analysis is the foundation of football analytics. Before building models or making recommendations, analysts must understand their data deeply. Key principles:
- Start with questions: Let curiosity guide exploration
- Visualize early and often: Patterns emerge through visual inspection
- Check assumptions: Data rarely behaves as expected
- Document findings: Insights are useless if not recorded
- Iterate: EDA is a cycle, not a linear process
The EDA skills developed in this chapter form the basis for all subsequent analysis. Whether evaluating players, building predictive models, or advising on game strategy, thorough data exploration ensures reliable conclusions.
Chapter Summary Checklist
After completing this chapter, you should be able to:
- [ ] Conduct systematic data inspection on NFL play-by-play data
- [ ] Create and interpret univariate distributions (histograms, box plots)
- [ ] Analyze bivariate relationships (scatter plots, correlation)
- [ ] Build multivariate team profiles
- [ ] Apply visualization best practices
- [ ] Structure an EDA workflow that can be repeated
- [ ] Communicate findings effectively to different audiences
Preview: Chapter 5
In Chapter 5, we'll build the statistical foundation needed for football analytics, covering probability, inference, and the core concepts that underpin metrics like EPA and win probability.
Related Reading
Explore this topic in other books
College Football Analytics Visualization Fundamentals Basketball Analytics Exploratory Data Analysis Soccer Analytics Pitch Coordinates & Visualization Prediction Markets Exploratory Analysis