6 min read

> "The goal is to turn data into information, and information into insight." — Carly Fiorina

Chapter 4: Exploratory Data Analysis for Football

"The goal is to turn data into information, and information into insight." — Carly Fiorina

Learning Objectives

By the end of this chapter, you will be able to:

  1. Apply the EDA mindset to football data problems
  2. Create effective visualizations for play-by-play data
  3. Identify patterns and anomalies in player and team performance
  4. Build a systematic EDA workflow for football analysis
  5. Communicate findings through visual storytelling

4.1 The EDA Mindset

What Is Exploratory Data Analysis?

Exploratory Data Analysis (EDA) is the art and science of examining data before modeling or hypothesis testing. Coined by statistician John Tukey in the 1970s, EDA emphasizes visual methods and open-ended questioning over formal statistical procedures.

For football analytics, EDA serves multiple purposes:

Understanding the Data Structure - What columns exist in play-by-play data? - How are plays categorized? - What's missing or inconsistent?

Generating Hypotheses - Why do some teams perform better in the red zone? - What distinguishes elite receivers from average ones? - How does play-calling change when trailing?

Validating Assumptions - Is EPA normally distributed? - Do home teams really have an advantage? - Are there seasonal trends in passing offense?

Communicating Findings - What story does the data tell? - How can we make insights accessible to coaches and executives?

The EDA Process

A systematic EDA process follows this general pattern:

1. Load and Inspect → 2. Clean and Prepare → 3. Univariate Analysis →
4. Bivariate Analysis → 5. Multivariate Analysis → 6. Document Insights

Let's examine each stage.


4.2 Initial Data Inspection

First Look at the Data

When encountering a new dataset, start with these fundamental questions:

import pandas as pd
import numpy as np
import nfl_data_py as nfl

# Load data
pbp = nfl.import_pbp_data([2023])

# Basic dimensions
print(f"Shape: {pbp.shape}")
print(f"Rows: {pbp.shape[0]:,}")
print(f"Columns: {pbp.shape[1]}")

# Column overview
print("\nColumn names:")
print(pbp.columns.tolist())

# Data types
print("\nData types:")
print(pbp.dtypes.value_counts())

Understanding Column Categories

NFL play-by-play data contains hundreds of columns. Organizing them into categories helps navigation:

def categorize_columns(df: pd.DataFrame) -> dict:
    """Categorize columns by their apparent purpose."""

    categories = {
        'game_info': [],
        'play_info': [],
        'player_ids': [],
        'player_names': [],
        'metrics': [],
        'flags': [],
        'other': []
    }

    for col in df.columns:
        col_lower = col.lower()

        if 'game' in col_lower or 'week' in col_lower or 'season' in col_lower:
            categories['game_info'].append(col)
        elif col_lower.endswith('_id'):
            categories['player_ids'].append(col)
        elif col_lower.endswith('_name'):
            categories['player_names'].append(col)
        elif col_lower in ['epa', 'wpa', 'cpoe', 'air_yards', 'yac']:
            categories['metrics'].append(col)
        elif df[col].dtype == 'float64' and df[col].between(0, 1).mean() > 0.9:
            categories['flags'].append(col)
        else:
            categories['other'].append(col)

    return categories

Handling Missing Data

Missing data is pervasive in football analytics. Understanding why data is missing matters:

def analyze_missing_data(df: pd.DataFrame, threshold: float = 0.01) -> pd.DataFrame:
    """Analyze missing data patterns."""

    missing = df.isnull().sum()
    missing_pct = missing / len(df)

    missing_analysis = pd.DataFrame({
        'missing_count': missing,
        'missing_pct': missing_pct,
        'dtype': df.dtypes
    })

    # Filter to columns with missing data above threshold
    missing_analysis = missing_analysis[missing_analysis['missing_pct'] > threshold]
    missing_analysis = missing_analysis.sort_values('missing_pct', ascending=False)

    return missing_analysis

Common Missing Data Patterns in NFL Data:

Column When Missing Reason
passer_player_name Non-pass plays Only populated on passes
air_yards Incomplete passes, runs Only for completions
epa Certain special teams Not calculated for all plays
cpoe Spikes, throwaway No completion expectation

4.3 Univariate Analysis

Understanding Distributions

Before analyzing relationships, understand individual variables:

import matplotlib.pyplot as plt
import seaborn as sns

def analyze_distribution(series: pd.Series, name: str):
    """Comprehensive univariate analysis."""

    fig, axes = plt.subplots(1, 3, figsize=(14, 4))

    # Histogram
    axes[0].hist(series.dropna(), bins=50, edgecolor='black', alpha=0.7)
    axes[0].axvline(series.mean(), color='red', linestyle='--', label=f'Mean: {series.mean():.3f}')
    axes[0].axvline(series.median(), color='green', linestyle='--', label=f'Median: {series.median():.3f}')
    axes[0].set_title(f'{name} Distribution')
    axes[0].legend()

    # Box plot
    axes[1].boxplot(series.dropna())
    axes[1].set_title(f'{name} Box Plot')

    # Summary statistics text
    stats_text = f"""
    Count: {series.count():,}
    Mean: {series.mean():.4f}
    Std: {series.std():.4f}
    Min: {series.min():.4f}
    25%: {series.quantile(0.25):.4f}
    50%: {series.median():.4f}
    75%: {series.quantile(0.75):.4f}
    Max: {series.max():.4f}
    Skew: {series.skew():.4f}
    """
    axes[2].text(0.1, 0.5, stats_text, fontsize=10, family='monospace',
                 verticalalignment='center')
    axes[2].axis('off')
    axes[2].set_title('Summary Statistics')

    plt.tight_layout()
    return fig

EPA Distribution Analysis

Expected Points Added (EPA) is the cornerstone metric in modern football analytics. Understanding its distribution is essential:

# Filter to relevant plays
plays = pbp.query("play_type.isin(['pass', 'run'])")

# EPA distribution
fig = analyze_distribution(plays['epa'], 'EPA')
plt.savefig('epa_distribution.png', dpi=150, bbox_inches='tight')

Key EPA Distribution Properties:

  1. Centered near zero: By construction, EPA averages to approximately zero across all plays
  2. Heavy tails: Extreme positive (big plays) and negative (turnovers) outcomes
  3. Slight right skew: More room for positive outcomes than negative
  4. Pass vs Run difference: Passes have higher variance than runs
# Compare pass vs run distributions
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

for ax, play_type in zip(axes, ['pass', 'run']):
    subset = plays.query(f"play_type == '{play_type}'")['epa']
    ax.hist(subset.dropna(), bins=50, edgecolor='black', alpha=0.7)
    ax.axvline(subset.mean(), color='red', linestyle='--')
    ax.set_title(f'{play_type.title()} EPA Distribution\nMean: {subset.mean():.3f}')
    ax.set_xlabel('EPA')

plt.tight_layout()

Categorical Variable Analysis

For categorical variables, frequency tables and bar charts reveal patterns:

def analyze_categorical(df: pd.DataFrame, col: str, top_n: int = 20):
    """Analyze categorical variable."""

    counts = df[col].value_counts().head(top_n)

    fig, axes = plt.subplots(1, 2, figsize=(12, 5))

    # Bar chart
    counts.plot(kind='barh', ax=axes[0])
    axes[0].set_title(f'{col} Frequency')
    axes[0].invert_yaxis()

    # Percentage table
    pcts = (counts / counts.sum() * 100).round(1)
    table_data = pd.DataFrame({
        'Count': counts.values,
        'Percent': pcts.values
    }, index=counts.index)

    axes[1].axis('off')
    table = axes[1].table(
        cellText=table_data.values,
        rowLabels=table_data.index,
        colLabels=table_data.columns,
        cellLoc='right',
        loc='center'
    )
    table.auto_set_font_size(False)
    table.set_fontsize(9)

    plt.tight_layout()
    return fig

4.4 Bivariate Analysis

Relationships Between Variables

Bivariate analysis examines how pairs of variables relate:

def analyze_bivariate(df: pd.DataFrame, x_col: str, y_col: str,
                       hue_col: str = None):
    """Bivariate analysis with optional grouping."""

    fig, axes = plt.subplots(2, 2, figsize=(12, 10))

    # Scatter plot
    if hue_col:
        for group in df[hue_col].unique():
            subset = df[df[hue_col] == group]
            axes[0, 0].scatter(subset[x_col], subset[y_col],
                              alpha=0.5, label=group, s=10)
        axes[0, 0].legend()
    else:
        axes[0, 0].scatter(df[x_col], df[y_col], alpha=0.3, s=10)
    axes[0, 0].set_xlabel(x_col)
    axes[0, 0].set_ylabel(y_col)
    axes[0, 0].set_title(f'{x_col} vs {y_col}')

    # Correlation
    corr = df[[x_col, y_col]].dropna().corr().iloc[0, 1]
    axes[0, 0].text(0.05, 0.95, f'r = {corr:.3f}',
                    transform=axes[0, 0].transAxes)

    # Hexbin for density
    axes[0, 1].hexbin(df[x_col], df[y_col], gridsize=30, cmap='YlOrRd')
    axes[0, 1].set_xlabel(x_col)
    axes[0, 1].set_ylabel(y_col)
    axes[0, 1].set_title('Density Plot')

    # Binned means
    df_clean = df[[x_col, y_col]].dropna()
    df_clean['bin'] = pd.qcut(df_clean[x_col], 10, duplicates='drop')
    binned = df_clean.groupby('bin')[y_col].agg(['mean', 'std', 'count'])
    binned['x_mid'] = [interval.mid for interval in binned.index]

    axes[1, 0].errorbar(binned['x_mid'], binned['mean'],
                        yerr=binned['std']/np.sqrt(binned['count']),
                        marker='o', capsize=3)
    axes[1, 0].set_xlabel(x_col)
    axes[1, 0].set_ylabel(f'Mean {y_col}')
    axes[1, 0].set_title('Binned Means with Standard Errors')

    # Residual plot (linear fit)
    from scipy import stats
    slope, intercept, r, p, se = stats.linregress(
        df_clean[x_col], df_clean[y_col]
    )
    predicted = intercept + slope * df_clean[x_col]
    residuals = df_clean[y_col] - predicted

    axes[1, 1].scatter(predicted, residuals, alpha=0.3, s=10)
    axes[1, 1].axhline(0, color='red', linestyle='--')
    axes[1, 1].set_xlabel('Predicted')
    axes[1, 1].set_ylabel('Residuals')
    axes[1, 1].set_title('Residual Plot')

    plt.tight_layout()
    return fig

Key Bivariate Relationships in Football

Air Yards vs EPA

Air yards (how far the ball travels in the air) relates to EPA, but not linearly:

passes = pbp.query("pass == 1 and air_yards.notna() and epa.notna()")

fig = analyze_bivariate(passes, 'air_yards', 'epa')
plt.savefig('airyards_epa.png', dpi=150, bbox_inches='tight')

Key insight: Short passes have lower variance but also lower upside. Deep passes are high-risk, high-reward.

Yards to Go vs Play Selection

How does distance to first down affect play calling?

plays = pbp.query("down <= 3 and play_type.isin(['pass', 'run'])")

pass_rate = (
    plays
    .groupby('ydstogo')['pass']
    .mean()
    .reset_index()
    .query("ydstogo <= 15")
)

plt.figure(figsize=(10, 6))
plt.plot(pass_rate['ydstogo'], pass_rate['pass'], marker='o')
plt.xlabel('Yards to Go')
plt.ylabel('Pass Rate')
plt.title('Pass Rate by Yards to Go')
plt.axhline(0.5, color='gray', linestyle='--', alpha=0.5)
plt.grid(True, alpha=0.3)

Score Differential vs Play Calling

Game situation dramatically affects strategy:

plays = pbp.query("play_type.isin(['pass', 'run'])")
plays['score_diff_bucket'] = pd.cut(
    plays['score_differential'],
    bins=[-50, -14, -7, 0, 7, 14, 50],
    labels=['Down 14+', 'Down 7-14', 'Down 1-7', 'Up 1-7', 'Up 7-14', 'Up 14+']
)

pass_by_score = (
    plays
    .groupby('score_diff_bucket')
    .agg(
        pass_rate=('pass', 'mean'),
        epa_per_play=('epa', 'mean'),
        plays=('play_id', 'count')
    )
    .reset_index()
)

print(pass_by_score)

4.5 Multivariate Analysis

Beyond Pairs

Real football analysis requires examining multiple variables simultaneously:

def create_team_profile(pbp: pd.DataFrame, team: str) -> pd.DataFrame:
    """Create comprehensive team profile."""

    team_plays = pbp.query(f"posteam == '{team}' and play_type.isin(['pass', 'run'])")

    profile = {
        'total_plays': len(team_plays),
        'pass_rate': team_plays['pass'].mean(),
        'epa_per_play': team_plays['epa'].mean(),
        'success_rate': team_plays['success'].mean(),
        'explosive_play_rate': (
            ((team_plays['pass'] == 1) & (team_plays['yards_gained'] >= 20)) |
            ((team_plays['rush'] == 1) & (team_plays['yards_gained'] >= 10))
        ).mean(),
        'turnover_rate': (
            team_plays['interception'].sum() + team_plays['fumble_lost'].sum()
        ) / len(team_plays),
        'third_down_success': team_plays.query("down == 3")['success'].mean(),
        'red_zone_td_rate': team_plays.query("yardline_100 <= 20")['touchdown'].mean()
    }

    return pd.Series(profile, name=team)

Correlation Matrices

Examining correlations across multiple metrics:

def create_correlation_heatmap(df: pd.DataFrame, columns: list):
    """Create correlation heatmap for specified columns."""

    corr_matrix = df[columns].corr()

    fig, ax = plt.subplots(figsize=(10, 8))

    # Create heatmap
    im = ax.imshow(corr_matrix, cmap='RdBu_r', vmin=-1, vmax=1)

    # Add colorbar
    cbar = ax.figure.colorbar(im, ax=ax)

    # Add labels
    ax.set_xticks(range(len(columns)))
    ax.set_yticks(range(len(columns)))
    ax.set_xticklabels(columns, rotation=45, ha='right')
    ax.set_yticklabels(columns)

    # Add correlation values
    for i in range(len(columns)):
        for j in range(len(columns)):
            text = ax.text(j, i, f'{corr_matrix.iloc[i, j]:.2f}',
                          ha='center', va='center', fontsize=8)

    ax.set_title('Correlation Matrix')
    plt.tight_layout()
    return fig

Principal Component Analysis for Team Profiles

When many metrics exist, dimensionality reduction reveals underlying structure:

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

def pca_team_analysis(team_stats: pd.DataFrame, n_components: int = 2):
    """PCA analysis of team statistics."""

    # Select numeric columns
    numeric_cols = team_stats.select_dtypes(include=[np.number]).columns

    # Standardize
    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(team_stats[numeric_cols])

    # PCA
    pca = PCA(n_components=n_components)
    components = pca.fit_transform(scaled_data)

    # Create results DataFrame
    results = pd.DataFrame(
        components,
        columns=[f'PC{i+1}' for i in range(n_components)],
        index=team_stats.index
    )

    # Explained variance
    print("Explained variance ratio:", pca.explained_variance_ratio_)

    # Plot
    fig, axes = plt.subplots(1, 2, figsize=(14, 6))

    # Scatter of teams in PC space
    axes[0].scatter(results['PC1'], results['PC2'])
    for idx, team in enumerate(results.index):
        axes[0].annotate(team, (results.iloc[idx]['PC1'], results.iloc[idx]['PC2']))
    axes[0].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%})')
    axes[0].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%})')
    axes[0].set_title('Teams in Principal Component Space')

    # Loadings
    loadings = pd.DataFrame(
        pca.components_.T,
        columns=[f'PC{i+1}' for i in range(n_components)],
        index=numeric_cols
    )

    loadings['PC1'].sort_values().plot(kind='barh', ax=axes[1])
    axes[1].set_title('PC1 Loadings')

    plt.tight_layout()
    return results, loadings, fig

4.6 Visualization Best Practices

Principles of Effective Football Visualization

1. Know Your Audience

Different audiences need different visualizations:

Audience Preferred Visuals Key Considerations
Coaches Play diagrams, simple charts Quick interpretation, actionable
Front Office Trend lines, comparisons Context, historical data
Media Infographics, rankings Engaging, shareable
Analysts Detailed charts, statistical Precision, methodology

2. Choose the Right Chart Type

chart_recommendations = {
    'distribution': ['histogram', 'density plot', 'box plot'],
    'comparison': ['bar chart', 'dot plot', 'grouped bar'],
    'relationship': ['scatter plot', 'line chart', 'hexbin'],
    'composition': ['stacked bar', 'pie chart', 'treemap'],
    'time_series': ['line chart', 'area chart'],
    'ranking': ['horizontal bar', 'lollipop chart']
}

3. Color Usage

# NFL team colors (sample)
team_colors = {
    'KC': {'primary': '#E31837', 'secondary': '#FFB612'},
    'SF': {'primary': '#AA0000', 'secondary': '#B3995D'},
    'BUF': {'primary': '#00338D', 'secondary': '#C60C30'},
    'PHI': {'primary': '#004C54', 'secondary': '#A5ACAF'}
}

def get_team_color(team: str, color_type: str = 'primary') -> str:
    """Get team color for visualization."""
    return team_colors.get(team, {}).get(color_type, '#333333')

Creating Publication-Quality Figures

def set_plot_style():
    """Set consistent plot style for the textbook."""

    plt.style.use('seaborn-v0_8-whitegrid')

    plt.rcParams.update({
        'figure.figsize': (10, 6),
        'figure.dpi': 150,
        'axes.titlesize': 14,
        'axes.labelsize': 12,
        'xtick.labelsize': 10,
        'ytick.labelsize': 10,
        'legend.fontsize': 10,
        'font.family': 'sans-serif',
        'axes.spines.top': False,
        'axes.spines.right': False
    })

set_plot_style()

Interactive Visualizations

For exploratory work, interactive visualizations accelerate discovery:

import plotly.express as px
import plotly.graph_objects as go

def interactive_team_comparison(team_stats: pd.DataFrame):
    """Create interactive team comparison."""

    fig = px.scatter(
        team_stats,
        x='epa_per_play',
        y='success_rate',
        size='total_plays',
        color='pass_rate',
        hover_name=team_stats.index,
        title='Team Offensive Profile',
        labels={
            'epa_per_play': 'EPA per Play',
            'success_rate': 'Success Rate',
            'pass_rate': 'Pass Rate'
        }
    )

    fig.update_layout(
        width=800,
        height=600,
        coloraxis_colorbar_title='Pass Rate'
    )

    return fig

4.7 Common EDA Patterns in Football

Pattern 1: Year-Over-Year Comparisons

def year_over_year_analysis(seasons: list, metric_func):
    """Compare metrics across seasons."""

    results = []
    for season in seasons:
        pbp = nfl.import_pbp_data([season])
        metric = metric_func(pbp)
        metric['season'] = season
        results.append(metric)

    return pd.concat(results, ignore_index=True)

# Example: League-wide passing EPA by season
def get_passing_epa(pbp):
    return (
        pbp
        .query("pass == 1")
        .groupby('posteam')
        .agg(epa_per_pass=('epa', 'mean'))
        .reset_index()
    )

multi_year = year_over_year_analysis([2019, 2020, 2021, 2022, 2023], get_passing_epa)

Pattern 2: Split Analysis

Comparing performance across different conditions:

def split_analysis(
    pbp: pd.DataFrame,
    split_col: str,
    player_col: str,
    min_plays: int = 50
) -> pd.DataFrame:
    """Analyze player performance across splits."""

    splits = (
        pbp
        .query(f"{player_col}.notna()")
        .groupby([player_col, split_col])
        .agg(
            plays=('play_id', 'count'),
            epa_per_play=('epa', 'mean'),
            success_rate=('success', 'mean')
        )
        .reset_index()
        .query(f"plays >= {min_plays}")
    )

    # Pivot for comparison
    pivoted = splits.pivot(
        index=player_col,
        columns=split_col,
        values='epa_per_play'
    )

    return pivoted

# Example: Home vs Away
pbp['location'] = np.where(pbp['posteam'] == pbp['home_team'], 'home', 'away')
home_away_splits = split_analysis(
    pbp.query("pass == 1"),
    'location',
    'passer_player_name'
)

Pattern 3: Rolling Averages

Tracking performance over time:

def calculate_rolling_stats(
    pbp: pd.DataFrame,
    player_col: str,
    player_name: str,
    window: int = 100
) -> pd.DataFrame:
    """Calculate rolling statistics for a player."""

    player_plays = (
        pbp
        .query(f"{player_col} == @player_name")
        .sort_values(['season', 'week', 'play_id'])
        .copy()
    )

    player_plays['rolling_epa'] = (
        player_plays['epa']
        .rolling(window, min_periods=20)
        .mean()
    )

    player_plays['rolling_success'] = (
        player_plays['success']
        .rolling(window, min_periods=20)
        .mean()
    )

    # Add play number for x-axis
    player_plays['play_num'] = range(len(player_plays))

    return player_plays

# Example usage
mahomes_rolling = calculate_rolling_stats(
    pbp.query("pass == 1"),
    'passer_player_name',
    'P.Mahomes',
    window=100
)

Pattern 4: Anomaly Detection

Finding outliers and unusual performances:

def detect_anomalies(
    df: pd.DataFrame,
    metric_col: str,
    group_col: str,
    z_threshold: float = 2.0
) -> pd.DataFrame:
    """Detect anomalous performances."""

    # Calculate z-scores within groups
    df = df.copy()
    df['z_score'] = df.groupby(group_col)[metric_col].transform(
        lambda x: (x - x.mean()) / x.std()
    )

    # Flag anomalies
    df['is_anomaly'] = df['z_score'].abs() > z_threshold
    df['anomaly_type'] = np.where(
        df['z_score'] > z_threshold, 'high',
        np.where(df['z_score'] < -z_threshold, 'low', 'normal')
    )

    return df[df['is_anomaly']]

4.8 Building an EDA Workflow

Structured Approach

A repeatable EDA workflow ensures thorough analysis:

class FootballEDA:
    """
    Structured EDA workflow for football data.

    Example
    -------
    >>> eda = FootballEDA([2023])
    >>> eda.run_full_analysis()
    >>> eda.generate_report('output/')
    """

    def __init__(self, seasons: list):
        self.seasons = seasons
        self.pbp = None
        self.findings = []

    def load_data(self):
        """Stage 1: Load and cache data."""
        print("Loading data...")
        self.pbp = nfl.import_pbp_data(self.seasons)
        print(f"Loaded {len(self.pbp):,} plays")
        return self

    def inspect_data(self):
        """Stage 2: Initial inspection."""
        print("\n" + "="*60)
        print("DATA INSPECTION")
        print("="*60)

        print(f"\nShape: {self.pbp.shape}")
        print(f"\nPlay type distribution:")
        print(self.pbp['play_type'].value_counts())

        print(f"\nMissing data (top 10):")
        missing = self.pbp.isnull().sum().sort_values(ascending=False).head(10)
        print(missing)

        return self

    def univariate_analysis(self):
        """Stage 3: Single variable analysis."""
        print("\n" + "="*60)
        print("UNIVARIATE ANALYSIS")
        print("="*60)

        # EPA distribution
        plays = self.pbp.query("play_type.isin(['pass', 'run'])")

        print(f"\nEPA Statistics:")
        print(plays['epa'].describe())

        # Pass vs Run
        print(f"\nPass EPA: {plays.query('pass == 1')['epa'].mean():.4f}")
        print(f"Run EPA: {plays.query('rush == 1')['epa'].mean():.4f}")

        self.findings.append({
            'stage': 'univariate',
            'finding': f"Pass EPA ({plays.query('pass == 1')['epa'].mean():.3f}) vs Run EPA ({plays.query('rush == 1')['epa'].mean():.3f})"
        })

        return self

    def bivariate_analysis(self):
        """Stage 4: Two variable relationships."""
        print("\n" + "="*60)
        print("BIVARIATE ANALYSIS")
        print("="*60)

        plays = self.pbp.query("play_type.isin(['pass', 'run'])")

        # Down vs EPA
        down_epa = plays.groupby('down')['epa'].mean()
        print(f"\nEPA by Down:")
        print(down_epa)

        # Correlation between yards to go and EPA
        corr = plays[['ydstogo', 'epa']].dropna().corr().iloc[0, 1]
        print(f"\nCorrelation (ydstogo, epa): {corr:.4f}")

        return self

    def team_analysis(self):
        """Stage 5: Team-level analysis."""
        print("\n" + "="*60)
        print("TEAM ANALYSIS")
        print("="*60)

        plays = self.pbp.query("play_type.isin(['pass', 'run'])")

        team_stats = (
            plays
            .groupby('posteam')
            .agg(
                plays=('play_id', 'count'),
                epa_per_play=('epa', 'mean'),
                pass_rate=('pass', 'mean'),
                success_rate=('success', 'mean')
            )
            .reset_index()
            .sort_values('epa_per_play', ascending=False)
        )

        print(f"\nTop 5 Teams by EPA/Play:")
        print(team_stats.head())

        print(f"\nBottom 5 Teams by EPA/Play:")
        print(team_stats.tail())

        return self

    def run_full_analysis(self):
        """Run complete EDA pipeline."""
        return (
            self
            .load_data()
            .inspect_data()
            .univariate_analysis()
            .bivariate_analysis()
            .team_analysis()
        )

    def generate_report(self, output_dir: str):
        """Generate EDA report."""
        from pathlib import Path

        output_path = Path(output_dir)
        output_path.mkdir(exist_ok=True)

        # Summary report
        with open(output_path / 'eda_summary.txt', 'w') as f:
            f.write("EDA Summary Report\n")
            f.write("="*60 + "\n\n")

            f.write(f"Seasons analyzed: {self.seasons}\n")
            f.write(f"Total plays: {len(self.pbp):,}\n\n")

            f.write("Key Findings:\n")
            for i, finding in enumerate(self.findings, 1):
                f.write(f"{i}. [{finding['stage']}] {finding['finding']}\n")

        print(f"\nReport saved to {output_path}")

4.9 Communicating Insights

The Analyst's Narrative

Data exploration is only valuable if insights are communicated effectively:

Structure for Presenting Findings:

  1. Lead with the insight: State the key finding clearly
  2. Show the evidence: Present supporting visualization
  3. Provide context: Compare to benchmarks or expectations
  4. Acknowledge limitations: Note sample size, data quality issues
  5. Suggest implications: What should be done with this information?

Example Write-Up:

Finding: Patrick Mahomes maintains elite performance under pressure

Analysis of 2023 passing data reveals that Mahomes' EPA per play remains above league average even when facing pressure (defined as sack rate > 5%).

While most quarterbacks see a 40-60% decline in EPA under pressure, Mahomes' decline is only 25%, ranking him 2nd among qualified passers.

This suggests the Chiefs' passing game maintains value even when the offensive line struggles, making aggressive blitz strategies less effective against Kansas City.

Visualization Checklist

Before sharing any visualization:

  • [ ] Title clearly states the insight
  • [ ] Axes are labeled with units
  • [ ] Legend is included if using colors/shapes
  • [ ] Data source and date range noted
  • [ ] Font sizes are readable
  • [ ] Colors are colorblind-friendly
  • [ ] Chart type matches the data and message

4.10 Chapter Summary

Exploratory Data Analysis is the foundation of football analytics. Before building models or making recommendations, analysts must understand their data deeply. Key principles:

  1. Start with questions: Let curiosity guide exploration
  2. Visualize early and often: Patterns emerge through visual inspection
  3. Check assumptions: Data rarely behaves as expected
  4. Document findings: Insights are useless if not recorded
  5. Iterate: EDA is a cycle, not a linear process

The EDA skills developed in this chapter form the basis for all subsequent analysis. Whether evaluating players, building predictive models, or advising on game strategy, thorough data exploration ensures reliable conclusions.


Chapter Summary Checklist

After completing this chapter, you should be able to:

  • [ ] Conduct systematic data inspection on NFL play-by-play data
  • [ ] Create and interpret univariate distributions (histograms, box plots)
  • [ ] Analyze bivariate relationships (scatter plots, correlation)
  • [ ] Build multivariate team profiles
  • [ ] Apply visualization best practices
  • [ ] Structure an EDA workflow that can be repeated
  • [ ] Communicate findings effectively to different audiences

Preview: Chapter 5

In Chapter 5, we'll build the statistical foundation needed for football analytics, covering probability, inference, and the core concepts that underpin metrics like EPA and win probability.