Case Study: Building a Team Stats Dashboard

"The first step in any analysis is understanding what data you have and what it can tell you."

Executive Summary

This case study guides you through building a comprehensive team statistics dashboard using nfl_data_py. You'll practice data loading, transformation, aggregation, and quality validation—the foundational skills for any football analytics project.

Skills Applied: - Loading and caching play-by-play data - Filtering to relevant subsets - Aggregating to team-level statistics - Validating data quality - Creating summary visualizations

Data: nfl_data_py play-by-play data (2023 season)


Background

The Context

You've been hired as a junior analyst for an NFL team's analytics department. Your first assignment: create a team performance dashboard that the coaching staff can reference for game preparation.

The dashboard needs to show: - Offensive efficiency metrics (EPA, success rate) - Defensive efficiency metrics - Situational performance (red zone, third down) - Comparison to league averages

The Challenge

Business Question: "How does our team's performance compare to league average across key efficiency metrics, and where are our biggest strengths and weaknesses?"

Requirements

Metric Category Specific Metrics
Offensive Efficiency EPA/play, Success Rate, Explosive Play %
Passing Attack EPA/dropback, CPOE, Air Yards/Attempt
Rushing Attack EPA/carry, Success Rate, YPC
Defensive Performance EPA/play allowed, Success Rate allowed
Situational 3rd Down Conv %, Red Zone TD %

Analysis Approach

Phase 1: Data Loading and Exploration

"""
Case Study: Team Stats Dashboard
Part 1: Data Loading and Initial Exploration
"""

import nfl_data_py as nfl
import pandas as pd
import numpy as np
from pathlib import Path

# Configuration
SEASON = 2023
CACHE_DIR = Path('./data/cache')
CACHE_DIR.mkdir(parents=True, exist_ok=True)

def load_pbp_cached(season: int) -> pd.DataFrame:
    """Load play-by-play data with local caching."""
    cache_path = CACHE_DIR / f'pbp_{season}.parquet'

    if cache_path.exists():
        print(f"Loading from cache: {cache_path}")
        return pd.read_parquet(cache_path)

    print(f"Downloading play-by-play data for {season}...")
    pbp = nfl.import_pbp_data([season])
    pbp.to_parquet(cache_path)
    print(f"Cached to: {cache_path}")
    return pbp


# Load data
pbp = load_pbp_cached(SEASON)

# Initial exploration
print("\n" + "=" * 50)
print("DATA EXPLORATION")
print("=" * 50)

print(f"\nDataset shape: {pbp.shape}")
print(f"Total plays: {len(pbp):,}")
print(f"Unique games: {pbp['game_id'].nunique()}")
print(f"Teams: {sorted(pbp['posteam'].dropna().unique())}")

print("\nPlay type distribution:")
print(pbp['play_type'].value_counts())

print("\nKey numeric columns:")
key_cols = ['yards_gained', 'epa', 'wpa', 'air_yards']
print(pbp[key_cols].describe().round(2))

Phase 2: Data Cleaning and Filtering

"""
Part 2: Cleaning and Filtering to Relevant Plays
"""

def filter_to_real_plays(df: pd.DataFrame) -> pd.DataFrame:
    """
    Filter to actual plays, excluding:
    - Timeouts
    - End of quarter/half/game
    - Two-minute warnings
    - Penalties without plays
    """
    # Keep only pass and rush plays (the core offensive plays)
    real_plays = df[
        (df['play_type'].isin(['pass', 'run'])) |
        ((df['pass'] == 1) | (df['rush'] == 1))
    ].copy()

    # Exclude plays with missing EPA
    real_plays = real_plays[real_plays['epa'].notna()]

    # Exclude plays with missing team
    real_plays = real_plays[real_plays['posteam'].notna()]

    return real_plays


def add_derived_fields(df: pd.DataFrame) -> pd.DataFrame:
    """Add commonly used derived fields."""
    df = df.copy()

    # Play type indicators
    df['is_pass'] = (df['pass'] == 1).astype(int)
    df['is_rush'] = (df['rush'] == 1).astype(int)

    # Success indicator (EPA > 0)
    df['is_success'] = (df['epa'] > 0).astype(int)

    # Explosive play (20+ yards pass, 10+ yards rush)
    df['is_explosive'] = (
        ((df['is_pass'] == 1) & (df['yards_gained'] >= 20)) |
        ((df['is_rush'] == 1) & (df['yards_gained'] >= 10))
    ).astype(int)

    # Situational flags
    df['is_third_down'] = (df['down'] == 3).astype(int)
    df['is_red_zone'] = (df['yardline_100'] <= 20).astype(int)

    # Score context
    df['score_diff'] = df['posteam_score'] - df['defteam_score']
    df['is_garbage_time'] = (
        (df['game_seconds_remaining'] < 300) &
        (df['score_diff'].abs() > 17)
    ).astype(int)

    return df


# Apply cleaning
plays = filter_to_real_plays(pbp)
plays = add_derived_fields(plays)

print("\n" + "=" * 50)
print("FILTERED DATA SUMMARY")
print("=" * 50)
print(f"Plays after filtering: {len(plays):,}")
print(f"Pass plays: {plays['is_pass'].sum():,}")
print(f"Rush plays: {plays['is_rush'].sum():,}")
print(f"Third down plays: {plays['is_third_down'].sum():,}")
print(f"Red zone plays: {plays['is_red_zone'].sum():,}")

Phase 3: Team-Level Aggregation

"""
Part 3: Aggregate to Team-Level Statistics
"""

def calculate_offensive_stats(df: pd.DataFrame) -> pd.DataFrame:
    """Calculate offensive statistics by team."""

    # Overall offensive metrics
    overall = df.groupby('posteam').agg({
        'play_id': 'count',
        'epa': ['sum', 'mean'],
        'is_success': 'mean',
        'is_explosive': 'mean',
        'yards_gained': 'sum'
    }).reset_index()

    overall.columns = ['team', 'plays', 'total_epa', 'epa_per_play',
                       'success_rate', 'explosive_rate', 'total_yards']

    # Passing metrics
    pass_plays = df[df['is_pass'] == 1]
    passing = pass_plays.groupby('posteam').agg({
        'play_id': 'count',
        'epa': 'mean',
        'is_success': 'mean',
        'cpoe': 'mean',
        'air_yards': 'mean'
    }).reset_index()

    passing.columns = ['team', 'dropbacks', 'pass_epa_per_play',
                       'pass_success_rate', 'cpoe', 'air_yards_per_att']

    # Rushing metrics
    rush_plays = df[df['is_rush'] == 1]
    rushing = rush_plays.groupby('posteam').agg({
        'play_id': 'count',
        'epa': 'mean',
        'is_success': 'mean',
        'yards_gained': 'mean'
    }).reset_index()

    rushing.columns = ['team', 'carries', 'rush_epa_per_play',
                       'rush_success_rate', 'ypc']

    # Third down
    third_down = df[df['is_third_down'] == 1].groupby('posteam').agg({
        'play_id': 'count',
        'first_down': 'mean'
    }).reset_index()

    third_down.columns = ['team', 'third_down_att', 'third_down_conv_rate']

    # Red zone
    red_zone = df[(df['is_red_zone'] == 1)].groupby('posteam').agg({
        'play_id': 'count',
        'touchdown': 'mean'
    }).reset_index()

    red_zone.columns = ['team', 'red_zone_plays', 'red_zone_td_rate']

    # Merge all
    offense = overall.merge(passing, on='team')
    offense = offense.merge(rushing, on='team')
    offense = offense.merge(third_down, on='team')
    offense = offense.merge(red_zone, on='team')

    return offense


def calculate_defensive_stats(df: pd.DataFrame) -> pd.DataFrame:
    """Calculate defensive statistics by team."""

    defense = df.groupby('defteam').agg({
        'play_id': 'count',
        'epa': ['sum', 'mean'],
        'is_success': 'mean',
        'yards_gained': 'sum'
    }).reset_index()

    defense.columns = ['team', 'plays_faced', 'total_epa_allowed',
                       'epa_per_play_allowed', 'opp_success_rate',
                       'total_yards_allowed']

    # For defense, lower is better, so we'll negate EPA for ranking
    defense['def_epa_per_play'] = -defense['epa_per_play_allowed']

    return defense


# Calculate stats
offense_stats = calculate_offensive_stats(plays)
defense_stats = calculate_defensive_stats(plays)

# Add league averages
league_avg = {
    'epa_per_play': plays['epa'].mean(),
    'success_rate': plays['is_success'].mean(),
    'pass_epa': plays[plays['is_pass'] == 1]['epa'].mean(),
    'rush_epa': plays[plays['is_rush'] == 1]['epa'].mean()
}

print("\n" + "=" * 50)
print("LEAGUE AVERAGES")
print("=" * 50)
for metric, value in league_avg.items():
    print(f"{metric}: {value:.3f}")

Phase 4: Dashboard Output

"""
Part 4: Create Dashboard Summary
"""

def create_team_dashboard(
    team: str,
    offense: pd.DataFrame,
    defense: pd.DataFrame,
    league_avg: dict
) -> str:
    """Generate text-based dashboard for a team."""

    off = offense[offense['team'] == team].iloc[0]
    def_ = defense[defense['team'] == team].iloc[0]

    # Calculate ranks (1 = best)
    off_rank = (offense['epa_per_play'].rank(ascending=False)
                [offense['team'] == team].values[0])
    def_rank = (defense['def_epa_per_play'].rank(ascending=False)
                [defense['team'] == team].values[0])

    lines = [
        "=" * 60,
        f"TEAM PERFORMANCE DASHBOARD: {team}",
        f"Season: {SEASON}",
        "=" * 60,
        "",
        "OFFENSIVE EFFICIENCY",
        "-" * 40,
        f"  EPA/Play:        {off['epa_per_play']:+.3f}  (Rank: {int(off_rank)}/32)",
        f"  vs League Avg:   {off['epa_per_play'] - league_avg['epa_per_play']:+.3f}",
        f"  Success Rate:    {off['success_rate']:.1%}",
        f"  Explosive Rate:  {off['explosive_rate']:.1%}",
        "",
        "PASSING ATTACK",
        "-" * 40,
        f"  EPA/Dropback:    {off['pass_epa_per_play']:+.3f}",
        f"  Success Rate:    {off['pass_success_rate']:.1%}",
        f"  CPOE:            {off['cpoe']:+.1f}%",
        f"  Air Yards/Att:   {off['air_yards_per_att']:.1f}",
        "",
        "RUSHING ATTACK",
        "-" * 40,
        f"  EPA/Carry:       {off['rush_epa_per_play']:+.3f}",
        f"  Success Rate:    {off['rush_success_rate']:.1%}",
        f"  Yards/Carry:     {off['ypc']:.1f}",
        "",
        "SITUATIONAL",
        "-" * 40,
        f"  3rd Down Conv:   {off['third_down_conv_rate']:.1%}",
        f"  Red Zone TD%:    {off['red_zone_td_rate']:.1%}",
        "",
        "DEFENSIVE EFFICIENCY",
        "-" * 40,
        f"  EPA/Play Allowed: {def_['epa_per_play_allowed']:+.3f}  (Rank: {int(def_rank)}/32)",
        f"  Opp Success Rate: {def_['opp_success_rate']:.1%}",
        "",
        "=" * 60
    ]

    return "\n".join(lines)


# Generate dashboards for select teams
for team in ['KC', 'SF', 'BAL', 'DET']:
    dashboard = create_team_dashboard(team, offense_stats, defense_stats, league_avg)
    print(dashboard)
    print()

Results Summary

Sample Output: Kansas City Chiefs

============================================================
TEAM PERFORMANCE DASHBOARD: KC
Season: 2023
============================================================

OFFENSIVE EFFICIENCY
----------------------------------------
  EPA/Play:        +0.152  (Rank: 5/32)
  vs League Avg:   +0.152
  Success Rate:    48.2%
  Explosive Rate:  12.1%

PASSING ATTACK
----------------------------------------
  EPA/Dropback:    +0.185
  Success Rate:    50.1%
  CPOE:            +2.3%
  Air Yards/Att:   8.2

RUSHING ATTACK
----------------------------------------
  EPA/Carry:       +0.089
  Success Rate:    45.2%
  Yards/Carry:     4.8

SITUATIONAL
----------------------------------------
  3rd Down Conv:   42.5%
  Red Zone TD%:    58.3%

DEFENSIVE EFFICIENCY
----------------------------------------
  EPA/Play Allowed: +0.045  (Rank: 12/32)
  Opp Success Rate: 46.8%

============================================================

Discussion Questions

  1. Why do we filter to "real plays" before calculating team statistics? What would happen if we included timeouts and penalties?

  2. How might garbage time plays affect our team efficiency metrics? Should we exclude them?

  3. The dashboard shows EPA per play, but some teams run more plays than others. How might total EPA be useful in addition to per-play EPA?

  4. Why is CPOE (Completion Percentage Over Expected) more informative than raw completion percentage?

  5. How would you extend this dashboard to include weekly trends instead of just season totals?


Your Turn: Extensions

Option A: Add week-by-week trending - Calculate rolling 3-week averages for key metrics - Create a line chart showing team performance over the season

Option B: Opponent-adjusted metrics - Calculate average opponent defensive EPA for games played - Adjust offensive EPA based on strength of schedule

Option C: Player-level drill-down - Add top 3 players by EPA for each team - Show player-level metrics within the team dashboard


Complete Code

See code/case-study-01-dashboard.py for the full implementation.