Case Study: Building a Team Stats Dashboard
"The first step in any analysis is understanding what data you have and what it can tell you."
Executive Summary
This case study guides you through building a comprehensive team statistics dashboard using nfl_data_py. You'll practice data loading, transformation, aggregation, and quality validation—the foundational skills for any football analytics project.
Skills Applied: - Loading and caching play-by-play data - Filtering to relevant subsets - Aggregating to team-level statistics - Validating data quality - Creating summary visualizations
Data: nfl_data_py play-by-play data (2023 season)
Background
The Context
You've been hired as a junior analyst for an NFL team's analytics department. Your first assignment: create a team performance dashboard that the coaching staff can reference for game preparation.
The dashboard needs to show: - Offensive efficiency metrics (EPA, success rate) - Defensive efficiency metrics - Situational performance (red zone, third down) - Comparison to league averages
The Challenge
Business Question: "How does our team's performance compare to league average across key efficiency metrics, and where are our biggest strengths and weaknesses?"
Requirements
| Metric Category | Specific Metrics |
|---|---|
| Offensive Efficiency | EPA/play, Success Rate, Explosive Play % |
| Passing Attack | EPA/dropback, CPOE, Air Yards/Attempt |
| Rushing Attack | EPA/carry, Success Rate, YPC |
| Defensive Performance | EPA/play allowed, Success Rate allowed |
| Situational | 3rd Down Conv %, Red Zone TD % |
Analysis Approach
Phase 1: Data Loading and Exploration
"""
Case Study: Team Stats Dashboard
Part 1: Data Loading and Initial Exploration
"""
import nfl_data_py as nfl
import pandas as pd
import numpy as np
from pathlib import Path
# Configuration
SEASON = 2023
CACHE_DIR = Path('./data/cache')
CACHE_DIR.mkdir(parents=True, exist_ok=True)
def load_pbp_cached(season: int) -> pd.DataFrame:
"""Load play-by-play data with local caching."""
cache_path = CACHE_DIR / f'pbp_{season}.parquet'
if cache_path.exists():
print(f"Loading from cache: {cache_path}")
return pd.read_parquet(cache_path)
print(f"Downloading play-by-play data for {season}...")
pbp = nfl.import_pbp_data([season])
pbp.to_parquet(cache_path)
print(f"Cached to: {cache_path}")
return pbp
# Load data
pbp = load_pbp_cached(SEASON)
# Initial exploration
print("\n" + "=" * 50)
print("DATA EXPLORATION")
print("=" * 50)
print(f"\nDataset shape: {pbp.shape}")
print(f"Total plays: {len(pbp):,}")
print(f"Unique games: {pbp['game_id'].nunique()}")
print(f"Teams: {sorted(pbp['posteam'].dropna().unique())}")
print("\nPlay type distribution:")
print(pbp['play_type'].value_counts())
print("\nKey numeric columns:")
key_cols = ['yards_gained', 'epa', 'wpa', 'air_yards']
print(pbp[key_cols].describe().round(2))
Phase 2: Data Cleaning and Filtering
"""
Part 2: Cleaning and Filtering to Relevant Plays
"""
def filter_to_real_plays(df: pd.DataFrame) -> pd.DataFrame:
"""
Filter to actual plays, excluding:
- Timeouts
- End of quarter/half/game
- Two-minute warnings
- Penalties without plays
"""
# Keep only pass and rush plays (the core offensive plays)
real_plays = df[
(df['play_type'].isin(['pass', 'run'])) |
((df['pass'] == 1) | (df['rush'] == 1))
].copy()
# Exclude plays with missing EPA
real_plays = real_plays[real_plays['epa'].notna()]
# Exclude plays with missing team
real_plays = real_plays[real_plays['posteam'].notna()]
return real_plays
def add_derived_fields(df: pd.DataFrame) -> pd.DataFrame:
"""Add commonly used derived fields."""
df = df.copy()
# Play type indicators
df['is_pass'] = (df['pass'] == 1).astype(int)
df['is_rush'] = (df['rush'] == 1).astype(int)
# Success indicator (EPA > 0)
df['is_success'] = (df['epa'] > 0).astype(int)
# Explosive play (20+ yards pass, 10+ yards rush)
df['is_explosive'] = (
((df['is_pass'] == 1) & (df['yards_gained'] >= 20)) |
((df['is_rush'] == 1) & (df['yards_gained'] >= 10))
).astype(int)
# Situational flags
df['is_third_down'] = (df['down'] == 3).astype(int)
df['is_red_zone'] = (df['yardline_100'] <= 20).astype(int)
# Score context
df['score_diff'] = df['posteam_score'] - df['defteam_score']
df['is_garbage_time'] = (
(df['game_seconds_remaining'] < 300) &
(df['score_diff'].abs() > 17)
).astype(int)
return df
# Apply cleaning
plays = filter_to_real_plays(pbp)
plays = add_derived_fields(plays)
print("\n" + "=" * 50)
print("FILTERED DATA SUMMARY")
print("=" * 50)
print(f"Plays after filtering: {len(plays):,}")
print(f"Pass plays: {plays['is_pass'].sum():,}")
print(f"Rush plays: {plays['is_rush'].sum():,}")
print(f"Third down plays: {plays['is_third_down'].sum():,}")
print(f"Red zone plays: {plays['is_red_zone'].sum():,}")
Phase 3: Team-Level Aggregation
"""
Part 3: Aggregate to Team-Level Statistics
"""
def calculate_offensive_stats(df: pd.DataFrame) -> pd.DataFrame:
"""Calculate offensive statistics by team."""
# Overall offensive metrics
overall = df.groupby('posteam').agg({
'play_id': 'count',
'epa': ['sum', 'mean'],
'is_success': 'mean',
'is_explosive': 'mean',
'yards_gained': 'sum'
}).reset_index()
overall.columns = ['team', 'plays', 'total_epa', 'epa_per_play',
'success_rate', 'explosive_rate', 'total_yards']
# Passing metrics
pass_plays = df[df['is_pass'] == 1]
passing = pass_plays.groupby('posteam').agg({
'play_id': 'count',
'epa': 'mean',
'is_success': 'mean',
'cpoe': 'mean',
'air_yards': 'mean'
}).reset_index()
passing.columns = ['team', 'dropbacks', 'pass_epa_per_play',
'pass_success_rate', 'cpoe', 'air_yards_per_att']
# Rushing metrics
rush_plays = df[df['is_rush'] == 1]
rushing = rush_plays.groupby('posteam').agg({
'play_id': 'count',
'epa': 'mean',
'is_success': 'mean',
'yards_gained': 'mean'
}).reset_index()
rushing.columns = ['team', 'carries', 'rush_epa_per_play',
'rush_success_rate', 'ypc']
# Third down
third_down = df[df['is_third_down'] == 1].groupby('posteam').agg({
'play_id': 'count',
'first_down': 'mean'
}).reset_index()
third_down.columns = ['team', 'third_down_att', 'third_down_conv_rate']
# Red zone
red_zone = df[(df['is_red_zone'] == 1)].groupby('posteam').agg({
'play_id': 'count',
'touchdown': 'mean'
}).reset_index()
red_zone.columns = ['team', 'red_zone_plays', 'red_zone_td_rate']
# Merge all
offense = overall.merge(passing, on='team')
offense = offense.merge(rushing, on='team')
offense = offense.merge(third_down, on='team')
offense = offense.merge(red_zone, on='team')
return offense
def calculate_defensive_stats(df: pd.DataFrame) -> pd.DataFrame:
"""Calculate defensive statistics by team."""
defense = df.groupby('defteam').agg({
'play_id': 'count',
'epa': ['sum', 'mean'],
'is_success': 'mean',
'yards_gained': 'sum'
}).reset_index()
defense.columns = ['team', 'plays_faced', 'total_epa_allowed',
'epa_per_play_allowed', 'opp_success_rate',
'total_yards_allowed']
# For defense, lower is better, so we'll negate EPA for ranking
defense['def_epa_per_play'] = -defense['epa_per_play_allowed']
return defense
# Calculate stats
offense_stats = calculate_offensive_stats(plays)
defense_stats = calculate_defensive_stats(plays)
# Add league averages
league_avg = {
'epa_per_play': plays['epa'].mean(),
'success_rate': plays['is_success'].mean(),
'pass_epa': plays[plays['is_pass'] == 1]['epa'].mean(),
'rush_epa': plays[plays['is_rush'] == 1]['epa'].mean()
}
print("\n" + "=" * 50)
print("LEAGUE AVERAGES")
print("=" * 50)
for metric, value in league_avg.items():
print(f"{metric}: {value:.3f}")
Phase 4: Dashboard Output
"""
Part 4: Create Dashboard Summary
"""
def create_team_dashboard(
team: str,
offense: pd.DataFrame,
defense: pd.DataFrame,
league_avg: dict
) -> str:
"""Generate text-based dashboard for a team."""
off = offense[offense['team'] == team].iloc[0]
def_ = defense[defense['team'] == team].iloc[0]
# Calculate ranks (1 = best)
off_rank = (offense['epa_per_play'].rank(ascending=False)
[offense['team'] == team].values[0])
def_rank = (defense['def_epa_per_play'].rank(ascending=False)
[defense['team'] == team].values[0])
lines = [
"=" * 60,
f"TEAM PERFORMANCE DASHBOARD: {team}",
f"Season: {SEASON}",
"=" * 60,
"",
"OFFENSIVE EFFICIENCY",
"-" * 40,
f" EPA/Play: {off['epa_per_play']:+.3f} (Rank: {int(off_rank)}/32)",
f" vs League Avg: {off['epa_per_play'] - league_avg['epa_per_play']:+.3f}",
f" Success Rate: {off['success_rate']:.1%}",
f" Explosive Rate: {off['explosive_rate']:.1%}",
"",
"PASSING ATTACK",
"-" * 40,
f" EPA/Dropback: {off['pass_epa_per_play']:+.3f}",
f" Success Rate: {off['pass_success_rate']:.1%}",
f" CPOE: {off['cpoe']:+.1f}%",
f" Air Yards/Att: {off['air_yards_per_att']:.1f}",
"",
"RUSHING ATTACK",
"-" * 40,
f" EPA/Carry: {off['rush_epa_per_play']:+.3f}",
f" Success Rate: {off['rush_success_rate']:.1%}",
f" Yards/Carry: {off['ypc']:.1f}",
"",
"SITUATIONAL",
"-" * 40,
f" 3rd Down Conv: {off['third_down_conv_rate']:.1%}",
f" Red Zone TD%: {off['red_zone_td_rate']:.1%}",
"",
"DEFENSIVE EFFICIENCY",
"-" * 40,
f" EPA/Play Allowed: {def_['epa_per_play_allowed']:+.3f} (Rank: {int(def_rank)}/32)",
f" Opp Success Rate: {def_['opp_success_rate']:.1%}",
"",
"=" * 60
]
return "\n".join(lines)
# Generate dashboards for select teams
for team in ['KC', 'SF', 'BAL', 'DET']:
dashboard = create_team_dashboard(team, offense_stats, defense_stats, league_avg)
print(dashboard)
print()
Results Summary
Sample Output: Kansas City Chiefs
============================================================
TEAM PERFORMANCE DASHBOARD: KC
Season: 2023
============================================================
OFFENSIVE EFFICIENCY
----------------------------------------
EPA/Play: +0.152 (Rank: 5/32)
vs League Avg: +0.152
Success Rate: 48.2%
Explosive Rate: 12.1%
PASSING ATTACK
----------------------------------------
EPA/Dropback: +0.185
Success Rate: 50.1%
CPOE: +2.3%
Air Yards/Att: 8.2
RUSHING ATTACK
----------------------------------------
EPA/Carry: +0.089
Success Rate: 45.2%
Yards/Carry: 4.8
SITUATIONAL
----------------------------------------
3rd Down Conv: 42.5%
Red Zone TD%: 58.3%
DEFENSIVE EFFICIENCY
----------------------------------------
EPA/Play Allowed: +0.045 (Rank: 12/32)
Opp Success Rate: 46.8%
============================================================
Discussion Questions
-
Why do we filter to "real plays" before calculating team statistics? What would happen if we included timeouts and penalties?
-
How might garbage time plays affect our team efficiency metrics? Should we exclude them?
-
The dashboard shows EPA per play, but some teams run more plays than others. How might total EPA be useful in addition to per-play EPA?
-
Why is CPOE (Completion Percentage Over Expected) more informative than raw completion percentage?
-
How would you extend this dashboard to include weekly trends instead of just season totals?
Your Turn: Extensions
Option A: Add week-by-week trending - Calculate rolling 3-week averages for key metrics - Create a line chart showing team performance over the season
Option B: Opponent-adjusted metrics - Calculate average opponent defensive EPA for games played - Adjust offensive EPA based on strength of schedule
Option C: Player-level drill-down - Add top 3 players by EPA for each team - Show player-level metrics within the team dashboard
Complete Code
See code/case-study-01-dashboard.py for the full implementation.