> "Code is read more often than it is written. Write for your future self."
Learning Objectives
- Configure a professional Python environment for football analytics
- Apply pandas operations efficiently for large play-by-play datasets
- Write reusable functions following software engineering best practices
- Use NumPy for vectorized numerical computations
- Debug common issues in football data processing
- Structure analytics projects for maintainability
In This Chapter
- Chapter Overview
- 3.1 Setting Up Your Environment
- 3.2 Pandas Essentials for Football Data
- 3.3 NumPy for Numerical Computing
- 3.4 Writing Reusable Functions
- 3.5 Working with nfl_data_py
- 3.6 Debugging and Profiling
- 3.7 Chapter Summary
- What's Next
- Chapter 3 Exercises → exercises.md
- Chapter 3 Quiz → quiz.md
- Case Study: Building a QB Comparison Tool → case-study-01.md
Chapter 3: Python for Football Analytics
"Code is read more often than it is written. Write for your future self." — Guido van Rossum (adapted)
Chapter Overview
In Chapter 2, we surveyed the NFL data ecosystem. Now we develop the programming skills to work with that data efficiently. This chapter isn't a general Python tutorial—it focuses specifically on the patterns and techniques that matter for football analytics.
Football data presents particular challenges: play-by-play datasets contain millions of rows, calculations must handle missing values gracefully, and analyses need to be reproducible across seasons. The techniques you learn here will save hours of debugging and enable analyses that would be impractical with naive approaches.
We emphasize three principles throughout: 1. Vectorization over loops: Let pandas and NumPy do the heavy lifting 2. Readability over cleverness: Code should explain itself 3. Reusability over repetition: Build functions you can trust
In this chapter, you will learn to: - Set up a professional Python environment for analytics work - Manipulate DataFrames efficiently using pandas best practices - Write clean, documented, reusable functions - Perform fast numerical computations with NumPy - Debug and profile your code effectively
3.1 Setting Up Your Environment
3.1.1 Python Installation and Virtual Environments
A clean environment prevents dependency conflicts and ensures reproducibility.
Step 1: Install Python 3.9+
Download from python.org or use a package manager:
# macOS with Homebrew
brew install python@3.11
# Windows with winget
winget install Python.Python.3.11
# Linux (Ubuntu/Debian)
sudo apt update && sudo apt install python3.11 python3.11-venv
Step 2: Create a Virtual Environment
# Create project directory
mkdir football-analytics
cd football-analytics
# Create virtual environment
python -m venv venv
# Activate it
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
# Verify
python --version
which python # Should point to venv
Step 3: Install Core Packages
# Core data science stack
pip install numpy pandas scipy matplotlib seaborn
# Football-specific
pip install nfl-data-py
# Machine learning
pip install scikit-learn statsmodels xgboost
# Development tools
pip install jupyter black flake8 pytest
# Save dependencies
pip freeze > requirements.txt
3.1.2 IDE Configuration
For football analytics, we recommend VS Code or JupyterLab:
VS Code Setup: 1. Install Python extension 2. Select interpreter (your venv) 3. Enable format-on-save with Black 4. Configure linting with flake8
settings.json for VS Code:
{
"python.formatting.provider": "black",
"python.linting.enabled": true,
"python.linting.flake8Enabled": true,
"editor.formatOnSave": true,
"python.analysis.typeCheckingMode": "basic"
}
3.1.3 Project Structure
Organize your projects consistently:
football-analytics/
│
├── data/
│ ├── raw/ # Original downloaded data
│ ├── processed/ # Cleaned data
│ └── cache/ # Temporary cache files
│
├── notebooks/ # Jupyter notebooks for exploration
│ ├── 01_eda.ipynb
│ └── 02_modeling.ipynb
│
├── src/ # Reusable Python modules
│ ├── __init__.py
│ ├── data_loader.py
│ ├── features.py
│ └── visualization.py
│
├── tests/ # Unit tests
│ └── test_features.py
│
├── outputs/ # Generated figures, reports
│
├── requirements.txt # Dependencies
├── README.md # Project documentation
└── .gitignore # Files to exclude from git
"""
Example: Project Setup Verification
Run this script to verify your environment is configured correctly.
"""
import sys
from importlib import import_module
def check_environment():
"""Verify all required packages are installed."""
required_packages = [
('numpy', 'np'),
('pandas', 'pd'),
('matplotlib', 'plt'),
('seaborn', 'sns'),
('sklearn', None),
('nfl_data_py', 'nfl'),
]
print("Environment Check")
print("=" * 50)
print(f"Python version: {sys.version}")
print()
all_ok = True
for package, alias in required_packages:
try:
mod = import_module(package)
version = getattr(mod, '__version__', 'unknown')
print(f" [OK] {package}: {version}")
except ImportError:
print(f" [MISSING] {package}")
all_ok = False
print()
if all_ok:
print("All packages installed correctly!")
else:
print("Some packages are missing. Run: pip install -r requirements.txt")
return all_ok
if __name__ == "__main__":
check_environment()
3.2 Pandas Essentials for Football Data
3.2.1 Loading and Inspecting Data
"""
Pandas Fundamentals for Football Analytics
"""
import pandas as pd
import numpy as np
import nfl_data_py as nfl
# Load data
pbp = nfl.import_pbp_data([2023])
# Basic inspection
print(f"Shape: {pbp.shape}") # (rows, columns)
print(f"Columns: {len(pbp.columns)}") # Number of columns
print(f"Memory: {pbp.memory_usage(deep=True).sum() / 1e6:.1f} MB")
# View structure
pbp.info() # Column types and non-null counts
pbp.head() # First 5 rows
pbp.describe() # Numeric column statistics
# Column exploration
pbp.columns.tolist() # All column names
pbp['play_type'].unique() # Unique values
pbp['play_type'].value_counts() # Value counts
3.2.2 Selecting and Filtering
Column Selection:
# Single column (returns Series)
epa_series = pbp['epa']
# Multiple columns (returns DataFrame)
subset = pbp[['game_id', 'play_id', 'posteam', 'epa']]
# Select columns by pattern
passing_cols = pbp.filter(like='pass') # Columns containing 'pass'
id_cols = pbp.filter(regex=r'_id$') # Columns ending with '_id'
Row Filtering:
# Boolean indexing
pass_plays = pbp[pbp['pass'] == 1]
high_epa = pbp[pbp['epa'] > 2]
# Multiple conditions (use & for AND, | for OR)
# IMPORTANT: Wrap each condition in parentheses
third_down_passes = pbp[(pbp['down'] == 3) & (pbp['pass'] == 1)]
chiefs_offense = pbp[(pbp['posteam'] == 'KC') & (pbp['epa'].notna())]
# Using query() for cleaner syntax
third_down_passes = pbp.query("down == 3 and pass == 1")
chiefs_plays = pbp.query("posteam == 'KC' and epa == epa") # epa == epa filters NaN
# isin() for multiple values
nfc_west = pbp[pbp['posteam'].isin(['SF', 'SEA', 'LAR', 'ARI'])]
early_downs = pbp[pbp['down'].isin([1, 2])]
Common Pitfall: Using
and/orinstead of&/|in pandas conditions. Python'sand/ordon't work element-wise on arrays.
3.2.3 Creating and Modifying Columns
# New column from calculation
pbp['yards_per_play'] = pbp['yards_gained'] / 1 # Simple calculation
# Conditional column with np.where
pbp['is_success'] = np.where(pbp['epa'] > 0, 1, 0)
# Multiple conditions with np.select
conditions = [
pbp['epa'] > 1,
pbp['epa'] > 0,
pbp['epa'] > -1,
]
choices = ['great', 'good', 'neutral']
pbp['epa_category'] = np.select(conditions, choices, default='bad')
# Apply custom function
def categorize_distance(ydstogo):
if pd.isna(ydstogo):
return 'unknown'
elif ydstogo <= 3:
return 'short'
elif ydstogo <= 7:
return 'medium'
else:
return 'long'
pbp['distance_category'] = pbp['ydstogo'].apply(categorize_distance)
# Vectorized string operations
pbp['home_team_upper'] = pbp['home_team'].str.upper()
pbp['has_mahomes'] = pbp['desc'].str.contains('Mahomes', na=False)
3.2.4 Groupby Aggregations
Groupby is the workhorse of football analytics—calculating statistics by team, player, game, or situation.
# Basic groupby
team_epa = pbp.groupby('posteam')['epa'].mean()
# Multiple aggregations
team_stats = pbp.groupby('posteam').agg({
'epa': ['sum', 'mean', 'count'],
'yards_gained': 'sum',
'success': 'mean'
})
# Flatten multi-level columns
team_stats.columns = ['_'.join(col).strip() for col in team_stats.columns]
team_stats = team_stats.reset_index()
# Named aggregations (cleaner approach)
team_stats = pbp.groupby('posteam').agg(
total_epa=('epa', 'sum'),
epa_per_play=('epa', 'mean'),
plays=('epa', 'count'),
total_yards=('yards_gained', 'sum'),
success_rate=('success', 'mean')
).reset_index()
# Multiple groupby columns
qb_weekly = pbp.groupby(['week', 'passer_player_name']).agg(
dropbacks=('pass', 'sum'),
epa=('epa', 'sum'),
cpoe=('cpoe', 'mean')
).reset_index()
# Filter groups (e.g., QBs with 20+ dropbacks per week)
qb_weekly_qualified = qb_weekly[qb_weekly['dropbacks'] >= 20]
Advanced Groupby Patterns:
# Transform: apply function and return same-shaped result
# Useful for calculating within-group statistics
pbp['team_mean_epa'] = pbp.groupby('posteam')['epa'].transform('mean')
pbp['epa_vs_team_avg'] = pbp['epa'] - pbp['team_mean_epa']
# Cumulative operations
pbp['cumulative_epa'] = pbp.groupby('game_id')['epa'].cumsum()
# Rolling calculations
pbp['rolling_epa_5'] = (
pbp.sort_values(['game_id', 'play_id'])
.groupby('game_id')['epa']
.transform(lambda x: x.rolling(5, min_periods=1).mean())
)
# Rank within groups
pbp['epa_rank_in_game'] = pbp.groupby('game_id')['epa'].rank(ascending=False)
3.2.5 Merging DataFrames
Combining data from different sources is essential.
# Load additional data
rosters = nfl.import_rosters([2023])
schedules = nfl.import_schedules([2023])
# Inner join (only matching rows)
pbp_with_roster = pbp.merge(
rosters[['player_id', 'player_name', 'position', 'team']],
left_on='passer_player_id',
right_on='player_id',
how='inner'
)
# Left join (keep all rows from left DataFrame)
pbp_with_schedule = pbp.merge(
schedules[['game_id', 'home_team', 'away_team', 'home_score', 'away_score']],
on='game_id',
how='left'
)
# Handle duplicate column names with suffixes
combined = df1.merge(df2, on='key', suffixes=('_left', '_right'))
# Merge on multiple columns
player_game_stats = stats1.merge(
stats2,
on=['player_id', 'game_id'],
how='outer'
)
3.2.6 Method Chaining
Chain operations for readable, pipeline-style code:
# Without chaining (harder to read)
filtered = pbp[pbp['pass'] == 1]
filtered = filtered[filtered['epa'].notna()]
grouped = filtered.groupby('passer_player_name')['epa'].agg(['sum', 'mean', 'count'])
grouped = grouped.reset_index()
grouped.columns = ['player', 'total_epa', 'epa_per_play', 'dropbacks']
result = grouped[grouped['dropbacks'] >= 200]
result = result.sort_values('epa_per_play', ascending=False)
# With chaining (cleaner)
result = (
pbp
.query("pass == 1 and epa == epa") # Filter passes with valid EPA
.groupby('passer_player_name')
.agg(
total_epa=('epa', 'sum'),
epa_per_play=('epa', 'mean'),
dropbacks=('pass', 'count')
)
.reset_index()
.query("dropbacks >= 200")
.sort_values('epa_per_play', ascending=False)
.rename(columns={'passer_player_name': 'player'})
)
Best Practice: Method chaining improves readability when each step is clear. Break into multiple statements if chains become too long or complex.
3.3 NumPy for Numerical Computing
3.3.1 Why NumPy?
NumPy provides fast, vectorized operations on arrays. Many pandas operations use NumPy under the hood.
import numpy as np
# Speed comparison
import time
data = list(range(1_000_000))
arr = np.array(data)
# Python loop
start = time.time()
result_loop = [x ** 2 for x in data]
loop_time = time.time() - start
# NumPy vectorized
start = time.time()
result_numpy = arr ** 2
numpy_time = time.time() - start
print(f"Loop: {loop_time:.3f}s, NumPy: {numpy_time:.3f}s")
print(f"NumPy is {loop_time / numpy_time:.0f}x faster")
3.3.2 Essential NumPy Operations
# Array creation
zeros = np.zeros(10)
ones = np.ones((3, 4)) # 3x4 matrix
range_arr = np.arange(0, 100, 5) # 0, 5, 10, ..., 95
linspace = np.linspace(0, 1, 11) # 11 points from 0 to 1
# From pandas
epa_array = pbp['epa'].values # Convert to numpy array
# Basic statistics
np.mean(epa_array)
np.std(epa_array)
np.median(epa_array)
np.percentile(epa_array, [25, 50, 75])
# Handle NaN values
np.nanmean(epa_array) # Ignores NaN
np.nanstd(epa_array)
# Boolean operations
positive_mask = epa_array > 0
count_positive = np.sum(positive_mask)
pct_positive = np.mean(positive_mask)
# Conditional selection
np.where(epa_array > 0, 'success', 'failure')
# Mathematical operations
np.sqrt(np.abs(epa_array))
np.log1p(np.maximum(epa_array, 0)) # log(1+x) for non-negative values
np.clip(epa_array, -5, 5) # Limit to range
3.3.3 Vectorized Calculations for Football
def calculate_expected_points_simple(
yardline: np.ndarray,
down: np.ndarray,
ydstogo: np.ndarray
) -> np.ndarray:
"""
Simplified EP calculation using vectorized operations.
Parameters
----------
yardline : np.ndarray
Yards from opponent's goal line
down : np.ndarray
Current down (1-4)
ydstogo : np.ndarray
Yards to first down
Returns
-------
np.ndarray
Expected points estimate
"""
# Base EP from field position (closer to goal = higher EP)
base_ep = (100 - yardline) * 0.06 - 1.5
# Down adjustment
down_adj = np.select(
[down == 1, down == 2, down == 3, down == 4],
[0.5, 0.2, -0.3, -1.0],
default=0
)
# Distance adjustment
dist_adj = np.where(ydstogo <= 3, 0.3, np.where(ydstogo >= 10, -0.3, 0))
return base_ep + down_adj + dist_adj
# Apply to DataFrame
pbp['ep_simple'] = calculate_expected_points_simple(
pbp['yardline_100'].values,
pbp['down'].values,
pbp['ydstogo'].values
)
3.4 Writing Reusable Functions
3.4.1 Function Design Principles
Good functions are: - Focused: Do one thing well - Documented: Clear docstring explaining purpose, parameters, returns - Typed: Type hints clarify expected inputs/outputs - Tested: Unit tests verify correctness
from typing import List, Optional, Union
import pandas as pd
import numpy as np
def calculate_success_rate(
plays: pd.DataFrame,
group_cols: Optional[List[str]] = None,
min_plays: int = 50
) -> pd.DataFrame:
"""
Calculate success rate (EPA > 0) by group.
Success rate measures the percentage of plays that generate
positive expected points added, indicating the offense improved
their scoring expectation.
Parameters
----------
plays : pd.DataFrame
Play-by-play data containing 'epa' column
group_cols : List[str], optional
Columns to group by. If None, returns overall success rate.
min_plays : int, default=50
Minimum plays required for a group to be included
Returns
-------
pd.DataFrame
Success rate and play count by group
Raises
------
ValueError
If 'epa' column is missing from plays
Examples
--------
>>> plays = load_pbp([2023])
>>> team_success = calculate_success_rate(plays, ['posteam'])
>>> team_success.head()
posteam success_rate plays
0 ARI 0.421 1045
1 ATL 0.456 1102
"""
if 'epa' not in plays.columns:
raise ValueError("DataFrame must contain 'epa' column")
# Filter to valid EPA values
valid_plays = plays[plays['epa'].notna()].copy()
valid_plays['success'] = (valid_plays['epa'] > 0).astype(int)
if group_cols is None:
# Overall success rate
return pd.DataFrame({
'success_rate': [valid_plays['success'].mean()],
'plays': [len(valid_plays)]
})
# Grouped success rate
result = (
valid_plays
.groupby(group_cols)
.agg(
success_rate=('success', 'mean'),
plays=('success', 'count')
)
.reset_index()
.query(f"plays >= {min_plays}")
.sort_values('success_rate', ascending=False)
)
return result
3.4.2 Common Football Analysis Functions
def filter_to_pass_plays(
pbp: pd.DataFrame,
include_scrambles: bool = True,
exclude_spikes: bool = True
) -> pd.DataFrame:
"""
Filter play-by-play to passing plays.
Parameters
----------
pbp : pd.DataFrame
Play-by-play data
include_scrambles : bool, default=True
Whether to include QB scrambles
exclude_spikes : bool, default=True
Whether to exclude spike plays
Returns
-------
pd.DataFrame
Filtered passing plays
"""
mask = pbp['pass'] == 1
if not include_scrambles:
mask &= pbp['qb_scramble'] != 1
if exclude_spikes:
mask &= pbp['qb_spike'] != 1
return pbp[mask].copy()
def calculate_epa_per_play(
plays: pd.DataFrame,
group_cols: List[str],
play_type: Optional[str] = None,
min_plays: int = 50
) -> pd.DataFrame:
"""
Calculate EPA per play by group.
Parameters
----------
plays : pd.DataFrame
Play-by-play data
group_cols : List[str]
Columns to group by
play_type : str, optional
Filter to 'pass' or 'rush'
min_plays : int, default=50
Minimum plays for inclusion
Returns
-------
pd.DataFrame
EPA per play statistics by group
"""
df = plays.copy()
if play_type == 'pass':
df = df[df['pass'] == 1]
elif play_type == 'rush':
df = df[df['rush'] == 1]
result = (
df[df['epa'].notna()]
.groupby(group_cols)
.agg(
epa_total=('epa', 'sum'),
epa_per_play=('epa', 'mean'),
plays=('epa', 'count')
)
.reset_index()
.query(f"plays >= {min_plays}")
.sort_values('epa_per_play', ascending=False)
)
return result
def add_game_context(pbp: pd.DataFrame) -> pd.DataFrame:
"""
Add game context columns to play-by-play data.
Adds:
- score_differential: posteam score minus defteam score
- is_trailing: whether offense is behind
- is_close: game within 8 points
- is_garbage_time: low leverage end-of-game situation
- quarter_seconds_remaining: time left in quarter
Parameters
----------
pbp : pd.DataFrame
Play-by-play data
Returns
-------
pd.DataFrame
Data with additional context columns
"""
df = pbp.copy()
# Score context
df['score_differential'] = df['posteam_score'] - df['defteam_score']
df['is_trailing'] = (df['score_differential'] < 0).astype(int)
df['is_leading'] = (df['score_differential'] > 0).astype(int)
df['is_close'] = (df['score_differential'].abs() <= 8).astype(int)
# Garbage time (trailing/leading by 17+ with < 5 min left)
df['is_garbage_time'] = (
(df['game_seconds_remaining'] < 300) &
(df['score_differential'].abs() > 17)
).astype(int)
# Time context
df['quarter_seconds_remaining'] = df['game_seconds_remaining'] % 900
return df
3.4.3 Creating Analysis Pipelines
from typing import Callable, List
from functools import reduce
def compose(*functions: Callable) -> Callable:
"""
Compose multiple functions into a single function.
Functions are applied left to right.
Example
-------
>>> pipeline = compose(filter_passes, add_context, calculate_stats)
>>> result = pipeline(pbp)
"""
def apply(x):
return reduce(lambda v, f: f(v), functions, x)
return apply
class AnalysisPipeline:
"""
Builder for creating reproducible analysis pipelines.
Example
-------
>>> pipeline = (
... AnalysisPipeline(pbp)
... .filter(lambda df: df[df['pass'] == 1])
... .add_column('is_success', lambda df: (df['epa'] > 0).astype(int))
... .aggregate(['posteam'], {'epa': 'mean', 'is_success': 'mean'})
... .execute()
... )
"""
def __init__(self, data: pd.DataFrame):
self.data = data.copy()
self.steps: List[Callable] = []
def filter(self, condition: Callable) -> 'AnalysisPipeline':
"""Add a filter step."""
self.steps.append(lambda df: condition(df))
return self
def add_column(self, name: str, func: Callable) -> 'AnalysisPipeline':
"""Add a new column."""
def add_col(df):
df = df.copy()
df[name] = func(df)
return df
self.steps.append(add_col)
return self
def aggregate(
self,
group_cols: List[str],
agg_dict: dict
) -> 'AnalysisPipeline':
"""Add an aggregation step."""
def agg_step(df):
return df.groupby(group_cols).agg(agg_dict).reset_index()
self.steps.append(agg_step)
return self
def execute(self) -> pd.DataFrame:
"""Execute all pipeline steps."""
result = self.data
for step in self.steps:
result = step(result)
return result
3.5 Working with nfl_data_py
3.5.1 Complete Data Loading Reference
import nfl_data_py as nfl
from typing import List, Optional
import pandas as pd
class NFLData:
"""
Comprehensive wrapper for nfl_data_py functionality.
Provides caching, common filters, and convenient methods.
"""
def __init__(self, cache_enabled: bool = True):
self.cache_enabled = cache_enabled
self._cache = {}
def pbp(
self,
seasons: List[int],
columns: Optional[List[str]] = None
) -> pd.DataFrame:
"""Load play-by-play data."""
key = ('pbp', tuple(seasons))
if self.cache_enabled and key in self._cache:
df = self._cache[key]
else:
df = nfl.import_pbp_data(seasons)
if self.cache_enabled:
self._cache[key] = df
if columns:
return df[[c for c in columns if c in df.columns]]
return df
def rosters(self, seasons: List[int]) -> pd.DataFrame:
"""Load roster data."""
key = ('rosters', tuple(seasons))
if self.cache_enabled and key in self._cache:
return self._cache[key]
df = nfl.import_rosters(seasons)
if self.cache_enabled:
self._cache[key] = df
return df
def schedules(self, seasons: List[int]) -> pd.DataFrame:
"""Load schedule data."""
return nfl.import_schedules(seasons)
def weekly_stats(self, seasons: List[int]) -> pd.DataFrame:
"""Load weekly player statistics."""
return nfl.import_weekly_data(seasons)
def seasonal_stats(self, seasons: List[int]) -> pd.DataFrame:
"""Load seasonal player statistics."""
return nfl.import_seasonal_data(seasons)
def combine(self, seasons: List[int]) -> pd.DataFrame:
"""Load NFL Combine data."""
return nfl.import_combine_data(seasons)
def draft_picks(self, seasons: List[int]) -> pd.DataFrame:
"""Load draft pick data."""
return nfl.import_draft_picks(seasons)
def player_ids(self) -> pd.DataFrame:
"""Load player ID mapping table."""
return nfl.import_ids()
def clear_cache(self):
"""Clear all cached data."""
self._cache.clear()
# Convenience functions
def load_passing_plays(seasons: List[int]) -> pd.DataFrame:
"""Load and filter to passing plays."""
pbp = nfl.import_pbp_data(seasons)
return pbp[pbp['pass'] == 1].copy()
def load_rushing_plays(seasons: List[int]) -> pd.DataFrame:
"""Load and filter to rushing plays."""
pbp = nfl.import_pbp_data(seasons)
return pbp[pbp['rush'] == 1].copy()
def load_qb_stats(seasons: List[int], min_dropbacks: int = 100) -> pd.DataFrame:
"""
Load quarterback statistics aggregated by season.
Parameters
----------
seasons : List[int]
Seasons to load
min_dropbacks : int
Minimum dropbacks for inclusion
Returns
-------
pd.DataFrame
QB statistics with key metrics
"""
pbp = nfl.import_pbp_data(seasons)
passing = pbp[pbp['pass'] == 1]
qb_stats = (
passing
.groupby(['season', 'passer_player_id', 'passer_player_name'])
.agg(
dropbacks=('pass', 'count'),
completions=('complete_pass', 'sum'),
attempts=('pass_attempt', 'sum'),
yards=('yards_gained', 'sum'),
tds=('pass_touchdown', 'sum'),
ints=('interception', 'sum'),
epa_total=('epa', 'sum'),
epa_per_play=('epa', 'mean'),
cpoe=('cpoe', 'mean'),
air_yards=('air_yards', 'sum'),
yac=('yards_after_catch', 'sum')
)
.reset_index()
.query(f"dropbacks >= {min_dropbacks}")
.sort_values('epa_total', ascending=False)
)
qb_stats['completion_pct'] = qb_stats['completions'] / qb_stats['attempts']
qb_stats['yards_per_attempt'] = qb_stats['yards'] / qb_stats['attempts']
return qb_stats
3.6 Debugging and Profiling
3.6.1 Common Errors and Solutions
"""
Common Pandas Errors and How to Fix Them
"""
import pandas as pd
import numpy as np
# ERROR 1: SettingWithCopyWarning
# BAD: Modifying a view instead of a copy
df_filtered = df[df['pass'] == 1]
df_filtered['new_col'] = 1 # Warning!
# GOOD: Explicitly copy
df_filtered = df[df['pass'] == 1].copy()
df_filtered['new_col'] = 1 # No warning
# ERROR 2: KeyError when column doesn't exist
# BAD: Assuming column exists
value = df['nonexistent_column'] # KeyError!
# GOOD: Check first or use get
if 'column' in df.columns:
value = df['column']
# Or use .get() on Series
value = df.get('column', default=None)
# ERROR 3: Boolean indexing with NaN
# BAD: NaN comparisons return NaN, not False
df[df['epa'] > 0] # May miss rows where epa is NaN
# GOOD: Handle NaN explicitly
df[(df['epa'] > 0) | (df['epa'].isna())]
# Or fill NaN first
df[df['epa'].fillna(0) > 0]
# ERROR 4: Merging creates duplicates
# BAD: Not checking for duplicates before merge
merged = df1.merge(df2, on='key') # May explode in size
# GOOD: Verify keys are unique
assert df2['key'].is_unique, "Duplicate keys in df2"
merged = df1.merge(df2, on='key')
# ERROR 5: Memory issues with large data
# BAD: Loading all columns
pbp = nfl.import_pbp_data([2020, 2021, 2022, 2023]) # ~2GB+
# GOOD: Select only needed columns
cols = ['game_id', 'play_id', 'epa', 'posteam']
pbp = nfl.import_pbp_data([2023], columns=cols) # Much smaller
3.6.2 Performance Profiling
import time
from functools import wraps
def timer(func):
"""Decorator to time function execution."""
@wraps(func)
def wrapper(*args, **kwargs):
start = time.perf_counter()
result = func(*args, **kwargs)
end = time.perf_counter()
print(f"{func.__name__}: {end - start:.3f} seconds")
return result
return wrapper
@timer
def slow_calculation(df):
"""Example slow function."""
return df.apply(lambda row: row['epa'] * 2, axis=1)
@timer
def fast_calculation(df):
"""Example fast function."""
return df['epa'] * 2
# Memory profiling
def memory_usage(df: pd.DataFrame) -> str:
"""Get human-readable memory usage."""
bytes_used = df.memory_usage(deep=True).sum()
if bytes_used > 1e9:
return f"{bytes_used / 1e9:.2f} GB"
elif bytes_used > 1e6:
return f"{bytes_used / 1e6:.2f} MB"
else:
return f"{bytes_used / 1e3:.2f} KB"
def optimize_dtypes(df: pd.DataFrame) -> pd.DataFrame:
"""
Reduce memory by optimizing data types.
Converts:
- float64 to float32 where possible
- int64 to smaller int types
- object to category for low-cardinality columns
"""
df = df.copy()
for col in df.columns:
col_type = df[col].dtype
if col_type == 'float64':
df[col] = df[col].astype('float32')
elif col_type == 'int64':
if df[col].min() >= 0 and df[col].max() < 256:
df[col] = df[col].astype('uint8')
elif df[col].min() >= -128 and df[col].max() < 128:
df[col] = df[col].astype('int8')
elif df[col].min() >= 0 and df[col].max() < 65536:
df[col] = df[col].astype('uint16')
else:
df[col] = df[col].astype('int32')
elif col_type == 'object':
n_unique = df[col].nunique()
n_total = len(df[col])
if n_unique / n_total < 0.5: # Low cardinality
df[col] = df[col].astype('category')
return df
3.7 Chapter Summary
Key Concepts
-
Virtual environments isolate project dependencies and ensure reproducibility.
-
Pandas operations should use vectorization (
.apply()with vectorized functions,np.where(),np.select()) rather than row-by-row loops. -
Method chaining creates readable, pipeline-style code for data transformations.
-
Groupby aggregations are the foundation of football statistics—calculating metrics by team, player, game, or situation.
-
Well-designed functions have clear purposes, type hints, docstrings, and handle edge cases gracefully.
-
NumPy vectorization provides 10-100x speedups compared to Python loops.
Key Code Patterns
# Pattern 1: Filter and aggregate
result = (
pbp
.query("pass == 1 and epa == epa")
.groupby('posteam')
.agg(epa_per_play=('epa', 'mean'))
.reset_index()
.sort_values('epa_per_play', ascending=False)
)
# Pattern 2: Conditional column creation
df['success'] = np.where(df['epa'] > 0, 1, 0)
# Pattern 3: Transform within groups
df['team_avg'] = df.groupby('posteam')['epa'].transform('mean')
df['vs_avg'] = df['epa'] - df['team_avg']
# Pattern 4: Safe column access
value = df.get('column', pd.Series())
What's Next
In Chapter 4: Exploratory Data Analysis for Football, we apply these programming skills to understand data before modeling it. You'll learn visualization techniques, pattern recognition, and the EDA mindset that separates good analysts from great ones.