4 min read

> "Code is read more often than it is written. Write for your future self."

Learning Objectives

  • Configure a professional Python environment for football analytics
  • Apply pandas operations efficiently for large play-by-play datasets
  • Write reusable functions following software engineering best practices
  • Use NumPy for vectorized numerical computations
  • Debug common issues in football data processing
  • Structure analytics projects for maintainability

Chapter 3: Python for Football Analytics

"Code is read more often than it is written. Write for your future self." — Guido van Rossum (adapted)


Chapter Overview

In Chapter 2, we surveyed the NFL data ecosystem. Now we develop the programming skills to work with that data efficiently. This chapter isn't a general Python tutorial—it focuses specifically on the patterns and techniques that matter for football analytics.

Football data presents particular challenges: play-by-play datasets contain millions of rows, calculations must handle missing values gracefully, and analyses need to be reproducible across seasons. The techniques you learn here will save hours of debugging and enable analyses that would be impractical with naive approaches.

We emphasize three principles throughout: 1. Vectorization over loops: Let pandas and NumPy do the heavy lifting 2. Readability over cleverness: Code should explain itself 3. Reusability over repetition: Build functions you can trust

In this chapter, you will learn to: - Set up a professional Python environment for analytics work - Manipulate DataFrames efficiently using pandas best practices - Write clean, documented, reusable functions - Perform fast numerical computations with NumPy - Debug and profile your code effectively


3.1 Setting Up Your Environment

3.1.1 Python Installation and Virtual Environments

A clean environment prevents dependency conflicts and ensures reproducibility.

Step 1: Install Python 3.9+

Download from python.org or use a package manager:

# macOS with Homebrew
brew install python@3.11

# Windows with winget
winget install Python.Python.3.11

# Linux (Ubuntu/Debian)
sudo apt update && sudo apt install python3.11 python3.11-venv

Step 2: Create a Virtual Environment

# Create project directory
mkdir football-analytics
cd football-analytics

# Create virtual environment
python -m venv venv

# Activate it
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate

# Verify
python --version
which python  # Should point to venv

Step 3: Install Core Packages

# Core data science stack
pip install numpy pandas scipy matplotlib seaborn

# Football-specific
pip install nfl-data-py

# Machine learning
pip install scikit-learn statsmodels xgboost

# Development tools
pip install jupyter black flake8 pytest

# Save dependencies
pip freeze > requirements.txt

3.1.2 IDE Configuration

For football analytics, we recommend VS Code or JupyterLab:

VS Code Setup: 1. Install Python extension 2. Select interpreter (your venv) 3. Enable format-on-save with Black 4. Configure linting with flake8

settings.json for VS Code:

{
    "python.formatting.provider": "black",
    "python.linting.enabled": true,
    "python.linting.flake8Enabled": true,
    "editor.formatOnSave": true,
    "python.analysis.typeCheckingMode": "basic"
}

3.1.3 Project Structure

Organize your projects consistently:

football-analytics/
│
├── data/
│   ├── raw/           # Original downloaded data
│   ├── processed/     # Cleaned data
│   └── cache/         # Temporary cache files
│
├── notebooks/         # Jupyter notebooks for exploration
│   ├── 01_eda.ipynb
│   └── 02_modeling.ipynb
│
├── src/               # Reusable Python modules
│   ├── __init__.py
│   ├── data_loader.py
│   ├── features.py
│   └── visualization.py
│
├── tests/             # Unit tests
│   └── test_features.py
│
├── outputs/           # Generated figures, reports
│
├── requirements.txt   # Dependencies
├── README.md          # Project documentation
└── .gitignore         # Files to exclude from git
"""
Example: Project Setup Verification

Run this script to verify your environment is configured correctly.
"""

import sys
from importlib import import_module

def check_environment():
    """Verify all required packages are installed."""
    required_packages = [
        ('numpy', 'np'),
        ('pandas', 'pd'),
        ('matplotlib', 'plt'),
        ('seaborn', 'sns'),
        ('sklearn', None),
        ('nfl_data_py', 'nfl'),
    ]

    print("Environment Check")
    print("=" * 50)
    print(f"Python version: {sys.version}")
    print()

    all_ok = True
    for package, alias in required_packages:
        try:
            mod = import_module(package)
            version = getattr(mod, '__version__', 'unknown')
            print(f"  [OK] {package}: {version}")
        except ImportError:
            print(f"  [MISSING] {package}")
            all_ok = False

    print()
    if all_ok:
        print("All packages installed correctly!")
    else:
        print("Some packages are missing. Run: pip install -r requirements.txt")

    return all_ok


if __name__ == "__main__":
    check_environment()

3.2 Pandas Essentials for Football Data

3.2.1 Loading and Inspecting Data

"""
Pandas Fundamentals for Football Analytics
"""

import pandas as pd
import numpy as np
import nfl_data_py as nfl

# Load data
pbp = nfl.import_pbp_data([2023])

# Basic inspection
print(f"Shape: {pbp.shape}")          # (rows, columns)
print(f"Columns: {len(pbp.columns)}") # Number of columns
print(f"Memory: {pbp.memory_usage(deep=True).sum() / 1e6:.1f} MB")

# View structure
pbp.info()                            # Column types and non-null counts
pbp.head()                            # First 5 rows
pbp.describe()                        # Numeric column statistics

# Column exploration
pbp.columns.tolist()                  # All column names
pbp['play_type'].unique()             # Unique values
pbp['play_type'].value_counts()       # Value counts

3.2.2 Selecting and Filtering

Column Selection:

# Single column (returns Series)
epa_series = pbp['epa']

# Multiple columns (returns DataFrame)
subset = pbp[['game_id', 'play_id', 'posteam', 'epa']]

# Select columns by pattern
passing_cols = pbp.filter(like='pass')  # Columns containing 'pass'
id_cols = pbp.filter(regex=r'_id$')     # Columns ending with '_id'

Row Filtering:

# Boolean indexing
pass_plays = pbp[pbp['pass'] == 1]
high_epa = pbp[pbp['epa'] > 2]

# Multiple conditions (use & for AND, | for OR)
# IMPORTANT: Wrap each condition in parentheses
third_down_passes = pbp[(pbp['down'] == 3) & (pbp['pass'] == 1)]

chiefs_offense = pbp[(pbp['posteam'] == 'KC') & (pbp['epa'].notna())]

# Using query() for cleaner syntax
third_down_passes = pbp.query("down == 3 and pass == 1")
chiefs_plays = pbp.query("posteam == 'KC' and epa == epa")  # epa == epa filters NaN

# isin() for multiple values
nfc_west = pbp[pbp['posteam'].isin(['SF', 'SEA', 'LAR', 'ARI'])]
early_downs = pbp[pbp['down'].isin([1, 2])]

Common Pitfall: Using and/or instead of &/| in pandas conditions. Python's and/or don't work element-wise on arrays.

3.2.3 Creating and Modifying Columns

# New column from calculation
pbp['yards_per_play'] = pbp['yards_gained'] / 1  # Simple calculation

# Conditional column with np.where
pbp['is_success'] = np.where(pbp['epa'] > 0, 1, 0)

# Multiple conditions with np.select
conditions = [
    pbp['epa'] > 1,
    pbp['epa'] > 0,
    pbp['epa'] > -1,
]
choices = ['great', 'good', 'neutral']
pbp['epa_category'] = np.select(conditions, choices, default='bad')

# Apply custom function
def categorize_distance(ydstogo):
    if pd.isna(ydstogo):
        return 'unknown'
    elif ydstogo <= 3:
        return 'short'
    elif ydstogo <= 7:
        return 'medium'
    else:
        return 'long'

pbp['distance_category'] = pbp['ydstogo'].apply(categorize_distance)

# Vectorized string operations
pbp['home_team_upper'] = pbp['home_team'].str.upper()
pbp['has_mahomes'] = pbp['desc'].str.contains('Mahomes', na=False)

3.2.4 Groupby Aggregations

Groupby is the workhorse of football analytics—calculating statistics by team, player, game, or situation.

# Basic groupby
team_epa = pbp.groupby('posteam')['epa'].mean()

# Multiple aggregations
team_stats = pbp.groupby('posteam').agg({
    'epa': ['sum', 'mean', 'count'],
    'yards_gained': 'sum',
    'success': 'mean'
})

# Flatten multi-level columns
team_stats.columns = ['_'.join(col).strip() for col in team_stats.columns]
team_stats = team_stats.reset_index()

# Named aggregations (cleaner approach)
team_stats = pbp.groupby('posteam').agg(
    total_epa=('epa', 'sum'),
    epa_per_play=('epa', 'mean'),
    plays=('epa', 'count'),
    total_yards=('yards_gained', 'sum'),
    success_rate=('success', 'mean')
).reset_index()

# Multiple groupby columns
qb_weekly = pbp.groupby(['week', 'passer_player_name']).agg(
    dropbacks=('pass', 'sum'),
    epa=('epa', 'sum'),
    cpoe=('cpoe', 'mean')
).reset_index()

# Filter groups (e.g., QBs with 20+ dropbacks per week)
qb_weekly_qualified = qb_weekly[qb_weekly['dropbacks'] >= 20]

Advanced Groupby Patterns:

# Transform: apply function and return same-shaped result
# Useful for calculating within-group statistics
pbp['team_mean_epa'] = pbp.groupby('posteam')['epa'].transform('mean')
pbp['epa_vs_team_avg'] = pbp['epa'] - pbp['team_mean_epa']

# Cumulative operations
pbp['cumulative_epa'] = pbp.groupby('game_id')['epa'].cumsum()

# Rolling calculations
pbp['rolling_epa_5'] = (
    pbp.sort_values(['game_id', 'play_id'])
    .groupby('game_id')['epa']
    .transform(lambda x: x.rolling(5, min_periods=1).mean())
)

# Rank within groups
pbp['epa_rank_in_game'] = pbp.groupby('game_id')['epa'].rank(ascending=False)

3.2.5 Merging DataFrames

Combining data from different sources is essential.

# Load additional data
rosters = nfl.import_rosters([2023])
schedules = nfl.import_schedules([2023])

# Inner join (only matching rows)
pbp_with_roster = pbp.merge(
    rosters[['player_id', 'player_name', 'position', 'team']],
    left_on='passer_player_id',
    right_on='player_id',
    how='inner'
)

# Left join (keep all rows from left DataFrame)
pbp_with_schedule = pbp.merge(
    schedules[['game_id', 'home_team', 'away_team', 'home_score', 'away_score']],
    on='game_id',
    how='left'
)

# Handle duplicate column names with suffixes
combined = df1.merge(df2, on='key', suffixes=('_left', '_right'))

# Merge on multiple columns
player_game_stats = stats1.merge(
    stats2,
    on=['player_id', 'game_id'],
    how='outer'
)

3.2.6 Method Chaining

Chain operations for readable, pipeline-style code:

# Without chaining (harder to read)
filtered = pbp[pbp['pass'] == 1]
filtered = filtered[filtered['epa'].notna()]
grouped = filtered.groupby('passer_player_name')['epa'].agg(['sum', 'mean', 'count'])
grouped = grouped.reset_index()
grouped.columns = ['player', 'total_epa', 'epa_per_play', 'dropbacks']
result = grouped[grouped['dropbacks'] >= 200]
result = result.sort_values('epa_per_play', ascending=False)

# With chaining (cleaner)
result = (
    pbp
    .query("pass == 1 and epa == epa")  # Filter passes with valid EPA
    .groupby('passer_player_name')
    .agg(
        total_epa=('epa', 'sum'),
        epa_per_play=('epa', 'mean'),
        dropbacks=('pass', 'count')
    )
    .reset_index()
    .query("dropbacks >= 200")
    .sort_values('epa_per_play', ascending=False)
    .rename(columns={'passer_player_name': 'player'})
)

Best Practice: Method chaining improves readability when each step is clear. Break into multiple statements if chains become too long or complex.


3.3 NumPy for Numerical Computing

3.3.1 Why NumPy?

NumPy provides fast, vectorized operations on arrays. Many pandas operations use NumPy under the hood.

import numpy as np

# Speed comparison
import time

data = list(range(1_000_000))
arr = np.array(data)

# Python loop
start = time.time()
result_loop = [x ** 2 for x in data]
loop_time = time.time() - start

# NumPy vectorized
start = time.time()
result_numpy = arr ** 2
numpy_time = time.time() - start

print(f"Loop: {loop_time:.3f}s, NumPy: {numpy_time:.3f}s")
print(f"NumPy is {loop_time / numpy_time:.0f}x faster")

3.3.2 Essential NumPy Operations

# Array creation
zeros = np.zeros(10)
ones = np.ones((3, 4))  # 3x4 matrix
range_arr = np.arange(0, 100, 5)  # 0, 5, 10, ..., 95
linspace = np.linspace(0, 1, 11)  # 11 points from 0 to 1

# From pandas
epa_array = pbp['epa'].values  # Convert to numpy array

# Basic statistics
np.mean(epa_array)
np.std(epa_array)
np.median(epa_array)
np.percentile(epa_array, [25, 50, 75])

# Handle NaN values
np.nanmean(epa_array)  # Ignores NaN
np.nanstd(epa_array)

# Boolean operations
positive_mask = epa_array > 0
count_positive = np.sum(positive_mask)
pct_positive = np.mean(positive_mask)

# Conditional selection
np.where(epa_array > 0, 'success', 'failure')

# Mathematical operations
np.sqrt(np.abs(epa_array))
np.log1p(np.maximum(epa_array, 0))  # log(1+x) for non-negative values
np.clip(epa_array, -5, 5)  # Limit to range

3.3.3 Vectorized Calculations for Football

def calculate_expected_points_simple(
    yardline: np.ndarray,
    down: np.ndarray,
    ydstogo: np.ndarray
) -> np.ndarray:
    """
    Simplified EP calculation using vectorized operations.

    Parameters
    ----------
    yardline : np.ndarray
        Yards from opponent's goal line
    down : np.ndarray
        Current down (1-4)
    ydstogo : np.ndarray
        Yards to first down

    Returns
    -------
    np.ndarray
        Expected points estimate
    """
    # Base EP from field position (closer to goal = higher EP)
    base_ep = (100 - yardline) * 0.06 - 1.5

    # Down adjustment
    down_adj = np.select(
        [down == 1, down == 2, down == 3, down == 4],
        [0.5, 0.2, -0.3, -1.0],
        default=0
    )

    # Distance adjustment
    dist_adj = np.where(ydstogo <= 3, 0.3, np.where(ydstogo >= 10, -0.3, 0))

    return base_ep + down_adj + dist_adj


# Apply to DataFrame
pbp['ep_simple'] = calculate_expected_points_simple(
    pbp['yardline_100'].values,
    pbp['down'].values,
    pbp['ydstogo'].values
)

3.4 Writing Reusable Functions

3.4.1 Function Design Principles

Good functions are: - Focused: Do one thing well - Documented: Clear docstring explaining purpose, parameters, returns - Typed: Type hints clarify expected inputs/outputs - Tested: Unit tests verify correctness

from typing import List, Optional, Union
import pandas as pd
import numpy as np


def calculate_success_rate(
    plays: pd.DataFrame,
    group_cols: Optional[List[str]] = None,
    min_plays: int = 50
) -> pd.DataFrame:
    """
    Calculate success rate (EPA > 0) by group.

    Success rate measures the percentage of plays that generate
    positive expected points added, indicating the offense improved
    their scoring expectation.

    Parameters
    ----------
    plays : pd.DataFrame
        Play-by-play data containing 'epa' column
    group_cols : List[str], optional
        Columns to group by. If None, returns overall success rate.
    min_plays : int, default=50
        Minimum plays required for a group to be included

    Returns
    -------
    pd.DataFrame
        Success rate and play count by group

    Raises
    ------
    ValueError
        If 'epa' column is missing from plays

    Examples
    --------
    >>> plays = load_pbp([2023])
    >>> team_success = calculate_success_rate(plays, ['posteam'])
    >>> team_success.head()
       posteam  success_rate  plays
    0      ARI         0.421   1045
    1      ATL         0.456   1102
    """
    if 'epa' not in plays.columns:
        raise ValueError("DataFrame must contain 'epa' column")

    # Filter to valid EPA values
    valid_plays = plays[plays['epa'].notna()].copy()
    valid_plays['success'] = (valid_plays['epa'] > 0).astype(int)

    if group_cols is None:
        # Overall success rate
        return pd.DataFrame({
            'success_rate': [valid_plays['success'].mean()],
            'plays': [len(valid_plays)]
        })

    # Grouped success rate
    result = (
        valid_plays
        .groupby(group_cols)
        .agg(
            success_rate=('success', 'mean'),
            plays=('success', 'count')
        )
        .reset_index()
        .query(f"plays >= {min_plays}")
        .sort_values('success_rate', ascending=False)
    )

    return result

3.4.2 Common Football Analysis Functions

def filter_to_pass_plays(
    pbp: pd.DataFrame,
    include_scrambles: bool = True,
    exclude_spikes: bool = True
) -> pd.DataFrame:
    """
    Filter play-by-play to passing plays.

    Parameters
    ----------
    pbp : pd.DataFrame
        Play-by-play data
    include_scrambles : bool, default=True
        Whether to include QB scrambles
    exclude_spikes : bool, default=True
        Whether to exclude spike plays

    Returns
    -------
    pd.DataFrame
        Filtered passing plays
    """
    mask = pbp['pass'] == 1

    if not include_scrambles:
        mask &= pbp['qb_scramble'] != 1

    if exclude_spikes:
        mask &= pbp['qb_spike'] != 1

    return pbp[mask].copy()


def calculate_epa_per_play(
    plays: pd.DataFrame,
    group_cols: List[str],
    play_type: Optional[str] = None,
    min_plays: int = 50
) -> pd.DataFrame:
    """
    Calculate EPA per play by group.

    Parameters
    ----------
    plays : pd.DataFrame
        Play-by-play data
    group_cols : List[str]
        Columns to group by
    play_type : str, optional
        Filter to 'pass' or 'rush'
    min_plays : int, default=50
        Minimum plays for inclusion

    Returns
    -------
    pd.DataFrame
        EPA per play statistics by group
    """
    df = plays.copy()

    if play_type == 'pass':
        df = df[df['pass'] == 1]
    elif play_type == 'rush':
        df = df[df['rush'] == 1]

    result = (
        df[df['epa'].notna()]
        .groupby(group_cols)
        .agg(
            epa_total=('epa', 'sum'),
            epa_per_play=('epa', 'mean'),
            plays=('epa', 'count')
        )
        .reset_index()
        .query(f"plays >= {min_plays}")
        .sort_values('epa_per_play', ascending=False)
    )

    return result


def add_game_context(pbp: pd.DataFrame) -> pd.DataFrame:
    """
    Add game context columns to play-by-play data.

    Adds:
    - score_differential: posteam score minus defteam score
    - is_trailing: whether offense is behind
    - is_close: game within 8 points
    - is_garbage_time: low leverage end-of-game situation
    - quarter_seconds_remaining: time left in quarter

    Parameters
    ----------
    pbp : pd.DataFrame
        Play-by-play data

    Returns
    -------
    pd.DataFrame
        Data with additional context columns
    """
    df = pbp.copy()

    # Score context
    df['score_differential'] = df['posteam_score'] - df['defteam_score']
    df['is_trailing'] = (df['score_differential'] < 0).astype(int)
    df['is_leading'] = (df['score_differential'] > 0).astype(int)
    df['is_close'] = (df['score_differential'].abs() <= 8).astype(int)

    # Garbage time (trailing/leading by 17+ with < 5 min left)
    df['is_garbage_time'] = (
        (df['game_seconds_remaining'] < 300) &
        (df['score_differential'].abs() > 17)
    ).astype(int)

    # Time context
    df['quarter_seconds_remaining'] = df['game_seconds_remaining'] % 900

    return df

3.4.3 Creating Analysis Pipelines

from typing import Callable, List
from functools import reduce


def compose(*functions: Callable) -> Callable:
    """
    Compose multiple functions into a single function.

    Functions are applied left to right.

    Example
    -------
    >>> pipeline = compose(filter_passes, add_context, calculate_stats)
    >>> result = pipeline(pbp)
    """
    def apply(x):
        return reduce(lambda v, f: f(v), functions, x)
    return apply


class AnalysisPipeline:
    """
    Builder for creating reproducible analysis pipelines.

    Example
    -------
    >>> pipeline = (
    ...     AnalysisPipeline(pbp)
    ...     .filter(lambda df: df[df['pass'] == 1])
    ...     .add_column('is_success', lambda df: (df['epa'] > 0).astype(int))
    ...     .aggregate(['posteam'], {'epa': 'mean', 'is_success': 'mean'})
    ...     .execute()
    ... )
    """

    def __init__(self, data: pd.DataFrame):
        self.data = data.copy()
        self.steps: List[Callable] = []

    def filter(self, condition: Callable) -> 'AnalysisPipeline':
        """Add a filter step."""
        self.steps.append(lambda df: condition(df))
        return self

    def add_column(self, name: str, func: Callable) -> 'AnalysisPipeline':
        """Add a new column."""
        def add_col(df):
            df = df.copy()
            df[name] = func(df)
            return df
        self.steps.append(add_col)
        return self

    def aggregate(
        self,
        group_cols: List[str],
        agg_dict: dict
    ) -> 'AnalysisPipeline':
        """Add an aggregation step."""
        def agg_step(df):
            return df.groupby(group_cols).agg(agg_dict).reset_index()
        self.steps.append(agg_step)
        return self

    def execute(self) -> pd.DataFrame:
        """Execute all pipeline steps."""
        result = self.data
        for step in self.steps:
            result = step(result)
        return result

3.5 Working with nfl_data_py

3.5.1 Complete Data Loading Reference

import nfl_data_py as nfl
from typing import List, Optional
import pandas as pd


class NFLData:
    """
    Comprehensive wrapper for nfl_data_py functionality.

    Provides caching, common filters, and convenient methods.
    """

    def __init__(self, cache_enabled: bool = True):
        self.cache_enabled = cache_enabled
        self._cache = {}

    def pbp(
        self,
        seasons: List[int],
        columns: Optional[List[str]] = None
    ) -> pd.DataFrame:
        """Load play-by-play data."""
        key = ('pbp', tuple(seasons))

        if self.cache_enabled and key in self._cache:
            df = self._cache[key]
        else:
            df = nfl.import_pbp_data(seasons)
            if self.cache_enabled:
                self._cache[key] = df

        if columns:
            return df[[c for c in columns if c in df.columns]]
        return df

    def rosters(self, seasons: List[int]) -> pd.DataFrame:
        """Load roster data."""
        key = ('rosters', tuple(seasons))
        if self.cache_enabled and key in self._cache:
            return self._cache[key]

        df = nfl.import_rosters(seasons)
        if self.cache_enabled:
            self._cache[key] = df
        return df

    def schedules(self, seasons: List[int]) -> pd.DataFrame:
        """Load schedule data."""
        return nfl.import_schedules(seasons)

    def weekly_stats(self, seasons: List[int]) -> pd.DataFrame:
        """Load weekly player statistics."""
        return nfl.import_weekly_data(seasons)

    def seasonal_stats(self, seasons: List[int]) -> pd.DataFrame:
        """Load seasonal player statistics."""
        return nfl.import_seasonal_data(seasons)

    def combine(self, seasons: List[int]) -> pd.DataFrame:
        """Load NFL Combine data."""
        return nfl.import_combine_data(seasons)

    def draft_picks(self, seasons: List[int]) -> pd.DataFrame:
        """Load draft pick data."""
        return nfl.import_draft_picks(seasons)

    def player_ids(self) -> pd.DataFrame:
        """Load player ID mapping table."""
        return nfl.import_ids()

    def clear_cache(self):
        """Clear all cached data."""
        self._cache.clear()


# Convenience functions
def load_passing_plays(seasons: List[int]) -> pd.DataFrame:
    """Load and filter to passing plays."""
    pbp = nfl.import_pbp_data(seasons)
    return pbp[pbp['pass'] == 1].copy()


def load_rushing_plays(seasons: List[int]) -> pd.DataFrame:
    """Load and filter to rushing plays."""
    pbp = nfl.import_pbp_data(seasons)
    return pbp[pbp['rush'] == 1].copy()


def load_qb_stats(seasons: List[int], min_dropbacks: int = 100) -> pd.DataFrame:
    """
    Load quarterback statistics aggregated by season.

    Parameters
    ----------
    seasons : List[int]
        Seasons to load
    min_dropbacks : int
        Minimum dropbacks for inclusion

    Returns
    -------
    pd.DataFrame
        QB statistics with key metrics
    """
    pbp = nfl.import_pbp_data(seasons)
    passing = pbp[pbp['pass'] == 1]

    qb_stats = (
        passing
        .groupby(['season', 'passer_player_id', 'passer_player_name'])
        .agg(
            dropbacks=('pass', 'count'),
            completions=('complete_pass', 'sum'),
            attempts=('pass_attempt', 'sum'),
            yards=('yards_gained', 'sum'),
            tds=('pass_touchdown', 'sum'),
            ints=('interception', 'sum'),
            epa_total=('epa', 'sum'),
            epa_per_play=('epa', 'mean'),
            cpoe=('cpoe', 'mean'),
            air_yards=('air_yards', 'sum'),
            yac=('yards_after_catch', 'sum')
        )
        .reset_index()
        .query(f"dropbacks >= {min_dropbacks}")
        .sort_values('epa_total', ascending=False)
    )

    qb_stats['completion_pct'] = qb_stats['completions'] / qb_stats['attempts']
    qb_stats['yards_per_attempt'] = qb_stats['yards'] / qb_stats['attempts']

    return qb_stats

3.6 Debugging and Profiling

3.6.1 Common Errors and Solutions

"""
Common Pandas Errors and How to Fix Them
"""

import pandas as pd
import numpy as np

# ERROR 1: SettingWithCopyWarning
# BAD: Modifying a view instead of a copy
df_filtered = df[df['pass'] == 1]
df_filtered['new_col'] = 1  # Warning!

# GOOD: Explicitly copy
df_filtered = df[df['pass'] == 1].copy()
df_filtered['new_col'] = 1  # No warning

# ERROR 2: KeyError when column doesn't exist
# BAD: Assuming column exists
value = df['nonexistent_column']  # KeyError!

# GOOD: Check first or use get
if 'column' in df.columns:
    value = df['column']
# Or use .get() on Series
value = df.get('column', default=None)

# ERROR 3: Boolean indexing with NaN
# BAD: NaN comparisons return NaN, not False
df[df['epa'] > 0]  # May miss rows where epa is NaN

# GOOD: Handle NaN explicitly
df[(df['epa'] > 0) | (df['epa'].isna())]
# Or fill NaN first
df[df['epa'].fillna(0) > 0]

# ERROR 4: Merging creates duplicates
# BAD: Not checking for duplicates before merge
merged = df1.merge(df2, on='key')  # May explode in size

# GOOD: Verify keys are unique
assert df2['key'].is_unique, "Duplicate keys in df2"
merged = df1.merge(df2, on='key')

# ERROR 5: Memory issues with large data
# BAD: Loading all columns
pbp = nfl.import_pbp_data([2020, 2021, 2022, 2023])  # ~2GB+

# GOOD: Select only needed columns
cols = ['game_id', 'play_id', 'epa', 'posteam']
pbp = nfl.import_pbp_data([2023], columns=cols)  # Much smaller

3.6.2 Performance Profiling

import time
from functools import wraps


def timer(func):
    """Decorator to time function execution."""
    @wraps(func)
    def wrapper(*args, **kwargs):
        start = time.perf_counter()
        result = func(*args, **kwargs)
        end = time.perf_counter()
        print(f"{func.__name__}: {end - start:.3f} seconds")
        return result
    return wrapper


@timer
def slow_calculation(df):
    """Example slow function."""
    return df.apply(lambda row: row['epa'] * 2, axis=1)


@timer
def fast_calculation(df):
    """Example fast function."""
    return df['epa'] * 2


# Memory profiling
def memory_usage(df: pd.DataFrame) -> str:
    """Get human-readable memory usage."""
    bytes_used = df.memory_usage(deep=True).sum()
    if bytes_used > 1e9:
        return f"{bytes_used / 1e9:.2f} GB"
    elif bytes_used > 1e6:
        return f"{bytes_used / 1e6:.2f} MB"
    else:
        return f"{bytes_used / 1e3:.2f} KB"


def optimize_dtypes(df: pd.DataFrame) -> pd.DataFrame:
    """
    Reduce memory by optimizing data types.

    Converts:
    - float64 to float32 where possible
    - int64 to smaller int types
    - object to category for low-cardinality columns
    """
    df = df.copy()

    for col in df.columns:
        col_type = df[col].dtype

        if col_type == 'float64':
            df[col] = df[col].astype('float32')

        elif col_type == 'int64':
            if df[col].min() >= 0 and df[col].max() < 256:
                df[col] = df[col].astype('uint8')
            elif df[col].min() >= -128 and df[col].max() < 128:
                df[col] = df[col].astype('int8')
            elif df[col].min() >= 0 and df[col].max() < 65536:
                df[col] = df[col].astype('uint16')
            else:
                df[col] = df[col].astype('int32')

        elif col_type == 'object':
            n_unique = df[col].nunique()
            n_total = len(df[col])
            if n_unique / n_total < 0.5:  # Low cardinality
                df[col] = df[col].astype('category')

    return df

3.7 Chapter Summary

Key Concepts

  1. Virtual environments isolate project dependencies and ensure reproducibility.

  2. Pandas operations should use vectorization (.apply() with vectorized functions, np.where(), np.select()) rather than row-by-row loops.

  3. Method chaining creates readable, pipeline-style code for data transformations.

  4. Groupby aggregations are the foundation of football statistics—calculating metrics by team, player, game, or situation.

  5. Well-designed functions have clear purposes, type hints, docstrings, and handle edge cases gracefully.

  6. NumPy vectorization provides 10-100x speedups compared to Python loops.

Key Code Patterns

# Pattern 1: Filter and aggregate
result = (
    pbp
    .query("pass == 1 and epa == epa")
    .groupby('posteam')
    .agg(epa_per_play=('epa', 'mean'))
    .reset_index()
    .sort_values('epa_per_play', ascending=False)
)

# Pattern 2: Conditional column creation
df['success'] = np.where(df['epa'] > 0, 1, 0)

# Pattern 3: Transform within groups
df['team_avg'] = df.groupby('posteam')['epa'].transform('mean')
df['vs_avg'] = df['epa'] - df['team_avg']

# Pattern 4: Safe column access
value = df.get('column', pd.Series())

What's Next

In Chapter 4: Exploratory Data Analysis for Football, we apply these programming skills to understand data before modeling it. You'll learn visualization techniques, pattern recognition, and the EDA mindset that separates good analysts from great ones.


Chapter 3 Exercises → exercises.md

Chapter 3 Quiz → quiz.md

Case Study: Building a QB Comparison Tool → case-study-01.md