Key Takeaways: Python for Football Analytics

One-page reference for Chapter 3 concepts

Environment Setup

# Create virtual environment
python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows

# Install packages
pip install pandas numpy matplotlib nfl-data-py scikit-learn

# Save dependencies
pip freeze > requirements.txt

Essential Pandas Patterns

Filtering

# Single condition
passes = pbp[pbp['pass'] == 1]

# Multiple conditions (use & | with parentheses!)
third_down_pass = pbp[(pbp['down'] == 3) & (pbp['pass'] == 1)]

# Query syntax (cleaner)
third_down_pass = pbp.query("down == 3 and pass == 1")

Column Creation

# Simple calculation
df['epa_doubled'] = df['epa'] * 2

# Conditional (two outcomes)
df['success'] = np.where(df['epa'] > 0, 1, 0)

# Conditional (multiple outcomes)
conditions = [df['epa'] > 1, df['epa'] > 0]
choices = ['great', 'good']
df['grade'] = np.select(conditions, choices, default='bad')

Groupby Aggregation

team_stats = (
    pbp
    .groupby('posteam')
    .agg(
        total_epa=('epa', 'sum'),
        epa_per_play=('epa', 'mean'),
        plays=('epa', 'count')
    )
    .reset_index()
    .sort_values('epa_per_play', ascending=False)
)

Transform (broadcast to original size)

df['team_avg_epa'] = df.groupby('posteam')['epa'].transform('mean')
df['epa_vs_avg'] = df['epa'] - df['team_avg_epa']

Method Chaining Template

result = (
    pbp
    .query("pass == 1 and epa == epa")           # Filter
    .assign(success=lambda x: (x['epa'] > 0))    # Add column
    .groupby('posteam')                           # Group
    .agg(epa=('epa', 'mean'), n=('epa', 'count')) # Aggregate
    .reset_index()                                # Flatten
    .query("n >= 100")                            # Filter groups
    .sort_values('epa', ascending=False)          # Sort
)

Common Gotchas

Issue	Wrong	Right
Multiple conditions	`and`, `or`	`&`, `\|` with `()`
Copy warning	`df[mask]['col'] = x`	`df[mask].copy()['col'] = x`
NaN comparison	`df[df['col'] > 0]`	Check with `notna()` first
Slow loops	`.iterrows()`	Vectorized operations

Function Template

def calculate_metric(
    plays: pd.DataFrame,
    group_cols: List[str],
    min_plays: int = 50
) -> pd.DataFrame:
    """
    Brief description of what function does.

    Parameters
    ----------
    plays : pd.DataFrame
        Play-by-play data
    group_cols : List[str]
        Columns to group by
    min_plays : int, default=50
        Minimum plays for inclusion

    Returns
    -------
    pd.DataFrame
        Aggregated statistics

    Examples
    --------
    >>> result = calculate_metric(pbp, ['posteam'])
    """
    return (
        plays
        .groupby(group_cols)
        .agg(...)
        .query(f"plays >= {min_plays}")
    )

Performance Tips

Vectorize: Use NumPy/pandas operations, not loops
Filter early: Reduce data size before processing
Select columns: Load only needed columns
Cache data: Avoid repeated downloads
Use categories: Convert low-cardinality strings

Quick Self-Check

Can you: - [ ] Set up a virtual environment? - [ ] Filter DataFrames with multiple conditions? - [ ] Use groupby with named aggregations? - [ ] Write a function with type hints and docstring? - [ ] Avoid SettingWithCopyWarning?

Preview: Chapter 4

Next: Exploratory Data Analysis for Football — visualization techniques, pattern recognition, and the EDA mindset.