Key Takeaways: Python for Football Analytics
One-page reference for Chapter 3 concepts
Environment Setup
# Create virtual environment
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows
# Install packages
pip install pandas numpy matplotlib nfl-data-py scikit-learn
# Save dependencies
pip freeze > requirements.txt
Essential Pandas Patterns
Filtering
# Single condition
passes = pbp[pbp['pass'] == 1]
# Multiple conditions (use & | with parentheses!)
third_down_pass = pbp[(pbp['down'] == 3) & (pbp['pass'] == 1)]
# Query syntax (cleaner)
third_down_pass = pbp.query("down == 3 and pass == 1")
Column Creation
# Simple calculation
df['epa_doubled'] = df['epa'] * 2
# Conditional (two outcomes)
df['success'] = np.where(df['epa'] > 0, 1, 0)
# Conditional (multiple outcomes)
conditions = [df['epa'] > 1, df['epa'] > 0]
choices = ['great', 'good']
df['grade'] = np.select(conditions, choices, default='bad')
Groupby Aggregation
team_stats = (
pbp
.groupby('posteam')
.agg(
total_epa=('epa', 'sum'),
epa_per_play=('epa', 'mean'),
plays=('epa', 'count')
)
.reset_index()
.sort_values('epa_per_play', ascending=False)
)
Transform (broadcast to original size)
df['team_avg_epa'] = df.groupby('posteam')['epa'].transform('mean')
df['epa_vs_avg'] = df['epa'] - df['team_avg_epa']
Method Chaining Template
result = (
pbp
.query("pass == 1 and epa == epa") # Filter
.assign(success=lambda x: (x['epa'] > 0)) # Add column
.groupby('posteam') # Group
.agg(epa=('epa', 'mean'), n=('epa', 'count')) # Aggregate
.reset_index() # Flatten
.query("n >= 100") # Filter groups
.sort_values('epa', ascending=False) # Sort
)
Common Gotchas
| Issue | Wrong | Right |
|---|---|---|
| Multiple conditions | and, or |
&, | with () |
| Copy warning | df[mask]['col'] = x |
df[mask].copy()['col'] = x |
| NaN comparison | df[df['col'] > 0] |
Check with notna() first |
| Slow loops | .iterrows() |
Vectorized operations |
Function Template
def calculate_metric(
plays: pd.DataFrame,
group_cols: List[str],
min_plays: int = 50
) -> pd.DataFrame:
"""
Brief description of what function does.
Parameters
----------
plays : pd.DataFrame
Play-by-play data
group_cols : List[str]
Columns to group by
min_plays : int, default=50
Minimum plays for inclusion
Returns
-------
pd.DataFrame
Aggregated statistics
Examples
--------
>>> result = calculate_metric(pbp, ['posteam'])
"""
return (
plays
.groupby(group_cols)
.agg(...)
.query(f"plays >= {min_plays}")
)
Performance Tips
- Vectorize: Use NumPy/pandas operations, not loops
- Filter early: Reduce data size before processing
- Select columns: Load only needed columns
- Cache data: Avoid repeated downloads
- Use categories: Convert low-cardinality strings
Quick Self-Check
Can you: - [ ] Set up a virtual environment? - [ ] Filter DataFrames with multiple conditions? - [ ] Use groupby with named aggregations? - [ ] Write a function with type hints and docstring? - [ ] Avoid SettingWithCopyWarning?
Preview: Chapter 4
Next: Exploratory Data Analysis for Football — visualization techniques, pattern recognition, and the EDA mindset.