In the previous chapter, you learned where football data lives and how to access it. Now it's time to learn how to work with that data. Python has become the lingua franca of sports analytics, and for good reason: its ecosystem of data science...
Learning Objectives
- Master pandas DataFrames for football data manipulation
- Apply NumPy operations to sports statistics
- Write efficient data processing pipelines
- Build reusable functions for common football analytics tasks
In This Chapter
- Introduction
- Setting Up Your Environment
- Part 1: The pandas DataFrame
- Part 2: Selecting and Filtering Data
- Part 3: Data Transformation
- Part 4: Grouping and Aggregation
- Part 5: Merging and Joining DataFrames
- Part 6: NumPy Essentials
- Part 7: Working with Football Data Types
- Part 8: Building Reusable Functions
- Part 9: Performance Optimization
- Putting It All Together
- Summary
- Key Terms
- References
Chapter 3: Python for Sports Analytics
"The best tool is the one you know how to use." — Traditional programmer wisdom
Introduction
In the previous chapter, you learned where football data lives and how to access it. Now it's time to learn how to work with that data. Python has become the lingua franca of sports analytics, and for good reason: its ecosystem of data science libraries—pandas, NumPy, and matplotlib—provides everything you need to transform raw game data into actionable insights.
This chapter is your gateway to practical data manipulation for football analytics. We'll move beyond basic Python and into the specialized tools that professional analysts use daily. By the end, you'll be able to load, clean, transform, and analyze football datasets with confidence.
Why Python for Sports Analytics?
Python dominates sports analytics for several reasons:
- Rich ecosystem: Libraries like pandas and NumPy handle complex data operations with minimal code
- Readability: Python's syntax is clear and maintainable—crucial when sharing analysis with non-technical colleagues
- Community: Thousands of sports analysts share code, tutorials, and packages
- Integration: Python connects easily to databases, APIs, and visualization tools
- Career relevance: Every major sports analytics job listing includes Python
What You'll Learn
This chapter covers four essential skills:
- pandas fundamentals: DataFrames, Series, indexing, and filtering
- Data manipulation: Grouping, aggregation, merging, and reshaping
- NumPy essentials: Array operations and vectorized calculations
- Football-specific patterns: Common operations for sports data
Setting Up Your Environment
Before diving in, ensure your environment is ready:
# Required packages
# pip install pandas numpy
import pandas as pd
import numpy as np
# Display settings for better output
pd.set_option('display.max_columns', 20)
pd.set_option('display.width', 200)
# Verify versions
print(f"pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
Minimum recommended versions: - pandas 1.5.0+ - NumPy 1.23.0+
Part 1: The pandas DataFrame
The DataFrame is the workhorse of sports analytics. Think of it as a smart spreadsheet—rows are observations (games, plays, players), columns are variables (team, score, yards), and pandas provides hundreds of methods to manipulate them.
Creating DataFrames
There are several ways to create DataFrames. Let's start with the most common for football analytics:
# Method 1: From a dictionary
game_data = {
"game_id": [1, 2, 3, 4, 5],
"home_team": ["Alabama", "Georgia", "Ohio State", "Michigan", "Texas"],
"away_team": ["Auburn", "Florida", "Penn State", "Ohio State", "Oklahoma"],
"home_score": [35, 42, 28, 45, 31],
"away_score": [21, 38, 24, 23, 28]
}
games = pd.DataFrame(game_data)
print(games)
Output:
game_id home_team away_team home_score away_score
0 1 Alabama Auburn 35 21
1 2 Georgia Florida 42 38
2 3 Ohio State Penn State 28 24
3 4 Michigan Ohio State 45 23
4 5 Texas Oklahoma 31 28
# Method 2: From a list of dictionaries (common with API responses)
plays = [
{"play_id": 1, "type": "Pass", "yards": 12, "success": True},
{"play_id": 2, "type": "Rush", "yards": 3, "success": False},
{"play_id": 3, "type": "Pass", "yards": -5, "success": False},
{"play_id": 4, "type": "Rush", "yards": 8, "success": True},
{"play_id": 5, "type": "Pass", "yards": 25, "success": True}
]
plays_df = pd.DataFrame(plays)
print(plays_df)
# Method 3: From a CSV file
# games = pd.read_csv("path/to/games.csv")
# Method 4: From a Parquet file (recommended for large datasets)
# games = pd.read_parquet("path/to/games.parquet")
# Method 5: From SQL query
# games = pd.read_sql("SELECT * FROM games", connection)
Understanding DataFrame Structure
Every DataFrame has essential properties you should know:
# Basic information
print(f"Shape: {games.shape}") # (rows, columns)
print(f"Columns: {games.columns.tolist()}")
print(f"Data types:\n{games.dtypes}")
# Summary statistics
print(games.describe())
# First/last rows
print(games.head(3)) # First 3 rows
print(games.tail(2)) # Last 2 rows
# Memory usage
print(f"Memory: {games.memory_usage(deep=True).sum() / 1024:.2f} KB")
The Series: A Single Column
A single column is a Series—a one-dimensional array with labels:
# Extract a column as Series
home_scores = games["home_score"]
print(type(home_scores)) # <class 'pandas.core.series.Series'>
# Series operations
print(f"Mean: {home_scores.mean():.1f}")
print(f"Max: {home_scores.max()}")
print(f"Sum: {home_scores.sum()}")
Part 2: Selecting and Filtering Data
Selecting the right data is fundamental. pandas offers multiple approaches, each suited for different situations.
Column Selection
# Single column (returns Series)
teams = games["home_team"]
# Multiple columns (returns DataFrame)
scores = games[["home_team", "home_score", "away_score"]]
# Using .loc for explicit column selection
scores = games.loc[:, ["home_team", "home_score"]]
Row Selection by Position
Use .iloc for position-based indexing (like array indices):
# Single row (returns Series)
first_game = games.iloc[0]
# Multiple rows
first_three = games.iloc[0:3]
# Specific rows and columns
subset = games.iloc[0:3, 1:4] # rows 0-2, columns 1-3
# Last row
last_game = games.iloc[-1]
Row Selection by Label
Use .loc for label-based indexing:
# Set a meaningful index
games_indexed = games.set_index("game_id")
# Select by index value
game_3 = games_indexed.loc[3]
# Multiple index values
selected = games_indexed.loc[[1, 3, 5]]
# Range of labels
range_games = games_indexed.loc[2:4]
Boolean Filtering
This is where pandas shines for analytics. Filter rows based on conditions:
# Simple condition
high_scoring_home = games[games["home_score"] > 35]
print(high_scoring_home)
# Multiple conditions (use & for AND, | for OR)
# Find games where home team scored 30+ AND away team scored 25+
close_games = games[(games["home_score"] >= 30) & (games["away_score"] >= 25)]
# Using query() for cleaner syntax
close_games = games.query("home_score >= 30 and away_score >= 25")
# String matching
alabama_games = games[games["home_team"] == "Alabama"]
# Contains (for partial matches)
ohio_games = games[games["home_team"].str.contains("Ohio")]
Practical Example: Finding Blowout Games
def find_blowouts(games_df: pd.DataFrame, margin: int = 20) -> pd.DataFrame:
"""
Find games where the winning margin exceeded a threshold.
Parameters
----------
games_df : pd.DataFrame
Games DataFrame with home_score and away_score columns
margin : int
Minimum point difference to qualify as a blowout
Returns
-------
pd.DataFrame
Filtered games that qualify as blowouts
"""
# Calculate margin (absolute difference)
games_df = games_df.copy()
games_df["margin"] = abs(games_df["home_score"] - games_df["away_score"])
# Filter for blowouts
blowouts = games_df[games_df["margin"] >= margin]
return blowouts.sort_values("margin", ascending=False)
# Example usage
blowouts = find_blowouts(games, margin=15)
print(blowouts)
Part 3: Data Transformation
Raw data rarely arrives analysis-ready. You'll spend significant time transforming data into useful forms.
Creating New Columns
# Simple calculation
games["total_points"] = games["home_score"] + games["away_score"]
games["home_margin"] = games["home_score"] - games["away_score"]
# Conditional assignment
games["home_win"] = games["home_score"] > games["away_score"]
# Using np.where for if-else logic
games["result"] = np.where(
games["home_score"] > games["away_score"],
"Home Win",
np.where(
games["home_score"] < games["away_score"],
"Away Win",
"Tie"
)
)
print(games[["home_team", "away_team", "home_score", "away_score", "result"]])
Applying Functions
For more complex transformations, use .apply():
def categorize_game(row):
"""Categorize game by scoring level."""
total = row["home_score"] + row["away_score"]
if total >= 70:
return "Shootout"
elif total >= 50:
return "High Scoring"
elif total >= 30:
return "Normal"
else:
return "Defensive Battle"
games["game_type"] = games.apply(categorize_game, axis=1)
print(games[["home_team", "away_team", "total_points", "game_type"]])
Performance tip: .apply() is flexible but slow. For simple operations, use vectorized approaches:
# Slow (apply)
games["total"] = games.apply(lambda x: x["home_score"] + x["away_score"], axis=1)
# Fast (vectorized)
games["total"] = games["home_score"] + games["away_score"]
Renaming and Reordering Columns
# Rename columns
games = games.rename(columns={
"home_score": "home_pts",
"away_score": "away_pts"
})
# Reorder columns
column_order = ["game_id", "home_team", "away_team", "home_pts", "away_pts", "result"]
games = games[column_order]
Handling Missing Data
Football data often has missing values. Here's how to handle them:
# Create sample data with missing values
incomplete_stats = pd.DataFrame({
"player": ["Smith", "Jones", "Williams", "Brown", "Davis"],
"rush_yards": [120, None, 85, 0, None],
"pass_yards": [None, 250, None, 180, 320],
"touchdowns": [1, 2, None, 0, 3]
})
# Check for missing values
print(incomplete_stats.isnull().sum())
# Drop rows with any missing values
clean = incomplete_stats.dropna()
# Drop rows with missing values in specific columns
clean = incomplete_stats.dropna(subset=["rush_yards"])
# Fill missing values
# With a constant
filled = incomplete_stats.fillna(0)
# With column mean
filled = incomplete_stats.fillna(incomplete_stats.mean(numeric_only=True))
# Forward fill (use previous valid value)
filled = incomplete_stats.fillna(method="ffill")
Part 4: Grouping and Aggregation
Grouping is essential for comparative analysis—comparing teams, seasons, or any category.
Basic GroupBy
# Create extended game data
season_games = pd.DataFrame({
"week": [1, 1, 1, 2, 2, 2, 3, 3, 3],
"home_team": ["Alabama", "Georgia", "Texas", "Alabama", "Ohio State", "Texas",
"Georgia", "Alabama", "Ohio State"],
"away_team": ["Duke", "Clemson", "Rice", "Texas", "Indiana", "Oklahoma",
"Auburn", "LSU", "Penn State"],
"home_score": [52, 45, 38, 31, 42, 28, 35, 42, 38],
"away_score": [10, 28, 14, 28, 14, 24, 21, 17, 31]
})
# Group by and aggregate
by_week = season_games.groupby("week").agg({
"home_score": ["mean", "sum", "max"],
"away_score": ["mean", "sum"]
})
print(by_week)
Multiple Aggregations
# Named aggregations (pandas 0.25+)
summary = season_games.groupby("week").agg(
games_played=("home_team", "count"),
avg_home_score=("home_score", "mean"),
avg_away_score=("away_score", "mean"),
max_total=("home_score", lambda x: (x + season_games.loc[x.index, "away_score"]).max())
).round(1)
print(summary)
Practical Example: Team Performance Summary
def calculate_team_summary(games_df: pd.DataFrame) -> pd.DataFrame:
"""
Calculate team performance summary from game data.
Parameters
----------
games_df : pd.DataFrame
DataFrame with home_team, away_team, home_score, away_score
Returns
-------
pd.DataFrame
Summary statistics per team
"""
# Process home games
home = games_df.copy()
home["team"] = home["home_team"]
home["points_for"] = home["home_score"]
home["points_against"] = home["away_score"]
home["is_win"] = home["home_score"] > home["away_score"]
home["is_home"] = True
# Process away games
away = games_df.copy()
away["team"] = away["away_team"]
away["points_for"] = away["away_score"]
away["points_against"] = away["home_score"]
away["is_win"] = away["away_score"] > away["home_score"]
away["is_home"] = False
# Combine
all_games = pd.concat([
home[["team", "points_for", "points_against", "is_win", "is_home"]],
away[["team", "points_for", "points_against", "is_win", "is_home"]]
])
# Aggregate by team
summary = all_games.groupby("team").agg(
games=("is_win", "count"),
wins=("is_win", "sum"),
total_points_for=("points_for", "sum"),
total_points_against=("points_against", "sum"),
avg_points_for=("points_for", "mean"),
avg_points_against=("points_against", "mean")
)
# Add calculated columns
summary["losses"] = summary["games"] - summary["wins"]
summary["win_pct"] = (summary["wins"] / summary["games"]).round(3)
summary["point_diff"] = summary["total_points_for"] - summary["total_points_against"]
return summary.sort_values("win_pct", ascending=False)
# Calculate summary
team_summary = calculate_team_summary(season_games)
print(team_summary)
Part 5: Merging and Joining DataFrames
Football analysis often requires combining data from multiple sources—games with team info, plays with game context, etc.
Understanding Join Types
# Create sample data
teams = pd.DataFrame({
"team_id": [1, 2, 3, 4],
"team_name": ["Alabama", "Georgia", "Ohio State", "Michigan"],
"conference": ["SEC", "SEC", "Big Ten", "Big Ten"]
})
rankings = pd.DataFrame({
"team_id": [1, 2, 3, 5], # Note: team 4 missing, team 5 not in teams
"ap_rank": [1, 2, 4, 8],
"cfp_rank": [1, 3, 4, 9]
})
# Inner join (only matching rows)
inner = pd.merge(teams, rankings, on="team_id", how="inner")
print("Inner join:")
print(inner)
# Left join (all rows from left table)
left = pd.merge(teams, rankings, on="team_id", how="left")
print("\nLeft join:")
print(left)
# Outer join (all rows from both tables)
outer = pd.merge(teams, rankings, on="team_id", how="outer")
print("\nOuter join:")
print(outer)
Merging on Different Column Names
games_with_ids = pd.DataFrame({
"game_id": [1, 2, 3],
"home_team_id": [1, 2, 3],
"away_team_id": [2, 3, 1],
"home_score": [28, 35, 21],
"away_score": [24, 28, 17]
})
# Merge to get home team names
result = pd.merge(
games_with_ids,
teams,
left_on="home_team_id",
right_on="team_id",
how="left"
).rename(columns={"team_name": "home_team"})
# Then merge for away team names
result = pd.merge(
result,
teams[["team_id", "team_name"]],
left_on="away_team_id",
right_on="team_id",
how="left"
).rename(columns={"team_name": "away_team"})
print(result[["game_id", "home_team", "away_team", "home_score", "away_score"]])
Practical Example: Enriching Play Data
def enrich_play_data(plays_df: pd.DataFrame, games_df: pd.DataFrame) -> pd.DataFrame:
"""
Enrich play-by-play data with game context.
Parameters
----------
plays_df : pd.DataFrame
Play-by-play data with game_id column
games_df : pd.DataFrame
Game data with game_id, home_team, away_team
Returns
-------
pd.DataFrame
Plays with game context added
"""
# Select game columns to add
game_cols = ["game_id", "home_team", "away_team", "week"]
game_context = games_df[game_cols].copy()
# Merge
enriched = pd.merge(
plays_df,
game_context,
on="game_id",
how="left"
)
return enriched
# Example
plays = pd.DataFrame({
"play_id": [1, 2, 3, 4],
"game_id": [1, 1, 2, 2],
"play_type": ["Pass", "Rush", "Pass", "Rush"],
"yards": [12, 4, -3, 8]
})
game_info = pd.DataFrame({
"game_id": [1, 2],
"home_team": ["Alabama", "Georgia"],
"away_team": ["Auburn", "Florida"],
"week": [1, 1]
})
enriched_plays = enrich_play_data(plays, game_info)
print(enriched_plays)
Part 6: NumPy Essentials
NumPy provides the numerical foundation for pandas and enables efficient calculations.
Array Basics
# Create arrays
scores = np.array([28, 35, 42, 21, 17])
print(f"Array: {scores}")
print(f"Type: {scores.dtype}")
print(f"Shape: {scores.shape}")
# Basic statistics
print(f"Mean: {np.mean(scores):.1f}")
print(f"Std: {np.std(scores):.1f}")
print(f"Median: {np.median(scores)}")
print(f"Min/Max: {np.min(scores)}/{np.max(scores)}")
Vectorized Operations
NumPy's power comes from vectorized operations—applying functions to entire arrays at once:
# Element-wise operations
home_scores = np.array([35, 42, 28, 45, 31])
away_scores = np.array([21, 38, 24, 23, 28])
# Margins (no loops needed!)
margins = home_scores - away_scores
print(f"Margins: {margins}")
# Total points
totals = home_scores + away_scores
print(f"Totals: {totals}")
# Percentage of home points
home_pct = home_scores / totals * 100
print(f"Home %: {home_pct.round(1)}")
# Boolean operations
home_wins = home_scores > away_scores
print(f"Home wins: {home_wins}")
print(f"Win count: {np.sum(home_wins)}")
Statistical Functions
# Game yardage data
rush_yards = np.array([150, 180, 120, 200, 95, 165, 140])
pass_yards = np.array([280, 250, 310, 220, 340, 275, 300])
# Basic statistics
print(f"Rush yards - Mean: {np.mean(rush_yards):.1f}, Std: {np.std(rush_yards):.1f}")
print(f"Pass yards - Mean: {np.mean(pass_yards):.1f}, Std: {np.std(pass_yards):.1f}")
# Correlation
correlation = np.corrcoef(rush_yards, pass_yards)[0, 1]
print(f"Rush/Pass correlation: {correlation:.3f}")
# Percentiles
print(f"Rush yards 75th percentile: {np.percentile(rush_yards, 75):.1f}")
# Cumulative sum (running totals)
cumulative_rush = np.cumsum(rush_yards)
print(f"Cumulative rush yards: {cumulative_rush}")
Practical Example: EPA Calculation Simulation
Expected Points Added (EPA) calculations often use NumPy for efficiency:
def simulate_epa_calculation(plays_df: pd.DataFrame) -> pd.DataFrame:
"""
Simulate EPA calculation for plays.
This is a simplified demonstration. Real EPA uses complex
expected points models.
Parameters
----------
plays_df : pd.DataFrame
Play data with yard_line, down, distance, yards_gained
Returns
-------
pd.DataFrame
Plays with EPA column added
"""
df = plays_df.copy()
# Simplified expected points by yard line (real models are more complex)
# This uses a rough approximation
def expected_points(yard_line):
# yard_line is yards from opponent's end zone (1-99)
return np.where(
yard_line <= 0, 7, # Touchdown
np.where(
yard_line <= 10, 4.5,
np.where(
yard_line <= 30, 2.5,
np.where(
yard_line <= 50, 1.0,
np.where(
yard_line <= 70, 0.0,
-1.0
)
)
)
)
)
# Calculate pre-play and post-play expected points
df["ep_before"] = expected_points(df["yard_line"].values)
df["new_yard_line"] = df["yard_line"] - df["yards_gained"]
df["ep_after"] = expected_points(df["new_yard_line"].values)
# EPA = change in expected points
df["epa"] = df["ep_after"] - df["ep_before"]
return df
# Example
plays_data = pd.DataFrame({
"play_id": [1, 2, 3, 4, 5],
"yard_line": [75, 50, 25, 10, 5], # Yards from opponent's end zone
"down": [1, 2, 1, 3, 1],
"distance": [10, 7, 10, 3, 5],
"yards_gained": [8, 3, 15, 10, 5] # Last play is TD
})
with_epa = simulate_epa_calculation(plays_data)
print(with_epa[["play_id", "yard_line", "yards_gained", "ep_before", "ep_after", "epa"]])
Part 7: Working with Football Data Types
Football data has specific patterns that require specialized handling.
Time and Date Operations
# Create sample schedule
schedule = pd.DataFrame({
"game_id": [1, 2, 3, 4, 5],
"date": ["2023-09-02", "2023-09-09", "2023-09-16", "2023-09-23", "2023-09-30"],
"home_team": ["Alabama", "Georgia", "Texas", "Ohio State", "Michigan"],
"kickoff_time": ["3:30 PM", "12:00 PM", "7:00 PM", "12:00 PM", "3:30 PM"]
})
# Convert to datetime
schedule["date"] = pd.to_datetime(schedule["date"])
# Extract components
schedule["year"] = schedule["date"].dt.year
schedule["month"] = schedule["date"].dt.month
schedule["day_of_week"] = schedule["date"].dt.day_name()
print(schedule[["game_id", "home_team", "date", "day_of_week"]])
# Filter by date
september_games = schedule[schedule["date"].dt.month == 9]
saturday_games = schedule[schedule["day_of_week"] == "Saturday"]
Categorical Data
Many football variables are categorical (conference, position, play type):
# Create categorical column for memory efficiency
games = pd.DataFrame({
"team": ["Alabama", "Georgia", "Alabama", "Texas", "Georgia"] * 100,
"conference": ["SEC", "SEC", "SEC", "Big 12", "SEC"] * 100,
"result": ["W", "W", "L", "W", "L"] * 100
})
# Memory before
memory_before = games.memory_usage(deep=True).sum()
# Convert to categorical
games["conference"] = games["conference"].astype("category")
games["result"] = games["result"].astype("category")
# Memory after
memory_after = games.memory_usage(deep=True).sum()
print(f"Memory reduction: {(1 - memory_after/memory_before)*100:.1f}%")
# Categorical operations
print(games["conference"].value_counts())
Handling Play-by-Play Text
Play descriptions often need parsing:
# Sample play descriptions
plays = pd.DataFrame({
"play_id": [1, 2, 3, 4],
"description": [
"J.Smith pass complete to M.Jones for 15 yards",
"T.Brown rush up the middle for 4 yards",
"J.Smith pass incomplete intended for R.Wilson",
"PENALTY on Alabama, False Start, 5 yards"
]
})
# Extract play type
plays["is_pass"] = plays["description"].str.contains("pass", case=False)
plays["is_rush"] = plays["description"].str.contains("rush", case=False)
plays["is_penalty"] = plays["description"].str.contains("penalty", case=False)
# Extract yards using regex
plays["yards"] = plays["description"].str.extract(r"(\d+)\s*yard").astype(float)
print(plays)
Part 8: Building Reusable Functions
Professional analysts build libraries of reusable functions. Here are patterns for football analytics:
Data Loading Function
def load_season_data(
year: int,
data_dir: str = "data",
data_types: list = ["games", "plays"]
) -> dict:
"""
Load football data for a season.
Parameters
----------
year : int
Season year
data_dir : str
Directory containing data files
data_types : list
Types of data to load
Returns
-------
dict
Dictionary of DataFrames
"""
data = {}
for dtype in data_types:
filepath = os.path.join(data_dir, f"{dtype}_{year}.parquet")
try:
data[dtype] = pd.read_parquet(filepath)
print(f"Loaded {dtype}: {len(data[dtype])} rows")
except FileNotFoundError:
# Try CSV fallback
csv_path = filepath.replace(".parquet", ".csv")
try:
data[dtype] = pd.read_csv(csv_path)
print(f"Loaded {dtype} (CSV): {len(data[dtype])} rows")
except FileNotFoundError:
print(f"Warning: {dtype} not found")
data[dtype] = pd.DataFrame()
return data
Team Performance Calculator
def calculate_team_metrics(
games_df: pd.DataFrame,
team: str,
metrics: list = ["wins", "ppg", "ppg_allowed"]
) -> dict:
"""
Calculate performance metrics for a team.
Parameters
----------
games_df : pd.DataFrame
Games with home_team, away_team, home_score, away_score
team : str
Team name
metrics : list
Metrics to calculate
Returns
-------
dict
Calculated metrics
"""
# Get all games for team
home_games = games_df[games_df["home_team"] == team].copy()
away_games = games_df[games_df["away_team"] == team].copy()
# Standardize columns
home_games["points_for"] = home_games["home_score"]
home_games["points_against"] = home_games["away_score"]
home_games["is_win"] = home_games["home_score"] > home_games["away_score"]
away_games["points_for"] = away_games["away_score"]
away_games["points_against"] = away_games["home_score"]
away_games["is_win"] = away_games["away_score"] > away_games["home_score"]
all_games = pd.concat([home_games, away_games])
results = {"team": team, "games": len(all_games)}
if "wins" in metrics:
results["wins"] = all_games["is_win"].sum()
results["losses"] = len(all_games) - results["wins"]
results["win_pct"] = round(results["wins"] / len(all_games), 3)
if "ppg" in metrics:
results["ppg"] = round(all_games["points_for"].mean(), 1)
if "ppg_allowed" in metrics:
results["ppg_allowed"] = round(all_games["points_against"].mean(), 1)
if "ppg" in metrics and "ppg_allowed" in metrics:
results["point_diff_per_game"] = round(results["ppg"] - results["ppg_allowed"], 1)
return results
Play Success Calculator
def calculate_play_success(
plays_df: pd.DataFrame,
success_thresholds: dict = None
) -> pd.DataFrame:
"""
Add success indicator to plays based on down and yards gained.
Standard success thresholds:
- 1st down: 40% of yards to go
- 2nd down: 50% of yards to go
- 3rd/4th down: 100% of yards to go
Parameters
----------
plays_df : pd.DataFrame
Play data with down, distance, yards_gained
success_thresholds : dict, optional
Custom thresholds by down
Returns
-------
pd.DataFrame
Plays with success column added
"""
if success_thresholds is None:
success_thresholds = {1: 0.4, 2: 0.5, 3: 1.0, 4: 1.0}
df = plays_df.copy()
def is_successful(row):
if pd.isna(row["down"]) or pd.isna(row["distance"]):
return None
threshold_pct = success_thresholds.get(row["down"], 1.0)
yards_needed = row["distance"] * threshold_pct
return row["yards_gained"] >= yards_needed
df["success"] = df.apply(is_successful, axis=1)
return df
Part 9: Performance Optimization
When working with large datasets (seasons of play-by-play data), performance matters.
Memory Optimization
def optimize_dataframe_memory(df: pd.DataFrame) -> pd.DataFrame:
"""
Optimize DataFrame memory usage by downcasting types.
Parameters
----------
df : pd.DataFrame
DataFrame to optimize
Returns
-------
pd.DataFrame
Memory-optimized DataFrame
"""
result = df.copy()
for col in result.columns:
col_type = result[col].dtype
# Downcast integers
if col_type == 'int64':
result[col] = pd.to_numeric(result[col], downcast='integer')
# Downcast floats
elif col_type == 'float64':
result[col] = pd.to_numeric(result[col], downcast='float')
# Convert object columns with few unique values to categorical
elif col_type == 'object':
num_unique = result[col].nunique()
num_total = len(result)
if num_unique / num_total < 0.5: # Less than 50% unique
result[col] = result[col].astype('category')
return result
# Example
plays = pd.DataFrame({
"play_id": range(100000),
"game_id": np.random.randint(1, 1000, 100000),
"play_type": np.random.choice(["Pass", "Rush", "Punt"], 100000),
"yards": np.random.normal(5, 10, 100000)
})
memory_before = plays.memory_usage(deep=True).sum() / 1024**2
optimized = optimize_dataframe_memory(plays)
memory_after = optimized.memory_usage(deep=True).sum() / 1024**2
print(f"Before: {memory_before:.2f} MB")
print(f"After: {memory_after:.2f} MB")
print(f"Reduction: {(1 - memory_after/memory_before)*100:.1f}%")
Vectorization Over Loops
import time
# Sample data
n = 100000
plays = pd.DataFrame({
"home_score": np.random.randint(0, 50, n),
"away_score": np.random.randint(0, 50, n)
})
# SLOW: Using a loop
start = time.time()
results_slow = []
for idx, row in plays.iterrows():
if row["home_score"] > row["away_score"]:
results_slow.append("Home")
elif row["away_score"] > row["home_score"]:
results_slow.append("Away")
else:
results_slow.append("Tie")
loop_time = time.time() - start
# FAST: Vectorized with np.select
start = time.time()
conditions = [
plays["home_score"] > plays["away_score"],
plays["away_score"] > plays["home_score"]
]
choices = ["Home", "Away"]
results_fast = np.select(conditions, choices, default="Tie")
vector_time = time.time() - start
print(f"Loop time: {loop_time:.3f}s")
print(f"Vectorized time: {vector_time:.3f}s")
print(f"Speedup: {loop_time/vector_time:.0f}x")
Putting It All Together
Let's combine everything into a complete analysis workflow:
def complete_season_analysis(year: int, games_df: pd.DataFrame) -> dict:
"""
Perform complete season analysis.
Parameters
----------
year : int
Season year
games_df : pd.DataFrame
Season games data
Returns
-------
dict
Analysis results
"""
results = {"year": year}
# 1. Basic statistics
results["total_games"] = len(games_df)
results["avg_total_points"] = round(
(games_df["home_score"] + games_df["away_score"]).mean(), 1
)
# 2. Home field advantage
games_df["home_win"] = games_df["home_score"] > games_df["away_score"]
results["home_win_pct"] = round(games_df["home_win"].mean(), 3)
results["avg_home_margin"] = round(
(games_df["home_score"] - games_df["away_score"]).mean(), 1
)
# 3. Blowout analysis
games_df["margin"] = abs(games_df["home_score"] - games_df["away_score"])
results["blowout_pct"] = round(
(games_df["margin"] >= 21).mean(), 3
)
results["close_game_pct"] = round(
(games_df["margin"] <= 7).mean(), 3
)
# 4. Team summaries
team_summary = calculate_team_summary(games_df)
results["top_5_teams"] = team_summary.head(5).index.tolist()
# 5. Scoring distribution
results["scoring_percentiles"] = {
"25th": int((games_df["home_score"] + games_df["away_score"]).quantile(0.25)),
"50th": int((games_df["home_score"] + games_df["away_score"]).quantile(0.50)),
"75th": int((games_df["home_score"] + games_df["away_score"]).quantile(0.75))
}
return results
# Run analysis
sample_games = pd.DataFrame({
"game_id": range(100),
"home_team": np.random.choice(
["Alabama", "Georgia", "Ohio State", "Michigan", "Texas"], 100
),
"away_team": np.random.choice(
["Auburn", "Florida", "Penn State", "Wisconsin", "Oklahoma"], 100
),
"home_score": np.random.randint(14, 49, 100),
"away_score": np.random.randint(10, 45, 100)
})
analysis = complete_season_analysis(2023, sample_games)
print("Season Analysis Results:")
print("-" * 40)
for key, value in analysis.items():
print(f" {key}: {value}")
Summary
This chapter introduced the essential Python tools for football analytics:
pandas Fundamentals:
- DataFrames and Series for structured data
- Selection with .loc, .iloc, and boolean indexing
- GroupBy for comparative analysis
- Merging DataFrames for data enrichment
NumPy Essentials: - Vectorized operations for speed - Statistical functions - Array-based calculations
Best Practices: - Vectorize instead of loop - Optimize memory for large datasets - Build reusable functions - Handle missing data appropriately
In the next chapter, we'll build on these skills to explore descriptive statistics specifically tailored for football analysis.
Key Terms
| Term | Definition |
|---|---|
| DataFrame | Two-dimensional labeled data structure in pandas |
| Series | One-dimensional labeled array in pandas |
| Vectorization | Applying operations to entire arrays without loops |
| Boolean indexing | Selecting rows based on True/False conditions |
| GroupBy | Splitting data into groups for aggregation |
| Merge/Join | Combining DataFrames based on common columns |
| Categorical | Data type for columns with limited unique values |
References
-
McKinney, W. (2022). Python for Data Analysis, 3rd Edition. O'Reilly Media.
-
pandas Documentation. https://pandas.pydata.org/docs/
-
NumPy Documentation. https://numpy.org/doc/
-
VanderPlas, J. (2016). Python Data Science Handbook. O'Reilly Media.
-
Coursera. "Applied Data Science with Python" Specialization. University of Michigan.