5 min read

In the previous chapter, you learned where football data lives and how to access it. Now it's time to learn how to work with that data. Python has become the lingua franca of sports analytics, and for good reason: its ecosystem of data science...

Learning Objectives

  • Master pandas DataFrames for football data manipulation
  • Apply NumPy operations to sports statistics
  • Write efficient data processing pipelines
  • Build reusable functions for common football analytics tasks

Chapter 3: Python for Sports Analytics

"The best tool is the one you know how to use." — Traditional programmer wisdom

Introduction

In the previous chapter, you learned where football data lives and how to access it. Now it's time to learn how to work with that data. Python has become the lingua franca of sports analytics, and for good reason: its ecosystem of data science libraries—pandas, NumPy, and matplotlib—provides everything you need to transform raw game data into actionable insights.

This chapter is your gateway to practical data manipulation for football analytics. We'll move beyond basic Python and into the specialized tools that professional analysts use daily. By the end, you'll be able to load, clean, transform, and analyze football datasets with confidence.

Why Python for Sports Analytics?

Python dominates sports analytics for several reasons:

  1. Rich ecosystem: Libraries like pandas and NumPy handle complex data operations with minimal code
  2. Readability: Python's syntax is clear and maintainable—crucial when sharing analysis with non-technical colleagues
  3. Community: Thousands of sports analysts share code, tutorials, and packages
  4. Integration: Python connects easily to databases, APIs, and visualization tools
  5. Career relevance: Every major sports analytics job listing includes Python

What You'll Learn

This chapter covers four essential skills:

  • pandas fundamentals: DataFrames, Series, indexing, and filtering
  • Data manipulation: Grouping, aggregation, merging, and reshaping
  • NumPy essentials: Array operations and vectorized calculations
  • Football-specific patterns: Common operations for sports data

Setting Up Your Environment

Before diving in, ensure your environment is ready:

# Required packages
# pip install pandas numpy

import pandas as pd
import numpy as np

# Display settings for better output
pd.set_option('display.max_columns', 20)
pd.set_option('display.width', 200)

# Verify versions
print(f"pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

Minimum recommended versions: - pandas 1.5.0+ - NumPy 1.23.0+


Part 1: The pandas DataFrame

The DataFrame is the workhorse of sports analytics. Think of it as a smart spreadsheet—rows are observations (games, plays, players), columns are variables (team, score, yards), and pandas provides hundreds of methods to manipulate them.

Creating DataFrames

There are several ways to create DataFrames. Let's start with the most common for football analytics:

# Method 1: From a dictionary
game_data = {
    "game_id": [1, 2, 3, 4, 5],
    "home_team": ["Alabama", "Georgia", "Ohio State", "Michigan", "Texas"],
    "away_team": ["Auburn", "Florida", "Penn State", "Ohio State", "Oklahoma"],
    "home_score": [35, 42, 28, 45, 31],
    "away_score": [21, 38, 24, 23, 28]
}

games = pd.DataFrame(game_data)
print(games)

Output:

   game_id   home_team   away_team  home_score  away_score
0        1     Alabama      Auburn          35          21
1        2     Georgia     Florida          42          38
2        3  Ohio State  Penn State          28          24
3        4    Michigan  Ohio State          45          23
4        5       Texas    Oklahoma          31          28
# Method 2: From a list of dictionaries (common with API responses)
plays = [
    {"play_id": 1, "type": "Pass", "yards": 12, "success": True},
    {"play_id": 2, "type": "Rush", "yards": 3, "success": False},
    {"play_id": 3, "type": "Pass", "yards": -5, "success": False},
    {"play_id": 4, "type": "Rush", "yards": 8, "success": True},
    {"play_id": 5, "type": "Pass", "yards": 25, "success": True}
]

plays_df = pd.DataFrame(plays)
print(plays_df)
# Method 3: From a CSV file
# games = pd.read_csv("path/to/games.csv")

# Method 4: From a Parquet file (recommended for large datasets)
# games = pd.read_parquet("path/to/games.parquet")

# Method 5: From SQL query
# games = pd.read_sql("SELECT * FROM games", connection)

Understanding DataFrame Structure

Every DataFrame has essential properties you should know:

# Basic information
print(f"Shape: {games.shape}")           # (rows, columns)
print(f"Columns: {games.columns.tolist()}")
print(f"Data types:\n{games.dtypes}")

# Summary statistics
print(games.describe())

# First/last rows
print(games.head(3))    # First 3 rows
print(games.tail(2))    # Last 2 rows

# Memory usage
print(f"Memory: {games.memory_usage(deep=True).sum() / 1024:.2f} KB")

The Series: A Single Column

A single column is a Series—a one-dimensional array with labels:

# Extract a column as Series
home_scores = games["home_score"]
print(type(home_scores))  # <class 'pandas.core.series.Series'>

# Series operations
print(f"Mean: {home_scores.mean():.1f}")
print(f"Max: {home_scores.max()}")
print(f"Sum: {home_scores.sum()}")

Part 2: Selecting and Filtering Data

Selecting the right data is fundamental. pandas offers multiple approaches, each suited for different situations.

Column Selection

# Single column (returns Series)
teams = games["home_team"]

# Multiple columns (returns DataFrame)
scores = games[["home_team", "home_score", "away_score"]]

# Using .loc for explicit column selection
scores = games.loc[:, ["home_team", "home_score"]]

Row Selection by Position

Use .iloc for position-based indexing (like array indices):

# Single row (returns Series)
first_game = games.iloc[0]

# Multiple rows
first_three = games.iloc[0:3]

# Specific rows and columns
subset = games.iloc[0:3, 1:4]  # rows 0-2, columns 1-3

# Last row
last_game = games.iloc[-1]

Row Selection by Label

Use .loc for label-based indexing:

# Set a meaningful index
games_indexed = games.set_index("game_id")

# Select by index value
game_3 = games_indexed.loc[3]

# Multiple index values
selected = games_indexed.loc[[1, 3, 5]]

# Range of labels
range_games = games_indexed.loc[2:4]

Boolean Filtering

This is where pandas shines for analytics. Filter rows based on conditions:

# Simple condition
high_scoring_home = games[games["home_score"] > 35]
print(high_scoring_home)

# Multiple conditions (use & for AND, | for OR)
# Find games where home team scored 30+ AND away team scored 25+
close_games = games[(games["home_score"] >= 30) & (games["away_score"] >= 25)]

# Using query() for cleaner syntax
close_games = games.query("home_score >= 30 and away_score >= 25")

# String matching
alabama_games = games[games["home_team"] == "Alabama"]

# Contains (for partial matches)
ohio_games = games[games["home_team"].str.contains("Ohio")]

Practical Example: Finding Blowout Games

def find_blowouts(games_df: pd.DataFrame, margin: int = 20) -> pd.DataFrame:
    """
    Find games where the winning margin exceeded a threshold.

    Parameters
    ----------
    games_df : pd.DataFrame
        Games DataFrame with home_score and away_score columns
    margin : int
        Minimum point difference to qualify as a blowout

    Returns
    -------
    pd.DataFrame
        Filtered games that qualify as blowouts
    """
    # Calculate margin (absolute difference)
    games_df = games_df.copy()
    games_df["margin"] = abs(games_df["home_score"] - games_df["away_score"])

    # Filter for blowouts
    blowouts = games_df[games_df["margin"] >= margin]

    return blowouts.sort_values("margin", ascending=False)

# Example usage
blowouts = find_blowouts(games, margin=15)
print(blowouts)

Part 3: Data Transformation

Raw data rarely arrives analysis-ready. You'll spend significant time transforming data into useful forms.

Creating New Columns

# Simple calculation
games["total_points"] = games["home_score"] + games["away_score"]
games["home_margin"] = games["home_score"] - games["away_score"]

# Conditional assignment
games["home_win"] = games["home_score"] > games["away_score"]

# Using np.where for if-else logic
games["result"] = np.where(
    games["home_score"] > games["away_score"],
    "Home Win",
    np.where(
        games["home_score"] < games["away_score"],
        "Away Win",
        "Tie"
    )
)

print(games[["home_team", "away_team", "home_score", "away_score", "result"]])

Applying Functions

For more complex transformations, use .apply():

def categorize_game(row):
    """Categorize game by scoring level."""
    total = row["home_score"] + row["away_score"]
    if total >= 70:
        return "Shootout"
    elif total >= 50:
        return "High Scoring"
    elif total >= 30:
        return "Normal"
    else:
        return "Defensive Battle"

games["game_type"] = games.apply(categorize_game, axis=1)
print(games[["home_team", "away_team", "total_points", "game_type"]])

Performance tip: .apply() is flexible but slow. For simple operations, use vectorized approaches:

# Slow (apply)
games["total"] = games.apply(lambda x: x["home_score"] + x["away_score"], axis=1)

# Fast (vectorized)
games["total"] = games["home_score"] + games["away_score"]

Renaming and Reordering Columns

# Rename columns
games = games.rename(columns={
    "home_score": "home_pts",
    "away_score": "away_pts"
})

# Reorder columns
column_order = ["game_id", "home_team", "away_team", "home_pts", "away_pts", "result"]
games = games[column_order]

Handling Missing Data

Football data often has missing values. Here's how to handle them:

# Create sample data with missing values
incomplete_stats = pd.DataFrame({
    "player": ["Smith", "Jones", "Williams", "Brown", "Davis"],
    "rush_yards": [120, None, 85, 0, None],
    "pass_yards": [None, 250, None, 180, 320],
    "touchdowns": [1, 2, None, 0, 3]
})

# Check for missing values
print(incomplete_stats.isnull().sum())

# Drop rows with any missing values
clean = incomplete_stats.dropna()

# Drop rows with missing values in specific columns
clean = incomplete_stats.dropna(subset=["rush_yards"])

# Fill missing values
# With a constant
filled = incomplete_stats.fillna(0)

# With column mean
filled = incomplete_stats.fillna(incomplete_stats.mean(numeric_only=True))

# Forward fill (use previous valid value)
filled = incomplete_stats.fillna(method="ffill")

Part 4: Grouping and Aggregation

Grouping is essential for comparative analysis—comparing teams, seasons, or any category.

Basic GroupBy

# Create extended game data
season_games = pd.DataFrame({
    "week": [1, 1, 1, 2, 2, 2, 3, 3, 3],
    "home_team": ["Alabama", "Georgia", "Texas", "Alabama", "Ohio State", "Texas",
                  "Georgia", "Alabama", "Ohio State"],
    "away_team": ["Duke", "Clemson", "Rice", "Texas", "Indiana", "Oklahoma",
                  "Auburn", "LSU", "Penn State"],
    "home_score": [52, 45, 38, 31, 42, 28, 35, 42, 38],
    "away_score": [10, 28, 14, 28, 14, 24, 21, 17, 31]
})

# Group by and aggregate
by_week = season_games.groupby("week").agg({
    "home_score": ["mean", "sum", "max"],
    "away_score": ["mean", "sum"]
})
print(by_week)

Multiple Aggregations

# Named aggregations (pandas 0.25+)
summary = season_games.groupby("week").agg(
    games_played=("home_team", "count"),
    avg_home_score=("home_score", "mean"),
    avg_away_score=("away_score", "mean"),
    max_total=("home_score", lambda x: (x + season_games.loc[x.index, "away_score"]).max())
).round(1)

print(summary)

Practical Example: Team Performance Summary

def calculate_team_summary(games_df: pd.DataFrame) -> pd.DataFrame:
    """
    Calculate team performance summary from game data.

    Parameters
    ----------
    games_df : pd.DataFrame
        DataFrame with home_team, away_team, home_score, away_score

    Returns
    -------
    pd.DataFrame
        Summary statistics per team
    """
    # Process home games
    home = games_df.copy()
    home["team"] = home["home_team"]
    home["points_for"] = home["home_score"]
    home["points_against"] = home["away_score"]
    home["is_win"] = home["home_score"] > home["away_score"]
    home["is_home"] = True

    # Process away games
    away = games_df.copy()
    away["team"] = away["away_team"]
    away["points_for"] = away["away_score"]
    away["points_against"] = away["home_score"]
    away["is_win"] = away["away_score"] > away["home_score"]
    away["is_home"] = False

    # Combine
    all_games = pd.concat([
        home[["team", "points_for", "points_against", "is_win", "is_home"]],
        away[["team", "points_for", "points_against", "is_win", "is_home"]]
    ])

    # Aggregate by team
    summary = all_games.groupby("team").agg(
        games=("is_win", "count"),
        wins=("is_win", "sum"),
        total_points_for=("points_for", "sum"),
        total_points_against=("points_against", "sum"),
        avg_points_for=("points_for", "mean"),
        avg_points_against=("points_against", "mean")
    )

    # Add calculated columns
    summary["losses"] = summary["games"] - summary["wins"]
    summary["win_pct"] = (summary["wins"] / summary["games"]).round(3)
    summary["point_diff"] = summary["total_points_for"] - summary["total_points_against"]

    return summary.sort_values("win_pct", ascending=False)

# Calculate summary
team_summary = calculate_team_summary(season_games)
print(team_summary)

Part 5: Merging and Joining DataFrames

Football analysis often requires combining data from multiple sources—games with team info, plays with game context, etc.

Understanding Join Types

# Create sample data
teams = pd.DataFrame({
    "team_id": [1, 2, 3, 4],
    "team_name": ["Alabama", "Georgia", "Ohio State", "Michigan"],
    "conference": ["SEC", "SEC", "Big Ten", "Big Ten"]
})

rankings = pd.DataFrame({
    "team_id": [1, 2, 3, 5],  # Note: team 4 missing, team 5 not in teams
    "ap_rank": [1, 2, 4, 8],
    "cfp_rank": [1, 3, 4, 9]
})

# Inner join (only matching rows)
inner = pd.merge(teams, rankings, on="team_id", how="inner")
print("Inner join:")
print(inner)

# Left join (all rows from left table)
left = pd.merge(teams, rankings, on="team_id", how="left")
print("\nLeft join:")
print(left)

# Outer join (all rows from both tables)
outer = pd.merge(teams, rankings, on="team_id", how="outer")
print("\nOuter join:")
print(outer)

Merging on Different Column Names

games_with_ids = pd.DataFrame({
    "game_id": [1, 2, 3],
    "home_team_id": [1, 2, 3],
    "away_team_id": [2, 3, 1],
    "home_score": [28, 35, 21],
    "away_score": [24, 28, 17]
})

# Merge to get home team names
result = pd.merge(
    games_with_ids,
    teams,
    left_on="home_team_id",
    right_on="team_id",
    how="left"
).rename(columns={"team_name": "home_team"})

# Then merge for away team names
result = pd.merge(
    result,
    teams[["team_id", "team_name"]],
    left_on="away_team_id",
    right_on="team_id",
    how="left"
).rename(columns={"team_name": "away_team"})

print(result[["game_id", "home_team", "away_team", "home_score", "away_score"]])

Practical Example: Enriching Play Data

def enrich_play_data(plays_df: pd.DataFrame, games_df: pd.DataFrame) -> pd.DataFrame:
    """
    Enrich play-by-play data with game context.

    Parameters
    ----------
    plays_df : pd.DataFrame
        Play-by-play data with game_id column
    games_df : pd.DataFrame
        Game data with game_id, home_team, away_team

    Returns
    -------
    pd.DataFrame
        Plays with game context added
    """
    # Select game columns to add
    game_cols = ["game_id", "home_team", "away_team", "week"]
    game_context = games_df[game_cols].copy()

    # Merge
    enriched = pd.merge(
        plays_df,
        game_context,
        on="game_id",
        how="left"
    )

    return enriched

# Example
plays = pd.DataFrame({
    "play_id": [1, 2, 3, 4],
    "game_id": [1, 1, 2, 2],
    "play_type": ["Pass", "Rush", "Pass", "Rush"],
    "yards": [12, 4, -3, 8]
})

game_info = pd.DataFrame({
    "game_id": [1, 2],
    "home_team": ["Alabama", "Georgia"],
    "away_team": ["Auburn", "Florida"],
    "week": [1, 1]
})

enriched_plays = enrich_play_data(plays, game_info)
print(enriched_plays)

Part 6: NumPy Essentials

NumPy provides the numerical foundation for pandas and enables efficient calculations.

Array Basics

# Create arrays
scores = np.array([28, 35, 42, 21, 17])
print(f"Array: {scores}")
print(f"Type: {scores.dtype}")
print(f"Shape: {scores.shape}")

# Basic statistics
print(f"Mean: {np.mean(scores):.1f}")
print(f"Std: {np.std(scores):.1f}")
print(f"Median: {np.median(scores)}")
print(f"Min/Max: {np.min(scores)}/{np.max(scores)}")

Vectorized Operations

NumPy's power comes from vectorized operations—applying functions to entire arrays at once:

# Element-wise operations
home_scores = np.array([35, 42, 28, 45, 31])
away_scores = np.array([21, 38, 24, 23, 28])

# Margins (no loops needed!)
margins = home_scores - away_scores
print(f"Margins: {margins}")

# Total points
totals = home_scores + away_scores
print(f"Totals: {totals}")

# Percentage of home points
home_pct = home_scores / totals * 100
print(f"Home %: {home_pct.round(1)}")

# Boolean operations
home_wins = home_scores > away_scores
print(f"Home wins: {home_wins}")
print(f"Win count: {np.sum(home_wins)}")

Statistical Functions

# Game yardage data
rush_yards = np.array([150, 180, 120, 200, 95, 165, 140])
pass_yards = np.array([280, 250, 310, 220, 340, 275, 300])

# Basic statistics
print(f"Rush yards - Mean: {np.mean(rush_yards):.1f}, Std: {np.std(rush_yards):.1f}")
print(f"Pass yards - Mean: {np.mean(pass_yards):.1f}, Std: {np.std(pass_yards):.1f}")

# Correlation
correlation = np.corrcoef(rush_yards, pass_yards)[0, 1]
print(f"Rush/Pass correlation: {correlation:.3f}")

# Percentiles
print(f"Rush yards 75th percentile: {np.percentile(rush_yards, 75):.1f}")

# Cumulative sum (running totals)
cumulative_rush = np.cumsum(rush_yards)
print(f"Cumulative rush yards: {cumulative_rush}")

Practical Example: EPA Calculation Simulation

Expected Points Added (EPA) calculations often use NumPy for efficiency:

def simulate_epa_calculation(plays_df: pd.DataFrame) -> pd.DataFrame:
    """
    Simulate EPA calculation for plays.

    This is a simplified demonstration. Real EPA uses complex
    expected points models.

    Parameters
    ----------
    plays_df : pd.DataFrame
        Play data with yard_line, down, distance, yards_gained

    Returns
    -------
    pd.DataFrame
        Plays with EPA column added
    """
    df = plays_df.copy()

    # Simplified expected points by yard line (real models are more complex)
    # This uses a rough approximation
    def expected_points(yard_line):
        # yard_line is yards from opponent's end zone (1-99)
        return np.where(
            yard_line <= 0, 7,  # Touchdown
            np.where(
                yard_line <= 10, 4.5,
                np.where(
                    yard_line <= 30, 2.5,
                    np.where(
                        yard_line <= 50, 1.0,
                        np.where(
                            yard_line <= 70, 0.0,
                            -1.0
                        )
                    )
                )
            )
        )

    # Calculate pre-play and post-play expected points
    df["ep_before"] = expected_points(df["yard_line"].values)
    df["new_yard_line"] = df["yard_line"] - df["yards_gained"]
    df["ep_after"] = expected_points(df["new_yard_line"].values)

    # EPA = change in expected points
    df["epa"] = df["ep_after"] - df["ep_before"]

    return df

# Example
plays_data = pd.DataFrame({
    "play_id": [1, 2, 3, 4, 5],
    "yard_line": [75, 50, 25, 10, 5],  # Yards from opponent's end zone
    "down": [1, 2, 1, 3, 1],
    "distance": [10, 7, 10, 3, 5],
    "yards_gained": [8, 3, 15, 10, 5]  # Last play is TD
})

with_epa = simulate_epa_calculation(plays_data)
print(with_epa[["play_id", "yard_line", "yards_gained", "ep_before", "ep_after", "epa"]])

Part 7: Working with Football Data Types

Football data has specific patterns that require specialized handling.

Time and Date Operations

# Create sample schedule
schedule = pd.DataFrame({
    "game_id": [1, 2, 3, 4, 5],
    "date": ["2023-09-02", "2023-09-09", "2023-09-16", "2023-09-23", "2023-09-30"],
    "home_team": ["Alabama", "Georgia", "Texas", "Ohio State", "Michigan"],
    "kickoff_time": ["3:30 PM", "12:00 PM", "7:00 PM", "12:00 PM", "3:30 PM"]
})

# Convert to datetime
schedule["date"] = pd.to_datetime(schedule["date"])

# Extract components
schedule["year"] = schedule["date"].dt.year
schedule["month"] = schedule["date"].dt.month
schedule["day_of_week"] = schedule["date"].dt.day_name()

print(schedule[["game_id", "home_team", "date", "day_of_week"]])

# Filter by date
september_games = schedule[schedule["date"].dt.month == 9]
saturday_games = schedule[schedule["day_of_week"] == "Saturday"]

Categorical Data

Many football variables are categorical (conference, position, play type):

# Create categorical column for memory efficiency
games = pd.DataFrame({
    "team": ["Alabama", "Georgia", "Alabama", "Texas", "Georgia"] * 100,
    "conference": ["SEC", "SEC", "SEC", "Big 12", "SEC"] * 100,
    "result": ["W", "W", "L", "W", "L"] * 100
})

# Memory before
memory_before = games.memory_usage(deep=True).sum()

# Convert to categorical
games["conference"] = games["conference"].astype("category")
games["result"] = games["result"].astype("category")

# Memory after
memory_after = games.memory_usage(deep=True).sum()

print(f"Memory reduction: {(1 - memory_after/memory_before)*100:.1f}%")

# Categorical operations
print(games["conference"].value_counts())

Handling Play-by-Play Text

Play descriptions often need parsing:

# Sample play descriptions
plays = pd.DataFrame({
    "play_id": [1, 2, 3, 4],
    "description": [
        "J.Smith pass complete to M.Jones for 15 yards",
        "T.Brown rush up the middle for 4 yards",
        "J.Smith pass incomplete intended for R.Wilson",
        "PENALTY on Alabama, False Start, 5 yards"
    ]
})

# Extract play type
plays["is_pass"] = plays["description"].str.contains("pass", case=False)
plays["is_rush"] = plays["description"].str.contains("rush", case=False)
plays["is_penalty"] = plays["description"].str.contains("penalty", case=False)

# Extract yards using regex
plays["yards"] = plays["description"].str.extract(r"(\d+)\s*yard").astype(float)

print(plays)

Part 8: Building Reusable Functions

Professional analysts build libraries of reusable functions. Here are patterns for football analytics:

Data Loading Function

def load_season_data(
    year: int,
    data_dir: str = "data",
    data_types: list = ["games", "plays"]
) -> dict:
    """
    Load football data for a season.

    Parameters
    ----------
    year : int
        Season year
    data_dir : str
        Directory containing data files
    data_types : list
        Types of data to load

    Returns
    -------
    dict
        Dictionary of DataFrames
    """
    data = {}

    for dtype in data_types:
        filepath = os.path.join(data_dir, f"{dtype}_{year}.parquet")

        try:
            data[dtype] = pd.read_parquet(filepath)
            print(f"Loaded {dtype}: {len(data[dtype])} rows")
        except FileNotFoundError:
            # Try CSV fallback
            csv_path = filepath.replace(".parquet", ".csv")
            try:
                data[dtype] = pd.read_csv(csv_path)
                print(f"Loaded {dtype} (CSV): {len(data[dtype])} rows")
            except FileNotFoundError:
                print(f"Warning: {dtype} not found")
                data[dtype] = pd.DataFrame()

    return data

Team Performance Calculator

def calculate_team_metrics(
    games_df: pd.DataFrame,
    team: str,
    metrics: list = ["wins", "ppg", "ppg_allowed"]
) -> dict:
    """
    Calculate performance metrics for a team.

    Parameters
    ----------
    games_df : pd.DataFrame
        Games with home_team, away_team, home_score, away_score
    team : str
        Team name
    metrics : list
        Metrics to calculate

    Returns
    -------
    dict
        Calculated metrics
    """
    # Get all games for team
    home_games = games_df[games_df["home_team"] == team].copy()
    away_games = games_df[games_df["away_team"] == team].copy()

    # Standardize columns
    home_games["points_for"] = home_games["home_score"]
    home_games["points_against"] = home_games["away_score"]
    home_games["is_win"] = home_games["home_score"] > home_games["away_score"]

    away_games["points_for"] = away_games["away_score"]
    away_games["points_against"] = away_games["home_score"]
    away_games["is_win"] = away_games["away_score"] > away_games["home_score"]

    all_games = pd.concat([home_games, away_games])

    results = {"team": team, "games": len(all_games)}

    if "wins" in metrics:
        results["wins"] = all_games["is_win"].sum()
        results["losses"] = len(all_games) - results["wins"]
        results["win_pct"] = round(results["wins"] / len(all_games), 3)

    if "ppg" in metrics:
        results["ppg"] = round(all_games["points_for"].mean(), 1)

    if "ppg_allowed" in metrics:
        results["ppg_allowed"] = round(all_games["points_against"].mean(), 1)

    if "ppg" in metrics and "ppg_allowed" in metrics:
        results["point_diff_per_game"] = round(results["ppg"] - results["ppg_allowed"], 1)

    return results

Play Success Calculator

def calculate_play_success(
    plays_df: pd.DataFrame,
    success_thresholds: dict = None
) -> pd.DataFrame:
    """
    Add success indicator to plays based on down and yards gained.

    Standard success thresholds:
    - 1st down: 40% of yards to go
    - 2nd down: 50% of yards to go
    - 3rd/4th down: 100% of yards to go

    Parameters
    ----------
    plays_df : pd.DataFrame
        Play data with down, distance, yards_gained
    success_thresholds : dict, optional
        Custom thresholds by down

    Returns
    -------
    pd.DataFrame
        Plays with success column added
    """
    if success_thresholds is None:
        success_thresholds = {1: 0.4, 2: 0.5, 3: 1.0, 4: 1.0}

    df = plays_df.copy()

    def is_successful(row):
        if pd.isna(row["down"]) or pd.isna(row["distance"]):
            return None

        threshold_pct = success_thresholds.get(row["down"], 1.0)
        yards_needed = row["distance"] * threshold_pct
        return row["yards_gained"] >= yards_needed

    df["success"] = df.apply(is_successful, axis=1)

    return df

Part 9: Performance Optimization

When working with large datasets (seasons of play-by-play data), performance matters.

Memory Optimization

def optimize_dataframe_memory(df: pd.DataFrame) -> pd.DataFrame:
    """
    Optimize DataFrame memory usage by downcasting types.

    Parameters
    ----------
    df : pd.DataFrame
        DataFrame to optimize

    Returns
    -------
    pd.DataFrame
        Memory-optimized DataFrame
    """
    result = df.copy()

    for col in result.columns:
        col_type = result[col].dtype

        # Downcast integers
        if col_type == 'int64':
            result[col] = pd.to_numeric(result[col], downcast='integer')

        # Downcast floats
        elif col_type == 'float64':
            result[col] = pd.to_numeric(result[col], downcast='float')

        # Convert object columns with few unique values to categorical
        elif col_type == 'object':
            num_unique = result[col].nunique()
            num_total = len(result)
            if num_unique / num_total < 0.5:  # Less than 50% unique
                result[col] = result[col].astype('category')

    return result

# Example
plays = pd.DataFrame({
    "play_id": range(100000),
    "game_id": np.random.randint(1, 1000, 100000),
    "play_type": np.random.choice(["Pass", "Rush", "Punt"], 100000),
    "yards": np.random.normal(5, 10, 100000)
})

memory_before = plays.memory_usage(deep=True).sum() / 1024**2
optimized = optimize_dataframe_memory(plays)
memory_after = optimized.memory_usage(deep=True).sum() / 1024**2

print(f"Before: {memory_before:.2f} MB")
print(f"After: {memory_after:.2f} MB")
print(f"Reduction: {(1 - memory_after/memory_before)*100:.1f}%")

Vectorization Over Loops

import time

# Sample data
n = 100000
plays = pd.DataFrame({
    "home_score": np.random.randint(0, 50, n),
    "away_score": np.random.randint(0, 50, n)
})

# SLOW: Using a loop
start = time.time()
results_slow = []
for idx, row in plays.iterrows():
    if row["home_score"] > row["away_score"]:
        results_slow.append("Home")
    elif row["away_score"] > row["home_score"]:
        results_slow.append("Away")
    else:
        results_slow.append("Tie")
loop_time = time.time() - start

# FAST: Vectorized with np.select
start = time.time()
conditions = [
    plays["home_score"] > plays["away_score"],
    plays["away_score"] > plays["home_score"]
]
choices = ["Home", "Away"]
results_fast = np.select(conditions, choices, default="Tie")
vector_time = time.time() - start

print(f"Loop time: {loop_time:.3f}s")
print(f"Vectorized time: {vector_time:.3f}s")
print(f"Speedup: {loop_time/vector_time:.0f}x")

Putting It All Together

Let's combine everything into a complete analysis workflow:

def complete_season_analysis(year: int, games_df: pd.DataFrame) -> dict:
    """
    Perform complete season analysis.

    Parameters
    ----------
    year : int
        Season year
    games_df : pd.DataFrame
        Season games data

    Returns
    -------
    dict
        Analysis results
    """
    results = {"year": year}

    # 1. Basic statistics
    results["total_games"] = len(games_df)
    results["avg_total_points"] = round(
        (games_df["home_score"] + games_df["away_score"]).mean(), 1
    )

    # 2. Home field advantage
    games_df["home_win"] = games_df["home_score"] > games_df["away_score"]
    results["home_win_pct"] = round(games_df["home_win"].mean(), 3)
    results["avg_home_margin"] = round(
        (games_df["home_score"] - games_df["away_score"]).mean(), 1
    )

    # 3. Blowout analysis
    games_df["margin"] = abs(games_df["home_score"] - games_df["away_score"])
    results["blowout_pct"] = round(
        (games_df["margin"] >= 21).mean(), 3
    )
    results["close_game_pct"] = round(
        (games_df["margin"] <= 7).mean(), 3
    )

    # 4. Team summaries
    team_summary = calculate_team_summary(games_df)
    results["top_5_teams"] = team_summary.head(5).index.tolist()

    # 5. Scoring distribution
    results["scoring_percentiles"] = {
        "25th": int((games_df["home_score"] + games_df["away_score"]).quantile(0.25)),
        "50th": int((games_df["home_score"] + games_df["away_score"]).quantile(0.50)),
        "75th": int((games_df["home_score"] + games_df["away_score"]).quantile(0.75))
    }

    return results

# Run analysis
sample_games = pd.DataFrame({
    "game_id": range(100),
    "home_team": np.random.choice(
        ["Alabama", "Georgia", "Ohio State", "Michigan", "Texas"], 100
    ),
    "away_team": np.random.choice(
        ["Auburn", "Florida", "Penn State", "Wisconsin", "Oklahoma"], 100
    ),
    "home_score": np.random.randint(14, 49, 100),
    "away_score": np.random.randint(10, 45, 100)
})

analysis = complete_season_analysis(2023, sample_games)

print("Season Analysis Results:")
print("-" * 40)
for key, value in analysis.items():
    print(f"  {key}: {value}")

Summary

This chapter introduced the essential Python tools for football analytics:

pandas Fundamentals: - DataFrames and Series for structured data - Selection with .loc, .iloc, and boolean indexing - GroupBy for comparative analysis - Merging DataFrames for data enrichment

NumPy Essentials: - Vectorized operations for speed - Statistical functions - Array-based calculations

Best Practices: - Vectorize instead of loop - Optimize memory for large datasets - Build reusable functions - Handle missing data appropriately

In the next chapter, we'll build on these skills to explore descriptive statistics specifically tailored for football analysis.


Key Terms

Term Definition
DataFrame Two-dimensional labeled data structure in pandas
Series One-dimensional labeled array in pandas
Vectorization Applying operations to entire arrays without loops
Boolean indexing Selecting rows based on True/False conditions
GroupBy Splitting data into groups for aggregation
Merge/Join Combining DataFrames based on common columns
Categorical Data type for columns with limited unique values

References

  1. McKinney, W. (2022). Python for Data Analysis, 3rd Edition. O'Reilly Media.

  2. pandas Documentation. https://pandas.pydata.org/docs/

  3. NumPy Documentation. https://numpy.org/doc/

  4. VanderPlas, J. (2016). Python Data Science Handbook. O'Reilly Media.

  5. Coursera. "Applied Data Science with Python" Specialization. University of Michigan.