Key Takeaways: Python for Sports Analytics


One-Page Summary

Core Data Structures

PANDAS HIERARCHY

DataFrame (2D)
├── Multiple columns
├── Multiple rows
└── Each column is a Series

Series (1D)
├── Single column
├── Index + Values
└── Supports vectorized operations

Key insight: Work with DataFrames for data manipulation, extract Series when you need column-specific operations.


Essential Operations Cheat Sheet

Operation Code Returns
Select column df["col"] Series
Select columns df[["col1", "col2"]] DataFrame
Select by position df.iloc[0:5] DataFrame
Select by label df.loc["key"] Row as Series
Filter rows df[df["col"] > value] DataFrame
Add column df["new"] = values Modified DataFrame
Group & aggregate df.groupby("col").mean() DataFrame
Merge pd.merge(df1, df2, on="key") DataFrame

Boolean Filtering Pattern

# Single condition
filtered = df[df["score"] > 30]

# Multiple conditions (AND)
filtered = df[(df["score"] > 30) & (df["team"] == "Alabama")]

# Multiple conditions (OR)
filtered = df[(df["home_win"]) | (df["margin"] > 20)]

# Using query() for readability
filtered = df.query("score > 30 and team == 'Alabama'")

Remember: Use & not and, | not or. Parentheses required!


GroupBy Pattern

# Basic aggregation
df.groupby("team")["score"].mean()

# Multiple aggregations
df.groupby("team").agg(
    games=("score", "count"),
    avg_score=("score", "mean"),
    total_yards=("yards", "sum")
)

Merge Types

Type Keeps Use Case
inner Only matching rows Default, safest
left All from left table Keep all original data
right All from right table Keep all lookup data
outer All from both Full picture with gaps

NumPy Essentials

# Vectorized operations (FAST)
margins = home_scores - away_scores

# Conditional assignment
results = np.where(home_scores > away_scores, "Win", "Loss")

# Multiple conditions
categories = np.select(
    [margins >= 21, margins >= 7, margins > 0],
    ["Blowout", "Comfortable", "Close"],
    default="Loss"
)

Performance Rules

Approach Speed Use When
Vectorized (NumPy/pandas) Fastest Always try first
.apply() Medium Complex row logic
For loop Slowest Avoid for data ops

Rule of thumb: If you're writing a loop over DataFrame rows, there's probably a better way.


Data Type Optimization

# Downcast integers
df["col"] = pd.to_numeric(df["col"], downcast="integer")

# Convert to categorical (for repeated strings)
df["team"] = df["team"].astype("category")

# Expected savings:
# - Integer downcast: 50-75% reduction
# - Categorical strings: 80-95% reduction

Missing Data Handling

Method Effect
df.dropna() Remove rows with any NaN
df.dropna(subset=["col"]) Remove only if "col" is NaN
df.fillna(0) Replace NaN with 0
df.fillna(df.mean()) Replace with column mean
df.interpolate() Fill with interpolated values

Common Football Analytics Patterns

Pattern 1: Home/Away Normalization

# Standardize home and away into single team view
home = games[["home_team", "home_score"]].rename(
    columns={"home_team": "team", "home_score": "points"}
)
away = games[["away_team", "away_score"]].rename(
    columns={"away_team": "team", "away_score": "points"}
)
all_team_games = pd.concat([home, away])

Pattern 2: Success Rate Calculation

def is_successful(row):
    thresholds = {1: 0.4, 2: 0.5, 3: 1.0, 4: 1.0}
    return row["yards"] >= row["distance"] * thresholds.get(row["down"], 1)

Pattern 3: Rolling Calculations

# 3-game rolling average
df["rolling_avg"] = df.groupby("team")["points"].transform(
    lambda x: x.rolling(3).mean()
)

Key Terms Quick Reference

Term Definition
DataFrame 2D labeled data structure with rows and columns
Series 1D labeled array (single column)
Vectorization Applying operations to entire arrays at once
Boolean mask Array of True/False for filtering
GroupBy Split-apply-combine pattern for aggregation
Merge/Join Combining DataFrames on common keys
Categorical Efficient type for repeated string values
Downcast Reducing memory by using smaller numeric types

Code Snippet Library

Load Data with Error Handling

def load_data(filepath, fallback_csv=True):
    try:
        return pd.read_parquet(filepath)
    except:
        if fallback_csv:
            return pd.read_csv(filepath.replace(".parquet", ".csv"))
        raise

Calculate Team Record

def get_record(games, team):
    home = games[games["home_team"] == team]
    away = games[games["away_team"] == team]
    wins = ((home["home_score"] > home["away_score"]).sum() +
            (away["away_score"] > away["home_score"]).sum())
    losses = len(home) + len(away) - wins
    return wins, losses

Find Explosive Plays

explosive = plays[
    ((plays["play_type"] == "Pass") & (plays["yards"] >= 15)) |
    ((plays["play_type"] == "Rush") & (plays["yards"] >= 10))
]

Looking Ahead

Chapter 4 introduces Descriptive Statistics in Football: - Central tendency and spread for football metrics - Distribution analysis for player and team performance - Correlation and relationships between statistics - Building statistical profiles for comparison