Key Takeaways: Python for Sports Analytics

One-Page Summary

Core Data Structures

PANDAS HIERARCHY

DataFrame (2D)
├── Multiple columns
├── Multiple rows
└── Each column is a Series

Series (1D)
├── Single column
├── Index + Values
└── Supports vectorized operations

Key insight: Work with DataFrames for data manipulation, extract Series when you need column-specific operations.

Essential Operations Cheat Sheet

Operation	Code	Returns
Select column	`df["col"]`	Series
Select columns	`df[["col1", "col2"]]`	DataFrame
Select by position	`df.iloc[0:5]`	DataFrame
Select by label	`df.loc["key"]`	Row as Series
Filter rows	`df[df["col"] > value]`	DataFrame
Add column	`df["new"] = values`	Modified DataFrame
Group & aggregate	`df.groupby("col").mean()`	DataFrame
Merge	`pd.merge(df1, df2, on="key")`	DataFrame

Boolean Filtering Pattern

# Single condition
filtered = df[df["score"] > 30]

# Multiple conditions (AND)
filtered = df[(df["score"] > 30) & (df["team"] == "Alabama")]

# Multiple conditions (OR)
filtered = df[(df["home_win"]) | (df["margin"] > 20)]

# Using query() for readability
filtered = df.query("score > 30 and team == 'Alabama'")

Remember: Use & not and, | not or. Parentheses required!

GroupBy Pattern

# Basic aggregation
df.groupby("team")["score"].mean()

# Multiple aggregations
df.groupby("team").agg(
    games=("score", "count"),
    avg_score=("score", "mean"),
    total_yards=("yards", "sum")
)

Merge Types

Type	Keeps	Use Case
`inner`	Only matching rows	Default, safest
`left`	All from left table	Keep all original data
`right`	All from right table	Keep all lookup data
`outer`	All from both	Full picture with gaps

NumPy Essentials

# Vectorized operations (FAST)
margins = home_scores - away_scores

# Conditional assignment
results = np.where(home_scores > away_scores, "Win", "Loss")

# Multiple conditions
categories = np.select(
    [margins >= 21, margins >= 7, margins > 0],
    ["Blowout", "Comfortable", "Close"],
    default="Loss"
)

Performance Rules

Approach	Speed	Use When
Vectorized (NumPy/pandas)	Fastest	Always try first
`.apply()`	Medium	Complex row logic
For loop	Slowest	Avoid for data ops

Rule of thumb: If you're writing a loop over DataFrame rows, there's probably a better way.

Data Type Optimization

# Downcast integers
df["col"] = pd.to_numeric(df["col"], downcast="integer")

# Convert to categorical (for repeated strings)
df["team"] = df["team"].astype("category")

# Expected savings:
# - Integer downcast: 50-75% reduction
# - Categorical strings: 80-95% reduction

Missing Data Handling

Method	Effect
`df.dropna()`	Remove rows with any NaN
`df.dropna(subset=["col"])`	Remove only if "col" is NaN
`df.fillna(0)`	Replace NaN with 0
`df.fillna(df.mean())`	Replace with column mean
`df.interpolate()`	Fill with interpolated values

Common Football Analytics Patterns

Pattern 1: Home/Away Normalization

# Standardize home and away into single team view
home = games[["home_team", "home_score"]].rename(
    columns={"home_team": "team", "home_score": "points"}
)
away = games[["away_team", "away_score"]].rename(
    columns={"away_team": "team", "away_score": "points"}
)
all_team_games = pd.concat([home, away])

Pattern 2: Success Rate Calculation

def is_successful(row):
    thresholds = {1: 0.4, 2: 0.5, 3: 1.0, 4: 1.0}
    return row["yards"] >= row["distance"] * thresholds.get(row["down"], 1)

Pattern 3: Rolling Calculations

# 3-game rolling average
df["rolling_avg"] = df.groupby("team")["points"].transform(
    lambda x: x.rolling(3).mean()
)

Key Terms Quick Reference

Term	Definition
DataFrame	2D labeled data structure with rows and columns
Series	1D labeled array (single column)
Vectorization	Applying operations to entire arrays at once
Boolean mask	Array of True/False for filtering
GroupBy	Split-apply-combine pattern for aggregation
Merge/Join	Combining DataFrames on common keys
Categorical	Efficient type for repeated string values
Downcast	Reducing memory by using smaller numeric types

Code Snippet Library

Load Data with Error Handling

def load_data(filepath, fallback_csv=True):
    try:
        return pd.read_parquet(filepath)
    except:
        if fallback_csv:
            return pd.read_csv(filepath.replace(".parquet", ".csv"))
        raise

Calculate Team Record

def get_record(games, team):
    home = games[games["home_team"] == team]
    away = games[games["away_team"] == team]
    wins = ((home["home_score"] > home["away_score"]).sum() +
            (away["away_score"] > away["home_score"]).sum())
    losses = len(home) + len(away) - wins
    return wins, losses

Find Explosive Plays

explosive = plays[
    ((plays["play_type"] == "Pass") & (plays["yards"] >= 15)) |
    ((plays["play_type"] == "Rush") & (plays["yards"] >= 10))
]

Looking Ahead

Chapter 4 introduces Descriptive Statistics in Football: - Central tendency and spread for football metrics - Distribution analysis for player and team performance - Correlation and relationships between statistics - Building statistical profiles for comparison