Key Takeaways: Python for Sports Analytics
One-Page Summary
Core Data Structures
PANDAS HIERARCHY
DataFrame (2D)
├── Multiple columns
├── Multiple rows
└── Each column is a Series
Series (1D)
├── Single column
├── Index + Values
└── Supports vectorized operations
Key insight: Work with DataFrames for data manipulation, extract Series when you need column-specific operations.
Essential Operations Cheat Sheet
| Operation | Code | Returns |
|---|---|---|
| Select column | df["col"] |
Series |
| Select columns | df[["col1", "col2"]] |
DataFrame |
| Select by position | df.iloc[0:5] |
DataFrame |
| Select by label | df.loc["key"] |
Row as Series |
| Filter rows | df[df["col"] > value] |
DataFrame |
| Add column | df["new"] = values |
Modified DataFrame |
| Group & aggregate | df.groupby("col").mean() |
DataFrame |
| Merge | pd.merge(df1, df2, on="key") |
DataFrame |
Boolean Filtering Pattern
# Single condition
filtered = df[df["score"] > 30]
# Multiple conditions (AND)
filtered = df[(df["score"] > 30) & (df["team"] == "Alabama")]
# Multiple conditions (OR)
filtered = df[(df["home_win"]) | (df["margin"] > 20)]
# Using query() for readability
filtered = df.query("score > 30 and team == 'Alabama'")
Remember: Use & not and, | not or. Parentheses required!
GroupBy Pattern
# Basic aggregation
df.groupby("team")["score"].mean()
# Multiple aggregations
df.groupby("team").agg(
games=("score", "count"),
avg_score=("score", "mean"),
total_yards=("yards", "sum")
)
Merge Types
| Type | Keeps | Use Case |
|---|---|---|
inner |
Only matching rows | Default, safest |
left |
All from left table | Keep all original data |
right |
All from right table | Keep all lookup data |
outer |
All from both | Full picture with gaps |
NumPy Essentials
# Vectorized operations (FAST)
margins = home_scores - away_scores
# Conditional assignment
results = np.where(home_scores > away_scores, "Win", "Loss")
# Multiple conditions
categories = np.select(
[margins >= 21, margins >= 7, margins > 0],
["Blowout", "Comfortable", "Close"],
default="Loss"
)
Performance Rules
| Approach | Speed | Use When |
|---|---|---|
| Vectorized (NumPy/pandas) | Fastest | Always try first |
.apply() |
Medium | Complex row logic |
| For loop | Slowest | Avoid for data ops |
Rule of thumb: If you're writing a loop over DataFrame rows, there's probably a better way.
Data Type Optimization
# Downcast integers
df["col"] = pd.to_numeric(df["col"], downcast="integer")
# Convert to categorical (for repeated strings)
df["team"] = df["team"].astype("category")
# Expected savings:
# - Integer downcast: 50-75% reduction
# - Categorical strings: 80-95% reduction
Missing Data Handling
| Method | Effect |
|---|---|
df.dropna() |
Remove rows with any NaN |
df.dropna(subset=["col"]) |
Remove only if "col" is NaN |
df.fillna(0) |
Replace NaN with 0 |
df.fillna(df.mean()) |
Replace with column mean |
df.interpolate() |
Fill with interpolated values |
Common Football Analytics Patterns
Pattern 1: Home/Away Normalization
# Standardize home and away into single team view
home = games[["home_team", "home_score"]].rename(
columns={"home_team": "team", "home_score": "points"}
)
away = games[["away_team", "away_score"]].rename(
columns={"away_team": "team", "away_score": "points"}
)
all_team_games = pd.concat([home, away])
Pattern 2: Success Rate Calculation
def is_successful(row):
thresholds = {1: 0.4, 2: 0.5, 3: 1.0, 4: 1.0}
return row["yards"] >= row["distance"] * thresholds.get(row["down"], 1)
Pattern 3: Rolling Calculations
# 3-game rolling average
df["rolling_avg"] = df.groupby("team")["points"].transform(
lambda x: x.rolling(3).mean()
)
Key Terms Quick Reference
| Term | Definition |
|---|---|
| DataFrame | 2D labeled data structure with rows and columns |
| Series | 1D labeled array (single column) |
| Vectorization | Applying operations to entire arrays at once |
| Boolean mask | Array of True/False for filtering |
| GroupBy | Split-apply-combine pattern for aggregation |
| Merge/Join | Combining DataFrames on common keys |
| Categorical | Efficient type for repeated string values |
| Downcast | Reducing memory by using smaller numeric types |
Code Snippet Library
Load Data with Error Handling
def load_data(filepath, fallback_csv=True):
try:
return pd.read_parquet(filepath)
except:
if fallback_csv:
return pd.read_csv(filepath.replace(".parquet", ".csv"))
raise
Calculate Team Record
def get_record(games, team):
home = games[games["home_team"] == team]
away = games[games["away_team"] == team]
wins = ((home["home_score"] > home["away_score"]).sum() +
(away["away_score"] > away["home_score"]).sum())
losses = len(home) + len(away) - wins
return wins, losses
Find Explosive Plays
explosive = plays[
((plays["play_type"] == "Pass") & (plays["yards"] >= 15)) |
((plays["play_type"] == "Rush") & (plays["yards"] >= 10))
]
Looking Ahead
Chapter 4 introduces Descriptive Statistics in Football: - Central tendency and spread for football metrics - Distribution analysis for player and team performance - Correlation and relationships between statistics - Building statistical profiles for comparison