> "The features you use are more important than the algorithm you choose."
Learning Objectives
- Design and implement domain-specific features for sports prediction including pace-adjusted, opponent-adjusted, and quality-of-competition metrics
- Construct temporal features using rolling averages, exponentially weighted means, lag features, and recency-weighted statistics
- Build interaction features that capture team matchup dynamics, style clashes, and player combination effects
- Apply PCA, t-SNE, and feature clustering to reduce dimensionality while preserving predictive signal
- Use automated feature engineering tools including featuretools and tsfresh to systematically generate and select features
In This Chapter
Chapter 28: Feature Engineering for Sports Prediction
"The features you use are more important than the algorithm you choose." --- Pedro Domingos, A Few Useful Things to Know About Machine Learning (2012)
Chapter Overview
Machine learning models are only as good as the data they consume. In the context of sports prediction, this means that the raw box score statistics, play-by-play logs, and roster information sitting in your database are not yet ready for modeling. They must be transformed, combined, and distilled into features --- numerical representations that capture the underlying dynamics that drive outcomes. This transformation process, known as feature engineering, is widely regarded as the single most impactful step in any predictive modeling pipeline.
Feature engineering for sports prediction presents unique challenges. Sports data is inherently temporal: a team's performance in Week 1 may bear little resemblance to its performance in Week 14 after injuries, trades, and coaching adjustments. It is also deeply contextual: a basketball team that excels against slow-paced opponents may struggle against up-tempo teams, even if their aggregate statistics appear identical. And it is rich with categorical structure: matchups between specific teams, players, and venues create combinatorial complexity that simple tabular features cannot capture.
This chapter provides a systematic framework for turning raw sports data into powerful predictive features. We begin with domain-specific feature creation --- the art of encoding sports knowledge into numerical form. We then build temporal and rolling features that capture performance trajectories. We construct interaction features that model the matchup dynamics central to sports prediction. We apply dimensionality reduction techniques to manage feature proliferation. And we conclude with automated feature engineering tools that can systematically explore the feature space.
In this chapter, you will learn to: - Transform raw statistics into pace-adjusted, opponent-adjusted, and quality-weighted features that isolate true team and player ability - Build rolling, exponential, and lag features that capture the time-dependent nature of sports performance - Encode matchup dynamics through interaction features that model how teams and playing styles interact - Apply PCA and t-SNE to reduce high-dimensional feature spaces while preserving predictive information - Use automated tools to generate, evaluate, and select features at scale
28.1 Domain-Specific Feature Creation
The Importance of Domain Knowledge
The most powerful features in sports prediction are not discovered by algorithms --- they are designed by humans who understand the sport. A neural network can learn complex nonlinear relationships, but it cannot invent a feature that does not exist in its input data. If you feed a model raw points scored and points allowed without adjusting for pace of play, the model will conflate scoring efficiency with game tempo. If you include opponent strength without adjustment, the model will confuse a team that beat weak opponents with a team that beat strong ones.
Domain-specific feature engineering encodes your understanding of how a sport works into numerical form. It is the bridge between sports expertise and statistical modeling.
Pace-Adjusted Statistics
In basketball, the number of possessions per game varies enormously between teams. A team averaging 115 points per game in 105 possessions is far more efficient than a team averaging 115 points in 115 possessions. Raw point totals obscure this distinction. Pace adjustment normalizes statistics to a common number of possessions, revealing the true underlying efficiency.
The fundamental unit is offensive rating (points scored per 100 possessions) and defensive rating (points allowed per 100 possessions):
$$\text{ORtg} = \frac{\text{Points Scored}}{\text{Possessions}} \times 100$$
$$\text{DRtg} = \frac{\text{Points Allowed}}{\text{Possessions}} \times 100$$
Possessions can be estimated from box score data using:
$$\text{Poss} \approx \text{FGA} - \text{OREB} + \text{TOV} + 0.44 \times \text{FTA}$$
This formula, developed by Dean Oliver, accounts for field goal attempts, offensive rebounds (which do not start new possessions), turnovers, and free throw attempts (with the 0.44 coefficient approximating the fraction of free throw trips that end possessions).
The same logic applies in other sports. In football, adjusting for plays per game normalizes between teams that run 55 plays and teams that run 75. In hockey, adjusting for Corsi (shot attempts) per 60 minutes normalizes for time on ice.
import numpy as np
import pandas as pd
from typing import Optional
def estimate_possessions(
fga: pd.Series,
oreb: pd.Series,
tov: pd.Series,
fta: pd.Series,
) -> pd.Series:
"""
Estimate the number of possessions using the Dean Oliver formula.
This formula provides a close approximation of true possession counts
from standard box score statistics. The 0.44 coefficient on free throw
attempts accounts for the fact that not all free throws end possessions
(e.g., and-one opportunities, technical free throws, three-shot fouls).
Args:
fga: Field goal attempts.
oreb: Offensive rebounds.
tov: Turnovers.
fta: Free throw attempts.
Returns:
Estimated possession count for each game.
"""
return fga - oreb + tov + 0.44 * fta
def pace_adjust_stats(
df: pd.DataFrame,
stat_columns: list[str],
possessions_col: str = "possessions",
per_n_possessions: int = 100,
) -> pd.DataFrame:
"""
Normalize raw counting statistics to a per-N-possessions basis.
This removes the effect of game tempo, allowing fair comparison
between teams and games with different paces of play.
Args:
df: DataFrame containing raw statistics and possession counts.
stat_columns: List of column names to pace-adjust.
possessions_col: Name of the column containing possession estimates.
per_n_possessions: The normalization base (default 100 possessions).
Returns:
DataFrame with new columns named '{original}_per{N}'.
Example:
>>> df = pd.DataFrame({
... 'points': [110, 98, 105],
... 'possessions': [100, 88, 95]
... })
>>> result = pace_adjust_stats(df, ['points'])
>>> print(result['points_per100'].values)
[110. 111.36 110.53]
"""
result = df.copy()
for col in stat_columns:
new_col = f"{col}_per{per_n_possessions}"
result[new_col] = (
result[col] / result[possessions_col] * per_n_possessions
)
return result
Opponent-Adjusted Statistics
A team that plays in a weak conference will accumulate inflated raw statistics. Opponent adjustment corrects for the strength of competition faced, isolating a team's true ability from the quality of its opponents.
The simplest opponent adjustment compares a team's performance against each opponent to that opponent's season average. If an opponent allows 105 points per 100 possessions on average, and you score 112 against them, your opponent-adjusted offensive rating for that game is:
$$\text{OAdj}_i = \text{ORtg}_i - \text{Opp DRtg}_{\text{avg}} + \text{League DRtg}_{\text{avg}}$$
This centers the adjustment around the league average, so a team performing exactly at its opponent's usual level receives the league-average rating.
More sophisticated approaches use iterative methods. The adjusted efficiency margin used by systems like KenPom in college basketball iterates between offensive and defensive adjustments until convergence:
$$\text{AdjO}_t^{(k+1)} = \frac{\sum_i w_i \cdot \text{ORtg}_{t,i} \cdot \frac{\text{League DRtg}_{\text{avg}}}{\text{AdjD}_{o_i}^{(k)}}}{\sum_i w_i}$$
where $t$ indexes teams, $i$ indexes games, $o_i$ is the opponent in game $i$, $w_i$ are recency weights, and $k$ is the iteration number. The defensive adjustment $\text{AdjD}$ is computed symmetrically.
def simple_opponent_adjustment(
df: pd.DataFrame,
team_col: str,
opponent_col: str,
stat_col: str,
opponent_avg_col: str,
league_avg: float,
) -> pd.Series:
"""
Apply a simple additive opponent adjustment to a statistic.
Adjusts each observation by the difference between the opponent's
average and the league average. A team that scores 112 per 100
possessions against a defense that allows 105 (league avg 108)
gets an adjusted rating of 112 - 105 + 108 = 115.
Args:
df: DataFrame with game-level data.
team_col: Column identifying the team.
opponent_col: Column identifying the opponent.
stat_col: The raw statistic to adjust.
opponent_avg_col: The opponent's season average for the
corresponding defensive/offensive stat.
league_avg: The league-wide average for the statistic.
Returns:
Series of opponent-adjusted values.
"""
return df[stat_col] - df[opponent_avg_col] + league_avg
def iterative_opponent_adjustment(
game_data: pd.DataFrame,
team_col: str = "team",
opponent_col: str = "opponent",
ortg_col: str = "ortg",
drtg_col: str = "drtg",
max_iterations: int = 100,
tolerance: float = 1e-6,
) -> pd.DataFrame:
"""
Compute iterative opponent-adjusted offensive and defensive ratings.
Uses an iterative scheme similar to KenPom's adjusted efficiency
ratings. Alternates between adjusting offensive ratings using
opponents' defensive ratings and vice versa until convergence.
Args:
game_data: DataFrame with one row per team per game, containing
columns for team, opponent, offensive rating, and defensive
rating (all per 100 possessions).
team_col: Column name for the team identifier.
opponent_col: Column name for the opponent identifier.
ortg_col: Column name for offensive rating.
drtg_col: Column name for defensive rating.
max_iterations: Maximum number of iterations.
tolerance: Convergence threshold (max change in any rating).
Returns:
DataFrame indexed by team with columns 'adj_ortg' and 'adj_drtg'.
"""
teams = game_data[team_col].unique()
league_avg_ortg = game_data[ortg_col].mean()
league_avg_drtg = game_data[drtg_col].mean()
# Initialize adjusted ratings at team raw averages
adj_ortg = game_data.groupby(team_col)[ortg_col].mean().to_dict()
adj_drtg = game_data.groupby(team_col)[drtg_col].mean().to_dict()
for iteration in range(max_iterations):
prev_ortg = adj_ortg.copy()
prev_drtg = adj_drtg.copy()
# Update offensive ratings
for team in teams:
team_games = game_data[game_data[team_col] == team]
adjusted_values = []
for _, game in team_games.iterrows():
opp = game[opponent_col]
opp_adj_d = adj_drtg.get(opp, league_avg_drtg)
adjustment_factor = league_avg_drtg / opp_adj_d
adjusted_values.append(game[ortg_col] * adjustment_factor)
adj_ortg[team] = np.mean(adjusted_values)
# Update defensive ratings
for team in teams:
team_games = game_data[game_data[team_col] == team]
adjusted_values = []
for _, game in team_games.iterrows():
opp = game[opponent_col]
opp_adj_o = adj_ortg.get(opp, league_avg_ortg)
adjustment_factor = league_avg_ortg / opp_adj_o
adjusted_values.append(game[drtg_col] * adjustment_factor)
adj_drtg[team] = np.mean(adjusted_values)
# Check convergence
max_change = max(
max(abs(adj_ortg[t] - prev_ortg[t]) for t in teams),
max(abs(adj_drtg[t] - prev_drtg[t]) for t in teams),
)
if max_change < tolerance:
break
result = pd.DataFrame({
"adj_ortg": adj_ortg,
"adj_drtg": adj_drtg,
})
result["adj_margin"] = result["adj_ortg"] - result["adj_drtg"]
return result.sort_values("adj_margin", ascending=False)
Quality of Competition Metrics
Beyond simple opponent adjustment, quality of competition (QoC) metrics measure the aggregate difficulty of a team's or player's schedule. These are particularly valuable in sports with unbalanced schedules (college sports, international soccer) where teams face very different levels of opposition.
Common QoC metrics include:
- Strength of schedule (SOS): The average rating of opponents faced. Can be computed as the mean opponent adjusted efficiency margin.
- Strength of record (SOR): The expected win-loss record a typical team would achieve against the same schedule, given the margins of victory observed.
- Weighted quality of competition: SOS where recent games receive higher weight, capturing the current difficulty of upcoming opponents.
$$\text{SOS}_t = \frac{1}{n} \sum_{i=1}^{n} \text{Rating}_{o_i}$$
where $o_i$ is the opponent in game $i$ and $\text{Rating}$ is an adjusted team strength metric.
Feature Brainstorming Framework
When approaching a new sport or prediction problem, use the following systematic framework to brainstorm features:
- Efficiency metrics: Points per possession, yards per play, goals per expected goals, runs per plate appearance. Always normalize by opportunity.
- Pace/tempo metrics: Possessions per game, plays per game, time of possession. These capture style independently of efficiency.
- Opponent adjustments: Apply to every efficiency metric. Raw stats are almost always inferior to adjusted stats.
- Situational splits: Home/away, rest days, back-to-back games, altitude, indoor/outdoor, day/night.
- Personnel factors: Injuries, lineup changes, minutes distribution, depth metrics.
- Recent form: Is the team trending up or down? Recent performance often dominates season averages.
- Matchup-specific: How does this team's style interact with the opponent's style? (Developed further in Section 28.3.)
Real-World Application: In NBA prediction, the single most important feature improvement most modelers can make is switching from raw team statistics to pace-adjusted, opponent-adjusted efficiency metrics. This one change --- which requires no sophisticated machine learning, only domain knowledge --- typically improves model accuracy by 2--5 percentage points in terms of game outcome prediction, which translates to substantial betting edge.
28.2 Temporal and Rolling Features
The Challenge of Time in Sports Data
Sports data is fundamentally a time series. A team's characteristics evolve continuously --- through injuries, trades, coaching changes, fatigue, and natural performance variance. Using season-long averages treats a team's October performance as equally relevant to predicting its March games, which is often badly wrong. Temporal features capture the dynamic, evolving nature of sports performance.
Rolling Averages
The simplest temporal feature is a rolling average (also called a moving average): the mean of the last $k$ observations. For a statistic $x$ at time $t$, the rolling average with window $k$ is:
$$\bar{x}_{t,k} = \frac{1}{k} \sum_{i=t-k}^{t-1} x_i$$
Note the critical detail: the window includes observations up to $t-1$, not $t$ itself. Including the current observation would constitute data leakage --- using information from the game you are trying to predict.
The choice of window size $k$ involves a bias-variance tradeoff: - Small $k$ (e.g., 3--5 games): Responsive to recent changes but noisy. Captures hot streaks and immediate effects of roster changes. - Large $k$ (e.g., 15--20 games): Stable but slow to adapt. Better estimates of true underlying ability but may miss regime changes. - Multiple windows: Using several window sizes simultaneously (e.g., $k = 3, 10, 20$) lets the model learn which timescale is most informative for each context.
def compute_rolling_features(
df: pd.DataFrame,
group_col: str,
date_col: str,
stat_columns: list[str],
windows: list[int],
min_periods: int = 1,
) -> pd.DataFrame:
"""
Compute rolling averages for multiple statistics and window sizes.
For each team (group), sorts by date and computes rolling means
over the specified windows. Uses shift(1) to prevent data leakage
by excluding the current observation from the window.
Args:
df: DataFrame with game-level data.
group_col: Column to group by (typically 'team').
date_col: Column to sort by (typically 'date').
stat_columns: List of statistics to compute rolling averages for.
windows: List of window sizes (in games).
min_periods: Minimum number of observations required for a
non-null result.
Returns:
DataFrame with new columns named '{stat}_roll_{window}'.
Example:
>>> df = pd.DataFrame({
... 'team': ['A']*5,
... 'date': pd.date_range('2024-01-01', periods=5),
... 'points': [100, 110, 105, 95, 108],
... })
>>> result = compute_rolling_features(
... df, 'team', 'date', ['points'], [3]
... )
>>> print(result['points_roll_3'].values)
[NaN, 100.0, 105.0, 105.0, 103.33]
"""
df = df.sort_values([group_col, date_col]).copy()
for stat in stat_columns:
for window in windows:
col_name = f"{stat}_roll_{window}"
df[col_name] = (
df.groupby(group_col)[stat]
.transform(
lambda s: s.shift(1).rolling(
window=window, min_periods=min_periods
).mean()
)
)
return df
Exponentially Weighted Means
Rolling averages treat all observations within the window equally and completely ignore observations outside it. Exponentially weighted means (EWMs) offer a smoother alternative that weights recent observations more heavily while retaining some influence from older data:
$$\text{EWM}_t = \alpha \cdot x_{t-1} + (1 - \alpha) \cdot \text{EWM}_{t-1}$$
where $\alpha \in (0, 1)$ is the smoothing factor (also called the decay parameter). Higher $\alpha$ places more weight on recent observations. The half-life $h$ --- the number of periods after which an observation's weight drops to 50% of a current observation's weight --- is related to $\alpha$ by:
$$\alpha = 1 - \exp\left(\frac{-\ln 2}{h}\right)$$
A half-life of 5 games means that a game played 5 games ago has half the influence of the most recent game. A half-life of 10 games produces a smoother, more conservative estimate.
def compute_ewm_features(
df: pd.DataFrame,
group_col: str,
date_col: str,
stat_columns: list[str],
half_lives: list[float],
) -> pd.DataFrame:
"""
Compute exponentially weighted means with multiple half-lives.
EWMs provide a smooth, recency-weighted average that avoids the
sharp cutoff of rolling windows. The half-life parameter controls
how quickly old observations lose influence.
Args:
df: DataFrame with game-level data.
group_col: Column to group by (typically 'team').
date_col: Column to sort by (typically 'date').
stat_columns: Statistics to compute EWMs for.
half_lives: List of half-life values (in games).
Returns:
DataFrame with new columns named '{stat}_ewm_{halflife}'.
"""
df = df.sort_values([group_col, date_col]).copy()
for stat in stat_columns:
for hl in half_lives:
col_name = f"{stat}_ewm_{hl}"
df[col_name] = (
df.groupby(group_col)[stat]
.transform(
lambda s: s.shift(1).ewm(halflife=hl).mean()
)
)
return df
Lag Features
Lag features capture the raw value of a statistic at a specific number of games in the past. While rolling averages and EWMs aggregate over time, lag features preserve the exact values at specific time offsets:
$$\text{Lag}_k(x_t) = x_{t-k}$$
Lag features are useful for capturing: - Momentum effects: Did the team win or lose their last game? By how much? - Fatigue patterns: What was the team's workload 1, 2, and 3 games ago? - Schedule effects: How many days of rest since the last game?
def compute_lag_features(
df: pd.DataFrame,
group_col: str,
date_col: str,
stat_columns: list[str],
lags: list[int],
) -> pd.DataFrame:
"""
Create lag features for specified statistics and lag periods.
Each lag feature captures the exact value of a statistic from
a specific number of games in the past. Lag 1 is the previous
game, lag 2 is two games ago, etc.
Args:
df: DataFrame with game-level data.
group_col: Column to group by.
date_col: Column to sort by.
stat_columns: Statistics to create lag features for.
lags: List of lag periods.
Returns:
DataFrame with new columns named '{stat}_lag_{lag}'.
"""
df = df.sort_values([group_col, date_col]).copy()
for stat in stat_columns:
for lag in lags:
col_name = f"{stat}_lag_{lag}"
df[col_name] = df.groupby(group_col)[stat].shift(lag)
return df
Season-to-Date Cumulative Statistics
Season-to-date (STD) features compute the running average or total from the start of the season through the previous game. These provide a stable, comprehensive baseline that is less noisy than short-window rolling averages:
$$\text{STD}_t = \frac{1}{t-1} \sum_{i=1}^{t-1} x_i$$
STD features are particularly useful early in the season when rolling windows do not have enough data, and for statistics where the true underlying rate changes slowly (e.g., free throw percentage, three-point percentage).
def compute_season_to_date(
df: pd.DataFrame,
group_col: str,
date_col: str,
stat_columns: list[str],
season_col: str = "season",
) -> pd.DataFrame:
"""
Compute expanding (season-to-date) averages through the previous game.
For each team-season, computes the cumulative mean of each statistic
through the game prior to the current one. This provides a stable
estimate of underlying team ability that incorporates all available
season data without data leakage.
Args:
df: DataFrame with game-level data.
group_col: Column to group by (typically 'team').
date_col: Column to sort by.
stat_columns: Statistics to compute STD averages for.
season_col: Column identifying the season.
Returns:
DataFrame with new columns named '{stat}_std'.
"""
df = df.sort_values([group_col, date_col]).copy()
group_keys = [group_col, season_col]
for stat in stat_columns:
col_name = f"{stat}_std"
df[col_name] = (
df.groupby(group_keys)[stat]
.transform(lambda s: s.shift(1).expanding().mean())
)
return df
Recency Weighting Schemes
Beyond exponential weighting, custom recency schemes can encode domain-specific beliefs about how quickly past performance loses relevance. For instance, you might believe that games played before a major trade are fundamentally different from games after it. Or you might weight games by the number of current starters who played in them.
A general recency-weighted average is:
$$\bar{x}_t^{(w)} = \frac{\sum_{i=1}^{t-1} w_i \cdot x_i}{\sum_{i=1}^{t-1} w_i}$$
where $w_i$ are arbitrary non-negative weights. Common choices include:
- Linear decay: $w_i = \max(0, 1 - \beta \cdot (t - 1 - i))$
- Power decay: $w_i = (t - i)^{-\gamma}$
- Binary cutoff with soft edges: Full weight for recent games, linearly declining weight for games in a transition period, zero weight for old games.
def recency_weighted_average(
values: np.ndarray,
scheme: str = "exponential",
**kwargs,
) -> float:
"""
Compute a recency-weighted average using various weighting schemes.
Args:
values: Array of historical values, ordered from oldest to newest.
The last element is the most recent observation.
scheme: Weighting scheme. One of 'exponential', 'linear', 'power'.
**kwargs: Scheme-specific parameters:
- exponential: 'alpha' (decay rate, default 0.9)
- linear: 'slope' (decay per period, default 0.05)
- power: 'gamma' (exponent, default 0.5)
Returns:
The recency-weighted average.
"""
n = len(values)
if n == 0:
return np.nan
if scheme == "exponential":
alpha = kwargs.get("alpha", 0.9)
weights = np.array([alpha ** (n - 1 - i) for i in range(n)])
elif scheme == "linear":
slope = kwargs.get("slope", 0.05)
weights = np.maximum(0, 1 - slope * np.arange(n - 1, -1, -1))
elif scheme == "power":
gamma = kwargs.get("gamma", 0.5)
recency = np.arange(n, 0, -1).astype(float)
weights = recency ** (-gamma)
else:
raise ValueError(f"Unknown scheme: {scheme}")
if weights.sum() == 0:
return np.nan
return np.average(values, weights=weights)
Common Pitfall: The most dangerous error in temporal feature engineering is data leakage --- using information from the game you are trying to predict (or from future games) in your features. Always use
.shift(1)or equivalent logic to ensure that the current observation is excluded. This seems obvious in isolation, but becomes surprisingly easy to violate when building complex feature pipelines with multiple join operations. Test your pipeline by verifying that changing the outcome of game $t$ does not change any features available at time $t$.
28.3 Interaction Features and Matchup Encoding
Why Matchups Matter
Sports prediction is fundamentally about matchups, not isolated team strength. A basketball team with an elite interior defense may be nearly unbeatable against post-heavy opponents but vulnerable against three-point shooting teams. A football team with a dominant pass rush may shut down pass-heavy offenses but struggle against strong rushing attacks. These interaction effects mean that the predictive value of a team's statistics depends on the opponent's characteristics.
Standard features capture each team's attributes independently. Interaction features capture how those attributes combine when two specific teams meet.
Team Matchup Features
The simplest interaction features are differences and ratios between corresponding statistics of two teams:
$$\Delta_{\text{ortg}} = \text{ORtg}_{\text{home}} - \text{DRtg}_{\text{away}}$$
This measures the expected offensive surplus when the home team's offense meets the away team's defense. Symmetrically:
$$\Delta_{\text{drtg}} = \text{DRtg}_{\text{home}} - \text{ORtg}_{\text{away}}$$
Beyond simple differences, products of complementary features capture nonlinear interactions:
$$\text{Pace Interaction} = \text{Pace}_{\text{home}} \times \text{Pace}_{\text{away}}$$
This captures the fact that two fast-paced teams create an especially high-scoring game (the product is large) while two slow-paced teams create an especially low-scoring game.
def create_matchup_features(
df: pd.DataFrame,
home_prefix: str = "home_",
away_prefix: str = "away_",
stat_pairs: Optional[list[tuple[str, str]]] = None,
) -> pd.DataFrame:
"""
Generate interaction features for head-to-head matchups.
Creates differences, ratios, and products between corresponding
home and away team statistics. These features capture matchup
dynamics that individual team features cannot represent.
Args:
df: DataFrame where each row is a game, with home and away team
statistics prefixed accordingly.
home_prefix: Prefix for home team stat columns.
away_prefix: Prefix for away team stat columns.
stat_pairs: List of (home_stat, away_stat) tuples to create
interactions for. If None, automatically pairs columns
with matching suffixes.
Returns:
DataFrame with new matchup feature columns.
Example:
>>> df = pd.DataFrame({
... 'home_ortg': [112.0], 'away_drtg': [108.0],
... 'home_pace': [100.0], 'away_pace': [95.0],
... })
>>> result = create_matchup_features(
... df, stat_pairs=[('home_ortg', 'away_drtg'),
... ('home_pace', 'away_pace')]
... )
"""
result = df.copy()
if stat_pairs is None:
# Auto-detect pairs by matching suffixes
home_cols = [c for c in df.columns if c.startswith(home_prefix)]
stat_pairs = []
for hc in home_cols:
suffix = hc[len(home_prefix):]
ac = away_prefix + suffix
if ac in df.columns:
stat_pairs.append((hc, ac))
for home_stat, away_stat in stat_pairs:
base_name = home_stat.replace(home_prefix, "")
# Difference: home advantage in this metric
result[f"matchup_diff_{base_name}"] = (
result[home_stat] - result[away_stat]
)
# Ratio: relative strength (with safety for zero division)
denominator = result[away_stat].replace(0, np.nan)
result[f"matchup_ratio_{base_name}"] = (
result[home_stat] / denominator
)
# Product: interaction magnitude
result[f"matchup_prod_{base_name}"] = (
result[home_stat] * result[away_stat]
)
return result
Style Clash Metrics
Beyond raw statistical matchups, style clash metrics quantify how teams' playing styles interact. These require defining style dimensions first, then measuring compatibility or conflict along each dimension.
In basketball, useful style dimensions include: - Pace: Fast vs. slow tempo - Three-point reliance: Perimeter-oriented vs. interior-oriented offense - Defensive style: Switching defense vs. drop coverage, full-court press vs. half-court - Roster size: Small-ball lineups vs. traditional big lineups
A style clash score measures how unusual the matchup is along a specific dimension:
$$\text{Clash}_{d} = |S_{d,\text{home}} - S_{d,\text{away}}|$$
where $S_{d,t}$ is team $t$'s value on style dimension $d$. High clash scores indicate stylistic mismatches that may create unusual game dynamics.
def compute_style_clash_metrics(
df: pd.DataFrame,
style_dimensions: dict[str, tuple[str, str]],
) -> pd.DataFrame:
"""
Compute style clash metrics for each matchup along multiple dimensions.
For each style dimension, computes the absolute difference, the
product, and a z-scored clash metric indicating how extreme the
mismatch is relative to the league distribution.
Args:
df: DataFrame with one row per game, containing style metrics
for both home and away teams.
style_dimensions: Dictionary mapping dimension names to
(home_col, away_col) tuples.
Returns:
DataFrame with style clash features appended.
"""
result = df.copy()
for dim_name, (home_col, away_col) in style_dimensions.items():
# Raw clash (absolute difference)
clash = (result[home_col] - result[away_col]).abs()
result[f"clash_{dim_name}"] = clash
# Z-scored clash (how unusual is this mismatch?)
mean_clash = clash.mean()
std_clash = clash.std()
if std_clash > 0:
result[f"clash_{dim_name}_z"] = (clash - mean_clash) / std_clash
else:
result[f"clash_{dim_name}_z"] = 0.0
# Directional clash (positive = home team higher)
result[f"clash_{dim_name}_dir"] = (
result[home_col] - result[away_col]
)
return result
Player Combination Features
In sports with lineup data, the specific combination of players on the court or field matters enormously. Player combination features capture the synergies and conflicts between specific groups of players.
For basketball, lineup-level features aggregate the on-court performance of every combination of five players:
$$\text{Lineup ORtg}_{L} = \frac{\text{Points scored with lineup } L}{\text{Possessions with lineup } L} \times 100$$
For modeling purposes, pairwise player features are more practical than full lineup features (since the combinatorial space of five-player lineups is vast). The net rating of every two-player combination captures which pairs work well together:
$$\text{Pair NetRtg}_{i,j} = \text{Team NetRtg when } i \text{ and } j \text{ are both on court}$$
Encoding Categorical Matchups
When teams, players, and venues are categorical variables, they must be encoded numerically. Several encoding strategies exist:
-
One-hot encoding: Creates a binary indicator for each category. Simple but creates very high-dimensional, sparse features. Impractical for player-level encoding (thousands of players).
-
Target encoding: Replaces each category with the average target value (e.g., win rate) for that category. Compact but prone to overfitting --- a player who has appeared in two games and won both gets an encoding of 1.0, which is unreliable.
-
Leave-one-out target encoding: For each observation, the encoding uses the mean target value excluding that observation. Reduces overfitting compared to naive target encoding.
-
Entity embeddings: Learned dense vector representations (covered in detail in Chapter 29). The most powerful approach for large categorical spaces.
from sklearn.base import BaseEstimator, TransformerMixin
class LeaveOneOutEncoder(BaseEstimator, TransformerMixin):
"""
Leave-one-out target encoder for categorical variables.
For each observation, encodes the categorical value with the mean
of the target variable computed over all OTHER observations with
the same category value. This prevents direct leakage of the
target into the feature.
During transform (on new data), uses the full training-set mean
for each category.
Attributes:
category_means_: Dict mapping category values to target means.
category_counts_: Dict mapping category values to observation counts.
category_sums_: Dict mapping category values to target sums.
global_mean_: The overall mean of the target in the training data.
smoothing: Bayesian smoothing parameter. Higher values shrink
category means toward the global mean.
"""
def __init__(self, smoothing: float = 10.0):
self.smoothing = smoothing
def fit(self, X: pd.Series, y: pd.Series):
"""Compute category-level statistics from training data."""
self.global_mean_ = y.mean()
df = pd.DataFrame({"cat": X, "target": y})
stats = df.groupby("cat")["target"].agg(["mean", "count", "sum"])
self.category_means_ = stats["mean"].to_dict()
self.category_counts_ = stats["count"].to_dict()
self.category_sums_ = stats["sum"].to_dict()
return self
def transform(self, X: pd.Series, y: Optional[pd.Series] = None):
"""
Encode categories using leave-one-out means (if y provided)
or full training means (if y is None, i.e., test data).
"""
if y is not None:
# Leave-one-out encoding for training data
result = []
for cat_val, target_val in zip(X, y):
n = self.category_counts_.get(cat_val, 0)
s = self.category_sums_.get(cat_val, 0)
if n > 1:
loo_mean = (s - target_val) / (n - 1)
else:
loo_mean = self.global_mean_
# Apply smoothing
smoothed = (
(n - 1) * loo_mean + self.smoothing * self.global_mean_
) / (n - 1 + self.smoothing)
result.append(smoothed)
return np.array(result)
else:
# Standard encoding for test data
result = []
for cat_val in X:
n = self.category_counts_.get(cat_val, 0)
mean = self.category_means_.get(cat_val, self.global_mean_)
smoothed = (
n * mean + self.smoothing * self.global_mean_
) / (n + self.smoothing)
result.append(smoothed)
return np.array(result)
Intuition: Think of interaction features as encoding the question "what happens when these two specific styles collide?" rather than "how good is each team independently?" A team with the league's best three-point defense gains little from that strength against an opponent that rarely shoots threes. Interaction features encode this conditional relationship, allowing the model to make matchup-specific predictions rather than generic ones.
28.4 Dimensionality Reduction
The Curse of Dimensionality in Sports Features
The feature engineering techniques from the previous sections can generate an enormous number of features. If you have 20 base statistics, 3 rolling windows, 3 exponential half-lives, 3 lags, season-to-date averages, and matchup interactions for each, you easily exceed 500 features per game. With a typical NBA season of ~1,230 games, you have far more features than observations --- a classic recipe for overfitting.
Dimensionality reduction techniques compress high-dimensional feature spaces into lower-dimensional representations that retain most of the predictive information. They serve two purposes: reducing overfitting by constraining model complexity, and revealing latent structure in the data.
Principal Component Analysis (PCA)
PCA finds the linear combinations of features (called principal components) that capture the maximum variance in the data. The first principal component is the direction of maximum variance; the second is the direction of maximum remaining variance orthogonal to the first; and so on.
Mathematically, PCA computes the eigendecomposition of the covariance matrix $\Sigma$ of the (centered) feature matrix $X$:
$$\Sigma = \frac{1}{n-1} X^T X = V \Lambda V^T$$
where $V$ is the matrix of eigenvectors (principal component directions) and $\Lambda$ is a diagonal matrix of eigenvalues (variance explained by each component). The transformed data is:
$$Z = X V_k$$
where $V_k$ contains only the first $k$ eigenvectors, projecting the data onto the top $k$ principal components.
The key question is: how many components to retain? Common heuristics include: - Explained variance threshold: Retain enough components to capture 90--95% of total variance. - Scree plot: Plot eigenvalues and look for an "elbow" where the marginal variance drops sharply. - Cross-validation: Choose $k$ that minimizes prediction error in a held-out set.
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
def apply_pca_to_sports_features(
train_features: pd.DataFrame,
test_features: pd.DataFrame,
variance_threshold: float = 0.95,
plot_scree: bool = True,
) -> tuple[np.ndarray, np.ndarray, PCA, StandardScaler]:
"""
Apply PCA to sports feature matrices with proper train/test separation.
Standardizes features using training data statistics only (preventing
information leakage from test data), then fits PCA on training data
and transforms both sets.
Args:
train_features: Training set feature DataFrame.
test_features: Test set feature DataFrame (same columns).
variance_threshold: Retain components explaining this fraction
of total variance.
plot_scree: If True, display a scree plot of explained variance.
Returns:
Tuple of (train_pca, test_pca, fitted_pca, fitted_scaler).
"""
# Standardize using training statistics only
scaler = StandardScaler()
train_scaled = scaler.fit_transform(train_features)
test_scaled = scaler.transform(test_features)
# Fit PCA on training data
pca_full = PCA()
pca_full.fit(train_scaled)
# Determine number of components
cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_)
n_components = np.argmax(cumulative_variance >= variance_threshold) + 1
if plot_scree:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Scree plot
ax1.bar(
range(1, len(pca_full.explained_variance_ratio_) + 1),
pca_full.explained_variance_ratio_,
alpha=0.7,
label="Individual",
)
ax1.plot(
range(1, len(cumulative_variance) + 1),
cumulative_variance,
"r-o",
label="Cumulative",
)
ax1.axhline(
y=variance_threshold, color="gray", linestyle="--",
label=f"{variance_threshold:.0%} threshold",
)
ax1.axvline(
x=n_components, color="green", linestyle="--",
label=f"{n_components} components",
)
ax1.set_xlabel("Principal Component")
ax1.set_ylabel("Explained Variance Ratio")
ax1.set_title("Scree Plot")
ax1.legend()
# Top component loadings
loadings = pd.DataFrame(
pca_full.components_[:3],
columns=train_features.columns,
index=["PC1", "PC2", "PC3"],
)
top_features_pc1 = loadings.loc["PC1"].abs().nlargest(10)
ax2.barh(range(len(top_features_pc1)), top_features_pc1.values)
ax2.set_yticks(range(len(top_features_pc1)))
ax2.set_yticklabels(top_features_pc1.index)
ax2.set_xlabel("Absolute Loading")
ax2.set_title("Top 10 Features in PC1")
plt.tight_layout()
plt.show()
# Refit PCA with selected number of components
pca = PCA(n_components=n_components)
train_pca = pca.fit_transform(train_scaled)
test_pca = pca.transform(test_scaled)
print(
f"Reduced {train_features.shape[1]} features to {n_components} "
f"components ({cumulative_variance[n_components - 1]:.1%} variance)"
)
return train_pca, test_pca, pca, scaler
t-SNE for Visualization
t-distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimensionality reduction technique designed for visualization. Unlike PCA, t-SNE does not produce a transformation that can be applied to new data --- it is a visualization tool, not a feature engineering tool.
t-SNE is valuable for exploring the structure of sports feature spaces. By projecting games or teams into two dimensions, you can visually identify clusters of similar teams, detect outliers, and verify that your features capture meaningful distinctions.
The key hyperparameter is perplexity, which roughly controls the number of effective nearest neighbors. Values between 5 and 50 are typical; higher perplexity produces more global structure, lower perplexity reveals more local detail.
from sklearn.manifold import TSNE
def visualize_teams_tsne(
features: pd.DataFrame,
team_labels: pd.Series,
perplexity: int = 30,
random_state: int = 42,
highlight_teams: Optional[list[str]] = None,
) -> None:
"""
Create a t-SNE visualization of teams in feature space.
Projects the high-dimensional feature vectors for each team into
2D space, preserving local neighborhood structure. Useful for
exploring which teams are statistically similar and which are
outliers.
Args:
features: Feature DataFrame (one row per team).
team_labels: Series of team names/identifiers.
perplexity: t-SNE perplexity parameter.
random_state: Random seed for reproducibility.
highlight_teams: Optional list of team names to highlight.
"""
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)
tsne = TSNE(
n_components=2,
perplexity=min(perplexity, len(features) - 1),
random_state=random_state,
n_iter=1000,
)
embedding = tsne.fit_transform(features_scaled)
fig, ax = plt.subplots(figsize=(12, 10))
if highlight_teams:
mask = team_labels.isin(highlight_teams)
ax.scatter(
embedding[~mask, 0], embedding[~mask, 1],
c="lightgray", s=50, alpha=0.5,
)
ax.scatter(
embedding[mask, 0], embedding[mask, 1],
c="red", s=100, zorder=5,
)
for i, team in enumerate(team_labels):
if team in highlight_teams:
ax.annotate(
team, (embedding[i, 0], embedding[i, 1]),
fontsize=9, fontweight="bold",
)
else:
ax.scatter(embedding[:, 0], embedding[:, 1], s=60, alpha=0.7)
for i, team in enumerate(team_labels):
ax.annotate(team, (embedding[i, 0], embedding[i, 1]), fontsize=7)
ax.set_title("t-SNE Visualization of Team Feature Space")
ax.set_xlabel("t-SNE Dimension 1")
ax.set_ylabel("t-SNE Dimension 2")
plt.tight_layout()
plt.show()
Feature Clustering
An alternative to PCA for dimensionality reduction is feature clustering --- grouping correlated features together and replacing each cluster with a single representative or aggregate. This is particularly useful when you have many features measuring similar constructs (e.g., multiple metrics of defensive quality).
Agglomerative clustering on the feature correlation matrix is a natural approach:
- Compute the absolute pairwise correlation matrix of all features.
- Convert correlations to distances: $d_{ij} = 1 - |r_{ij}|$.
- Perform hierarchical clustering on the distance matrix.
- Cut the dendrogram at a threshold to form clusters.
- Within each cluster, select the feature with the highest average correlation to the target, or compute the first principal component.
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import squareform
def cluster_correlated_features(
features: pd.DataFrame,
target: pd.Series,
correlation_threshold: float = 0.7,
method: str = "representative",
) -> tuple[pd.DataFrame, dict[int, list[str]]]:
"""
Cluster highly correlated features and reduce to one per cluster.
Groups features whose absolute pairwise correlation exceeds the
threshold. Within each cluster, selects either the feature most
correlated with the target ('representative') or computes the
first principal component ('pca').
Args:
features: Feature DataFrame.
target: Target variable Series.
correlation_threshold: Features with correlation above this
value (in absolute terms) are grouped together.
method: 'representative' to keep the best feature per cluster,
'pca' to replace each cluster with its first PC.
Returns:
Tuple of (reduced DataFrame, dict mapping cluster IDs to
feature name lists).
"""
corr_matrix = features.corr().abs()
distance_matrix = 1 - corr_matrix
np.fill_diagonal(distance_matrix.values, 0)
condensed = squareform(distance_matrix.values)
linkage_matrix = linkage(condensed, method="complete")
cluster_labels = fcluster(
linkage_matrix,
t=1 - correlation_threshold,
criterion="distance",
)
cluster_map = {}
for feat, label in zip(features.columns, cluster_labels):
cluster_map.setdefault(label, []).append(feat)
if method == "representative":
selected = []
for cluster_id, feat_names in cluster_map.items():
if len(feat_names) == 1:
selected.append(feat_names[0])
else:
correlations_with_target = {
f: abs(features[f].corr(target)) for f in feat_names
}
best = max(correlations_with_target, key=correlations_with_target.get)
selected.append(best)
reduced = features[selected]
elif method == "pca":
components = []
component_names = []
for cluster_id, feat_names in cluster_map.items():
if len(feat_names) == 1:
components.append(features[feat_names[0]].values)
component_names.append(feat_names[0])
else:
scaler = StandardScaler()
scaled = scaler.fit_transform(features[feat_names])
pca = PCA(n_components=1)
pc = pca.fit_transform(scaled).ravel()
components.append(pc)
component_names.append(f"cluster_{cluster_id}_pc1")
reduced = pd.DataFrame(
np.column_stack(components), columns=component_names
)
else:
raise ValueError(f"Unknown method: {method}")
print(
f"Reduced {features.shape[1]} features to {reduced.shape[1]} "
f"({len(cluster_map)} clusters)"
)
return reduced, cluster_map
When to Reduce Dimensions
Dimensionality reduction is not always beneficial. Consider the following guidelines:
| Situation | Recommendation |
|---|---|
| $n \gg p$ (many observations, few features) | Reduction unnecessary; may lose signal |
| $p > n$ (more features than observations) | Reduction essential to prevent overfitting |
| Many highly correlated features | Feature clustering or PCA very effective |
| Tree-based models (XGBoost, RF) | Often handle high dimensions well; reduction optional |
| Linear models, neural networks | More sensitive to dimensionality; reduction often helps |
| Interpretability is important | Feature clustering preserves interpretability better than PCA |
| Visualization needed | t-SNE for exploration; PCA for linear projections |
Intuition: Dimensionality reduction is a form of regularization. Just as L1 and L2 penalties constrain model weights to prevent overfitting, dimensionality reduction constrains the space of features the model can use. The tradeoff is the same: you sacrifice some ability to fit the training data in exchange for better generalization to new data.
28.5 Automated Feature Engineering
The Case for Automation
Manual feature engineering is powerful but labor-intensive. For a single sport and prediction target, a skilled domain expert might spend weeks designing, implementing, and testing features. Automated feature engineering tools systematically explore the space of possible feature transformations, generating candidates that a human might miss and evaluating them at a scale no human could match.
Automated approaches complement, but do not replace, domain knowledge. The best pipelines combine hand-crafted domain-specific features (Section 28.1) with automatically generated candidates, then apply systematic feature selection to identify the most valuable subset.
Featuretools for Deep Feature Synthesis
Featuretools is a Python library that performs Deep Feature Synthesis (DFS) --- automatically generating features by applying transformation and aggregation primitives to relational datasets. For sports data, this is particularly powerful because our data naturally has relational structure: games contain players, players belong to teams, teams belong to leagues, games occur at venues, etc.
# Note: This code requires the featuretools package
# pip install featuretools
import featuretools as ft
def build_sports_entityset(
games: pd.DataFrame,
player_stats: pd.DataFrame,
teams: pd.DataFrame,
) -> ft.EntitySet:
"""
Build a featuretools EntitySet from sports relational data.
Creates an entity set with three entities (games, player_stats,
teams) and the relationships between them. This relational
structure allows featuretools to automatically generate aggregated
and transformed features across entity boundaries.
Args:
games: Game-level DataFrame with columns ['game_id', 'date',
'home_team_id', 'away_team_id', 'home_score', 'away_score'].
player_stats: Player-game-level DataFrame with columns
['player_game_id', 'game_id', 'team_id', 'player_id',
'minutes', 'points', 'rebounds', 'assists', ...].
teams: Team DataFrame with columns ['team_id', 'team_name',
'conference', 'division'].
Returns:
Configured featuretools EntitySet.
"""
es = ft.EntitySet(id="sports_data")
es = es.add_dataframe(
dataframe_name="games",
dataframe=games,
index="game_id",
time_index="date",
)
es = es.add_dataframe(
dataframe_name="player_stats",
dataframe=player_stats,
index="player_game_id",
)
es = es.add_dataframe(
dataframe_name="teams",
dataframe=teams,
index="team_id",
)
# Define relationships
es = es.add_relationship("games", "game_id", "player_stats", "game_id")
es = es.add_relationship("teams", "team_id", "player_stats", "team_id")
return es
def run_deep_feature_synthesis(
es: ft.EntitySet,
target_entity: str = "games",
max_depth: int = 2,
primitives_agg: Optional[list[str]] = None,
primitives_trans: Optional[list[str]] = None,
) -> pd.DataFrame:
"""
Run Deep Feature Synthesis to automatically generate features.
DFS traverses the relational structure of the EntitySet, applying
aggregation primitives (e.g., mean, std, max) across relationships
and transformation primitives (e.g., absolute, percentile) to
individual columns. The max_depth parameter controls the complexity
of generated features.
Args:
es: Configured featuretools EntitySet.
target_entity: The entity to generate features for.
max_depth: Maximum depth of feature stacking. Depth 1 applies
single primitives; depth 2 applies primitives to the
results of depth-1 features.
primitives_agg: Aggregation primitives to use.
primitives_trans: Transformation primitives to use.
Returns:
DataFrame of automatically generated features for each
instance in the target entity.
"""
if primitives_agg is None:
primitives_agg = ["mean", "std", "max", "min", "skew", "count"]
if primitives_trans is None:
primitives_trans = ["absolute", "percentile"]
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name=target_entity,
max_depth=max_depth,
agg_primitives=primitives_agg,
trans_primitives=primitives_trans,
verbose=True,
)
print(f"Generated {len(feature_defs)} features")
print(f"Feature matrix shape: {feature_matrix.shape}")
return feature_matrix
Tsfresh for Time Series Feature Extraction
Tsfresh (Time Series Feature extraction based on Scalable Hypothesis tests) is specialized for extracting features from time series data --- exactly the kind of data we have in sequential game-by-game statistics.
Tsfresh automatically computes hundreds of features from each time series, including statistical moments, autocorrelation coefficients, entropy measures, and spectral features. It then applies statistical hypothesis tests to select only the features that are significantly associated with the target variable.
# Note: This code requires the tsfresh package
# pip install tsfresh
from tsfresh import extract_features, select_features
from tsfresh.utilities.dataframe_functions import impute
def extract_tsfresh_features(
game_data: pd.DataFrame,
team_col: str,
game_number_col: str,
stat_columns: list[str],
target: pd.Series,
fdr_level: float = 0.05,
) -> pd.DataFrame:
"""
Extract and select time series features using tsfresh.
For each team, treats the sequence of game statistics as a time
series and extracts hundreds of statistical features. Then applies
the Benjamini-Hochberg procedure to select only features with a
statistically significant relationship to the target.
Args:
game_data: DataFrame with game-level statistics, one row per
team per game.
team_col: Column identifying the team (used as the series ID).
game_number_col: Column indicating the temporal order of games
(used as the time index within each series).
stat_columns: List of statistics to extract features from.
target: Target variable for feature selection (one value per
team, e.g., season win percentage).
fdr_level: False discovery rate for the Benjamini-Hochberg
feature selection procedure.
Returns:
DataFrame of selected features, one row per team.
"""
# Prepare data in tsfresh format
ts_data = game_data[[team_col, game_number_col] + stat_columns].copy()
ts_data = ts_data.rename(columns={
team_col: "id",
game_number_col: "time",
})
# Extract features
extracted = extract_features(
ts_data,
column_id="id",
column_sort="time",
default_fc_parameters="efficient",
n_jobs=4,
)
# Impute missing values
impute(extracted)
# Select significant features
selected = select_features(extracted, target, fdr_level=fdr_level)
print(
f"Extracted {extracted.shape[1]} features, "
f"selected {selected.shape[1]} significant features"
)
return selected
Target Encoding with Regularization
Target encoding (also called mean encoding) replaces categorical variables with the mean of the target variable for that category. This is covered briefly in Section 28.3, but in the context of automated pipelines, we need a production-ready version with proper regularization and cross-validation to prevent overfitting.
The regularized target encoding uses Bayesian smoothing:
$$\text{Encoding}(c) = \frac{n_c \cdot \bar{y}_c + m \cdot \bar{y}_{\text{global}}}{n_c + m}$$
where $n_c$ is the number of observations in category $c$, $\bar{y}_c$ is the category mean, $\bar{y}_{\text{global}}$ is the global mean, and $m$ is the smoothing parameter. When $n_c$ is small, the encoding shrinks toward the global mean; when $n_c$ is large, the encoding approaches the category-specific mean.
Feature Selection Pipelines
After generating hundreds or thousands of candidate features --- both hand-crafted and automated --- a systematic feature selection pipeline identifies the subset that maximizes predictive performance. A robust pipeline combines multiple selection methods:
- Variance threshold: Remove features with near-zero variance (they contain no information).
- Correlation filter: Remove features highly correlated with existing selections (they add redundancy, not information).
- Mutual information: Rank features by their mutual information with the target variable.
- Recursive feature elimination (RFE): Iteratively remove the least important features according to a model's feature importance scores.
- L1 regularization: Fit a model with L1 (Lasso) penalty; features with zero coefficients are eliminated.
from sklearn.feature_selection import (
VarianceThreshold,
mutual_info_classif,
SelectKBest,
)
from sklearn.ensemble import GradientBoostingClassifier
class FeatureSelectionPipeline:
"""
Multi-stage feature selection pipeline for sports prediction.
Combines variance filtering, correlation filtering, mutual information
ranking, and model-based importance into a single pipeline that can
be fitted on training data and applied to test data consistently.
Attributes:
variance_threshold: Minimum variance for a feature to be retained.
correlation_threshold: Maximum pairwise correlation allowed.
n_top_mi: Number of features to retain after MI ranking.
n_final: Final number of features to select via model importance.
"""
def __init__(
self,
variance_threshold: float = 0.01,
correlation_threshold: float = 0.95,
n_top_mi: int = 100,
n_final: int = 50,
):
self.variance_threshold = variance_threshold
self.correlation_threshold = correlation_threshold
self.n_top_mi = n_top_mi
self.n_final = n_final
self.selected_features_: list[str] = []
def fit(self, X: pd.DataFrame, y: pd.Series) -> "FeatureSelectionPipeline":
"""
Fit the feature selection pipeline on training data.
Applies each stage sequentially, recording the surviving
features at each step.
"""
current_features = X.columns.tolist()
print(f"Starting with {len(current_features)} features")
# Stage 1: Variance threshold
selector = VarianceThreshold(threshold=self.variance_threshold)
selector.fit(X[current_features])
current_features = [
f for f, keep in zip(current_features, selector.get_support())
if keep
]
print(f"After variance filter: {len(current_features)}")
# Stage 2: Correlation filter
corr_matrix = X[current_features].corr().abs()
to_drop = set()
for i in range(len(current_features)):
for j in range(i + 1, len(current_features)):
if corr_matrix.iloc[i, j] > self.correlation_threshold:
# Drop the feature with lower target correlation
corr_i = abs(X[current_features[i]].corr(y))
corr_j = abs(X[current_features[j]].corr(y))
drop = current_features[j] if corr_i >= corr_j else current_features[i]
to_drop.add(drop)
current_features = [f for f in current_features if f not in to_drop]
print(f"After correlation filter: {len(current_features)}")
# Stage 3: Mutual information ranking
n_mi = min(self.n_top_mi, len(current_features))
mi_scores = mutual_info_classif(X[current_features], y, random_state=42)
mi_ranking = sorted(
zip(current_features, mi_scores),
key=lambda x: x[1],
reverse=True,
)
current_features = [f for f, _ in mi_ranking[:n_mi]]
print(f"After MI ranking: {len(current_features)}")
# Stage 4: Model-based importance
n_final = min(self.n_final, len(current_features))
model = GradientBoostingClassifier(
n_estimators=100, max_depth=4, random_state=42,
)
model.fit(X[current_features], y)
importances = sorted(
zip(current_features, model.feature_importances_),
key=lambda x: x[1],
reverse=True,
)
self.selected_features_ = [f for f, _ in importances[:n_final]]
print(f"Final selection: {len(self.selected_features_)}")
return self
def transform(self, X: pd.DataFrame) -> pd.DataFrame:
"""Return only the selected features."""
return X[self.selected_features_]
Common Pitfall: Automated feature engineering can generate thousands of candidate features, but most will be noise. Without rigorous feature selection --- including proper cross-validation to avoid selecting features that overfit to the training set --- automated features can degrade rather than improve model performance. Always evaluate automated features on a held-out set that was not used during the generation or selection process.
28.6 Chapter Summary
Key Concepts
-
Feature engineering is the process of transforming raw data into numerical representations suitable for machine learning models. In sports prediction, it is widely considered the single most impactful step in the modeling pipeline.
-
Pace adjustment normalizes counting statistics to a common rate (e.g., per 100 possessions), isolating efficiency from tempo. This is essential in any sport where game tempo varies between teams.
-
Opponent adjustment corrects for the strength of competition, isolating true team ability from schedule difficulty. Iterative methods that alternate between offensive and defensive adjustments converge to stable ratings.
-
Rolling averages, exponentially weighted means, and lag features capture the temporal dynamics of team performance. Multiple timescales (short and long windows) provide complementary information about recent form and stable ability.
-
Interaction features --- differences, ratios, and products of corresponding team statistics --- capture matchup dynamics that individual team features cannot represent. Style clash metrics quantify how playing styles interact.
-
Dimensionality reduction via PCA, t-SNE, or feature clustering manages the feature proliferation that results from aggressive feature engineering. PCA is useful for compression; t-SNE for visualization; clustering for interpretable reduction.
-
Automated feature engineering with tools like featuretools and tsfresh systematically explores the feature space, generating candidates that complement hand-crafted features. Rigorous feature selection is essential to avoid retaining noise.
-
Target encoding with Bayesian smoothing provides a compact encoding of high-cardinality categorical variables, but must be regularized and cross-validated to prevent overfitting.
Key Formulas
| Formula | Description |
|---|---|
| $\text{ORtg} = \frac{\text{Pts}}{\text{Poss}} \times 100$ | Offensive rating (per 100 possessions) |
| $\text{Poss} \approx \text{FGA} - \text{OREB} + \text{TOV} + 0.44 \times \text{FTA}$ | Possession estimate (Dean Oliver) |
| $\text{EWM}_t = \alpha \cdot x_{t-1} + (1-\alpha) \cdot \text{EWM}_{t-1}$ | Exponentially weighted mean |
| $\alpha = 1 - \exp(-\ln 2 / h)$ | EWM decay from half-life $h$ |
| $\Sigma = V \Lambda V^T$ | PCA eigendecomposition of covariance matrix |
| $\text{Enc}(c) = \frac{n_c \bar{y}_c + m \bar{y}_g}{n_c + m}$ | Bayesian target encoding |
Key Code Patterns
-
Pace adjustment functions (
estimate_possessions,pace_adjust_stats): Normalize raw counting stats to per-possession rates, removing tempo effects. -
Temporal feature builders (
compute_rolling_features,compute_ewm_features,compute_lag_features): Systematic construction of time-aware features with.shift(1)to prevent data leakage. -
Matchup feature constructors (
create_matchup_features,compute_style_clash_metrics): Generate interaction features that encode how two teams' characteristics combine. -
Dimensionality reduction (
apply_pca_to_sports_features,cluster_correlated_features): Compress high-dimensional feature spaces while preserving predictive information, with proper train/test separation. -
Automated pipelines (
FeatureSelectionPipeline): Multi-stage feature selection combining variance filtering, correlation filtering, mutual information, and model-based importance.
Decision Framework: Feature Engineering Workflow
START: You have raw sports data and a prediction target.
1. Create domain-specific features.
- Pace-adjust all counting statistics.
- Opponent-adjust all efficiency metrics.
- Compute quality of competition metrics.
2. Build temporal features.
- Rolling averages (3, 10, 20 game windows).
- Exponentially weighted means (half-lives of 5 and 15).
- Key lag features (last game result, rest days).
- Season-to-date averages.
3. Construct matchup features.
- Differences and ratios of corresponding team stats.
- Style clash metrics for key dimensions.
- Player combination features (if lineup data available).
- Encode categorical matchups (target encoding or embeddings).
4. Apply automated feature generation.
- Featuretools for relational features.
- Tsfresh for time series features.
- Merge with hand-crafted features.
5. Select features.
- Remove near-zero variance features.
- Filter high-correlation redundancies.
- Rank by mutual information and model importance.
- Validate on held-out data.
6. Optionally reduce dimensions.
- PCA if $p > n$ or for linear models.
- Feature clustering for interpretable reduction.
- t-SNE for visual exploration only.
END: Feature matrix ready for modeling.
What's Next
In Chapter 29: Neural Networks for Sports Prediction, we will take the features engineered in this chapter and feed them into deep learning models. We will begin with feedforward networks for tabular sports data, then advance to LSTM and sequence models that can learn directly from temporal game sequences --- potentially discovering temporal patterns that our hand-crafted rolling features miss. We will also explore entity embeddings, which learn dense vector representations of teams, players, and venues that serve as a learned form of the categorical encodings we built manually in this chapter. The neural network approach represents a shift from designing features by hand to learning representations from data, and understanding both paradigms --- as well as when to combine them --- is essential for the modern sports prediction practitioner.
This chapter is part of The Sports Betting Textbook, a comprehensive guide to quantitative sports betting. All code examples use Python 3.11+ and are available in the companion repository.
Related Reading
Explore this topic in other books
AI Engineering Supervised Learning College Football Analytics Machine Learning for Football NFL Analytics ML Prediction Models Basketball Analytics Machine Learning in Basketball Soccer Analytics Machine Learning for Soccer Prediction Markets ML for Market Prediction AI Engineering Feature Engineering for AI Prediction Markets Feature Stores & MLOps