22 min read

> "The features you use are more important than the algorithm you choose."

Learning Objectives

  • Design and implement domain-specific features for sports prediction including pace-adjusted, opponent-adjusted, and quality-of-competition metrics
  • Construct temporal features using rolling averages, exponentially weighted means, lag features, and recency-weighted statistics
  • Build interaction features that capture team matchup dynamics, style clashes, and player combination effects
  • Apply PCA, t-SNE, and feature clustering to reduce dimensionality while preserving predictive signal
  • Use automated feature engineering tools including featuretools and tsfresh to systematically generate and select features

Chapter 28: Feature Engineering for Sports Prediction

"The features you use are more important than the algorithm you choose." --- Pedro Domingos, A Few Useful Things to Know About Machine Learning (2012)

Chapter Overview

Machine learning models are only as good as the data they consume. In the context of sports prediction, this means that the raw box score statistics, play-by-play logs, and roster information sitting in your database are not yet ready for modeling. They must be transformed, combined, and distilled into features --- numerical representations that capture the underlying dynamics that drive outcomes. This transformation process, known as feature engineering, is widely regarded as the single most impactful step in any predictive modeling pipeline.

Feature engineering for sports prediction presents unique challenges. Sports data is inherently temporal: a team's performance in Week 1 may bear little resemblance to its performance in Week 14 after injuries, trades, and coaching adjustments. It is also deeply contextual: a basketball team that excels against slow-paced opponents may struggle against up-tempo teams, even if their aggregate statistics appear identical. And it is rich with categorical structure: matchups between specific teams, players, and venues create combinatorial complexity that simple tabular features cannot capture.

This chapter provides a systematic framework for turning raw sports data into powerful predictive features. We begin with domain-specific feature creation --- the art of encoding sports knowledge into numerical form. We then build temporal and rolling features that capture performance trajectories. We construct interaction features that model the matchup dynamics central to sports prediction. We apply dimensionality reduction techniques to manage feature proliferation. And we conclude with automated feature engineering tools that can systematically explore the feature space.

In this chapter, you will learn to: - Transform raw statistics into pace-adjusted, opponent-adjusted, and quality-weighted features that isolate true team and player ability - Build rolling, exponential, and lag features that capture the time-dependent nature of sports performance - Encode matchup dynamics through interaction features that model how teams and playing styles interact - Apply PCA and t-SNE to reduce high-dimensional feature spaces while preserving predictive information - Use automated tools to generate, evaluate, and select features at scale


28.1 Domain-Specific Feature Creation

The Importance of Domain Knowledge

The most powerful features in sports prediction are not discovered by algorithms --- they are designed by humans who understand the sport. A neural network can learn complex nonlinear relationships, but it cannot invent a feature that does not exist in its input data. If you feed a model raw points scored and points allowed without adjusting for pace of play, the model will conflate scoring efficiency with game tempo. If you include opponent strength without adjustment, the model will confuse a team that beat weak opponents with a team that beat strong ones.

Domain-specific feature engineering encodes your understanding of how a sport works into numerical form. It is the bridge between sports expertise and statistical modeling.

Pace-Adjusted Statistics

In basketball, the number of possessions per game varies enormously between teams. A team averaging 115 points per game in 105 possessions is far more efficient than a team averaging 115 points in 115 possessions. Raw point totals obscure this distinction. Pace adjustment normalizes statistics to a common number of possessions, revealing the true underlying efficiency.

The fundamental unit is offensive rating (points scored per 100 possessions) and defensive rating (points allowed per 100 possessions):

$$\text{ORtg} = \frac{\text{Points Scored}}{\text{Possessions}} \times 100$$

$$\text{DRtg} = \frac{\text{Points Allowed}}{\text{Possessions}} \times 100$$

Possessions can be estimated from box score data using:

$$\text{Poss} \approx \text{FGA} - \text{OREB} + \text{TOV} + 0.44 \times \text{FTA}$$

This formula, developed by Dean Oliver, accounts for field goal attempts, offensive rebounds (which do not start new possessions), turnovers, and free throw attempts (with the 0.44 coefficient approximating the fraction of free throw trips that end possessions).

The same logic applies in other sports. In football, adjusting for plays per game normalizes between teams that run 55 plays and teams that run 75. In hockey, adjusting for Corsi (shot attempts) per 60 minutes normalizes for time on ice.

import numpy as np
import pandas as pd
from typing import Optional


def estimate_possessions(
    fga: pd.Series,
    oreb: pd.Series,
    tov: pd.Series,
    fta: pd.Series,
) -> pd.Series:
    """
    Estimate the number of possessions using the Dean Oliver formula.

    This formula provides a close approximation of true possession counts
    from standard box score statistics. The 0.44 coefficient on free throw
    attempts accounts for the fact that not all free throws end possessions
    (e.g., and-one opportunities, technical free throws, three-shot fouls).

    Args:
        fga: Field goal attempts.
        oreb: Offensive rebounds.
        tov: Turnovers.
        fta: Free throw attempts.

    Returns:
        Estimated possession count for each game.
    """
    return fga - oreb + tov + 0.44 * fta


def pace_adjust_stats(
    df: pd.DataFrame,
    stat_columns: list[str],
    possessions_col: str = "possessions",
    per_n_possessions: int = 100,
) -> pd.DataFrame:
    """
    Normalize raw counting statistics to a per-N-possessions basis.

    This removes the effect of game tempo, allowing fair comparison
    between teams and games with different paces of play.

    Args:
        df: DataFrame containing raw statistics and possession counts.
        stat_columns: List of column names to pace-adjust.
        possessions_col: Name of the column containing possession estimates.
        per_n_possessions: The normalization base (default 100 possessions).

    Returns:
        DataFrame with new columns named '{original}_per{N}'.

    Example:
        >>> df = pd.DataFrame({
        ...     'points': [110, 98, 105],
        ...     'possessions': [100, 88, 95]
        ... })
        >>> result = pace_adjust_stats(df, ['points'])
        >>> print(result['points_per100'].values)
        [110.  111.36  110.53]
    """
    result = df.copy()
    for col in stat_columns:
        new_col = f"{col}_per{per_n_possessions}"
        result[new_col] = (
            result[col] / result[possessions_col] * per_n_possessions
        )
    return result

Opponent-Adjusted Statistics

A team that plays in a weak conference will accumulate inflated raw statistics. Opponent adjustment corrects for the strength of competition faced, isolating a team's true ability from the quality of its opponents.

The simplest opponent adjustment compares a team's performance against each opponent to that opponent's season average. If an opponent allows 105 points per 100 possessions on average, and you score 112 against them, your opponent-adjusted offensive rating for that game is:

$$\text{OAdj}_i = \text{ORtg}_i - \text{Opp DRtg}_{\text{avg}} + \text{League DRtg}_{\text{avg}}$$

This centers the adjustment around the league average, so a team performing exactly at its opponent's usual level receives the league-average rating.

More sophisticated approaches use iterative methods. The adjusted efficiency margin used by systems like KenPom in college basketball iterates between offensive and defensive adjustments until convergence:

$$\text{AdjO}_t^{(k+1)} = \frac{\sum_i w_i \cdot \text{ORtg}_{t,i} \cdot \frac{\text{League DRtg}_{\text{avg}}}{\text{AdjD}_{o_i}^{(k)}}}{\sum_i w_i}$$

where $t$ indexes teams, $i$ indexes games, $o_i$ is the opponent in game $i$, $w_i$ are recency weights, and $k$ is the iteration number. The defensive adjustment $\text{AdjD}$ is computed symmetrically.

def simple_opponent_adjustment(
    df: pd.DataFrame,
    team_col: str,
    opponent_col: str,
    stat_col: str,
    opponent_avg_col: str,
    league_avg: float,
) -> pd.Series:
    """
    Apply a simple additive opponent adjustment to a statistic.

    Adjusts each observation by the difference between the opponent's
    average and the league average. A team that scores 112 per 100
    possessions against a defense that allows 105 (league avg 108)
    gets an adjusted rating of 112 - 105 + 108 = 115.

    Args:
        df: DataFrame with game-level data.
        team_col: Column identifying the team.
        opponent_col: Column identifying the opponent.
        stat_col: The raw statistic to adjust.
        opponent_avg_col: The opponent's season average for the
            corresponding defensive/offensive stat.
        league_avg: The league-wide average for the statistic.

    Returns:
        Series of opponent-adjusted values.
    """
    return df[stat_col] - df[opponent_avg_col] + league_avg


def iterative_opponent_adjustment(
    game_data: pd.DataFrame,
    team_col: str = "team",
    opponent_col: str = "opponent",
    ortg_col: str = "ortg",
    drtg_col: str = "drtg",
    max_iterations: int = 100,
    tolerance: float = 1e-6,
) -> pd.DataFrame:
    """
    Compute iterative opponent-adjusted offensive and defensive ratings.

    Uses an iterative scheme similar to KenPom's adjusted efficiency
    ratings. Alternates between adjusting offensive ratings using
    opponents' defensive ratings and vice versa until convergence.

    Args:
        game_data: DataFrame with one row per team per game, containing
            columns for team, opponent, offensive rating, and defensive
            rating (all per 100 possessions).
        team_col: Column name for the team identifier.
        opponent_col: Column name for the opponent identifier.
        ortg_col: Column name for offensive rating.
        drtg_col: Column name for defensive rating.
        max_iterations: Maximum number of iterations.
        tolerance: Convergence threshold (max change in any rating).

    Returns:
        DataFrame indexed by team with columns 'adj_ortg' and 'adj_drtg'.
    """
    teams = game_data[team_col].unique()
    league_avg_ortg = game_data[ortg_col].mean()
    league_avg_drtg = game_data[drtg_col].mean()

    # Initialize adjusted ratings at team raw averages
    adj_ortg = game_data.groupby(team_col)[ortg_col].mean().to_dict()
    adj_drtg = game_data.groupby(team_col)[drtg_col].mean().to_dict()

    for iteration in range(max_iterations):
        prev_ortg = adj_ortg.copy()
        prev_drtg = adj_drtg.copy()

        # Update offensive ratings
        for team in teams:
            team_games = game_data[game_data[team_col] == team]
            adjusted_values = []
            for _, game in team_games.iterrows():
                opp = game[opponent_col]
                opp_adj_d = adj_drtg.get(opp, league_avg_drtg)
                adjustment_factor = league_avg_drtg / opp_adj_d
                adjusted_values.append(game[ortg_col] * adjustment_factor)
            adj_ortg[team] = np.mean(adjusted_values)

        # Update defensive ratings
        for team in teams:
            team_games = game_data[game_data[team_col] == team]
            adjusted_values = []
            for _, game in team_games.iterrows():
                opp = game[opponent_col]
                opp_adj_o = adj_ortg.get(opp, league_avg_ortg)
                adjustment_factor = league_avg_ortg / opp_adj_o
                adjusted_values.append(game[drtg_col] * adjustment_factor)
            adj_drtg[team] = np.mean(adjusted_values)

        # Check convergence
        max_change = max(
            max(abs(adj_ortg[t] - prev_ortg[t]) for t in teams),
            max(abs(adj_drtg[t] - prev_drtg[t]) for t in teams),
        )
        if max_change < tolerance:
            break

    result = pd.DataFrame({
        "adj_ortg": adj_ortg,
        "adj_drtg": adj_drtg,
    })
    result["adj_margin"] = result["adj_ortg"] - result["adj_drtg"]
    return result.sort_values("adj_margin", ascending=False)

Quality of Competition Metrics

Beyond simple opponent adjustment, quality of competition (QoC) metrics measure the aggregate difficulty of a team's or player's schedule. These are particularly valuable in sports with unbalanced schedules (college sports, international soccer) where teams face very different levels of opposition.

Common QoC metrics include:

  • Strength of schedule (SOS): The average rating of opponents faced. Can be computed as the mean opponent adjusted efficiency margin.
  • Strength of record (SOR): The expected win-loss record a typical team would achieve against the same schedule, given the margins of victory observed.
  • Weighted quality of competition: SOS where recent games receive higher weight, capturing the current difficulty of upcoming opponents.

$$\text{SOS}_t = \frac{1}{n} \sum_{i=1}^{n} \text{Rating}_{o_i}$$

where $o_i$ is the opponent in game $i$ and $\text{Rating}$ is an adjusted team strength metric.

Feature Brainstorming Framework

When approaching a new sport or prediction problem, use the following systematic framework to brainstorm features:

  1. Efficiency metrics: Points per possession, yards per play, goals per expected goals, runs per plate appearance. Always normalize by opportunity.
  2. Pace/tempo metrics: Possessions per game, plays per game, time of possession. These capture style independently of efficiency.
  3. Opponent adjustments: Apply to every efficiency metric. Raw stats are almost always inferior to adjusted stats.
  4. Situational splits: Home/away, rest days, back-to-back games, altitude, indoor/outdoor, day/night.
  5. Personnel factors: Injuries, lineup changes, minutes distribution, depth metrics.
  6. Recent form: Is the team trending up or down? Recent performance often dominates season averages.
  7. Matchup-specific: How does this team's style interact with the opponent's style? (Developed further in Section 28.3.)

Real-World Application: In NBA prediction, the single most important feature improvement most modelers can make is switching from raw team statistics to pace-adjusted, opponent-adjusted efficiency metrics. This one change --- which requires no sophisticated machine learning, only domain knowledge --- typically improves model accuracy by 2--5 percentage points in terms of game outcome prediction, which translates to substantial betting edge.


28.2 Temporal and Rolling Features

The Challenge of Time in Sports Data

Sports data is fundamentally a time series. A team's characteristics evolve continuously --- through injuries, trades, coaching changes, fatigue, and natural performance variance. Using season-long averages treats a team's October performance as equally relevant to predicting its March games, which is often badly wrong. Temporal features capture the dynamic, evolving nature of sports performance.

Rolling Averages

The simplest temporal feature is a rolling average (also called a moving average): the mean of the last $k$ observations. For a statistic $x$ at time $t$, the rolling average with window $k$ is:

$$\bar{x}_{t,k} = \frac{1}{k} \sum_{i=t-k}^{t-1} x_i$$

Note the critical detail: the window includes observations up to $t-1$, not $t$ itself. Including the current observation would constitute data leakage --- using information from the game you are trying to predict.

The choice of window size $k$ involves a bias-variance tradeoff: - Small $k$ (e.g., 3--5 games): Responsive to recent changes but noisy. Captures hot streaks and immediate effects of roster changes. - Large $k$ (e.g., 15--20 games): Stable but slow to adapt. Better estimates of true underlying ability but may miss regime changes. - Multiple windows: Using several window sizes simultaneously (e.g., $k = 3, 10, 20$) lets the model learn which timescale is most informative for each context.

def compute_rolling_features(
    df: pd.DataFrame,
    group_col: str,
    date_col: str,
    stat_columns: list[str],
    windows: list[int],
    min_periods: int = 1,
) -> pd.DataFrame:
    """
    Compute rolling averages for multiple statistics and window sizes.

    For each team (group), sorts by date and computes rolling means
    over the specified windows. Uses shift(1) to prevent data leakage
    by excluding the current observation from the window.

    Args:
        df: DataFrame with game-level data.
        group_col: Column to group by (typically 'team').
        date_col: Column to sort by (typically 'date').
        stat_columns: List of statistics to compute rolling averages for.
        windows: List of window sizes (in games).
        min_periods: Minimum number of observations required for a
            non-null result.

    Returns:
        DataFrame with new columns named '{stat}_roll_{window}'.

    Example:
        >>> df = pd.DataFrame({
        ...     'team': ['A']*5,
        ...     'date': pd.date_range('2024-01-01', periods=5),
        ...     'points': [100, 110, 105, 95, 108],
        ... })
        >>> result = compute_rolling_features(
        ...     df, 'team', 'date', ['points'], [3]
        ... )
        >>> print(result['points_roll_3'].values)
        [NaN, 100.0, 105.0, 105.0, 103.33]
    """
    df = df.sort_values([group_col, date_col]).copy()

    for stat in stat_columns:
        for window in windows:
            col_name = f"{stat}_roll_{window}"
            df[col_name] = (
                df.groupby(group_col)[stat]
                .transform(
                    lambda s: s.shift(1).rolling(
                        window=window, min_periods=min_periods
                    ).mean()
                )
            )

    return df

Exponentially Weighted Means

Rolling averages treat all observations within the window equally and completely ignore observations outside it. Exponentially weighted means (EWMs) offer a smoother alternative that weights recent observations more heavily while retaining some influence from older data:

$$\text{EWM}_t = \alpha \cdot x_{t-1} + (1 - \alpha) \cdot \text{EWM}_{t-1}$$

where $\alpha \in (0, 1)$ is the smoothing factor (also called the decay parameter). Higher $\alpha$ places more weight on recent observations. The half-life $h$ --- the number of periods after which an observation's weight drops to 50% of a current observation's weight --- is related to $\alpha$ by:

$$\alpha = 1 - \exp\left(\frac{-\ln 2}{h}\right)$$

A half-life of 5 games means that a game played 5 games ago has half the influence of the most recent game. A half-life of 10 games produces a smoother, more conservative estimate.

def compute_ewm_features(
    df: pd.DataFrame,
    group_col: str,
    date_col: str,
    stat_columns: list[str],
    half_lives: list[float],
) -> pd.DataFrame:
    """
    Compute exponentially weighted means with multiple half-lives.

    EWMs provide a smooth, recency-weighted average that avoids the
    sharp cutoff of rolling windows. The half-life parameter controls
    how quickly old observations lose influence.

    Args:
        df: DataFrame with game-level data.
        group_col: Column to group by (typically 'team').
        date_col: Column to sort by (typically 'date').
        stat_columns: Statistics to compute EWMs for.
        half_lives: List of half-life values (in games).

    Returns:
        DataFrame with new columns named '{stat}_ewm_{halflife}'.
    """
    df = df.sort_values([group_col, date_col]).copy()

    for stat in stat_columns:
        for hl in half_lives:
            col_name = f"{stat}_ewm_{hl}"
            df[col_name] = (
                df.groupby(group_col)[stat]
                .transform(
                    lambda s: s.shift(1).ewm(halflife=hl).mean()
                )
            )

    return df

Lag Features

Lag features capture the raw value of a statistic at a specific number of games in the past. While rolling averages and EWMs aggregate over time, lag features preserve the exact values at specific time offsets:

$$\text{Lag}_k(x_t) = x_{t-k}$$

Lag features are useful for capturing: - Momentum effects: Did the team win or lose their last game? By how much? - Fatigue patterns: What was the team's workload 1, 2, and 3 games ago? - Schedule effects: How many days of rest since the last game?

def compute_lag_features(
    df: pd.DataFrame,
    group_col: str,
    date_col: str,
    stat_columns: list[str],
    lags: list[int],
) -> pd.DataFrame:
    """
    Create lag features for specified statistics and lag periods.

    Each lag feature captures the exact value of a statistic from
    a specific number of games in the past. Lag 1 is the previous
    game, lag 2 is two games ago, etc.

    Args:
        df: DataFrame with game-level data.
        group_col: Column to group by.
        date_col: Column to sort by.
        stat_columns: Statistics to create lag features for.
        lags: List of lag periods.

    Returns:
        DataFrame with new columns named '{stat}_lag_{lag}'.
    """
    df = df.sort_values([group_col, date_col]).copy()

    for stat in stat_columns:
        for lag in lags:
            col_name = f"{stat}_lag_{lag}"
            df[col_name] = df.groupby(group_col)[stat].shift(lag)

    return df

Season-to-Date Cumulative Statistics

Season-to-date (STD) features compute the running average or total from the start of the season through the previous game. These provide a stable, comprehensive baseline that is less noisy than short-window rolling averages:

$$\text{STD}_t = \frac{1}{t-1} \sum_{i=1}^{t-1} x_i$$

STD features are particularly useful early in the season when rolling windows do not have enough data, and for statistics where the true underlying rate changes slowly (e.g., free throw percentage, three-point percentage).

def compute_season_to_date(
    df: pd.DataFrame,
    group_col: str,
    date_col: str,
    stat_columns: list[str],
    season_col: str = "season",
) -> pd.DataFrame:
    """
    Compute expanding (season-to-date) averages through the previous game.

    For each team-season, computes the cumulative mean of each statistic
    through the game prior to the current one. This provides a stable
    estimate of underlying team ability that incorporates all available
    season data without data leakage.

    Args:
        df: DataFrame with game-level data.
        group_col: Column to group by (typically 'team').
        date_col: Column to sort by.
        stat_columns: Statistics to compute STD averages for.
        season_col: Column identifying the season.

    Returns:
        DataFrame with new columns named '{stat}_std'.
    """
    df = df.sort_values([group_col, date_col]).copy()
    group_keys = [group_col, season_col]

    for stat in stat_columns:
        col_name = f"{stat}_std"
        df[col_name] = (
            df.groupby(group_keys)[stat]
            .transform(lambda s: s.shift(1).expanding().mean())
        )

    return df

Recency Weighting Schemes

Beyond exponential weighting, custom recency schemes can encode domain-specific beliefs about how quickly past performance loses relevance. For instance, you might believe that games played before a major trade are fundamentally different from games after it. Or you might weight games by the number of current starters who played in them.

A general recency-weighted average is:

$$\bar{x}_t^{(w)} = \frac{\sum_{i=1}^{t-1} w_i \cdot x_i}{\sum_{i=1}^{t-1} w_i}$$

where $w_i$ are arbitrary non-negative weights. Common choices include:

  • Linear decay: $w_i = \max(0, 1 - \beta \cdot (t - 1 - i))$
  • Power decay: $w_i = (t - i)^{-\gamma}$
  • Binary cutoff with soft edges: Full weight for recent games, linearly declining weight for games in a transition period, zero weight for old games.
def recency_weighted_average(
    values: np.ndarray,
    scheme: str = "exponential",
    **kwargs,
) -> float:
    """
    Compute a recency-weighted average using various weighting schemes.

    Args:
        values: Array of historical values, ordered from oldest to newest.
            The last element is the most recent observation.
        scheme: Weighting scheme. One of 'exponential', 'linear', 'power'.
        **kwargs: Scheme-specific parameters:
            - exponential: 'alpha' (decay rate, default 0.9)
            - linear: 'slope' (decay per period, default 0.05)
            - power: 'gamma' (exponent, default 0.5)

    Returns:
        The recency-weighted average.
    """
    n = len(values)
    if n == 0:
        return np.nan

    if scheme == "exponential":
        alpha = kwargs.get("alpha", 0.9)
        weights = np.array([alpha ** (n - 1 - i) for i in range(n)])
    elif scheme == "linear":
        slope = kwargs.get("slope", 0.05)
        weights = np.maximum(0, 1 - slope * np.arange(n - 1, -1, -1))
    elif scheme == "power":
        gamma = kwargs.get("gamma", 0.5)
        recency = np.arange(n, 0, -1).astype(float)
        weights = recency ** (-gamma)
    else:
        raise ValueError(f"Unknown scheme: {scheme}")

    if weights.sum() == 0:
        return np.nan

    return np.average(values, weights=weights)

Common Pitfall: The most dangerous error in temporal feature engineering is data leakage --- using information from the game you are trying to predict (or from future games) in your features. Always use .shift(1) or equivalent logic to ensure that the current observation is excluded. This seems obvious in isolation, but becomes surprisingly easy to violate when building complex feature pipelines with multiple join operations. Test your pipeline by verifying that changing the outcome of game $t$ does not change any features available at time $t$.


28.3 Interaction Features and Matchup Encoding

Why Matchups Matter

Sports prediction is fundamentally about matchups, not isolated team strength. A basketball team with an elite interior defense may be nearly unbeatable against post-heavy opponents but vulnerable against three-point shooting teams. A football team with a dominant pass rush may shut down pass-heavy offenses but struggle against strong rushing attacks. These interaction effects mean that the predictive value of a team's statistics depends on the opponent's characteristics.

Standard features capture each team's attributes independently. Interaction features capture how those attributes combine when two specific teams meet.

Team Matchup Features

The simplest interaction features are differences and ratios between corresponding statistics of two teams:

$$\Delta_{\text{ortg}} = \text{ORtg}_{\text{home}} - \text{DRtg}_{\text{away}}$$

This measures the expected offensive surplus when the home team's offense meets the away team's defense. Symmetrically:

$$\Delta_{\text{drtg}} = \text{DRtg}_{\text{home}} - \text{ORtg}_{\text{away}}$$

Beyond simple differences, products of complementary features capture nonlinear interactions:

$$\text{Pace Interaction} = \text{Pace}_{\text{home}} \times \text{Pace}_{\text{away}}$$

This captures the fact that two fast-paced teams create an especially high-scoring game (the product is large) while two slow-paced teams create an especially low-scoring game.

def create_matchup_features(
    df: pd.DataFrame,
    home_prefix: str = "home_",
    away_prefix: str = "away_",
    stat_pairs: Optional[list[tuple[str, str]]] = None,
) -> pd.DataFrame:
    """
    Generate interaction features for head-to-head matchups.

    Creates differences, ratios, and products between corresponding
    home and away team statistics. These features capture matchup
    dynamics that individual team features cannot represent.

    Args:
        df: DataFrame where each row is a game, with home and away team
            statistics prefixed accordingly.
        home_prefix: Prefix for home team stat columns.
        away_prefix: Prefix for away team stat columns.
        stat_pairs: List of (home_stat, away_stat) tuples to create
            interactions for. If None, automatically pairs columns
            with matching suffixes.

    Returns:
        DataFrame with new matchup feature columns.

    Example:
        >>> df = pd.DataFrame({
        ...     'home_ortg': [112.0], 'away_drtg': [108.0],
        ...     'home_pace': [100.0], 'away_pace': [95.0],
        ... })
        >>> result = create_matchup_features(
        ...     df, stat_pairs=[('home_ortg', 'away_drtg'),
        ...                     ('home_pace', 'away_pace')]
        ... )
    """
    result = df.copy()

    if stat_pairs is None:
        # Auto-detect pairs by matching suffixes
        home_cols = [c for c in df.columns if c.startswith(home_prefix)]
        stat_pairs = []
        for hc in home_cols:
            suffix = hc[len(home_prefix):]
            ac = away_prefix + suffix
            if ac in df.columns:
                stat_pairs.append((hc, ac))

    for home_stat, away_stat in stat_pairs:
        base_name = home_stat.replace(home_prefix, "")

        # Difference: home advantage in this metric
        result[f"matchup_diff_{base_name}"] = (
            result[home_stat] - result[away_stat]
        )

        # Ratio: relative strength (with safety for zero division)
        denominator = result[away_stat].replace(0, np.nan)
        result[f"matchup_ratio_{base_name}"] = (
            result[home_stat] / denominator
        )

        # Product: interaction magnitude
        result[f"matchup_prod_{base_name}"] = (
            result[home_stat] * result[away_stat]
        )

    return result

Style Clash Metrics

Beyond raw statistical matchups, style clash metrics quantify how teams' playing styles interact. These require defining style dimensions first, then measuring compatibility or conflict along each dimension.

In basketball, useful style dimensions include: - Pace: Fast vs. slow tempo - Three-point reliance: Perimeter-oriented vs. interior-oriented offense - Defensive style: Switching defense vs. drop coverage, full-court press vs. half-court - Roster size: Small-ball lineups vs. traditional big lineups

A style clash score measures how unusual the matchup is along a specific dimension:

$$\text{Clash}_{d} = |S_{d,\text{home}} - S_{d,\text{away}}|$$

where $S_{d,t}$ is team $t$'s value on style dimension $d$. High clash scores indicate stylistic mismatches that may create unusual game dynamics.

def compute_style_clash_metrics(
    df: pd.DataFrame,
    style_dimensions: dict[str, tuple[str, str]],
) -> pd.DataFrame:
    """
    Compute style clash metrics for each matchup along multiple dimensions.

    For each style dimension, computes the absolute difference, the
    product, and a z-scored clash metric indicating how extreme the
    mismatch is relative to the league distribution.

    Args:
        df: DataFrame with one row per game, containing style metrics
            for both home and away teams.
        style_dimensions: Dictionary mapping dimension names to
            (home_col, away_col) tuples.

    Returns:
        DataFrame with style clash features appended.
    """
    result = df.copy()

    for dim_name, (home_col, away_col) in style_dimensions.items():
        # Raw clash (absolute difference)
        clash = (result[home_col] - result[away_col]).abs()
        result[f"clash_{dim_name}"] = clash

        # Z-scored clash (how unusual is this mismatch?)
        mean_clash = clash.mean()
        std_clash = clash.std()
        if std_clash > 0:
            result[f"clash_{dim_name}_z"] = (clash - mean_clash) / std_clash
        else:
            result[f"clash_{dim_name}_z"] = 0.0

        # Directional clash (positive = home team higher)
        result[f"clash_{dim_name}_dir"] = (
            result[home_col] - result[away_col]
        )

    return result

Player Combination Features

In sports with lineup data, the specific combination of players on the court or field matters enormously. Player combination features capture the synergies and conflicts between specific groups of players.

For basketball, lineup-level features aggregate the on-court performance of every combination of five players:

$$\text{Lineup ORtg}_{L} = \frac{\text{Points scored with lineup } L}{\text{Possessions with lineup } L} \times 100$$

For modeling purposes, pairwise player features are more practical than full lineup features (since the combinatorial space of five-player lineups is vast). The net rating of every two-player combination captures which pairs work well together:

$$\text{Pair NetRtg}_{i,j} = \text{Team NetRtg when } i \text{ and } j \text{ are both on court}$$

Encoding Categorical Matchups

When teams, players, and venues are categorical variables, they must be encoded numerically. Several encoding strategies exist:

  1. One-hot encoding: Creates a binary indicator for each category. Simple but creates very high-dimensional, sparse features. Impractical for player-level encoding (thousands of players).

  2. Target encoding: Replaces each category with the average target value (e.g., win rate) for that category. Compact but prone to overfitting --- a player who has appeared in two games and won both gets an encoding of 1.0, which is unreliable.

  3. Leave-one-out target encoding: For each observation, the encoding uses the mean target value excluding that observation. Reduces overfitting compared to naive target encoding.

  4. Entity embeddings: Learned dense vector representations (covered in detail in Chapter 29). The most powerful approach for large categorical spaces.

from sklearn.base import BaseEstimator, TransformerMixin


class LeaveOneOutEncoder(BaseEstimator, TransformerMixin):
    """
    Leave-one-out target encoder for categorical variables.

    For each observation, encodes the categorical value with the mean
    of the target variable computed over all OTHER observations with
    the same category value. This prevents direct leakage of the
    target into the feature.

    During transform (on new data), uses the full training-set mean
    for each category.

    Attributes:
        category_means_: Dict mapping category values to target means.
        category_counts_: Dict mapping category values to observation counts.
        category_sums_: Dict mapping category values to target sums.
        global_mean_: The overall mean of the target in the training data.
        smoothing: Bayesian smoothing parameter. Higher values shrink
            category means toward the global mean.
    """

    def __init__(self, smoothing: float = 10.0):
        self.smoothing = smoothing

    def fit(self, X: pd.Series, y: pd.Series):
        """Compute category-level statistics from training data."""
        self.global_mean_ = y.mean()
        df = pd.DataFrame({"cat": X, "target": y})
        stats = df.groupby("cat")["target"].agg(["mean", "count", "sum"])
        self.category_means_ = stats["mean"].to_dict()
        self.category_counts_ = stats["count"].to_dict()
        self.category_sums_ = stats["sum"].to_dict()
        return self

    def transform(self, X: pd.Series, y: Optional[pd.Series] = None):
        """
        Encode categories using leave-one-out means (if y provided)
        or full training means (if y is None, i.e., test data).
        """
        if y is not None:
            # Leave-one-out encoding for training data
            result = []
            for cat_val, target_val in zip(X, y):
                n = self.category_counts_.get(cat_val, 0)
                s = self.category_sums_.get(cat_val, 0)
                if n > 1:
                    loo_mean = (s - target_val) / (n - 1)
                else:
                    loo_mean = self.global_mean_
                # Apply smoothing
                smoothed = (
                    (n - 1) * loo_mean + self.smoothing * self.global_mean_
                ) / (n - 1 + self.smoothing)
                result.append(smoothed)
            return np.array(result)
        else:
            # Standard encoding for test data
            result = []
            for cat_val in X:
                n = self.category_counts_.get(cat_val, 0)
                mean = self.category_means_.get(cat_val, self.global_mean_)
                smoothed = (
                    n * mean + self.smoothing * self.global_mean_
                ) / (n + self.smoothing)
                result.append(smoothed)
            return np.array(result)

Intuition: Think of interaction features as encoding the question "what happens when these two specific styles collide?" rather than "how good is each team independently?" A team with the league's best three-point defense gains little from that strength against an opponent that rarely shoots threes. Interaction features encode this conditional relationship, allowing the model to make matchup-specific predictions rather than generic ones.


28.4 Dimensionality Reduction

The Curse of Dimensionality in Sports Features

The feature engineering techniques from the previous sections can generate an enormous number of features. If you have 20 base statistics, 3 rolling windows, 3 exponential half-lives, 3 lags, season-to-date averages, and matchup interactions for each, you easily exceed 500 features per game. With a typical NBA season of ~1,230 games, you have far more features than observations --- a classic recipe for overfitting.

Dimensionality reduction techniques compress high-dimensional feature spaces into lower-dimensional representations that retain most of the predictive information. They serve two purposes: reducing overfitting by constraining model complexity, and revealing latent structure in the data.

Principal Component Analysis (PCA)

PCA finds the linear combinations of features (called principal components) that capture the maximum variance in the data. The first principal component is the direction of maximum variance; the second is the direction of maximum remaining variance orthogonal to the first; and so on.

Mathematically, PCA computes the eigendecomposition of the covariance matrix $\Sigma$ of the (centered) feature matrix $X$:

$$\Sigma = \frac{1}{n-1} X^T X = V \Lambda V^T$$

where $V$ is the matrix of eigenvectors (principal component directions) and $\Lambda$ is a diagonal matrix of eigenvalues (variance explained by each component). The transformed data is:

$$Z = X V_k$$

where $V_k$ contains only the first $k$ eigenvectors, projecting the data onto the top $k$ principal components.

The key question is: how many components to retain? Common heuristics include: - Explained variance threshold: Retain enough components to capture 90--95% of total variance. - Scree plot: Plot eigenvalues and look for an "elbow" where the marginal variance drops sharply. - Cross-validation: Choose $k$ that minimizes prediction error in a held-out set.

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt


def apply_pca_to_sports_features(
    train_features: pd.DataFrame,
    test_features: pd.DataFrame,
    variance_threshold: float = 0.95,
    plot_scree: bool = True,
) -> tuple[np.ndarray, np.ndarray, PCA, StandardScaler]:
    """
    Apply PCA to sports feature matrices with proper train/test separation.

    Standardizes features using training data statistics only (preventing
    information leakage from test data), then fits PCA on training data
    and transforms both sets.

    Args:
        train_features: Training set feature DataFrame.
        test_features: Test set feature DataFrame (same columns).
        variance_threshold: Retain components explaining this fraction
            of total variance.
        plot_scree: If True, display a scree plot of explained variance.

    Returns:
        Tuple of (train_pca, test_pca, fitted_pca, fitted_scaler).
    """
    # Standardize using training statistics only
    scaler = StandardScaler()
    train_scaled = scaler.fit_transform(train_features)
    test_scaled = scaler.transform(test_features)

    # Fit PCA on training data
    pca_full = PCA()
    pca_full.fit(train_scaled)

    # Determine number of components
    cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_)
    n_components = np.argmax(cumulative_variance >= variance_threshold) + 1

    if plot_scree:
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

        # Scree plot
        ax1.bar(
            range(1, len(pca_full.explained_variance_ratio_) + 1),
            pca_full.explained_variance_ratio_,
            alpha=0.7,
            label="Individual",
        )
        ax1.plot(
            range(1, len(cumulative_variance) + 1),
            cumulative_variance,
            "r-o",
            label="Cumulative",
        )
        ax1.axhline(
            y=variance_threshold, color="gray", linestyle="--",
            label=f"{variance_threshold:.0%} threshold",
        )
        ax1.axvline(
            x=n_components, color="green", linestyle="--",
            label=f"{n_components} components",
        )
        ax1.set_xlabel("Principal Component")
        ax1.set_ylabel("Explained Variance Ratio")
        ax1.set_title("Scree Plot")
        ax1.legend()

        # Top component loadings
        loadings = pd.DataFrame(
            pca_full.components_[:3],
            columns=train_features.columns,
            index=["PC1", "PC2", "PC3"],
        )
        top_features_pc1 = loadings.loc["PC1"].abs().nlargest(10)
        ax2.barh(range(len(top_features_pc1)), top_features_pc1.values)
        ax2.set_yticks(range(len(top_features_pc1)))
        ax2.set_yticklabels(top_features_pc1.index)
        ax2.set_xlabel("Absolute Loading")
        ax2.set_title("Top 10 Features in PC1")

        plt.tight_layout()
        plt.show()

    # Refit PCA with selected number of components
    pca = PCA(n_components=n_components)
    train_pca = pca.fit_transform(train_scaled)
    test_pca = pca.transform(test_scaled)

    print(
        f"Reduced {train_features.shape[1]} features to {n_components} "
        f"components ({cumulative_variance[n_components - 1]:.1%} variance)"
    )

    return train_pca, test_pca, pca, scaler

t-SNE for Visualization

t-distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimensionality reduction technique designed for visualization. Unlike PCA, t-SNE does not produce a transformation that can be applied to new data --- it is a visualization tool, not a feature engineering tool.

t-SNE is valuable for exploring the structure of sports feature spaces. By projecting games or teams into two dimensions, you can visually identify clusters of similar teams, detect outliers, and verify that your features capture meaningful distinctions.

The key hyperparameter is perplexity, which roughly controls the number of effective nearest neighbors. Values between 5 and 50 are typical; higher perplexity produces more global structure, lower perplexity reveals more local detail.

from sklearn.manifold import TSNE


def visualize_teams_tsne(
    features: pd.DataFrame,
    team_labels: pd.Series,
    perplexity: int = 30,
    random_state: int = 42,
    highlight_teams: Optional[list[str]] = None,
) -> None:
    """
    Create a t-SNE visualization of teams in feature space.

    Projects the high-dimensional feature vectors for each team into
    2D space, preserving local neighborhood structure. Useful for
    exploring which teams are statistically similar and which are
    outliers.

    Args:
        features: Feature DataFrame (one row per team).
        team_labels: Series of team names/identifiers.
        perplexity: t-SNE perplexity parameter.
        random_state: Random seed for reproducibility.
        highlight_teams: Optional list of team names to highlight.
    """
    scaler = StandardScaler()
    features_scaled = scaler.fit_transform(features)

    tsne = TSNE(
        n_components=2,
        perplexity=min(perplexity, len(features) - 1),
        random_state=random_state,
        n_iter=1000,
    )
    embedding = tsne.fit_transform(features_scaled)

    fig, ax = plt.subplots(figsize=(12, 10))

    if highlight_teams:
        mask = team_labels.isin(highlight_teams)
        ax.scatter(
            embedding[~mask, 0], embedding[~mask, 1],
            c="lightgray", s=50, alpha=0.5,
        )
        ax.scatter(
            embedding[mask, 0], embedding[mask, 1],
            c="red", s=100, zorder=5,
        )
        for i, team in enumerate(team_labels):
            if team in highlight_teams:
                ax.annotate(
                    team, (embedding[i, 0], embedding[i, 1]),
                    fontsize=9, fontweight="bold",
                )
    else:
        ax.scatter(embedding[:, 0], embedding[:, 1], s=60, alpha=0.7)
        for i, team in enumerate(team_labels):
            ax.annotate(team, (embedding[i, 0], embedding[i, 1]), fontsize=7)

    ax.set_title("t-SNE Visualization of Team Feature Space")
    ax.set_xlabel("t-SNE Dimension 1")
    ax.set_ylabel("t-SNE Dimension 2")
    plt.tight_layout()
    plt.show()

Feature Clustering

An alternative to PCA for dimensionality reduction is feature clustering --- grouping correlated features together and replacing each cluster with a single representative or aggregate. This is particularly useful when you have many features measuring similar constructs (e.g., multiple metrics of defensive quality).

Agglomerative clustering on the feature correlation matrix is a natural approach:

  1. Compute the absolute pairwise correlation matrix of all features.
  2. Convert correlations to distances: $d_{ij} = 1 - |r_{ij}|$.
  3. Perform hierarchical clustering on the distance matrix.
  4. Cut the dendrogram at a threshold to form clusters.
  5. Within each cluster, select the feature with the highest average correlation to the target, or compute the first principal component.
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import squareform


def cluster_correlated_features(
    features: pd.DataFrame,
    target: pd.Series,
    correlation_threshold: float = 0.7,
    method: str = "representative",
) -> tuple[pd.DataFrame, dict[int, list[str]]]:
    """
    Cluster highly correlated features and reduce to one per cluster.

    Groups features whose absolute pairwise correlation exceeds the
    threshold. Within each cluster, selects either the feature most
    correlated with the target ('representative') or computes the
    first principal component ('pca').

    Args:
        features: Feature DataFrame.
        target: Target variable Series.
        correlation_threshold: Features with correlation above this
            value (in absolute terms) are grouped together.
        method: 'representative' to keep the best feature per cluster,
            'pca' to replace each cluster with its first PC.

    Returns:
        Tuple of (reduced DataFrame, dict mapping cluster IDs to
        feature name lists).
    """
    corr_matrix = features.corr().abs()
    distance_matrix = 1 - corr_matrix
    np.fill_diagonal(distance_matrix.values, 0)

    condensed = squareform(distance_matrix.values)
    linkage_matrix = linkage(condensed, method="complete")

    cluster_labels = fcluster(
        linkage_matrix,
        t=1 - correlation_threshold,
        criterion="distance",
    )

    cluster_map = {}
    for feat, label in zip(features.columns, cluster_labels):
        cluster_map.setdefault(label, []).append(feat)

    if method == "representative":
        selected = []
        for cluster_id, feat_names in cluster_map.items():
            if len(feat_names) == 1:
                selected.append(feat_names[0])
            else:
                correlations_with_target = {
                    f: abs(features[f].corr(target)) for f in feat_names
                }
                best = max(correlations_with_target, key=correlations_with_target.get)
                selected.append(best)
        reduced = features[selected]
    elif method == "pca":
        components = []
        component_names = []
        for cluster_id, feat_names in cluster_map.items():
            if len(feat_names) == 1:
                components.append(features[feat_names[0]].values)
                component_names.append(feat_names[0])
            else:
                scaler = StandardScaler()
                scaled = scaler.fit_transform(features[feat_names])
                pca = PCA(n_components=1)
                pc = pca.fit_transform(scaled).ravel()
                components.append(pc)
                component_names.append(f"cluster_{cluster_id}_pc1")
        reduced = pd.DataFrame(
            np.column_stack(components), columns=component_names
        )
    else:
        raise ValueError(f"Unknown method: {method}")

    print(
        f"Reduced {features.shape[1]} features to {reduced.shape[1]} "
        f"({len(cluster_map)} clusters)"
    )

    return reduced, cluster_map

When to Reduce Dimensions

Dimensionality reduction is not always beneficial. Consider the following guidelines:

Situation Recommendation
$n \gg p$ (many observations, few features) Reduction unnecessary; may lose signal
$p > n$ (more features than observations) Reduction essential to prevent overfitting
Many highly correlated features Feature clustering or PCA very effective
Tree-based models (XGBoost, RF) Often handle high dimensions well; reduction optional
Linear models, neural networks More sensitive to dimensionality; reduction often helps
Interpretability is important Feature clustering preserves interpretability better than PCA
Visualization needed t-SNE for exploration; PCA for linear projections

Intuition: Dimensionality reduction is a form of regularization. Just as L1 and L2 penalties constrain model weights to prevent overfitting, dimensionality reduction constrains the space of features the model can use. The tradeoff is the same: you sacrifice some ability to fit the training data in exchange for better generalization to new data.


28.5 Automated Feature Engineering

The Case for Automation

Manual feature engineering is powerful but labor-intensive. For a single sport and prediction target, a skilled domain expert might spend weeks designing, implementing, and testing features. Automated feature engineering tools systematically explore the space of possible feature transformations, generating candidates that a human might miss and evaluating them at a scale no human could match.

Automated approaches complement, but do not replace, domain knowledge. The best pipelines combine hand-crafted domain-specific features (Section 28.1) with automatically generated candidates, then apply systematic feature selection to identify the most valuable subset.

Featuretools for Deep Feature Synthesis

Featuretools is a Python library that performs Deep Feature Synthesis (DFS) --- automatically generating features by applying transformation and aggregation primitives to relational datasets. For sports data, this is particularly powerful because our data naturally has relational structure: games contain players, players belong to teams, teams belong to leagues, games occur at venues, etc.

# Note: This code requires the featuretools package
# pip install featuretools

import featuretools as ft


def build_sports_entityset(
    games: pd.DataFrame,
    player_stats: pd.DataFrame,
    teams: pd.DataFrame,
) -> ft.EntitySet:
    """
    Build a featuretools EntitySet from sports relational data.

    Creates an entity set with three entities (games, player_stats,
    teams) and the relationships between them. This relational
    structure allows featuretools to automatically generate aggregated
    and transformed features across entity boundaries.

    Args:
        games: Game-level DataFrame with columns ['game_id', 'date',
            'home_team_id', 'away_team_id', 'home_score', 'away_score'].
        player_stats: Player-game-level DataFrame with columns
            ['player_game_id', 'game_id', 'team_id', 'player_id',
            'minutes', 'points', 'rebounds', 'assists', ...].
        teams: Team DataFrame with columns ['team_id', 'team_name',
            'conference', 'division'].

    Returns:
        Configured featuretools EntitySet.
    """
    es = ft.EntitySet(id="sports_data")

    es = es.add_dataframe(
        dataframe_name="games",
        dataframe=games,
        index="game_id",
        time_index="date",
    )

    es = es.add_dataframe(
        dataframe_name="player_stats",
        dataframe=player_stats,
        index="player_game_id",
    )

    es = es.add_dataframe(
        dataframe_name="teams",
        dataframe=teams,
        index="team_id",
    )

    # Define relationships
    es = es.add_relationship("games", "game_id", "player_stats", "game_id")
    es = es.add_relationship("teams", "team_id", "player_stats", "team_id")

    return es


def run_deep_feature_synthesis(
    es: ft.EntitySet,
    target_entity: str = "games",
    max_depth: int = 2,
    primitives_agg: Optional[list[str]] = None,
    primitives_trans: Optional[list[str]] = None,
) -> pd.DataFrame:
    """
    Run Deep Feature Synthesis to automatically generate features.

    DFS traverses the relational structure of the EntitySet, applying
    aggregation primitives (e.g., mean, std, max) across relationships
    and transformation primitives (e.g., absolute, percentile) to
    individual columns. The max_depth parameter controls the complexity
    of generated features.

    Args:
        es: Configured featuretools EntitySet.
        target_entity: The entity to generate features for.
        max_depth: Maximum depth of feature stacking. Depth 1 applies
            single primitives; depth 2 applies primitives to the
            results of depth-1 features.
        primitives_agg: Aggregation primitives to use.
        primitives_trans: Transformation primitives to use.

    Returns:
        DataFrame of automatically generated features for each
        instance in the target entity.
    """
    if primitives_agg is None:
        primitives_agg = ["mean", "std", "max", "min", "skew", "count"]
    if primitives_trans is None:
        primitives_trans = ["absolute", "percentile"]

    feature_matrix, feature_defs = ft.dfs(
        entityset=es,
        target_dataframe_name=target_entity,
        max_depth=max_depth,
        agg_primitives=primitives_agg,
        trans_primitives=primitives_trans,
        verbose=True,
    )

    print(f"Generated {len(feature_defs)} features")
    print(f"Feature matrix shape: {feature_matrix.shape}")

    return feature_matrix

Tsfresh for Time Series Feature Extraction

Tsfresh (Time Series Feature extraction based on Scalable Hypothesis tests) is specialized for extracting features from time series data --- exactly the kind of data we have in sequential game-by-game statistics.

Tsfresh automatically computes hundreds of features from each time series, including statistical moments, autocorrelation coefficients, entropy measures, and spectral features. It then applies statistical hypothesis tests to select only the features that are significantly associated with the target variable.

# Note: This code requires the tsfresh package
# pip install tsfresh

from tsfresh import extract_features, select_features
from tsfresh.utilities.dataframe_functions import impute


def extract_tsfresh_features(
    game_data: pd.DataFrame,
    team_col: str,
    game_number_col: str,
    stat_columns: list[str],
    target: pd.Series,
    fdr_level: float = 0.05,
) -> pd.DataFrame:
    """
    Extract and select time series features using tsfresh.

    For each team, treats the sequence of game statistics as a time
    series and extracts hundreds of statistical features. Then applies
    the Benjamini-Hochberg procedure to select only features with a
    statistically significant relationship to the target.

    Args:
        game_data: DataFrame with game-level statistics, one row per
            team per game.
        team_col: Column identifying the team (used as the series ID).
        game_number_col: Column indicating the temporal order of games
            (used as the time index within each series).
        stat_columns: List of statistics to extract features from.
        target: Target variable for feature selection (one value per
            team, e.g., season win percentage).
        fdr_level: False discovery rate for the Benjamini-Hochberg
            feature selection procedure.

    Returns:
        DataFrame of selected features, one row per team.
    """
    # Prepare data in tsfresh format
    ts_data = game_data[[team_col, game_number_col] + stat_columns].copy()
    ts_data = ts_data.rename(columns={
        team_col: "id",
        game_number_col: "time",
    })

    # Extract features
    extracted = extract_features(
        ts_data,
        column_id="id",
        column_sort="time",
        default_fc_parameters="efficient",
        n_jobs=4,
    )

    # Impute missing values
    impute(extracted)

    # Select significant features
    selected = select_features(extracted, target, fdr_level=fdr_level)

    print(
        f"Extracted {extracted.shape[1]} features, "
        f"selected {selected.shape[1]} significant features"
    )

    return selected

Target Encoding with Regularization

Target encoding (also called mean encoding) replaces categorical variables with the mean of the target variable for that category. This is covered briefly in Section 28.3, but in the context of automated pipelines, we need a production-ready version with proper regularization and cross-validation to prevent overfitting.

The regularized target encoding uses Bayesian smoothing:

$$\text{Encoding}(c) = \frac{n_c \cdot \bar{y}_c + m \cdot \bar{y}_{\text{global}}}{n_c + m}$$

where $n_c$ is the number of observations in category $c$, $\bar{y}_c$ is the category mean, $\bar{y}_{\text{global}}$ is the global mean, and $m$ is the smoothing parameter. When $n_c$ is small, the encoding shrinks toward the global mean; when $n_c$ is large, the encoding approaches the category-specific mean.

Feature Selection Pipelines

After generating hundreds or thousands of candidate features --- both hand-crafted and automated --- a systematic feature selection pipeline identifies the subset that maximizes predictive performance. A robust pipeline combines multiple selection methods:

  1. Variance threshold: Remove features with near-zero variance (they contain no information).
  2. Correlation filter: Remove features highly correlated with existing selections (they add redundancy, not information).
  3. Mutual information: Rank features by their mutual information with the target variable.
  4. Recursive feature elimination (RFE): Iteratively remove the least important features according to a model's feature importance scores.
  5. L1 regularization: Fit a model with L1 (Lasso) penalty; features with zero coefficients are eliminated.
from sklearn.feature_selection import (
    VarianceThreshold,
    mutual_info_classif,
    SelectKBest,
)
from sklearn.ensemble import GradientBoostingClassifier


class FeatureSelectionPipeline:
    """
    Multi-stage feature selection pipeline for sports prediction.

    Combines variance filtering, correlation filtering, mutual information
    ranking, and model-based importance into a single pipeline that can
    be fitted on training data and applied to test data consistently.

    Attributes:
        variance_threshold: Minimum variance for a feature to be retained.
        correlation_threshold: Maximum pairwise correlation allowed.
        n_top_mi: Number of features to retain after MI ranking.
        n_final: Final number of features to select via model importance.
    """

    def __init__(
        self,
        variance_threshold: float = 0.01,
        correlation_threshold: float = 0.95,
        n_top_mi: int = 100,
        n_final: int = 50,
    ):
        self.variance_threshold = variance_threshold
        self.correlation_threshold = correlation_threshold
        self.n_top_mi = n_top_mi
        self.n_final = n_final
        self.selected_features_: list[str] = []

    def fit(self, X: pd.DataFrame, y: pd.Series) -> "FeatureSelectionPipeline":
        """
        Fit the feature selection pipeline on training data.

        Applies each stage sequentially, recording the surviving
        features at each step.
        """
        current_features = X.columns.tolist()
        print(f"Starting with {len(current_features)} features")

        # Stage 1: Variance threshold
        selector = VarianceThreshold(threshold=self.variance_threshold)
        selector.fit(X[current_features])
        current_features = [
            f for f, keep in zip(current_features, selector.get_support())
            if keep
        ]
        print(f"After variance filter: {len(current_features)}")

        # Stage 2: Correlation filter
        corr_matrix = X[current_features].corr().abs()
        to_drop = set()
        for i in range(len(current_features)):
            for j in range(i + 1, len(current_features)):
                if corr_matrix.iloc[i, j] > self.correlation_threshold:
                    # Drop the feature with lower target correlation
                    corr_i = abs(X[current_features[i]].corr(y))
                    corr_j = abs(X[current_features[j]].corr(y))
                    drop = current_features[j] if corr_i >= corr_j else current_features[i]
                    to_drop.add(drop)
        current_features = [f for f in current_features if f not in to_drop]
        print(f"After correlation filter: {len(current_features)}")

        # Stage 3: Mutual information ranking
        n_mi = min(self.n_top_mi, len(current_features))
        mi_scores = mutual_info_classif(X[current_features], y, random_state=42)
        mi_ranking = sorted(
            zip(current_features, mi_scores),
            key=lambda x: x[1],
            reverse=True,
        )
        current_features = [f for f, _ in mi_ranking[:n_mi]]
        print(f"After MI ranking: {len(current_features)}")

        # Stage 4: Model-based importance
        n_final = min(self.n_final, len(current_features))
        model = GradientBoostingClassifier(
            n_estimators=100, max_depth=4, random_state=42,
        )
        model.fit(X[current_features], y)
        importances = sorted(
            zip(current_features, model.feature_importances_),
            key=lambda x: x[1],
            reverse=True,
        )
        self.selected_features_ = [f for f, _ in importances[:n_final]]
        print(f"Final selection: {len(self.selected_features_)}")

        return self

    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        """Return only the selected features."""
        return X[self.selected_features_]

Common Pitfall: Automated feature engineering can generate thousands of candidate features, but most will be noise. Without rigorous feature selection --- including proper cross-validation to avoid selecting features that overfit to the training set --- automated features can degrade rather than improve model performance. Always evaluate automated features on a held-out set that was not used during the generation or selection process.


28.6 Chapter Summary

Key Concepts

  1. Feature engineering is the process of transforming raw data into numerical representations suitable for machine learning models. In sports prediction, it is widely considered the single most impactful step in the modeling pipeline.

  2. Pace adjustment normalizes counting statistics to a common rate (e.g., per 100 possessions), isolating efficiency from tempo. This is essential in any sport where game tempo varies between teams.

  3. Opponent adjustment corrects for the strength of competition, isolating true team ability from schedule difficulty. Iterative methods that alternate between offensive and defensive adjustments converge to stable ratings.

  4. Rolling averages, exponentially weighted means, and lag features capture the temporal dynamics of team performance. Multiple timescales (short and long windows) provide complementary information about recent form and stable ability.

  5. Interaction features --- differences, ratios, and products of corresponding team statistics --- capture matchup dynamics that individual team features cannot represent. Style clash metrics quantify how playing styles interact.

  6. Dimensionality reduction via PCA, t-SNE, or feature clustering manages the feature proliferation that results from aggressive feature engineering. PCA is useful for compression; t-SNE for visualization; clustering for interpretable reduction.

  7. Automated feature engineering with tools like featuretools and tsfresh systematically explores the feature space, generating candidates that complement hand-crafted features. Rigorous feature selection is essential to avoid retaining noise.

  8. Target encoding with Bayesian smoothing provides a compact encoding of high-cardinality categorical variables, but must be regularized and cross-validated to prevent overfitting.

Key Formulas

Formula Description
$\text{ORtg} = \frac{\text{Pts}}{\text{Poss}} \times 100$ Offensive rating (per 100 possessions)
$\text{Poss} \approx \text{FGA} - \text{OREB} + \text{TOV} + 0.44 \times \text{FTA}$ Possession estimate (Dean Oliver)
$\text{EWM}_t = \alpha \cdot x_{t-1} + (1-\alpha) \cdot \text{EWM}_{t-1}$ Exponentially weighted mean
$\alpha = 1 - \exp(-\ln 2 / h)$ EWM decay from half-life $h$
$\Sigma = V \Lambda V^T$ PCA eigendecomposition of covariance matrix
$\text{Enc}(c) = \frac{n_c \bar{y}_c + m \bar{y}_g}{n_c + m}$ Bayesian target encoding

Key Code Patterns

  1. Pace adjustment functions (estimate_possessions, pace_adjust_stats): Normalize raw counting stats to per-possession rates, removing tempo effects.

  2. Temporal feature builders (compute_rolling_features, compute_ewm_features, compute_lag_features): Systematic construction of time-aware features with .shift(1) to prevent data leakage.

  3. Matchup feature constructors (create_matchup_features, compute_style_clash_metrics): Generate interaction features that encode how two teams' characteristics combine.

  4. Dimensionality reduction (apply_pca_to_sports_features, cluster_correlated_features): Compress high-dimensional feature spaces while preserving predictive information, with proper train/test separation.

  5. Automated pipelines (FeatureSelectionPipeline): Multi-stage feature selection combining variance filtering, correlation filtering, mutual information, and model-based importance.

Decision Framework: Feature Engineering Workflow

START: You have raw sports data and a prediction target.

1. Create domain-specific features.
   - Pace-adjust all counting statistics.
   - Opponent-adjust all efficiency metrics.
   - Compute quality of competition metrics.

2. Build temporal features.
   - Rolling averages (3, 10, 20 game windows).
   - Exponentially weighted means (half-lives of 5 and 15).
   - Key lag features (last game result, rest days).
   - Season-to-date averages.

3. Construct matchup features.
   - Differences and ratios of corresponding team stats.
   - Style clash metrics for key dimensions.
   - Player combination features (if lineup data available).
   - Encode categorical matchups (target encoding or embeddings).

4. Apply automated feature generation.
   - Featuretools for relational features.
   - Tsfresh for time series features.
   - Merge with hand-crafted features.

5. Select features.
   - Remove near-zero variance features.
   - Filter high-correlation redundancies.
   - Rank by mutual information and model importance.
   - Validate on held-out data.

6. Optionally reduce dimensions.
   - PCA if $p > n$ or for linear models.
   - Feature clustering for interpretable reduction.
   - t-SNE for visual exploration only.

END: Feature matrix ready for modeling.

What's Next

In Chapter 29: Neural Networks for Sports Prediction, we will take the features engineered in this chapter and feed them into deep learning models. We will begin with feedforward networks for tabular sports data, then advance to LSTM and sequence models that can learn directly from temporal game sequences --- potentially discovering temporal patterns that our hand-crafted rolling features miss. We will also explore entity embeddings, which learn dense vector representations of teams, players, and venues that serve as a learned form of the categorical encodings we built manually in this chapter. The neural network approach represents a shift from designing features by hand to learning representations from data, and understanding both paradigms --- as well as when to combine them --- is essential for the modern sports prediction practitioner.


This chapter is part of The Sports Betting Textbook, a comprehensive guide to quantitative sports betting. All code examples use Python 3.11+ and are available in the companion repository.