18 min read

> --- George E. P. Box, Robustness in the Strategy of Scientific Model Building (1979)

Learning Objectives

  • Design and implement feedforward neural networks for tabular sports data using PyTorch
  • Build LSTM and sequence models that capture temporal dependencies in team and player performance
  • Create entity embeddings for teams, players, and venues that learn dense vector representations from data
  • Implement complete PyTorch training pipelines including data loaders, GPU training, and model persistence
  • Apply hyperparameter tuning with Optuna and architecture search techniques to optimize neural network performance

Chapter 29: Neural Networks for Sports Prediction

"All models are wrong, but some are useful." --- George E. P. Box, Robustness in the Strategy of Scientific Model Building (1979)

Chapter Overview

Neural networks have transformed fields from computer vision to natural language processing, achieving superhuman performance on tasks that were considered intractable a decade ago. Their application to sports prediction is more nuanced. Sports data is typically tabular --- rows of games with columns of features --- a domain where tree-based models like XGBoost have traditionally dominated. The datasets are small by deep learning standards: a decade of NBA games yields roughly 12,000 observations, compared to the millions or billions of examples used to train image classifiers or language models. And the signal-to-noise ratio is low: sports outcomes are inherently uncertain, and even a perfect model cannot predict them with high accuracy.

Yet neural networks offer capabilities that tree-based models lack. They can learn entity embeddings --- dense vector representations of teams, players, and venues --- that capture similarities and relationships in ways that one-hot encoding cannot. They can model sequential dependencies through recurrent architectures like LSTMs, learning temporal patterns directly from game sequences rather than relying on hand-crafted rolling features. And they can incorporate multiple data modalities --- combining tabular features, sequences, and categorical embeddings in a single end-to-end architecture.

This chapter provides a practical, code-heavy introduction to neural networks for sports prediction. We assume familiarity with the feature engineering concepts from Chapter 28 and the general machine learning framework from Chapter 27. We use PyTorch as our deep learning framework throughout, chosen for its flexibility, Pythonic API, and widespread adoption in both research and production.

In this chapter, you will learn to: - Build feedforward networks for tabular sports data and understand when they outperform tree-based models - Design LSTM architectures that model team performance trajectories and player career arcs - Create entity embeddings that learn meaningful representations of teams, players, and venues - Implement complete PyTorch training pipelines with proper data loading, GPU acceleration, and model persistence - Tune hyperparameters and search over architectures using Optuna


29.1 Feedforward Networks for Tabular Sports Data

Architecture Fundamentals

A feedforward neural network (also called a multilayer perceptron, or MLP) consists of layers of neurons connected in sequence. Each neuron computes a weighted sum of its inputs, adds a bias term, and passes the result through a nonlinear activation function:

$$z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)}$$ $$a^{(l)} = \sigma(z^{(l)})$$

where $W^{(l)}$ is the weight matrix for layer $l$, $b^{(l)}$ is the bias vector, $a^{(l-1)}$ is the output (activation) of the previous layer, and $\sigma$ is the activation function. The input layer $a^{(0)}$ is simply the feature vector $x$.

For a network with $L$ hidden layers, the full computation is:

$$\hat{y} = f(x) = \sigma_L(W^{(L)} \sigma_{L-1}(\cdots \sigma_1(W^{(1)} x + b^{(1)}) \cdots) + b^{(L)})$$

Activation Functions

The choice of activation function affects the network's ability to learn complex patterns and its training dynamics:

ReLU (Rectified Linear Unit): The most commonly used activation for hidden layers. Simple, computationally efficient, and avoids the vanishing gradient problem that plagued earlier architectures:

$$\text{ReLU}(z) = \max(0, z)$$

Leaky ReLU: Addresses the "dying ReLU" problem (neurons that output zero for all inputs and stop learning) by allowing a small gradient for negative inputs:

$$\text{LeakyReLU}(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha z & \text{if } z \leq 0 \end{cases}$$

where $\alpha$ is typically 0.01.

GELU (Gaussian Error Linear Unit): A smooth approximation of ReLU that has become standard in transformer architectures. It can provide marginal improvements over ReLU in some settings:

$$\text{GELU}(z) = z \cdot \Phi(z)$$

where $\Phi$ is the CDF of the standard normal distribution.

Sigmoid: Maps inputs to the range $(0, 1)$. Used primarily for binary classification output layers:

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

Backpropagation Intuition

Training a neural network means finding weight values that minimize a loss function --- a measure of prediction error. For binary classification (win/loss prediction), the standard loss is binary cross-entropy:

$$\mathcal{L} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log \hat{y}_i + (1 - y_i) \log (1 - \hat{y}_i) \right]$$

Backpropagation computes the gradient of this loss with respect to every weight in the network, using the chain rule of calculus. For a weight $w_{jk}^{(l)}$ in layer $l$:

$$\frac{\partial \mathcal{L}}{\partial w_{jk}^{(l)}} = \frac{\partial \mathcal{L}}{\partial z_j^{(l)}} \cdot \frac{\partial z_j^{(l)}}{\partial w_{jk}^{(l)}} = \delta_j^{(l)} \cdot a_k^{(l-1)}$$

where $\delta_j^{(l)} = \frac{\partial \mathcal{L}}{\partial z_j^{(l)}}$ is the "error signal" that propagates backward through the network. The key insight is that this error signal can be computed recursively from the output layer back to the input:

$$\delta^{(l)} = (W^{(l+1)})^T \delta^{(l+1)} \odot \sigma'(z^{(l)})$$

where $\odot$ denotes element-wise multiplication and $\sigma'$ is the derivative of the activation function. This recursive structure makes backpropagation computationally efficient: computing all gradients costs roughly twice the computation of a single forward pass.

Gradient descent then updates each weight in the direction that reduces the loss:

$$W^{(l)} \leftarrow W^{(l)} - \eta \frac{\partial \mathcal{L}}{\partial W^{(l)}}$$

where $\eta$ is the learning rate. In practice, we use stochastic gradient descent (SGD) variants like Adam that adapt the learning rate for each parameter and incorporate momentum.

When Neural Networks Beat Tree Models

For tabular sports data, tree-based models (XGBoost, LightGBM, Random Forest) are strong baselines that are often hard to beat. Neural networks tend to outperform tree models when:

  1. Entity embeddings are valuable: If your data contains high-cardinality categorical variables (hundreds of players, dozens of teams), embeddings can capture latent structure that tree models miss.
  2. Sequential patterns matter: When the order of games matters and you want to model trajectories rather than static snapshots.
  3. Multi-task learning is beneficial: Neural networks can jointly predict multiple related targets (win probability, total points, margin) with shared representations.
  4. Very large datasets: With 100,000+ observations, neural networks can learn complex nonlinear relationships that simpler models cannot represent.

For small tabular datasets with well-engineered features, XGBoost often wins. The practical recommendation is to use tree models as a strong baseline and invest in neural networks when the specific advantages above are relevant.

PyTorch Implementation

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np
import pandas as pd
from typing import Optional


class SportsTabularDataset(Dataset):
    """
    PyTorch Dataset for tabular sports prediction data.

    Converts pandas DataFrames to PyTorch tensors for efficient
    batched training. Handles both features and targets, with
    optional support for sample weights.

    Attributes:
        features: Tensor of shape (n_samples, n_features).
        targets: Tensor of shape (n_samples,).
        weights: Optional tensor of sample weights.
    """

    def __init__(
        self,
        features: pd.DataFrame,
        targets: pd.Series,
        weights: Optional[pd.Series] = None,
    ):
        self.features = torch.FloatTensor(features.values)
        self.targets = torch.FloatTensor(targets.values)
        self.weights = (
            torch.FloatTensor(weights.values)
            if weights is not None
            else None
        )

    def __len__(self) -> int:
        return len(self.targets)

    def __getitem__(self, idx: int):
        if self.weights is not None:
            return self.features[idx], self.targets[idx], self.weights[idx]
        return self.features[idx], self.targets[idx]


class SportsFeedforwardNet(nn.Module):
    """
    Feedforward neural network for tabular sports prediction.

    Architecture: Input -> [Linear -> BatchNorm -> ReLU -> Dropout] x N
    -> Linear -> Sigmoid (for binary classification).

    Batch normalization stabilizes training by normalizing layer inputs.
    Dropout regularizes by randomly zeroing neurons during training,
    preventing co-adaptation and reducing overfitting.

    Args:
        input_dim: Number of input features.
        hidden_dims: List of hidden layer sizes. Each entry creates
            one hidden layer with that many neurons.
        dropout_rate: Probability of dropping each neuron during training.
        output_dim: Number of output neurons (1 for binary classification).
    """

    def __init__(
        self,
        input_dim: int,
        hidden_dims: list[int] = [256, 128, 64],
        dropout_rate: float = 0.3,
        output_dim: int = 1,
    ):
        super().__init__()

        layers = []
        prev_dim = input_dim

        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.BatchNorm1d(hidden_dim),
                nn.ReLU(),
                nn.Dropout(dropout_rate),
            ])
            prev_dim = hidden_dim

        layers.append(nn.Linear(prev_dim, output_dim))
        layers.append(nn.Sigmoid())

        self.network = nn.Sequential(*layers)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass through the network."""
        return self.network(x).squeeze(-1)


def train_feedforward_model(
    model: nn.Module,
    train_loader: DataLoader,
    val_loader: DataLoader,
    learning_rate: float = 1e-3,
    weight_decay: float = 1e-4,
    n_epochs: int = 100,
    patience: int = 15,
    device: str = "auto",
) -> dict:
    """
    Train a feedforward model with early stopping and validation monitoring.

    Uses the Adam optimizer with weight decay (L2 regularization) and
    a ReduceLROnPlateau scheduler that decreases the learning rate when
    validation loss plateaus. Training stops early if validation loss
    does not improve for 'patience' consecutive epochs.

    Args:
        model: The neural network to train.
        train_loader: DataLoader for training data.
        val_loader: DataLoader for validation data.
        learning_rate: Initial learning rate for Adam optimizer.
        weight_decay: L2 regularization strength.
        n_epochs: Maximum number of training epochs.
        patience: Number of epochs without improvement before stopping.
        device: 'cuda', 'cpu', or 'auto' (auto-detect GPU).

    Returns:
        Dictionary containing training history with keys:
            'train_losses', 'val_losses', 'best_epoch', 'best_val_loss'.
    """
    if device == "auto":
        device = "cuda" if torch.cuda.is_available() else "cpu"
    model = model.to(device)

    criterion = nn.BCELoss()
    optimizer = optim.Adam(
        model.parameters(), lr=learning_rate, weight_decay=weight_decay,
    )
    scheduler = optim.lr_scheduler.ReduceLROnPlateau(
        optimizer, mode="min", factor=0.5, patience=5, verbose=True,
    )

    best_val_loss = float("inf")
    best_model_state = None
    epochs_without_improvement = 0
    history = {"train_losses": [], "val_losses": []}

    for epoch in range(n_epochs):
        # Training phase
        model.train()
        train_loss = 0.0
        train_batches = 0

        for batch in train_loader:
            features, targets = batch[0].to(device), batch[1].to(device)

            optimizer.zero_grad()
            predictions = model(features)
            loss = criterion(predictions, targets)
            loss.backward()

            # Gradient clipping prevents exploding gradients
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

            optimizer.step()

            train_loss += loss.item()
            train_batches += 1

        avg_train_loss = train_loss / train_batches

        # Validation phase
        model.eval()
        val_loss = 0.0
        val_batches = 0

        with torch.no_grad():
            for batch in val_loader:
                features, targets = batch[0].to(device), batch[1].to(device)
                predictions = model(features)
                loss = criterion(predictions, targets)
                val_loss += loss.item()
                val_batches += 1

        avg_val_loss = val_loss / val_batches

        history["train_losses"].append(avg_train_loss)
        history["val_losses"].append(avg_val_loss)

        scheduler.step(avg_val_loss)

        # Early stopping check
        if avg_val_loss < best_val_loss:
            best_val_loss = avg_val_loss
            best_model_state = model.state_dict().copy()
            epochs_without_improvement = 0
        else:
            epochs_without_improvement += 1

        if (epoch + 1) % 10 == 0:
            print(
                f"Epoch {epoch + 1}/{n_epochs} | "
                f"Train Loss: {avg_train_loss:.4f} | "
                f"Val Loss: {avg_val_loss:.4f} | "
                f"LR: {optimizer.param_groups[0]['lr']:.6f}"
            )

        if epochs_without_improvement >= patience:
            print(f"Early stopping at epoch {epoch + 1}")
            break

    # Restore best model
    if best_model_state is not None:
        model.load_state_dict(best_model_state)

    history["best_epoch"] = np.argmin(history["val_losses"]) + 1
    history["best_val_loss"] = best_val_loss

    return history

Worked Example: NBA Game Prediction with a Feedforward Network

# Complete example: predicting NBA game outcomes
# Assumes features from Chapter 28 are already computed

def run_nba_feedforward_example():
    """
    End-to-end example of training a feedforward network for NBA
    game outcome prediction.

    This example demonstrates:
    1. Loading and splitting time-series data properly
    2. Creating PyTorch datasets and data loaders
    3. Building and training a feedforward model
    4. Evaluating predictions with Brier score

    Note: This uses synthetic data for illustration. Replace with
    real NBA data for actual predictions.
    """
    # Generate synthetic data matching real NBA feature structure
    np.random.seed(42)
    n_games = 5000

    features = pd.DataFrame({
        "home_adj_ortg": np.random.normal(110, 5, n_games),
        "home_adj_drtg": np.random.normal(110, 5, n_games),
        "away_adj_ortg": np.random.normal(110, 5, n_games),
        "away_adj_drtg": np.random.normal(110, 5, n_games),
        "home_pace": np.random.normal(100, 3, n_games),
        "away_pace": np.random.normal(100, 3, n_games),
        "home_ortg_roll_10": np.random.normal(110, 4, n_games),
        "away_ortg_roll_10": np.random.normal(110, 4, n_games),
        "rest_diff": np.random.choice([-2, -1, 0, 1, 2], n_games),
        "home_win_pct_std": np.random.uniform(0.2, 0.8, n_games),
        "away_win_pct_std": np.random.uniform(0.2, 0.8, n_games),
    })

    # Simulate target with realistic structure
    margin = (
        0.3 * (features["home_adj_ortg"] - features["away_adj_drtg"])
        - 0.3 * (features["away_adj_ortg"] - features["home_adj_drtg"])
        + 3.0  # home court advantage
        + np.random.normal(0, 10, n_games)  # noise
    )
    targets = (margin > 0).astype(float)
    targets = pd.Series(targets, name="home_win")

    # Time-based split (first 80% train, next 10% val, last 10% test)
    n_train = int(0.8 * n_games)
    n_val = int(0.1 * n_games)

    X_train = features.iloc[:n_train]
    y_train = targets.iloc[:n_train]
    X_val = features.iloc[n_train:n_train + n_val]
    y_val = targets.iloc[n_train:n_train + n_val]
    X_test = features.iloc[n_train + n_val:]
    y_test = targets.iloc[n_train + n_val:]

    # Standardize features
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    X_train_scaled = pd.DataFrame(
        scaler.fit_transform(X_train), columns=X_train.columns,
    )
    X_val_scaled = pd.DataFrame(
        scaler.transform(X_val), columns=X_val.columns,
    )
    X_test_scaled = pd.DataFrame(
        scaler.transform(X_test), columns=X_test.columns,
    )

    # Create datasets and loaders
    train_dataset = SportsTabularDataset(X_train_scaled, y_train)
    val_dataset = SportsTabularDataset(X_val_scaled, y_val)
    test_dataset = SportsTabularDataset(X_test_scaled, y_test)

    train_loader = DataLoader(
        train_dataset, batch_size=64, shuffle=True,
    )
    val_loader = DataLoader(val_dataset, batch_size=256, shuffle=False)
    test_loader = DataLoader(test_dataset, batch_size=256, shuffle=False)

    # Build and train model
    model = SportsFeedforwardNet(
        input_dim=X_train.shape[1],
        hidden_dims=[128, 64, 32],
        dropout_rate=0.3,
    )

    print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
    print(model)

    history = train_feedforward_model(
        model, train_loader, val_loader,
        learning_rate=1e-3, n_epochs=100, patience=15,
    )

    # Evaluate on test set
    model.eval()
    device = next(model.parameters()).device
    all_preds = []
    all_targets = []

    with torch.no_grad():
        for features_batch, targets_batch in test_loader:
            preds = model(features_batch.to(device))
            all_preds.extend(preds.cpu().numpy())
            all_targets.extend(targets_batch.numpy())

    predictions = np.array(all_preds)
    actuals = np.array(all_targets)

    # Brier score
    brier_score = np.mean((predictions - actuals) ** 2)
    print(f"\nTest Brier Score: {brier_score:.4f}")
    print(f"Best validation loss at epoch: {history['best_epoch']}")

    return model, history

Real-World Application: In production NBA prediction systems, feedforward networks typically achieve Brier scores between 0.20 and 0.22 on game outcome prediction, compared to 0.21--0.23 for logistic regression and 0.19--0.21 for well-tuned XGBoost models. The neural network advantage emerges primarily when entity embeddings are incorporated (Section 29.3), allowing the model to learn team-specific representations that capture information beyond the engineered features.


29.2 LSTM and Sequence Models for Time-Dependent Data

The Limitations of Static Features

The rolling averages and exponentially weighted means from Chapter 28 capture some temporal dynamics, but they are fundamentally hand-crafted summaries that impose a fixed functional form on how the past influences the present. A 10-game rolling average assumes all 10 games contribute equally and that only the last 10 games matter. An exponentially weighted mean assumes a specific decay pattern.

Recurrent neural networks (RNNs) learn their own temporal aggregation functions from data. Given a sequence of game-by-game observations, an RNN can learn to weight recent games more heavily, to detect momentum patterns, to recognize regime changes (e.g., before and after a major trade), and to incorporate any temporal pattern that is predictive of the target --- without being told what patterns to look for.

Recurrent Neural Networks and the Vanishing Gradient Problem

A basic RNN processes a sequence $x_1, x_2, \ldots, x_T$ by maintaining a hidden state $h_t$ that evolves over time:

$$h_t = \tanh(W_h h_{t-1} + W_x x_t + b)$$

The hidden state at time $T$ encodes information from the entire sequence and can be used for prediction. However, basic RNNs suffer from the vanishing gradient problem: during backpropagation through time, gradients must be multiplied across $T$ time steps, and they tend to shrink exponentially, making it nearly impossible to learn long-range dependencies.

LSTM Cells

Long Short-Term Memory (LSTM) cells solve the vanishing gradient problem by introducing a cell state $c_t$ and a system of gates that control information flow. The gates --- forget gate, input gate, and output gate --- learn when to retain, when to update, and when to output information:

Forget gate: Decides what information to discard from the cell state: $$f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)$$

Input gate: Decides what new information to store: $$i_t = \sigma(W_i [h_{t-1}, x_t] + b_i)$$ $$\tilde{c}_t = \tanh(W_c [h_{t-1}, x_t] + b_c)$$

Cell state update: Combines forgetting and new input: $$c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t$$

Output gate: Decides what to output from the cell state: $$o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)$$ $$h_t = o_t \odot \tanh(c_t)$$

where $\sigma$ is the sigmoid function and $\odot$ is element-wise multiplication. The cell state $c_t$ acts as a "conveyor belt" that can carry information across many time steps with minimal gradient degradation, because the cell state update involves addition rather than multiplication.

Modeling Team Performance Trajectories

For sports prediction, we model each team's season as a sequence of game-level feature vectors. The LSTM processes this sequence game by game, building an internal representation of the team's current state that integrates all past observations.

class SportsSequenceDataset(Dataset):
    """
    Dataset for sequential sports data (LSTM input).

    For each prediction target (a game), constructs a sequence of the
    team's previous N games as input. This creates a 3D tensor of shape
    (n_samples, sequence_length, n_features) suitable for LSTM processing.

    Args:
        game_data: DataFrame with one row per team per game, sorted by
            date within each team.
        team_col: Column identifying the team.
        feature_cols: List of feature column names.
        target_col: Name of the target column.
        sequence_length: Number of previous games to include in each
            sequence. Games with fewer than sequence_length prior games
            are padded with zeros.
    """

    def __init__(
        self,
        game_data: pd.DataFrame,
        team_col: str,
        feature_cols: list[str],
        target_col: str,
        sequence_length: int = 10,
    ):
        self.sequence_length = sequence_length
        self.sequences = []
        self.targets = []

        for team, team_data in game_data.groupby(team_col):
            team_data = team_data.sort_index()
            features_array = team_data[feature_cols].values
            targets_array = team_data[target_col].values

            for i in range(len(team_data)):
                # Get sequence of previous games
                start_idx = max(0, i - sequence_length)
                seq = features_array[start_idx:i]

                # Pad if insufficient history
                if len(seq) < sequence_length:
                    padding = np.zeros(
                        (sequence_length - len(seq), len(feature_cols))
                    )
                    seq = np.vstack([padding, seq]) if len(seq) > 0 else padding

                self.sequences.append(seq)
                self.targets.append(targets_array[i])

        self.sequences = torch.FloatTensor(np.array(self.sequences))
        self.targets = torch.FloatTensor(np.array(self.targets))

    def __len__(self) -> int:
        return len(self.targets)

    def __getitem__(self, idx: int):
        return self.sequences[idx], self.targets[idx]


class SportsLSTM(nn.Module):
    """
    LSTM model for sequential sports prediction.

    Processes a sequence of game-level feature vectors through an LSTM
    layer, then passes the final hidden state through feedforward layers
    for prediction. Optionally uses bidirectional LSTM and multi-layer
    stacking for increased capacity.

    Architecture:
        Input sequence -> LSTM -> Dropout -> Linear -> ReLU ->
        Linear -> Sigmoid

    Args:
        input_dim: Number of features per time step.
        hidden_dim: Number of LSTM hidden units.
        num_layers: Number of stacked LSTM layers.
        dropout: Dropout rate between LSTM layers and in the
            feedforward head.
        bidirectional: If True, uses bidirectional LSTM (processes
            sequence in both directions).
        fc_dims: List of feedforward layer sizes after the LSTM.
    """

    def __init__(
        self,
        input_dim: int,
        hidden_dim: int = 64,
        num_layers: int = 2,
        dropout: float = 0.3,
        bidirectional: bool = False,
        fc_dims: list[int] = [64, 32],
    ):
        super().__init__()

        self.hidden_dim = hidden_dim
        self.num_layers = num_layers
        self.bidirectional = bidirectional
        self.num_directions = 2 if bidirectional else 1

        self.lstm = nn.LSTM(
            input_size=input_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0,
            bidirectional=bidirectional,
        )

        lstm_output_dim = hidden_dim * self.num_directions

        # Feedforward head
        fc_layers = []
        prev_dim = lstm_output_dim
        for fc_dim in fc_dims:
            fc_layers.extend([
                nn.Linear(prev_dim, fc_dim),
                nn.ReLU(),
                nn.Dropout(dropout),
            ])
            prev_dim = fc_dim
        fc_layers.append(nn.Linear(prev_dim, 1))
        fc_layers.append(nn.Sigmoid())

        self.fc_head = nn.Sequential(*fc_layers)

    def forward(
        self,
        x: torch.Tensor,
        hidden: Optional[tuple] = None,
    ) -> torch.Tensor:
        """
        Forward pass through LSTM and feedforward head.

        Args:
            x: Input tensor of shape (batch_size, seq_len, input_dim).
            hidden: Optional initial hidden state. If None, initialized
                to zeros.

        Returns:
            Predictions of shape (batch_size,).
        """
        # LSTM forward pass
        lstm_out, (h_n, c_n) = self.lstm(x, hidden)

        # Use the final hidden state from the last layer
        if self.bidirectional:
            # Concatenate forward and backward final hidden states
            final_hidden = torch.cat(
                (h_n[-2], h_n[-1]), dim=1
            )
        else:
            final_hidden = h_n[-1]

        # Feedforward prediction head
        output = self.fc_head(final_hidden).squeeze(-1)
        return output

Modeling Player Career Arcs

The same LSTM architecture can model individual player performance trajectories. By processing a player's game-by-game statistics as a sequence, the LSTM can learn career arc patterns --- improvement curves for young players, prime-year plateaus, and decline patterns for aging players. This is particularly valuable for projecting player performance in upcoming seasons.

class PlayerCareerLSTM(nn.Module):
    """
    LSTM for modeling player career trajectories.

    Processes a player's season-by-season (or game-by-game) statistics
    to predict future performance. Includes an attention mechanism that
    allows the model to focus on the most relevant past seasons.

    The attention-weighted approach is superior to using only the final
    hidden state, because a player's development trajectory (the pattern
    of improvement or decline across seasons) is as informative as their
    most recent performance.

    Args:
        input_dim: Number of features per time step (season or game).
        hidden_dim: LSTM hidden dimension.
        num_layers: Number of stacked LSTM layers.
        dropout: Dropout rate.
        use_attention: If True, applies attention over LSTM outputs
            instead of using only the final hidden state.
    """

    def __init__(
        self,
        input_dim: int,
        hidden_dim: int = 32,
        num_layers: int = 1,
        dropout: float = 0.2,
        use_attention: bool = True,
    ):
        super().__init__()
        self.use_attention = use_attention
        self.hidden_dim = hidden_dim

        self.lstm = nn.LSTM(
            input_size=input_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0,
        )

        if use_attention:
            self.attention_weights = nn.Linear(hidden_dim, 1)

        self.output_head = nn.Sequential(
            nn.Linear(hidden_dim, 16),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(16, 1),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass with optional attention mechanism.

        Args:
            x: Shape (batch_size, seq_len, input_dim).

        Returns:
            Predictions of shape (batch_size,).
        """
        lstm_out, _ = self.lstm(x)  # (batch, seq_len, hidden_dim)

        if self.use_attention:
            # Compute attention scores
            attn_scores = self.attention_weights(lstm_out)  # (batch, seq, 1)
            attn_weights = torch.softmax(attn_scores, dim=1)
            # Weighted sum of LSTM outputs
            context = (lstm_out * attn_weights).sum(dim=1)  # (batch, hidden)
        else:
            context = lstm_out[:, -1, :]  # Use final hidden state

        return self.output_head(context).squeeze(-1)

Intuition: Think of an LSTM as a team scout with perfect memory. As it watches each game in sequence, it decides what to remember (input gate), what to forget (forget gate), and what to report (output gate). A skilled scout does not simply average the last 10 games --- they recognize that a blowout loss in a back-to-back road game against a top team is less informative than a close home loss to a mediocre opponent. The LSTM learns these contextual judgments from data.


29.3 Embeddings for Categorical Sports Data

The Problem with One-Hot Encoding

Sports data contains many categorical variables: team identifiers, player IDs, venue names, referee assignments, and so on. One-hot encoding converts each category to a binary vector, but this approach has severe limitations for sports data:

  1. High dimensionality: With 450+ active NBA players, one-hot encoding creates 450+ sparse features, each providing minimal signal.
  2. No similarity structure: One-hot vectors treat every pair of categories as equally dissimilar. But teams and players have rich similarity structures --- the Golden State Warriors and the Sacramento Kings play in the same arena market, similar divisions, and often similar styles.
  3. No generalization: If a model learns that Player A scores efficiently, it cannot transfer that knowledge to the statistically similar Player B.

Entity Embeddings

Entity embeddings address these limitations by mapping each categorical value to a dense, low-dimensional vector that is learned during training. Instead of a 450-dimensional one-hot vector, each player is represented by, say, a 16-dimensional dense vector. These learned representations capture meaningful similarities: players with similar playing styles will have similar embedding vectors, and teams with similar characteristics will cluster in embedding space.

Formally, an embedding is a learned lookup table $E \in \mathbb{R}^{|C| \times d}$ where $|C|$ is the number of categories and $d$ is the embedding dimension. For category index $i$, the embedding vector is the $i$-th row of $E$:

$$e_i = E[i] \in \mathbb{R}^d$$

The embedding matrix $E$ is initialized randomly and trained via backpropagation alongside the rest of the network. The gradients from the prediction loss update the embedding vectors, pushing similar entities toward similar representations.

Choosing Embedding Dimensions

A common heuristic for embedding dimension is:

$$d = \min\left(50, \left\lceil \frac{|C|}{2} \right\rceil\right)$$

For 30 NBA teams, this gives $d = 15$. For 450 players, $d = 50$. In practice, the optimal dimension should be tuned as a hyperparameter, but this heuristic provides a reasonable starting point.

Implementation

class SportsEmbeddingNet(nn.Module):
    """
    Neural network with entity embeddings for categorical sports variables.

    Combines learned embeddings for teams, players, and venues with
    continuous features in a single feedforward architecture. The
    embeddings are concatenated with the continuous features and passed
    through hidden layers for prediction.

    This architecture is inspired by the entity embedding approach
    described by Guo and Berkhahn (2016) and popularized by the
    winning entries in several Kaggle competitions.

    Args:
        n_continuous: Number of continuous input features.
        embedding_specs: List of (n_categories, embedding_dim) tuples,
            one for each categorical variable. n_categories should
            include +1 for an unknown/padding category.
        hidden_dims: Sizes of hidden layers.
        dropout_rate: Dropout probability.
    """

    def __init__(
        self,
        n_continuous: int,
        embedding_specs: list[tuple[int, int]],
        hidden_dims: list[int] = [256, 128, 64],
        dropout_rate: float = 0.3,
    ):
        super().__init__()

        # Create embedding layers
        self.embeddings = nn.ModuleList([
            nn.Embedding(n_cats, emb_dim)
            for n_cats, emb_dim in embedding_specs
        ])

        # Total input dimension
        total_emb_dim = sum(emb_dim for _, emb_dim in embedding_specs)
        total_input_dim = n_continuous + total_emb_dim

        # Feedforward layers
        layers = []
        prev_dim = total_input_dim
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.BatchNorm1d(hidden_dim),
                nn.ReLU(),
                nn.Dropout(dropout_rate),
            ])
            prev_dim = hidden_dim
        layers.append(nn.Linear(prev_dim, 1))
        layers.append(nn.Sigmoid())

        self.fc = nn.Sequential(*layers)

        # Initialize embeddings with small random values
        for emb in self.embeddings:
            nn.init.normal_(emb.weight, mean=0, std=0.01)

    def forward(
        self,
        x_continuous: torch.Tensor,
        x_categorical: list[torch.Tensor],
    ) -> torch.Tensor:
        """
        Forward pass combining embeddings and continuous features.

        Args:
            x_continuous: Continuous features, shape (batch, n_continuous).
            x_categorical: List of categorical index tensors, each of
                shape (batch,). One tensor per categorical variable.

        Returns:
            Predictions of shape (batch,).
        """
        # Look up embeddings
        embedded = [
            emb(cat_tensor)
            for emb, cat_tensor in zip(self.embeddings, x_categorical)
        ]

        # Concatenate embeddings and continuous features
        combined = torch.cat([x_continuous] + embedded, dim=1)

        return self.fc(combined).squeeze(-1)


class SportsEmbeddingDataset(Dataset):
    """
    Dataset combining continuous and categorical features for embedding models.

    Handles the separation of continuous features (as float tensors)
    and categorical features (as long tensors for embedding lookup).

    Args:
        continuous_features: DataFrame of continuous features.
        categorical_features: DataFrame of categorical features (integer-encoded).
        targets: Series of target values.
    """

    def __init__(
        self,
        continuous_features: pd.DataFrame,
        categorical_features: pd.DataFrame,
        targets: pd.Series,
    ):
        self.continuous = torch.FloatTensor(continuous_features.values)
        self.categorical = [
            torch.LongTensor(categorical_features[col].values)
            for col in categorical_features.columns
        ]
        self.targets = torch.FloatTensor(targets.values)

    def __len__(self) -> int:
        return len(self.targets)

    def __getitem__(self, idx: int):
        return (
            self.continuous[idx],
            [cat[idx] for cat in self.categorical],
            self.targets[idx],
        )

Transfer Learning Across Seasons

One of the most powerful applications of entity embeddings in sports is transfer learning across seasons. Team and player embeddings learned from one season capture latent representations of playing style, ability, and role that are partially transferable to the next season.

The transfer process works as follows:

  1. Train the model on Season $N$ data, learning embeddings for all teams and players.
  2. Initialize Season $N+1$ embeddings with the learned Season $N$ values for returning teams and players.
  3. Initialize new entities (expansion teams, rookies, traded players on new teams) with the average embedding or the embedding of a similar entity.
  4. Fine-tune the model on Season $N+1$ data with a reduced learning rate, allowing embeddings to adapt to new circumstances while retaining prior knowledge.
def transfer_embeddings(
    old_model: SportsEmbeddingNet,
    new_model: SportsEmbeddingNet,
    entity_mapping: dict[int, int],
    embedding_index: int = 0,
) -> None:
    """
    Transfer learned embeddings from an old model to a new model.

    For entities that exist in both seasons (returning teams/players),
    copies the learned embedding vectors. For new entities, initializes
    with the mean of all old embeddings (a reasonable prior for an
    unknown entity).

    Args:
        old_model: Model trained on previous season.
        new_model: Model for the new season (with potentially different
            number of entities).
        entity_mapping: Dict mapping old entity indices to new entity
            indices. Entities present in the new model but absent from
            this mapping are initialized with the old model's mean
            embedding.
        embedding_index: Index into the model's embedding list
            (0 for the first categorical variable, etc.).
    """
    old_emb = old_model.embeddings[embedding_index]
    new_emb = new_model.embeddings[embedding_index]

    # Compute mean of old embeddings as default for new entities
    mean_embedding = old_emb.weight.data.mean(dim=0)

    with torch.no_grad():
        # Initialize all new embeddings with the old mean
        new_emb.weight.data[:] = mean_embedding.unsqueeze(0).expand_as(
            new_emb.weight.data
        )

        # Copy embeddings for returning entities
        for old_idx, new_idx in entity_mapping.items():
            if old_idx < old_emb.num_embeddings and new_idx < new_emb.num_embeddings:
                new_emb.weight.data[new_idx] = old_emb.weight.data[old_idx]

    transferred = len(entity_mapping)
    total = new_emb.num_embeddings
    print(
        f"Transferred {transferred}/{total} embeddings "
        f"({transferred / total:.1%})"
    )

Common Pitfall: Entity embeddings require sufficient data to learn meaningful representations. If a player appears in only 5 games, their embedding will be poorly determined and may overfit. Apply minimum appearance thresholds (e.g., 20+ games) and group rare entities into a shared "other" category. Alternatively, use regularization (weight decay on the embedding layer) to shrink low-data embeddings toward zero, effectively defaulting to the continuous features alone.


29.4 Practical PyTorch Implementation

Complete Training Pipeline

This section assembles all the components from previous sections into a complete, production-ready PyTorch pipeline for sports prediction. The pipeline handles data loading, model construction, training with logging, evaluation, and model persistence.

import json
import os
from pathlib import Path
from datetime import datetime


class SportsModelTrainer:
    """
    Complete training pipeline for sports prediction neural networks.

    Handles data preparation, model training, evaluation, and
    persistence in a single, configurable class. Supports both
    feedforward and LSTM architectures.

    Args:
        model: The PyTorch model to train.
        model_dir: Directory for saving model checkpoints and logs.
        device: Device to train on ('cuda', 'cpu', or 'auto').
    """

    def __init__(
        self,
        model: nn.Module,
        model_dir: str = "./models",
        device: str = "auto",
    ):
        if device == "auto":
            self.device = torch.device(
                "cuda" if torch.cuda.is_available() else "cpu"
            )
        else:
            self.device = torch.device(device)

        self.model = model.to(self.device)
        self.model_dir = Path(model_dir)
        self.model_dir.mkdir(parents=True, exist_ok=True)
        self.training_log: list[dict] = []

    def create_data_loaders(
        self,
        train_dataset: Dataset,
        val_dataset: Dataset,
        batch_size: int = 64,
        num_workers: int = 0,
    ) -> tuple[DataLoader, DataLoader]:
        """
        Create training and validation DataLoaders with proper settings.

        Training data is shuffled; validation data is not. Pin memory
        is enabled when using GPU for faster data transfer.
        """
        pin_memory = self.device.type == "cuda"

        train_loader = DataLoader(
            train_dataset,
            batch_size=batch_size,
            shuffle=True,
            num_workers=num_workers,
            pin_memory=pin_memory,
            drop_last=True,  # Avoids batch norm issues with batch_size=1
        )

        val_loader = DataLoader(
            val_dataset,
            batch_size=batch_size * 4,  # Larger batches for faster eval
            shuffle=False,
            num_workers=num_workers,
            pin_memory=pin_memory,
        )

        return train_loader, val_loader

    def train(
        self,
        train_loader: DataLoader,
        val_loader: DataLoader,
        n_epochs: int = 100,
        learning_rate: float = 1e-3,
        weight_decay: float = 1e-4,
        patience: int = 15,
        scheduler_patience: int = 5,
        max_grad_norm: float = 1.0,
    ) -> dict:
        """
        Full training loop with early stopping, LR scheduling, and logging.

        Returns:
            Training history dictionary.
        """
        criterion = nn.BCELoss()
        optimizer = optim.Adam(
            self.model.parameters(),
            lr=learning_rate,
            weight_decay=weight_decay,
        )
        scheduler = optim.lr_scheduler.ReduceLROnPlateau(
            optimizer, mode="min", factor=0.5,
            patience=scheduler_patience,
        )

        best_val_loss = float("inf")
        best_state = None
        epochs_no_improve = 0
        history = {
            "train_loss": [], "val_loss": [], "val_brier": [],
            "learning_rates": [],
        }

        start_time = datetime.now()

        for epoch in range(n_epochs):
            # ---- Training ----
            self.model.train()
            train_loss = 0.0
            n_train = 0

            for batch in train_loader:
                if len(batch) == 2:
                    features, targets = batch
                    features = features.to(self.device)
                    targets = targets.to(self.device)
                    preds = self.model(features)
                elif len(batch) == 3:
                    continuous, categorical, targets = batch
                    continuous = continuous.to(self.device)
                    categorical = [c.to(self.device) for c in categorical]
                    targets = targets.to(self.device)
                    preds = self.model(continuous, categorical)
                else:
                    raise ValueError(f"Unexpected batch size: {len(batch)}")

                loss = criterion(preds, targets)

                optimizer.zero_grad()
                loss.backward()
                torch.nn.utils.clip_grad_norm_(
                    self.model.parameters(), max_grad_norm,
                )
                optimizer.step()

                train_loss += loss.item() * len(targets)
                n_train += len(targets)

            avg_train_loss = train_loss / n_train

            # ---- Validation ----
            val_loss, val_brier = self._evaluate(val_loader, criterion)

            # Record history
            current_lr = optimizer.param_groups[0]["lr"]
            history["train_loss"].append(avg_train_loss)
            history["val_loss"].append(val_loss)
            history["val_brier"].append(val_brier)
            history["learning_rates"].append(current_lr)

            scheduler.step(val_loss)

            # Early stopping
            if val_loss < best_val_loss:
                best_val_loss = val_loss
                best_state = {
                    k: v.cpu().clone() for k, v in self.model.state_dict().items()
                }
                epochs_no_improve = 0
            else:
                epochs_no_improve += 1

            if (epoch + 1) % 10 == 0:
                elapsed = (datetime.now() - start_time).total_seconds()
                print(
                    f"Epoch {epoch + 1:3d} | "
                    f"Train: {avg_train_loss:.4f} | "
                    f"Val: {val_loss:.4f} | "
                    f"Brier: {val_brier:.4f} | "
                    f"LR: {current_lr:.2e} | "
                    f"Time: {elapsed:.0f}s"
                )

            if epochs_no_improve >= patience:
                print(f"Early stopping at epoch {epoch + 1}")
                break

        # Restore best model
        if best_state is not None:
            self.model.load_state_dict(best_state)
            self.model = self.model.to(self.device)

        history["best_epoch"] = int(np.argmin(history["val_loss"])) + 1
        history["best_val_loss"] = float(best_val_loss)
        history["total_epochs"] = epoch + 1

        return history

    def _evaluate(
        self, loader: DataLoader, criterion: nn.Module,
    ) -> tuple[float, float]:
        """Evaluate model on a data loader, returning loss and Brier score."""
        self.model.eval()
        total_loss = 0.0
        all_preds = []
        all_targets = []

        with torch.no_grad():
            for batch in loader:
                if len(batch) == 2:
                    features, targets = batch
                    features = features.to(self.device)
                    targets = targets.to(self.device)
                    preds = self.model(features)
                elif len(batch) == 3:
                    continuous, categorical, targets = batch
                    continuous = continuous.to(self.device)
                    categorical = [c.to(self.device) for c in categorical]
                    targets = targets.to(self.device)
                    preds = self.model(continuous, categorical)
                else:
                    raise ValueError(f"Unexpected batch size: {len(batch)}")

                loss = criterion(preds, targets)
                total_loss += loss.item() * len(targets)
                all_preds.extend(preds.cpu().numpy())
                all_targets.extend(targets.cpu().numpy())

        avg_loss = total_loss / len(all_preds)
        brier = float(np.mean(
            (np.array(all_preds) - np.array(all_targets)) ** 2
        ))

        return avg_loss, brier

    def save_model(self, name: str, metadata: Optional[dict] = None) -> str:
        """
        Save model checkpoint with metadata.

        Saves the model state dict, architecture info, and any
        additional metadata to a timestamped directory.
        """
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        save_dir = self.model_dir / f"{name}_{timestamp}"
        save_dir.mkdir(parents=True, exist_ok=True)

        # Save model weights
        model_path = save_dir / "model.pt"
        torch.save(self.model.state_dict(), model_path)

        # Save metadata
        meta = {
            "name": name,
            "timestamp": timestamp,
            "model_class": self.model.__class__.__name__,
            "n_parameters": sum(p.numel() for p in self.model.parameters()),
            "device": str(self.device),
        }
        if metadata:
            meta.update(metadata)

        with open(save_dir / "metadata.json", "w") as f:
            json.dump(meta, f, indent=2)

        print(f"Model saved to {save_dir}")
        return str(save_dir)

    @staticmethod
    def load_model(
        model: nn.Module,
        checkpoint_dir: str,
        device: str = "auto",
    ) -> nn.Module:
        """
        Load a model from a checkpoint directory.

        Args:
            model: An uninitialized model with the same architecture.
            checkpoint_dir: Path to the checkpoint directory.
            device: Device to load the model onto.

        Returns:
            The model with loaded weights.
        """
        if device == "auto":
            device = "cuda" if torch.cuda.is_available() else "cpu"

        model_path = Path(checkpoint_dir) / "model.pt"
        state_dict = torch.load(model_path, map_location=device)
        model.load_state_dict(state_dict)
        model = model.to(device)
        model.eval()

        return model

GPU Training Considerations

Modern sports prediction datasets are small enough to train on a CPU in minutes. However, when you scale to larger datasets (play-by-play data, player tracking data) or use complex architectures (deep LSTMs, attention models), GPU acceleration becomes valuable.

Key considerations for GPU training:

  1. Data transfer overhead: Moving data between CPU and GPU takes time. Use pin_memory=True in DataLoaders and transfer entire batches at once.
  2. Batch size: Larger batches utilize GPU parallelism better. Increase batch size until GPU memory is fully utilized.
  3. Mixed precision: For larger models, torch.cuda.amp enables half-precision (FP16) training, roughly doubling throughput at minimal accuracy cost.
  4. Multi-GPU: For very large experiments (e.g., hyperparameter sweeps), torch.nn.DataParallel or torch.distributed enables multi-GPU training.
def setup_gpu_training(model: nn.Module) -> tuple[nn.Module, torch.device]:
    """
    Configure model for GPU training with automatic device detection.

    Handles CUDA availability checking, model transfer to GPU, and
    optional DataParallel wrapping for multi-GPU systems.

    Returns:
        Tuple of (configured model, device).
    """
    if torch.cuda.is_available():
        device = torch.device("cuda")
        n_gpus = torch.cuda.device_count()
        print(f"Using GPU: {torch.cuda.get_device_name(0)}")
        print(f"Available GPUs: {n_gpus}")
        print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")

        if n_gpus > 1:
            model = nn.DataParallel(model)
            print(f"Using DataParallel across {n_gpus} GPUs")

        model = model.to(device)
    else:
        device = torch.device("cpu")
        print("No GPU available. Training on CPU.")
        model = model.to(device)

    return model, device

Real-World Application: For a typical NBA season prediction model with 12,000 games and 50 features, a feedforward network trains in under 30 seconds on a modern CPU. An LSTM processing 10-game sequences trains in 2--5 minutes. GPU acceleration becomes important when you scale to play-by-play data (millions of rows), when you add entity embeddings for hundreds of players, or when you run hyperparameter searches with hundreds of trials.


The Hyperparameter Challenge

Neural networks have many hyperparameters that significantly affect performance: learning rate, batch size, hidden layer sizes, number of layers, dropout rate, weight decay, and more. Unlike model weights, hyperparameters are not learned during training --- they must be set before training begins. Finding good hyperparameter values is critical, and doing it manually is both tedious and unreliable.

Grid search evaluates every combination of hyperparameter values from predefined grids. For example, if you test 3 learning rates, 3 dropout rates, and 3 hidden sizes, grid search trains 27 models. This is exhaustive but scales exponentially with the number of hyperparameters.

Random search samples hyperparameter combinations randomly from specified distributions. Research by Bergstra and Bengio (2012) showed that random search is more efficient than grid search because it explores the search space more uniformly, avoiding the redundancy of grid search when some hyperparameters are more important than others.

Optuna for Bayesian Optimization

Optuna is a modern hyperparameter optimization framework that uses Bayesian optimization --- specifically, the Tree-structured Parzen Estimator (TPE) --- to intelligently explore the hyperparameter space. Unlike grid or random search, Optuna uses the results of previous trials to decide which configurations to try next, focusing on promising regions of the search space.

import optuna
from optuna.trial import Trial


def create_optuna_objective(
    train_dataset: Dataset,
    val_dataset: Dataset,
    input_dim: int,
    n_epochs: int = 50,
    device: str = "auto",
):
    """
    Create an Optuna objective function for neural architecture search.

    The objective function is called by Optuna for each trial. It
    samples hyperparameters, constructs a model, trains it, and
    returns the validation loss. Optuna uses this information to
    decide which hyperparameter combinations to try next.

    Args:
        train_dataset: Training dataset.
        val_dataset: Validation dataset.
        input_dim: Number of input features.
        n_epochs: Maximum training epochs per trial.
        device: Training device.

    Returns:
        An objective function compatible with optuna.study.optimize().
    """
    if device == "auto":
        device = "cuda" if torch.cuda.is_available() else "cpu"

    def objective(trial: Trial) -> float:
        # Sample architecture hyperparameters
        n_layers = trial.suggest_int("n_layers", 1, 4)
        hidden_dims = []
        for i in range(n_layers):
            dim = trial.suggest_categorical(
                f"hidden_dim_{i}", [32, 64, 128, 256, 512]
            )
            hidden_dims.append(dim)

        dropout = trial.suggest_float("dropout", 0.1, 0.5)
        learning_rate = trial.suggest_float(
            "learning_rate", 1e-5, 1e-2, log=True,
        )
        weight_decay = trial.suggest_float(
            "weight_decay", 1e-6, 1e-2, log=True,
        )
        batch_size = trial.suggest_categorical(
            "batch_size", [32, 64, 128, 256]
        )

        # Build model
        model = SportsFeedforwardNet(
            input_dim=input_dim,
            hidden_dims=hidden_dims,
            dropout_rate=dropout,
        ).to(device)

        # Create data loaders
        train_loader = DataLoader(
            train_dataset, batch_size=batch_size,
            shuffle=True, drop_last=True,
        )
        val_loader = DataLoader(
            val_dataset, batch_size=256, shuffle=False,
        )

        # Train
        criterion = nn.BCELoss()
        optimizer = optim.Adam(
            model.parameters(), lr=learning_rate,
            weight_decay=weight_decay,
        )

        best_val_loss = float("inf")
        patience_counter = 0
        patience = 10

        for epoch in range(n_epochs):
            # Training
            model.train()
            for features, targets in train_loader:
                features = features.to(device)
                targets = targets.to(device)

                optimizer.zero_grad()
                preds = model(features)
                loss = criterion(preds, targets)
                loss.backward()
                torch.nn.utils.clip_grad_norm_(
                    model.parameters(), 1.0,
                )
                optimizer.step()

            # Validation
            model.eval()
            val_losses = []
            with torch.no_grad():
                for features, targets in val_loader:
                    features = features.to(device)
                    targets = targets.to(device)
                    preds = model(features)
                    loss = criterion(preds, targets)
                    val_losses.append(loss.item())

            val_loss = np.mean(val_losses)

            # Report intermediate value for pruning
            trial.report(val_loss, epoch)
            if trial.should_prune():
                raise optuna.TrialPruned()

            # Early stopping within trial
            if val_loss < best_val_loss:
                best_val_loss = val_loss
                patience_counter = 0
            else:
                patience_counter += 1
                if patience_counter >= patience:
                    break

        return best_val_loss

    return objective


def run_hyperparameter_search(
    train_dataset: Dataset,
    val_dataset: Dataset,
    input_dim: int,
    n_trials: int = 100,
    n_epochs: int = 50,
    study_name: str = "sports_nn_optimization",
) -> optuna.Study:
    """
    Run a complete Optuna hyperparameter search.

    Uses the Median Pruner to terminate unpromising trials early
    (if a trial's intermediate value is worse than the median of
    previous trials at the same step, it is pruned).

    Args:
        train_dataset: Training dataset.
        val_dataset: Validation dataset.
        input_dim: Number of input features.
        n_trials: Number of hyperparameter combinations to try.
        n_epochs: Maximum epochs per trial.
        study_name: Name for the Optuna study (for logging).

    Returns:
        The completed Optuna study with all trial results.
    """
    objective = create_optuna_objective(
        train_dataset, val_dataset, input_dim, n_epochs,
    )

    study = optuna.create_study(
        study_name=study_name,
        direction="minimize",
        pruner=optuna.pruners.MedianPruner(
            n_startup_trials=10,
            n_warmup_steps=5,
        ),
    )

    study.optimize(objective, n_trials=n_trials, show_progress_bar=True)

    # Report results
    print(f"\nBest trial:")
    print(f"  Value (val loss): {study.best_trial.value:.4f}")
    print(f"  Params:")
    for key, value in study.best_trial.params.items():
        print(f"    {key}: {value}")

    return study

Early Stopping and Regularization

Neural networks for small sports datasets are highly prone to overfitting. Multiple regularization techniques should be used in combination:

  1. Early stopping: Monitor validation loss and stop training when it begins to increase. This is the single most important regularization technique.

  2. Dropout: Randomly zero a fraction of neuron outputs during training. Standard rates for sports models: 0.2--0.5.

  3. Weight decay (L2 regularization): Penalize large weights by adding $\lambda \|W\|^2$ to the loss. Typical values: $10^{-4}$ to $10^{-2}$.

  4. Batch normalization: Normalizes layer inputs, which acts as an implicit regularizer and stabilizes training.

  5. Data augmentation: In sports, this might mean adding noise to features, or training on bootstrapped samples.

Practical Tips for Sports Neural Networks

Aspect Recommendation
Architecture Start simple: 2--3 hidden layers, 64--256 neurons each
Activation ReLU for hidden layers, Sigmoid for binary output
Optimizer Adam with default betas (0.9, 0.999)
Learning rate Start at $10^{-3}$; use scheduler to reduce on plateau
Batch size 32--128 for typical sports datasets
Regularization Dropout (0.3) + weight decay ($10^{-4}$) + early stopping
Feature scaling Always standardize continuous inputs (zero mean, unit variance)
Initialization Default PyTorch initialization (Kaiming) is usually sufficient
Embeddings Use for categorical variables with 5+ categories
LSTM sequence length 5--15 games; longer sequences rarely help and slow training
Ensemble Train 5--10 models with different random seeds; average predictions

Common Pitfall: A common mistake is to over-invest in architecture complexity before establishing strong baselines. A simple 2-layer feedforward network with well-engineered features from Chapter 28 will outperform a complex LSTM with poor features in nearly every case. The correct order of operations is: (1) engineer excellent features, (2) establish a tree-model baseline, (3) build a simple neural network, (4) incrementally add complexity (embeddings, LSTM, attention) only when each addition demonstrably improves validation performance.


29.6 Chapter Summary

Key Concepts

  1. Feedforward neural networks apply layers of weighted linear transformations followed by nonlinear activations. For tabular sports data, they serve as flexible function approximators that can capture complex feature interactions.

  2. Backpropagation computes gradients of the loss with respect to all network weights using the chain rule, enabling efficient gradient-based optimization. The Adam optimizer adapts learning rates per-parameter and is the standard choice for sports prediction networks.

  3. LSTM networks process sequential data (game-by-game statistics) through gated cells that learn what to remember, forget, and output at each time step. They can discover temporal patterns that hand-crafted rolling features may miss.

  4. Entity embeddings learn dense, low-dimensional vector representations of categorical variables (teams, players, venues). Similar entities cluster in embedding space, enabling generalization across categories and transfer learning across seasons.

  5. The training pipeline includes data loading (with proper batching and shuffling), forward pass, loss computation, backpropagation, gradient clipping, optimizer step, validation evaluation, learning rate scheduling, and early stopping.

  6. Hyperparameter tuning with Optuna uses Bayesian optimization to efficiently search over architectures and training configurations, with pruning to terminate unpromising trials early.

  7. Regularization through dropout, weight decay, batch normalization, and early stopping is essential for sports datasets, which are typically small by deep learning standards. Multiple regularization techniques should be combined.

  8. Neural networks complement rather than replace tree-based models for sports prediction. Their primary advantages are entity embeddings, sequential modeling, and multi-task learning. For simple tabular prediction with well-engineered features, tree models remain competitive.

Key Formulas

Formula Description
$a^{(l)} = \sigma(W^{(l)} a^{(l-1)} + b^{(l)})$ Feedforward layer computation
$\text{ReLU}(z) = \max(0, z)$ Rectified linear unit activation
$\mathcal{L} = -\frac{1}{n}\sum[y\log\hat{y} + (1-y)\log(1-\hat{y})]$ Binary cross-entropy loss
$f_t = \sigma(W_f[h_{t-1}, x_t] + b_f)$ LSTM forget gate
$c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t$ LSTM cell state update
$e_i = E[i] \in \mathbb{R}^d$ Entity embedding lookup
$d \approx \min(50, \lceil |C|/2 \rceil)$ Embedding dimension heuristic

Key Code Patterns

  1. SportsTabularDataset / SportsSequenceDataset: PyTorch Dataset subclasses that convert pandas DataFrames to tensors, handle feature/target separation, and support batched loading through DataLoader.

  2. SportsFeedforwardNet: Configurable feedforward architecture with batch normalization and dropout, using nn.Sequential for clean layer stacking.

  3. SportsLSTM / PlayerCareerLSTM: LSTM architectures for sequential sports data, with optional attention mechanism and bidirectional processing.

  4. SportsEmbeddingNet: Combined embedding + feedforward architecture that handles both continuous and categorical inputs.

  5. SportsModelTrainer: Complete training pipeline with early stopping, learning rate scheduling, gradient clipping, and model persistence.

  6. Optuna integration: Objective functions that sample architectures and hyperparameters, with intermediate reporting for trial pruning.

Decision Framework: Choosing a Neural Network Architecture

START: You have engineered features and a prediction target.

1. Is your dataset large (>10,000 observations)?
   - NO  --> Use tree-based models (XGBoost). NNs likely to overfit.
   - YES --> Proceed to step 2.

2. Do you have high-cardinality categorical variables (teams, players)?
   - YES --> Use entity embeddings (SportsEmbeddingNet).
   - NO  --> Simple feedforward network (SportsFeedforwardNet).

3. Is temporal sequence important beyond rolling averages?
   - YES --> Add LSTM component (SportsLSTM).
   - NO  --> Feedforward or embedding network is sufficient.

4. Do you need to predict multiple related targets?
   - YES --> Multi-task network with shared hidden layers.
   - NO  --> Single-output architecture.

5. Tune hyperparameters with Optuna (50-200 trials).
6. Ensemble 5-10 models with different seeds.
7. Compare against XGBoost baseline on held-out data.

END: Deploy the best model (or ensemble).

What's Next

In Chapter 30: Model Evaluation and Selection, we will learn how to rigorously evaluate and compare the models built in this chapter and previous ones. Raw accuracy is insufficient for betting models --- we need proper scoring rules (Brier score, log loss) that evaluate the quality of probabilistic predictions. We will build backtesting frameworks that simulate realistic betting scenarios with transaction costs. We will implement walk-forward validation that respects the temporal structure of sports data. And we will use calibration plots and statistical tests to determine whether one model genuinely outperforms another or whether observed differences are within the range of statistical noise. These evaluation techniques are the final link in the chain from data to profitable betting decisions.


This chapter is part of The Sports Betting Textbook, a comprehensive guide to quantitative sports betting. All code examples use Python 3.11+ with PyTorch 2.0+ and are available in the companion repository.