> --- George E. P. Box, Robustness in the Strategy of Scientific Model Building (1979)
Learning Objectives
- Design and implement feedforward neural networks for tabular sports data using PyTorch
- Build LSTM and sequence models that capture temporal dependencies in team and player performance
- Create entity embeddings for teams, players, and venues that learn dense vector representations from data
- Implement complete PyTorch training pipelines including data loaders, GPU training, and model persistence
- Apply hyperparameter tuning with Optuna and architecture search techniques to optimize neural network performance
In This Chapter
Chapter 29: Neural Networks for Sports Prediction
"All models are wrong, but some are useful." --- George E. P. Box, Robustness in the Strategy of Scientific Model Building (1979)
Chapter Overview
Neural networks have transformed fields from computer vision to natural language processing, achieving superhuman performance on tasks that were considered intractable a decade ago. Their application to sports prediction is more nuanced. Sports data is typically tabular --- rows of games with columns of features --- a domain where tree-based models like XGBoost have traditionally dominated. The datasets are small by deep learning standards: a decade of NBA games yields roughly 12,000 observations, compared to the millions or billions of examples used to train image classifiers or language models. And the signal-to-noise ratio is low: sports outcomes are inherently uncertain, and even a perfect model cannot predict them with high accuracy.
Yet neural networks offer capabilities that tree-based models lack. They can learn entity embeddings --- dense vector representations of teams, players, and venues --- that capture similarities and relationships in ways that one-hot encoding cannot. They can model sequential dependencies through recurrent architectures like LSTMs, learning temporal patterns directly from game sequences rather than relying on hand-crafted rolling features. And they can incorporate multiple data modalities --- combining tabular features, sequences, and categorical embeddings in a single end-to-end architecture.
This chapter provides a practical, code-heavy introduction to neural networks for sports prediction. We assume familiarity with the feature engineering concepts from Chapter 28 and the general machine learning framework from Chapter 27. We use PyTorch as our deep learning framework throughout, chosen for its flexibility, Pythonic API, and widespread adoption in both research and production.
In this chapter, you will learn to: - Build feedforward networks for tabular sports data and understand when they outperform tree-based models - Design LSTM architectures that model team performance trajectories and player career arcs - Create entity embeddings that learn meaningful representations of teams, players, and venues - Implement complete PyTorch training pipelines with proper data loading, GPU acceleration, and model persistence - Tune hyperparameters and search over architectures using Optuna
29.1 Feedforward Networks for Tabular Sports Data
Architecture Fundamentals
A feedforward neural network (also called a multilayer perceptron, or MLP) consists of layers of neurons connected in sequence. Each neuron computes a weighted sum of its inputs, adds a bias term, and passes the result through a nonlinear activation function:
$$z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)}$$ $$a^{(l)} = \sigma(z^{(l)})$$
where $W^{(l)}$ is the weight matrix for layer $l$, $b^{(l)}$ is the bias vector, $a^{(l-1)}$ is the output (activation) of the previous layer, and $\sigma$ is the activation function. The input layer $a^{(0)}$ is simply the feature vector $x$.
For a network with $L$ hidden layers, the full computation is:
$$\hat{y} = f(x) = \sigma_L(W^{(L)} \sigma_{L-1}(\cdots \sigma_1(W^{(1)} x + b^{(1)}) \cdots) + b^{(L)})$$
Activation Functions
The choice of activation function affects the network's ability to learn complex patterns and its training dynamics:
ReLU (Rectified Linear Unit): The most commonly used activation for hidden layers. Simple, computationally efficient, and avoids the vanishing gradient problem that plagued earlier architectures:
$$\text{ReLU}(z) = \max(0, z)$$
Leaky ReLU: Addresses the "dying ReLU" problem (neurons that output zero for all inputs and stop learning) by allowing a small gradient for negative inputs:
$$\text{LeakyReLU}(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha z & \text{if } z \leq 0 \end{cases}$$
where $\alpha$ is typically 0.01.
GELU (Gaussian Error Linear Unit): A smooth approximation of ReLU that has become standard in transformer architectures. It can provide marginal improvements over ReLU in some settings:
$$\text{GELU}(z) = z \cdot \Phi(z)$$
where $\Phi$ is the CDF of the standard normal distribution.
Sigmoid: Maps inputs to the range $(0, 1)$. Used primarily for binary classification output layers:
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$
Backpropagation Intuition
Training a neural network means finding weight values that minimize a loss function --- a measure of prediction error. For binary classification (win/loss prediction), the standard loss is binary cross-entropy:
$$\mathcal{L} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log \hat{y}_i + (1 - y_i) \log (1 - \hat{y}_i) \right]$$
Backpropagation computes the gradient of this loss with respect to every weight in the network, using the chain rule of calculus. For a weight $w_{jk}^{(l)}$ in layer $l$:
$$\frac{\partial \mathcal{L}}{\partial w_{jk}^{(l)}} = \frac{\partial \mathcal{L}}{\partial z_j^{(l)}} \cdot \frac{\partial z_j^{(l)}}{\partial w_{jk}^{(l)}} = \delta_j^{(l)} \cdot a_k^{(l-1)}$$
where $\delta_j^{(l)} = \frac{\partial \mathcal{L}}{\partial z_j^{(l)}}$ is the "error signal" that propagates backward through the network. The key insight is that this error signal can be computed recursively from the output layer back to the input:
$$\delta^{(l)} = (W^{(l+1)})^T \delta^{(l+1)} \odot \sigma'(z^{(l)})$$
where $\odot$ denotes element-wise multiplication and $\sigma'$ is the derivative of the activation function. This recursive structure makes backpropagation computationally efficient: computing all gradients costs roughly twice the computation of a single forward pass.
Gradient descent then updates each weight in the direction that reduces the loss:
$$W^{(l)} \leftarrow W^{(l)} - \eta \frac{\partial \mathcal{L}}{\partial W^{(l)}}$$
where $\eta$ is the learning rate. In practice, we use stochastic gradient descent (SGD) variants like Adam that adapt the learning rate for each parameter and incorporate momentum.
When Neural Networks Beat Tree Models
For tabular sports data, tree-based models (XGBoost, LightGBM, Random Forest) are strong baselines that are often hard to beat. Neural networks tend to outperform tree models when:
- Entity embeddings are valuable: If your data contains high-cardinality categorical variables (hundreds of players, dozens of teams), embeddings can capture latent structure that tree models miss.
- Sequential patterns matter: When the order of games matters and you want to model trajectories rather than static snapshots.
- Multi-task learning is beneficial: Neural networks can jointly predict multiple related targets (win probability, total points, margin) with shared representations.
- Very large datasets: With 100,000+ observations, neural networks can learn complex nonlinear relationships that simpler models cannot represent.
For small tabular datasets with well-engineered features, XGBoost often wins. The practical recommendation is to use tree models as a strong baseline and invest in neural networks when the specific advantages above are relevant.
PyTorch Implementation
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np
import pandas as pd
from typing import Optional
class SportsTabularDataset(Dataset):
"""
PyTorch Dataset for tabular sports prediction data.
Converts pandas DataFrames to PyTorch tensors for efficient
batched training. Handles both features and targets, with
optional support for sample weights.
Attributes:
features: Tensor of shape (n_samples, n_features).
targets: Tensor of shape (n_samples,).
weights: Optional tensor of sample weights.
"""
def __init__(
self,
features: pd.DataFrame,
targets: pd.Series,
weights: Optional[pd.Series] = None,
):
self.features = torch.FloatTensor(features.values)
self.targets = torch.FloatTensor(targets.values)
self.weights = (
torch.FloatTensor(weights.values)
if weights is not None
else None
)
def __len__(self) -> int:
return len(self.targets)
def __getitem__(self, idx: int):
if self.weights is not None:
return self.features[idx], self.targets[idx], self.weights[idx]
return self.features[idx], self.targets[idx]
class SportsFeedforwardNet(nn.Module):
"""
Feedforward neural network for tabular sports prediction.
Architecture: Input -> [Linear -> BatchNorm -> ReLU -> Dropout] x N
-> Linear -> Sigmoid (for binary classification).
Batch normalization stabilizes training by normalizing layer inputs.
Dropout regularizes by randomly zeroing neurons during training,
preventing co-adaptation and reducing overfitting.
Args:
input_dim: Number of input features.
hidden_dims: List of hidden layer sizes. Each entry creates
one hidden layer with that many neurons.
dropout_rate: Probability of dropping each neuron during training.
output_dim: Number of output neurons (1 for binary classification).
"""
def __init__(
self,
input_dim: int,
hidden_dims: list[int] = [256, 128, 64],
dropout_rate: float = 0.3,
output_dim: int = 1,
):
super().__init__()
layers = []
prev_dim = input_dim
for hidden_dim in hidden_dims:
layers.extend([
nn.Linear(prev_dim, hidden_dim),
nn.BatchNorm1d(hidden_dim),
nn.ReLU(),
nn.Dropout(dropout_rate),
])
prev_dim = hidden_dim
layers.append(nn.Linear(prev_dim, output_dim))
layers.append(nn.Sigmoid())
self.network = nn.Sequential(*layers)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""Forward pass through the network."""
return self.network(x).squeeze(-1)
def train_feedforward_model(
model: nn.Module,
train_loader: DataLoader,
val_loader: DataLoader,
learning_rate: float = 1e-3,
weight_decay: float = 1e-4,
n_epochs: int = 100,
patience: int = 15,
device: str = "auto",
) -> dict:
"""
Train a feedforward model with early stopping and validation monitoring.
Uses the Adam optimizer with weight decay (L2 regularization) and
a ReduceLROnPlateau scheduler that decreases the learning rate when
validation loss plateaus. Training stops early if validation loss
does not improve for 'patience' consecutive epochs.
Args:
model: The neural network to train.
train_loader: DataLoader for training data.
val_loader: DataLoader for validation data.
learning_rate: Initial learning rate for Adam optimizer.
weight_decay: L2 regularization strength.
n_epochs: Maximum number of training epochs.
patience: Number of epochs without improvement before stopping.
device: 'cuda', 'cpu', or 'auto' (auto-detect GPU).
Returns:
Dictionary containing training history with keys:
'train_losses', 'val_losses', 'best_epoch', 'best_val_loss'.
"""
if device == "auto":
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
criterion = nn.BCELoss()
optimizer = optim.Adam(
model.parameters(), lr=learning_rate, weight_decay=weight_decay,
)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
optimizer, mode="min", factor=0.5, patience=5, verbose=True,
)
best_val_loss = float("inf")
best_model_state = None
epochs_without_improvement = 0
history = {"train_losses": [], "val_losses": []}
for epoch in range(n_epochs):
# Training phase
model.train()
train_loss = 0.0
train_batches = 0
for batch in train_loader:
features, targets = batch[0].to(device), batch[1].to(device)
optimizer.zero_grad()
predictions = model(features)
loss = criterion(predictions, targets)
loss.backward()
# Gradient clipping prevents exploding gradients
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
train_loss += loss.item()
train_batches += 1
avg_train_loss = train_loss / train_batches
# Validation phase
model.eval()
val_loss = 0.0
val_batches = 0
with torch.no_grad():
for batch in val_loader:
features, targets = batch[0].to(device), batch[1].to(device)
predictions = model(features)
loss = criterion(predictions, targets)
val_loss += loss.item()
val_batches += 1
avg_val_loss = val_loss / val_batches
history["train_losses"].append(avg_train_loss)
history["val_losses"].append(avg_val_loss)
scheduler.step(avg_val_loss)
# Early stopping check
if avg_val_loss < best_val_loss:
best_val_loss = avg_val_loss
best_model_state = model.state_dict().copy()
epochs_without_improvement = 0
else:
epochs_without_improvement += 1
if (epoch + 1) % 10 == 0:
print(
f"Epoch {epoch + 1}/{n_epochs} | "
f"Train Loss: {avg_train_loss:.4f} | "
f"Val Loss: {avg_val_loss:.4f} | "
f"LR: {optimizer.param_groups[0]['lr']:.6f}"
)
if epochs_without_improvement >= patience:
print(f"Early stopping at epoch {epoch + 1}")
break
# Restore best model
if best_model_state is not None:
model.load_state_dict(best_model_state)
history["best_epoch"] = np.argmin(history["val_losses"]) + 1
history["best_val_loss"] = best_val_loss
return history
Worked Example: NBA Game Prediction with a Feedforward Network
# Complete example: predicting NBA game outcomes
# Assumes features from Chapter 28 are already computed
def run_nba_feedforward_example():
"""
End-to-end example of training a feedforward network for NBA
game outcome prediction.
This example demonstrates:
1. Loading and splitting time-series data properly
2. Creating PyTorch datasets and data loaders
3. Building and training a feedforward model
4. Evaluating predictions with Brier score
Note: This uses synthetic data for illustration. Replace with
real NBA data for actual predictions.
"""
# Generate synthetic data matching real NBA feature structure
np.random.seed(42)
n_games = 5000
features = pd.DataFrame({
"home_adj_ortg": np.random.normal(110, 5, n_games),
"home_adj_drtg": np.random.normal(110, 5, n_games),
"away_adj_ortg": np.random.normal(110, 5, n_games),
"away_adj_drtg": np.random.normal(110, 5, n_games),
"home_pace": np.random.normal(100, 3, n_games),
"away_pace": np.random.normal(100, 3, n_games),
"home_ortg_roll_10": np.random.normal(110, 4, n_games),
"away_ortg_roll_10": np.random.normal(110, 4, n_games),
"rest_diff": np.random.choice([-2, -1, 0, 1, 2], n_games),
"home_win_pct_std": np.random.uniform(0.2, 0.8, n_games),
"away_win_pct_std": np.random.uniform(0.2, 0.8, n_games),
})
# Simulate target with realistic structure
margin = (
0.3 * (features["home_adj_ortg"] - features["away_adj_drtg"])
- 0.3 * (features["away_adj_ortg"] - features["home_adj_drtg"])
+ 3.0 # home court advantage
+ np.random.normal(0, 10, n_games) # noise
)
targets = (margin > 0).astype(float)
targets = pd.Series(targets, name="home_win")
# Time-based split (first 80% train, next 10% val, last 10% test)
n_train = int(0.8 * n_games)
n_val = int(0.1 * n_games)
X_train = features.iloc[:n_train]
y_train = targets.iloc[:n_train]
X_val = features.iloc[n_train:n_train + n_val]
y_val = targets.iloc[n_train:n_train + n_val]
X_test = features.iloc[n_train + n_val:]
y_test = targets.iloc[n_train + n_val:]
# Standardize features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = pd.DataFrame(
scaler.fit_transform(X_train), columns=X_train.columns,
)
X_val_scaled = pd.DataFrame(
scaler.transform(X_val), columns=X_val.columns,
)
X_test_scaled = pd.DataFrame(
scaler.transform(X_test), columns=X_test.columns,
)
# Create datasets and loaders
train_dataset = SportsTabularDataset(X_train_scaled, y_train)
val_dataset = SportsTabularDataset(X_val_scaled, y_val)
test_dataset = SportsTabularDataset(X_test_scaled, y_test)
train_loader = DataLoader(
train_dataset, batch_size=64, shuffle=True,
)
val_loader = DataLoader(val_dataset, batch_size=256, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=256, shuffle=False)
# Build and train model
model = SportsFeedforwardNet(
input_dim=X_train.shape[1],
hidden_dims=[128, 64, 32],
dropout_rate=0.3,
)
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
print(model)
history = train_feedforward_model(
model, train_loader, val_loader,
learning_rate=1e-3, n_epochs=100, patience=15,
)
# Evaluate on test set
model.eval()
device = next(model.parameters()).device
all_preds = []
all_targets = []
with torch.no_grad():
for features_batch, targets_batch in test_loader:
preds = model(features_batch.to(device))
all_preds.extend(preds.cpu().numpy())
all_targets.extend(targets_batch.numpy())
predictions = np.array(all_preds)
actuals = np.array(all_targets)
# Brier score
brier_score = np.mean((predictions - actuals) ** 2)
print(f"\nTest Brier Score: {brier_score:.4f}")
print(f"Best validation loss at epoch: {history['best_epoch']}")
return model, history
Real-World Application: In production NBA prediction systems, feedforward networks typically achieve Brier scores between 0.20 and 0.22 on game outcome prediction, compared to 0.21--0.23 for logistic regression and 0.19--0.21 for well-tuned XGBoost models. The neural network advantage emerges primarily when entity embeddings are incorporated (Section 29.3), allowing the model to learn team-specific representations that capture information beyond the engineered features.
29.2 LSTM and Sequence Models for Time-Dependent Data
The Limitations of Static Features
The rolling averages and exponentially weighted means from Chapter 28 capture some temporal dynamics, but they are fundamentally hand-crafted summaries that impose a fixed functional form on how the past influences the present. A 10-game rolling average assumes all 10 games contribute equally and that only the last 10 games matter. An exponentially weighted mean assumes a specific decay pattern.
Recurrent neural networks (RNNs) learn their own temporal aggregation functions from data. Given a sequence of game-by-game observations, an RNN can learn to weight recent games more heavily, to detect momentum patterns, to recognize regime changes (e.g., before and after a major trade), and to incorporate any temporal pattern that is predictive of the target --- without being told what patterns to look for.
Recurrent Neural Networks and the Vanishing Gradient Problem
A basic RNN processes a sequence $x_1, x_2, \ldots, x_T$ by maintaining a hidden state $h_t$ that evolves over time:
$$h_t = \tanh(W_h h_{t-1} + W_x x_t + b)$$
The hidden state at time $T$ encodes information from the entire sequence and can be used for prediction. However, basic RNNs suffer from the vanishing gradient problem: during backpropagation through time, gradients must be multiplied across $T$ time steps, and they tend to shrink exponentially, making it nearly impossible to learn long-range dependencies.
LSTM Cells
Long Short-Term Memory (LSTM) cells solve the vanishing gradient problem by introducing a cell state $c_t$ and a system of gates that control information flow. The gates --- forget gate, input gate, and output gate --- learn when to retain, when to update, and when to output information:
Forget gate: Decides what information to discard from the cell state: $$f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)$$
Input gate: Decides what new information to store: $$i_t = \sigma(W_i [h_{t-1}, x_t] + b_i)$$ $$\tilde{c}_t = \tanh(W_c [h_{t-1}, x_t] + b_c)$$
Cell state update: Combines forgetting and new input: $$c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t$$
Output gate: Decides what to output from the cell state: $$o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)$$ $$h_t = o_t \odot \tanh(c_t)$$
where $\sigma$ is the sigmoid function and $\odot$ is element-wise multiplication. The cell state $c_t$ acts as a "conveyor belt" that can carry information across many time steps with minimal gradient degradation, because the cell state update involves addition rather than multiplication.
Modeling Team Performance Trajectories
For sports prediction, we model each team's season as a sequence of game-level feature vectors. The LSTM processes this sequence game by game, building an internal representation of the team's current state that integrates all past observations.
class SportsSequenceDataset(Dataset):
"""
Dataset for sequential sports data (LSTM input).
For each prediction target (a game), constructs a sequence of the
team's previous N games as input. This creates a 3D tensor of shape
(n_samples, sequence_length, n_features) suitable for LSTM processing.
Args:
game_data: DataFrame with one row per team per game, sorted by
date within each team.
team_col: Column identifying the team.
feature_cols: List of feature column names.
target_col: Name of the target column.
sequence_length: Number of previous games to include in each
sequence. Games with fewer than sequence_length prior games
are padded with zeros.
"""
def __init__(
self,
game_data: pd.DataFrame,
team_col: str,
feature_cols: list[str],
target_col: str,
sequence_length: int = 10,
):
self.sequence_length = sequence_length
self.sequences = []
self.targets = []
for team, team_data in game_data.groupby(team_col):
team_data = team_data.sort_index()
features_array = team_data[feature_cols].values
targets_array = team_data[target_col].values
for i in range(len(team_data)):
# Get sequence of previous games
start_idx = max(0, i - sequence_length)
seq = features_array[start_idx:i]
# Pad if insufficient history
if len(seq) < sequence_length:
padding = np.zeros(
(sequence_length - len(seq), len(feature_cols))
)
seq = np.vstack([padding, seq]) if len(seq) > 0 else padding
self.sequences.append(seq)
self.targets.append(targets_array[i])
self.sequences = torch.FloatTensor(np.array(self.sequences))
self.targets = torch.FloatTensor(np.array(self.targets))
def __len__(self) -> int:
return len(self.targets)
def __getitem__(self, idx: int):
return self.sequences[idx], self.targets[idx]
class SportsLSTM(nn.Module):
"""
LSTM model for sequential sports prediction.
Processes a sequence of game-level feature vectors through an LSTM
layer, then passes the final hidden state through feedforward layers
for prediction. Optionally uses bidirectional LSTM and multi-layer
stacking for increased capacity.
Architecture:
Input sequence -> LSTM -> Dropout -> Linear -> ReLU ->
Linear -> Sigmoid
Args:
input_dim: Number of features per time step.
hidden_dim: Number of LSTM hidden units.
num_layers: Number of stacked LSTM layers.
dropout: Dropout rate between LSTM layers and in the
feedforward head.
bidirectional: If True, uses bidirectional LSTM (processes
sequence in both directions).
fc_dims: List of feedforward layer sizes after the LSTM.
"""
def __init__(
self,
input_dim: int,
hidden_dim: int = 64,
num_layers: int = 2,
dropout: float = 0.3,
bidirectional: bool = False,
fc_dims: list[int] = [64, 32],
):
super().__init__()
self.hidden_dim = hidden_dim
self.num_layers = num_layers
self.bidirectional = bidirectional
self.num_directions = 2 if bidirectional else 1
self.lstm = nn.LSTM(
input_size=input_dim,
hidden_size=hidden_dim,
num_layers=num_layers,
batch_first=True,
dropout=dropout if num_layers > 1 else 0,
bidirectional=bidirectional,
)
lstm_output_dim = hidden_dim * self.num_directions
# Feedforward head
fc_layers = []
prev_dim = lstm_output_dim
for fc_dim in fc_dims:
fc_layers.extend([
nn.Linear(prev_dim, fc_dim),
nn.ReLU(),
nn.Dropout(dropout),
])
prev_dim = fc_dim
fc_layers.append(nn.Linear(prev_dim, 1))
fc_layers.append(nn.Sigmoid())
self.fc_head = nn.Sequential(*fc_layers)
def forward(
self,
x: torch.Tensor,
hidden: Optional[tuple] = None,
) -> torch.Tensor:
"""
Forward pass through LSTM and feedforward head.
Args:
x: Input tensor of shape (batch_size, seq_len, input_dim).
hidden: Optional initial hidden state. If None, initialized
to zeros.
Returns:
Predictions of shape (batch_size,).
"""
# LSTM forward pass
lstm_out, (h_n, c_n) = self.lstm(x, hidden)
# Use the final hidden state from the last layer
if self.bidirectional:
# Concatenate forward and backward final hidden states
final_hidden = torch.cat(
(h_n[-2], h_n[-1]), dim=1
)
else:
final_hidden = h_n[-1]
# Feedforward prediction head
output = self.fc_head(final_hidden).squeeze(-1)
return output
Modeling Player Career Arcs
The same LSTM architecture can model individual player performance trajectories. By processing a player's game-by-game statistics as a sequence, the LSTM can learn career arc patterns --- improvement curves for young players, prime-year plateaus, and decline patterns for aging players. This is particularly valuable for projecting player performance in upcoming seasons.
class PlayerCareerLSTM(nn.Module):
"""
LSTM for modeling player career trajectories.
Processes a player's season-by-season (or game-by-game) statistics
to predict future performance. Includes an attention mechanism that
allows the model to focus on the most relevant past seasons.
The attention-weighted approach is superior to using only the final
hidden state, because a player's development trajectory (the pattern
of improvement or decline across seasons) is as informative as their
most recent performance.
Args:
input_dim: Number of features per time step (season or game).
hidden_dim: LSTM hidden dimension.
num_layers: Number of stacked LSTM layers.
dropout: Dropout rate.
use_attention: If True, applies attention over LSTM outputs
instead of using only the final hidden state.
"""
def __init__(
self,
input_dim: int,
hidden_dim: int = 32,
num_layers: int = 1,
dropout: float = 0.2,
use_attention: bool = True,
):
super().__init__()
self.use_attention = use_attention
self.hidden_dim = hidden_dim
self.lstm = nn.LSTM(
input_size=input_dim,
hidden_size=hidden_dim,
num_layers=num_layers,
batch_first=True,
dropout=dropout if num_layers > 1 else 0,
)
if use_attention:
self.attention_weights = nn.Linear(hidden_dim, 1)
self.output_head = nn.Sequential(
nn.Linear(hidden_dim, 16),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(16, 1),
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
Forward pass with optional attention mechanism.
Args:
x: Shape (batch_size, seq_len, input_dim).
Returns:
Predictions of shape (batch_size,).
"""
lstm_out, _ = self.lstm(x) # (batch, seq_len, hidden_dim)
if self.use_attention:
# Compute attention scores
attn_scores = self.attention_weights(lstm_out) # (batch, seq, 1)
attn_weights = torch.softmax(attn_scores, dim=1)
# Weighted sum of LSTM outputs
context = (lstm_out * attn_weights).sum(dim=1) # (batch, hidden)
else:
context = lstm_out[:, -1, :] # Use final hidden state
return self.output_head(context).squeeze(-1)
Intuition: Think of an LSTM as a team scout with perfect memory. As it watches each game in sequence, it decides what to remember (input gate), what to forget (forget gate), and what to report (output gate). A skilled scout does not simply average the last 10 games --- they recognize that a blowout loss in a back-to-back road game against a top team is less informative than a close home loss to a mediocre opponent. The LSTM learns these contextual judgments from data.
29.3 Embeddings for Categorical Sports Data
The Problem with One-Hot Encoding
Sports data contains many categorical variables: team identifiers, player IDs, venue names, referee assignments, and so on. One-hot encoding converts each category to a binary vector, but this approach has severe limitations for sports data:
- High dimensionality: With 450+ active NBA players, one-hot encoding creates 450+ sparse features, each providing minimal signal.
- No similarity structure: One-hot vectors treat every pair of categories as equally dissimilar. But teams and players have rich similarity structures --- the Golden State Warriors and the Sacramento Kings play in the same arena market, similar divisions, and often similar styles.
- No generalization: If a model learns that Player A scores efficiently, it cannot transfer that knowledge to the statistically similar Player B.
Entity Embeddings
Entity embeddings address these limitations by mapping each categorical value to a dense, low-dimensional vector that is learned during training. Instead of a 450-dimensional one-hot vector, each player is represented by, say, a 16-dimensional dense vector. These learned representations capture meaningful similarities: players with similar playing styles will have similar embedding vectors, and teams with similar characteristics will cluster in embedding space.
Formally, an embedding is a learned lookup table $E \in \mathbb{R}^{|C| \times d}$ where $|C|$ is the number of categories and $d$ is the embedding dimension. For category index $i$, the embedding vector is the $i$-th row of $E$:
$$e_i = E[i] \in \mathbb{R}^d$$
The embedding matrix $E$ is initialized randomly and trained via backpropagation alongside the rest of the network. The gradients from the prediction loss update the embedding vectors, pushing similar entities toward similar representations.
Choosing Embedding Dimensions
A common heuristic for embedding dimension is:
$$d = \min\left(50, \left\lceil \frac{|C|}{2} \right\rceil\right)$$
For 30 NBA teams, this gives $d = 15$. For 450 players, $d = 50$. In practice, the optimal dimension should be tuned as a hyperparameter, but this heuristic provides a reasonable starting point.
Implementation
class SportsEmbeddingNet(nn.Module):
"""
Neural network with entity embeddings for categorical sports variables.
Combines learned embeddings for teams, players, and venues with
continuous features in a single feedforward architecture. The
embeddings are concatenated with the continuous features and passed
through hidden layers for prediction.
This architecture is inspired by the entity embedding approach
described by Guo and Berkhahn (2016) and popularized by the
winning entries in several Kaggle competitions.
Args:
n_continuous: Number of continuous input features.
embedding_specs: List of (n_categories, embedding_dim) tuples,
one for each categorical variable. n_categories should
include +1 for an unknown/padding category.
hidden_dims: Sizes of hidden layers.
dropout_rate: Dropout probability.
"""
def __init__(
self,
n_continuous: int,
embedding_specs: list[tuple[int, int]],
hidden_dims: list[int] = [256, 128, 64],
dropout_rate: float = 0.3,
):
super().__init__()
# Create embedding layers
self.embeddings = nn.ModuleList([
nn.Embedding(n_cats, emb_dim)
for n_cats, emb_dim in embedding_specs
])
# Total input dimension
total_emb_dim = sum(emb_dim for _, emb_dim in embedding_specs)
total_input_dim = n_continuous + total_emb_dim
# Feedforward layers
layers = []
prev_dim = total_input_dim
for hidden_dim in hidden_dims:
layers.extend([
nn.Linear(prev_dim, hidden_dim),
nn.BatchNorm1d(hidden_dim),
nn.ReLU(),
nn.Dropout(dropout_rate),
])
prev_dim = hidden_dim
layers.append(nn.Linear(prev_dim, 1))
layers.append(nn.Sigmoid())
self.fc = nn.Sequential(*layers)
# Initialize embeddings with small random values
for emb in self.embeddings:
nn.init.normal_(emb.weight, mean=0, std=0.01)
def forward(
self,
x_continuous: torch.Tensor,
x_categorical: list[torch.Tensor],
) -> torch.Tensor:
"""
Forward pass combining embeddings and continuous features.
Args:
x_continuous: Continuous features, shape (batch, n_continuous).
x_categorical: List of categorical index tensors, each of
shape (batch,). One tensor per categorical variable.
Returns:
Predictions of shape (batch,).
"""
# Look up embeddings
embedded = [
emb(cat_tensor)
for emb, cat_tensor in zip(self.embeddings, x_categorical)
]
# Concatenate embeddings and continuous features
combined = torch.cat([x_continuous] + embedded, dim=1)
return self.fc(combined).squeeze(-1)
class SportsEmbeddingDataset(Dataset):
"""
Dataset combining continuous and categorical features for embedding models.
Handles the separation of continuous features (as float tensors)
and categorical features (as long tensors for embedding lookup).
Args:
continuous_features: DataFrame of continuous features.
categorical_features: DataFrame of categorical features (integer-encoded).
targets: Series of target values.
"""
def __init__(
self,
continuous_features: pd.DataFrame,
categorical_features: pd.DataFrame,
targets: pd.Series,
):
self.continuous = torch.FloatTensor(continuous_features.values)
self.categorical = [
torch.LongTensor(categorical_features[col].values)
for col in categorical_features.columns
]
self.targets = torch.FloatTensor(targets.values)
def __len__(self) -> int:
return len(self.targets)
def __getitem__(self, idx: int):
return (
self.continuous[idx],
[cat[idx] for cat in self.categorical],
self.targets[idx],
)
Transfer Learning Across Seasons
One of the most powerful applications of entity embeddings in sports is transfer learning across seasons. Team and player embeddings learned from one season capture latent representations of playing style, ability, and role that are partially transferable to the next season.
The transfer process works as follows:
- Train the model on Season $N$ data, learning embeddings for all teams and players.
- Initialize Season $N+1$ embeddings with the learned Season $N$ values for returning teams and players.
- Initialize new entities (expansion teams, rookies, traded players on new teams) with the average embedding or the embedding of a similar entity.
- Fine-tune the model on Season $N+1$ data with a reduced learning rate, allowing embeddings to adapt to new circumstances while retaining prior knowledge.
def transfer_embeddings(
old_model: SportsEmbeddingNet,
new_model: SportsEmbeddingNet,
entity_mapping: dict[int, int],
embedding_index: int = 0,
) -> None:
"""
Transfer learned embeddings from an old model to a new model.
For entities that exist in both seasons (returning teams/players),
copies the learned embedding vectors. For new entities, initializes
with the mean of all old embeddings (a reasonable prior for an
unknown entity).
Args:
old_model: Model trained on previous season.
new_model: Model for the new season (with potentially different
number of entities).
entity_mapping: Dict mapping old entity indices to new entity
indices. Entities present in the new model but absent from
this mapping are initialized with the old model's mean
embedding.
embedding_index: Index into the model's embedding list
(0 for the first categorical variable, etc.).
"""
old_emb = old_model.embeddings[embedding_index]
new_emb = new_model.embeddings[embedding_index]
# Compute mean of old embeddings as default for new entities
mean_embedding = old_emb.weight.data.mean(dim=0)
with torch.no_grad():
# Initialize all new embeddings with the old mean
new_emb.weight.data[:] = mean_embedding.unsqueeze(0).expand_as(
new_emb.weight.data
)
# Copy embeddings for returning entities
for old_idx, new_idx in entity_mapping.items():
if old_idx < old_emb.num_embeddings and new_idx < new_emb.num_embeddings:
new_emb.weight.data[new_idx] = old_emb.weight.data[old_idx]
transferred = len(entity_mapping)
total = new_emb.num_embeddings
print(
f"Transferred {transferred}/{total} embeddings "
f"({transferred / total:.1%})"
)
Common Pitfall: Entity embeddings require sufficient data to learn meaningful representations. If a player appears in only 5 games, their embedding will be poorly determined and may overfit. Apply minimum appearance thresholds (e.g., 20+ games) and group rare entities into a shared "other" category. Alternatively, use regularization (weight decay on the embedding layer) to shrink low-data embeddings toward zero, effectively defaulting to the continuous features alone.
29.4 Practical PyTorch Implementation
Complete Training Pipeline
This section assembles all the components from previous sections into a complete, production-ready PyTorch pipeline for sports prediction. The pipeline handles data loading, model construction, training with logging, evaluation, and model persistence.
import json
import os
from pathlib import Path
from datetime import datetime
class SportsModelTrainer:
"""
Complete training pipeline for sports prediction neural networks.
Handles data preparation, model training, evaluation, and
persistence in a single, configurable class. Supports both
feedforward and LSTM architectures.
Args:
model: The PyTorch model to train.
model_dir: Directory for saving model checkpoints and logs.
device: Device to train on ('cuda', 'cpu', or 'auto').
"""
def __init__(
self,
model: nn.Module,
model_dir: str = "./models",
device: str = "auto",
):
if device == "auto":
self.device = torch.device(
"cuda" if torch.cuda.is_available() else "cpu"
)
else:
self.device = torch.device(device)
self.model = model.to(self.device)
self.model_dir = Path(model_dir)
self.model_dir.mkdir(parents=True, exist_ok=True)
self.training_log: list[dict] = []
def create_data_loaders(
self,
train_dataset: Dataset,
val_dataset: Dataset,
batch_size: int = 64,
num_workers: int = 0,
) -> tuple[DataLoader, DataLoader]:
"""
Create training and validation DataLoaders with proper settings.
Training data is shuffled; validation data is not. Pin memory
is enabled when using GPU for faster data transfer.
"""
pin_memory = self.device.type == "cuda"
train_loader = DataLoader(
train_dataset,
batch_size=batch_size,
shuffle=True,
num_workers=num_workers,
pin_memory=pin_memory,
drop_last=True, # Avoids batch norm issues with batch_size=1
)
val_loader = DataLoader(
val_dataset,
batch_size=batch_size * 4, # Larger batches for faster eval
shuffle=False,
num_workers=num_workers,
pin_memory=pin_memory,
)
return train_loader, val_loader
def train(
self,
train_loader: DataLoader,
val_loader: DataLoader,
n_epochs: int = 100,
learning_rate: float = 1e-3,
weight_decay: float = 1e-4,
patience: int = 15,
scheduler_patience: int = 5,
max_grad_norm: float = 1.0,
) -> dict:
"""
Full training loop with early stopping, LR scheduling, and logging.
Returns:
Training history dictionary.
"""
criterion = nn.BCELoss()
optimizer = optim.Adam(
self.model.parameters(),
lr=learning_rate,
weight_decay=weight_decay,
)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
optimizer, mode="min", factor=0.5,
patience=scheduler_patience,
)
best_val_loss = float("inf")
best_state = None
epochs_no_improve = 0
history = {
"train_loss": [], "val_loss": [], "val_brier": [],
"learning_rates": [],
}
start_time = datetime.now()
for epoch in range(n_epochs):
# ---- Training ----
self.model.train()
train_loss = 0.0
n_train = 0
for batch in train_loader:
if len(batch) == 2:
features, targets = batch
features = features.to(self.device)
targets = targets.to(self.device)
preds = self.model(features)
elif len(batch) == 3:
continuous, categorical, targets = batch
continuous = continuous.to(self.device)
categorical = [c.to(self.device) for c in categorical]
targets = targets.to(self.device)
preds = self.model(continuous, categorical)
else:
raise ValueError(f"Unexpected batch size: {len(batch)}")
loss = criterion(preds, targets)
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(
self.model.parameters(), max_grad_norm,
)
optimizer.step()
train_loss += loss.item() * len(targets)
n_train += len(targets)
avg_train_loss = train_loss / n_train
# ---- Validation ----
val_loss, val_brier = self._evaluate(val_loader, criterion)
# Record history
current_lr = optimizer.param_groups[0]["lr"]
history["train_loss"].append(avg_train_loss)
history["val_loss"].append(val_loss)
history["val_brier"].append(val_brier)
history["learning_rates"].append(current_lr)
scheduler.step(val_loss)
# Early stopping
if val_loss < best_val_loss:
best_val_loss = val_loss
best_state = {
k: v.cpu().clone() for k, v in self.model.state_dict().items()
}
epochs_no_improve = 0
else:
epochs_no_improve += 1
if (epoch + 1) % 10 == 0:
elapsed = (datetime.now() - start_time).total_seconds()
print(
f"Epoch {epoch + 1:3d} | "
f"Train: {avg_train_loss:.4f} | "
f"Val: {val_loss:.4f} | "
f"Brier: {val_brier:.4f} | "
f"LR: {current_lr:.2e} | "
f"Time: {elapsed:.0f}s"
)
if epochs_no_improve >= patience:
print(f"Early stopping at epoch {epoch + 1}")
break
# Restore best model
if best_state is not None:
self.model.load_state_dict(best_state)
self.model = self.model.to(self.device)
history["best_epoch"] = int(np.argmin(history["val_loss"])) + 1
history["best_val_loss"] = float(best_val_loss)
history["total_epochs"] = epoch + 1
return history
def _evaluate(
self, loader: DataLoader, criterion: nn.Module,
) -> tuple[float, float]:
"""Evaluate model on a data loader, returning loss and Brier score."""
self.model.eval()
total_loss = 0.0
all_preds = []
all_targets = []
with torch.no_grad():
for batch in loader:
if len(batch) == 2:
features, targets = batch
features = features.to(self.device)
targets = targets.to(self.device)
preds = self.model(features)
elif len(batch) == 3:
continuous, categorical, targets = batch
continuous = continuous.to(self.device)
categorical = [c.to(self.device) for c in categorical]
targets = targets.to(self.device)
preds = self.model(continuous, categorical)
else:
raise ValueError(f"Unexpected batch size: {len(batch)}")
loss = criterion(preds, targets)
total_loss += loss.item() * len(targets)
all_preds.extend(preds.cpu().numpy())
all_targets.extend(targets.cpu().numpy())
avg_loss = total_loss / len(all_preds)
brier = float(np.mean(
(np.array(all_preds) - np.array(all_targets)) ** 2
))
return avg_loss, brier
def save_model(self, name: str, metadata: Optional[dict] = None) -> str:
"""
Save model checkpoint with metadata.
Saves the model state dict, architecture info, and any
additional metadata to a timestamped directory.
"""
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
save_dir = self.model_dir / f"{name}_{timestamp}"
save_dir.mkdir(parents=True, exist_ok=True)
# Save model weights
model_path = save_dir / "model.pt"
torch.save(self.model.state_dict(), model_path)
# Save metadata
meta = {
"name": name,
"timestamp": timestamp,
"model_class": self.model.__class__.__name__,
"n_parameters": sum(p.numel() for p in self.model.parameters()),
"device": str(self.device),
}
if metadata:
meta.update(metadata)
with open(save_dir / "metadata.json", "w") as f:
json.dump(meta, f, indent=2)
print(f"Model saved to {save_dir}")
return str(save_dir)
@staticmethod
def load_model(
model: nn.Module,
checkpoint_dir: str,
device: str = "auto",
) -> nn.Module:
"""
Load a model from a checkpoint directory.
Args:
model: An uninitialized model with the same architecture.
checkpoint_dir: Path to the checkpoint directory.
device: Device to load the model onto.
Returns:
The model with loaded weights.
"""
if device == "auto":
device = "cuda" if torch.cuda.is_available() else "cpu"
model_path = Path(checkpoint_dir) / "model.pt"
state_dict = torch.load(model_path, map_location=device)
model.load_state_dict(state_dict)
model = model.to(device)
model.eval()
return model
GPU Training Considerations
Modern sports prediction datasets are small enough to train on a CPU in minutes. However, when you scale to larger datasets (play-by-play data, player tracking data) or use complex architectures (deep LSTMs, attention models), GPU acceleration becomes valuable.
Key considerations for GPU training:
- Data transfer overhead: Moving data between CPU and GPU takes time. Use
pin_memory=Truein DataLoaders and transfer entire batches at once. - Batch size: Larger batches utilize GPU parallelism better. Increase batch size until GPU memory is fully utilized.
- Mixed precision: For larger models,
torch.cuda.ampenables half-precision (FP16) training, roughly doubling throughput at minimal accuracy cost. - Multi-GPU: For very large experiments (e.g., hyperparameter sweeps),
torch.nn.DataParallelortorch.distributedenables multi-GPU training.
def setup_gpu_training(model: nn.Module) -> tuple[nn.Module, torch.device]:
"""
Configure model for GPU training with automatic device detection.
Handles CUDA availability checking, model transfer to GPU, and
optional DataParallel wrapping for multi-GPU systems.
Returns:
Tuple of (configured model, device).
"""
if torch.cuda.is_available():
device = torch.device("cuda")
n_gpus = torch.cuda.device_count()
print(f"Using GPU: {torch.cuda.get_device_name(0)}")
print(f"Available GPUs: {n_gpus}")
print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")
if n_gpus > 1:
model = nn.DataParallel(model)
print(f"Using DataParallel across {n_gpus} GPUs")
model = model.to(device)
else:
device = torch.device("cpu")
print("No GPU available. Training on CPU.")
model = model.to(device)
return model, device
Real-World Application: For a typical NBA season prediction model with 12,000 games and 50 features, a feedforward network trains in under 30 seconds on a modern CPU. An LSTM processing 10-game sequences trains in 2--5 minutes. GPU acceleration becomes important when you scale to play-by-play data (millions of rows), when you add entity embeddings for hundreds of players, or when you run hyperparameter searches with hundreds of trials.
29.5 Hyperparameter Tuning and Architecture Search
The Hyperparameter Challenge
Neural networks have many hyperparameters that significantly affect performance: learning rate, batch size, hidden layer sizes, number of layers, dropout rate, weight decay, and more. Unlike model weights, hyperparameters are not learned during training --- they must be set before training begins. Finding good hyperparameter values is critical, and doing it manually is both tedious and unreliable.
Grid Search and Random Search
Grid search evaluates every combination of hyperparameter values from predefined grids. For example, if you test 3 learning rates, 3 dropout rates, and 3 hidden sizes, grid search trains 27 models. This is exhaustive but scales exponentially with the number of hyperparameters.
Random search samples hyperparameter combinations randomly from specified distributions. Research by Bergstra and Bengio (2012) showed that random search is more efficient than grid search because it explores the search space more uniformly, avoiding the redundancy of grid search when some hyperparameters are more important than others.
Optuna for Bayesian Optimization
Optuna is a modern hyperparameter optimization framework that uses Bayesian optimization --- specifically, the Tree-structured Parzen Estimator (TPE) --- to intelligently explore the hyperparameter space. Unlike grid or random search, Optuna uses the results of previous trials to decide which configurations to try next, focusing on promising regions of the search space.
import optuna
from optuna.trial import Trial
def create_optuna_objective(
train_dataset: Dataset,
val_dataset: Dataset,
input_dim: int,
n_epochs: int = 50,
device: str = "auto",
):
"""
Create an Optuna objective function for neural architecture search.
The objective function is called by Optuna for each trial. It
samples hyperparameters, constructs a model, trains it, and
returns the validation loss. Optuna uses this information to
decide which hyperparameter combinations to try next.
Args:
train_dataset: Training dataset.
val_dataset: Validation dataset.
input_dim: Number of input features.
n_epochs: Maximum training epochs per trial.
device: Training device.
Returns:
An objective function compatible with optuna.study.optimize().
"""
if device == "auto":
device = "cuda" if torch.cuda.is_available() else "cpu"
def objective(trial: Trial) -> float:
# Sample architecture hyperparameters
n_layers = trial.suggest_int("n_layers", 1, 4)
hidden_dims = []
for i in range(n_layers):
dim = trial.suggest_categorical(
f"hidden_dim_{i}", [32, 64, 128, 256, 512]
)
hidden_dims.append(dim)
dropout = trial.suggest_float("dropout", 0.1, 0.5)
learning_rate = trial.suggest_float(
"learning_rate", 1e-5, 1e-2, log=True,
)
weight_decay = trial.suggest_float(
"weight_decay", 1e-6, 1e-2, log=True,
)
batch_size = trial.suggest_categorical(
"batch_size", [32, 64, 128, 256]
)
# Build model
model = SportsFeedforwardNet(
input_dim=input_dim,
hidden_dims=hidden_dims,
dropout_rate=dropout,
).to(device)
# Create data loaders
train_loader = DataLoader(
train_dataset, batch_size=batch_size,
shuffle=True, drop_last=True,
)
val_loader = DataLoader(
val_dataset, batch_size=256, shuffle=False,
)
# Train
criterion = nn.BCELoss()
optimizer = optim.Adam(
model.parameters(), lr=learning_rate,
weight_decay=weight_decay,
)
best_val_loss = float("inf")
patience_counter = 0
patience = 10
for epoch in range(n_epochs):
# Training
model.train()
for features, targets in train_loader:
features = features.to(device)
targets = targets.to(device)
optimizer.zero_grad()
preds = model(features)
loss = criterion(preds, targets)
loss.backward()
torch.nn.utils.clip_grad_norm_(
model.parameters(), 1.0,
)
optimizer.step()
# Validation
model.eval()
val_losses = []
with torch.no_grad():
for features, targets in val_loader:
features = features.to(device)
targets = targets.to(device)
preds = model(features)
loss = criterion(preds, targets)
val_losses.append(loss.item())
val_loss = np.mean(val_losses)
# Report intermediate value for pruning
trial.report(val_loss, epoch)
if trial.should_prune():
raise optuna.TrialPruned()
# Early stopping within trial
if val_loss < best_val_loss:
best_val_loss = val_loss
patience_counter = 0
else:
patience_counter += 1
if patience_counter >= patience:
break
return best_val_loss
return objective
def run_hyperparameter_search(
train_dataset: Dataset,
val_dataset: Dataset,
input_dim: int,
n_trials: int = 100,
n_epochs: int = 50,
study_name: str = "sports_nn_optimization",
) -> optuna.Study:
"""
Run a complete Optuna hyperparameter search.
Uses the Median Pruner to terminate unpromising trials early
(if a trial's intermediate value is worse than the median of
previous trials at the same step, it is pruned).
Args:
train_dataset: Training dataset.
val_dataset: Validation dataset.
input_dim: Number of input features.
n_trials: Number of hyperparameter combinations to try.
n_epochs: Maximum epochs per trial.
study_name: Name for the Optuna study (for logging).
Returns:
The completed Optuna study with all trial results.
"""
objective = create_optuna_objective(
train_dataset, val_dataset, input_dim, n_epochs,
)
study = optuna.create_study(
study_name=study_name,
direction="minimize",
pruner=optuna.pruners.MedianPruner(
n_startup_trials=10,
n_warmup_steps=5,
),
)
study.optimize(objective, n_trials=n_trials, show_progress_bar=True)
# Report results
print(f"\nBest trial:")
print(f" Value (val loss): {study.best_trial.value:.4f}")
print(f" Params:")
for key, value in study.best_trial.params.items():
print(f" {key}: {value}")
return study
Early Stopping and Regularization
Neural networks for small sports datasets are highly prone to overfitting. Multiple regularization techniques should be used in combination:
-
Early stopping: Monitor validation loss and stop training when it begins to increase. This is the single most important regularization technique.
-
Dropout: Randomly zero a fraction of neuron outputs during training. Standard rates for sports models: 0.2--0.5.
-
Weight decay (L2 regularization): Penalize large weights by adding $\lambda \|W\|^2$ to the loss. Typical values: $10^{-4}$ to $10^{-2}$.
-
Batch normalization: Normalizes layer inputs, which acts as an implicit regularizer and stabilizes training.
-
Data augmentation: In sports, this might mean adding noise to features, or training on bootstrapped samples.
Practical Tips for Sports Neural Networks
| Aspect | Recommendation |
|---|---|
| Architecture | Start simple: 2--3 hidden layers, 64--256 neurons each |
| Activation | ReLU for hidden layers, Sigmoid for binary output |
| Optimizer | Adam with default betas (0.9, 0.999) |
| Learning rate | Start at $10^{-3}$; use scheduler to reduce on plateau |
| Batch size | 32--128 for typical sports datasets |
| Regularization | Dropout (0.3) + weight decay ($10^{-4}$) + early stopping |
| Feature scaling | Always standardize continuous inputs (zero mean, unit variance) |
| Initialization | Default PyTorch initialization (Kaiming) is usually sufficient |
| Embeddings | Use for categorical variables with 5+ categories |
| LSTM sequence length | 5--15 games; longer sequences rarely help and slow training |
| Ensemble | Train 5--10 models with different random seeds; average predictions |
Common Pitfall: A common mistake is to over-invest in architecture complexity before establishing strong baselines. A simple 2-layer feedforward network with well-engineered features from Chapter 28 will outperform a complex LSTM with poor features in nearly every case. The correct order of operations is: (1) engineer excellent features, (2) establish a tree-model baseline, (3) build a simple neural network, (4) incrementally add complexity (embeddings, LSTM, attention) only when each addition demonstrably improves validation performance.
29.6 Chapter Summary
Key Concepts
-
Feedforward neural networks apply layers of weighted linear transformations followed by nonlinear activations. For tabular sports data, they serve as flexible function approximators that can capture complex feature interactions.
-
Backpropagation computes gradients of the loss with respect to all network weights using the chain rule, enabling efficient gradient-based optimization. The Adam optimizer adapts learning rates per-parameter and is the standard choice for sports prediction networks.
-
LSTM networks process sequential data (game-by-game statistics) through gated cells that learn what to remember, forget, and output at each time step. They can discover temporal patterns that hand-crafted rolling features may miss.
-
Entity embeddings learn dense, low-dimensional vector representations of categorical variables (teams, players, venues). Similar entities cluster in embedding space, enabling generalization across categories and transfer learning across seasons.
-
The training pipeline includes data loading (with proper batching and shuffling), forward pass, loss computation, backpropagation, gradient clipping, optimizer step, validation evaluation, learning rate scheduling, and early stopping.
-
Hyperparameter tuning with Optuna uses Bayesian optimization to efficiently search over architectures and training configurations, with pruning to terminate unpromising trials early.
-
Regularization through dropout, weight decay, batch normalization, and early stopping is essential for sports datasets, which are typically small by deep learning standards. Multiple regularization techniques should be combined.
-
Neural networks complement rather than replace tree-based models for sports prediction. Their primary advantages are entity embeddings, sequential modeling, and multi-task learning. For simple tabular prediction with well-engineered features, tree models remain competitive.
Key Formulas
| Formula | Description |
|---|---|
| $a^{(l)} = \sigma(W^{(l)} a^{(l-1)} + b^{(l)})$ | Feedforward layer computation |
| $\text{ReLU}(z) = \max(0, z)$ | Rectified linear unit activation |
| $\mathcal{L} = -\frac{1}{n}\sum[y\log\hat{y} + (1-y)\log(1-\hat{y})]$ | Binary cross-entropy loss |
| $f_t = \sigma(W_f[h_{t-1}, x_t] + b_f)$ | LSTM forget gate |
| $c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t$ | LSTM cell state update |
| $e_i = E[i] \in \mathbb{R}^d$ | Entity embedding lookup |
| $d \approx \min(50, \lceil |C|/2 \rceil)$ | Embedding dimension heuristic |
Key Code Patterns
-
SportsTabularDataset/SportsSequenceDataset: PyTorch Dataset subclasses that convert pandas DataFrames to tensors, handle feature/target separation, and support batched loading through DataLoader. -
SportsFeedforwardNet: Configurable feedforward architecture with batch normalization and dropout, using nn.Sequential for clean layer stacking. -
SportsLSTM/PlayerCareerLSTM: LSTM architectures for sequential sports data, with optional attention mechanism and bidirectional processing. -
SportsEmbeddingNet: Combined embedding + feedforward architecture that handles both continuous and categorical inputs. -
SportsModelTrainer: Complete training pipeline with early stopping, learning rate scheduling, gradient clipping, and model persistence. -
Optuna integration: Objective functions that sample architectures and hyperparameters, with intermediate reporting for trial pruning.
Decision Framework: Choosing a Neural Network Architecture
START: You have engineered features and a prediction target.
1. Is your dataset large (>10,000 observations)?
- NO --> Use tree-based models (XGBoost). NNs likely to overfit.
- YES --> Proceed to step 2.
2. Do you have high-cardinality categorical variables (teams, players)?
- YES --> Use entity embeddings (SportsEmbeddingNet).
- NO --> Simple feedforward network (SportsFeedforwardNet).
3. Is temporal sequence important beyond rolling averages?
- YES --> Add LSTM component (SportsLSTM).
- NO --> Feedforward or embedding network is sufficient.
4. Do you need to predict multiple related targets?
- YES --> Multi-task network with shared hidden layers.
- NO --> Single-output architecture.
5. Tune hyperparameters with Optuna (50-200 trials).
6. Ensemble 5-10 models with different seeds.
7. Compare against XGBoost baseline on held-out data.
END: Deploy the best model (or ensemble).
What's Next
In Chapter 30: Model Evaluation and Selection, we will learn how to rigorously evaluate and compare the models built in this chapter and previous ones. Raw accuracy is insufficient for betting models --- we need proper scoring rules (Brier score, log loss) that evaluate the quality of probabilistic predictions. We will build backtesting frameworks that simulate realistic betting scenarios with transaction costs. We will implement walk-forward validation that respects the temporal structure of sports data. And we will use calibration plots and statistical tests to determine whether one model genuinely outperforms another or whether observed differences are within the range of statistical noise. These evaluation techniques are the final link in the chain from data to profitable betting decisions.
This chapter is part of The Sports Betting Textbook, a comprehensive guide to quantitative sports betting. All code examples use Python 3.11+ with PyTorch 2.0+ and are available in the companion repository.
Related Reading
Explore this topic in other books
AI Engineering Neural Networks from Scratch Soccer Analytics Deep Learning for Soccer AI Engineering Convolutional Neural Networks