Chapter 29 Exercises: Neural Networks for Sports Prediction

Part A: Foundational Concepts (Exercises 1-6)

Exercise 1. Define a feedforward neural network and describe its core components: neurons, weights, biases, activation functions, and layers. Explain how a feedforward network differs from a recurrent network. For a sports prediction task with 15 input features and a binary output, sketch an architecture with two hidden layers and specify reasonable dimensions for each layer.

Exercise 2. Explain the backpropagation algorithm in your own words, without using matrix notation. Describe the forward pass and backward pass for a two-layer network predicting win probability. Explain why the chain rule is essential and what "error signal" means in this context. What happens to the gradients if the activation function is the identity function everywhere?

Exercise 3. Compare and contrast three activation functions --- ReLU, Leaky ReLU, and Sigmoid --- in terms of their mathematical definitions, output ranges, gradient behavior, and suitability for hidden layers versus output layers. Explain the "dying ReLU" problem and why it is particularly relevant in sports prediction models with small datasets.

Exercise 4. Describe three regularization techniques used in neural networks (dropout, weight decay, and batch normalization). For each, explain the mathematical mechanism, how it prevents overfitting, and typical hyperparameter values for a sports prediction model with 5,000 training observations. Why is combining multiple regularization techniques standard practice?

Exercise 5. Explain the concept of early stopping as a regularization strategy. A neural network trained on NBA game data shows the following validation loss trajectory over 50 epochs: the loss decreases steadily from epoch 1 to epoch 22, fluctuates slightly from epoch 22 to epoch 35, and then increases from epoch 35 onward. At what epoch would you stop training, and what patience value would produce this result? What is the relationship between early stopping and the bias-variance tradeoff?

Exercise 6. Compare feedforward neural networks and gradient-boosted tree models (e.g., XGBoost) for tabular sports prediction. List three scenarios where neural networks tend to outperform trees and three scenarios where trees tend to outperform neural networks. For a dataset of 8,000 NFL games with 25 engineered features and no categorical variables, which approach would you recommend and why?

Part B: Architecture Design (Exercises 7-12)

Exercise 7. Write a PyTorch nn.Module class for a feedforward network that accepts an input dimension, a list of hidden layer sizes, a dropout rate, and an activation function as constructor arguments. The network should use batch normalization between each hidden layer and support both Sigmoid and linear output heads (configurable via a parameter). Include complete type hints and docstrings.

Exercise 8. Explain the vanishing gradient problem in recurrent neural networks. Using the LSTM cell equations from the chapter, describe how the forget gate, input gate, and cell state address this problem. Why does the cell state update use addition rather than multiplication, and what effect does this have on gradient flow during backpropagation through time?

Exercise 9. Design an LSTM architecture for modeling a basketball team's performance trajectory over a 10-game window. Specify the input features per time step (list at least 8 game-level statistics), the sequence length, the LSTM hidden dimension, the number of LSTM layers, and the feedforward head architecture. Justify each design choice with reference to the size of a typical NBA dataset.

Exercise 10. Explain entity embeddings and their advantages over one-hot encoding for representing NBA teams (30 categories) and NBA players (approximately 450 categories). For each, compute the recommended embedding dimension using the heuristic from the chapter. Describe how entity embeddings enable transfer learning across seasons.

Exercise 11. A researcher builds a neural network with the following architecture for predicting NBA game outcomes: 3 hidden layers of sizes [512, 512, 512], dropout rate 0.1, no weight decay, batch size 16, trained for 200 epochs on 4,000 games. Identify at least four problems with this architecture and training configuration and propose specific fixes for each.

Exercise 12. Compare unidirectional and bidirectional LSTMs for sports prediction. Explain why bidirectional LSTMs, despite their theoretical advantages, are often inappropriate for predicting future game outcomes. Describe one sports prediction scenario where a bidirectional LSTM is appropriate and one where it is not.

Part C: Implementation and Training (Exercises 13-18)

Exercise 13. Implement a PyTorch Dataset class for sequential sports data that: (a) takes a DataFrame of game-level features sorted by date within each team, (b) constructs sequences of the previous N games as input, (c) pads sequences with zeros when fewer than N prior games are available, and (d) returns a tuple of (sequence_tensor, target_tensor). Include proper handling of the team grouping and temporal ordering.

Exercise 14. Write a complete training loop in PyTorch for a feedforward network that includes: (a) Adam optimizer with configurable learning rate and weight decay, (b) binary cross-entropy loss, (c) ReduceLROnPlateau scheduler, (d) gradient clipping, (e) early stopping with patience, (f) validation evaluation every epoch, and (g) restoration of the best model weights. The function should return a dictionary containing training history.

Exercise 15. Explain the role of the learning rate in neural network training. Given a training run where the loss oscillates wildly and never converges, diagnose the problem and propose two solutions. Given a second training run where the loss decreases extremely slowly and stalls at a mediocre value, diagnose the problem and propose two solutions. How does the ReduceLROnPlateau scheduler address both issues?

Exercise 16. Implement a function that computes and prints a summary of a PyTorch model including: the total number of trainable parameters, the number of parameters per layer, the input and output shapes of each layer, and the estimated memory footprint (assuming 32-bit floats). Apply it to a SportsFeedforwardNet with input_dim=20 and hidden_dims=[128, 64, 32].

Exercise 17. Write code to create a PyTorch DataLoader pipeline for an NBA prediction task. The pipeline should: (a) split data temporally (first 80% train, next 10% validation, last 10% test), (b) standardize features using only training data statistics, (c) create PyTorch datasets, (d) wrap them in DataLoaders with appropriate batch sizes, shuffling, and pin_memory settings for GPU training. Include assertions that verify no data leakage between splits.

Exercise 18. Implement model checkpointing in PyTorch. Write a ModelCheckpointer class that: (a) saves the model state dict and optimizer state dict to disk whenever validation loss improves, (b) saves metadata (epoch number, validation loss, training loss) in a JSON file alongside the checkpoint, (c) keeps only the top K checkpoints (deleting older ones), and (d) provides a load_best method that restores the best model.

Part D: Entity Embeddings and Transfer Learning (Exercises 19-24)

Exercise 19. Implement a SportsEmbeddingNet that combines entity embeddings for home team, away team, and venue with 12 continuous features. The network should: (a) look up embeddings for each categorical variable, (b) concatenate embeddings with continuous features, (c) pass through two hidden layers with batch norm and dropout, and (d) output a win probability. Compute the total number of parameters for 30 teams, 25 venues, and embedding dimensions of 8 for teams and 6 for venues.

Exercise 20. Describe the transfer learning process for entity embeddings across NBA seasons. A model trained on the 2022-23 season contains embeddings for 450 players. In the 2023-24 season, 380 players return and 70 are new (rookies and trades from other leagues). Write Python code that: (a) transfers embeddings for returning players, (b) initializes rookies with the mean embedding, (c) initializes traded players with a weighted average of their old team embedding and the position-average embedding.

Exercise 21. Explain the cold-start problem for entity embeddings in sports. A new expansion team joins the league, and your model has no embedding for this team. Propose three strategies for initializing the expansion team's embedding, evaluate the tradeoffs of each, and implement the best strategy in code.

Exercise 22. Build a custom PyTorch Dataset that handles both continuous and categorical features for an embedding model. The dataset should: (a) separate continuous features (float tensors) from categorical features (long tensors), (b) support integer encoding of categorical variables with a mapping dictionary, (c) handle unknown categories at inference time by mapping them to an "unknown" index, and (d) return a tuple of (continuous_tensor, categorical_tensor_list, target_tensor).

Exercise 23. Implement a visualization function that takes trained entity embeddings and produces a 2D scatter plot using PCA or t-SNE. Apply it to team embeddings and interpret the spatial relationships. Specifically, the function should: (a) extract the embedding weight matrix, (b) reduce to 2D, (c) plot each team as a labeled point, (d) optionally color teams by a grouping variable (e.g., conference or division). Write complete code using matplotlib.

Exercise 24. Design an experiment to measure whether entity embeddings improve NBA game prediction compared to one-hot encoding. Specify: (a) the control model (one-hot encoded teams, feedforward network), (b) the treatment model (embedding-based, same feedforward architecture after the embedding layer), (c) the evaluation protocol (walk-forward validation, Brier score), (d) the statistical test (Diebold-Mariano), and (e) the minimum sample size needed to detect a 0.005 Brier score improvement with 80% power.

Part E: Advanced Applications (Exercises 25-30)

Exercise 25. Implement a complete end-to-end neural network pipeline for NBA game prediction. The pipeline should: (a) generate or load game-level features, (b) split data temporally, (c) standardize continuous features, (d) build a feedforward network with entity embeddings for home and away teams, (e) train with early stopping, (f) evaluate on test data with Brier score and calibration analysis, and (g) save the trained model. Write production-quality code with error handling.

Exercise 26. Build an LSTM model that processes a team's last 10 games of statistics to predict the next game's outcome. Implement the complete pipeline from sequence construction through prediction, including: (a) the SportsSequenceDataset, (b) the SportsLSTM with attention, (c) a modified training loop that handles 3D input tensors, and (d) evaluation comparing the LSTM to a feedforward baseline on the same data. Report the Brier score difference and interpret the result.

Exercise 27. Implement Optuna hyperparameter tuning for a sports prediction neural network. The search space should include: number of hidden layers (1-4), hidden dimensions per layer (32-512), dropout rate (0.1-0.5), learning rate (1e-5 to 1e-2, log scale), weight decay (1e-6 to 1e-2, log scale), and batch size (32, 64, 128, 256). Use MedianPruner for early trial termination. Run 50 trials and report the best configuration. Write code that extracts the best hyperparameters and constructs the final model.

Exercise 28. Build a multi-task neural network that jointly predicts three targets from the same input features: (a) home team win probability, (b) whether the total score will be over the line, and (c) the predicted point spread. The network should share hidden layers for all three tasks and have separate output heads. Explain why multi-task learning can improve predictions compared to three separate models. Implement the architecture and the multi-component loss function.

Exercise 29. Implement an ensemble of neural networks for sports prediction. Train 5 feedforward networks with different random seeds on the same data. For each test game, average the 5 predictions. Compare the ensemble Brier score to the average individual model Brier score. Implement both simple averaging and inverse-Brier-score weighting. Explain mathematically why ensembling reduces prediction variance.

Exercise 30. Design and implement a neural network that combines all three architectures from this chapter into a single model: (a) entity embeddings for team and player identifiers, (b) an LSTM that processes the home team's last 10 games, (c) a parallel LSTM for the away team's last 10 games, and (d) a feedforward head that concatenates embedding outputs, LSTM final hidden states, and continuous matchup features. Train the combined model on synthetic NBA data and compare to a feedforward-only baseline. Discuss when this level of architectural complexity is justified versus when simpler models suffice.