Chapter 24 Exercises: Deep Learning in Soccer Analytics
Part A: Conceptual Foundations (Problems 1--8)
These exercises test your understanding of neural network fundamentals and their application to soccer data.
Problem 1 (*) Explain why a single-layer neural network (logistic regression) cannot learn the XOR function. How does this limitation relate to xG modeling? Give a concrete example of a nonlinear interaction between shot features that a single-layer model would miss but a two-layer model could capture.
Problem 2 (*) Consider a feedforward neural network for xG prediction with the following architecture: input layer (8 features), hidden layer 1 (64 neurons, ReLU), hidden layer 2 (32 neurons, ReLU), output layer (1 neuron, sigmoid).
(a) Calculate the total number of trainable parameters (weights + biases). (b) If you have 5,000 labeled shots in your training set, compute the ratio of parameters to training examples. Is overfitting a concern? Why or why not? (c) Propose three specific regularization strategies and explain how each would help.
Problem 3 (**) Derive the gradient of the binary cross-entropy loss with respect to the pre-activation output $z$ of the final sigmoid layer. That is, show that:
$$\frac{\partial \mathcal{L}}{\partial z} = \hat{y} - y$$
where $\hat{y} = \sigma(z)$ and $\mathcal{L} = -[y \log(\hat{y}) + (1-y)\log(1-\hat{y})]$.
Explain why this simple form is computationally convenient for backpropagation.
Problem 4 (**) A colleague proposes using a neural network with 10 hidden layers (each with 256 neurons) for an xG model trained on 8,000 shots from a single league season.
(a) Identify at least three problems with this approach. (b) Suggest a more appropriate architecture, justifying your choices. (c) Estimate the minimum dataset size that would make the 10-layer architecture reasonable, using the rough heuristic that you need at least 5--10 training examples per parameter.
Problem 5 (**) Compare and contrast the following activation functions for hidden layers in a soccer analytics neural network: sigmoid, tanh, ReLU, and Leaky ReLU.
(a) For each, state the output range and compute the derivative. (b) Explain the "dying ReLU" problem and when it might occur during training. (c) Which would you recommend for an xG model with 3 hidden layers, and why?
Problem 6 (**) The Adam optimizer maintains exponential moving averages of the first and second moments of the gradient. Explain in plain language:
(a) Why is tracking the first moment (mean gradient) useful? (b) Why is tracking the second moment (mean squared gradient) useful? (c) Why is bias correction necessary in the early steps of training? (d) What is the typical default learning rate for Adam, and how does it compare to the typical default for vanilla SGD?
Problem 7 (***) You are building an xG model for set pieces (corners, free kicks, penalties) and open play shots. A single model covers all shot types.
(a) Argue for and against using a single model vs. separate models for each shot type. (b) Design a neural network architecture that can handle all shot types with shared lower layers and type-specific upper layers (a multi-task or multi-head architecture). Draw the architecture diagram. (c) How would you handle the class imbalance between penalty xG (very high conversion rate) and open-play xG (much lower)?
Problem 8 (***) Batch normalization normalizes activations within each mini-batch. Explain:
(a) The mathematical operation performed during training. (b) How it differs during inference (evaluation mode). (c) Why small batch sizes (e.g., batch_size=4) can cause problems with batch normalization. (d) What alternative normalization technique would you use for sequence models, and why?
Part B: Sequence and Graph Models (Problems 9--18)
Problem 9 (**) Consider a possession sequence of 8 events. Each event is represented by a 12-dimensional feature vector. An LSTM with hidden dimension 64 processes this sequence.
(a) What is the dimension of the hidden state $\mathbf{h}_t$ at each time step? (b) How many parameters are in a single LSTM cell (considering all four gates)? (c) If you use a bidirectional LSTM, how does the output dimension change?
Problem 10 (**) The forget gate $\mathbf{f}_t$ in an LSTM controls what information to discard from the cell state. In the context of possession sequences:
(a) Give a specific example of a situation where the forget gate should output values close to 0 (forget most information). (b) Give a specific example where it should output values close to 1 (retain most information). (c) How does the forget gate bias initialization affect training? What is a common initialization strategy and why?
Problem 11 (**) Compare an LSTM and a Transformer for modeling possession sequences.
(a) What is the computational complexity of processing a sequence of length $T$ for each architecture? (b) Which architecture can better capture long-range dependencies? Explain. (c) For a dataset of 50,000 possessions with average length 12, which architecture would you recommend and why? (d) How would you incorporate temporal information (seconds elapsed between events) into each architecture?
Problem 12 (***) Design a sequence-to-sequence model that takes a defensive pressing sequence (sequence of defender positions over time) and predicts the ball carrier's next action (pass, dribble, shoot, turnover).
(a) Define the input representation for each time step. (b) Specify the encoder and decoder architectures. (c) How would you handle the variable-length nature of pressing sequences? (d) What loss function would you use?
Problem 13 (**) For a graph neural network applied to a single frame of tracking data:
(a) Define the node set $V$, edge set $E$, node features $\mathbf{X}$, and edge features for a match with 22 outfield players and a ball. (b) Write the adjacency matrix $\mathbf{A}$ for a graph where edges connect all teammates and all players within 10 meters of an opponent. (c) After one GCN layer, what information does each node's representation encode? (d) After two GCN layers?
Problem 14 (**) In a Graph Attention Network (GAT), attention coefficients $\alpha_{ij}$ weight the importance of neighbor $j$ to node $i$.
(a) Explain why attention is preferable to uniform aggregation for soccer tactical analysis. (b) Give a concrete example where the attention weight between a defender and an attacker should be high. (c) How can attention weights be used as an interpretability tool for coaches?
Problem 15 (***) You want to classify team formations (4-4-2, 4-3-3, 3-5-2, 4-2-3-1, 5-3-2) from a single frame of tracking data using a GNN.
(a) Design the graph construction procedure (what are nodes, edges, features?). (b) Specify the GNN architecture including the readout/pooling layer that produces a graph-level representation. (c) How would you handle the fact that player ordering (jersey numbers) should not affect the classification? (d) What data augmentation strategies are appropriate for this task?
Problem 16 (***) Temporal graph networks combine GNNs with sequence models. Design a model that processes a 10-second clip of tracking data (250 frames at 25 Hz) to predict whether a goal-scoring opportunity will arise in the next 5 seconds.
(a) Explain why processing every frame is computationally expensive and propose a subsampling strategy. (b) Design the architecture combining spatial (GNN) and temporal (LSTM) processing. (c) How would you define the positive class label (what constitutes a "goal-scoring opportunity")?
Problem 17 (***) Implement (on paper or in pseudocode) the message-passing step of a GNN for a soccer graph where edge features include the Euclidean distance between players. The message function should be distance-weighted:
$$\mathbf{m}_{ij} = \frac{1}{d_{ij} + \epsilon} \cdot \text{MLP}([\mathbf{h}_i \| \mathbf{h}_j])$$
Show the aggregation and update steps.
Problem 18 (****) Propose a novel architecture that combines a GNN (for spatial relationships) with a Transformer (for temporal sequences) to model full-match tracking data. Address:
(a) How spatial and temporal information are fused. (b) The computational complexity and how to make it tractable for 90-minute matches. (c) What pre-training tasks could be used (analogous to masked language modeling in NLP). (d) How the model could be fine-tuned for multiple downstream tasks (xG, pass prediction, formation recognition).
Part C: Applied Deep Learning (Problems 19--26)
Problem 19 (**) You have pitch control surfaces represented as $104 \times 68$ grids with 4 channels (home density, away density, home velocity magnitude, away velocity magnitude). Design a CNN to predict whether the attacking team will complete a pass to a target zone.
(a) Specify the full architecture (layers, filter sizes, activations, pooling). (b) How would you encode the target zone information? (c) Calculate the total number of parameters. (d) What data augmentation is appropriate?
Problem 20 (**) Explain why a U-Net architecture is more appropriate than a standard CNN for predicting pitch control surfaces (per-pixel output) rather than global classification.
(a) Draw the encoder-decoder structure with skip connections. (b) What loss function would you use for pixel-wise probability prediction? (c) How do skip connections help preserve spatial detail?
Problem 21 (***) Formulate the problem of optimal pass selection as a reinforcement learning problem.
(a) Define the state space, action space, transition dynamics, and reward function. (b) Explain why the reward is sparse and how you would address this with reward shaping. (c) Why is off-policy learning necessary (you cannot interact with the environment)? (d) Describe how you would evaluate a learned passing policy against human decisions.
Problem 22 (***) The VAEP framework estimates $P(\text{scoring}|a_{t-2}, a_{t-1}, a_t)$ using the last three actions. Extend this to a deep RL framework:
(a) Replace the fixed 3-action window with a full state representation using an LSTM that processes the entire possession. (b) Define the action-value function $Q(s, a)$ and how it relates to VAEP values. (c) Discuss the tradeoffs between the VAEP approach and the full RL approach.
Problem 23 (***) Design a VAE for generating synthetic corner kick routines. The data consists of the $(x, y)$ positions of 22 players at the moment of delivery and 3 seconds afterward (75 frames at 25 Hz).
(a) What is the input dimensionality? (b) Design the encoder and decoder architectures. (c) What latent space dimension would you choose and why? (d) How would you make the generation conditional on the desired outcome (goal, clearance, recycled possession)?
Problem 24 (***) A GAN is trained to generate realistic player trajectories. After training, you notice:
(a) Mode collapse: The generator produces only straight-line runs. Diagnose the problem and propose solutions. (b) Physically implausible trajectories: Players exceed 40 km/h. How would you incorporate physical constraints? (c) Lack of team coordination: Individual trajectories look realistic but players overlap or ignore the ball. How would you enforce team-level coherence?
Problem 25 (***) You are deploying a deep learning xG model for live match analysis on a broadcast. Inference must complete within 50 milliseconds per shot event.
(a) Profile the model: a 3-layer MLP with dimensions [64, 128, 64, 1] takes 2ms. An LSTM processing the last 20 events with hidden dimension 128 takes 45ms. A GNN over 22 players with 2 layers takes 80ms. Which components can you include? (b) Propose a model distillation strategy to speed up the GNN component. (c) How would you handle the tradeoff between model complexity and latency for different use cases (live broadcast vs. post-match report)?
Problem 26 (****) Design a complete deep learning pipeline for a professional soccer club's analytics department. The pipeline should:
(a) Ingest event data (from a provider like StatsBomb) and tracking data (from a provider like SkillCorner). (b) Train and evaluate models for: xG, xT, pass success probability, formation recognition, and player action valuation. (c) Include data versioning, experiment tracking, model registry, and automated retraining. (d) Serve predictions via a REST API for integration with the club's existing tools. (e) Address computational infrastructure requirements and estimated costs.
Part D: Research and Open Problems (Problems 27--32)
Problem 27 (****) Foundation models (large pre-trained models fine-tuned for downstream tasks) have revolutionized NLP and computer vision. Propose a "foundation model for soccer":
(a) What pre-training data would you use (event data, tracking data, video)? (b) What self-supervised pre-training objectives would be appropriate? (c) What downstream tasks could the foundation model be fine-tuned for? (d) Estimate the computational cost of pre-training and compare to NLP foundation models.
Problem 28 (****) Causal inference with deep learning: A club wants to know the causal effect of a tactical change (switching from 4-3-3 to 3-4-3) on expected points.
(a) Explain why naive deep learning prediction (train a model to predict points from formation) does not answer this causal question. (b) Propose an approach combining deep learning with causal inference (e.g., propensity score weighting, instrumental variables, or double ML). (c) What assumptions are required, and how plausible are they in this context?
Problem 29 (****) Multi-agent reinforcement learning models each player as an independent agent optimizing individual and team objectives. Design a MARL framework for modeling a full 11-player team:
(a) Define the observation space, action space, and reward function for each agent. (b) How do you handle the cooperative nature of teammates vs. the competitive nature of opponents? (c) What communication mechanism would you implement between agents? (d) How would you train this system given that you only have observational data?
Problem 30 (****) Propose a deep learning approach for automated scouting that identifies "hidden gems"---players in lower divisions whose playing style would translate well to a higher division.
(a) How would you represent playing style as a learned embedding? (b) How would you account for the different quality of opposition and teammates? (c) What transfer learning approach would bridge the domain gap between divisions? (d) How would you validate the model's recommendations?
Problem 31 (****) Fairness in soccer analytics: An xG model consistently undervalues shots by left-footed players.
(a) How would you diagnose this bias? (b) What architectural or data-level interventions could mitigate it? (c) More broadly, how should soccer analytics practitioners think about algorithmic fairness when models are used for player evaluation and contract negotiations?
Problem 32 (****) Write a 2-page research proposal for a novel deep learning application in soccer analytics. Your proposal should include:
(a) A clear problem statement and motivation. (b) A review of related work (cite at least 5 relevant papers). (c) A detailed methodology section describing the proposed architecture, data, and training procedure. (d) An evaluation plan with appropriate baselines and metrics. (e) Expected challenges and mitigation strategies.
Solutions
Selected solutions are available in code/exercise-solutions.py. Full solutions for all problems are provided in the instructor's companion materials.