Chapter 24 Quiz: Deep Learning in Soccer Analytics

Q: A neural network for xG prediction has input dimension 10, one hidden layer with 64 neurons, and one output neuron. How many trainable parameters does it have? (a) 704 (b) 705 (c) 769 (d) 775

(c) 769. Layer 1: (weights + biases). Output layer: . Total: .

Q: In an LSTM, which gate controls what new information is stored in the cell state? (a) Forget gate (b) Input gate (c) Output gate (d) Update gate

(b) Input gate. The input gate modulates the candidate cell state , determining which new information is written to the cell state. The forget gate determines what to discard; the output gate determines what to expose as the hidden state.

Test your understanding of deep learning concepts as applied to soccer analytics. Each question has one best answer unless otherwise indicated.

Question 1. Which activation function is most commonly used in the hidden layers of modern feedforward neural networks?

(a) Sigmoid (b) Tanh (c) ReLU (d) Softmax

Answer

**(c) ReLU.** The Rectified Linear Unit ($\max(0, z)$) is the default activation for hidden layers because it avoids the vanishing gradient problem that plagues sigmoid and tanh, is computationally efficient, and produces sparse activations. Sigmoid is used for binary output layers; softmax for multi-class output layers.

Question 2. A neural network for xG prediction has input dimension 10, one hidden layer with 64 neurons, and one output neuron. How many trainable parameters does it have?

(a) 704 (b) 705 (c) 769 (d) 775

Answer

**(c) 769.** Layer 1: $10 \times 64 + 64 = 704$ (weights + biases). Output layer: $64 \times 1 + 1 = 65$. Total: $704 + 65 = 769$.

Question 3. What is the primary purpose of dropout regularization?

(a) To speed up training by reducing the number of parameters (b) To prevent co-adaptation of neurons by randomly zeroing activations during training (c) To normalize the distribution of activations across layers (d) To reduce the learning rate over time

Answer

**(b) To prevent co-adaptation of neurons by randomly zeroing activations during training.** Dropout forces the network to learn redundant representations, as any neuron may be absent during a given forward pass. This has an ensemble-like effect that reduces overfitting. Dropout is applied only during training, not during inference.

Question 4. Why do vanilla RNNs struggle with long possession sequences (e.g., 40+ events)?

(a) They have too many parameters for long sequences (b) The vanishing gradient problem causes early events to have negligible influence on the final output (c) They cannot process variable-length sequences (d) They are inherently limited to sequences of length 10 or fewer

Answer

**(b) The vanishing gradient problem causes early events to have negligible influence on the final output.** Gradients are multiplied by the recurrent weight matrix at each time step during backpropagation. If the spectral radius of this matrix is less than 1, gradients shrink exponentially, making it impossible to learn dependencies between distant events. LSTMs and GRUs address this with gating mechanisms.

Question 5. In an LSTM, which gate controls what new information is stored in the cell state?

(a) Forget gate (b) Input gate (c) Output gate (d) Update gate

Answer

**(b) Input gate.** The input gate $\mathbf{i}_t = \sigma(\mathbf{W}_i [\mathbf{h}_{t-1}, \mathbf{e}_t] + \mathbf{b}_i)$ modulates the candidate cell state $\tilde{\mathbf{c}}_t$, determining which new information is written to the cell state. The forget gate determines what to discard; the output gate determines what to expose as the hidden state.

Question 6. What advantage does a GRU have over an LSTM?

(a) It always achieves higher accuracy (b) It has fewer parameters due to combining the forget and input gates (c) It can process longer sequences (d) It does not require backpropagation through time

Answer

**(b) It has fewer parameters due to combining the forget and input gates.** The GRU uses a single update gate $\mathbf{z}_t$ instead of separate forget and input gates, and merges the cell state and hidden state. This results in fewer parameters, which can be advantageous when training data is limited. Performance differences between LSTMs and GRUs are often marginal in practice.

Question 7. In the attention mechanism $\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}})\mathbf{V}$, what is the purpose of dividing by $\sqrt{d_k}$?

(a) To normalize the output to unit length (b) To prevent the dot products from growing too large, which would push softmax into regions of very small gradients (c) To account for the batch size (d) To make the computation differentiable

Answer

**(b) To prevent the dot products from growing too large, which would push softmax into regions of very small gradients.** When $d_k$ is large, the dot products $\mathbf{Q}\mathbf{K}^\top$ can have large magnitudes, causing softmax to produce near-one-hot distributions with very small gradients. Dividing by $\sqrt{d_k}$ keeps the variance of the dot products at approximately 1, maintaining healthy gradient flow.

Question 8. A graph neural network for soccer tactical analysis represents each player as a node. What is the most natural choice for edge features between two players?

(a) The sum of their jersey numbers (b) Their Euclidean distance, relative angle, and whether they are on the same team (c) The total number of goals they have scored in the season (d) The average possession length of their team

Answer

**(b) Their Euclidean distance, relative angle, and whether they are on the same team.** Edge features should capture the spatial and relational context between connected players. Distance and angle encode the geometry of the relationship, while the team indicator captures the cooperative vs. competitive nature of the interaction. These are instantaneous, frame-level features aligned with the graph structure.

Question 9. After two layers of a GCN, each node's representation encodes information from:

(a) Only itself (b) Itself and its immediate neighbors (c) Itself, its immediate neighbors, and its neighbors' neighbors (d) All nodes in the graph regardless of distance

Answer

**(c) Itself, its immediate neighbors, and its neighbors' neighbors.** Each GCN layer allows information to propagate one hop through the graph. After $k$ layers, each node's representation incorporates information from its $k$-hop neighborhood. Two layers therefore capture 2-hop neighborhoods, meaning a player's representation includes information about teammates and the opponents those teammates are interacting with.

Question 10. Which graph pooling strategy would you use to produce a single fixed-size vector representing the entire team (graph) for formation classification?

(a) Max pooling over spatial dimensions (b) Mean or sum over all node embeddings (c) Selecting the embedding of the goalkeeper only (d) Concatenating all node embeddings in jersey number order

Answer

**(b) Mean or sum over all node embeddings.** Global mean or sum pooling aggregates all node representations into a single graph-level vector in a permutation-invariant manner. Option (d) would work but destroys permutation invariance (the formation should not depend on jersey number ordering). Option (c) discards most information. More sophisticated approaches (hierarchical pooling, attention-weighted pooling) also work.

Question 11. In a CNN for spatial soccer data, what does the property of "translation equivariance" mean for pitch analysis?

(a) The model produces the same output regardless of the input (b) A pattern learned in one region of the pitch can be recognized in other regions (c) The model can handle pitches of different sizes (d) The model processes the pitch left-to-right

Answer

**(b) A pattern learned in one region of the pitch can be recognized in other regions.** Translation equivariance means that if the input shifts, the output shifts correspondingly. For soccer, a CNN that learns to detect a 2v1 overload on the left wing will also detect it on the right wing, near the center circle, or anywhere else on the pitch, because the same convolutional filters are applied at every spatial location.

Question 12. A U-Net architecture is preferable to a standard classification CNN when:

(a) The output is a single class label (b) The output is a dense spatial map (e.g., pitch control surface) (c) The dataset is very small (d) The input is a 1D sequence

Answer

**(b) The output is a dense spatial map (e.g., pitch control surface).** U-Nets have an encoder-decoder structure with skip connections that preserves spatial resolution. The encoder captures context through downsampling; the decoder upsamples back to the original resolution. Skip connections transfer fine-grained spatial details from the encoder to the decoder, enabling accurate per-pixel predictions.

Question 13. In reinforcement learning for soccer, what does the value function $V(s)$ represent?

(a) The salary value of the player in state $s$ (b) The expected cumulative future reward (e.g., probability of scoring) from state $s$ (c) The number of passes completed in state $s$ (d) The xG of the most recent shot

Answer

**(b) The expected cumulative future reward (e.g., probability of scoring) from state $s$.** The value function estimates the expected discounted sum of future rewards starting from state $s$ and following the policy thereafter. In soccer, with goal-scoring reward, $V(s)$ represents how "dangerous" the current game state is---essentially a context-aware expected threat value.

Question 14. The advantage function $A(s, a) = Q(s, a) - V(s)$ measures:

(a) The absolute quality of action $a$ in state $s$ (b) How much better or worse action $a$ is compared to the average action available in state $s$ (c) The probability that action $a$ will succeed (d) The advantage of the home team over the away team

Answer

**(b) How much better or worse action $a$ is compared to the average action available in state $s$.** A positive advantage means the chosen action has higher expected value than the policy's expected value at that state, indicating an above-average decision. This is directly useful for player evaluation: consistently positive advantages indicate superior decision-making.

Question 15. Why is off-policy evaluation essential for reinforcement learning in soccer?

(a) Because soccer data is collected in real time (b) Because we cannot conduct experiments by forcing players to execute specific actions (c) Because the reward function is always positive (d) Because soccer has a continuous action space

Answer

**(b) Because we cannot conduct experiments by forcing players to execute specific actions.** Unlike video games or simulations, we cannot have a soccer player repeatedly execute different actions from the same state to observe outcomes. We must evaluate counterfactual policies ("what if the player had passed instead of dribbled?") from observational data only, which requires off-policy evaluation techniques such as importance sampling.

Question 16. In a Variational Autoencoder (VAE), the KL divergence term in the ELBO serves to:

(a) Maximize reconstruction accuracy (b) Regularize the latent space to be close to a standard normal distribution (c) Increase the dimensionality of the latent space (d) Prevent the decoder from overfitting

Answer

**(b) Regularize the latent space to be close to a standard normal distribution.** The $D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))$ term penalizes the encoder's posterior from deviating too far from the prior $p(\mathbf{z}) = \mathcal{N}(0, \mathbf{I})$. This ensures the latent space is smooth and continuous, enabling meaningful interpolation and sampling of new data points.

Question 17. In a GAN, what happens if the discriminator becomes too strong relative to the generator?

(a) The generator produces more diverse outputs (b) Training becomes unstable because the generator receives near-zero gradients (c) The discriminator starts generating data itself (d) The model converges faster

Answer

**(b) Training becomes unstable because the generator receives near-zero gradients.** If the discriminator perfectly distinguishes real from fake, the loss $\log(1 - D(G(z)))$ saturates near $\log(1) = 0$, providing negligible gradient signal to the generator. This is a key challenge in GAN training, addressed by techniques like Wasserstein loss, spectral normalization, and progressive training.

Question 18. Which data split strategy is most appropriate for temporal soccer data?

(a) Random 80/10/10 split of all events (b) Stratified split by event type (c) Temporal split where training data precedes validation data, which precedes test data (d) Leave-one-player-out cross-validation

Answer

**(c) Temporal split where training data precedes validation data, which precedes test data.** Random splitting of temporal data causes data leakage: the model trains on future events and is evaluated on past events. A proper temporal split (e.g., train on first 30 matchdays, validate on matchdays 31--34, test on matchdays 35--38) simulates real-world deployment where the model must predict future events from past observations.

Question 19. What is the main advantage of learned embeddings over one-hot encoding for player IDs?

(a) Embeddings use less memory for small player pools (b) Embeddings capture similarity between players in a dense, low-dimensional space (c) Embeddings are always more accurate (d) Embeddings do not require gradient computation

Answer

**(b) Embeddings capture similarity between players in a dense, low-dimensional space.** A 32-dimensional embedding represents each player as a dense vector, where similar players (by playing style, position, or role) end up nearby in the embedding space. One-hot encoding treats all players as equally dissimilar, requires $N$ dimensions for $N$ players, and does not generalize to unseen players.

Question 20. Grad-CAM produces spatial heatmaps for CNN predictions. In a spatial xG model, a Grad-CAM heatmap would show:

(a) The locations of all players on the pitch (b) Which regions of the pitch most influenced the model's xG prediction (c) The trajectory of the ball (d) The optimal passing lanes

Answer

**(b) Which regions of the pitch most influenced the model's xG prediction.** Grad-CAM computes the gradient of the output with respect to the feature maps of a specified convolutional layer, then weights the feature maps by these gradients. The resulting heatmap highlights pitch regions that contributed most to the prediction, providing spatial interpretability (e.g., showing that the gap between center-backs was the key factor).

Question 21. Knowledge distillation involves:

(a) Removing unnecessary layers from a trained network (b) Training a small "student" model to mimic the outputs of a larger "teacher" model (c) Converting a neural network to a decision tree for interpretability (d) Transferring weights from one layer to another

Answer

**(b) Training a small "student" model to mimic the outputs of a larger "teacher" model.** The student is trained on soft labels (probability distributions) produced by the teacher, which contain more information than hard labels alone. The soft labels encode the teacher's uncertainty and inter-class similarities. This allows the student to achieve performance closer to the teacher despite having far fewer parameters, enabling efficient deployment.

Question 22. Which of the following is NOT a valid concern about deep learning in soccer analytics?

(a) Models may encode biases present in the training data (b) Overfitting to small datasets is a persistent challenge (c) Deep learning models are always less accurate than logistic regression (d) Interpretability is important for coaching staff adoption

Answer

**(c) Deep learning models are always less accurate than logistic regression.** This is false. While logistic regression can be competitive (or even preferable) for small datasets or simple relationships, deep learning models generally outperform logistic regression when sufficient data is available and the underlying relationships are complex and nonlinear. The other three options are legitimate and important concerns.

Question 23. In the context of generative models for soccer, "mode collapse" refers to:

(a) The model generating data points from only a small subset of the true data distribution (b) The model failing to converge during training (c) The model generating data that is too similar to the training set (d) The model producing outputs with incorrect dimensions

Answer

**(a) The model generating data points from only a small subset of the true data distribution.** In mode collapse, the generator learns to produce a limited variety of outputs that fool the discriminator, ignoring large regions of the data distribution. For soccer, this might manifest as generating only one type of tactical pattern or player trajectory, rather than the full diversity observed in real matches.

Question 24. When processing possession sequences, why is masking important for padded sequences?

(a) To encrypt sensitive player data (b) To prevent the model from treating padding tokens as real events, which would corrupt the learned representations (c) To reduce computational cost (d) To increase the sequence length

Answer

**(b) To prevent the model from treating padding tokens as real events, which would corrupt the learned representations.** Without masking, the model would aggregate information from zero-valued padding positions as if they were meaningful events. This is particularly problematic for attention mechanisms (which would attend to padding tokens) and for loss computation (which should ignore predictions at padded positions).

Question 25. A club's analytics department has event data for 5,000 matches and tracking data for 500 matches. They want to build a deep learning xG model. What strategy would you recommend?

(a) Use only the 500 matches with tracking data since it is richer (b) Use only the 5,000 matches with event data since there is more of it (c) Pre-train on the 5,000 event-data matches, then fine-tune on the 500 tracking-data matches (d) Combine both datasets by generating synthetic tracking data for the event-only matches

Answer

**(c) Pre-train on the 5,000 event-data matches, then fine-tune on the 500 tracking-data matches.** This transfer learning approach leverages the larger event dataset to learn general shot-outcome relationships, then fine-tunes on the smaller but richer tracking dataset to capture spatial dynamics not present in event data. Pre-training provides a better initialization for the tracking-data model than random initialization, reducing overfitting risk. Option (d) is also viable but introduces synthetic data noise.

Scoring Guide

Score	Level	Recommendation
23--25	Excellent	Ready for research-level applications
18--22	Good	Solid understanding; review missed topics
13--17	Fair	Re-read sections 24.1--24.4 before proceeding
0--12	Needs work	Review Chapter 16 (ML Fundamentals) first, then re-read this chapter