Chapter 19 Quiz

Multiple Choice

Question 1. Which of the following is the most appropriate train/test splitting strategy for a soccer xG model trained on event data spanning multiple seasons?

(a) Random 80/20 split across all seasons (b) Stratified random split preserving the goal/no-goal ratio (c) Temporal split where training data precedes test data chronologically (d) Leave-one-out cross-validation on individual shots

Question 2. A logistic regression xG model achieves an AUC-ROC of 0.79. What does this mean?

(a) The model correctly classifies 79% of shots. (b) There is a 79% probability that a randomly chosen goal has a higher predicted probability than a randomly chosen non-goal. (c) 79% of predicted goals actually became goals. (d) The model explains 79% of the variance in goal-scoring.

Question 3. In the context of K-means clustering for player roles, what does the elbow method help determine?

(a) The optimal set of features for clustering (b) The appropriate number of clusters $k$ (c) Whether K-means is better than hierarchical clustering (d) The best distance metric to use

Question 4. Which ensemble method builds models sequentially, where each new model focuses on the errors of the previous ones?

(a) Bagging (b) Random Forest (c) Gradient Boosting (d) Stacking

Question 5. For a player market value regression model, which regularization technique performs automatic feature selection by driving irrelevant coefficients to zero?

(a) Ridge regression ($L_2$ penalty) (b) Lasso regression ($L_1$ penalty) (c) Elastic Net (d) Dropout

Question 6. What is the primary advantage of Gaussian Mixture Models over K-means for player role clustering?

(a) GMMs are faster to train (b) GMMs require fewer hyperparameters (c) GMMs provide probabilistic (soft) cluster assignments (d) GMMs guarantee globally optimal solutions

Question 7. When deploying a soccer ML model, Population Stability Index (PSI) is used to detect:

(a) Model accuracy degradation (b) Feature importance changes (c) Distributional shifts in input features (d) Labeling errors in new data

Question 8. In a stacking ensemble, what is the role of the meta-learner?

(a) To train each base model independently (b) To combine the predictions of base models into a final prediction (c) To select the best individual base model (d) To tune the hyperparameters of each base model

Question 9. Which evaluation metric is most appropriate for assessing the calibration of an xG model?

(a) AUC-ROC (b) Accuracy (c) Brier score (d) F1 score

Question 10. Why is target encoding risky for high-cardinality categorical features like player names?

(a) It creates too many dummy variables (b) It can cause target leakage if not properly regularized (c) It only works with numerical targets (d) It requires too much computational memory

Question 11. A random forest feature importance ranking shows that distance_to_goal is the most important feature in an xG model. Which of the following is a known limitation of this importance measure?

(a) It is biased toward features with more categories or higher cardinality (b) It only works for classification, not regression (c) It requires the features to be normally distributed (d) It cannot detect non-linear relationships

Question 12. When performing time-series cross-validation for soccer data, which statement is TRUE?

(a) The validation set can contain matches from before the training period (b) The training set size remains constant across folds (c) The training set grows with each successive fold (d) Random shuffling within each fold is recommended

Question 13. Which of the following is NOT a valid strategy for handling class imbalance in a shot-to-goal prediction model?

(a) Using class weights inversely proportional to class frequency (b) Evaluating with log-loss instead of accuracy (c) Removing all non-goal shots to balance the dataset (d) Using stratified cross-validation splits

Question 14. In the bias-variance trade-off, adding more features to a model with limited training data is most likely to increase:

(a) Bias (b) Variance (c) Irreducible noise (d) Training speed

Question 15. For a deployed xG model, which type of drift describes a change in the relationship between features and the target variable (e.g., due to VAR introduction)?

(a) Data drift (b) Concept drift (c) Label drift (d) Feature drift

True or False

Question 16. True or False: A model with a higher AUC-ROC is always better calibrated than a model with a lower AUC-ROC.

Question 17. True or False: In K-means clustering, the algorithm is guaranteed to converge to the global optimum regardless of the initial centroid placement.

Question 18. True or False: Lasso regression with a sufficiently large regularization parameter $\lambda$ will set all coefficients to zero.

Question 19. True or False: When using gradient boosting, a lower learning rate generally requires more trees to achieve optimal performance.

Question 20. True or False: Permutation feature importance is model-agnostic and can be applied to any trained model regardless of the algorithm used.

Short Answer

Question 21. Explain why random forests use random subsets of features at each split rather than considering all features. How does this relate to the concept of decorrelation among ensemble members?

Question 22. A soccer analytics company retrains their xG model at the start of each season using an expanding window of all available historical data. Describe one advantage and one disadvantage of this approach compared to a sliding window that only uses the most recent three seasons.

Question 23. You discover that your xG model assigns systematically higher probabilities to shots from free kicks than the observed conversion rate. The model is well-calibrated for open-play shots. Suggest two approaches to fix this issue.

Question 24. Explain why SHAP values are preferred over simple feature importance for interpreting individual predictions in a scouting context. Give a concrete example involving a player evaluation.

Question 25. Describe the difference between one-hot encoding and target encoding for the team_name feature in a match outcome prediction model. Under what circumstances would you prefer each approach?

Answer Key

1. (c) --- Temporal splits prevent future data leakage and mirror real deployment.

2. (b) --- AUC-ROC measures the probability that a positive instance is ranked higher than a negative instance.

3. (b) --- The elbow method plots inertia vs. $k$ to find the "elbow" where adding more clusters yields diminishing returns.

4. (c) --- Gradient boosting builds models sequentially, fitting each to the residuals of the previous ensemble.

5. (b) --- Lasso's $L_1$ penalty induces sparsity by driving coefficients to exactly zero.

6. (c) --- GMMs assign a probability distribution over clusters for each data point, unlike K-means' hard assignments.

7. (c) --- PSI detects changes in the distribution of input features between a reference and new dataset.

8. (b) --- The meta-learner takes base model predictions as input and produces the final combined prediction.

9. (c) --- Brier score measures both calibration and discrimination; it is the mean squared error of predicted probabilities.

10. (b) --- Target encoding uses the target variable to encode categories, creating a direct path for information leakage.

11. (a) --- Mean Decrease Impurity (MDI) importance is biased toward high-cardinality and continuous features.

12. (c) --- In time-series CV, the training window expands with each fold while validation always follows training chronologically.

13. (c) --- Removing all majority-class samples would destroy most of the training data and eliminate valuable information.

14. (b) --- More features relative to samples increases model complexity and the risk of overfitting (higher variance).

15. (b) --- Concept drift refers to changes in the underlying relationship $P(y | X)$, such as rule changes affecting conversion rates.

16. False --- AUC measures discrimination (ranking ability), not calibration. A model can rank well but have poorly calibrated probabilities.

17. False --- K-means converges to a local optimum that depends on initialization. Multiple runs with different initializations are recommended.

18. True --- As $\lambda \to \infty$, all Lasso coefficients are shrunk to exactly zero, resulting in a null model.

19. True --- A lower learning rate means each tree contributes less, requiring more trees to reach the same level of training loss reduction.

20. True --- Permutation importance measures the drop in performance when a feature's values are shuffled, which works for any model.

21. Random subsets of features decorrelate individual trees. If all trees used the same dominant feature (e.g., distance_to_goal), their predictions would be highly correlated, reducing the variance-reduction benefit of averaging. By forcing each tree to sometimes ignore strong features, the ensemble explores diverse decision boundaries.

22. Advantage of expanding window: More training data generally improves model performance, especially for rare events like long-range goals. Disadvantage: Older data may not reflect current tactical trends, diluting the signal from recent seasons and potentially introducing concept drift.

23. (1) Add a is_free_kick feature or interaction terms specific to free kicks so the model can learn different calibration for this shot type. (2) Apply post-hoc calibration (e.g., Platt scaling or isotonic regression) separately for free kicks and open-play shots.

24. SHAP values decompose an individual prediction into per-feature contributions with a consistent theoretical basis (Shapley values from game theory). For example, when evaluating a striker, SHAP can show that "this player's xG over-performance contributes +EUR 3M to their predicted value, while their age contributes -EUR 1.5M." Simple feature importance only shows global rankings, not how features affect a specific player.

25. One-hot encoding creates a binary column for each team (e.g., 20 columns for a 20-team league), treating each team as independent. Target encoding replaces each team name with the mean target value (e.g., average points per match), compressing to a single column. Prefer one-hot encoding when cardinality is low and sample size is large. Prefer target encoding (with proper regularization such as leave-one-out or additive smoothing) when cardinality is high (e.g., encoding player names across multiple leagues) and you want to reduce dimensionality.