Chapter 19 Exercises

Section 19.1 --- ML Fundamentals for Soccer Applications

Exercise 19.1 (Foundational)

Explain the difference between supervised and unsupervised learning in the context of soccer analytics. Provide two examples of each paradigm applied to soccer data.


Exercise 19.2 (Foundational)

A data scientist uses a random 80/20 train/test split on five seasons of Premier League shot data to build an xG model. The test set contains shots from all five seasons.

(a) Explain why this approach is problematic. (b) Propose a better splitting strategy and justify your choice.


Exercise 19.3 (Intermediate)

You are building a model to predict whether a tackle will result in a foul. In your dataset of 50,000 tackles, 13,500 resulted in fouls.

(a) What is the class distribution? Is this a balanced or imbalanced problem? (b) If you use accuracy as your evaluation metric with a model that always predicts "no foul," what accuracy would you achieve? (c) Why is accuracy a poor metric here? Suggest two better alternatives.


Exercise 19.4 (Foundational)

Define the bias-variance trade-off. For each of the following scenarios, state whether the primary problem is high bias or high variance:

(a) A linear regression model predicting match outcomes from only possession percentage achieves 42% accuracy on both training and test data. (b) A decision tree with no depth limit achieves 98% accuracy on training data but only 55% on test data. (c) A logistic regression model for pass success prediction achieves 71% accuracy on training data and 70% on test data, but domain experts believe 80%+ should be achievable.


Exercise 19.5 (Intermediate)

Implement time-series cross-validation for a soccer dataset spanning seasons 2015/16 through 2022/23. Write Python code that creates the following folds:

  • Fold 1: Train on 2015/16--2017/18, validate on 2018/19
  • Fold 2: Train on 2015/16--2018/19, validate on 2019/20
  • Fold 3: Train on 2015/16--2019/20, validate on 2020/21
  • Fold 4: Train on 2015/16--2020/21, validate on 2021/22
  • Fold 5: Train on 2015/16--2021/22, validate on 2022/23

Section 19.2 --- Classification Problems in Soccer

Exercise 19.6 (Foundational)

For a binary classification xG model, explain what each of the following metrics measures and when you would prefer one over the others:

(a) AUC-ROC (b) Log-loss (c) Brier score (d) Precision and recall


Exercise 19.7 (Intermediate)

Build a logistic regression model for goal prediction using the following features: distance_to_goal, angle_to_goal, is_header.

(a) Write the mathematical formulation of the model. (b) If the fitted coefficients are $w_{\text{dist}} = -0.08$, $w_{\text{angle}} = 0.03$, $w_{\text{header}} = -0.45$, and $b = 1.2$, calculate the predicted probability for a header from 12 meters with an angle of 0.35 radians. (c) Interpret each coefficient in soccer terms.


Exercise 19.8 (Intermediate)

You are building a pass success prediction model. Design a feature set of at least 10 features, categorized into spatial, temporal, and contextual groups. For each feature, explain why it would be predictive of pass success.


Exercise 19.9 (Advanced)

A match outcome prediction model outputs the following probability distributions for 1,000 matches where the predicted probability of a home win was between 0.55 and 0.65:

Outcome Predicted (avg) Observed Frequency
Home Win 0.60 0.52
Draw 0.22 0.28
Away Win 0.18 0.20

(a) Is this model well-calibrated? Explain. (b) What specific calibration issues do you see? (c) Suggest a method to improve calibration without retraining the model.


Exercise 19.10 (Advanced)

Implement a complete binary classification pipeline for predicting whether a shot is on target. Your implementation should include:

(a) Synthetic data generation with at least 8 features. (b) A preprocessing pipeline with scaling and encoding. (c) Comparison of logistic regression, random forest, and gradient boosting. (d) Evaluation using AUC-ROC, log-loss, and a calibration plot. (e) A confusion matrix visualization for the best model.


Section 19.3 --- Regression Applications

Exercise 19.11 (Foundational)

Explain the difference between Ridge and Lasso regression. In what scenario would Lasso be preferred for a player valuation model with 50 features?


Exercise 19.12 (Intermediate)

A player market value model produces the following test-set metrics:

  • RMSE: EUR 4.2 million
  • MAE: EUR 2.1 million
  • $R^2$: 0.78

(a) Interpret each metric in the context of player valuation. (b) The large gap between RMSE and MAE suggests something about the error distribution. What is it? (c) Would you trust this model for valuing a player worth EUR 500,000? Why or why not?


Exercise 19.13 (Intermediate)

Build a regression model to predict a player's per-90 xG contribution based on: position, age, minutes played, shots per 90, shot distance (average), and percentage of shots that are headers.

(a) Write Python code to generate synthetic data for 500 players. (b) Fit both a linear regression and a Ridge regression model. (c) Compare their performance using 5-fold cross-validation with RMSE. (d) Plot the Ridge coefficient paths as a function of the regularization parameter $\alpha$.


Exercise 19.14 (Advanced)

The relationship between a player's age and market value is non-linear.

(a) Generate synthetic data where market value follows an inverted-U shape peaking at age 27, with added noise. (b) Fit a linear regression, polynomial regression (degree 2 and 3), and a gradient boosting regressor. (c) Plot the fitted curves and compare $R^2$ values. (d) Discuss which model you would deploy and why.


Section 19.4 --- Clustering for Player Roles

Exercise 19.15 (Foundational)

Explain why per-90-minute normalization is important when clustering players. What could go wrong if you cluster on raw season totals?


Exercise 19.16 (Foundational)

Describe three methods for choosing the number of clusters $k$ in K-means clustering. Which method do you consider most appropriate for player role discovery, and why?


Exercise 19.17 (Intermediate)

Generate synthetic data for 200 outfield players with the following features: goals/90, assists/90, tackles/90, interceptions/90, progressive passes/90, and dribbles/90.

(a) Standardize the features using z-scores. (b) Apply K-means clustering with $k = 4, 5, 6, 7, 8$. (c) Plot the elbow curve and silhouette scores. (d) Select the optimal $k$ and visualize the clusters using PCA.


Exercise 19.18 (Intermediate)

Compare K-means and hierarchical clustering (Ward linkage) on the same player dataset from Exercise 19.17.

(a) Generate a dendrogram from the hierarchical clustering. (b) Cut the dendrogram at the same $k$ you selected in Exercise 19.17. (c) Compute the adjusted Rand index between the K-means and hierarchical cluster assignments. (d) Discuss which method produces more interpretable clusters and why.


Exercise 19.19 (Advanced)

Implement a Gaussian Mixture Model for player clustering and compare it with K-means.

(a) Fit a GMM with the same $k$ as in Exercise 19.17. (b) For each player, report the probability of belonging to each cluster. (c) Identify players with high uncertainty (no single cluster probability exceeds 0.6). What types of players might these be in soccer terms? (d) Use BIC to select the optimal number of components and compare with the K-means elbow method result.


Exercise 19.20 (Advanced)

Create radar charts (spider plots) for the centroid of each cluster from Exercise 19.17. Label each cluster with an interpretive name based on the centroid values (e.g., "Ball-Winning Midfielder," "Goal-Scoring Forward").


Section 19.5 --- Ensemble Methods and Model Stacking

Exercise 19.21 (Foundational)

Explain the difference between bagging and boosting. Why does bagging reduce variance while boosting reduces bias?


Exercise 19.22 (Intermediate)

Train a random forest classifier for goal prediction (synthetic data) and analyze feature importance.

(a) Generate synthetic shot data with 10 features. (b) Train a random forest with 500 trees. (c) Plot the feature importances (MDI-based). (d) Compare with permutation importance. Do the rankings differ? Why might they?


Exercise 19.23 (Intermediate)

Implement gradient boosting for an xG model and analyze the effect of the learning rate:

(a) Train models with learning rates of 0.01, 0.05, 0.1, and 0.3, each with 1,000 estimators. (b) Plot the training and validation log-loss curves for each learning rate. (c) Which learning rate achieves the best validation performance? At how many estimators? (d) Explain the relationship between learning rate and optimal number of trees.


Exercise 19.24 (Advanced)

Build a stacking ensemble for match outcome prediction (home win / draw / away win):

(a) Use logistic regression, random forest, gradient boosting, and SVM as base models. (b) Use logistic regression as the meta-learner. (c) Compare the stacking ensemble's performance against each individual base model. (d) Analyze whether the improvement justifies the added complexity.


Exercise 19.25 (Advanced)

Implement a custom stacking pipeline without using StackingClassifier:

(a) Generate out-of-fold predictions for each base model using 5-fold cross-validation. (b) Stack these predictions as features for the meta-learner. (c) Train and evaluate the meta-learner on a held-out test set. (d) Verify that your results are consistent with StackingClassifier.


Section 19.6 --- Feature Selection and Engineering

Exercise 19.26 (Foundational)

Categorize the following features as spatial, temporal, sequential, or contextual, and explain your reasoning:

(a) Distance from shot location to the goal center. (b) Number of passes in the possession chain before the shot. (c) Score differential at the time of the shot. (d) Minutes remaining in the match. (e) Whether the previous action was a cross. (f) The pitch zone (defensive, middle, attacking third).


Exercise 19.27 (Intermediate)

Apply mutual information feature selection to identify the top 10 features for a goal prediction model from a set of 20 candidate features.

(a) Generate synthetic data with 20 features, where only 8 are truly informative. (b) Compute the mutual information score for each feature. (c) Compare the selected features with the ground truth. (d) Train a model on all 20 features vs. the top 10. Which performs better on the test set?


Exercise 19.28 (Intermediate)

Design and implement interaction features for an xG model:

(a) Create the interaction distance_to_goal * is_header. (b) Create the interaction angle_to_goal * num_defenders. (c) Add these interaction features to a logistic regression model. (d) Test whether the interactions improve model performance via cross-validated log-loss.


Exercise 19.29 (Advanced)

Implement a complete feature engineering pipeline using sklearn.pipeline.Pipeline and ColumnTransformer that handles:

(a) Numerical features: imputation (median) + standardization. (b) Categorical features: imputation (most frequent) + one-hot encoding. (c) Feature selection: mutual information, keeping top 15 features. (d) Model: gradient boosting classifier.

Test the pipeline end-to-end on synthetic shot data with deliberate missing values.


Section 19.7 --- Model Deployment and Monitoring

Exercise 19.30 (Foundational)

List five components of a model registry entry for a production xG model. Explain why each component is important for model governance.


Exercise 19.31 (Intermediate)

Calculate the Population Stability Index (PSI) for the following feature distributions:

Bin Reference Distribution New Distribution
[0, 10) 0.15 0.12
[10, 20) 0.25 0.22
[20, 30) 0.30 0.28
[30, 40) 0.20 0.23
[40, 50] 0.10 0.15

(a) Compute the PSI value. (b) Interpret the result: is there significant drift? (c) Write a Python function that computes PSI given two distributions.


Exercise 19.32 (Advanced)

Design a complete monitoring dashboard for a deployed xG model. Your design should specify:

(a) Which metrics to track (at least 5). (b) The frequency of each metric computation. (c) Alert thresholds for each metric. (d) The retraining trigger logic. (e) A visualization mockup (describe or sketch 4+ panels).

Write Python code that simulates 12 months of monitoring data and generates alerts when drift is detected.


Bonus Challenges

Challenge A: End-to-End xG Pipeline

Build a complete xG model from synthetic event data through deployment:

  1. Generate 50,000 synthetic shots with realistic feature distributions.
  2. Engineer at least 15 features.
  3. Compare 4+ algorithms with proper temporal cross-validation.
  4. Select the best model and analyze it with SHAP values.
  5. Serialize the model and write a prediction function.
  6. Implement basic drift monitoring.

Challenge B: Player Similarity Search Engine

Build a system that, given a target player's statistical profile, finds the $k$ most similar players from a database:

  1. Define a feature set of 12+ per-90 metrics.
  2. Implement multiple distance metrics (Euclidean, cosine, Mahalanobis).
  3. Compare the similarity rankings across metrics.
  4. Build a simple command-line interface that takes a player name and returns the top 5 most similar players.