Chapter 20 Exercises: Predictive Modeling

Part A: Conceptual Foundations (Problems 1-8)

Problem 1 (Difficulty: *) Explain the difference between aleatoric and epistemic uncertainty in the context of predicting the number of goals in a soccer match. Give one concrete example of each type. Which type can be reduced by collecting more data?

Problem 2 (Difficulty: *) A match prediction model assigns the following probabilities to a match: Home Win = 0.45, Draw = 0.30, Away Win = 0.25. The match ends in a draw.

(a) Calculate the log-loss for this single prediction. (b) Calculate the Brier score for this single prediction. (c) If we evaluated this model using accuracy (most likely outcome = prediction), what would the "prediction" be? Would we count this as correct or incorrect? (d) Explain why log-loss is preferred over accuracy for evaluating probabilistic match predictions.

Problem 3 (Difficulty: **) In the Dixon-Coles model, the attack and defense parameters require a sum-to-zero constraint for identifiability. Explain why this is necessary using a simple example with just two teams. Show that without the constraint, multiple parameter vectors yield identical predictions.

Problem 4 (Difficulty: **) A scout reports that a 22-year-old striker scored 18 goals in 30 Eredivisie appearances last season. His underlying xG was 14.5 for the same period.

(a) Calculate his goals per 90 (assume 90 minutes per appearance for simplicity). (b) Calculate his xG per 90. (c) Explain why xG per 90 is likely a better predictor of future goal scoring than actual goals per 90. (d) If the year-over-year correlation for goals per 90 is 0.40 and for xG per 90 is 0.60, and the league average xG per 90 for strikers is 0.35, compute the regressed estimate of this player's true xG per 90 talent level.

Problem 5 (Difficulty: **) Describe three distinct ways to define "transfer success." For each definition, explain one advantage and one disadvantage. Then explain why using a composite measure may be preferable.

Problem 6 (Difficulty: *) What is the acute-chronic workload ratio (ACWR), and why is it used in injury risk prediction? Sketch a hypothetical graph showing the relationship between ACWR and injury risk, labeling the "sweet spot" and "danger zone."

Problem 7 (Difficulty: ***) A Bayesian match prediction model places priors $a_i \sim \mathcal{N}(0, 0.3^2)$ on team attack parameters.

(a) Interpret this prior: what does it say about our beliefs regarding team quality before seeing data? (b) After observing half a season, the posterior for team $i$'s attack parameter is $a_i \mid \text{data} \sim \mathcal{N}(0.25, 0.12^2)$. How has our belief changed? (c) If we now predict a match for team $i$, explain qualitatively how the posterior predictive distribution differs from a point-estimate prediction. (d) After a full season, the posterior becomes $a_i \mid \text{data} \sim \mathcal{N}(0.30, 0.08^2)$. Compare with the half-season posterior and explain the changes.

Problem 8 (Difficulty: **) Explain the concept of calibration in the context of match prediction. A model predicts "Home Win" with probability 0.60 for 200 matches. In reality, the home team wins in 135 of those matches.

(a) Is this model well-calibrated for this probability bin? (b) Calculate the calibration error for this bin. (c) Suggest one method for improving calibration of an already-trained model.

Part B: Mathematical Derivations (Problems 9-16)

Problem 9 (Difficulty: **) Derive the log-likelihood function for the basic Poisson regression model for match outcomes. Start from the Poisson probability mass function and show all steps to arrive at:

$$ \ell(\theta) = \sum_{k=1}^{N} \left[ y_k \log(\lambda_k) - \lambda_k - \log(y_k!) \right] $$

Problem 10 (Difficulty: ***) In the Dixon-Coles model, the correction factor for the (0,0) scoreline is $\tau(0,0,\lambda,\mu,\rho) = 1 - \rho\lambda\mu$.

(a) Show that for this correction to yield valid probabilities (i.e., $\tau > 0$), we need $\rho < \frac{1}{\lambda\mu}$. (b) If $\lambda = 1.5$ and $\mu = 1.1$, what is the maximum allowable value of $\rho$? (c) Show that the marginal distributions are no longer exactly Poisson when $\rho \neq 0$, by computing $P(X=0) = P(X=0, Y=0) + P(X=0, Y=1) + P(X=0, Y \geq 2)$ and comparing with $e^{-\lambda}$.

Problem 11 (Difficulty: **) For the exponential time-decay weighting $\phi(t) = e^{-\xi(T-t)}$, where $T$ is the current time and $t$ is the match time:

(a) Show that the half-life of the decay is $t_{1/2} = \frac{\ln 2}{\xi}$. (b) If we want a half-life of 180 days, what value of $\xi$ should we use? (c) If the most recent match was yesterday ($T - t = 1$) and the oldest match was 365 days ago, calculate the ratio of their weights for the $\xi$ found in (b).

Problem 12 (Difficulty: ***) Consider a mixed-effects aging curve model:

$$ y_{ij} = (\beta_0 + b_{0i}) + (\beta_1 + b_{1i}) \cdot \text{age}_{ij} + (\beta_2 + b_{2i}) \cdot \text{age}_{ij}^2 + \epsilon_{ij} $$

(a) Derive the peak age for the population average trajectory. (b) Derive the peak age for player $i$ as a function of the random effects. (c) If $\beta_1 = 0.08$, $\beta_2 = -0.0015$, $b_{1i} = 0.01$, and $b_{2i} = -0.0003$, calculate both the population peak age and player $i$'s peak age. (d) Interpret the difference between these two peak ages in footballing terms.

Problem 13 (Difficulty: ***) For a Cox proportional hazards model with two covariates (workload $x_1$ and age $x_2$):

$$ h(t \mid x_1, x_2) = h_0(t) \exp(\beta_1 x_1 + \beta_2 x_2) $$

(a) Show that the hazard ratio for a one-unit increase in $x_1$, holding $x_2$ constant, is $\exp(\beta_1)$. (b) If $\beta_1 = 0.05$ (workload in km) and $\beta_2 = 0.03$ (age in years), calculate and interpret the hazard ratio for a player who runs 5 km more per week than average and is 3 years older than average. (c) Explain why the proportional hazards assumption might be violated for age in the context of injury prediction.

Problem 14 (Difficulty: **) Show that the prediction interval for a linear regression model is always wider than the confidence interval. Start from the variance decomposition:

$$ \text{Var}(Y_{\text{new}} - \hat{Y}_{\text{new}}) = \text{Var}(\epsilon) + \text{Var}(\hat{Y}_{\text{new}}) $$

and derive both interval formulas.

Problem 15 (Difficulty: ***) In Bayesian Model Averaging, the posterior model weight for model $M_k$ is:

$$ w_k = \frac{P(\text{data} \mid M_k) P(M_k)}{\sum_{j} P(\text{data} \mid M_j) P(M_j)} $$

(a) If we have three models with equal prior probabilities and marginal likelihoods $P(\text{data} \mid M_k)$ proportional to 10, 5, and 2, compute the posterior model weights. (b) Show that the BMA predictive variance is:

$$ \text{Var}_{\text{BMA}}(\hat{y}) = \sum_k w_k \left[\text{Var}_k(\hat{y}) + \hat{y}_k^2\right] - \left(\sum_k w_k \hat{y}_k\right)^2 $$

and explain why this is larger than any individual model's variance.

Problem 16 (Difficulty: **) The Expected Calibration Error (ECE) is defined as:

$$ \text{ECE} = \sum_{b=1}^{B} \frac{n_b}{N} |\bar{y}_b - \bar{p}_b| $$

(a) Prove that a perfectly calibrated model has ECE = 0. (b) Construct a simple example with 10 predictions where the model has ECE = 0 but is not a useful predictor. (c) What additional metric would you pair with ECE to identify this deficiency? Explain.

Part C: Computational Problems (Problems 17-26)

Problem 17 (Difficulty: **) Implement a Dixon-Coles model from scratch for a single season of match data.

(a) Generate synthetic data for a 20-team league with 380 matches, using known attack/defense parameters. (b) Fit the model using maximum likelihood estimation with sum-to-zero constraints. (c) Compare the estimated parameters with the true values. Plot the comparison. (d) Use the model to predict match outcomes for 10 hypothetical fixture sets and evaluate with log-loss.

Problem 18 (Difficulty: **) Build a match outcome predictor using gradient boosted trees.

(a) Engineer at least 10 features from raw match data (form, home advantage, team ratings, etc.). (b) Use TimeSeriesSplit for cross-validation (explain why standard k-fold is inappropriate). (c) Compare log-loss with a baseline Poisson model. (d) Plot feature importance and discuss which features are most predictive.

Problem 19 (Difficulty: ***) Implement a complete xG model pipeline.

(a) Use synthetic shot data with features: distance, angle, body part, assist type, game state. (b) Train a logistic regression model and a gradient boosted model. (c) Create calibration plots for both models. (d) Apply Platt scaling to improve calibration of the gradient boosted model. (e) Compare the models using AUC-ROC, log-loss, and Brier score.

Problem 20 (Difficulty: **) Forecast a player's xG per 90 for the next season using multiple methods.

(a) Generate a synthetic career trajectory (10 seasons) with age effects and noise. (b) Apply exponential smoothing, ARIMA(1,0,1), and a regression model with age terms. (c) Compute prediction intervals for each method. (d) Compare forecast accuracy using mean absolute error over the final 3 "held-out" seasons.

Problem 21 (Difficulty: ***) Build an injury risk prediction model.

(a) Generate synthetic player-day data with features: ACWR, total distance, high-speed running, age, injury history, and a binary injury outcome (~2% base rate). (b) Address class imbalance using SMOTE or class weighting. (c) Train a random forest and evaluate using precision-recall curves. (d) Compute the optimal probability threshold for maximizing F1 score. (e) Discuss the costs of false positives (unnecessary rest) vs. false negatives (missed injury prediction) and how this affects threshold selection.

Problem 22 (Difficulty: ***) Implement an aging curve analysis.

(a) Generate synthetic career data for 500 players with position-dependent peak ages and random effects. (b) Estimate the aging curve using the delta method. (c) Fit a mixed-effects model using a spline basis for age. (d) Compare the delta method curve with the mixed-effects curve and explain any differences. (e) Use the fitted model to project a 24-year-old midfielder's performance over the next 8 seasons with confidence bands.

Problem 23 (Difficulty: **) Implement league adjustment factors for transfer analysis.

(a) Generate synthetic transfer data with known league quality differences. (b) Estimate conversion factors from observed transfers. (c) Compute bootstrap confidence intervals for each conversion factor. (d) Demonstrate that the estimated factors recover the true values (with uncertainty) as the sample size increases.

Problem 24 (Difficulty: ***) Build a Bayesian match prediction model.

(a) Use conjugate priors for a simplified Poisson model (Gamma prior on rates). (b) Update the posterior after each matchday of a simulated season. (c) Plot the evolution of team strength estimates and their uncertainty bands over the season. (d) Compare posterior predictive distributions at matchday 5, 15, and 30.

Problem 25 (Difficulty: **) Create a calibration analysis toolkit.

(a) Implement reliability diagrams from scratch (no library calibration plot functions). (b) Compute ECE for a set of match predictions. (c) Implement isotonic regression for post-hoc calibration. (d) Show before-and-after calibration plots demonstrating improvement.

Problem 26 (Difficulty: ***) Build an ensemble prediction system.

(a) Implement three different match prediction models (Poisson, logistic regression, gradient boosting). (b) Combine them using simple averaging, weighted averaging (weights from validation performance), and stacking. (c) Compare ensemble performance with individual models on held-out data. (d) Analyze when the ensemble improves most over individual models (e.g., close matches vs. one-sided matches).

Part D: Open-Ended Research Problems (Problems 27-32)

Problem 27 (Difficulty: *) *Dynamic Elo with mean reversion. Implement an Elo rating system for a soccer league with the following enhancements:

(a) Goal-difference-sensitive K-factor (larger updates for unexpected large margins). (b) Season-to-season mean reversion (regression toward the mean by 33% at the start of each season). (c) Promoted/relegated team initialization. (d) Evaluate against the standard Dixon-Coles model on 5 seasons of data. Under what circumstances does each model perform better?

Problem 28 (Difficulty: *) *Transfer success prediction with interpretability. Build a transfer success classifier and apply SHAP (SHapley Additive exPlanations) to make it interpretable.

(a) Define success as "played >1500 minutes in first season at new club." (b) Engineer features including performance metrics, age, league gap, and playing style compatibility. (c) Train a gradient boosted model and compute SHAP values for each feature. (d) Create a "transfer risk report" that shows, for a hypothetical transfer target, which factors increase and decrease the probability of success. (e) Discuss the ethical implications of using such a model in transfer decisions.

Problem 29 (Difficulty: *) *Career trajectory clustering. Use latent class trajectory analysis to identify distinct career arc types.

(a) Generate synthetic data with 4 latent classes: early bloomer, late developer, steady performer, and flash-in-the-pan. (b) Fit a mixture of polynomial trajectories using EM algorithm. (c) Determine the optimal number of classes using BIC. (d) Classify a set of active players into trajectory types and project their future performance. (e) Discuss how class membership probability changes with the amount of observed career data.

Problem 30 (Difficulty: *) *In-match win probability model. Build a model that updates win/draw/loss probabilities in real time during a match.

(a) Model the scoring process as an inhomogeneous Poisson process with a time-varying rate. (b) Condition on the current score, time remaining, red cards, and team strengths. (c) Use simulation (10,000 match completions from the current state) to estimate outcome probabilities. (d) Create a "win probability added" (WPA) metric that assigns credit to individual events. (e) Validate the model's calibration: at each minute, do the predicted probabilities match observed frequencies?

Problem 31 (Difficulty: *) *Injury prediction with survival analysis. Implement a full survival analysis pipeline for muscle injury prediction.

(a) Generate realistic time-to-event data with censoring (players who don't get injured during the observation period). (b) Fit a Cox proportional hazards model and a parametric Weibull model. (c) Assess the proportional hazards assumption using Schoenfeld residuals. (d) Implement a competing risks model with separate hazards for hamstring, quadriceps, and calf injuries. (e) Evaluate discrimination using the concordance index (C-statistic).

Problem 32 (Difficulty: *) *Full prediction pipeline. Build an end-to-end predictive system that combines multiple models.

(a) Match outcome model (Dixon-Coles or ML-based). (b) Player performance forecasting (time series + aging curves). (c) Injury risk model (survival analysis). (d) Transfer valuation model (similarity-based). (e) Integrate all four into a dashboard-style report for a hypothetical sporting director, including uncertainty quantification for every prediction. (f) Write a 1-page executive summary explaining the key findings and their confidence levels in non-technical language.