Chapter 20 Quiz: Predictive Modeling
Test your understanding of the predictive modeling concepts covered in this chapter. Each question has one best answer unless otherwise noted. Answers with explanations are provided in expandable sections.
Question 1. In the Dixon-Coles match prediction model, what do the parameters $a_i$ and $d_i$ represent?
(a) Attack and defense ratings on a 0-100 scale (b) Log-scale attack strength and defensive strength for team $i$ (c) The expected goals scored and conceded by team $i$ (d) The probability of team $i$ winning and losing
Answer
**(b)** In the Dixon-Coles formulation, $a_i$ and $d_i$ are log-scale parameters representing the attack and defense strength of team $i$. They enter the model through the log-linear equation: $\log(\lambda_{ij}) = \alpha + \beta_{\text{home}} + a_i - d_j$. They are not on a probability scale or a 0-100 scale; they are unbounded real-valued parameters on the log-rate scale.Question 2. Why does the Dixon-Coles model introduce the correction factor $\tau$?
(a) To account for extra time and penalty shootouts (b) To correct for the independence assumption being violated at low scorelines (c) To adjust for home advantage (d) To handle promoted and relegated teams
Answer
**(b)** The bivariate independent Poisson model underestimates certain low-scoring outcomes (like 0-0 and 1-1 draws) and overestimates others. The $\tau$ correction factor adjusts the joint probabilities for scorelines involving 0 or 1 goals, where the independence assumption is most clearly violated. Home advantage is handled by the $\beta_{\text{home}}$ parameter, not by $\tau$.Question 3. A match prediction model outputs Home Win = 0.50, Draw = 0.25, Away Win = 0.25. The away team wins. What is the log-loss for this prediction?
(a) $-\ln(0.50) \approx 0.693$ (b) $-\ln(0.25) \approx 1.386$ (c) $-\ln(0.75) \approx 0.288$ (d) $-(0.50 \ln(0.50) + 0.25 \ln(0.25) + 0.25 \ln(0.25)) \approx 1.040$
Answer
**(b)** Log-loss for a single observation is $-\ln(\hat{p}_{\text{true class}})$. Since the true outcome is an away win and the predicted probability for away win was 0.25, the log-loss is $-\ln(0.25) \approx 1.386$. Answer (d) computes the entropy of the predicted distribution, which is a different quantity.Question 4. What is the primary advantage of time-weighted estimation in the Dixon-Coles model?
(a) It makes the model faster to compute (b) It gives more influence to recent matches, capturing changes in team strength (c) It removes the need for the $\tau$ correction (d) It ensures all teams have equal weight in the estimation
Answer
**(b)** Time-weighted estimation uses an exponential decay function to downweight older matches, giving more influence to recent results. This allows the model to adapt to changes in team strength over the course of a season (e.g., due to transfers, injuries, or managerial changes).Question 5. Which evaluation metric accounts for the ordinal nature of match outcomes (Home Win > Draw > Away Win)?
(a) Log-loss (b) Brier score (c) Ranked Probability Score (RPS) (d) AUC-ROC
Answer
**(c)** The Ranked Probability Score considers the cumulative distribution and penalizes predictions that are "further away" from the truth on the ordinal scale. For example, predicting a high probability of home win when the actual result is an away win is penalized more than predicting a draw. Log-loss and Brier score treat all misclassifications equally.Question 6. A player's xG per 90 last season was 0.55 in the Eredivisie. The league conversion factor from Eredivisie to Premier League is estimated at 0.72. What is the adjusted xG per 90 projection?
(a) 0.55 (b) 0.40 (c) 0.76 (d) 0.55 / 0.72 = 0.76
Answer
**(b)** The adjusted projection is $0.55 \times 0.72 = 0.396 \approx 0.40$. The conversion factor represents the expected fraction of the source-league performance that will be maintained in the target league. Moving from a lower-quality league to a higher-quality league typically yields a factor less than 1.0.Question 7. Which type of uncertainty can be reduced by collecting more data?
(a) Aleatoric uncertainty (b) Epistemic uncertainty (c) Both (d) Neither
Answer
**(b)** Epistemic uncertainty arises from limited knowledge (insufficient data or model misspecification) and can be reduced by collecting more data or improving the model. Aleatoric uncertainty is inherent randomness in the system (e.g., the unpredictable bounce of a ball) and cannot be reduced regardless of how much data is collected.Question 8. What is the key difference between a confidence interval and a prediction interval?
(a) Confidence intervals are Bayesian; prediction intervals are frequentist (b) Prediction intervals include aleatoric uncertainty; confidence intervals do not (c) Confidence intervals are always wider than prediction intervals (d) There is no meaningful difference
Answer
**(b)** A confidence interval quantifies uncertainty about the mean response (epistemic uncertainty only), while a prediction interval quantifies uncertainty about a future individual observation, which includes both epistemic and aleatoric uncertainty. As a result, prediction intervals are always wider than confidence intervals. Both can be constructed in frequentist or Bayesian frameworks.Question 9. In the context of player performance forecasting, what does "regression to the mean" imply?
(a) All players eventually become average (b) Extreme performance in one season is partly due to luck and will partially revert (c) Players should be evaluated relative to the league mean (d) Regression models should always include the league average as a feature
Answer
**(b)** Regression to the mean is a statistical phenomenon where extreme observations tend to be followed by observations closer to the mean. In soccer, a player who vastly outperforms their expected metrics (e.g., goals >> xG) is benefiting partly from luck, and this luck component is unlikely to repeat. The player does not become "average" -- they regress toward their true talent level, which is between the observed extreme and the population average.Question 10. Which aging curve shape is most typical for professional soccer players?
(a) Linear decline from age 18 (b) Inverted U-shape with position-dependent peak (c) Flat until age 30, then sudden decline (d) Exponential growth followed by exponential decline
Answer
**(b)** Performance typically follows an inverted-U shape: players improve through their early 20s as they gain experience and physical maturity, reach a peak in their mid-to-late 20s, and then gradually decline. The peak age varies by position -- full-backs and wingers (who rely on pace) tend to peak earlier than goalkeepers and central midfielders (who rely more on positioning and experience).Question 11. The acute-chronic workload ratio (ACWR) is calculated as:
(a) Total season workload / matches played (b) 7-day rolling average workload / 28-day rolling average workload (c) Training load / match load (d) Current week load / previous week load
Answer
**(b)** The ACWR divides the acute workload (typically a 7-day rolling average) by the chronic workload (typically a 28-day rolling average). This ratio captures whether a player's recent workload is higher or lower than what they are accustomed to. Values above 1.5 indicate a spike in workload that increases injury risk.Question 12. Why is accuracy a poor metric for evaluating injury prediction models?
(a) Injuries are too easy to predict (b) Injuries are rare events, so a model predicting "no injury" always achieves ~98% accuracy (c) Accuracy does not apply to binary classification problems (d) Injury prediction is a regression problem, not classification
Answer
**(b)** Injuries occur on roughly 2% of player-days, creating severe class imbalance. A trivial model that always predicts "no injury" achieves approximately 98% accuracy but is completely useless. Precision-recall curves, F1 score, and calibrated probability estimates are much more informative metrics for evaluating rare-event prediction models.Question 13. In a Cox proportional hazards model, if a covariate has a hazard ratio of 1.25, this means:
(a) The event occurs 25% sooner (b) A one-unit increase in the covariate increases the hazard by 25% (c) The survival probability decreases by 25% (d) The covariate explains 25% of the variance
Answer
**(b)** A hazard ratio of 1.25 means that a one-unit increase in the covariate is associated with a 25% increase in the instantaneous rate (hazard) of the event occurring, holding all other covariates constant. This does not directly translate to a 25% change in survival probability or event timing, as the relationship between hazard and survival is nonlinear.Question 14. What is the purpose of mixed-effects models in career trajectory analysis?
(a) To handle missing data in longitudinal studies (b) To separate population-average patterns from individual deviations (c) To model interaction effects between position and age (d) To correct for selection bias in player samples
Answer
**(b)** Mixed-effects models include both fixed effects (which describe the population-average trajectory) and random effects (which allow each player to deviate from this average). This is essential in career trajectory analysis because players have different talent levels (intercept), different development rates (slope), and different peak characteristics (curvature). The model estimates both the average pattern and how much individual players vary around it.Question 15. A latent class trajectory model identifies four career arc types. Which type describes a player who performs well in their early 20s but declines rapidly?
(a) Late developer (b) Steady performer (c) Early bloomer (d) Flash in the pan
Answer
**(d)** A "flash in the pan" player shows a rapid rise in performance followed by an equally rapid decline, often without an extended peak period. An "early bloomer" peaks early but typically maintains a reasonable level for longer before declining. The distinction is in the sustainability of the early peak.Question 16. When adjusting player performance metrics across leagues, which approach is most commonly used?
(a) Dividing by league GDP (b) Using conversion factors estimated from players who moved between the leagues (c) Multiplying by the UEFA coefficient ratio (d) No adjustment is needed if using per-90 metrics
Answer
**(b)** League adjustment factors are estimated by examining the performance of players who have played in both the source and target leagues. By comparing their per-90 metrics before and after the move, we can estimate the multiplicative factor needed to project performance from one league to another. UEFA coefficients are too coarse, and per-90 metrics still need league adjustment.Question 17. In Bayesian Model Averaging, what determines the weight assigned to each model?
(a) The number of parameters in the model (b) The model's training accuracy (c) The marginal likelihood (model evidence) multiplied by the prior model probability (d) Cross-validation performance
Answer
**(c)** In BMA, the posterior model weight is proportional to $P(\text{data} \mid M_k) \times P(M_k)$, where $P(\text{data} \mid M_k)$ is the marginal likelihood (model evidence) and $P(M_k)$ is the prior probability of the model. The marginal likelihood naturally penalizes model complexity (Occam's razor), so more complex models are only favored if they fit the data substantially better.Question 18. What is the Expected Calibration Error (ECE)?
(a) The average absolute difference between predicted probabilities and observed frequencies across bins (b) The mean squared error of probability predictions (c) The maximum deviation between predicted and observed probabilities (d) The entropy of the predicted probability distribution
Answer
**(a)** The ECE is a weighted average of the absolute differences between predicted probabilities and observed frequencies across probability bins: $\text{ECE} = \sum_b \frac{n_b}{N} |\bar{y}_b - \bar{p}_b|$. It measures how well the predicted probabilities correspond to actual outcome frequencies. A perfectly calibrated model has ECE = 0.Question 19. Which statement about the Poisson distribution as a model for goals scored is TRUE?
(a) It always perfectly fits soccer goal data (b) It constrains the mean to equal the variance (c) It can model negative goal counts (d) It naturally handles overdispersion
Answer
**(b)** The Poisson distribution has the property that its mean equals its variance ($E[X] = \text{Var}(X) = \lambda$). This is both a strength (parsimony) and a limitation, as real soccer goal data sometimes exhibits overdispersion (variance > mean), which the Negative Binomial distribution can accommodate.Question 20. Why is TimeSeriesSplit preferred over standard k-fold cross-validation for match prediction?
(a) It runs faster (b) It prevents data leakage by ensuring training data always precedes validation data (c) It produces more folds (d) It handles missing values better
Answer
**(b)** In match prediction, using future data to predict past results constitutes data leakage. Standard k-fold cross-validation randomly assigns matches to folds, which means a model might be trained on matchday 30 data and evaluated on matchday 10. `TimeSeriesSplit` ensures that training data always comes chronologically before validation data, mimicking the real-world scenario where we can only use past information.Question 21. A player similarity model uses Mahalanobis distance rather than Euclidean distance because:
(a) Mahalanobis distance is faster to compute (b) Mahalanobis distance accounts for correlations between features and differences in scale (c) Euclidean distance cannot handle more than 3 dimensions (d) Mahalanobis distance always gives smaller values
Answer
**(b)** Mahalanobis distance uses the covariance matrix to account for both the variance of each feature and the correlations between features. In player profiling, metrics like xG and shots per 90 are correlated, and different features have different scales. Euclidean distance would be dominated by whichever feature has the largest numerical range and would ignore correlations.Question 22. In the context of injury prediction, what is a "competing risks" model?
(a) A model that predicts which team will get more injuries (b) A model where multiple injury types compete, and experiencing one censors observation of the others (c) A model that compares different prediction algorithms (d) A model that adjusts for the competitive level of matches
Answer
**(b)** A competing risks model recognizes that a player faces multiple possible injury types simultaneously (e.g., hamstring, ACL, ankle). Experiencing a hamstring injury removes the player from the "at-risk" pool for other injury types during recovery. Standard survival analysis treats all non-events as censored, but competing risks models estimate separate hazard functions for each event type while accounting for this censoring structure.Question 23. The year-over-year correlation for assists per 90 in soccer is approximately 0.25-0.35. This implies:
(a) Assists are mostly determined by skill (b) Assists are highly repeatable from season to season (c) A large component of assists is due to factors other than the player's stable skill (d) Assists should not be used in player evaluation
Answer
**(c)** A low year-over-year correlation means that a player's assists per 90 in one season is only weakly predictive of their assists per 90 in the next season. This suggests that assists depend heavily on factors beyond the player's stable skill level -- teammate quality, chance variation, playing time patterns, and tactical systems. This does not mean assists are useless for evaluation, but they should be heavily regressed toward the mean when forecasting.Question 24. What is the main advantage of Bayesian inference over frequentist methods for uncertainty quantification in soccer analytics?
(a) Bayesian methods are always more accurate (b) Bayesian methods produce full posterior distributions rather than point estimates with intervals (c) Bayesian methods do not require assumptions (d) Bayesian methods are computationally faster
Answer
**(b)** Bayesian inference produces full posterior distributions over parameters and predictions, which provide a complete characterization of uncertainty. This is particularly valuable in soccer analytics where decisions depend on the full range of possible outcomes (e.g., "what is the probability this player scores more than 15 goals?" requires the full distribution, not just a point estimate and interval). Bayesian methods do require prior specifications and are typically more computationally expensive than frequentist alternatives.Question 25. When communicating prediction uncertainty to a non-technical sporting director, which approach is recommended?
(a) Report exact posterior distributions with mathematical notation (b) Use natural frequencies and scenario-based presentations (c) Omit uncertainty information to avoid confusion (d) Report only the best-case scenario to maintain confidence