Chapter 29 Exercises: Comprehensive Case Studies
These exercises are designed as end-to-end projects that integrate multiple techniques from across the textbook. Each exercise requires you to work through the full analytics workflow: problem definition, data preparation, modeling, evaluation, and communication of results.
Section A: xG Pipeline Projects (Exercises 1-6)
Exercise 1: Basic xG Model from Scratch
Build a complete xG model using only shot location (x, y coordinates) and body part as features. Your pipeline should include:
a) Data loading and validation (handle missing values, out-of-range coordinates). b) Feature engineering: compute distance to goal, angle to goal, and at least two interaction terms. c) Train a logistic regression model using 5-fold stratified cross-validation. d) Report log-loss, Brier score, and ROC-AUC for each fold plus the mean and standard deviation. e) Plot a calibration curve with 10 bins and compute the Expected Calibration Error (ECE).
Deliverable: A single Python script that runs end-to-end and produces all outputs.
Exercise 2: Feature Importance Analysis
Starting from the xG model in Exercise 1, extend it to include at least 8 additional contextual features (e.g., play pattern, number of defenders, shot under pressure, fast break indicator). Then:
a) Train both a logistic regression and a gradient boosting model. b) For logistic regression, extract and plot the coefficients with 95% confidence intervals. c) For gradient boosting, compute and plot permutation feature importances. d) Compare the top-5 features from each method. Discuss why they might differ.
Exercise 3: xG Model Calibration
Using a gradient boosting xG model:
a) Plot the raw calibration curve (before any post-hoc calibration). b) Apply Platt scaling (logistic calibration) and plot the resulting calibration curve. c) Apply isotonic regression calibration and plot the resulting calibration curve. d) Compute the ECE for all three versions (raw, Platt, isotonic). e) Discuss the trade-offs between the calibration methods. Under what conditions might isotonic regression overfit?
Exercise 4: Position-Aware xG
Standard xG models treat all shots equally regardless of the game state. Build an extended model that:
a) Includes game state features: score differential, minute of the match, home/away indicator. b) Tests whether a player trailing by 1 goal takes lower-quality shots on average (selection effect). c) Implements a separate xG model for penalty kicks using only historical conversion rates. d) Combines the open-play and penalty models into a unified prediction pipeline.
Exercise 5: xG Timeline Visualization
Build a visualization tool that creates an xG timeline chart for any given match:
a) Plot cumulative xG for both teams across the 90 minutes. b) Mark actual goals with distinct markers. c) Add a shaded confidence band showing the 10th-90th percentile of simulated scorelines. d) Include an xG flow chart (difference in xG over rolling 10-minute windows).
Exercise 6: Model Deployment and Monitoring
Design a model monitoring system for a deployed xG model:
a) Implement a function that computes weekly calibration metrics on new incoming data. b) Create a drift detection mechanism that flags when the feature distributions shift significantly (use the Kolmogorov-Smirnov test). c) Build an alerting system that triggers model retraining when calibration error exceeds a threshold. d) Write unit tests for your scoring function that validate output ranges and edge cases.
Section B: Scouting Analytics Projects (Exercises 7-12)
Exercise 7: Striker Scouting Database
Build a comprehensive scouting database for strikers across Europe's top 5 leagues:
a) Define at least 12 per-90 metrics relevant to striker evaluation. b) Implement per-90 normalization with a minimum minutes threshold. c) Apply z-score standardization within each league to account for league-level differences. d) Create a composite scoring function with customizable weights. e) Produce a ranked top-20 list with both raw and standardized metrics.
Exercise 8: Player Similarity Engine
Implement a player similarity engine that supports multiple distance metrics:
a) Implement cosine similarity, Euclidean distance, and Mahalanobis distance. b) For a given target player, return the top-10 most similar players under each metric. c) Analyze the overlap between the three similarity lists. Which metric produces the most "intuitive" results? d) Add position-awareness: only compare players within the same positional cluster.
Exercise 9: Archetype Discovery
Using clustering techniques, discover natural player archetypes among forwards:
a) Apply $k$-means clustering with $k = 4, 5, 6, 7, 8$ and plot the elbow curve and silhouette scores. b) Select the optimal $k$ and name each archetype based on the cluster centroids. c) Visualize the archetypes using PCA (2D projection) with cluster coloring. d) For a given departing player, identify which archetype they belong to and find the best replacement candidates within that archetype.
Exercise 10: Age-Value Curve Analysis
Build an age-performance-value analysis for scouting:
a) Plot average performance metrics by age for forwards (ages 18-35). b) Fit a quadratic curve to model the peak age for different metrics. c) Compute a "value score" that accounts for remaining peak years: $V = \text{performance} \times (1 - \text{age decay factor})$. d) Identify players who are "pre-peak" (currently improving with peak years ahead) and rank them by projected peak performance.
Exercise 11: League Adjustment Factors
One challenge in cross-league scouting is that metrics are not directly comparable across leagues:
a) Compute league adjustment factors by analyzing players who transferred between leagues. b) Use a paired analysis: for each player who moved, compare their per-90 metrics before and after the transfer. c) Build a regression model that predicts post-transfer performance from pre-transfer metrics and league pair. d) Apply these adjustment factors to standardize your scouting database.
Exercise 12: Scouting Report Generation
Build an automated scouting report generator:
a) For each shortlisted player, generate a one-page profile including: radar chart, statistical summary, strengths/weaknesses, similar players, and contract/market value information. b) Include a "fit score" that quantifies how well the player matches the defined target profile. c) Generate comparison visualizations showing the shortlisted player against the departing player and the league average. d) Output the report as a structured HTML or PDF document.
Section C: Tactical Analysis Projects (Exercises 13-18)
Exercise 13: Formation Detection System
Build a robust formation detection system:
a) Implement the clustering-based formation detection algorithm from Section 29.3. b) Test it on at least 3 different known formations (e.g., 4-3-3, 4-4-2, 3-5-2). c) Handle edge cases: asymmetric formations, formation changes mid-match. d) Compute the team's "formation stability index"---how frequently the detected formation changes across a match.
Exercise 14: Passing Network Analysis
Conduct a full passing network analysis for a team's season:
a) Build match-level passing networks for all 38 league matches. b) Compute betweenness centrality, eigenvector centrality, and clustering coefficient for each player. c) Identify the team's "most important passer" (highest betweenness centrality) and "most connected sub-group" (highest clustering coefficient). d) Analyze how the network structure changes between wins and losses.
Exercise 15: Pressing Profile Analysis
Create a comprehensive pressing profile:
a) Compute PPDA for each match and plot the season-long trend. b) Segment pressing intensity by game state (leading, drawing, trailing). c) Calculate counterpressing recovery rate (turnovers won within 5 seconds of losing possession in the attacking third). d) Build a pressing effectiveness metric that combines PPDA with high turnover conversion rate.
Exercise 16: Expected Points Table
Build a full expected points table for a league season:
a) For each match, compute the expected points for both teams using the Poisson model from Section 29.3. b) Construct the expected points table and compare it to the actual table. c) Identify the biggest over-performers and under-performers. d) Compute 95% confidence intervals for each team's expected point total using Monte Carlo simulation (10,000 seasons).
Exercise 17: Tactical Phase Detection
Implement a change-point detection algorithm to identify tactical phases:
a) Use the PELT (Pruned Exact Linear Time) algorithm or a rolling variance method. b) Apply it to a vector of tactical metrics: PPDA, possession percentage, field tilt, and progressive passes per 90. c) For each detected phase, compute summary statistics and the team's points-per-game. d) Visualize the phases on a timeline with color coding.
Exercise 18: Opponent Tendency Model
Build a predictive model of opponent tactical tendencies:
a) For each opponent in the league, compute their tactical fingerprint (a vector of 10+ metrics). b) Cluster opponents into tactical groups (e.g., "high press," "low block," "possession-based"). c) Analyze how your team performs against each tactical group. d) For an upcoming match, predict the opponent's likely tactical approach and recommend counter-strategies.
Section D: Injury Prevention Projects (Exercises 19-23)
Exercise 19: Workload Monitoring Dashboard
Build a complete workload monitoring system:
a) Compute daily ACWR for each player using GPS training data. b) Implement traffic-light classification (green/amber/red) based on ACWR thresholds. c) Add weekly load monotony and strain calculations: - Monotony: $\frac{\text{mean daily load}}{\text{std daily load}}$ - Strain: $\text{weekly load} \times \text{monotony}$ d) Create a dashboard visualization showing the squad's risk profile.
Exercise 20: Injury Risk Prediction Model
Build and evaluate a multi-factor injury risk model:
a) Engineer at least 10 features from training load, match load, and player characteristics. b) Train a logistic regression model with class weighting to handle the imbalanced target. c) Evaluate using precision-recall curves (not ROC, due to class imbalance). d) Perform a cost-benefit analysis: if the cost of a missed injury is 10x the cost of unnecessary load reduction, what probability threshold should you use?
Exercise 21: Survival Analysis for Injury Duration
Implement survival analysis for return-to-play estimation:
a) Build Kaplan-Meier curves for the 5 most common injury types. b) Fit a Cox proportional hazards model with covariates: age, injury severity, previous injury history, and position. c) For a player who has been injured for 14 days, compute the conditional probability of returning within the next 7 days. d) Validate the model using concordance index (C-statistic).
Exercise 22: Congestion and Fatigue Analysis
Analyze the relationship between match congestion and injury incidence:
a) Compute a "congestion index" for each player-week: weighted sum of minutes played in the previous 28 days, with more recent matches weighted higher. b) Build a visualization showing injury incidence rate by congestion index quartile. c) Test whether congestion significantly predicts injury risk using logistic regression, controlling for age and position. d) Propose a rotation policy that minimizes injury risk while maintaining competitive performance.
Exercise 23: Pre-Season Load Prescription
Design an optimal pre-season loading program:
a) Analyze historical data to determine the relationship between pre-season training load and in-season injury rates. b) Model the "U-shaped" relationship between chronic load and injury risk (both under-training and over-training increase risk). c) For each player, compute the optimal target chronic load based on their physical characteristics and injury history. d) Generate a 6-week pre-season loading plan that gradually ramps each player to their target chronic load.
Section E: Match Preparation Projects (Exercises 24-28)
Exercise 24: Automated Opponent Report
Build a complete automated match preparation report:
a) Implement all analysis modules from Section 29.5: build-up analysis, set-piece analysis, and key threat assessment. b) Add a defensive vulnerability analysis: where does the opponent concede the most dangerous chances? c) Generate visualizations: pitch maps showing defensive weaknesses, pass flow diagrams, and set-piece heat maps. d) Assemble everything into a structured report with executive summary, detailed analysis, and appendix.
Exercise 25: Set-Piece Design Tool
Build a tool for analyzing and designing set pieces:
a) Analyze the opponent's defensive set-piece structure (zonal vs. man-marking, near-post coverage). b) From your own team's data, identify the most effective set-piece routines. c) Compute the xG for each set-piece variant against each defensive structure. d) Recommend the optimal set-piece strategy for the upcoming match.
Exercise 26: In-Game Decision Support
Design a system that provides real-time tactical suggestions during a match:
a) Track the cumulative xG difference and identify when the game state deviates significantly from pre-match expectations. b) Implement a substitution recommendation engine that considers: player fatigue, tactical fit, game state, and historical impact of substitutions. c) Compute the optimal substitution timing based on historical data (when do substitutions have the highest marginal impact?). d) Present recommendations in a simple, glanceable format suitable for the coaching staff.
Exercise 27: Post-Match Analysis Automation
Build an automated post-match analysis system:
a) Compute all standard match statistics and compare to pre-match expectations. b) Identify the 3 most significant tactical moments (based on xG swing). c) Generate "what-if" analysis: how would the expected result change if specific key events had different outcomes? d) Produce a structured post-match report within 30 minutes of the final whistle (measure your pipeline's execution time).
Exercise 28: Season Planning Tool
Build a strategic season planning tool:
a) Using the fixture list and opponent strength estimates, project the expected points trajectory for the season. b) Identify "six-pointer" matches where the opponent is closest in quality. c) Recommend squad rotation patterns based on fixture congestion and opponent difficulty. d) Run Monte Carlo simulations to estimate the probability of achieving various season objectives (e.g., top 4, avoiding relegation).
Section F: Player Development Projects (Exercises 29-32)
Exercise 29: Academy Player Tracking System
Build a comprehensive player development tracking system:
a) Define metrics across all four pillars: technical, physical, tactical, and mental. b) Implement percentile ranking against age-group peers. c) Create radar charts that show a player's profile at multiple time points (overlay 3 snapshots: 6 months ago, 3 months ago, and current). d) Generate an automated "development trajectory" classification: accelerating, on-track, plateauing, or declining.
Exercise 30: Growth Curve Projection
Implement growth curve modeling for player development:
a) Fit polynomial growth curves (degree 2 and 3) for each metric. b) Add confidence bands using bootstrapped regression. c) Project each player's metrics 12 months into the future. d) Identify players whose projected metrics would place them in the top quartile of the age group above.
Exercise 31: Development Benchmarking
Create a benchmarking system for player development:
a) Build development trajectories for established first-team players, working backwards from their current level. b) For each academy player, compute the similarity between their development trajectory and those of established players at the same age. c) Identify the "development template" player for each academy prospect (the first-team player whose trajectory they most closely resemble). d) Estimate the probability of each academy player reaching first-team level based on trajectory similarity.
Exercise 32: Integrated Development Report
Produce a comprehensive quarterly development report for a single player:
a) Combine all quantitative metrics with qualitative coaching assessments. b) Track development across all four pillars with trend indicators. c) Compare the player to their "development template" (from Exercise 31) at the same age. d) Provide specific, measurable development objectives for the next quarter. e) Include a "readiness assessment" for promotion to the next age group or first-team inclusion.
Deliverable: A complete report in HTML format with embedded visualizations.