Chapter 7: Exercises
Overview
These exercises reinforce the concepts from Chapter 7 on Expected Goals (xG) models. They progress from fundamental understanding through practical implementation to advanced analysis. Complete solutions are available in the code/exercise-solutions.py file.
Part A: Conceptual Understanding (Questions 1-6)
Exercise 1: xG Fundamentals
Difficulty: Basic
Explain in your own words: a) What problem does xG solve that traditional shot statistics cannot? b) Why is a shot from 8 meters with a narrow angle different from one at 12 meters with a wide angle, even if their conversion rates are similar? c) What does an xG value of 0.25 mean in practical terms?
Exercise 2: Feature Importance Ranking
Difficulty: Basic
Rank the following xG model features from most to least important, based on typical feature importance scores. Justify your ranking:
- Shot body part (foot vs. head)
- Distance to goal
- Angle to goal
- Time remaining in match
- Current score differential
- Assist type (through ball, cross, etc.)
Exercise 3: Interpreting Match xG
Difficulty: Basic
Consider a match with the following statistics: - Team A: 1.8 xG, 3 goals scored - Team B: 2.2 xG, 1 goal scored
a) Which team created better chances? b) Which team was more efficient at converting chances? c) If this match were replayed 1000 times with the same chances, approximately what percentage would Team A win?
Exercise 4: Model Evaluation Metrics
Difficulty: Intermediate
Explain the difference between: a) Log loss and accuracy for xG models b) ROC AUC and precision-recall AUC c) Why might a model have excellent ROC AUC but poor calibration?
Exercise 5: xG Limitations
Difficulty: Intermediate
For each scenario, explain why standard xG models might produce misleading conclusions:
a) A penalty shootout specialist takes 30 penalties in a season b) A striker who only plays against weak opposition c) A team that scores 90% of their goals from set pieces d) A goalkeeper who faces primarily long-range shots
Exercise 6: Descriptive vs. Predictive xG
Difficulty: Intermediate
Explain the tension between using xG for: - Describing what happened in a match - Predicting future performance
When would you prioritize one interpretation over the other? Give specific examples.
Part B: Distance and Angle Calculations (Questions 7-12)
Exercise 7: Distance Calculation
Difficulty: Basic
Calculate the distance to the goal center (x=120, y=40 on StatsBomb coordinates) for shots taken from: a) (108, 40) b) (115, 35) c) (100, 55) d) (112, 40)
Show your work using the Euclidean distance formula.
Exercise 8: Angle Calculation
Difficulty: Intermediate
Using the goal post coordinates (left post at y=36.34, right post at y=43.66 for a 9.32m goal centered at y=40), calculate the shot angle in degrees for: a) A shot from (110, 40) - directly in front of goal b) A shot from (110, 50) - to the right of center c) A shot from (105, 35) - to the left at a wider position
Exercise 9: Distance-Angle Relationship
Difficulty: Intermediate
Write a function that, given a distance and angle, determines whether the shot is: - "Central close" (distance < 12m, angle > 25°) - "Central medium" (12m ≤ distance < 20m, angle > 20°) - "Wide close" (distance < 12m, angle ≤ 25°) - "Wide medium" (12m ≤ distance < 20m, angle ≤ 20°) - "Long range" (distance ≥ 20m)
Test your function on the shots from Exercises 7-8.
Exercise 10: Visualizing Shot Zones
Difficulty: Intermediate
Create a visualization that: a) Divides the attacking third into zones based on distance (0-6m, 6-12m, 12-18m, 18-25m, 25m+) b) Colors each zone according to typical conversion rate c) Overlays the goal posts for reference
Exercise 11: Expected Goals by Zone
Difficulty: Intermediate
Using StatsBomb open data from the 2018 World Cup: a) Calculate the average xG for shots in each zone from Exercise 10 b) Calculate the actual conversion rate in each zone c) Compare your zone-based estimates to StatsBomb's xG values
Exercise 12: Optimal Shooting Position
Difficulty: Advanced
Given the trade-off between distance (closer is better) and angle (wider is better), find the position(s) on the pitch that maximize expected goals by: a) Writing a function that estimates xG from position using distance and angle b) Creating a heatmap showing estimated xG across the attacking third c) Identifying the optimal shooting position (highest xG) at different distances from goal
Part C: Building xG Models (Questions 13-18)
Exercise 13: Simple Logistic Regression
Difficulty: Intermediate
Using World Cup 2018 data: a) Build a logistic regression model using only distance as a feature b) Report the coefficient and intercept c) Calculate xG for shots at distances of 5m, 10m, 15m, and 25m d) Plot the xG curve from 0-40m
Exercise 14: Multi-Feature Model
Difficulty: Intermediate
Extend Exercise 13 by: a) Adding angle and body part as features b) Comparing log loss between the simple and extended model c) Interpreting the coefficients: which features have the strongest effect?
Exercise 15: Feature Engineering
Difficulty: Intermediate
Create the following derived features and test whether they improve model performance:
a) log_distance: Natural log of distance
b) distance_squared: Square of distance
c) angle_distance_interaction: Angle × Distance
d) is_header: Binary indicator for headers
e) is_close_range: Binary indicator for shots within 10m
Report the improvement in log loss and ROC AUC.
Exercise 16: Gradient Boosting Implementation
Difficulty: Advanced
Build a gradient boosting xG model: a) Use at least 6 features (distance, angle, body part, shot type, x-coordinate, y-coordinate) b) Tune hyperparameters using 5-fold cross-validation c) Compare performance to logistic regression d) Analyze feature importances
Exercise 17: Model Calibration
Difficulty: Advanced
For your gradient boosting model: a) Create a calibration curve comparing predicted probabilities to actual outcomes b) Identify any regions of miscalibration c) Apply Platt scaling or isotonic regression to improve calibration d) Report the change in Brier score after calibration
Exercise 18: Cross-Competition Validation
Difficulty: Advanced
Test model generalization: a) Train an xG model on World Cup 2018 data b) Evaluate on Women's World Cup 2019 data (or another available competition) c) Compare performance metrics between in-sample and out-of-sample d) Discuss reasons for any performance degradation
Part D: Applying xG Analysis (Questions 19-24)
Exercise 19: Player Finishing Analysis
Difficulty: Intermediate
Using World Cup 2018 data: a) Identify all players with at least 10 shots b) Calculate goals, xG, and goals-minus-xG for each c) Rank players by "finishing skill" (goals/xG ratio) d) Discuss the reliability of these rankings given sample sizes
Exercise 20: Team Shot Profile
Difficulty: Intermediate
For France and Croatia in the 2018 World Cup: a) Calculate total shots, total xG, and xG per shot b) Create shot maps showing location and xG for each team c) Compare their shot profiles: which team took higher quality chances? d) Analyze differences in shot zones (inside box vs. outside, central vs. wide)
Exercise 21: Match Analysis Deep Dive
Difficulty: Intermediate
Select the World Cup Final (France 4-2 Croatia): a) Plot a timeline of xG accumulation for both teams b) Identify the highest xG chance for each team c) Calculate the probability France would win given the xG created (using Poisson simulation) d) Discuss whether the actual scoreline was "deserved"
Exercise 22: Goalkeeper Evaluation
Difficulty: Advanced
For a selected goalkeeper with at least 20 shots faced: a) Calculate total xG conceded and actual goals conceded b) Compute "goals saved above expected" (xG - Goals) c) Break down by shot zone to identify strengths/weaknesses d) Discuss limitations of this analysis without post-shot xG
Exercise 23: Chance Creation Analysis
Difficulty: Advanced
Analyze which players create the best chances for teammates: a) Identify all passes that immediately precede shots b) Sum the xG of shots created by each passer (Expected Assists / xA) c) Compare xA to actual assists d) Identify the top 5 chance creators by xA
Exercise 24: xG Rolling Average
Difficulty: Intermediate
For a team of your choice with multiple matches: a) Calculate xG created and xG conceded per match b) Compute 3-match rolling averages c) Identify trends in chance creation/prevention over the tournament d) Visualize the rolling xG with actual goals overlaid
Part E: Model Comparison and Evaluation (Questions 25-28)
Exercise 25: Benchmark Comparison
Difficulty: Intermediate
Compare three xG estimation approaches: a) Simple distance-only logistic regression b) Multi-feature gradient boosting (your model) c) StatsBomb xG values (provided in data)
Report log loss, ROC AUC, and Brier score for each. Which performs best?
Exercise 26: Lift Analysis
Difficulty: Intermediate
For your best model: a) Divide predictions into deciles (10 groups by xG) b) Calculate actual conversion rate in each decile c) Compute lift (actual rate / baseline rate) for each decile d) Create a lift chart visualization
The top decile should have lift > 2.0 for a good model.
Exercise 27: Residual Analysis
Difficulty: Advanced
Examine model residuals: a) Calculate residuals (actual outcome - predicted probability) for all shots b) Group residuals by distance, angle, and body part c) Identify any systematic patterns suggesting missing features d) Propose additional features that might address the patterns
Exercise 28: Confidence Intervals
Difficulty: Advanced
Quantify uncertainty in xG predictions: a) Use bootstrap sampling (1000 iterations) to estimate model uncertainty b) For a sample of shots, compute 95% confidence intervals for xG c) Visualize how confidence interval width varies with predicted xG d) Discuss implications for communicating xG to non-technical audiences
Part F: Simulation and Prediction (Questions 29-32)
Exercise 29: Basic Match Simulation
Difficulty: Intermediate
Using Poisson distributions: a) Write a function that simulates a match outcome given home and away xG b) Run 10,000 simulations for a match with home xG = 1.5, away xG = 1.2 c) Calculate probabilities of home win, draw, and away win d) Generate the distribution of most likely scorelines
Exercise 30: Season Points Simulation
Difficulty: Intermediate
For a hypothetical team with 1.7 xG/match and 1.2 xGA/match: a) Simulate 1000 full 38-match seasons b) Calculate mean, standard deviation, and percentiles of total points c) Estimate the probability of finishing with 70+ points (Champions League) d) Estimate the probability of finishing below 35 points (relegation)
Exercise 31: Monte Carlo Match Prediction
Difficulty: Advanced
Create a full match prediction system: a) Take historical xG/xGA averages for two teams b) Apply home advantage adjustment (+10% xG for home team) c) Generate scoreline probability matrix using Poisson d) Calculate implied betting odds (1/probability) for each outcome e) Compare to actual bookmaker odds for a recent match
Exercise 32: Tournament Simulation
Difficulty: Advanced
Simulate the World Cup knockout rounds: a) Use average xG/xGA from group stage for each team b) Simulate each knockout match using your Poisson model c) Run 10,000 tournament simulations d) Calculate probability of winning the tournament for each team e) Compare your predicted winner probabilities to pre-tournament favorites
Part G: Advanced Topics (Questions 33-35)
Exercise 33: Expected Threat (xT) Grid
Difficulty: Advanced
Create a simplified Expected Threat model: a) Divide the pitch into a 12×8 grid (96 zones) b) For each zone, calculate the probability of a goal being scored from actions starting there c) Visualize the xT grid as a heatmap d) Compare your xT values in attacking zones to typical xG values
Exercise 34: Post-Shot xG Approximation
Difficulty: Advanced
Without actual shot placement data, approximate post-shot xG: a) For goals, assign high post-shot xG (0.7-0.9) based on being difficult to save b) For saves, estimate based on shot xG and whether it required a "great save" c) Analyze how post-shot xG differs from pre-shot xG d) Discuss what data would be needed for a proper PSxG model
Exercise 35: Neural Network xG Model
Difficulty: Expert
Build a neural network xG model: a) Design an architecture with 2-3 hidden layers b) Include appropriate regularization (dropout, early stopping) c) Train on World Cup data with validation split d) Compare performance to gradient boosting e) Discuss trade-offs between neural networks and tree-based models for xG
Submission Guidelines
For programming exercises: - Include well-commented code with docstrings - Generate all requested visualizations - Report numerical results to 2-3 decimal places - Include brief interpretations of results
For conceptual questions: - Provide clear, structured answers - Reference specific examples where appropriate - Acknowledge limitations and uncertainty
Grading Rubric
| Category | Weight | Criteria |
|---|---|---|
| Conceptual Understanding | 20% | Accurate explanations, addresses nuances |
| Technical Implementation | 35% | Correct code, appropriate methods |
| Analysis Quality | 25% | Meaningful insights, proper interpretation |
| Visualization | 10% | Clear, informative, properly labeled |
| Communication | 10% | Well-structured, concise, professional |