Chapter 7: Key Takeaways
Quick Reference Card
What is xG?
Expected Goals (xG) is the probability that a shot results in a goal, based on historical conversion rates for similar shots.
- Value range: 0.00 to 1.00 (0% to 100% probability)
- Interpretation: xG of 0.15 = shot converts 15% of the time historically
- Aggregation: Sum shot xG for team/player totals
Core Formulas
Distance to Goal
distance = √[(goal_x - shot_x)² + (goal_y - shot_y)²]
StatsBomb coordinates: goal at (120, 40)
Angle to Goal
angle = |arctan2(right_post_y - shot_y, goal_x - shot_x) -
arctan2(left_post_y - shot_y, goal_x - shot_x)|
Goal posts: y = 36.34 (left), y = 43.66 (right)
Log Loss (Model Evaluation)
Log Loss = -1/N × Σ[yᵢ × log(pᵢ) + (1-yᵢ) × log(1-pᵢ)]
Lower is better. Baseline ~0.35, good model ~0.26-0.28
Brier Score
Brier = 1/N × Σ(pᵢ - yᵢ)²
Lower is better. Typical values: 0.07-0.09
Feature Importance Hierarchy
| Rank | Feature | Typical Importance |
|---|---|---|
| 1 | Distance to goal | 30-40% |
| 2 | Angle to goal | 15-25% |
| 3 | Shot body part | 10-15% |
| 4 | X/Y coordinates | 10-15% |
| 5 | Shot type | 5-10% |
| 6 | Game context | 5-10% |
Typical Conversion Rates
| Shot Type | Conversion Rate | Typical xG |
|---|---|---|
| Penalty | 76% | 0.76 |
| 6-yard box | 40-50% | 0.35-0.50 |
| Penalty area (central) | 15-25% | 0.15-0.25 |
| Penalty area (wide) | 8-15% | 0.08-0.15 |
| Outside box | 3-6% | 0.03-0.06 |
| Long range (25m+) | 1-3% | 0.01-0.03 |
Model Building Checklist
Data Preparation
- [ ] Extract x, y coordinates from location
- [ ] Calculate distance and angle features
- [ ] Create transformed features (log_distance, etc.)
- [ ] Encode categorical variables (body part, shot type)
- [ ] Remove penalties (they have fixed xG)
- [ ] Split data: train/test (80/20) with stratification
Model Training
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=4,
min_samples_leaf=20,
random_state=42
)
model.fit(X_train, y_train)
Evaluation
from sklearn.metrics import log_loss, roc_auc_score, brier_score_loss
y_pred = model.predict_proba(X_test)[:, 1]
print(f"Log Loss: {log_loss(y_test, y_pred):.4f}")
print(f"ROC AUC: {roc_auc_score(y_test, y_pred):.4f}")
print(f"Brier: {brier_score_loss(y_test, y_pred):.4f}")
Interpretation Guidelines
Shot Level
| xG Value | Interpretation |
|---|---|
| > 0.40 | Big chance (should score) |
| 0.20 - 0.40 | Good opportunity |
| 0.10 - 0.20 | Moderate chance |
| 0.05 - 0.10 | Difficult shot |
| < 0.05 | Low-quality attempt |
Match Level
- xG margin > 1.0: Clear dominance in chance creation
- xG margin 0.5-1.0: Slight advantage
- xG margin < 0.5: Competitive match
- Goals >> xG: Clinical finishing or luck
- Goals << xG: Poor finishing or unlucky
Season Level
- Teams rarely deviate from xG by more than ±10%
- Large deviations indicate regression candidate
- Use xG table for "true" performance ranking
Match Simulation (Poisson)
from scipy.stats import poisson
import numpy as np
def simulate_match(home_xg, away_xg, n_sims=10000):
home_goals = np.random.poisson(home_xg, n_sims)
away_goals = np.random.poisson(away_xg, n_sims)
home_win = (home_goals > away_goals).mean()
draw = (home_goals == away_goals).mean()
away_win = (home_goals < away_goals).mean()
return home_win, draw, away_win
Common Pitfalls
1. Small Sample Sizes
- Problem: Judging finishing skill from 20 shots
- Solution: Require 100+ shots for reliable conclusions
2. Ignoring Calibration
- Problem: Model predicts wrong probabilities
- Solution: Check calibration curves; apply Platt scaling if needed
3. Comparing Different xG Models
- Problem: StatsBomb xG ≠ Opta xG ≠ Understat xG
- Solution: Use consistent sources for comparisons
4. Overconfident Predictions
- Problem: "France had 2.37 xG"
- Solution: Round to 2.4 xG; acknowledge uncertainty
5. Causation from Correlation
- Problem: "High xG caused the win"
- Solution: xG measures chance quality, not outcomes
Essential Python Code Snippets
Load StatsBomb Data
from statsbombpy import sb
matches = sb.matches(competition_id=43, season_id=3)
events = sb.events(match_id=7298)
shots = events[events['type'] == 'Shot']
Calculate Distance and Angle
import numpy as np
def shot_features(x, y, goal_x=120, goal_y=40):
distance = np.sqrt((goal_x - x)**2 + (goal_y - y)**2)
left_post, right_post = 36.34, 43.66
angle = np.abs(
np.arctan2(right_post - y, goal_x - x) -
np.arctan2(left_post - y, goal_x - x)
)
return distance, np.degrees(angle)
Quick xG Calculation
# Using StatsBomb's built-in xG
shots['xg'] = shots['shot_statsbomb_xg']
team_xg = shots.groupby('team')['xg'].sum()
Calibration Curve
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt
prob_true, prob_pred = calibration_curve(y_true, y_pred, n_bins=10)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(prob_pred, prob_true, 'bo-')
plt.xlabel('Predicted Probability')
plt.ylabel('Actual Frequency')
Applications Summary
| Application | Key Metric | Insight |
|---|---|---|
| Player finishing | Goals - xG | Finishing skill (with caveats) |
| Chance creation | xA (assisted xG) | Creativity independent of conversion |
| Goalkeeper | xG - Goals Conceded | Shot-stopping ability |
| Team offense | Total xG | Chance creation quality |
| Team defense | xGA | Chance prevention |
| Match prediction | xG → Poisson | Win/draw/loss probabilities |
Quick Evaluation Benchmarks
| Metric | Baseline | Acceptable | Good | Excellent |
|---|---|---|---|---|
| Log Loss | 0.35 | 0.30 | 0.28 | 0.25 |
| ROC AUC | 0.50 | 0.72 | 0.78 | 0.82 |
| Brier Score | 0.09 | 0.085 | 0.08 | 0.075 |
One-Page Summary
xG = P(Goal | Shot Characteristics)
- Distance and angle dominate predictions (60-70% of importance)
- Logistic regression works; gradient boosting works better
- Calibration matters for aggregating probabilities
- Single shots are noisy; season totals are reliable
- Over/underperformance tends to regress toward xG
- Different providers give different xG values
- Track match xG for performance assessment
- Simulate with Poisson for outcome probabilities