Case Study 2: Building an xG Model with Gradient Boosting: From Features to Deployment
Background
Expected goals (xG) is the foundational metric in modern soccer analytics. Every major data provider, broadcaster, and club maintains an xG model. This case study walks through the complete lifecycle of building a production-quality xG model using gradient boosting, from raw event data through feature engineering, training, evaluation, interpretability analysis, and deployment.
We simulate a scenario where a data analytics company is building an xG model for a broadcasting client. The model must be:
- Accurate: State-of-the-art predictive performance.
- Well-calibrated: Predicted probabilities must match observed frequencies.
- Interpretable: Broadcasters need to explain predictions to viewers.
- Deployable: The model must run in real-time during live broadcasts.
Data Description
We use synthetic shot data modeled on realistic distributions from top-five European leagues. The dataset contains 50,000 shots with the following fields:
| Field | Type | Description |
|---|---|---|
match_id |
int | Unique match identifier |
season |
str | Season (e.g., "2019/20") |
minute |
int | Match minute |
x |
float | Shot x-coordinate (0--120, attacking direction) |
y |
float | Shot y-coordinate (0--80, left to right) |
body_part |
str | "right_foot", "left_foot", "head", "other" |
shot_type |
str | "open_play", "free_kick", "corner", "penalty" |
prev_action |
str | "pass", "cross", "dribble", "set_piece", "rebound" |
num_defenders |
int | Defenders between shooter and goal |
gk_x |
float | Goalkeeper x-coordinate |
gk_y |
float | Goalkeeper y-coordinate |
is_home |
bool | Whether the shooting team is playing at home |
score_diff |
int | Goal difference at time of shot (shooter's perspective) |
is_goal |
int | Target variable: 1 if goal, 0 otherwise |
Step 1: Feature Engineering
Derived Spatial Features
The raw $(x, y)$ coordinates are less informative than derived spatial features:
import numpy as np
# Goal center coordinates (assuming 120x80 pitch)
GOAL_X = 120.0
GOAL_Y = 40.0
POST_LEFT_Y = 36.34 # 9.32m goal width centered at y=40
POST_RIGHT_Y = 43.66
def engineer_spatial_features(df: pd.DataFrame) -> pd.DataFrame:
"""Compute spatial features from shot and goalkeeper coordinates."""
df = df.copy()
# Distance to goal center
df["distance_to_goal"] = np.sqrt(
(GOAL_X - df["x"]) ** 2 + (GOAL_Y - df["y"]) ** 2
)
# Angle to goal (radians) using the two posts
a = np.sqrt((GOAL_X - df["x"]) ** 2 + (POST_LEFT_Y - df["y"]) ** 2)
b = np.sqrt((GOAL_X - df["x"]) ** 2 + (POST_RIGHT_Y - df["y"]) ** 2)
c = POST_RIGHT_Y - POST_LEFT_Y # goal width
df["angle_to_goal"] = np.arccos(
np.clip((a ** 2 + b ** 2 - c ** 2) / (2 * a * b), -1, 1)
)
# Goalkeeper distance from goal line
df["gk_distance_from_line"] = GOAL_X - df["gk_x"]
# Goalkeeper angle (how far off-center)
df["gk_offset"] = np.abs(df["gk_y"] - GOAL_Y)
# Is the shot from a central position?
df["is_central"] = ((df["y"] >= 30) & (df["y"] <= 50)).astype(int)
# Distance to nearest post
dist_left = np.sqrt((GOAL_X - df["x"]) ** 2 + (POST_LEFT_Y - df["y"]) ** 2)
dist_right = np.sqrt((GOAL_X - df["x"]) ** 2 + (POST_RIGHT_Y - df["y"]) ** 2)
df["distance_to_near_post"] = np.minimum(dist_left, dist_right)
return df
Contextual and Interaction Features
def engineer_context_features(df: pd.DataFrame) -> pd.DataFrame:
"""Compute contextual and interaction features."""
df = df.copy()
# Is the team trailing?
df["is_trailing"] = (df["score_diff"] < 0).astype(int)
# Late-game indicator (after 75th minute)
df["is_late_game"] = (df["minute"] >= 75).astype(int)
# Interaction: distance x body_part (header from distance is very different)
df["distance_x_is_header"] = df["distance_to_goal"] * (
df["body_part"] == "head"
).astype(int)
# Interaction: angle x defenders
df["angle_x_defenders"] = df["angle_to_goal"] * df["num_defenders"]
# Is penalty
df["is_penalty"] = (df["shot_type"] == "penalty").astype(int)
return df
Final Feature Set
After engineering, our feature matrix contains 18 features:
| # | Feature | Type | Source |
|---|---|---|---|
| 1 | distance_to_goal |
Continuous | Derived |
| 2 | angle_to_goal |
Continuous | Derived |
| 3 | gk_distance_from_line |
Continuous | Derived |
| 4 | gk_offset |
Continuous | Derived |
| 5 | is_central |
Binary | Derived |
| 6 | distance_to_near_post |
Continuous | Derived |
| 7 | num_defenders |
Integer | Raw |
| 8 | is_home |
Binary | Raw |
| 9 | score_diff |
Integer | Raw |
| 10 | minute |
Integer | Raw |
| 11 | is_trailing |
Binary | Derived |
| 12 | is_late_game |
Binary | Derived |
| 13 | distance_x_is_header |
Continuous | Interaction |
| 14 | angle_x_defenders |
Continuous | Interaction |
| 15 | is_penalty |
Binary | Derived |
| 16 | body_part |
Categorical | Raw (encoded) |
| 17 | shot_type |
Categorical | Raw (encoded) |
| 18 | prev_action |
Categorical | Raw (encoded) |
Step 2: Train/Test Split
We split by season to prevent temporal leakage:
train_seasons = ["2017/18", "2018/19", "2019/20", "2020/21"]
val_season = "2021/22"
test_season = "2022/23"
train_df = df[df["season"].isin(train_seasons)]
val_df = df[df["season"] == val_season]
test_df = df[df["season"] == test_season]
print(f"Train: {len(train_df)} shots ({train_df['is_goal'].mean():.3f} goal rate)")
print(f"Val: {len(val_df)} shots ({val_df['is_goal'].mean():.3f} goal rate)")
print(f"Test: {len(test_df)} shots ({test_df['is_goal'].mean():.3f} goal rate)")
Step 3: Preprocessing Pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
numerical_features = [
"distance_to_goal", "angle_to_goal", "gk_distance_from_line",
"gk_offset", "distance_to_near_post", "num_defenders",
"score_diff", "minute", "distance_x_is_header", "angle_x_defenders"
]
binary_features = [
"is_central", "is_home", "is_trailing", "is_late_game", "is_penalty"
]
categorical_features = ["body_part", "shot_type", "prev_action"]
preprocessor = ColumnTransformer([
("num", Pipeline([
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler())
]), numerical_features),
("bin", "passthrough", binary_features),
("cat", OneHotEncoder(drop="first", sparse_output=False,
handle_unknown="infrequent_if_exist"),
categorical_features)
])
Step 4: Baseline Model
We start with logistic regression as a baseline:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, brier_score_loss, log_loss
baseline_pipeline = Pipeline([
("preprocessor", preprocessor),
("classifier", LogisticRegression(max_iter=1000, random_state=42))
])
baseline_pipeline.fit(X_train, y_train)
y_prob_baseline = baseline_pipeline.predict_proba(X_test)[:, 1]
print(f"Baseline AUC: {roc_auc_score(y_test, y_prob_baseline):.4f}")
print(f"Baseline Log-Loss: {log_loss(y_test, y_prob_baseline):.4f}")
print(f"Baseline Brier: {brier_score_loss(y_test, y_prob_baseline):.4f}")
Typical baseline results: AUC ~ 0.77, Log-loss ~ 0.32, Brier ~ 0.085.
Step 5: Gradient Boosting Model
Initial Model
from sklearn.ensemble import GradientBoostingClassifier
gb_pipeline = Pipeline([
("preprocessor", preprocessor),
("classifier", GradientBoostingClassifier(
n_estimators=500,
learning_rate=0.05,
max_depth=5,
subsample=0.8,
min_samples_leaf=20,
random_state=42
))
])
gb_pipeline.fit(X_train, y_train)
y_prob_gb = gb_pipeline.predict_proba(X_test)[:, 1]
print(f"GB AUC: {roc_auc_score(y_test, y_prob_gb):.4f}")
print(f"GB Log-Loss: {log_loss(y_test, y_prob_gb):.4f}")
print(f"GB Brier: {brier_score_loss(y_test, y_prob_gb):.4f}")
Expected improvement: AUC ~ 0.81, Log-loss ~ 0.29, Brier ~ 0.078.
Hyperparameter Tuning
from sklearn.model_selection import RandomizedSearchCV, TimeSeriesSplit
from scipy.stats import uniform, randint
param_dist = {
"classifier__n_estimators": randint(200, 800),
"classifier__learning_rate": uniform(0.01, 0.14),
"classifier__max_depth": randint(3, 8),
"classifier__subsample": uniform(0.6, 0.4),
"classifier__min_samples_leaf": randint(10, 50),
}
search = RandomizedSearchCV(
gb_pipeline,
param_distributions=param_dist,
n_iter=40,
scoring="neg_log_loss",
cv=TimeSeriesSplit(n_splits=4),
random_state=42,
n_jobs=-1,
verbose=1
)
search.fit(X_train, y_train)
print(f"Best Log-Loss: {-search.best_score_:.4f}")
print(f"Best Params: {search.best_params_}")
best_model = search.best_estimator_
Step 6: Calibration Analysis
A well-calibrated xG model is essential. We create calibration curves:
from sklearn.calibration import calibration_curve
y_prob_best = best_model.predict_proba(X_test)[:, 1]
prob_true, prob_pred = calibration_curve(y_test, y_prob_best, n_bins=10,
strategy="quantile")
plt.figure(figsize=(8, 8))
plt.plot(prob_pred, prob_true, "s-", label="Gradient Boosting")
plt.plot([0, 1], [0, 1], "k--", label="Perfect Calibration")
plt.xlabel("Predicted Probability")
plt.ylabel("Observed Frequency")
plt.title("xG Model Calibration Curve")
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig("calibration_curve.png", dpi=150)
plt.show()
If the model shows systematic over- or under-confidence, we apply isotonic regression calibration:
from sklearn.calibration import CalibratedClassifierCV
calibrated_model = CalibratedClassifierCV(
best_model, method="isotonic", cv=5
)
calibrated_model.fit(X_train, y_train)
y_prob_cal = calibrated_model.predict_proba(X_test)[:, 1]
print(f"Calibrated Brier: {brier_score_loss(y_test, y_prob_cal):.4f}")
Step 7: Model Interpretability with SHAP
Broadcasters need to explain why a particular shot had a high or low xG value. SHAP (SHapley Additive exPlanations) provides per-prediction feature contributions:
import shap
# Extract the gradient boosting model from the pipeline
gb_model = best_model.named_steps["classifier"]
X_test_transformed = best_model.named_steps["preprocessor"].transform(X_test)
# Get feature names after preprocessing
feature_names = (
numerical_features +
binary_features +
list(best_model.named_steps["preprocessor"]
.named_transformers_["cat"]
.get_feature_names_out(categorical_features))
)
explainer = shap.TreeExplainer(gb_model)
shap_values = explainer.shap_values(X_test_transformed)
# Global feature importance
plt.figure(figsize=(10, 8))
shap.summary_plot(shap_values, X_test_transformed,
feature_names=feature_names, show=False)
plt.tight_layout()
plt.savefig("shap_summary.png", dpi=150, bbox_inches="tight")
plt.show()
Individual Prediction Explanation
For a specific high-xG shot, we can show which features drove the prediction:
# Explain a single shot (e.g., a close-range header from a cross)
shot_idx = 42
shap.force_plot(
explainer.expected_value,
shap_values[shot_idx, :],
X_test_transformed[shot_idx, :],
feature_names=feature_names,
matplotlib=True
)
This might reveal: "Distance to goal (4.2m) pushed xG up by +0.18, angle to goal pushed up by +0.12, but it being a header pushed down by -0.05, for a final xG of 0.42."
Step 8: Model Performance Summary
| Model | AUC-ROC | Log-Loss | Brier Score |
|---|---|---|---|
| Logistic Regression (baseline) | 0.770 | 0.320 | 0.085 |
| Gradient Boosting (initial) | 0.810 | 0.290 | 0.078 |
| Gradient Boosting (tuned) | 0.818 | 0.283 | 0.075 |
| Gradient Boosting (calibrated) | 0.816 | 0.280 | 0.073 |
The calibrated model sacrifices a tiny amount of AUC for significantly better calibration, which is the right trade-off for a broadcast-facing xG model.
Step 9: Deployment
Model Serialization
import joblib
# Save the full calibrated pipeline
joblib.dump(calibrated_model, "xg_model_v1.0.joblib")
# Save the preprocessing pipeline separately for validation
joblib.dump(best_model.named_steps["preprocessor"],
"xg_preprocessor_v1.0.joblib")
# Model metadata
metadata = {
"model_name": "xg_model",
"version": "1.0",
"training_seasons": train_seasons,
"test_season": test_season,
"n_features": len(feature_names),
"test_auc": 0.816,
"test_brier": 0.073,
"algorithm": "GradientBoostingClassifier + Isotonic Calibration",
}
Prediction Function
def predict_xg(shot_data: pd.DataFrame,
model_path: str = "xg_model_v1.0.joblib") -> np.ndarray:
"""Predict xG for new shot data.
Args:
shot_data: DataFrame with required shot features.
model_path: Path to the serialized model.
Returns:
Array of xG probabilities.
"""
model = joblib.load(model_path)
# Apply feature engineering
shot_data = engineer_spatial_features(shot_data)
shot_data = engineer_context_features(shot_data)
return model.predict_proba(shot_data)[:, 1]
Step 10: Monitoring Plan
Once deployed, we monitor:
- Weekly Brier score on new matches vs. the baseline of 0.073.
- Monthly calibration curves to detect systematic drift.
- Feature distribution PSI for key features (
distance_to_goal,angle_to_goal) to detect data drift. - Alert threshold: Retrain if weekly Brier score exceeds 0.085 for three consecutive weeks.
Lessons Learned
-
Feature engineering provided the largest performance gains. Moving from raw $(x, y)$ to derived spatial features improved AUC by 0.05.
-
Gradient boosting outperformed logistic regression but the gap narrowed when the logistic regression was given well-engineered features.
-
Calibration is essential for broadcast use. An uncalibrated model produces xG values that do not sum to reasonable match totals.
-
SHAP explanations are invaluable for building trust with non-technical stakeholders. Broadcasters can now say "this shot had a high xG because the shooter was close to goal with no defenders blocking."
-
Temporal splits are non-negotiable. A random split inflated the AUC by 0.03 compared to the temporal split --- a misleading overestimate.
Code Reference
The complete code for this case study is available in
code/case-study-code.py (Section 2).