Case Study 2: Building an xG Model with Gradient Boosting: From Features to Deployment

Background

Expected goals (xG) is the foundational metric in modern soccer analytics. Every major data provider, broadcaster, and club maintains an xG model. This case study walks through the complete lifecycle of building a production-quality xG model using gradient boosting, from raw event data through feature engineering, training, evaluation, interpretability analysis, and deployment.

We simulate a scenario where a data analytics company is building an xG model for a broadcasting client. The model must be:

Accurate: State-of-the-art predictive performance.
Well-calibrated: Predicted probabilities must match observed frequencies.
Interpretable: Broadcasters need to explain predictions to viewers.
Deployable: The model must run in real-time during live broadcasts.

Data Description

We use synthetic shot data modeled on realistic distributions from top-five European leagues. The dataset contains 50,000 shots with the following fields:

Field	Type	Description
`match_id`	int	Unique match identifier
`season`	str	Season (e.g., "2019/20")
`minute`	int	Match minute
`x`	float	Shot x-coordinate (0--120, attacking direction)
`y`	float	Shot y-coordinate (0--80, left to right)
`body_part`	str	"right_foot", "left_foot", "head", "other"
`shot_type`	str	"open_play", "free_kick", "corner", "penalty"
`prev_action`	str	"pass", "cross", "dribble", "set_piece", "rebound"
`num_defenders`	int	Defenders between shooter and goal
`gk_x`	float	Goalkeeper x-coordinate
`gk_y`	float	Goalkeeper y-coordinate
`is_home`	bool	Whether the shooting team is playing at home
`score_diff`	int	Goal difference at time of shot (shooter's perspective)
`is_goal`	int	Target variable: 1 if goal, 0 otherwise

Step 1: Feature Engineering

Derived Spatial Features

The raw $(x, y)$ coordinates are less informative than derived spatial features:

import numpy as np

# Goal center coordinates (assuming 120x80 pitch)
GOAL_X = 120.0
GOAL_Y = 40.0
POST_LEFT_Y = 36.34   # 9.32m goal width centered at y=40
POST_RIGHT_Y = 43.66

def engineer_spatial_features(df: pd.DataFrame) -> pd.DataFrame:
    """Compute spatial features from shot and goalkeeper coordinates."""
    df = df.copy()

    # Distance to goal center
    df["distance_to_goal"] = np.sqrt(
        (GOAL_X - df["x"]) ** 2 + (GOAL_Y - df["y"]) ** 2
    )

    # Angle to goal (radians) using the two posts
    a = np.sqrt((GOAL_X - df["x"]) ** 2 + (POST_LEFT_Y - df["y"]) ** 2)
    b = np.sqrt((GOAL_X - df["x"]) ** 2 + (POST_RIGHT_Y - df["y"]) ** 2)
    c = POST_RIGHT_Y - POST_LEFT_Y  # goal width
    df["angle_to_goal"] = np.arccos(
        np.clip((a ** 2 + b ** 2 - c ** 2) / (2 * a * b), -1, 1)
    )

    # Goalkeeper distance from goal line
    df["gk_distance_from_line"] = GOAL_X - df["gk_x"]

    # Goalkeeper angle (how far off-center)
    df["gk_offset"] = np.abs(df["gk_y"] - GOAL_Y)

    # Is the shot from a central position?
    df["is_central"] = ((df["y"] >= 30) & (df["y"] <= 50)).astype(int)

    # Distance to nearest post
    dist_left = np.sqrt((GOAL_X - df["x"]) ** 2 + (POST_LEFT_Y - df["y"]) ** 2)
    dist_right = np.sqrt((GOAL_X - df["x"]) ** 2 + (POST_RIGHT_Y - df["y"]) ** 2)
    df["distance_to_near_post"] = np.minimum(dist_left, dist_right)

    return df

Contextual and Interaction Features

def engineer_context_features(df: pd.DataFrame) -> pd.DataFrame:
    """Compute contextual and interaction features."""
    df = df.copy()

    # Is the team trailing?
    df["is_trailing"] = (df["score_diff"] < 0).astype(int)

    # Late-game indicator (after 75th minute)
    df["is_late_game"] = (df["minute"] >= 75).astype(int)

    # Interaction: distance x body_part (header from distance is very different)
    df["distance_x_is_header"] = df["distance_to_goal"] * (
        df["body_part"] == "head"
    ).astype(int)

    # Interaction: angle x defenders
    df["angle_x_defenders"] = df["angle_to_goal"] * df["num_defenders"]

    # Is penalty
    df["is_penalty"] = (df["shot_type"] == "penalty").astype(int)

    return df

Final Feature Set

After engineering, our feature matrix contains 18 features:

#	Feature	Type	Source
1	`distance_to_goal`	Continuous	Derived
2	`angle_to_goal`	Continuous	Derived
3	`gk_distance_from_line`	Continuous	Derived
4	`gk_offset`	Continuous	Derived
5	`is_central`	Binary	Derived
6	`distance_to_near_post`	Continuous	Derived
7	`num_defenders`	Integer	Raw
8	`is_home`	Binary	Raw
9	`score_diff`	Integer	Raw
10	`minute`	Integer	Raw
11	`is_trailing`	Binary	Derived
12	`is_late_game`	Binary	Derived
13	`distance_x_is_header`	Continuous	Interaction
14	`angle_x_defenders`	Continuous	Interaction
15	`is_penalty`	Binary	Derived
16	`body_part`	Categorical	Raw (encoded)
17	`shot_type`	Categorical	Raw (encoded)
18	`prev_action`	Categorical	Raw (encoded)

Step 2: Train/Test Split

We split by season to prevent temporal leakage:

train_seasons = ["2017/18", "2018/19", "2019/20", "2020/21"]
val_season = "2021/22"
test_season = "2022/23"

train_df = df[df["season"].isin(train_seasons)]
val_df = df[df["season"] == val_season]
test_df = df[df["season"] == test_season]

print(f"Train: {len(train_df)} shots ({train_df['is_goal'].mean():.3f} goal rate)")
print(f"Val:   {len(val_df)} shots ({val_df['is_goal'].mean():.3f} goal rate)")
print(f"Test:  {len(test_df)} shots ({test_df['is_goal'].mean():.3f} goal rate)")

Step 3: Preprocessing Pipeline

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

numerical_features = [
    "distance_to_goal", "angle_to_goal", "gk_distance_from_line",
    "gk_offset", "distance_to_near_post", "num_defenders",
    "score_diff", "minute", "distance_x_is_header", "angle_x_defenders"
]
binary_features = [
    "is_central", "is_home", "is_trailing", "is_late_game", "is_penalty"
]
categorical_features = ["body_part", "shot_type", "prev_action"]

preprocessor = ColumnTransformer([
    ("num", Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler())
    ]), numerical_features),
    ("bin", "passthrough", binary_features),
    ("cat", OneHotEncoder(drop="first", sparse_output=False,
                          handle_unknown="infrequent_if_exist"),
     categorical_features)
])

Step 4: Baseline Model

We start with logistic regression as a baseline:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, brier_score_loss, log_loss

baseline_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier", LogisticRegression(max_iter=1000, random_state=42))
])
baseline_pipeline.fit(X_train, y_train)

y_prob_baseline = baseline_pipeline.predict_proba(X_test)[:, 1]
print(f"Baseline AUC:        {roc_auc_score(y_test, y_prob_baseline):.4f}")
print(f"Baseline Log-Loss:   {log_loss(y_test, y_prob_baseline):.4f}")
print(f"Baseline Brier:      {brier_score_loss(y_test, y_prob_baseline):.4f}")

Typical baseline results: AUC ~ 0.77, Log-loss ~ 0.32, Brier ~ 0.085.

Step 5: Gradient Boosting Model

Initial Model

from sklearn.ensemble import GradientBoostingClassifier

gb_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier", GradientBoostingClassifier(
        n_estimators=500,
        learning_rate=0.05,
        max_depth=5,
        subsample=0.8,
        min_samples_leaf=20,
        random_state=42
    ))
])
gb_pipeline.fit(X_train, y_train)

y_prob_gb = gb_pipeline.predict_proba(X_test)[:, 1]
print(f"GB AUC:        {roc_auc_score(y_test, y_prob_gb):.4f}")
print(f"GB Log-Loss:   {log_loss(y_test, y_prob_gb):.4f}")
print(f"GB Brier:      {brier_score_loss(y_test, y_prob_gb):.4f}")

Expected improvement: AUC ~ 0.81, Log-loss ~ 0.29, Brier ~ 0.078.

Hyperparameter Tuning

from sklearn.model_selection import RandomizedSearchCV, TimeSeriesSplit
from scipy.stats import uniform, randint

param_dist = {
    "classifier__n_estimators": randint(200, 800),
    "classifier__learning_rate": uniform(0.01, 0.14),
    "classifier__max_depth": randint(3, 8),
    "classifier__subsample": uniform(0.6, 0.4),
    "classifier__min_samples_leaf": randint(10, 50),
}

search = RandomizedSearchCV(
    gb_pipeline,
    param_distributions=param_dist,
    n_iter=40,
    scoring="neg_log_loss",
    cv=TimeSeriesSplit(n_splits=4),
    random_state=42,
    n_jobs=-1,
    verbose=1
)
search.fit(X_train, y_train)

print(f"Best Log-Loss: {-search.best_score_:.4f}")
print(f"Best Params: {search.best_params_}")

best_model = search.best_estimator_

Step 6: Calibration Analysis

A well-calibrated xG model is essential. We create calibration curves:

from sklearn.calibration import calibration_curve

y_prob_best = best_model.predict_proba(X_test)[:, 1]

prob_true, prob_pred = calibration_curve(y_test, y_prob_best, n_bins=10,
                                          strategy="quantile")

plt.figure(figsize=(8, 8))
plt.plot(prob_pred, prob_true, "s-", label="Gradient Boosting")
plt.plot([0, 1], [0, 1], "k--", label="Perfect Calibration")
plt.xlabel("Predicted Probability")
plt.ylabel("Observed Frequency")
plt.title("xG Model Calibration Curve")
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig("calibration_curve.png", dpi=150)
plt.show()

If the model shows systematic over- or under-confidence, we apply isotonic regression calibration:

from sklearn.calibration import CalibratedClassifierCV

calibrated_model = CalibratedClassifierCV(
    best_model, method="isotonic", cv=5
)
calibrated_model.fit(X_train, y_train)

y_prob_cal = calibrated_model.predict_proba(X_test)[:, 1]
print(f"Calibrated Brier: {brier_score_loss(y_test, y_prob_cal):.4f}")

Step 7: Model Interpretability with SHAP

Broadcasters need to explain why a particular shot had a high or low xG value. SHAP (SHapley Additive exPlanations) provides per-prediction feature contributions:

import shap

# Extract the gradient boosting model from the pipeline
gb_model = best_model.named_steps["classifier"]
X_test_transformed = best_model.named_steps["preprocessor"].transform(X_test)

# Get feature names after preprocessing
feature_names = (
    numerical_features +
    binary_features +
    list(best_model.named_steps["preprocessor"]
         .named_transformers_["cat"]
         .get_feature_names_out(categorical_features))
)

explainer = shap.TreeExplainer(gb_model)
shap_values = explainer.shap_values(X_test_transformed)

# Global feature importance
plt.figure(figsize=(10, 8))
shap.summary_plot(shap_values, X_test_transformed,
                  feature_names=feature_names, show=False)
plt.tight_layout()
plt.savefig("shap_summary.png", dpi=150, bbox_inches="tight")
plt.show()

Individual Prediction Explanation

For a specific high-xG shot, we can show which features drove the prediction:

# Explain a single shot (e.g., a close-range header from a cross)
shot_idx = 42
shap.force_plot(
    explainer.expected_value,
    shap_values[shot_idx, :],
    X_test_transformed[shot_idx, :],
    feature_names=feature_names,
    matplotlib=True
)

This might reveal: "Distance to goal (4.2m) pushed xG up by +0.18, angle to goal pushed up by +0.12, but it being a header pushed down by -0.05, for a final xG of 0.42."

Step 8: Model Performance Summary

Model	AUC-ROC	Log-Loss	Brier Score
Logistic Regression (baseline)	0.770	0.320	0.085
Gradient Boosting (initial)	0.810	0.290	0.078
Gradient Boosting (tuned)	0.818	0.283	0.075
Gradient Boosting (calibrated)	0.816	0.280	0.073

The calibrated model sacrifices a tiny amount of AUC for significantly better calibration, which is the right trade-off for a broadcast-facing xG model.

Step 9: Deployment

Model Serialization

import joblib

# Save the full calibrated pipeline
joblib.dump(calibrated_model, "xg_model_v1.0.joblib")

# Save the preprocessing pipeline separately for validation
joblib.dump(best_model.named_steps["preprocessor"],
            "xg_preprocessor_v1.0.joblib")

# Model metadata
metadata = {
    "model_name": "xg_model",
    "version": "1.0",
    "training_seasons": train_seasons,
    "test_season": test_season,
    "n_features": len(feature_names),
    "test_auc": 0.816,
    "test_brier": 0.073,
    "algorithm": "GradientBoostingClassifier + Isotonic Calibration",
}

Prediction Function

def predict_xg(shot_data: pd.DataFrame,
               model_path: str = "xg_model_v1.0.joblib") -> np.ndarray:
    """Predict xG for new shot data.

    Args:
        shot_data: DataFrame with required shot features.
        model_path: Path to the serialized model.

    Returns:
        Array of xG probabilities.
    """
    model = joblib.load(model_path)

    # Apply feature engineering
    shot_data = engineer_spatial_features(shot_data)
    shot_data = engineer_context_features(shot_data)

    return model.predict_proba(shot_data)[:, 1]

Step 10: Monitoring Plan

Once deployed, we monitor:

Weekly Brier score on new matches vs. the baseline of 0.073.
Monthly calibration curves to detect systematic drift.
Feature distribution PSI for key features (distance_to_goal, angle_to_goal) to detect data drift.
Alert threshold: Retrain if weekly Brier score exceeds 0.085 for three consecutive weeks.

Lessons Learned

Feature engineering provided the largest performance gains. Moving from raw $(x, y)$ to derived spatial features improved AUC by 0.05.
Gradient boosting outperformed logistic regression but the gap narrowed when the logistic regression was given well-engineered features.
Calibration is essential for broadcast use. An uncalibrated model produces xG values that do not sum to reasonable match totals.
SHAP explanations are invaluable for building trust with non-technical stakeholders. Broadcasters can now say "this shot had a high xG because the shooter was close to goal with no defenders blocking."
Temporal splits are non-negotiable. A random split inflated the AUC by 0.03 compared to the temporal split --- a misleading overestimate.

Code Reference

The complete code for this case study is available in code/case-study-code.py (Section 2).