Case Study 2: Building an xG Model with Gradient Boosting: From Features to Deployment

Background

Expected goals (xG) is the foundational metric in modern soccer analytics. Every major data provider, broadcaster, and club maintains an xG model. This case study walks through the complete lifecycle of building a production-quality xG model using gradient boosting, from raw event data through feature engineering, training, evaluation, interpretability analysis, and deployment.

We simulate a scenario where a data analytics company is building an xG model for a broadcasting client. The model must be:

  1. Accurate: State-of-the-art predictive performance.
  2. Well-calibrated: Predicted probabilities must match observed frequencies.
  3. Interpretable: Broadcasters need to explain predictions to viewers.
  4. Deployable: The model must run in real-time during live broadcasts.

Data Description

We use synthetic shot data modeled on realistic distributions from top-five European leagues. The dataset contains 50,000 shots with the following fields:

Field Type Description
match_id int Unique match identifier
season str Season (e.g., "2019/20")
minute int Match minute
x float Shot x-coordinate (0--120, attacking direction)
y float Shot y-coordinate (0--80, left to right)
body_part str "right_foot", "left_foot", "head", "other"
shot_type str "open_play", "free_kick", "corner", "penalty"
prev_action str "pass", "cross", "dribble", "set_piece", "rebound"
num_defenders int Defenders between shooter and goal
gk_x float Goalkeeper x-coordinate
gk_y float Goalkeeper y-coordinate
is_home bool Whether the shooting team is playing at home
score_diff int Goal difference at time of shot (shooter's perspective)
is_goal int Target variable: 1 if goal, 0 otherwise

Step 1: Feature Engineering

Derived Spatial Features

The raw $(x, y)$ coordinates are less informative than derived spatial features:

import numpy as np

# Goal center coordinates (assuming 120x80 pitch)
GOAL_X = 120.0
GOAL_Y = 40.0
POST_LEFT_Y = 36.34   # 9.32m goal width centered at y=40
POST_RIGHT_Y = 43.66

def engineer_spatial_features(df: pd.DataFrame) -> pd.DataFrame:
    """Compute spatial features from shot and goalkeeper coordinates."""
    df = df.copy()

    # Distance to goal center
    df["distance_to_goal"] = np.sqrt(
        (GOAL_X - df["x"]) ** 2 + (GOAL_Y - df["y"]) ** 2
    )

    # Angle to goal (radians) using the two posts
    a = np.sqrt((GOAL_X - df["x"]) ** 2 + (POST_LEFT_Y - df["y"]) ** 2)
    b = np.sqrt((GOAL_X - df["x"]) ** 2 + (POST_RIGHT_Y - df["y"]) ** 2)
    c = POST_RIGHT_Y - POST_LEFT_Y  # goal width
    df["angle_to_goal"] = np.arccos(
        np.clip((a ** 2 + b ** 2 - c ** 2) / (2 * a * b), -1, 1)
    )

    # Goalkeeper distance from goal line
    df["gk_distance_from_line"] = GOAL_X - df["gk_x"]

    # Goalkeeper angle (how far off-center)
    df["gk_offset"] = np.abs(df["gk_y"] - GOAL_Y)

    # Is the shot from a central position?
    df["is_central"] = ((df["y"] >= 30) & (df["y"] <= 50)).astype(int)

    # Distance to nearest post
    dist_left = np.sqrt((GOAL_X - df["x"]) ** 2 + (POST_LEFT_Y - df["y"]) ** 2)
    dist_right = np.sqrt((GOAL_X - df["x"]) ** 2 + (POST_RIGHT_Y - df["y"]) ** 2)
    df["distance_to_near_post"] = np.minimum(dist_left, dist_right)

    return df

Contextual and Interaction Features

def engineer_context_features(df: pd.DataFrame) -> pd.DataFrame:
    """Compute contextual and interaction features."""
    df = df.copy()

    # Is the team trailing?
    df["is_trailing"] = (df["score_diff"] < 0).astype(int)

    # Late-game indicator (after 75th minute)
    df["is_late_game"] = (df["minute"] >= 75).astype(int)

    # Interaction: distance x body_part (header from distance is very different)
    df["distance_x_is_header"] = df["distance_to_goal"] * (
        df["body_part"] == "head"
    ).astype(int)

    # Interaction: angle x defenders
    df["angle_x_defenders"] = df["angle_to_goal"] * df["num_defenders"]

    # Is penalty
    df["is_penalty"] = (df["shot_type"] == "penalty").astype(int)

    return df

Final Feature Set

After engineering, our feature matrix contains 18 features:

# Feature Type Source
1 distance_to_goal Continuous Derived
2 angle_to_goal Continuous Derived
3 gk_distance_from_line Continuous Derived
4 gk_offset Continuous Derived
5 is_central Binary Derived
6 distance_to_near_post Continuous Derived
7 num_defenders Integer Raw
8 is_home Binary Raw
9 score_diff Integer Raw
10 minute Integer Raw
11 is_trailing Binary Derived
12 is_late_game Binary Derived
13 distance_x_is_header Continuous Interaction
14 angle_x_defenders Continuous Interaction
15 is_penalty Binary Derived
16 body_part Categorical Raw (encoded)
17 shot_type Categorical Raw (encoded)
18 prev_action Categorical Raw (encoded)

Step 2: Train/Test Split

We split by season to prevent temporal leakage:

train_seasons = ["2017/18", "2018/19", "2019/20", "2020/21"]
val_season = "2021/22"
test_season = "2022/23"

train_df = df[df["season"].isin(train_seasons)]
val_df = df[df["season"] == val_season]
test_df = df[df["season"] == test_season]

print(f"Train: {len(train_df)} shots ({train_df['is_goal'].mean():.3f} goal rate)")
print(f"Val:   {len(val_df)} shots ({val_df['is_goal'].mean():.3f} goal rate)")
print(f"Test:  {len(test_df)} shots ({test_df['is_goal'].mean():.3f} goal rate)")

Step 3: Preprocessing Pipeline

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

numerical_features = [
    "distance_to_goal", "angle_to_goal", "gk_distance_from_line",
    "gk_offset", "distance_to_near_post", "num_defenders",
    "score_diff", "minute", "distance_x_is_header", "angle_x_defenders"
]
binary_features = [
    "is_central", "is_home", "is_trailing", "is_late_game", "is_penalty"
]
categorical_features = ["body_part", "shot_type", "prev_action"]

preprocessor = ColumnTransformer([
    ("num", Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler())
    ]), numerical_features),
    ("bin", "passthrough", binary_features),
    ("cat", OneHotEncoder(drop="first", sparse_output=False,
                          handle_unknown="infrequent_if_exist"),
     categorical_features)
])

Step 4: Baseline Model

We start with logistic regression as a baseline:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, brier_score_loss, log_loss

baseline_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier", LogisticRegression(max_iter=1000, random_state=42))
])
baseline_pipeline.fit(X_train, y_train)

y_prob_baseline = baseline_pipeline.predict_proba(X_test)[:, 1]
print(f"Baseline AUC:        {roc_auc_score(y_test, y_prob_baseline):.4f}")
print(f"Baseline Log-Loss:   {log_loss(y_test, y_prob_baseline):.4f}")
print(f"Baseline Brier:      {brier_score_loss(y_test, y_prob_baseline):.4f}")

Typical baseline results: AUC ~ 0.77, Log-loss ~ 0.32, Brier ~ 0.085.

Step 5: Gradient Boosting Model

Initial Model

from sklearn.ensemble import GradientBoostingClassifier

gb_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier", GradientBoostingClassifier(
        n_estimators=500,
        learning_rate=0.05,
        max_depth=5,
        subsample=0.8,
        min_samples_leaf=20,
        random_state=42
    ))
])
gb_pipeline.fit(X_train, y_train)

y_prob_gb = gb_pipeline.predict_proba(X_test)[:, 1]
print(f"GB AUC:        {roc_auc_score(y_test, y_prob_gb):.4f}")
print(f"GB Log-Loss:   {log_loss(y_test, y_prob_gb):.4f}")
print(f"GB Brier:      {brier_score_loss(y_test, y_prob_gb):.4f}")

Expected improvement: AUC ~ 0.81, Log-loss ~ 0.29, Brier ~ 0.078.

Hyperparameter Tuning

from sklearn.model_selection import RandomizedSearchCV, TimeSeriesSplit
from scipy.stats import uniform, randint

param_dist = {
    "classifier__n_estimators": randint(200, 800),
    "classifier__learning_rate": uniform(0.01, 0.14),
    "classifier__max_depth": randint(3, 8),
    "classifier__subsample": uniform(0.6, 0.4),
    "classifier__min_samples_leaf": randint(10, 50),
}

search = RandomizedSearchCV(
    gb_pipeline,
    param_distributions=param_dist,
    n_iter=40,
    scoring="neg_log_loss",
    cv=TimeSeriesSplit(n_splits=4),
    random_state=42,
    n_jobs=-1,
    verbose=1
)
search.fit(X_train, y_train)

print(f"Best Log-Loss: {-search.best_score_:.4f}")
print(f"Best Params: {search.best_params_}")

best_model = search.best_estimator_

Step 6: Calibration Analysis

A well-calibrated xG model is essential. We create calibration curves:

from sklearn.calibration import calibration_curve

y_prob_best = best_model.predict_proba(X_test)[:, 1]

prob_true, prob_pred = calibration_curve(y_test, y_prob_best, n_bins=10,
                                          strategy="quantile")

plt.figure(figsize=(8, 8))
plt.plot(prob_pred, prob_true, "s-", label="Gradient Boosting")
plt.plot([0, 1], [0, 1], "k--", label="Perfect Calibration")
plt.xlabel("Predicted Probability")
plt.ylabel("Observed Frequency")
plt.title("xG Model Calibration Curve")
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig("calibration_curve.png", dpi=150)
plt.show()

If the model shows systematic over- or under-confidence, we apply isotonic regression calibration:

from sklearn.calibration import CalibratedClassifierCV

calibrated_model = CalibratedClassifierCV(
    best_model, method="isotonic", cv=5
)
calibrated_model.fit(X_train, y_train)

y_prob_cal = calibrated_model.predict_proba(X_test)[:, 1]
print(f"Calibrated Brier: {brier_score_loss(y_test, y_prob_cal):.4f}")

Step 7: Model Interpretability with SHAP

Broadcasters need to explain why a particular shot had a high or low xG value. SHAP (SHapley Additive exPlanations) provides per-prediction feature contributions:

import shap

# Extract the gradient boosting model from the pipeline
gb_model = best_model.named_steps["classifier"]
X_test_transformed = best_model.named_steps["preprocessor"].transform(X_test)

# Get feature names after preprocessing
feature_names = (
    numerical_features +
    binary_features +
    list(best_model.named_steps["preprocessor"]
         .named_transformers_["cat"]
         .get_feature_names_out(categorical_features))
)

explainer = shap.TreeExplainer(gb_model)
shap_values = explainer.shap_values(X_test_transformed)

# Global feature importance
plt.figure(figsize=(10, 8))
shap.summary_plot(shap_values, X_test_transformed,
                  feature_names=feature_names, show=False)
plt.tight_layout()
plt.savefig("shap_summary.png", dpi=150, bbox_inches="tight")
plt.show()

Individual Prediction Explanation

For a specific high-xG shot, we can show which features drove the prediction:

# Explain a single shot (e.g., a close-range header from a cross)
shot_idx = 42
shap.force_plot(
    explainer.expected_value,
    shap_values[shot_idx, :],
    X_test_transformed[shot_idx, :],
    feature_names=feature_names,
    matplotlib=True
)

This might reveal: "Distance to goal (4.2m) pushed xG up by +0.18, angle to goal pushed up by +0.12, but it being a header pushed down by -0.05, for a final xG of 0.42."

Step 8: Model Performance Summary

Model AUC-ROC Log-Loss Brier Score
Logistic Regression (baseline) 0.770 0.320 0.085
Gradient Boosting (initial) 0.810 0.290 0.078
Gradient Boosting (tuned) 0.818 0.283 0.075
Gradient Boosting (calibrated) 0.816 0.280 0.073

The calibrated model sacrifices a tiny amount of AUC for significantly better calibration, which is the right trade-off for a broadcast-facing xG model.

Step 9: Deployment

Model Serialization

import joblib

# Save the full calibrated pipeline
joblib.dump(calibrated_model, "xg_model_v1.0.joblib")

# Save the preprocessing pipeline separately for validation
joblib.dump(best_model.named_steps["preprocessor"],
            "xg_preprocessor_v1.0.joblib")

# Model metadata
metadata = {
    "model_name": "xg_model",
    "version": "1.0",
    "training_seasons": train_seasons,
    "test_season": test_season,
    "n_features": len(feature_names),
    "test_auc": 0.816,
    "test_brier": 0.073,
    "algorithm": "GradientBoostingClassifier + Isotonic Calibration",
}

Prediction Function

def predict_xg(shot_data: pd.DataFrame,
               model_path: str = "xg_model_v1.0.joblib") -> np.ndarray:
    """Predict xG for new shot data.

    Args:
        shot_data: DataFrame with required shot features.
        model_path: Path to the serialized model.

    Returns:
        Array of xG probabilities.
    """
    model = joblib.load(model_path)

    # Apply feature engineering
    shot_data = engineer_spatial_features(shot_data)
    shot_data = engineer_context_features(shot_data)

    return model.predict_proba(shot_data)[:, 1]

Step 10: Monitoring Plan

Once deployed, we monitor:

  1. Weekly Brier score on new matches vs. the baseline of 0.073.
  2. Monthly calibration curves to detect systematic drift.
  3. Feature distribution PSI for key features (distance_to_goal, angle_to_goal) to detect data drift.
  4. Alert threshold: Retrain if weekly Brier score exceeds 0.085 for three consecutive weeks.

Lessons Learned

  1. Feature engineering provided the largest performance gains. Moving from raw $(x, y)$ to derived spatial features improved AUC by 0.05.

  2. Gradient boosting outperformed logistic regression but the gap narrowed when the logistic regression was given well-engineered features.

  3. Calibration is essential for broadcast use. An uncalibrated model produces xG values that do not sum to reasonable match totals.

  4. SHAP explanations are invaluable for building trust with non-technical stakeholders. Broadcasters can now say "this shot had a high xG because the shooter was close to goal with no defenders blocking."

  5. Temporal splits are non-negotiable. A random split inflated the AUC by 0.03 compared to the temporal split --- a misleading overestimate.

Code Reference

The complete code for this case study is available in code/case-study-code.py (Section 2).