Build Your Own xG Model in Python: A Step-by-Step Tutorial
Expected goals (xG) has transformed how we understand soccer. Instead of asking "how many goals did a team score?" we can ask "how many goals should they have scored based on the quality of their chances?" This single shift unlocks deeper analysis of player finishing, goalkeeper performance, team dominance, and match luck.
In this tutorial, you will build your own xG model in Python from scratch. We will start with free open data, engineer meaningful features, train a logistic regression model, evaluate its performance, and visualize the results. By the end, you will have a working xG model and the foundation to improve it further.
What xG Measures and Why It Matters
Expected goals (xG) assigns a probability between 0 and 1 to every shot, representing the likelihood that an average player would score from that situation. A penalty kick typically has an xG of about 0.76. A header from 15 yards out after a cross might have an xG of 0.05. A one-on-one breakaway from 8 yards might be 0.40.
By summing xG values across all shots in a match, we get a team's expected goals total — a measure of how many goals they "deserved" based on chance quality. When a team's actual goals consistently exceed their xG, they are either exceptionally clinical or getting lucky. When they consistently underperform their xG, they are either wasteful or unlucky.
xG is valuable because:
- It separates chance creation from finishing quality
- It identifies teams and players whose results are likely to regress toward the mean
- It provides a fairer assessment of goalkeeper performance (saves from high-xG chances are more impressive)
- It powers more sophisticated metrics like xG chain, xG buildup, and expected threat (xT)
Prerequisites
Before we start, make sure you have:
- Python 3.8+ installed
- The following libraries:
pandas,numpy,scikit-learn,matplotlib,seaborn - Basic familiarity with Python and pandas
Install the required libraries if needed:
pip install pandas numpy scikit-learn matplotlib seaborn statsbombpy
Step 1: Getting the Data
We will use StatsBomb's free open data, which provides detailed event-level data for several competitions including the 2018 FIFA World Cup, select Champions League seasons, FA Women's Super League seasons, and more.
from statsbombpy import sb
import pandas as pd
import numpy as np
# List available free competitions
competitions = sb.competitions()
print(competitions[['competition_name', 'season_name', 'competition_id', 'season_id']])
# Load all matches from the 2018 World Cup
matches = sb.matches(competition_id=43, season_id=3)
print(f"Number of matches: {len(matches)}")
Now we collect all shot events across the tournament:
all_shots = []
for match_id in matches['match_id']:
events = sb.events(match_id=match_id)
shots = events[events['type'] == 'Shot'].copy()
shots['match_id'] = match_id
all_shots.append(shots)
shots_df = pd.concat(all_shots, ignore_index=True)
print(f"Total shots collected: {len(shots_df)}")
print(shots_df.columns.tolist())
Step 2: Understanding the Raw Data
StatsBomb provides rich detail for every shot. Let us examine what we have:
# Look at key columns
print(shots_df[['player', 'shot_statsbomb_xg', 'shot_outcome',
'shot_body_part', 'shot_technique', 'shot_type',
'location', 'shot_end_location']].head(10))
# Check shot outcomes
print(shots_df['shot_outcome'].value_counts())
The shot_outcome column tells us whether the shot resulted in a goal, was saved, blocked, went off target, hit the post, or was a wayward miss. The location column provides the [x, y] coordinates of the shot on the pitch.
StatsBomb uses a coordinate system where the pitch is 120 yards long (x-axis) and 80 yards wide (y-axis). The goal being attacked is at x = 120, centered at y = 40.
Step 3: Feature Engineering
Feature engineering is where your xG model is made or broken. The quality of your features determines the quality of your predictions. We will create features that capture the most important factors influencing whether a shot becomes a goal.
# Extract x and y coordinates from the location column
shots_df['x'] = shots_df['location'].apply(lambda loc: loc[0])
shots_df['y'] = shots_df['location'].apply(lambda loc: loc[1])
# --- Feature 1: Distance to goal ---
# Goal center is at (120, 40) in StatsBomb coordinates
goal_x, goal_y = 120, 40
shots_df['distance'] = np.sqrt(
(shots_df['x'] - goal_x)**2 + (shots_df['y'] - goal_y)**2
)
# --- Feature 2: Angle to goal ---
# Calculate the angle in degrees between the shot location and the goal posts
# Goal posts are at (120, 36) and (120, 44) — 8 yards wide
goal_post_left = np.array([120, 36])
goal_post_right = np.array([120, 44])
def calculate_angle(x, y):
"""Calculate the angle subtended by the goal from the shot location."""
dx_left = goal_post_left[0] - x
dy_left = goal_post_left[1] - y
dx_right = goal_post_right[0] - x
dy_right = goal_post_right[1] - y
angle_left = np.arctan2(dy_left, dx_left)
angle_right = np.arctan2(dy_right, dx_right)
angle = abs(angle_left - angle_right)
return np.degrees(angle)
shots_df['angle'] = shots_df.apply(
lambda row: calculate_angle(row['x'], row['y']), axis=1
)
# --- Feature 3: Body part (one-hot encoded) ---
shots_df['is_header'] = (shots_df['shot_body_part'] == 'Head').astype(int)
shots_df['is_right_foot'] = (shots_df['shot_body_part'] == 'Right Foot').astype(int)
shots_df['is_left_foot'] = (shots_df['shot_body_part'] == 'Left Foot').astype(int)
# --- Feature 4: Shot type ---
shots_df['is_open_play'] = (shots_df['shot_type'] == 'Open Play').astype(int)
shots_df['is_set_piece'] = shots_df['shot_type'].isin(
['Free Kick', 'Corner']
).astype(int)
shots_df['is_penalty'] = (shots_df['shot_type'] == 'Penalty').astype(int)
# --- Feature 5: Shot technique ---
shots_df['is_volley'] = (shots_df['shot_technique'] == 'Volley').astype(int)
shots_df['is_half_volley'] = (shots_df['shot_technique'] == 'Half Volley').astype(int)
# --- Target variable ---
shots_df['is_goal'] = (shots_df['shot_outcome'] == 'Goal').astype(int)
print(f"\nGoal rate: {shots_df['is_goal'].mean():.3f}")
print(f"Total shots: {len(shots_df)}, Goals: {shots_df['is_goal'].sum()}")
Why These Features Matter
- Distance: The farther from goal, the less likely a shot scores. This is the single most predictive feature in any xG model.
- Angle: Shots from tight angles (near the touchline) have a small visible goal to aim at. Central shots have more goal to work with.
- Body part: Headers are converted at a significantly lower rate than shots with the feet. This is one of the largest categorical effects.
- Shot type: Penalties have a fixed, high conversion rate. Free kicks and corners have their own characteristics.
- Technique: Volleys and half-volleys are harder to control, typically reducing conversion rates.
Step 4: Building the Logistic Regression Model
Logistic regression is an excellent starting point for an xG model because it outputs calibrated probabilities, is interpretable, and performs surprisingly well on this type of binary classification problem.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
# Define feature columns
feature_cols = [
'distance', 'angle',
'is_header', 'is_right_foot', 'is_left_foot',
'is_open_play', 'is_set_piece', 'is_penalty',
'is_volley', 'is_half_volley'
]
# Remove rows with missing values in our features
model_data = shots_df[feature_cols + ['is_goal']].dropna()
X = model_data[feature_cols]
y = model_data['is_goal']
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training set: {len(X_train)} shots, {y_train.sum()} goals")
print(f"Test set: {len(X_test)} shots, {y_test.sum()} goals")
# Scale features (important for logistic regression convergence)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train the model
model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train_scaled, y_train)
# Examine coefficients
coef_df = pd.DataFrame({
'Feature': feature_cols,
'Coefficient': model.coef_[0]
}).sort_values('Coefficient', key=abs, ascending=False)
print("\nModel Coefficients:")
print(coef_df.to_string(index=False))
Interpreting the Coefficients
In logistic regression, coefficients indicate the direction and relative strength of each feature's influence on the log-odds of a goal:
- Distance should have a strong negative coefficient (farther = less likely to score)
- Angle should have a positive coefficient (wider angle = more goal visible = more likely to score)
- is_header should be negative (headers convert at lower rates)
- is_penalty should be strongly positive (penalties convert at high rates)
Step 5: Evaluating Model Performance
A good xG model needs to be both discriminative (it ranks higher-quality chances above lower-quality ones) and calibrated (when it says a shot has 0.15 xG, approximately 15% of such shots should actually be goals).
from sklearn.metrics import brier_score_loss, roc_auc_score, log_loss
import matplotlib.pyplot as plt
import seaborn as sns
# Generate predictions
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
# --- Metric 1: Brier Score ---
brier = brier_score_loss(y_test, y_pred_proba)
print(f"Brier Score: {brier:.4f}")
print("(Lower is better. Baseline 'always predict mean': "
f"{brier_score_loss(y_test, [y_test.mean()] * len(y_test)):.4f})")
# --- Metric 2: ROC AUC ---
auc = roc_auc_score(y_test, y_pred_proba)
print(f"\nROC AUC: {auc:.4f}")
print("(1.0 is perfect, 0.5 is random)")
# --- Metric 3: Log Loss ---
ll = log_loss(y_test, y_pred_proba)
print(f"\nLog Loss: {ll:.4f}")
Calibration Plot
The calibration plot is arguably the most important evaluation for an xG model. It shows whether your predicted probabilities match observed goal rates.
from sklearn.calibration import calibration_curve
# Create calibration plot
prob_true, prob_pred = calibration_curve(y_test, y_pred_proba, n_bins=10)
fig, ax = plt.subplots(1, 1, figsize=(8, 6))
ax.plot(prob_pred, prob_true, marker='o', linewidth=2, label='Our xG Model')
ax.plot([0, 1], [0, 1], linestyle='--', color='gray', label='Perfect Calibration')
ax.set_xlabel('Predicted xG', fontsize=12)
ax.set_ylabel('Observed Goal Rate', fontsize=12)
ax.set_title('xG Model Calibration Plot', fontsize=14)
ax.legend(fontsize=11)
ax.set_xlim([0, 1])
ax.set_ylim([0, 1])
plt.tight_layout()
plt.savefig('xg_calibration_plot.png', dpi=150)
plt.show()
A well-calibrated model will have points that fall close to the diagonal line. If the points consistently fall above the line, your model underestimates goal probability. Below the line means overestimation.
Step 6: Visualizing Shot Maps with xG
One of the most compelling applications of xG is creating shot maps where each shot is sized and colored by its expected goals value.
# Generate xG for all shots
X_all_scaled = scaler.transform(model_data[feature_cols])
model_data = model_data.copy()
model_data['our_xg'] = model.predict_proba(X_all_scaled)[:, 1]
# Create the shot map
fig, ax = plt.subplots(figsize=(12, 8))
# Draw a simplified pitch (attacking half)
ax.set_xlim(60, 121)
ax.set_ylim(0, 80)
# Pitch markings
ax.plot([60, 120], [0, 0], color='black')
ax.plot([60, 120], [80, 80], color='black')
ax.plot([120, 120], [0, 80], color='black')
ax.plot([60, 60], [0, 80], color='black')
# Penalty area
ax.plot([102, 102], [18, 62], color='black')
ax.plot([102, 120], [18, 18], color='black')
ax.plot([102, 120], [62, 62], color='black')
# Six-yard box
ax.plot([114, 114], [30, 50], color='black')
ax.plot([114, 120], [30, 30], color='black')
ax.plot([114, 120], [50, 50], color='black')
# Goal
ax.plot([120, 120], [36, 44], color='red', linewidth=3)
# Plot shots
goals = model_data[model_data['is_goal'] == 1]
non_goals = model_data[model_data['is_goal'] == 0]
# Non-goals: open circles
ax.scatter(
non_goals['x'], non_goals['y'],
s=non_goals['our_xg'] * 500 + 20,
c=non_goals['our_xg'],
cmap='RdYlGn', alpha=0.5, edgecolors='gray',
linewidth=0.5, vmin=0, vmax=0.8,
label='No Goal'
)
# Goals: filled stars
ax.scatter(
goals['x'], goals['y'],
s=goals['our_xg'] * 500 + 50,
c=goals['our_xg'],
cmap='RdYlGn', marker='*', edgecolors='black',
linewidth=0.8, vmin=0, vmax=0.8,
label='Goal'
)
ax.set_title('Shot Map with xG Values', fontsize=14)
ax.set_xlabel('Pitch Length (yards)')
ax.set_ylabel('Pitch Width (yards)')
ax.legend(loc='upper left', fontsize=10)
# Add colorbar
sm = plt.cm.ScalarMappable(cmap='RdYlGn', norm=plt.Normalize(0, 0.8))
sm.set_array([])
cbar = plt.colorbar(sm, ax=ax, shrink=0.6)
cbar.set_label('xG Value', fontsize=11)
plt.tight_layout()
plt.savefig('xg_shot_map.png', dpi=150)
plt.show()
Step 7: Comparing Your Model to StatsBomb's xG
StatsBomb includes their own xG values in the data (shot_statsbomb_xg). Let us see how our simple model compares to their professional model.
# Merge our predictions back with StatsBomb's xG
comparison = shots_df[['shot_statsbomb_xg', 'is_goal']].copy()
comparison = comparison.dropna()
# Our model predictions for the same shots
X_compare = shots_df.loc[comparison.index, feature_cols].dropna()
comparison = comparison.loc[X_compare.index]
X_compare_scaled = scaler.transform(X_compare)
comparison['our_xg'] = model.predict_proba(X_compare_scaled)[:, 1]
# Compare Brier scores
brier_ours = brier_score_loss(comparison['is_goal'], comparison['our_xg'])
brier_sb = brier_score_loss(comparison['is_goal'], comparison['shot_statsbomb_xg'])
print(f"Our model Brier Score: {brier_ours:.4f}")
print(f"StatsBomb model Brier Score: {brier_sb:.4f}")
# Scatter plot: our xG vs StatsBomb xG
fig, ax = plt.subplots(figsize=(8, 8))
ax.scatter(
comparison['shot_statsbomb_xg'], comparison['our_xg'],
alpha=0.3, s=20, c='steelblue'
)
ax.plot([0, 1], [0, 1], linestyle='--', color='gray')
ax.set_xlabel('StatsBomb xG', fontsize=12)
ax.set_ylabel('Our xG', fontsize=12)
ax.set_title('Our Model vs StatsBomb xG', fontsize=14)
plt.tight_layout()
plt.savefig('xg_comparison.png', dpi=150)
plt.show()
StatsBomb's model will almost certainly outperform ours because it uses many more features we have not included: defensive positioning of all players at the moment of the shot, goalkeeper position, whether the shot was a one-on-one, passage of play leading to the shot, and more. But the correlation between our simple model and theirs should be strong, demonstrating that distance and angle alone capture most of the signal.
Step 8: Next Steps for Improving Your Model
Your basic logistic regression xG model is a solid foundation. Here are concrete next steps to improve it.
Add More Features
The biggest improvements will come from richer features:
- Number of defenders between shooter and goal (available in StatsBomb freeze-frame data)
- Goalkeeper position (also in freeze-frame data)
- Previous action (was the shot preceded by a cross, dribble, through ball, or a set piece?)
- Game state (goal difference at time of shot — players shoot differently when desperate)
- Speed of play (time elapsed since the team gained possession)
- Shot under pressure (whether a defender was applying pressure)
# Example: extracting whether the shot was preceded by a cross
# (you would need to look at the event preceding each shot)
shots_df['previous_action_cross'] = 0 # placeholder
# Logic to determine if previous event was a cross would go here
Try More Powerful Algorithms
Once you have enough features, gradient boosting models typically outperform logistic regression for xG:
from sklearn.ensemble import GradientBoostingClassifier
gb_model = GradientBoostingClassifier(
n_estimators=200,
max_depth=4,
learning_rate=0.1,
random_state=42
)
gb_model.fit(X_train_scaled, y_train)
y_pred_gb = gb_model.predict_proba(X_test_scaled)[:, 1]
brier_gb = brier_score_loss(y_test, y_pred_gb)
print(f"Gradient Boosting Brier Score: {brier_gb:.4f}")
XGBoost and LightGBM are also excellent choices used by professional analytics providers.
Use More Data
The 2018 World Cup gives us a few hundred shots — a small dataset. For a more robust model, combine multiple competitions from the StatsBomb open data. You can also explore Understat (understat.com), which provides shot-level data with xG for the top 5 European leagues, or Wyscout for professional-grade data.
Add Post-Shot xG (PSxG)
Post-shot xG incorporates where the shot was aimed — a shot placed in the top corner has a higher probability of going in than one aimed at the keeper. This requires shot end-location data, which StatsBomb provides.
# Example: calculate distance from shot end location to goal center
# shot_end_location gives [x, y, z] where z is the height
shots_df['shot_end_y'] = shots_df['shot_end_location'].apply(
lambda loc: loc[1] if isinstance(loc, list) and len(loc) >= 2 else np.nan
)
shots_df['shot_end_z'] = shots_df['shot_end_location'].apply(
lambda loc: loc[2] if isinstance(loc, list) and len(loc) >= 3 else np.nan
)
Key Takeaways
Building an expected goals model in Python teaches you several important lessons:
-
Distance and angle are king. These two features alone explain the majority of variance in shot outcomes. Everything else is refinement.
-
Logistic regression is a strong baseline. You do not need deep learning or exotic algorithms to build a useful xG model. Start simple, and let the data tell you when you need more complexity.
-
Calibration matters more than accuracy. A model that says "0.15" and is right 15% of the time is more useful than one with slightly better classification accuracy but poorly calibrated probabilities.
-
Feature engineering is where value lives. The difference between amateur and professional xG models is not the algorithm — it is the features. Defensive positioning, goalkeeper location, and passage-of-play data are what separate a good model from a great one.
-
Open data makes this accessible to everyone. StatsBomb's open data initiative has democratized soccer analytics. You can build a real xG model with zero budget.
For a comprehensive treatment of soccer analytics — including xG, expected threat, passing models, possession value frameworks, and how professional clubs use these tools — see our complete guide in Professional Soccer Analytics.