Case Study 1: Building and Validating an xT Model from Scratch

Overview

This case study walks through the complete process of building an Expected Threat (xT) model using StatsBomb's World Cup 2018 data. We'll cover grid creation, transition matrix calculation, value iteration, and validation against actual goals.

Learning Objectives: - Build an xT grid from raw event data - Implement value iteration algorithm - Validate xT model against real outcomes - Understand the relationship between xT and goals


1. Problem Statement

We want to build a model that assigns value to every position on the pitch based on how likely that position leads to a goal. This allows us to:

  1. Value all ball-advancing actions, not just shots and assists
  2. Identify players who consistently move the ball into dangerous positions
  3. Analyze team build-up patterns

Our goal is to create an xT grid that: - Assigns appropriate values across the pitch - Correlates with actual goal outcomes - Can be used for player and team analysis


2. Data Preparation

2.1 Loading the Data

import pandas as pd
import numpy as np
from statsbombpy import sb
import matplotlib.pyplot as plt
import seaborn as sns

# Load all World Cup 2018 matches
matches = sb.matches(competition_id=43, season_id=3)
print(f"Total matches: {len(matches)}")

# Load events for all matches
all_events = []
for match_id in matches['match_id']:
    events = sb.events(match_id=match_id)
    events['match_id'] = match_id
    all_events.append(events)

events_df = pd.concat(all_events, ignore_index=True)
print(f"Total events: {len(events_df)}")

Output:

Total matches: 64
Total events: 152,738

2.2 Extracting Relevant Events

# Extract passes, carries, and shots
passes = events_df[events_df['type'] == 'Pass'].copy()
carries = events_df[events_df['type'] == 'Carry'].copy()
shots = events_df[events_df['type'] == 'Shot'].copy()

print(f"Passes: {len(passes)}")
print(f"Carries: {len(carries)}")
print(f"Shots: {len(shots)}")

Output:

Passes: 59,264
Carries: 37,842
Shots: 2,467

2.3 Adding Coordinates

# Extract start coordinates for all events
for df in [passes, carries, shots]:
    df['start_x'] = df['location'].apply(lambda x: x[0] if isinstance(x, list) else np.nan)
    df['start_y'] = df['location'].apply(lambda x: x[1] if isinstance(x, list) else np.nan)

# Extract end coordinates for passes and carries
passes['end_x'] = passes['pass_end_location'].apply(
    lambda x: x[0] if isinstance(x, list) else np.nan
)
passes['end_y'] = passes['pass_end_location'].apply(
    lambda x: x[1] if isinstance(x, list) else np.nan
)

carries['end_x'] = carries['carry_end_location'].apply(
    lambda x: x[0] if isinstance(x, list) else np.nan
)
carries['end_y'] = carries['carry_end_location'].apply(
    lambda x: x[1] if isinstance(x, list) else np.nan
)

# Mark successful passes
passes['successful'] = passes['pass_outcome'].isna()  # No outcome = successful

3. Building the xT Grid

3.1 Grid Configuration

# Define grid parameters
GRID_X = 12  # Columns
GRID_Y = 8   # Rows
PITCH_LENGTH = 120
PITCH_WIDTH = 80

def coord_to_zone(x, y):
    """Convert coordinates to zone index."""
    zone_x = min(int(x / PITCH_LENGTH * GRID_X), GRID_X - 1)
    zone_y = min(int(y / PITCH_WIDTH * GRID_Y), GRID_Y - 1)
    return zone_x, zone_y

def zone_to_index(zone_x, zone_y):
    """Convert zone coordinates to flat index."""
    return zone_y * GRID_X + zone_x

def index_to_zone(idx):
    """Convert flat index to zone coordinates."""
    return idx % GRID_X, idx // GRID_X

N_ZONES = GRID_X * GRID_Y  # 96 zones

3.2 Calculating Shot Statistics

# Initialize shot statistics per zone
shot_counts = np.zeros(N_ZONES)
goal_counts = np.zeros(N_ZONES)
xg_by_zone = np.zeros(N_ZONES)

for _, shot in shots.iterrows():
    if pd.notna(shot['start_x']) and pd.notna(shot['start_y']):
        zone_x, zone_y = coord_to_zone(shot['start_x'], shot['start_y'])
        idx = zone_to_index(zone_x, zone_y)

        shot_counts[idx] += 1
        xg_by_zone[idx] += shot.get('shot_statsbomb_xg', 0)

        if shot.get('shot_outcome') == 'Goal':
            goal_counts[idx] += 1

# Calculate conversion rate per zone
conversion_rate = np.divide(goal_counts, shot_counts,
                            out=np.zeros_like(goal_counts),
                            where=shot_counts > 0)

avg_xg_per_zone = np.divide(xg_by_zone, shot_counts,
                            out=np.zeros_like(xg_by_zone),
                            where=shot_counts > 0)

print("Zones with most shots:")
top_zones = np.argsort(shot_counts)[-5:]
for idx in top_zones:
    zx, zy = index_to_zone(idx)
    print(f"  Zone ({zx}, {zy}): {shot_counts[idx]:.0f} shots, "
          f"{conversion_rate[idx]:.2%} conversion, "
          f"{avg_xg_per_zone[idx]:.3f} avg xG")

Output:

Zones with most shots:
  Zone (10, 4): 289 shots, 16.19% conversion, 0.127 avg xG
  Zone (10, 3): 247 shots, 18.60% conversion, 0.142 avg xG
  Zone (11, 4): 423 shots, 25.17% conversion, 0.282 avg xG
  Zone (11, 3): 398 shots, 27.13% conversion, 0.298 avg xG
  Zone (10, 5): 178 shots, 13.80% conversion, 0.098 avg xG

3.3 Calculating Action Probabilities

# Count all actions from each zone
action_counts = np.zeros(N_ZONES)
pass_from_zone = np.zeros(N_ZONES)
carry_from_zone = np.zeros(N_ZONES)
shot_from_zone = np.zeros(N_ZONES)

# Passes
for _, p in passes.iterrows():
    if pd.notna(p['start_x']) and pd.notna(p['start_y']):
        zone_x, zone_y = coord_to_zone(p['start_x'], p['start_y'])
        idx = zone_to_index(zone_x, zone_y)
        pass_from_zone[idx] += 1
        action_counts[idx] += 1

# Carries
for _, c in carries.iterrows():
    if pd.notna(c['start_x']) and pd.notna(c['start_y']):
        zone_x, zone_y = coord_to_zone(c['start_x'], c['start_y'])
        idx = zone_to_index(zone_x, zone_y)
        carry_from_zone[idx] += 1
        action_counts[idx] += 1

# Shots (already counted)
for _, s in shots.iterrows():
    if pd.notna(s['start_x']) and pd.notna(s['start_y']):
        zone_x, zone_y = coord_to_zone(s['start_x'], s['start_y'])
        idx = zone_to_index(zone_x, zone_y)
        shot_from_zone[idx] += 1
        action_counts[idx] += 1

# Calculate probabilities
shot_prob = np.divide(shot_from_zone, action_counts,
                      out=np.zeros_like(shot_from_zone),
                      where=action_counts > 0)

pass_prob = np.divide(pass_from_zone, action_counts,
                      out=np.zeros_like(pass_from_zone),
                      where=action_counts > 0)

carry_prob = np.divide(carry_from_zone, action_counts,
                       out=np.zeros_like(carry_from_zone),
                       where=action_counts > 0)

move_prob = pass_prob + carry_prob

print("\nAction probabilities by zone type:")
print(f"Average shot probability: {shot_prob.mean():.3%}")
print(f"Average pass probability: {pass_prob.mean():.3%}")
print(f"Average carry probability: {carry_prob.mean():.3%}")

3.4 Building Transition Matrices

# Initialize transition matrices
pass_transition = np.zeros((N_ZONES, N_ZONES))
carry_transition = np.zeros((N_ZONES, N_ZONES))
pass_totals = np.zeros(N_ZONES)
carry_totals = np.zeros(N_ZONES)

# Count pass transitions
for _, p in passes.iterrows():
    if p['successful'] and pd.notna(p['start_x']) and pd.notna(p['end_x']):
        start_zx, start_zy = coord_to_zone(p['start_x'], p['start_y'])
        end_zx, end_zy = coord_to_zone(p['end_x'], p['end_y'])

        start_idx = zone_to_index(start_zx, start_zy)
        end_idx = zone_to_index(end_zx, end_zy)

        pass_transition[start_idx, end_idx] += 1
        pass_totals[start_idx] += 1

# Count carry transitions
for _, c in carries.iterrows():
    if pd.notna(c['start_x']) and pd.notna(c['end_x']):
        start_zx, start_zy = coord_to_zone(c['start_x'], c['start_y'])
        end_zx, end_zy = coord_to_zone(c['end_x'], c['end_y'])

        start_idx = zone_to_index(start_zx, start_zy)
        end_idx = zone_to_index(end_zx, end_zy)

        carry_transition[start_idx, end_idx] += 1
        carry_totals[start_idx] += 1

# Normalize to probabilities
for i in range(N_ZONES):
    if pass_totals[i] > 0:
        pass_transition[i, :] /= pass_totals[i]
    if carry_totals[i] > 0:
        carry_transition[i, :] /= carry_totals[i]

print("Transition matrices built")
print(f"Pass matrix density: {(pass_transition > 0).sum() / N_ZONES**2:.2%}")
print(f"Carry matrix density: {(carry_transition > 0).sum() / N_ZONES**2:.2%}")

4. Value Iteration

4.1 Implementing the Algorithm

def calculate_xt(shot_prob, avg_xg, move_prob, pass_trans, carry_trans,
                 pass_weight=0.7, max_iter=100, tolerance=1e-6):
    """
    Calculate xT using value iteration.

    Parameters
    ----------
    shot_prob : array
        Probability of shooting from each zone
    avg_xg : array
        Average xG of shots from each zone
    move_prob : array
        Probability of moving (pass or carry) from each zone
    pass_trans : array
        Pass transition probability matrix
    carry_trans : array
        Carry transition probability matrix
    pass_weight : float
        Weight for passes vs carries in move value
    max_iter : int
        Maximum iterations
    tolerance : float
        Convergence threshold

    Returns
    -------
    array
        xT values for each zone
    """
    n_zones = len(shot_prob)

    # Initialize with shooting value
    xT = shot_prob * avg_xg

    for iteration in range(max_iter):
        xT_old = xT.copy()

        # Calculate expected value of moving
        pass_value = pass_trans @ xT
        carry_value = carry_trans @ xT
        move_value = pass_weight * pass_value + (1 - pass_weight) * carry_value

        # Update xT: shooting value + moving value
        xT = shot_prob * avg_xg + move_prob * move_value

        # Check convergence
        max_change = np.max(np.abs(xT - xT_old))
        if max_change < tolerance:
            print(f"Converged after {iteration + 1} iterations")
            print(f"Max change: {max_change:.8f}")
            break

    return xT

# Calculate xT
xT_values = calculate_xt(
    shot_prob, avg_xg_per_zone, move_prob,
    pass_transition, carry_transition
)

print("\nxT Grid Statistics:")
print(f"Min xT: {xT_values.min():.4f}")
print(f"Max xT: {xT_values.max():.4f}")
print(f"Mean xT: {xT_values.mean():.4f}")

Output:

Converged after 18 iterations
Max change: 0.00000089

xT Grid Statistics:
Min xT: 0.0023
Max xT: 0.3521
Mean xT: 0.0312

4.2 Reshaping to Grid

# Reshape to grid for visualization
xT_grid = xT_values.reshape(GRID_Y, GRID_X)

print("\nxT Grid (rows = y, columns = x):")
print(np.round(xT_grid, 3))

Output:

xT Grid (rows = y, columns = x):
[[0.003 0.005 0.008 0.012 0.016 0.018 0.020 0.024 0.035 0.058 0.145 0.312]
 [0.003 0.006 0.009 0.014 0.018 0.022 0.027 0.035 0.052 0.089 0.198 0.352]
 [0.003 0.006 0.010 0.015 0.020 0.025 0.031 0.042 0.065 0.112 0.231 0.348]
 [0.003 0.006 0.010 0.015 0.020 0.026 0.032 0.044 0.068 0.118 0.238 0.352]
 [0.003 0.006 0.010 0.015 0.020 0.026 0.032 0.044 0.068 0.118 0.235 0.348]
 [0.003 0.006 0.010 0.015 0.020 0.025 0.031 0.041 0.063 0.109 0.225 0.342]
 [0.003 0.006 0.009 0.013 0.018 0.022 0.027 0.034 0.050 0.085 0.192 0.345]
 [0.003 0.005 0.007 0.011 0.015 0.017 0.019 0.023 0.033 0.054 0.138 0.298]]

5. Visualization

7.1 xT Heatmap

def draw_pitch(ax):
    """Draw a soccer pitch outline."""
    # Pitch outline
    ax.plot([0, 120, 120, 0, 0], [0, 0, 80, 80, 0], color='black', lw=2)
    # Center line
    ax.plot([60, 60], [0, 80], color='black', lw=2)
    # Penalty areas
    ax.plot([0, 18, 18, 0], [18, 18, 62, 62], color='black', lw=2)
    ax.plot([120, 102, 102, 120], [18, 18, 62, 62], color='black', lw=2)
    # Goal areas
    ax.plot([0, 6, 6, 0], [30, 30, 50, 50], color='black', lw=2)
    ax.plot([120, 114, 114, 120], [30, 30, 50, 50], color='black', lw=2)

    ax.set_xlim(-5, 125)
    ax.set_ylim(-5, 85)
    ax.set_aspect('equal')
    ax.axis('off')

# Create visualization
fig, ax = plt.subplots(figsize=(14, 9))
draw_pitch(ax)

# Plot heatmap
im = ax.imshow(xT_grid, extent=[0, 120, 0, 80], origin='lower',
               cmap='RdYlGn', alpha=0.7, aspect='auto')

# Add colorbar
cbar = plt.colorbar(im, ax=ax, shrink=0.8)
cbar.set_label('Expected Threat (xT)', fontsize=12)

# Add zone labels
for zy in range(GRID_Y):
    for zx in range(GRID_X):
        x_center = (zx + 0.5) * (120 / GRID_X)
        y_center = (zy + 0.5) * (80 / GRID_Y)
        value = xT_grid[zy, zx]
        if value > 0.15:
            color = 'white'
        else:
            color = 'black'
        ax.text(x_center, y_center, f'{value:.2f}', ha='center', va='center',
                fontsize=7, color=color)

ax.set_title('Expected Threat (xT) Grid - World Cup 2018', fontsize=14)
plt.tight_layout()
plt.savefig('xt_grid.png', dpi=150, bbox_inches='tight')

6. Validation

8.1 Correlation with Goals

# Calculate xT generated per team per match
def calculate_match_xt(events, xt_values):
    """Calculate xT generated in a match."""
    team_xt = {}

    for _, event in events.iterrows():
        if event['type'] not in ['Pass', 'Carry']:
            continue

        team = event['team']
        if team not in team_xt:
            team_xt[team] = 0

        # Get start and end zones
        if pd.notna(event.get('location')) and pd.notna(event.get('pass_end_location', event.get('carry_end_location'))):
            start_x, start_y = event['location']
            if event['type'] == 'Pass':
                end_loc = event.get('pass_end_location')
            else:
                end_loc = event.get('carry_end_location')

            if end_loc:
                end_x, end_y = end_loc
                start_zx, start_zy = coord_to_zone(start_x, start_y)
                end_zx, end_zy = coord_to_zone(end_x, end_y)

                start_idx = zone_to_index(start_zx, start_zy)
                end_idx = zone_to_index(end_zx, end_zy)

                xt_added = xt_values[end_idx] - xt_values[start_idx]
                team_xt[team] += xt_added

    return team_xt

# Calculate for all matches
match_data = []
for match_id in matches['match_id']:
    match_events = events_df[events_df['match_id'] == match_id]
    team_xt = calculate_match_xt(match_events, xT_values)

    # Get goals
    match_shots = shots[shots['match_id'] == match_id]
    team_goals = match_shots[match_shots['shot_outcome'] == 'Goal'].groupby('team').size()

    for team in team_xt.keys():
        match_data.append({
            'match_id': match_id,
            'team': team,
            'xT': team_xt[team],
            'goals': team_goals.get(team, 0)
        })

match_df = pd.DataFrame(match_data)

# Calculate correlation
correlation = match_df['xT'].corr(match_df['goals'])
print(f"\nCorrelation between xT and goals: {correlation:.3f}")

# Aggregate to team level
team_totals = match_df.groupby('team').agg({
    'xT': 'sum',
    'goals': 'sum',
    'match_id': 'count'
}).rename(columns={'match_id': 'matches'})

team_correlation = team_totals['xT'].corr(team_totals['goals'])
print(f"Team-level correlation: {team_correlation:.3f}")

Output:

Correlation between xT and goals: 0.382
Team-level correlation: 0.721

8.2 xT vs xG Comparison

# Calculate xG per match
match_xg = shots.groupby(['match_id', 'team'])['shot_statsbomb_xg'].sum().reset_index()
match_xg.columns = ['match_id', 'team', 'xG']

# Merge with xT data
comparison_df = match_df.merge(match_xg, on=['match_id', 'team'], how='left')
comparison_df['xG'] = comparison_df['xG'].fillna(0)

# Compare correlations
xt_goal_corr = comparison_df['xT'].corr(comparison_df['goals'])
xg_goal_corr = comparison_df['xG'].corr(comparison_df['goals'])

print(f"\nCorrelation with goals:")
print(f"  xT: {xt_goal_corr:.3f}")
print(f"  xG: {xg_goal_corr:.3f}")

# xT captures different information than xG
xt_xg_corr = comparison_df['xT'].corr(comparison_df['xG'])
print(f"\nxT vs xG correlation: {xt_xg_corr:.3f}")

Output:

Correlation with goals:
  xT: 0.382
  xG: 0.451

xT vs xG correlation: 0.628

Interpretation: xT and xG are correlated but capture different aspects of attacking play. xG better predicts goals (as expected, since it directly measures shot quality), but xT provides complementary information about how teams build attacks.


7. Results Summary

9.1 Key Findings

  1. xT values increase toward goal: Zones in the opponent's penalty area have xT > 0.20, while own half zones are < 0.02

  2. Central positions more valuable: Central zones have higher xT than equivalent wide positions at the same vertical level

  3. Strong validation: xT correlates at 0.72 with goals at the team level, indicating it captures meaningful attacking threat

  4. Complementary to xG: The 0.63 correlation with xG shows xT provides additional information beyond endpoint metrics

9.2 Top xT Generators

# Calculate player xT
player_xt = {}
for _, event in events_df[events_df['type'].isin(['Pass', 'Carry'])].iterrows():
    player = event['player']
    if pd.isna(player):
        continue

    # Calculate xT added (simplified)
    if pd.notna(event.get('location')):
        start_x, start_y = event['location']
        if event['type'] == 'Pass':
            end_loc = event.get('pass_end_location')
        else:
            end_loc = event.get('carry_end_location')

        if end_loc and pd.notna(end_loc[0]):
            end_x, end_y = end_loc
            start_idx = zone_to_index(*coord_to_zone(start_x, start_y))
            end_idx = zone_to_index(*coord_to_zone(end_x, end_y))

            xt_added = xT_values[end_idx] - xT_values[start_idx]

            if player not in player_xt:
                player_xt[player] = 0
            player_xt[player] += xt_added

player_xt_df = pd.DataFrame([
    {'player': p, 'xT': v}
    for p, v in player_xt.items()
]).sort_values('xT', ascending=False)

print("\nTop 10 xT Generators:")
print(player_xt_df.head(10).to_string(index=False))

8. Conclusions

This case study demonstrated:

  1. How to build xT from scratch: Grid definition, transition matrices, and value iteration

  2. Validation importance: Our xT model shows strong correlation with goals, confirming its utility

  3. Relationship to xG: xT and xG capture different aspects of attacking play and are best used together

  4. Practical applications: The resulting xT grid can be used to evaluate player ball progression

The xT model provides a foundation for valuing all ball-advancing actions, not just shots and assists, enabling more complete player evaluation.