Soccer analytics requires processing large volumes of data---millions of events, thousands of players, hundreds of matches. Python has emerged as the dominant language for this work, offering a powerful ecosystem of libraries specifically designed...
Learning Objectives
- Configure a professional Python environment for soccer analytics projects
- Use pandas to load, clean, filter, and aggregate soccer event data
- Apply NumPy for efficient numerical computations and spatial calculations
- Create publication-quality visualizations with matplotlib, seaborn, and mplsoccer
- Build reusable, well-documented analysis functions and classes
- Work with JSON and CSV match data from common providers
- Apply error handling and debugging techniques to analytics code
- Optimize performance when processing large soccer datasets
In This Chapter
Chapter 4: Python Tools for Soccer Analytics
Learning Objectives
By the end of this chapter, you will be able to:
- Configure a professional Python environment for soccer analytics
- Use pandas for efficient data manipulation of match and player data
- Apply NumPy for numerical computations in analytics workflows
- Create publication-quality visualizations with matplotlib and seaborn
- Build reusable analysis functions and classes
- Work with JSON and CSV match data from common providers
- Apply error handling and debugging techniques to analytics code
- Implement best practices for reproducible analytics code
4.1 Introduction
Soccer analytics requires processing large volumes of data---millions of events, thousands of players, hundreds of matches. Python has emerged as the dominant language for this work, offering a powerful ecosystem of libraries specifically designed for data analysis, statistical modeling, and visualization.
This chapter provides a comprehensive guide to the Python tools you'll use throughout this textbook and your analytics career. Rather than a general Python tutorial, we focus specifically on patterns and techniques relevant to soccer data. Every code example uses soccer data, and every design pattern addresses a challenge that soccer analysts face in practice.
Why Python for Soccer Analytics?
Python dominates professional sports analytics for several reasons:
- Rich ecosystem: pandas, NumPy, scikit-learn, and visualization libraries form a complete analytics toolkit
- Community support: Active communities have built soccer-specific tools like
statsbombpy,mplsoccer, andsocceraction - Industry adoption: Most professional clubs and analytics companies use Python
- Accessibility: Clear syntax makes code readable and maintainable
- Integration: Easy connection to databases, APIs, and web services
Intuition: While R remains popular in academic sports research, Python has become the dominant language in professional club analytics departments. Learning Python for soccer analytics gives you skills that transfer directly to industry roles. Nearly every major data provider (StatsBomb, Opta, Wyscout) offers Python SDKs or APIs, and the vast majority of job listings for soccer analyst positions list Python as a required skill.
Chapter Overview
We'll cover four core areas:
- Environment Setup: Configuring a professional development environment
- Data Manipulation: pandas for wrangling soccer data
- Numerical Computing: NumPy for efficient calculations
- Visualization: matplotlib and seaborn for soccer graphics
Each section progresses from fundamental concepts to soccer-specific patterns. By the end, you will have a working toolkit sufficient to tackle the analyses in every subsequent chapter of this book.
4.2 Environment Setup
A well-configured environment prevents countless headaches later. This section establishes professional practices from the start.
4.2.1 Python Installation and Virtual Environments
Recommended Setup:
# Install Python 3.10+ (via python.org, Anaconda, or pyenv)
# Create a virtual environment for soccer analytics
python -m venv soccer-analytics-env
# Activate (Windows)
soccer-analytics-env\Scripts\activate
# Activate (macOS/Linux)
source soccer-analytics-env/bin/activate
# Install core packages
pip install pandas numpy matplotlib seaborn scipy scikit-learn
pip install statsbombpy mplsoccer jupyter
Why Virtual Environments?
Each project should have its own isolated environment to: - Avoid version conflicts between projects - Ensure reproducibility - Make deployment easier - Allow clean dependency tracking
Best Practice: When starting a new soccer analytics project, always create a fresh virtual environment and install packages incrementally as you need them. Then run
pip freeze > requirements.txtto capture your exact dependencies. This small discipline saves enormous headaches when sharing projects with colleagues or deploying analyses to production servers at a club. A colleague should be able to runpip install -r requirements.txtand reproduce your entire environment.
4.2.2 Project Structure
Organize your analytics projects consistently:
soccer-project/
├── data/
│ ├── raw/ # Original, immutable data
│ ├── processed/ # Cleaned, transformed data
│ └── external/ # Data from external sources
├── notebooks/ # Jupyter notebooks for exploration
├── src/
│ ├── __init__.py
│ ├── data/ # Data loading and processing
│ ├── features/ # Feature engineering
│ ├── models/ # Statistical models
│ └── visualization/ # Plotting functions
├── tests/ # Unit tests
├── outputs/
│ ├── figures/ # Generated visualizations
│ └── reports/ # Analysis reports
├── requirements.txt # Dependencies
├── README.md
└── config.py # Configuration settings
The key principle is separation of concerns: raw data is kept separate from processed data, source code is separate from notebooks, and outputs are separate from inputs. The data/raw/ directory should be treated as immutable---never modify original data files. Instead, write processing scripts that read from raw/ and write to processed/.
4.2.3 Configuration Management
Create a config.py for project settings:
"""Project configuration settings."""
from pathlib import Path
# Paths
PROJECT_ROOT = Path(__file__).parent
DATA_DIR = PROJECT_ROOT / "data"
RAW_DATA_DIR = DATA_DIR / "raw"
PROCESSED_DATA_DIR = DATA_DIR / "processed"
OUTPUT_DIR = PROJECT_ROOT / "outputs"
FIGURES_DIR = OUTPUT_DIR / "figures"
# Data settings
STATSBOMB_COMPETITION_ID = 43 # World Cup
STATSBOMB_SEASON_ID = 3 # 2018
# Visualization settings
FIGURE_DPI = 150
FIGURE_FORMAT = "png"
# Soccer pitch dimensions (StatsBomb)
PITCH_LENGTH = 120
PITCH_WIDTH = 80
# Ensure directories exist
for dir_path in [RAW_DATA_DIR, PROCESSED_DATA_DIR, FIGURES_DIR]:
dir_path.mkdir(parents=True, exist_ok=True)
Centralizing configuration in a single file means that when you switch from analyzing the 2018 World Cup to the 2022-23 Premier League, you change one file rather than hunting through dozens of scripts for hard-coded values. Every script in the project imports from config.py:
from config import RAW_DATA_DIR, FIGURE_DPI
# Now use these constants throughout your analysis
data = pd.read_csv(RAW_DATA_DIR / "matches.csv")
plt.savefig("my_plot.png", dpi=FIGURE_DPI)
4.2.4 Jupyter Notebooks Best Practices
Notebooks are excellent for exploration but can become messy. Follow these guidelines:
Good Practices:
- Use clear, descriptive cell headers with Markdown
- Keep cells focused on single tasks
- Move reusable code to .py modules as soon as it stabilizes
- Restart kernel and run all before sharing
- Clear output before committing to version control
Anti-Patterns to Avoid: - Running cells out of order (creates hidden state bugs) - Putting all code in one massive notebook - Leaving commented-out experimental code everywhere - Defining the same function in multiple notebooks
Example Notebook Structure:
# Cell 1: Imports and setup
"""
# Match Analysis Notebook
Analyzing passing patterns in World Cup 2018 matches.
"""
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsbombpy import sb
%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')
# Cell 2: Load data
matches = sb.matches(competition_id=43, season_id=3)
print(f"Loaded {len(matches)} matches")
# Cell 3: Analysis
# ... focused analysis code ...
# Cell 4: Visualization
# ... plotting code ...
Common Pitfall: Notebooks that work on your machine may fail on a colleague's because of hidden state---cells run in a different order, variables defined in deleted cells, or reliance on global variables. Before sharing a notebook, always do "Restart Kernel and Run All" to verify it executes cleanly from top to bottom.
4.3 Pandas for Soccer Data
pandas is the cornerstone of soccer analytics. This section covers essential operations with soccer-specific examples.
4.3.1 DataFrames and Series
A DataFrame is a two-dimensional labeled data structure---think of it as a spreadsheet or SQL table in Python. A Series is a single column of a DataFrame.
Creating DataFrames from Match Data:
import pandas as pd
# From a list of dictionaries (common format when receiving data from APIs)
match_data = [
{'match_id': 1, 'home_team': 'France', 'away_team': 'Croatia',
'home_goals': 4, 'away_goals': 2, 'home_xg': 2.35, 'away_xg': 1.78},
{'match_id': 2, 'home_team': 'Belgium', 'away_team': 'England',
'home_goals': 2, 'away_goals': 0, 'home_xg': 1.65, 'away_xg': 0.92},
]
# pd.DataFrame() converts the list of dictionaries into a tabular structure.
# Each dictionary becomes a row; keys become column names.
df = pd.DataFrame(match_data)
print(df)
Key DataFrame Attributes:
# Shape tells you (number_of_rows, number_of_columns)
print(f"Shape: {df.shape}")
# columns lists all column names as an Index object
print(f"Columns: {df.columns.tolist()}")
# dtypes shows the data type of each column
# Watch for 'object' type --- it often means strings or mixed types
print(f"Data types:\n{df.dtypes}")
# describe() computes summary statistics for all numeric columns
print(df.describe())
# memory_usage(deep=True) shows actual memory consumption
# deep=True is needed for accurate measurement of string columns
print(f"Memory: {df.memory_usage(deep=True).sum() / 1024:.2f} KB")
4.3.2 Loading Soccer Data
From CSV Files:
# Load match data with explicit data types and date parsing.
# Specifying dtypes upfront prevents pandas from guessing incorrectly
# and saves memory for large files.
matches = pd.read_csv(
'data/matches.csv',
parse_dates=['match_date'], # Convert string dates to datetime objects
dtype={'match_id': 'int64', 'home_team': 'category'} # category saves memory
)
# Load event data (large files --- be selective about columns)
# usecols avoids loading columns you do not need, saving memory and time.
events = pd.read_csv(
'data/events.csv',
usecols=['event_id', 'match_id', 'type', 'player', 'team'],
dtype={'match_id': 'int32'} # int32 uses half the memory of int64
)
From StatsBomb API:
from statsbombpy import sb
# List available competitions
competitions = sb.competitions()
print(competitions[['competition_id', 'competition_name', 'season_name']])
# Load World Cup 2018 matches
matches = sb.matches(competition_id=43, season_id=3)
# Load events for a specific match
events = sb.events(match_id=7298) # World Cup Final
# Convert to more efficient dtypes after loading
events['minute'] = events['minute'].astype('int16')
events['second'] = events['second'].fillna(0).astype('int16')
Common Pitfall: When loading event data from StatsBomb or other providers, always inspect the data types and handle missing values before performing analysis. Many columns contain nested structures (lists, dictionaries) that pandas stores as Python objects. Extracting coordinates from the
locationcolumn, for example, requires explicit parsing. Failing to handleNaNvalues in pass outcomes or shot details will cause silent errors in aggregation calculations.
Working with JSON Match Data:
Many data providers deliver match data in JSON format. Understanding how to parse nested JSON into flat DataFrames is an essential skill.
import json
def load_json_events(filepath: str) -> pd.DataFrame:
"""
Load event data from a JSON file and flatten nested structures.
Parameters
----------
filepath : str
Path to the JSON events file.
Returns
-------
pd.DataFrame
Flattened event data with one row per event.
"""
with open(filepath, 'r', encoding='utf-8') as f:
raw_events = json.load(f)
# pd.json_normalize flattens nested dictionaries into columns.
# For example, {'shot': {'xg': 0.35}} becomes a column 'shot.xg'.
df = pd.json_normalize(raw_events, sep='_')
return df
# Usage
events = load_json_events('data/raw/events_7298.json')
print(f"Loaded {len(events)} events with {len(events.columns)} columns")
print(events.columns[:20].tolist()) # Inspect first 20 column names
Best Practice: When working with JSON data, use
pd.json_normalize()instead of manually parsing dictionaries. It handles nested structures automatically, creating column names from the nested keys separated by a delimiter. For deeply nested data (e.g., StatsBomb's freeze frame data), you may need multiple normalization passes.
4.3.3 Selecting and Filtering Data
Column Selection:
# Single column (returns Series --- a one-dimensional labeled array)
goals = df['home_goals']
# Multiple columns (returns DataFrame --- preserves two-dimensional structure)
scores = df[['home_goals', 'away_goals']]
# Using loc for both rows and columns
# loc uses label-based indexing: loc[row_labels, column_labels]
subset = df.loc[:, ['match_id', 'home_team', 'home_goals']]
Row Filtering (Boolean Indexing):
# Simple condition: the expression inside [] creates a boolean Series
high_scoring = df[df['home_goals'] + df['away_goals'] >= 4]
# Multiple conditions: use & for AND, | for OR
# IMPORTANT: each condition must be wrapped in parentheses
france_wins = df[(df['home_team'] == 'France') & (df['home_goals'] > df['away_goals'])]
# Using query (cleaner syntax for complex conditions)
# Variables from the local scope can be referenced with @
result = df.query('home_goals > away_goals and home_xg < away_xg')
# Filter events by type
passes = events[events['type'] == 'Pass']
shots = events.query("type == 'Shot'")
Practical Example: Finding Specific Events
def get_player_shots(events: pd.DataFrame, player_name: str) -> pd.DataFrame:
"""
Get all shots by a specific player.
Parameters
----------
events : pd.DataFrame
Event data with 'type' and 'player' columns
player_name : str
Name of the player to filter for
Returns
-------
pd.DataFrame
Filtered DataFrame containing only shots by the specified player.
Returns a copy to avoid SettingWithCopyWarning.
"""
return events.query(
"type == 'Shot' and player == @player_name"
).copy()
# Usage
mbappe_shots = get_player_shots(events, 'Kylian Mbappé')
print(f"Mbappé shots: {len(mbappe_shots)}")
4.3.4 Data Transformation
Adding Calculated Columns:
# Goal difference: simple arithmetic on two columns
df['goal_diff'] = df['home_goals'] - df['away_goals']
# xG difference
df['xg_diff'] = df['home_xg'] - df['away_xg']
# Points using apply() --- works but is slow for large DataFrames
# apply() calls a Python function once per row, which is not vectorized.
df['home_points'] = df['goal_diff'].apply(
lambda x: 3 if x > 0 else (1 if x == 0 else 0)
)
# More efficient with np.where --- fully vectorized, runs in C
import numpy as np
df['home_points'] = np.where(
df['goal_diff'] > 0, 3,
np.where(df['goal_diff'] == 0, 1, 0)
)
# Using np.select for multiple conditions --- cleanest syntax for 3+ cases
conditions = [
df['goal_diff'] > 0, # Home win
df['goal_diff'] == 0, # Draw
df['goal_diff'] < 0 # Away win
]
choices = [3, 1, 0]
df['home_points'] = np.select(conditions, choices)
Best Practice: Avoid
.apply()with lambda functions whenever possible. It processes rows one at a time in Python, which is orders of magnitude slower than vectorized NumPy operations. For the points calculation above,np.selectis about 50-100x faster thanapplyon a typical season dataset.
Working with Coordinates:
def extract_coordinates(events: pd.DataFrame) -> pd.DataFrame:
"""
Extract x, y coordinates from the location column.
StatsBomb stores locations as Python lists [x, y] inside a column.
This function unpacks those lists into separate numeric columns.
"""
df = events.copy()
# isinstance check prevents errors when location is NaN (e.g., for
# events like half-start that have no spatial position).
df['x'] = df['location'].apply(
lambda loc: loc[0] if isinstance(loc, list) and len(loc) >= 2 else None
)
df['y'] = df['location'].apply(
lambda loc: loc[1] if isinstance(loc, list) and len(loc) >= 2 else None
)
# Convert to float for numerical operations
df['x'] = pd.to_numeric(df['x'], errors='coerce')
df['y'] = pd.to_numeric(df['y'], errors='coerce')
return df
# For end location (passes, carries)
def extract_end_coordinates(events: pd.DataFrame) -> pd.DataFrame:
"""Extract end location for passes and carries."""
df = events.copy()
df['end_x'] = df['pass_end_location'].apply(
lambda loc: loc[0] if isinstance(loc, list) else None
)
df['end_y'] = df['pass_end_location'].apply(
lambda loc: loc[1] if isinstance(loc, list) else None
)
return df
4.3.5 Grouping and Aggregation
Aggregation is fundamental to soccer analytics---calculating per-player, per-team, or per-match statistics.
Intuition: The
groupbyoperation is the single most important pandas pattern for soccer analytics. Nearly every meaningful statistic---passes per 90 minutes, team shot conversion rates, player progressive carry distances---requires grouping data by some category (player, team, match) and then aggregating. Masteringgroupbywithagg,transform, andapplywill unlock the vast majority of analyses you need to perform.
Basic Groupby:
# Goals per team: group by team name, sum goals column
team_goals = df.groupby('home_team')['home_goals'].sum()
# Multiple aggregations on multiple columns using agg()
# The dict maps column names to lists of aggregation functions.
team_stats = df.groupby('home_team').agg({
'home_goals': ['sum', 'mean'],
'home_xg': ['sum', 'mean'],
'match_id': 'count'
}).round(2)
# agg() with multiple functions creates a MultiIndex on columns.
# Flatten it by joining the two levels with an underscore.
team_stats.columns = ['_'.join(col) for col in team_stats.columns]
team_stats = team_stats.rename(columns={'match_id_count': 'matches'})
Per-90-Minute Statistics:
One of the most common normalizations in soccer analytics is converting raw counts to "per 90 minutes" rates. This allows fair comparison between players who have played different amounts of time.
def per_90(count: pd.Series, minutes: pd.Series) -> pd.Series:
"""
Convert raw counts to per-90-minute rates.
Parameters
----------
count : pd.Series
Raw event counts (e.g., passes, shots, tackles)
minutes : pd.Series
Minutes played by each player
Returns
-------
pd.Series
Per-90 rates. Returns NaN for players with fewer than 90 minutes.
"""
# Avoid division by zero and unreliable small-sample rates
return np.where(minutes >= 90, count / minutes * 90, np.nan)
# Example usage
player_stats['passes_per_90'] = per_90(player_stats['passes'], player_stats['minutes'])
player_stats['shots_per_90'] = per_90(player_stats['shots'], player_stats['minutes'])
Player Statistics from Events:
def calculate_player_stats(events: pd.DataFrame) -> pd.DataFrame:
"""
Calculate per-player statistics from event data.
Parameters
----------
events : pd.DataFrame
Event data with columns: player, team, type, etc.
Returns
-------
pd.DataFrame
Aggregated player statistics with one row per player.
"""
# Filter out events without a player (e.g., ball receipts)
player_events = events[events['player'].notna()].copy()
# Count by event type using unstack to pivot type into columns
event_counts = player_events.groupby(
['player', 'team', 'type']
).size().unstack(fill_value=0)
# Build a clean stats DataFrame
stats = pd.DataFrame({
'team': player_events.groupby('player')['team'].first(),
'passes': event_counts.get('Pass', 0),
'shots': event_counts.get('Shot', 0),
'tackles': event_counts.get('Tackle', 0),
'carries': event_counts.get('Carry', 0),
})
# Add minutes played (approximate from last event timestamp)
if 'minute' in events.columns:
stats['minutes'] = player_events.groupby('player')['minute'].max()
return stats.reset_index()
# Calculate and display top passers
player_stats = calculate_player_stats(events)
print(player_stats.nlargest(10, 'passes'))
Using transform() for Within-Group Calculations:
The transform() method returns a Series with the same index as the input, making it ideal for adding group-level statistics back to the original DataFrame.
# Add team average xG to each row
events['team_avg_xg'] = events.groupby('team')['shot_statsbomb_xg'].transform('mean')
# Add player rank within team by passes
player_stats['pass_rank'] = player_stats.groupby('team')['passes'].rank(
ascending=False, method='min'
)
Team Match Statistics:
def calculate_match_team_stats(events: pd.DataFrame) -> pd.DataFrame:
"""
Calculate team-level statistics for each match.
Returns one row per team per match.
"""
stats = events.groupby(['match_id', 'team']).agg({
'type': [
('passes', lambda x: (x == 'Pass').sum()),
('shots', lambda x: (x == 'Shot').sum()),
('goals', lambda x: ((x == 'Shot')).sum()), # Simplified
],
'minute': 'max'
})
# Flatten columns
stats.columns = [col[1] if col[1] else col[0] for col in stats.columns]
return stats.reset_index()
4.3.6 Merging and Joining DataFrames
Soccer analysis often requires combining multiple data sources.
Types of Merges:
# Inner merge: only rows with matching keys in BOTH DataFrames
merged = pd.merge(events, matches, on='match_id', how='inner')
# Left merge: ALL rows from left DataFrame, matching rows from right.
# Non-matching right rows are filled with NaN.
events_with_match_info = pd.merge(
events,
matches[['match_id', 'home_team', 'away_team', 'match_date']],
on='match_id',
how='left'
)
# Merge player stats with biographical info
player_bio = pd.DataFrame({
'player': ['Kylian Mbappé', 'Lionel Messi'],
'birth_year': [1998, 1987],
'position': ['Forward', 'Forward']
})
player_full = pd.merge(
player_stats,
player_bio,
on='player',
how='left' # Keep all players, even those without bio info
)
Common Pitfall: When merging, always check for duplicate keys. If both DataFrames have multiple rows with the same key, the merge produces a Cartesian product (every combination), which can explode the row count unexpectedly. After any merge, verify the result shape:
print(f"Before: {len(df1)}, After: {len(merged)}").
Combining Home and Away Data:
A common data transformation task is converting match-level data (one row per match) into team-match data (two rows per match---one from each team's perspective).
def create_team_match_data(matches: pd.DataFrame) -> pd.DataFrame:
"""
Convert match data to team-match format.
Each match becomes two rows: one for home team, one for away team.
This format is essential for calculating team-level season statistics.
"""
# Home team perspective
home = matches[['match_id', 'home_team', 'away_team',
'home_goals', 'away_goals', 'home_xg', 'away_xg']].copy()
home.columns = ['match_id', 'team', 'opponent', 'goals_for',
'goals_against', 'xg_for', 'xg_against']
home['venue'] = 'home'
# Away team perspective: same columns but swapped
away = matches[['match_id', 'away_team', 'home_team',
'away_goals', 'home_goals', 'away_xg', 'home_xg']].copy()
away.columns = ['match_id', 'team', 'opponent', 'goals_for',
'goals_against', 'xg_for', 'xg_against']
away['venue'] = 'away'
# Combine both perspectives
team_matches = pd.concat([home, away], ignore_index=True)
# Add derived columns
team_matches['goal_diff'] = team_matches['goals_for'] - team_matches['goals_against']
team_matches['xg_diff'] = team_matches['xg_for'] - team_matches['xg_against']
return team_matches.sort_values(['team', 'match_id'])
4.3.7 Handling Missing Data in Soccer Datasets
Missing data is pervasive in soccer datasets. Not every event has every attribute---a pass has no shot_xg, a shot has no pass_end_location, and some events lack coordinate data entirely.
def audit_missing_data(df: pd.DataFrame) -> pd.DataFrame:
"""
Create a summary of missing data in each column.
Returns a DataFrame sorted by percentage missing, descending.
"""
missing = df.isnull().sum()
percent = (missing / len(df)) * 100
summary = pd.DataFrame({
'missing_count': missing,
'percent_missing': percent.round(1),
'dtype': df.dtypes
})
return summary[summary['missing_count'] > 0].sort_values(
'percent_missing', ascending=False
)
# Common strategies for handling missing soccer data:
# 1. Fill with a default value (appropriate for known-absent data)
events['shot_statsbomb_xg'] = events['shot_statsbomb_xg'].fillna(0)
# 2. Forward-fill for sequential data (e.g., game state)
events['score_home'] = events['score_home'].fillna(method='ffill')
# 3. Drop rows where a critical column is missing
shots = events[events['type'] == 'Shot'].dropna(subset=['location'])
# 4. Interpolate for continuous variables
player_tracking['speed'] = player_tracking['speed'].interpolate(method='linear')
Best Practice: Before dropping or filling missing data, always understand why it is missing. In StatsBomb data, a missing
pass_outcomemeans the pass was successful (they only record failed outcomes). Filling it with "Unknown" or dropping it would be a serious error. Read the data documentation carefully.
4.3.8 Time Series Operations
Match data is inherently temporal. pandas provides powerful time series tools.
# Parse dates from strings to datetime objects
matches['match_date'] = pd.to_datetime(matches['match_date'])
# Set date as index for time series operations
matches_ts = matches.set_index('match_date').sort_index()
# Resample to weekly totals --- useful for plotting trends
weekly_goals = matches_ts.resample('W')['home_goals'].sum()
# Rolling averages (e.g., 5-match moving average of xG)
# transform() keeps the result aligned with the original DataFrame
matches['rolling_xg'] = matches.groupby('home_team')['home_xg'].transform(
lambda x: x.rolling(5, min_periods=1).mean()
)
# Cumulative statistics across a season
matches['cumulative_goals'] = matches.groupby('home_team')['home_goals'].cumsum()
# Exponentially weighted moving average (more weight on recent matches)
matches['ewm_xg'] = matches.groupby('home_team')['home_xg'].transform(
lambda x: x.ewm(span=5).mean()
)
4.4 NumPy for Numerical Computing
NumPy provides the numerical foundation for all Python data science. Understanding NumPy operations enables efficient analytics code.
4.4.1 Array Basics
import numpy as np
# Create arrays from soccer data
goals = np.array([2, 1, 0, 3, 1, 2, 0, 1])
xG = np.array([1.5, 1.2, 0.8, 2.1, 0.9, 1.8, 0.5, 1.1])
# Basic operations are element-wise: each element is processed independently
goal_diff_from_xG = goals - xG
print(f"Over/underperformance: {goal_diff_from_xG}")
# Aggregations collapse the array to a single value
print(f"Total goals: {goals.sum()}")
print(f"Mean xG: {xG.mean():.2f}")
print(f"Total overperformance: {goal_diff_from_xG.sum():.2f}")
Understanding Array Shapes:
# 1-D array: a single list of values (e.g., one player's match xG)
player_xg = np.array([0.3, 0.5, 0.1, 0.8])
print(f"Shape: {player_xg.shape}") # (4,)
# 2-D array: a matrix (e.g., xG for multiple players across matches)
team_xg = np.array([
[0.3, 0.5, 0.1, 0.8], # Player 1
[0.2, 0.4, 0.3, 0.6], # Player 2
[0.1, 0.0, 0.5, 0.2], # Player 3
])
print(f"Shape: {team_xg.shape}") # (3, 4) --- 3 players, 4 matches
# Sum across axis 1 (columns) to get each player's total xG
print(f"Player totals: {team_xg.sum(axis=1)}") # [1.7, 1.5, 0.8]
# Sum across axis 0 (rows) to get each match's total xG
print(f"Match totals: {team_xg.sum(axis=0)}") # [0.6, 0.9, 0.9, 1.6]
4.4.2 Vectorized Operations
NumPy operations are much faster than Python loops because they execute in compiled C code rather than interpreted Python.
Intuition: The speed difference between Python loops and NumPy vectorized operations is not a minor optimization---it is often the difference between an analysis that runs in milliseconds and one that takes minutes. When processing a full season of event data (hundreds of thousands of rows), a distance calculation implemented as a Python loop can take 30+ seconds. The same calculation vectorized with NumPy completes in under 50 milliseconds. Always think in terms of array operations rather than element-by-element processing.
# Slow: Python loop --- processes one element at a time
def calculate_points_loop(goal_diffs):
points = []
for gd in goal_diffs:
if gd > 0:
points.append(3)
elif gd == 0:
points.append(1)
else:
points.append(0)
return points
# Fast: NumPy vectorized --- processes the entire array at once
def calculate_points_numpy(goal_diffs):
return np.where(goal_diffs > 0, 3, np.where(goal_diffs == 0, 1, 0))
# Timing comparison
import time
goal_diffs = np.random.randint(-5, 6, 100000)
start = time.time()
_ = calculate_points_loop(goal_diffs)
loop_time = time.time() - start
start = time.time()
_ = calculate_points_numpy(goal_diffs)
numpy_time = time.time() - start
print(f"Loop: {loop_time:.4f}s, NumPy: {numpy_time:.4f}s")
print(f"NumPy is {loop_time/numpy_time:.0f}x faster")
4.4.3 Statistical Functions
# Sample data: conversion rates for 8 strikers
conversion_rates = np.array([0.12, 0.18, 0.15, 0.22, 0.10, 0.14, 0.16, 0.19])
# Central tendency
print(f"Mean: {np.mean(conversion_rates):.3f}")
print(f"Median: {np.median(conversion_rates):.3f}")
# Spread --- ddof=1 gives the sample standard deviation (Bessel's correction)
print(f"Std Dev: {np.std(conversion_rates, ddof=1):.3f}")
print(f"Variance: {np.var(conversion_rates, ddof=1):.4f}")
# Percentiles
print(f"25th percentile: {np.percentile(conversion_rates, 25):.3f}")
print(f"75th percentile: {np.percentile(conversion_rates, 75):.3f}")
# Correlation matrix between xG and goals for 8 teams
xG = np.array([55, 48, 72, 61, 53, 68, 45, 58])
goals = np.array([52, 45, 75, 58, 50, 70, 42, 55])
# np.corrcoef returns a 2x2 matrix; [0, 1] is the cross-correlation
print(f"Correlation: {np.corrcoef(xG, goals)[0, 1]:.3f}")
4.4.4 Distance and Spatial Calculations
Soccer analytics frequently involves spatial calculations. These functions form the foundation of xG models and positional analysis.
def euclidean_distance(x1, y1, x2, y2):
"""
Calculate Euclidean distance between two points.
Works with both scalar values and NumPy arrays (vectorized).
"""
return np.sqrt((x2 - x1)**2 + (y2 - y1)**2)
def distance_to_goal(x, y, goal_x=120, goal_y=40):
"""
Calculate distance from event to center of goal.
Uses StatsBomb coordinate system where the goal is at x=120, y=40.
"""
return euclidean_distance(x, y, goal_x, goal_y)
def shot_angle(x, y, goal_y=40, post_width=7.32):
"""
Calculate the visible angle to goal from a shot position.
This is one of the most important features in xG models.
A wider angle means more of the goal is visible to the shooter.
Parameters
----------
x : float or np.ndarray
X coordinate (0-120 in StatsBomb system)
y : float or np.ndarray
Y coordinate (0-80 in StatsBomb system)
goal_y : float
Y coordinate of goal center
post_width : float
Width of the goal (7.32 meters)
Returns
-------
float or np.ndarray
Angle in radians
"""
goal_line = 120
left_post = goal_y - post_width / 2 # 36.34
right_post = goal_y + post_width / 2 # 43.66
# Vectors from shot position to each post
to_left = np.array([goal_line - x, left_post - y])
to_right = np.array([goal_line - x, right_post - y])
# Angle between vectors using dot product formula:
# cos(theta) = (a . b) / (|a| * |b|)
cos_angle = np.dot(to_left, to_right) / (
np.linalg.norm(to_left) * np.linalg.norm(to_right)
)
# Clip to [-1, 1] to avoid numerical errors with arccos
return np.arccos(np.clip(cos_angle, -1, 1))
# Example: angle from penalty spot (12 yards = ~11m from goal, centered)
penalty_angle = shot_angle(108, 40)
print(f"Penalty spot angle: {np.degrees(penalty_angle):.1f}°")
# Example: angle from a tight angle on the wing
wing_angle = shot_angle(110, 70)
print(f"Tight wing angle: {np.degrees(wing_angle):.1f}°")
4.4.5 Random Number Generation for Simulations
Monte Carlo simulation is a core technique in soccer analytics, used for everything from match outcome probabilities to season projections.
# Set seed for reproducibility --- always do this in published analyses
rng = np.random.default_rng(42)
# Simulate goals from xG using Poisson distribution
xG = 1.8
simulated_goals = rng.poisson(xG, size=10000)
print(f"Mean simulated goals: {simulated_goals.mean():.2f}")
print(f"P(score 2+): {(simulated_goals >= 2).mean():.3f}")
# Simulate match outcomes
def simulate_match(home_xg, away_xg, n_simulations=10000):
"""
Simulate match outcomes using independent Poisson model.
Each team's goals are drawn from a Poisson distribution with
their xG as the rate parameter. This is a simple but effective
model for converting pre-match xG into win/draw/loss probabilities.
Parameters
----------
home_xg : float
Home team expected goals
away_xg : float
Away team expected goals
n_simulations : int
Number of Monte Carlo simulations
Returns
-------
dict
Probabilities for home win, draw, away win
"""
rng = np.random.default_rng()
home_goals = rng.poisson(home_xg, n_simulations)
away_goals = rng.poisson(away_xg, n_simulations)
return {
'home_win': (home_goals > away_goals).mean(),
'draw': (home_goals == away_goals).mean(),
'away_win': (home_goals < away_goals).mean()
}
result = simulate_match(1.8, 1.2)
print(f"Home win: {result['home_win']:.1%}")
print(f"Draw: {result['draw']:.1%}")
print(f"Away win: {result['away_win']:.1%}")
4.4.6 Boolean Masking and Fancy Indexing
Boolean masks are essential for filtering arrays based on conditions without loops.
# Shot data arrays
distances = np.array([8.5, 22.3, 5.1, 31.0, 12.4, 18.7, 6.2, 25.8])
xg_values = np.array([0.25, 0.04, 0.45, 0.02, 0.12, 0.06, 0.38, 0.03])
outcomes = np.array([0, 0, 1, 0, 1, 0, 1, 0]) # 1 = goal, 0 = no goal
# Boolean mask: shots inside the box (roughly < 18 meters)
inside_box = distances < 18
print(f"Shots inside box: {inside_box.sum()}")
print(f"Avg xG inside box: {xg_values[inside_box].mean():.3f}")
print(f"Conversion inside box: {outcomes[inside_box].mean():.1%}")
# Combine masks
high_quality = (xg_values > 0.10) & (distances < 20)
print(f"High-quality chances: {high_quality.sum()}")
4.5 Visualization with Matplotlib and Seaborn
Effective visualization is essential for communicating soccer analytics insights. A well-crafted chart can convey patterns that would take paragraphs of text to describe. This section covers the most important chart types for soccer analytics, with production-ready code examples.
4.5.1 Matplotlib Fundamentals
Matplotlib uses a two-level API. The Figure is the entire image; Axes are individual plots within the figure. Always use the object-oriented API (fig, ax = plt.subplots()) rather than the stateful plt.plot() interface for production code.
import matplotlib.pyplot as plt
# Create a figure with one axes
fig, ax = plt.subplots(figsize=(10, 6))
# Simulate season xG progression
matches = range(1, 39)
home_xg = np.random.normal(1.5, 0.4, 38)
cumulative_xg = np.cumsum(home_xg)
# Plot actual xG and expected trend
ax.plot(matches, cumulative_xg, 'b-', linewidth=2, label='Actual xG')
ax.plot(matches, np.arange(1, 39) * 1.5, 'r--', alpha=0.5, label='Expected (1.5/match)')
# Labels and formatting
ax.set_xlabel('Match Number', fontsize=12)
ax.set_ylabel('Cumulative xG', fontsize=12)
ax.set_title('Season xG Progression', fontsize=14)
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('xg_progression.png', dpi=150)
plt.show()
Multi-Panel Figures:
# Create a 2x2 grid of subplots for a comprehensive match report
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Top-left: xG timeline
axes[0, 0].set_title('xG Timeline')
# Top-right: Shot map (placeholder)
axes[0, 1].set_title('Shot Map')
# Bottom-left: Pass network (placeholder)
axes[1, 0].set_title('Pass Network')
# Bottom-right: Player radar (placeholder)
axes[1, 1].set_title('Player Radar')
fig.suptitle('Match Report: France vs Croatia', fontsize=16, y=1.02)
plt.tight_layout()
4.5.2 Bar Charts for Comparisons
def plot_team_comparison(teams, metric1, metric2, label1, label2, title):
"""Create a grouped bar chart comparing two metrics across teams."""
fig, ax = plt.subplots(figsize=(12, 6))
x = np.arange(len(teams))
width = 0.35
bars1 = ax.bar(x - width/2, metric1, width, label=label1, color='steelblue')
bars2 = ax.bar(x + width/2, metric2, width, label=label2, color='coral')
ax.set_xlabel('Team')
ax.set_ylabel('Value')
ax.set_title(title)
ax.set_xticks(x)
ax.set_xticklabels(teams, rotation=45, ha='right')
ax.legend()
# Add value labels on top of each bar
for bar in bars1:
ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5,
f'{bar.get_height():.1f}', ha='center', va='bottom', fontsize=8)
plt.tight_layout()
return fig, ax
# Example usage
teams = ['Man City', 'Liverpool', 'Chelsea', 'Arsenal', 'Tottenham']
xg = [82.5, 78.3, 65.2, 68.9, 61.4]
goals = [85, 82, 63, 70, 58]
fig, ax = plot_team_comparison(teams, xg, goals, 'xG', 'Goals', 'xG vs Actual Goals')
plt.savefig('team_comparison.png', dpi=150)
4.5.3 Scatter Plots with Regression
Scatter plots with regression lines are the workhorse visualization for exploring relationships between metrics.
from scipy import stats
def plot_correlation(x, y, xlabel, ylabel, title, team_labels=None):
"""
Create scatter plot with regression line and statistics.
Parameters
----------
x, y : array-like
Data to plot
xlabel, ylabel : str
Axis labels
title : str
Plot title
team_labels : list of str, optional
Labels for each point (e.g., team names)
"""
fig, ax = plt.subplots(figsize=(8, 6))
# Scatter plot
ax.scatter(x, y, alpha=0.6, s=50)
# Regression line
slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
x_line = np.linspace(x.min(), x.max(), 100)
y_line = slope * x_line + intercept
ax.plot(x_line, y_line, 'r-', linewidth=2,
label=f'y = {intercept:.1f} + {slope:.2f}x')
# Add statistics annotation
ax.annotate(
f'r = {r_value:.3f}\nR\u00b2 = {r_value**2:.3f}\np = {p_value:.4f}',
xy=(0.05, 0.95), xycoords='axes fraction',
fontsize=10, verticalalignment='top',
bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5)
)
# Optional: label each point with team name
if team_labels is not None:
for i, label in enumerate(team_labels):
ax.annotate(label, (x[i], y[i]), fontsize=7,
xytext=(5, 5), textcoords='offset points')
ax.set_xlabel(xlabel, fontsize=12)
ax.set_ylabel(ylabel, fontsize=12)
ax.set_title(title, fontsize=14)
ax.legend(loc='lower right')
ax.grid(True, alpha=0.3)
plt.tight_layout()
return fig, ax
# Example: xG vs Goals
xg = np.array([55, 62, 48, 71, 58, 65, 52, 68, 45, 75])
goals = np.array([52, 65, 45, 68, 60, 62, 50, 70, 42, 78])
fig, ax = plot_correlation(xg, goals, 'Expected Goals (xG)', 'Actual Goals',
'xG vs Goals Correlation')
plt.savefig('xg_correlation.png', dpi=150)
4.5.4 Seaborn for Statistical Visualization
Seaborn is built on matplotlib and provides higher-level functions for common statistical plots.
import seaborn as sns
# Set seaborn style globally
sns.set_style("whitegrid")
sns.set_palette("husl")
# Distribution plot
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Left panel: Goals per match distribution
goals_data = np.random.poisson(1.4, 380)
sns.histplot(goals_data, discrete=True, ax=axes[0], color='steelblue')
axes[0].set_xlabel('Goals')
axes[0].set_title('Goals per Match Distribution')
# Right panel: xG distribution by match outcome
np.random.seed(42)
outcome_data = pd.DataFrame({
'xG': np.concatenate([
np.random.normal(2.0, 0.4, 100), # Wins
np.random.normal(1.4, 0.3, 80), # Draws
np.random.normal(1.0, 0.4, 70) # Losses
]),
'Result': ['Win']*100 + ['Draw']*80 + ['Loss']*70
})
sns.boxplot(data=outcome_data, x='Result', y='xG', ax=axes[1],
order=['Loss', 'Draw', 'Win'])
axes[1].set_ylabel('Expected Goals (xG)')
axes[1].set_title('xG by Match Result')
plt.tight_layout()
plt.savefig('seaborn_example.png', dpi=150)
Violin Plots for Distribution Comparison:
# Violin plots show the full distribution shape, not just quartiles
fig, ax = plt.subplots(figsize=(10, 6))
sns.violinplot(data=outcome_data, x='Result', y='xG',
order=['Loss', 'Draw', 'Win'], inner='quartile', ax=ax)
ax.set_title('xG Distribution by Match Outcome')
ax.set_ylabel('Expected Goals (xG)')
plt.tight_layout()
4.5.5 Heatmaps
def plot_correlation_heatmap(df, columns, title='Correlation Matrix'):
"""Create a correlation heatmap for selected columns."""
corr_matrix = df[columns].corr()
fig, ax = plt.subplots(figsize=(10, 8))
# annot=True prints the correlation value in each cell
# cmap='RdBu_r' uses red for positive, blue for negative
# center=0 ensures white is at zero correlation
sns.heatmap(
corr_matrix,
annot=True,
fmt='.2f',
cmap='RdBu_r',
center=0,
vmin=-1,
vmax=1,
square=True,
ax=ax
)
ax.set_title(title, fontsize=14)
plt.tight_layout()
return fig, ax
# Example with simulated team data
np.random.seed(42)
n = 100
team_data = pd.DataFrame({
'Goals': np.random.poisson(50, n),
'xG': np.random.normal(50, 10, n),
'Possession': np.random.normal(50, 5, n),
'Pass Accuracy': np.random.normal(82, 4, n),
'Shots': np.random.poisson(400, n)
})
# Add realistic correlation
team_data['Goals'] = team_data['xG'] * 0.9 + np.random.normal(0, 5, n)
fig, ax = plot_correlation_heatmap(
team_data,
['Goals', 'xG', 'Possession', 'Pass Accuracy', 'Shots'],
'Team Statistics Correlation Matrix'
)
plt.savefig('correlation_heatmap.png', dpi=150)
4.5.6 Soccer Pitch Visualization
Intuition: The spatial calculations covered here---distance to goal, shot angle, Euclidean distance---connect directly to the coordinate system concepts in Chapter 6 and form the foundation of the xG models developed in Chapter 7. Make sure you are comfortable converting between pitch coordinates and real-world distances before moving on.
For soccer-specific visualizations, use the mplsoccer library:
from mplsoccer import Pitch
def plot_shot_map(shots_df, title='Shot Map'):
"""
Create a shot map visualization.
Parameters
----------
shots_df : pd.DataFrame
DataFrame with columns: x, y, xG, outcome
title : str
Plot title
"""
# pitch_type='statsbomb' matches StatsBomb's coordinate system (120x80)
pitch = Pitch(pitch_type='statsbomb', pitch_color='grass',
line_color='white', goal_type='box')
fig, ax = pitch.draw(figsize=(12, 8))
# Separate goals and non-goals for different styling
goals = shots_df[shots_df['outcome'] == 'Goal']
non_goals = shots_df[shots_df['outcome'] != 'Goal']
# Plot non-goals as white circles, size proportional to xG
pitch.scatter(non_goals['x'], non_goals['y'],
s=non_goals['xG'] * 500,
c='white', edgecolors='black',
alpha=0.6, ax=ax, label='No Goal')
# Plot goals as red circles
pitch.scatter(goals['x'], goals['y'],
s=goals['xG'] * 500,
c='red', edgecolors='black',
alpha=0.8, ax=ax, label='Goal')
ax.set_title(title, fontsize=16)
ax.legend(loc='upper left')
return fig, ax
# Example usage with simulated data
np.random.seed(42)
shots = pd.DataFrame({
'x': np.random.uniform(100, 120, 50),
'y': np.random.uniform(20, 60, 50),
'xG': np.random.uniform(0.05, 0.5, 50),
'outcome': np.random.choice(['Goal', 'Saved', 'Blocked', 'Off Target'],
50, p=[0.12, 0.4, 0.25, 0.23])
})
fig, ax = plot_shot_map(shots, 'Team Shot Map - Season 2023/24')
plt.savefig('shot_map.png', dpi=150, facecolor='#1a1a1a')
Heatmap on a Pitch:
def plot_action_heatmap(events_df, title='Action Heatmap'):
"""
Create a heatmap of player actions on the pitch.
Uses mplsoccer's binning to create a spatial density plot.
"""
pitch = Pitch(pitch_type='statsbomb', line_zorder=2)
fig, ax = pitch.draw(figsize=(12, 8))
# bin_statistic divides the pitch into a grid and counts events
bin_statistic = pitch.bin_statistic(
events_df['x'], events_df['y'],
statistic='count', bins=(12, 8)
)
# Normalize and plot as a heatmap
pitch.heatmap(bin_statistic, ax=ax, cmap='hot', edgecolors='#22312b')
ax.set_title(title, fontsize=16)
return fig, ax
4.5.7 Passing Networks
from mplsoccer import Pitch
import matplotlib.patches as mpatches
def plot_passing_network(pass_df, min_passes=3):
"""
Create a passing network visualization.
Parameters
----------
pass_df : pd.DataFrame
Pass data with passer, receiver, start/end locations
min_passes : int
Minimum passes to show connection
"""
pitch = Pitch(pitch_type='statsbomb', pitch_color='#22312b',
line_color='white')
fig, ax = pitch.draw(figsize=(12, 8))
# Calculate average positions
avg_positions = pass_df.groupby('passer').agg({
'x': 'mean',
'y': 'mean'
}).reset_index()
# Calculate pass combinations
pass_counts = pass_df.groupby(['passer', 'receiver']).size().reset_index(name='count')
pass_counts = pass_counts[pass_counts['count'] >= min_passes]
# Plot connections --- line width proportional to pass frequency
for _, row in pass_counts.iterrows():
passer_pos = avg_positions[avg_positions['passer'] == row['passer']]
receiver_pos = avg_positions[avg_positions['passer'] == row['receiver']]
if len(passer_pos) > 0 and len(receiver_pos) > 0:
pitch.lines(
passer_pos['x'].values[0], passer_pos['y'].values[0],
receiver_pos['x'].values[0], receiver_pos['y'].values[0],
lw=row['count'] / 2,
color='white',
alpha=0.5,
ax=ax
)
# Plot player positions --- node size proportional to total passes
total_passes = pass_df.groupby('passer').size()
for _, row in avg_positions.iterrows():
size = total_passes.get(row['passer'], 1) * 5
pitch.scatter(row['x'], row['y'], s=size,
c='#d63333', edgecolors='white',
linewidths=2, ax=ax)
ax.set_title('Passing Network', fontsize=16, color='white')
return fig, ax
4.5.8 Styling for Publication
Professional soccer analytics publications follow consistent styling conventions. Here is a reusable style configuration.
def set_analytics_style():
"""Apply a consistent, publication-ready style to all plots."""
plt.rcParams.update({
'figure.facecolor': 'white',
'axes.facecolor': 'white',
'axes.grid': True,
'grid.alpha': 0.3,
'font.family': 'sans-serif',
'font.size': 11,
'axes.titlesize': 14,
'axes.labelsize': 12,
'legend.fontsize': 10,
'figure.dpi': 150,
'savefig.bbox': 'tight',
'savefig.dpi': 150,
})
# Call once at the top of your script or notebook
set_analytics_style()
4.6 Building Reusable Code
Professional analytics requires well-organized, reusable code.
Intuition: In professional club environments, analysts rarely write one-off scripts. Code is organized into reusable modules that can be applied to any match, any player, any competition. Investing time in writing well-documented functions with clear parameter types and return values pays dividends as your analysis library grows. The functions you write in this section will serve as templates for every analysis chapter that follows.
4.6.1 Function Design Principles
Good analytics functions follow several principles: - Single responsibility: Each function does one thing well. - Clear parameters: Use type hints and default values. - Documentation: Include docstrings with parameter descriptions and examples. - Defensive coding: Validate inputs and handle edge cases.
def calculate_xg_performance(
goals: int,
xg: float,
matches: int,
confidence_level: float = 0.95
) -> dict:
"""
Calculate xG performance metrics with confidence intervals.
Parameters
----------
goals : int
Actual goals scored
xg : float
Expected goals
matches : int
Number of matches
confidence_level : float, optional
Confidence level for interval (default 0.95)
Returns
-------
dict
Dictionary containing:
- goals_per_match: Goals per match
- xg_per_match: xG per match
- overperformance: Total goals minus xG
- overperformance_per_match: Per-match overperformance
- ci_lower: Lower bound of CI
- ci_upper: Upper bound of CI
Examples
--------
>>> result = calculate_xg_performance(25, 22.5, 38)
>>> print(f"Overperformance: {result['overperformance']:.1f}")
Overperformance: 2.5
"""
from scipy import stats
# Input validation
if matches <= 0:
raise ValueError(f"matches must be positive, got {matches}")
if goals < 0:
raise ValueError(f"goals must be non-negative, got {goals}")
goals_per_match = goals / matches
xg_per_match = xg / matches
overperformance = goals - xg
overperformance_per_match = overperformance / matches
# Confidence interval for goals per match (Poisson)
alpha = 1 - confidence_level
ci_lower = stats.poisson.ppf(alpha/2, goals) / matches
ci_upper = stats.poisson.ppf(1 - alpha/2, goals) / matches
return {
'goals_per_match': goals_per_match,
'xg_per_match': xg_per_match,
'overperformance': overperformance,
'overperformance_per_match': overperformance_per_match,
'ci_lower': ci_lower,
'ci_upper': ci_upper
}
4.6.2 Classes for Complex Analysis
When an analysis involves multiple related calculations on the same data, a class provides better organization than a collection of loose functions.
class PlayerAnalyzer:
"""
Analyze individual player performance from event data.
This class encapsulates all analysis for a single player,
providing a clean interface for accessing statistics.
Parameters
----------
events : pd.DataFrame
Event data containing player actions
player_name : str
Name of the player to analyze
Attributes
----------
player_events : pd.DataFrame
Filtered events for the specified player
stats : dict
Calculated statistics
"""
def __init__(self, events: pd.DataFrame, player_name: str):
self.player_name = player_name
self.player_events = events[events['player'] == player_name].copy()
if len(self.player_events) == 0:
raise ValueError(f"No events found for player: {player_name}")
self.stats = self._calculate_stats()
def _calculate_stats(self) -> dict:
"""Calculate basic statistics for the player."""
events = self.player_events
return {
'total_events': len(events),
'passes': len(events[events['type'] == 'Pass']),
'shots': len(events[events['type'] == 'Shot']),
'goals': len(events[
(events['type'] == 'Shot') &
(events.get('shot_outcome') == 'Goal')
]),
'matches': events['match_id'].nunique()
}
def get_passing_stats(self) -> dict:
"""Calculate detailed passing statistics."""
passes = self.player_events[self.player_events['type'] == 'Pass']
if len(passes) == 0:
return {'total': 0, 'accuracy': None}
successful = passes[passes.get('pass_outcome', '').isna() |
(passes.get('pass_outcome') == 'Complete')]
return {
'total': len(passes),
'successful': len(successful),
'accuracy': len(successful) / len(passes) if len(passes) > 0 else 0,
'progressive': len(passes[passes.get('pass_progressive', False)])
}
def get_shooting_stats(self) -> dict:
"""Calculate detailed shooting statistics."""
shots = self.player_events[self.player_events['type'] == 'Shot']
if len(shots) == 0:
return {'total': 0, 'goals': 0, 'xg': 0}
return {
'total': len(shots),
'goals': len(shots[shots.get('shot_outcome') == 'Goal']),
'xg': shots['shot_statsbomb_xg'].sum() if 'shot_statsbomb_xg' in shots else 0,
'conversion_rate': self.stats['goals'] / len(shots) if len(shots) > 0 else 0
}
def summary(self) -> pd.DataFrame:
"""Return a summary DataFrame of all statistics."""
passing = self.get_passing_stats()
shooting = self.get_shooting_stats()
return pd.DataFrame([{
'Player': self.player_name,
'Matches': self.stats['matches'],
'Passes': passing['total'],
'Pass Accuracy': f"{passing['accuracy']:.1%}" if passing['accuracy'] else 'N/A',
'Shots': shooting['total'],
'Goals': shooting['goals'],
'xG': f"{shooting['xg']:.2f}",
'Conversion': f"{shooting['conversion_rate']:.1%}" if shooting['total'] > 0 else 'N/A'
}])
4.6.3 Error Handling and Debugging
Robust error handling is essential for analytics code that will be used repeatedly across different datasets. Data formats change, API responses vary, and edge cases are inevitable.
import logging
# Configure logging --- this should be done once at the top of your module.
# INFO level logs routine operations; WARNING and ERROR for problems.
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
def safe_load_data(filepath: str) -> pd.DataFrame:
"""
Safely load data with comprehensive error handling.
Supports CSV, Parquet, and Excel formats. Logs success or
failure for debugging.
Parameters
----------
filepath : str
Path to data file
Returns
-------
pd.DataFrame
Loaded data
Raises
------
FileNotFoundError
If file doesn't exist
ValueError
If file format is not supported
"""
from pathlib import Path
path = Path(filepath)
if not path.exists():
logger.error(f"File not found: {filepath}")
raise FileNotFoundError(f"File not found: {filepath}")
suffix = path.suffix.lower()
try:
if suffix == '.csv':
df = pd.read_csv(filepath)
elif suffix == '.parquet':
df = pd.read_parquet(filepath)
elif suffix in ['.xlsx', '.xls']:
df = pd.read_excel(filepath)
elif suffix == '.json':
df = pd.read_json(filepath)
else:
raise ValueError(f"Unsupported file format: {suffix}")
logger.info(f"Loaded {len(df)} rows from {filepath}")
return df
except Exception as e:
logger.error(f"Error loading {filepath}: {e}")
raise
Common Debugging Techniques:
# 1. Check data at intermediate steps
def debug_pipeline(events):
"""Example of adding debug checks to a data pipeline."""
print(f"Step 0: {len(events)} events")
# Step 1: Filter to shots
shots = events[events['type'] == 'Shot']
print(f"Step 1 (shots): {len(shots)} rows")
# Step 2: Extract coordinates
shots = extract_coordinates(shots)
print(f"Step 2 (with coords): {shots['x'].notna().sum()} have valid x")
# Step 3: Calculate distance
shots['distance'] = distance_to_goal(shots['x'], shots['y'])
print(f"Step 3 (with distance): {shots['distance'].describe()}")
return shots
# 2. Use assertions for data validation
def validate_events(events: pd.DataFrame) -> None:
"""Validate that event data has the expected structure."""
required_columns = ['match_id', 'type', 'player', 'team']
missing = [col for col in required_columns if col not in events.columns]
assert len(missing) == 0, f"Missing columns: {missing}"
assert len(events) > 0, "Events DataFrame is empty"
assert events['match_id'].notna().all(), "Some events have null match_id"
Best Practice: When building a data pipeline, add logging at each major step. When something goes wrong (and it will), the logs tell you exactly where the pipeline broke. Use
logger.info()for routine operations,logger.warning()for recoverable issues, andlogger.error()for failures.
4.7 Version Control with Git
Reproducible analytics requires more than clean code and virtual environments -- it requires systematic version control. Git is the industry-standard tool for tracking changes to code, collaborating with teammates, and maintaining a reliable history of your analytical work. Every professional soccer analytics department uses version control, and fluency with Git is a non-negotiable skill for analysts working in club environments.
4.7.1 Git Basics
Git tracks changes to files in a repository, allowing you to record snapshots of your project at any point and return to previous states if needed.
Core Commands:
# Initialize a new repository in your project directory
git init
# Check the status of your working directory
git status
# Stage specific files for the next commit
git add src/data_loader.py src/metrics.py
# Commit staged changes with a descriptive message
git commit -m "Add data loader and metrics modules for match analysis"
# Push commits to a remote repository (e.g., GitHub, GitLab)
git push origin main
# Pull the latest changes from the remote repository
git pull origin main
Each commit should represent a logical unit of work -- a completed function, a bug fix, or a new analysis pipeline. Write commit messages that describe the purpose of the change rather than listing files modified. A message like "Build xG model for set piece shots" is far more useful than "Updated model.py".
4.7.2 Repository Structure for Soccer Analytics
Organizing your repository with a clear structure makes it easy for collaborators to navigate and for your future self to understand past work.
soccer-analytics/
├── .gitignore # Files Git should not track
├── README.md # Project overview and setup instructions
├── requirements.txt # Python dependencies
├── config.py # Configuration settings
├── data/
│ ├── raw/ # Original data (often git-ignored)
│ └── processed/ # Cleaned data (often git-ignored)
├── notebooks/ # Jupyter notebooks for exploration
├── src/
│ ├── __init__.py
│ ├── data/ # Data loading and cleaning modules
│ ├── features/ # Feature engineering
│ ├── models/ # Statistical and ML models
│ └── visualization/ # Plotting utilities
├── tests/ # Unit tests for src modules
└── outputs/
├── figures/ # Generated plots (git-ignored)
└── reports/ # Analysis reports
Keep source code (src/) under version control at all times. Notebooks should be committed but with outputs cleared to avoid large diffs and potential data leaks. Data files and generated outputs are typically excluded via .gitignore.
4.7.3 .gitignore for Data Files and Notebooks
A well-crafted .gitignore file prevents large data files, sensitive credentials, and generated outputs from being accidentally committed to the repository.
# Data files -- too large for Git, store separately
data/raw/
data/processed/
*.csv
*.parquet
*.xlsx
*.json.gz
# Jupyter notebook checkpoints
.ipynb_checkpoints/
# Python environment
venv/
*.pyc
__pycache__/
# Generated outputs
outputs/figures/
outputs/reports/
*.png
*.pdf
# IDE and OS files
.vscode/
.idea/
.DS_Store
Thumbs.db
# Credentials and secrets
.env
credentials.json
api_keys.py
For large datasets, use separate storage solutions such as cloud storage buckets, a shared network drive, or Git Large File Storage (Git LFS). Document in your README.md where collaborators can obtain the required data files and how to place them in the expected directory structure.
4.7.4 Collaborative Workflows with Branches
When working on a team -- or even on your own across multiple analyses -- branches let you develop new features or experiments without disrupting stable code.
# Create and switch to a new branch for a specific analysis
git checkout -b feature/xg-model-v2
# Work on your changes, committing as you go
git add src/models/xg_model.py
git commit -m "Implement gradient boosting xG model with set piece features"
# When finished, switch back to main and merge
git checkout main
git merge feature/xg-model-v2
# Delete the branch after merging
git branch -d feature/xg-model-v2
Branching strategies for analytics teams:
- Feature branches: One branch per analysis task (e.g.,
feature/corner-kick-analysis,feature/player-recruitment-report). Merge intomainwhen complete and reviewed. - Experimentation branches: Use branches to test alternative modeling approaches without committing unfinished work to the shared codebase.
- Release branches: For production dashboards or recurring reports, maintain a stable
mainbranch that always produces correct outputs.
When multiple analysts work on the same repository, use pull requests (on GitHub or GitLab) to review each other's code before merging. Code review catches errors, spreads knowledge across the team, and maintains consistent code quality.
4.7.5 Best Practices for Versioning Data Pipelines
Data pipelines in soccer analytics evolve over time as new data sources become available, models are refined, and reporting requirements change. Git helps manage this evolution, but only if you follow disciplined practices.
Pipeline versioning guidelines:
-
Pin your dependencies: Always commit
requirements.txtwith exact version numbers (pandas==2.1.0, notpandas>=2.0). A model trained with one version of scikit-learn may produce different results with another. -
Tag significant milestones: Use Git tags to mark important versions of your pipeline, such as the model deployed for a particular transfer window or the analysis delivered for a board presentation.
bash
git tag -a v1.0-summer-window -m "xG model used for summer 2024 recruitment"
-
Separate code from configuration: Store model hyperparameters, competition IDs, and season identifiers in configuration files rather than hard-coding them. This lets you rerun the same pipeline on different data without modifying source code.
-
Document data lineage: Record in your commit messages or a changelog which data sources were used, any manual cleaning steps applied, and the date the data was downloaded. When a match event provider retroactively corrects data, you need to know which analyses may be affected.
-
Automate with scripts: Create shell scripts or Makefile targets that reproduce your entire pipeline from raw data to final outputs. A collaborator should be able to clone your repository, install dependencies, and run a single command to regenerate all results.
Following these practices transforms your analytics work from a collection of ad-hoc scripts into a professional, auditable, and reproducible system -- the standard expected in any serious club analytics operation.
4.8 Performance Optimization
Large soccer datasets require efficient code. Here are key optimization strategies.
Best Practice: Memory optimization is not premature---it is essential. A full season of tracking data at 25 frames per second for 22 players generates over 100 million rows. Even event data for a multi-season analysis can exceed available RAM if data types are not managed carefully. Start every project by checking
df.info()anddf.memory_usage(deep=True)to understand your memory footprint, then apply the dtype optimization techniques shown below.
4.8.1 Memory Optimization
def optimize_dtypes(df: pd.DataFrame) -> pd.DataFrame:
"""
Optimize DataFrame memory usage by downcasting dtypes.
This function reduces memory by:
- Converting low-cardinality string columns to category type
- Downcasting integers to the smallest type that fits the data
- Downcasting floats from float64 to float32
Parameters
----------
df : pd.DataFrame
Input DataFrame
Returns
-------
pd.DataFrame
Memory-optimized DataFrame
"""
result = df.copy()
for col in result.columns:
col_type = result[col].dtype
if col_type == 'object':
# Category type uses integer codes + lookup table.
# Effective when unique values are < 50% of total rows.
if result[col].nunique() / len(result) < 0.5:
result[col] = result[col].astype('category')
elif col_type == 'int64':
# int8: -128 to 127; int16: -32768 to 32767;
# int32: -2B to 2B. Most soccer data fits in int16 or int32.
result[col] = pd.to_numeric(result[col], downcast='integer')
elif col_type == 'float64':
# float32 has ~7 decimal digits of precision.
# More than enough for xG, coordinates, etc.
result[col] = pd.to_numeric(result[col], downcast='float')
return result
# Example
original_memory = events.memory_usage(deep=True).sum() / 1024**2
optimized_events = optimize_dtypes(events)
optimized_memory = optimized_events.memory_usage(deep=True).sum() / 1024**2
print(f"Original: {original_memory:.2f} MB")
print(f"Optimized: {optimized_memory:.2f} MB")
print(f"Reduction: {(1 - optimized_memory/original_memory)*100:.1f}%")
4.8.2 Efficient Aggregation
Different aggregation approaches vary enormously in performance. Here is a comparison from slowest to fastest.
# SLOW: iterating over groups with a Python loop
# This is the pattern beginners default to. Avoid it.
def slow_aggregation(df):
results = []
for player in df['player'].unique():
player_df = df[df['player'] == player]
results.append({
'player': player,
'passes': len(player_df[player_df['type'] == 'Pass']),
'shots': len(player_df[player_df['type'] == 'Shot'])
})
return pd.DataFrame(results)
# MEDIUM: groupby with apply --- better, but still calls Python per group
def fast_aggregation(df):
return df.groupby('player').apply(
lambda x: pd.Series({
'passes': (x['type'] == 'Pass').sum(),
'shots': (x['type'] == 'Shot').sum()
})
).reset_index()
# FAST: pivot approach --- fully vectorized, no Python loop
def fastest_aggregation(df):
counts = df.groupby(['player', 'type']).size().unstack(fill_value=0)
return counts[['Pass', 'Shot']].rename(
columns={'Pass': 'passes', 'Shot': 'shots'}
).reset_index()
4.8.3 Using Parquet for Faster I/O
For datasets you load repeatedly, consider converting CSV files to Parquet format. Parquet is a columnar storage format that offers dramatically faster read times and smaller file sizes.
# Convert CSV to Parquet (do this once)
events = pd.read_csv('data/raw/events.csv')
events.to_parquet('data/processed/events.parquet', index=False)
# Load from Parquet (do this every time --- much faster)
events = pd.read_parquet('data/processed/events.parquet')
# Parquet also supports reading only specific columns
shots = pd.read_parquet(
'data/processed/events.parquet',
columns=['match_id', 'type', 'player', 'location', 'shot_statsbomb_xg']
)
Advanced: For very large datasets (multiple seasons of tracking data), consider using
pyarrowdirectly ordaskfor out-of-core computation. Dask provides a pandas-like API that operates on datasets larger than memory by processing them in chunks.
4.9 Summary
This chapter established the Python foundation for soccer analytics:
Environment Setup: - Virtual environments ensure project isolation and reproducibility - Consistent project structure improves maintainability - Configuration files centralize settings and prevent hard-coded values
pandas Essentials: - DataFrames efficiently store and manipulate tabular soccer data - Boolean indexing and query() filter data precisely - Groupby operations aggregate statistics at any level (player, team, match) - Merges combine multiple data sources (events, matches, player bio) - Time series operations support rolling averages and cumulative statistics - Missing data handling requires understanding why data is absent
NumPy Fundamentals: - Vectorized operations dramatically outperform Python loops - Statistical functions enable quick exploratory analysis - Spatial calculations (distance, angle) support position-based analytics - Random number generation powers Monte Carlo simulations
Visualization: - matplotlib provides complete plotting control via the object-oriented API - seaborn simplifies statistical graphics (distributions, correlations) - mplsoccer creates professional soccer-specific visualizations (shot maps, passing networks, heatmaps)
Best Practices: - Write functions with clear parameters, type hints, and documentation - Use classes for complex, stateful analysis - Handle errors gracefully with logging and input validation - Optimize for large datasets with dtype downcasting and Parquet files
The tools covered in this chapter form the backbone of every analysis in subsequent chapters. Mastery of these fundamentals enables you to focus on the analytics questions rather than implementation details.
4.10 Exercises
See exercises.md for hands-on practice problems covering all topics in this chapter.
4.11 Further Reading
See further-reading.md for recommended resources on Python data science.