16 min read

Soccer analytics requires processing large volumes of data---millions of events, thousands of players, hundreds of matches. Python has emerged as the dominant language for this work, offering a powerful ecosystem of libraries specifically designed...

Learning Objectives

  • Configure a professional Python environment for soccer analytics projects
  • Use pandas to load, clean, filter, and aggregate soccer event data
  • Apply NumPy for efficient numerical computations and spatial calculations
  • Create publication-quality visualizations with matplotlib, seaborn, and mplsoccer
  • Build reusable, well-documented analysis functions and classes
  • Work with JSON and CSV match data from common providers
  • Apply error handling and debugging techniques to analytics code
  • Optimize performance when processing large soccer datasets

Chapter 4: Python Tools for Soccer Analytics

Learning Objectives

By the end of this chapter, you will be able to:

  1. Configure a professional Python environment for soccer analytics
  2. Use pandas for efficient data manipulation of match and player data
  3. Apply NumPy for numerical computations in analytics workflows
  4. Create publication-quality visualizations with matplotlib and seaborn
  5. Build reusable analysis functions and classes
  6. Work with JSON and CSV match data from common providers
  7. Apply error handling and debugging techniques to analytics code
  8. Implement best practices for reproducible analytics code

4.1 Introduction

Soccer analytics requires processing large volumes of data---millions of events, thousands of players, hundreds of matches. Python has emerged as the dominant language for this work, offering a powerful ecosystem of libraries specifically designed for data analysis, statistical modeling, and visualization.

This chapter provides a comprehensive guide to the Python tools you'll use throughout this textbook and your analytics career. Rather than a general Python tutorial, we focus specifically on patterns and techniques relevant to soccer data. Every code example uses soccer data, and every design pattern addresses a challenge that soccer analysts face in practice.

Why Python for Soccer Analytics?

Python dominates professional sports analytics for several reasons:

  1. Rich ecosystem: pandas, NumPy, scikit-learn, and visualization libraries form a complete analytics toolkit
  2. Community support: Active communities have built soccer-specific tools like statsbombpy, mplsoccer, and socceraction
  3. Industry adoption: Most professional clubs and analytics companies use Python
  4. Accessibility: Clear syntax makes code readable and maintainable
  5. Integration: Easy connection to databases, APIs, and web services

Intuition: While R remains popular in academic sports research, Python has become the dominant language in professional club analytics departments. Learning Python for soccer analytics gives you skills that transfer directly to industry roles. Nearly every major data provider (StatsBomb, Opta, Wyscout) offers Python SDKs or APIs, and the vast majority of job listings for soccer analyst positions list Python as a required skill.

Chapter Overview

We'll cover four core areas:

  • Environment Setup: Configuring a professional development environment
  • Data Manipulation: pandas for wrangling soccer data
  • Numerical Computing: NumPy for efficient calculations
  • Visualization: matplotlib and seaborn for soccer graphics

Each section progresses from fundamental concepts to soccer-specific patterns. By the end, you will have a working toolkit sufficient to tackle the analyses in every subsequent chapter of this book.

4.2 Environment Setup

A well-configured environment prevents countless headaches later. This section establishes professional practices from the start.

4.2.1 Python Installation and Virtual Environments

Recommended Setup:

# Install Python 3.10+ (via python.org, Anaconda, or pyenv)

# Create a virtual environment for soccer analytics
python -m venv soccer-analytics-env

# Activate (Windows)
soccer-analytics-env\Scripts\activate

# Activate (macOS/Linux)
source soccer-analytics-env/bin/activate

# Install core packages
pip install pandas numpy matplotlib seaborn scipy scikit-learn
pip install statsbombpy mplsoccer jupyter

Why Virtual Environments?

Each project should have its own isolated environment to: - Avoid version conflicts between projects - Ensure reproducibility - Make deployment easier - Allow clean dependency tracking

Best Practice: When starting a new soccer analytics project, always create a fresh virtual environment and install packages incrementally as you need them. Then run pip freeze > requirements.txt to capture your exact dependencies. This small discipline saves enormous headaches when sharing projects with colleagues or deploying analyses to production servers at a club. A colleague should be able to run pip install -r requirements.txt and reproduce your entire environment.

4.2.2 Project Structure

Organize your analytics projects consistently:

soccer-project/
├── data/
│   ├── raw/              # Original, immutable data
│   ├── processed/        # Cleaned, transformed data
│   └── external/         # Data from external sources
├── notebooks/            # Jupyter notebooks for exploration
├── src/
│   ├── __init__.py
│   ├── data/             # Data loading and processing
│   ├── features/         # Feature engineering
│   ├── models/           # Statistical models
│   └── visualization/    # Plotting functions
├── tests/                # Unit tests
├── outputs/
│   ├── figures/          # Generated visualizations
│   └── reports/          # Analysis reports
├── requirements.txt      # Dependencies
├── README.md
└── config.py             # Configuration settings

The key principle is separation of concerns: raw data is kept separate from processed data, source code is separate from notebooks, and outputs are separate from inputs. The data/raw/ directory should be treated as immutable---never modify original data files. Instead, write processing scripts that read from raw/ and write to processed/.

4.2.3 Configuration Management

Create a config.py for project settings:

"""Project configuration settings."""

from pathlib import Path

# Paths
PROJECT_ROOT = Path(__file__).parent
DATA_DIR = PROJECT_ROOT / "data"
RAW_DATA_DIR = DATA_DIR / "raw"
PROCESSED_DATA_DIR = DATA_DIR / "processed"
OUTPUT_DIR = PROJECT_ROOT / "outputs"
FIGURES_DIR = OUTPUT_DIR / "figures"

# Data settings
STATSBOMB_COMPETITION_ID = 43  # World Cup
STATSBOMB_SEASON_ID = 3        # 2018

# Visualization settings
FIGURE_DPI = 150
FIGURE_FORMAT = "png"

# Soccer pitch dimensions (StatsBomb)
PITCH_LENGTH = 120
PITCH_WIDTH = 80

# Ensure directories exist
for dir_path in [RAW_DATA_DIR, PROCESSED_DATA_DIR, FIGURES_DIR]:
    dir_path.mkdir(parents=True, exist_ok=True)

Centralizing configuration in a single file means that when you switch from analyzing the 2018 World Cup to the 2022-23 Premier League, you change one file rather than hunting through dozens of scripts for hard-coded values. Every script in the project imports from config.py:

from config import RAW_DATA_DIR, FIGURE_DPI

# Now use these constants throughout your analysis
data = pd.read_csv(RAW_DATA_DIR / "matches.csv")
plt.savefig("my_plot.png", dpi=FIGURE_DPI)

4.2.4 Jupyter Notebooks Best Practices

Notebooks are excellent for exploration but can become messy. Follow these guidelines:

Good Practices: - Use clear, descriptive cell headers with Markdown - Keep cells focused on single tasks - Move reusable code to .py modules as soon as it stabilizes - Restart kernel and run all before sharing - Clear output before committing to version control

Anti-Patterns to Avoid: - Running cells out of order (creates hidden state bugs) - Putting all code in one massive notebook - Leaving commented-out experimental code everywhere - Defining the same function in multiple notebooks

Example Notebook Structure:

# Cell 1: Imports and setup
"""
# Match Analysis Notebook
Analyzing passing patterns in World Cup 2018 matches.
"""

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsbombpy import sb

%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')

# Cell 2: Load data
matches = sb.matches(competition_id=43, season_id=3)
print(f"Loaded {len(matches)} matches")

# Cell 3: Analysis
# ... focused analysis code ...

# Cell 4: Visualization
# ... plotting code ...

Common Pitfall: Notebooks that work on your machine may fail on a colleague's because of hidden state---cells run in a different order, variables defined in deleted cells, or reliance on global variables. Before sharing a notebook, always do "Restart Kernel and Run All" to verify it executes cleanly from top to bottom.

4.3 Pandas for Soccer Data

pandas is the cornerstone of soccer analytics. This section covers essential operations with soccer-specific examples.

4.3.1 DataFrames and Series

A DataFrame is a two-dimensional labeled data structure---think of it as a spreadsheet or SQL table in Python. A Series is a single column of a DataFrame.

Creating DataFrames from Match Data:

import pandas as pd

# From a list of dictionaries (common format when receiving data from APIs)
match_data = [
    {'match_id': 1, 'home_team': 'France', 'away_team': 'Croatia',
     'home_goals': 4, 'away_goals': 2, 'home_xg': 2.35, 'away_xg': 1.78},
    {'match_id': 2, 'home_team': 'Belgium', 'away_team': 'England',
     'home_goals': 2, 'away_goals': 0, 'home_xg': 1.65, 'away_xg': 0.92},
]

# pd.DataFrame() converts the list of dictionaries into a tabular structure.
# Each dictionary becomes a row; keys become column names.
df = pd.DataFrame(match_data)
print(df)

Key DataFrame Attributes:

# Shape tells you (number_of_rows, number_of_columns)
print(f"Shape: {df.shape}")

# columns lists all column names as an Index object
print(f"Columns: {df.columns.tolist()}")

# dtypes shows the data type of each column
# Watch for 'object' type --- it often means strings or mixed types
print(f"Data types:\n{df.dtypes}")

# describe() computes summary statistics for all numeric columns
print(df.describe())

# memory_usage(deep=True) shows actual memory consumption
# deep=True is needed for accurate measurement of string columns
print(f"Memory: {df.memory_usage(deep=True).sum() / 1024:.2f} KB")

4.3.2 Loading Soccer Data

From CSV Files:

# Load match data with explicit data types and date parsing.
# Specifying dtypes upfront prevents pandas from guessing incorrectly
# and saves memory for large files.
matches = pd.read_csv(
    'data/matches.csv',
    parse_dates=['match_date'],  # Convert string dates to datetime objects
    dtype={'match_id': 'int64', 'home_team': 'category'}  # category saves memory
)

# Load event data (large files --- be selective about columns)
# usecols avoids loading columns you do not need, saving memory and time.
events = pd.read_csv(
    'data/events.csv',
    usecols=['event_id', 'match_id', 'type', 'player', 'team'],
    dtype={'match_id': 'int32'}  # int32 uses half the memory of int64
)

From StatsBomb API:

from statsbombpy import sb

# List available competitions
competitions = sb.competitions()
print(competitions[['competition_id', 'competition_name', 'season_name']])

# Load World Cup 2018 matches
matches = sb.matches(competition_id=43, season_id=3)

# Load events for a specific match
events = sb.events(match_id=7298)  # World Cup Final

# Convert to more efficient dtypes after loading
events['minute'] = events['minute'].astype('int16')
events['second'] = events['second'].fillna(0).astype('int16')

Common Pitfall: When loading event data from StatsBomb or other providers, always inspect the data types and handle missing values before performing analysis. Many columns contain nested structures (lists, dictionaries) that pandas stores as Python objects. Extracting coordinates from the location column, for example, requires explicit parsing. Failing to handle NaN values in pass outcomes or shot details will cause silent errors in aggregation calculations.

Working with JSON Match Data:

Many data providers deliver match data in JSON format. Understanding how to parse nested JSON into flat DataFrames is an essential skill.

import json

def load_json_events(filepath: str) -> pd.DataFrame:
    """
    Load event data from a JSON file and flatten nested structures.

    Parameters
    ----------
    filepath : str
        Path to the JSON events file.

    Returns
    -------
    pd.DataFrame
        Flattened event data with one row per event.
    """
    with open(filepath, 'r', encoding='utf-8') as f:
        raw_events = json.load(f)

    # pd.json_normalize flattens nested dictionaries into columns.
    # For example, {'shot': {'xg': 0.35}} becomes a column 'shot.xg'.
    df = pd.json_normalize(raw_events, sep='_')

    return df

# Usage
events = load_json_events('data/raw/events_7298.json')
print(f"Loaded {len(events)} events with {len(events.columns)} columns")
print(events.columns[:20].tolist())  # Inspect first 20 column names

Best Practice: When working with JSON data, use pd.json_normalize() instead of manually parsing dictionaries. It handles nested structures automatically, creating column names from the nested keys separated by a delimiter. For deeply nested data (e.g., StatsBomb's freeze frame data), you may need multiple normalization passes.

4.3.3 Selecting and Filtering Data

Column Selection:

# Single column (returns Series --- a one-dimensional labeled array)
goals = df['home_goals']

# Multiple columns (returns DataFrame --- preserves two-dimensional structure)
scores = df[['home_goals', 'away_goals']]

# Using loc for both rows and columns
# loc uses label-based indexing: loc[row_labels, column_labels]
subset = df.loc[:, ['match_id', 'home_team', 'home_goals']]

Row Filtering (Boolean Indexing):

# Simple condition: the expression inside [] creates a boolean Series
high_scoring = df[df['home_goals'] + df['away_goals'] >= 4]

# Multiple conditions: use & for AND, | for OR
# IMPORTANT: each condition must be wrapped in parentheses
france_wins = df[(df['home_team'] == 'France') & (df['home_goals'] > df['away_goals'])]

# Using query (cleaner syntax for complex conditions)
# Variables from the local scope can be referenced with @
result = df.query('home_goals > away_goals and home_xg < away_xg')

# Filter events by type
passes = events[events['type'] == 'Pass']
shots = events.query("type == 'Shot'")

Practical Example: Finding Specific Events

def get_player_shots(events: pd.DataFrame, player_name: str) -> pd.DataFrame:
    """
    Get all shots by a specific player.

    Parameters
    ----------
    events : pd.DataFrame
        Event data with 'type' and 'player' columns
    player_name : str
        Name of the player to filter for

    Returns
    -------
    pd.DataFrame
        Filtered DataFrame containing only shots by the specified player.
        Returns a copy to avoid SettingWithCopyWarning.
    """
    return events.query(
        "type == 'Shot' and player == @player_name"
    ).copy()

# Usage
mbappe_shots = get_player_shots(events, 'Kylian Mbappé')
print(f"Mbappé shots: {len(mbappe_shots)}")

4.3.4 Data Transformation

Adding Calculated Columns:

# Goal difference: simple arithmetic on two columns
df['goal_diff'] = df['home_goals'] - df['away_goals']

# xG difference
df['xg_diff'] = df['home_xg'] - df['away_xg']

# Points using apply() --- works but is slow for large DataFrames
# apply() calls a Python function once per row, which is not vectorized.
df['home_points'] = df['goal_diff'].apply(
    lambda x: 3 if x > 0 else (1 if x == 0 else 0)
)

# More efficient with np.where --- fully vectorized, runs in C
import numpy as np
df['home_points'] = np.where(
    df['goal_diff'] > 0, 3,
    np.where(df['goal_diff'] == 0, 1, 0)
)

# Using np.select for multiple conditions --- cleanest syntax for 3+ cases
conditions = [
    df['goal_diff'] > 0,   # Home win
    df['goal_diff'] == 0,   # Draw
    df['goal_diff'] < 0     # Away win
]
choices = [3, 1, 0]
df['home_points'] = np.select(conditions, choices)

Best Practice: Avoid .apply() with lambda functions whenever possible. It processes rows one at a time in Python, which is orders of magnitude slower than vectorized NumPy operations. For the points calculation above, np.select is about 50-100x faster than apply on a typical season dataset.

Working with Coordinates:

def extract_coordinates(events: pd.DataFrame) -> pd.DataFrame:
    """
    Extract x, y coordinates from the location column.

    StatsBomb stores locations as Python lists [x, y] inside a column.
    This function unpacks those lists into separate numeric columns.
    """
    df = events.copy()

    # isinstance check prevents errors when location is NaN (e.g., for
    # events like half-start that have no spatial position).
    df['x'] = df['location'].apply(
        lambda loc: loc[0] if isinstance(loc, list) and len(loc) >= 2 else None
    )
    df['y'] = df['location'].apply(
        lambda loc: loc[1] if isinstance(loc, list) and len(loc) >= 2 else None
    )

    # Convert to float for numerical operations
    df['x'] = pd.to_numeric(df['x'], errors='coerce')
    df['y'] = pd.to_numeric(df['y'], errors='coerce')

    return df

# For end location (passes, carries)
def extract_end_coordinates(events: pd.DataFrame) -> pd.DataFrame:
    """Extract end location for passes and carries."""
    df = events.copy()

    df['end_x'] = df['pass_end_location'].apply(
        lambda loc: loc[0] if isinstance(loc, list) else None
    )
    df['end_y'] = df['pass_end_location'].apply(
        lambda loc: loc[1] if isinstance(loc, list) else None
    )

    return df

4.3.5 Grouping and Aggregation

Aggregation is fundamental to soccer analytics---calculating per-player, per-team, or per-match statistics.

Intuition: The groupby operation is the single most important pandas pattern for soccer analytics. Nearly every meaningful statistic---passes per 90 minutes, team shot conversion rates, player progressive carry distances---requires grouping data by some category (player, team, match) and then aggregating. Mastering groupby with agg, transform, and apply will unlock the vast majority of analyses you need to perform.

Basic Groupby:

# Goals per team: group by team name, sum goals column
team_goals = df.groupby('home_team')['home_goals'].sum()

# Multiple aggregations on multiple columns using agg()
# The dict maps column names to lists of aggregation functions.
team_stats = df.groupby('home_team').agg({
    'home_goals': ['sum', 'mean'],
    'home_xg': ['sum', 'mean'],
    'match_id': 'count'
}).round(2)

# agg() with multiple functions creates a MultiIndex on columns.
# Flatten it by joining the two levels with an underscore.
team_stats.columns = ['_'.join(col) for col in team_stats.columns]
team_stats = team_stats.rename(columns={'match_id_count': 'matches'})

Per-90-Minute Statistics:

One of the most common normalizations in soccer analytics is converting raw counts to "per 90 minutes" rates. This allows fair comparison between players who have played different amounts of time.

def per_90(count: pd.Series, minutes: pd.Series) -> pd.Series:
    """
    Convert raw counts to per-90-minute rates.

    Parameters
    ----------
    count : pd.Series
        Raw event counts (e.g., passes, shots, tackles)
    minutes : pd.Series
        Minutes played by each player

    Returns
    -------
    pd.Series
        Per-90 rates. Returns NaN for players with fewer than 90 minutes.
    """
    # Avoid division by zero and unreliable small-sample rates
    return np.where(minutes >= 90, count / minutes * 90, np.nan)

# Example usage
player_stats['passes_per_90'] = per_90(player_stats['passes'], player_stats['minutes'])
player_stats['shots_per_90'] = per_90(player_stats['shots'], player_stats['minutes'])

Player Statistics from Events:

def calculate_player_stats(events: pd.DataFrame) -> pd.DataFrame:
    """
    Calculate per-player statistics from event data.

    Parameters
    ----------
    events : pd.DataFrame
        Event data with columns: player, team, type, etc.

    Returns
    -------
    pd.DataFrame
        Aggregated player statistics with one row per player.
    """
    # Filter out events without a player (e.g., ball receipts)
    player_events = events[events['player'].notna()].copy()

    # Count by event type using unstack to pivot type into columns
    event_counts = player_events.groupby(
        ['player', 'team', 'type']
    ).size().unstack(fill_value=0)

    # Build a clean stats DataFrame
    stats = pd.DataFrame({
        'team': player_events.groupby('player')['team'].first(),
        'passes': event_counts.get('Pass', 0),
        'shots': event_counts.get('Shot', 0),
        'tackles': event_counts.get('Tackle', 0),
        'carries': event_counts.get('Carry', 0),
    })

    # Add minutes played (approximate from last event timestamp)
    if 'minute' in events.columns:
        stats['minutes'] = player_events.groupby('player')['minute'].max()

    return stats.reset_index()

# Calculate and display top passers
player_stats = calculate_player_stats(events)
print(player_stats.nlargest(10, 'passes'))

Using transform() for Within-Group Calculations:

The transform() method returns a Series with the same index as the input, making it ideal for adding group-level statistics back to the original DataFrame.

# Add team average xG to each row
events['team_avg_xg'] = events.groupby('team')['shot_statsbomb_xg'].transform('mean')

# Add player rank within team by passes
player_stats['pass_rank'] = player_stats.groupby('team')['passes'].rank(
    ascending=False, method='min'
)

Team Match Statistics:

def calculate_match_team_stats(events: pd.DataFrame) -> pd.DataFrame:
    """
    Calculate team-level statistics for each match.

    Returns one row per team per match.
    """
    stats = events.groupby(['match_id', 'team']).agg({
        'type': [
            ('passes', lambda x: (x == 'Pass').sum()),
            ('shots', lambda x: (x == 'Shot').sum()),
            ('goals', lambda x: ((x == 'Shot')).sum()),  # Simplified
        ],
        'minute': 'max'
    })

    # Flatten columns
    stats.columns = [col[1] if col[1] else col[0] for col in stats.columns]

    return stats.reset_index()

4.3.6 Merging and Joining DataFrames

Soccer analysis often requires combining multiple data sources.

Types of Merges:

# Inner merge: only rows with matching keys in BOTH DataFrames
merged = pd.merge(events, matches, on='match_id', how='inner')

# Left merge: ALL rows from left DataFrame, matching rows from right.
# Non-matching right rows are filled with NaN.
events_with_match_info = pd.merge(
    events,
    matches[['match_id', 'home_team', 'away_team', 'match_date']],
    on='match_id',
    how='left'
)

# Merge player stats with biographical info
player_bio = pd.DataFrame({
    'player': ['Kylian Mbappé', 'Lionel Messi'],
    'birth_year': [1998, 1987],
    'position': ['Forward', 'Forward']
})

player_full = pd.merge(
    player_stats,
    player_bio,
    on='player',
    how='left'  # Keep all players, even those without bio info
)

Common Pitfall: When merging, always check for duplicate keys. If both DataFrames have multiple rows with the same key, the merge produces a Cartesian product (every combination), which can explode the row count unexpectedly. After any merge, verify the result shape: print(f"Before: {len(df1)}, After: {len(merged)}").

Combining Home and Away Data:

A common data transformation task is converting match-level data (one row per match) into team-match data (two rows per match---one from each team's perspective).

def create_team_match_data(matches: pd.DataFrame) -> pd.DataFrame:
    """
    Convert match data to team-match format.

    Each match becomes two rows: one for home team, one for away team.
    This format is essential for calculating team-level season statistics.
    """
    # Home team perspective
    home = matches[['match_id', 'home_team', 'away_team',
                    'home_goals', 'away_goals', 'home_xg', 'away_xg']].copy()
    home.columns = ['match_id', 'team', 'opponent', 'goals_for',
                    'goals_against', 'xg_for', 'xg_against']
    home['venue'] = 'home'

    # Away team perspective: same columns but swapped
    away = matches[['match_id', 'away_team', 'home_team',
                    'away_goals', 'home_goals', 'away_xg', 'home_xg']].copy()
    away.columns = ['match_id', 'team', 'opponent', 'goals_for',
                    'goals_against', 'xg_for', 'xg_against']
    away['venue'] = 'away'

    # Combine both perspectives
    team_matches = pd.concat([home, away], ignore_index=True)

    # Add derived columns
    team_matches['goal_diff'] = team_matches['goals_for'] - team_matches['goals_against']
    team_matches['xg_diff'] = team_matches['xg_for'] - team_matches['xg_against']

    return team_matches.sort_values(['team', 'match_id'])

4.3.7 Handling Missing Data in Soccer Datasets

Missing data is pervasive in soccer datasets. Not every event has every attribute---a pass has no shot_xg, a shot has no pass_end_location, and some events lack coordinate data entirely.

def audit_missing_data(df: pd.DataFrame) -> pd.DataFrame:
    """
    Create a summary of missing data in each column.

    Returns a DataFrame sorted by percentage missing, descending.
    """
    missing = df.isnull().sum()
    percent = (missing / len(df)) * 100
    summary = pd.DataFrame({
        'missing_count': missing,
        'percent_missing': percent.round(1),
        'dtype': df.dtypes
    })
    return summary[summary['missing_count'] > 0].sort_values(
        'percent_missing', ascending=False
    )

# Common strategies for handling missing soccer data:

# 1. Fill with a default value (appropriate for known-absent data)
events['shot_statsbomb_xg'] = events['shot_statsbomb_xg'].fillna(0)

# 2. Forward-fill for sequential data (e.g., game state)
events['score_home'] = events['score_home'].fillna(method='ffill')

# 3. Drop rows where a critical column is missing
shots = events[events['type'] == 'Shot'].dropna(subset=['location'])

# 4. Interpolate for continuous variables
player_tracking['speed'] = player_tracking['speed'].interpolate(method='linear')

Best Practice: Before dropping or filling missing data, always understand why it is missing. In StatsBomb data, a missing pass_outcome means the pass was successful (they only record failed outcomes). Filling it with "Unknown" or dropping it would be a serious error. Read the data documentation carefully.

4.3.8 Time Series Operations

Match data is inherently temporal. pandas provides powerful time series tools.

# Parse dates from strings to datetime objects
matches['match_date'] = pd.to_datetime(matches['match_date'])

# Set date as index for time series operations
matches_ts = matches.set_index('match_date').sort_index()

# Resample to weekly totals --- useful for plotting trends
weekly_goals = matches_ts.resample('W')['home_goals'].sum()

# Rolling averages (e.g., 5-match moving average of xG)
# transform() keeps the result aligned with the original DataFrame
matches['rolling_xg'] = matches.groupby('home_team')['home_xg'].transform(
    lambda x: x.rolling(5, min_periods=1).mean()
)

# Cumulative statistics across a season
matches['cumulative_goals'] = matches.groupby('home_team')['home_goals'].cumsum()

# Exponentially weighted moving average (more weight on recent matches)
matches['ewm_xg'] = matches.groupby('home_team')['home_xg'].transform(
    lambda x: x.ewm(span=5).mean()
)

4.4 NumPy for Numerical Computing

NumPy provides the numerical foundation for all Python data science. Understanding NumPy operations enables efficient analytics code.

4.4.1 Array Basics

import numpy as np

# Create arrays from soccer data
goals = np.array([2, 1, 0, 3, 1, 2, 0, 1])
xG = np.array([1.5, 1.2, 0.8, 2.1, 0.9, 1.8, 0.5, 1.1])

# Basic operations are element-wise: each element is processed independently
goal_diff_from_xG = goals - xG
print(f"Over/underperformance: {goal_diff_from_xG}")

# Aggregations collapse the array to a single value
print(f"Total goals: {goals.sum()}")
print(f"Mean xG: {xG.mean():.2f}")
print(f"Total overperformance: {goal_diff_from_xG.sum():.2f}")

Understanding Array Shapes:

# 1-D array: a single list of values (e.g., one player's match xG)
player_xg = np.array([0.3, 0.5, 0.1, 0.8])
print(f"Shape: {player_xg.shape}")  # (4,)

# 2-D array: a matrix (e.g., xG for multiple players across matches)
team_xg = np.array([
    [0.3, 0.5, 0.1, 0.8],  # Player 1
    [0.2, 0.4, 0.3, 0.6],  # Player 2
    [0.1, 0.0, 0.5, 0.2],  # Player 3
])
print(f"Shape: {team_xg.shape}")  # (3, 4) --- 3 players, 4 matches

# Sum across axis 1 (columns) to get each player's total xG
print(f"Player totals: {team_xg.sum(axis=1)}")  # [1.7, 1.5, 0.8]

# Sum across axis 0 (rows) to get each match's total xG
print(f"Match totals: {team_xg.sum(axis=0)}")   # [0.6, 0.9, 0.9, 1.6]

4.4.2 Vectorized Operations

NumPy operations are much faster than Python loops because they execute in compiled C code rather than interpreted Python.

Intuition: The speed difference between Python loops and NumPy vectorized operations is not a minor optimization---it is often the difference between an analysis that runs in milliseconds and one that takes minutes. When processing a full season of event data (hundreds of thousands of rows), a distance calculation implemented as a Python loop can take 30+ seconds. The same calculation vectorized with NumPy completes in under 50 milliseconds. Always think in terms of array operations rather than element-by-element processing.

# Slow: Python loop --- processes one element at a time
def calculate_points_loop(goal_diffs):
    points = []
    for gd in goal_diffs:
        if gd > 0:
            points.append(3)
        elif gd == 0:
            points.append(1)
        else:
            points.append(0)
    return points

# Fast: NumPy vectorized --- processes the entire array at once
def calculate_points_numpy(goal_diffs):
    return np.where(goal_diffs > 0, 3, np.where(goal_diffs == 0, 1, 0))

# Timing comparison
import time

goal_diffs = np.random.randint(-5, 6, 100000)

start = time.time()
_ = calculate_points_loop(goal_diffs)
loop_time = time.time() - start

start = time.time()
_ = calculate_points_numpy(goal_diffs)
numpy_time = time.time() - start

print(f"Loop: {loop_time:.4f}s, NumPy: {numpy_time:.4f}s")
print(f"NumPy is {loop_time/numpy_time:.0f}x faster")

4.4.3 Statistical Functions

# Sample data: conversion rates for 8 strikers
conversion_rates = np.array([0.12, 0.18, 0.15, 0.22, 0.10, 0.14, 0.16, 0.19])

# Central tendency
print(f"Mean: {np.mean(conversion_rates):.3f}")
print(f"Median: {np.median(conversion_rates):.3f}")

# Spread --- ddof=1 gives the sample standard deviation (Bessel's correction)
print(f"Std Dev: {np.std(conversion_rates, ddof=1):.3f}")
print(f"Variance: {np.var(conversion_rates, ddof=1):.4f}")

# Percentiles
print(f"25th percentile: {np.percentile(conversion_rates, 25):.3f}")
print(f"75th percentile: {np.percentile(conversion_rates, 75):.3f}")

# Correlation matrix between xG and goals for 8 teams
xG = np.array([55, 48, 72, 61, 53, 68, 45, 58])
goals = np.array([52, 45, 75, 58, 50, 70, 42, 55])
# np.corrcoef returns a 2x2 matrix; [0, 1] is the cross-correlation
print(f"Correlation: {np.corrcoef(xG, goals)[0, 1]:.3f}")

4.4.4 Distance and Spatial Calculations

Soccer analytics frequently involves spatial calculations. These functions form the foundation of xG models and positional analysis.

def euclidean_distance(x1, y1, x2, y2):
    """
    Calculate Euclidean distance between two points.

    Works with both scalar values and NumPy arrays (vectorized).
    """
    return np.sqrt((x2 - x1)**2 + (y2 - y1)**2)

def distance_to_goal(x, y, goal_x=120, goal_y=40):
    """
    Calculate distance from event to center of goal.

    Uses StatsBomb coordinate system where the goal is at x=120, y=40.
    """
    return euclidean_distance(x, y, goal_x, goal_y)

def shot_angle(x, y, goal_y=40, post_width=7.32):
    """
    Calculate the visible angle to goal from a shot position.

    This is one of the most important features in xG models.
    A wider angle means more of the goal is visible to the shooter.

    Parameters
    ----------
    x : float or np.ndarray
        X coordinate (0-120 in StatsBomb system)
    y : float or np.ndarray
        Y coordinate (0-80 in StatsBomb system)
    goal_y : float
        Y coordinate of goal center
    post_width : float
        Width of the goal (7.32 meters)

    Returns
    -------
    float or np.ndarray
        Angle in radians
    """
    goal_line = 120
    left_post = goal_y - post_width / 2    # 36.34
    right_post = goal_y + post_width / 2   # 43.66

    # Vectors from shot position to each post
    to_left = np.array([goal_line - x, left_post - y])
    to_right = np.array([goal_line - x, right_post - y])

    # Angle between vectors using dot product formula:
    # cos(theta) = (a . b) / (|a| * |b|)
    cos_angle = np.dot(to_left, to_right) / (
        np.linalg.norm(to_left) * np.linalg.norm(to_right)
    )

    # Clip to [-1, 1] to avoid numerical errors with arccos
    return np.arccos(np.clip(cos_angle, -1, 1))

# Example: angle from penalty spot (12 yards = ~11m from goal, centered)
penalty_angle = shot_angle(108, 40)
print(f"Penalty spot angle: {np.degrees(penalty_angle):.1f}°")

# Example: angle from a tight angle on the wing
wing_angle = shot_angle(110, 70)
print(f"Tight wing angle: {np.degrees(wing_angle):.1f}°")

4.4.5 Random Number Generation for Simulations

Monte Carlo simulation is a core technique in soccer analytics, used for everything from match outcome probabilities to season projections.

# Set seed for reproducibility --- always do this in published analyses
rng = np.random.default_rng(42)

# Simulate goals from xG using Poisson distribution
xG = 1.8
simulated_goals = rng.poisson(xG, size=10000)
print(f"Mean simulated goals: {simulated_goals.mean():.2f}")
print(f"P(score 2+): {(simulated_goals >= 2).mean():.3f}")

# Simulate match outcomes
def simulate_match(home_xg, away_xg, n_simulations=10000):
    """
    Simulate match outcomes using independent Poisson model.

    Each team's goals are drawn from a Poisson distribution with
    their xG as the rate parameter. This is a simple but effective
    model for converting pre-match xG into win/draw/loss probabilities.

    Parameters
    ----------
    home_xg : float
        Home team expected goals
    away_xg : float
        Away team expected goals
    n_simulations : int
        Number of Monte Carlo simulations

    Returns
    -------
    dict
        Probabilities for home win, draw, away win
    """
    rng = np.random.default_rng()

    home_goals = rng.poisson(home_xg, n_simulations)
    away_goals = rng.poisson(away_xg, n_simulations)

    return {
        'home_win': (home_goals > away_goals).mean(),
        'draw': (home_goals == away_goals).mean(),
        'away_win': (home_goals < away_goals).mean()
    }

result = simulate_match(1.8, 1.2)
print(f"Home win: {result['home_win']:.1%}")
print(f"Draw: {result['draw']:.1%}")
print(f"Away win: {result['away_win']:.1%}")

4.4.6 Boolean Masking and Fancy Indexing

Boolean masks are essential for filtering arrays based on conditions without loops.

# Shot data arrays
distances = np.array([8.5, 22.3, 5.1, 31.0, 12.4, 18.7, 6.2, 25.8])
xg_values = np.array([0.25, 0.04, 0.45, 0.02, 0.12, 0.06, 0.38, 0.03])
outcomes = np.array([0, 0, 1, 0, 1, 0, 1, 0])  # 1 = goal, 0 = no goal

# Boolean mask: shots inside the box (roughly < 18 meters)
inside_box = distances < 18
print(f"Shots inside box: {inside_box.sum()}")
print(f"Avg xG inside box: {xg_values[inside_box].mean():.3f}")
print(f"Conversion inside box: {outcomes[inside_box].mean():.1%}")

# Combine masks
high_quality = (xg_values > 0.10) & (distances < 20)
print(f"High-quality chances: {high_quality.sum()}")

4.5 Visualization with Matplotlib and Seaborn

Effective visualization is essential for communicating soccer analytics insights. A well-crafted chart can convey patterns that would take paragraphs of text to describe. This section covers the most important chart types for soccer analytics, with production-ready code examples.

4.5.1 Matplotlib Fundamentals

Matplotlib uses a two-level API. The Figure is the entire image; Axes are individual plots within the figure. Always use the object-oriented API (fig, ax = plt.subplots()) rather than the stateful plt.plot() interface for production code.

import matplotlib.pyplot as plt

# Create a figure with one axes
fig, ax = plt.subplots(figsize=(10, 6))

# Simulate season xG progression
matches = range(1, 39)
home_xg = np.random.normal(1.5, 0.4, 38)
cumulative_xg = np.cumsum(home_xg)

# Plot actual xG and expected trend
ax.plot(matches, cumulative_xg, 'b-', linewidth=2, label='Actual xG')
ax.plot(matches, np.arange(1, 39) * 1.5, 'r--', alpha=0.5, label='Expected (1.5/match)')

# Labels and formatting
ax.set_xlabel('Match Number', fontsize=12)
ax.set_ylabel('Cumulative xG', fontsize=12)
ax.set_title('Season xG Progression', fontsize=14)
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('xg_progression.png', dpi=150)
plt.show()

Multi-Panel Figures:

# Create a 2x2 grid of subplots for a comprehensive match report
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Top-left: xG timeline
axes[0, 0].set_title('xG Timeline')

# Top-right: Shot map (placeholder)
axes[0, 1].set_title('Shot Map')

# Bottom-left: Pass network (placeholder)
axes[1, 0].set_title('Pass Network')

# Bottom-right: Player radar (placeholder)
axes[1, 1].set_title('Player Radar')

fig.suptitle('Match Report: France vs Croatia', fontsize=16, y=1.02)
plt.tight_layout()

4.5.2 Bar Charts for Comparisons

def plot_team_comparison(teams, metric1, metric2, label1, label2, title):
    """Create a grouped bar chart comparing two metrics across teams."""
    fig, ax = plt.subplots(figsize=(12, 6))

    x = np.arange(len(teams))
    width = 0.35

    bars1 = ax.bar(x - width/2, metric1, width, label=label1, color='steelblue')
    bars2 = ax.bar(x + width/2, metric2, width, label=label2, color='coral')

    ax.set_xlabel('Team')
    ax.set_ylabel('Value')
    ax.set_title(title)
    ax.set_xticks(x)
    ax.set_xticklabels(teams, rotation=45, ha='right')
    ax.legend()

    # Add value labels on top of each bar
    for bar in bars1:
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5,
                f'{bar.get_height():.1f}', ha='center', va='bottom', fontsize=8)

    plt.tight_layout()
    return fig, ax

# Example usage
teams = ['Man City', 'Liverpool', 'Chelsea', 'Arsenal', 'Tottenham']
xg = [82.5, 78.3, 65.2, 68.9, 61.4]
goals = [85, 82, 63, 70, 58]

fig, ax = plot_team_comparison(teams, xg, goals, 'xG', 'Goals', 'xG vs Actual Goals')
plt.savefig('team_comparison.png', dpi=150)

4.5.3 Scatter Plots with Regression

Scatter plots with regression lines are the workhorse visualization for exploring relationships between metrics.

from scipy import stats

def plot_correlation(x, y, xlabel, ylabel, title, team_labels=None):
    """
    Create scatter plot with regression line and statistics.

    Parameters
    ----------
    x, y : array-like
        Data to plot
    xlabel, ylabel : str
        Axis labels
    title : str
        Plot title
    team_labels : list of str, optional
        Labels for each point (e.g., team names)
    """
    fig, ax = plt.subplots(figsize=(8, 6))

    # Scatter plot
    ax.scatter(x, y, alpha=0.6, s=50)

    # Regression line
    slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
    x_line = np.linspace(x.min(), x.max(), 100)
    y_line = slope * x_line + intercept
    ax.plot(x_line, y_line, 'r-', linewidth=2,
            label=f'y = {intercept:.1f} + {slope:.2f}x')

    # Add statistics annotation
    ax.annotate(
        f'r = {r_value:.3f}\nR\u00b2 = {r_value**2:.3f}\np = {p_value:.4f}',
        xy=(0.05, 0.95), xycoords='axes fraction',
        fontsize=10, verticalalignment='top',
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5)
    )

    # Optional: label each point with team name
    if team_labels is not None:
        for i, label in enumerate(team_labels):
            ax.annotate(label, (x[i], y[i]), fontsize=7,
                       xytext=(5, 5), textcoords='offset points')

    ax.set_xlabel(xlabel, fontsize=12)
    ax.set_ylabel(ylabel, fontsize=12)
    ax.set_title(title, fontsize=14)
    ax.legend(loc='lower right')
    ax.grid(True, alpha=0.3)

    plt.tight_layout()
    return fig, ax

# Example: xG vs Goals
xg = np.array([55, 62, 48, 71, 58, 65, 52, 68, 45, 75])
goals = np.array([52, 65, 45, 68, 60, 62, 50, 70, 42, 78])

fig, ax = plot_correlation(xg, goals, 'Expected Goals (xG)', 'Actual Goals',
                          'xG vs Goals Correlation')
plt.savefig('xg_correlation.png', dpi=150)

4.5.4 Seaborn for Statistical Visualization

Seaborn is built on matplotlib and provides higher-level functions for common statistical plots.

import seaborn as sns

# Set seaborn style globally
sns.set_style("whitegrid")
sns.set_palette("husl")

# Distribution plot
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Left panel: Goals per match distribution
goals_data = np.random.poisson(1.4, 380)
sns.histplot(goals_data, discrete=True, ax=axes[0], color='steelblue')
axes[0].set_xlabel('Goals')
axes[0].set_title('Goals per Match Distribution')

# Right panel: xG distribution by match outcome
np.random.seed(42)
outcome_data = pd.DataFrame({
    'xG': np.concatenate([
        np.random.normal(2.0, 0.4, 100),  # Wins
        np.random.normal(1.4, 0.3, 80),   # Draws
        np.random.normal(1.0, 0.4, 70)    # Losses
    ]),
    'Result': ['Win']*100 + ['Draw']*80 + ['Loss']*70
})

sns.boxplot(data=outcome_data, x='Result', y='xG', ax=axes[1],
            order=['Loss', 'Draw', 'Win'])
axes[1].set_ylabel('Expected Goals (xG)')
axes[1].set_title('xG by Match Result')

plt.tight_layout()
plt.savefig('seaborn_example.png', dpi=150)

Violin Plots for Distribution Comparison:

# Violin plots show the full distribution shape, not just quartiles
fig, ax = plt.subplots(figsize=(10, 6))
sns.violinplot(data=outcome_data, x='Result', y='xG',
               order=['Loss', 'Draw', 'Win'], inner='quartile', ax=ax)
ax.set_title('xG Distribution by Match Outcome')
ax.set_ylabel('Expected Goals (xG)')
plt.tight_layout()

4.5.5 Heatmaps

def plot_correlation_heatmap(df, columns, title='Correlation Matrix'):
    """Create a correlation heatmap for selected columns."""
    corr_matrix = df[columns].corr()

    fig, ax = plt.subplots(figsize=(10, 8))

    # annot=True prints the correlation value in each cell
    # cmap='RdBu_r' uses red for positive, blue for negative
    # center=0 ensures white is at zero correlation
    sns.heatmap(
        corr_matrix,
        annot=True,
        fmt='.2f',
        cmap='RdBu_r',
        center=0,
        vmin=-1,
        vmax=1,
        square=True,
        ax=ax
    )

    ax.set_title(title, fontsize=14)
    plt.tight_layout()

    return fig, ax

# Example with simulated team data
np.random.seed(42)
n = 100
team_data = pd.DataFrame({
    'Goals': np.random.poisson(50, n),
    'xG': np.random.normal(50, 10, n),
    'Possession': np.random.normal(50, 5, n),
    'Pass Accuracy': np.random.normal(82, 4, n),
    'Shots': np.random.poisson(400, n)
})

# Add realistic correlation
team_data['Goals'] = team_data['xG'] * 0.9 + np.random.normal(0, 5, n)

fig, ax = plot_correlation_heatmap(
    team_data,
    ['Goals', 'xG', 'Possession', 'Pass Accuracy', 'Shots'],
    'Team Statistics Correlation Matrix'
)
plt.savefig('correlation_heatmap.png', dpi=150)

4.5.6 Soccer Pitch Visualization

Intuition: The spatial calculations covered here---distance to goal, shot angle, Euclidean distance---connect directly to the coordinate system concepts in Chapter 6 and form the foundation of the xG models developed in Chapter 7. Make sure you are comfortable converting between pitch coordinates and real-world distances before moving on.

For soccer-specific visualizations, use the mplsoccer library:

from mplsoccer import Pitch

def plot_shot_map(shots_df, title='Shot Map'):
    """
    Create a shot map visualization.

    Parameters
    ----------
    shots_df : pd.DataFrame
        DataFrame with columns: x, y, xG, outcome
    title : str
        Plot title
    """
    # pitch_type='statsbomb' matches StatsBomb's coordinate system (120x80)
    pitch = Pitch(pitch_type='statsbomb', pitch_color='grass',
                  line_color='white', goal_type='box')

    fig, ax = pitch.draw(figsize=(12, 8))

    # Separate goals and non-goals for different styling
    goals = shots_df[shots_df['outcome'] == 'Goal']
    non_goals = shots_df[shots_df['outcome'] != 'Goal']

    # Plot non-goals as white circles, size proportional to xG
    pitch.scatter(non_goals['x'], non_goals['y'],
                  s=non_goals['xG'] * 500,
                  c='white', edgecolors='black',
                  alpha=0.6, ax=ax, label='No Goal')

    # Plot goals as red circles
    pitch.scatter(goals['x'], goals['y'],
                  s=goals['xG'] * 500,
                  c='red', edgecolors='black',
                  alpha=0.8, ax=ax, label='Goal')

    ax.set_title(title, fontsize=16)
    ax.legend(loc='upper left')

    return fig, ax

# Example usage with simulated data
np.random.seed(42)
shots = pd.DataFrame({
    'x': np.random.uniform(100, 120, 50),
    'y': np.random.uniform(20, 60, 50),
    'xG': np.random.uniform(0.05, 0.5, 50),
    'outcome': np.random.choice(['Goal', 'Saved', 'Blocked', 'Off Target'],
                                 50, p=[0.12, 0.4, 0.25, 0.23])
})

fig, ax = plot_shot_map(shots, 'Team Shot Map - Season 2023/24')
plt.savefig('shot_map.png', dpi=150, facecolor='#1a1a1a')

Heatmap on a Pitch:

def plot_action_heatmap(events_df, title='Action Heatmap'):
    """
    Create a heatmap of player actions on the pitch.

    Uses mplsoccer's binning to create a spatial density plot.
    """
    pitch = Pitch(pitch_type='statsbomb', line_zorder=2)
    fig, ax = pitch.draw(figsize=(12, 8))

    # bin_statistic divides the pitch into a grid and counts events
    bin_statistic = pitch.bin_statistic(
        events_df['x'], events_df['y'],
        statistic='count', bins=(12, 8)
    )

    # Normalize and plot as a heatmap
    pitch.heatmap(bin_statistic, ax=ax, cmap='hot', edgecolors='#22312b')

    ax.set_title(title, fontsize=16)
    return fig, ax

4.5.7 Passing Networks

from mplsoccer import Pitch
import matplotlib.patches as mpatches

def plot_passing_network(pass_df, min_passes=3):
    """
    Create a passing network visualization.

    Parameters
    ----------
    pass_df : pd.DataFrame
        Pass data with passer, receiver, start/end locations
    min_passes : int
        Minimum passes to show connection
    """
    pitch = Pitch(pitch_type='statsbomb', pitch_color='#22312b',
                  line_color='white')
    fig, ax = pitch.draw(figsize=(12, 8))

    # Calculate average positions
    avg_positions = pass_df.groupby('passer').agg({
        'x': 'mean',
        'y': 'mean'
    }).reset_index()

    # Calculate pass combinations
    pass_counts = pass_df.groupby(['passer', 'receiver']).size().reset_index(name='count')
    pass_counts = pass_counts[pass_counts['count'] >= min_passes]

    # Plot connections --- line width proportional to pass frequency
    for _, row in pass_counts.iterrows():
        passer_pos = avg_positions[avg_positions['passer'] == row['passer']]
        receiver_pos = avg_positions[avg_positions['passer'] == row['receiver']]

        if len(passer_pos) > 0 and len(receiver_pos) > 0:
            pitch.lines(
                passer_pos['x'].values[0], passer_pos['y'].values[0],
                receiver_pos['x'].values[0], receiver_pos['y'].values[0],
                lw=row['count'] / 2,
                color='white',
                alpha=0.5,
                ax=ax
            )

    # Plot player positions --- node size proportional to total passes
    total_passes = pass_df.groupby('passer').size()
    for _, row in avg_positions.iterrows():
        size = total_passes.get(row['passer'], 1) * 5
        pitch.scatter(row['x'], row['y'], s=size,
                      c='#d63333', edgecolors='white',
                      linewidths=2, ax=ax)

    ax.set_title('Passing Network', fontsize=16, color='white')

    return fig, ax

4.5.8 Styling for Publication

Professional soccer analytics publications follow consistent styling conventions. Here is a reusable style configuration.

def set_analytics_style():
    """Apply a consistent, publication-ready style to all plots."""
    plt.rcParams.update({
        'figure.facecolor': 'white',
        'axes.facecolor': 'white',
        'axes.grid': True,
        'grid.alpha': 0.3,
        'font.family': 'sans-serif',
        'font.size': 11,
        'axes.titlesize': 14,
        'axes.labelsize': 12,
        'legend.fontsize': 10,
        'figure.dpi': 150,
        'savefig.bbox': 'tight',
        'savefig.dpi': 150,
    })

# Call once at the top of your script or notebook
set_analytics_style()

4.6 Building Reusable Code

Professional analytics requires well-organized, reusable code.

Intuition: In professional club environments, analysts rarely write one-off scripts. Code is organized into reusable modules that can be applied to any match, any player, any competition. Investing time in writing well-documented functions with clear parameter types and return values pays dividends as your analysis library grows. The functions you write in this section will serve as templates for every analysis chapter that follows.

4.6.1 Function Design Principles

Good analytics functions follow several principles: - Single responsibility: Each function does one thing well. - Clear parameters: Use type hints and default values. - Documentation: Include docstrings with parameter descriptions and examples. - Defensive coding: Validate inputs and handle edge cases.

def calculate_xg_performance(
    goals: int,
    xg: float,
    matches: int,
    confidence_level: float = 0.95
) -> dict:
    """
    Calculate xG performance metrics with confidence intervals.

    Parameters
    ----------
    goals : int
        Actual goals scored
    xg : float
        Expected goals
    matches : int
        Number of matches
    confidence_level : float, optional
        Confidence level for interval (default 0.95)

    Returns
    -------
    dict
        Dictionary containing:
        - goals_per_match: Goals per match
        - xg_per_match: xG per match
        - overperformance: Total goals minus xG
        - overperformance_per_match: Per-match overperformance
        - ci_lower: Lower bound of CI
        - ci_upper: Upper bound of CI

    Examples
    --------
    >>> result = calculate_xg_performance(25, 22.5, 38)
    >>> print(f"Overperformance: {result['overperformance']:.1f}")
    Overperformance: 2.5
    """
    from scipy import stats

    # Input validation
    if matches <= 0:
        raise ValueError(f"matches must be positive, got {matches}")
    if goals < 0:
        raise ValueError(f"goals must be non-negative, got {goals}")

    goals_per_match = goals / matches
    xg_per_match = xg / matches
    overperformance = goals - xg
    overperformance_per_match = overperformance / matches

    # Confidence interval for goals per match (Poisson)
    alpha = 1 - confidence_level
    ci_lower = stats.poisson.ppf(alpha/2, goals) / matches
    ci_upper = stats.poisson.ppf(1 - alpha/2, goals) / matches

    return {
        'goals_per_match': goals_per_match,
        'xg_per_match': xg_per_match,
        'overperformance': overperformance,
        'overperformance_per_match': overperformance_per_match,
        'ci_lower': ci_lower,
        'ci_upper': ci_upper
    }

4.6.2 Classes for Complex Analysis

When an analysis involves multiple related calculations on the same data, a class provides better organization than a collection of loose functions.

class PlayerAnalyzer:
    """
    Analyze individual player performance from event data.

    This class encapsulates all analysis for a single player,
    providing a clean interface for accessing statistics.

    Parameters
    ----------
    events : pd.DataFrame
        Event data containing player actions
    player_name : str
        Name of the player to analyze

    Attributes
    ----------
    player_events : pd.DataFrame
        Filtered events for the specified player
    stats : dict
        Calculated statistics
    """

    def __init__(self, events: pd.DataFrame, player_name: str):
        self.player_name = player_name
        self.player_events = events[events['player'] == player_name].copy()

        if len(self.player_events) == 0:
            raise ValueError(f"No events found for player: {player_name}")

        self.stats = self._calculate_stats()

    def _calculate_stats(self) -> dict:
        """Calculate basic statistics for the player."""
        events = self.player_events

        return {
            'total_events': len(events),
            'passes': len(events[events['type'] == 'Pass']),
            'shots': len(events[events['type'] == 'Shot']),
            'goals': len(events[
                (events['type'] == 'Shot') &
                (events.get('shot_outcome') == 'Goal')
            ]),
            'matches': events['match_id'].nunique()
        }

    def get_passing_stats(self) -> dict:
        """Calculate detailed passing statistics."""
        passes = self.player_events[self.player_events['type'] == 'Pass']

        if len(passes) == 0:
            return {'total': 0, 'accuracy': None}

        successful = passes[passes.get('pass_outcome', '').isna() |
                           (passes.get('pass_outcome') == 'Complete')]

        return {
            'total': len(passes),
            'successful': len(successful),
            'accuracy': len(successful) / len(passes) if len(passes) > 0 else 0,
            'progressive': len(passes[passes.get('pass_progressive', False)])
        }

    def get_shooting_stats(self) -> dict:
        """Calculate detailed shooting statistics."""
        shots = self.player_events[self.player_events['type'] == 'Shot']

        if len(shots) == 0:
            return {'total': 0, 'goals': 0, 'xg': 0}

        return {
            'total': len(shots),
            'goals': len(shots[shots.get('shot_outcome') == 'Goal']),
            'xg': shots['shot_statsbomb_xg'].sum() if 'shot_statsbomb_xg' in shots else 0,
            'conversion_rate': self.stats['goals'] / len(shots) if len(shots) > 0 else 0
        }

    def summary(self) -> pd.DataFrame:
        """Return a summary DataFrame of all statistics."""
        passing = self.get_passing_stats()
        shooting = self.get_shooting_stats()

        return pd.DataFrame([{
            'Player': self.player_name,
            'Matches': self.stats['matches'],
            'Passes': passing['total'],
            'Pass Accuracy': f"{passing['accuracy']:.1%}" if passing['accuracy'] else 'N/A',
            'Shots': shooting['total'],
            'Goals': shooting['goals'],
            'xG': f"{shooting['xg']:.2f}",
            'Conversion': f"{shooting['conversion_rate']:.1%}" if shooting['total'] > 0 else 'N/A'
        }])

4.6.3 Error Handling and Debugging

Robust error handling is essential for analytics code that will be used repeatedly across different datasets. Data formats change, API responses vary, and edge cases are inevitable.

import logging

# Configure logging --- this should be done once at the top of your module.
# INFO level logs routine operations; WARNING and ERROR for problems.
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

def safe_load_data(filepath: str) -> pd.DataFrame:
    """
    Safely load data with comprehensive error handling.

    Supports CSV, Parquet, and Excel formats. Logs success or
    failure for debugging.

    Parameters
    ----------
    filepath : str
        Path to data file

    Returns
    -------
    pd.DataFrame
        Loaded data

    Raises
    ------
    FileNotFoundError
        If file doesn't exist
    ValueError
        If file format is not supported
    """
    from pathlib import Path

    path = Path(filepath)

    if not path.exists():
        logger.error(f"File not found: {filepath}")
        raise FileNotFoundError(f"File not found: {filepath}")

    suffix = path.suffix.lower()

    try:
        if suffix == '.csv':
            df = pd.read_csv(filepath)
        elif suffix == '.parquet':
            df = pd.read_parquet(filepath)
        elif suffix in ['.xlsx', '.xls']:
            df = pd.read_excel(filepath)
        elif suffix == '.json':
            df = pd.read_json(filepath)
        else:
            raise ValueError(f"Unsupported file format: {suffix}")

        logger.info(f"Loaded {len(df)} rows from {filepath}")
        return df

    except Exception as e:
        logger.error(f"Error loading {filepath}: {e}")
        raise

Common Debugging Techniques:

# 1. Check data at intermediate steps
def debug_pipeline(events):
    """Example of adding debug checks to a data pipeline."""
    print(f"Step 0: {len(events)} events")

    # Step 1: Filter to shots
    shots = events[events['type'] == 'Shot']
    print(f"Step 1 (shots): {len(shots)} rows")

    # Step 2: Extract coordinates
    shots = extract_coordinates(shots)
    print(f"Step 2 (with coords): {shots['x'].notna().sum()} have valid x")

    # Step 3: Calculate distance
    shots['distance'] = distance_to_goal(shots['x'], shots['y'])
    print(f"Step 3 (with distance): {shots['distance'].describe()}")

    return shots

# 2. Use assertions for data validation
def validate_events(events: pd.DataFrame) -> None:
    """Validate that event data has the expected structure."""
    required_columns = ['match_id', 'type', 'player', 'team']
    missing = [col for col in required_columns if col not in events.columns]
    assert len(missing) == 0, f"Missing columns: {missing}"
    assert len(events) > 0, "Events DataFrame is empty"
    assert events['match_id'].notna().all(), "Some events have null match_id"

Best Practice: When building a data pipeline, add logging at each major step. When something goes wrong (and it will), the logs tell you exactly where the pipeline broke. Use logger.info() for routine operations, logger.warning() for recoverable issues, and logger.error() for failures.

4.7 Version Control with Git

Reproducible analytics requires more than clean code and virtual environments -- it requires systematic version control. Git is the industry-standard tool for tracking changes to code, collaborating with teammates, and maintaining a reliable history of your analytical work. Every professional soccer analytics department uses version control, and fluency with Git is a non-negotiable skill for analysts working in club environments.

4.7.1 Git Basics

Git tracks changes to files in a repository, allowing you to record snapshots of your project at any point and return to previous states if needed.

Core Commands:

# Initialize a new repository in your project directory
git init

# Check the status of your working directory
git status

# Stage specific files for the next commit
git add src/data_loader.py src/metrics.py

# Commit staged changes with a descriptive message
git commit -m "Add data loader and metrics modules for match analysis"

# Push commits to a remote repository (e.g., GitHub, GitLab)
git push origin main

# Pull the latest changes from the remote repository
git pull origin main

Each commit should represent a logical unit of work -- a completed function, a bug fix, or a new analysis pipeline. Write commit messages that describe the purpose of the change rather than listing files modified. A message like "Build xG model for set piece shots" is far more useful than "Updated model.py".

4.7.2 Repository Structure for Soccer Analytics

Organizing your repository with a clear structure makes it easy for collaborators to navigate and for your future self to understand past work.

soccer-analytics/
├── .gitignore            # Files Git should not track
├── README.md             # Project overview and setup instructions
├── requirements.txt      # Python dependencies
├── config.py             # Configuration settings
├── data/
│   ├── raw/              # Original data (often git-ignored)
│   └── processed/        # Cleaned data (often git-ignored)
├── notebooks/            # Jupyter notebooks for exploration
├── src/
│   ├── __init__.py
│   ├── data/             # Data loading and cleaning modules
│   ├── features/         # Feature engineering
│   ├── models/           # Statistical and ML models
│   └── visualization/    # Plotting utilities
├── tests/                # Unit tests for src modules
└── outputs/
    ├── figures/          # Generated plots (git-ignored)
    └── reports/          # Analysis reports

Keep source code (src/) under version control at all times. Notebooks should be committed but with outputs cleared to avoid large diffs and potential data leaks. Data files and generated outputs are typically excluded via .gitignore.

4.7.3 .gitignore for Data Files and Notebooks

A well-crafted .gitignore file prevents large data files, sensitive credentials, and generated outputs from being accidentally committed to the repository.

# Data files -- too large for Git, store separately
data/raw/
data/processed/
*.csv
*.parquet
*.xlsx
*.json.gz

# Jupyter notebook checkpoints
.ipynb_checkpoints/

# Python environment
venv/
*.pyc
__pycache__/

# Generated outputs
outputs/figures/
outputs/reports/
*.png
*.pdf

# IDE and OS files
.vscode/
.idea/
.DS_Store
Thumbs.db

# Credentials and secrets
.env
credentials.json
api_keys.py

For large datasets, use separate storage solutions such as cloud storage buckets, a shared network drive, or Git Large File Storage (Git LFS). Document in your README.md where collaborators can obtain the required data files and how to place them in the expected directory structure.

4.7.4 Collaborative Workflows with Branches

When working on a team -- or even on your own across multiple analyses -- branches let you develop new features or experiments without disrupting stable code.

# Create and switch to a new branch for a specific analysis
git checkout -b feature/xg-model-v2

# Work on your changes, committing as you go
git add src/models/xg_model.py
git commit -m "Implement gradient boosting xG model with set piece features"

# When finished, switch back to main and merge
git checkout main
git merge feature/xg-model-v2

# Delete the branch after merging
git branch -d feature/xg-model-v2

Branching strategies for analytics teams:

  • Feature branches: One branch per analysis task (e.g., feature/corner-kick-analysis, feature/player-recruitment-report). Merge into main when complete and reviewed.
  • Experimentation branches: Use branches to test alternative modeling approaches without committing unfinished work to the shared codebase.
  • Release branches: For production dashboards or recurring reports, maintain a stable main branch that always produces correct outputs.

When multiple analysts work on the same repository, use pull requests (on GitHub or GitLab) to review each other's code before merging. Code review catches errors, spreads knowledge across the team, and maintains consistent code quality.

4.7.5 Best Practices for Versioning Data Pipelines

Data pipelines in soccer analytics evolve over time as new data sources become available, models are refined, and reporting requirements change. Git helps manage this evolution, but only if you follow disciplined practices.

Pipeline versioning guidelines:

  1. Pin your dependencies: Always commit requirements.txt with exact version numbers (pandas==2.1.0, not pandas>=2.0). A model trained with one version of scikit-learn may produce different results with another.

  2. Tag significant milestones: Use Git tags to mark important versions of your pipeline, such as the model deployed for a particular transfer window or the analysis delivered for a board presentation.

bash git tag -a v1.0-summer-window -m "xG model used for summer 2024 recruitment"

  1. Separate code from configuration: Store model hyperparameters, competition IDs, and season identifiers in configuration files rather than hard-coding them. This lets you rerun the same pipeline on different data without modifying source code.

  2. Document data lineage: Record in your commit messages or a changelog which data sources were used, any manual cleaning steps applied, and the date the data was downloaded. When a match event provider retroactively corrects data, you need to know which analyses may be affected.

  3. Automate with scripts: Create shell scripts or Makefile targets that reproduce your entire pipeline from raw data to final outputs. A collaborator should be able to clone your repository, install dependencies, and run a single command to regenerate all results.

Following these practices transforms your analytics work from a collection of ad-hoc scripts into a professional, auditable, and reproducible system -- the standard expected in any serious club analytics operation.


4.8 Performance Optimization

Large soccer datasets require efficient code. Here are key optimization strategies.

Best Practice: Memory optimization is not premature---it is essential. A full season of tracking data at 25 frames per second for 22 players generates over 100 million rows. Even event data for a multi-season analysis can exceed available RAM if data types are not managed carefully. Start every project by checking df.info() and df.memory_usage(deep=True) to understand your memory footprint, then apply the dtype optimization techniques shown below.

4.8.1 Memory Optimization

def optimize_dtypes(df: pd.DataFrame) -> pd.DataFrame:
    """
    Optimize DataFrame memory usage by downcasting dtypes.

    This function reduces memory by:
    - Converting low-cardinality string columns to category type
    - Downcasting integers to the smallest type that fits the data
    - Downcasting floats from float64 to float32

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame

    Returns
    -------
    pd.DataFrame
        Memory-optimized DataFrame
    """
    result = df.copy()

    for col in result.columns:
        col_type = result[col].dtype

        if col_type == 'object':
            # Category type uses integer codes + lookup table.
            # Effective when unique values are < 50% of total rows.
            if result[col].nunique() / len(result) < 0.5:
                result[col] = result[col].astype('category')

        elif col_type == 'int64':
            # int8: -128 to 127; int16: -32768 to 32767;
            # int32: -2B to 2B. Most soccer data fits in int16 or int32.
            result[col] = pd.to_numeric(result[col], downcast='integer')

        elif col_type == 'float64':
            # float32 has ~7 decimal digits of precision.
            # More than enough for xG, coordinates, etc.
            result[col] = pd.to_numeric(result[col], downcast='float')

    return result

# Example
original_memory = events.memory_usage(deep=True).sum() / 1024**2
optimized_events = optimize_dtypes(events)
optimized_memory = optimized_events.memory_usage(deep=True).sum() / 1024**2

print(f"Original: {original_memory:.2f} MB")
print(f"Optimized: {optimized_memory:.2f} MB")
print(f"Reduction: {(1 - optimized_memory/original_memory)*100:.1f}%")

4.8.2 Efficient Aggregation

Different aggregation approaches vary enormously in performance. Here is a comparison from slowest to fastest.

# SLOW: iterating over groups with a Python loop
# This is the pattern beginners default to. Avoid it.
def slow_aggregation(df):
    results = []
    for player in df['player'].unique():
        player_df = df[df['player'] == player]
        results.append({
            'player': player,
            'passes': len(player_df[player_df['type'] == 'Pass']),
            'shots': len(player_df[player_df['type'] == 'Shot'])
        })
    return pd.DataFrame(results)

# MEDIUM: groupby with apply --- better, but still calls Python per group
def fast_aggregation(df):
    return df.groupby('player').apply(
        lambda x: pd.Series({
            'passes': (x['type'] == 'Pass').sum(),
            'shots': (x['type'] == 'Shot').sum()
        })
    ).reset_index()

# FAST: pivot approach --- fully vectorized, no Python loop
def fastest_aggregation(df):
    counts = df.groupby(['player', 'type']).size().unstack(fill_value=0)
    return counts[['Pass', 'Shot']].rename(
        columns={'Pass': 'passes', 'Shot': 'shots'}
    ).reset_index()

4.8.3 Using Parquet for Faster I/O

For datasets you load repeatedly, consider converting CSV files to Parquet format. Parquet is a columnar storage format that offers dramatically faster read times and smaller file sizes.

# Convert CSV to Parquet (do this once)
events = pd.read_csv('data/raw/events.csv')
events.to_parquet('data/processed/events.parquet', index=False)

# Load from Parquet (do this every time --- much faster)
events = pd.read_parquet('data/processed/events.parquet')

# Parquet also supports reading only specific columns
shots = pd.read_parquet(
    'data/processed/events.parquet',
    columns=['match_id', 'type', 'player', 'location', 'shot_statsbomb_xg']
)

Advanced: For very large datasets (multiple seasons of tracking data), consider using pyarrow directly or dask for out-of-core computation. Dask provides a pandas-like API that operates on datasets larger than memory by processing them in chunks.

4.9 Summary

This chapter established the Python foundation for soccer analytics:

Environment Setup: - Virtual environments ensure project isolation and reproducibility - Consistent project structure improves maintainability - Configuration files centralize settings and prevent hard-coded values

pandas Essentials: - DataFrames efficiently store and manipulate tabular soccer data - Boolean indexing and query() filter data precisely - Groupby operations aggregate statistics at any level (player, team, match) - Merges combine multiple data sources (events, matches, player bio) - Time series operations support rolling averages and cumulative statistics - Missing data handling requires understanding why data is absent

NumPy Fundamentals: - Vectorized operations dramatically outperform Python loops - Statistical functions enable quick exploratory analysis - Spatial calculations (distance, angle) support position-based analytics - Random number generation powers Monte Carlo simulations

Visualization: - matplotlib provides complete plotting control via the object-oriented API - seaborn simplifies statistical graphics (distributions, correlations) - mplsoccer creates professional soccer-specific visualizations (shot maps, passing networks, heatmaps)

Best Practices: - Write functions with clear parameters, type hints, and documentation - Use classes for complex, stateful analysis - Handle errors gracefully with logging and input validation - Optimize for large datasets with dtype downcasting and Parquet files

The tools covered in this chapter form the backbone of every analysis in subsequent chapters. Mastery of these fundamentals enables you to focus on the analytics questions rather than implementation details.

4.10 Exercises

See exercises.md for hands-on practice problems covering all topics in this chapter.

4.11 Further Reading

See further-reading.md for recommended resources on Python data science.