Key Takeaways: Python Tools for Soccer Analytics

Environment Setup

Virtual Environments

  • Always use virtual environments for project isolation
  • Create with: python -m venv env_name
  • Export dependencies: pip freeze > requirements.txt
  • Reproduce elsewhere: pip install -r requirements.txt

Project Structure

project/
├── data/raw/          # Original, immutable data
├── data/processed/    # Cleaned data
├── src/               # Source code modules
├── notebooks/         # Exploration notebooks
├── outputs/           # Generated files
└── config.py          # Centralized settings

pandas Essentials

DataFrame Operations Cheat Sheet

Task Code
Create from dict pd.DataFrame(data)
Read CSV pd.read_csv('file.csv')
Select column df['col'] or df.col
Select multiple df[['col1', 'col2']]
Filter rows df[df['col'] > value]
Multiple conditions df[(cond1) & (cond2)]
Query syntax df.query("col > @value")

Groupby Pattern

# Standard aggregation
df.groupby('team').agg({
    'goals': 'sum',
    'xg': ['mean', 'std'],
    'match_id': 'count'
})

# Transform (keeps original shape)
df['team_avg'] = df.groupby('team')['goals'].transform('mean')

Merging Data

# Inner: only matching rows
pd.merge(df1, df2, on='key', how='inner')

# Left: all from left, matching from right
pd.merge(df1, df2, on='key', how='left')

# Concatenate DataFrames
pd.concat([df1, df2], ignore_index=True)

NumPy Essentials

Vectorized Operations

# Always prefer vectorized over loops
# Slow
for i in range(len(data)):
    result.append(data[i] * 2)

# Fast (10-100x)
result = data * 2

Key Functions

Function Purpose
np.mean(x) Average
np.std(x, ddof=1) Sample standard deviation
np.corrcoef(x, y) Correlation matrix
np.where(cond, if_true, if_false) Conditional selection
np.random.poisson(lam, size) Random goals simulation

Distance Calculations

# Euclidean distance
distance = np.sqrt((x2 - x1)**2 + (y2 - y1)**2)

# Vectorized for arrays
distances = np.sqrt((end_x - start_x)**2 + (end_y - start_y)**2)

Visualization

matplotlib Basics

# Standard pattern
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(x, y, 'b-', label='Line')
ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')
ax.set_title('Title')
ax.legend()
plt.tight_layout()
plt.savefig('plot.png', dpi=150)

seaborn for Statistical Plots

# Distribution
sns.histplot(data=df, x='xg', bins=20)

# Comparison
sns.boxplot(data=df, x='team', y='goals')

# Relationship
sns.scatterplot(data=df, x='xg', y='goals', hue='team')

mplsoccer for Pitches

from mplsoccer import Pitch

pitch = Pitch(pitch_type='statsbomb')
fig, ax = pitch.draw(figsize=(12, 8))
pitch.scatter(x, y, ax=ax)

Code Quality

Function Design

def calculate_xg_performance(
    goals: int,
    xg: float,
    matches: int
) -> dict:
    """
    Calculate xG performance metrics.

    Parameters
    ----------
    goals : int
        Actual goals scored
    xg : float
        Expected goals
    matches : int
        Number of matches

    Returns
    -------
    dict
        Performance metrics
    """
    return {
        'goals_per_match': goals / matches,
        'xg_per_match': xg / matches,
        'overperformance': goals - xg
    }

Class Design

  • Use classes for stateful analysis
  • Implement clear __init__ with validation
  • Provide methods for different analyses
  • Include docstrings with examples

Error Handling

import logging

logger = logging.getLogger(__name__)

def safe_operation(data):
    try:
        result = risky_operation(data)
        logger.info("Operation successful")
        return result
    except ValueError as e:
        logger.error(f"Invalid data: {e}")
        raise

Performance Tips

Memory Optimization

# Use appropriate dtypes
df['goals'] = df['goals'].astype('int16')  # vs int64
df['team'] = df['team'].astype('category')  # for repeated strings

# Load only needed columns
df = pd.read_csv('big_file.csv', usecols=['col1', 'col2'])

Speed Optimization

  1. Avoid loops: Use vectorized operations
  2. Use query(): Often faster than boolean indexing
  3. Pre-allocate: Create arrays of correct size upfront
  4. Use inplace=True: Avoid unnecessary copies (when appropriate)

Quick Reference Commands

# Load StatsBomb data
from statsbombpy import sb
matches = sb.matches(competition_id=43, season_id=3)
events = sb.events(match_id=7298)

# Basic event filtering
shots = events[events['type'] == 'Shot']
goals = shots[shots['shot_outcome'] == 'Goal']

# Calculate team stats
team_stats = events.groupby('team').agg({
    'type': [
        ('passes', lambda x: (x == 'Pass').sum()),
        ('shots', lambda x: (x == 'Shot').sum())
    ]
})

# Save visualization
fig.savefig('output.png', dpi=150, bbox_inches='tight')

Common Pitfalls to Avoid

  1. Modifying while iterating: Use .copy() when filtering
  2. Chained assignment warning: Use .loc[] for assignment
  3. Memory issues: Load only needed columns
  4. String comparison: Use == not is for values
  5. Forgetting ddof=1: For sample standard deviation

Summary

Python tools for soccer analytics center on: - pandas for data manipulation - NumPy for numerical operations - matplotlib/seaborn for visualization - mplsoccer for soccer-specific graphics

Master the patterns in this chapter, and implementation of any analysis becomes straightforward. The tools handle the complexity; you focus on the insights.