Key Takeaways: Python Tools for Soccer Analytics
Environment Setup
Virtual Environments
- Always use virtual environments for project isolation
- Create with:
python -m venv env_name - Export dependencies:
pip freeze > requirements.txt - Reproduce elsewhere:
pip install -r requirements.txt
Project Structure
project/
├── data/raw/ # Original, immutable data
├── data/processed/ # Cleaned data
├── src/ # Source code modules
├── notebooks/ # Exploration notebooks
├── outputs/ # Generated files
└── config.py # Centralized settings
pandas Essentials
DataFrame Operations Cheat Sheet
| Task | Code |
|---|---|
| Create from dict | pd.DataFrame(data) |
| Read CSV | pd.read_csv('file.csv') |
| Select column | df['col'] or df.col |
| Select multiple | df[['col1', 'col2']] |
| Filter rows | df[df['col'] > value] |
| Multiple conditions | df[(cond1) & (cond2)] |
| Query syntax | df.query("col > @value") |
Groupby Pattern
# Standard aggregation
df.groupby('team').agg({
'goals': 'sum',
'xg': ['mean', 'std'],
'match_id': 'count'
})
# Transform (keeps original shape)
df['team_avg'] = df.groupby('team')['goals'].transform('mean')
Merging Data
# Inner: only matching rows
pd.merge(df1, df2, on='key', how='inner')
# Left: all from left, matching from right
pd.merge(df1, df2, on='key', how='left')
# Concatenate DataFrames
pd.concat([df1, df2], ignore_index=True)
NumPy Essentials
Vectorized Operations
# Always prefer vectorized over loops
# Slow
for i in range(len(data)):
result.append(data[i] * 2)
# Fast (10-100x)
result = data * 2
Key Functions
| Function | Purpose |
|---|---|
np.mean(x) |
Average |
np.std(x, ddof=1) |
Sample standard deviation |
np.corrcoef(x, y) |
Correlation matrix |
np.where(cond, if_true, if_false) |
Conditional selection |
np.random.poisson(lam, size) |
Random goals simulation |
Distance Calculations
# Euclidean distance
distance = np.sqrt((x2 - x1)**2 + (y2 - y1)**2)
# Vectorized for arrays
distances = np.sqrt((end_x - start_x)**2 + (end_y - start_y)**2)
Visualization
matplotlib Basics
# Standard pattern
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(x, y, 'b-', label='Line')
ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')
ax.set_title('Title')
ax.legend()
plt.tight_layout()
plt.savefig('plot.png', dpi=150)
seaborn for Statistical Plots
# Distribution
sns.histplot(data=df, x='xg', bins=20)
# Comparison
sns.boxplot(data=df, x='team', y='goals')
# Relationship
sns.scatterplot(data=df, x='xg', y='goals', hue='team')
mplsoccer for Pitches
from mplsoccer import Pitch
pitch = Pitch(pitch_type='statsbomb')
fig, ax = pitch.draw(figsize=(12, 8))
pitch.scatter(x, y, ax=ax)
Code Quality
Function Design
def calculate_xg_performance(
goals: int,
xg: float,
matches: int
) -> dict:
"""
Calculate xG performance metrics.
Parameters
----------
goals : int
Actual goals scored
xg : float
Expected goals
matches : int
Number of matches
Returns
-------
dict
Performance metrics
"""
return {
'goals_per_match': goals / matches,
'xg_per_match': xg / matches,
'overperformance': goals - xg
}
Class Design
- Use classes for stateful analysis
- Implement clear
__init__with validation - Provide methods for different analyses
- Include docstrings with examples
Error Handling
import logging
logger = logging.getLogger(__name__)
def safe_operation(data):
try:
result = risky_operation(data)
logger.info("Operation successful")
return result
except ValueError as e:
logger.error(f"Invalid data: {e}")
raise
Performance Tips
Memory Optimization
# Use appropriate dtypes
df['goals'] = df['goals'].astype('int16') # vs int64
df['team'] = df['team'].astype('category') # for repeated strings
# Load only needed columns
df = pd.read_csv('big_file.csv', usecols=['col1', 'col2'])
Speed Optimization
- Avoid loops: Use vectorized operations
- Use
query(): Often faster than boolean indexing - Pre-allocate: Create arrays of correct size upfront
- Use
inplace=True: Avoid unnecessary copies (when appropriate)
Quick Reference Commands
# Load StatsBomb data
from statsbombpy import sb
matches = sb.matches(competition_id=43, season_id=3)
events = sb.events(match_id=7298)
# Basic event filtering
shots = events[events['type'] == 'Shot']
goals = shots[shots['shot_outcome'] == 'Goal']
# Calculate team stats
team_stats = events.groupby('team').agg({
'type': [
('passes', lambda x: (x == 'Pass').sum()),
('shots', lambda x: (x == 'Shot').sum())
]
})
# Save visualization
fig.savefig('output.png', dpi=150, bbox_inches='tight')
Common Pitfalls to Avoid
- Modifying while iterating: Use
.copy()when filtering - Chained assignment warning: Use
.loc[]for assignment - Memory issues: Load only needed columns
- String comparison: Use
==notisfor values - Forgetting
ddof=1: For sample standard deviation
Summary
Python tools for soccer analytics center on: - pandas for data manipulation - NumPy for numerical operations - matplotlib/seaborn for visualization - mplsoccer for soccer-specific graphics
Master the patterns in this chapter, and implementation of any analysis becomes straightforward. The tools handle the complexity; you focus on the insights.