Exercises: Python Tools for Soccer Analytics
These exercises build practical Python skills through soccer-specific applications. Complete them in order, as later exercises build on earlier concepts.
Difficulty Levels: - ⭐ Foundational (5-10 min each) - ⭐⭐ Intermediate (10-20 min each) - ⭐⭐⭐ Challenging (20-40 min each) - ⭐⭐⭐⭐ Advanced/Project (40+ min each)
Part A: pandas Fundamentals ⭐
A.1. Create a DataFrame from the following match data:
matches = [
{'match_id': 1, 'home': 'Arsenal', 'away': 'Chelsea', 'home_goals': 2, 'away_goals': 1},
{'match_id': 2, 'home': 'Liverpool', 'away': 'Man City', 'home_goals': 1, 'away_goals': 1},
{'match_id': 3, 'home': 'Tottenham', 'away': 'Arsenal', 'home_goals': 0, 'away_goals': 3},
{'match_id': 4, 'home': 'Chelsea', 'away': 'Liverpool', 'home_goals': 2, 'away_goals': 2},
{'match_id': 5, 'home': 'Man City', 'away': 'Tottenham', 'home_goals': 4, 'away_goals': 0},
]
a) Print the shape and column names b) Calculate the total goals in each match (new column) c) Filter to only high-scoring matches (4+ total goals)
A.2. Using the DataFrame from A.1:
a) Add columns for home points and away points (3 for win, 1 for draw, 0 for loss) b) Find all matches where Arsenal played (home or away) c) Calculate the average goals per match
A.3. Load the StatsBomb World Cup 2018 matches:
from statsbombpy import sb
matches = sb.matches(competition_id=43, season_id=3)
a) How many matches are in the dataset? b) What columns are available? c) Which team scored the most total goals as home team? d) Find all matches that went to extra time or penalties
A.4. Using the matches from A.3:
a) Create a new DataFrame with only: match_id, home_team, away_team, home_score, away_score b) Add a 'result' column: 'H' for home win, 'D' for draw, 'A' for away win c) What percentage of matches were home wins, draws, and away wins?
Part B: Data Filtering and Selection ⭐
B.1. Load events for the World Cup Final (match_id=7298):
events = sb.events(match_id=7298)
a) How many events are in the match? b) What unique event types exist? c) Filter to only shots—how many were there? d) Which player had the most passes?
B.2. Using the events from B.1:
a) Get all events by French players b) Get all events in the second half (period == 2) c) Get all passes that were successful (pass_outcome is null or 'Complete') d) Get shots that resulted in goals
B.3. Write a function that takes an events DataFrame and returns shots:
def get_shots(events: pd.DataFrame) -> pd.DataFrame:
"""Return all shot events from the events DataFrame."""
# Your code here
pass
Test with the World Cup Final events.
B.4. Write a function that filters events by player and event type:
def filter_player_events(
events: pd.DataFrame,
player_name: str,
event_types: list = None
) -> pd.DataFrame:
"""
Filter events for a specific player.
Parameters
----------
events : pd.DataFrame
All events
player_name : str
Name of player
event_types : list, optional
List of event types to include. If None, return all types.
Returns
-------
pd.DataFrame
Filtered events
"""
# Your code here
pass
Part C: Grouping and Aggregation ⭐⭐
C.1. Load events from multiple World Cup matches:
from statsbombpy import sb
matches = sb.matches(competition_id=43, season_id=3)
match_ids = matches['match_id'].head(10).tolist()
all_events = []
for mid in match_ids:
events = sb.events(match_id=mid)
events['match_id'] = mid
all_events.append(events)
events_df = pd.concat(all_events, ignore_index=True)
Calculate: a) Total events per match b) Total passes per team c) Shots per match for each team
C.2. Using the events from C.1, create a player summary showing: - Total passes - Total shots - Total goals (shots with outcome 'Goal') - Minutes played (max minute value)
Sort by passes descending.
C.3. Create a function to generate match statistics:
def match_summary(events: pd.DataFrame) -> pd.DataFrame:
"""
Generate summary statistics for each team in a match.
Returns
-------
pd.DataFrame
One row per team with columns: team, passes, shots, goals,
possession_events, corners, fouls
"""
# Your code here
pass
C.4. Calculate rolling statistics:
Using match data, calculate a 5-match rolling average of goals scored for each team. Handle the first few matches where fewer than 5 games exist.
Part D: Merging and Reshaping ⭐⭐
D.1. Create two DataFrames:
# Player stats
stats = pd.DataFrame({
'player_id': [1, 2, 3, 4, 5],
'name': ['Kane', 'Salah', 'De Bruyne', 'Mbappe', 'Haaland'],
'goals': [15, 18, 8, 22, 25],
'assists': [2, 8, 12, 5, 3]
})
# Player info
info = pd.DataFrame({
'player_id': [1, 2, 3, 6, 7],
'nationality': ['England', 'Egypt', 'Belgium', 'Germany', 'Spain'],
'position': ['FW', 'FW', 'MF', 'MF', 'DF']
})
a) Perform an inner merge on player_id b) Perform a left merge (keeping all stats) c) Perform an outer merge (keeping all players)
D.2. Convert match data to team-match format:
Starting with a matches DataFrame (home_team, away_team, home_goals, away_goals), create a new DataFrame where each match becomes two rows—one for each team's perspective with columns: team, opponent, goals_for, goals_against, venue.
D.3. Reshape player performance data:
# Weekly performance (wide format)
wide_data = pd.DataFrame({
'player': ['Kane', 'Salah', 'Haaland'],
'week_1': [1, 2, 1],
'week_2': [0, 1, 2],
'week_3': [2, 0, 3],
'week_4': [1, 1, 0]
})
a) Convert to long format (player, week, goals) b) Convert back to wide format
Part E: NumPy Operations ⭐⭐
E.1. Create NumPy arrays from the following data:
shots_x = [105, 108, 112, 98, 115, 110, 103, 118] # x coordinates
shots_y = [35, 42, 38, 45, 40, 32, 48, 36] # y coordinates
xG = [0.12, 0.25, 0.35, 0.08, 0.52, 0.18, 0.15, 0.68]
a) Calculate the mean xG b) Calculate the standard deviation of xG c) Find the maximum xG value and its index d) Calculate total xG
E.2. Using the arrays from E.1:
a) Calculate the distance from each shot to the goal center (x=120, y=40) b) Calculate the distance from each shot to the near post (x=120, y=36.34) c) Find the shot with the smallest distance to goal
E.3. Implement vectorized point calculation:
def calculate_points(goal_diff: np.ndarray) -> np.ndarray:
"""
Calculate points from goal differences.
3 points for win (goal_diff > 0)
1 point for draw (goal_diff == 0)
0 points for loss (goal_diff < 0)
Use np.where for vectorized operations.
"""
# Your code here
pass
E.4. Simulate a season using NumPy:
def simulate_season(team_xg: float, opponent_xg: float, n_matches: int = 38) -> dict:
"""
Simulate a full season using Poisson distribution.
Parameters
----------
team_xg : float
Average xG per match for the team
opponent_xg : float
Average xG against per match
n_matches : int
Number of matches
Returns
-------
dict
Contains: goals_for, goals_against, points, wins, draws, losses
"""
# Your code here
pass
Part F: Visualization ⭐⭐
F.1. Create a bar chart comparing top scorers:
players = ['Haaland', 'Kane', 'Salah', 'Mbappe', 'Bellingham']
goals = [25, 18, 15, 22, 12]
xG = [22.5, 15.2, 14.8, 18.5, 10.2]
Create a grouped bar chart showing goals vs xG for each player.
F.2. Create a scatter plot with regression line:
Using simulated xG vs Goals data for 20 teams, create a scatter plot with: - Regression line - R² annotation - Proper axis labels and title
F.3. Create a histogram of shot xG values:
Generate 500 random xG values (exponentially distributed with mean 0.1) and create a histogram with: - 20 bins - Vertical line at the mean - Density on y-axis
F.4. Create a line plot showing cumulative xG over a season:
Simulate 38 match xG values and plot: - Cumulative actual xG - Expected cumulative xG (constant rate) - Confidence band (±1.5 standard deviations)
Part G: Soccer-Specific Visualization ⭐⭐⭐
G.1. Create a shot map using mplsoccer:
from mplsoccer import Pitch
# Simulated shot data
shots = pd.DataFrame({
'x': np.random.uniform(100, 120, 30),
'y': np.random.uniform(18, 62, 30),
'xG': np.random.exponential(0.12, 30),
'outcome': np.random.choice(['Goal', 'Saved', 'Blocked', 'Off Target'], 30)
})
Create a shot map where: - Circle size represents xG - Goals are colored differently - Include a legend
G.2. Create a pass map for a specific player:
Load event data and create a visualization showing: - Player's average position - All passes (arrows) - Color passes by success/failure
G.3. Create a heatmap showing where a team's events occurred:
Using event data, create a pitch heatmap showing event density across the field.
Part H: Functions and Classes ⭐⭐⭐
H.1. Create a well-documented function for xG analysis:
def analyze_xg_performance(
goals: int,
xg: float,
matches: int,
league_avg_conversion: float = 1.0
) -> dict:
"""
Analyze xG performance for a player or team.
Parameters
----------
goals : int
Actual goals scored
xg : float
Total expected goals
matches : int
Number of matches
league_avg_conversion : float
League average goals/xG ratio
Returns
-------
dict
Performance metrics including:
- goals_per_match
- xg_per_match
- conversion_ratio (goals/xG)
- vs_league_average (difference from league conversion)
- overperformance (goals - xG)
"""
# Your code here
pass
H.2. Create a class for team season analysis:
class TeamSeason:
"""
Analyze a team's season performance.
Parameters
----------
team_name : str
Name of the team
matches : pd.DataFrame
Match data with columns: opponent, goals_for, goals_against, xG_for, xG_against
Methods
-------
summary() : dict
Return season summary statistics
form(n_matches: int) : pd.DataFrame
Return last n matches form
plot_xg_timeline() : matplotlib figure
Plot xG vs goals over the season
"""
def __init__(self, team_name: str, matches: pd.DataFrame):
# Your code here
pass
def summary(self) -> dict:
# Your code here
pass
def form(self, n_matches: int = 5) -> pd.DataFrame:
# Your code here
pass
def plot_xg_timeline(self):
# Your code here
pass
H.3. Create a data pipeline class:
class MatchDataPipeline:
"""
Pipeline for loading and processing match data.
Methods
-------
load(source: str) : self
Load data from source
clean() : self
Clean and validate data
transform() : self
Apply transformations
get_data() : pd.DataFrame
Return processed DataFrame
"""
def __init__(self):
self.data = None
def load(self, source: str):
# Your code here
return self
def clean(self):
# Your code here
return self
def transform(self):
# Your code here
return self
def get_data(self) -> pd.DataFrame:
return self.data
Part I: Performance Optimization ⭐⭐⭐
I.1. Optimize memory usage:
Create a function that optimizes a DataFrame's memory usage by: - Converting object columns with few unique values to category - Downcasting numeric columns
Compare memory before and after.
I.2. Compare loop vs vectorized performance:
Write two versions of a function that calculates shot distance to goal: - One using a for loop - One using NumPy vectorization
Time both on 100,000 shots and compare.
I.3. Optimize groupby operations:
Given a large events DataFrame, compare different approaches to counting events per player: - Using groupby with apply - Using groupby with agg - Using value_counts - Using pivot_table
Part J: Integration Project ⭐⭐⭐⭐
J.1. Build a complete match analysis module:
Create a Python module (match_analysis.py) that includes:
load_match(match_id)- Load match events from StatsBombcalculate_team_stats(events)- Calculate comprehensive team statisticscalculate_player_stats(events)- Calculate player-level statisticscreate_match_report(events)- Generate a visual match report
The match report should include: - Shot maps for both teams - Passing statistics - Key events timeline
J.2. Build a season analysis dashboard:
Create functions that: 1. Load all matches from a competition/season 2. Calculate league table (points, goals, xG) 3. Identify over/underperformers (xG vs actual) 4. Create visualizations for: - League table - xG vs Goals scatter - Team form over time
J.3. Create a player comparison tool:
Build a class that: 1. Takes two player names 2. Loads their event data from multiple matches 3. Calculates comparative statistics 4. Generates a radar chart comparison 5. Provides a written summary of strengths/weaknesses
Solutions
Selected solutions are available in:
- code/exercise-solutions.py (programming problems)
- appendices/g-answers-to-selected-exercises.md (odd-numbered problems)
Reflection Questions
- Which pandas operation do you find most useful for soccer analysis?
- When would you use NumPy arrays vs pandas DataFrames?
- What visualization types best communicate soccer insights?
- How would you structure code for a production analytics pipeline?