Exercises: Python Tools for Soccer Analytics

These exercises build practical Python skills through soccer-specific applications. Complete them in order, as later exercises build on earlier concepts.

Difficulty Levels: - ⭐ Foundational (5-10 min each) - ⭐⭐ Intermediate (10-20 min each) - ⭐⭐⭐ Challenging (20-40 min each) - ⭐⭐⭐⭐ Advanced/Project (40+ min each)

Part A: pandas Fundamentals ⭐

A.1. Create a DataFrame from the following match data:

matches = [
    {'match_id': 1, 'home': 'Arsenal', 'away': 'Chelsea', 'home_goals': 2, 'away_goals': 1},
    {'match_id': 2, 'home': 'Liverpool', 'away': 'Man City', 'home_goals': 1, 'away_goals': 1},
    {'match_id': 3, 'home': 'Tottenham', 'away': 'Arsenal', 'home_goals': 0, 'away_goals': 3},
    {'match_id': 4, 'home': 'Chelsea', 'away': 'Liverpool', 'home_goals': 2, 'away_goals': 2},
    {'match_id': 5, 'home': 'Man City', 'away': 'Tottenham', 'home_goals': 4, 'away_goals': 0},
]

a) Print the shape and column names b) Calculate the total goals in each match (new column) c) Filter to only high-scoring matches (4+ total goals)

A.2. Using the DataFrame from A.1:

a) Add columns for home points and away points (3 for win, 1 for draw, 0 for loss) b) Find all matches where Arsenal played (home or away) c) Calculate the average goals per match

A.3. Load the StatsBomb World Cup 2018 matches:

from statsbombpy import sb
matches = sb.matches(competition_id=43, season_id=3)

a) How many matches are in the dataset? b) What columns are available? c) Which team scored the most total goals as home team? d) Find all matches that went to extra time or penalties

A.4. Using the matches from A.3:

a) Create a new DataFrame with only: match_id, home_team, away_team, home_score, away_score b) Add a 'result' column: 'H' for home win, 'D' for draw, 'A' for away win c) What percentage of matches were home wins, draws, and away wins?

Part B: Data Filtering and Selection ⭐

B.1. Load events for the World Cup Final (match_id=7298):

events = sb.events(match_id=7298)

a) How many events are in the match? b) What unique event types exist? c) Filter to only shots—how many were there? d) Which player had the most passes?

B.2. Using the events from B.1:

a) Get all events by French players b) Get all events in the second half (period == 2) c) Get all passes that were successful (pass_outcome is null or 'Complete') d) Get shots that resulted in goals

B.3. Write a function that takes an events DataFrame and returns shots:

def get_shots(events: pd.DataFrame) -> pd.DataFrame:
    """Return all shot events from the events DataFrame."""
    # Your code here
    pass

Test with the World Cup Final events.

B.4. Write a function that filters events by player and event type:

def filter_player_events(
    events: pd.DataFrame,
    player_name: str,
    event_types: list = None
) -> pd.DataFrame:
    """
    Filter events for a specific player.

    Parameters
    ----------
    events : pd.DataFrame
        All events
    player_name : str
        Name of player
    event_types : list, optional
        List of event types to include. If None, return all types.

    Returns
    -------
    pd.DataFrame
        Filtered events
    """
    # Your code here
    pass

Part C: Grouping and Aggregation ⭐⭐

C.1. Load events from multiple World Cup matches:

from statsbombpy import sb

matches = sb.matches(competition_id=43, season_id=3)
match_ids = matches['match_id'].head(10).tolist()

all_events = []
for mid in match_ids:
    events = sb.events(match_id=mid)
    events['match_id'] = mid
    all_events.append(events)

events_df = pd.concat(all_events, ignore_index=True)

Calculate: a) Total events per match b) Total passes per team c) Shots per match for each team

C.2. Using the events from C.1, create a player summary showing: - Total passes - Total shots - Total goals (shots with outcome 'Goal') - Minutes played (max minute value)

Sort by passes descending.

C.3. Create a function to generate match statistics:

def match_summary(events: pd.DataFrame) -> pd.DataFrame:
    """
    Generate summary statistics for each team in a match.

    Returns
    -------
    pd.DataFrame
        One row per team with columns: team, passes, shots, goals,
        possession_events, corners, fouls
    """
    # Your code here
    pass

C.4. Calculate rolling statistics:

Using match data, calculate a 5-match rolling average of goals scored for each team. Handle the first few matches where fewer than 5 games exist.

Part D: Merging and Reshaping ⭐⭐

D.1. Create two DataFrames:

# Player stats
stats = pd.DataFrame({
    'player_id': [1, 2, 3, 4, 5],
    'name': ['Kane', 'Salah', 'De Bruyne', 'Mbappe', 'Haaland'],
    'goals': [15, 18, 8, 22, 25],
    'assists': [2, 8, 12, 5, 3]
})

# Player info
info = pd.DataFrame({
    'player_id': [1, 2, 3, 6, 7],
    'nationality': ['England', 'Egypt', 'Belgium', 'Germany', 'Spain'],
    'position': ['FW', 'FW', 'MF', 'MF', 'DF']
})

a) Perform an inner merge on player_id b) Perform a left merge (keeping all stats) c) Perform an outer merge (keeping all players)

D.2. Convert match data to team-match format:

Starting with a matches DataFrame (home_team, away_team, home_goals, away_goals), create a new DataFrame where each match becomes two rows—one for each team's perspective with columns: team, opponent, goals_for, goals_against, venue.

D.3. Reshape player performance data:

# Weekly performance (wide format)
wide_data = pd.DataFrame({
    'player': ['Kane', 'Salah', 'Haaland'],
    'week_1': [1, 2, 1],
    'week_2': [0, 1, 2],
    'week_3': [2, 0, 3],
    'week_4': [1, 1, 0]
})

a) Convert to long format (player, week, goals) b) Convert back to wide format

Part E: NumPy Operations ⭐⭐

E.1. Create NumPy arrays from the following data:

shots_x = [105, 108, 112, 98, 115, 110, 103, 118]  # x coordinates
shots_y = [35, 42, 38, 45, 40, 32, 48, 36]         # y coordinates
xG = [0.12, 0.25, 0.35, 0.08, 0.52, 0.18, 0.15, 0.68]

a) Calculate the mean xG b) Calculate the standard deviation of xG c) Find the maximum xG value and its index d) Calculate total xG

E.2. Using the arrays from E.1:

a) Calculate the distance from each shot to the goal center (x=120, y=40) b) Calculate the distance from each shot to the near post (x=120, y=36.34) c) Find the shot with the smallest distance to goal

E.3. Implement vectorized point calculation:

def calculate_points(goal_diff: np.ndarray) -> np.ndarray:
    """
    Calculate points from goal differences.

    3 points for win (goal_diff > 0)
    1 point for draw (goal_diff == 0)
    0 points for loss (goal_diff < 0)

    Use np.where for vectorized operations.
    """
    # Your code here
    pass

E.4. Simulate a season using NumPy:

def simulate_season(team_xg: float, opponent_xg: float, n_matches: int = 38) -> dict:
    """
    Simulate a full season using Poisson distribution.

    Parameters
    ----------
    team_xg : float
        Average xG per match for the team
    opponent_xg : float
        Average xG against per match
    n_matches : int
        Number of matches

    Returns
    -------
    dict
        Contains: goals_for, goals_against, points, wins, draws, losses
    """
    # Your code here
    pass

Part F: Visualization ⭐⭐

F.1. Create a bar chart comparing top scorers:

players = ['Haaland', 'Kane', 'Salah', 'Mbappe', 'Bellingham']
goals = [25, 18, 15, 22, 12]
xG = [22.5, 15.2, 14.8, 18.5, 10.2]

Create a grouped bar chart showing goals vs xG for each player.

F.2. Create a scatter plot with regression line:

Using simulated xG vs Goals data for 20 teams, create a scatter plot with: - Regression line - R² annotation - Proper axis labels and title

F.3. Create a histogram of shot xG values:

Generate 500 random xG values (exponentially distributed with mean 0.1) and create a histogram with: - 20 bins - Vertical line at the mean - Density on y-axis

F.4. Create a line plot showing cumulative xG over a season:

Simulate 38 match xG values and plot: - Cumulative actual xG - Expected cumulative xG (constant rate) - Confidence band (±1.5 standard deviations)

Part G: Soccer-Specific Visualization ⭐⭐⭐

G.1. Create a shot map using mplsoccer:

from mplsoccer import Pitch

# Simulated shot data
shots = pd.DataFrame({
    'x': np.random.uniform(100, 120, 30),
    'y': np.random.uniform(18, 62, 30),
    'xG': np.random.exponential(0.12, 30),
    'outcome': np.random.choice(['Goal', 'Saved', 'Blocked', 'Off Target'], 30)
})

Create a shot map where: - Circle size represents xG - Goals are colored differently - Include a legend

G.2. Create a pass map for a specific player:

Load event data and create a visualization showing: - Player's average position - All passes (arrows) - Color passes by success/failure

G.3. Create a heatmap showing where a team's events occurred:

Using event data, create a pitch heatmap showing event density across the field.

Part H: Functions and Classes ⭐⭐⭐

H.1. Create a well-documented function for xG analysis:

def analyze_xg_performance(
    goals: int,
    xg: float,
    matches: int,
    league_avg_conversion: float = 1.0
) -> dict:
    """
    Analyze xG performance for a player or team.

    Parameters
    ----------
    goals : int
        Actual goals scored
    xg : float
        Total expected goals
    matches : int
        Number of matches
    league_avg_conversion : float
        League average goals/xG ratio

    Returns
    -------
    dict
        Performance metrics including:
        - goals_per_match
        - xg_per_match
        - conversion_ratio (goals/xG)
        - vs_league_average (difference from league conversion)
        - overperformance (goals - xG)
    """
    # Your code here
    pass

H.2. Create a class for team season analysis:

class TeamSeason:
    """
    Analyze a team's season performance.

    Parameters
    ----------
    team_name : str
        Name of the team
    matches : pd.DataFrame
        Match data with columns: opponent, goals_for, goals_against, xG_for, xG_against

    Methods
    -------
    summary() : dict
        Return season summary statistics
    form(n_matches: int) : pd.DataFrame
        Return last n matches form
    plot_xg_timeline() : matplotlib figure
        Plot xG vs goals over the season
    """

    def __init__(self, team_name: str, matches: pd.DataFrame):
        # Your code here
        pass

    def summary(self) -> dict:
        # Your code here
        pass

    def form(self, n_matches: int = 5) -> pd.DataFrame:
        # Your code here
        pass

    def plot_xg_timeline(self):
        # Your code here
        pass

H.3. Create a data pipeline class:

class MatchDataPipeline:
    """
    Pipeline for loading and processing match data.

    Methods
    -------
    load(source: str) : self
        Load data from source
    clean() : self
        Clean and validate data
    transform() : self
        Apply transformations
    get_data() : pd.DataFrame
        Return processed DataFrame
    """

    def __init__(self):
        self.data = None

    def load(self, source: str):
        # Your code here
        return self

    def clean(self):
        # Your code here
        return self

    def transform(self):
        # Your code here
        return self

    def get_data(self) -> pd.DataFrame:
        return self.data

Part I: Performance Optimization ⭐⭐⭐

I.1. Optimize memory usage:

Create a function that optimizes a DataFrame's memory usage by: - Converting object columns with few unique values to category - Downcasting numeric columns

Compare memory before and after.

I.2. Compare loop vs vectorized performance:

Write two versions of a function that calculates shot distance to goal: - One using a for loop - One using NumPy vectorization

Time both on 100,000 shots and compare.

I.3. Optimize groupby operations:

Given a large events DataFrame, compare different approaches to counting events per player: - Using groupby with apply - Using groupby with agg - Using value_counts - Using pivot_table

Part J: Integration Project ⭐⭐⭐⭐

J.1. Build a complete match analysis module:

Create a Python module (match_analysis.py) that includes:

load_match(match_id) - Load match events from StatsBomb
calculate_team_stats(events) - Calculate comprehensive team statistics
calculate_player_stats(events) - Calculate player-level statistics
create_match_report(events) - Generate a visual match report

The match report should include: - Shot maps for both teams - Passing statistics - Key events timeline

J.2. Build a season analysis dashboard:

Create functions that: 1. Load all matches from a competition/season 2. Calculate league table (points, goals, xG) 3. Identify over/underperformers (xG vs actual) 4. Create visualizations for: - League table - xG vs Goals scatter - Team form over time

J.3. Create a player comparison tool:

Build a class that: 1. Takes two player names 2. Loads their event data from multiple matches 3. Calculates comparative statistics 4. Generates a radar chart comparison 5. Provides a written summary of strengths/weaknesses

Solutions

Selected solutions are available in: - code/exercise-solutions.py (programming problems) - appendices/g-answers-to-selected-exercises.md (odd-numbered problems)

Reflection Questions

Which pandas operation do you find most useful for soccer analysis?
When would you use NumPy arrays vs pandas DataFrames?
What visualization types best communicate soccer insights?
How would you structure code for a production analytics pipeline?