Appendix D: Data Sources and Tools

This appendix catalogs the data sources, APIs, and tools referenced throughout the book. The landscape of soccer data evolves rapidly; URLs and availability are current as of the time of writing but may change. Check the book's companion website for updated links.


D.1 Free and Public Data Sources

D.1.1 StatsBomb Open Data

URL: https://github.com/statsbomb/open-data

StatsBomb provides a substantial free dataset covering select competitions with full event-level data. This is the primary dataset used in many chapters of this book.

Available Data: - Event data (passes, shots, carries, pressures, duels, etc.) - Lineup data - Match metadata - 360 freeze-frame data (for select matches)

Competitions Included (selection): - FIFA World Cup (Men's and Women's, multiple editions) - UEFA Euro (multiple editions) - FA Women's Super League (multiple seasons) - La Liga (select seasons) - UEFA Champions League (select seasons) - National Women's Soccer League (NWSL) - Indian Super League - Various international tournaments

Format: JSON files organized by competition, season, and match.

Access via Python:

# Using the statsbombpy library
pip install statsbombpy
from statsbombpy import sb
competitions = sb.competitions()
events = sb.events(match_id=3869685)

Coordinate System: 120 x 80 yards, origin at bottom-left, attacking direction left to right.

License: Free for non-commercial use with attribution. See the StatsBomb data use agreement for details.

D.1.2 FBref

URL: https://fbref.com

FBref, powered by Sports Reference, provides comprehensive aggregated statistics for professional soccer leagues worldwide.

Available Data: - Player-level statistics (standard, shooting, passing, defensive, possession, etc.) - Team-level statistics - Match reports with advanced metrics - Expected goals and expected assists (via StatsBomb/Opta) - Scouting reports with percentile rankings

Coverage: - Top 5 European leagues (England, Spain, Germany, Italy, France) - Major secondary leagues (Netherlands, Portugal, Belgium, etc.) - UEFA Champions League and Europa League - Major international tournaments - Historical data going back several decades (with varying detail)

Access: Web scraping (respect robots.txt and rate limits). No official API.

# Example scraping pattern (use responsibly)
import pandas as pd
import time

url = "https://fbref.com/en/comps/9/stats/Premier-League-Stats"
tables = pd.read_html(url)
player_stats = tables[0]  # First table is typically standard stats
time.sleep(5)  # Respect rate limits

Notes: FBref data integrates StatsBomb xG data for top leagues. Statistics are per-90 normalized where indicated. Be aware that scraping policies may change.

D.1.3 Understat

URL: https://understat.com

Understat provides expected goals data with shot-level detail for six major European leagues.

Available Data: - Shot-level xG values - Shot maps - Player and team xG summaries - xG timelines per match - Situation breakdowns (open play, set piece, counter-attack, etc.)

Coverage: Premier League, La Liga, Bundesliga, Serie A, Ligue 1, Russian Premier League, from 2014-15 onward.

Access: The data is loaded dynamically via JavaScript. Scraping requires parsing embedded JSON from page source.

import requests
import json
from bs4 import BeautifulSoup

url = "https://understat.com/league/EPL/2023"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# Data is embedded in script tags as JSON
scripts = soup.find_all('script')
# Parse the relevant script tag to extract data

D.1.4 WhoScored

URL: https://www.whoscored.com

WhoScored provides match-level statistics, player ratings, and tactical analysis powered by Opta data.

Available Data: - Match statistics (shots, possession, passes, etc.) - Player ratings (0-10 scale) - Heatmaps and touch maps - Chalkboard event visualizations - Tactical lineups and formations

Coverage: Major leagues worldwide, international tournaments.

Notes: WhoScored's proprietary player rating system is derived from Opta event data. Data is not easily machine-readable; designed for web consumption.

D.1.5 Transfermarkt

URL: https://www.transfermarkt.com

Transfermarkt is the leading source for transfer values, market data, and biographical information.

Available Data: - Estimated market values for players - Transfer history and fees - Contract details - Player biographical data (age, nationality, position, height, foot) - Injury history - Appearance and goal records

Access: Web scraping or community-maintained packages.

# Community package (not official)
# pip install transfermarkt-api
# Note: Check current availability and terms of use

D.1.6 Other Free Sources

Source URL Data Type Notes
European Soccer Database (Kaggle) kaggle.com/datasets Match results, betting odds SQLite database, good for learning
football-data.co.uk football-data.co.uk Match results, betting odds CSV files, extensive historical coverage
Club Elo clubelo.com Elo ratings Historical Elo ratings for clubs worldwide
538 Soccer Predictions projects.fivethirtyeight.com Match predictions, SPI ratings FiveThirtyEight's Soccer Power Index
Open Football github.com/openfootball Match results Community-maintained, structured text format
Soccerway soccerway.com Match results, lineups Web-based, broad international coverage
WorldFootballR github.com/JaseZiv/worldfootballR R package Aggregates data from FBref, Transfermarkt, Understat, Fotmob
Pappalardo et al. Dataset figshare.com Event data Academic dataset covering multiple leagues

D.2 Commercial Data Providers

D.2.1 Stats Perform (Opta)

Website: statsperform.com

Opta, now part of Stats Perform, is the most widely used event data provider in professional soccer.

Products: - Event Data (F24/F9): Detailed event-level data with ~2,000+ events per match (passes, shots, tackles, fouls, etc.) with x,y coordinates. - Match Data: Pre-match, live, and post-match statistics. - Player Data: Seasonal aggregates, biographical information. - Advanced Metrics: Expected goals (xG), Expected Threat (xT), possession value frameworks. - Content Feeds: Real-time text commentary, graphics-ready data.

Coverage: 500+ competitions worldwide, including all major European leagues, lower divisions, international tournaments, and youth football.

Coordinate System: 100 x 100 (percentage-based), origin at bottom-left.

Typical Clients: Media organizations, broadcasters, betting companies, professional clubs.

D.2.2 StatsBomb

Website: statsbomb.com

StatsBomb provides event data with superior granularity to most competitors, including pressure events and advanced tagging.

Products: - Event Data: Highly detailed events including pressure tracking, shot freeze-frames (360 data showing all player positions at the time of each shot), and ball receipt types. - StatsBomb 360: Freeze-frame data capturing all visible player and ball positions at the moment of key events. - IQ Platform: Web-based analytics and visualization platform. - Data Lab: Custom analysis and consultancy.

Unique Features: - Pressure events tracked - Goalkeeper positioning data - Carry events (not just dribbles) - Shot freeze-frame data (positions of all nearby players)

Coverage: Major European leagues, international tournaments, select lower-tier leagues.

D.2.3 Second Spectrum

Website: secondspectrum.com

Second Spectrum provides optical tracking data using computer vision applied to broadcast or stadium camera feeds.

Products: - Tracking Data: 25 Hz positional data for all players and the ball. - Physical Metrics: Speed, distance, acceleration profiles. - Contextual Intelligence: Automated tactical analysis, off-ball movement detection. - Coaching Tools: Video synchronization with data overlays.

Coverage: Official tracking partner of the Premier League, La Liga, MLS, and other leagues.

Data Format: Typically frame-by-frame JSON or CSV with x, y coordinates and timestamps.

D.2.4 Wyscout

Website: wyscout.com (now part of Hudl)

Wyscout is a widely used scouting and analysis platform, particularly popular with clubs and agents.

Products: - Event Data: Similar scope to Opta, with passes, shots, duels, etc. - Video Platform: Full match video with event tagging and clipping. - Scouting Tools: Player search, comparison, and shortlisting. - Advanced Data: xG, progressive passes, smart passes.

Coverage: 200+ competitions, extensive lower-league and youth coverage.

Coordinate System: 100 x 100 (percentage-based), origin at top-left (y-axis inverted compared to most other providers).

D.2.5 InStat

Website: instatsport.com

InStat provides event and video analysis primarily in Eastern European, South American, and Asian markets.

Products: - Event data with video synchronization - Player indices (composite ratings) - Team reports and opposition analysis

D.2.6 SkillCorner

Website: skillcorner.com

SkillCorner provides broadcast-derived tracking data using computer vision on publicly available TV footage.

Products: - Broadcast Tracking: Physical and tactical metrics derived from TV broadcasts. - Off-ball Metrics: Running intensity, pressing behavior, positioning. - Open Data: Limited free dataset available for academic and personal use.

Coverage: Major leagues worldwide, based on availability of broadcast footage.

D.2.7 Comparison of Providers

Provider Data Type Coverage Coordinate System Typical Cost
Opta / Stats Perform Event 500+ comps 100 x 100 $$$$$
StatsBomb Event + 360 30+ comps 120 x 80 $$$$
Second Spectrum Tracking Select leagues Meters (real coords) $$$$$
Wyscout / Hudl Event + Video 200+ comps 100 x 100 $$$ | | InStat | Event + Video | 100+ comps | Varies | $$
SkillCorner Tracking (broadcast) Major leagues Meters (real coords) $$$

D.3 APIs and Tools

D.3.1 Python Libraries

Library Purpose Install Command
statsbombpy Access StatsBomb data pip install statsbombpy
mplsoccer Soccer pitch plotting and visualization pip install mplsoccer
socceraction SPADL, VAEP, xT implementations pip install socceraction
kloppy Standardize data across providers pip install kloppy
codeball Expected possession value (EPV) pip install codeball
pandas Data manipulation pip install pandas
scikit-learn Machine learning pip install scikit-learn
statsmodels Statistical modeling pip install statsmodels
xgboost Gradient boosting pip install xgboost
lightgbm Gradient boosting (fast) pip install lightgbm
torch / pytorch Deep learning pip install torch
networkx Network/graph analysis pip install networkx

D.3.2 R Libraries

Library Purpose
worldfootballR Scrape data from FBref, Transfermarkt, Understat, Fotmob
StatsBombR Access StatsBomb open data
ggsoccer Soccer pitch visualization in ggplot2
ggplot2 General data visualization
tidyverse Data wrangling suite
brms Bayesian regression modeling
rstanarm Applied Bayesian regression

D.3.3 Visualization Tools

Tool Type Best For
mplsoccer (Python) Library Pitch plots, heatmaps, shot maps, pass networks
ggsoccer (R) Library Pitch plots in ggplot2 framework
Tableau Software Interactive dashboards, no-code exploration
D3.js JavaScript Custom web-based interactive visualizations
Flourish Web app Animated and interactive visualizations
Figma / Sketch Design Publication-quality static graphics

D.3.4 Data Standardization

Different data providers use different event definitions, coordinate systems, and naming conventions. The kloppy library standardizes data into a common format:

from kloppy import statsbomb, opta, wyscout

# Load StatsBomb data into standardized format
dataset = statsbomb.load_open_data(
    match_id=3869685,
    coordinates="statsbomb"
)

# Convert to SPADL format using socceraction
import socceraction.spadl as spadl
actions = spadl.statsbomb.convert_to_actions(events, home_team_id)

SPADL (Soccer Player Action Description Language): A standardized representation introduced by Decroos et al. (2019) that converts all event data into a consistent format with 21 action types. Used as the foundation for VAEP (Valuing Actions by Estimating Probabilities). See Chapter 9.


D.4 Sample Datasets Included with This Book

The companion repository for this book includes several curated datasets for reproducing the analyses in each chapter.

D.4.1 Dataset Inventory

Dataset File Description Chapters
sample_matches.csv 500 KB Match results from top 5 leagues (2018-2024) 1, 3, 5
sample_shots.parquet 12 MB 50,000 shots with xG values, locations, outcomes 6, 7, 8
sample_events.parquet 45 MB Full event data for 100 matches 10, 11, 12
sample_tracking.parquet 200 MB Tracking data (25 Hz) for 5 matches 17, 18, 19
player_seasons.csv 3 MB Player-season aggregated stats (5 leagues, 6 seasons) 15, 20, 21
team_seasons.csv 200 KB Team-season aggregated stats 9, 16
transfer_values.csv 1 MB Transfer values and fees (2015-2024) 21, 25
sample_lineups.json 500 KB Lineup and formation data for 100 matches 10, 22
xg_model_training.parquet 8 MB Pre-processed shot data for xG model training 6, 7
league_tables.csv 100 KB Final league standings (top 5 leagues, 10 seasons) 4, 5, 16

D.4.2 Loading the Sample Data

import pandas as pd
import os

# Set base path (adjust as needed)
DATA_DIR = os.path.join(os.path.dirname(__file__), '..', 'data')

# Load match data
matches = pd.read_csv(os.path.join(DATA_DIR, 'sample_matches.csv'),
                       parse_dates=['date'])

# Load shot data
shots = pd.read_parquet(os.path.join(DATA_DIR, 'sample_shots.parquet'))

# Load tracking data
tracking = pd.read_parquet(os.path.join(DATA_DIR, 'sample_tracking.parquet'))

D.4.3 Data Dictionaries

sample_shots.parquet columns:

Column Type Description
shot_id int Unique shot identifier
match_id int Match identifier
team str Shooting team
player str Shooter name
minute int Match minute
x float X coordinate (StatsBomb: 0-120)
y float Y coordinate (StatsBomb: 0-80)
end_x float Shot end X coordinate
end_y float Shot end Y coordinate
body_part str Right Foot, Left Foot, Head, Other
technique str Normal, Volley, Half Volley, etc.
situation str Open Play, Set Piece, Corner, Free Kick, Penalty
first_time bool Whether the shot was first-time
distance float Distance to goal center (yards)
angle float Angle to goal (radians)
num_defenders int Defenders between shooter and goal
gk_x float Goalkeeper X position
gk_y float Goalkeeper Y position
xG float Expected goals value
outcome str Goal, Saved, Blocked, Off Target, Post
is_goal int 1 if goal, 0 otherwise

D.4.4 Accessing the Companion Repository

The complete companion code and datasets are available at:

Repository: github.com/[publisher]/professional-soccer-analytics

git clone https://github.com/[publisher]/professional-soccer-analytics.git
cd professional-soccer-analytics
pip install -r requirements.txt

The repository is organized as follows:

professional-soccer-analytics/
    data/                  # Sample datasets
    notebooks/             # Jupyter notebooks by chapter
    src/                   # Reusable Python modules
        pitch.py           # Pitch drawing utilities
        xg.py             # xG model implementations
        tracking.py        # Tracking data processing
        simulation.py      # Match simulation
    tests/                 # Unit tests
    requirements.txt       # Python dependencies

For the mathematical foundations of the methods applied to these datasets, see Appendix A. For Python code patterns, see Appendix C. For term definitions, see Appendix E.