Appendix D: Data Sources and Tools
This appendix catalogs the data sources, APIs, and tools referenced throughout the book. The landscape of soccer data evolves rapidly; URLs and availability are current as of the time of writing but may change. Check the book's companion website for updated links.
D.1 Free and Public Data Sources
D.1.1 StatsBomb Open Data
URL: https://github.com/statsbomb/open-data
StatsBomb provides a substantial free dataset covering select competitions with full event-level data. This is the primary dataset used in many chapters of this book.
Available Data: - Event data (passes, shots, carries, pressures, duels, etc.) - Lineup data - Match metadata - 360 freeze-frame data (for select matches)
Competitions Included (selection): - FIFA World Cup (Men's and Women's, multiple editions) - UEFA Euro (multiple editions) - FA Women's Super League (multiple seasons) - La Liga (select seasons) - UEFA Champions League (select seasons) - National Women's Soccer League (NWSL) - Indian Super League - Various international tournaments
Format: JSON files organized by competition, season, and match.
Access via Python:
# Using the statsbombpy library
pip install statsbombpy
from statsbombpy import sb
competitions = sb.competitions()
events = sb.events(match_id=3869685)
Coordinate System: 120 x 80 yards, origin at bottom-left, attacking direction left to right.
License: Free for non-commercial use with attribution. See the StatsBomb data use agreement for details.
D.1.2 FBref
URL: https://fbref.com
FBref, powered by Sports Reference, provides comprehensive aggregated statistics for professional soccer leagues worldwide.
Available Data: - Player-level statistics (standard, shooting, passing, defensive, possession, etc.) - Team-level statistics - Match reports with advanced metrics - Expected goals and expected assists (via StatsBomb/Opta) - Scouting reports with percentile rankings
Coverage: - Top 5 European leagues (England, Spain, Germany, Italy, France) - Major secondary leagues (Netherlands, Portugal, Belgium, etc.) - UEFA Champions League and Europa League - Major international tournaments - Historical data going back several decades (with varying detail)
Access: Web scraping (respect robots.txt and rate limits). No official API.
# Example scraping pattern (use responsibly)
import pandas as pd
import time
url = "https://fbref.com/en/comps/9/stats/Premier-League-Stats"
tables = pd.read_html(url)
player_stats = tables[0] # First table is typically standard stats
time.sleep(5) # Respect rate limits
Notes: FBref data integrates StatsBomb xG data for top leagues. Statistics are per-90 normalized where indicated. Be aware that scraping policies may change.
D.1.3 Understat
URL: https://understat.com
Understat provides expected goals data with shot-level detail for six major European leagues.
Available Data: - Shot-level xG values - Shot maps - Player and team xG summaries - xG timelines per match - Situation breakdowns (open play, set piece, counter-attack, etc.)
Coverage: Premier League, La Liga, Bundesliga, Serie A, Ligue 1, Russian Premier League, from 2014-15 onward.
Access: The data is loaded dynamically via JavaScript. Scraping requires parsing embedded JSON from page source.
import requests
import json
from bs4 import BeautifulSoup
url = "https://understat.com/league/EPL/2023"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# Data is embedded in script tags as JSON
scripts = soup.find_all('script')
# Parse the relevant script tag to extract data
D.1.4 WhoScored
URL: https://www.whoscored.com
WhoScored provides match-level statistics, player ratings, and tactical analysis powered by Opta data.
Available Data: - Match statistics (shots, possession, passes, etc.) - Player ratings (0-10 scale) - Heatmaps and touch maps - Chalkboard event visualizations - Tactical lineups and formations
Coverage: Major leagues worldwide, international tournaments.
Notes: WhoScored's proprietary player rating system is derived from Opta event data. Data is not easily machine-readable; designed for web consumption.
D.1.5 Transfermarkt
URL: https://www.transfermarkt.com
Transfermarkt is the leading source for transfer values, market data, and biographical information.
Available Data: - Estimated market values for players - Transfer history and fees - Contract details - Player biographical data (age, nationality, position, height, foot) - Injury history - Appearance and goal records
Access: Web scraping or community-maintained packages.
# Community package (not official)
# pip install transfermarkt-api
# Note: Check current availability and terms of use
D.1.6 Other Free Sources
| Source | URL | Data Type | Notes |
|---|---|---|---|
| European Soccer Database (Kaggle) | kaggle.com/datasets | Match results, betting odds | SQLite database, good for learning |
| football-data.co.uk | football-data.co.uk | Match results, betting odds | CSV files, extensive historical coverage |
| Club Elo | clubelo.com | Elo ratings | Historical Elo ratings for clubs worldwide |
| 538 Soccer Predictions | projects.fivethirtyeight.com | Match predictions, SPI ratings | FiveThirtyEight's Soccer Power Index |
| Open Football | github.com/openfootball | Match results | Community-maintained, structured text format |
| Soccerway | soccerway.com | Match results, lineups | Web-based, broad international coverage |
| WorldFootballR | github.com/JaseZiv/worldfootballR | R package | Aggregates data from FBref, Transfermarkt, Understat, Fotmob |
| Pappalardo et al. Dataset | figshare.com | Event data | Academic dataset covering multiple leagues |
D.2 Commercial Data Providers
D.2.1 Stats Perform (Opta)
Website: statsperform.com
Opta, now part of Stats Perform, is the most widely used event data provider in professional soccer.
Products: - Event Data (F24/F9): Detailed event-level data with ~2,000+ events per match (passes, shots, tackles, fouls, etc.) with x,y coordinates. - Match Data: Pre-match, live, and post-match statistics. - Player Data: Seasonal aggregates, biographical information. - Advanced Metrics: Expected goals (xG), Expected Threat (xT), possession value frameworks. - Content Feeds: Real-time text commentary, graphics-ready data.
Coverage: 500+ competitions worldwide, including all major European leagues, lower divisions, international tournaments, and youth football.
Coordinate System: 100 x 100 (percentage-based), origin at bottom-left.
Typical Clients: Media organizations, broadcasters, betting companies, professional clubs.
D.2.2 StatsBomb
Website: statsbomb.com
StatsBomb provides event data with superior granularity to most competitors, including pressure events and advanced tagging.
Products: - Event Data: Highly detailed events including pressure tracking, shot freeze-frames (360 data showing all player positions at the time of each shot), and ball receipt types. - StatsBomb 360: Freeze-frame data capturing all visible player and ball positions at the moment of key events. - IQ Platform: Web-based analytics and visualization platform. - Data Lab: Custom analysis and consultancy.
Unique Features: - Pressure events tracked - Goalkeeper positioning data - Carry events (not just dribbles) - Shot freeze-frame data (positions of all nearby players)
Coverage: Major European leagues, international tournaments, select lower-tier leagues.
D.2.3 Second Spectrum
Website: secondspectrum.com
Second Spectrum provides optical tracking data using computer vision applied to broadcast or stadium camera feeds.
Products: - Tracking Data: 25 Hz positional data for all players and the ball. - Physical Metrics: Speed, distance, acceleration profiles. - Contextual Intelligence: Automated tactical analysis, off-ball movement detection. - Coaching Tools: Video synchronization with data overlays.
Coverage: Official tracking partner of the Premier League, La Liga, MLS, and other leagues.
Data Format: Typically frame-by-frame JSON or CSV with x, y coordinates and timestamps.
D.2.4 Wyscout
Website: wyscout.com (now part of Hudl)
Wyscout is a widely used scouting and analysis platform, particularly popular with clubs and agents.
Products: - Event Data: Similar scope to Opta, with passes, shots, duels, etc. - Video Platform: Full match video with event tagging and clipping. - Scouting Tools: Player search, comparison, and shortlisting. - Advanced Data: xG, progressive passes, smart passes.
Coverage: 200+ competitions, extensive lower-league and youth coverage.
Coordinate System: 100 x 100 (percentage-based), origin at top-left (y-axis inverted compared to most other providers).
D.2.5 InStat
Website: instatsport.com
InStat provides event and video analysis primarily in Eastern European, South American, and Asian markets.
Products: - Event data with video synchronization - Player indices (composite ratings) - Team reports and opposition analysis
D.2.6 SkillCorner
Website: skillcorner.com
SkillCorner provides broadcast-derived tracking data using computer vision on publicly available TV footage.
Products: - Broadcast Tracking: Physical and tactical metrics derived from TV broadcasts. - Off-ball Metrics: Running intensity, pressing behavior, positioning. - Open Data: Limited free dataset available for academic and personal use.
Coverage: Major leagues worldwide, based on availability of broadcast footage.
D.2.7 Comparison of Providers
| Provider | Data Type | Coverage | Coordinate System | Typical Cost |
|---|---|---|---|---|
| Opta / Stats Perform | Event | 500+ comps | 100 x 100 | $$$$$ |
| StatsBomb | Event + 360 | 30+ comps | 120 x 80 | $$$$ |
| Second Spectrum | Tracking | Select leagues | Meters (real coords) | $$$$$ |
| Wyscout / Hudl | Event + Video | 200+ comps | 100 x 100 | $$$ | | InStat | Event + Video | 100+ comps | Varies | $$ |
| SkillCorner | Tracking (broadcast) | Major leagues | Meters (real coords) | $$$ |
D.3 APIs and Tools
D.3.1 Python Libraries
| Library | Purpose | Install Command |
|---|---|---|
statsbombpy |
Access StatsBomb data | pip install statsbombpy |
mplsoccer |
Soccer pitch plotting and visualization | pip install mplsoccer |
socceraction |
SPADL, VAEP, xT implementations | pip install socceraction |
kloppy |
Standardize data across providers | pip install kloppy |
codeball |
Expected possession value (EPV) | pip install codeball |
pandas |
Data manipulation | pip install pandas |
scikit-learn |
Machine learning | pip install scikit-learn |
statsmodels |
Statistical modeling | pip install statsmodels |
xgboost |
Gradient boosting | pip install xgboost |
lightgbm |
Gradient boosting (fast) | pip install lightgbm |
torch / pytorch |
Deep learning | pip install torch |
networkx |
Network/graph analysis | pip install networkx |
D.3.2 R Libraries
| Library | Purpose |
|---|---|
worldfootballR |
Scrape data from FBref, Transfermarkt, Understat, Fotmob |
StatsBombR |
Access StatsBomb open data |
ggsoccer |
Soccer pitch visualization in ggplot2 |
ggplot2 |
General data visualization |
tidyverse |
Data wrangling suite |
brms |
Bayesian regression modeling |
rstanarm |
Applied Bayesian regression |
D.3.3 Visualization Tools
| Tool | Type | Best For |
|---|---|---|
mplsoccer (Python) |
Library | Pitch plots, heatmaps, shot maps, pass networks |
ggsoccer (R) |
Library | Pitch plots in ggplot2 framework |
| Tableau | Software | Interactive dashboards, no-code exploration |
| D3.js | JavaScript | Custom web-based interactive visualizations |
| Flourish | Web app | Animated and interactive visualizations |
| Figma / Sketch | Design | Publication-quality static graphics |
D.3.4 Data Standardization
Different data providers use different event definitions, coordinate systems, and naming conventions. The kloppy library standardizes data into a common format:
from kloppy import statsbomb, opta, wyscout
# Load StatsBomb data into standardized format
dataset = statsbomb.load_open_data(
match_id=3869685,
coordinates="statsbomb"
)
# Convert to SPADL format using socceraction
import socceraction.spadl as spadl
actions = spadl.statsbomb.convert_to_actions(events, home_team_id)
SPADL (Soccer Player Action Description Language): A standardized representation introduced by Decroos et al. (2019) that converts all event data into a consistent format with 21 action types. Used as the foundation for VAEP (Valuing Actions by Estimating Probabilities). See Chapter 9.
D.4 Sample Datasets Included with This Book
The companion repository for this book includes several curated datasets for reproducing the analyses in each chapter.
D.4.1 Dataset Inventory
| Dataset | File | Description | Chapters |
|---|---|---|---|
sample_matches.csv |
500 KB | Match results from top 5 leagues (2018-2024) | 1, 3, 5 |
sample_shots.parquet |
12 MB | 50,000 shots with xG values, locations, outcomes | 6, 7, 8 |
sample_events.parquet |
45 MB | Full event data for 100 matches | 10, 11, 12 |
sample_tracking.parquet |
200 MB | Tracking data (25 Hz) for 5 matches | 17, 18, 19 |
player_seasons.csv |
3 MB | Player-season aggregated stats (5 leagues, 6 seasons) | 15, 20, 21 |
team_seasons.csv |
200 KB | Team-season aggregated stats | 9, 16 |
transfer_values.csv |
1 MB | Transfer values and fees (2015-2024) | 21, 25 |
sample_lineups.json |
500 KB | Lineup and formation data for 100 matches | 10, 22 |
xg_model_training.parquet |
8 MB | Pre-processed shot data for xG model training | 6, 7 |
league_tables.csv |
100 KB | Final league standings (top 5 leagues, 10 seasons) | 4, 5, 16 |
D.4.2 Loading the Sample Data
import pandas as pd
import os
# Set base path (adjust as needed)
DATA_DIR = os.path.join(os.path.dirname(__file__), '..', 'data')
# Load match data
matches = pd.read_csv(os.path.join(DATA_DIR, 'sample_matches.csv'),
parse_dates=['date'])
# Load shot data
shots = pd.read_parquet(os.path.join(DATA_DIR, 'sample_shots.parquet'))
# Load tracking data
tracking = pd.read_parquet(os.path.join(DATA_DIR, 'sample_tracking.parquet'))
D.4.3 Data Dictionaries
sample_shots.parquet columns:
| Column | Type | Description |
|---|---|---|
shot_id |
int | Unique shot identifier |
match_id |
int | Match identifier |
team |
str | Shooting team |
player |
str | Shooter name |
minute |
int | Match minute |
x |
float | X coordinate (StatsBomb: 0-120) |
y |
float | Y coordinate (StatsBomb: 0-80) |
end_x |
float | Shot end X coordinate |
end_y |
float | Shot end Y coordinate |
body_part |
str | Right Foot, Left Foot, Head, Other |
technique |
str | Normal, Volley, Half Volley, etc. |
situation |
str | Open Play, Set Piece, Corner, Free Kick, Penalty |
first_time |
bool | Whether the shot was first-time |
distance |
float | Distance to goal center (yards) |
angle |
float | Angle to goal (radians) |
num_defenders |
int | Defenders between shooter and goal |
gk_x |
float | Goalkeeper X position |
gk_y |
float | Goalkeeper Y position |
xG |
float | Expected goals value |
outcome |
str | Goal, Saved, Blocked, Off Target, Post |
is_goal |
int | 1 if goal, 0 otherwise |
D.4.4 Accessing the Companion Repository
The complete companion code and datasets are available at:
Repository: github.com/[publisher]/professional-soccer-analytics
git clone https://github.com/[publisher]/professional-soccer-analytics.git
cd professional-soccer-analytics
pip install -r requirements.txt
The repository is organized as follows:
professional-soccer-analytics/
data/ # Sample datasets
notebooks/ # Jupyter notebooks by chapter
src/ # Reusable Python modules
pitch.py # Pitch drawing utilities
xg.py # xG model implementations
tracking.py # Tracking data processing
simulation.py # Match simulation
tests/ # Unit tests
requirements.txt # Python dependencies
For the mathematical foundations of the methods applied to these datasets, see Appendix A. For Python code patterns, see Appendix C. For term definitions, see Appendix E.