Soccer Data Landscape
Navigating Soccer Data Sources
Soccer analytics requires quality data. This guide explores the major data providers, from free open datasets to premium professional services, helping you choose the right sources for your projects.
Free & Open Data Sources
StatsBomb Open Data
Best for: Learning, research, portfolio projects
What's Included:
- 7+ competitions (World Cups, Champions League, NWSL, etc.)
- Detailed event data with 360-degree context
- 3,000+ matches spanning multiple seasons
- Player pressure events and freeze frames
- Complete match lineups and formations
Data Quality:
Industry-leading quality with 3,500+ event types. Each event includes:
- Precise x,y coordinates
- Timestamp (minute, second, millisecond)
- Player and team information
- Event outcome and technique
- Contextual information (pressure, body part, etc.)
Python: Loading StatsBomb Data
from statsbombpy import sb
import pandas as pd
# List available competitions
competitions = sb.competitions()
print("Available Competitions:")
print(competitions[['competition_name', 'season_name']])
# Get matches from a competition
# FIFA World Cup 2018
matches = sb.matches(competition_id=43, season_id=3)
print(f"\nFound {len(matches)} matches")
# Load events from a specific match
# Example: France vs Croatia Final
match_id = 8658 # World Cup 2018 Final
events = sb.events(match_id=match_id)
print(f"\nLoaded {len(events)} events")
print("\nEvent types:")
print(events['type'].value_counts())
# Get all passes from the match
passes = events[events['type'] == 'Pass'].copy()
print(f"\nTotal passes: {len(passes)}")
# Analyze pass completion by team
pass_summary = passes.groupby('team').agg({
'pass_outcome': lambda x: (x.isna().sum() / len(x) * 100), # % complete
'id': 'count'
}).round(2)
pass_summary.columns = ['Completion %', 'Total Passes']
print("\nPass Summary:")
print(pass_summary)
# Get player statistics
lineup = sb.lineups(match_id=match_id)
for team in lineup:
print(f"\n{team} Starting XI:")
print(lineup[team][['player_name', 'jersey_number']])
R: Working with StatsBomb Data
library(StatsBombR)
library(dplyr)
library(ggplot2)
# Get available competitions
competitions <- FreeCompetitions()
print(head(competitions))
# Get matches from World Cup 2018
matches <- FreeMatches(competition_id = 43, season_id = 3)
cat(sprintf("Found %d matches\n", nrow(matches)))
# Load events from France vs Croatia final
events <- get.matchFree(matches[matches$match_id == 8658, ])
events_clean <- allclean(events)
cat(sprintf("Loaded %d events\n", nrow(events_clean)))
# Analyze shot locations
shots <- events_clean %>%
filter(type.name == "Shot")
# Create shot map
ggplot(shots, aes(x = location.x, y = location.y, color = shot.outcome.name)) +
geom_point(size = 3, alpha = 0.6) +
annotate_pitch() +
scale_color_manual(values = c(
"Goal" = "#28a745",
"Saved" = "#ffc107",
"Off T" = "#dc3545",
"Blocked" = "#6c757d"
)) +
labs(
title = "World Cup 2018 Final - Shot Map",
subtitle = "France vs Croatia",
color = "Outcome"
) +
theme_pitch() +
coord_fixed(ratio = 1)
# Calculate xG
shots_summary <- shots %>%
group_by(team.name) %>%
summarise(
total_shots = n(),
total_xG = sum(shot.statsbomb_xg, na.rm = TRUE),
goals = sum(shot.outcome.name == "Goal", na.rm = TRUE)
)
print(shots_summary)
FBref (Football Reference)
Best for: Season-long statistics, player comparisons
Coverage:
- 30+ leagues worldwide
- Historical data from multiple seasons
- Powered by StatsBomb and Opta data
- Advanced metrics (xG, xA, progressive passes)
Python: Scraping FBref Data
from worldfootballR import fb
# Get Big 5 European League stats
# Available leagues: Premier League, La Liga, Bundesliga, Serie A, Ligue 1
season_stats = fb.get_season_team_stats(
country="ENG",
gender="M",
season_end_year=2024,
tier="1st",
stat_type="shooting"
)
# Analyze top scorers
top_scorers = season_stats.nlargest(10, 'goals')
print("Top 10 Scorers:")
print(top_scorers[['player', 'squad', 'goals', 'xG', 'shots']])
# Get player scouting report
player_stats = fb.get_player_scouting_report(
player_url="https://fbref.com/en/players/21a66f6a/Erling-Haaland",
pos_versus=["FW"]
)
print("\nPlayer Scouting Report:")
print(player_stats)
Wyscout API (Limited Free Access)
Best for: Academic research (free academic licenses available)
Features:
- Extensive coverage across 30+ leagues
- Detailed event data with tags
- Player attributes and market values
- Video clips linked to events
Premium Data Providers
Opta Sports
Used by: Premier League, ESPN, major broadcasters
Capabilities:
- Real-time match data collection
- 40+ years of historical data
- 150+ competitions worldwide
- Advanced metrics and expected goals models
- Custom API access
Pricing: Enterprise pricing (contact sales)
Best for: Professional clubs, media companies, betting operators
StatsBomb (Professional)
Used by: Top European clubs, national teams
Unique Features:
- 360-degree freeze frames for every event
- Pressure and defensive action context
- Industry-leading xG model
- IQ platform for visual analysis
- Custom metrics and KPIs
Second Spectrum (Tracking Data)
Used by: Premier League, Bundesliga, MLS
Technology:
- Optical tracking system (25+ times per second)
- Player and ball x,y coordinates
- Speed, acceleration, distance metrics
- Space occupation and team shape analysis
- Machine learning-powered insights
Comparison Table
| Provider | Cost | Data Type | Coverage | Best For |
|---|---|---|---|---|
| StatsBomb Open | Free | Event | 7+ competitions | Learning, portfolios |
| FBref | Free | Aggregated | 30+ leagues | Season statistics |
| Understat | Free | xG Stats | Top 6 leagues | xG analysis |
| Opta | $$$$$ | Event | 150+ comps | Professionals |
| StatsBomb Pro | $$$$ | Event | 50+ comps | Clubs, analysts |
| Wyscout | $$$$ | Event + Video | 30+ leagues | Scouting, research |
| Second Spectrum | $$$$$ | Tracking | Select leagues | Elite clubs |
API Libraries and Tools
Python Libraries
statsbombpy
Official StatsBomb Python library
pip install statsbombpy
mplsoccer
Soccer pitch visualization
pip install mplsoccer
socceraction
Convert events to VAEP actions
pip install socceraction
kloppy
Standardize event and tracking data
pip install kloppy
R Packages
Installing R Soccer Packages
# StatsBomb data access
install.packages("StatsBombR")
# World football data scraping
install.packages("worldfootballR")
# Soccer pitch plotting
install.packages("ggsoccer")
# Advanced plotting
install.packages("ggplot2")
# Data manipulation
install.packages("dplyr")
# Load libraries
library(StatsBombR)
library(worldfootballR)
library(ggsoccer)
library(ggplot2)
library(dplyr)
Choosing the Right Data Source
Decision Guide:
- Learning soccer analytics? → Start with StatsBomb Open Data
- Building a portfolio project? → StatsBomb Open + FBref
- Academic research? → StatsBomb Open or Wyscout academic license
- Professional club analysis? → Opta or StatsBomb Professional
- Broadcasting/media? → Opta with real-time feeds
- Advanced tactical analysis? → Second Spectrum tracking data
Data Quality Considerations
- Consistency: Event definitions vary between providers
- Coverage: Not all providers cover all leagues equally
- Timeliness: Free data may have delays; professional feeds are real-time
- Granularity: Tracking data offers more detail than event data
- Cost vs. Value: Premium data is expensive but offers significant advantages
Getting API Access
StatsBomb Open Data Setup
# Python
pip install statsbombpy
# R
install.packages("StatsBombR")
No API key required for open data. Just install and start using!
Wyscout Academic Access
Steps:
- Visit Wyscout's academic program page
- Provide proof of academic affiliation (.edu email)
- Describe your research project
- Receive API credentials within 2-4 weeks
Ready to Start?
Now that you know the data landscape, proceed to:
- Set up your development environment
- Load your first dataset
- Conduct basic analysis