Case Study 2: Open Data and the Democratization of Soccer Analytics
Background
In 2018, StatsBomb made an unprecedented decision: they released a substantial portion of their professional-grade event data as open data, freely available to anyone. This included complete event data for the 2018 FIFA World Cup, multiple seasons of the FA Women's Super League, select Champions League matches, and data from La Liga, the NWSL, and several international competitions.
This case study traces the ripple effects of this decision and the broader open data movement on the soccer analytics ecosystem, examining how it has lowered barriers to entry, accelerated research, and reshaped the talent pipeline for the industry.
The State of Affairs Before Open Data
Prior to the open data movement, soccer analytics was characterized by severe information asymmetry:
- Professional data (event data, tracking data) cost tens of thousands to hundreds of thousands of dollars per year, affordable only for professional clubs and well-funded media organizations.
- Academic researchers had to negotiate individual agreements with data providers, often with restrictive terms that limited reproducibility.
- Aspiring analysts had no way to develop skills on real data without a professional affiliation.
- The analytics community relied on scraped, unstructured data of inconsistent quality, making it difficult to produce reliable analysis.
This created a catch-22: to get hired in soccer analytics, you needed to demonstrate skills with professional data, but to access professional data, you needed to already be hired.
The Open Data Ecosystem
StatsBomb Open Data
StatsBomb's open data release included:
- Event-level data: Every on-ball event (passes, shots, tackles, carries, etc.) with precise x,y coordinates, outcome labels, and rich contextual attributes.
- Lineup data: Starting lineups, substitutions, formations.
- Match metadata: Competition, season, date, venue.
- 360 freeze frames: For select matches, the positions of all visible players at the moment of key events --- a bridge between event data and tracking data.
The data was released on GitHub under a permissive license, with clear documentation and a Python access library (statsbombpy).
Other Open Data Initiatives
The StatsBomb release catalyzed a broader movement:
- Metrica Sports released sample tracking data (positional coordinates for all 22 players and the ball at 25 fps) for two complete matches, enabling tracking data research without a professional license.
- Pappalardo et al. published event data from multiple European leagues through the Nature journal's Scientific Data, establishing a citable, peer-reviewed open dataset.
- Wyscout launched academic access programs, providing research-grade data to university groups.
- Friends of Tracking (YouTube series by Laurie Shaw, David Sumpter, and others) combined open data with educational content, creating a complete self-study curriculum.
The Community Response
The impact was immediate and substantial:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from typing import List, Tuple
def visualize_community_growth(
years: List[int],
github_repos: List[int],
blog_posts: List[int],
conference_submissions: List[int]
) -> plt.Figure:
"""Visualize the growth of the soccer analytics community.
Creates a multi-line chart showing growth across different
indicators of community activity.
Args:
years: List of years for the x-axis.
github_repos: Count of soccer analytics GitHub repos per year.
blog_posts: Count of public analytical blog posts per year.
conference_submissions: Count of conference paper submissions.
Returns:
Matplotlib figure with growth visualization.
"""
fig, ax1 = plt.subplots(figsize=(10, 6))
color_repos = "#2E86AB"
color_blogs = "#A23B72"
color_conf = "#F18F01"
ax1.plot(years, github_repos, color=color_repos, marker="o",
linewidth=2, label="GitHub Repos")
ax1.plot(years, blog_posts, color=color_blogs, marker="s",
linewidth=2, label="Blog Posts")
ax1.plot(years, conference_submissions, color=color_conf, marker="^",
linewidth=2, label="Conference Submissions")
ax1.set_xlabel("Year", fontsize=12)
ax1.set_ylabel("Count", fontsize=12)
ax1.set_title("Growth of the Soccer Analytics Community", fontsize=14)
ax1.legend(loc="upper left", fontsize=10)
ax1.grid(True, alpha=0.3)
plt.tight_layout()
return fig
# Illustrative data (directional, not precise counts)
years = [2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025]
github_repos = [15, 25, 60, 150, 280, 420, 550, 700, 850, 1000]
blog_posts = [50, 80, 200, 500, 800, 1100, 1400, 1700, 2000, 2300]
conference_subs = [30, 45, 80, 140, 200, 280, 350, 400, 450, 500]
Five Stories of Impact
Story 1: The Self-Taught Analyst
Maria, a computer science student in Colombia, had never worked in professional football. Using StatsBomb open data and tutorials from Friends of Tracking, she built an expected goals model, wrote a series of blog posts analyzing the Colombian women's national team, and published her code on GitHub. Her work was noticed by a data analyst at a South American club, who recommended her for an internship. Within two years, she was working as a full-time analyst for a first-division club.
Maria's story illustrates how open data broke the catch-22: she could demonstrate professional-quality work without professional access, creating an on-ramp that did not exist before.
Story 2: The Academic Breakthrough
A research group at a European university used the Pappalardo dataset to publish a seminal paper on action valuation in soccer (the VAEP framework). Because the data was open, other researchers could replicate and extend their results, leading to a productive chain of follow-up papers. The VAEP framework eventually influenced how several professional clubs evaluated player performance.
Without open data, this research would have been either impossible (no data access) or non-reproducible (proprietary data that others could not verify against). Open data made the research both possible and trustworthy.
Story 3: The Women's Soccer Catalyst
StatsBomb's deliberate inclusion of FA Women's Super League data in their open release had an outsized impact. Before this, analytical coverage of women's soccer was extremely sparse. The open data enabled:
- The first large-scale statistical analyses of women's soccer tactics
- Direct comparisons of tactical patterns between men's and women's leagues
- Community-driven development of women's soccer analytics content
- Increased interest from data providers in covering women's competitions
This demonstrates how open data can have an equalizing effect, directing analytical attention to historically underserved areas of the sport.
Story 4: The Grassroots Application
A youth development director at a small English academy used concepts and code from open-source soccer analytics projects (built on open data) to create a simple player development tracking system. Using a smartphone camera and open-source pose estimation models, they built a basic system to track young players' technical development across training sessions.
The system was far less sophisticated than professional-grade tools, but it was free, and it was enough --- enough to identify development trends, flag players who might benefit from additional coaching attention, and bring a data-informed approach to an academy that could never have afforded commercial analytics tools.
Story 5: The Journalistic Innovation
A sports data journalist used StatsBomb open data to create an interactive visualization of World Cup 2018 tactical patterns that was published by a major newspaper. The piece reached millions of readers and was credited with increasing public interest in analytical coverage of soccer. It demonstrated that sophisticated analysis could be communicated to a mainstream audience when paired with strong visual storytelling.
Challenges and Limitations of Open Data
The open data movement has not been without challenges:
1. Selection Bias
Open datasets are not random samples. They tend to cover high-profile competitions and may not be representative of the global soccer ecosystem. Analysis based solely on open data may produce insights that do not generalize to lower leagues or different football cultures.
2. Sustainability
Providing professional-grade data for free is expensive. The long-term sustainability of open data initiatives depends on the business model of the providing companies and the continued goodwill of stakeholders.
3. Quality Maintenance
Open datasets may not be maintained with the same rigor as commercial products. Documentation gaps, schema changes, and uncorrected errors can affect the reliability of research built on open data.
4. The Tracking Data Gap
While event data has become relatively accessible, tracking data (positional coordinates at high frequency) remains largely proprietary. This is the most significant remaining barrier, as many advanced analytical techniques (off-ball movement analysis, spatial control models, pressing intensity metrics) require tracking data.
5. Misuse and Misinterpretation
Lower barriers to entry also mean more people producing analysis without adequate statistical training. The community has seen an increase in misleading analyses based on open data --- cherry-picked samples, misinterpreted metrics, and overconfident conclusions from small datasets.
Measuring the Impact
Quantifying the impact of open data is methodologically challenging, but several indicators suggest a transformative effect:
| Indicator | Pre-Open Data (2017) | Post-Open Data (2024) | Change |
|---|---|---|---|
| Soccer analytics GitHub repositories | ~25 | ~850 | 34x |
| Analytics-focused conference submissions | ~45 | ~450 | 10x |
| University courses incorporating soccer analytics | ~5 | ~60 | 12x |
| Full-time analytics roles at professional clubs | ~100 | ~500 | 5x |
| Countries with clubs employing analysts | ~15 | ~50 | 3.3x |
Note: These figures are directional estimates based on publicly available information and industry surveys. Precise counts are difficult to establish.
The Open Data Manifesto
Based on the experiences documented in this case study, we propose principles for the continued development of open soccer data:
-
Breadth over depth: Broad coverage of diverse competitions (including women's leagues, lower divisions, and non-European football) is more valuable than deep coverage of a few elite leagues.
-
Documentation as a first-class concern: Open data without clear documentation is of limited value. Every dataset should include comprehensive data dictionaries, known limitations, and example code.
-
Versioning and stability: Datasets should use semantic versioning, with clear changelogs and backward compatibility guarantees where possible.
-
Ethical collection: Open data should be collected with appropriate consent and in compliance with relevant regulations. The fact that data is free does not exempt it from ethical standards.
-
Community stewardship: The long-term maintenance of open datasets should be a shared community responsibility, not dependent on a single organization.
-
Attribution and citation: Users of open data should consistently credit the providers, creating a positive feedback loop that incentivizes further data release.
The Road Ahead
The democratization of soccer analytics through open data is an ongoing process. Key developments to watch:
- Open tracking data: The release of larger tracking data samples (beyond Metrica's two-match set) would dramatically accelerate research in spatial analysis and off-ball movement.
- Standardization efforts: Projects like
kloppyare working to create standard data formats that bridge different providers, making it easier to combine data sources. - Federated learning: Techniques that allow models to be trained across multiple clubs' private data without sharing the raw data could combine the benefits of open science with the privacy requirements of professional football.
- Institutional support: FIFA, UEFA, and other governing bodies could play a role in mandating minimum data sharing standards, particularly for research and development purposes.
Discussion Questions
-
StatsBomb's open data release was a business decision as much as an altruistic one --- it generated goodwill, brand awareness, and a pool of trained analysts familiar with their data format. Is this alignment of commercial and community interests sustainable, or does it create risks?
-
The "misuse and misinterpretation" challenge is inherent to any democratization effort. How should the community balance lowering barriers to entry with maintaining analytical quality standards?
-
The tracking data gap remains the most significant barrier. What mechanisms (commercial, regulatory, or community-driven) might lead to greater access to tracking data?
-
Maria's story is inspiring, but it is also a survivorship bias example --- we do not hear about the many aspiring analysts who used open data and did not find a career path. How should we honestly represent career prospects in soccer analytics?
-
Open data has primarily benefited the analytical community in English-speaking countries and Western Europe. What specific initiatives could extend these benefits to the global soccer community?
Connection to Chapter Themes
This case study connects to multiple Chapter 30 themes:
- Democratization (Section 30.3): Open data is the single most important enabler of democratization in soccer analytics.
- Ethics (Section 30.2): Open data raises its own ethical questions about player consent and data governance.
- Career guidance (Section 30.6): Open data has fundamentally changed how aspiring analysts build portfolios and demonstrate competence.
- The human element (Section 30.4): The community that formed around open data --- sharing code, providing feedback, mentoring newcomers --- demonstrates that technology adoption is ultimately a social phenomenon.