52 min read

> "Data is the new oil. But like oil, it's valuable only when refined."

Learning Objectives

  • Distinguish between event data, tracking data, and other data types used in soccer analytics
  • Explain how event data and tracking data are collected and processed
  • Evaluate major commercial data providers and understand their offerings
  • Access and work with free public soccer data sources
  • Assess data quality and identify common data issues
  • Design a basic data pipeline for soccer analytics projects

Chapter 2: Data Sources and Collection in Soccer

"Data is the new oil. But like oil, it's valuable only when refined." — Clive Humby (adapted)

Chapter Overview

On a typical Premier League matchday, approximately 2,000 individual events occur on the pitch: passes, shots, tackles, duels, and more. Modern tracking systems record the position of every player 25 times per second, generating over 4 million data points per match. Broadcast cameras capture footage from dozens of angles, while GPS vests worn by players measure acceleration, speed, and distance covered.

This flood of data is the raw material of soccer analytics. But before you can analyze it, you need to understand what data exists, where it comes from, how to access it, and—crucially—what its limitations are.

This chapter takes you inside the world of soccer data: how it's created, who provides it, and how you can get your hands on it for your own analysis. We will examine each major data type in detail, survey the commercial and free data landscape, address the critical issues of data quality that can undermine even the most sophisticated analysis, and provide practical guidance for building data pipelines that support reproducible analytical work.

In this chapter, you will learn to: - Understand the different types of soccer data and their uses - Know where data comes from and how it's collected - Navigate the landscape of commercial and free data sources - Access public data for learning and portfolio projects - Evaluate data quality and handle common issues - Design and implement basic data pipelines


2.1 Types of Soccer Data

2.1.1 The Data Taxonomy

Soccer data comes in several distinct forms, each with different characteristics, collection methods, and analytical applications. Understanding these categories is fundamental to effective analysis. The type of data you use determines what questions you can answer, what methods you can apply, and what conclusions you can draw. Choosing the wrong data type for a given question is one of the most common mistakes in soccer analytics.

                        SOCCER DATA TYPES
    ┌────────────────────────────────────────────────────────┐
    │                                                        │
    │   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐ │
    │   │   EVENT     │   │  TRACKING   │   │   VIDEO/    │ │
    │   │    DATA     │   │    DATA     │   │   BROADCAST │ │
    │   │             │   │             │   │             │ │
    │   │ - Discrete  │   │ - Continuous│   │ - Raw       │ │
    │   │   actions   │   │   positions │   │   footage   │ │
    │   │ - Tagged by │   │ - All 22    │   │ - Requires  │ │
    │   │   humans    │   │   players   │   │   processing│ │
    │   │ - Universal │   │ - High      │   │ - Context   │ │
    │   │   coverage  │   │   frequency │   │   rich      │ │
    │   └─────────────┘   └─────────────┘   └─────────────┘ │
    │                                                        │
    │   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐ │
    │   │  PHYSICAL   │   │  REFERENCE  │   │  CONTEXTUAL │ │
    │   │    DATA     │   │    DATA     │   │    DATA     │ │
    │   │             │   │             │   │             │ │
    │   │ - GPS/IMU   │   │ - Players   │   │ - Weather   │ │
    │   │ - Heart rate│   │ - Teams     │   │ - Injuries  │ │
    │   │ - Load      │   │ - Matches   │   │ - Transfers │ │
    │   │   metrics   │   │ - Fixtures  │   │ - Finances  │ │
    │   └─────────────┘   └─────────────┘   └─────────────┘ │
    │                                                        │
    └────────────────────────────────────────────────────────┘

Let us examine each of these data types in detail.

2.1.2 Event Data

Event data is the most widely used type of soccer data. It records discrete actions that occur during a match—every pass, shot, tackle, foul, and more—along with metadata describing each action. Event data is to soccer analytics what box scores are to baseball: the fundamental quantitative record of what happened in a match.

What Event Data Contains:

A typical event data record includes:

Field Description Example
event_id Unique identifier 12345
match_id Match identifier 98765
timestamp When it occurred 23:45.2
event_type Type of action Pass
player_id Who performed it 12345
team_id Which team 100
x, y Location (start) 35.2, 48.1
end_x, end_y Location (end) 65.8, 52.3
outcome Success/failure Successful
qualifiers Additional tags Ground, Forward, Progressive

Example Event Sequence:

23:45.2  Pass        Player A → Player B    (35, 48) → (55, 52)  Successful
23:47.1  Receipt     Player B               (55, 52)
23:48.3  Pass        Player B → Player C    (55, 52) → (78, 35)  Successful
23:50.8  Dribble     Player C               (78, 35) → (85, 30)  Successful
23:52.4  Shot        Player C               (85, 30) → Goal      Goal

This sequence tells a clear story: a three-pass move culminating in a goal. But notice what it does not tell you: where the other 19 outfield players were positioned, what runs were being made off the ball, how the defensive line shifted, or whether the goalkeeper was caught out of position. Event data captures the story of the ball; it is largely silent about everything else.

The Event Data Schema in Detail:

Different providers structure their event data differently, but most follow a similar conceptual schema. Understanding the schema is essential for working with the data effectively.

Event types form a hierarchical taxonomy. At the top level, events are categorized as passes, shots, tackles, duels, fouls, goalkeeper actions, and so on. Within each category, subtypes provide finer granularity. A pass might be a ground pass, a lofted ball, a through ball, or a cross. A shot might be a volley, a header, or a placed shot. A tackle might be won or lost, standing or sliding.

Qualifiers (sometimes called "tags" or "attributes") add additional context to each event. Opta, for example, uses a system of qualifier codes that describe aspects like body part (left foot, right foot, head), technique (volley, half-volley, lob), and situational context (counter-attack, set piece, under pressure). StatsBomb uses a richer qualifier system that includes information about the preceding action, the defensive context, and the player's body orientation.

Coordinates describe where on the pitch an event occurred. Most providers use a coordinate system where the pitch is represented as a rectangle with standardized dimensions—typically 120 by 80 units for Opta, or the actual pitch dimensions (typically 105 by 68 meters) for StatsBomb. The origin point and axis orientation vary between providers, which is a common source of confusion and bugs.

Common Pitfall: Different providers use different coordinate systems. Opta uses a 100x100 grid (or sometimes 120x80), StatsBomb uses actual pitch dimensions (120x80 yards), and other providers may use entirely different scales. Always check the documentation and verify coordinates visually before performing any spatial analysis. Failing to account for coordinate system differences is one of the most common sources of error in soccer analytics.

Strengths of Event Data: - Widely available for most professional matches across dozens of leagues - Standardized formats allow cross-competition analysis - Captures the "story" of a match in structured form - Relatively affordable compared to tracking data - Sufficient for many common analytical tasks (xG, passing analysis, shot analysis) - Historical data available going back over a decade for major leagues

Limitations of Event Data: - Only records discrete actions, missing continuous play - Doesn't capture off-ball movement—where the 20 players without the ball are positioned - Quality varies between providers and between leagues within the same provider - Some subjective classification (what counts as a "key pass"? When does a "duel" begin and end?) - Spatial accuracy is limited (typically plus or minus 1-2 meters) - Cannot capture pressing intensity, space creation, or other continuous phenomena

Intuition: Event data providers tag 2,000-3,000 events per match, but this represents only a fraction of what actually happens. Everything that occurs between events—positioning, movement, space creation, pressing, recovery runs—is invisible. If a match were a novel, event data would capture the dialogue but miss the descriptions, the characters' inner thoughts, and the setting. It tells you what was said, but not the full story.

2.1.3 Tracking Data

Tracking data captures the continuous position of every player and the ball throughout a match, typically at 25 frames per second. This creates a vastly richer picture of what happens on the pitch—one that includes the 95% of match activity that event data misses.

What Tracking Data Contains:

For each frame (25 times per second):

Field Description
frame_id Frame identifier
timestamp Precise time
ball_x, ball_y, ball_z Ball position (3D)
player_1_x, player_1_y Player 1 position
player_1_vx, player_1_vy Player 1 velocity
... (22 players total)

The Numbers: - 25 frames/second x 90 minutes x 60 seconds = 135,000 frames per match - 22 players x 2 coordinates = 44 positional values per frame - Plus ball position (3 coordinates including height), velocities, accelerations - Total: approximately 4-6 million data points per match

A single season of Premier League tracking data—380 matches—thus contains on the order of 1.5 to 2.3 billion individual data points. This scale presents significant engineering challenges for storage, processing, and analysis.

The Raw Format:

Tracking data is typically delivered as frame-by-frame records, often in CSV or JSON format. A single frame might look like this:

frame_id: 54321
timestamp: 00:23:45.200
period: 1
ball: {x: 52.3, y: 34.1, z: 0.5}
home_team:
  - {player_id: 101, x: 45.2, y: 30.1, speed: 5.2}
  - {player_id: 102, x: 38.7, y: 42.5, speed: 2.1}
  - {player_id: 103, x: 50.1, y: 35.8, speed: 7.8}
  ...
away_team:
  - {player_id: 201, x: 55.6, y: 33.2, speed: 6.1}
  ...

Working with tracking data requires fundamentally different approaches than event data. Instead of querying discrete records, analysts must process time series of positions and derive meaningful features: distances covered, speeds reached, formations maintained, spaces created, and pressing actions executed.

Collection Methods:

  1. Optical Tracking: - Multiple cameras positioned around the stadium (typically 12-20+) - Computer vision algorithms identify and track players and the ball - Providers: Second Spectrum, TRACAB (ChyronHego), Hawk-Eye - Most accurate method, with positional precision of approximately 10-30 centimeters - Requires permanent camera installation in stadiums, limiting coverage to equipped venues - Can track all 22 players plus the ball simultaneously - Ball tracking is particularly challenging because the ball is small, moves quickly, and can be occluded by players

  2. GPS/Wearable Tracking: - Players wear vests with GPS and accelerometer sensors - Provides position plus physical metrics (acceleration, impacts, heart rate) - Providers: Catapult, STATSports, Playertek - Lower positional accuracy than optical systems (approximately 1-5 meters for GPS) - Richer physical data including accelerometer readings at 100-1000 Hz - Can be used in training as well as matches - Only captures data for the team that owns the equipment—opponent data is not available

  3. Broadcast-Derived Tracking: - Computer vision applied to standard broadcast television footage - No stadium hardware required—works with any televised match - Provider: SkillCorner (the leading provider in this category) - Lower accuracy than dedicated optical systems (estimated 1-2 meters) - Coverage is dramatically broader—any televised match can potentially be tracked - Cannot track players who are off-screen (broadcast cameras follow the ball) - Sampling rate may be lower than dedicated systems (typically around 10 Hz from broadcast footage versus 25 Hz from dedicated optical systems)

Real-World Application: SkillCorner's broadcast-derived tracking data has been transformative for cross-league scouting. Before its availability, a club comparing a Bundesliga player to a Ligue 1 player could use event data but had no tracking data for both leagues from a single consistent source. SkillCorner's technology, by deriving tracking data from broadcast footage available for both leagues, enables physical and spatial comparisons across competitions. Several Premier League clubs reportedly use SkillCorner data as a complement to their primary tracking data to evaluate transfer targets in leagues where dedicated tracking systems are not installed.

Strengths of Tracking Data: - Captures everything that happens on the pitch, including off-ball movement - Enables analysis of pressing patterns, space creation, defensive shape, and other continuous phenomena - Allows sophisticated spatial models (pitch control, pressing intensity, expected threat) - Physical metrics support load management and injury prevention - Provides the foundation for the most advanced analytical techniques in the field

Limitations of Tracking Data: - Expensive and not universally available - Large data volumes require significant computational infrastructure - Processing and analysis more complex than event data - Still being explored—best practices and standard methods are evolving - Synchronization between tracking data and event data can be imperfect - Quality varies between systems and conditions (e.g., rain can affect optical tracking)

Intuition: If event data is like a play-by-play transcript of a match, tracking data is like the complete video recording. The transcript tells you what happened; the video shows you everything, including what was happening away from the ball. And just as video analysis requires different skills than reading a transcript, tracking data analysis requires different tools and techniques than event data analysis.

2.1.4 Freeze Frames

Freeze frames are snapshots of all player positions at specific moments—typically the moment of key events like shots or passes. They represent a middle ground between event data and full tracking data: more spatial context than standard event data, but far less data volume than continuous tracking.

What Freeze Frames Contain:

For each key event: - All 22 player positions at that exact moment - Ball position - Associated event metadata (who shot, from where, what happened)

The Concept:

Imagine pausing a video at the exact moment a player takes a shot. A freeze frame captures a digital photograph of where every player is standing at that instant. This tells you, for example, how many defenders were between the shooter and the goal, how far the goalkeeper was from the center of the goal, and whether any teammates were in offside positions.

Providers: - StatsBomb includes freeze frames with their event data, making them the most widely accessible source of freeze frame data - Some providers offer "enhanced" event data with positional context for certain event types

Advantages: - Much smaller data volume than full tracking (thousands of frames per match versus millions of data points) - Enables spatial analysis of key moments without the infrastructure requirements of full tracking - Often included with event data subscriptions at no additional cost - Sufficient for many analytical purposes, particularly shot analysis and set piece analysis

Limitations: - Only captures specific moments, not continuous play - Missing movement trajectories—you see where players are but not where they are going - Limited coverage (not all providers offer this) - Selection of which events include freeze frames varies

Best Practice: When building xG models, freeze frame data significantly improves accuracy compared to models built on event data alone. Knowing the positions of defenders and the goalkeeper at the moment of a shot provides crucial context about shot difficulty that is absent from simple location-based features. If freeze frame data is available, always incorporate it into shot-quality models.

2.1.5 Video and Broadcast Data

Video remains essential for soccer analysis, both as raw material and as a source for derived data. Despite the growth of structured data, many coaches and analysts still consider video their primary analytical tool. The relationship between video and data is complementary rather than competitive: data identifies patterns and highlights anomalies; video provides the context needed to understand them.

Uses of Video Data:

  1. Manual Analysis: - Coaches and analysts watch video to understand tactics, player behaviors, and game situations - Tagging systems (Hudl, Wyscout, InStat) allow annotation of video with custom labels - Essential for qualitative understanding that data alone cannot provide - Pre-match opposition analysis typically involves extensive video review, guided by data

  2. Computer Vision Source: - Optical tracking systems derive positions from video feeds - Emerging AI systems automate event detection from video (companies like Metrica Sports and others are developing these capabilities) - Pose estimation technology extracts body positioning and movement patterns - Action recognition systems can identify specific tactical patterns

  3. Context Validation: - Verify questionable data points by watching the original video - Understand context behind unusual statistics (why did a player have zero passes in a ten-minute period? The video might reveal they were receiving treatment for an injury) - Communicate findings to non-technical stakeholders who understand video more readily than charts

Video Analysis Platforms: - Hudl (formerly Wyscout Professional): Leading platform for team video analysis, offering tagging, clipping, sharing, and presentation tools. Used by thousands of professional and amateur clubs worldwide. Hudl acquired Wyscout in 2019, consolidating the two largest video platforms under one company. - Wyscout: Comprehensive video archive with tagging, accessible as both a web platform and an API. Particularly popular for scouting, offering video coverage of over 200 competitions. Now part of the Hudl family. - InStat: Russian-origin video analysis platform with synchronized data. Offers detailed tactical analysis tools and is particularly popular in Eastern European and Central Asian markets. - Dartfish: Biomechanical analysis focus, used more for individual player development than match analysis. Offers tools for slow-motion analysis, angle measurement, and movement comparison. - SBG Sports Software (previously known as Sportscode): Widely used in professional clubs for in-house video coding and analysis. Allows clubs to create custom coding frameworks tailored to their tactical philosophy.

2.1.6 Physical and Biometric Data

Modern players are increasingly monitored for physical performance and health. The volume and granularity of physical data has exploded over the past decade, driven by improvements in wearable sensor technology and growing recognition that physical preparation is a critical determinant of performance and injury risk.

Types of Physical Data:

Category Metrics Source
Load Total distance, high-speed distance (>5.5 m/s), sprint distance (>7 m/s), sprint count GPS/Optical
Intensity Accelerations (>3 m/s2), decelerations, high-intensity efforts, metabolic power IMU/Accelerometer
Physiological Heart rate, heart rate variability, heart rate recovery Wearable monitors
Biomechanical Running gait, asymmetry, joint angles, ground contact time Specialized sensors

Key Physical Metrics Explained:

  • Total distance: The total distance covered by a player during a match or training session. A typical outfield player covers 10-13 km per match, with midfielders generally covering the most distance.
  • High-speed running distance: Distance covered above a threshold speed (typically 5.5 m/s or approximately 20 km/h). This metric is more indicative of physical effort than total distance, as much of a player's total distance is covered at low speeds.
  • Sprint distance and count: Distance covered and number of efforts above sprint speed threshold (typically 7 m/s or approximately 25 km/h). Sprints are the most physically demanding actions and are closely linked to injury risk when accumulated excessively.
  • Accelerations and decelerations: Rapid changes in speed place high neuromuscular demands on the body. Tracking the number and intensity of accelerations and decelerations provides insight into the mechanical load on muscles and joints.
  • Metabolic power: An estimate of the instantaneous energy expenditure based on speed and acceleration, intended to capture the overall physiological cost of activity more comprehensively than speed-based metrics alone.

Applications: - Training load management and periodization - Injury risk prediction (players whose acute load significantly exceeds their chronic load are at elevated injury risk—the "acute:chronic workload ratio") - Return-to-play monitoring (comparing current physical output to pre-injury baselines) - Performance optimization (identifying physical parameters that limit performance and targeting them in training) - Tactical analysis (how does pressing intensity, measured through physical metrics, relate to tactical success?)

Real-World Application: The acute-to-chronic workload ratio (ACWR) has become one of the most widely used injury risk metrics in professional soccer. The concept, popularized by researcher Tim Gabbett, compares a player's recent workload (typically the past week) to their longer-term average (typically the past four weeks). When the ratio exceeds approximately 1.5—meaning the player's recent workload is 50% higher than their average—injury risk increases significantly. Clubs use this metric to make decisions about training intensity, player rotation, and match availability, though recent research has questioned some aspects of the methodology and suggested more nuanced approaches.

2.1.7 Reference and Contextual Data

Beyond match data, analysis requires contextual information that provides the backdrop against which on-pitch data is interpreted. A player's statistics are meaningless without knowing what team they played for, what competition they played in, what position they played, and what circumstances surrounded their performance.

Reference Data: - Player profiles: age, height, weight, preferred foot, position(s), nationality, date of birth - Team rosters and formations - Match metadata: date, venue, competition, round, referee, attendance - League tables and standings - Competition structures (group stages, knockout rounds, promotion/relegation)

Contextual Data: - Weather conditions (temperature, precipitation, wind speed, humidity) - Injuries and suspensions (who was available for selection) - Transfer history and valuations - Managerial changes (new manager appointments often cause short-term performance fluctuations) - Historical results - Stadium dimensions (pitch size varies between venues, particularly in lower leagues)

Sources: - Transfermarkt: The largest publicly accessible source of transfer data, player valuations, injury histories, and squad information. Despite being a website rather than a formal data provider, Transfermarkt's community-maintained data is widely used in professional and academic analytics. - National associations: Official fixture lists, results, and competition regulations - Club websites: Official rosters, injury updates, and match reports - Weather APIs: Historical and forecast weather data for match venues - Capology and other salary databases: Player salary information (availability varies by league) - FIFA and UEFA: Official rankings, regulations, and competition data


2.2 How Soccer Data Is Collected

Understanding how data is collected is essential for understanding its strengths and limitations. Data collection is not a neutral process—the methods used to collect data determine what is captured, what is missed, and how accurate the result is. An analyst who treats data as an objective record of reality, without understanding the collection process, risks drawing conclusions that reflect data artifacts rather than actual patterns.

2.2.1 Event Data Collection

Event data is created through a combination of technology and human judgment. Despite advances in automation, the process remains fundamentally human-dependent, with trained operators making real-time decisions about how to classify the actions they observe.

The Collection Process:

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│  Live Video │ →  │   Human     │ →  │   Quality   │ →  │   Final     │
│   Feed      │    │   Coders    │    │   Control   │    │   Data      │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
                         │                   │
                         ▼                   ▼
                   ┌───────────┐      ┌───────────┐
                   │  Coding   │      │  Review   │
                   │  Software │      │  Process  │
                   └───────────┘      └───────────┘

Step 1: Live Coding

Trained operators watch live broadcast feeds and log events in real-time using specialized software. A typical match requires 2-3 operators working simultaneously: - One focusing on primary events (passes, shots, ball receipts) - Others handling secondary events (duels, fouls, pressures) - A supervisor monitoring accuracy and resolving disputes

The coding software presents the operator with a virtual pitch and a set of event type buttons. When a pass occurs, the operator clicks the pass button, then clicks the starting location and ending location on the virtual pitch, then selects relevant qualifiers (ground pass, forward, successful). This entire process must happen within seconds, as the next event is already occurring on screen. The skill and attentiveness required are considerable; a single coder might log 1,000 or more events per match.

For major providers like Opta and StatsBomb, live coding is performed at dedicated coding centers where multiple matches can be monitored simultaneously. Opta, for example, operates coding centers in multiple countries, employing hundreds of trained coders who collectively cover thousands of matches per week during the peak season.

Step 2: Event Classification

Each action is classified according to a detailed taxonomy: - Type: What happened (pass, shot, tackle, duel, foul, clearance, interception) - Outcome: Result (successful, unsuccessful, blocked, saved, off target) - Qualifiers: Additional context (headed, volleyed, under pressure, counter-attack, from corner) - Location: Coordinates on the pitch (start and end locations where applicable) - Timing: Timestamp within the match, synchronized to match clock

The taxonomy used by different providers varies significantly. StatsBomb, for example, classifies over 30 distinct event types with hundreds of qualifier combinations, while simpler systems might use fewer than 20 event types. These differences in classification granularity have practical implications: analyses built on one provider's data may not translate directly to another's.

Intuition: Think of event data coding like court reporting in a legal proceeding. The court reporter must capture everything that is said, in real time, with high accuracy. But unlike a verbatim transcript, event data coding also requires judgment: the coder must decide not just what happened (a pass was made) but how to classify it (was it a ground pass or a lofted ball? Was it progressive? Was it under pressure?). These classification judgments introduce subjectivity into what appears to be an objective dataset.

Step 3: Quality Control

After initial coding, additional review occurs: - Automated validation checks (impossible coordinates, missing events, logical inconsistencies like a goal without a preceding shot) - Human review of edge cases where the initial coder's classification is uncertain - Cross-checking against video for flagged events - Statistical anomaly detection (a match with an unusually low number of passes might indicate coding errors)

StatsBomb, in particular, has built a reputation for rigorous quality control. Their data undergoes multiple rounds of review, and they regularly update historical data when errors are identified—a practice that improves long-term data quality but can complicate analyses that span multiple data versions.

Step 4: Post-Match Enhancement

Some providers perform a second, more detailed coding pass after the match is complete. This post-match review can add: - More precise coordinates based on multi-angle video review - Additional qualifiers that are difficult to code in real time - Corrections to errors identified in the live coding - Freeze frame data (StatsBomb adds player positions at the moment of key events during post-match processing)

Accuracy Considerations:

Event data accuracy is generally high but not perfect: - Pass locations: plus or minus 1-2 meters typical accuracy - Event classification: approximately 95-98% agreement between coders for objective events (pass, shot) but lower for subjective events (key pass, tackle won) - Subjective events (key pass, tackle won versus tackle attempted): Lower consistency, sometimes as low as 80-85% inter-coder agreement - Shot xG features: High accuracy for critical variables like shot location and body part

Research by academic groups has quantified inter-coder reliability for various event types. A 2019 study found that while basic events like passes and shots showed over 95% agreement between independent coders, more complex classifications like "under pressure" or "progressive pass" showed agreement rates closer to 85%. This means that analyses heavily dependent on subjective qualifiers should be interpreted with appropriate caution.

Common Pitfall: Assuming event data is perfectly accurate. All event data contains some errors and inconsistencies. Robust analysis should be somewhat tolerant of data quality issues. When a finding depends critically on a small number of events, always verify by watching the video. When working with aggregated data, small errors tend to cancel out, but for individual match or player analyses, data quality issues can be significant.

2.2.2 Tracking Data Collection

Tracking data collection is more technologically intensive and expensive, relying on sophisticated hardware and software systems rather than human coders.

Optical Tracking Systems:

Modern optical tracking uses multiple high-definition cameras positioned around the stadium. The system operates automatically, with minimal human intervention during matches, though human oversight is required for quality assurance.

  1. Camera Setup: - 12-20+ cameras providing overlapping coverage of the entire pitch - Positioned high in stands for optimal viewing angles (typically at the top of the stands or on the stadium roof) - Calibrated to stadium dimensions using known reference points (pitch markings, stadium features) - Cameras operate at 25-50 frames per second with high resolution sufficient to distinguish individual players

  2. Image Processing: - Real-time video processing at 25-50 fps using GPU-accelerated computing - Player detection using machine learning models trained on millions of labeled examples - Jersey number recognition for identification (challenging when jerseys are obscured by sweat, rain, or physical contact) - Ball tracking using dedicated algorithms (the ball is small, moves fast, and is frequently occluded by players, making this the most technically challenging aspect) - Background subtraction to isolate moving objects from the static stadium environment

  3. Position Calculation: - Triangulation from multiple camera views to determine 3D positions - Kalman filtering for smooth trajectories that handle measurement noise - Handling occlusions (when players block each other from camera view, the system must predict positions based on recent trajectory) - Identity management (maintaining correct player identification when players cross paths or cluster together)

  4. Data Output: - Raw positional data at 25 fps with positional accuracy of approximately 10-30 centimeters - Derived velocities and accelerations (calculated from position changes between frames) - Synchronized with match clock and typically with event data feeds - Delivered as structured data files (CSV, JSON, or proprietary formats)

Major Optical Tracking Providers:

  • TRACAB (ChyronHego): One of the longest-established tracking systems, installed in stadiums across Europe including all Bundesliga venues. TRACAB uses a system of cameras mounted at the top of the main stand to provide overlapping coverage of the entire pitch.

  • Second Spectrum: A Los Angeles-based company that has become one of the dominant tracking data providers, with installations in the Premier League, La Liga, and MLS among others. Second Spectrum differentiates through its advanced analytics layer, providing not just raw tracking data but derived metrics and tactical insights built on top of the positional data.

  • Hawk-Eye: Best known for goal-line technology (determining whether the ball has fully crossed the goal line), Hawk-Eye also provides tracking data. Their systems are installed in every Premier League stadium, providing tracking data that is available to all clubs in the league.

  • Kinexon: A German company that has developed a local positioning system (LPS) using ultra-wideband radio technology. Players wear small transponders, and sensors around the stadium triangulate their positions. This approach offers very high accuracy (claimed sub-centimeter precision) but requires dedicated hardware.

GPS/Wearable Systems:

Player-worn devices provide an alternative or complement to optical tracking:

  1. Device Components: - GPS receiver: determines position at approximately 10-18 Hz (depending on the device) - Accelerometer: measures acceleration forces at 100-1000 Hz - Gyroscope: measures rotational movement - Magnetometer: provides heading/orientation - Heart rate monitor (in some devices) - Wireless transmission capability for real-time monitoring

  2. Data Processing: - Position smoothing and correction (raw GPS data is noisy and requires filtering) - Physical metric calculation (distances, speeds, accelerations derived from raw sensor data) - Team-level synchronization (ensuring all devices are reporting on the same time base) - Integration with video systems for contextual analysis

  3. Key Providers: - Catapult: The largest provider of GPS/wearable tracking in professional sports. Catapult devices are used by over 3,000 teams worldwide across multiple sports. Their platform provides both raw data and derived metrics for physical performance monitoring. - STATSports: An Irish company whose Apex system is used by numerous Premier League clubs and national teams. FIFA selected STATSports as their preferred wearable technology partner. - Playertek (now part of Catapult): Originally targeted at sub-elite and grassroots teams, offering more affordable wearable tracking.

  4. Limitations: - GPS less accurate than optical tracking: approximately 1-5 meters for standard GPS, improved to approximately 0.3-1 meter with differential GPS (DGPS) - Cannot capture ball position - Requires player cooperation (players must wear the device) - Not available for opponents (only the team that owns the equipment gets data) - GPS signal can degrade in stadiums with large overhanging roofs - Not permitted in competitive matches in some leagues (though FIFA regulations have generally allowed approved devices since 2015)

2.2.3 The Data Pipeline

From raw collection to usable analysis, data passes through several stages. Understanding this pipeline is essential for anyone who will work with soccer data professionally, as problems at any stage can compromise the quality of downstream analysis.

┌─────────────────────────────────────────────────────────────────────────────┐
│                         SOCCER DATA PIPELINE                                │
│                                                                             │
│   Collection      Validation      Storage      Processing      Analysis    │
│       │               │              │              │              │        │
│       ▼               ▼              ▼              ▼              ▼        │
│   ┌───────┐      ┌───────┐      ┌───────┐      ┌───────┐      ┌───────┐   │
│   │ Raw   │  →   │Quality│  →   │Database│ →   │ Clean │  →   │Insights│   │
│   │ Data  │      │Checks │      │Storage │     │Transform│     │Reports │   │
│   └───────┘      └───────┘      └───────┘      └───────┘      └───────┘   │
│                                                                             │
│   - Coding        - Schema        - SQL/NoSQL    - Feature      - Models   │
│   - Tracking      - Range checks  - Data lake    - Aggregation  - Viz      │
│   - Sensors       - Consistency   - Versioning   - Derivation   - Dashboards│
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Each stage of this pipeline introduces opportunities for error and requires careful attention. We will examine data quality and pipeline design in more detail in sections 2.5 and 2.6.


2.3 Major Data Providers

2.3.1 The Commercial Landscape

The soccer data industry is dominated by a few major providers, each with different strengths, coverage, and pricing models. Understanding the provider landscape is important for practical reasons: which data you can access determines what analyses you can perform. It is also important conceptually: because different providers collect and classify data differently, results can vary depending on which provider's data you use.

The industry has undergone significant consolidation in recent years. Stats Perform acquired Opta in 2019 (through its parent company, which merged Perform Group with STATS LLC). Hudl acquired Wyscout in 2019. These consolidations have reduced the number of independent providers while potentially improving integration between complementary products.

2.3.2 Opta / Stats Perform

Overview: Opta, now part of Stats Perform, is the longest-established major provider of soccer event data. Founded in 1996 by Aidan Cooney in London, Opta began by providing statistical content to media companies and has grown into the industry's dominant provider of event data. Their data powers most mainstream media coverage of soccer statistics, including the statistics shown on broadcast television, news websites, and social media.

Coverage: - Event data for 40+ competitions worldwide, including all major European leagues - Historical data back to 2006-2008 for major leagues, with some data extending further - Near-universal coverage of top European leagues (Premier League, La Liga, Bundesliga, Serie A, Ligue 1) - Extensive coverage of second-tier leagues, cup competitions, and international tournaments

Data Products: - Opta F24: The core event data feed, providing detailed event records for each match. The F24 format includes events with coordinates, qualifiers, timestamps, and player/team identifiers. - Opta F1/F9: Match and season-level aggregated statistics - Opta F40: Advanced possession and passing data - Stats Perform AI: Machine learning-derived metrics including expected goals, expected assists, and player performance ratings

Strengths: - Extensive historical archive enabling longitudinal analysis - Consistent definitions across seasons (though definitions do evolve, changes are documented) - Well-documented data dictionary with detailed qualifier definitions - Industry standard for many metrics—when media refer to "official" stats, they usually mean Opta data - Broad coverage enabling cross-league comparison

Limitations: - Limited spatial detail in the standard product (no freeze frames) - Some metrics are proprietary and not transparently documented - Primarily event data; limited tracking data offering - Premium pricing that puts it out of reach for most individuals and small organizations - Some subjective qualifiers (e.g., "big chance") that are not consistently defined

Users: Media organizations (BBC, Sky Sports, ESPN), betting companies, Premier League and La Liga clubs, academic researchers

Intuition: Think of Opta as the "standard reference" of soccer data—like a well-established encyclopedia. It may not have the newest or most detailed information on every topic, but it has broad, reliable coverage and a long track record. When you hear a statistic cited on television or in a newspaper, it is very likely sourced from Opta.

2.3.3 StatsBomb

Overview: Founded in 2017 by Ted Knutson, a former poker professional turned soccer analytics consultant, StatsBomb differentiates through data quality, analytical depth, and community engagement. Knutson had previously built a reputation through his analytical writing (at StatsBomb.com, originally a blog and media site) and through consulting work for clubs and media companies.

StatsBomb was explicitly founded on the premise that the quality and granularity of existing event data could be significantly improved. Their data collection process is more labor-intensive than competitors', involving multiple coding passes and rigorous quality control, resulting in what many analysts consider the highest-quality event data available.

Coverage: - Event data for major European leagues (Premier League, La Liga, Bundesliga, Serie A, Ligue 1) plus selected other competitions (MLS, A-League, Indian Super League, and others) - Freeze frame data with every shot and many other key events - Growing historical archive - Women's competitions including the FA Women's Super League and NWSL

Key Differentiators: - 360 Freeze Frames: StatsBomb includes positional data for all visible players at the moment of key events. This enables spatial analysis of shots, passes, and other actions without requiring full tracking data. - Pressure events: StatsBomb codes "pressure" events—moments when a player closes down an opponent—that are not systematically recorded by all providers. This enables pressing analysis from event data alone. - Detailed carry data: StatsBomb records ball carries as explicit events, with start and end positions. Most other providers infer carries from the gap between consecutive events, which is less reliable. - Open Data initiative: Free public data for educational use (discussed in detail in section 2.4.2)

Strengths: - Superior data quality and granularity - Freeze frames enable spatial analysis without tracking data - Transparent methodology with published documentation - Strong analytics community engagement (public research, blog posts, conference presentations) - Open Data initiative that has become the standard educational resource

Limitations: - Less extensive historical coverage than Opta (data collection began more recently) - Smaller total competition coverage (fewer leagues and cups) - Premium pricing reflecting the higher data quality - Relatively newer in the market, so some users have shorter time series

Users: Clubs (Premier League, La Liga, Bundesliga), media companies (The Athletic), betting companies, academic researchers, public analysts (via Open Data)

2.3.4 Wyscout (Hudl)

Overview: Wyscout, an Italian company founded in 2004, built its reputation as an integrated video and data platform. It is particularly popular among scouts, offering video coverage of over 200 competitions worldwide—far more than any other single platform. Hudl, the American sports video company, acquired Wyscout in 2019, combining Wyscout's professional scouting platform with Hudl's team-level video analysis tools.

Coverage: - Video and data for 200+ competitions worldwide, including many lower leagues and youth competitions - Extensive coverage of South American leagues (often the best available source for these competitions) - Good coverage of African, Asian, and other non-European leagues - Youth tournament coverage that is valuable for clubs scouting young talent

Data Products: - Integrated video platform with synchronized event data - Player comparison tools - Team statistics and tactical analysis - Scouting workflow tools (shortlisting, reporting, sharing) - API access for data extraction

Strengths: - Unmatched breadth of competition coverage, especially for lower leagues and non-European competitions - Integrated video platform makes it easy to move between data and video - User-friendly scouting interface designed for non-technical users - Relatively accessible pricing for individuals and small organizations - The go-to platform for scouts working across multiple markets

Limitations: - Data quality inconsistent across competitions (top leagues are well-covered; lower leagues may have less rigorous coding) - Less analytical depth than StatsBomb (fewer qualifiers, no freeze frames) - Video quality variable, especially for lower-league matches - Event definitions may not perfectly align with other providers

Users: Scouts, player agents, smaller clubs, journalists, individual analysts

2.3.5 InStat

Overview: InStat is a Russian-founded sports analytics company that provides video analysis and statistical data across multiple sports, including soccer. Founded in 2007, InStat has built a strong presence in Eastern European and Central Asian markets and has expanded into Western European and other global markets.

Coverage: - Over 100 soccer competitions, with particularly strong coverage of Russian, Ukrainian, and Eastern European leagues - Video analysis with synchronized data - Growing coverage of Western European leagues

Strengths: - Strong Eastern European coverage that other providers may not match - Integrated video and data platform - Detailed individual player analysis tools - Affordable pricing compared to premium Western providers

Limitations: - Less recognized in Western European markets - Data quality and definitions may differ from Western standards - Smaller user community means fewer shared tools and resources

Users: Clubs in Eastern Europe and Central Asia, scouts, agents

2.3.6 Tracking Data Providers

Second Spectrum: - Optical tracking systems installed in MLS (all stadiums), La Liga (all stadiums), and the Premier League - Advanced analytics layer on top of tracking data, including proprietary metrics and tactical classification - Powers broadcast graphics (the "augmented reality" overlays showing tactical patterns during broadcasts) - Acquired by Genius Sports in 2021 - High accuracy (approximately 10-15 cm positional precision), comprehensive coverage within partner leagues - Works closely with leagues rather than individual clubs, providing league-wide data access

ChyronHego (TRACAB): - One of the earliest optical tracking providers, with systems installed in many major European stadiums - The official tracking data provider for the Bundesliga (since 2011) and Serie A, among others - Long-established technology with extensive validation - Hardware installed in stadiums using a combination of cameras and other sensors - Position data at 25 Hz with accuracy comparable to other optical systems

SkillCorner: - Tracking data derived from broadcast video using computer vision, requiring no stadium hardware - Founded in 2017, SkillCorner has rapidly grown to become a significant player in the tracking data market - Growing coverage: any televised match can potentially be processed - Democratizing tracking data access by dramatically reducing the cost and infrastructure requirements - Lower accuracy than dedicated optical systems (estimated 1-2 meters versus 10-30 centimeters), and limited to players visible in the broadcast frame - Provides physical performance metrics (speed, distance, acceleration) derived from the tracking data - Particularly valuable for cross-league scouting, where consistent tracking data across competitions is needed

Catapult / STATSports: - GPS/wearable solutions for teams - Physical performance focus with detailed biomechanical metrics - Used in both training and matches - Team must own and deploy equipment (cost: approximately 200,000-500,000 USD for a full team setup) - Data is proprietary to the team—not shared with leagues or other organizations - Increasingly integrated with video analysis and tactical tools

Real-World Application: The complementarity of different tracking data sources is increasingly recognized by clubs. A Premier League club might use Hawk-Eye optical tracking data (available league-wide) for match analysis, SkillCorner broadcast-derived data for scouting targets in foreign leagues, and Catapult GPS data for training load monitoring. Each source has different strengths, and the most sophisticated clubs integrate multiple sources into a unified picture.

2.3.7 Provider Comparison

Provider Event Data Tracking Video Freeze Frames Free Tier
Stats Perform/Opta Excellent Limited No No No
StatsBomb Excellent No No Excellent Yes (Open Data)
Wyscout/Hudl Good No Excellent No No
InStat Good No Good No No
Second Spectrum Limited Excellent No N/A No
SkillCorner No Good No N/A Limited
TRACAB Limited Excellent No N/A No

2.4 Accessing Free Data

2.4.1 The Democratization of Soccer Data

Until recently, quality soccer data was accessible only to those who could afford expensive subscriptions. A professional-grade event data subscription from a major provider can cost tens of thousands of dollars per year—well beyond the reach of students, independent researchers, and aspiring analysts. Several initiatives have changed this landscape, making learning and portfolio-building possible for everyone with an internet connection and the willingness to learn.

The importance of free data cannot be overstated for the development of the field. Many of today's professional analysts built their skills and portfolios using free data, demonstrating their capabilities publicly before being hired by clubs or media companies. The open data movement has democratized entry into soccer analytics in a way that benefits the entire ecosystem.

2.4.2 StatsBomb Open Data

Overview: StatsBomb releases detailed event data from select competitions for free educational use. This is the highest-quality free soccer data available, and it has become the standard dataset for soccer analytics education, tutorials, and introductory research.

Available Data (as of the most recent release): - FIFA World Cup 2018 (Men's) - complete event data for all 64 matches - FIFA World Cup 2022 (Men's) - complete event data - UEFA Euro 2020 (2021) - complete event data - UEFA Euro 2024 - complete event data - FA Women's Super League (multiple seasons) - NWSL (multiple seasons) - La Liga (selected seasons featuring Lionel Messi's career at Barcelona) - UEFA Champions League finals (selected) - FIFA Women's World Cup (multiple tournaments) - Other select matches and competitions added periodically

How to Access:

# Using statsbombpy library
from statsbombpy import sb

# List available competitions
competitions = sb.competitions()
print(competitions)

# Get matches for a competition
matches = sb.matches(competition_id=43, season_id=3)  # World Cup 2018
print(matches)

# Get events for a specific match
events = sb.events(match_id=7298)  # France vs Croatia final
print(events.head())

# Get lineups for a match
lineups = sb.lineups(match_id=7298)
print(lineups)

The statsbombpy library handles authentication and data retrieval automatically, making access straightforward. The data is also available as raw JSON files from StatsBomb's GitHub repository, which can be useful for understanding the underlying data structure.

Data Structure: - Competitions: Metadata about available competitions and seasons - Matches: Match-level information including teams, scores, managers, stadium, and referee - Events: Detailed event records with all qualifiers, coordinates, and metadata - Lineups: Player information and tactical positions for each match - Freeze frames: Player positions at the moment of key events (available for shots in most competitions)

Working with StatsBomb Data—A Practical Example:

import pandas as pd
from statsbombpy import sb

# Fetch all events from the 2018 World Cup Final
events = sb.events(match_id=7298)

# Filter for shots
shots = events[events['type'] == 'Shot']

# Examine shot details
print(f"Total shots: {len(shots)}")
print(f"Goals: {len(shots[shots['shot_outcome'] == 'Goal'])}")
print(shots[['player', 'shot_outcome', 'shot_statsbomb_xg', 'location']].head(10))

Terms of Use: - Free for educational, personal, and non-commercial use - Attribution to StatsBomb is required in any publication or presentation - Not for commercial products without a separate commercial license - Data should not be redistributed outside the terms of use

Real-World Application: The StatsBomb Open Data has enabled hundreds of public analyses, tutorials, and learning resources. It is the standard dataset referenced in soccer analytics blog posts, YouTube tutorials, and university courses. Many hiring managers in soccer analytics expect candidates to be familiar with StatsBomb data and to have worked with it in portfolio projects.

2.4.3 FBref

Overview: FBref (Football Reference) provides free access to aggregated statistics powered by StatsBomb data. Part of the Sports Reference family of websites (which also includes Baseball-Reference, Basketball-Reference, and others), FBref has become the most widely used free source of aggregated soccer statistics.

Available Data: - Player statistics: standard (goals, assists, minutes) and advanced (xG, xA, progressive passes, carries, pressures) - Team statistics: offensive, defensive, possession, and passing metrics - Match reports with shot maps and event summaries - Historical data back to 2017-18 for advanced stats (StatsBomb-powered), with basic stats extending further - Scouting reports with percentile rankings for individual players

How to Access: - Web interface at fbref.com with comprehensive tables and filters - No official API, but the site is structured to facilitate data extraction - Download tables directly from pages using the "Share & Export" option - Web scraping using Python libraries (tolerated within reasonable limits)

# Example: Scraping player data from FBref
import pandas as pd

url = "https://fbref.com/en/comps/9/stats/Premier-League-Stats"
tables = pd.read_html(url)
player_stats = tables[0]  # First table on page

# Clean up multi-level headers if present
# (FBref tables sometimes have multi-level column headers)
print(player_stats.head())

Limitations: - Aggregated statistics only—no event-level or match-level granularity - Web scraping can be fragile: page structure may change without notice - Rate limiting may apply: excessive automated requests may be blocked - Data is delayed compared to live feeds (typically updated within 24 hours of match completion)

Best Practice: When scraping FBref or other websites, always respect rate limits and terms of service. Insert delays between requests (at least 3-5 seconds per page), cache responses locally so you don't need to re-scrape, and never scrape more data than you actually need. Building a local cache of scraped data also protects your analysis from website changes.

2.4.4 Understat

Overview: Understat is a website providing xG-based statistics for six major European leagues (Premier League, La Liga, Bundesliga, Serie A, Ligue 1, and the Russian Premier League). The site uses its own proprietary xG model, distinct from StatsBomb's or Opta's, and provides shot-level data that enables detailed shooting analysis.

Available Data: - Shot data with xG values for every shot in covered leagues - Player and team xG statistics aggregated by season - Historical data back several seasons (coverage varies by league) - Shot maps and basic visualizations

How to Access: - Web interface at understat.com - Undocumented API that returns JSON data (can be accessed programmatically) - Several Python packages exist for data retrieval

# Example using understat package
import asyncio
from understat import Understat

async def get_player_shots():
    async with Understat() as understat:
        player = await understat.get_player_shots(
            player_id=1234,
            season='2023'
        )
        return player

shots = asyncio.run(get_player_shots())

Strengths: - Shot-level data with xG values is valuable for shooting and chance creation analysis - Clean, accessible data format - Good historical coverage for major leagues

Limitations: - Only covers six leagues - Only shot data (not full event data) - Proprietary xG model with limited documentation of methodology - The undocumented API could change without notice

2.4.5 Other Free Sources

Transfermarkt: - Market values and transfer history for players worldwide - Squad information including contract details, agent information, and historical clubs - Injury records (dates, types, duration) - Web scraping or unofficial APIs (the transfermarkt-api package provides structured access) - Extremely comprehensive coverage of the global soccer market - Market values are crowd-sourced estimates, not official figures, but are widely used as reference points

Kaggle Datasets: - Various soccer datasets uploaded by community members - Quality and documentation vary significantly - Good for specific projects or competitions - Notable datasets include historical match results, player attributes from FIFA video games, and European match data

Football-Data.co.uk: - Historical match results and betting odds for dozens of leagues - Covers many leagues back to the 1990s - CSV downloads available with consistent formatting - Particularly useful for match prediction and betting market analysis - Updated regularly during the season

WhoScored: - Aggregated player ratings and statistics powered by Opta data - Requires web scraping for data extraction - Less detailed than FBref for advanced metrics - Player ratings (on a 1-10 scale) are widely referenced in media and fan discussion

Open Football Data: - A community-maintained repository of open soccer data on GitHub - Includes fixtures, results, and league tables in structured formats - Good for basic historical analysis

2.4.6 Web Scraping Considerations

Many free data sources do not provide official APIs, meaning that accessing the data programmatically requires web scraping—extracting data from web pages by parsing their HTML structure. Web scraping is a valuable skill for soccer analysts, but it comes with important ethical, legal, and technical considerations.

Ethical Considerations: - Respect the website's terms of service. Some sites explicitly prohibit scraping; others tolerate it within limits. - Do not overload servers with excessive requests. Insert delays between requests and scrape during off-peak hours. - Attribute the data source in any publication or presentation. - Do not scrape and redistribute data in ways that undermine the source's business model.

Legal Considerations: - The legal status of web scraping varies by jurisdiction. In the United States, the Computer Fraud and Abuse Act (CFAA) has been interpreted in various ways regarding scraping. In the European Union, the Database Directive provides legal protection for databases. - Scraping publicly available data for personal or research use is generally considered lower-risk than scraping for commercial purposes. - When in doubt, contact the website operator and ask for permission.

Technical Considerations: - Web scraping is inherently fragile: changes to a website's HTML structure can break your scraper without warning. Build scrapers that fail gracefully and log errors clearly. - Use libraries like BeautifulSoup (for HTML parsing) and requests (for HTTP requests) in Python, or Selenium for JavaScript-rendered pages. - Cache scraped data locally so you do not need to re-scrape unnecessarily. - Handle rate limiting gracefully: if a server returns a 429 (Too Many Requests) status, back off and retry after a delay.

# Example: Responsible web scraping pattern
import requests
import time
from bs4 import BeautifulSoup

def scrape_with_respect(url, delay=5):
    """Scrape a URL with appropriate delay and error handling."""
    try:
        response = requests.get(url, headers={'User-Agent': 'Soccer Analytics Student Project'})
        response.raise_for_status()
        time.sleep(delay)  # Respect the server
        return BeautifulSoup(response.text, 'html.parser')
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return None

Common Pitfall: Building an analysis pipeline that depends entirely on web scraping without caching. If the source website changes its structure, your entire pipeline breaks. Always save scraped data locally and version it. Your analysis code should read from local files, not from live web scraping. Scraping should be a separate, infrequent step that populates your local data store.

2.4.7 Building Your Data Library

For learning and portfolio projects, we recommend assembling data from multiple sources into a well-organized local library:

your_project/
├── data/
│   ├── raw/                    # Original data, never modified
│   │   ├── statsbomb/          # Event data via API
│   │   │   ├── competitions.json
│   │   │   ├── matches/
│   │   │   └── events/
│   │   ├── fbref/              # Scraped aggregates
│   │   │   ├── player_stats.csv
│   │   │   └── team_stats.csv
│   │   └── transfermarkt/      # Reference data
│   │       ├── players.csv
│   │       └── transfers.csv
│   ├── processed/              # Cleaned and transformed data
│   │   ├── shots_with_xg.parquet
│   │   └── player_season_stats.parquet
│   └── features/               # Analysis-ready feature tables
│       └── xg_model_features.parquet
├── notebooks/                  # Jupyter notebooks for exploration
├── src/                        # Python source code
│   ├── data/                   # Data loading and processing
│   ├── models/                 # Analytical models
│   └── viz/                    # Visualization code
├── tests/                      # Unit tests
└── README.md

Best Practice: Maintain a strict separation between raw and processed data. Raw data should never be modified in place—always read from raw, process in code, and write to a separate processed directory. This ensures reproducibility: anyone can re-run your processing pipeline from the raw data and get the same results.


2.5 Data Quality and Validation

2.5.1 Why Data Quality Matters

The adage "garbage in, garbage out" applies forcefully to soccer analytics. Even sophisticated models produce meaningless results if fed bad data. Understanding data quality—and developing practices to ensure it—is essential for any analyst.

Data quality issues in soccer analytics are ubiquitous. They range from minor (a pass coordinate off by two meters) to major (an entire match missing from a dataset). Some are systematic (a provider consistently undercounts aerial duels) while others are random (an occasional coding error). The analyst's job is not to achieve perfect data—that is impossible—but to understand the types and magnitudes of imperfections and to ensure that conclusions are robust to them.

Consider a concrete example: you are building an xG model and discover that 2% of shots in your training data have coordinates that place them outside the pitch boundaries. If you simply exclude these shots, you are removing data that might be biased in some way (perhaps shots from extreme angles are more likely to have coordinate errors). If you include them, your model is trained on impossible locations. If you correct them by projecting onto the nearest valid coordinate, you are introducing assumptions. There is no perfect answer, but the worst approach is to not notice the problem at all.

2.5.2 Common Data Quality Issues

1. Missing Data

Events or records may be absent: - Entire matches missing from a dataset (the league had 380 matches but your data contains only 378) - Specific event types not recorded for certain matches or seasons - Player IDs not matched to names or matched incorrectly - Metadata fields (referee, venue, attendance) missing for some matches

Detection:

# Check for missing matches
expected_matches = 380  # e.g., Premier League season
actual_matches = df['match_id'].nunique()
print(f"Missing matches: {expected_matches - actual_matches}")

# Check for missing values across all columns
print(df.isnull().sum())

# Check for matches with suspiciously few events
events_per_match = df.groupby('match_id').size()
print(f"Matches with fewer than 1000 events: {(events_per_match < 1000).sum()}")

Real-World Application: In 2020, a widely cited public analysis of pressing metrics in the Premier League was later found to be affected by missing "pressure" events in certain matches. The provider had not coded pressure events consistently for all matches in the early part of the season, meaning that teams who played more matches before the coding improvement appeared to press less than they actually did. The analysis was retracted and corrected, but it illustrates how missing data can lead to incorrect conclusions even in work by experienced analysts.

2. Coordinate Errors

Location data may be inaccurate: - Events placed outside pitch boundaries - Systematic biases (coordinates consistently shifted in one direction, or different coordinate systems used for home and away teams) - Coordinate system inconsistencies between providers or between seasons

Detection:

# Check for out-of-bounds coordinates
pitch_length, pitch_width = 120, 80
invalid = df[(df['x'] < 0) | (df['x'] > pitch_length) |
             (df['y'] < 0) | (df['y'] > pitch_width)]
print(f"Invalid coordinates: {len(invalid)} ({100*len(invalid)/len(df):.2f}%)")

# Visualize coordinate distributions to spot systematic issues
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].hist(df['x'], bins=50)
axes[0].set_title('X coordinate distribution')
axes[1].hist(df['y'], bins=50)
axes[1].set_title('Y coordinate distribution')
plt.show()

3. Inconsistent Classification

Event types may be classified differently: - Between providers (Opta and StatsBomb may classify the same action differently) - Between seasons (a provider may change their definitions from one season to the next) - Between coders (different human coders may classify edge cases differently)

This is particularly problematic for subjective event types like "key pass," "big chance," and "under pressure." These classifications involve human judgment, and different coders may apply different thresholds. When analyzing trends across seasons, always check whether the provider changed their definitions.

Detection:

# Check event type distributions across seasons
event_counts = df.groupby(['season', 'event_type']).size().unstack(fill_value=0)
print(event_counts)
# Look for sudden changes that might indicate definition changes

4. Temporal Issues

Timing may be incorrect: - Events out of sequence (a pass completion recorded before the pass itself) - Duplicate timestamps (two events recorded at the exact same time) - Missing periods (no events recorded during injury time) - Inconsistent handling of stoppage time across providers

Detection:

# Check event sequence
df_sorted = df.sort_values('timestamp')
negative_time = (df_sorted['timestamp'].diff() < 0).sum()
print(f"Out-of-sequence events: {negative_time}")

# Check for duplicate events
duplicates = df.duplicated(subset=['match_id', 'timestamp', 'event_type', 'player_id'])
print(f"Duplicate events: {duplicates.sum()}")

5. Entity Resolution

Players and teams may be inconsistently identified: - Same player with multiple IDs (common when players change clubs or when merging data from multiple providers) - Name spelling variations (Mohamed Salah vs. Mohamed Salah Ghaly vs. M. Salah) - Teams changing names (especially after ownership changes or mergers) - Different providers using different ID systems with no official mapping

Entity resolution is a particularly thorny problem when merging data from multiple sources. FBref, Transfermarkt, StatsBomb, and Opta all use different player ID systems, and there is no universally maintained mapping between them. Community-maintained mapping tables exist (for example, the soccerdata Python package provides some cross-provider ID mappings), but they are incomplete and may contain errors.

Common Pitfall: Merging datasets from different providers by player name is unreliable. Names may be spelled differently, transliterated differently from non-Latin scripts, or abbreviated differently. Always use unique identifiers where possible, and when merging by name, use fuzzy matching with manual verification of edge cases.

2.5.3 Validation Framework

Implement systematic validation for every dataset you work with. A good validation framework catches problems early, before they propagate through your analysis pipeline:

def validate_event_data(df: pd.DataFrame) -> dict:
    """
    Validate event data quality.

    Parameters
    ----------
    df : pd.DataFrame
        Event data to validate

    Returns
    -------
    dict
        Validation report with quality metrics
    """
    report = {}

    # Completeness checks
    report['total_events'] = len(df)
    report['null_counts'] = df.isnull().sum().to_dict()

    # Coordinate validity (assuming 120x80 pitch)
    valid_coords = (
        (df['x'] >= 0) & (df['x'] <= 120) &
        (df['y'] >= 0) & (df['y'] <= 80)
    )
    report['invalid_coordinates'] = (~valid_coords).sum()
    report['invalid_coordinate_pct'] = round(100 * (~valid_coords).sum() / len(df), 2)

    # Temporal consistency
    df_sorted = df.sort_values(['match_id', 'timestamp'])
    report['out_of_sequence'] = (
        df_sorted.groupby('match_id')['timestamp']
        .diff().lt(0).sum()
    )

    # Event distribution
    report['events_per_match'] = df.groupby('match_id').size().describe().to_dict()

    # Duplicate check
    report['duplicates'] = df.duplicated(
        subset=['match_id', 'timestamp', 'event_type', 'player_id']
    ).sum()

    # Event type distribution (check for unusual patterns)
    report['event_type_counts'] = df['event_type'].value_counts().to_dict()

    # Matches with suspiciously few events
    events_per_match = df.groupby('match_id').size()
    report['low_event_matches'] = int((events_per_match < 1000).sum())

    return report

# Usage
report = validate_event_data(events_df)
for key, value in report.items():
    print(f"{key}: {value}")

2.5.4 Data Cleaning Best Practices

1. Document Everything: Keep records of all cleaning steps for reproducibility. Your future self (and anyone else who works with your data) will thank you.

# Example cleaning log
cleaning_log = []

# Remove invalid coordinates
invalid_mask = (df['x'] < 0) | (df['x'] > 120)
cleaning_log.append(f"Removed {invalid_mask.sum()} events with invalid x coordinates")
df = df[~invalid_mask]

# Log all cleaning steps
for step in cleaning_log:
    print(step)

2. Prefer Filtering Over Imputation: For analytical work, it's often better to exclude problematic data than to guess values. Imputation (filling in missing or incorrect values with estimates) introduces assumptions that may not be valid. If you must impute, document your method and test the sensitivity of your results to different imputation approaches.

3. Validate Downstream: Check that cleaning doesn't introduce new problems. For example, if you remove all events with missing coordinates, check whether the remaining dataset is still representative (are certain event types disproportionately affected?).

4. Version Your Data: Maintain versions of datasets at different processing stages. Use clear naming conventions (e.g., events_raw.parquet, events_cleaned.parquet, events_features.parquet) so you can always return to an earlier stage if needed.

5. Test Your Cleaning Pipeline: Write automated tests that verify your cleaning pipeline produces expected outputs. For example:

def test_no_invalid_coordinates(df):
    """Verify all coordinates are within pitch boundaries."""
    assert (df['x'] >= 0).all() and (df['x'] <= 120).all(), "Invalid x coordinates found"
    assert (df['y'] >= 0).all() and (df['y'] <= 80).all(), "Invalid y coordinates found"

def test_no_duplicate_events(df):
    """Verify no duplicate events exist."""
    dupes = df.duplicated(subset=['match_id', 'timestamp', 'event_type', 'player_id'])
    assert dupes.sum() == 0, f"Found {dupes.sum()} duplicate events"

2.6 Building Your Data Pipeline

2.6.1 From Source to Analysis

A well-designed data pipeline makes analysis efficient and reproducible. Without a pipeline, analysts spend a disproportionate amount of time on data wrangling—loading, cleaning, transforming, and merging data—leaving less time for actual analysis. A good pipeline automates the repetitive parts of data preparation so that analysts can focus on the work that requires human judgment.

2.6.2 Pipeline Architecture

"""
Basic Soccer Data Pipeline Structure

This example shows how to structure a reusable data pipeline
for soccer analytics projects.
"""

from pathlib import Path
import pandas as pd
from typing import Optional
from statsbombpy import sb


class SoccerDataPipeline:
    """Pipeline for collecting and processing soccer data."""

    def __init__(self, data_dir: str = "data"):
        """Initialize pipeline with data directory."""
        self.data_dir = Path(data_dir)
        self.raw_dir = self.data_dir / "raw"
        self.processed_dir = self.data_dir / "processed"

        # Create directories if they don't exist
        self.raw_dir.mkdir(parents=True, exist_ok=True)
        self.processed_dir.mkdir(parents=True, exist_ok=True)

    def fetch_competition_data(
        self,
        competition_id: int,
        season_id: int
    ) -> pd.DataFrame:
        """
        Fetch all events for a competition-season.

        Parameters
        ----------
        competition_id : int
            StatsBomb competition ID
        season_id : int
            StatsBomb season ID

        Returns
        -------
        pd.DataFrame
            All events for the competition-season
        """
        # Get matches
        matches = sb.matches(
            competition_id=competition_id,
            season_id=season_id
        )

        # Collect events for each match
        all_events = []
        for match_id in matches['match_id']:
            try:
                events = sb.events(match_id=match_id)
                events['match_id'] = match_id
                all_events.append(events)
            except Exception as e:
                print(f"Error fetching match {match_id}: {e}")

        return pd.concat(all_events, ignore_index=True)

    def validate_data(self, df: pd.DataFrame) -> dict:
        """Run validation checks on data."""
        return {
            'total_rows': len(df),
            'null_counts': df.isnull().sum().sum(),
            'unique_matches': df['match_id'].nunique()
        }

    def process_events(self, df: pd.DataFrame) -> pd.DataFrame:
        """
        Process raw events into analysis-ready format.

        Parameters
        ----------
        df : pd.DataFrame
            Raw event data

        Returns
        -------
        pd.DataFrame
            Processed event data
        """
        # Make a copy to avoid modifying original
        processed = df.copy()

        # Standardize column names
        processed.columns = processed.columns.str.lower().str.replace(' ', '_')

        # Extract location coordinates
        if 'location' in processed.columns:
            processed['x'] = processed['location'].apply(
                lambda loc: loc[0] if isinstance(loc, list) else None
            )
            processed['y'] = processed['location'].apply(
                lambda loc: loc[1] if isinstance(loc, list) else None
            )

        # Add derived features
        processed['is_goal'] = processed['shot_outcome'].eq('Goal').fillna(False)

        return processed

    def save_data(self, df: pd.DataFrame, name: str, processed: bool = True):
        """Save DataFrame to appropriate directory."""
        directory = self.processed_dir if processed else self.raw_dir
        filepath = directory / f"{name}.parquet"
        df.to_parquet(filepath)
        print(f"Saved {len(df)} rows to {filepath}")

    def load_data(self, name: str, processed: bool = True) -> pd.DataFrame:
        """Load DataFrame from storage."""
        directory = self.processed_dir if processed else self.raw_dir
        filepath = directory / f"{name}.parquet"
        return pd.read_parquet(filepath)


# Usage example
if __name__ == "__main__":
    pipeline = SoccerDataPipeline()

    # Fetch World Cup 2018 data
    events = pipeline.fetch_competition_data(
        competition_id=43,
        season_id=3
    )

    # Validate
    validation = pipeline.validate_data(events)
    print(f"Validation: {validation}")

    # Process
    processed = pipeline.process_events(events)

    # Save
    pipeline.save_data(processed, "world_cup_2018")

2.6.3 Storage Formats

Choose appropriate formats for different use cases. The choice of storage format has significant implications for performance, storage size, and interoperability:

Format Best For Pros Cons
CSV Small datasets, sharing Human readable, universal support Slow to read/write, large files, no type information
Parquet Large analytical datasets Fast columnar reads, compressed, preserves types Requires libraries (pyarrow/fastparquet), not human readable
SQLite Multi-table data, complex queries SQL interface, relational integrity, no server needed Setup overhead, slower than Parquet for full-table scans
JSON Hierarchical data, APIs Flexible structure, web-native Verbose, slow, large files
Feather Fast interchange between Python and R Very fast read/write, minimal overhead Less compressed than Parquet, not ideal for long-term storage
HDF5 Very large numerical datasets, tracking data Handles huge files, partial reads Complex API, not widely supported outside scientific computing

Best Practice: For most soccer analytics projects, use Parquet as your primary storage format. It provides excellent read performance for analytical queries, strong compression (typically 5-10x smaller than CSV), preserves column types, and is well-supported in Python (via pandas and pyarrow), R (via arrow), and many other tools. Save CSV copies only when you need to share data with non-technical collaborators or import into tools that do not support Parquet.

2.6.4 Data Storage Solutions: SQL vs NoSQL vs Flat Files

For larger projects or club-level operations, the choice of data storage solution becomes more consequential:

Relational Databases (PostgreSQL, MySQL, SQLite): - Best for structured data with well-defined schemas (event data, player reference data, match metadata) - SQL queries enable complex filtering, joining, and aggregation - Referential integrity ensures data consistency - PostgreSQL with the PostGIS extension is particularly useful for spatial queries on coordinate data - SQLite is excellent for single-user analytics projects—no server setup required

NoSQL Databases (MongoDB, DynamoDB): - Best for semi-structured or schema-flexible data (event data with varying qualifier structures, tracking data) - MongoDB's document model maps naturally to the nested JSON structure of event data - Better horizontal scalability for very large datasets - Less suitable for complex joins across multiple data types

Flat Files (Parquet, CSV, JSON): - Best for individual analysis projects and data sharing - No infrastructure overhead - Easy to version control and share - Limited querying capability compared to databases - Fine for most individual and small-team analytics work

Cloud Solutions: - Cloud data warehouses (BigQuery, Snowflake, Redshift) are increasingly used by clubs and data providers for large-scale analytics - Object storage (S3, GCS) for raw data archival - Managed databases for production applications

For most readers of this textbook, flat files (particularly Parquet) stored in well-organized directory structures will be sufficient. As projects grow in scale and complexity, consider migrating to a database solution.

2.6.5 API Usage Patterns

Many data sources are accessed through APIs (Application Programming Interfaces). Understanding common API patterns is essential for building reliable data pipelines:

"""
Common patterns for working with soccer data APIs.
"""

import requests
import time
import json
from pathlib import Path


class APIClient:
    """Generic API client with caching and rate limiting."""

    def __init__(self, base_url: str, cache_dir: str = "cache"):
        self.base_url = base_url
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(exist_ok=True)
        self.session = requests.Session()

    def get(self, endpoint: str, params: dict = None, cache: bool = True) -> dict:
        """
        Make a GET request with optional caching.

        Parameters
        ----------
        endpoint : str
            API endpoint path
        params : dict, optional
            Query parameters
        cache : bool
            Whether to cache the response

        Returns
        -------
        dict
            Parsed JSON response
        """
        # Check cache first
        cache_key = f"{endpoint}_{hash(str(params))}.json"
        cache_path = self.cache_dir / cache_key

        if cache and cache_path.exists():
            with open(cache_path, 'r') as f:
                return json.load(f)

        # Make request with rate limiting
        url = f"{self.base_url}/{endpoint}"
        response = self.session.get(url, params=params)
        response.raise_for_status()

        data = response.json()

        # Cache response
        if cache:
            with open(cache_path, 'w') as f:
                json.dump(data, f)

        # Rate limiting: wait between requests
        time.sleep(1)

        return data

2.6.6 Pipeline Best Practices

  1. Separate concerns: Raw data acquisition, validation, processing, and analysis should be distinct steps in your pipeline. This makes debugging easier and ensures that problems in one stage do not silently propagate to others.

  2. Make it reproducible: Anyone should be able to re-run your pipeline from scratch and get the same results. This means pinning library versions, documenting external data sources, and avoiding manual steps.

  3. Version control: Track changes to pipeline code using git. This allows you to understand when and why processing logic changed, and to revert to earlier versions if needed.

  4. Document assumptions: Write down what you expect from the data—what coordinate system, what event types, what time range. These assumptions should be checked programmatically at the start of your pipeline.

  5. Handle errors gracefully: Don't let one bad match crash your entire pipeline. Use try/except blocks, log errors, and continue processing. Review errors afterward to determine whether they indicate data quality issues that need to be addressed.

  6. Log everything: Maintain logs of when data was fetched, how many records were processed, what validation checks passed or failed, and what cleaning steps were applied. These logs are invaluable for debugging and for documenting your analytical process.

  7. Test your pipeline: Write automated tests that verify your pipeline produces expected outputs from known inputs. This catches regressions when you modify pipeline code.


2.7 Chapter Summary

Key Concepts

  1. Soccer data comes in multiple forms: Event data captures discrete actions; tracking data records continuous positions; video provides context; physical data monitors player loads; reference and contextual data provide the backdrop for interpretation.

  2. Event data is the most accessible: It's widely available, relatively affordable, and sufficient for many analyses—but it misses off-ball movement, continuous play, and the spatial context that tracking data provides.

  3. Tracking data is the frontier: It enables sophisticated spatial analysis including pitch control, pressing analysis, and off-ball movement evaluation, but remains expensive and analytically challenging.

  4. Major providers include: Stats Perform/Opta (breadth and history), StatsBomb (quality, depth, and open data), Wyscout/Hudl (video integration and scouting), Second Spectrum and TRACAB (tracking), and SkillCorner (broadcast-derived tracking).

  5. Free data is available: StatsBomb Open Data, FBref, Understat, and other sources make learning and portfolio-building accessible to everyone with an internet connection.

  6. Data quality requires vigilance: Systematic validation and cleaning are essential for reliable analysis. Common issues include missing data, coordinate errors, inconsistent classification, temporal problems, and entity resolution challenges.

  7. Good data pipelines are essential: Separating data acquisition, validation, processing, and analysis into distinct stages with clear documentation ensures reproducibility and makes debugging tractable.

Key Terminology

Term Definition
Event data Records of discrete actions (passes, shots, etc.) during a match, with coordinates and metadata
Tracking data Continuous position data for all 22 players and the ball, typically at 25 frames per second
Freeze frame Snapshot of all player positions at the moment of key events
Broadcast tracking Tracking data derived from broadcast television footage using computer vision
Data provider Company that collects, processes, and sells soccer data
Data pipeline System for collecting, validating, processing, and storing data
Validation Process of checking data quality against expected standards
API Application Programming Interface: a structured way to request data from a service
Entity resolution The process of matching records that refer to the same real-world entity (e.g., the same player) across different data sources

Decision Framework

When selecting data for a project:

├── What question am I trying to answer?
│   ├── Requires off-ball analysis → Need tracking data
│   ├── Requires spatial context for shots/passes → Need freeze frames
│   ├── On-ball statistics and metrics → Event data sufficient
│   └── Historical trends or cross-league comparison → Aggregated data (FBref) may suffice
├── What resources are available?
│   ├── Learning project → Use free data (StatsBomb Open, FBref)
│   ├── Academic research → StatsBomb Open Data or academic data sharing agreements
│   ├── Commercial project → Budget for appropriate provider
│   └── Club work → Use internal data sources, supplemented by commercial providers
├── What quality is required?
│   ├── High-stakes decisions → Premium providers, thorough validation
│   ├── Exploratory analysis → Public data acceptable
│   └── Model training → Maximum volume; quality issues addressed in preprocessing
└── What infrastructure do I have?
    ├── Individual laptop → Flat files (Parquet/CSV), SQLite
    ├── Team with shared servers → PostgreSQL, shared file systems
    └── Enterprise/club → Cloud solutions, data warehouses

What's Next

In Chapter 3: Statistical Foundations for Soccer Analysis, we'll build the statistical toolkit needed to analyze the data you've now learned to access. You'll learn to apply descriptive statistics, probability, and inference to soccer problems, establishing the mathematical foundations for the modeling work in later chapters.

Before moving on, complete the exercises and quiz to solidify your understanding of soccer data sources.


Chapter 2 Exercises → exercises.md

Chapter 2 Quiz → quiz.md

Case Study: Building a Multi-Source Data Pipeline → case-study-01.md

Case Study: Comparing Data Provider Quality → case-study-02.md


Chapter 2 Complete