Chapter 2: Key Takeaways

Data Sources and Collection in Soccer


Data Types at a Glance

Data Type What It Captures Collection Method Typical Volume
Event Data Discrete actions (passes, shots) Human coders 2,000-3,000 events/match
Tracking Data Continuous positions (25 fps) Cameras or GPS 4-6 million points/match
Freeze Frames Positions at key moments Derived from tracking 50-200 snapshots/match
Physical Data Load, speed, acceleration Wearable sensors Varies by system
Video Raw match footage Broadcast/cameras Hours of footage

The Critical Distinction

Event Data: Tells you WHAT happened (the story)

Tracking Data: Shows you EVERYTHING that happened (the full picture)

Event data misses: - Off-ball player movement - Space creation and exploitation - Pressing patterns - Defensive positioning - Continuous context between actions


Major Data Providers

Provider Primary Product Key Strength Typical User
Stats Perform/Opta Event data Historical depth, coverage Media, betting
StatsBomb Event data + freeze frames Quality, analytics depth Clubs, researchers
Wyscout Video + data Video integration, breadth Scouts, clubs
Second Spectrum Tracking data Advanced analytics Top clubs
SkillCorner Broadcast-derived tracking No hardware needed Various

Free Data Sources

StatsBomb Open Data

  • Best for: Learning event data analysis
  • Includes: World Cups, select leagues, freeze frames
  • Access: statsbombpy Python package or GitHub

FBref

  • Best for: Aggregated player/team statistics
  • Includes: Major leagues since 2017-18
  • Access: Web scraping or manual download

Other Sources

  • Understat: xG data and shot maps
  • Transfermarkt: Market values, transfers
  • Football-Data.co.uk: Historical results

Data Quality Checklist

Before any analysis, check for:

Issue Detection Method
Missing data Count nulls, compare expected vs actual records
Invalid coordinates Check against pitch boundaries (0-120, 0-80)
Out-of-sequence events Check timestamp differences
Duplicates Check unique constraints
Inconsistent classification Compare distributions across sources

Data Pipeline Stages

Raw Data → Validation → Cleaning → Storage → Processing → Analysis

Best Practices: 1. Document every cleaning step 2. Version your data 3. Prefer filtering over imputation 4. Validate downstream effects 5. Make pipelines reproducible


Quick Reference: Getting Started

For Learning Projects

# Install the library
pip install statsbombpy pandas

# Access free data
from statsbombpy import sb
competitions = sb.competitions()
matches = sb.matches(competition_id=43, season_id=3)  # World Cup 2018
events = sb.events(match_id=7298)
  • Small datasets: CSV (readable, portable)
  • Large datasets: Parquet (fast, compressed)
  • Multi-table data: SQLite (relational queries)

Decision Framework

What data do I need?

├── Analyzing specific actions (shots, passes)?
│   └── Event data is sufficient
├── Analyzing off-ball movement or space?
│   └── Need tracking data or freeze frames
├── Analyzing physical performance?
│   └── Need GPS/wearable data
└── Need video context?
    └── Add video platform access

Common Mistakes to Avoid

Mistake Better Approach
Assuming data is perfect Always validate before analysis
Using wrong coordinate system Check provider documentation
Ignoring collection methodology Understand how data was created
Mixing providers without adjustment Be aware of definitional differences
Overcomplicating initial projects Start with free data, simple questions

Key Terminology

Term Definition
Event data Records of discrete match actions
Tracking data Continuous positional data for all players
Freeze frame Position snapshot at specific moments
Optical tracking Camera-based position detection
GPS tracking Wearable device-based tracking
Data validation Quality checking process
Data pipeline System for processing data from source to analysis

Looking Ahead

Chapter 3: Statistical Foundations for Soccer Analysis will teach you the statistical techniques needed to analyze the data you've now learned to access: - Descriptive statistics for soccer contexts - Probability concepts for match outcomes - Inference and hypothesis testing - Introduction to regression


Keep this summary card handy as a reference while working through later chapters.