Chapter 2: Key Takeaways
Data Sources and Collection in Soccer
Data Types at a Glance
| Data Type | What It Captures | Collection Method | Typical Volume |
|---|---|---|---|
| Event Data | Discrete actions (passes, shots) | Human coders | 2,000-3,000 events/match |
| Tracking Data | Continuous positions (25 fps) | Cameras or GPS | 4-6 million points/match |
| Freeze Frames | Positions at key moments | Derived from tracking | 50-200 snapshots/match |
| Physical Data | Load, speed, acceleration | Wearable sensors | Varies by system |
| Video | Raw match footage | Broadcast/cameras | Hours of footage |
The Critical Distinction
Event Data: Tells you WHAT happened (the story)
Tracking Data: Shows you EVERYTHING that happened (the full picture)
Event data misses: - Off-ball player movement - Space creation and exploitation - Pressing patterns - Defensive positioning - Continuous context between actions
Major Data Providers
| Provider | Primary Product | Key Strength | Typical User |
|---|---|---|---|
| Stats Perform/Opta | Event data | Historical depth, coverage | Media, betting |
| StatsBomb | Event data + freeze frames | Quality, analytics depth | Clubs, researchers |
| Wyscout | Video + data | Video integration, breadth | Scouts, clubs |
| Second Spectrum | Tracking data | Advanced analytics | Top clubs |
| SkillCorner | Broadcast-derived tracking | No hardware needed | Various |
Free Data Sources
StatsBomb Open Data
- Best for: Learning event data analysis
- Includes: World Cups, select leagues, freeze frames
- Access:
statsbombpyPython package or GitHub
FBref
- Best for: Aggregated player/team statistics
- Includes: Major leagues since 2017-18
- Access: Web scraping or manual download
Other Sources
- Understat: xG data and shot maps
- Transfermarkt: Market values, transfers
- Football-Data.co.uk: Historical results
Data Quality Checklist
Before any analysis, check for:
| Issue | Detection Method |
|---|---|
| Missing data | Count nulls, compare expected vs actual records |
| Invalid coordinates | Check against pitch boundaries (0-120, 0-80) |
| Out-of-sequence events | Check timestamp differences |
| Duplicates | Check unique constraints |
| Inconsistent classification | Compare distributions across sources |
Data Pipeline Stages
Raw Data → Validation → Cleaning → Storage → Processing → Analysis
Best Practices: 1. Document every cleaning step 2. Version your data 3. Prefer filtering over imputation 4. Validate downstream effects 5. Make pipelines reproducible
Quick Reference: Getting Started
For Learning Projects
# Install the library
pip install statsbombpy pandas
# Access free data
from statsbombpy import sb
competitions = sb.competitions()
matches = sb.matches(competition_id=43, season_id=3) # World Cup 2018
events = sb.events(match_id=7298)
Recommended Storage Format
- Small datasets: CSV (readable, portable)
- Large datasets: Parquet (fast, compressed)
- Multi-table data: SQLite (relational queries)
Decision Framework
What data do I need?
├── Analyzing specific actions (shots, passes)?
│ └── Event data is sufficient
├── Analyzing off-ball movement or space?
│ └── Need tracking data or freeze frames
├── Analyzing physical performance?
│ └── Need GPS/wearable data
└── Need video context?
└── Add video platform access
Common Mistakes to Avoid
| Mistake | Better Approach |
|---|---|
| Assuming data is perfect | Always validate before analysis |
| Using wrong coordinate system | Check provider documentation |
| Ignoring collection methodology | Understand how data was created |
| Mixing providers without adjustment | Be aware of definitional differences |
| Overcomplicating initial projects | Start with free data, simple questions |
Key Terminology
| Term | Definition |
|---|---|
| Event data | Records of discrete match actions |
| Tracking data | Continuous positional data for all players |
| Freeze frame | Position snapshot at specific moments |
| Optical tracking | Camera-based position detection |
| GPS tracking | Wearable device-based tracking |
| Data validation | Quality checking process |
| Data pipeline | System for processing data from source to analysis |
Looking Ahead
Chapter 3: Statistical Foundations for Soccer Analysis will teach you the statistical techniques needed to analyze the data you've now learned to access: - Descriptive statistics for soccer contexts - Probability concepts for match outcomes - Inference and hypothesis testing - Introduction to regression
Keep this summary card handy as a reference while working through later chapters.