Exercises: Data Sources and Collection in Soccer
These exercises build practical skills in working with soccer data sources, from conceptual understanding to hands-on data acquisition and validation.
Scoring Guide: - ⭐ Foundational (5-10 min each) - ⭐⭐ Intermediate (10-20 min each) - ⭐⭐⭐ Challenging (20-40 min each) - ⭐⭐⭐⭐ Advanced/Research (40+ min each)
Part A: Conceptual Understanding ⭐
Test your understanding of core concepts. No coding required.
A.1. What is the fundamental difference between event data and tracking data? For each, describe: - What information is captured - How it is collected - One analysis that requires this data type
A.2. A colleague says: "We have all the event data, so we know everything that happened in the match." Explain why this statement is misleading. What does event data miss?
A.3. Explain what "freeze frame" data is. Why might freeze frames be a useful middle ground between event-only data and full tracking data?
A.4. Describe three potential sources of error in event data. For each, explain how the error might arise and how you might detect it.
A.5. What is the difference between optical tracking and GPS/wearable tracking? List two advantages and two disadvantages of each approach.
A.6. True or False (with explanation): "More data is always better for soccer analysis." Provide a scenario where having less, higher-quality data might be preferable to having more, lower-quality data.
A.7. List three pieces of contextual data that might be important for a soccer analysis but are NOT typically included in event or tracking data. For each, explain why it matters and where you might find it.
Part B: Provider Analysis ⭐⭐
Evaluate and compare data providers.
B.1. Provider Comparison Table
Create a detailed comparison table of three data providers (choose from: Stats Perform/Opta, StatsBomb, Wyscout, Second Spectrum, SkillCorner). Your table should include: - Data types offered - Coverage (leagues, historical depth) - Unique features - Primary use cases - Approximate pricing tier (free/affordable/premium/enterprise)
B.2. Use Case Matching
For each of the following use cases, identify which data provider(s) would be most appropriate and explain why:
a) A newspaper journalist writing weekly analysis columns b) A club scout evaluating young players in South American leagues c) A PhD researcher studying pressing patterns using spatial data d) A mid-table Championship club building basic analytics capabilities e) A fan building a personal analytics blog/portfolio
B.3. Data Quality Investigation
Compare the same metric (e.g., total passes for a specific player in a specific match) across two different public sources (FBref, Understat, WhoScored, etc.). Document: - What values each source reports - Any discrepancies you find - Possible reasons for differences
B.4. Provider Evolution
Research how one major data provider (your choice) has evolved over the past 5-10 years. Consider: - Changes in data types offered - Improvements in quality or coverage - New features or products - Competitive positioning
Part C: Practical Data Access ⭐⭐
Hands-on exercises working with real data.
C.1. StatsBomb API Exploration
Using the statsbombpy library, write code to:
# Your code should:
# a) List all available competitions
# b) Find all matches from FIFA World Cup 2018
# c) Retrieve events for a specific match
# d) Count the total number of passes in that match
# e) Find the player with the most shots
Test your code and report the results.
C.2. FBref Data Extraction
Write code to scrape player statistics from FBref for a specific league season:
# Your code should:
# a) Load the main player stats table for a league
# b) Clean column names and data types
# c) Handle any merged header rows
# d) Save to CSV
# e) Print summary statistics
Document any challenges you encounter and how you solved them.
C.3. Data Merging Challenge
Using data from two different sources:
a) Retrieve player statistics from StatsBomb Open Data b) Retrieve market values from Transfermarkt (or another source) c) Merge the datasets by player d) Document any matching challenges (name variations, missing players) e) Calculate a "value per goal contribution" metric
C.4. Building a Local Database
Design and implement a simple SQLite database to store soccer data:
# Your database should include tables for:
# - competitions
# - teams
# - players
# - matches
# - events (basic schema)
# Include proper primary/foreign keys
# Write functions to insert and query data
Part D: Data Quality ⭐⭐⭐
Investigate and address data quality issues.
D.1. Coordinate Validation
Using StatsBomb Open Data, analyze the distribution of event coordinates:
a) Plot histograms of x and y coordinates for all events b) Check for events outside expected pitch boundaries c) Compare coordinate distributions for home vs. away events d) Identify any systematic biases or issues e) Propose and implement cleaning rules
D.2. Temporal Consistency
Examine the temporal structure of event data:
a) For a single match, plot the event timestamps over time b) Identify any gaps or clusters that seem suspicious c) Check if events ever appear out of sequence d) Analyze typical events per minute across match phases (first half, second half, injury time) e) Document any quality concerns
D.3. Cross-Source Validation
Select a single match covered by both StatsBomb Open Data and FBref. Compare:
a) Total shots and shots on target b) Total passes c) Possession percentage d) Any player-level statistics
Document discrepancies and hypothesize reasons for differences.
D.4. Missing Data Analysis
For a complete season of data, analyze missingness:
a) Identify which fields have missing values and at what rates b) Determine if missingness is random or systematic c) Propose appropriate handling strategies for each case d) Implement a cleaning function that handles missing data appropriately
Part E: Pipeline Development ⭐⭐⭐
Build reusable data pipelines.
E.1. Competition Pipeline
Build a pipeline class that can:
class CompetitionPipeline:
"""
Pipeline for processing an entire competition's data.
Methods to implement:
- fetch_all_matches(): Get all matches for a competition
- fetch_all_events(): Get events for all matches
- validate_data(): Run quality checks
- calculate_aggregates(): Compute player/team aggregates
- export_results(): Save to specified format
"""
Test your pipeline on a StatsBomb Open Data competition.
E.2. Incremental Updates
Extend your pipeline to handle incremental updates:
a) Track which matches have already been processed b) Only fetch new matches when run again c) Handle matches that might have been corrected/updated d) Log all operations
E.3. Multi-Source Pipeline
Build a pipeline that combines data from multiple sources:
a) Event data from StatsBomb b) Aggregated statistics from FBref c) Player metadata from a reference source
Your pipeline should: - Fetch from all sources - Handle entity matching - Merge into unified tables - Validate consistency
Part F: Research and Extension ⭐⭐⭐⭐
Open-ended problems requiring deeper investigation.
F.1. Data Provider Landscape Report
Write a 1000-word report on the current soccer data provider landscape: - Major players and their positions - Recent trends and developments - Emerging technologies (computer vision, AI-derived data) - Predictions for the next 5 years
Include citations from recent industry news and analysis.
F.2. Open Data Quality Assessment
Conduct a systematic quality assessment of StatsBomb Open Data:
a) Define quality dimensions (completeness, accuracy, consistency, timeliness) b) Develop metrics for each dimension c) Assess multiple competitions d) Compare quality across competitions e) Write a summary report with recommendations for users
F.3. Data Collection Simulation
Design a prototype for a simplified event data collection system:
a) Define an event taxonomy (types, qualifiers) b) Design a data entry interface (can be paper mockup or simple UI) c) Test the interface by logging events from a match video d) Analyze inter-coder reliability if possible e) Reflect on challenges faced by professional coders
F.4. Tracking Data Reconstruction
Research approaches for reconstructing tracking data from event data:
a) Review academic literature on the topic b) Identify key challenges c) Propose a simple methodology d) Implement a basic prototype (if feasible) e) Evaluate limitations and potential improvements
Solutions
Selected solutions are available in:
- code/exercise-solutions.py (programming problems)
- appendices/g-answers-to-selected-exercises.md (odd-numbered problems)
Full solutions available to instructors upon request.
Reflection Questions
After completing these exercises, consider:
- What surprised you about the availability (or lack thereof) of soccer data?
- What data quality issues do you think are most problematic for analysis?
- How would your analysis workflow change with unlimited access to tracking data?
- What new data types do you think will emerge in the next decade?
Write brief notes to guide your continued learning.