Case Study: Comparing Data Provider Quality
"Not everything that can be counted counts, and not everything that counts can be counted." — William Bruce Cameron
Executive Summary
This case study investigates data quality differences across soccer data sources. By comparing the same matches from multiple sources, we'll identify systematic differences, understand their causes, and develop strategies for handling multi-source analysis.
Skills Applied: - Cross-source data validation - Understanding provider methodologies - Statistical comparison techniques - Quality assessment frameworks
Background
The Problem
A club's analytics team has noticed discrepancies between statistics from different providers. For the same match, one provider records 550 passes while another records 520. Shot counts differ by 2-3 per team. The team needs to understand these differences to make reliable analyses.
Why This Matters
- Scouting: Different sources may make players look better or worse
- Benchmarking: League averages depend on counting methodology
- Research: Published findings may not replicate across providers
- Communication: Stakeholders may have data from different sources
Business Question: "How do we understand and account for systematic differences between data providers?"
The Data Quality Framework
Dimensions of Data Quality
We'll assess data quality across five dimensions:
| Dimension | Definition | Metrics |
|---|---|---|
| Completeness | Are all expected records present? | Missing event rate, coverage |
| Accuracy | Are values correct? | Comparison to ground truth |
| Consistency | Are similar things measured the same way? | Cross-match variance |
| Timeliness | Is data available when needed? | Delay from match end |
| Precision | How granular is the data? | Coordinate precision, qualifiers |
Phase 1: Data Collection
Identifying Comparable Data
For this analysis, we'll compare StatsBomb Open Data with FBref statistics for the same matches.
"""
case_study_02_02.py - Data Provider Quality Comparison
Comparing data quality across soccer data sources.
"""
import pandas as pd
import numpy as np
from typing import Dict, List, Tuple
from statsbombpy import sb
import matplotlib.pyplot as plt
import seaborn as sns
class DataQualityAnalyzer:
"""Analyze and compare data quality across sources."""
def __init__(self):
self.comparison_results = []
def load_statsbomb_match(self, match_id: int) -> Dict:
"""
Load and summarize a match from StatsBomb.
Parameters
----------
match_id : int
StatsBomb match ID
Returns
-------
Dict
Match summary statistics
"""
events = sb.events(match_id=match_id)
summary = {
'source': 'StatsBomb',
'match_id': match_id,
'total_events': len(events),
}
# Count by team
for team in events['team'].unique():
team_events = events[events['team'] == team]
prefix = 'home_' if team == events['team'].iloc[0] else 'away_'
summary[f'{prefix}team'] = team
summary[f'{prefix}passes'] = len(team_events[team_events['type'] == 'Pass'])
summary[f'{prefix}shots'] = len(team_events[team_events['type'] == 'Shot'])
summary[f'{prefix}goals'] = len(team_events[
(team_events['type'] == 'Shot') &
(team_events['shot_outcome'] == 'Goal')
])
summary[f'{prefix}fouls'] = len(team_events[team_events['type'] == 'Foul Committed'])
return summary
def create_simulated_comparison(self, sb_summary: Dict) -> Dict:
"""
Create simulated alternative source data for comparison.
In practice, you would load from actual second source.
Here we simulate realistic variations.
Parameters
----------
sb_summary : Dict
StatsBomb summary
Returns
-------
Dict
Simulated alternative source summary
"""
np.random.seed(42)
alt_summary = {
'source': 'Alternative',
'match_id': sb_summary['match_id'],
}
# Simulate systematic differences
# Alternative source typically counts:
# - Slightly fewer passes (different threshold for "pass")
# - Similar shots (clear definition)
# - Same goals (unambiguous)
# - Slightly different foul counts
for prefix in ['home_', 'away_']:
if f'{prefix}team' in sb_summary:
alt_summary[f'{prefix}team'] = sb_summary[f'{prefix}team']
# Passes: -5 to -15 fewer (methodological difference)
alt_summary[f'{prefix}passes'] = max(0, sb_summary[f'{prefix}passes'] - np.random.randint(5, 15))
# Shots: usually same or ±1
alt_summary[f'{prefix}shots'] = sb_summary[f'{prefix}shots'] + np.random.choice([-1, 0, 0, 0, 1])
# Goals: always same
alt_summary[f'{prefix}goals'] = sb_summary[f'{prefix}goals']
# Fouls: can vary by ±2
alt_summary[f'{prefix}fouls'] = max(0, sb_summary[f'{prefix}fouls'] + np.random.randint(-2, 3))
return alt_summary
class ComparisonReport:
"""Generate comparison reports between data sources."""
def __init__(self, matches: List[Tuple[Dict, Dict]]):
"""
Initialize with matched data from two sources.
Parameters
----------
matches : List[Tuple[Dict, Dict]]
List of (source1_summary, source2_summary) tuples
"""
self.matches = matches
self.comparison_df = self._build_comparison_df()
def _build_comparison_df(self) -> pd.DataFrame:
"""Build comparison dataframe."""
records = []
for sb_data, alt_data in self.matches:
for prefix in ['home_', 'away_']:
if f'{prefix}passes' in sb_data:
record = {
'match_id': sb_data['match_id'],
'team': sb_data.get(f'{prefix}team', 'Unknown'),
'is_home': prefix == 'home_',
'passes_sb': sb_data[f'{prefix}passes'],
'passes_alt': alt_data[f'{prefix}passes'],
'shots_sb': sb_data[f'{prefix}shots'],
'shots_alt': alt_data[f'{prefix}shots'],
'goals_sb': sb_data[f'{prefix}goals'],
'goals_alt': alt_data[f'{prefix}goals'],
'fouls_sb': sb_data[f'{prefix}fouls'],
'fouls_alt': alt_data[f'{prefix}fouls'],
}
records.append(record)
return pd.DataFrame(records)
def calculate_differences(self) -> pd.DataFrame:
"""Calculate differences between sources."""
df = self.comparison_df.copy()
for metric in ['passes', 'shots', 'goals', 'fouls']:
df[f'{metric}_diff'] = df[f'{metric}_sb'] - df[f'{metric}_alt']
df[f'{metric}_pct_diff'] = (
(df[f'{metric}_sb'] - df[f'{metric}_alt']) /
df[f'{metric}_sb'] * 100
).round(2)
return df
def summary_statistics(self) -> Dict:
"""Generate summary of differences."""
diff_df = self.calculate_differences()
summary = {}
for metric in ['passes', 'shots', 'goals', 'fouls']:
diff_col = f'{metric}_diff'
summary[metric] = {
'mean_difference': diff_df[diff_col].mean(),
'std_difference': diff_df[diff_col].std(),
'max_difference': diff_df[diff_col].abs().max(),
'exact_matches': (diff_df[diff_col] == 0).mean() * 100,
}
return summary
def print_report(self):
"""Print formatted comparison report."""
print("\n" + "=" * 60)
print("DATA PROVIDER COMPARISON REPORT")
print("=" * 60)
summary = self.summary_statistics()
print(f"\nMatches compared: {len(self.matches)}")
print(f"Team-match records: {len(self.comparison_df)}")
print("\n" + "-" * 60)
print("SUMMARY OF DIFFERENCES (StatsBomb - Alternative)")
print("-" * 60)
for metric, stats in summary.items():
print(f"\n{metric.upper()}:")
print(f" Mean difference: {stats['mean_difference']:+.2f}")
print(f" Std deviation: {stats['std_difference']:.2f}")
print(f" Max absolute diff: {stats['max_difference']:.0f}")
print(f" Exact match rate: {stats['exact_matches']:.1f}%")
def plot_differences(self):
"""Visualize differences between sources."""
diff_df = self.calculate_differences()
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
metrics = ['passes', 'shots', 'goals', 'fouls']
for ax, metric in zip(axes.flatten(), metrics):
diff_col = f'{metric}_diff'
ax.hist(diff_df[diff_col], bins=20, edgecolor='black', alpha=0.7)
ax.axvline(x=0, color='red', linestyle='--', label='No difference')
ax.axvline(x=diff_df[diff_col].mean(), color='blue',
linestyle='-', label=f'Mean: {diff_df[diff_col].mean():.1f}')
ax.set_xlabel(f'{metric.title()} Difference (SB - Alt)')
ax.set_ylabel('Frequency')
ax.set_title(f'{metric.title()} Count Differences')
ax.legend()
plt.suptitle('Distribution of Differences Between Data Sources', fontsize=14)
plt.tight_layout()
plt.show()
# Run the comparison analysis
if __name__ == "__main__":
analyzer = DataQualityAnalyzer()
# Load multiple World Cup 2018 matches
print("Loading matches from StatsBomb...")
matches = sb.matches(competition_id=43, season_id=3)
# Analyze first 10 matches
match_comparisons = []
for match_id in matches['match_id'].head(10):
try:
sb_summary = analyzer.load_statsbomb_match(match_id)
alt_summary = analyzer.create_simulated_comparison(sb_summary)
match_comparisons.append((sb_summary, alt_summary))
print(f" Loaded match {match_id}")
except Exception as e:
print(f" Error loading match {match_id}: {e}")
# Generate report
report = ComparisonReport(match_comparisons)
report.print_report()
print("\nGenerating visualizations...")
report.plot_differences()
Phase 2: Analysis of Differences
Pass Counting Methodology
The most significant differences typically appear in pass counts:
| Aspect | Provider A | Provider B |
|---|---|---|
| Minimum distance | 2 meters | 5 meters |
| GK distributions | Included | Sometimes excluded |
| Throw-ins | Counted as passes | Separate category |
| Set pieces | Included | Sometimes separate |
Implications: - Provider A will report 5-10% more passes - Pass completion rates may differ - Possession calculations affected
Shot Definitions
Shot counting is more consistent but not identical:
| Aspect | Potential Difference |
|---|---|
| Blocked shots | Some count as shots, others don't |
| Off-target distance | How far off to count as "shot" |
| Headed clearances | Occasionally miscoded |
Foul Subjectivity
Foul counts can vary based on: - Whether advantage played counts as foul - Classification of simulation - When whistle wasn't blown but should have been
Phase 3: Implications and Recommendations
For Analysts
class DataSourceRecommendations:
"""Recommendations for working with multiple data sources."""
@staticmethod
def print_recommendations():
recommendations = """
RECOMMENDATIONS FOR MULTI-SOURCE ANALYSIS
1. DOCUMENT YOUR SOURCE
- Always record which provider supplied each statistic
- Note the date/version of data extraction
- Keep raw data separate from processed
2. DON'T MIX SOURCES FOR COMPARISONS
- Compare Player A to Player B using SAME source
- League averages must come from same source as player data
- Time series should use consistent source
3. UNDERSTAND SYSTEMATIC BIASES
- Provider A reports ~7% more passes than Provider B
- Account for this when interpreting
- Consider using percentiles rather than absolutes
4. USE ROBUST METRICS
- Shots and goals are more consistent than passes
- Rates and percentages less affected than totals
- Expected metrics (xG) may vary by model, not just data
5. VALIDATE UNUSUAL VALUES
- If a statistic looks unusual, check alternative source
- Consider video review for critical decisions
- Build in sanity checks
6. PREFER ORIGINAL SOURCES
- Use provider's official API/platform when possible
- Aggregator sites may introduce additional variation
- Historical data may have been revised
"""
print(recommendations)
# Print recommendations
DataSourceRecommendations.print_recommendations()
Quality Assessment Scorecard
def assess_source_quality(source_name: str, metrics: Dict) -> pd.DataFrame:
"""
Generate quality scorecard for a data source.
Parameters
----------
source_name : str
Name of data source
metrics : Dict
Quality metrics
Returns
-------
pd.DataFrame
Quality scorecard
"""
# Example scorecard
scorecard = pd.DataFrame({
'Dimension': ['Completeness', 'Accuracy', 'Consistency', 'Timeliness', 'Precision'],
'Score': [4.5, 4.0, 3.5, 4.0, 4.5], # Out of 5
'Notes': [
'All major leagues covered',
'High agreement with video review',
'Some variation in edge cases',
'Data available within 24 hours',
'Includes detailed qualifiers and coordinates'
]
})
scorecard['Source'] = source_name
return scorecard
# Example usage
print("\nSample Quality Scorecard:")
print(assess_source_quality("StatsBomb", {}))
Results Summary
Key Findings
-
Pass counts vary by 5-10% between providers due to different minimum thresholds
-
Shot and goal counts are consistent (<2% variation) due to clearer definitions
-
Foul counts can vary by 10-15% due to subjective classification
-
Coordinate precision differs: some providers use 1-meter grids, others sub-meter
-
Qualifier richness varies significantly (StatsBomb provides most detail)
Practical Guidelines
| Statistic | Cross-Source Reliability | Recommendation |
|---|---|---|
| Goals | Very High | Safe to compare |
| Shots | High | Generally comparable |
| xG | Medium | Use same provider's model |
| Passes | Medium-Low | Don't mix sources |
| Fouls | Low | Use with caution |
| Duels | Low | Provider-specific |
Discussion Questions
-
If you're scouting a player and only have data from Provider A, but your club uses Provider B, how would you adjust your analysis?
-
Should the industry standardize data definitions? What are the pros and cons?
-
How would you communicate uncertainty due to data source to a non-technical stakeholder?
-
What additional validation could you perform if you had access to video?
Your Turn: Mini-Project
Option A: Real Cross-Source Comparison Compare statistics for a specific player across FBref, Understat, and WhoScored. Document all differences and hypothesize causes.
Option B: Quality Scoring System Design a comprehensive quality scoring system for soccer data, with weighted dimensions and objective criteria.
Option C: Adjustment Model Build a statistical model that predicts Provider B statistics from Provider A statistics, enabling cross-source comparisons.
Case Study Complete