Case Study: Comparing Data Provider Quality

"Not everything that can be counted counts, and not everything that counts can be counted." — William Bruce Cameron

Executive Summary

This case study investigates data quality differences across soccer data sources. By comparing the same matches from multiple sources, we'll identify systematic differences, understand their causes, and develop strategies for handling multi-source analysis.

Skills Applied: - Cross-source data validation - Understanding provider methodologies - Statistical comparison techniques - Quality assessment frameworks


Background

The Problem

A club's analytics team has noticed discrepancies between statistics from different providers. For the same match, one provider records 550 passes while another records 520. Shot counts differ by 2-3 per team. The team needs to understand these differences to make reliable analyses.

Why This Matters

  • Scouting: Different sources may make players look better or worse
  • Benchmarking: League averages depend on counting methodology
  • Research: Published findings may not replicate across providers
  • Communication: Stakeholders may have data from different sources

Business Question: "How do we understand and account for systematic differences between data providers?"


The Data Quality Framework

Dimensions of Data Quality

We'll assess data quality across five dimensions:

Dimension Definition Metrics
Completeness Are all expected records present? Missing event rate, coverage
Accuracy Are values correct? Comparison to ground truth
Consistency Are similar things measured the same way? Cross-match variance
Timeliness Is data available when needed? Delay from match end
Precision How granular is the data? Coordinate precision, qualifiers

Phase 1: Data Collection

Identifying Comparable Data

For this analysis, we'll compare StatsBomb Open Data with FBref statistics for the same matches.

"""
case_study_02_02.py - Data Provider Quality Comparison

Comparing data quality across soccer data sources.
"""

import pandas as pd
import numpy as np
from typing import Dict, List, Tuple
from statsbombpy import sb
import matplotlib.pyplot as plt
import seaborn as sns


class DataQualityAnalyzer:
    """Analyze and compare data quality across sources."""

    def __init__(self):
        self.comparison_results = []

    def load_statsbomb_match(self, match_id: int) -> Dict:
        """
        Load and summarize a match from StatsBomb.

        Parameters
        ----------
        match_id : int
            StatsBomb match ID

        Returns
        -------
        Dict
            Match summary statistics
        """
        events = sb.events(match_id=match_id)

        summary = {
            'source': 'StatsBomb',
            'match_id': match_id,
            'total_events': len(events),
        }

        # Count by team
        for team in events['team'].unique():
            team_events = events[events['team'] == team]
            prefix = 'home_' if team == events['team'].iloc[0] else 'away_'

            summary[f'{prefix}team'] = team
            summary[f'{prefix}passes'] = len(team_events[team_events['type'] == 'Pass'])
            summary[f'{prefix}shots'] = len(team_events[team_events['type'] == 'Shot'])
            summary[f'{prefix}goals'] = len(team_events[
                (team_events['type'] == 'Shot') &
                (team_events['shot_outcome'] == 'Goal')
            ])
            summary[f'{prefix}fouls'] = len(team_events[team_events['type'] == 'Foul Committed'])

        return summary

    def create_simulated_comparison(self, sb_summary: Dict) -> Dict:
        """
        Create simulated alternative source data for comparison.

        In practice, you would load from actual second source.
        Here we simulate realistic variations.

        Parameters
        ----------
        sb_summary : Dict
            StatsBomb summary

        Returns
        -------
        Dict
            Simulated alternative source summary
        """
        np.random.seed(42)

        alt_summary = {
            'source': 'Alternative',
            'match_id': sb_summary['match_id'],
        }

        # Simulate systematic differences
        # Alternative source typically counts:
        # - Slightly fewer passes (different threshold for "pass")
        # - Similar shots (clear definition)
        # - Same goals (unambiguous)
        # - Slightly different foul counts

        for prefix in ['home_', 'away_']:
            if f'{prefix}team' in sb_summary:
                alt_summary[f'{prefix}team'] = sb_summary[f'{prefix}team']

                # Passes: -5 to -15 fewer (methodological difference)
                alt_summary[f'{prefix}passes'] = max(0, sb_summary[f'{prefix}passes'] - np.random.randint(5, 15))

                # Shots: usually same or ±1
                alt_summary[f'{prefix}shots'] = sb_summary[f'{prefix}shots'] + np.random.choice([-1, 0, 0, 0, 1])

                # Goals: always same
                alt_summary[f'{prefix}goals'] = sb_summary[f'{prefix}goals']

                # Fouls: can vary by ±2
                alt_summary[f'{prefix}fouls'] = max(0, sb_summary[f'{prefix}fouls'] + np.random.randint(-2, 3))

        return alt_summary


class ComparisonReport:
    """Generate comparison reports between data sources."""

    def __init__(self, matches: List[Tuple[Dict, Dict]]):
        """
        Initialize with matched data from two sources.

        Parameters
        ----------
        matches : List[Tuple[Dict, Dict]]
            List of (source1_summary, source2_summary) tuples
        """
        self.matches = matches
        self.comparison_df = self._build_comparison_df()

    def _build_comparison_df(self) -> pd.DataFrame:
        """Build comparison dataframe."""
        records = []

        for sb_data, alt_data in self.matches:
            for prefix in ['home_', 'away_']:
                if f'{prefix}passes' in sb_data:
                    record = {
                        'match_id': sb_data['match_id'],
                        'team': sb_data.get(f'{prefix}team', 'Unknown'),
                        'is_home': prefix == 'home_',
                        'passes_sb': sb_data[f'{prefix}passes'],
                        'passes_alt': alt_data[f'{prefix}passes'],
                        'shots_sb': sb_data[f'{prefix}shots'],
                        'shots_alt': alt_data[f'{prefix}shots'],
                        'goals_sb': sb_data[f'{prefix}goals'],
                        'goals_alt': alt_data[f'{prefix}goals'],
                        'fouls_sb': sb_data[f'{prefix}fouls'],
                        'fouls_alt': alt_data[f'{prefix}fouls'],
                    }
                    records.append(record)

        return pd.DataFrame(records)

    def calculate_differences(self) -> pd.DataFrame:
        """Calculate differences between sources."""
        df = self.comparison_df.copy()

        for metric in ['passes', 'shots', 'goals', 'fouls']:
            df[f'{metric}_diff'] = df[f'{metric}_sb'] - df[f'{metric}_alt']
            df[f'{metric}_pct_diff'] = (
                (df[f'{metric}_sb'] - df[f'{metric}_alt']) /
                df[f'{metric}_sb'] * 100
            ).round(2)

        return df

    def summary_statistics(self) -> Dict:
        """Generate summary of differences."""
        diff_df = self.calculate_differences()

        summary = {}
        for metric in ['passes', 'shots', 'goals', 'fouls']:
            diff_col = f'{metric}_diff'
            summary[metric] = {
                'mean_difference': diff_df[diff_col].mean(),
                'std_difference': diff_df[diff_col].std(),
                'max_difference': diff_df[diff_col].abs().max(),
                'exact_matches': (diff_df[diff_col] == 0).mean() * 100,
            }

        return summary

    def print_report(self):
        """Print formatted comparison report."""
        print("\n" + "=" * 60)
        print("DATA PROVIDER COMPARISON REPORT")
        print("=" * 60)

        summary = self.summary_statistics()

        print(f"\nMatches compared: {len(self.matches)}")
        print(f"Team-match records: {len(self.comparison_df)}")

        print("\n" + "-" * 60)
        print("SUMMARY OF DIFFERENCES (StatsBomb - Alternative)")
        print("-" * 60)

        for metric, stats in summary.items():
            print(f"\n{metric.upper()}:")
            print(f"  Mean difference: {stats['mean_difference']:+.2f}")
            print(f"  Std deviation: {stats['std_difference']:.2f}")
            print(f"  Max absolute diff: {stats['max_difference']:.0f}")
            print(f"  Exact match rate: {stats['exact_matches']:.1f}%")

    def plot_differences(self):
        """Visualize differences between sources."""
        diff_df = self.calculate_differences()

        fig, axes = plt.subplots(2, 2, figsize=(12, 10))

        metrics = ['passes', 'shots', 'goals', 'fouls']

        for ax, metric in zip(axes.flatten(), metrics):
            diff_col = f'{metric}_diff'

            ax.hist(diff_df[diff_col], bins=20, edgecolor='black', alpha=0.7)
            ax.axvline(x=0, color='red', linestyle='--', label='No difference')
            ax.axvline(x=diff_df[diff_col].mean(), color='blue',
                       linestyle='-', label=f'Mean: {diff_df[diff_col].mean():.1f}')

            ax.set_xlabel(f'{metric.title()} Difference (SB - Alt)')
            ax.set_ylabel('Frequency')
            ax.set_title(f'{metric.title()} Count Differences')
            ax.legend()

        plt.suptitle('Distribution of Differences Between Data Sources', fontsize=14)
        plt.tight_layout()
        plt.show()


# Run the comparison analysis
if __name__ == "__main__":
    analyzer = DataQualityAnalyzer()

    # Load multiple World Cup 2018 matches
    print("Loading matches from StatsBomb...")
    matches = sb.matches(competition_id=43, season_id=3)

    # Analyze first 10 matches
    match_comparisons = []

    for match_id in matches['match_id'].head(10):
        try:
            sb_summary = analyzer.load_statsbomb_match(match_id)
            alt_summary = analyzer.create_simulated_comparison(sb_summary)
            match_comparisons.append((sb_summary, alt_summary))
            print(f"  Loaded match {match_id}")
        except Exception as e:
            print(f"  Error loading match {match_id}: {e}")

    # Generate report
    report = ComparisonReport(match_comparisons)
    report.print_report()

    print("\nGenerating visualizations...")
    report.plot_differences()

Phase 2: Analysis of Differences

Pass Counting Methodology

The most significant differences typically appear in pass counts:

Aspect Provider A Provider B
Minimum distance 2 meters 5 meters
GK distributions Included Sometimes excluded
Throw-ins Counted as passes Separate category
Set pieces Included Sometimes separate

Implications: - Provider A will report 5-10% more passes - Pass completion rates may differ - Possession calculations affected

Shot Definitions

Shot counting is more consistent but not identical:

Aspect Potential Difference
Blocked shots Some count as shots, others don't
Off-target distance How far off to count as "shot"
Headed clearances Occasionally miscoded

Foul Subjectivity

Foul counts can vary based on: - Whether advantage played counts as foul - Classification of simulation - When whistle wasn't blown but should have been


Phase 3: Implications and Recommendations

For Analysts

class DataSourceRecommendations:
    """Recommendations for working with multiple data sources."""

    @staticmethod
    def print_recommendations():
        recommendations = """
        RECOMMENDATIONS FOR MULTI-SOURCE ANALYSIS

        1. DOCUMENT YOUR SOURCE
           - Always record which provider supplied each statistic
           - Note the date/version of data extraction
           - Keep raw data separate from processed

        2. DON'T MIX SOURCES FOR COMPARISONS
           - Compare Player A to Player B using SAME source
           - League averages must come from same source as player data
           - Time series should use consistent source

        3. UNDERSTAND SYSTEMATIC BIASES
           - Provider A reports ~7% more passes than Provider B
           - Account for this when interpreting
           - Consider using percentiles rather than absolutes

        4. USE ROBUST METRICS
           - Shots and goals are more consistent than passes
           - Rates and percentages less affected than totals
           - Expected metrics (xG) may vary by model, not just data

        5. VALIDATE UNUSUAL VALUES
           - If a statistic looks unusual, check alternative source
           - Consider video review for critical decisions
           - Build in sanity checks

        6. PREFER ORIGINAL SOURCES
           - Use provider's official API/platform when possible
           - Aggregator sites may introduce additional variation
           - Historical data may have been revised
        """
        print(recommendations)


# Print recommendations
DataSourceRecommendations.print_recommendations()

Quality Assessment Scorecard

def assess_source_quality(source_name: str, metrics: Dict) -> pd.DataFrame:
    """
    Generate quality scorecard for a data source.

    Parameters
    ----------
    source_name : str
        Name of data source
    metrics : Dict
        Quality metrics

    Returns
    -------
    pd.DataFrame
        Quality scorecard
    """
    # Example scorecard
    scorecard = pd.DataFrame({
        'Dimension': ['Completeness', 'Accuracy', 'Consistency', 'Timeliness', 'Precision'],
        'Score': [4.5, 4.0, 3.5, 4.0, 4.5],  # Out of 5
        'Notes': [
            'All major leagues covered',
            'High agreement with video review',
            'Some variation in edge cases',
            'Data available within 24 hours',
            'Includes detailed qualifiers and coordinates'
        ]
    })

    scorecard['Source'] = source_name

    return scorecard


# Example usage
print("\nSample Quality Scorecard:")
print(assess_source_quality("StatsBomb", {}))

Results Summary

Key Findings

  1. Pass counts vary by 5-10% between providers due to different minimum thresholds

  2. Shot and goal counts are consistent (<2% variation) due to clearer definitions

  3. Foul counts can vary by 10-15% due to subjective classification

  4. Coordinate precision differs: some providers use 1-meter grids, others sub-meter

  5. Qualifier richness varies significantly (StatsBomb provides most detail)

Practical Guidelines

Statistic Cross-Source Reliability Recommendation
Goals Very High Safe to compare
Shots High Generally comparable
xG Medium Use same provider's model
Passes Medium-Low Don't mix sources
Fouls Low Use with caution
Duels Low Provider-specific

Discussion Questions

  1. If you're scouting a player and only have data from Provider A, but your club uses Provider B, how would you adjust your analysis?

  2. Should the industry standardize data definitions? What are the pros and cons?

  3. How would you communicate uncertainty due to data source to a non-technical stakeholder?

  4. What additional validation could you perform if you had access to video?


Your Turn: Mini-Project

Option A: Real Cross-Source Comparison Compare statistics for a specific player across FBref, Understat, and WhoScored. Document all differences and hypothesize causes.

Option B: Quality Scoring System Design a comprehensive quality scoring system for soccer data, with weighted dimensions and objective criteria.

Option C: Adjustment Model Build a statistical model that predicts Provider B statistics from Provider A statistics, enabling cross-source comparisons.


Case Study Complete