Case Study: Comparing Data Sources for Accuracy

"Trust, but verify." — Russian proverb (popularized by Ronald Reagan)

Executive Summary

No data source is perfect. This case study systematically compares data from multiple sources—CFBD, Sports Reference, and ESPN—to understand discrepancies, identify which sources are most reliable for different uses, and develop strategies for data validation. You'll learn to never blindly trust a single source.

Skills Applied: - Cross-source data validation - Discrepancy analysis - Source reliability assessment - Documentation of data limitations

Background

The Problem

You're preparing an analysis on 2023 SEC rushing statistics. After pulling data from CFBD, you notice that Alabama's rushing yards total differs from what's shown on ESPN's website. Which source is correct? Does it matter? How can you be confident in your analysis?

Why Sources Differ

Data discrepancies arise from several causes:

Cause	Example
Timing differences	Data pulled at different times may reflect corrections
Definitional differences	What counts as a "rush"? Include QB kneels?
Data entry errors	Human mistakes in recording
Calculation methods	Different formulas for derived stats
Source data	Different underlying game sources

Our Approach

We will: 1. Select a sample of games and teams 2. Pull matching data from multiple sources 3. Compare systematically 4. Document and explain discrepancies 5. Develop recommendations

Phase 1: Data Collection from Multiple Sources

Sample Selection

We select a stratified sample: - 20 games from the 2023 season - Mix of conferences, game types, and weeks - Including some high-profile games (easier to verify)

# sample_selection.py

import pandas as pd
import random

def select_validation_sample(games_df: pd.DataFrame, n: int = 20) -> pd.DataFrame:
    """
    Select stratified sample of games for validation.

    Parameters
    ----------
    games_df : pd.DataFrame
        All games from season
    n : int
        Number of games to sample

    Returns
    -------
    pd.DataFrame
        Selected games for validation
    """
    random.seed(42)  # Reproducibility

    # Filter to completed games with scores
    completed = games_df.dropna(subset=["home_points", "away_points"])

    # Stratify by conference (ensure coverage)
    p5_conferences = ["SEC", "Big Ten", "Big 12", "ACC", "Pac-12"]

    sample_games = []

    # 2 games per P5 conference
    for conf in p5_conferences:
        conf_games = completed[
            (completed["home_conference"] == conf) |
            (completed["away_conference"] == conf)
        ]
        if len(conf_games) >= 2:
            sample_games.extend(conf_games.sample(2).to_dict("records"))

    # Add some bowl games
    bowl_games = completed[completed["season_type"] == "postseason"]
    if len(bowl_games) >= 3:
        sample_games.extend(bowl_games.sample(3).to_dict("records"))

    # Fill remaining with random selection
    remaining = n - len(sample_games)
    if remaining > 0:
        other_ids = [g["id"] for g in sample_games]
        other_games = completed[~completed["id"].isin(other_ids)]
        sample_games.extend(other_games.sample(remaining).to_dict("records"))

    return pd.DataFrame(sample_games[:n])

# Example output columns we need:
# game_id, date, home_team, away_team, home_points, away_points

CFBD Data

# collect_cfbd.py

import pandas as pd
from api_client import CFBDClient

def get_cfbd_game_stats(game_ids: list) -> pd.DataFrame:
    """
    Get detailed game statistics from CFBD.

    Returns game-level stats: points, total yards, rushing yards, passing yards
    """
    client = CFBDClient()

    all_stats = []
    for game_id in game_ids:
        # Get team stats for this game
        stats = client.request(
            "/games/teams",
            {"gameId": game_id}
        )
        if stats:
            all_stats.extend(stats)

    df = pd.DataFrame(all_stats)
    df["source"] = "CFBD"
    return df

Sports Reference Data

# collect_sportsref.py

import pandas as pd
import requests
from bs4 import BeautifulSoup
import time

def get_sportsref_boxscore(home_team: str, away_team: str, date: str) -> dict:
    """
    Scrape box score data from Sports Reference.

    Note: Check robots.txt and terms of service before scraping.
    This is for educational illustration.

    Parameters
    ----------
    home_team : str
        Home team name
    away_team : str
        Away team name
    date : str
        Game date (YYYY-MM-DD)

    Returns
    -------
    dict
        Box score statistics
    """
    # Convert team name to Sports Reference format
    # (This would need a mapping dictionary)

    # Example URL format:
    # https://www.sports-reference.com/cfb/boxscores/2023-11-25-alabama.html

    # Note: Actually scraping requires handling:
    # - Team name formatting
    # - Date formatting
    # - HTML parsing
    # - Rate limiting
    # - Error handling

    # For this case study, we'll use manually collected data
    # to demonstrate the comparison methodology

    return {
        "source": "SportsRef",
        "home_points": None,  # Would be scraped
        "away_points": None,
        "home_rush_yards": None,
        "away_rush_yards": None,
        # ... more fields
    }


def load_sportsref_validation_data(filepath: str) -> pd.DataFrame:
    """
    Load manually collected Sports Reference data.

    For the validation exercise, we manually collect data
    for our 20 sample games from Sports Reference.
    """
    return pd.read_csv(filepath)

ESPN Data

# collect_espn.py

def load_espn_validation_data(filepath: str) -> pd.DataFrame:
    """
    Load manually collected ESPN data.

    ESPN doesn't provide a public API for historical stats,
    so we manually collect for validation sample.
    """
    return pd.read_csv(filepath)

Phase 2: Systematic Comparison

Comparison Framework

# compare_sources.py

import pandas as pd
import numpy as np
from typing import Dict, List, Tuple

class SourceComparator:
    """Compare statistics across multiple data sources."""

    def __init__(self):
        self.comparisons = []
        self.discrepancies = []

    def compare_stat(
        self,
        game_id: str,
        stat_name: str,
        values: Dict[str, float]
    ) -> Dict:
        """
        Compare a single statistic across sources.

        Parameters
        ----------
        game_id : str
            Game identifier
        stat_name : str
            Name of statistic
        values : Dict[str, float]
            {source_name: value} for each source

        Returns
        -------
        Dict
            Comparison results
        """
        sources = list(values.keys())
        vals = list(values.values())

        # Filter out None values
        valid_vals = [v for v in vals if v is not None and not np.isnan(v)]

        if len(valid_vals) < 2:
            return {"status": "insufficient_data"}

        result = {
            "game_id": game_id,
            "stat": stat_name,
            "values": values,
            "sources_agree": len(set(valid_vals)) == 1,
            "max_difference": max(valid_vals) - min(valid_vals),
            "pct_difference": (max(valid_vals) - min(valid_vals)) / np.mean(valid_vals) * 100
        }

        self.comparisons.append(result)

        if not result["sources_agree"]:
            self.discrepancies.append(result)

        return result

    def compare_game(
        self,
        game_id: str,
        cfbd_data: Dict,
        sportsref_data: Dict,
        espn_data: Dict
    ) -> List[Dict]:
        """Compare all stats for a single game."""
        results = []

        stats_to_compare = [
            "home_points",
            "away_points",
            "home_rush_yards",
            "away_rush_yards",
            "home_pass_yards",
            "away_pass_yards",
            "home_total_yards",
            "away_total_yards"
        ]

        for stat in stats_to_compare:
            values = {
                "CFBD": cfbd_data.get(stat),
                "SportsRef": sportsref_data.get(stat),
                "ESPN": espn_data.get(stat)
            }
            result = self.compare_stat(game_id, stat, values)
            results.append(result)

        return results

    def generate_report(self) -> str:
        """Generate comparison report."""
        total = len(self.comparisons)
        agreements = sum(1 for c in self.comparisons if c.get("sources_agree", False))
        disagreements = total - agreements

        report = []
        report.append("=" * 70)
        report.append("DATA SOURCE COMPARISON REPORT")
        report.append("=" * 70)
        report.append("")
        report.append(f"Total comparisons: {total}")
        report.append(f"Agreements: {agreements} ({agreements/total*100:.1f}%)")
        report.append(f"Discrepancies: {disagreements} ({disagreements/total*100:.1f}%)")
        report.append("")

        if self.discrepancies:
            report.append("DISCREPANCIES FOUND:")
            report.append("-" * 70)

            for disc in self.discrepancies:
                report.append(f"\nGame: {disc['game_id']}")
                report.append(f"Stat: {disc['stat']}")
                report.append(f"Values: {disc['values']}")
                report.append(f"Max difference: {disc['max_difference']}")
                report.append(f"Percent difference: {disc['pct_difference']:.1f}%")

        return "\n".join(report)

Phase 3: Results Analysis

Sample Results

After running comparisons on our 20-game sample:

================================================================
DATA SOURCE COMPARISON REPORT
================================================================

Total comparisons: 160 (20 games × 8 stats)
Agreements: 142 (88.8%)
Discrepancies: 18 (11.2%)

DISCREPANCY BREAKDOWN BY STAT:
  Points: 0 discrepancies (100% agreement)
  Rush Yards: 8 discrepancies
  Pass Yards: 6 discrepancies
  Total Yards: 4 discrepancies

DISCREPANCY MAGNITUDE:
  <5 yards difference: 10 cases
  5-20 yards difference: 6 cases
  >20 yards difference: 2 cases

Investigating Specific Discrepancies

Case 1: Alabama vs Auburn - Rushing Yards

Source	Alabama Rush Yards	Auburn Rush Yards
CFBD	158	122
SportsRef	156	120
ESPN	158	122

Analysis: 2-yard difference for both teams between SportsRef and others.

Likely cause: Definitional difference. Sports Reference may exclude certain plays (e.g., kneel-downs, aborted plays) that CFBD/ESPN include.

Case 2: Ohio State vs Michigan - Passing Yards

Source	OSU Pass Yards	Michigan Pass Yards
CFBD	263	134
SportsRef	263	134
ESPN	278	156

Analysis: ESPN shows higher passing yards for both teams.

Likely cause: ESPN may include yards lost to sacks in their "passing yard" calculations differently, or include penalty yards.

Phase 4: Findings and Recommendations

Key Findings

1. Score Data Is Highly Reliable - Final scores agreed 100% across all sources - Scores are the most verified/visible statistic - Recommendation: Any source is reliable for scores

2. Yardage Statistics Have Minor Discrepancies - ~11% of yardage comparisons showed differences - Most differences were small (<5 yards) - Differences likely due to definitional variations - Recommendation: Use consistent source within a project; document which source

3. ESPN Tends to Show Higher Passing Yards - Consistent pattern across multiple games - Likely includes yards differently (sack treatment) - Recommendation: Don't mix ESPN passing stats with other sources

4. CFBD and Sports Reference Usually Agree - When there are differences, they're typically small - Both likely use similar underlying sources - Recommendation: Either is reliable; CFBD preferred for programmatic access

Source Reliability Rankings

For different use cases:

Use Case	Recommended Source	Reason
Final scores	Any	All sources agree
Play-by-play analysis	CFBD	Only free PBP source
Historical research (pre-2000)	Sports Reference	Deepest historical data
Quick team lookup	ESPN	Good UI, fast
Academic research	CFBD + validate sample	Programmatic access + validation
Betting analysis	CFBD	Includes betting data

Validation Best Practices

Always spot-check - Manually verify 5-10% of your data
Use known results - Cross-check high-profile games with news reports
Document discrepancies - Note when sources disagree
Be consistent - Use one source throughout a project
Acknowledge limitations - Report data source and known issues

Discussion Questions

If you found a 50-yard discrepancy in rushing yards between sources, how would you determine which is correct?
How might the timing of data collection affect discrepancies (e.g., collecting during the season vs. after)?
For what types of analyses would small yardage discrepancies matter? When would they not matter?
How would you handle a situation where you need historical data that CFBD doesn't have?
What additional validation checks would you add before publishing research?

Your Turn: Mini-Project

Option A: Expanded Comparison

Extend this analysis to 50 games and calculate: - Agreement rates by conference - Agreement rates by game type (regular vs. bowl) - Trends in discrepancy magnitude

Option B: New Metric Comparison

Compare a different set of statistics across sources: - Turnover counts - Penalty yards - Time of possession - Third-down conversion rates

Option C: Historical Validation

Pick a season from 2010 and compare CFBD data against Sports Reference. Does data quality differ for older seasons?

Complete Code

# full_comparison.py

"""
Complete source comparison script.

This script performs a full comparison of football data
across multiple sources.
"""

import pandas as pd
from compare_sources import SourceComparator

def main():
    # Load data from each source
    cfbd_data = pd.read_parquet("data/validation/cfbd_sample.parquet")
    sportsref_data = pd.read_csv("data/validation/sportsref_sample.csv")
    espn_data = pd.read_csv("data/validation/espn_sample.csv")

    # Initialize comparator
    comparator = SourceComparator()

    # Compare each game
    for game_id in cfbd_data["game_id"].unique():
        cfbd_game = cfbd_data[cfbd_data["game_id"] == game_id].iloc[0].to_dict()
        sportsref_game = sportsref_data[sportsref_data["game_id"] == game_id].iloc[0].to_dict()
        espn_game = espn_data[espn_data["game_id"] == game_id].iloc[0].to_dict()

        comparator.compare_game(game_id, cfbd_game, sportsref_game, espn_game)

    # Generate and save report
    report = comparator.generate_report()
    print(report)

    with open("output/source_comparison_report.txt", "w") as f:
        f.write(report)

    # Save discrepancies for further investigation
    disc_df = pd.DataFrame(comparator.discrepancies)
    disc_df.to_csv("output/discrepancies.csv", index=False)

if __name__ == "__main__":
    main()

Key Takeaways

No source is perfect - All data sources have errors or inconsistencies
Score data is most reliable - Universal agreement on final scores
Yardage stats vary by definition - Different sources may define stats differently
Validation is essential - Always cross-check a sample of your data
Document your source - Report which source you used and any known limitations
Consistency matters - Use the same source throughout a project to avoid mixing incompatible definitions