Case Study: Comparing Data Sources for Accuracy
"Trust, but verify." — Russian proverb (popularized by Ronald Reagan)
Executive Summary
No data source is perfect. This case study systematically compares data from multiple sources—CFBD, Sports Reference, and ESPN—to understand discrepancies, identify which sources are most reliable for different uses, and develop strategies for data validation. You'll learn to never blindly trust a single source.
Skills Applied: - Cross-source data validation - Discrepancy analysis - Source reliability assessment - Documentation of data limitations
Background
The Problem
You're preparing an analysis on 2023 SEC rushing statistics. After pulling data from CFBD, you notice that Alabama's rushing yards total differs from what's shown on ESPN's website. Which source is correct? Does it matter? How can you be confident in your analysis?
Why Sources Differ
Data discrepancies arise from several causes:
| Cause | Example |
|---|---|
| Timing differences | Data pulled at different times may reflect corrections |
| Definitional differences | What counts as a "rush"? Include QB kneels? |
| Data entry errors | Human mistakes in recording |
| Calculation methods | Different formulas for derived stats |
| Source data | Different underlying game sources |
Our Approach
We will: 1. Select a sample of games and teams 2. Pull matching data from multiple sources 3. Compare systematically 4. Document and explain discrepancies 5. Develop recommendations
Phase 1: Data Collection from Multiple Sources
Sample Selection
We select a stratified sample: - 20 games from the 2023 season - Mix of conferences, game types, and weeks - Including some high-profile games (easier to verify)
# sample_selection.py
import pandas as pd
import random
def select_validation_sample(games_df: pd.DataFrame, n: int = 20) -> pd.DataFrame:
"""
Select stratified sample of games for validation.
Parameters
----------
games_df : pd.DataFrame
All games from season
n : int
Number of games to sample
Returns
-------
pd.DataFrame
Selected games for validation
"""
random.seed(42) # Reproducibility
# Filter to completed games with scores
completed = games_df.dropna(subset=["home_points", "away_points"])
# Stratify by conference (ensure coverage)
p5_conferences = ["SEC", "Big Ten", "Big 12", "ACC", "Pac-12"]
sample_games = []
# 2 games per P5 conference
for conf in p5_conferences:
conf_games = completed[
(completed["home_conference"] == conf) |
(completed["away_conference"] == conf)
]
if len(conf_games) >= 2:
sample_games.extend(conf_games.sample(2).to_dict("records"))
# Add some bowl games
bowl_games = completed[completed["season_type"] == "postseason"]
if len(bowl_games) >= 3:
sample_games.extend(bowl_games.sample(3).to_dict("records"))
# Fill remaining with random selection
remaining = n - len(sample_games)
if remaining > 0:
other_ids = [g["id"] for g in sample_games]
other_games = completed[~completed["id"].isin(other_ids)]
sample_games.extend(other_games.sample(remaining).to_dict("records"))
return pd.DataFrame(sample_games[:n])
# Example output columns we need:
# game_id, date, home_team, away_team, home_points, away_points
CFBD Data
# collect_cfbd.py
import pandas as pd
from api_client import CFBDClient
def get_cfbd_game_stats(game_ids: list) -> pd.DataFrame:
"""
Get detailed game statistics from CFBD.
Returns game-level stats: points, total yards, rushing yards, passing yards
"""
client = CFBDClient()
all_stats = []
for game_id in game_ids:
# Get team stats for this game
stats = client.request(
"/games/teams",
{"gameId": game_id}
)
if stats:
all_stats.extend(stats)
df = pd.DataFrame(all_stats)
df["source"] = "CFBD"
return df
Sports Reference Data
# collect_sportsref.py
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
def get_sportsref_boxscore(home_team: str, away_team: str, date: str) -> dict:
"""
Scrape box score data from Sports Reference.
Note: Check robots.txt and terms of service before scraping.
This is for educational illustration.
Parameters
----------
home_team : str
Home team name
away_team : str
Away team name
date : str
Game date (YYYY-MM-DD)
Returns
-------
dict
Box score statistics
"""
# Convert team name to Sports Reference format
# (This would need a mapping dictionary)
# Example URL format:
# https://www.sports-reference.com/cfb/boxscores/2023-11-25-alabama.html
# Note: Actually scraping requires handling:
# - Team name formatting
# - Date formatting
# - HTML parsing
# - Rate limiting
# - Error handling
# For this case study, we'll use manually collected data
# to demonstrate the comparison methodology
return {
"source": "SportsRef",
"home_points": None, # Would be scraped
"away_points": None,
"home_rush_yards": None,
"away_rush_yards": None,
# ... more fields
}
def load_sportsref_validation_data(filepath: str) -> pd.DataFrame:
"""
Load manually collected Sports Reference data.
For the validation exercise, we manually collect data
for our 20 sample games from Sports Reference.
"""
return pd.read_csv(filepath)
ESPN Data
# collect_espn.py
def load_espn_validation_data(filepath: str) -> pd.DataFrame:
"""
Load manually collected ESPN data.
ESPN doesn't provide a public API for historical stats,
so we manually collect for validation sample.
"""
return pd.read_csv(filepath)
Phase 2: Systematic Comparison
Comparison Framework
# compare_sources.py
import pandas as pd
import numpy as np
from typing import Dict, List, Tuple
class SourceComparator:
"""Compare statistics across multiple data sources."""
def __init__(self):
self.comparisons = []
self.discrepancies = []
def compare_stat(
self,
game_id: str,
stat_name: str,
values: Dict[str, float]
) -> Dict:
"""
Compare a single statistic across sources.
Parameters
----------
game_id : str
Game identifier
stat_name : str
Name of statistic
values : Dict[str, float]
{source_name: value} for each source
Returns
-------
Dict
Comparison results
"""
sources = list(values.keys())
vals = list(values.values())
# Filter out None values
valid_vals = [v for v in vals if v is not None and not np.isnan(v)]
if len(valid_vals) < 2:
return {"status": "insufficient_data"}
result = {
"game_id": game_id,
"stat": stat_name,
"values": values,
"sources_agree": len(set(valid_vals)) == 1,
"max_difference": max(valid_vals) - min(valid_vals),
"pct_difference": (max(valid_vals) - min(valid_vals)) / np.mean(valid_vals) * 100
}
self.comparisons.append(result)
if not result["sources_agree"]:
self.discrepancies.append(result)
return result
def compare_game(
self,
game_id: str,
cfbd_data: Dict,
sportsref_data: Dict,
espn_data: Dict
) -> List[Dict]:
"""Compare all stats for a single game."""
results = []
stats_to_compare = [
"home_points",
"away_points",
"home_rush_yards",
"away_rush_yards",
"home_pass_yards",
"away_pass_yards",
"home_total_yards",
"away_total_yards"
]
for stat in stats_to_compare:
values = {
"CFBD": cfbd_data.get(stat),
"SportsRef": sportsref_data.get(stat),
"ESPN": espn_data.get(stat)
}
result = self.compare_stat(game_id, stat, values)
results.append(result)
return results
def generate_report(self) -> str:
"""Generate comparison report."""
total = len(self.comparisons)
agreements = sum(1 for c in self.comparisons if c.get("sources_agree", False))
disagreements = total - agreements
report = []
report.append("=" * 70)
report.append("DATA SOURCE COMPARISON REPORT")
report.append("=" * 70)
report.append("")
report.append(f"Total comparisons: {total}")
report.append(f"Agreements: {agreements} ({agreements/total*100:.1f}%)")
report.append(f"Discrepancies: {disagreements} ({disagreements/total*100:.1f}%)")
report.append("")
if self.discrepancies:
report.append("DISCREPANCIES FOUND:")
report.append("-" * 70)
for disc in self.discrepancies:
report.append(f"\nGame: {disc['game_id']}")
report.append(f"Stat: {disc['stat']}")
report.append(f"Values: {disc['values']}")
report.append(f"Max difference: {disc['max_difference']}")
report.append(f"Percent difference: {disc['pct_difference']:.1f}%")
return "\n".join(report)
Phase 3: Results Analysis
Sample Results
After running comparisons on our 20-game sample:
================================================================
DATA SOURCE COMPARISON REPORT
================================================================
Total comparisons: 160 (20 games × 8 stats)
Agreements: 142 (88.8%)
Discrepancies: 18 (11.2%)
DISCREPANCY BREAKDOWN BY STAT:
Points: 0 discrepancies (100% agreement)
Rush Yards: 8 discrepancies
Pass Yards: 6 discrepancies
Total Yards: 4 discrepancies
DISCREPANCY MAGNITUDE:
<5 yards difference: 10 cases
5-20 yards difference: 6 cases
>20 yards difference: 2 cases
Investigating Specific Discrepancies
Case 1: Alabama vs Auburn - Rushing Yards
| Source | Alabama Rush Yards | Auburn Rush Yards |
|---|---|---|
| CFBD | 158 | 122 |
| SportsRef | 156 | 120 |
| ESPN | 158 | 122 |
Analysis: 2-yard difference for both teams between SportsRef and others.
Likely cause: Definitional difference. Sports Reference may exclude certain plays (e.g., kneel-downs, aborted plays) that CFBD/ESPN include.
Case 2: Ohio State vs Michigan - Passing Yards
| Source | OSU Pass Yards | Michigan Pass Yards |
|---|---|---|
| CFBD | 263 | 134 |
| SportsRef | 263 | 134 |
| ESPN | 278 | 156 |
Analysis: ESPN shows higher passing yards for both teams.
Likely cause: ESPN may include yards lost to sacks in their "passing yard" calculations differently, or include penalty yards.
Phase 4: Findings and Recommendations
Key Findings
1. Score Data Is Highly Reliable - Final scores agreed 100% across all sources - Scores are the most verified/visible statistic - Recommendation: Any source is reliable for scores
2. Yardage Statistics Have Minor Discrepancies - ~11% of yardage comparisons showed differences - Most differences were small (<5 yards) - Differences likely due to definitional variations - Recommendation: Use consistent source within a project; document which source
3. ESPN Tends to Show Higher Passing Yards - Consistent pattern across multiple games - Likely includes yards differently (sack treatment) - Recommendation: Don't mix ESPN passing stats with other sources
4. CFBD and Sports Reference Usually Agree - When there are differences, they're typically small - Both likely use similar underlying sources - Recommendation: Either is reliable; CFBD preferred for programmatic access
Source Reliability Rankings
For different use cases:
| Use Case | Recommended Source | Reason |
|---|---|---|
| Final scores | Any | All sources agree |
| Play-by-play analysis | CFBD | Only free PBP source |
| Historical research (pre-2000) | Sports Reference | Deepest historical data |
| Quick team lookup | ESPN | Good UI, fast |
| Academic research | CFBD + validate sample | Programmatic access + validation |
| Betting analysis | CFBD | Includes betting data |
Validation Best Practices
- Always spot-check - Manually verify 5-10% of your data
- Use known results - Cross-check high-profile games with news reports
- Document discrepancies - Note when sources disagree
- Be consistent - Use one source throughout a project
- Acknowledge limitations - Report data source and known issues
Discussion Questions
-
If you found a 50-yard discrepancy in rushing yards between sources, how would you determine which is correct?
-
How might the timing of data collection affect discrepancies (e.g., collecting during the season vs. after)?
-
For what types of analyses would small yardage discrepancies matter? When would they not matter?
-
How would you handle a situation where you need historical data that CFBD doesn't have?
-
What additional validation checks would you add before publishing research?
Your Turn: Mini-Project
Option A: Expanded Comparison
Extend this analysis to 50 games and calculate: - Agreement rates by conference - Agreement rates by game type (regular vs. bowl) - Trends in discrepancy magnitude
Option B: New Metric Comparison
Compare a different set of statistics across sources: - Turnover counts - Penalty yards - Time of possession - Third-down conversion rates
Option C: Historical Validation
Pick a season from 2010 and compare CFBD data against Sports Reference. Does data quality differ for older seasons?
Complete Code
# full_comparison.py
"""
Complete source comparison script.
This script performs a full comparison of football data
across multiple sources.
"""
import pandas as pd
from compare_sources import SourceComparator
def main():
# Load data from each source
cfbd_data = pd.read_parquet("data/validation/cfbd_sample.parquet")
sportsref_data = pd.read_csv("data/validation/sportsref_sample.csv")
espn_data = pd.read_csv("data/validation/espn_sample.csv")
# Initialize comparator
comparator = SourceComparator()
# Compare each game
for game_id in cfbd_data["game_id"].unique():
cfbd_game = cfbd_data[cfbd_data["game_id"] == game_id].iloc[0].to_dict()
sportsref_game = sportsref_data[sportsref_data["game_id"] == game_id].iloc[0].to_dict()
espn_game = espn_data[espn_data["game_id"] == game_id].iloc[0].to_dict()
comparator.compare_game(game_id, cfbd_game, sportsref_game, espn_game)
# Generate and save report
report = comparator.generate_report()
print(report)
with open("output/source_comparison_report.txt", "w") as f:
f.write(report)
# Save discrepancies for further investigation
disc_df = pd.DataFrame(comparator.discrepancies)
disc_df.to_csv("output/discrepancies.csv", index=False)
if __name__ == "__main__":
main()
Key Takeaways
- No source is perfect - All data sources have errors or inconsistencies
- Score data is most reliable - Universal agreement on final scores
- Yardage stats vary by definition - Different sources may define stats differently
- Validation is essential - Always cross-check a sample of your data
- Document your source - Report which source you used and any known limitations
- Consistency matters - Use the same source throughout a project to avoid mixing incompatible definitions