Case Study 2: Investigating Data Quality Issues in Historical NBA Statistics
Overview
Scenario: A sports media company is producing a documentary comparing NBA legends across different eras. They need to reconcile statistical discrepancies between sources and understand data limitations when making cross-era comparisons.
Duration: 2-3 hours Difficulty: Advanced Prerequisites: Chapter 2 concepts, understanding of data quality principles
Background
The production team has discovered troubling inconsistencies while researching historical statistics:
- Basketball-Reference and the NBA's official records show different career totals for some Hall of Fame players
- Play-by-play data is incomplete or missing for games before 1996
- Certain statistics (blocks, steals) weren't recorded until 1973-74
- Definition changes for assists and rebounds complicate historical comparisons
Your task is to investigate these discrepancies, document their causes, and develop guidelines for responsible use of historical NBA data.
Part 1: The Investigation
1.1 The Mystery of the Missing Points
The research team found that Wilt Chamberlain's career points total differs between sources:
| Source | Career Points |
|---|---|
| NBA.com | 31,419 |
| Basketball-Reference | 31,419 |
| Historical Archives | 31,420 |
Investigation Task: Determine why this 1-point discrepancy exists and document its origin.
1.2 Data Collection Script for Investigation
"""
Historical Data Quality Investigation
Case Study 2 - Chapter 2: Data Sources and Collection
"""
import pandas as pd
import numpy as np
from typing import Dict, List, Tuple, Optional
from dataclasses import dataclass
from enum import Enum
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class DataEra(Enum):
"""NBA statistical eras with different data availability."""
PRE_ABA_MERGER = "pre_1976"
PRE_THREE_POINT = "1976_1979"
EARLY_MODERN = "1980_1996"
PLAY_BY_PLAY = "1997_2013"
TRACKING = "2014_present"
@dataclass
class StatAvailability:
"""Tracks availability of statistics across eras."""
points: bool = True
rebounds: bool = True
assists: bool = True
steals: bool = False
blocks: bool = False
three_pointers: bool = False
turnovers: bool = False
play_by_play: bool = False
tracking: bool = False
ERA_CONFIGURATIONS: Dict[DataEra, StatAvailability] = {
DataEra.PRE_ABA_MERGER: StatAvailability(
steals=False, blocks=False, three_pointers=False,
turnovers=False, play_by_play=False, tracking=False
),
DataEra.PRE_THREE_POINT: StatAvailability(
steals=True, blocks=True, three_pointers=False,
turnovers=True, play_by_play=False, tracking=False
),
DataEra.EARLY_MODERN: StatAvailability(
steals=True, blocks=True, three_pointers=True,
turnovers=True, play_by_play=False, tracking=False
),
DataEra.PLAY_BY_PLAY: StatAvailability(
steals=True, blocks=True, three_pointers=True,
turnovers=True, play_by_play=True, tracking=False
),
DataEra.TRACKING: StatAvailability(
steals=True, blocks=True, three_pointers=True,
turnovers=True, play_by_play=True, tracking=True
),
}
def get_era_for_season(season_end_year: int) -> DataEra:
"""Determine the data era for a given season."""
if season_end_year < 1974:
return DataEra.PRE_ABA_MERGER
elif season_end_year < 1980:
return DataEra.PRE_THREE_POINT
elif season_end_year < 1997:
return DataEra.EARLY_MODERN
elif season_end_year < 2014:
return DataEra.PLAY_BY_PLAY
else:
return DataEra.TRACKING
class HistoricalDataInvestigator:
"""
Investigates data quality issues in historical NBA statistics.
This class provides tools for comparing sources, identifying
discrepancies, and documenting data quality issues.
"""
def __init__(self):
"""Initialize the investigator."""
self.discrepancies_log = []
self.investigation_notes = []
def compare_career_totals(
self,
player_name: str,
source_a: Dict[str, float],
source_b: Dict[str, float],
source_a_name: str = "Source A",
source_b_name: str = "Source B"
) -> Dict:
"""
Compare career statistics between two sources.
Args:
player_name: Name of the player
source_a: Statistics from first source
source_b: Statistics from second source
source_a_name: Name identifier for first source
source_b_name: Name identifier for second source
Returns:
Dictionary with comparison results
"""
comparison = {
"player": player_name,
"sources": [source_a_name, source_b_name],
"statistics": [],
"discrepancies_found": False
}
all_stats = set(source_a.keys()) | set(source_b.keys())
for stat in all_stats:
val_a = source_a.get(stat)
val_b = source_b.get(stat)
stat_comparison = {
"statistic": stat,
source_a_name: val_a,
source_b_name: val_b,
"match": val_a == val_b
}
if val_a is not None and val_b is not None:
stat_comparison["difference"] = abs(val_a - val_b)
if val_a != val_b:
stat_comparison["pct_difference"] = (
abs(val_a - val_b) / max(val_a, val_b) * 100
)
comparison["discrepancies_found"] = True
comparison["statistics"].append(stat_comparison)
if comparison["discrepancies_found"]:
self.discrepancies_log.append(comparison)
return comparison
def investigate_season_by_season(
self,
player_seasons_source_a: pd.DataFrame,
player_seasons_source_b: pd.DataFrame,
stat_column: str
) -> pd.DataFrame:
"""
Compare statistics season-by-season to find discrepancy origins.
Args:
player_seasons_source_a: Season data from first source
player_seasons_source_b: Season data from second source
stat_column: Name of statistic column to compare
Returns:
DataFrame showing season-by-season comparison
"""
comparison_rows = []
# Align by season
merged = player_seasons_source_a.merge(
player_seasons_source_b,
on='Season',
suffixes=('_A', '_B'),
how='outer'
)
for _, row in merged.iterrows():
val_a = row.get(f'{stat_column}_A')
val_b = row.get(f'{stat_column}_B')
comparison_rows.append({
'Season': row['Season'],
'Source_A': val_a,
'Source_B': val_b,
'Difference': val_a - val_b if pd.notna(val_a) and pd.notna(val_b) else None,
'Match': val_a == val_b if pd.notna(val_a) and pd.notna(val_b) else None
})
return pd.DataFrame(comparison_rows)
def document_data_limitation(
self,
era: DataEra,
limitation_type: str,
description: str,
affected_statistics: List[str],
recommended_handling: str
) -> Dict:
"""
Document a known data limitation for an era.
Args:
era: The NBA data era
limitation_type: Category of limitation
description: Detailed description
affected_statistics: List of affected stat names
recommended_handling: How to handle this limitation
Returns:
Dictionary documenting the limitation
"""
limitation = {
"era": era.value,
"type": limitation_type,
"description": description,
"affected_statistics": affected_statistics,
"recommended_handling": recommended_handling
}
self.investigation_notes.append(limitation)
return limitation
def analyze_missing_data_patterns(
self,
df: pd.DataFrame,
stat_columns: List[str]
) -> Dict:
"""
Analyze patterns in missing data.
Args:
df: DataFrame to analyze
stat_columns: Columns to check for missing values
Returns:
Dictionary with missing data analysis
"""
analysis = {
"total_records": len(df),
"statistics": {}
}
for col in stat_columns:
if col in df.columns:
missing = df[col].isna().sum()
analysis["statistics"][col] = {
"missing_count": missing,
"missing_pct": round(missing / len(df) * 100, 2),
"first_non_null": df[df[col].notna()].index.min() if missing < len(df) else None
}
# Analyze by era if season information available
if 'Season' in df.columns:
analysis["by_era"] = {}
for _, row in df.iterrows():
season_year = int(str(row['Season'])[:4]) + 1
era = get_era_for_season(season_year)
if era.value not in analysis["by_era"]:
analysis["by_era"][era.value] = {"seasons": 0, "missing": {}}
analysis["by_era"][era.value]["seasons"] += 1
return analysis
Part 2: Known Historical Data Issues
2.1 The Steals and Blocks Gap (Pre-1973-74)
def investigate_blocks_steals_gap():
"""
Document the missing steals and blocks data before 1973-74.
Historical Context:
- Steals and blocks were not officially tracked until 1973-74
- Some historical box scores exist but are incomplete
- Players like Wilt Chamberlain and Bill Russell have no official
steals/blocks statistics despite clearly accumulating them
"""
investigator = HistoricalDataInvestigator()
investigator.document_data_limitation(
era=DataEra.PRE_ABA_MERGER,
limitation_type="MISSING_STATISTIC",
description=(
"Steals and blocks were not officially recorded by the NBA "
"until the 1973-74 season. Players who competed primarily "
"before this date have incomplete defensive statistics."
),
affected_statistics=["STL", "BLK"],
recommended_handling=(
"1. Do not compare steals/blocks across eras directly. "
"2. Use per-possession defensive metrics for post-1973 only. "
"3. Consider noting career totals as 'incomplete' for affected players. "
"4. Research third-party historical reconstruction projects for estimates."
)
)
# Players most affected
affected_players = [
{"name": "Wilt Chamberlain", "career": "1959-1973", "seasons_missing": 13},
{"name": "Bill Russell", "career": "1956-1969", "seasons_missing": 13},
{"name": "Oscar Robertson", "career": "1960-1974", "seasons_missing": 13},
{"name": "Jerry West", "career": "1960-1974", "seasons_missing": 13},
{"name": "Elgin Baylor", "career": "1958-1972", "seasons_missing": 14},
]
return investigator.investigation_notes, affected_players
2.2 The Three-Point Revolution (Pre-1979-80)
def investigate_three_point_gap():
"""
Document the absence of three-point shooting before 1979-80.
Historical Context:
- The three-point line was introduced in 1979-80
- Players before this era never had the opportunity to shoot threes
- Comparing career three-point totals across eras is inherently flawed
"""
investigator = HistoricalDataInvestigator()
investigator.document_data_limitation(
era=DataEra.PRE_THREE_POINT,
limitation_type="RULE_CHANGE",
description=(
"The three-point line did not exist in the NBA until 1979-80. "
"All long-range shots were counted as two-pointers. This makes "
"direct comparison of shooting statistics problematic across eras."
),
affected_statistics=["3PM", "3PA", "3P%", "eFG%", "TS%"],
recommended_handling=(
"1. Use era-adjusted efficiency metrics. "
"2. Calculate 'hypothetical' three-point impact for historical players. "
"3. Present efficiency stats with era context. "
"4. Focus on league-relative performance rather than raw numbers."
)
)
# Impact analysis
impact_notes = {
"true_shooting": (
"TS% calculations for pre-1980 players are lower than they "
"would be if those players had access to the three-point line."
),
"effective_fg": (
"eFG% is meaningless for pre-1980 players as it weights "
"three-pointers, which didn't exist."
),
"comparison_approach": (
"Compare players to their era's league average rather than "
"using raw statistics across eras."
)
}
return investigator.investigation_notes, impact_notes
2.3 Box Score Discrepancies
def investigate_box_score_discrepancies():
"""
Investigate known discrepancies in historical box scores.
Common Issues:
- Scorer errors in original games
- Different counting methods for rebounds (offensive vs defensive)
- Retroactive corrections not consistently applied
- Transcription errors in digitization
"""
# Example: The Bill Russell rebounds debate
bill_russell_rebounds = {
"official_nba": 21620,
"basketball_reference": 21620,
"some_archives": 21621,
"note": (
"A 1-rebound discrepancy exists due to a disputed game "
"from the 1963-64 season where the original scorer's "
"count differs from the newspaper report."
)
}
# Common discrepancy patterns
discrepancy_patterns = [
{
"type": "Scorer errors",
"frequency": "Occasional",
"typical_magnitude": "1-5 per game",
"resolution": "Usually resolved post-game, but historical games may not be corrected"
},
{
"type": "Digitization errors",
"frequency": "Rare",
"typical_magnitude": "Variable",
"resolution": "Cross-reference multiple sources"
},
{
"type": "Retroactive stat changes",
"frequency": "Occasional",
"typical_magnitude": "Can be significant for individual games",
"resolution": "Check for official NBA corrections"
},
{
"type": "Definition changes",
"frequency": "Era-dependent",
"typical_magnitude": "Systematic",
"resolution": "Document era-specific definitions"
}
]
return bill_russell_rebounds, discrepancy_patterns
Part 3: Building a Cross-Era Comparison Framework
3.1 Era-Adjusted Statistics
class EraAdjustedStatistics:
"""
Calculate era-adjusted statistics for fair cross-era comparisons.
"""
def __init__(self, league_averages: pd.DataFrame):
"""
Initialize with historical league average data.
Args:
league_averages: DataFrame with season-by-season league averages
"""
self.league_averages = league_averages
def calculate_relative_performance(
self,
player_stats: pd.DataFrame,
stat_column: str
) -> pd.DataFrame:
"""
Calculate performance relative to league average.
Args:
player_stats: Player's season-by-season statistics
stat_column: Statistic to analyze
Returns:
DataFrame with relative performance metrics
"""
player_stats = player_stats.copy()
# Merge with league averages
merged = player_stats.merge(
self.league_averages[['Season', stat_column]],
on='Season',
suffixes=('_player', '_league')
)
# Calculate relative metrics
player_col = f'{stat_column}_player'
league_col = f'{stat_column}_league'
merged['relative_to_league'] = (
merged[player_col] / merged[league_col]
)
merged['above_league_avg'] = (
merged[player_col] - merged[league_col]
)
merged['pct_above_league'] = (
(merged[player_col] - merged[league_col]) /
merged[league_col] * 100
)
return merged
def calculate_z_score_by_era(
self,
player_value: float,
era_mean: float,
era_std: float
) -> float:
"""
Calculate z-score relative to era statistics.
Args:
player_value: Player's statistic value
era_mean: Mean for that statistic in the era
era_std: Standard deviation in the era
Returns:
Z-score indicating how exceptional the performance was
"""
if era_std == 0:
return 0.0
return (player_value - era_mean) / era_std
def compare_across_eras(
self,
player_a_stats: Dict,
player_a_era_context: Dict,
player_b_stats: Dict,
player_b_era_context: Dict
) -> Dict:
"""
Compare two players from different eras using standardized metrics.
Args:
player_a_stats: Statistics for player A
player_a_era_context: Era context (mean, std) for player A
player_b_stats: Statistics for player B
player_b_era_context: Era context (mean, std) for player B
Returns:
Dictionary with comparison results
"""
comparison = {
"method": "Era-Adjusted Z-Score Comparison",
"statistics": {}
}
common_stats = set(player_a_stats.keys()) & set(player_b_stats.keys())
for stat in common_stats:
z_a = self.calculate_z_score_by_era(
player_a_stats[stat],
player_a_era_context[f'{stat}_mean'],
player_a_era_context[f'{stat}_std']
)
z_b = self.calculate_z_score_by_era(
player_b_stats[stat],
player_b_era_context[f'{stat}_mean'],
player_b_era_context[f'{stat}_std']
)
comparison["statistics"][stat] = {
"player_a_raw": player_a_stats[stat],
"player_a_z": round(z_a, 2),
"player_b_raw": player_b_stats[stat],
"player_b_z": round(z_b, 2),
"more_exceptional": "A" if z_a > z_b else "B" if z_b > z_a else "Equal"
}
return comparison
Part 4: Documentation Best Practices
4.1 Creating a Data Dictionary
def create_historical_data_dictionary() -> Dict:
"""
Create a comprehensive data dictionary for historical NBA statistics.
Returns:
Dictionary documenting all statistics and their historical context
"""
data_dictionary = {
"statistics": {
"PTS": {
"name": "Points",
"available_from": "1946-47",
"definition": "Total points scored",
"historical_notes": "Consistent definition throughout NBA history"
},
"REB": {
"name": "Rebounds",
"available_from": "1950-51",
"definition": "Total rebounds (offensive + defensive)",
"historical_notes": (
"Offensive/defensive split not recorded until 1973-74. "
"Early years may have inconsistent counting methods."
)
},
"AST": {
"name": "Assists",
"available_from": "1946-47",
"definition": "Passes leading directly to made baskets",
"historical_notes": (
"Definition has evolved. Early eras were more generous "
"in crediting assists. Modern definition requires more "
"direct contribution to the basket."
)
},
"STL": {
"name": "Steals",
"available_from": "1973-74",
"definition": "Taking the ball from an opponent",
"historical_notes": "Not recorded before 1973-74 season"
},
"BLK": {
"name": "Blocks",
"available_from": "1973-74",
"definition": "Deflecting an opponent's shot attempt",
"historical_notes": "Not recorded before 1973-74 season"
},
"TOV": {
"name": "Turnovers",
"available_from": "1977-78",
"definition": "Losing possession before a shot attempt",
"historical_notes": (
"Not consistently recorded until 1977-78. "
"Some earlier seasons have partial data."
)
},
"3PM": {
"name": "Three-Pointers Made",
"available_from": "1979-80",
"definition": "Made field goals from beyond the arc",
"historical_notes": (
"Three-point line introduced 1979-80. "
"Line distance changed briefly in 1994-97."
)
}
},
"eras": {
"BAA": {"years": "1946-1949", "notes": "Before NBA existed"},
"Early_NBA": {"years": "1949-1954", "notes": "No shot clock until 1954"},
"Pre_Expansion": {"years": "1954-1967", "notes": "Relatively stable league"},
"Expansion": {"years": "1967-1976", "notes": "ABA competition, roster changes"},
"Merger": {"years": "1976-1980", "notes": "ABA-NBA merger integration"},
"Modern": {"years": "1980-1996", "notes": "Pre-internet record keeping"},
"Digital": {"years": "1996-2014", "notes": "Play-by-play data available"},
"Analytics": {"years": "2014-present", "notes": "Tracking data available"}
}
}
return data_dictionary
4.2 Generating Investigation Reports
def generate_investigation_report(
investigator: HistoricalDataInvestigator,
output_format: str = "markdown"
) -> str:
"""
Generate a formatted report of the data quality investigation.
Args:
investigator: HistoricalDataInvestigator with findings
output_format: Output format (markdown, html, json)
Returns:
Formatted report string
"""
report_lines = [
"# Historical NBA Data Quality Investigation Report",
"",
f"**Generated:** {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M')}",
"",
"## Executive Summary",
"",
f"This investigation identified {len(investigator.discrepancies_log)} "
f"discrepancies between data sources and documented "
f"{len(investigator.investigation_notes)} data limitations.",
"",
"## Discrepancies Found",
""
]
for discrepancy in investigator.discrepancies_log:
report_lines.append(f"### {discrepancy['player']}")
report_lines.append("")
for stat in discrepancy['statistics']:
if not stat.get('match', True):
report_lines.append(
f"- **{stat['statistic']}**: "
f"{discrepancy['sources'][0]}={stat.get(discrepancy['sources'][0])} vs "
f"{discrepancy['sources'][1]}={stat.get(discrepancy['sources'][1])} "
f"(diff: {stat.get('difference', 'N/A')})"
)
report_lines.append("")
report_lines.extend([
"## Documented Limitations",
""
])
for limitation in investigator.investigation_notes:
report_lines.extend([
f"### {limitation['type']} ({limitation['era']})",
"",
limitation['description'],
"",
f"**Affected Statistics:** {', '.join(limitation['affected_statistics'])}",
"",
f"**Recommended Handling:** {limitation['recommended_handling']}",
""
])
report_lines.extend([
"## Recommendations",
"",
"1. Always document the data source and access date",
"2. Note era-specific limitations in any analysis",
"3. Use era-adjusted statistics for cross-era comparisons",
"4. Cross-reference multiple sources for critical statistics",
"5. Clearly communicate data limitations to stakeholders",
""
])
return "\n".join(report_lines)
Part 5: Discussion Questions
Question 1: Ethical Considerations
When presenting historical comparisons in a documentary, how should the limitations of the data be communicated to a general audience without overwhelming them with caveats?
Question 2: Source Priority
When two reputable sources disagree on a historical statistic, what criteria should determine which source to trust?
Question 3: Modern Applications
How do historical data quality issues inform best practices for collecting and storing current NBA data?
Question 4: Reconstruction Projects
Several researchers have attempted to estimate steals and blocks for pre-1973 players using film study. How should these estimates be handled in a database versus official statistics?
Question 5: Accessibility vs. Accuracy
Is it better to present incomplete data with appropriate caveats, or to exclude statistics entirely when data quality cannot be assured?
Deliverables
By completing this case study, you should produce:
- Investigation Code: Python module for comparing and validating historical data
- Data Dictionary: Comprehensive documentation of NBA statistics and their historical context
- Investigation Report: Formatted report documenting all findings
- Guidelines Document: Best practices for using historical NBA data
- Presentation Materials: Key findings suitable for non-technical stakeholders
Key Takeaways
- Historical data is inherently incomplete - Statistics we take for granted today were not always recorded
- Sources may disagree - Even official sources can have discrepancies requiring investigation
- Context is essential - Raw numbers without era context are misleading
- Documentation enables reproducibility - Future researchers benefit from understanding data limitations
- Standardization enables comparison - Era-adjusted metrics allow fairer cross-era analysis
This case study demonstrates the critical importance of understanding data provenance and limitations, skills that are essential for any serious basketball analytics work.