Case Study 2: Investigating Data Quality Issues in Historical NBA Statistics

Overview

Scenario: A sports media company is producing a documentary comparing NBA legends across different eras. They need to reconcile statistical discrepancies between sources and understand data limitations when making cross-era comparisons.

Duration: 2-3 hours Difficulty: Advanced Prerequisites: Chapter 2 concepts, understanding of data quality principles

Background

The production team has discovered troubling inconsistencies while researching historical statistics:

Basketball-Reference and the NBA's official records show different career totals for some Hall of Fame players
Play-by-play data is incomplete or missing for games before 1996
Certain statistics (blocks, steals) weren't recorded until 1973-74
Definition changes for assists and rebounds complicate historical comparisons

Your task is to investigate these discrepancies, document their causes, and develop guidelines for responsible use of historical NBA data.

Part 1: The Investigation

1.1 The Mystery of the Missing Points

The research team found that Wilt Chamberlain's career points total differs between sources:

Source	Career Points
NBA.com	31,419
Basketball-Reference	31,419
Historical Archives	31,420

Investigation Task: Determine why this 1-point discrepancy exists and document its origin.

1.2 Data Collection Script for Investigation

"""
Historical Data Quality Investigation
Case Study 2 - Chapter 2: Data Sources and Collection
"""

import pandas as pd
import numpy as np
from typing import Dict, List, Tuple, Optional
from dataclasses import dataclass
from enum import Enum
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


class DataEra(Enum):
    """NBA statistical eras with different data availability."""
    PRE_ABA_MERGER = "pre_1976"
    PRE_THREE_POINT = "1976_1979"
    EARLY_MODERN = "1980_1996"
    PLAY_BY_PLAY = "1997_2013"
    TRACKING = "2014_present"


@dataclass
class StatAvailability:
    """Tracks availability of statistics across eras."""
    points: bool = True
    rebounds: bool = True
    assists: bool = True
    steals: bool = False
    blocks: bool = False
    three_pointers: bool = False
    turnovers: bool = False
    play_by_play: bool = False
    tracking: bool = False


ERA_CONFIGURATIONS: Dict[DataEra, StatAvailability] = {
    DataEra.PRE_ABA_MERGER: StatAvailability(
        steals=False, blocks=False, three_pointers=False,
        turnovers=False, play_by_play=False, tracking=False
    ),
    DataEra.PRE_THREE_POINT: StatAvailability(
        steals=True, blocks=True, three_pointers=False,
        turnovers=True, play_by_play=False, tracking=False
    ),
    DataEra.EARLY_MODERN: StatAvailability(
        steals=True, blocks=True, three_pointers=True,
        turnovers=True, play_by_play=False, tracking=False
    ),
    DataEra.PLAY_BY_PLAY: StatAvailability(
        steals=True, blocks=True, three_pointers=True,
        turnovers=True, play_by_play=True, tracking=False
    ),
    DataEra.TRACKING: StatAvailability(
        steals=True, blocks=True, three_pointers=True,
        turnovers=True, play_by_play=True, tracking=True
    ),
}


def get_era_for_season(season_end_year: int) -> DataEra:
    """Determine the data era for a given season."""
    if season_end_year < 1974:
        return DataEra.PRE_ABA_MERGER
    elif season_end_year < 1980:
        return DataEra.PRE_THREE_POINT
    elif season_end_year < 1997:
        return DataEra.EARLY_MODERN
    elif season_end_year < 2014:
        return DataEra.PLAY_BY_PLAY
    else:
        return DataEra.TRACKING


class HistoricalDataInvestigator:
    """
    Investigates data quality issues in historical NBA statistics.

    This class provides tools for comparing sources, identifying
    discrepancies, and documenting data quality issues.
    """

    def __init__(self):
        """Initialize the investigator."""
        self.discrepancies_log = []
        self.investigation_notes = []

    def compare_career_totals(
        self,
        player_name: str,
        source_a: Dict[str, float],
        source_b: Dict[str, float],
        source_a_name: str = "Source A",
        source_b_name: str = "Source B"
    ) -> Dict:
        """
        Compare career statistics between two sources.

        Args:
            player_name: Name of the player
            source_a: Statistics from first source
            source_b: Statistics from second source
            source_a_name: Name identifier for first source
            source_b_name: Name identifier for second source

        Returns:
            Dictionary with comparison results
        """
        comparison = {
            "player": player_name,
            "sources": [source_a_name, source_b_name],
            "statistics": [],
            "discrepancies_found": False
        }

        all_stats = set(source_a.keys()) | set(source_b.keys())

        for stat in all_stats:
            val_a = source_a.get(stat)
            val_b = source_b.get(stat)

            stat_comparison = {
                "statistic": stat,
                source_a_name: val_a,
                source_b_name: val_b,
                "match": val_a == val_b
            }

            if val_a is not None and val_b is not None:
                stat_comparison["difference"] = abs(val_a - val_b)
                if val_a != val_b:
                    stat_comparison["pct_difference"] = (
                        abs(val_a - val_b) / max(val_a, val_b) * 100
                    )
                    comparison["discrepancies_found"] = True

            comparison["statistics"].append(stat_comparison)

        if comparison["discrepancies_found"]:
            self.discrepancies_log.append(comparison)

        return comparison

    def investigate_season_by_season(
        self,
        player_seasons_source_a: pd.DataFrame,
        player_seasons_source_b: pd.DataFrame,
        stat_column: str
    ) -> pd.DataFrame:
        """
        Compare statistics season-by-season to find discrepancy origins.

        Args:
            player_seasons_source_a: Season data from first source
            player_seasons_source_b: Season data from second source
            stat_column: Name of statistic column to compare

        Returns:
            DataFrame showing season-by-season comparison
        """
        comparison_rows = []

        # Align by season
        merged = player_seasons_source_a.merge(
            player_seasons_source_b,
            on='Season',
            suffixes=('_A', '_B'),
            how='outer'
        )

        for _, row in merged.iterrows():
            val_a = row.get(f'{stat_column}_A')
            val_b = row.get(f'{stat_column}_B')

            comparison_rows.append({
                'Season': row['Season'],
                'Source_A': val_a,
                'Source_B': val_b,
                'Difference': val_a - val_b if pd.notna(val_a) and pd.notna(val_b) else None,
                'Match': val_a == val_b if pd.notna(val_a) and pd.notna(val_b) else None
            })

        return pd.DataFrame(comparison_rows)

    def document_data_limitation(
        self,
        era: DataEra,
        limitation_type: str,
        description: str,
        affected_statistics: List[str],
        recommended_handling: str
    ) -> Dict:
        """
        Document a known data limitation for an era.

        Args:
            era: The NBA data era
            limitation_type: Category of limitation
            description: Detailed description
            affected_statistics: List of affected stat names
            recommended_handling: How to handle this limitation

        Returns:
            Dictionary documenting the limitation
        """
        limitation = {
            "era": era.value,
            "type": limitation_type,
            "description": description,
            "affected_statistics": affected_statistics,
            "recommended_handling": recommended_handling
        }

        self.investigation_notes.append(limitation)
        return limitation

    def analyze_missing_data_patterns(
        self,
        df: pd.DataFrame,
        stat_columns: List[str]
    ) -> Dict:
        """
        Analyze patterns in missing data.

        Args:
            df: DataFrame to analyze
            stat_columns: Columns to check for missing values

        Returns:
            Dictionary with missing data analysis
        """
        analysis = {
            "total_records": len(df),
            "statistics": {}
        }

        for col in stat_columns:
            if col in df.columns:
                missing = df[col].isna().sum()
                analysis["statistics"][col] = {
                    "missing_count": missing,
                    "missing_pct": round(missing / len(df) * 100, 2),
                    "first_non_null": df[df[col].notna()].index.min() if missing < len(df) else None
                }

        # Analyze by era if season information available
        if 'Season' in df.columns:
            analysis["by_era"] = {}
            for _, row in df.iterrows():
                season_year = int(str(row['Season'])[:4]) + 1
                era = get_era_for_season(season_year)
                if era.value not in analysis["by_era"]:
                    analysis["by_era"][era.value] = {"seasons": 0, "missing": {}}
                analysis["by_era"][era.value]["seasons"] += 1

        return analysis

Part 2: Known Historical Data Issues

2.1 The Steals and Blocks Gap (Pre-1973-74)

def investigate_blocks_steals_gap():
    """
    Document the missing steals and blocks data before 1973-74.

    Historical Context:
    - Steals and blocks were not officially tracked until 1973-74
    - Some historical box scores exist but are incomplete
    - Players like Wilt Chamberlain and Bill Russell have no official
      steals/blocks statistics despite clearly accumulating them
    """

    investigator = HistoricalDataInvestigator()

    investigator.document_data_limitation(
        era=DataEra.PRE_ABA_MERGER,
        limitation_type="MISSING_STATISTIC",
        description=(
            "Steals and blocks were not officially recorded by the NBA "
            "until the 1973-74 season. Players who competed primarily "
            "before this date have incomplete defensive statistics."
        ),
        affected_statistics=["STL", "BLK"],
        recommended_handling=(
            "1. Do not compare steals/blocks across eras directly. "
            "2. Use per-possession defensive metrics for post-1973 only. "
            "3. Consider noting career totals as 'incomplete' for affected players. "
            "4. Research third-party historical reconstruction projects for estimates."
        )
    )

    # Players most affected
    affected_players = [
        {"name": "Wilt Chamberlain", "career": "1959-1973", "seasons_missing": 13},
        {"name": "Bill Russell", "career": "1956-1969", "seasons_missing": 13},
        {"name": "Oscar Robertson", "career": "1960-1974", "seasons_missing": 13},
        {"name": "Jerry West", "career": "1960-1974", "seasons_missing": 13},
        {"name": "Elgin Baylor", "career": "1958-1972", "seasons_missing": 14},
    ]

    return investigator.investigation_notes, affected_players

2.2 The Three-Point Revolution (Pre-1979-80)

def investigate_three_point_gap():
    """
    Document the absence of three-point shooting before 1979-80.

    Historical Context:
    - The three-point line was introduced in 1979-80
    - Players before this era never had the opportunity to shoot threes
    - Comparing career three-point totals across eras is inherently flawed
    """

    investigator = HistoricalDataInvestigator()

    investigator.document_data_limitation(
        era=DataEra.PRE_THREE_POINT,
        limitation_type="RULE_CHANGE",
        description=(
            "The three-point line did not exist in the NBA until 1979-80. "
            "All long-range shots were counted as two-pointers. This makes "
            "direct comparison of shooting statistics problematic across eras."
        ),
        affected_statistics=["3PM", "3PA", "3P%", "eFG%", "TS%"],
        recommended_handling=(
            "1. Use era-adjusted efficiency metrics. "
            "2. Calculate 'hypothetical' three-point impact for historical players. "
            "3. Present efficiency stats with era context. "
            "4. Focus on league-relative performance rather than raw numbers."
        )
    )

    # Impact analysis
    impact_notes = {
        "true_shooting": (
            "TS% calculations for pre-1980 players are lower than they "
            "would be if those players had access to the three-point line."
        ),
        "effective_fg": (
            "eFG% is meaningless for pre-1980 players as it weights "
            "three-pointers, which didn't exist."
        ),
        "comparison_approach": (
            "Compare players to their era's league average rather than "
            "using raw statistics across eras."
        )
    }

    return investigator.investigation_notes, impact_notes

2.3 Box Score Discrepancies

def investigate_box_score_discrepancies():
    """
    Investigate known discrepancies in historical box scores.

    Common Issues:
    - Scorer errors in original games
    - Different counting methods for rebounds (offensive vs defensive)
    - Retroactive corrections not consistently applied
    - Transcription errors in digitization
    """

    # Example: The Bill Russell rebounds debate
    bill_russell_rebounds = {
        "official_nba": 21620,
        "basketball_reference": 21620,
        "some_archives": 21621,
        "note": (
            "A 1-rebound discrepancy exists due to a disputed game "
            "from the 1963-64 season where the original scorer's "
            "count differs from the newspaper report."
        )
    }

    # Common discrepancy patterns
    discrepancy_patterns = [
        {
            "type": "Scorer errors",
            "frequency": "Occasional",
            "typical_magnitude": "1-5 per game",
            "resolution": "Usually resolved post-game, but historical games may not be corrected"
        },
        {
            "type": "Digitization errors",
            "frequency": "Rare",
            "typical_magnitude": "Variable",
            "resolution": "Cross-reference multiple sources"
        },
        {
            "type": "Retroactive stat changes",
            "frequency": "Occasional",
            "typical_magnitude": "Can be significant for individual games",
            "resolution": "Check for official NBA corrections"
        },
        {
            "type": "Definition changes",
            "frequency": "Era-dependent",
            "typical_magnitude": "Systematic",
            "resolution": "Document era-specific definitions"
        }
    ]

    return bill_russell_rebounds, discrepancy_patterns

Part 3: Building a Cross-Era Comparison Framework

3.1 Era-Adjusted Statistics

class EraAdjustedStatistics:
    """
    Calculate era-adjusted statistics for fair cross-era comparisons.
    """

    def __init__(self, league_averages: pd.DataFrame):
        """
        Initialize with historical league average data.

        Args:
            league_averages: DataFrame with season-by-season league averages
        """
        self.league_averages = league_averages

    def calculate_relative_performance(
        self,
        player_stats: pd.DataFrame,
        stat_column: str
    ) -> pd.DataFrame:
        """
        Calculate performance relative to league average.

        Args:
            player_stats: Player's season-by-season statistics
            stat_column: Statistic to analyze

        Returns:
            DataFrame with relative performance metrics
        """
        player_stats = player_stats.copy()

        # Merge with league averages
        merged = player_stats.merge(
            self.league_averages[['Season', stat_column]],
            on='Season',
            suffixes=('_player', '_league')
        )

        # Calculate relative metrics
        player_col = f'{stat_column}_player'
        league_col = f'{stat_column}_league'

        merged['relative_to_league'] = (
            merged[player_col] / merged[league_col]
        )
        merged['above_league_avg'] = (
            merged[player_col] - merged[league_col]
        )
        merged['pct_above_league'] = (
            (merged[player_col] - merged[league_col]) /
            merged[league_col] * 100
        )

        return merged

    def calculate_z_score_by_era(
        self,
        player_value: float,
        era_mean: float,
        era_std: float
    ) -> float:
        """
        Calculate z-score relative to era statistics.

        Args:
            player_value: Player's statistic value
            era_mean: Mean for that statistic in the era
            era_std: Standard deviation in the era

        Returns:
            Z-score indicating how exceptional the performance was
        """
        if era_std == 0:
            return 0.0
        return (player_value - era_mean) / era_std

    def compare_across_eras(
        self,
        player_a_stats: Dict,
        player_a_era_context: Dict,
        player_b_stats: Dict,
        player_b_era_context: Dict
    ) -> Dict:
        """
        Compare two players from different eras using standardized metrics.

        Args:
            player_a_stats: Statistics for player A
            player_a_era_context: Era context (mean, std) for player A
            player_b_stats: Statistics for player B
            player_b_era_context: Era context (mean, std) for player B

        Returns:
            Dictionary with comparison results
        """
        comparison = {
            "method": "Era-Adjusted Z-Score Comparison",
            "statistics": {}
        }

        common_stats = set(player_a_stats.keys()) & set(player_b_stats.keys())

        for stat in common_stats:
            z_a = self.calculate_z_score_by_era(
                player_a_stats[stat],
                player_a_era_context[f'{stat}_mean'],
                player_a_era_context[f'{stat}_std']
            )
            z_b = self.calculate_z_score_by_era(
                player_b_stats[stat],
                player_b_era_context[f'{stat}_mean'],
                player_b_era_context[f'{stat}_std']
            )

            comparison["statistics"][stat] = {
                "player_a_raw": player_a_stats[stat],
                "player_a_z": round(z_a, 2),
                "player_b_raw": player_b_stats[stat],
                "player_b_z": round(z_b, 2),
                "more_exceptional": "A" if z_a > z_b else "B" if z_b > z_a else "Equal"
            }

        return comparison

Part 4: Documentation Best Practices

4.1 Creating a Data Dictionary

def create_historical_data_dictionary() -> Dict:
    """
    Create a comprehensive data dictionary for historical NBA statistics.

    Returns:
        Dictionary documenting all statistics and their historical context
    """

    data_dictionary = {
        "statistics": {
            "PTS": {
                "name": "Points",
                "available_from": "1946-47",
                "definition": "Total points scored",
                "historical_notes": "Consistent definition throughout NBA history"
            },
            "REB": {
                "name": "Rebounds",
                "available_from": "1950-51",
                "definition": "Total rebounds (offensive + defensive)",
                "historical_notes": (
                    "Offensive/defensive split not recorded until 1973-74. "
                    "Early years may have inconsistent counting methods."
                )
            },
            "AST": {
                "name": "Assists",
                "available_from": "1946-47",
                "definition": "Passes leading directly to made baskets",
                "historical_notes": (
                    "Definition has evolved. Early eras were more generous "
                    "in crediting assists. Modern definition requires more "
                    "direct contribution to the basket."
                )
            },
            "STL": {
                "name": "Steals",
                "available_from": "1973-74",
                "definition": "Taking the ball from an opponent",
                "historical_notes": "Not recorded before 1973-74 season"
            },
            "BLK": {
                "name": "Blocks",
                "available_from": "1973-74",
                "definition": "Deflecting an opponent's shot attempt",
                "historical_notes": "Not recorded before 1973-74 season"
            },
            "TOV": {
                "name": "Turnovers",
                "available_from": "1977-78",
                "definition": "Losing possession before a shot attempt",
                "historical_notes": (
                    "Not consistently recorded until 1977-78. "
                    "Some earlier seasons have partial data."
                )
            },
            "3PM": {
                "name": "Three-Pointers Made",
                "available_from": "1979-80",
                "definition": "Made field goals from beyond the arc",
                "historical_notes": (
                    "Three-point line introduced 1979-80. "
                    "Line distance changed briefly in 1994-97."
                )
            }
        },
        "eras": {
            "BAA": {"years": "1946-1949", "notes": "Before NBA existed"},
            "Early_NBA": {"years": "1949-1954", "notes": "No shot clock until 1954"},
            "Pre_Expansion": {"years": "1954-1967", "notes": "Relatively stable league"},
            "Expansion": {"years": "1967-1976", "notes": "ABA competition, roster changes"},
            "Merger": {"years": "1976-1980", "notes": "ABA-NBA merger integration"},
            "Modern": {"years": "1980-1996", "notes": "Pre-internet record keeping"},
            "Digital": {"years": "1996-2014", "notes": "Play-by-play data available"},
            "Analytics": {"years": "2014-present", "notes": "Tracking data available"}
        }
    }

    return data_dictionary

4.2 Generating Investigation Reports

def generate_investigation_report(
    investigator: HistoricalDataInvestigator,
    output_format: str = "markdown"
) -> str:
    """
    Generate a formatted report of the data quality investigation.

    Args:
        investigator: HistoricalDataInvestigator with findings
        output_format: Output format (markdown, html, json)

    Returns:
        Formatted report string
    """

    report_lines = [
        "# Historical NBA Data Quality Investigation Report",
        "",
        f"**Generated:** {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M')}",
        "",
        "## Executive Summary",
        "",
        f"This investigation identified {len(investigator.discrepancies_log)} "
        f"discrepancies between data sources and documented "
        f"{len(investigator.investigation_notes)} data limitations.",
        "",
        "## Discrepancies Found",
        ""
    ]

    for discrepancy in investigator.discrepancies_log:
        report_lines.append(f"### {discrepancy['player']}")
        report_lines.append("")
        for stat in discrepancy['statistics']:
            if not stat.get('match', True):
                report_lines.append(
                    f"- **{stat['statistic']}**: "
                    f"{discrepancy['sources'][0]}={stat.get(discrepancy['sources'][0])} vs "
                    f"{discrepancy['sources'][1]}={stat.get(discrepancy['sources'][1])} "
                    f"(diff: {stat.get('difference', 'N/A')})"
                )
        report_lines.append("")

    report_lines.extend([
        "## Documented Limitations",
        ""
    ])

    for limitation in investigator.investigation_notes:
        report_lines.extend([
            f"### {limitation['type']} ({limitation['era']})",
            "",
            limitation['description'],
            "",
            f"**Affected Statistics:** {', '.join(limitation['affected_statistics'])}",
            "",
            f"**Recommended Handling:** {limitation['recommended_handling']}",
            ""
        ])

    report_lines.extend([
        "## Recommendations",
        "",
        "1. Always document the data source and access date",
        "2. Note era-specific limitations in any analysis",
        "3. Use era-adjusted statistics for cross-era comparisons",
        "4. Cross-reference multiple sources for critical statistics",
        "5. Clearly communicate data limitations to stakeholders",
        ""
    ])

    return "\n".join(report_lines)

Part 5: Discussion Questions

Question 1: Ethical Considerations

When presenting historical comparisons in a documentary, how should the limitations of the data be communicated to a general audience without overwhelming them with caveats?

Question 2: Source Priority

When two reputable sources disagree on a historical statistic, what criteria should determine which source to trust?

Question 3: Modern Applications

How do historical data quality issues inform best practices for collecting and storing current NBA data?

Question 4: Reconstruction Projects

Several researchers have attempted to estimate steals and blocks for pre-1973 players using film study. How should these estimates be handled in a database versus official statistics?

Question 5: Accessibility vs. Accuracy

Is it better to present incomplete data with appropriate caveats, or to exclude statistics entirely when data quality cannot be assured?

Deliverables

By completing this case study, you should produce:

Investigation Code: Python module for comparing and validating historical data
Data Dictionary: Comprehensive documentation of NBA statistics and their historical context
Investigation Report: Formatted report documenting all findings
Guidelines Document: Best practices for using historical NBA data
Presentation Materials: Key findings suitable for non-technical stakeholders

Key Takeaways

Historical data is inherently incomplete - Statistics we take for granted today were not always recorded
Sources may disagree - Even official sources can have discrepancies requiring investigation
Context is essential - Raw numbers without era context are misleading
Documentation enables reproducibility - Future researchers benefit from understanding data limitations
Standardization enables comparison - Era-adjusted metrics allow fairer cross-era analysis

This case study demonstrates the critical importance of understanding data provenance and limitations, skills that are essential for any serious basketball analytics work.