Case Study 1: Building a Season Database from Multiple Sources

Overview

In this case study, you'll combine play-by-play data, game metadata, team statistics, and weather information from multiple sources into a unified, analysis-ready database. This scenario mirrors real-world sports analytics work where data must be collected, cleaned, and integrated from various providers.


The Scenario

You're a data analyst for a college football media company preparing for the upcoming season. Your task is to create a comprehensive database of the previous season's games that can support various analytical needs:

  1. Writer queries for game recaps and historical comparisons
  2. Statistical analysis for predictions and rankings
  3. Visualization dashboards for on-air graphics

You have data from four sources, each with its own quirks: - Play-by-Play Data: Detailed play records from a third-party API - Game Information: Schedule and results from the conference website - Team Reference: Team metadata including conference affiliations - Weather Data: Historical weather conditions from a weather API


Part 1: Data Exploration and Assessment

Understanding the Raw Data

Let's examine each data source:

import pandas as pd
import numpy as np
from typing import Dict, List, Tuple

# Source 1: Play-by-Play Data
plays_raw = pd.DataFrame({
    'gameId': ['2023010001', '2023010001', '2023010001', '2023010001',
               '2023010002', '2023010002', '2023010002'],
    'playId': [1, 2, 3, 4, 1, 2, 3],
    'offense': ['ALABAMA', 'Alabama', 'BAMA', 'Georgia',
                'ohio state', 'Ohio St.', 'Michigan'],
    'defense': ['Georgia', 'georgia', 'UGA', 'Alabama',
                'Michigan', 'michigan', 'Ohio State'],
    'down': ['1', '2', '3', '1', '1', '2', '3'],
    'distance': [10, 7, 3, 10, 10, 8, 5],
    'yardsGained': ['5', '-2', 'INC', '8', '12', None, '6'],
    'playType': ['run', 'pass', 'pass', 'run', 'run', 'pass', 'run'],
    'quarter': [1, 1, 1, 1, 1, 1, 1],
    'clock': ['15:00', '14:25', '13:58', '13:12', '15:00', '14:32', '14:05']
})

print("=" * 60)
print("PLAY-BY-PLAY DATA ISSUES")
print("=" * 60)
print(f"Shape: {plays_raw.shape}")
print(f"\nColumn types:")
print(plays_raw.dtypes)
print(f"\nSample:")
print(plays_raw.head())
print(f"\nUnique offense values: {plays_raw['offense'].unique()}")

Issues Identified: - gameId is a string (should be usable for joins) - down is a string (should be integer) - yardsGained contains "INC" for incomplete passes and has missing values - Team names are inconsistent across rows - No explicit game date or team identifiers

# Source 2: Game Information
games_raw = pd.DataFrame({
    'game_id': [2023010001, 2023010002, 2023010003],
    'date': ['2023-09-02', '2023-09-02', '2023-09-09'],
    'home': ['Alabama Crimson Tide', 'Ohio State Buckeyes', 'LSU Tigers'],
    'away': ['Georgia Bulldogs', 'Michigan Wolverines', 'Florida State Seminoles'],
    'home_score': [28, 42, 35],
    'away_score': [24, 27, 38],
    'venue': ['Bryant-Denny Stadium', 'Ohio Stadium', 'Tiger Stadium'],
    'attendance': ['101,821', '104,944', '102,321']
})

print("\n" + "=" * 60)
print("GAME INFORMATION ISSUES")
print("=" * 60)
print(f"Shape: {games_raw.shape}")
print(f"\nColumn types:")
print(games_raw.dtypes)
print(f"\nSample home teams: {games_raw['home'].unique()}")

Issues Identified: - Team names include mascots (different from play-by-play format) - Attendance has commas (string, not numeric) - Date is a string (should be datetime)

# Source 3: Team Reference Data
teams_raw = pd.DataFrame({
    'team_id': [1, 2, 3, 4, 5, 6, 7, 8],
    'team_name': ['Alabama', 'Georgia', 'Ohio State', 'Michigan',
                  'LSU', 'Florida State', 'Texas', 'Oregon'],
    'conference': ['SEC', 'SEC', 'Big Ten', 'Big Ten',
                   'SEC', 'ACC', 'Big 12', 'Pac-12'],
    'abbreviation': ['ALA', 'UGA', 'OSU', 'MICH', 'LSU', 'FSU', 'TEX', 'ORE'],
    'mascot': ['Crimson Tide', 'Bulldogs', 'Buckeyes', 'Wolverines',
               'Tigers', 'Seminoles', 'Longhorns', 'Ducks']
})

print("\n" + "=" * 60)
print("TEAM REFERENCE DATA")
print("=" * 60)
print(teams_raw)
# Source 4: Weather Data
weather_raw = pd.DataFrame({
    'date': ['2023-09-02', '2023-09-02', '2023-09-09'],
    'location': ['Tuscaloosa, AL', 'Columbus, OH', 'Baton Rouge, LA'],
    'temp_f': [92, 78, 88],
    'humidity': [65, 45, 72],
    'wind_mph': [8, 12, 5],
    'conditions': ['Partly Cloudy', 'Clear', 'Scattered Storms']
})

print("\n" + "=" * 60)
print("WEATHER DATA")
print("=" * 60)
print(weather_raw)

Challenge: - Weather is keyed by location, not game_id - Need to map venues to weather locations


Part 2: Building the Cleaning Pipeline

Step 1: Team Name Standardization

class TeamNameStandardizer:
    """Standardize team names across all data sources."""

    def __init__(self):
        # Build comprehensive mapping
        self.name_to_standard = {
            # Alabama variants
            'alabama': 'Alabama',
            'bama': 'Alabama',
            'alabama crimson tide': 'Alabama',
            'ala': 'Alabama',

            # Georgia variants
            'georgia': 'Georgia',
            'uga': 'Georgia',
            'georgia bulldogs': 'Georgia',

            # Ohio State variants
            'ohio state': 'Ohio State',
            'ohio st': 'Ohio State',
            'ohio st.': 'Ohio State',
            'osu': 'Ohio State',
            'ohio state buckeyes': 'Ohio State',

            # Michigan variants
            'michigan': 'Michigan',
            'mich': 'Michigan',
            'michigan wolverines': 'Michigan',

            # LSU variants
            'lsu': 'LSU',
            'louisiana state': 'LSU',
            'lsu tigers': 'LSU',

            # Florida State variants
            'florida state': 'Florida State',
            'fsu': 'Florida State',
            'florida state seminoles': 'Florida State',
        }

    def standardize(self, name: str) -> str:
        """Convert any team name variant to standard form."""
        if pd.isna(name):
            return name

        cleaned = str(name).lower().strip()

        if cleaned in self.name_to_standard:
            return self.name_to_standard[cleaned]

        return name.strip().title()

    def standardize_column(self, df: pd.DataFrame, column: str) -> pd.DataFrame:
        """Standardize all team names in a column."""
        df_result = df.copy()
        original_unique = df_result[column].nunique()

        df_result[column] = df_result[column].apply(self.standardize)

        new_unique = df_result[column].nunique()
        print(f"  {column}: {original_unique} → {new_unique} unique values")

        return df_result


# Apply standardization
standardizer = TeamNameStandardizer()

print("Standardizing play-by-play team names...")
plays_clean = standardizer.standardize_column(plays_raw.copy(), 'offense')
plays_clean = standardizer.standardize_column(plays_clean, 'defense')

print("\nStandardizing game team names...")
games_clean = standardizer.standardize_column(games_raw.copy(), 'home')
games_clean = standardizer.standardize_column(games_clean, 'away')

Step 2: Data Type Corrections

def clean_plays_types(df: pd.DataFrame) -> pd.DataFrame:
    """Clean and convert data types in play-by-play data."""
    df_clean = df.copy()

    # Convert gameId to string for consistent joining
    df_clean['gameId'] = df_clean['gameId'].astype(str)

    # Convert down to integer (handling errors)
    df_clean['down'] = pd.to_numeric(df_clean['down'], errors='coerce').astype('Int64')

    # Convert yardsGained - handle special values
    def parse_yards(value):
        if pd.isna(value):
            return np.nan
        if isinstance(value, str):
            if value.upper() in ['INC', 'N/A', 'NULL', '']:
                return 0  # Incomplete passes = 0 yards
            try:
                return float(value)
            except ValueError:
                return np.nan
        return float(value)

    df_clean['yardsGained'] = df_clean['yardsGained'].apply(parse_yards)

    print("Data types after cleaning:")
    print(df_clean.dtypes)

    return df_clean


def clean_games_types(df: pd.DataFrame) -> pd.DataFrame:
    """Clean and convert data types in game data."""
    df_clean = df.copy()

    # Convert game_id to string
    df_clean['game_id'] = df_clean['game_id'].astype(str)

    # Convert date to datetime
    df_clean['date'] = pd.to_datetime(df_clean['date'])

    # Clean attendance (remove commas, convert to int)
    df_clean['attendance'] = df_clean['attendance'].str.replace(',', '').astype(int)

    print("Game data types after cleaning:")
    print(df_clean.dtypes)

    return df_clean


plays_clean = clean_plays_types(plays_clean)
games_clean = clean_games_types(games_clean)

Step 3: Handling Missing Values

def handle_plays_missing(df: pd.DataFrame) -> pd.DataFrame:
    """Handle missing values in play data."""
    df_clean = df.copy()

    # Report missing
    print("\nMissing values in plays:")
    missing = df_clean.isnull().sum()
    for col, count in missing[missing > 0].items():
        print(f"  {col}: {count} ({count/len(df_clean)*100:.1f}%)")

    # Strategy for yardsGained: already converted INC to 0
    # For remaining nulls (if any), use 0 as plays without gain
    df_clean['yardsGained'] = df_clean['yardsGained'].fillna(0)

    # Validate no critical columns have missing values
    critical_cols = ['gameId', 'playId', 'offense', 'defense']
    for col in critical_cols:
        null_count = df_clean[col].isnull().sum()
        if null_count > 0:
            print(f"  WARNING: {null_count} null values in critical column {col}")

    return df_clean


plays_clean = handle_plays_missing(plays_clean)

Part 3: Data Integration

Merging Plays with Games

def merge_plays_games(plays: pd.DataFrame,
                       games: pd.DataFrame) -> pd.DataFrame:
    """Merge play-by-play data with game metadata."""

    # Check key compatibility
    plays_games = set(plays['gameId'].unique())
    games_ids = set(games['game_id'].unique())

    print(f"\nUnique game IDs in plays: {len(plays_games)}")
    print(f"Unique game IDs in games: {len(games_ids)}")
    print(f"Overlap: {len(plays_games & games_ids)}")

    # Perform merge
    merged = pd.merge(
        plays,
        games[['game_id', 'date', 'venue', 'attendance']],
        left_on='gameId',
        right_on='game_id',
        how='left',
        validate='many_to_one'
    )

    # Check merge success
    unmatched = merged[merged['game_id'].isna()]
    if len(unmatched) > 0:
        print(f"WARNING: {len(unmatched)} plays couldn't be matched to games")
        print(f"  Unmatched game IDs: {unmatched['gameId'].unique()}")
    else:
        print("All plays successfully matched to games!")

    # Drop redundant column
    merged = merged.drop(columns=['game_id'])

    return merged


plays_with_games = merge_plays_games(plays_clean, games_clean)
print(f"\nMerged shape: {plays_with_games.shape}")
print(plays_with_games.head())

Adding Team Metadata

def add_team_metadata(plays: pd.DataFrame,
                       teams: pd.DataFrame) -> pd.DataFrame:
    """Add conference and other team info to plays."""

    # Add offensive team's conference
    plays_enriched = pd.merge(
        plays,
        teams[['team_name', 'conference']],
        left_on='offense',
        right_on='team_name',
        how='left'
    )
    plays_enriched = plays_enriched.rename(columns={'conference': 'offense_conf'})
    plays_enriched = plays_enriched.drop(columns=['team_name'])

    # Add defensive team's conference
    plays_enriched = pd.merge(
        plays_enriched,
        teams[['team_name', 'conference']],
        left_on='defense',
        right_on='team_name',
        how='left'
    )
    plays_enriched = plays_enriched.rename(columns={'conference': 'defense_conf'})
    plays_enriched = plays_enriched.drop(columns=['team_name'])

    # Check for unmatched teams
    unmatched_off = plays_enriched[plays_enriched['offense_conf'].isna()]['offense'].unique()
    unmatched_def = plays_enriched[plays_enriched['defense_conf'].isna()]['defense'].unique()

    if len(unmatched_off) > 0:
        print(f"Unmatched offensive teams: {unmatched_off}")
    if len(unmatched_def) > 0:
        print(f"Unmatched defensive teams: {unmatched_def}")

    return plays_enriched


plays_enriched = add_team_metadata(plays_with_games, teams_raw)
print(f"\nEnriched shape: {plays_enriched.shape}")

Integrating Weather Data

def create_venue_weather_mapping() -> Dict[str, str]:
    """Map venue names to weather locations."""
    return {
        'Bryant-Denny Stadium': 'Tuscaloosa, AL',
        'Ohio Stadium': 'Columbus, OH',
        'Tiger Stadium': 'Baton Rouge, LA',
        'Sanford Stadium': 'Athens, GA',
        'Michigan Stadium': 'Ann Arbor, MI',
        'Beaver Stadium': 'University Park, PA'
    }


def add_weather(plays: pd.DataFrame,
                weather: pd.DataFrame,
                games: pd.DataFrame) -> pd.DataFrame:
    """Add weather data to plays using venue mapping."""

    # Create venue to weather location mapping
    venue_map = create_venue_weather_mapping()

    # Add weather location to games
    games_weather = games.copy()
    games_weather['weather_loc'] = games_weather['venue'].map(venue_map)

    # Merge weather with games on date and location
    games_weather = pd.merge(
        games_weather,
        weather,
        left_on=['date', 'weather_loc'],
        right_on=[pd.to_datetime(weather['date']), 'location'],
        how='left',
        suffixes=('', '_weather')
    )

    # Now merge weather info into plays
    plays_weather = pd.merge(
        plays,
        games_weather[['game_id', 'temp_f', 'humidity', 'wind_mph', 'conditions']],
        left_on='gameId',
        right_on='game_id',
        how='left'
    )
    plays_weather = plays_weather.drop(columns=['game_id'], errors='ignore')

    # Report weather coverage
    has_weather = plays_weather['temp_f'].notna().sum()
    print(f"Plays with weather data: {has_weather}/{len(plays_weather)} ({has_weather/len(plays_weather)*100:.1f}%)")

    return plays_weather


# Note: This simplified example - real weather integration more complex
print("\nAdding weather data...")

Part 4: Feature Engineering

def create_analysis_features(df: pd.DataFrame) -> pd.DataFrame:
    """Create derived features for analysis."""
    df_features = df.copy()

    # Distance category
    df_features['distance_cat'] = pd.cut(
        df_features['distance'],
        bins=[0, 3, 7, 15, 100],
        labels=['short', 'medium', 'long', 'very_long']
    )

    # Passing down indicator
    df_features['passing_down'] = (
        ((df_features['down'] == 2) & (df_features['distance'] >= 8)) |
        ((df_features['down'] == 3) & (df_features['distance'] >= 5)) |
        (df_features['down'] == 4)
    ).astype(int)

    # Play success (gained 40%+ of needed yards on 1st, 50%+ on 2nd, 100% on 3rd/4th)
    def calculate_success(row):
        if pd.isna(row['down']) or pd.isna(row['yardsGained']):
            return np.nan
        if row['down'] == 1:
            return 1 if row['yardsGained'] >= row['distance'] * 0.4 else 0
        elif row['down'] == 2:
            return 1 if row['yardsGained'] >= row['distance'] * 0.5 else 0
        else:  # 3rd and 4th down
            return 1 if row['yardsGained'] >= row['distance'] else 0

    df_features['play_success'] = df_features.apply(calculate_success, axis=1)

    # Conference game indicator
    df_features['conf_game'] = (
        df_features['offense_conf'] == df_features['defense_conf']
    ).astype(int)

    print("Features created:")
    print(f"  distance_cat: {df_features['distance_cat'].value_counts().to_dict()}")
    print(f"  passing_down: {df_features['passing_down'].value_counts().to_dict()}")
    print(f"  play_success: {df_features['play_success'].value_counts().to_dict()}")
    print(f"  conf_game: {df_features['conf_game'].value_counts().to_dict()}")

    return df_features


plays_final = create_analysis_features(plays_enriched)

Part 5: Validation and Export

def validate_final_dataset(df: pd.DataFrame) -> Dict:
    """Comprehensive validation of the final dataset."""
    print("\n" + "=" * 60)
    print("FINAL DATASET VALIDATION")
    print("=" * 60)

    validation = {
        'passed': True,
        'checks': [],
        'warnings': []
    }

    # Check 1: No duplicates
    dups = df.duplicated(subset=['gameId', 'playId']).sum()
    if dups > 0:
        validation['passed'] = False
        validation['checks'].append(f"FAIL: {dups} duplicate plays found")
    else:
        validation['checks'].append("PASS: No duplicate plays")

    # Check 2: Valid down values
    invalid_downs = df['down'].notna() & ~df['down'].isin([1, 2, 3, 4])
    if invalid_downs.sum() > 0:
        validation['passed'] = False
        validation['checks'].append(f"FAIL: {invalid_downs.sum()} invalid down values")
    else:
        validation['checks'].append("PASS: All down values valid (1-4)")

    # Check 3: Valid yards range
    extreme_yards = (df['yardsGained'] < -20) | (df['yardsGained'] > 100)
    if extreme_yards.sum() > 0:
        validation['warnings'].append(f"WARNING: {extreme_yards.sum()} extreme yards values")
    else:
        validation['checks'].append("PASS: All yards in reasonable range")

    # Check 4: Team names standardized
    unique_teams = set(df['offense'].unique()) | set(df['defense'].unique())
    if any(t.isupper() or t.islower() for t in unique_teams if isinstance(t, str)):
        validation['warnings'].append("WARNING: Some team names may not be standardized")
    else:
        validation['checks'].append("PASS: Team names appear standardized")

    # Check 5: No critical nulls
    critical_cols = ['gameId', 'playId', 'offense', 'defense', 'yardsGained']
    for col in critical_cols:
        nulls = df[col].isnull().sum()
        if nulls > 0:
            validation['passed'] = False
            validation['checks'].append(f"FAIL: {nulls} nulls in {col}")
        else:
            validation['checks'].append(f"PASS: No nulls in {col}")

    # Print results
    for check in validation['checks']:
        print(f"  {check}")
    for warning in validation['warnings']:
        print(f"  {warning}")

    print(f"\nOverall: {'PASSED' if validation['passed'] else 'FAILED'}")

    return validation


validation_result = validate_final_dataset(plays_final)


def export_dataset(df: pd.DataFrame, path: str):
    """Export the final dataset with documentation."""

    # Add metadata
    metadata = {
        'created': pd.Timestamp.now().isoformat(),
        'rows': len(df),
        'columns': list(df.columns),
        'sources': ['play_by_play_api', 'game_schedule', 'team_reference', 'weather_api']
    }

    print(f"\nExporting {len(df)} rows to {path}")
    # In practice: df.to_parquet(path) or df.to_csv(path)

    return metadata


metadata = export_dataset(plays_final, 'season_plays_clean.parquet')
print(f"Export metadata: {metadata}")

Summary and Key Learnings

Pipeline Steps Completed

  1. Data Exploration: Identified issues in each source
  2. Standardization: Created consistent team naming
  3. Type Conversion: Fixed data types for analysis
  4. Missing Values: Handled with domain-appropriate strategies
  5. Integration: Merged four data sources
  6. Feature Engineering: Created analysis-ready derived columns
  7. Validation: Verified data quality
  8. Export: Saved with documentation

Best Practices Demonstrated

  • Build reusable standardization classes
  • Validate at each pipeline stage
  • Document transformations applied
  • Handle edge cases explicitly
  • Create features that support downstream analysis

Common Pitfalls Avoided

  • Losing data in merges (used left joins with validation)
  • Type conversion errors (used coerce option)
  • Inconsistent naming (systematic standardization)
  • Undocumented transformations (logged each step)

Exercises

  1. Extend the team standardizer to handle 30 more teams with at least 3 variants each.

  2. Add data lineage tracking that records which source each value came from.

  3. Create a weather imputation function that fills missing weather using nearby game data.

  4. Build a pipeline class that wraps all these steps with proper error handling.