Case Study 1: Building a Season Database from Multiple Sources

Overview

In this case study, you'll combine play-by-play data, game metadata, team statistics, and weather information from multiple sources into a unified, analysis-ready database. This scenario mirrors real-world sports analytics work where data must be collected, cleaned, and integrated from various providers.

The Scenario

You're a data analyst for a college football media company preparing for the upcoming season. Your task is to create a comprehensive database of the previous season's games that can support various analytical needs:

Writer queries for game recaps and historical comparisons
Statistical analysis for predictions and rankings
Visualization dashboards for on-air graphics

You have data from four sources, each with its own quirks: - Play-by-Play Data: Detailed play records from a third-party API - Game Information: Schedule and results from the conference website - Team Reference: Team metadata including conference affiliations - Weather Data: Historical weather conditions from a weather API

Part 1: Data Exploration and Assessment

Understanding the Raw Data

Let's examine each data source:

import pandas as pd
import numpy as np
from typing import Dict, List, Tuple

# Source 1: Play-by-Play Data
plays_raw = pd.DataFrame({
    'gameId': ['2023010001', '2023010001', '2023010001', '2023010001',
               '2023010002', '2023010002', '2023010002'],
    'playId': [1, 2, 3, 4, 1, 2, 3],
    'offense': ['ALABAMA', 'Alabama', 'BAMA', 'Georgia',
                'ohio state', 'Ohio St.', 'Michigan'],
    'defense': ['Georgia', 'georgia', 'UGA', 'Alabama',
                'Michigan', 'michigan', 'Ohio State'],
    'down': ['1', '2', '3', '1', '1', '2', '3'],
    'distance': [10, 7, 3, 10, 10, 8, 5],
    'yardsGained': ['5', '-2', 'INC', '8', '12', None, '6'],
    'playType': ['run', 'pass', 'pass', 'run', 'run', 'pass', 'run'],
    'quarter': [1, 1, 1, 1, 1, 1, 1],
    'clock': ['15:00', '14:25', '13:58', '13:12', '15:00', '14:32', '14:05']
})

print("=" * 60)
print("PLAY-BY-PLAY DATA ISSUES")
print("=" * 60)
print(f"Shape: {plays_raw.shape}")
print(f"\nColumn types:")
print(plays_raw.dtypes)
print(f"\nSample:")
print(plays_raw.head())
print(f"\nUnique offense values: {plays_raw['offense'].unique()}")

Issues Identified: - gameId is a string (should be usable for joins) - down is a string (should be integer) - yardsGained contains "INC" for incomplete passes and has missing values - Team names are inconsistent across rows - No explicit game date or team identifiers

# Source 2: Game Information
games_raw = pd.DataFrame({
    'game_id': [2023010001, 2023010002, 2023010003],
    'date': ['2023-09-02', '2023-09-02', '2023-09-09'],
    'home': ['Alabama Crimson Tide', 'Ohio State Buckeyes', 'LSU Tigers'],
    'away': ['Georgia Bulldogs', 'Michigan Wolverines', 'Florida State Seminoles'],
    'home_score': [28, 42, 35],
    'away_score': [24, 27, 38],
    'venue': ['Bryant-Denny Stadium', 'Ohio Stadium', 'Tiger Stadium'],
    'attendance': ['101,821', '104,944', '102,321']
})

print("\n" + "=" * 60)
print("GAME INFORMATION ISSUES")
print("=" * 60)
print(f"Shape: {games_raw.shape}")
print(f"\nColumn types:")
print(games_raw.dtypes)
print(f"\nSample home teams: {games_raw['home'].unique()}")

Issues Identified: - Team names include mascots (different from play-by-play format) - Attendance has commas (string, not numeric) - Date is a string (should be datetime)

# Source 3: Team Reference Data
teams_raw = pd.DataFrame({
    'team_id': [1, 2, 3, 4, 5, 6, 7, 8],
    'team_name': ['Alabama', 'Georgia', 'Ohio State', 'Michigan',
                  'LSU', 'Florida State', 'Texas', 'Oregon'],
    'conference': ['SEC', 'SEC', 'Big Ten', 'Big Ten',
                   'SEC', 'ACC', 'Big 12', 'Pac-12'],
    'abbreviation': ['ALA', 'UGA', 'OSU', 'MICH', 'LSU', 'FSU', 'TEX', 'ORE'],
    'mascot': ['Crimson Tide', 'Bulldogs', 'Buckeyes', 'Wolverines',
               'Tigers', 'Seminoles', 'Longhorns', 'Ducks']
})

print("\n" + "=" * 60)
print("TEAM REFERENCE DATA")
print("=" * 60)
print(teams_raw)

# Source 4: Weather Data
weather_raw = pd.DataFrame({
    'date': ['2023-09-02', '2023-09-02', '2023-09-09'],
    'location': ['Tuscaloosa, AL', 'Columbus, OH', 'Baton Rouge, LA'],
    'temp_f': [92, 78, 88],
    'humidity': [65, 45, 72],
    'wind_mph': [8, 12, 5],
    'conditions': ['Partly Cloudy', 'Clear', 'Scattered Storms']
})

print("\n" + "=" * 60)
print("WEATHER DATA")
print("=" * 60)
print(weather_raw)

Challenge: - Weather is keyed by location, not game_id - Need to map venues to weather locations

Part 2: Building the Cleaning Pipeline

Step 1: Team Name Standardization

class TeamNameStandardizer:
    """Standardize team names across all data sources."""

    def __init__(self):
        # Build comprehensive mapping
        self.name_to_standard = {
            # Alabama variants
            'alabama': 'Alabama',
            'bama': 'Alabama',
            'alabama crimson tide': 'Alabama',
            'ala': 'Alabama',

            # Georgia variants
            'georgia': 'Georgia',
            'uga': 'Georgia',
            'georgia bulldogs': 'Georgia',

            # Ohio State variants
            'ohio state': 'Ohio State',
            'ohio st': 'Ohio State',
            'ohio st.': 'Ohio State',
            'osu': 'Ohio State',
            'ohio state buckeyes': 'Ohio State',

            # Michigan variants
            'michigan': 'Michigan',
            'mich': 'Michigan',
            'michigan wolverines': 'Michigan',

            # LSU variants
            'lsu': 'LSU',
            'louisiana state': 'LSU',
            'lsu tigers': 'LSU',

            # Florida State variants
            'florida state': 'Florida State',
            'fsu': 'Florida State',
            'florida state seminoles': 'Florida State',
        }

    def standardize(self, name: str) -> str:
        """Convert any team name variant to standard form."""
        if pd.isna(name):
            return name

        cleaned = str(name).lower().strip()

        if cleaned in self.name_to_standard:
            return self.name_to_standard[cleaned]

        return name.strip().title()

    def standardize_column(self, df: pd.DataFrame, column: str) -> pd.DataFrame:
        """Standardize all team names in a column."""
        df_result = df.copy()
        original_unique = df_result[column].nunique()

        df_result[column] = df_result[column].apply(self.standardize)

        new_unique = df_result[column].nunique()
        print(f"  {column}: {original_unique} → {new_unique} unique values")

        return df_result


# Apply standardization
standardizer = TeamNameStandardizer()

print("Standardizing play-by-play team names...")
plays_clean = standardizer.standardize_column(plays_raw.copy(), 'offense')
plays_clean = standardizer.standardize_column(plays_clean, 'defense')

print("\nStandardizing game team names...")
games_clean = standardizer.standardize_column(games_raw.copy(), 'home')
games_clean = standardizer.standardize_column(games_clean, 'away')

Step 2: Data Type Corrections

def clean_plays_types(df: pd.DataFrame) -> pd.DataFrame:
    """Clean and convert data types in play-by-play data."""
    df_clean = df.copy()

    # Convert gameId to string for consistent joining
    df_clean['gameId'] = df_clean['gameId'].astype(str)

    # Convert down to integer (handling errors)
    df_clean['down'] = pd.to_numeric(df_clean['down'], errors='coerce').astype('Int64')

    # Convert yardsGained - handle special values
    def parse_yards(value):
        if pd.isna(value):
            return np.nan
        if isinstance(value, str):
            if value.upper() in ['INC', 'N/A', 'NULL', '']:
                return 0  # Incomplete passes = 0 yards
            try:
                return float(value)
            except ValueError:
                return np.nan
        return float(value)

    df_clean['yardsGained'] = df_clean['yardsGained'].apply(parse_yards)

    print("Data types after cleaning:")
    print(df_clean.dtypes)

    return df_clean


def clean_games_types(df: pd.DataFrame) -> pd.DataFrame:
    """Clean and convert data types in game data."""
    df_clean = df.copy()

    # Convert game_id to string
    df_clean['game_id'] = df_clean['game_id'].astype(str)

    # Convert date to datetime
    df_clean['date'] = pd.to_datetime(df_clean['date'])

    # Clean attendance (remove commas, convert to int)
    df_clean['attendance'] = df_clean['attendance'].str.replace(',', '').astype(int)

    print("Game data types after cleaning:")
    print(df_clean.dtypes)

    return df_clean


plays_clean = clean_plays_types(plays_clean)
games_clean = clean_games_types(games_clean)

Step 3: Handling Missing Values

def handle_plays_missing(df: pd.DataFrame) -> pd.DataFrame:
    """Handle missing values in play data."""
    df_clean = df.copy()

    # Report missing
    print("\nMissing values in plays:")
    missing = df_clean.isnull().sum()
    for col, count in missing[missing > 0].items():
        print(f"  {col}: {count} ({count/len(df_clean)*100:.1f}%)")

    # Strategy for yardsGained: already converted INC to 0
    # For remaining nulls (if any), use 0 as plays without gain
    df_clean['yardsGained'] = df_clean['yardsGained'].fillna(0)

    # Validate no critical columns have missing values
    critical_cols = ['gameId', 'playId', 'offense', 'defense']
    for col in critical_cols:
        null_count = df_clean[col].isnull().sum()
        if null_count > 0:
            print(f"  WARNING: {null_count} null values in critical column {col}")

    return df_clean


plays_clean = handle_plays_missing(plays_clean)

Part 3: Data Integration

Merging Plays with Games

def merge_plays_games(plays: pd.DataFrame,
                       games: pd.DataFrame) -> pd.DataFrame:
    """Merge play-by-play data with game metadata."""

    # Check key compatibility
    plays_games = set(plays['gameId'].unique())
    games_ids = set(games['game_id'].unique())

    print(f"\nUnique game IDs in plays: {len(plays_games)}")
    print(f"Unique game IDs in games: {len(games_ids)}")
    print(f"Overlap: {len(plays_games & games_ids)}")

    # Perform merge
    merged = pd.merge(
        plays,
        games[['game_id', 'date', 'venue', 'attendance']],
        left_on='gameId',
        right_on='game_id',
        how='left',
        validate='many_to_one'
    )

    # Check merge success
    unmatched = merged[merged['game_id'].isna()]
    if len(unmatched) > 0:
        print(f"WARNING: {len(unmatched)} plays couldn't be matched to games")
        print(f"  Unmatched game IDs: {unmatched['gameId'].unique()}")
    else:
        print("All plays successfully matched to games!")

    # Drop redundant column
    merged = merged.drop(columns=['game_id'])

    return merged


plays_with_games = merge_plays_games(plays_clean, games_clean)
print(f"\nMerged shape: {plays_with_games.shape}")
print(plays_with_games.head())

Adding Team Metadata

def add_team_metadata(plays: pd.DataFrame,
                       teams: pd.DataFrame) -> pd.DataFrame:
    """Add conference and other team info to plays."""

    # Add offensive team's conference
    plays_enriched = pd.merge(
        plays,
        teams[['team_name', 'conference']],
        left_on='offense',
        right_on='team_name',
        how='left'
    )
    plays_enriched = plays_enriched.rename(columns={'conference': 'offense_conf'})
    plays_enriched = plays_enriched.drop(columns=['team_name'])

    # Add defensive team's conference
    plays_enriched = pd.merge(
        plays_enriched,
        teams[['team_name', 'conference']],
        left_on='defense',
        right_on='team_name',
        how='left'
    )
    plays_enriched = plays_enriched.rename(columns={'conference': 'defense_conf'})
    plays_enriched = plays_enriched.drop(columns=['team_name'])

    # Check for unmatched teams
    unmatched_off = plays_enriched[plays_enriched['offense_conf'].isna()]['offense'].unique()
    unmatched_def = plays_enriched[plays_enriched['defense_conf'].isna()]['defense'].unique()

    if len(unmatched_off) > 0:
        print(f"Unmatched offensive teams: {unmatched_off}")
    if len(unmatched_def) > 0:
        print(f"Unmatched defensive teams: {unmatched_def}")

    return plays_enriched


plays_enriched = add_team_metadata(plays_with_games, teams_raw)
print(f"\nEnriched shape: {plays_enriched.shape}")

Integrating Weather Data

def create_venue_weather_mapping() -> Dict[str, str]:
    """Map venue names to weather locations."""
    return {
        'Bryant-Denny Stadium': 'Tuscaloosa, AL',
        'Ohio Stadium': 'Columbus, OH',
        'Tiger Stadium': 'Baton Rouge, LA',
        'Sanford Stadium': 'Athens, GA',
        'Michigan Stadium': 'Ann Arbor, MI',
        'Beaver Stadium': 'University Park, PA'
    }


def add_weather(plays: pd.DataFrame,
                weather: pd.DataFrame,
                games: pd.DataFrame) -> pd.DataFrame:
    """Add weather data to plays using venue mapping."""

    # Create venue to weather location mapping
    venue_map = create_venue_weather_mapping()

    # Add weather location to games
    games_weather = games.copy()
    games_weather['weather_loc'] = games_weather['venue'].map(venue_map)

    # Merge weather with games on date and location
    games_weather = pd.merge(
        games_weather,
        weather,
        left_on=['date', 'weather_loc'],
        right_on=[pd.to_datetime(weather['date']), 'location'],
        how='left',
        suffixes=('', '_weather')
    )

    # Now merge weather info into plays
    plays_weather = pd.merge(
        plays,
        games_weather[['game_id', 'temp_f', 'humidity', 'wind_mph', 'conditions']],
        left_on='gameId',
        right_on='game_id',
        how='left'
    )
    plays_weather = plays_weather.drop(columns=['game_id'], errors='ignore')

    # Report weather coverage
    has_weather = plays_weather['temp_f'].notna().sum()
    print(f"Plays with weather data: {has_weather}/{len(plays_weather)} ({has_weather/len(plays_weather)*100:.1f}%)")

    return plays_weather


# Note: This simplified example - real weather integration more complex
print("\nAdding weather data...")

Part 4: Feature Engineering

def create_analysis_features(df: pd.DataFrame) -> pd.DataFrame:
    """Create derived features for analysis."""
    df_features = df.copy()

    # Distance category
    df_features['distance_cat'] = pd.cut(
        df_features['distance'],
        bins=[0, 3, 7, 15, 100],
        labels=['short', 'medium', 'long', 'very_long']
    )

    # Passing down indicator
    df_features['passing_down'] = (
        ((df_features['down'] == 2) & (df_features['distance'] >= 8)) |
        ((df_features['down'] == 3) & (df_features['distance'] >= 5)) |
        (df_features['down'] == 4)
    ).astype(int)

    # Play success (gained 40%+ of needed yards on 1st, 50%+ on 2nd, 100% on 3rd/4th)
    def calculate_success(row):
        if pd.isna(row['down']) or pd.isna(row['yardsGained']):
            return np.nan
        if row['down'] == 1:
            return 1 if row['yardsGained'] >= row['distance'] * 0.4 else 0
        elif row['down'] == 2:
            return 1 if row['yardsGained'] >= row['distance'] * 0.5 else 0
        else:  # 3rd and 4th down
            return 1 if row['yardsGained'] >= row['distance'] else 0

    df_features['play_success'] = df_features.apply(calculate_success, axis=1)

    # Conference game indicator
    df_features['conf_game'] = (
        df_features['offense_conf'] == df_features['defense_conf']
    ).astype(int)

    print("Features created:")
    print(f"  distance_cat: {df_features['distance_cat'].value_counts().to_dict()}")
    print(f"  passing_down: {df_features['passing_down'].value_counts().to_dict()}")
    print(f"  play_success: {df_features['play_success'].value_counts().to_dict()}")
    print(f"  conf_game: {df_features['conf_game'].value_counts().to_dict()}")

    return df_features


plays_final = create_analysis_features(plays_enriched)

Part 5: Validation and Export

def validate_final_dataset(df: pd.DataFrame) -> Dict:
    """Comprehensive validation of the final dataset."""
    print("\n" + "=" * 60)
    print("FINAL DATASET VALIDATION")
    print("=" * 60)

    validation = {
        'passed': True,
        'checks': [],
        'warnings': []
    }

    # Check 1: No duplicates
    dups = df.duplicated(subset=['gameId', 'playId']).sum()
    if dups > 0:
        validation['passed'] = False
        validation['checks'].append(f"FAIL: {dups} duplicate plays found")
    else:
        validation['checks'].append("PASS: No duplicate plays")

    # Check 2: Valid down values
    invalid_downs = df['down'].notna() & ~df['down'].isin([1, 2, 3, 4])
    if invalid_downs.sum() > 0:
        validation['passed'] = False
        validation['checks'].append(f"FAIL: {invalid_downs.sum()} invalid down values")
    else:
        validation['checks'].append("PASS: All down values valid (1-4)")

    # Check 3: Valid yards range
    extreme_yards = (df['yardsGained'] < -20) | (df['yardsGained'] > 100)
    if extreme_yards.sum() > 0:
        validation['warnings'].append(f"WARNING: {extreme_yards.sum()} extreme yards values")
    else:
        validation['checks'].append("PASS: All yards in reasonable range")

    # Check 4: Team names standardized
    unique_teams = set(df['offense'].unique()) | set(df['defense'].unique())
    if any(t.isupper() or t.islower() for t in unique_teams if isinstance(t, str)):
        validation['warnings'].append("WARNING: Some team names may not be standardized")
    else:
        validation['checks'].append("PASS: Team names appear standardized")

    # Check 5: No critical nulls
    critical_cols = ['gameId', 'playId', 'offense', 'defense', 'yardsGained']
    for col in critical_cols:
        nulls = df[col].isnull().sum()
        if nulls > 0:
            validation['passed'] = False
            validation['checks'].append(f"FAIL: {nulls} nulls in {col}")
        else:
            validation['checks'].append(f"PASS: No nulls in {col}")

    # Print results
    for check in validation['checks']:
        print(f"  {check}")
    for warning in validation['warnings']:
        print(f"  {warning}")

    print(f"\nOverall: {'PASSED' if validation['passed'] else 'FAILED'}")

    return validation


validation_result = validate_final_dataset(plays_final)


def export_dataset(df: pd.DataFrame, path: str):
    """Export the final dataset with documentation."""

    # Add metadata
    metadata = {
        'created': pd.Timestamp.now().isoformat(),
        'rows': len(df),
        'columns': list(df.columns),
        'sources': ['play_by_play_api', 'game_schedule', 'team_reference', 'weather_api']
    }

    print(f"\nExporting {len(df)} rows to {path}")
    # In practice: df.to_parquet(path) or df.to_csv(path)

    return metadata


metadata = export_dataset(plays_final, 'season_plays_clean.parquet')
print(f"Export metadata: {metadata}")

Summary and Key Learnings

Pipeline Steps Completed

Data Exploration: Identified issues in each source
Standardization: Created consistent team naming
Type Conversion: Fixed data types for analysis
Missing Values: Handled with domain-appropriate strategies
Integration: Merged four data sources
Feature Engineering: Created analysis-ready derived columns
Validation: Verified data quality
Export: Saved with documentation

Best Practices Demonstrated

Build reusable standardization classes
Validate at each pipeline stage
Document transformations applied
Handle edge cases explicitly
Create features that support downstream analysis

Common Pitfalls Avoided

Losing data in merges (used left joins with validation)
Type conversion errors (used coerce option)
Inconsistent naming (systematic standardization)
Undocumented transformations (logged each step)

Exercises

Extend the team standardizer to handle 30 more teams with at least 3 variants each.
Add data lineage tracking that records which source each value came from.
Create a weather imputation function that fills missing weather using nearby game data.
Build a pipeline class that wraps all these steps with proper error handling.