Case Study 1: Building a Season Database from Multiple Sources
Overview
In this case study, you'll combine play-by-play data, game metadata, team statistics, and weather information from multiple sources into a unified, analysis-ready database. This scenario mirrors real-world sports analytics work where data must be collected, cleaned, and integrated from various providers.
The Scenario
You're a data analyst for a college football media company preparing for the upcoming season. Your task is to create a comprehensive database of the previous season's games that can support various analytical needs:
- Writer queries for game recaps and historical comparisons
- Statistical analysis for predictions and rankings
- Visualization dashboards for on-air graphics
You have data from four sources, each with its own quirks: - Play-by-Play Data: Detailed play records from a third-party API - Game Information: Schedule and results from the conference website - Team Reference: Team metadata including conference affiliations - Weather Data: Historical weather conditions from a weather API
Part 1: Data Exploration and Assessment
Understanding the Raw Data
Let's examine each data source:
import pandas as pd
import numpy as np
from typing import Dict, List, Tuple
# Source 1: Play-by-Play Data
plays_raw = pd.DataFrame({
'gameId': ['2023010001', '2023010001', '2023010001', '2023010001',
'2023010002', '2023010002', '2023010002'],
'playId': [1, 2, 3, 4, 1, 2, 3],
'offense': ['ALABAMA', 'Alabama', 'BAMA', 'Georgia',
'ohio state', 'Ohio St.', 'Michigan'],
'defense': ['Georgia', 'georgia', 'UGA', 'Alabama',
'Michigan', 'michigan', 'Ohio State'],
'down': ['1', '2', '3', '1', '1', '2', '3'],
'distance': [10, 7, 3, 10, 10, 8, 5],
'yardsGained': ['5', '-2', 'INC', '8', '12', None, '6'],
'playType': ['run', 'pass', 'pass', 'run', 'run', 'pass', 'run'],
'quarter': [1, 1, 1, 1, 1, 1, 1],
'clock': ['15:00', '14:25', '13:58', '13:12', '15:00', '14:32', '14:05']
})
print("=" * 60)
print("PLAY-BY-PLAY DATA ISSUES")
print("=" * 60)
print(f"Shape: {plays_raw.shape}")
print(f"\nColumn types:")
print(plays_raw.dtypes)
print(f"\nSample:")
print(plays_raw.head())
print(f"\nUnique offense values: {plays_raw['offense'].unique()}")
Issues Identified:
- gameId is a string (should be usable for joins)
- down is a string (should be integer)
- yardsGained contains "INC" for incomplete passes and has missing values
- Team names are inconsistent across rows
- No explicit game date or team identifiers
# Source 2: Game Information
games_raw = pd.DataFrame({
'game_id': [2023010001, 2023010002, 2023010003],
'date': ['2023-09-02', '2023-09-02', '2023-09-09'],
'home': ['Alabama Crimson Tide', 'Ohio State Buckeyes', 'LSU Tigers'],
'away': ['Georgia Bulldogs', 'Michigan Wolverines', 'Florida State Seminoles'],
'home_score': [28, 42, 35],
'away_score': [24, 27, 38],
'venue': ['Bryant-Denny Stadium', 'Ohio Stadium', 'Tiger Stadium'],
'attendance': ['101,821', '104,944', '102,321']
})
print("\n" + "=" * 60)
print("GAME INFORMATION ISSUES")
print("=" * 60)
print(f"Shape: {games_raw.shape}")
print(f"\nColumn types:")
print(games_raw.dtypes)
print(f"\nSample home teams: {games_raw['home'].unique()}")
Issues Identified: - Team names include mascots (different from play-by-play format) - Attendance has commas (string, not numeric) - Date is a string (should be datetime)
# Source 3: Team Reference Data
teams_raw = pd.DataFrame({
'team_id': [1, 2, 3, 4, 5, 6, 7, 8],
'team_name': ['Alabama', 'Georgia', 'Ohio State', 'Michigan',
'LSU', 'Florida State', 'Texas', 'Oregon'],
'conference': ['SEC', 'SEC', 'Big Ten', 'Big Ten',
'SEC', 'ACC', 'Big 12', 'Pac-12'],
'abbreviation': ['ALA', 'UGA', 'OSU', 'MICH', 'LSU', 'FSU', 'TEX', 'ORE'],
'mascot': ['Crimson Tide', 'Bulldogs', 'Buckeyes', 'Wolverines',
'Tigers', 'Seminoles', 'Longhorns', 'Ducks']
})
print("\n" + "=" * 60)
print("TEAM REFERENCE DATA")
print("=" * 60)
print(teams_raw)
# Source 4: Weather Data
weather_raw = pd.DataFrame({
'date': ['2023-09-02', '2023-09-02', '2023-09-09'],
'location': ['Tuscaloosa, AL', 'Columbus, OH', 'Baton Rouge, LA'],
'temp_f': [92, 78, 88],
'humidity': [65, 45, 72],
'wind_mph': [8, 12, 5],
'conditions': ['Partly Cloudy', 'Clear', 'Scattered Storms']
})
print("\n" + "=" * 60)
print("WEATHER DATA")
print("=" * 60)
print(weather_raw)
Challenge: - Weather is keyed by location, not game_id - Need to map venues to weather locations
Part 2: Building the Cleaning Pipeline
Step 1: Team Name Standardization
class TeamNameStandardizer:
"""Standardize team names across all data sources."""
def __init__(self):
# Build comprehensive mapping
self.name_to_standard = {
# Alabama variants
'alabama': 'Alabama',
'bama': 'Alabama',
'alabama crimson tide': 'Alabama',
'ala': 'Alabama',
# Georgia variants
'georgia': 'Georgia',
'uga': 'Georgia',
'georgia bulldogs': 'Georgia',
# Ohio State variants
'ohio state': 'Ohio State',
'ohio st': 'Ohio State',
'ohio st.': 'Ohio State',
'osu': 'Ohio State',
'ohio state buckeyes': 'Ohio State',
# Michigan variants
'michigan': 'Michigan',
'mich': 'Michigan',
'michigan wolverines': 'Michigan',
# LSU variants
'lsu': 'LSU',
'louisiana state': 'LSU',
'lsu tigers': 'LSU',
# Florida State variants
'florida state': 'Florida State',
'fsu': 'Florida State',
'florida state seminoles': 'Florida State',
}
def standardize(self, name: str) -> str:
"""Convert any team name variant to standard form."""
if pd.isna(name):
return name
cleaned = str(name).lower().strip()
if cleaned in self.name_to_standard:
return self.name_to_standard[cleaned]
return name.strip().title()
def standardize_column(self, df: pd.DataFrame, column: str) -> pd.DataFrame:
"""Standardize all team names in a column."""
df_result = df.copy()
original_unique = df_result[column].nunique()
df_result[column] = df_result[column].apply(self.standardize)
new_unique = df_result[column].nunique()
print(f" {column}: {original_unique} → {new_unique} unique values")
return df_result
# Apply standardization
standardizer = TeamNameStandardizer()
print("Standardizing play-by-play team names...")
plays_clean = standardizer.standardize_column(plays_raw.copy(), 'offense')
plays_clean = standardizer.standardize_column(plays_clean, 'defense')
print("\nStandardizing game team names...")
games_clean = standardizer.standardize_column(games_raw.copy(), 'home')
games_clean = standardizer.standardize_column(games_clean, 'away')
Step 2: Data Type Corrections
def clean_plays_types(df: pd.DataFrame) -> pd.DataFrame:
"""Clean and convert data types in play-by-play data."""
df_clean = df.copy()
# Convert gameId to string for consistent joining
df_clean['gameId'] = df_clean['gameId'].astype(str)
# Convert down to integer (handling errors)
df_clean['down'] = pd.to_numeric(df_clean['down'], errors='coerce').astype('Int64')
# Convert yardsGained - handle special values
def parse_yards(value):
if pd.isna(value):
return np.nan
if isinstance(value, str):
if value.upper() in ['INC', 'N/A', 'NULL', '']:
return 0 # Incomplete passes = 0 yards
try:
return float(value)
except ValueError:
return np.nan
return float(value)
df_clean['yardsGained'] = df_clean['yardsGained'].apply(parse_yards)
print("Data types after cleaning:")
print(df_clean.dtypes)
return df_clean
def clean_games_types(df: pd.DataFrame) -> pd.DataFrame:
"""Clean and convert data types in game data."""
df_clean = df.copy()
# Convert game_id to string
df_clean['game_id'] = df_clean['game_id'].astype(str)
# Convert date to datetime
df_clean['date'] = pd.to_datetime(df_clean['date'])
# Clean attendance (remove commas, convert to int)
df_clean['attendance'] = df_clean['attendance'].str.replace(',', '').astype(int)
print("Game data types after cleaning:")
print(df_clean.dtypes)
return df_clean
plays_clean = clean_plays_types(plays_clean)
games_clean = clean_games_types(games_clean)
Step 3: Handling Missing Values
def handle_plays_missing(df: pd.DataFrame) -> pd.DataFrame:
"""Handle missing values in play data."""
df_clean = df.copy()
# Report missing
print("\nMissing values in plays:")
missing = df_clean.isnull().sum()
for col, count in missing[missing > 0].items():
print(f" {col}: {count} ({count/len(df_clean)*100:.1f}%)")
# Strategy for yardsGained: already converted INC to 0
# For remaining nulls (if any), use 0 as plays without gain
df_clean['yardsGained'] = df_clean['yardsGained'].fillna(0)
# Validate no critical columns have missing values
critical_cols = ['gameId', 'playId', 'offense', 'defense']
for col in critical_cols:
null_count = df_clean[col].isnull().sum()
if null_count > 0:
print(f" WARNING: {null_count} null values in critical column {col}")
return df_clean
plays_clean = handle_plays_missing(plays_clean)
Part 3: Data Integration
Merging Plays with Games
def merge_plays_games(plays: pd.DataFrame,
games: pd.DataFrame) -> pd.DataFrame:
"""Merge play-by-play data with game metadata."""
# Check key compatibility
plays_games = set(plays['gameId'].unique())
games_ids = set(games['game_id'].unique())
print(f"\nUnique game IDs in plays: {len(plays_games)}")
print(f"Unique game IDs in games: {len(games_ids)}")
print(f"Overlap: {len(plays_games & games_ids)}")
# Perform merge
merged = pd.merge(
plays,
games[['game_id', 'date', 'venue', 'attendance']],
left_on='gameId',
right_on='game_id',
how='left',
validate='many_to_one'
)
# Check merge success
unmatched = merged[merged['game_id'].isna()]
if len(unmatched) > 0:
print(f"WARNING: {len(unmatched)} plays couldn't be matched to games")
print(f" Unmatched game IDs: {unmatched['gameId'].unique()}")
else:
print("All plays successfully matched to games!")
# Drop redundant column
merged = merged.drop(columns=['game_id'])
return merged
plays_with_games = merge_plays_games(plays_clean, games_clean)
print(f"\nMerged shape: {plays_with_games.shape}")
print(plays_with_games.head())
Adding Team Metadata
def add_team_metadata(plays: pd.DataFrame,
teams: pd.DataFrame) -> pd.DataFrame:
"""Add conference and other team info to plays."""
# Add offensive team's conference
plays_enriched = pd.merge(
plays,
teams[['team_name', 'conference']],
left_on='offense',
right_on='team_name',
how='left'
)
plays_enriched = plays_enriched.rename(columns={'conference': 'offense_conf'})
plays_enriched = plays_enriched.drop(columns=['team_name'])
# Add defensive team's conference
plays_enriched = pd.merge(
plays_enriched,
teams[['team_name', 'conference']],
left_on='defense',
right_on='team_name',
how='left'
)
plays_enriched = plays_enriched.rename(columns={'conference': 'defense_conf'})
plays_enriched = plays_enriched.drop(columns=['team_name'])
# Check for unmatched teams
unmatched_off = plays_enriched[plays_enriched['offense_conf'].isna()]['offense'].unique()
unmatched_def = plays_enriched[plays_enriched['defense_conf'].isna()]['defense'].unique()
if len(unmatched_off) > 0:
print(f"Unmatched offensive teams: {unmatched_off}")
if len(unmatched_def) > 0:
print(f"Unmatched defensive teams: {unmatched_def}")
return plays_enriched
plays_enriched = add_team_metadata(plays_with_games, teams_raw)
print(f"\nEnriched shape: {plays_enriched.shape}")
Integrating Weather Data
def create_venue_weather_mapping() -> Dict[str, str]:
"""Map venue names to weather locations."""
return {
'Bryant-Denny Stadium': 'Tuscaloosa, AL',
'Ohio Stadium': 'Columbus, OH',
'Tiger Stadium': 'Baton Rouge, LA',
'Sanford Stadium': 'Athens, GA',
'Michigan Stadium': 'Ann Arbor, MI',
'Beaver Stadium': 'University Park, PA'
}
def add_weather(plays: pd.DataFrame,
weather: pd.DataFrame,
games: pd.DataFrame) -> pd.DataFrame:
"""Add weather data to plays using venue mapping."""
# Create venue to weather location mapping
venue_map = create_venue_weather_mapping()
# Add weather location to games
games_weather = games.copy()
games_weather['weather_loc'] = games_weather['venue'].map(venue_map)
# Merge weather with games on date and location
games_weather = pd.merge(
games_weather,
weather,
left_on=['date', 'weather_loc'],
right_on=[pd.to_datetime(weather['date']), 'location'],
how='left',
suffixes=('', '_weather')
)
# Now merge weather info into plays
plays_weather = pd.merge(
plays,
games_weather[['game_id', 'temp_f', 'humidity', 'wind_mph', 'conditions']],
left_on='gameId',
right_on='game_id',
how='left'
)
plays_weather = plays_weather.drop(columns=['game_id'], errors='ignore')
# Report weather coverage
has_weather = plays_weather['temp_f'].notna().sum()
print(f"Plays with weather data: {has_weather}/{len(plays_weather)} ({has_weather/len(plays_weather)*100:.1f}%)")
return plays_weather
# Note: This simplified example - real weather integration more complex
print("\nAdding weather data...")
Part 4: Feature Engineering
def create_analysis_features(df: pd.DataFrame) -> pd.DataFrame:
"""Create derived features for analysis."""
df_features = df.copy()
# Distance category
df_features['distance_cat'] = pd.cut(
df_features['distance'],
bins=[0, 3, 7, 15, 100],
labels=['short', 'medium', 'long', 'very_long']
)
# Passing down indicator
df_features['passing_down'] = (
((df_features['down'] == 2) & (df_features['distance'] >= 8)) |
((df_features['down'] == 3) & (df_features['distance'] >= 5)) |
(df_features['down'] == 4)
).astype(int)
# Play success (gained 40%+ of needed yards on 1st, 50%+ on 2nd, 100% on 3rd/4th)
def calculate_success(row):
if pd.isna(row['down']) or pd.isna(row['yardsGained']):
return np.nan
if row['down'] == 1:
return 1 if row['yardsGained'] >= row['distance'] * 0.4 else 0
elif row['down'] == 2:
return 1 if row['yardsGained'] >= row['distance'] * 0.5 else 0
else: # 3rd and 4th down
return 1 if row['yardsGained'] >= row['distance'] else 0
df_features['play_success'] = df_features.apply(calculate_success, axis=1)
# Conference game indicator
df_features['conf_game'] = (
df_features['offense_conf'] == df_features['defense_conf']
).astype(int)
print("Features created:")
print(f" distance_cat: {df_features['distance_cat'].value_counts().to_dict()}")
print(f" passing_down: {df_features['passing_down'].value_counts().to_dict()}")
print(f" play_success: {df_features['play_success'].value_counts().to_dict()}")
print(f" conf_game: {df_features['conf_game'].value_counts().to_dict()}")
return df_features
plays_final = create_analysis_features(plays_enriched)
Part 5: Validation and Export
def validate_final_dataset(df: pd.DataFrame) -> Dict:
"""Comprehensive validation of the final dataset."""
print("\n" + "=" * 60)
print("FINAL DATASET VALIDATION")
print("=" * 60)
validation = {
'passed': True,
'checks': [],
'warnings': []
}
# Check 1: No duplicates
dups = df.duplicated(subset=['gameId', 'playId']).sum()
if dups > 0:
validation['passed'] = False
validation['checks'].append(f"FAIL: {dups} duplicate plays found")
else:
validation['checks'].append("PASS: No duplicate plays")
# Check 2: Valid down values
invalid_downs = df['down'].notna() & ~df['down'].isin([1, 2, 3, 4])
if invalid_downs.sum() > 0:
validation['passed'] = False
validation['checks'].append(f"FAIL: {invalid_downs.sum()} invalid down values")
else:
validation['checks'].append("PASS: All down values valid (1-4)")
# Check 3: Valid yards range
extreme_yards = (df['yardsGained'] < -20) | (df['yardsGained'] > 100)
if extreme_yards.sum() > 0:
validation['warnings'].append(f"WARNING: {extreme_yards.sum()} extreme yards values")
else:
validation['checks'].append("PASS: All yards in reasonable range")
# Check 4: Team names standardized
unique_teams = set(df['offense'].unique()) | set(df['defense'].unique())
if any(t.isupper() or t.islower() for t in unique_teams if isinstance(t, str)):
validation['warnings'].append("WARNING: Some team names may not be standardized")
else:
validation['checks'].append("PASS: Team names appear standardized")
# Check 5: No critical nulls
critical_cols = ['gameId', 'playId', 'offense', 'defense', 'yardsGained']
for col in critical_cols:
nulls = df[col].isnull().sum()
if nulls > 0:
validation['passed'] = False
validation['checks'].append(f"FAIL: {nulls} nulls in {col}")
else:
validation['checks'].append(f"PASS: No nulls in {col}")
# Print results
for check in validation['checks']:
print(f" {check}")
for warning in validation['warnings']:
print(f" {warning}")
print(f"\nOverall: {'PASSED' if validation['passed'] else 'FAILED'}")
return validation
validation_result = validate_final_dataset(plays_final)
def export_dataset(df: pd.DataFrame, path: str):
"""Export the final dataset with documentation."""
# Add metadata
metadata = {
'created': pd.Timestamp.now().isoformat(),
'rows': len(df),
'columns': list(df.columns),
'sources': ['play_by_play_api', 'game_schedule', 'team_reference', 'weather_api']
}
print(f"\nExporting {len(df)} rows to {path}")
# In practice: df.to_parquet(path) or df.to_csv(path)
return metadata
metadata = export_dataset(plays_final, 'season_plays_clean.parquet')
print(f"Export metadata: {metadata}")
Summary and Key Learnings
Pipeline Steps Completed
- Data Exploration: Identified issues in each source
- Standardization: Created consistent team naming
- Type Conversion: Fixed data types for analysis
- Missing Values: Handled with domain-appropriate strategies
- Integration: Merged four data sources
- Feature Engineering: Created analysis-ready derived columns
- Validation: Verified data quality
- Export: Saved with documentation
Best Practices Demonstrated
- Build reusable standardization classes
- Validate at each pipeline stage
- Document transformations applied
- Handle edge cases explicitly
- Create features that support downstream analysis
Common Pitfalls Avoided
- Losing data in merges (used left joins with validation)
- Type conversion errors (used coerce option)
- Inconsistent naming (systematic standardization)
- Undocumented transformations (logged each step)
Exercises
-
Extend the team standardizer to handle 30 more teams with at least 3 variants each.
-
Add data lineage tracking that records which source each value came from.
-
Create a weather imputation function that fills missing weather using nearby game data.
-
Build a pipeline class that wraps all these steps with proper error handling.