Exercises: Data Cleaning and Preparation
Practice cleaning, transforming, and validating football data with these exercises organized by difficulty level.
Level 1: Foundation (Exercises 1.1-1.8)
Exercise 1.1: Exploring Missing Data
Given the following play-by-play dataset, identify columns with missing values and calculate the percentage missing for each.
import pandas as pd
import numpy as np
plays = pd.DataFrame({
'game_id': [1, 1, 1, 2, 2, 2, 3, 3, 3, 3],
'play_id': range(1, 11),
'down': [1, 2, 3, 1, 2, None, 1, 2, 3, 4],
'distance': [10, 7, 3, 10, None, None, 10, 8, 5, 2],
'yards_gained': [3, 4, 2, 8, None, 5, -2, 6, 4, 3],
'play_type': ['run', 'pass', 'pass', 'run', 'pass', None, 'run', 'run', 'pass', 'run']
})
Tasks: 1. Calculate the number of missing values per column 2. Calculate the percentage of missing values per column 3. Identify which columns have no missing values
Exercise 1.2: Basic Type Conversion
Convert the following columns to appropriate data types:
data = pd.DataFrame({
'game_id': ['001', '002', '003', '004'],
'score': ['28', '35', '24', '42'],
'is_home': ['true', 'false', 'true', 'true'],
'date': ['2023-09-01', '2023-09-08', '2023-09-15', '2023-09-22']
})
Tasks:
1. Convert game_id to integer
2. Convert score to integer
3. Convert is_home to boolean
4. Convert date to datetime
Exercise 1.3: Removing Duplicates
Identify and remove duplicate rows from this dataset:
games = pd.DataFrame({
'game_id': [1, 2, 2, 3, 3, 3, 4],
'home_team': ['Alabama', 'Georgia', 'Georgia', 'Ohio State', 'Ohio State', 'Ohio State', 'Michigan'],
'away_team': ['Georgia', 'LSU', 'LSU', 'Michigan', 'Michigan', 'Michigan', 'Penn State'],
'home_score': [28, 35, 35, 42, 42, 42, 24]
})
Tasks: 1. Identify how many duplicate rows exist 2. Remove duplicates, keeping the first occurrence 3. Verify the cleaned dataset
Exercise 1.4: Simple Imputation
Fill missing values in this quarterback statistics dataset:
qb_stats = pd.DataFrame({
'player': ['Smith', 'Jones', 'Williams', 'Brown', 'Davis'],
'games': [12, 11, 12, 10, None],
'passing_yards': [2850, None, 3100, 2650, 2900],
'touchdowns': [24, 18, None, 20, 22],
'interceptions': [8, 12, 10, None, 6]
})
Tasks:
1. Fill missing games with the median value
2. Fill missing passing_yards with the mean value
3. Fill missing touchdowns with the median value
4. Fill missing interceptions with the mode value
Exercise 1.5: Range Validation
Check if values in this dataset fall within valid ranges for football data:
plays = pd.DataFrame({
'down': [1, 2, 5, 0, 3, 4],
'distance': [10, -3, 15, 8, 55, 7],
'quarter': [1, 2, 3, 6, 4, 5],
'yards_gained': [-15, 75, 8, 150, 12, -25]
})
Valid ranges: - down: 1-4 - distance: 1-50 - quarter: 1-5 (including OT) - yards_gained: -20 to 100
Tasks: 1. Identify invalid values in each column 2. Count the number of invalid values per column 3. Replace invalid values with NaN
Exercise 1.6: Case Normalization
Standardize team names to proper case:
teams = pd.DataFrame({
'home': ['ALABAMA', 'georgia', 'Ohio state', 'MICHIGAN', 'texas A&M'],
'away': ['GEORGIA', 'lsu', 'Penn STATE', 'ohio STATE', 'ARKANSAS']
})
Tasks: 1. Convert all team names to title case 2. Handle special cases (A&M, LSU) 3. Verify the standardized output
Exercise 1.7: Column Renaming
Rename columns to follow consistent naming conventions:
data = pd.DataFrame({
'Game ID': [1, 2, 3],
'Home Team': ['Alabama', 'Georgia', 'Ohio State'],
'Away Team': ['Georgia', 'LSU', 'Michigan'],
'Home Points': [28, 35, 42],
'Away Points': [24, 31, 27]
})
Tasks: 1. Convert all column names to snake_case (lowercase with underscores) 2. Ensure no spaces or special characters in column names
Exercise 1.8: Basic Merge
Merge these two datasets to add team conference information:
games = pd.DataFrame({
'game_id': [1, 2, 3, 4],
'team': ['Alabama', 'Ohio State', 'Oregon', 'Texas'],
'opponent': ['Georgia', 'Michigan', 'Washington', 'Oklahoma'],
'points': [28, 42, 35, 31]
})
conferences = pd.DataFrame({
'team': ['Alabama', 'Georgia', 'Ohio State', 'Michigan', 'Oregon', 'Washington', 'Texas', 'Oklahoma'],
'conference': ['SEC', 'SEC', 'Big Ten', 'Big Ten', 'Big Ten', 'Big Ten', 'SEC', 'SEC']
})
Tasks: 1. Merge games with conferences based on the 'team' column 2. Add opponent's conference using a second merge 3. Verify no data was lost in the merge
Level 2: Application (Exercises 2.1-2.6)
Exercise 2.1: Missing Data Pattern Analysis
Analyze the pattern of missing data in this weather dataset:
import numpy as np
np.random.seed(42)
weather = pd.DataFrame({
'game_id': range(1, 51),
'venue': np.random.choice(['outdoor', 'dome'], 50, p=[0.6, 0.4]),
'temperature': np.where(
np.random.rand(50) > 0.4,
np.random.normal(65, 15, 50).round(0),
np.nan
),
'wind': np.where(
np.random.rand(50) > 0.4,
np.random.normal(10, 5, 50).round(0).clip(0),
np.nan
),
'attendance': np.random.normal(70000, 15000, 50).round(0)
})
Tasks: 1. Calculate the missing rate for each column 2. Analyze if temperature missing is related to venue type 3. Check if temperature and wind tend to be missing together 4. Determine the most appropriate imputation strategy
Exercise 2.2: Team Name Standardization System
Build a function to standardize these team name variants:
team_names = [
'ohio state', 'Ohio St.', 'OSU', 'Ohio State Buckeyes',
'michigan', 'MICHIGAN', 'U of M', 'Michigan Wolverines',
'penn state', 'Penn St', 'PSU', 'Penn State Nittany Lions',
'alabama', 'BAMA', 'Roll Tide', 'Alabama Crimson Tide'
]
Tasks: 1. Create a mapping dictionary for each variant to a standard name 2. Write a function that takes any variant and returns the standard name 3. Handle unknown teams gracefully (return cleaned version of input) 4. Test your function on all the example names
Exercise 2.3: Outlier Detection and Handling
Detect outliers in this rushing yards dataset:
np.random.seed(42)
rushing = pd.DataFrame({
'play_id': range(1, 101),
'yards': np.concatenate([
np.random.normal(4, 3, 85), # Normal runs
np.random.normal(25, 10, 10), # Big plays
np.array([75, -12, 85, -15, 92]) # Potential outliers/errors
])
})
Tasks: 1. Detect outliers using the IQR method 2. Detect outliers using the z-score method (|z| > 3) 3. Determine which outliers are likely data errors vs. legitimate big plays 4. Create a flag column indicating potential errors
Exercise 2.4: Feature Engineering - Down and Distance
Create situational features from play-by-play data:
plays = pd.DataFrame({
'play_id': range(1, 21),
'down': [1, 2, 3, 4, 1, 2, 3, 1, 2, 3,
1, 2, 3, 4, 1, 2, 3, 1, 2, 3],
'distance': [10, 7, 3, 1, 10, 8, 15, 10, 4, 2,
10, 12, 18, 8, 10, 6, 4, 10, 9, 11],
'yard_line': [25, 28, 35, 37, 20, 30, 38, 80, 84, 88,
50, 47, 35, 27, 15, 22, 28, 65, 74, 83]
})
Tasks:
1. Create a distance_category column: 'short' (1-3), 'medium' (4-7), 'long' (8+)
2. Create a passing_down indicator (2nd & 8+, 3rd & 5+, 4th down)
3. Create a field_zone column based on yard line (own 20, own territory, midfield, opponent territory, red zone)
4. Create an expected_pass column based on down, distance, and field position
Exercise 2.5: Multi-Source Data Merge
Combine data from three sources:
# Play-by-play data
plays = pd.DataFrame({
'game_id': [1, 1, 1, 2, 2],
'play_id': [1, 2, 3, 1, 2],
'offense': ['Alabama', 'Alabama', 'Georgia', 'Ohio State', 'Michigan'],
'yards': [5, 12, -2, 8, 15]
})
# Game metadata
games = pd.DataFrame({
'game_id': [1, 2, 3],
'date': ['2023-09-01', '2023-09-08', '2023-09-15'],
'venue': ['Bryant-Denny', 'Ohio Stadium', 'Tiger Stadium'],
'attendance': [101821, 104944, 102321]
})
# Team statistics
team_stats = pd.DataFrame({
'team': ['Alabama', 'Georgia', 'Ohio State', 'Michigan'],
'ranking': [4, 1, 2, 3],
'conference': ['SEC', 'SEC', 'Big Ten', 'Big Ten']
})
Tasks: 1. Merge plays with games to add game metadata 2. Merge the result with team_stats for the offense 3. Handle any unmatched records appropriately 4. Verify the final dataset has no unexpected duplicates
Exercise 2.6: Data Validation Framework
Create validation checks for this dataset:
season_stats = pd.DataFrame({
'team': ['Alabama', 'Georgia', 'Ohio State', 'Michigan', 'Penn State'],
'games': [13, 14, 13, 15, 13],
'wins': [11, 14, 12, 15, 10],
'points_per_game': [38.2, 41.5, 35.8, 32.1, 28.4],
'yards_per_game': [445.3, 478.2, 412.5, -25.0, 385.6],
'turnovers': [12, 8, 15, 11, 18]
})
Tasks: 1. Validate that wins ≤ games 2. Validate that points_per_game is positive 3. Validate that yards_per_game is positive 4. Report all validation failures with row details
Level 3: Intermediate (Exercises 3.1-3.5)
Exercise 3.1: Complex Missing Data Strategy
Design and implement a missing data strategy for this player statistics dataset:
np.random.seed(42)
players = pd.DataFrame({
'player_id': range(1, 51),
'position': np.random.choice(['QB', 'RB', 'WR', 'TE'], 50),
'games': np.random.randint(8, 14, 50),
'passing_yards': np.where(
np.random.choice(['QB', 'RB', 'WR', 'TE'], 50) == 'QB',
np.random.normal(2500, 500, 50).round(0),
np.nan
),
'rushing_yards': np.where(
np.random.rand(50) > 0.2,
np.random.normal(300, 200, 50).round(0),
np.nan
),
'receiving_yards': np.where(
np.random.choice(['QB', 'RB', 'WR', 'TE'], 50).isin(['WR', 'TE', 'RB']),
np.random.normal(400, 200, 50).round(0),
np.nan
)
})
Tasks: 1. Analyze the missing data pattern and explain why values are missing 2. Determine appropriate handling for each column: - passing_yards (missing for non-QBs) - rushing_yards (randomly missing) - receiving_yards (position-dependent) 3. Implement the appropriate strategy for each 4. Validate the imputed values make sense
Exercise 3.2: Fuzzy Name Matching
Handle approximate matches in team names:
source_a = pd.DataFrame({
'team_a': ['Alabama', 'Georgia Tech', 'Miami (FL)', 'Texas Christian', 'Brigham Young'],
'wins_a': [11, 7, 9, 10, 8]
})
source_b = pd.DataFrame({
'team_b': ['Alabama Crimson Tide', 'Ga. Tech', 'Miami Florida', 'TCU', 'BYU'],
'ranking': [4, 25, 15, 8, 12]
})
Tasks: 1. Create a mapping between the two naming conventions 2. Write a function using string similarity to suggest matches 3. Merge the two datasets using your mapping 4. Handle any names that couldn't be matched
Exercise 3.3: Time-Series Data Preparation
Prepare game-by-game data for time series analysis:
np.random.seed(42)
games = pd.DataFrame({
'team': ['Alabama'] * 12 + ['Georgia'] * 12,
'week': list(range(1, 13)) * 2,
'opponent': ['Utah State', 'Texas', 'South Florida', 'Ole Miss', 'Vanderbilt',
'Arkansas', 'Texas A&M', 'Tennessee', 'LSU', 'Kentucky',
'Auburn', 'Georgia'] + \
['Oregon', 'Ball State', 'South Carolina', 'UAB', 'Auburn',
'Kentucky', 'Vanderbilt', 'Florida', 'Missouri', 'Ole Miss',
'Tennessee', 'Alabama'],
'points_scored': np.random.randint(20, 50, 24),
'points_allowed': np.random.randint(10, 35, 24)
})
Tasks: 1. Sort data by team and week 2. Create lagged features (previous game points scored/allowed) 3. Create rolling averages (3-game and 5-game) 4. Create cumulative season statistics 5. Handle the first few games where lags aren't available
Exercise 3.4: Data Quality Report Generator
Build a function that generates a comprehensive data quality report:
def generate_quality_report(df: pd.DataFrame, name: str = "Dataset") -> dict:
"""
Generate a comprehensive data quality report.
Should include:
- Basic statistics (rows, columns)
- Missing value analysis
- Duplicate detection
- Data type summary
- Outlier detection for numeric columns
- Unique value counts for categorical columns
"""
pass # Implement this
Test with:
np.random.seed(42)
test_data = pd.DataFrame({
'game_id': [1, 1, 2, 3, 4, 5, 5], # Duplicates
'team': ['Alabama', 'alabama', 'GEORGIA', 'Ohio State', None, 'Texas', 'Texas'],
'points': [28, 28, 35, 42, 31, -5, 45], # Invalid negative
'yards': [350, 350, None, 425, 380, 290, 850], # Missing and outlier
'date': ['2023-09-01', '2023-09-01', '2023-09-08', '2023-09-15', '2023-09-22', '2023-09-29', '2023-09-29']
})
report = generate_quality_report(test_data, "Test Games")
Exercise 3.5: Building a Cleaning Function
Create a reusable cleaning function for play-by-play data:
def clean_play_by_play(df: pd.DataFrame,
team_mapping: dict = None,
validate: bool = True) -> pd.DataFrame:
"""
Clean play-by-play data following standard steps.
Parameters
----------
df : pd.DataFrame
Raw play-by-play data
team_mapping : dict
Optional team name mapping
validate : bool
Whether to run validation checks
Returns
-------
pd.DataFrame : Cleaned data
Steps:
1. Remove duplicates
2. Standardize team names
3. Fix data types
4. Handle missing values
5. Validate ranges
6. Create basic features
"""
pass # Implement this
Test with messy data:
raw_plays = pd.DataFrame({
'game_id': ['1', '1', '1', '1', '2', '2'],
'play_id': [1, 2, 2, 3, 1, 2], # Duplicate play
'offense': ['ALABAMA', 'alabama', 'Bama', 'Georgia', 'ohio st', 'Michigan'],
'down': [1, 2, 2, 3, 1, 5], # Invalid down
'distance': [10, 'seven', 7, 3, 10, 8], # String instead of int
'yards_gained': [5, None, None, 8, -2, 150] # Missing and outlier
})
clean_plays = clean_play_by_play(raw_plays)
Level 4: Advanced (Exercises 4.1-4.4)
Exercise 4.1: Complete Pipeline with Logging
Build a data cleaning pipeline with comprehensive logging and error handling:
import logging
from datetime import datetime
class FootballDataPipeline:
"""
Production-quality data cleaning pipeline with logging,
error handling, and checkpointing.
Requirements:
- Log each step with timing
- Handle errors gracefully
- Save intermediate results
- Generate summary report
- Support rollback on failure
"""
def __init__(self, name: str, output_dir: str = './pipeline_output'):
pass # Implement initialization
def run(self, df: pd.DataFrame) -> pd.DataFrame:
pass # Implement main pipeline
def save_checkpoint(self, df: pd.DataFrame, step: str):
pass # Save intermediate state
def generate_report(self) -> dict:
pass # Generate summary of all steps
Exercise 4.2: Handling Schema Changes
Design a system to handle data sources that change format between seasons:
# 2020 format
data_2020 = pd.DataFrame({
'game_id': [1, 2],
'home_team': ['Alabama', 'Georgia'],
'home_score': [42, 35],
'away_team': ['Georgia', 'Florida'],
'away_score': [24, 28]
})
# 2023 format (different column names)
data_2023 = pd.DataFrame({
'gameId': [1, 2],
'homeTeam': ['Alabama', 'Georgia'],
'homePoints': [38, 41],
'awayTeam': ['Texas', 'Auburn'],
'awayPoints': [31, 17],
'attendance': [101821, 92746] # New column
})
Tasks: 1. Create a schema definition for the target format 2. Write adapters for each source format 3. Combine data from multiple formats 4. Handle missing columns appropriately
Exercise 4.3: Automated Data Validation Rules
Create a rules-based validation system:
class ValidationRule:
"""Base class for validation rules."""
def __init__(self, name: str, column: str):
self.name = name
self.column = column
def check(self, df: pd.DataFrame) -> dict:
raise NotImplementedError
class RangeRule(ValidationRule):
"""Value must be within specified range."""
pass
class NotNullRule(ValidationRule):
"""Column cannot have null values."""
pass
class UniqueRule(ValidationRule):
"""Column values must be unique."""
pass
class ReferentialRule(ValidationRule):
"""Values must exist in reference set."""
pass
class CustomRule(ValidationRule):
"""Custom validation logic."""
pass
# Define rules for play-by-play data
play_rules = [
RangeRule('valid_down', 'down', min_val=1, max_val=4),
RangeRule('valid_distance', 'distance', min_val=1, max_val=50),
RangeRule('valid_yards', 'yards_gained', min_val=-20, max_val=100),
NotNullRule('game_id_required', 'game_id'),
UniqueRule('unique_play', ['game_id', 'play_id']),
ReferentialRule('valid_team', 'offense', reference=valid_teams),
]
def validate_with_rules(df: pd.DataFrame, rules: list) -> dict:
"""Run all validation rules and return report."""
pass # Implement this
Exercise 4.4: Performance-Optimized Cleaning
Optimize a cleaning pipeline for a large dataset (1M+ rows):
def generate_large_dataset(n_rows: int = 1_000_000) -> pd.DataFrame:
"""Generate a large play-by-play dataset for testing."""
np.random.seed(42)
return pd.DataFrame({
'game_id': np.random.randint(1, 5000, n_rows),
'play_id': np.arange(n_rows),
'offense': np.random.choice(['Alabama', 'Georgia', 'Ohio State', 'Michigan'] * 25, n_rows),
'down': np.random.choice([1, 2, 3, 4, np.nan], n_rows, p=[0.3, 0.25, 0.25, 0.15, 0.05]),
'distance': np.random.exponential(7, n_rows).clip(1, 50).round(0),
'yards_gained': np.random.normal(4, 8, n_rows).round(0)
})
# Time the basic approach
import time
large_df = generate_large_dataset()
start = time.time()
# Basic cleaning (your implementation)
# ...
basic_time = time.time() - start
# Optimized cleaning (your implementation)
# ...
optimized_time = time.time() - start
print(f"Basic: {basic_time:.2f}s")
print(f"Optimized: {optimized_time:.2f}s")
print(f"Speedup: {basic_time/optimized_time:.1f}x")
Optimization targets: 1. Use vectorized operations instead of loops 2. Use categorical types for low-cardinality columns 3. Process in chunks if necessary 4. Minimize intermediate DataFrame copies
Level 5: Mastery (Exercises 5.1-5.3)
Exercise 5.1: End-to-End Analysis Pipeline
Build a complete pipeline from raw data to analysis-ready dataset:
class AnalysisPipeline:
"""
Complete pipeline from raw data to analysis-ready format.
Stages:
1. Ingestion - Load from multiple formats (CSV, JSON, API)
2. Cleaning - Handle quality issues
3. Integration - Merge multiple sources
4. Enrichment - Add derived features
5. Validation - Ensure data quality
6. Export - Save in analysis-ready format
Requirements:
- Configuration-driven (YAML/JSON config file)
- Extensible (easy to add new steps)
- Resumable (can restart from any stage)
- Auditable (complete processing log)
"""
def __init__(self, config_path: str):
pass
def run_stage(self, stage: str):
pass
def run_all(self):
pass
def get_audit_log(self) -> pd.DataFrame:
pass
Exercise 5.2: Data Reconciliation System
Build a system to reconcile data from multiple sources that may conflict:
class DataReconciler:
"""
Reconcile conflicting data from multiple sources.
When sources disagree:
- Use rules to determine which source to trust
- Flag irreconcilable conflicts for manual review
- Track provenance of final values
"""
def __init__(self, source_priority: dict):
"""
source_priority: {'source_name': priority_int}
Lower number = higher priority
"""
pass
def add_source(self, name: str, df: pd.DataFrame, key_cols: list):
pass
def reconcile(self, tolerance: dict = None) -> pd.DataFrame:
"""
tolerance: {'column': max_difference}
Values within tolerance are not considered conflicts
"""
pass
def get_conflicts(self) -> pd.DataFrame:
"""Return all unresolved conflicts."""
pass
Test case:
# ESPN data
espn = pd.DataFrame({
'game_id': [1, 2],
'home_team': ['Alabama', 'Georgia'],
'home_score': [28, 35]
})
# CBS data (has an error)
cbs = pd.DataFrame({
'game_id': [1, 2],
'home_team': ['Alabama', 'Georgia'],
'home_score': [28, 36] # Off by one
})
# Official NCAA data
ncaa = pd.DataFrame({
'game_id': [1, 2],
'home_team': ['Alabama', 'Georgia'],
'home_score': [28, 35]
})
reconciler = DataReconciler(source_priority={'ncaa': 1, 'espn': 2, 'cbs': 3})
# Add sources and reconcile
Exercise 5.3: ML-Ready Feature Pipeline
Build a pipeline that prepares data specifically for machine learning models:
class MLFeaturePipeline:
"""
Prepare football data for machine learning.
Steps:
1. Split data into train/test by time
2. Fit transformations on training data only
3. Create features avoiding data leakage
4. Handle categorical variables
5. Scale numeric features
6. Document all transformations for inference
"""
def __init__(self, target_col: str, time_col: str):
self.target_col = target_col
self.time_col = time_col
self.fitted_transformers = {}
def fit(self, train_df: pd.DataFrame):
"""Fit all transformers on training data."""
pass
def transform(self, df: pd.DataFrame) -> tuple:
"""Transform data and return (X, y)."""
pass
def fit_transform(self, df: pd.DataFrame) -> tuple:
"""Fit and transform training data."""
pass
def save(self, path: str):
"""Save fitted pipeline for inference."""
pass
def load(self, path: str):
"""Load fitted pipeline."""
pass
Requirements: 1. Rolling features must use only past data (no future leakage) 2. Categorical encodings fit on train data only 3. Scaling parameters from train data only 4. Document feature definitions for reproducibility 5. Handle new categories at inference time
Submission Guidelines
For each exercise:
- Code: Well-documented, following PEP 8 style
- Output: Show the results of running your code
- Explanation: Brief description of your approach
- Edge Cases: Note how you handled special situations
- Validation: Verify your solution works correctly
Example submission format:
# Exercise X.Y: Title
# Approach: Brief description of strategy
# Code
def my_solution(df):
"""
Description of what this function does.
Parameters
----------
df : pd.DataFrame
Input description
Returns
-------
pd.DataFrame
Output description
"""
# Implementation
pass
# Test
result = my_solution(test_data)
print(result)
# Validation
assert len(result) == expected_length
print("Validation passed!")