Exercises: Data Cleaning and Preparation

Practice cleaning, transforming, and validating football data with these exercises organized by difficulty level.

Level 1: Foundation (Exercises 1.1-1.8)

Exercise 1.1: Exploring Missing Data

Given the following play-by-play dataset, identify columns with missing values and calculate the percentage missing for each.

import pandas as pd
import numpy as np

plays = pd.DataFrame({
    'game_id': [1, 1, 1, 2, 2, 2, 3, 3, 3, 3],
    'play_id': range(1, 11),
    'down': [1, 2, 3, 1, 2, None, 1, 2, 3, 4],
    'distance': [10, 7, 3, 10, None, None, 10, 8, 5, 2],
    'yards_gained': [3, 4, 2, 8, None, 5, -2, 6, 4, 3],
    'play_type': ['run', 'pass', 'pass', 'run', 'pass', None, 'run', 'run', 'pass', 'run']
})

Tasks: 1. Calculate the number of missing values per column 2. Calculate the percentage of missing values per column 3. Identify which columns have no missing values

Exercise 1.2: Basic Type Conversion

Convert the following columns to appropriate data types:

data = pd.DataFrame({
    'game_id': ['001', '002', '003', '004'],
    'score': ['28', '35', '24', '42'],
    'is_home': ['true', 'false', 'true', 'true'],
    'date': ['2023-09-01', '2023-09-08', '2023-09-15', '2023-09-22']
})

Tasks: 1. Convert game_id to integer 2. Convert score to integer 3. Convert is_home to boolean 4. Convert date to datetime

Exercise 1.3: Removing Duplicates

Identify and remove duplicate rows from this dataset:

games = pd.DataFrame({
    'game_id': [1, 2, 2, 3, 3, 3, 4],
    'home_team': ['Alabama', 'Georgia', 'Georgia', 'Ohio State', 'Ohio State', 'Ohio State', 'Michigan'],
    'away_team': ['Georgia', 'LSU', 'LSU', 'Michigan', 'Michigan', 'Michigan', 'Penn State'],
    'home_score': [28, 35, 35, 42, 42, 42, 24]
})

Tasks: 1. Identify how many duplicate rows exist 2. Remove duplicates, keeping the first occurrence 3. Verify the cleaned dataset

Exercise 1.4: Simple Imputation

Fill missing values in this quarterback statistics dataset:

qb_stats = pd.DataFrame({
    'player': ['Smith', 'Jones', 'Williams', 'Brown', 'Davis'],
    'games': [12, 11, 12, 10, None],
    'passing_yards': [2850, None, 3100, 2650, 2900],
    'touchdowns': [24, 18, None, 20, 22],
    'interceptions': [8, 12, 10, None, 6]
})

Tasks: 1. Fill missing games with the median value 2. Fill missing passing_yards with the mean value 3. Fill missing touchdowns with the median value 4. Fill missing interceptions with the mode value

Exercise 1.5: Range Validation

Check if values in this dataset fall within valid ranges for football data:

plays = pd.DataFrame({
    'down': [1, 2, 5, 0, 3, 4],
    'distance': [10, -3, 15, 8, 55, 7],
    'quarter': [1, 2, 3, 6, 4, 5],
    'yards_gained': [-15, 75, 8, 150, 12, -25]
})

Valid ranges: - down: 1-4 - distance: 1-50 - quarter: 1-5 (including OT) - yards_gained: -20 to 100

Tasks: 1. Identify invalid values in each column 2. Count the number of invalid values per column 3. Replace invalid values with NaN

Exercise 1.6: Case Normalization

Standardize team names to proper case:

teams = pd.DataFrame({
    'home': ['ALABAMA', 'georgia', 'Ohio state', 'MICHIGAN', 'texas A&M'],
    'away': ['GEORGIA', 'lsu', 'Penn STATE', 'ohio STATE', 'ARKANSAS']
})

Tasks: 1. Convert all team names to title case 2. Handle special cases (A&M, LSU) 3. Verify the standardized output

Exercise 1.7: Column Renaming

Rename columns to follow consistent naming conventions:

data = pd.DataFrame({
    'Game ID': [1, 2, 3],
    'Home Team': ['Alabama', 'Georgia', 'Ohio State'],
    'Away Team': ['Georgia', 'LSU', 'Michigan'],
    'Home Points': [28, 35, 42],
    'Away Points': [24, 31, 27]
})

Tasks: 1. Convert all column names to snake_case (lowercase with underscores) 2. Ensure no spaces or special characters in column names

Exercise 1.8: Basic Merge

Merge these two datasets to add team conference information:

games = pd.DataFrame({
    'game_id': [1, 2, 3, 4],
    'team': ['Alabama', 'Ohio State', 'Oregon', 'Texas'],
    'opponent': ['Georgia', 'Michigan', 'Washington', 'Oklahoma'],
    'points': [28, 42, 35, 31]
})

conferences = pd.DataFrame({
    'team': ['Alabama', 'Georgia', 'Ohio State', 'Michigan', 'Oregon', 'Washington', 'Texas', 'Oklahoma'],
    'conference': ['SEC', 'SEC', 'Big Ten', 'Big Ten', 'Big Ten', 'Big Ten', 'SEC', 'SEC']
})

Tasks: 1. Merge games with conferences based on the 'team' column 2. Add opponent's conference using a second merge 3. Verify no data was lost in the merge

Level 2: Application (Exercises 2.1-2.6)

Exercise 2.1: Missing Data Pattern Analysis

Analyze the pattern of missing data in this weather dataset:

import numpy as np

np.random.seed(42)

weather = pd.DataFrame({
    'game_id': range(1, 51),
    'venue': np.random.choice(['outdoor', 'dome'], 50, p=[0.6, 0.4]),
    'temperature': np.where(
        np.random.rand(50) > 0.4,
        np.random.normal(65, 15, 50).round(0),
        np.nan
    ),
    'wind': np.where(
        np.random.rand(50) > 0.4,
        np.random.normal(10, 5, 50).round(0).clip(0),
        np.nan
    ),
    'attendance': np.random.normal(70000, 15000, 50).round(0)
})

Tasks: 1. Calculate the missing rate for each column 2. Analyze if temperature missing is related to venue type 3. Check if temperature and wind tend to be missing together 4. Determine the most appropriate imputation strategy

Exercise 2.2: Team Name Standardization System

Build a function to standardize these team name variants:

team_names = [
    'ohio state', 'Ohio St.', 'OSU', 'Ohio State Buckeyes',
    'michigan', 'MICHIGAN', 'U of M', 'Michigan Wolverines',
    'penn state', 'Penn St', 'PSU', 'Penn State Nittany Lions',
    'alabama', 'BAMA', 'Roll Tide', 'Alabama Crimson Tide'
]

Tasks: 1. Create a mapping dictionary for each variant to a standard name 2. Write a function that takes any variant and returns the standard name 3. Handle unknown teams gracefully (return cleaned version of input) 4. Test your function on all the example names

Exercise 2.3: Outlier Detection and Handling

Detect outliers in this rushing yards dataset:

np.random.seed(42)

rushing = pd.DataFrame({
    'play_id': range(1, 101),
    'yards': np.concatenate([
        np.random.normal(4, 3, 85),      # Normal runs
        np.random.normal(25, 10, 10),    # Big plays
        np.array([75, -12, 85, -15, 92]) # Potential outliers/errors
    ])
})

Tasks: 1. Detect outliers using the IQR method 2. Detect outliers using the z-score method (|z| > 3) 3. Determine which outliers are likely data errors vs. legitimate big plays 4. Create a flag column indicating potential errors

Exercise 2.4: Feature Engineering - Down and Distance

Create situational features from play-by-play data:

plays = pd.DataFrame({
    'play_id': range(1, 21),
    'down': [1, 2, 3, 4, 1, 2, 3, 1, 2, 3,
             1, 2, 3, 4, 1, 2, 3, 1, 2, 3],
    'distance': [10, 7, 3, 1, 10, 8, 15, 10, 4, 2,
                 10, 12, 18, 8, 10, 6, 4, 10, 9, 11],
    'yard_line': [25, 28, 35, 37, 20, 30, 38, 80, 84, 88,
                  50, 47, 35, 27, 15, 22, 28, 65, 74, 83]
})

Tasks: 1. Create a distance_category column: 'short' (1-3), 'medium' (4-7), 'long' (8+) 2. Create a passing_down indicator (2nd & 8+, 3rd & 5+, 4th down) 3. Create a field_zone column based on yard line (own 20, own territory, midfield, opponent territory, red zone) 4. Create an expected_pass column based on down, distance, and field position

Exercise 2.5: Multi-Source Data Merge

Combine data from three sources:

# Play-by-play data
plays = pd.DataFrame({
    'game_id': [1, 1, 1, 2, 2],
    'play_id': [1, 2, 3, 1, 2],
    'offense': ['Alabama', 'Alabama', 'Georgia', 'Ohio State', 'Michigan'],
    'yards': [5, 12, -2, 8, 15]
})

# Game metadata
games = pd.DataFrame({
    'game_id': [1, 2, 3],
    'date': ['2023-09-01', '2023-09-08', '2023-09-15'],
    'venue': ['Bryant-Denny', 'Ohio Stadium', 'Tiger Stadium'],
    'attendance': [101821, 104944, 102321]
})

# Team statistics
team_stats = pd.DataFrame({
    'team': ['Alabama', 'Georgia', 'Ohio State', 'Michigan'],
    'ranking': [4, 1, 2, 3],
    'conference': ['SEC', 'SEC', 'Big Ten', 'Big Ten']
})

Tasks: 1. Merge plays with games to add game metadata 2. Merge the result with team_stats for the offense 3. Handle any unmatched records appropriately 4. Verify the final dataset has no unexpected duplicates

Exercise 2.6: Data Validation Framework

Create validation checks for this dataset:

season_stats = pd.DataFrame({
    'team': ['Alabama', 'Georgia', 'Ohio State', 'Michigan', 'Penn State'],
    'games': [13, 14, 13, 15, 13],
    'wins': [11, 14, 12, 15, 10],
    'points_per_game': [38.2, 41.5, 35.8, 32.1, 28.4],
    'yards_per_game': [445.3, 478.2, 412.5, -25.0, 385.6],
    'turnovers': [12, 8, 15, 11, 18]
})

Tasks: 1. Validate that wins ≤ games 2. Validate that points_per_game is positive 3. Validate that yards_per_game is positive 4. Report all validation failures with row details

Level 3: Intermediate (Exercises 3.1-3.5)

Exercise 3.1: Complex Missing Data Strategy

Design and implement a missing data strategy for this player statistics dataset:

np.random.seed(42)

players = pd.DataFrame({
    'player_id': range(1, 51),
    'position': np.random.choice(['QB', 'RB', 'WR', 'TE'], 50),
    'games': np.random.randint(8, 14, 50),
    'passing_yards': np.where(
        np.random.choice(['QB', 'RB', 'WR', 'TE'], 50) == 'QB',
        np.random.normal(2500, 500, 50).round(0),
        np.nan
    ),
    'rushing_yards': np.where(
        np.random.rand(50) > 0.2,
        np.random.normal(300, 200, 50).round(0),
        np.nan
    ),
    'receiving_yards': np.where(
        np.random.choice(['QB', 'RB', 'WR', 'TE'], 50).isin(['WR', 'TE', 'RB']),
        np.random.normal(400, 200, 50).round(0),
        np.nan
    )
})

Tasks: 1. Analyze the missing data pattern and explain why values are missing 2. Determine appropriate handling for each column: - passing_yards (missing for non-QBs) - rushing_yards (randomly missing) - receiving_yards (position-dependent) 3. Implement the appropriate strategy for each 4. Validate the imputed values make sense

Exercise 3.2: Fuzzy Name Matching

Handle approximate matches in team names:

source_a = pd.DataFrame({
    'team_a': ['Alabama', 'Georgia Tech', 'Miami (FL)', 'Texas Christian', 'Brigham Young'],
    'wins_a': [11, 7, 9, 10, 8]
})

source_b = pd.DataFrame({
    'team_b': ['Alabama Crimson Tide', 'Ga. Tech', 'Miami Florida', 'TCU', 'BYU'],
    'ranking': [4, 25, 15, 8, 12]
})

Tasks: 1. Create a mapping between the two naming conventions 2. Write a function using string similarity to suggest matches 3. Merge the two datasets using your mapping 4. Handle any names that couldn't be matched

Exercise 3.3: Time-Series Data Preparation

Prepare game-by-game data for time series analysis:

np.random.seed(42)

games = pd.DataFrame({
    'team': ['Alabama'] * 12 + ['Georgia'] * 12,
    'week': list(range(1, 13)) * 2,
    'opponent': ['Utah State', 'Texas', 'South Florida', 'Ole Miss', 'Vanderbilt',
                'Arkansas', 'Texas A&M', 'Tennessee', 'LSU', 'Kentucky',
                'Auburn', 'Georgia'] + \
               ['Oregon', 'Ball State', 'South Carolina', 'UAB', 'Auburn',
                'Kentucky', 'Vanderbilt', 'Florida', 'Missouri', 'Ole Miss',
                'Tennessee', 'Alabama'],
    'points_scored': np.random.randint(20, 50, 24),
    'points_allowed': np.random.randint(10, 35, 24)
})

Tasks: 1. Sort data by team and week 2. Create lagged features (previous game points scored/allowed) 3. Create rolling averages (3-game and 5-game) 4. Create cumulative season statistics 5. Handle the first few games where lags aren't available

Exercise 3.4: Data Quality Report Generator

Build a function that generates a comprehensive data quality report:

def generate_quality_report(df: pd.DataFrame, name: str = "Dataset") -> dict:
    """
    Generate a comprehensive data quality report.

    Should include:
    - Basic statistics (rows, columns)
    - Missing value analysis
    - Duplicate detection
    - Data type summary
    - Outlier detection for numeric columns
    - Unique value counts for categorical columns
    """
    pass  # Implement this

Test with:

np.random.seed(42)

test_data = pd.DataFrame({
    'game_id': [1, 1, 2, 3, 4, 5, 5],  # Duplicates
    'team': ['Alabama', 'alabama', 'GEORGIA', 'Ohio State', None, 'Texas', 'Texas'],
    'points': [28, 28, 35, 42, 31, -5, 45],  # Invalid negative
    'yards': [350, 350, None, 425, 380, 290, 850],  # Missing and outlier
    'date': ['2023-09-01', '2023-09-01', '2023-09-08', '2023-09-15', '2023-09-22', '2023-09-29', '2023-09-29']
})

report = generate_quality_report(test_data, "Test Games")

Exercise 3.5: Building a Cleaning Function

Create a reusable cleaning function for play-by-play data:

def clean_play_by_play(df: pd.DataFrame,
                        team_mapping: dict = None,
                        validate: bool = True) -> pd.DataFrame:
    """
    Clean play-by-play data following standard steps.

    Parameters
    ----------
    df : pd.DataFrame
        Raw play-by-play data
    team_mapping : dict
        Optional team name mapping
    validate : bool
        Whether to run validation checks

    Returns
    -------
    pd.DataFrame : Cleaned data

    Steps:
    1. Remove duplicates
    2. Standardize team names
    3. Fix data types
    4. Handle missing values
    5. Validate ranges
    6. Create basic features
    """
    pass  # Implement this

Test with messy data:

raw_plays = pd.DataFrame({
    'game_id': ['1', '1', '1', '1', '2', '2'],
    'play_id': [1, 2, 2, 3, 1, 2],  # Duplicate play
    'offense': ['ALABAMA', 'alabama', 'Bama', 'Georgia', 'ohio st', 'Michigan'],
    'down': [1, 2, 2, 3, 1, 5],  # Invalid down
    'distance': [10, 'seven', 7, 3, 10, 8],  # String instead of int
    'yards_gained': [5, None, None, 8, -2, 150]  # Missing and outlier
})

clean_plays = clean_play_by_play(raw_plays)

Level 4: Advanced (Exercises 4.1-4.4)

Exercise 4.1: Complete Pipeline with Logging

Build a data cleaning pipeline with comprehensive logging and error handling:

import logging
from datetime import datetime

class FootballDataPipeline:
    """
    Production-quality data cleaning pipeline with logging,
    error handling, and checkpointing.

    Requirements:
    - Log each step with timing
    - Handle errors gracefully
    - Save intermediate results
    - Generate summary report
    - Support rollback on failure
    """

    def __init__(self, name: str, output_dir: str = './pipeline_output'):
        pass  # Implement initialization

    def run(self, df: pd.DataFrame) -> pd.DataFrame:
        pass  # Implement main pipeline

    def save_checkpoint(self, df: pd.DataFrame, step: str):
        pass  # Save intermediate state

    def generate_report(self) -> dict:
        pass  # Generate summary of all steps

Exercise 4.2: Handling Schema Changes

Design a system to handle data sources that change format between seasons:

# 2020 format
data_2020 = pd.DataFrame({
    'game_id': [1, 2],
    'home_team': ['Alabama', 'Georgia'],
    'home_score': [42, 35],
    'away_team': ['Georgia', 'Florida'],
    'away_score': [24, 28]
})

# 2023 format (different column names)
data_2023 = pd.DataFrame({
    'gameId': [1, 2],
    'homeTeam': ['Alabama', 'Georgia'],
    'homePoints': [38, 41],
    'awayTeam': ['Texas', 'Auburn'],
    'awayPoints': [31, 17],
    'attendance': [101821, 92746]  # New column
})

Tasks: 1. Create a schema definition for the target format 2. Write adapters for each source format 3. Combine data from multiple formats 4. Handle missing columns appropriately

Exercise 4.3: Automated Data Validation Rules

Create a rules-based validation system:

class ValidationRule:
    """Base class for validation rules."""

    def __init__(self, name: str, column: str):
        self.name = name
        self.column = column

    def check(self, df: pd.DataFrame) -> dict:
        raise NotImplementedError


class RangeRule(ValidationRule):
    """Value must be within specified range."""
    pass


class NotNullRule(ValidationRule):
    """Column cannot have null values."""
    pass


class UniqueRule(ValidationRule):
    """Column values must be unique."""
    pass


class ReferentialRule(ValidationRule):
    """Values must exist in reference set."""
    pass


class CustomRule(ValidationRule):
    """Custom validation logic."""
    pass


# Define rules for play-by-play data
play_rules = [
    RangeRule('valid_down', 'down', min_val=1, max_val=4),
    RangeRule('valid_distance', 'distance', min_val=1, max_val=50),
    RangeRule('valid_yards', 'yards_gained', min_val=-20, max_val=100),
    NotNullRule('game_id_required', 'game_id'),
    UniqueRule('unique_play', ['game_id', 'play_id']),
    ReferentialRule('valid_team', 'offense', reference=valid_teams),
]


def validate_with_rules(df: pd.DataFrame, rules: list) -> dict:
    """Run all validation rules and return report."""
    pass  # Implement this

Exercise 4.4: Performance-Optimized Cleaning

Optimize a cleaning pipeline for a large dataset (1M+ rows):

def generate_large_dataset(n_rows: int = 1_000_000) -> pd.DataFrame:
    """Generate a large play-by-play dataset for testing."""
    np.random.seed(42)

    return pd.DataFrame({
        'game_id': np.random.randint(1, 5000, n_rows),
        'play_id': np.arange(n_rows),
        'offense': np.random.choice(['Alabama', 'Georgia', 'Ohio State', 'Michigan'] * 25, n_rows),
        'down': np.random.choice([1, 2, 3, 4, np.nan], n_rows, p=[0.3, 0.25, 0.25, 0.15, 0.05]),
        'distance': np.random.exponential(7, n_rows).clip(1, 50).round(0),
        'yards_gained': np.random.normal(4, 8, n_rows).round(0)
    })


# Time the basic approach
import time

large_df = generate_large_dataset()

start = time.time()
# Basic cleaning (your implementation)
# ...
basic_time = time.time() - start

# Optimized cleaning (your implementation)
# ...
optimized_time = time.time() - start

print(f"Basic: {basic_time:.2f}s")
print(f"Optimized: {optimized_time:.2f}s")
print(f"Speedup: {basic_time/optimized_time:.1f}x")

Optimization targets: 1. Use vectorized operations instead of loops 2. Use categorical types for low-cardinality columns 3. Process in chunks if necessary 4. Minimize intermediate DataFrame copies

Level 5: Mastery (Exercises 5.1-5.3)

Exercise 5.1: End-to-End Analysis Pipeline

Build a complete pipeline from raw data to analysis-ready dataset:

class AnalysisPipeline:
    """
    Complete pipeline from raw data to analysis-ready format.

    Stages:
    1. Ingestion - Load from multiple formats (CSV, JSON, API)
    2. Cleaning - Handle quality issues
    3. Integration - Merge multiple sources
    4. Enrichment - Add derived features
    5. Validation - Ensure data quality
    6. Export - Save in analysis-ready format

    Requirements:
    - Configuration-driven (YAML/JSON config file)
    - Extensible (easy to add new steps)
    - Resumable (can restart from any stage)
    - Auditable (complete processing log)
    """

    def __init__(self, config_path: str):
        pass

    def run_stage(self, stage: str):
        pass

    def run_all(self):
        pass

    def get_audit_log(self) -> pd.DataFrame:
        pass

Exercise 5.2: Data Reconciliation System

Build a system to reconcile data from multiple sources that may conflict:

class DataReconciler:
    """
    Reconcile conflicting data from multiple sources.

    When sources disagree:
    - Use rules to determine which source to trust
    - Flag irreconcilable conflicts for manual review
    - Track provenance of final values
    """

    def __init__(self, source_priority: dict):
        """
        source_priority: {'source_name': priority_int}
        Lower number = higher priority
        """
        pass

    def add_source(self, name: str, df: pd.DataFrame, key_cols: list):
        pass

    def reconcile(self, tolerance: dict = None) -> pd.DataFrame:
        """
        tolerance: {'column': max_difference}
        Values within tolerance are not considered conflicts
        """
        pass

    def get_conflicts(self) -> pd.DataFrame:
        """Return all unresolved conflicts."""
        pass

Test case:

# ESPN data
espn = pd.DataFrame({
    'game_id': [1, 2],
    'home_team': ['Alabama', 'Georgia'],
    'home_score': [28, 35]
})

# CBS data (has an error)
cbs = pd.DataFrame({
    'game_id': [1, 2],
    'home_team': ['Alabama', 'Georgia'],
    'home_score': [28, 36]  # Off by one
})

# Official NCAA data
ncaa = pd.DataFrame({
    'game_id': [1, 2],
    'home_team': ['Alabama', 'Georgia'],
    'home_score': [28, 35]
})

reconciler = DataReconciler(source_priority={'ncaa': 1, 'espn': 2, 'cbs': 3})
# Add sources and reconcile

Exercise 5.3: ML-Ready Feature Pipeline

Build a pipeline that prepares data specifically for machine learning models:

class MLFeaturePipeline:
    """
    Prepare football data for machine learning.

    Steps:
    1. Split data into train/test by time
    2. Fit transformations on training data only
    3. Create features avoiding data leakage
    4. Handle categorical variables
    5. Scale numeric features
    6. Document all transformations for inference
    """

    def __init__(self, target_col: str, time_col: str):
        self.target_col = target_col
        self.time_col = time_col
        self.fitted_transformers = {}

    def fit(self, train_df: pd.DataFrame):
        """Fit all transformers on training data."""
        pass

    def transform(self, df: pd.DataFrame) -> tuple:
        """Transform data and return (X, y)."""
        pass

    def fit_transform(self, df: pd.DataFrame) -> tuple:
        """Fit and transform training data."""
        pass

    def save(self, path: str):
        """Save fitted pipeline for inference."""
        pass

    def load(self, path: str):
        """Load fitted pipeline."""
        pass

Requirements: 1. Rolling features must use only past data (no future leakage) 2. Categorical encodings fit on train data only 3. Scaling parameters from train data only 4. Document feature definitions for reproducibility 5. Handle new categories at inference time

Submission Guidelines

For each exercise:

Code: Well-documented, following PEP 8 style
Output: Show the results of running your code
Explanation: Brief description of your approach
Edge Cases: Note how you handled special situations
Validation: Verify your solution works correctly

Example submission format:

# Exercise X.Y: Title
# Approach: Brief description of strategy

# Code
def my_solution(df):
    """
    Description of what this function does.

    Parameters
    ----------
    df : pd.DataFrame
        Input description

    Returns
    -------
    pd.DataFrame
        Output description
    """
    # Implementation
    pass

# Test
result = my_solution(test_data)
print(result)

# Validation
assert len(result) == expected_length
print("Validation passed!")