5 min read

The most sophisticated analysis is worthless if built on flawed data. In professional sports analytics, data scientists typically spend 60-80% of their time on data cleaning and preparation—the unglamorous but essential foundation of every insight...

Chapter 5: Data Cleaning and Preparation

"Garbage in, garbage out." — George Fuechsel

The most sophisticated analysis is worthless if built on flawed data. In professional sports analytics, data scientists typically spend 60-80% of their time on data cleaning and preparation—the unglamorous but essential foundation of every insight. This chapter teaches you to transform raw, messy football data into analysis-ready datasets.


Learning Objectives

By the end of this chapter, you will be able to:

  1. Identify and handle missing data using appropriate strategies
  2. Standardize team and player names across different data sources
  3. Detect and handle outliers and data quality issues
  4. Merge datasets from multiple sources correctly
  5. Create derived features for football analysis
  6. Build reproducible data cleaning pipelines
  7. Validate data quality at each stage of processing

5.1 The Data Cleaning Imperative

Why Data Quality Matters

Consider a simple example: You're comparing rushing efficiency across teams. Your dataset shows Team A averaging 8.5 yards per carry—extraordinarily high. Investigation reveals the data source recorded a 75-yard touchdown as "75" in the tens column, creating "750" yards for that play. Without proper cleaning, your analysis would be fundamentally flawed.

Football data presents unique challenges:

  1. Multiple Naming Conventions: "Ohio State," "Ohio St.," "OSU," "THE Ohio State University"
  2. Missing Data Patterns: Weather data missing for dome games, snap counts unavailable for older seasons
  3. Schema Changes: Data providers modify their formats between seasons
  4. Human Entry Errors: Misrecorded jersey numbers, transposed statistics
  5. Definitional Inconsistencies: What counts as a "pressure"? A "broken tackle"?

The Cost of Poor Data

Issue Potential Impact
Duplicate records Inflated statistics, misleading aggregations
Missing values Biased analysis, failed calculations
Inconsistent naming Failed merges, split records
Data type errors Calculation failures, incorrect sorting
Outlier contamination Skewed means, misleading conclusions

5.2 Understanding Your Data

Before cleaning, you must understand what you have. This reconnaissance phase is essential.

Initial Data Exploration

import pandas as pd
import numpy as np
from typing import Dict, List, Tuple

def explore_dataset(df: pd.DataFrame, name: str = "Dataset") -> Dict:
    """
    Comprehensive initial exploration of a dataset.

    Parameters
    ----------
    df : pd.DataFrame
        Dataset to explore
    name : str
        Name for reporting

    Returns
    -------
    dict : Exploration summary
    """
    print(f"{'='*60}")
    print(f"DATASET EXPLORATION: {name}")
    print(f"{'='*60}")

    # Basic dimensions
    print(f"\nShape: {df.shape[0]:,} rows × {df.shape[1]} columns")

    # Data types
    print(f"\nData Types:")
    for dtype, count in df.dtypes.value_counts().items():
        print(f"  {dtype}: {count} columns")

    # Memory usage
    memory_mb = df.memory_usage(deep=True).sum() / 1024**2
    print(f"\nMemory Usage: {memory_mb:.2f} MB")

    # Missing values
    missing = df.isnull().sum()
    missing_pct = (missing / len(df)) * 100
    missing_cols = missing[missing > 0].sort_values(ascending=False)

    print(f"\nMissing Values:")
    if len(missing_cols) > 0:
        for col, count in missing_cols.head(10).items():
            print(f"  {col}: {count:,} ({missing_pct[col]:.1f}%)")
    else:
        print("  No missing values found")

    # Duplicates
    dup_rows = df.duplicated().sum()
    print(f"\nDuplicate Rows: {dup_rows:,} ({dup_rows/len(df)*100:.2f}%)")

    return {
        'shape': df.shape,
        'dtypes': df.dtypes.value_counts().to_dict(),
        'memory_mb': memory_mb,
        'missing_cols': missing_cols.to_dict(),
        'duplicate_rows': dup_rows
    }


# Example usage with play-by-play data
plays = pd.DataFrame({
    'game_id': [1, 1, 1, 2, 2, 2],
    'play_id': [1, 2, 3, 1, 2, 3],
    'offense': ['Alabama', 'Alabama', 'ALABAMA', 'Georgia', 'georgia', 'Georgia'],
    'yards_gained': [5, -2, None, 12, 8, None],
    'down': [1, 2, 3, 1, 2, 3],
    'distance': [10, 12, 14, 10, 8, 10]
})

summary = explore_dataset(plays, "Sample Play-by-Play")

Data Type Assessment

Understanding and correcting data types is fundamental:

def assess_column_types(df: pd.DataFrame) -> pd.DataFrame:
    """
    Assess actual vs. appropriate data types for each column.

    Returns a DataFrame with recommendations.
    """
    assessments = []

    for col in df.columns:
        current_type = str(df[col].dtype)
        sample_values = df[col].dropna().head(3).tolist()

        # Assess appropriate type
        if df[col].dtype == 'object':
            # Check if should be numeric
            numeric_test = pd.to_numeric(df[col], errors='coerce')
            if numeric_test.notna().sum() > 0.9 * df[col].notna().sum():
                recommended = 'numeric (currently string)'
            # Check if should be categorical
            elif df[col].nunique() < len(df) * 0.05:
                recommended = 'category (few unique values)'
            else:
                recommended = 'string (OK)'
        elif np.issubdtype(df[col].dtype, np.integer):
            # Check if should be boolean
            if set(df[col].dropna().unique()).issubset({0, 1}):
                recommended = 'boolean (0/1 only)'
            else:
                recommended = 'integer (OK)'
        elif np.issubdtype(df[col].dtype, np.floating):
            recommended = 'float (OK)'
        else:
            recommended = str(df[col].dtype)

        assessments.append({
            'column': col,
            'current_type': current_type,
            'unique_values': df[col].nunique(),
            'recommendation': recommended,
            'sample': str(sample_values)[:50]
        })

    return pd.DataFrame(assessments)

5.3 Handling Missing Data

Missing data is inevitable in football datasets. The key is handling it appropriately.

Types of Missing Data

Understanding why data is missing determines how to handle it:

Type Description Football Example Strategy
MCAR Missing Completely at Random Random data entry omissions Safe to drop or impute
MAR Missing at Random Weather missing for dome games (related to venue, not weather itself) Impute carefully
MNAR Missing Not at Random Snap counts missing because player was injured Must address mechanism

Diagnosing Missing Data Patterns

def analyze_missing_patterns(df: pd.DataFrame) -> Dict:
    """
    Analyze patterns in missing data.

    Identifies which columns have missing values and whether
    they're correlated with other variables.
    """
    # Create missingness indicator matrix
    missing_matrix = df.isnull()

    # Summary by column
    missing_summary = pd.DataFrame({
        'missing_count': missing_matrix.sum(),
        'missing_pct': (missing_matrix.sum() / len(df)) * 100,
        'dtype': df.dtypes
    }).sort_values('missing_pct', ascending=False)

    # Look for patterns: do columns tend to be missing together?
    missing_cols = missing_summary[missing_summary['missing_count'] > 0].index.tolist()

    if len(missing_cols) > 1:
        # Correlation of missingness
        missing_corr = missing_matrix[missing_cols].corr()
    else:
        missing_corr = None

    return {
        'summary': missing_summary,
        'missing_columns': missing_cols,
        'missing_correlation': missing_corr
    }


# Example: Analyzing missing weather data
games_data = pd.DataFrame({
    'game_id': range(1, 21),
    'venue_type': ['outdoor'] * 12 + ['dome'] * 8,
    'temperature': [72, 65, 58, 45, 38, 52, 78, 82, 55, 48, 42, 68,
                    None, None, None, None, None, None, None, None],
    'wind': [8, 12, 5, 15, 22, 10, 6, 4, 18, 20, 25, 9,
             None, None, None, None, None, None, None, None],
    'points_scored': [28, 35, 24, 21, 17, 31, 42, 38, 27, 24, 14, 33,
                      35, 28, 31, 42, 38, 27, 45, 33]
})

patterns = analyze_missing_patterns(games_data)
print("Missing Data Analysis:")
print(patterns['summary'][patterns['summary']['missing_count'] > 0])
# Notice: Temperature and wind missing together for dome games (MAR)

Missing Data Strategies

Strategy 1: Deletion

def handle_missing_deletion(df: pd.DataFrame,
                             threshold: float = 0.05) -> pd.DataFrame:
    """
    Handle missing data through deletion.

    - Drop columns with too many missing values
    - Drop rows where critical columns are missing

    Parameters
    ----------
    df : pd.DataFrame
        Data to clean
    threshold : float
        Maximum acceptable proportion of missing values

    Returns
    -------
    pd.DataFrame : Cleaned data
    """
    df_clean = df.copy()

    # Drop columns with excessive missing
    missing_pct = df_clean.isnull().sum() / len(df_clean)
    cols_to_drop = missing_pct[missing_pct > 0.5].index.tolist()

    if cols_to_drop:
        print(f"Dropping columns (>50% missing): {cols_to_drop}")
        df_clean = df_clean.drop(columns=cols_to_drop)

    # Drop rows where essential columns are missing
    # (Define critical columns based on analysis needs)

    return df_clean

Strategy 2: Imputation

def impute_missing_values(df: pd.DataFrame,
                           strategy_map: Dict[str, str]) -> pd.DataFrame:
    """
    Impute missing values using specified strategies.

    Parameters
    ----------
    df : pd.DataFrame
        Data to impute
    strategy_map : dict
        {column: strategy} where strategy is one of:
        'mean', 'median', 'mode', 'zero', 'forward', 'group_mean'

    Returns
    -------
    pd.DataFrame : Imputed data
    """
    df_imputed = df.copy()

    for col, strategy in strategy_map.items():
        if col not in df_imputed.columns:
            continue

        missing_count = df_imputed[col].isnull().sum()
        if missing_count == 0:
            continue

        if strategy == 'mean':
            fill_value = df_imputed[col].mean()
            df_imputed[col] = df_imputed[col].fillna(fill_value)

        elif strategy == 'median':
            fill_value = df_imputed[col].median()
            df_imputed[col] = df_imputed[col].fillna(fill_value)

        elif strategy == 'mode':
            fill_value = df_imputed[col].mode().iloc[0]
            df_imputed[col] = df_imputed[col].fillna(fill_value)

        elif strategy == 'zero':
            df_imputed[col] = df_imputed[col].fillna(0)

        elif strategy == 'forward':
            df_imputed[col] = df_imputed[col].fillna(method='ffill')

        print(f"  {col}: Imputed {missing_count} values using {strategy}")

    return df_imputed


# Example: Domain-appropriate imputation
imputation_strategies = {
    'yards_gained': 'median',      # Use median (robust to outliers)
    'temperature': 'median',        # Use median for dome games
    'wind': 'zero',                 # Assume 0 wind for dome games
    'attendance': 'median',         # Use typical attendance
}

Strategy 3: Indicator Variables

Sometimes missingness itself is informative:

def add_missing_indicators(df: pd.DataFrame,
                            columns: List[str]) -> pd.DataFrame:
    """
    Add indicator columns for missing values.

    Useful when missingness is informative (MNAR).

    Parameters
    ----------
    df : pd.DataFrame
        Original data
    columns : list
        Columns to create indicators for

    Returns
    -------
    pd.DataFrame : Data with indicator columns
    """
    df_with_indicators = df.copy()

    for col in columns:
        if col in df.columns:
            indicator_name = f'{col}_missing'
            df_with_indicators[indicator_name] = df[col].isnull().astype(int)

    return df_with_indicators


# Example: Weather data missing for dome games is informative
games_with_indicators = add_missing_indicators(games_data, ['temperature', 'wind'])
# Now 'temperature_missing' = 1 indicates a dome game

5.4 Standardizing Names and Identifiers

One of the biggest challenges in football data is inconsistent naming.

The Team Name Problem

# The same team can appear many ways:
ohio_state_variants = [
    'Ohio State',
    'Ohio St.',
    'Ohio St',
    'OSU',
    'THE Ohio State University',
    'Ohio State Buckeyes',
    'OHIO STATE',
    'ohio state'
]

Building a Name Standardization System

class TeamNameStandardizer:
    """
    Standardize team names across data sources.

    Maintains a mapping of variants to canonical names.
    """

    def __init__(self):
        # Core mapping of variants to standard names
        self.team_mapping = {
            # SEC
            'alabama': 'Alabama',
            'bama': 'Alabama',
            'roll tide': 'Alabama',

            'georgia': 'Georgia',
            'uga': 'Georgia',
            'dawgs': 'Georgia',

            'lsu': 'LSU',
            'louisiana state': 'LSU',

            'texas a&m': 'Texas A&M',
            'texas am': 'Texas A&M',
            'tamu': 'Texas A&M',

            # Big Ten
            'ohio state': 'Ohio State',
            'ohio st': 'Ohio State',
            'ohio st.': 'Ohio State',
            'osu': 'Ohio State',  # Context-dependent, could be Oklahoma State

            'michigan': 'Michigan',
            'um': 'Michigan',
            'wolverines': 'Michigan',

            'penn state': 'Penn State',
            'penn st': 'Penn State',
            'penn st.': 'Penn State',
            'psu': 'Penn State',

            # Additional mappings...
        }

        # Common suffixes to remove
        self.suffixes_to_remove = [
            ' Crimson Tide', ' Bulldogs', ' Tigers', ' Aggies',
            ' Buckeyes', ' Wolverines', ' Nittany Lions',
            ' Fighting Irish', ' Longhorns', ' Sooners'
        ]

    def clean_name(self, name: str) -> str:
        """
        Clean a team name for matching.

        Lowercase, remove punctuation, strip whitespace.
        """
        if pd.isna(name):
            return name

        cleaned = str(name).lower().strip()

        # Remove common suffixes
        for suffix in self.suffixes_to_remove:
            cleaned = cleaned.replace(suffix.lower(), '')

        # Remove punctuation
        cleaned = cleaned.replace('.', '').replace("'", '')

        return cleaned.strip()

    def standardize(self, name: str) -> str:
        """
        Convert a team name to its standard form.

        Parameters
        ----------
        name : str
            Raw team name

        Returns
        -------
        str : Standardized name
        """
        if pd.isna(name):
            return name

        cleaned = self.clean_name(name)

        if cleaned in self.team_mapping:
            return self.team_mapping[cleaned]

        # If no exact match, return cleaned version with title case
        return name.strip().title()

    def standardize_column(self, df: pd.DataFrame,
                            column: str) -> pd.DataFrame:
        """
        Standardize team names in a DataFrame column.

        Parameters
        ----------
        df : pd.DataFrame
            Data containing team names
        column : str
            Column name to standardize

        Returns
        -------
        pd.DataFrame : Data with standardized names
        """
        df_clean = df.copy()
        original_unique = df_clean[column].nunique()

        df_clean[column] = df_clean[column].apply(self.standardize)

        new_unique = df_clean[column].nunique()
        print(f"Standardized '{column}': {original_unique} → {new_unique} unique values")

        return df_clean


# Usage
standardizer = TeamNameStandardizer()

raw_data = pd.DataFrame({
    'home_team': ['Ohio State', 'MICHIGAN', 'Penn St.', 'georgia', 'LSU Tigers'],
    'away_team': ['michigan', 'osu', 'Ohio State Buckeyes', 'Alabama', 'texas am']
})

clean_data = standardizer.standardize_column(raw_data, 'home_team')
clean_data = standardizer.standardize_column(clean_data, 'away_team')

Player Name Standardization

class PlayerNameStandardizer:
    """
    Standardize player names for matching across sources.
    """

    def __init__(self):
        # Common name variations
        self.first_name_map = {
            'mike': 'Michael',
            'will': 'William',
            'bill': 'William',
            'bob': 'Robert',
            'rob': 'Robert',
            'joe': 'Joseph',
            'tom': 'Thomas',
            'jim': 'James',
            'jimmy': 'James',
            'chris': 'Christopher',
            'matt': 'Matthew',
            'dan': 'Daniel',
            'dave': 'David',
            'steve': 'Stephen',
            'tony': 'Anthony',
            'nick': 'Nicholas',
            'josh': 'Joshua',
        }

        # Suffix handling
        self.suffixes = ['Jr.', 'Jr', 'Sr.', 'Sr', 'II', 'III', 'IV', 'V']

    def standardize(self, name: str) -> str:
        """
        Standardize a player name.

        Handles:
        - Case normalization
        - Nickname to formal name conversion
        - Suffix normalization
        - Extra whitespace
        """
        if pd.isna(name) or name == '':
            return name

        # Clean up whitespace
        parts = str(name).strip().split()

        if len(parts) < 2:
            return name.strip().title()

        # Handle first name
        first = parts[0].lower()
        if first in self.first_name_map:
            parts[0] = self.first_name_map[first]
        else:
            parts[0] = parts[0].title()

        # Handle rest of name
        for i in range(1, len(parts)):
            part = parts[i]

            # Check for suffix
            is_suffix = any(part.lower().replace('.', '') ==
                          s.lower().replace('.', '') for s in self.suffixes)

            if is_suffix:
                # Normalize suffix format
                parts[i] = part.upper().replace('.', '') + '.'
            else:
                parts[i] = part.title()

        return ' '.join(parts)

    def create_matching_key(self, name: str) -> str:
        """
        Create a simplified key for fuzzy matching.

        Returns: "lastname_firstinitial"
        """
        if pd.isna(name) or name == '':
            return ''

        standardized = self.standardize(name)
        parts = standardized.split()

        if len(parts) < 2:
            return name.lower().replace(' ', '_')

        # Remove suffix
        parts = [p for p in parts if not any(
            p.lower().replace('.', '') == s.lower().replace('.', '')
            for s in self.suffixes
        )]

        if len(parts) < 2:
            return name.lower().replace(' ', '_')

        # Create key: lastname_firstinitial
        first_initial = parts[0][0].lower()
        last_name = parts[-1].lower()

        return f"{last_name}_{first_initial}"

5.5 Data Type Conversions

Proper data types ensure correct operations and efficient memory usage.

Common Type Conversions

def convert_data_types(df: pd.DataFrame,
                        type_spec: Dict[str, str]) -> pd.DataFrame:
    """
    Convert columns to specified data types.

    Parameters
    ----------
    df : pd.DataFrame
        Data to convert
    type_spec : dict
        {column: type} where type is 'int', 'float', 'bool',
        'category', 'datetime', 'string'

    Returns
    -------
    pd.DataFrame : Data with converted types
    """
    df_converted = df.copy()

    for col, dtype in type_spec.items():
        if col not in df_converted.columns:
            print(f"Warning: Column '{col}' not found")
            continue

        try:
            if dtype == 'int':
                # Handle NaN by converting to nullable integer
                df_converted[col] = pd.to_numeric(
                    df_converted[col], errors='coerce'
                ).astype('Int64')

            elif dtype == 'float':
                df_converted[col] = pd.to_numeric(
                    df_converted[col], errors='coerce'
                ).astype('float64')

            elif dtype == 'bool':
                df_converted[col] = df_converted[col].astype(bool)

            elif dtype == 'category':
                df_converted[col] = df_converted[col].astype('category')

            elif dtype == 'datetime':
                df_converted[col] = pd.to_datetime(
                    df_converted[col], errors='coerce'
                )

            elif dtype == 'string':
                df_converted[col] = df_converted[col].astype(str)

            print(f"  Converted '{col}' to {dtype}")

        except Exception as e:
            print(f"  Error converting '{col}': {e}")

    return df_converted


# Example: Play-by-play type corrections
play_type_spec = {
    'game_id': 'int',
    'play_id': 'int',
    'yards_gained': 'float',  # Can have NaN
    'down': 'int',
    'distance': 'int',
    'quarter': 'int',
    'game_date': 'datetime',
    'play_type': 'category',
    'offense': 'string',
    'is_scoring_play': 'bool'
}

Memory Optimization

Large datasets benefit from memory-efficient types:

def optimize_memory(df: pd.DataFrame,
                     verbose: bool = True) -> pd.DataFrame:
    """
    Optimize DataFrame memory usage by downcasting types.

    Parameters
    ----------
    df : pd.DataFrame
        Data to optimize
    verbose : bool
        Print memory savings

    Returns
    -------
    pd.DataFrame : Memory-optimized data
    """
    initial_memory = df.memory_usage(deep=True).sum() / 1024**2

    df_optimized = df.copy()

    for col in df_optimized.columns:
        col_type = df_optimized[col].dtype

        if col_type == 'int64':
            # Downcast integers
            c_min = df_optimized[col].min()
            c_max = df_optimized[col].max()

            if c_min >= 0:
                if c_max < 255:
                    df_optimized[col] = df_optimized[col].astype('uint8')
                elif c_max < 65535:
                    df_optimized[col] = df_optimized[col].astype('uint16')
                elif c_max < 4294967295:
                    df_optimized[col] = df_optimized[col].astype('uint32')
            else:
                if c_min > -128 and c_max < 127:
                    df_optimized[col] = df_optimized[col].astype('int8')
                elif c_min > -32768 and c_max < 32767:
                    df_optimized[col] = df_optimized[col].astype('int16')
                elif c_min > -2147483648 and c_max < 2147483647:
                    df_optimized[col] = df_optimized[col].astype('int32')

        elif col_type == 'float64':
            # Downcast floats
            df_optimized[col] = pd.to_numeric(
                df_optimized[col], downcast='float'
            )

        elif col_type == 'object':
            # Convert low-cardinality strings to category
            num_unique = df_optimized[col].nunique()
            num_total = len(df_optimized[col])
            if num_unique / num_total < 0.5:
                df_optimized[col] = df_optimized[col].astype('category')

    final_memory = df_optimized.memory_usage(deep=True).sum() / 1024**2

    if verbose:
        print(f"Memory: {initial_memory:.2f} MB → {final_memory:.2f} MB")
        print(f"Reduced by {(1 - final_memory/initial_memory)*100:.1f}%")

    return df_optimized

5.6 Detecting and Handling Outliers

Outliers require careful consideration in football data—some are errors, others are legitimate exceptional performances.

Outlier Detection Methods

from scipy import stats

def detect_outliers(df: pd.DataFrame,
                     column: str,
                     method: str = 'iqr') -> pd.DataFrame:
    """
    Detect outliers in a numeric column.

    Parameters
    ----------
    df : pd.DataFrame
        Data to analyze
    column : str
        Numeric column to check
    method : str
        Detection method: 'iqr', 'zscore', 'modified_zscore'

    Returns
    -------
    pd.DataFrame : Rows containing outliers
    """
    data = df[column].dropna()

    if method == 'iqr':
        q1 = data.quantile(0.25)
        q3 = data.quantile(0.75)
        iqr = q3 - q1
        lower = q1 - 1.5 * iqr
        upper = q3 + 1.5 * iqr
        outlier_mask = (df[column] < lower) | (df[column] > upper)

    elif method == 'zscore':
        z_scores = np.abs(stats.zscore(data))
        threshold = 3
        outlier_values = data[z_scores > threshold]
        outlier_mask = df[column].isin(outlier_values)

    elif method == 'modified_zscore':
        # More robust to outliers than standard z-score
        median = data.median()
        mad = np.median(np.abs(data - median))
        modified_z = 0.6745 * (data - median) / mad
        outlier_values = data[np.abs(modified_z) > 3.5]
        outlier_mask = df[column].isin(outlier_values)

    else:
        raise ValueError(f"Unknown method: {method}")

    outliers = df[outlier_mask].copy()
    outliers['outlier_value'] = df.loc[outlier_mask, column]

    return outliers


def analyze_potential_outliers(df: pd.DataFrame,
                                column: str) -> Dict:
    """
    Comprehensive outlier analysis for a column.

    Returns information to help decide if outliers are
    errors or legitimate exceptional values.
    """
    outliers_iqr = detect_outliers(df, column, 'iqr')
    outliers_z = detect_outliers(df, column, 'zscore')

    analysis = {
        'column': column,
        'n_total': len(df),
        'n_outliers_iqr': len(outliers_iqr),
        'n_outliers_zscore': len(outliers_z),
        'outlier_pct': len(outliers_iqr) / len(df) * 100,
        'outlier_values': outliers_iqr[column].tolist(),
        'mean': df[column].mean(),
        'median': df[column].median(),
        'std': df[column].std(),
        'min': df[column].min(),
        'max': df[column].max()
    }

    return analysis

Contextual Outlier Handling in Football

def handle_football_outliers(df: pd.DataFrame) -> pd.DataFrame:
    """
    Handle outliers with football domain knowledge.

    Some outliers are real (big plays), others are errors.
    """
    df_clean = df.copy()

    # Yards gained: Can legitimately range from ~-10 to 99
    # Values outside this are likely errors
    if 'yards_gained' in df_clean.columns:
        invalid_yards = (df_clean['yards_gained'] < -15) | \
                        (df_clean['yards_gained'] > 100)
        df_clean.loc[invalid_yards, 'yards_gained'] = np.nan
        print(f"Flagged {invalid_yards.sum()} invalid yard values")

    # Down: Must be 1-4
    if 'down' in df_clean.columns:
        invalid_down = ~df_clean['down'].isin([1, 2, 3, 4])
        df_clean.loc[invalid_down, 'down'] = np.nan
        print(f"Flagged {invalid_down.sum()} invalid down values")

    # Distance: Typically 1-25, but can be higher
    if 'distance' in df_clean.columns:
        invalid_distance = (df_clean['distance'] < 1) | \
                           (df_clean['distance'] > 50)
        df_clean.loc[invalid_distance, 'distance'] = np.nan
        print(f"Flagged {invalid_distance.sum()} invalid distance values")

    # Quarter: Must be 1-4 (or 5 for OT)
    if 'quarter' in df_clean.columns:
        invalid_quarter = ~df_clean['quarter'].isin([1, 2, 3, 4, 5])
        df_clean.loc[invalid_quarter, 'quarter'] = np.nan
        print(f"Flagged {invalid_quarter.sum()} invalid quarter values")

    return df_clean


def flag_statistical_outliers(df: pd.DataFrame,
                               column: str,
                               context_column: str = None) -> pd.DataFrame:
    """
    Flag outliers while preserving them for review.

    Instead of removing, adds a flag column for manual review.

    Parameters
    ----------
    df : pd.DataFrame
        Data to analyze
    column : str
        Column to check for outliers
    context_column : str
        Group by this column for context-aware detection

    Returns
    -------
    pd.DataFrame : Data with outlier flags
    """
    df_flagged = df.copy()
    flag_col = f'{column}_outlier'
    df_flagged[flag_col] = False

    if context_column and context_column in df.columns:
        # Context-aware: outliers within each group
        for group_value in df[context_column].unique():
            group_mask = df[context_column] == group_value
            group_data = df.loc[group_mask, column]

            if len(group_data) < 10:
                continue

            q1 = group_data.quantile(0.25)
            q3 = group_data.quantile(0.75)
            iqr = q3 - q1

            outlier_mask = group_mask & (
                (df[column] < q1 - 1.5 * iqr) |
                (df[column] > q3 + 1.5 * iqr)
            )
            df_flagged.loc[outlier_mask, flag_col] = True
    else:
        # Global outlier detection
        outliers = detect_outliers(df, column, 'iqr')
        df_flagged.loc[outliers.index, flag_col] = True

    print(f"Flagged {df_flagged[flag_col].sum()} outliers in '{column}'")
    return df_flagged

5.7 Merging and Joining Datasets

Combining data from multiple sources is essential for comprehensive analysis.

Common Join Types

def demonstrate_join_types():
    """
    Illustrate different types of joins with football data.
    """
    # Game-level data
    games = pd.DataFrame({
        'game_id': [1, 2, 3, 4],
        'home_team': ['Alabama', 'Georgia', 'Ohio State', 'Texas'],
        'away_team': ['Georgia', 'LSU', 'Michigan', 'Oklahoma'],
        'home_score': [28, 35, 42, 31]
    })

    # Weather data (missing for some games)
    weather = pd.DataFrame({
        'game_id': [1, 2, 4],
        'temperature': [72, 65, 85],
        'conditions': ['clear', 'cloudy', 'sunny']
    })

    print("Games (left):")
    print(games)
    print("\nWeather (right):")
    print(weather)

    # Inner join: Only games with weather
    inner = pd.merge(games, weather, on='game_id', how='inner')
    print("\nINNER JOIN (only matching rows):")
    print(inner)

    # Left join: All games, weather where available
    left = pd.merge(games, weather, on='game_id', how='left')
    print("\nLEFT JOIN (all games, weather if available):")
    print(left)

    # Outer join: All records from both
    outer = pd.merge(games, weather, on='game_id', how='outer')
    print("\nOUTER JOIN (all records):")
    print(outer)


demonstrate_join_types()

Safe Merging Practices

def safe_merge(left: pd.DataFrame,
               right: pd.DataFrame,
               on: str or List[str],
               how: str = 'left',
               validate: str = None,
               indicator: bool = True) -> pd.DataFrame:
    """
    Perform a merge with validation and diagnostics.

    Parameters
    ----------
    left : pd.DataFrame
        Left DataFrame
    right : pd.DataFrame
        Right DataFrame
    on : str or list
        Column(s) to merge on
    how : str
        Join type: 'left', 'right', 'inner', 'outer'
    validate : str
        Validation: 'one_to_one', 'one_to_many', 'many_to_one', 'many_to_many'
    indicator : bool
        Add merge indicator column

    Returns
    -------
    pd.DataFrame : Merged data
    """
    # Pre-merge diagnostics
    print("="*50)
    print("MERGE DIAGNOSTICS")
    print("="*50)

    on_cols = [on] if isinstance(on, str) else on

    print(f"\nLeft shape: {left.shape}")
    print(f"Right shape: {right.shape}")

    for col in on_cols:
        left_unique = left[col].nunique()
        right_unique = right[col].nunique()
        print(f"\n'{col}':")
        print(f"  Left unique: {left_unique}")
        print(f"  Right unique: {right_unique}")

        # Check for values that won't match
        left_only = set(left[col]) - set(right[col])
        right_only = set(right[col]) - set(left[col])
        print(f"  Left-only values: {len(left_only)}")
        print(f"  Right-only values: {len(right_only)}")

    # Perform merge
    try:
        merged = pd.merge(
            left, right,
            on=on,
            how=how,
            validate=validate,
            indicator=indicator
        )

        if indicator:
            merge_counts = merged['_merge'].value_counts()
            print(f"\nMerge results:")
            for status, count in merge_counts.items():
                print(f"  {status}: {count}")

        print(f"\nFinal shape: {merged.shape}")

        return merged

    except Exception as e:
        print(f"\nMerge failed: {e}")
        raise


# Example: Merging play-by-play with team data
plays = pd.DataFrame({
    'play_id': range(1, 11),
    'team': ['Alabama', 'Alabama', 'Georgia', 'Georgia', 'Ohio State',
             'Ohio St', 'Michigan', 'michigan', 'LSU', 'Texas A&M'],
    'yards': [5, 12, -2, 8, 15, 6, 3, 9, 22, 4]
})

teams = pd.DataFrame({
    'team_name': ['Alabama', 'Georgia', 'Ohio State', 'Michigan', 'LSU', 'Texas A&M'],
    'conference': ['SEC', 'SEC', 'Big Ten', 'Big Ten', 'SEC', 'SEC']
})

# Note: Direct merge will fail due to naming inconsistencies
# Need to standardize first!

Handling Multiple Key Merges

def merge_play_game_data(plays: pd.DataFrame,
                          games: pd.DataFrame) -> pd.DataFrame:
    """
    Merge play-level data with game-level data.

    Handles the common case of enriching plays with game context.

    Parameters
    ----------
    plays : pd.DataFrame
        Play-by-play data (requires game_id)
    games : pd.DataFrame
        Game-level data (requires game_id)

    Returns
    -------
    pd.DataFrame : Enriched play data
    """
    # Validate key exists
    if 'game_id' not in plays.columns:
        raise ValueError("plays must contain 'game_id'")
    if 'game_id' not in games.columns:
        raise ValueError("games must contain 'game_id'")

    # Identify columns to bring from games (avoid duplicates)
    game_cols = ['game_id'] + [c for c in games.columns
                                if c not in plays.columns and c != 'game_id']

    # Merge
    merged = pd.merge(
        plays,
        games[game_cols],
        on='game_id',
        how='left',
        validate='many_to_one'  # Many plays per game
    )

    # Check for unmatched plays
    unmatched = merged[merged[game_cols[1]].isna()] if len(game_cols) > 1 else pd.DataFrame()
    if len(unmatched) > 0:
        print(f"Warning: {len(unmatched)} plays couldn't be matched to games")

    return merged

5.8 Feature Engineering for Football

Creating derived features unlocks analytical potential.

Common Football Features

def create_situational_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Create situational features from play-by-play data.

    Parameters
    ----------
    df : pd.DataFrame
        Play-by-play data with standard columns

    Returns
    -------
    pd.DataFrame : Data with new features
    """
    df_features = df.copy()

    # Down and distance categories
    if 'down' in df.columns and 'distance' in df.columns:
        # Short, medium, long distance
        df_features['distance_category'] = pd.cut(
            df_features['distance'],
            bins=[0, 3, 7, 15, 100],
            labels=['short', 'medium', 'long', 'very_long']
        )

        # Standard downs (1st, 2nd & manageable) vs. passing downs (2nd/3rd & long)
        df_features['passing_down'] = (
            ((df_features['down'] == 2) & (df_features['distance'] >= 8)) |
            ((df_features['down'] == 3) & (df_features['distance'] >= 5)) |
            (df_features['down'] == 4)
        ).astype(int)

    # Field position categories
    if 'yard_line' in df.columns:
        df_features['field_zone'] = pd.cut(
            df_features['yard_line'],
            bins=[0, 20, 40, 60, 80, 100],
            labels=['backed_up', 'own_territory', 'midfield',
                    'opponent_territory', 'red_zone']
        )

    # Score differential
    if 'home_score' in df.columns and 'away_score' in df.columns:
        if 'is_home' in df.columns:
            df_features['score_diff'] = np.where(
                df_features['is_home'],
                df_features['home_score'] - df_features['away_score'],
                df_features['away_score'] - df_features['home_score']
            )
        else:
            df_features['score_diff'] = (
                df_features['home_score'] - df_features['away_score']
            )

        # Game state categories
        df_features['game_state'] = pd.cut(
            df_features['score_diff'],
            bins=[-100, -17, -9, -1, 1, 9, 17, 100],
            labels=['blowout_behind', 'big_deficit', 'small_deficit',
                    'close', 'small_lead', 'big_lead', 'blowout_ahead']
        )

    return df_features


def create_efficiency_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Create efficiency-based features.

    Parameters
    ----------
    df : pd.DataFrame
        Data with performance statistics

    Returns
    -------
    pd.DataFrame : Data with efficiency features
    """
    df_features = df.copy()

    # Yards per attempt (passing)
    if 'passing_yards' in df.columns and 'pass_attempts' in df.columns:
        df_features['yards_per_attempt'] = (
            df_features['passing_yards'] / df_features['pass_attempts']
        ).replace([np.inf, -np.inf], np.nan)

    # Yards per carry (rushing)
    if 'rushing_yards' in df.columns and 'rush_attempts' in df.columns:
        df_features['yards_per_carry'] = (
            df_features['rushing_yards'] / df_features['rush_attempts']
        ).replace([np.inf, -np.inf], np.nan)

    # Completion percentage
    if 'completions' in df.columns and 'pass_attempts' in df.columns:
        df_features['completion_pct'] = (
            df_features['completions'] / df_features['pass_attempts'] * 100
        ).replace([np.inf, -np.inf], np.nan)

    # Points per play
    if 'points' in df.columns and 'total_plays' in df.columns:
        df_features['points_per_play'] = (
            df_features['points'] / df_features['total_plays']
        ).replace([np.inf, -np.inf], np.nan)

    # Turnover rate
    if 'turnovers' in df.columns and 'total_plays' in df.columns:
        df_features['turnover_rate'] = (
            df_features['turnovers'] / df_features['total_plays'] * 100
        ).replace([np.inf, -np.inf], np.nan)

    return df_features


def create_rolling_features(df: pd.DataFrame,
                             group_col: str,
                             value_col: str,
                             windows: List[int] = [3, 5]) -> pd.DataFrame:
    """
    Create rolling average features.

    Parameters
    ----------
    df : pd.DataFrame
        Data sorted by time/game order
    group_col : str
        Column to group by (e.g., 'team')
    value_col : str
        Column to calculate rolling stats for
    windows : list
        Window sizes for rolling calculations

    Returns
    -------
    pd.DataFrame : Data with rolling features
    """
    df_features = df.copy()

    for window in windows:
        # Rolling mean
        col_name = f'{value_col}_rolling_{window}g'
        df_features[col_name] = df_features.groupby(group_col)[value_col].transform(
            lambda x: x.rolling(window, min_periods=1).mean().shift(1)
        )

        # Rolling std
        std_name = f'{value_col}_rolling_std_{window}g'
        df_features[std_name] = df_features.groupby(group_col)[value_col].transform(
            lambda x: x.rolling(window, min_periods=1).std().shift(1)
        )

    return df_features

5.9 Data Validation

Validation ensures data meets expectations before analysis.

Building Validation Functions

class DataValidator:
    """
    Validate football data against defined rules.
    """

    def __init__(self):
        self.errors = []
        self.warnings = []

    def validate_range(self, df: pd.DataFrame, column: str,
                        min_val: float = None, max_val: float = None) -> bool:
        """Check if values fall within expected range."""
        if column not in df.columns:
            self.errors.append(f"Column '{column}' not found")
            return False

        values = df[column].dropna()

        if min_val is not None:
            below_min = values[values < min_val]
            if len(below_min) > 0:
                self.errors.append(
                    f"{column}: {len(below_min)} values below {min_val}"
                )
                return False

        if max_val is not None:
            above_max = values[values > max_val]
            if len(above_max) > 0:
                self.errors.append(
                    f"{column}: {len(above_max)} values above {max_val}"
                )
                return False

        return True

    def validate_not_null(self, df: pd.DataFrame, columns: List[str]) -> bool:
        """Check that columns have no null values."""
        valid = True
        for col in columns:
            if col not in df.columns:
                self.errors.append(f"Column '{col}' not found")
                valid = False
                continue

            null_count = df[col].isnull().sum()
            if null_count > 0:
                self.errors.append(f"{col}: {null_count} null values")
                valid = False

        return valid

    def validate_unique(self, df: pd.DataFrame, columns: List[str]) -> bool:
        """Check that column combination is unique (no duplicates)."""
        duplicates = df.duplicated(subset=columns, keep=False)
        if duplicates.any():
            self.errors.append(
                f"Duplicate rows found for {columns}: {duplicates.sum()}"
            )
            return False
        return True

    def validate_referential(self, df: pd.DataFrame, column: str,
                              reference: set) -> bool:
        """Check that values exist in reference set."""
        if column not in df.columns:
            self.errors.append(f"Column '{column}' not found")
            return False

        values = set(df[column].dropna().unique())
        invalid = values - reference

        if invalid:
            self.warnings.append(
                f"{column}: {len(invalid)} values not in reference set"
            )
            return False

        return True

    def validate_play_by_play(self, df: pd.DataFrame) -> Dict:
        """
        Comprehensive validation for play-by-play data.

        Returns validation report.
        """
        self.errors = []
        self.warnings = []

        # Required columns
        required = ['game_id', 'play_id', 'down', 'distance', 'yards_gained']
        for col in required:
            if col not in df.columns:
                self.errors.append(f"Missing required column: {col}")

        # Validate ranges
        self.validate_range(df, 'down', 1, 4)
        self.validate_range(df, 'distance', 1, 50)
        self.validate_range(df, 'yards_gained', -20, 100)
        self.validate_range(df, 'quarter', 1, 5)

        # Check for duplicates
        if 'game_id' in df.columns and 'play_id' in df.columns:
            self.validate_unique(df, ['game_id', 'play_id'])

        # Generate report
        report = {
            'valid': len(self.errors) == 0,
            'errors': self.errors,
            'warnings': self.warnings,
            'row_count': len(df),
            'columns': list(df.columns)
        }

        return report


# Usage
validator = DataValidator()
report = validator.validate_play_by_play(plays)
print(f"Validation passed: {report['valid']}")
if report['errors']:
    print("Errors:")
    for error in report['errors']:
        print(f"  - {error}")

5.10 Building Reproducible Pipelines

Professional data cleaning requires reproducible, documented pipelines.

Complete Cleaning Pipeline

class FootballDataPipeline:
    """
    Complete, reproducible data cleaning pipeline for football data.
    """

    def __init__(self, verbose: bool = True):
        self.verbose = verbose
        self.steps_completed = []
        self.team_standardizer = TeamNameStandardizer()
        self.validator = DataValidator()

    def log(self, message: str):
        """Log processing step."""
        if self.verbose:
            print(f"[Pipeline] {message}")

    def clean_play_by_play(self, df: pd.DataFrame) -> pd.DataFrame:
        """
        Complete cleaning pipeline for play-by-play data.

        Parameters
        ----------
        df : pd.DataFrame
            Raw play-by-play data

        Returns
        -------
        pd.DataFrame : Cleaned, analysis-ready data
        """
        self.log(f"Starting pipeline with {len(df):,} rows")
        df_clean = df.copy()

        # Step 1: Remove complete duplicates
        initial_rows = len(df_clean)
        df_clean = df_clean.drop_duplicates()
        removed = initial_rows - len(df_clean)
        self.log(f"Step 1: Removed {removed} duplicate rows")
        self.steps_completed.append('remove_duplicates')

        # Step 2: Standardize team names
        for col in ['offense', 'defense', 'home_team', 'away_team']:
            if col in df_clean.columns:
                df_clean = self.team_standardizer.standardize_column(df_clean, col)
        self.log("Step 2: Standardized team names")
        self.steps_completed.append('standardize_names')

        # Step 3: Convert data types
        type_spec = {
            'game_id': 'int',
            'play_id': 'int',
            'yards_gained': 'float',
            'down': 'int',
            'distance': 'int',
            'quarter': 'int'
        }
        df_clean = convert_data_types(df_clean, type_spec)
        self.log("Step 3: Converted data types")
        self.steps_completed.append('convert_types')

        # Step 4: Handle invalid values
        df_clean = handle_football_outliers(df_clean)
        self.log("Step 4: Handled invalid values")
        self.steps_completed.append('handle_outliers')

        # Step 5: Handle missing values
        imputation_map = {
            'yards_gained': 'median',
            'down': 'mode',
            'distance': 'median'
        }
        df_clean = impute_missing_values(df_clean, imputation_map)
        self.log("Step 5: Imputed missing values")
        self.steps_completed.append('impute_missing')

        # Step 6: Create features
        df_clean = create_situational_features(df_clean)
        self.log("Step 6: Created situational features")
        self.steps_completed.append('create_features')

        # Step 7: Optimize memory
        df_clean = optimize_memory(df_clean, verbose=False)
        self.log("Step 7: Optimized memory usage")
        self.steps_completed.append('optimize_memory')

        # Step 8: Validate
        report = self.validator.validate_play_by_play(df_clean)
        self.log(f"Step 8: Validation {'passed' if report['valid'] else 'failed'}")
        self.steps_completed.append('validate')

        self.log(f"Pipeline complete: {len(df_clean):,} rows")

        return df_clean

    def get_pipeline_report(self) -> Dict:
        """Generate report of pipeline execution."""
        return {
            'steps_completed': self.steps_completed,
            'n_steps': len(self.steps_completed)
        }


# Example usage
pipeline = FootballDataPipeline(verbose=True)

# Load raw data (example)
raw_plays = pd.DataFrame({
    'game_id': [1, 1, 1, 2, 2],
    'play_id': [1, 2, 3, 1, 2],
    'offense': ['ALABAMA', 'alabama', 'Bama', 'Georgia', 'UGA'],
    'defense': ['Georgia', 'Georgia', 'georgia', 'Alabama', 'Alabama'],
    'down': [1, 2, 3, 1, 2],
    'distance': [10, 8, 5, 10, 7],
    'yards_gained': [3, 5, None, 12, -2],
    'quarter': [1, 1, 1, 1, 1]
})

clean_plays = pipeline.clean_play_by_play(raw_plays)

5.11 Best Practices Summary

Data Cleaning Checklist

## Pre-Cleaning
- [ ] Document raw data source and date obtained
- [ ] Create backup of raw data
- [ ] Explore data structure (shape, types, missing patterns)
- [ ] Identify key columns and their expected values

## Cleaning Steps
- [ ] Remove exact duplicates
- [ ] Standardize string columns (names, categories)
- [ ] Convert data types appropriately
- [ ] Handle missing values (document strategy)
- [ ] Detect and handle outliers/errors
- [ ] Validate value ranges and constraints

## Integration
- [ ] Merge related datasets carefully
- [ ] Verify join keys match
- [ ] Check for unmatched records
- [ ] Avoid creating duplicates through joins

## Feature Engineering
- [ ] Create derived columns needed for analysis
- [ ] Document feature definitions
- [ ] Validate new features

## Quality Assurance
- [ ] Run validation checks
- [ ] Verify row/column counts
- [ ] Sample and manually review records
- [ ] Document known data limitations

## Documentation
- [ ] Record all transformations applied
- [ ] Note any assumptions made
- [ ] Save cleaning code for reproducibility
- [ ] Version control the pipeline

Common Mistakes to Avoid

  1. Cleaning data in Excel: Not reproducible, error-prone
  2. Not documenting transformations: Future you won't remember
  3. Dropping too much data: Investigate before deleting
  4. Imputing without understanding: Know why data is missing
  5. Joining without validation: Check match rates
  6. Not version controlling: Changes should be trackable
  7. Manual string corrections: Build systematic mappings
  8. Ignoring warnings: They often indicate real problems

Chapter Summary

Data cleaning is the foundation of every analysis. In this chapter, you learned to:

  1. Explore data systematically before making changes
  2. Handle missing data using appropriate strategies based on why it's missing
  3. Standardize names using reusable mapping systems
  4. Convert data types for correctness and efficiency
  5. Detect outliers while respecting domain knowledge
  6. Merge datasets safely with validation
  7. Engineer features that unlock analytical insights
  8. Validate data at each pipeline stage
  9. Build reproducible pipelines that document every transformation

The effort invested in data preparation pays dividends throughout your analysis. Clean data enables accurate insights; messy data guarantees flawed conclusions.


Key Terms

  • MCAR/MAR/MNAR: Missing data mechanisms (completely at random, at random, not at random)
  • Imputation: Filling in missing values using statistical methods
  • Cardinality: Number of unique values in a column
  • Schema: Structure and types of data columns
  • Pipeline: Sequence of processing steps applied to data
  • Feature Engineering: Creating new variables from existing data
  • Data Validation: Checking data against expected rules and constraints

References

  1. Van Buuren, S. (2018). Flexible Imputation of Missing Data. CRC Press.
  2. McKinney, W. (2017). Python for Data Analysis (2nd ed.). O'Reilly Media.
  3. Wickham, H. (2014). "Tidy Data." Journal of Statistical Software, 59(10).
  4. CollegeFootballData.com Documentation. https://collegefootballdata.com/