The most sophisticated analysis is worthless if built on flawed data. In professional sports analytics, data scientists typically spend 60-80% of their time on data cleaning and preparation—the unglamorous but essential foundation of every insight...
In This Chapter
- Learning Objectives
- 5.1 The Data Cleaning Imperative
- 5.2 Understanding Your Data
- 5.3 Handling Missing Data
- 5.4 Standardizing Names and Identifiers
- 5.5 Data Type Conversions
- 5.6 Detecting and Handling Outliers
- 5.7 Merging and Joining Datasets
- 5.8 Feature Engineering for Football
- 5.9 Data Validation
- 5.10 Building Reproducible Pipelines
- 5.11 Best Practices Summary
- Chapter Summary
- Key Terms
- References
Chapter 5: Data Cleaning and Preparation
"Garbage in, garbage out." — George Fuechsel
The most sophisticated analysis is worthless if built on flawed data. In professional sports analytics, data scientists typically spend 60-80% of their time on data cleaning and preparation—the unglamorous but essential foundation of every insight. This chapter teaches you to transform raw, messy football data into analysis-ready datasets.
Learning Objectives
By the end of this chapter, you will be able to:
- Identify and handle missing data using appropriate strategies
- Standardize team and player names across different data sources
- Detect and handle outliers and data quality issues
- Merge datasets from multiple sources correctly
- Create derived features for football analysis
- Build reproducible data cleaning pipelines
- Validate data quality at each stage of processing
5.1 The Data Cleaning Imperative
Why Data Quality Matters
Consider a simple example: You're comparing rushing efficiency across teams. Your dataset shows Team A averaging 8.5 yards per carry—extraordinarily high. Investigation reveals the data source recorded a 75-yard touchdown as "75" in the tens column, creating "750" yards for that play. Without proper cleaning, your analysis would be fundamentally flawed.
Football data presents unique challenges:
- Multiple Naming Conventions: "Ohio State," "Ohio St.," "OSU," "THE Ohio State University"
- Missing Data Patterns: Weather data missing for dome games, snap counts unavailable for older seasons
- Schema Changes: Data providers modify their formats between seasons
- Human Entry Errors: Misrecorded jersey numbers, transposed statistics
- Definitional Inconsistencies: What counts as a "pressure"? A "broken tackle"?
The Cost of Poor Data
| Issue | Potential Impact |
|---|---|
| Duplicate records | Inflated statistics, misleading aggregations |
| Missing values | Biased analysis, failed calculations |
| Inconsistent naming | Failed merges, split records |
| Data type errors | Calculation failures, incorrect sorting |
| Outlier contamination | Skewed means, misleading conclusions |
5.2 Understanding Your Data
Before cleaning, you must understand what you have. This reconnaissance phase is essential.
Initial Data Exploration
import pandas as pd
import numpy as np
from typing import Dict, List, Tuple
def explore_dataset(df: pd.DataFrame, name: str = "Dataset") -> Dict:
"""
Comprehensive initial exploration of a dataset.
Parameters
----------
df : pd.DataFrame
Dataset to explore
name : str
Name for reporting
Returns
-------
dict : Exploration summary
"""
print(f"{'='*60}")
print(f"DATASET EXPLORATION: {name}")
print(f"{'='*60}")
# Basic dimensions
print(f"\nShape: {df.shape[0]:,} rows × {df.shape[1]} columns")
# Data types
print(f"\nData Types:")
for dtype, count in df.dtypes.value_counts().items():
print(f" {dtype}: {count} columns")
# Memory usage
memory_mb = df.memory_usage(deep=True).sum() / 1024**2
print(f"\nMemory Usage: {memory_mb:.2f} MB")
# Missing values
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
missing_cols = missing[missing > 0].sort_values(ascending=False)
print(f"\nMissing Values:")
if len(missing_cols) > 0:
for col, count in missing_cols.head(10).items():
print(f" {col}: {count:,} ({missing_pct[col]:.1f}%)")
else:
print(" No missing values found")
# Duplicates
dup_rows = df.duplicated().sum()
print(f"\nDuplicate Rows: {dup_rows:,} ({dup_rows/len(df)*100:.2f}%)")
return {
'shape': df.shape,
'dtypes': df.dtypes.value_counts().to_dict(),
'memory_mb': memory_mb,
'missing_cols': missing_cols.to_dict(),
'duplicate_rows': dup_rows
}
# Example usage with play-by-play data
plays = pd.DataFrame({
'game_id': [1, 1, 1, 2, 2, 2],
'play_id': [1, 2, 3, 1, 2, 3],
'offense': ['Alabama', 'Alabama', 'ALABAMA', 'Georgia', 'georgia', 'Georgia'],
'yards_gained': [5, -2, None, 12, 8, None],
'down': [1, 2, 3, 1, 2, 3],
'distance': [10, 12, 14, 10, 8, 10]
})
summary = explore_dataset(plays, "Sample Play-by-Play")
Data Type Assessment
Understanding and correcting data types is fundamental:
def assess_column_types(df: pd.DataFrame) -> pd.DataFrame:
"""
Assess actual vs. appropriate data types for each column.
Returns a DataFrame with recommendations.
"""
assessments = []
for col in df.columns:
current_type = str(df[col].dtype)
sample_values = df[col].dropna().head(3).tolist()
# Assess appropriate type
if df[col].dtype == 'object':
# Check if should be numeric
numeric_test = pd.to_numeric(df[col], errors='coerce')
if numeric_test.notna().sum() > 0.9 * df[col].notna().sum():
recommended = 'numeric (currently string)'
# Check if should be categorical
elif df[col].nunique() < len(df) * 0.05:
recommended = 'category (few unique values)'
else:
recommended = 'string (OK)'
elif np.issubdtype(df[col].dtype, np.integer):
# Check if should be boolean
if set(df[col].dropna().unique()).issubset({0, 1}):
recommended = 'boolean (0/1 only)'
else:
recommended = 'integer (OK)'
elif np.issubdtype(df[col].dtype, np.floating):
recommended = 'float (OK)'
else:
recommended = str(df[col].dtype)
assessments.append({
'column': col,
'current_type': current_type,
'unique_values': df[col].nunique(),
'recommendation': recommended,
'sample': str(sample_values)[:50]
})
return pd.DataFrame(assessments)
5.3 Handling Missing Data
Missing data is inevitable in football datasets. The key is handling it appropriately.
Types of Missing Data
Understanding why data is missing determines how to handle it:
| Type | Description | Football Example | Strategy |
|---|---|---|---|
| MCAR | Missing Completely at Random | Random data entry omissions | Safe to drop or impute |
| MAR | Missing at Random | Weather missing for dome games (related to venue, not weather itself) | Impute carefully |
| MNAR | Missing Not at Random | Snap counts missing because player was injured | Must address mechanism |
Diagnosing Missing Data Patterns
def analyze_missing_patterns(df: pd.DataFrame) -> Dict:
"""
Analyze patterns in missing data.
Identifies which columns have missing values and whether
they're correlated with other variables.
"""
# Create missingness indicator matrix
missing_matrix = df.isnull()
# Summary by column
missing_summary = pd.DataFrame({
'missing_count': missing_matrix.sum(),
'missing_pct': (missing_matrix.sum() / len(df)) * 100,
'dtype': df.dtypes
}).sort_values('missing_pct', ascending=False)
# Look for patterns: do columns tend to be missing together?
missing_cols = missing_summary[missing_summary['missing_count'] > 0].index.tolist()
if len(missing_cols) > 1:
# Correlation of missingness
missing_corr = missing_matrix[missing_cols].corr()
else:
missing_corr = None
return {
'summary': missing_summary,
'missing_columns': missing_cols,
'missing_correlation': missing_corr
}
# Example: Analyzing missing weather data
games_data = pd.DataFrame({
'game_id': range(1, 21),
'venue_type': ['outdoor'] * 12 + ['dome'] * 8,
'temperature': [72, 65, 58, 45, 38, 52, 78, 82, 55, 48, 42, 68,
None, None, None, None, None, None, None, None],
'wind': [8, 12, 5, 15, 22, 10, 6, 4, 18, 20, 25, 9,
None, None, None, None, None, None, None, None],
'points_scored': [28, 35, 24, 21, 17, 31, 42, 38, 27, 24, 14, 33,
35, 28, 31, 42, 38, 27, 45, 33]
})
patterns = analyze_missing_patterns(games_data)
print("Missing Data Analysis:")
print(patterns['summary'][patterns['summary']['missing_count'] > 0])
# Notice: Temperature and wind missing together for dome games (MAR)
Missing Data Strategies
Strategy 1: Deletion
def handle_missing_deletion(df: pd.DataFrame,
threshold: float = 0.05) -> pd.DataFrame:
"""
Handle missing data through deletion.
- Drop columns with too many missing values
- Drop rows where critical columns are missing
Parameters
----------
df : pd.DataFrame
Data to clean
threshold : float
Maximum acceptable proportion of missing values
Returns
-------
pd.DataFrame : Cleaned data
"""
df_clean = df.copy()
# Drop columns with excessive missing
missing_pct = df_clean.isnull().sum() / len(df_clean)
cols_to_drop = missing_pct[missing_pct > 0.5].index.tolist()
if cols_to_drop:
print(f"Dropping columns (>50% missing): {cols_to_drop}")
df_clean = df_clean.drop(columns=cols_to_drop)
# Drop rows where essential columns are missing
# (Define critical columns based on analysis needs)
return df_clean
Strategy 2: Imputation
def impute_missing_values(df: pd.DataFrame,
strategy_map: Dict[str, str]) -> pd.DataFrame:
"""
Impute missing values using specified strategies.
Parameters
----------
df : pd.DataFrame
Data to impute
strategy_map : dict
{column: strategy} where strategy is one of:
'mean', 'median', 'mode', 'zero', 'forward', 'group_mean'
Returns
-------
pd.DataFrame : Imputed data
"""
df_imputed = df.copy()
for col, strategy in strategy_map.items():
if col not in df_imputed.columns:
continue
missing_count = df_imputed[col].isnull().sum()
if missing_count == 0:
continue
if strategy == 'mean':
fill_value = df_imputed[col].mean()
df_imputed[col] = df_imputed[col].fillna(fill_value)
elif strategy == 'median':
fill_value = df_imputed[col].median()
df_imputed[col] = df_imputed[col].fillna(fill_value)
elif strategy == 'mode':
fill_value = df_imputed[col].mode().iloc[0]
df_imputed[col] = df_imputed[col].fillna(fill_value)
elif strategy == 'zero':
df_imputed[col] = df_imputed[col].fillna(0)
elif strategy == 'forward':
df_imputed[col] = df_imputed[col].fillna(method='ffill')
print(f" {col}: Imputed {missing_count} values using {strategy}")
return df_imputed
# Example: Domain-appropriate imputation
imputation_strategies = {
'yards_gained': 'median', # Use median (robust to outliers)
'temperature': 'median', # Use median for dome games
'wind': 'zero', # Assume 0 wind for dome games
'attendance': 'median', # Use typical attendance
}
Strategy 3: Indicator Variables
Sometimes missingness itself is informative:
def add_missing_indicators(df: pd.DataFrame,
columns: List[str]) -> pd.DataFrame:
"""
Add indicator columns for missing values.
Useful when missingness is informative (MNAR).
Parameters
----------
df : pd.DataFrame
Original data
columns : list
Columns to create indicators for
Returns
-------
pd.DataFrame : Data with indicator columns
"""
df_with_indicators = df.copy()
for col in columns:
if col in df.columns:
indicator_name = f'{col}_missing'
df_with_indicators[indicator_name] = df[col].isnull().astype(int)
return df_with_indicators
# Example: Weather data missing for dome games is informative
games_with_indicators = add_missing_indicators(games_data, ['temperature', 'wind'])
# Now 'temperature_missing' = 1 indicates a dome game
5.4 Standardizing Names and Identifiers
One of the biggest challenges in football data is inconsistent naming.
The Team Name Problem
# The same team can appear many ways:
ohio_state_variants = [
'Ohio State',
'Ohio St.',
'Ohio St',
'OSU',
'THE Ohio State University',
'Ohio State Buckeyes',
'OHIO STATE',
'ohio state'
]
Building a Name Standardization System
class TeamNameStandardizer:
"""
Standardize team names across data sources.
Maintains a mapping of variants to canonical names.
"""
def __init__(self):
# Core mapping of variants to standard names
self.team_mapping = {
# SEC
'alabama': 'Alabama',
'bama': 'Alabama',
'roll tide': 'Alabama',
'georgia': 'Georgia',
'uga': 'Georgia',
'dawgs': 'Georgia',
'lsu': 'LSU',
'louisiana state': 'LSU',
'texas a&m': 'Texas A&M',
'texas am': 'Texas A&M',
'tamu': 'Texas A&M',
# Big Ten
'ohio state': 'Ohio State',
'ohio st': 'Ohio State',
'ohio st.': 'Ohio State',
'osu': 'Ohio State', # Context-dependent, could be Oklahoma State
'michigan': 'Michigan',
'um': 'Michigan',
'wolverines': 'Michigan',
'penn state': 'Penn State',
'penn st': 'Penn State',
'penn st.': 'Penn State',
'psu': 'Penn State',
# Additional mappings...
}
# Common suffixes to remove
self.suffixes_to_remove = [
' Crimson Tide', ' Bulldogs', ' Tigers', ' Aggies',
' Buckeyes', ' Wolverines', ' Nittany Lions',
' Fighting Irish', ' Longhorns', ' Sooners'
]
def clean_name(self, name: str) -> str:
"""
Clean a team name for matching.
Lowercase, remove punctuation, strip whitespace.
"""
if pd.isna(name):
return name
cleaned = str(name).lower().strip()
# Remove common suffixes
for suffix in self.suffixes_to_remove:
cleaned = cleaned.replace(suffix.lower(), '')
# Remove punctuation
cleaned = cleaned.replace('.', '').replace("'", '')
return cleaned.strip()
def standardize(self, name: str) -> str:
"""
Convert a team name to its standard form.
Parameters
----------
name : str
Raw team name
Returns
-------
str : Standardized name
"""
if pd.isna(name):
return name
cleaned = self.clean_name(name)
if cleaned in self.team_mapping:
return self.team_mapping[cleaned]
# If no exact match, return cleaned version with title case
return name.strip().title()
def standardize_column(self, df: pd.DataFrame,
column: str) -> pd.DataFrame:
"""
Standardize team names in a DataFrame column.
Parameters
----------
df : pd.DataFrame
Data containing team names
column : str
Column name to standardize
Returns
-------
pd.DataFrame : Data with standardized names
"""
df_clean = df.copy()
original_unique = df_clean[column].nunique()
df_clean[column] = df_clean[column].apply(self.standardize)
new_unique = df_clean[column].nunique()
print(f"Standardized '{column}': {original_unique} → {new_unique} unique values")
return df_clean
# Usage
standardizer = TeamNameStandardizer()
raw_data = pd.DataFrame({
'home_team': ['Ohio State', 'MICHIGAN', 'Penn St.', 'georgia', 'LSU Tigers'],
'away_team': ['michigan', 'osu', 'Ohio State Buckeyes', 'Alabama', 'texas am']
})
clean_data = standardizer.standardize_column(raw_data, 'home_team')
clean_data = standardizer.standardize_column(clean_data, 'away_team')
Player Name Standardization
class PlayerNameStandardizer:
"""
Standardize player names for matching across sources.
"""
def __init__(self):
# Common name variations
self.first_name_map = {
'mike': 'Michael',
'will': 'William',
'bill': 'William',
'bob': 'Robert',
'rob': 'Robert',
'joe': 'Joseph',
'tom': 'Thomas',
'jim': 'James',
'jimmy': 'James',
'chris': 'Christopher',
'matt': 'Matthew',
'dan': 'Daniel',
'dave': 'David',
'steve': 'Stephen',
'tony': 'Anthony',
'nick': 'Nicholas',
'josh': 'Joshua',
}
# Suffix handling
self.suffixes = ['Jr.', 'Jr', 'Sr.', 'Sr', 'II', 'III', 'IV', 'V']
def standardize(self, name: str) -> str:
"""
Standardize a player name.
Handles:
- Case normalization
- Nickname to formal name conversion
- Suffix normalization
- Extra whitespace
"""
if pd.isna(name) or name == '':
return name
# Clean up whitespace
parts = str(name).strip().split()
if len(parts) < 2:
return name.strip().title()
# Handle first name
first = parts[0].lower()
if first in self.first_name_map:
parts[0] = self.first_name_map[first]
else:
parts[0] = parts[0].title()
# Handle rest of name
for i in range(1, len(parts)):
part = parts[i]
# Check for suffix
is_suffix = any(part.lower().replace('.', '') ==
s.lower().replace('.', '') for s in self.suffixes)
if is_suffix:
# Normalize suffix format
parts[i] = part.upper().replace('.', '') + '.'
else:
parts[i] = part.title()
return ' '.join(parts)
def create_matching_key(self, name: str) -> str:
"""
Create a simplified key for fuzzy matching.
Returns: "lastname_firstinitial"
"""
if pd.isna(name) or name == '':
return ''
standardized = self.standardize(name)
parts = standardized.split()
if len(parts) < 2:
return name.lower().replace(' ', '_')
# Remove suffix
parts = [p for p in parts if not any(
p.lower().replace('.', '') == s.lower().replace('.', '')
for s in self.suffixes
)]
if len(parts) < 2:
return name.lower().replace(' ', '_')
# Create key: lastname_firstinitial
first_initial = parts[0][0].lower()
last_name = parts[-1].lower()
return f"{last_name}_{first_initial}"
5.5 Data Type Conversions
Proper data types ensure correct operations and efficient memory usage.
Common Type Conversions
def convert_data_types(df: pd.DataFrame,
type_spec: Dict[str, str]) -> pd.DataFrame:
"""
Convert columns to specified data types.
Parameters
----------
df : pd.DataFrame
Data to convert
type_spec : dict
{column: type} where type is 'int', 'float', 'bool',
'category', 'datetime', 'string'
Returns
-------
pd.DataFrame : Data with converted types
"""
df_converted = df.copy()
for col, dtype in type_spec.items():
if col not in df_converted.columns:
print(f"Warning: Column '{col}' not found")
continue
try:
if dtype == 'int':
# Handle NaN by converting to nullable integer
df_converted[col] = pd.to_numeric(
df_converted[col], errors='coerce'
).astype('Int64')
elif dtype == 'float':
df_converted[col] = pd.to_numeric(
df_converted[col], errors='coerce'
).astype('float64')
elif dtype == 'bool':
df_converted[col] = df_converted[col].astype(bool)
elif dtype == 'category':
df_converted[col] = df_converted[col].astype('category')
elif dtype == 'datetime':
df_converted[col] = pd.to_datetime(
df_converted[col], errors='coerce'
)
elif dtype == 'string':
df_converted[col] = df_converted[col].astype(str)
print(f" Converted '{col}' to {dtype}")
except Exception as e:
print(f" Error converting '{col}': {e}")
return df_converted
# Example: Play-by-play type corrections
play_type_spec = {
'game_id': 'int',
'play_id': 'int',
'yards_gained': 'float', # Can have NaN
'down': 'int',
'distance': 'int',
'quarter': 'int',
'game_date': 'datetime',
'play_type': 'category',
'offense': 'string',
'is_scoring_play': 'bool'
}
Memory Optimization
Large datasets benefit from memory-efficient types:
def optimize_memory(df: pd.DataFrame,
verbose: bool = True) -> pd.DataFrame:
"""
Optimize DataFrame memory usage by downcasting types.
Parameters
----------
df : pd.DataFrame
Data to optimize
verbose : bool
Print memory savings
Returns
-------
pd.DataFrame : Memory-optimized data
"""
initial_memory = df.memory_usage(deep=True).sum() / 1024**2
df_optimized = df.copy()
for col in df_optimized.columns:
col_type = df_optimized[col].dtype
if col_type == 'int64':
# Downcast integers
c_min = df_optimized[col].min()
c_max = df_optimized[col].max()
if c_min >= 0:
if c_max < 255:
df_optimized[col] = df_optimized[col].astype('uint8')
elif c_max < 65535:
df_optimized[col] = df_optimized[col].astype('uint16')
elif c_max < 4294967295:
df_optimized[col] = df_optimized[col].astype('uint32')
else:
if c_min > -128 and c_max < 127:
df_optimized[col] = df_optimized[col].astype('int8')
elif c_min > -32768 and c_max < 32767:
df_optimized[col] = df_optimized[col].astype('int16')
elif c_min > -2147483648 and c_max < 2147483647:
df_optimized[col] = df_optimized[col].astype('int32')
elif col_type == 'float64':
# Downcast floats
df_optimized[col] = pd.to_numeric(
df_optimized[col], downcast='float'
)
elif col_type == 'object':
# Convert low-cardinality strings to category
num_unique = df_optimized[col].nunique()
num_total = len(df_optimized[col])
if num_unique / num_total < 0.5:
df_optimized[col] = df_optimized[col].astype('category')
final_memory = df_optimized.memory_usage(deep=True).sum() / 1024**2
if verbose:
print(f"Memory: {initial_memory:.2f} MB → {final_memory:.2f} MB")
print(f"Reduced by {(1 - final_memory/initial_memory)*100:.1f}%")
return df_optimized
5.6 Detecting and Handling Outliers
Outliers require careful consideration in football data—some are errors, others are legitimate exceptional performances.
Outlier Detection Methods
from scipy import stats
def detect_outliers(df: pd.DataFrame,
column: str,
method: str = 'iqr') -> pd.DataFrame:
"""
Detect outliers in a numeric column.
Parameters
----------
df : pd.DataFrame
Data to analyze
column : str
Numeric column to check
method : str
Detection method: 'iqr', 'zscore', 'modified_zscore'
Returns
-------
pd.DataFrame : Rows containing outliers
"""
data = df[column].dropna()
if method == 'iqr':
q1 = data.quantile(0.25)
q3 = data.quantile(0.75)
iqr = q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr
outlier_mask = (df[column] < lower) | (df[column] > upper)
elif method == 'zscore':
z_scores = np.abs(stats.zscore(data))
threshold = 3
outlier_values = data[z_scores > threshold]
outlier_mask = df[column].isin(outlier_values)
elif method == 'modified_zscore':
# More robust to outliers than standard z-score
median = data.median()
mad = np.median(np.abs(data - median))
modified_z = 0.6745 * (data - median) / mad
outlier_values = data[np.abs(modified_z) > 3.5]
outlier_mask = df[column].isin(outlier_values)
else:
raise ValueError(f"Unknown method: {method}")
outliers = df[outlier_mask].copy()
outliers['outlier_value'] = df.loc[outlier_mask, column]
return outliers
def analyze_potential_outliers(df: pd.DataFrame,
column: str) -> Dict:
"""
Comprehensive outlier analysis for a column.
Returns information to help decide if outliers are
errors or legitimate exceptional values.
"""
outliers_iqr = detect_outliers(df, column, 'iqr')
outliers_z = detect_outliers(df, column, 'zscore')
analysis = {
'column': column,
'n_total': len(df),
'n_outliers_iqr': len(outliers_iqr),
'n_outliers_zscore': len(outliers_z),
'outlier_pct': len(outliers_iqr) / len(df) * 100,
'outlier_values': outliers_iqr[column].tolist(),
'mean': df[column].mean(),
'median': df[column].median(),
'std': df[column].std(),
'min': df[column].min(),
'max': df[column].max()
}
return analysis
Contextual Outlier Handling in Football
def handle_football_outliers(df: pd.DataFrame) -> pd.DataFrame:
"""
Handle outliers with football domain knowledge.
Some outliers are real (big plays), others are errors.
"""
df_clean = df.copy()
# Yards gained: Can legitimately range from ~-10 to 99
# Values outside this are likely errors
if 'yards_gained' in df_clean.columns:
invalid_yards = (df_clean['yards_gained'] < -15) | \
(df_clean['yards_gained'] > 100)
df_clean.loc[invalid_yards, 'yards_gained'] = np.nan
print(f"Flagged {invalid_yards.sum()} invalid yard values")
# Down: Must be 1-4
if 'down' in df_clean.columns:
invalid_down = ~df_clean['down'].isin([1, 2, 3, 4])
df_clean.loc[invalid_down, 'down'] = np.nan
print(f"Flagged {invalid_down.sum()} invalid down values")
# Distance: Typically 1-25, but can be higher
if 'distance' in df_clean.columns:
invalid_distance = (df_clean['distance'] < 1) | \
(df_clean['distance'] > 50)
df_clean.loc[invalid_distance, 'distance'] = np.nan
print(f"Flagged {invalid_distance.sum()} invalid distance values")
# Quarter: Must be 1-4 (or 5 for OT)
if 'quarter' in df_clean.columns:
invalid_quarter = ~df_clean['quarter'].isin([1, 2, 3, 4, 5])
df_clean.loc[invalid_quarter, 'quarter'] = np.nan
print(f"Flagged {invalid_quarter.sum()} invalid quarter values")
return df_clean
def flag_statistical_outliers(df: pd.DataFrame,
column: str,
context_column: str = None) -> pd.DataFrame:
"""
Flag outliers while preserving them for review.
Instead of removing, adds a flag column for manual review.
Parameters
----------
df : pd.DataFrame
Data to analyze
column : str
Column to check for outliers
context_column : str
Group by this column for context-aware detection
Returns
-------
pd.DataFrame : Data with outlier flags
"""
df_flagged = df.copy()
flag_col = f'{column}_outlier'
df_flagged[flag_col] = False
if context_column and context_column in df.columns:
# Context-aware: outliers within each group
for group_value in df[context_column].unique():
group_mask = df[context_column] == group_value
group_data = df.loc[group_mask, column]
if len(group_data) < 10:
continue
q1 = group_data.quantile(0.25)
q3 = group_data.quantile(0.75)
iqr = q3 - q1
outlier_mask = group_mask & (
(df[column] < q1 - 1.5 * iqr) |
(df[column] > q3 + 1.5 * iqr)
)
df_flagged.loc[outlier_mask, flag_col] = True
else:
# Global outlier detection
outliers = detect_outliers(df, column, 'iqr')
df_flagged.loc[outliers.index, flag_col] = True
print(f"Flagged {df_flagged[flag_col].sum()} outliers in '{column}'")
return df_flagged
5.7 Merging and Joining Datasets
Combining data from multiple sources is essential for comprehensive analysis.
Common Join Types
def demonstrate_join_types():
"""
Illustrate different types of joins with football data.
"""
# Game-level data
games = pd.DataFrame({
'game_id': [1, 2, 3, 4],
'home_team': ['Alabama', 'Georgia', 'Ohio State', 'Texas'],
'away_team': ['Georgia', 'LSU', 'Michigan', 'Oklahoma'],
'home_score': [28, 35, 42, 31]
})
# Weather data (missing for some games)
weather = pd.DataFrame({
'game_id': [1, 2, 4],
'temperature': [72, 65, 85],
'conditions': ['clear', 'cloudy', 'sunny']
})
print("Games (left):")
print(games)
print("\nWeather (right):")
print(weather)
# Inner join: Only games with weather
inner = pd.merge(games, weather, on='game_id', how='inner')
print("\nINNER JOIN (only matching rows):")
print(inner)
# Left join: All games, weather where available
left = pd.merge(games, weather, on='game_id', how='left')
print("\nLEFT JOIN (all games, weather if available):")
print(left)
# Outer join: All records from both
outer = pd.merge(games, weather, on='game_id', how='outer')
print("\nOUTER JOIN (all records):")
print(outer)
demonstrate_join_types()
Safe Merging Practices
def safe_merge(left: pd.DataFrame,
right: pd.DataFrame,
on: str or List[str],
how: str = 'left',
validate: str = None,
indicator: bool = True) -> pd.DataFrame:
"""
Perform a merge with validation and diagnostics.
Parameters
----------
left : pd.DataFrame
Left DataFrame
right : pd.DataFrame
Right DataFrame
on : str or list
Column(s) to merge on
how : str
Join type: 'left', 'right', 'inner', 'outer'
validate : str
Validation: 'one_to_one', 'one_to_many', 'many_to_one', 'many_to_many'
indicator : bool
Add merge indicator column
Returns
-------
pd.DataFrame : Merged data
"""
# Pre-merge diagnostics
print("="*50)
print("MERGE DIAGNOSTICS")
print("="*50)
on_cols = [on] if isinstance(on, str) else on
print(f"\nLeft shape: {left.shape}")
print(f"Right shape: {right.shape}")
for col in on_cols:
left_unique = left[col].nunique()
right_unique = right[col].nunique()
print(f"\n'{col}':")
print(f" Left unique: {left_unique}")
print(f" Right unique: {right_unique}")
# Check for values that won't match
left_only = set(left[col]) - set(right[col])
right_only = set(right[col]) - set(left[col])
print(f" Left-only values: {len(left_only)}")
print(f" Right-only values: {len(right_only)}")
# Perform merge
try:
merged = pd.merge(
left, right,
on=on,
how=how,
validate=validate,
indicator=indicator
)
if indicator:
merge_counts = merged['_merge'].value_counts()
print(f"\nMerge results:")
for status, count in merge_counts.items():
print(f" {status}: {count}")
print(f"\nFinal shape: {merged.shape}")
return merged
except Exception as e:
print(f"\nMerge failed: {e}")
raise
# Example: Merging play-by-play with team data
plays = pd.DataFrame({
'play_id': range(1, 11),
'team': ['Alabama', 'Alabama', 'Georgia', 'Georgia', 'Ohio State',
'Ohio St', 'Michigan', 'michigan', 'LSU', 'Texas A&M'],
'yards': [5, 12, -2, 8, 15, 6, 3, 9, 22, 4]
})
teams = pd.DataFrame({
'team_name': ['Alabama', 'Georgia', 'Ohio State', 'Michigan', 'LSU', 'Texas A&M'],
'conference': ['SEC', 'SEC', 'Big Ten', 'Big Ten', 'SEC', 'SEC']
})
# Note: Direct merge will fail due to naming inconsistencies
# Need to standardize first!
Handling Multiple Key Merges
def merge_play_game_data(plays: pd.DataFrame,
games: pd.DataFrame) -> pd.DataFrame:
"""
Merge play-level data with game-level data.
Handles the common case of enriching plays with game context.
Parameters
----------
plays : pd.DataFrame
Play-by-play data (requires game_id)
games : pd.DataFrame
Game-level data (requires game_id)
Returns
-------
pd.DataFrame : Enriched play data
"""
# Validate key exists
if 'game_id' not in plays.columns:
raise ValueError("plays must contain 'game_id'")
if 'game_id' not in games.columns:
raise ValueError("games must contain 'game_id'")
# Identify columns to bring from games (avoid duplicates)
game_cols = ['game_id'] + [c for c in games.columns
if c not in plays.columns and c != 'game_id']
# Merge
merged = pd.merge(
plays,
games[game_cols],
on='game_id',
how='left',
validate='many_to_one' # Many plays per game
)
# Check for unmatched plays
unmatched = merged[merged[game_cols[1]].isna()] if len(game_cols) > 1 else pd.DataFrame()
if len(unmatched) > 0:
print(f"Warning: {len(unmatched)} plays couldn't be matched to games")
return merged
5.8 Feature Engineering for Football
Creating derived features unlocks analytical potential.
Common Football Features
def create_situational_features(df: pd.DataFrame) -> pd.DataFrame:
"""
Create situational features from play-by-play data.
Parameters
----------
df : pd.DataFrame
Play-by-play data with standard columns
Returns
-------
pd.DataFrame : Data with new features
"""
df_features = df.copy()
# Down and distance categories
if 'down' in df.columns and 'distance' in df.columns:
# Short, medium, long distance
df_features['distance_category'] = pd.cut(
df_features['distance'],
bins=[0, 3, 7, 15, 100],
labels=['short', 'medium', 'long', 'very_long']
)
# Standard downs (1st, 2nd & manageable) vs. passing downs (2nd/3rd & long)
df_features['passing_down'] = (
((df_features['down'] == 2) & (df_features['distance'] >= 8)) |
((df_features['down'] == 3) & (df_features['distance'] >= 5)) |
(df_features['down'] == 4)
).astype(int)
# Field position categories
if 'yard_line' in df.columns:
df_features['field_zone'] = pd.cut(
df_features['yard_line'],
bins=[0, 20, 40, 60, 80, 100],
labels=['backed_up', 'own_territory', 'midfield',
'opponent_territory', 'red_zone']
)
# Score differential
if 'home_score' in df.columns and 'away_score' in df.columns:
if 'is_home' in df.columns:
df_features['score_diff'] = np.where(
df_features['is_home'],
df_features['home_score'] - df_features['away_score'],
df_features['away_score'] - df_features['home_score']
)
else:
df_features['score_diff'] = (
df_features['home_score'] - df_features['away_score']
)
# Game state categories
df_features['game_state'] = pd.cut(
df_features['score_diff'],
bins=[-100, -17, -9, -1, 1, 9, 17, 100],
labels=['blowout_behind', 'big_deficit', 'small_deficit',
'close', 'small_lead', 'big_lead', 'blowout_ahead']
)
return df_features
def create_efficiency_features(df: pd.DataFrame) -> pd.DataFrame:
"""
Create efficiency-based features.
Parameters
----------
df : pd.DataFrame
Data with performance statistics
Returns
-------
pd.DataFrame : Data with efficiency features
"""
df_features = df.copy()
# Yards per attempt (passing)
if 'passing_yards' in df.columns and 'pass_attempts' in df.columns:
df_features['yards_per_attempt'] = (
df_features['passing_yards'] / df_features['pass_attempts']
).replace([np.inf, -np.inf], np.nan)
# Yards per carry (rushing)
if 'rushing_yards' in df.columns and 'rush_attempts' in df.columns:
df_features['yards_per_carry'] = (
df_features['rushing_yards'] / df_features['rush_attempts']
).replace([np.inf, -np.inf], np.nan)
# Completion percentage
if 'completions' in df.columns and 'pass_attempts' in df.columns:
df_features['completion_pct'] = (
df_features['completions'] / df_features['pass_attempts'] * 100
).replace([np.inf, -np.inf], np.nan)
# Points per play
if 'points' in df.columns and 'total_plays' in df.columns:
df_features['points_per_play'] = (
df_features['points'] / df_features['total_plays']
).replace([np.inf, -np.inf], np.nan)
# Turnover rate
if 'turnovers' in df.columns and 'total_plays' in df.columns:
df_features['turnover_rate'] = (
df_features['turnovers'] / df_features['total_plays'] * 100
).replace([np.inf, -np.inf], np.nan)
return df_features
def create_rolling_features(df: pd.DataFrame,
group_col: str,
value_col: str,
windows: List[int] = [3, 5]) -> pd.DataFrame:
"""
Create rolling average features.
Parameters
----------
df : pd.DataFrame
Data sorted by time/game order
group_col : str
Column to group by (e.g., 'team')
value_col : str
Column to calculate rolling stats for
windows : list
Window sizes for rolling calculations
Returns
-------
pd.DataFrame : Data with rolling features
"""
df_features = df.copy()
for window in windows:
# Rolling mean
col_name = f'{value_col}_rolling_{window}g'
df_features[col_name] = df_features.groupby(group_col)[value_col].transform(
lambda x: x.rolling(window, min_periods=1).mean().shift(1)
)
# Rolling std
std_name = f'{value_col}_rolling_std_{window}g'
df_features[std_name] = df_features.groupby(group_col)[value_col].transform(
lambda x: x.rolling(window, min_periods=1).std().shift(1)
)
return df_features
5.9 Data Validation
Validation ensures data meets expectations before analysis.
Building Validation Functions
class DataValidator:
"""
Validate football data against defined rules.
"""
def __init__(self):
self.errors = []
self.warnings = []
def validate_range(self, df: pd.DataFrame, column: str,
min_val: float = None, max_val: float = None) -> bool:
"""Check if values fall within expected range."""
if column not in df.columns:
self.errors.append(f"Column '{column}' not found")
return False
values = df[column].dropna()
if min_val is not None:
below_min = values[values < min_val]
if len(below_min) > 0:
self.errors.append(
f"{column}: {len(below_min)} values below {min_val}"
)
return False
if max_val is not None:
above_max = values[values > max_val]
if len(above_max) > 0:
self.errors.append(
f"{column}: {len(above_max)} values above {max_val}"
)
return False
return True
def validate_not_null(self, df: pd.DataFrame, columns: List[str]) -> bool:
"""Check that columns have no null values."""
valid = True
for col in columns:
if col not in df.columns:
self.errors.append(f"Column '{col}' not found")
valid = False
continue
null_count = df[col].isnull().sum()
if null_count > 0:
self.errors.append(f"{col}: {null_count} null values")
valid = False
return valid
def validate_unique(self, df: pd.DataFrame, columns: List[str]) -> bool:
"""Check that column combination is unique (no duplicates)."""
duplicates = df.duplicated(subset=columns, keep=False)
if duplicates.any():
self.errors.append(
f"Duplicate rows found for {columns}: {duplicates.sum()}"
)
return False
return True
def validate_referential(self, df: pd.DataFrame, column: str,
reference: set) -> bool:
"""Check that values exist in reference set."""
if column not in df.columns:
self.errors.append(f"Column '{column}' not found")
return False
values = set(df[column].dropna().unique())
invalid = values - reference
if invalid:
self.warnings.append(
f"{column}: {len(invalid)} values not in reference set"
)
return False
return True
def validate_play_by_play(self, df: pd.DataFrame) -> Dict:
"""
Comprehensive validation for play-by-play data.
Returns validation report.
"""
self.errors = []
self.warnings = []
# Required columns
required = ['game_id', 'play_id', 'down', 'distance', 'yards_gained']
for col in required:
if col not in df.columns:
self.errors.append(f"Missing required column: {col}")
# Validate ranges
self.validate_range(df, 'down', 1, 4)
self.validate_range(df, 'distance', 1, 50)
self.validate_range(df, 'yards_gained', -20, 100)
self.validate_range(df, 'quarter', 1, 5)
# Check for duplicates
if 'game_id' in df.columns and 'play_id' in df.columns:
self.validate_unique(df, ['game_id', 'play_id'])
# Generate report
report = {
'valid': len(self.errors) == 0,
'errors': self.errors,
'warnings': self.warnings,
'row_count': len(df),
'columns': list(df.columns)
}
return report
# Usage
validator = DataValidator()
report = validator.validate_play_by_play(plays)
print(f"Validation passed: {report['valid']}")
if report['errors']:
print("Errors:")
for error in report['errors']:
print(f" - {error}")
5.10 Building Reproducible Pipelines
Professional data cleaning requires reproducible, documented pipelines.
Complete Cleaning Pipeline
class FootballDataPipeline:
"""
Complete, reproducible data cleaning pipeline for football data.
"""
def __init__(self, verbose: bool = True):
self.verbose = verbose
self.steps_completed = []
self.team_standardizer = TeamNameStandardizer()
self.validator = DataValidator()
def log(self, message: str):
"""Log processing step."""
if self.verbose:
print(f"[Pipeline] {message}")
def clean_play_by_play(self, df: pd.DataFrame) -> pd.DataFrame:
"""
Complete cleaning pipeline for play-by-play data.
Parameters
----------
df : pd.DataFrame
Raw play-by-play data
Returns
-------
pd.DataFrame : Cleaned, analysis-ready data
"""
self.log(f"Starting pipeline with {len(df):,} rows")
df_clean = df.copy()
# Step 1: Remove complete duplicates
initial_rows = len(df_clean)
df_clean = df_clean.drop_duplicates()
removed = initial_rows - len(df_clean)
self.log(f"Step 1: Removed {removed} duplicate rows")
self.steps_completed.append('remove_duplicates')
# Step 2: Standardize team names
for col in ['offense', 'defense', 'home_team', 'away_team']:
if col in df_clean.columns:
df_clean = self.team_standardizer.standardize_column(df_clean, col)
self.log("Step 2: Standardized team names")
self.steps_completed.append('standardize_names')
# Step 3: Convert data types
type_spec = {
'game_id': 'int',
'play_id': 'int',
'yards_gained': 'float',
'down': 'int',
'distance': 'int',
'quarter': 'int'
}
df_clean = convert_data_types(df_clean, type_spec)
self.log("Step 3: Converted data types")
self.steps_completed.append('convert_types')
# Step 4: Handle invalid values
df_clean = handle_football_outliers(df_clean)
self.log("Step 4: Handled invalid values")
self.steps_completed.append('handle_outliers')
# Step 5: Handle missing values
imputation_map = {
'yards_gained': 'median',
'down': 'mode',
'distance': 'median'
}
df_clean = impute_missing_values(df_clean, imputation_map)
self.log("Step 5: Imputed missing values")
self.steps_completed.append('impute_missing')
# Step 6: Create features
df_clean = create_situational_features(df_clean)
self.log("Step 6: Created situational features")
self.steps_completed.append('create_features')
# Step 7: Optimize memory
df_clean = optimize_memory(df_clean, verbose=False)
self.log("Step 7: Optimized memory usage")
self.steps_completed.append('optimize_memory')
# Step 8: Validate
report = self.validator.validate_play_by_play(df_clean)
self.log(f"Step 8: Validation {'passed' if report['valid'] else 'failed'}")
self.steps_completed.append('validate')
self.log(f"Pipeline complete: {len(df_clean):,} rows")
return df_clean
def get_pipeline_report(self) -> Dict:
"""Generate report of pipeline execution."""
return {
'steps_completed': self.steps_completed,
'n_steps': len(self.steps_completed)
}
# Example usage
pipeline = FootballDataPipeline(verbose=True)
# Load raw data (example)
raw_plays = pd.DataFrame({
'game_id': [1, 1, 1, 2, 2],
'play_id': [1, 2, 3, 1, 2],
'offense': ['ALABAMA', 'alabama', 'Bama', 'Georgia', 'UGA'],
'defense': ['Georgia', 'Georgia', 'georgia', 'Alabama', 'Alabama'],
'down': [1, 2, 3, 1, 2],
'distance': [10, 8, 5, 10, 7],
'yards_gained': [3, 5, None, 12, -2],
'quarter': [1, 1, 1, 1, 1]
})
clean_plays = pipeline.clean_play_by_play(raw_plays)
5.11 Best Practices Summary
Data Cleaning Checklist
## Pre-Cleaning
- [ ] Document raw data source and date obtained
- [ ] Create backup of raw data
- [ ] Explore data structure (shape, types, missing patterns)
- [ ] Identify key columns and their expected values
## Cleaning Steps
- [ ] Remove exact duplicates
- [ ] Standardize string columns (names, categories)
- [ ] Convert data types appropriately
- [ ] Handle missing values (document strategy)
- [ ] Detect and handle outliers/errors
- [ ] Validate value ranges and constraints
## Integration
- [ ] Merge related datasets carefully
- [ ] Verify join keys match
- [ ] Check for unmatched records
- [ ] Avoid creating duplicates through joins
## Feature Engineering
- [ ] Create derived columns needed for analysis
- [ ] Document feature definitions
- [ ] Validate new features
## Quality Assurance
- [ ] Run validation checks
- [ ] Verify row/column counts
- [ ] Sample and manually review records
- [ ] Document known data limitations
## Documentation
- [ ] Record all transformations applied
- [ ] Note any assumptions made
- [ ] Save cleaning code for reproducibility
- [ ] Version control the pipeline
Common Mistakes to Avoid
- Cleaning data in Excel: Not reproducible, error-prone
- Not documenting transformations: Future you won't remember
- Dropping too much data: Investigate before deleting
- Imputing without understanding: Know why data is missing
- Joining without validation: Check match rates
- Not version controlling: Changes should be trackable
- Manual string corrections: Build systematic mappings
- Ignoring warnings: They often indicate real problems
Chapter Summary
Data cleaning is the foundation of every analysis. In this chapter, you learned to:
- Explore data systematically before making changes
- Handle missing data using appropriate strategies based on why it's missing
- Standardize names using reusable mapping systems
- Convert data types for correctness and efficiency
- Detect outliers while respecting domain knowledge
- Merge datasets safely with validation
- Engineer features that unlock analytical insights
- Validate data at each pipeline stage
- Build reproducible pipelines that document every transformation
The effort invested in data preparation pays dividends throughout your analysis. Clean data enables accurate insights; messy data guarantees flawed conclusions.
Key Terms
- MCAR/MAR/MNAR: Missing data mechanisms (completely at random, at random, not at random)
- Imputation: Filling in missing values using statistical methods
- Cardinality: Number of unique values in a column
- Schema: Structure and types of data columns
- Pipeline: Sequence of processing steps applied to data
- Feature Engineering: Creating new variables from existing data
- Data Validation: Checking data against expected rules and constraints
References
- Van Buuren, S. (2018). Flexible Imputation of Missing Data. CRC Press.
- McKinney, W. (2017). Python for Data Analysis (2nd ed.). O'Reilly Media.
- Wickham, H. (2014). "Tidy Data." Journal of Statistical Software, 59(10).
- CollegeFootballData.com Documentation. https://collegefootballdata.com/