Key Takeaways: Data Cleaning and Preparation

The Data Cleaning Imperative

"Garbage in, garbage out" - No analysis can overcome fundamentally flawed data.

Professional analysts spend 60-80% of their time on data cleaning. This isn't inefficiency—it's essential due diligence.

Missing Data Decision Tree

Is data missing?
     │
     ├── NO → Proceed with analysis
     │
     └── YES → Why is it missing?
               │
               ├── MCAR (Random) → Safe to delete or impute
               │
               ├── MAR (Related to other columns)
               │     → Impute using related variables
               │     → Example: Weather missing for dome games
               │
               └── MNAR (Related to the value itself)
                     → Requires careful domain consideration
                     → Example: Snap counts missing when injured

Imputation Strategy Quick Reference

Situation	Strategy	Code
Numeric, symmetric	Mean	`df['col'].fillna(df['col'].mean())`
Numeric, skewed	Median	`df['col'].fillna(df['col'].median())`
Categorical	Mode	`df['col'].fillna(df['col'].mode()[0])`
Rate/percentage	0 or group mean	Context-dependent
Informative missingness	Create indicator	`df['col_missing'] = df['col'].isna()`

Name Standardization Pattern

class NameStandardizer:
    def __init__(self):
        self.mapping = {
            'variant1': 'Standard',
            'variant2': 'Standard',
            # ...
        }

    def standardize(self, name: str) -> str:
        cleaned = str(name).lower().strip()
        return self.mapping.get(cleaned, name.title())

    def standardize_column(self, df, col):
        df[col] = df[col].apply(self.standardize)
        return df

Key principles: 1. Build a comprehensive mapping dictionary 2. Clean input before matching (lowercase, strip) 3. Handle unknown values gracefully 4. Make it reusable across projects

Data Type Conversion Reference

Original Type	Target Type	Method
String → Integer	Int64	`pd.to_numeric(df['col'], errors='coerce').astype('Int64')`
String → Float	float64	`pd.to_numeric(df['col'], errors='coerce')`
String → DateTime	datetime64	`pd.to_datetime(df['col'], errors='coerce')`
String → Boolean	bool	`df['col'].map({'true': True, 'false': False})`
String → Category	category	`df['col'].astype('category')`

Always use errors='coerce' to convert unparseable values to NaN rather than raising exceptions.

Outlier Handling Guidelines

Detection Methods

# IQR Method
q1, q3 = df['col'].quantile([0.25, 0.75])
iqr = q3 - q1
outliers = df[(df['col'] < q1 - 1.5*iqr) | (df['col'] > q3 + 1.5*iqr)]

# Z-Score Method
from scipy import stats
z_scores = np.abs(stats.zscore(df['col']))
outliers = df[z_scores > 3]

Football-Specific Outlier Logic

Statistic	Legitimate Range	Likely Error Range
Yards gained	-15 to 99	< -20 or > 100
Down	1-4	0 or > 4
Quarter	1-5 (OT)	0 or > 5
Distance	1-50	< 1 or > 50

Golden rule: Flag for review, don't auto-delete. Verify against authoritative sources.

Safe Merging Checklist

Before merging:
□ Key columns have same data type
□ Key values match (after standardization)
□ No unexpected duplicates in join keys
□ Understand relationship (1:1, 1:N, N:N)

After merging:
□ Row count is as expected
□ No unintended duplicates created
□ Check _merge indicator for unmatched rows
□ Verify sample records manually

Merge Validation Pattern

merged = pd.merge(
    left, right,
    on='key_col',
    how='left',
    validate='many_to_one',  # Catches unexpected duplicates
    indicator=True           # Shows match status
)

# Check results
print(merged['_merge'].value_counts())

Feature Engineering Essentials

Common Football Features

# Distance category
df['dist_cat'] = pd.cut(df['distance'],
    bins=[0, 3, 7, 15, 100],
    labels=['short', 'medium', 'long', 'very_long'])

# Passing down indicator
df['passing_down'] = (
    ((df['down'] == 2) & (df['distance'] >= 8)) |
    ((df['down'] == 3) & (df['distance'] >= 5)) |
    (df['down'] == 4)
).astype(int)

# Field position zone
df['field_zone'] = pd.cut(df['yard_line'],
    bins=[0, 20, 40, 60, 80, 100],
    labels=['backed_up', 'own_terr', 'mid', 'opp_terr', 'red_zone'])

# Score differential
df['score_diff'] = df['team_score'] - df['opp_score']

Data Validation Framework

class Validator:
    def validate_range(self, df, col, min_val, max_val):
        invalid = (df[col] < min_val) | (df[col] > max_val)
        return df[~invalid], df[invalid]

    def validate_not_null(self, df, columns):
        return df[columns].isnull().sum()

    def validate_unique(self, df, key_cols):
        return df.duplicated(subset=key_cols).sum()

    def validate_referential(self, df, col, valid_set):
        invalid = ~df[col].isin(valid_set)
        return df[invalid]

Pipeline Best Practices

Structure

class DataPipeline:
    def __init__(self):
        self.steps_completed = []

    def clean(self, df):
        df = self.remove_duplicates(df)
        df = self.standardize_names(df)
        df = self.fix_types(df)
        df = self.handle_missing(df)
        df = self.validate(df)
        return df

    def log(self, step, before_count, after_count):
        self.steps_completed.append({
            'step': step,
            'rows_before': before_count,
            'rows_after': after_count,
            'timestamp': pd.Timestamp.now()
        })

Principles

Reproducible: Same input → same output
Documented: Log every transformation
Validated: Check quality at each step
Modular: Easy to modify individual steps
Auditable: Track data provenance

Quick Reference Card

Cleaning Workflow

1. EXPLORE    → df.info(), df.describe(), df.isnull().sum()
2. DUPLICATES → df.drop_duplicates()
3. TYPES      → pd.to_numeric(), pd.to_datetime()
4. NAMES      → NameStandardizer().standardize_column()
5. MISSING    → df.fillna() or df.dropna()
6. OUTLIERS   → Flag, investigate, then handle
7. VALIDATE   → Range checks, referential integrity
8. MERGE      → validate='many_to_one', indicator=True
9. FEATURES   → Create derived columns
10. EXPORT    → Document and save

Red Flags

Symptom	Likely Cause
Unexpected row count after merge	Duplicate keys
String where number expected	Data entry issues
Impossible values (down=5)	Validation failure
Team splits into multiple entries	Naming inconsistency
Mean much different from median	Outliers or skew

Memory Optimization Cheat Sheet

# Downcast integers
df['col'] = pd.to_numeric(df['col'], downcast='integer')

# Downcast floats
df['col'] = pd.to_numeric(df['col'], downcast='float')

# Convert strings to category (if low cardinality)
if df['col'].nunique() < len(df) * 0.5:
    df['col'] = df['col'].astype('category')

# Check memory usage
print(df.memory_usage(deep=True).sum() / 1024**2, 'MB')

Chapter Summary

Data cleaning is foundational, not optional. The time invested in proper data preparation:

Prevents incorrect analysis conclusions
Enables complex joins and aggregations
Makes your work reproducible
Builds trust in your results

Always clean before analyzing. Always document what you did. Always validate the results.