Key Takeaways: Data Cleaning and Preparation

The Data Cleaning Imperative

"Garbage in, garbage out" - No analysis can overcome fundamentally flawed data.

Professional analysts spend 60-80% of their time on data cleaning. This isn't inefficiency—it's essential due diligence.


Missing Data Decision Tree

Is data missing?
     │
     ├── NO → Proceed with analysis
     │
     └── YES → Why is it missing?
               │
               ├── MCAR (Random) → Safe to delete or impute
               │
               ├── MAR (Related to other columns)
               │     → Impute using related variables
               │     → Example: Weather missing for dome games
               │
               └── MNAR (Related to the value itself)
                     → Requires careful domain consideration
                     → Example: Snap counts missing when injured

Imputation Strategy Quick Reference

Situation Strategy Code
Numeric, symmetric Mean df['col'].fillna(df['col'].mean())
Numeric, skewed Median df['col'].fillna(df['col'].median())
Categorical Mode df['col'].fillna(df['col'].mode()[0])
Rate/percentage 0 or group mean Context-dependent
Informative missingness Create indicator df['col_missing'] = df['col'].isna()

Name Standardization Pattern

class NameStandardizer:
    def __init__(self):
        self.mapping = {
            'variant1': 'Standard',
            'variant2': 'Standard',
            # ...
        }

    def standardize(self, name: str) -> str:
        cleaned = str(name).lower().strip()
        return self.mapping.get(cleaned, name.title())

    def standardize_column(self, df, col):
        df[col] = df[col].apply(self.standardize)
        return df

Key principles: 1. Build a comprehensive mapping dictionary 2. Clean input before matching (lowercase, strip) 3. Handle unknown values gracefully 4. Make it reusable across projects


Data Type Conversion Reference

Original Type Target Type Method
String → Integer Int64 pd.to_numeric(df['col'], errors='coerce').astype('Int64')
String → Float float64 pd.to_numeric(df['col'], errors='coerce')
String → DateTime datetime64 pd.to_datetime(df['col'], errors='coerce')
String → Boolean bool df['col'].map({'true': True, 'false': False})
String → Category category df['col'].astype('category')

Always use errors='coerce' to convert unparseable values to NaN rather than raising exceptions.


Outlier Handling Guidelines

Detection Methods

# IQR Method
q1, q3 = df['col'].quantile([0.25, 0.75])
iqr = q3 - q1
outliers = df[(df['col'] < q1 - 1.5*iqr) | (df['col'] > q3 + 1.5*iqr)]

# Z-Score Method
from scipy import stats
z_scores = np.abs(stats.zscore(df['col']))
outliers = df[z_scores > 3]

Football-Specific Outlier Logic

Statistic Legitimate Range Likely Error Range
Yards gained -15 to 99 < -20 or > 100
Down 1-4 0 or > 4
Quarter 1-5 (OT) 0 or > 5
Distance 1-50 < 1 or > 50

Golden rule: Flag for review, don't auto-delete. Verify against authoritative sources.


Safe Merging Checklist

Before merging:
□ Key columns have same data type
□ Key values match (after standardization)
□ No unexpected duplicates in join keys
□ Understand relationship (1:1, 1:N, N:N)

After merging:
□ Row count is as expected
□ No unintended duplicates created
□ Check _merge indicator for unmatched rows
□ Verify sample records manually

Merge Validation Pattern

merged = pd.merge(
    left, right,
    on='key_col',
    how='left',
    validate='many_to_one',  # Catches unexpected duplicates
    indicator=True           # Shows match status
)

# Check results
print(merged['_merge'].value_counts())

Feature Engineering Essentials

Common Football Features

# Distance category
df['dist_cat'] = pd.cut(df['distance'],
    bins=[0, 3, 7, 15, 100],
    labels=['short', 'medium', 'long', 'very_long'])

# Passing down indicator
df['passing_down'] = (
    ((df['down'] == 2) & (df['distance'] >= 8)) |
    ((df['down'] == 3) & (df['distance'] >= 5)) |
    (df['down'] == 4)
).astype(int)

# Field position zone
df['field_zone'] = pd.cut(df['yard_line'],
    bins=[0, 20, 40, 60, 80, 100],
    labels=['backed_up', 'own_terr', 'mid', 'opp_terr', 'red_zone'])

# Score differential
df['score_diff'] = df['team_score'] - df['opp_score']

Data Validation Framework

class Validator:
    def validate_range(self, df, col, min_val, max_val):
        invalid = (df[col] < min_val) | (df[col] > max_val)
        return df[~invalid], df[invalid]

    def validate_not_null(self, df, columns):
        return df[columns].isnull().sum()

    def validate_unique(self, df, key_cols):
        return df.duplicated(subset=key_cols).sum()

    def validate_referential(self, df, col, valid_set):
        invalid = ~df[col].isin(valid_set)
        return df[invalid]

Pipeline Best Practices

Structure

class DataPipeline:
    def __init__(self):
        self.steps_completed = []

    def clean(self, df):
        df = self.remove_duplicates(df)
        df = self.standardize_names(df)
        df = self.fix_types(df)
        df = self.handle_missing(df)
        df = self.validate(df)
        return df

    def log(self, step, before_count, after_count):
        self.steps_completed.append({
            'step': step,
            'rows_before': before_count,
            'rows_after': after_count,
            'timestamp': pd.Timestamp.now()
        })

Principles

  1. Reproducible: Same input → same output
  2. Documented: Log every transformation
  3. Validated: Check quality at each step
  4. Modular: Easy to modify individual steps
  5. Auditable: Track data provenance

Quick Reference Card

Cleaning Workflow

1. EXPLORE    → df.info(), df.describe(), df.isnull().sum()
2. DUPLICATES → df.drop_duplicates()
3. TYPES      → pd.to_numeric(), pd.to_datetime()
4. NAMES      → NameStandardizer().standardize_column()
5. MISSING    → df.fillna() or df.dropna()
6. OUTLIERS   → Flag, investigate, then handle
7. VALIDATE   → Range checks, referential integrity
8. MERGE      → validate='many_to_one', indicator=True
9. FEATURES   → Create derived columns
10. EXPORT    → Document and save

Red Flags

Symptom Likely Cause
Unexpected row count after merge Duplicate keys
String where number expected Data entry issues
Impossible values (down=5) Validation failure
Team splits into multiple entries Naming inconsistency
Mean much different from median Outliers or skew

Memory Optimization Cheat Sheet

# Downcast integers
df['col'] = pd.to_numeric(df['col'], downcast='integer')

# Downcast floats
df['col'] = pd.to_numeric(df['col'], downcast='float')

# Convert strings to category (if low cardinality)
if df['col'].nunique() < len(df) * 0.5:
    df['col'] = df['col'].astype('category')

# Check memory usage
print(df.memory_usage(deep=True).sum() / 1024**2, 'MB')

Chapter Summary

Data cleaning is foundational, not optional. The time invested in proper data preparation:

  • Prevents incorrect analysis conclusions
  • Enables complex joins and aggregations
  • Makes your work reproducible
  • Builds trust in your results

Always clean before analyzing. Always document what you did. Always validate the results.