Key Takeaways: Data Cleaning and Preparation
The Data Cleaning Imperative
"Garbage in, garbage out" - No analysis can overcome fundamentally flawed data.
Professional analysts spend 60-80% of their time on data cleaning. This isn't inefficiency—it's essential due diligence.
Missing Data Decision Tree
Is data missing?
│
├── NO → Proceed with analysis
│
└── YES → Why is it missing?
│
├── MCAR (Random) → Safe to delete or impute
│
├── MAR (Related to other columns)
│ → Impute using related variables
│ → Example: Weather missing for dome games
│
└── MNAR (Related to the value itself)
→ Requires careful domain consideration
→ Example: Snap counts missing when injured
Imputation Strategy Quick Reference
| Situation | Strategy | Code |
|---|---|---|
| Numeric, symmetric | Mean | df['col'].fillna(df['col'].mean()) |
| Numeric, skewed | Median | df['col'].fillna(df['col'].median()) |
| Categorical | Mode | df['col'].fillna(df['col'].mode()[0]) |
| Rate/percentage | 0 or group mean | Context-dependent |
| Informative missingness | Create indicator | df['col_missing'] = df['col'].isna() |
Name Standardization Pattern
class NameStandardizer:
def __init__(self):
self.mapping = {
'variant1': 'Standard',
'variant2': 'Standard',
# ...
}
def standardize(self, name: str) -> str:
cleaned = str(name).lower().strip()
return self.mapping.get(cleaned, name.title())
def standardize_column(self, df, col):
df[col] = df[col].apply(self.standardize)
return df
Key principles: 1. Build a comprehensive mapping dictionary 2. Clean input before matching (lowercase, strip) 3. Handle unknown values gracefully 4. Make it reusable across projects
Data Type Conversion Reference
| Original Type | Target Type | Method |
|---|---|---|
| String → Integer | Int64 | pd.to_numeric(df['col'], errors='coerce').astype('Int64') |
| String → Float | float64 | pd.to_numeric(df['col'], errors='coerce') |
| String → DateTime | datetime64 | pd.to_datetime(df['col'], errors='coerce') |
| String → Boolean | bool | df['col'].map({'true': True, 'false': False}) |
| String → Category | category | df['col'].astype('category') |
Always use errors='coerce' to convert unparseable values to NaN rather than raising exceptions.
Outlier Handling Guidelines
Detection Methods
# IQR Method
q1, q3 = df['col'].quantile([0.25, 0.75])
iqr = q3 - q1
outliers = df[(df['col'] < q1 - 1.5*iqr) | (df['col'] > q3 + 1.5*iqr)]
# Z-Score Method
from scipy import stats
z_scores = np.abs(stats.zscore(df['col']))
outliers = df[z_scores > 3]
Football-Specific Outlier Logic
| Statistic | Legitimate Range | Likely Error Range |
|---|---|---|
| Yards gained | -15 to 99 | < -20 or > 100 |
| Down | 1-4 | 0 or > 4 |
| Quarter | 1-5 (OT) | 0 or > 5 |
| Distance | 1-50 | < 1 or > 50 |
Golden rule: Flag for review, don't auto-delete. Verify against authoritative sources.
Safe Merging Checklist
Before merging:
□ Key columns have same data type
□ Key values match (after standardization)
□ No unexpected duplicates in join keys
□ Understand relationship (1:1, 1:N, N:N)
After merging:
□ Row count is as expected
□ No unintended duplicates created
□ Check _merge indicator for unmatched rows
□ Verify sample records manually
Merge Validation Pattern
merged = pd.merge(
left, right,
on='key_col',
how='left',
validate='many_to_one', # Catches unexpected duplicates
indicator=True # Shows match status
)
# Check results
print(merged['_merge'].value_counts())
Feature Engineering Essentials
Common Football Features
# Distance category
df['dist_cat'] = pd.cut(df['distance'],
bins=[0, 3, 7, 15, 100],
labels=['short', 'medium', 'long', 'very_long'])
# Passing down indicator
df['passing_down'] = (
((df['down'] == 2) & (df['distance'] >= 8)) |
((df['down'] == 3) & (df['distance'] >= 5)) |
(df['down'] == 4)
).astype(int)
# Field position zone
df['field_zone'] = pd.cut(df['yard_line'],
bins=[0, 20, 40, 60, 80, 100],
labels=['backed_up', 'own_terr', 'mid', 'opp_terr', 'red_zone'])
# Score differential
df['score_diff'] = df['team_score'] - df['opp_score']
Data Validation Framework
class Validator:
def validate_range(self, df, col, min_val, max_val):
invalid = (df[col] < min_val) | (df[col] > max_val)
return df[~invalid], df[invalid]
def validate_not_null(self, df, columns):
return df[columns].isnull().sum()
def validate_unique(self, df, key_cols):
return df.duplicated(subset=key_cols).sum()
def validate_referential(self, df, col, valid_set):
invalid = ~df[col].isin(valid_set)
return df[invalid]
Pipeline Best Practices
Structure
class DataPipeline:
def __init__(self):
self.steps_completed = []
def clean(self, df):
df = self.remove_duplicates(df)
df = self.standardize_names(df)
df = self.fix_types(df)
df = self.handle_missing(df)
df = self.validate(df)
return df
def log(self, step, before_count, after_count):
self.steps_completed.append({
'step': step,
'rows_before': before_count,
'rows_after': after_count,
'timestamp': pd.Timestamp.now()
})
Principles
- Reproducible: Same input → same output
- Documented: Log every transformation
- Validated: Check quality at each step
- Modular: Easy to modify individual steps
- Auditable: Track data provenance
Quick Reference Card
Cleaning Workflow
1. EXPLORE → df.info(), df.describe(), df.isnull().sum()
2. DUPLICATES → df.drop_duplicates()
3. TYPES → pd.to_numeric(), pd.to_datetime()
4. NAMES → NameStandardizer().standardize_column()
5. MISSING → df.fillna() or df.dropna()
6. OUTLIERS → Flag, investigate, then handle
7. VALIDATE → Range checks, referential integrity
8. MERGE → validate='many_to_one', indicator=True
9. FEATURES → Create derived columns
10. EXPORT → Document and save
Red Flags
| Symptom | Likely Cause |
|---|---|
| Unexpected row count after merge | Duplicate keys |
| String where number expected | Data entry issues |
| Impossible values (down=5) | Validation failure |
| Team splits into multiple entries | Naming inconsistency |
| Mean much different from median | Outliers or skew |
Memory Optimization Cheat Sheet
# Downcast integers
df['col'] = pd.to_numeric(df['col'], downcast='integer')
# Downcast floats
df['col'] = pd.to_numeric(df['col'], downcast='float')
# Convert strings to category (if low cardinality)
if df['col'].nunique() < len(df) * 0.5:
df['col'] = df['col'].astype('category')
# Check memory usage
print(df.memory_usage(deep=True).sum() / 1024**2, 'MB')
Chapter Summary
Data cleaning is foundational, not optional. The time invested in proper data preparation:
- Prevents incorrect analysis conclusions
- Enables complex joins and aggregations
- Makes your work reproducible
- Builds trust in your results
Always clean before analyzing. Always document what you did. Always validate the results.