Chapter 8: Missing Data Strategies

DataField.Dev

23 min read

> War Story --- A predictive maintenance team at a manufacturing firm had sensor data from 1,200 wind turbines. Vibration sensors, temperature probes, pressure gauges --- 847 measurements per turbine, streaming every 10 seconds. When they built...

In This Chapter

Beyond df.dropna() --- Imputation, Indicators, and When to Walk Away
The War on Missing Values --- and Why You Are Losing It
Why Missing Data Matters More Than You Think
The Three Mechanisms of Missingness
Visualizing Missingness Patterns
Simple Imputation: When It Works and When It Lies
Advanced Imputation: Using Structure to Fill Gaps
Missing Indicators: When the Gap IS the Signal
The Decision Framework: Drop, Impute, or Indicator?
Putting It All Together: The Imputation Pipeline
Little's MCAR Test: A Formal Check
Common Pitfalls and How to Avoid Them
Progressive Project M3 (Part 2): Missing Data in StreamFlow
Chapter Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 8: Missing Data Strategies

Beyond df.dropna() --- Imputation, Indicators, and When to Walk Away

Learning Objectives

By the end of this chapter, you will be able to:

Classify missing data mechanisms (MCAR, MAR, MNAR) and explain why it matters
Implement simple imputation (mean, median, mode) and understand when it fails
Apply advanced imputation (KNN, iterative/MICE, model-based)
Use missing indicators as features
Decide when to drop rows, drop columns, or impute

The War on Missing Values --- and Why You Are Losing It

War Story --- A predictive maintenance team at a manufacturing firm had sensor data from 1,200 wind turbines. Vibration sensors, temperature probes, pressure gauges --- 847 measurements per turbine, streaming every 10 seconds. When they built their first bearing-failure model, the data engineer noticed that roughly 40% of the rows had at least one missing sensor reading. "Dirty data," she said. She ran df.dropna() and moved on. The model trained beautifully on the remaining 60% of the data and achieved an AUC of 0.72 on the test set. Respectable, but not what they had hoped.

Six months later, a domain expert from the mechanical engineering team joined a review meeting. He looked at the feature importance plot, looked at the missing data report, and turned pale. "You dropped the rows where the sensors went offline?" he asked. "Those sensors go offline because they are mounted on equipment that is vibrating itself apart. The missingness IS the failure signal. You threw away the answer."

They rebuilt the model with missing indicators --- binary flags marking whether each sensor reading was absent. Three of those indicators landed in the top ten features. The AUC jumped to 0.89. The missing data was not noise. It was the most predictive signal in the entire dataset.

That story should reframe everything you think you know about missing data. Most introductory courses treat missingness as a nuisance --- something to clean up before the "real" analysis begins. df.dropna(). df.fillna(df.mean()). Move on.

This chapter argues that missing data is not a preprocessing annoyance. It is a modeling decision with consequences that propagate through every downstream result. The wrong decision can bias your coefficients, destroy your rarest signals, and produce a model that works beautifully in development and fails catastrophically in production.

df.dropna() is the rm -rf of data science --- it solves the immediate problem by destroying the evidence.

We are going to do better.

Why Missing Data Matters More Than You Think

Before we touch any code, let us establish the stakes.

Consider the StreamFlow churn prediction dataset from previous chapters. You have 2.4 million subscriber-months of data with 25 features. Here is the missingness profile:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(42)

# Simulate a realistic StreamFlow missingness profile
n = 50000  # Subscriber-months

# Feature missingness rates (realistic for a SaaS product)
missingness_profile = {
    'tenure_months': 0.0,           # Always known
    'plan_type': 0.0,               # Always known
    'monthly_charge': 0.0,          # Always known (billing system)
    'total_hours_last_7d': 0.12,    # 12% missing
    'total_hours_last_30d': 0.08,   # 8% missing
    'sessions_last_7d': 0.12,       # Same missingness as hours_7d
    'sessions_last_30d': 0.08,      # Same missingness as hours_30d
    'avg_session_duration': 0.15,   # 15% missing
    'devices_used': 0.02,           # 2% missing
    'genre_diversity_score': 0.18,  # 18% missing (requires enough history)
    'support_tickets_last_90d': 0.0,# Always known
    'days_since_last_login': 0.05,  # 5% missing
    'email_open_rate': 0.22,        # 22% missing (email tracking not universal)
    'nps_score': 0.65,              # 65% missing (most users never respond)
}

print("StreamFlow Feature Missingness Profile")
print("=" * 50)
for feature, rate in sorted(missingness_profile.items(), key=lambda x: x[1], reverse=True):
    bar = "#" * int(rate * 40)
    print(f"{feature:30s} {rate:5.1%}  {bar}")

StreamFlow Feature Missingness Profile
==================================================
nps_score                        65.0%  ##########################
email_open_rate                  22.0%  #########
genre_diversity_score            18.0%  #######
avg_session_duration             15.0%  ######
total_hours_last_7d              12.0%  #####
sessions_last_7d                 12.0%  #####
total_hours_last_30d              8.0%  ###
sessions_last_30d                 8.0%  ###
days_since_last_login             5.0%  ##
devices_used                      2.0%  #
tenure_months                     0.0%
plan_type                         0.0%
monthly_charge                    0.0%
support_tickets_last_90d          0.0%

If you run df.dropna() on this dataset, you lose every row that has at least one missing value across any feature. With these missingness rates, that means losing approximately 75-80% of your data. You started with 50,000 subscriber-months. You are left with roughly 10,000-12,000. And the rows you kept are systematically different from the rows you dropped --- they are the subscribers who are most engaged (they have usage data), most responsive (they answered the NPS survey), and most trackable (email opens registered). In other words, you kept the healthy patients and threw out the sick ones.

Your churn model, trained on this biased sample, will learn to predict churn among engaged users. It will have no idea what to do with the disengaged users who are actually most likely to churn --- because it has never seen them.

This is not a hypothetical. This is the default behavior of df.dropna().

The Three Mechanisms of Missingness

Not all missing data is created equal. The statistical framework for understanding missingness, developed by Donald Rubin in the 1970s, identifies three mechanisms. These matter because each one requires a different response, and the wrong response can introduce bias that no algorithm can fix.

MCAR: Missing Completely at Random

Definition: The probability that a value is missing is unrelated to the value itself or to any other observed variable. Missingness is purely random.

StreamFlow example: A data pipeline bug causes 3% of usage records to be dropped at random during nightly ETL processing. The bug affects all subscribers equally, regardless of their usage level, plan type, or churn status. The records that survive are a random sample of all records.

Why it matters: MCAR is the only mechanism under which dropping rows introduces no bias. Your remaining data is a smaller but representative sample of the full dataset. Means, variances, and correlations are all unbiased estimates of the true population values.

How to detect it: If data is MCAR, the rows with missing values should look statistically indistinguishable from the rows without missing values on all observed features. You can test this informally by comparing group means, or formally using Little's MCAR test.

def informal_mcar_check(df, feature_with_missing, comparison_features):
    """
    Compare observed features between rows where the target feature
    is present vs. missing. Large differences suggest non-MCAR.
    """
    is_missing = df[feature_with_missing].isna()

    print(f"Checking MCAR for: {feature_with_missing}")
    print(f"  Missing: {is_missing.sum()} ({is_missing.mean():.1%})")
    print(f"  Present: {(~is_missing).sum()} ({(~is_missing).mean():.1%})")
    print()

    for feat in comparison_features:
        if df[feat].dtype in ['float64', 'int64']:
            mean_present = df.loc[~is_missing, feat].mean()
            mean_missing = df.loc[is_missing, feat].mean()
            diff_pct = (mean_missing - mean_present) / mean_present * 100

            flag = " *** SUSPICIOUS" if abs(diff_pct) > 10 else ""
            print(f"  {feat:30s}  present={mean_present:.2f}  "
                  f"missing={mean_missing:.2f}  diff={diff_pct:+.1f}%{flag}")

    print()

Production Tip --- MCAR is rare in practice. Most missing data in production systems is not random. It is the result of a process --- a user decision, a system limitation, a business rule --- and that process is almost always correlated with something you care about. Assume your data is NOT MCAR until proven otherwise.

MAR: Missing at Random

Definition: The probability that a value is missing depends on other observed variables but not on the missing value itself. Missingness can be explained by data you have.

StreamFlow example: The total_hours_last_7d feature is missing for 12% of subscribers. But it is not missing at random. Users on the free plan have a simpler tracking system that occasionally fails to log sessions. When you condition on plan_type, the missingness rate is 2% for paid subscribers and 28% for free-tier users. The missingness is related to plan type (observed) but not to the actual hours watched (the missing value itself).

Why it matters: Under MAR, dropping rows introduces bias because the remaining sample overrepresents certain groups (paid subscribers, in this case). But the bias can be corrected by conditioning on the variables that explain the missingness. Imputation methods that use the observed features (KNN, MICE) can produce unbiased estimates if the MAR assumption holds.

How to detect it: Check whether missingness correlates with observed features. If it does, the data is at least MAR (it could also be MNAR --- you cannot distinguish between MAR and MNAR from the data alone).

def check_missingness_correlations(df, target_feature):
    """
    Compute correlation between missingness of target_feature
    and all other features. High correlations suggest MAR.
    """
    is_missing = df[target_feature].isna().astype(int)

    correlations = []
    for col in df.columns:
        if col == target_feature:
            continue
        if df[col].dtype in ['float64', 'int64'] and df[col].notna().sum() > 100:
            corr = df[col].corr(is_missing)
            correlations.append((col, corr))

    correlations.sort(key=lambda x: abs(x[1]), reverse=True)

    print(f"Missingness correlations for: {target_feature}")
    print(f"{'Feature':30s}  {'Correlation':>12s}")
    print("-" * 44)
    for feat, corr in correlations[:10]:
        flag = " ***" if abs(corr) > 0.15 else ""
        print(f"{feat:30s}  {corr:+12.3f}{flag}")

MNAR: Missing Not at Random

Definition: The probability that a value is missing depends on the value itself. The reason the data is missing is related to what the data would have shown.

StreamFlow example: The total_hours_last_7d feature is missing because the user did not use the product at all in the last 7 days. Users who watched zero hours show up as NULL in the usage table because no usage events were logged. The missing value would be zero (or near-zero), and the missingness is caused by the value being zero. This is MNAR.

TurbineTech example: A vibration sensor reading is missing because the sensor itself has failed due to excessive vibration. The missing value would have been extremely high (indicating imminent bearing failure), and the missingness is caused by the very condition the value would have captured. This is MNAR --- and it is the most dangerous kind, because the missingness is not just informative, it is the single most predictive feature in the dataset.

Why it matters: Under MNAR, no amount of imputation from observed data will recover the truth, because the reason for missingness is precisely the thing you cannot observe. Dropping rows is biased. Simple imputation is biased. Even sophisticated imputation is biased unless you model the missingness mechanism itself. The only reliable approach is to incorporate the missingness pattern directly into your model.

Common Mistake --- The most dangerous assumption in data science is treating MNAR data as if it were MCAR. When a SaaS user's usage data is missing because they stopped using the product, filling in the mean usage (from active users) tells the model "this user is average." They are not average. They are gone. The imputed value actively misleads the model.

Visualizing Missingness Patterns

Before deciding how to handle missing data, you need to see the patterns. A missingness heatmap reveals structure that summary statistics hide.

import matplotlib.pyplot as plt
import seaborn as sns

def plot_missingness_heatmap(df, figsize=(14, 6)):
    """
    Visualize the pattern of missing values across the dataset.
    Each row is a sample, each column is a feature.
    White = present, dark = missing.
    """
    # Sort columns by missingness rate for readability
    missing_rates = df.isnull().mean().sort_values(ascending=False)
    cols_sorted = missing_rates.index.tolist()

    fig, axes = plt.subplots(1, 2, figsize=figsize)

    # Left: missingness heatmap (sample of rows for readability)
    sample_idx = np.random.choice(len(df), min(500, len(df)), replace=False)
    sample_idx.sort()

    ax = axes[0]
    ax.imshow(
        df.iloc[sample_idx][cols_sorted].isnull().values.astype(int),
        aspect='auto', cmap='Greys', interpolation='none'
    )
    ax.set_xlabel('Features')
    ax.set_ylabel('Samples (random 500)')
    ax.set_title('Missingness Pattern (black = missing)')
    ax.set_xticks(range(len(cols_sorted)))
    ax.set_xticklabels(cols_sorted, rotation=90, fontsize=8)

    # Right: missingness co-occurrence (which features go missing together?)
    missing_corr = df[cols_sorted].isnull().corr()
    sns.heatmap(missing_corr, ax=axes[1], cmap='RdBu_r', center=0,
                annot=True, fmt='.2f', square=True, cbar_kws={'shrink': 0.8},
                xticklabels=True, yticklabels=True)
    axes[1].set_title('Missingness Co-occurrence')

    plt.tight_layout()
    plt.savefig('missingness_heatmap.png', dpi=150, bbox_inches='tight')
    plt.show()

The co-occurrence matrix on the right is often more informative than the heatmap on the left. If total_hours_last_7d and sessions_last_7d always go missing together, they share a cause (the same tracking pipeline). If nps_score missingness is uncorrelated with everything else, it might be MCAR. If genre_diversity_score missingness correlates strongly with tenure_months, new users lack the history to compute it (MAR).

Look for these patterns:

Monotone missingness: If feature A is missing, features B and C are also always missing. This often indicates a data source that was unavailable for a subset of records.
Block missingness: A rectangular block of missingness, indicating a time period or user segment where an entire data source was offline.
Correlated missingness: Two features that go missing together, suggesting a shared upstream cause.
Independent missingness: Features that go missing independently of each other, suggesting different causes for each.

Simple Imputation: When It Works and When It Lies

Simple imputation replaces missing values with a single summary statistic computed from the observed data. It is fast, easy to implement, and often good enough. It is also often quietly wrong.

Mean Imputation

from sklearn.impute import SimpleImputer

# Mean imputation
mean_imputer = SimpleImputer(strategy='mean')

# Fit on training data only
X_train_imputed = pd.DataFrame(
    mean_imputer.fit_transform(X_train),
    columns=X_train.columns,
    index=X_train.index
)

# Transform test data using training statistics
X_test_imputed = pd.DataFrame(
    mean_imputer.transform(X_test),
    columns=X_test.columns,
    index=X_test.index
)

When it works: The feature is MCAR (or close to it), the feature distribution is roughly symmetric, and the missingness rate is low (under 5%). Under these conditions, the mean of the observed values is a reasonable estimate of the missing values, and the bias is minimal.

When it fails: Consider what mean imputation does to the distribution. If avg_session_duration has a right-skewed distribution with mean 23 minutes and median 14 minutes, mean imputation replaces every missing value with 23 minutes. This:

Shrinks the variance. Every missing value becomes exactly the mean, pulling the distribution toward the center. Your model sees less spread than actually exists.
Distorts correlations. If avg_session_duration is correlated with total_hours_last_30d, mean imputation breaks that correlation for every imputed row. The imputed value of 23 minutes is the same regardless of whether the user watched 2 hours or 200 hours last month.
Creates a spike. The imputed distribution has an artificial spike at the mean value that does not exist in the real data.

def demonstrate_mean_imputation_distortion(series, missing_rate=0.20):
    """Show how mean imputation distorts the distribution."""
    np.random.seed(42)

    # Original distribution: right-skewed (typical of session durations)
    original = series.dropna().copy()

    # Create missingness (for demonstration)
    mask = np.random.random(len(original)) < missing_rate
    with_missing = original.copy()
    with_missing.iloc[mask] = np.nan

    # Mean imputation
    imputed_mean = with_missing.fillna(with_missing.mean())

    fig, axes = plt.subplots(1, 3, figsize=(15, 4))

    axes[0].hist(original, bins=50, edgecolor='black', alpha=0.7)
    axes[0].set_title('Original Distribution')
    axes[0].axvline(original.mean(), color='red', linestyle='--', label=f'Mean: {original.mean():.1f}')
    axes[0].legend()

    axes[1].hist(with_missing.dropna(), bins=50, edgecolor='black', alpha=0.7, color='orange')
    axes[1].set_title(f'After Dropping {missing_rate:.0%} Missing')
    axes[1].axvline(with_missing.dropna().mean(), color='red', linestyle='--')

    axes[2].hist(imputed_mean, bins=50, edgecolor='black', alpha=0.7, color='green')
    axes[2].set_title('After Mean Imputation')
    axes[2].axvline(imputed_mean.mean(), color='red', linestyle='--')

    plt.tight_layout()
    plt.savefig('mean_imputation_distortion.png', dpi=150, bbox_inches='tight')
    plt.show()

    print(f"Original  - Mean: {original.mean():.2f}, Std: {original.std():.2f}")
    print(f"Imputed   - Mean: {imputed_mean.mean():.2f}, Std: {imputed_mean.std():.2f}")
    print(f"Variance reduction: {(1 - imputed_mean.std()/original.std())*100:.1f}%")

Median Imputation

# Median imputation
median_imputer = SimpleImputer(strategy='median')
X_train_imputed = pd.DataFrame(
    median_imputer.fit_transform(X_train),
    columns=X_train.columns,
    index=X_train.index
)

Median imputation has the same structural problems as mean imputation (variance shrinkage, correlation distortion, artificial spike) but is less sensitive to outliers. For skewed distributions --- which are common in behavioral data --- the median is a better central value than the mean. When avg_session_duration has a mean of 23 and a median of 14, imputing 14 puts the filled values in the middle of where most users actually are, rather than pulling them toward the tail.

Production Tip --- Use median imputation as your default for numeric features. It is almost never the best imputation strategy, but it is rarely the worst. It is the "logistic regression of imputation" --- a solid baseline that you should always try first and improve upon only when you have evidence that something better is needed.

Mode Imputation (Categorical Features)

# Mode imputation for categoricals
mode_imputer = SimpleImputer(strategy='most_frequent')

Mode imputation replaces missing categorical values with the most common category. It has an additional problem beyond variance shrinkage: if the most frequent category is "Basic Plan" (60% of users), every user with a missing plan type gets assigned "Basic Plan." If those users are actually disproportionately on the free tier (which has worse tracking, causing the missingness), you have introduced systematic misclassification.

The Fundamental Problem with Simple Imputation

All simple imputation methods share a fatal flaw: they replace a missing value with the same constant for every row, regardless of the context provided by other features. A user who has watched 200 hours in the last month and has a missing avg_session_duration is fundamentally different from a user who has watched 2 hours. Simple imputation assigns them the same value.

This is not a subtle statistical issue. It is a modeling error that can propagate through your entire pipeline.

Advanced Imputation: Using Structure to Fill Gaps

Advanced imputation methods use the relationships between features to produce more realistic imputed values. Instead of replacing every missing avg_session_duration with the same constant, they ask: "Given what we know about this subscriber's other features, what is their session duration likely to be?"

KNN Imputation

KNN imputation finds the K most similar rows (based on non-missing features) and uses their values to fill in the gap. If a subscriber with missing session duration is most similar to 5 neighbors who have session durations of 8, 12, 10, 15, and 11 minutes, the imputed value is 11.2 minutes (the mean of the neighbors).

from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# KNN imputation requires scaled features (distance-based)
knn_imputer = Pipeline([
    ('scaler', StandardScaler()),
    ('imputer', KNNImputer(n_neighbors=5, weights='distance')),
])

# Fit on training data, transform both
X_train_knn = pd.DataFrame(
    knn_imputer.fit_transform(X_train),
    columns=X_train.columns,
    index=X_train.index
)

X_test_knn = pd.DataFrame(
    knn_imputer.transform(X_test),
    columns=X_test.columns,
    index=X_test.index
)

Advantages: - Preserves local structure. Imputed values reflect the feature patterns of similar observations. - Preserves correlations between features better than simple imputation. - Works well when missingness is MAR and the feature relationships are smooth.

Disadvantages: - Computationally expensive. For each missing value, the algorithm must compute distances to all complete rows. With 50,000 rows and 25 features, this can take minutes rather than milliseconds. - Sensitive to the distance metric and the value of K. - All features must be numeric (or pre-encoded). The scaler is required because KNN is distance-based --- a feature with range [0, 100000] will dominate a feature with range [0, 1]. - Does not handle MNAR well. If the reason a value is missing is related to the value itself, the neighbors have systematically different values than the missing observation would.

Try It --- Run KNN imputation with K=1, K=5, and K=20 on the StreamFlow dataset. Compare the distribution of imputed values for avg_session_duration at each K. You should see that K=1 preserves the variance but introduces noise, K=20 smooths the variance but regresses toward local means, and K=5 is a reasonable compromise. This is a bias-variance tradeoff within imputation itself.

Iterative Imputation (MICE)

Multiple Imputation by Chained Equations (MICE) is the most sophisticated imputation method available in scikit-learn. The idea is elegant: treat each feature with missing values as a prediction target, and use the other features as predictors. Iterate until the imputed values stabilize.

The algorithm:

Initialize all missing values with simple imputation (e.g., mean).
For each feature with missing values, in sequence: a. Fit a regression model using all other features as predictors. b. Replace the initially imputed values with the model's predictions.
Repeat steps 2a-2b for multiple iterations until convergence.

from sklearn.experimental import enable_iterative_imputer  # Required
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor

# Iterative imputation with a Random Forest estimator
mice_imputer = IterativeImputer(
    estimator=RandomForestRegressor(
        n_estimators=100, max_depth=5, random_state=42
    ),
    max_iter=10,
    random_state=42,
    verbose=0
)

X_train_mice = pd.DataFrame(
    mice_imputer.fit_transform(X_train),
    columns=X_train.columns,
    index=X_train.index
)

X_test_mice = pd.DataFrame(
    mice_imputer.transform(X_test),
    columns=X_test.columns,
    index=X_test.index
)

Common Mistake --- The IterativeImputer is still marked as experimental in scikit-learn, which is why you need the enable_iterative_imputer import. This does not mean it is unreliable --- it has been stable for years --- but the API could change in future versions. Pin your scikit-learn version in production.

Advantages: - Produces the most realistic imputed values by modeling feature relationships. - Handles arbitrary patterns of missingness (not just monotone). - Flexible: you can use any scikit-learn estimator as the imputation model.

Disadvantages: - Slowest imputation method by far. With a Random Forest estimator, 25 features, and 50,000 rows, expect minutes of computation per fit. - Can introduce subtle biases if the imputation model is misspecified. - The order in which features are imputed can affect results (though convergence usually eliminates this). - Still cannot handle MNAR. If the value is missing because of the value, predicting the value from other features will systematically miss the truth.

Comparing Imputation Methods

Let us put this head-to-head with a controlled experiment.

from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

def compare_imputation_strategies(X_train, y_train, feature_names):
    """
    Compare imputation methods using downstream model performance.
    The best imputation is the one that produces the best predictions.
    """
    strategies = {
        'Drop rows (listwise)': None,  # Special case
        'Mean imputation': SimpleImputer(strategy='mean'),
        'Median imputation': SimpleImputer(strategy='median'),
        'KNN (k=5)': KNNImputer(n_neighbors=5),
        'Iterative (BayesianRidge)': IterativeImputer(
            max_iter=10, random_state=42
        ),
    }

    results = []

    for name, imputer in strategies.items():
        if imputer is None:
            # Listwise deletion
            mask = X_train.notna().all(axis=1)
            X_complete = X_train[mask]
            y_complete = y_train[mask]
            n_rows = len(X_complete)

            pipe = Pipeline([
                ('scaler', StandardScaler()),
                ('model', GradientBoostingClassifier(
                    n_estimators=200, random_state=42
                ))
            ])
            scores = cross_val_score(
                pipe, X_complete, y_complete, cv=5, scoring='roc_auc'
            )
        else:
            n_rows = len(X_train)
            pipe = Pipeline([
                ('imputer', imputer),
                ('scaler', StandardScaler()),
                ('model', GradientBoostingClassifier(
                    n_estimators=200, random_state=42
                ))
            ])
            scores = cross_val_score(
                pipe, X_train, y_train, cv=5, scoring='roc_auc'
            )

        results.append({
            'Strategy': name,
            'AUC (mean)': scores.mean(),
            'AUC (std)': scores.std(),
            'Rows used': n_rows,
            'Data retained': n_rows / len(X_train)
        })

    results_df = pd.DataFrame(results)
    print(results_df.to_string(index=False))
    return results_df

             Strategy  AUC (mean)  AUC (std)  Rows used  Data retained
  Drop rows (listwise)      0.7640     0.0110      12840         0.257
      Mean imputation       0.8020     0.0085      50000         1.000
    Median imputation       0.8035     0.0082      50000         1.000
           KNN (k=5)        0.8190     0.0078      50000         1.000
Iterative (BayesianRidge)   0.8230     0.0075      50000         1.000

The pattern is consistent across most real-world datasets:

Listwise deletion is worst --- not because the imputed values are bad (they do not exist), but because you have lost 74% of your data. Less data means less signal, worse generalization, and higher variance.
Simple imputation is better than dropping --- even crude mean/median imputation recovers most of the performance lost from dropping rows.
Advanced imputation is best --- but the marginal improvement over simple imputation is often modest. Going from median to iterative is a +2 AUC points gain. Going from dropping to median is a +4 point gain.

The implication: do not agonize over imputation method choice before you have established that you are not accidentally dropping data. The biggest gain comes from keeping your data, not from sophisticated fill values.

Missing Indicators: When the Gap IS the Signal

This is the most important section of this chapter. Everything before it is standard statistical practice. What follows is the insight that separates textbook data scientists from practitioners.

The Concept

A missing indicator is a binary feature that equals 1 when the original feature is missing and 0 when it is present. You add it alongside the imputed feature, so the model has access to both the filled-in value and the fact that the value was originally absent.

def add_missing_indicators(df, features_to_track):
    """
    Add binary missing indicators for specified features.
    Convention: {feature_name}_missing = 1 if original value was NaN.
    """
    df_out = df.copy()
    for feat in features_to_track:
        indicator_name = f"{feat}_missing"
        df_out[indicator_name] = df_out[feat].isna().astype(int)
    return df_out

# Add indicators before imputation
features_to_track = [
    'total_hours_last_7d',
    'sessions_last_7d',
    'avg_session_duration',
    'genre_diversity_score',
    'email_open_rate',
    'nps_score',
    'days_since_last_login',
]

X_train_with_indicators = add_missing_indicators(X_train, features_to_track)
X_test_with_indicators = add_missing_indicators(X_test, features_to_track)

# Now impute the original features (the indicators are already complete)
imputer = SimpleImputer(strategy='median')

# Separate indicator columns (already complete) from features to impute
indicator_cols = [f"{f}_missing" for f in features_to_track]
feature_cols = [c for c in X_train_with_indicators.columns if c not in indicator_cols]

X_train_with_indicators[feature_cols] = imputer.fit_transform(
    X_train_with_indicators[feature_cols]
)
X_test_with_indicators[feature_cols] = imputer.transform(
    X_test_with_indicators[feature_cols]
)

print(f"Original features: {len(feature_cols)}")
print(f"Missing indicators: {len(indicator_cols)}")
print(f"Total features: {len(X_train_with_indicators.columns)}")

Why Missing Indicators Work

The magic of missing indicators is that they let the model learn different relationships for present and absent values.

Without a missing indicator, after mean imputation: - Subscriber A: total_hours_last_7d = 15.3 (actual value) - Subscriber B: total_hours_last_7d = 15.3 (imputed --- actually was NaN)

The model cannot distinguish these two subscribers. They look identical.

With a missing indicator: - Subscriber A: total_hours_last_7d = 15.3, total_hours_last_7d_missing = 0 - Subscriber B: total_hours_last_7d = 15.3, total_hours_last_7d_missing = 1

Now the model can learn: "When usage data is present and equals 15.3, churn probability is X. When usage data is missing (and imputed to 15.3), churn probability is Y." And Y is typically much higher than X, because the user with missing usage data may have simply stopped using the product.

The StreamFlow Evidence

Let us measure the predictive power of missingness in the StreamFlow dataset.

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

def evaluate_missing_indicators(X_train, y_train, features_to_track):
    """
    Compare model performance with and without missing indicators.
    """
    # Strategy 1: Median imputation only
    X_imputed = X_train.copy()
    imputer = SimpleImputer(strategy='median')
    X_imputed = pd.DataFrame(
        imputer.fit_transform(X_imputed),
        columns=X_imputed.columns,
        index=X_imputed.index
    )

    pipe_no_indicators = Pipeline([
        ('scaler', StandardScaler()),
        ('model', GradientBoostingClassifier(
            n_estimators=200, random_state=42
        ))
    ])
    scores_no = cross_val_score(
        pipe_no_indicators, X_imputed, y_train, cv=5, scoring='roc_auc'
    )

    # Strategy 2: Median imputation + missing indicators
    X_with_ind = add_missing_indicators(X_train.copy(), features_to_track)
    indicator_cols = [f"{f}_missing" for f in features_to_track]
    feature_cols = [c for c in X_with_ind.columns if c not in indicator_cols]
    X_with_ind[feature_cols] = imputer.fit_transform(X_with_ind[feature_cols])

    pipe_with_indicators = Pipeline([
        ('scaler', StandardScaler()),
        ('model', GradientBoostingClassifier(
            n_estimators=200, random_state=42
        ))
    ])
    scores_with = cross_val_score(
        pipe_with_indicators, X_with_ind, y_train, cv=5, scoring='roc_auc'
    )

    print("Impact of Missing Indicators")
    print("=" * 50)
    print(f"Without indicators:  AUC = {scores_no.mean():.4f} "
          f"(+/- {scores_no.std():.4f})")
    print(f"With indicators:     AUC = {scores_with.mean():.4f} "
          f"(+/- {scores_with.std():.4f})")
    print(f"Improvement:         AUC = {scores_with.mean() - scores_no.mean():+.4f}")

Impact of Missing Indicators
==================================================
Without indicators:  AUC = 0.8035 (+/- 0.0082)
With indicators:     AUC = 0.8410 (+/- 0.0068)
Improvement:         AUC = +0.0375

A +3.75 AUC point improvement from simply adding binary flags. No new data. No new features. No algorithm change. Just telling the model which values were originally missing.

And when you look at the feature importance rankings, the reason becomes clear:

Feature Importance (top 15):
  1. days_since_last_login              0.142
  2. total_hours_last_7d                0.098
  3. total_hours_last_7d_missing        0.087  *** Missing indicator
  4. sessions_last_7d_missing           0.072  *** Missing indicator
  5. tenure_months                      0.068
  6. support_tickets_last_90d           0.062
  7. avg_session_duration               0.055
  8. avg_session_duration_missing       0.048  *** Missing indicator
  9. monthly_charge                     0.044
 10. sessions_last_7d                   0.041
 11. genre_diversity_score              0.038
 12. plan_type_encoded                  0.035
 13. days_since_last_login_missing      0.031  *** Missing indicator
 14. email_open_rate                    0.028
 15. devices_used                       0.025

Four of the top 13 features are missing indicators. total_hours_last_7d_missing is the third most important feature in the entire model. The model learned what the domain expert knew all along: a subscriber whose usage data is missing in the last 7 days is a subscriber who has disengaged. The missingness IS the churn signal.

Production Tip --- Add missing indicators for any feature where missingness is plausibly informative. The cost is negligible (one binary feature per tracked feature), and the potential upside is significant. Even if the indicator turns out to be uninformative, tree-based models will simply ignore it --- no harm done. This is one of the highest-ROI practices in applied ML.

The TurbineTech Pattern: Missingness as Prediction

The StreamFlow case is striking. The TurbineTech case is dramatic.

In predictive maintenance, sensor dropout patterns are among the strongest failure predictors. Sensors are physical devices mounted on physical equipment. When equipment begins to fail:

Vibration increases beyond the sensor's operating range, causing intermittent readings.
Extreme temperatures degrade the sensor's electronics, causing dropouts.
Corrosion from leaking fluids damages wiring connections, causing progressive signal loss.

The pattern: sensors begin producing intermittent missing values days or weeks before the equipment fails catastrophically. The missingness is not noise --- it is an early warning system.

def create_sensor_missingness_features(df, sensor_cols, window_days=7):
    """
    Create missingness-based features for sensor data.
    These are often among the top predictors in maintenance models.
    """
    features = {}

    for sensor in sensor_cols:
        # Binary: is the latest reading missing?
        features[f'{sensor}_missing_now'] = df[sensor].isna().astype(int)

        # Count: how many missing readings in the last window?
        # (Assumes df is sorted by timestamp with one row per reading)
        features[f'{sensor}_missing_count_{window_days}d'] = (
            df[sensor]
            .isna()
            .rolling(window=window_days, min_periods=1)
            .sum()
        )

        # Rate: what fraction of readings are missing in the window?
        features[f'{sensor}_missing_rate_{window_days}d'] = (
            df[sensor]
            .isna()
            .rolling(window=window_days, min_periods=1)
            .mean()
        )

        # Trend: is the missingness rate increasing?
        recent = (
            df[sensor]
            .isna()
            .rolling(window=window_days, min_periods=1)
            .mean()
        )
        older = (
            df[sensor]
            .isna()
            .rolling(window=window_days * 2, min_periods=1)
            .mean()
        )
        features[f'{sensor}_missing_trend'] = recent - older

    return pd.DataFrame(features, index=df.index)

The missingness trend feature --- is the sensor dropout rate accelerating? --- is often the single strongest predictor in the model. Equipment does not fail suddenly; it degrades progressively. And that progressive degradation shows up as a progressive increase in missing sensor readings long before any sensor that is still functioning shows abnormal values.

This is the insight that cost the team in our opening war story six months of suboptimal predictions: you do not clean missing data in a predictive maintenance dataset. You feature-engineer it.

The Decision Framework: Drop, Impute, or Indicator?

You now have the tools. What you need is a decision process. Here is a practical framework for handling missing data in a predictive modeling context.

Step 1: Profile the Missingness

def missing_data_report(df, target=None):
    """
    Generate a comprehensive missing data report.
    """
    report = pd.DataFrame({
        'feature': df.columns,
        'dtype': df.dtypes.values,
        'n_missing': df.isnull().sum().values,
        'pct_missing': (df.isnull().mean() * 100).round(2).values,
        'n_unique': df.nunique().values,
    })

    if target is not None and target in df.columns:
        # Compute target rate for present vs. missing rows
        target_rates = []
        for col in df.columns:
            if col == target:
                target_rates.append(np.nan)
                continue
            mask = df[col].isna()
            if mask.sum() > 0 and (~mask).sum() > 0:
                rate_missing = df.loc[mask, target].mean()
                rate_present = df.loc[~mask, target].mean()
                target_rates.append(rate_missing - rate_present)
            else:
                target_rates.append(np.nan)
        report['target_rate_diff'] = target_rates

    report = report.sort_values('pct_missing', ascending=False)
    print(report.to_string(index=False))
    return report

Step 2: Classify the Mechanism

For each feature with significant missingness (>1%), ask three questions:

Is the missingness rate the same across all values of the target? If churned and retained subscribers have the same missingness rate, it might be MCAR. If churned subscribers have much higher missingness, the mechanism is MAR or MNAR.
Is the missingness rate correlated with other features? If missingness in total_hours_last_7d is strongly correlated with plan_type, the mechanism is at least MAR.
Could the reason for missingness be related to the missing value itself? This requires domain knowledge, not statistics. If a sensor reading is missing because the sensor broke due to extreme conditions, that is MNAR. If a user's usage data is missing because they did not use the product, that is MNAR.

Step 3: Choose Your Strategy

Missingness Rate	Mechanism	Recommended Strategy
< 1%	Any	Drop rows. The data loss is negligible.
1-5%	MCAR	Simple imputation (median for numeric, mode for categorical).
1-5%	MAR	Simple imputation + missing indicator.
5-20%	MCAR	Median imputation. KNN or iterative if you want marginal improvement.
5-20%	MAR	KNN or iterative imputation + missing indicator.
5-20%	MNAR	Missing indicator (critical). Impute with a domain-appropriate constant (e.g., 0 for usage).
20-50%	Any	Missing indicator (critical). Consider whether the feature should be imputed at all, or used only via its indicator.
> 50%	Any	Drop the feature OR use only the missing indicator. The imputed values are more fiction than data.

Step 4: Special Case --- MNAR with Domain Knowledge

When you know the mechanism behind MNAR data, you can often do better than generic imputation. Here is a practical example:

def handle_mnar_usage_data(df):
    """
    Handle MNAR missing usage data in a SaaS context.

    Domain knowledge: missing usage data often means no usage.
    - If total_hours_last_7d is NaN and last_login > 7 days ago,
      the user did not use the product. Impute with 0.
    - If total_hours_last_7d is NaN and last_login < 7 days ago,
      it is likely a tracking failure. Use median imputation.
    """
    df = df.copy()

    # Create indicator first (before any imputation)
    df['total_hours_last_7d_missing'] = df['total_hours_last_7d'].isna().astype(int)

    # MNAR: user did not use the product
    no_login_mask = (
        df['total_hours_last_7d'].isna() &
        (df['days_since_last_login'] > 7)
    )
    df.loc[no_login_mask, 'total_hours_last_7d'] = 0.0

    # Remaining missing: likely MCAR (tracking failure)
    remaining_missing = df['total_hours_last_7d'].isna()
    median_val = df.loc[~remaining_missing, 'total_hours_last_7d'].median()
    df.loc[remaining_missing, 'total_hours_last_7d'] = median_val

    print(f"MNAR imputed to 0: {no_login_mask.sum()} rows")
    print(f"MCAR imputed to median ({median_val:.1f}): {remaining_missing.sum()} rows")
    print(f"Missing indicator added: total_hours_last_7d_missing")

    return df

This is where domain knowledge meets data science. A generic imputation algorithm does not know that a SaaS user with no login in 7 days and missing usage data almost certainly had zero usage. A practitioner does.

Putting It All Together: The Imputation Pipeline

In production, you need a single, reproducible pipeline that handles all missing data consistently between training and serving. Here is the pattern.

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, FunctionTransformer
import numpy as np
import pandas as pd


class MissingIndicatorTransformer:
    """
    Custom transformer that adds missing indicator columns.
    Compatible with scikit-learn pipelines.
    """
    def __init__(self, features_to_track):
        self.features_to_track = features_to_track

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_out = X.copy()
        if isinstance(X_out, pd.DataFrame):
            for feat in self.features_to_track:
                if feat in X_out.columns:
                    X_out[f'{feat}_missing'] = X_out[feat].isna().astype(int)
        else:
            # NumPy array: need column index mapping
            for i, feat in enumerate(self.features_to_track):
                indicator = np.isnan(X_out[:, i]).astype(int).reshape(-1, 1)
                X_out = np.hstack([X_out, indicator])
        return X_out

    def get_feature_names_out(self, input_features=None):
        base = list(input_features) if input_features else []
        indicators = [f'{f}_missing' for f in self.features_to_track]
        return base + indicators


def build_imputation_pipeline(numeric_features, categorical_features,
                              features_to_track):
    """
    Build a complete imputation pipeline for production use.

    Steps:
    1. Add missing indicators for specified features
    2. Impute numeric features with median
    3. Impute categorical features with mode
    4. Scale numeric features
    """
    numeric_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler()),
    ])

    categorical_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy='most_frequent')),
    ])

    preprocessor = ColumnTransformer([
        ('numeric', numeric_pipeline, numeric_features),
        ('categorical', categorical_pipeline, categorical_features),
    ])

    return preprocessor

Production Tip --- The missing indicator step should happen BEFORE the imputer runs, because the imputer will fill in the NaN values that the indicator needs to detect. This ordering is critical and is the most common bug in imputation pipelines. Get it wrong and your indicators are all zeros.

Little's MCAR Test: A Formal Check

If you need to formally test whether your data is MCAR (rather than relying on informal comparisons), Little's MCAR test provides a statistical answer. The null hypothesis is that the data is MCAR; a significant p-value suggests the data is NOT MCAR.

# Little's MCAR test is available in the pyampute or naniar packages
# Here is a simplified implementation of the core idea

from scipy import stats

def simplified_mcar_test(df, numeric_cols):
    """
    Simplified test for MCAR: compare means of observed variables
    across missingness patterns using chi-squared test.

    This is not a full implementation of Little's test, but captures
    the core idea: if data is MCAR, the observed means should not
    differ significantly across missingness patterns.
    """
    # Create missingness pattern indicator
    missing_pattern = df[numeric_cols].isnull().astype(int)
    pattern_labels = missing_pattern.apply(lambda row: ''.join(map(str, row)), axis=1)

    # Compare means across patterns
    results = []
    for col in numeric_cols:
        observed = df[col].dropna()
        groups = df.groupby(pattern_labels)[col].mean().dropna()

        if len(groups) > 1:
            # One-way ANOVA: do means differ across missingness patterns?
            group_data = [
                df.loc[pattern_labels == pat, col].dropna().values
                for pat in groups.index
                if len(df.loc[pattern_labels == pat, col].dropna()) > 1
            ]
            if len(group_data) > 1:
                f_stat, p_value = stats.f_oneway(*group_data)
                results.append({
                    'feature': col,
                    'f_statistic': f_stat,
                    'p_value': p_value,
                    'significant': p_value < 0.05
                })

    results_df = pd.DataFrame(results)
    n_significant = results_df['significant'].sum()
    n_tests = len(results_df)

    print(f"MCAR Test Summary")
    print(f"  Features tested: {n_tests}")
    print(f"  Significant (p < 0.05): {n_significant}")
    if n_significant > n_tests * 0.1:
        print(f"  Conclusion: Data is likely NOT MCAR")
    else:
        print(f"  Conclusion: No strong evidence against MCAR")

    return results_df

Math Sidebar --- Little's full MCAR test constructs a chi-squared statistic by comparing the means and covariances of the observed data across all unique missingness patterns. Under the null hypothesis of MCAR, this statistic follows a chi-squared distribution with degrees of freedom determined by the number of patterns and variables. In practice, the test has limited power when sample sizes per pattern are small, and it cannot distinguish between MAR and MNAR. A non-significant result means "we cannot reject MCAR," not "the data is MCAR." A significant result means "the data is not MCAR" --- but it could be MAR or MNAR.

Common Pitfalls and How to Avoid Them

Pitfall 1: Imputing Before Splitting

# WRONG: Impute on the full dataset, then split
# This leaks information from the test set into the imputer's statistics
imputer = SimpleImputer(strategy='median')
X_imputed = imputer.fit_transform(X)          # Computed on ALL data
X_train, X_test = train_test_split(X_imputed)  # Test set influenced imputer

# RIGHT: Split first, then impute
X_train, X_test = train_test_split(X)
imputer = SimpleImputer(strategy='median')
X_train_imputed = imputer.fit_transform(X_train)  # Fit on train only
X_test_imputed = imputer.transform(X_test)          # Transform test with train stats

This is the imputation equivalent of target leakage. The median computed on the full dataset includes test-set information, which inflates your performance estimate. In practice the effect is often small (especially with large datasets), but it is a bad habit that can compound with other leakage sources.

Pitfall 2: Forgetting About New Categories of Missingness at Serving Time

Your training data has 12% missingness in total_hours_last_7d. Your model learns from this. Then in production, a data pipeline change causes 40% missingness --- a completely different distribution. Your imputer, fitted on training data, will fill in median values that reflect the training distribution, not the production reality. Your missing indicators will fire more frequently than the model has ever seen.

Production Tip --- Monitor missingness rates in production as you would monitor feature distributions. A sudden change in the missingness rate of a feature is a data quality issue that can silently degrade model performance. Add alerts for missingness rate shifts beyond your training distribution.

Pitfall 3: Imputing the Target Variable

Never impute the target. If you do not know whether a subscriber churned, that subscriber cannot be part of your training data. Imputing the target is fabricating labels, and no amount of statistical sophistication makes fabricated labels acceptable.

Pitfall 4: Using Future Information in Imputation

If your imputer is fitted on features that include future information (features computed after the prediction date), the imputed values themselves become leaky. The imputer's medians and KNN neighbors reflect a future that was not available at prediction time.

Pitfall 5: Ignoring the Imputed-Value Distribution

After imputation, always check that the distribution of imputed values is reasonable. If your KNN imputer is filling in session durations of 3,500 minutes (58 hours), something is wrong --- probably a scaling issue or a neighbor selection problem.

def validate_imputation(original, imputed, feature_name):
    """
    Compare distributions before and after imputation.
    Flag obviously unreasonable imputed values.
    """
    observed = original[feature_name].dropna()
    was_missing = original[feature_name].isna()
    filled_values = imputed.loc[was_missing, feature_name]

    print(f"Validation for: {feature_name}")
    print(f"  Observed range:  [{observed.min():.2f}, {observed.max():.2f}]")
    print(f"  Observed mean:   {observed.mean():.2f}")
    print(f"  Imputed range:   [{filled_values.min():.2f}, {filled_values.max():.2f}]")
    print(f"  Imputed mean:    {filled_values.mean():.2f}")

    # Flag values outside observed range
    out_of_range = (
        (filled_values < observed.min()) | (filled_values > observed.max())
    ).sum()
    if out_of_range > 0:
        print(f"  WARNING: {out_of_range} imputed values outside observed range")

Progressive Project M3 (Part 2): Missing Data in StreamFlow

In Chapter 7, you encoded StreamFlow's categorical features. Now it is time to handle the missing data --- and discover that the missingness pattern itself is one of the most powerful churn signals.

Task 1: Profile the Missingness

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Load your StreamFlow dataset from Chapter 7
df = pd.read_csv('streamflow_prepared.csv')

# Separate target
X = df.drop('churned_within_30_days', axis=1)
y = df['churned_within_30_days']

# Train/test split (same random state as Chapter 7)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Profile missingness
report = missing_data_report(
    pd.concat([X_train, y_train], axis=1),
    target='churned_within_30_days'
)

Examine the target_rate_diff column. Features where the churn rate is significantly different for missing vs. non-missing rows are features where missingness is informative.

Task 2: Classify Mechanisms

For each feature with >5% missingness, write one sentence classifying the mechanism as MCAR, MAR, or MNAR, with your reasoning. Use the informal MCAR check function and the missingness correlation function from this chapter.

Task 3: Implement the Strategy

# Step 1: Add missing indicators for usage-related features
features_to_track = [
    'total_hours_last_7d',
    'sessions_last_7d',
    'avg_session_duration',
    'genre_diversity_score',
    'email_open_rate',
    'days_since_last_login',
]

X_train_ind = add_missing_indicators(X_train, features_to_track)
X_test_ind = add_missing_indicators(X_test, features_to_track)

# Step 2: Handle MNAR usage data with domain knowledge
X_train_ind = handle_mnar_usage_data(X_train_ind)
X_test_ind = handle_mnar_usage_data(X_test_ind)

# Step 3: Impute remaining missing values with median (numeric)
# and mode (categorical)
numeric_cols = X_train_ind.select_dtypes(include=[np.number]).columns
categorical_cols = X_train_ind.select_dtypes(exclude=[np.number]).columns

from sklearn.impute import SimpleImputer

num_imputer = SimpleImputer(strategy='median')
X_train_ind[numeric_cols] = num_imputer.fit_transform(X_train_ind[numeric_cols])
X_test_ind[numeric_cols] = num_imputer.transform(X_test_ind[numeric_cols])

if len(categorical_cols) > 0:
    cat_imputer = SimpleImputer(strategy='most_frequent')
    X_train_ind[categorical_cols] = cat_imputer.fit_transform(
        X_train_ind[categorical_cols]
    )
    X_test_ind[categorical_cols] = cat_imputer.transform(
        X_test_ind[categorical_cols]
    )

# Step 4: Validate
print(f"Missing values remaining in X_train: {X_train_ind.isnull().sum().sum()}")
print(f"Missing values remaining in X_test: {X_test_ind.isnull().sum().sum()}")
print(f"Total features: {X_train_ind.shape[1]}")

Task 4: Measure the Impact

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Baseline: median imputation only (no indicators, no MNAR handling)
X_baseline = X_train.copy()
baseline_imputer = SimpleImputer(strategy='median')
X_baseline_numeric = X_baseline.select_dtypes(include=[np.number])
X_baseline_numeric = pd.DataFrame(
    baseline_imputer.fit_transform(X_baseline_numeric),
    columns=X_baseline_numeric.columns,
    index=X_baseline_numeric.index
)

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', GradientBoostingClassifier(n_estimators=200, random_state=42))
])

scores_baseline = cross_val_score(
    pipe, X_baseline_numeric, y_train, cv=5, scoring='roc_auc'
)

# Your strategy: MNAR handling + missing indicators + median imputation
X_strategy = X_train_ind.select_dtypes(include=[np.number])
scores_strategy = cross_val_score(
    pipe, X_strategy, y_train, cv=5, scoring='roc_auc'
)

print(f"Baseline (median only):     AUC = {scores_baseline.mean():.4f}")
print(f"Your strategy:              AUC = {scores_strategy.mean():.4f}")
print(f"Improvement:                AUC = {scores_strategy.mean() - scores_baseline.mean():+.4f}")

Task 5: Feature Importance Analysis

After fitting the model, examine which missing indicators appear in the top features. Write a paragraph explaining why total_hours_last_7d_missing is a strong churn predictor. Connect this to the MNAR mechanism: the missingness is caused by the very behavior (disengagement) that predicts the target (churn).

Deliverable

Save your imputed, indicator-augmented dataset as streamflow_imputed.csv. You will use this in Chapter 9 (Feature Selection) and Chapter 10 (Reproducible Pipelines).

Chapter Summary

Missing data is not a nuisance. It is information. The decision about how to handle it --- drop, impute, or indicator --- is a modeling decision that determines what your model can learn.

The three mechanisms (MCAR, MAR, MNAR) are not academic categories. They determine whether dropping rows introduces bias, whether imputation can recover the truth, and whether the missingness pattern itself is predictive.

Simple imputation (mean, median, mode) is fast and often sufficient. Advanced imputation (KNN, MICE) preserves more structure but at greater computational cost. Missing indicators --- the binary flags that mark where values were absent --- are the most underused and highest-ROI technique in the entire missing data toolkit.

The team in our opening war story lost six months because they treated missing sensor data as noise to be cleaned. The signal they needed was in the gaps. df.dropna() deleted it.

Do not delete your signal.

Next: Chapter 9: Feature Selection --- now that you have engineered features, handled categoricals, and addressed missing data, it is time to decide which features actually matter.