Exercises: Chapter 8

Missing Data Strategies


Exercise 1: Classify the Mechanism (Conceptual)

For each of the following scenarios, classify the missing data mechanism as MCAR, MAR, or MNAR. Explain your reasoning in one to two sentences.

a) A hospital survey asks patients to rate their pain level on a 1-10 scale. Patients with the highest pain levels are less likely to complete the survey because they are in too much distress.

b) A research assistant accidentally spills coffee on a stack of completed questionnaires, destroying 15 of the 200 collected forms. The damaged forms were on top of a random pile.

c) In a customer satisfaction dataset, income is missing for 30% of respondents. Analysis shows that younger respondents and those with lower education levels are less likely to report their income, regardless of their actual income level.

d) A fitness tracking app records daily step counts. Users who are sedentary for extended periods often uninstall the app, causing their step data to disappear from the dataset.

e) In a manufacturing quality dataset, a particular measurement instrument is unavailable on weekends. All weekend readings are missing, but the production process does not change based on day of week.

f) StreamFlow's total_hours_last_30d feature is missing for subscribers who signed up less than 30 days ago. There is no 30-day history to compute.


Exercise 2: The Cost of Dropping (Applied)

Generate the following synthetic dataset and measure the consequences of df.dropna():

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score

np.random.seed(42)
n = 10000

# Generate features
tenure = np.random.exponential(18, n).clip(1, 72).astype(int)
monthly_charge = np.round(np.random.uniform(10, 100, n), 2)
support_tickets = np.random.poisson(1.5, n)

# Generate usage feature with MNAR missingness
# Users with low engagement have missing usage data
true_usage = np.random.exponential(15, n)
engagement_score = true_usage / true_usage.max()
missing_prob = 0.4 * (1 - engagement_score)  # Lower usage -> higher missing rate
usage_missing = np.random.random(n) < missing_prob
usage_observed = true_usage.copy()
usage_observed[usage_missing] = np.nan

# Generate target (churn) correlated with engagement
churn_logit = (
    -1.5
    - 0.04 * tenure
    + 0.01 * monthly_charge
    + 0.3 * support_tickets
    - 0.08 * true_usage
    + np.random.normal(0, 0.5, n)
)
churned = (1 / (1 + np.exp(-churn_logit)) > 0.5).astype(int)

df = pd.DataFrame({
    'tenure': tenure,
    'monthly_charge': monthly_charge,
    'support_tickets': support_tickets,
    'usage_hours': usage_observed,
    'churned': churned,
})

print(f"Total rows: {len(df)}")
print(f"Missing usage: {usage_missing.sum()} ({usage_missing.mean():.1%})")
print(f"Churn rate (overall): {churned.mean():.1%}")
print(f"Churn rate (usage present): {churned[~usage_missing].mean():.1%}")
print(f"Churn rate (usage missing): {churned[usage_missing].mean():.1%}")

Complete the following tasks:

a) What percentage of rows will df.dropna() remove? What is the churn rate in the remaining rows compared to the full dataset?

b) Train a GradientBoostingClassifier (200 trees, random_state=42) on three versions of the data: - V1: df.dropna() (listwise deletion) - V2: Median imputation of usage_hours - V3: Median imputation + a usage_hours_missing indicator

Use an 80/20 train/test split with random_state=42 and stratify=y. Report the AUC for each version.

c) For V1, compute the churn rate in the dropped rows. Compare this to the churn rate in the training data. Explain why this discrepancy would cause problems in production.

d) For V3, after training the model, extract the feature importance for usage_hours_missing. Where does it rank among the features?


Exercise 3: Missingness Heatmap Interpretation (Applied)

Run the following code and answer the questions below:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

np.random.seed(42)
n = 5000

# Simulate a dataset with structured missingness patterns
df = pd.DataFrame({
    'age': np.random.normal(45, 15, n).clip(18, 90),
    'income': np.random.lognormal(10.5, 0.8, n),
    'credit_score': np.random.normal(680, 80, n).clip(300, 850),
    'loan_amount': np.random.lognormal(10, 1, n),
    'employment_years': np.random.exponential(8, n).clip(0, 40),
    'debt_to_income': np.random.uniform(0.05, 0.65, n),
    'num_accounts': np.random.poisson(5, n),
    'recent_inquiries': np.random.poisson(1, n),
})

# Block missingness: income and employment_years from same source
source_missing = np.random.random(n) < 0.15
df.loc[source_missing, 'income'] = np.nan
df.loc[source_missing, 'employment_years'] = np.nan

# MAR: credit score missing for younger applicants
young_mask = df['age'] < 25
df.loc[young_mask & (np.random.random(n) < 0.40), 'credit_score'] = np.nan

# MNAR: high debt_to_income applicants less likely to report
high_dti = df['debt_to_income'] > 0.45
df.loc[high_dti & (np.random.random(n) < 0.35), 'debt_to_income'] = np.nan

# Random: recent_inquiries (pure MCAR)
df.loc[np.random.random(n) < 0.05, 'recent_inquiries'] = np.nan

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Missingness matrix
missing_sorted = df.isnull().mean().sort_values(ascending=False)
cols = missing_sorted.index.tolist()
sample = df[cols].iloc[np.random.choice(n, 500, replace=False)]
axes[0].imshow(sample.isnull().values.astype(int), aspect='auto',
               cmap='Greys', interpolation='none')
axes[0].set_xticks(range(len(cols)))
axes[0].set_xticklabels(cols, rotation=45, ha='right')
axes[0].set_title('Missingness Pattern')

# Missingness correlation
sns.heatmap(df[cols].isnull().corr(), ax=axes[1], annot=True,
            fmt='.2f', cmap='RdBu_r', center=0)
axes[1].set_title('Missingness Correlation')

plt.tight_layout()
plt.show()

a) Which two features exhibit block missingness (they tend to go missing together)? What does the high correlation in the missingness correlation matrix tell you about their data source?

b) Which feature's missingness is most likely MCAR? How can you tell from the correlation matrix?

c) The credit_score missingness is designed to be MAR (conditional on age). Describe how you would verify this from the data without knowing the data-generating process.

d) The debt_to_income missingness is designed to be MNAR. Explain why this is impossible to distinguish from MAR using only the observed data.


Exercise 4: Imputation Method Comparison (Coding)

Using the loan dataset from Exercise 3, compare imputation methods on a downstream prediction task:

from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Create a binary target: high-risk loan (debt_to_income > median)
# Use the TRUE debt_to_income (before we made it missing) for the target
# In practice you would not have this --- this is for evaluation only

a) Create a binary target variable. Split into train/test with random_state=42.

b) Build pipelines for four imputation strategies: (1) listwise deletion, (2) median imputation, (3) KNN imputation (k=5), and (4) iterative imputation. For each, train a GradientBoostingClassifier with 200 trees and evaluate using 5-fold cross-validated AUC.

c) Add missing indicators for all features with >5% missingness and repeat the comparison. Does the ranking of imputation methods change?

d) Time each pipeline's fit step using %%timeit or time.time(). Create a table showing AUC vs. computation time. Is the most accurate method worth the additional compute?


Exercise 5: MNAR in the Wild (Conceptual)

For each of the following real-world features, explain why the missing data is likely MNAR and propose a domain-aware imputation strategy:

a) Employee salary in an HR dataset. Senior employees are less likely to report salary in internal surveys.

b) Blood pressure in an outpatient clinic dataset. Patients whose blood pressure was not measured were those who appeared healthy and did not need the measurement.

c) Product review score in an e-commerce dataset. Only customers who felt strongly (positively or negatively) bother to leave reviews.

d) Exam score in an educational dataset. Students who did not take the exam often dropped the course due to poor performance.

e) Sensor temperature in an industrial IoT dataset. The sensor stops transmitting when the temperature exceeds its 150C operating limit.


Exercise 6: Building a Missing Data Report (Coding)

Write a function comprehensive_missing_report(df, target_col) that produces the following output for each feature:

  1. Number and percentage of missing values
  2. Data type
  3. Whether missingness correlates with the target (compare target mean for present vs. missing rows)
  4. Whether missingness correlates with any other feature (top 3 correlations)
  5. A classification recommendation (MCAR, MAR, or "needs domain review")

Test your function on both the StreamFlow dataset (from the progressive project) and the loan dataset (from Exercise 3).

def comprehensive_missing_report(df, target_col):
    """
    Generate a comprehensive missing data report with mechanism
    classification recommendations.

    Parameters
    ----------
    df : pd.DataFrame
    target_col : str, name of the target variable

    Returns
    -------
    pd.DataFrame with one row per feature
    """
    # Your implementation here
    pass

Exercise 7: The Imputation Pipeline (Coding)

Build a scikit-learn-compatible imputation pipeline that:

  1. Accepts a DataFrame with mixed types (numeric and categorical)
  2. Adds missing indicators for features with >5% missingness
  3. Imputes numeric features with median
  4. Imputes categorical features with mode
  5. Is compatible with cross_val_score (i.e., no data leakage between folds)

Use ColumnTransformer and Pipeline. Test your pipeline on the StreamFlow dataset and verify that:

a) No missing values remain after transformation b) The missing indicator columns are correctly computed c) The pipeline can be serialized with joblib.dump and deserialized with joblib.load d) The deserialized pipeline produces identical results on the test set

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import joblib

# Your implementation here

Exercise 8: When to Walk Away (Conceptual)

You are a data scientist at a health insurance company. You are building a model to predict which members are at risk of developing Type 2 diabetes within the next 5 years. Your dataset has the following missingness profile:

Feature Missing Rate Notes
age 0% Always available from enrollment
bmi 62% Only measured at annual physical (many members skip)
fasting_glucose 58% Only measured if doctor orders blood work
family_history 45% Self-reported on intake form
exercise_frequency 71% From optional wellness survey
blood_pressure 55% Only at in-person visits
smoking_status 38% Self-reported, often outdated
hba1c 75% Only measured if diabetes suspected

a) Which features would you drop entirely? Justify your decision.

b) Which features would you keep but use only through their missing indicator? Why?

c) For the features you decide to impute, which imputation method would you use and why?

d) The hba1c feature (hemoglobin A1c, a diabetes marker) is 75% missing. It is also the single most predictive feature for diabetes when it is present. A colleague suggests imputing the missing values using KNN imputation. Explain why this is a bad idea. (Hint: why was it measured for the 25% who have it?)

e) Propose a feature engineering strategy that captures the informative missingness of hba1c without imputing the value itself.


Exercise 9: Imputation and Fairness (Applied)

Consider a credit scoring dataset where income is missing for 25% of applicants. Investigation reveals:

  • Income is missing for 35% of minority applicants vs. 18% of non-minority applicants
  • Among applicants with reported income, minority applicants have lower average income
  • The median imputation fills missing income values with the overall median ($52,000)

a) Explain how median imputation introduces bias against minority applicants. (Hint: what happens to the income distribution for minority applicants after imputation?)

b) Propose an imputation strategy that mitigates this bias.

c) Should you use a missing indicator for income in this context? What are the fairness implications of including an indicator that is correlated with protected group membership?

d) A colleague suggests group-conditional imputation: impute missing income separately for minority and non-minority applicants using each group's median. Is this approach fair? What are its advantages and risks?


Exercise 10: The Full Pipeline (Synthesis)

Return to your StreamFlow progressive project dataset. Implement the complete missing data strategy described in this chapter:

  1. Profile the missingness for all features
  2. Classify the mechanism for each feature with >5% missingness
  3. Add missing indicators for usage-related features
  4. Implement domain-aware MNAR imputation for usage data
  5. Apply median imputation for remaining numeric features
  6. Apply mode imputation for categorical features
  7. Validate that no missing values remain
  8. Compare model performance (AUC) before and after your strategy

Document your decisions in a markdown cell in your Jupyter notebook. For each feature, record: the missingness rate, the classified mechanism, the chosen strategy, and your reasoning.

Save the final imputed dataset as streamflow_imputed.csv for use in Chapter 9.


Challenge Exercise: Multiple Imputation for Inference (Advanced)

The imputation methods in this chapter produce a single imputed dataset. But for statistical inference (confidence intervals, hypothesis tests), a single imputation understates the uncertainty by treating imputed values as if they were known. Multiple imputation generates M imputed datasets, runs the analysis on each, and pools the results.

Implement a simplified multiple imputation workflow:

def multiple_imputation_analysis(X, y, n_imputations=5, random_state=42):
    """
    Run multiple imputation and pool the results.

    Steps:
    1. Create n_imputations imputed datasets (using IterativeImputer
       with different random seeds)
    2. Fit a logistic regression on each
    3. Pool the coefficients using Rubin's rules:
       - Pooled coefficient = mean of coefficients across imputations
       - Total variance = within-imputation variance + between-imputation variance
       - Standard error = sqrt(total variance)
    """
    # Your implementation here
    pass

a) Run the analysis with M=5 and M=20 imputations. How do the pooled standard errors compare?

b) Compare the confidence intervals from multiple imputation to those from a single median imputation. Which are wider? Why is this more honest?

c) In what situations would multiple imputation matter most for practical decision-making? When can you get away with single imputation?


These exercises support Chapter 8: Missing Data Strategies. Return to the chapter for reference.