Quiz: Chapter 8

DataField.Dev

Quiz: Chapter 8

Missing Data Strategies

Instructions: Answer all questions. Multiple-choice questions have one correct answer unless otherwise stated. Short-answer questions should be answered in 2-4 sentences.

Question 1 (Multiple Choice)

A dataset has 10,000 rows and 20 features. Three features each have 15% missing values, and the missingness patterns are independent. If you run df.dropna(), approximately what percentage of rows will remain?

A) 55%
B) 61%
C) 70%
D) 85%

Answer: B) 61%. With three independent features each 15% missing, the probability of a row being complete across all three is (0.85)^3 = 0.614, or about 61%. This demonstrates why df.dropna() is so destructive --- even moderate missingness rates across multiple features compound to eliminate a large fraction of your data.

Question 2 (Multiple Choice)

A SaaS company's total_hours_last_7d feature is missing for users who did not log in during the past 7 days. The missing value would have been 0 (zero hours of usage). This is an example of:

A) Missing Completely at Random (MCAR)
B) Missing at Random (MAR)
C) Missing Not at Random (MNAR)
D) Not missing --- the value should be coded as 0

Answer: C) Missing Not at Random (MNAR). The probability of the value being missing depends on the value itself (zero usage leads to no logged events, which leads to a NULL in the database). Option D is tempting and represents the correct imputation strategy, but the question asks about the mechanism --- and the mechanism is MNAR because the value's absence is caused by what the value would have been.

Question 3 (Short Answer)

Explain why mean imputation shrinks the variance of the imputed feature. How does this affect a downstream model?

Answer: Mean imputation replaces every missing value with the same constant (the feature mean). These imputed values cluster at the center of the distribution, adding data points with zero deviation from the mean. This reduces the overall variance because the spread of the data is artificially compressed. For a downstream model, this means the feature appears to have less predictive spread than it actually does, correlations with other features are weakened (because the imputed rows have the mean regardless of their other feature values), and the model may underweight the feature relative to its true importance.

Question 4 (Multiple Choice)

You add a missing indicator feature (usage_missing = 1 if usage is NaN, else 0) alongside the imputed usage feature. After training a gradient boosted model, the missing indicator ranks as the 3rd most important feature. What does this tell you?

A) The imputation was performed incorrectly
B) The missingness pattern is informative --- knowing that usage is missing is predictive of the target
C) The model is overfitting to the missing indicator
D) You should remove the missing indicator because it is redundant with the imputed feature

Answer: B) The missingness pattern is informative --- knowing that usage is missing is predictive of the target. A high-importance missing indicator means the model has discovered that the fact of missingness (independent of the imputed value) contains signal about the target variable. This is a strong indication that the missing data mechanism is MAR or MNAR, and that the missingness itself is a valuable feature.

Question 5 (Multiple Choice)

Which of the following imputation methods is most appropriate when the missing data mechanism is MNAR?

A) Mean imputation
B) KNN imputation
C) Iterative imputation (MICE)
D) A missing indicator combined with domain-specific imputation

Answer: D) A missing indicator combined with domain-specific imputation. Under MNAR, the probability of missingness depends on the unobserved value itself, so no statistical imputation method (A, B, or C) can recover the true value from observed data alone. The missing indicator captures the informative missingness, and domain knowledge provides the appropriate fill value (e.g., imputing 0 for a usage feature when the user did not use the product).

Question 6 (Short Answer)

A colleague says: "I ran df.dropna() and my dataset went from 50,000 rows to 35,000 rows. That is only 30% loss, so it is fine." Explain why this reasoning may be flawed even if the 30% loss seems modest.

Answer: The 30% data loss is only the first problem. The more serious issue is whether the dropped rows are representative of the full dataset. If the missingness is not MCAR, the remaining 35,000 rows are a biased sample --- they may overrepresent certain subgroups (e.g., more engaged users, higher-income customers, healthier patients). The model trained on this biased sample will learn patterns that describe the retained population, not the full population, and will perform poorly on the very observations it was designed to predict. Additionally, the dropped rows may disproportionately contain the minority class (e.g., churners, fraud cases), exacerbating class imbalance in the training data.

Question 7 (Multiple Choice)

You fit a SimpleImputer(strategy='median') on the full dataset before splitting into train and test sets. What problem does this create?

A) The imputed values will be less accurate
B) Data leakage: the imputer's median includes information from the test set
C) The imputer will fail on new data at serving time
D) The model will underfit because the imputation is too simple

Answer: B) Data leakage: the imputer's median includes information from the test set. The median computed on the full dataset includes test set observations, meaning the imputed training values are influenced by data the model should not have access to during training. This inflates the apparent performance on the test set. The correct approach is to fit the imputer on the training set only and use transform (not fit_transform) on the test set.

Question 8 (Multiple Choice)

In the TurbineTech predictive maintenance case, sensor vibration readings become progressively more missing as a bearing degrades. This is because:

A) The sensor has a known intermittent hardware bug (MCAR)
B) The data pipeline drops readings above a certain vibration threshold (MAR)
C) The physical vibration damages the sensor itself, causing it to fail (MNAR)
D) The maintenance team manually removes sensors before failures (not a data problem)

Answer: C) The physical vibration damages the sensor itself, causing it to fail (MNAR). The sensor dropout is caused by the very condition (excessive vibration from bearing degradation) that the sensor is trying to measure. The missing value would have been an extremely high vibration reading, and the missingness is directly caused by the magnitude of that value. This is the most dangerous form of MNAR because dropping these rows removes the most predictive observations from the dataset.

Question 9 (Short Answer)

Explain the difference between KNN imputation and iterative imputation (MICE). When would you choose one over the other?

Answer: KNN imputation finds the K most similar rows (by Euclidean distance on non-missing features) and fills missing values with the mean or weighted mean of those neighbors' values. Iterative imputation (MICE) treats each feature with missing values as a prediction target, fits a regression model using all other features, and iterates until convergence. Choose KNN when you have a small to medium dataset and believe that similar observations should have similar values (local structure matters). Choose MICE when you have complex, nonlinear relationships between features that a regression model can capture, or when you need the most accurate imputation for downstream inference. In practice, MICE is usually more accurate but significantly slower, while KNN is a good middle ground between simple imputation and MICE.

Question 10 (Multiple Choice)

A feature has 65% missing values. Which strategy is most appropriate?

A) KNN imputation to fill in the missing 65%
B) Drop the feature entirely
C) Use only the missing indicator (binary: was this feature observed?) and drop the raw values
D) Both B and C are reasonable; the choice depends on whether the missingness is informative

Answer: D) Both B and C are reasonable; the choice depends on whether the missingness is informative. When 65% of values are missing, any imputed values are heavily influenced by the imputation model rather than the actual data, making them unreliable (ruling out A as the primary strategy). If the missingness is uninformative, dropping the feature (B) is appropriate. If the missingness itself is predictive of the target (as with NPS scores in the StreamFlow example), the missing indicator alone (C) can be more valuable than the imputed values.

Question 11 (Multiple Choice)

Little's MCAR test returns a p-value of 0.003. What can you conclude?

A) The data is MCAR
B) The data is MAR
C) The data is MNAR
D) The data is not MCAR; it could be MAR or MNAR

Answer: D) The data is not MCAR; it could be MAR or MNAR. A significant p-value (< 0.05) rejects the null hypothesis that the data is MCAR, meaning the missingness is related to observed or unobserved variables. However, the test cannot distinguish between MAR and MNAR --- both would produce a significant result. Additional domain knowledge is needed to determine whether the missingness depends on observed variables (MAR) or unobserved values (MNAR).

Question 12 (Short Answer)

You are building a model to predict hospital readmission. The feature lives_alone is missing for 40% of patients. The data team says: "We should impute it --- living alone is an important risk factor." The clinical team says: "The missingness IS the signal --- patients whose social situation is unknown did not receive a proper social work assessment, and that itself predicts readmission." Who is right, and what would you do?

Answer: Both are partially right, and the best approach combines their insights. The clinical team's observation is critical: the missingness likely reflects a process failure (incomplete social work assessment), and patients who do not receive that assessment may indeed face higher readmission risk. The data team is correct that the variable's observed values are also predictive. The solution: add a missing indicator (lives_alone_missing) to capture the informative missingness, AND impute the raw feature (e.g., with mode or a model-based method) so the model can use the observed values when they are available. This gives the model access to both signals.

Question 13 (Multiple Choice --- Select Two)

Which TWO of the following are valid reasons to add a missing indicator feature?

A) To fill in missing values with a binary flag instead of imputation
B) To let the model learn different relationships for present vs. absent values
C) To ensure the imputed values are always zero
D) To capture informative missingness that may be predictive of the target
E) To replace the need for any imputation at all

Answer: B) and D). Missing indicators let the model distinguish between rows where the value was observed and rows where it was imputed (B), and they capture the predictive signal in the missingness pattern itself (D). Options A, C, and E are misconceptions: missing indicators do not replace imputation (A, E) --- they complement it. The imputed feature handles the value, the indicator handles the fact of missingness. And the indicator is binary (0/1 for present/missing), not a substitute for the fill value (C).

Question 14 (Short Answer)

Explain why you should never impute the target variable. Give an example of what could go wrong if you did.

Answer: Imputing the target variable means fabricating labels for observations whose true outcome is unknown. This introduces noise at best and systematic bias at worst. For example, if you are predicting churn and 10% of subscribers have unknown churn status (they are mid-contract), imputing their churn label with the mode (0, retained) would undercount churn in the training data and train the model on false negatives. If you impute with a model, you are using a model's predictions as training labels for another model --- a circular process that amplifies whatever biases exist in the imputation model. The only correct approach is to exclude observations with missing targets from the training set.

Question 15 (Multiple Choice)

In a production ML system, the missingness rate for a key feature suddenly jumps from 8% (in training data) to 35% (in production). What should happen?

A) Nothing --- the imputer will handle it by filling in more values
B) The model will automatically adapt to the new missingness rate
C) An alert should fire because this represents a data distribution shift that may degrade model performance
D) The missing indicators will compensate for the increased missingness

Answer: C) An alert should fire because this represents a data distribution shift that may degrade model performance. A sudden change in the missingness rate is a strong signal that something has changed in the data-generating process --- a broken pipeline, a product change, a policy change. The model was trained on data with 8% missingness and may not generalize well to 35%. The imputer will mechanically fill in values (A is technically true), but those values will be the same medians/modes from training, which may no longer be appropriate. The missing indicators will fire more often (D is also technically true), but the model has never seen this frequency of indicator activation. Monitoring and alerting is the correct response.

This quiz supports Chapter 8: Missing Data Strategies. Return to the chapter to review concepts.