Key Takeaways: Chapter 8
Missing Data Strategies
-
df.dropna()is not a data cleaning step. It is a modeling decision --- and usually a bad one. Dropping rows with missing values removes data, reduces statistical power, and introduces bias whenever the missingness is not completely random. The rows you drop are often the most informative rows in your dataset: the disengaged users, the failing sensors, the hard-to-reach patients. -
Not all missingness is created equal. The mechanism determines the remedy. MCAR (missing completely at random) is harmless to drop but rare in practice. MAR (missing at random, conditional on observed data) can be corrected with imputation. MNAR (missing not at random, where the missingness depends on the missing value itself) is the most dangerous and the most informative --- and no amount of imputation from observed data will fix it.
-
Simple imputation is a solid baseline, not a final answer. Mean and median imputation are fast, easy, and surprisingly effective for many use cases. But they shrink variance, distort correlations, and create artificial spikes in the distribution. Use them as a starting point, not an endpoint. Median is almost always preferable to mean for skewed features.
-
Advanced imputation (KNN, MICE) preserves structure at the cost of speed. KNN imputation respects local feature relationships. Iterative imputation (MICE) models the full multivariate structure of the data. Both produce more realistic imputed values than simple methods, but the marginal improvement over median imputation is often smaller than the improvement from simply keeping your data instead of dropping it.
-
Missing indicators are the single highest-ROI technique for handling missing data in predictive modeling. A binary flag that says "this value was originally missing" costs nothing to compute and can be one of the most predictive features in your model. In the StreamFlow case,
total_hours_last_7d_missingranked as the third most important feature. In the TurbineTech case, sensor dropout indicators accounted for six of the top fifteen features. Add them. Always. -
Missingness is often the signal, not the noise. When a SaaS user's usage data is missing because they stopped using the product, the missingness is a churn signal. When a vibration sensor goes offline because the equipment is shaking itself apart, the missingness is a failure signal. Dropping these rows does not clean the data --- it destroys the answer.
-
Domain knowledge is the best imputation method. No statistical algorithm can tell you that missing usage data means zero usage, or that sensor dropout correlates with equipment degradation. These insights come from understanding the data-generating process --- how events are logged, why gaps occur, and what user or system behavior produces NULLs in your database.
-
Impute after splitting, never before. Fitting an imputer on the full dataset before the train/test split leaks test-set information into the training data. The imputer must be fit on training data only and applied to both train and test using
transform, neverfit_transform, on the test set. -
Features with very high missingness (>50%) are rarely worth imputing. The imputed values are more fiction than data. Either drop the feature or use only its missing indicator. An exception: if the observed values are highly predictive when present, keep both the indicator and the (sparsely) imputed feature and let the model decide.
-
Monitor missingness rates in production as you monitor feature distributions. A sudden change in the missingness rate of a feature signals a data quality issue, a pipeline failure, or a change in the data-generating process. Any of these can silently degrade model performance. Treat missingness rate shifts as a first-class monitoring metric alongside feature drift and prediction drift.
If You Remember One Thing
The most valuable data is often hiding in the gaps. The team that dropped 40% of their manufacturing data because of missing sensor values threw away the strongest failure signal in the dataset. The model that treated missing usage data as noise missed the 34% of churners whose silence was the loudest warning. Missing data is not a problem to be cleaned. It is information to be modeled. df.dropna() deletes it. Missing indicators preserve it. Your job is to tell the difference.
These takeaways summarize Chapter 8: Missing Data Strategies. Return to the chapter for full context.