Further Reading: Chapter 8

Missing Data Strategies


Foundational Papers and Books

1. "Statistical Analysis with Missing Data" --- Roderick J.A. Little and Donald B. Rubin (3rd edition, 2019) The definitive reference on missing data theory. Little and Rubin formalized the MCAR/MAR/MNAR framework that underpins all modern missing data practice. Chapters 1-4 cover the theory at a graduate level. Chapter 6 covers maximum likelihood estimation under MAR. Chapter 10 covers multiple imputation. This is not light reading, but if you need to defend your imputation choices to a statistician or a regulatory body, this is the book you cite.

2. "Multiple Imputation by Chained Equations: What Is It and How Does It Work?" --- Melissa J. Azur, Elizabeth A. Stuart, Constantine Frangakis, Philip J. Leaf (2011) International Journal of Methods in Psychiatric Research, 20(1), 40-49. The best accessible introduction to MICE (multiple imputation by chained equations). Explains the algorithm, its assumptions, and its practical application in clear, non-mathematical language. Essential reading before using IterativeImputer in scikit-learn. Available freely through PubMed Central.

3. "A Test of Missing Completely at Random for Multivariate Data with Missing Values" --- Roderick J.A. Little (1988) Journal of the American Statistical Association, 83(404), 1198-1202. The original paper introducing Little's MCAR test. Short, precise, and foundational. Understanding this test --- its logic, its limitations, and when it does and does not apply --- is part of the missing data practitioner's toolkit.


Practical Guides

4. "Flexible Imputation of Missing Data" --- Stef van Buuren (2nd edition, 2018) The most practical and accessible book on missing data imputation. Van Buuren is the creator of the MICE algorithm and the mice package in R. The book covers the theory, the practice, and the pitfalls of multiple imputation with clear examples and honest advice about what works and what does not. Chapter 2 (Concepts in Incomplete Data) and Chapter 3 (Univariate Missing Data) are the best introductions to the subject at the practitioner level. Available freely at stefvanbuuren.name/fimd/.

5. scikit-learn --- Imputation of Missing Values (User Guide) The official scikit-learn documentation on SimpleImputer, KNNImputer, and IterativeImputer. Covers API usage, parameters, and common patterns. Pay particular attention to the IterativeImputer section, which describes how to choose the estimator, set convergence criteria, and handle categorical features. The documentation includes warnings about the experimental status of IterativeImputer and guidance on when to use each method. Available at scikit-learn.org.

6. "Missing Data Conundrum" --- Karen Grace-Martin (The Analysis Factor) A series of accessible blog posts that explain MCAR, MAR, and MNAR with real-world examples. Particularly strong on the intuition behind why MAR is the pivotal assumption in most analyses and why MNAR is so difficult to handle. Recommended for readers who found the Little and Rubin book too theoretical. Available at theanalysisfactor.com.


Research on Missingness as a Feature

7. "Missing Data as Part of the Social Indicator" --- Various authors in social science literature Multiple papers in epidemiology and social science have documented the phenomenon where missingness patterns are themselves predictive of outcomes. In healthcare, missed appointments predict readmission. In education, missing survey responses predict dropout. In finance, unreported income predicts default. This is not a single paper but a research pattern --- search Google Scholar for "informative missingness" or "missing indicator method" to find domain-specific examples.

8. "The Missing Indicator Method" --- Groenwold et al. (2012) Journal of Clinical Epidemiology, 65(10), 1014-1022. A systematic evaluation of the missing indicator method (adding binary flags for missingness) in epidemiological studies. The paper shows that the method can introduce bias for causal inference but is often beneficial for prediction. This distinction is critical: missing indicators are a powerful predictive tool but should not be used uncritically when the goal is to estimate causal effects.


Visualization and Diagnostics

9. "missingno" --- Python library by Aleksey Bilogur A lightweight Python library for visualizing missing data patterns. Generates missingness matrices, bar charts, heatmaps, and dendrograms with a single function call. The dendrogram view is particularly useful for identifying clusters of features that go missing together. Install with pip install missingno. Available on GitHub.

10. "naniar" --- R package by Nicholas Tierney If you work in R (or want to see best-in-class missing data visualization), naniar is the gold standard. Its gg_miss_upset plot shows the intersection of missingness patterns across features --- the most informative single visualization for understanding missing data structure. The Python ecosystem does not yet have a direct equivalent, making this worth a look even for Python-primary practitioners. Available on CRAN.


Predictive Maintenance Context

11. "Machine Learning for Predictive Maintenance: A Multiple Classifier Approach" --- Susto et al. (2015) IEEE Transactions on Industrial Informatics, 11(3), 812-820. One of the foundational papers on using ML for predictive maintenance in manufacturing. Discusses how sensor data quality issues (including missing readings) affect model performance and proposes strategies for handling sensor dropout. Relevant to the TurbineTech case study in this chapter.

12. "A Survey on Data-Driven Predictive Maintenance" --- Zonta et al. (2020) Computers in Industry, 117, 103180. A comprehensive survey covering data preprocessing for predictive maintenance, including sections on handling missing sensor data, feature engineering from sensor streams, and the challenge of label scarcity. Section 4.2 specifically addresses missing data strategies in the industrial IoT context.


Advanced Topics

13. "Rubin's Rules" --- Donald Rubin (1987) The pooling rules for combining results from multiple imputed datasets. If you need to produce valid confidence intervals or hypothesis tests from imputed data, you need Rubin's rules. The original treatment is in "Multiple Imputation for Nonresponse in Surveys" (Wiley, 1987). For a more accessible treatment, see van Buuren (item 4, Chapter 5).

14. "Pattern Submodels" --- Various authors An advanced technique where separate models are trained for different missingness patterns. Instead of imputing missing values and training one model, you train a model for "all features present," another for "feature A missing," another for "features A and B missing," etc. This avoids imputation entirely at the cost of model complexity. Useful in production systems with stable, well-understood missingness patterns.


How to Use This List

If you read nothing else, read van Buuren (item 4), Chapters 2-3. It is the single best resource for developing practical intuition about missing data, and it is free.

If you are implementing imputation in production, start with the scikit-learn documentation (item 5) and the missingno library (item 9). The documentation gives you the API; missingno gives you the diagnostics.

If you need to defend your missing data strategy to stakeholders or reviewers, cite Little and Rubin (item 1) for the theoretical framework and Groenwold et al. (item 8) for evidence that the missing indicator method is appropriate for prediction tasks.

If you work in predictive maintenance or IoT, Zonta et al. (item 12) provides the most relevant survey of domain-specific missing data challenges.


This reading list supports Chapter 8: Missing Data Strategies. Return to the chapter to review concepts before diving in.