Further Reading: Data Wrangling — Cleaning and Preparing Real Data

Books (Start Here)

McKinney, W. (2022). Python for Data Analysis (3rd ed.). O'Reilly Media. Wes McKinney created pandas, and this is the definitive guide. Chapters 7 (Data Cleaning and Preparation) and 8 (Data Wrangling) cover everything from Section 7.5 onward in far more depth. The book includes detailed examples of merging datasets, handling time series data, and reshaping operations. If you're serious about data wrangling in Python, this is the reference you'll return to repeatedly. The third edition is updated for modern pandas.

Wickham, H., & Grolemund, G. (2023). R for Data Science (2nd ed.). O'Reilly Media. While this book uses R rather than Python, Hadley Wickham's treatment of tidy data is unmatched — he coined the concept. Chapters on data tidying, transformation, and importing are conceptually valuable even if you don't read R code. The principles of tidy data are language-independent, and Wickham explains them with exceptional clarity. Freely available at r4ds.hadley.nz.

Van der Plas, J. (2016). Python Data Science Handbook. O'Reilly Media. Chapter 3 (Data Manipulation with Pandas) provides an excellent alternative explanation of data cleaning operations. Van der Plas covers hierarchical indexing, combining datasets, and handling missing data with a focus on practical examples. Freely available at jakevdp.github.io/PythonDataScienceHandbook/. If McKinney's style doesn't click for you, try Van der Plas.

Spiegelhalter, D. (2019). The Art of Statistics: How to Learn from Data. Basic Books. Spiegelhalter's chapter on "What Causes What?" includes a thoughtful discussion of how data quality affects causal claims. His section on missing data in medical studies — where the missing mechanism directly affects treatment conclusions — connects Chapter 7's technical skills to Chapter 4's study design principles. Accessible and insightful.

Osborne, J. W. (2012). Best Practices in Data Cleaning. SAGE Publications. The only book-length treatment of data cleaning as a topic in its own right. Osborne covers outlier detection, missing data, normality testing, and data transformations, all from a social science research perspective. More academic than the other recommendations here, but invaluable if you want a deep dive into the theory behind cleaning decisions. The chapters on missing data mechanisms (MCAR/MAR/MNAR) are particularly thorough.

Articles and Papers

Wickham, H. (2014). "Tidy Data." Journal of Statistical Software, 59(10), 1-23. The foundational paper on tidy data. Wickham defines the three rules (each variable a column, each observation a row, each value a cell), catalogs five common patterns of messy data, and shows how tidying operations transform each pattern. This paper changed how a generation of data scientists think about data organization. It's readable, well-illustrated, and freely available at doi.org/10.18637/jss.v059.i10. Required reading for anyone who works with data.

Rubin, D. B. (1976). "Inference and Missing Data." Biometrika, 63(3), 581-592. The original paper that defined MCAR, MAR, and MNAR (though Rubin used slightly different terminology). This is a technical statistics paper, not a textbook chapter, so it's challenging reading. But if you want to understand the theoretical foundation behind Section 7.2's intuitive explanations, this is the primary source. Rubin's missing data framework is one of the most cited ideas in all of statistics.

Tierney, N. J., & Cook, D. H. (2023). "Expanding Tidy Data Principles to Facilitate Missing Data Exploration, Visualization and Assessment of Imputations." Journal of Statistical Software, 105(7), 1-31. A modern paper that extends the tidy data concept to include explicit tools for exploring and visualizing missing data patterns. The authors introduce the idea of "shadow matrices" that track missingness alongside the data — essentially a formal version of the missing-data flag approach from Section 7.3. Accessible and practical, with R code examples (but the concepts transfer directly to Python).

Van Buuren, S. (2018). Flexible Imputation of Missing Data (2nd ed.). Chapman and Hall/CRC. The definitive textbook on missing data imputation. Far beyond the scope of this chapter, but if you want to understand advanced techniques like multiple imputation, predictive mean matching, and sensitivity analysis, this is where to go. The first two chapters are accessible to introductory students and provide a rigorous version of the MCAR/MAR/MNAR framework from Section 7.2. Freely available at stefvanbuuren.name/fimd/.

Videos

StatQuest with Josh Starmer — "Dealing with Missing Data" (YouTube, ~15 min) Josh Starmer explains missing data types and imputation strategies with his signature whiteboard-and-animation style. He covers listwise deletion, mean imputation, regression imputation, and multiple imputation — building from simple to complex. Great for reinforcing Section 7.3 and previewing more advanced methods.

Corey Schafer — "Python Pandas Tutorial: Cleaning Data" (YouTube, ~30 min) A practical, hands-on Python tutorial covering many of the same pandas operations from Sections 7.2-7.5: .isna(), .fillna(), .dropna(), .replace(), and more. Schafer's teaching style is clear and methodical. Watch this if you want to see someone actually type the code and explain each step.

Kaggle — "Data Cleaning Mini-Course" (kaggle.com/learn, ~4 hours) A free, hands-on mini-course that covers handling missing values, scaling and normalization, parsing dates, character encodings, and inconsistent data entry. Each lesson runs in a Kaggle notebook — you don't need to install anything. The exercises use real-world datasets, making it an excellent companion to the project checkpoint.

3Blue1Brown — "But What Is a Convolution?" (YouTube, ~24 min) Not directly about data cleaning, but relevant if you want to understand why the mathematical operations behind imputation matter. Grant Sanderson's visualizations of how combining distributions works help explain why mean imputation narrows the distribution's spread — a concept from Section 7.3.

Calling Bullshit — "Selection Bias" (callingbullshit.org, Lecture 6) Carl Bergstrom and Jevin West's excellent free lecture series includes a module on selection bias that connects directly to the Spaced Review from Chapter 4 and the ethical dimensions of data cleaning. Their examples of how missing data creates selection bias in social science research are vivid and memorable.

Interactive and Online Resources

pandas Documentation: "Working with Missing Data" (pandas.pydata.org) The official pandas documentation on missing data handling is surprisingly readable. It covers everything from detecting missing values to advanced filling strategies, with executable examples. Bookmark this page — you'll reference it every time you clean data in Python. Search for "Working with missing data" on the pandas documentation site.

Kaggle Datasets (kaggle.com/datasets) The best way to practice data cleaning is on real, messy datasets. Kaggle hosts thousands of free datasets, many of which are genuinely messy (missing values, duplicates, inconsistencies). Search for "dirty data" or "data cleaning" to find datasets that are specifically designed for cleaning practice. Try the "Melbourne Housing Market" dataset for a challenging but manageable cleaning exercise.

Quartz Guide to Bad Data (github.com/Quartz/bad-data-guide) A curated list of common data quality problems, maintained by the data journalism team at Quartz. Organized as a checklist — "Is the data too good?", "Is the data internally inconsistent?", "Has the data been manually manipulated?" — it's a practical field guide for catching problems that .describe() alone won't reveal. Especially useful for journalists and social scientists working with government or organizational data.

Data Quality Campaign Resources (dataqualitycampaign.org) While focused on education data, this organization provides excellent resources on why data quality matters for policy decisions. Their case studies illustrate how cleaning decisions in administrative data (student records, test scores, demographic information) affect the conclusions that policymakers draw — directly connecting Section 7.10's cleaning log principles to real-world consequences.

OpenRefine (openrefine.org) A free, open-source tool specifically designed for data cleaning. OpenRefine is especially powerful for standardizing text data — it can cluster similar values (like "New York", "new york", "NYC", "N.Y.C.") and let you merge them interactively. It also provides full undo/redo history, which is essentially a built-in cleaning log. If you work with messy text data regularly, OpenRefine is worth learning alongside pandas.

Podcasts

Not So Standard Deviations — "Data Cleaning Is Analysis" (nssdeviations.com) Roger Peng and Hilary Parker discuss data cleaning as an integral part of analysis, not a preliminary step. Their conversation touches on how cleaning decisions are often the most consequential choices in a data project — reinforcing the chapter's central theme. About 45 minutes, with a mix of practical advice and philosophical reflection.

Data Skeptic — "Missing Data" (dataskeptic.com) A concise (~20 min) episode explaining MCAR, MAR, and MNAR with accessible examples. The hosts discuss why the missing data mechanism matters for inference and cover strategies from simple deletion to multiple imputation. A good audio companion to Section 7.2.

Casual Inference — "The Missing Data Problem in Causal Inference" (casualinfer.libsyn.com) For students who want to understand how missing data affects the causal claims from Chapter 4. The hosts (epidemiologists) explain how non-random missingness can invalidate study conclusions and discuss practical strategies for sensitivity analysis — the approach recommended in Section 7.10.

Looking Ahead

The concepts in this chapter are foundations for everything that follows:

  • Chapter 8 (Probability): Clean data is essential for calculating meaningful probabilities. Missing values and duplicates would distort probability estimates from frequency tables.
  • Chapter 10 (Normal Distribution): The distribution shapes you used to choose between mean and median imputation (Section 7.3) become formal probability models.
  • Chapter 11 (Central Limit Theorem): The sample size reductions from listwise deletion (Section 7.3) directly affect the precision of sampling distributions.
  • Chapter 13 (Hypothesis Testing): Your cleaning decisions determine the dataset on which every test is run. Different cleaning choices can lead to different p-values — which is why the cleaning log from Section 7.10 is essential.
  • Chapter 17 (Power and Effect Sizes): Reducing sample size through deletion reduces statistical power. The trade-off between deletion and imputation has direct consequences for your ability to detect real effects.
  • Chapter 22 (Regression): Feature engineering (Section 7.8) is the art of creating variables that improve regression models. The new variables you create during cleaning often become the best predictors.
  • Chapter 25 (Communicating with Data): Your cleaning log becomes part of the "methods" section of any report. Transparency about cleaning decisions is a hallmark of trustworthy data communication.
  • Chapter 27 (Ethical Data Practice): The ethical themes from this chapter — missing data means missing people, cleaning decisions shape conclusions — return as central topics.