Further Reading: Cleaning Messy Data

Data cleaning doesn't get the glamorous book deals that machine learning does. But the resources below are genuinely useful — they'll deepen your understanding of why cleaning matters, how to do it well, and what happens when it goes wrong.


Tier 1: Verified Sources

Wes McKinney, Python for Data Analysis: Data Wrangling with pandas, NumPy, and Jupyter (O'Reilly, 3rd edition, 2022). The definitive pandas reference, written by the creator of pandas. Chapters 7 (Data Cleaning and Preparation) and 8 (Data Wrangling) cover everything in our chapter and more, with detailed examples. This is the book to keep on your desk when you're stuck on a specific cleaning operation. It's more reference than tutorial, so it complements rather than replaces the narrative approach of our textbook.

Hadley Wickham, "Tidy Data," Journal of Statistical Software, Volume 59, Issue 10, 2014. The foundational paper on what it means for data to be "clean" (or more precisely, "tidy"). Wickham defines tidy data as data where each variable is a column, each observation is a row, and each type of observational unit is a table. While the paper uses R examples, the principles are universal and deeply relevant to pandas work. The paper is freely available online through the journal's website.

Ziawasch Obermeyer, Brian Powers, Christine Vogeli, and Sendhil Mullainathan, "Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations," Science, Volume 366, Issue 6464, pages 447-453, 2019. The landmark study referenced in Case Study 2. The researchers showed that a widely used healthcare algorithm systematically discriminated against Black patients because it used healthcare spending as a proxy for healthcare need. Essential reading for understanding how data processing decisions — not just cleaning, but feature selection and proxy choice — can encode systemic bias.

David Spiegelhalter, The Art of Statistics: How to Learn from Data (Basic Books, 2019). While not specifically about data cleaning, Spiegelhalter's book devotes significant attention to data quality, missing data, and the dangers of drawing conclusions from imperfect information. His chapter on "What is wrong with numbers?" is particularly relevant. Highly accessible — no advanced math required.

Cathy O'Neil, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy (Crown, 2016). O'Neil examines how data-driven systems can discriminate against vulnerable populations. Several of her examples involve data quality issues — not missing values or type errors, but the deeper question of what the data represents and whose experiences are captured. A powerful companion to our discussion of the ethical dimensions of cleaning.


Tier 2: Attributed Resources

Roderick J. A. Little and Donald B. Rubin, Statistical Analysis with Missing Data (Wiley, 3rd edition, 2019). The authoritative academic treatment of missing data theory. This is where the MCAR/MAR/MNAR framework originated. It's a graduate-level statistics textbook and not light reading, but if you want to understand the mathematical foundations of why missing data handling matters, this is the source. Recommended for students planning to pursue statistics or biostatistics.

Quartz, "The Quartz Guide to Bad Data." A well-known reference guide created by the data journalism team at Quartz (the business news publication). It catalogues dozens of common data quality problems with clear explanations of what causes them and how to detect them. Search for "Quartz guide to bad data GitHub" to find it — it's maintained as an open-source resource.

Tidy Tuesday (R for Data Science Community). A weekly social data project organized by the R for Data Science online learning community. Each week, a new messy, real-world dataset is released for community members to clean, explore, and visualize. While the community focuses on R, the datasets themselves are CSV files that work perfectly in pandas, and browsing the community's cleaning approaches is educational regardless of your preferred language. Search for "TidyTuesday GitHub."

pandas official documentation: "Working with missing data." The pandas documentation has an excellent section specifically about missing data handling, including detailed explanations of how NaN propagates through operations, the difference between None and NaN, and advanced topics like nullable integer types. Search the pandas documentation site for "working with missing data."


  • If you want more practice with messy data: Find real-world datasets on Kaggle, Data.gov, or TidyTuesday and practice your cleaning pipeline. The more datasets you clean, the faster you'll recognize common patterns.

  • If the ethics discussion resonated with you: Read Obermeyer et al. (2019) and O'Neil's Weapons of Math Destruction. Both will deepen your understanding of how data processing decisions affect real people. We'll return to these themes extensively in Chapter 32 (Ethics in Data Science).

  • If you want to understand missing data theory more deeply: Start with the Wickham "Tidy Data" paper, which is accessible and practical. If you want more rigor, consult Little and Rubin — but be prepared for mathematical notation.

  • If you're ready to move on: Chapter 9 (Reshaping and Transforming) builds directly on clean data. Chapters 10 (Text Data) and 11 (Dates and Times) handle specialized cleaning for those data types. All three assume you're comfortable with the skills from this chapter.