Further Reading: Your First Data Analysis

You've just completed your first real data exploration — a milestone worth celebrating. If you want to deepen your understanding before moving to Part II, here are resources organized by what caught your attention.

Tier 1: Verified Sources

These are published books with full bibliographic details.

John W. Tukey, Exploratory Data Analysis (Addison-Wesley, 1977). This is the book that coined the term "exploratory data analysis." Tukey was a legendary statistician at Princeton and Bell Labs who argued that data analysis should begin with open-ended exploration rather than jumping straight to hypothesis testing. The book is dated in its technology (stem-and-leaf plots, done by hand) but timeless in its philosophy. If the "EDA as conversation" threshold concept resonated with you, Tukey is the originator of that mindset. A genuinely foundational work in the history of data science.

Wes McKinney, Python for Data Analysis: Data Wrangling with pandas, NumPy, and Jupyter (O'Reilly, 3rd edition, 2022). You don't need this yet — we're introducing pandas in Chapter 7 — but if the "limits of pure Python" section left you eager to see what's coming, McKinney's book is the definitive reference. McKinney created pandas, and his book walks through everything from loading CSV files to complex reshaping operations. Having it on hand when you start Part II will be invaluable. The third edition was updated for modern pandas and includes Jupyter-based examples throughout.

Joel Grus, Data Science from Scratch: First Principles with Python (O'Reilly, 2nd edition, 2019). Grus takes the same "build it yourself first" approach we used in this chapter, but extends it across the entire data science pipeline — from statistics to machine learning, all implemented in pure Python before introducing libraries. If you enjoyed computing summary statistics by hand and want to see how far that approach can go, this is your book.

Roger D. Peng and Elizabeth Matsui, The Art of Data Science (Leanpub, 2016). A short, focused book about the thinking behind data analysis, not the coding. Peng (a biostatistician at Johns Hopkins and co-creator of the Johns Hopkins Data Science Specialization on Coursera) and Matsui walk through how experienced analysts approach a dataset: forming questions, iterating through analysis, and arriving at insights. This is one of the best resources for developing the analytical mindset that makes EDA productive.

David Spiegelhalter, The Art of Statistics: How to Learn from Data (Basic Books, 2019). We recommended this in Chapter 1, and it becomes even more relevant now that you've actually worked with data. Spiegelhalter explains statistical concepts through real-world stories — crime data, medical trials, survival analysis — without requiring mathematical background. The chapters on summarizing data and understanding variation connect directly to the summary statistics you computed in this chapter.

Tier 2: Attributed Resources

These are articles, talks, and online resources that are well-known in the data science community. We provide enough detail to find them, but not URLs (because links rot).

Hadley Wickham, "Tidy Data" (2014). Published in the Journal of Statistical Software (Volume 59, Issue 10), this paper introduces the concept of "tidy data" — a consistent way of organizing data that makes analysis easier. We'll explore tidy data formally in Chapter 7, but reading this paper now will give you a preview of why data structure matters so much. Wickham's writing is exceptionally clear for an academic paper.

World Health Organization, WHO/UNICEF Estimates of National Immunization Coverage (WUENIC). The real version of the dataset we used in this chapter. You can find it by searching for "WHO immunization data" — the WHO maintains an open data portal with downloadable CSV files of vaccination coverage estimates by country and year. Exploring the actual data (which is messier and larger than our simplified version) is an excellent next exercise.

Roger Peng's "Exploratory Data Analysis with R" course (Johns Hopkins, via Coursera). While this course uses R rather than Python, Peng's teaching of EDA principles is language-agnostic and excellent. The lectures on plotting systems, clustering, and dimension reduction will preview topics from later chapters in our book. Search for "Roger Peng EDA Coursera" to find it.

The Python Standard Library documentation for the csv module. Python's official documentation (docs.python.org) has comprehensive documentation for the csv module, including DictReader, DictWriter, and various dialect options. If you want to understand all the parameters we didn't cover (custom delimiters, quoting rules, handling different line endings), this is the authoritative reference.

Recommended Next Steps

If you want more practice with EDA: Download a real dataset from a public data source (Kaggle, data.gov, the WHO data portal, or your city's open data site) and run the full EDA workflow from this chapter on it. The practice of applying the same checklist to different datasets is how the workflow becomes automatic.
If you're eager for pandas: Jump straight into Chapter 7. Everything you struggled with in this chapter — type conversion, grouped statistics, filtering — becomes dramatically easier. Having done it manually first, you'll appreciate pandas at a visceral level.
If you want to understand the statistics more deeply: Read Spiegelhalter's Art of Statistics or preview our Chapter 19 (Descriptive Statistics), which will formalize the summary statistics you computed here and introduce additional measures like standard deviation, percentiles, and the five-number summary.
If the data quality section fascinated you: Preview Chapter 8 (Cleaning Messy Data), which covers professional techniques for handling missing values, fixing type errors, and standardizing inconsistent data — all the things we identified as problems in this chapter but didn't fully resolve.
If the notebook narrative concept resonated: Look at the Jupyter notebooks shared by data scientists on GitHub (search "jupyter notebook data analysis examples"). Seeing how professionals structure their notebooks — with clear sections, rich Markdown, and interpretive text — will give you models to emulate.

Happy reading — and welcome to the other side. You're a practitioner now, not just a student. Part II is where things start to fly.