Chapter 5 Further Reading: Exploratory Data Analysis

DataField.Dev

Chapter 5 Further Reading: Exploratory Data Analysis

Foundational Texts

1. Exploratory Data Analysis — John W. Tukey (1977)

The book that coined the term and launched the field. Tukey's approach — emphasizing visual inspection and open-ended investigation over hypothesis testing — remains the intellectual foundation of modern EDA. The prose is dense and the examples are pre-computer, but the philosophy is timeless. Essential reading for anyone who wants to understand why we explore data, not just how.

2. The Visual Display of Quantitative Information — Edward R. Tufte (2001, 2nd edition)

The most influential book on data visualization ever written. Tufte introduces the data-ink ratio, chartjunk, the lie factor, and small multiples — concepts that appear throughout this chapter and throughout the field. The book is also a masterpiece of physical design (Tufte self-published it to maintain control over the layout). Every chart you create for the rest of your career should be measured against Tufte's standards.

3. Visual Explanations: Images and Quantities, Evidence and Narrative — Edward R. Tufte (1997)

Contains Tufte's famous analysis of the Challenger O-ring data (the basis for Case Study 1 in this chapter). Extends the principles from Visual Display into the domain of causality, explanation, and decision-making. The Challenger analysis alone makes this book essential.

4. Envisioning Information — Edward R. Tufte (1990)

Focuses on representing complex, multivariate information in two-dimensional spaces. Particularly relevant for business dashboards and executive reporting where multiple variables must be compared simultaneously. The chapter on "small multiples" is directly applicable to the multi-panel matplotlib figures introduced in Section 5.4.

Data Visualization and Storytelling

5. Storytelling with Data: A Data Visualization Guide for Business Professionals — Cole Nussbaumer Knaflic (2015)

The most practical guide to turning data analysis into business communication. Knaflic, a former Google People Analytics manager, focuses on the specific challenge that NK Adeyemi faces in this chapter: how to make data insights land with a business audience. Covers decluttering, audience analysis, and the narrative arc of a data presentation. Highly recommended as a companion to this chapter.

6. Storytelling with Data: Let's Practice! — Cole Nussbaumer Knaflic (2019)

A workbook companion to the original, filled with before-and-after visualization makeovers and hands-on exercises. Excellent for building the chart-critique skills practiced in this chapter's quiz and exercises.

7. The Truthful Art: Data, Charts, and Maps for Communication — Alberto Cairo (2016)

Cairo, a journalist and visualization professor, bridges the gap between design and statistics. His framework of "truthful, functional, beautiful, insightful, and enlightening" provides a checklist for evaluating visualizations. The chapters on distribution shapes and correlation interpretation complement the statistical material in Sections 5.6 and 5.7.

8. Good Charts: The HBR Guide to Making Smarter, More Persuasive Data Visualizations — Scott Berinato (2016)

From Harvard Business Review, this book is aimed squarely at the MBA audience. Berinato's 2x2 framework (idea illustration vs. idea generation, conceptual vs. data-driven) helps determine what type of chart is appropriate for different communication goals. Practical and concise.

9. The Big Picture: How Data Visualization Unlocks Business Insight — Steve Wexler, Jeffrey Shaffer, Andy Cotgreave (2021)

Focuses specifically on enterprise data visualization — dashboards, executive reports, and operational displays. The examples come from business contexts (sales, marketing, finance, operations) rather than academic or journalistic ones, making it particularly relevant for MBA students.

Python Visualization Libraries

10. Python Data Science Handbook — Jake VanderPlas (2016, updated online)

Chapter 4 provides the most thorough treatment of matplotlib available in a general data science textbook. VanderPlas explains the figure-axes model, customization options, and advanced techniques clearly and with excellent code examples. Freely available online at jakevdp.github.io/PythonDataScienceHandbook.

11. matplotlib Official Documentation and Tutorials

URL: matplotlib.org/stable/tutorials The official tutorials cover everything from basic plots to advanced customization. The "Gallery" section is particularly useful — browse hundreds of chart examples with source code. When you need to create a specific type of chart, start here.

12. seaborn Official Documentation

URL: seaborn.pydata.org seaborn's documentation is unusually well-organized, with a tutorial section that explains the library's design philosophy (data-centric API, integration with pandas DataFrames, statistical estimation built in). The function reference includes dozens of examples for every plot type covered in Section 5.5.

13. Python for Data Analysis — Wes McKinney (2022, 3rd edition)

Written by the creator of pandas, this book is the definitive reference for data manipulation in Python. Chapter 9 covers plotting and visualization with both matplotlib and seaborn. The real value, however, is in the data cleaning and transformation chapters (6-8), which complement the EDA workflow described in Section 5.1.

Statistics for Business Audiences

14. Naked Statistics: Stripping the Dread from the Data — Charles Wheelan (2013)

An accessible, entertaining introduction to statistical concepts for non-statisticians. Wheelan explains mean, median, standard deviation, correlation, and regression with real-world examples and humor. Ideal for business professionals who need to build statistical intuition without mathematical formality. Covers all the descriptive statistics concepts from Section 5.2 at an accessible level.

15. The Art of Statistics: How to Learn from Data — David Spiegelhalter (2019)

Spiegelhalter, a Cambridge professor and former president of the Royal Statistical Society, writes for a general audience about how to think statistically. His treatment of correlation, causation, and misleading data visualizations is particularly relevant to Sections 5.7 and 5.3. The chapter on risk communication connects directly to the Challenger case study.

16. How to Lie with Statistics — Darrell Huff (1954)

A classic that remains relevant seventy years later. Huff catalogs the ways that statistics can be (and are) used to mislead — truncated axes, misleading averages, biased samples, and visual distortion. Short, entertaining, and useful as a checklist of pitfalls to avoid in your own work and to spot in others'.

Missing Data

17. Statistical Analysis with Missing Data — Roderick J.A. Little and Donald B. Rubin (2019, 3rd edition)

The authoritative academic reference on missing data theory. Little and Rubin developed the MCAR/MAR/MNAR taxonomy used in Section 5.8. This is a technical statistics textbook — not light reading — but the first three chapters are accessible and provide the theoretical foundation for understanding why missingness matters.

18. Flexible Imputation of Missing Data — Stef van Buuren (2018, 2nd edition)

A more practical guide to handling missing data, focused on the mice (Multiple Imputation by Chained Equations) framework. While the examples use R rather than Python, the conceptual material on when and how to impute missing values is language-agnostic and directly applicable to the business imputation decisions discussed in Section 5.8. Freely available online at stefvanbuuren.name/fimd.

Data Quality and Profiling

19. Bad Data Handbook: Cleaning Up the Data So You Can Get Back to Work — Q. Ethan McCallum, editor (2012)

A collection of essays by data practitioners about the messy reality of working with imperfect data. Covers common data quality issues (duplicates, inconsistent formats, missing values, encoding errors) with practical solutions. More pragmatic than academic — the tone matches the "EDA as detective work" mindset of this chapter.

20. pandas-profiling / ydata-profiling

URL: github.com/ydataai/ydata-profiling An open-source Python library that generates an automated EDA report from a pandas DataFrame — similar in spirit to the EDAReport class built in Section 5.11, but with more features (interaction analysis, duplicate detection, correlation types beyond Pearson). Install with pip install ydata-profiling and generate a report with two lines of code. Excellent for understanding what a production-grade EDA tool looks like.

Advanced Visualization

21. Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures — Claus O. Wilke (2019)

A modern, comprehensive guide to choosing and creating effective visualizations. Wilke covers every chart type discussed in this chapter (histograms, scatter plots, box plots, heatmaps) plus advanced types (density plots, ridgeline plots, mosaic plots). Freely available online at clauswilke.com/dataviz.

22. plotly Documentation

URL: plotly.com/python plotly creates interactive visualizations — charts that users can hover over, zoom into, and filter. While this chapter focused on static matplotlib/seaborn charts, interactive visualization is increasingly important for dashboards and web-based reporting. plotly's Python API follows a similar syntax to matplotlib, making the transition straightforward.

23. Information Dashboard Design: Displaying Data for At-a-Glance Monitoring — Stephen Few (2013, 2nd edition)

Few specializes in the design of operational and analytical dashboards — the kind of persistent displays that business teams use daily. His principles for reducing visual noise and highlighting signal are directly applicable to the "design for the decision-maker" philosophy advocated by Professor Okonkwo.

The Challenger Case

24. Report of the Presidential Commission on the Space Shuttle Challenger Accident (Rogers Commission Report, 1986)

URL: history.nasa.gov/rogersrep/genindex.htm The primary source document for Case Study 1. Chapter 6 contains the engineering analysis of the O-ring failure, including reproductions of the charts the Thiokol engineers presented. Reading the original charts alongside Tufte's critique makes the visualization failure viscerally clear.

25. "The Decision to Launch the Space Shuttle Challenger" — Roger Boisjoly (1987)

An account by the Morton Thiokol engineer who most strenuously argued against the launch. Boisjoly's perspective adds organizational and ethical dimensions to the data visualization failure — demonstrating how even correct data, presented by credible experts, can be overridden by institutional momentum.

Reading Path Recommendation

If you read three things: Start with Knaflic's Storytelling with Data (practical visualization for business), then Tufte's The Visual Display of Quantitative Information (foundational principles), then VanderPlas's Python Data Science Handbook Chapter 4 (matplotlib implementation).

If you read one thing: Knaflic's Storytelling with Data. It is the most directly applicable book to the business communication challenges that define the gap between a data analyst and a data leader.

If you want to go deep on Python visualization: Work through the matplotlib and seaborn official tutorials systematically, then install ydata-profiling and compare its output to the EDAReport class from this chapter.