Chapter 22 Further Reading: Data Analysis and Visualization

DataField.Dev

Chapter 22 Further Reading: Data Analysis and Visualization

On AI and Data Analysis Productivity

"The State of AI" McKinsey Global Survey (Annual) McKinsey publishes an annual survey of AI adoption and business impact. The data on analyst productivity freed by AI tools is drawn from these surveys. Available free at mckinsey.com. Search for the most recent edition.

"GitHub Copilot Impact on Developer Productivity" (GitHub Research) GitHub has published multiple studies on how Copilot affects developer workflows, including data analysis tasks. Available at resources.github.com. Relevant for Tier 2 (code-assisted) data analysis users.

"Evaluating AI-Generated Code for Data Analysis" Multiple academic papers from 2023-2024 examine error rates in AI-generated data analysis code. Search Google Scholar for "LLM code generation accuracy data analysis" for current literature. Key finding: error rates are high enough to require review; errors are frequently silent (producing wrong results without failures).

Data Analysis Foundations

"Python for Data Analysis" by Wes McKinney O'Reilly Media, 3rd edition, 2022. The definitive guide to pandas, written by the library's creator. Essential reading for Tier 2 users who want to understand the code that AI generates. Chapter coverage of GroupBy operations and time series is particularly relevant for reviewing AI-generated analysis code.

"Storytelling with Data" by Cole Nussbaumer Knaflic Wiley, 2015. The best practical guide to data visualization design for business contexts. Establishes the standards for what good visualization looks like — making it invaluable for evaluating AI-generated charts. Covers chart type selection, axis design, labeling, and the narrative framing of data visuals.

"The Visual Display of Quantitative Information" by Edward Tufte Graphics Press, 2nd edition, 2001. The foundational work on data visualization design. Tufte's concept of "data-ink ratio" — the principle that every element of a chart should carry information — is the theoretical basis for critiquing AI's often over-decorated default charts.

"Statistics" by David Freedman, Robert Pisani, and Roger Purves W. W. Norton, 4th edition, 2007. A conceptual introduction to statistics that does not require calculus. Particularly strong on the distinction between correlation and causation, and on understanding what statistical claims can and cannot support. Useful background for the interpretation layer discussions in this chapter.

Data Visualization Tools and References

"Fundamentals of Data Visualization" by Claus Wilke O'Reilly Media, 2019. Available free online at clauswilke.com/dataviz. Comprehensive guide to when to use which chart type, and why. Directly applicable to both evaluating AI-generated visualization choices and prompting for better visualizations.

Matplotlib Documentation matplotlib.org. The official documentation for the Python visualization library used in this chapter's code examples. The gallery section (matplotlib.org/stable/gallery) provides example code for every chart type — useful reference when iterating on AI-generated visualization code.

Seaborn Documentation seaborn.pydata.org. Documentation and gallery for seaborn, a statistical visualization library built on matplotlib. Seaborn is often easier to use for statistical plots (distributions, correlation matrices, regression plots) and is well-supported by AI code generation.

Data Privacy and Governance

"The Privacy Engineer's Manifesto" by Michelle Dennedy, Jonathan Fox, and Thomas Finneran Apress, 2014. A comprehensive treatment of privacy engineering principles. Relevant for understanding the technical and policy dimensions of data anonymization before using AI tools.

NIST Privacy Framework Available free at nist.gov/privacy-framework. The US National Institute of Standards and Technology's framework for managing privacy risk. Useful for organizations building data governance policies that include AI tool use.

General Data Protection Regulation (GDPR) — Article 25 (Data Protection by Design) Relevant for professionals working in EU contexts. The principle that privacy should be built into data processing by default — not added as an afterthought — applies directly to decisions about which data to use with AI analysis tools.

Tools Referenced in This Chapter

ChatGPT Advanced Data Analysis — chat.openai.com Requires a ChatGPT Plus subscription. Accepts file uploads (CSV, Excel), runs Python code in a sandboxed environment, generates charts, and provides written interpretation. The primary Tier 1 tool discussed in this chapter.

Claude — claude.ai Used in this chapter for interpretation of pasted statistics, qualitative theme synthesis (working from user-provided notes), and code generation. Extended context window is useful for analyzing large datasets represented as summaries.

Anthropic Python SDK — pypi.org/project/anthropic The official SDK for the Anthropic API, used in the Python code examples. Requires an API key (available at console.anthropic.com).

Gemini in Google Sheets — workspace.google.com/products/gemini Available with Google Workspace Gemini add-on. Integrated directly into Google Sheets for natural language analysis, formula assistance, and chart generation.

Microsoft Copilot in Excel — microsoft.com/en-us/microsoft-365/copilot Available with Microsoft 365 Copilot license. Integrated into Excel for natural language queries, chart generation, and data summarization.

Pandas — pandas.pydata.org The Python data analysis library used in all code examples. Version 2.0+ changes some DataFrame behaviors from earlier versions; ensure AI-generated code is using compatible syntax for your installed version.