Chapter 22 Exercises: Data Analysis and Visualization

These exercises build practical skills in AI-assisted data analysis across the three tiers — chat-based, code-assisted, and interpretive. Complete exercises at the tier appropriate to your technical background; all professionals should complete Part A.


Part A: Chat-Based Analysis (All Skill Levels)

Exercise 1: Your First Advanced Data Analysis Session

If you have not used ChatGPT Advanced Data Analysis before, complete this exercise to build baseline familiarity.

Find a dataset you can use: your own work data (appropriately de-identified), a public dataset from data.gov or Kaggle, or any CSV or Excel file with at least 50 rows and 5 columns.

Upload it to ChatGPT Advanced Data Analysis with this prompt:

"I've uploaded a dataset. Please conduct a complete exploratory data analysis. Cover: the structure and shape of the data, basic summary statistics for all numerical columns, any missing values or data quality issues, and the three most interesting patterns you notice."

After receiving the output: 1. Check three of the specific numbers against the original data. 2. Note one thing the EDA flagged that you had not noticed. 3. Note one thing the EDA missed or got wrong.

Exercise 2: Visualization Request Iteration

Using the dataset from Exercise 1 (or another one), practice the visualization iteration workflow:

  1. Ask for a visualization using a vague request: "Show me a chart of the data."
  2. Critique what you received and ask for a specific improvement.
  3. Ask for a completely different chart type for the same data.
  4. Ask AI to recommend the best chart type for a specific analytical question you have.

Reflect: how much specificity in the initial request affected the quality of the first visualization?

Exercise 3: Statistical Summary to Plain Language

Find any dataset with numerical data (or use the one from Exercise 1). Run a statistical summary and then ask AI to explain the results in plain language for three different audiences: 1. A senior executive who is not quantitatively focused 2. A team member who works with this data every day 3. An external stakeholder unfamiliar with your domain

Compare the three explanations. What did AI include in each version? What did it leave out? Are the simplifications accurate?

Exercise 4: Data Quality Investigation

Create a small dataset with intentional data quality issues (or find one with quality problems in your own work data): missing values, outliers, inconsistent formatting, impossible values (negative quantities, dates in the future, etc.).

Upload it to ChatGPT Advanced Data Analysis and ask for a data quality audit. Evaluate: - Did AI find all the issues you intentionally introduced? - Did it find any issues you had not noticed? - Did it misidentify anything as a quality issue that is actually correct data?

Exercise 5: Cross-Channel or Multi-Metric Analysis

Find or create a dataset with two or more related metrics (e.g., marketing spend and leads generated, website visits and conversions, study hours and test scores). Ask for: 1. A correlation analysis between the two metrics 2. A scatter plot visualizing the relationship 3. An interpretation of what the correlation means 4. A list of alternative explanations for the pattern that do not assume causation

Evaluate the interpretation carefully: does AI appropriately distinguish correlation from causation? Does it note the sample size?


Part B: Interpretation Skills (All Skill Levels)

Exercise 6: Descriptive Stats to Business Insight

Paste the following summary statistics into Claude and ask for interpretation:

Monthly website visitors:
  Jan: 12,400 | Feb: 11,800 | Mar: 13,200 | Apr: 15,600 | May: 18,900 | Jun: 17,200

Conversion rate (visitors to trial signups):
  Jan: 2.1% | Feb: 2.0% | Mar: 2.3% | Apr: 2.4% | May: 1.9% | Jun: 2.0%

Trial to paid conversion:
  Jan: 18% | Feb: 19% | Mar: 21% | Apr: 22% | May: 23% | Jun: 24%

Ask for: (1) the most important pattern, (2) any concerns, (3) what additional data would help interpret this.

Then ask AI: "What alternative explanations exist for the May traffic spike combined with the conversion rate dip?" Evaluate the hypotheses. Which are most plausible?

Exercise 7: Challenging AI Interpretation

Take any AI-generated interpretation of data (from a previous exercise or new analysis). Ask AI to: 1. "What alternative interpretation of this data would tell a completely different story?" 2. "What would need to be true in the underlying business for this interpretation to be wrong?" 3. "Is this pattern practically significant? What effect size would matter for a business decision?"

How does the AI respond to being challenged? Does it maintain its original interpretation or revise it? What does this reveal about AI interpretation confidence?

Exercise 8: Causal Claim Audit

Find three AI-generated data interpretations (from your exercises or external sources). For each one: 1. Identify any causal language ("X causes Y," "X leads to Y," "X results in Y"). 2. Evaluate whether the data supports a causal claim or only a correlational one. 3. Rewrite the interpretation to be accurate about what the data actually supports.

Reflect: how often does AI use causal language when the data only supports correlation?


Part C: Python Code-Assisted Analysis (Technical Readers)

Exercise 9: AI-Generated Pandas Pipeline Review

Submit this request to an AI model:

"Write Python code using pandas to: (1) load a CSV file, (2) perform data cleaning — handle missing values, remove duplicates, and parse date columns, (3) calculate monthly aggregates for a 'revenue' column, (4) calculate month-over-month growth rates, and (5) generate a summary DataFrame with month, total_revenue, mom_growth, and a 3-month rolling average."

Read the code before running it. Identify: - How are missing values handled? Is the strategy appropriate? - Are the date parsing arguments explicit? - Is the rolling average calculated correctly (check the window parameter)? - Are there any edge cases the code does not handle?

Run the code with a test dataset. Verify two or three of the calculated values manually.

Exercise 10: Visualization Code Iteration

Ask AI to write matplotlib code for a specific chart you need for your work. Then: 1. Run the code and examine the output. 2. Identify three specific improvements (colors, fonts, labels, scale, etc.). 3. Ask AI to update the code for each improvement. 4. Practice reading the matplotlib code to understand what each element controls.

Exercise 11: Building the API Analysis Function

Using the analyze_dataset function from Section 4 of the chapter as a starting point, extend it with one of the following capabilities: - A function that analyzes the dataset and returns a prioritized list of columns to investigate further - A function that takes a natural language question about the dataset and returns both the answer and the code that would compute it - A function that compares two time periods of a dataset and returns a structured comparison of key metrics

Test your extension with real data. Verify that the AI-generated analysis in the API response is consistent with what you see in the data directly.

Exercise 12: Log Analysis Pipeline

Following the Raj scenario from Section 10, build a simplified log analysis pipeline. You can either: - Use server log data from your own work - Generate synthetic log data using AI: "Generate a Python script that creates synthetic web server logs in JSON format for 7 days, including occasional anomalies"

Ask AI to write code that: 1. Loads and parses the log data 2. Calculates response time percentiles (P50, P95, P99) by hour 3. Identifies time windows where P95 exceeds a threshold you define 4. Produces a visualization of the percentile trends over time

Read and understand the code before running. Verify the percentile calculations with a manual check on a small subset.


Part D: Data Privacy and Ethics

Exercise 13: Data Privacy Classification

Review the following data types and classify each as: (a) safe to upload to consumer AI tools, (b) requires organizational approval, or (c) do not upload without anonymization:

  1. Monthly sales totals by product category (no customer-level data)
  2. Employee names and satisfaction survey scores
  3. Website session data with anonymized user IDs
  4. Customer email addresses and purchase history
  5. Aggregated survey results showing department-level averages
  6. Medical device performance data with patient identifiers removed
  7. Publicly available economic data from a government website
  8. Internal financial projections for an unannounced product

For each classification, explain your reasoning. Check your classifications against your organization's data governance policy.

Exercise 14: Anonymization Practice

Take a dataset that contains some personal or sensitive information (create a synthetic one if needed). Before "uploading" it (you do not need to actually upload it for this exercise): 1. Identify every field that could be used to identify individuals 2. Remove or hash identifiers 3. Aggregate fields that create re-identification risk at low row counts 4. Document the anonymization steps you took

Evaluate: is the resulting dataset safe to use with an external AI tool? What risk remains?


Part E: Synthesis and Application

Exercise 15: End-to-End Analysis Project

Choose a business question you genuinely want to answer using data available to you. Complete a full AI-assisted analysis:

  1. Define the question precisely before touching any tools
  2. Identify the data you need and obtain it (or use a proxy dataset)
  3. Conduct EDA and data quality assessment
  4. Run your specific analysis
  5. Generate appropriate visualizations
  6. Write an interpretation (human-authored, with AI assistance)
  7. Identify what questions the analysis raises but cannot answer

Document the tools and AI interactions at each step. Write a one-paragraph reflection on where AI added the most value and where it added the most risk.

Exercise 16: Verification Audit

Take any AI-generated data analysis output — from your exercises or elsewhere. For every specific number in the output: 1. Can you trace it to the underlying data? 2. Does a manual calculation confirm it? 3. Is the context (time period, population, units) correctly stated?

Document the error rate: of all specific numbers, how many are exactly correct, approximately correct, or incorrect? What does this imply for your verification workflow?

Exercise 17: Non-Technical Audience Communication

Take a data analysis result (your own or from an exercise). Create two versions of a data summary: 1. A version for a technical audience (full statistics, methodology notes, confidence intervals) 2. A version for a non-technical executive audience (key finding, one chart, plain language interpretation)

Use AI to assist with both versions. Evaluate: does AI appropriately adjust technical language and detail level between versions? What did you have to correct or add?


Instructors: Exercise 1 (First Advanced Data Analysis Session) should be completed early in the course to establish baseline familiarity with the tool. Exercise 15 (End-to-End Analysis Project) is suitable as a major graded assignment — require students to document their AI interactions and submit a reflection on verification findings.