Chapter 5 Exercises: Exploratory Data Analysis

DataField.Dev

Chapter 5 Exercises: Exploratory Data Analysis

Section A: Statistics Interpretation (Exercises 1-6)

Exercise 1: Reading Summary Statistics

A retail analytics team presents the following summary statistics for daily revenue across 365 days:

Statistic	Value
Mean	$47,200
Median	$31,800
Std Dev	$38,500
Min	$4,100
Q1 (25th)	$18,900
Q3 (75th)	$62,400
Max	$312,000
Skewness	2.41

(a) What does the gap between mean and median tell you about the distribution shape? Is this typical for revenue data? Why?

(b) Calculate the IQR. Using the 1.5x IQR rule, what are the lower and upper bounds for outlier detection? How many "extreme" revenue days might you expect?

(c) A new analyst reports to the CFO that "our average daily revenue is $47,200." Is this a fair representation? What would you report instead, and why?

(d) The skewness value is 2.41. In plain English, what does this tell a business audience?

Exercise 2: Comparing Two Distributions

Two product lines at Athena Retail Group have the following order value statistics:

Metric	Electronics	Apparel
Mean	$285 \| $62
Median	$210 \| $55
Std Dev	$195 \| $28
Skewness	1.8	0.4
Kurtosis	4.2	-0.3

(a) Which product line has more predictable order values? How do you know?

(b) The electronics category has kurtosis of 4.2. What does this mean in practical terms for inventory planning or financial forecasting?

(c) If the company sets a "high-value order" threshold at the 90th percentile for each category, which category's threshold would you expect to be farther from its mean? Why?

(d) Write a two-sentence executive summary comparing these distributions for a VP of Merchandising.

Exercise 3: Percentile Analysis

A SaaS company tracks time-to-resolution for customer support tickets (in hours):

Percentile	Value
P10	0.5
P25	1.2
P50 (Median)	3.8
P75	8.5
P90	24.0
P95	48.0
P99	168.0

(a) The company's SLA guarantees resolution within 24 hours. What percentage of tickets meet this SLA?

(b) What does the jump from P90 (24 hours) to P99 (168 hours = 7 days) tell you about the data? What business question would you ask next?

(c) Management is considering changing the SLA metric from "mean resolution time" to "P90 resolution time." Argue for or against this change.

Exercise 4: Correlation Interpretation

A marketing analyst presents the following correlation matrix excerpt:

	Ad Spend	Impressions	Click Rate	Conversions	Revenue
Ad Spend	1.00	0.92	-0.15	0.61	0.58
Impressions	0.92	1.00	-0.22	0.55	0.52
Click Rate	-0.15	-0.22	1.00	0.38	0.42
Conversions	0.61	0.55	0.38	1.00	0.89
Revenue	0.58	0.52	0.42	0.89	1.00

(a) Why is the correlation between Ad Spend and Impressions so high (0.92)? Is this surprising? Does it suggest causation?

(b) Click Rate has a negative correlation with both Ad Spend (-0.15) and Impressions (-0.22). Propose a business explanation for this pattern.

(c) Conversions and Revenue have a correlation of 0.89. Does this mean that maximizing conversions will maximize revenue? What might be missing from this analysis?

(d) If you could add one more variable to this matrix to improve strategic insight, what would it be and why?

Exercise 5: Missing Data Diagnosis

A hospital dataset has the following missing data pattern:

Column	% Missing	Notes
Patient Age	0.2%	Randomly scattered
Blood Pressure	3.1%	Higher in emergency admissions
Income	22.0%	Higher for uninsured patients
Follow-up Visit	35.0%	Higher for patients with severe diagnoses
Satisfaction Score	41.0%	Higher for patients who left AMA

(a) Classify each column's missingness as MCAR, MAR, or MNAR. Justify each classification.

(b) For the "Follow-up Visit" column (35% missing, likely MNAR), explain why simply dropping rows with missing values would introduce bias. What population would be underrepresented?

(c) Propose a strategy for handling the missing "Income" data. Could you use other variables to impute it? Which ones?

(d) The "Satisfaction Score" is 41% missing, and missingness correlates with patients who left against medical advice. A hospital administrator wants to report "average patient satisfaction: 4.2/5." What warning would you give?

Exercise 6: Chart Critique

Consider a chart with the following properties: - 3D pie chart showing market share of 8 competitors - Each slice has a gradient fill and a shadow - The smallest slice (2.1%) is labeled but nearly invisible - The chart title reads "Market Share" - No data source or time period is indicated

(a) List at least four specific problems with this visualization, referencing Tufte's principles.

(b) Redesign this chart. What chart type would you use? What would the title be? What information would you add?

(c) Apply the "So What?" test. Write a chart title that communicates an insight, not just a description.

Section B: Visualization Creation (Exercises 7-12)

Exercise 7: Basic Distribution Plot

Write Python code to create a histogram of the following employee salary data with the following requirements: - Use matplotlib, not seaborn - Include vertical lines for mean and median with a legend - Remove top and right spines - Title should state an insight, not just describe the data - Save the figure to a PNG file

import numpy as np
np.random.seed(42)
salaries = np.random.lognormal(mean=11.0, sigma=0.5, size=500)

Exercise 8: Grouped Bar Chart

Using the following data, create a horizontal grouped bar chart comparing Q1 and Q2 performance across four regions:

import pandas as pd

data = {
    'Region': ['North', 'South', 'East', 'West'],
    'Q1_Revenue': [4200000, 5800000, 3100000, 4600000],
    'Q2_Revenue': [4500000, 5200000, 3900000, 4800000]
}
df_revenue = pd.DataFrame(data)

Requirements: - Use distinct but non-garish colors - Include value labels on each bar (formatted as $X.XM) - Indicate which regions grew and which declined (hint: use annotations or color coding)

Exercise 9: Correlation Heatmap

Write Python code using seaborn to create a correlation heatmap for a dataset of your choice (you may use the sample retail dataset from the chapter or create your own). Your heatmap must:

Show only the lower triangle (mask the upper)
Use a diverging color palette centered at zero
Include correlation values as annotations
Have a descriptive title
Print the three strongest correlations below the heatmap (as text output)

Exercise 10: Box Plot Comparison

Create a figure with four side-by-side box plots comparing days_since_last_purchase across four customer segments: "New" (0-90 days as customer), "Growing" (91-365), "Established" (366-730), "Veteran" (731+).

import numpy as np
import pandas as pd

np.random.seed(42)
n = 800
customer_tenure_days = np.random.exponential(400, n).astype(int)
days_since_purchase = np.random.exponential(45, n).astype(int)

# Add correlation: longer-tenure customers have slightly longer gaps
days_since_purchase = days_since_purchase + (customer_tenure_days * 0.02).astype(int)

df_box = pd.DataFrame({
    'tenure_days': customer_tenure_days,
    'days_since_purchase': days_since_purchase
})

# Create segments
df_box['segment'] = pd.cut(
    df_box['tenure_days'],
    bins=[0, 90, 365, 730, 99999],
    labels=['New', 'Growing', 'Established', 'Veteran']
)

Requirements: - Use seaborn's boxplot or violinplot - Add a horizontal reference line at 60 days (the "at risk" threshold) - Annotate the median value above each box - Write a one-sentence insight as the chart title

Exercise 11: Small Multiples

Create a 2x2 grid of scatter plots showing the relationship between two variables across four regions. Use the sample retail dataset from the chapter. Each panel should show one region, and all panels should use the same axis scales for comparison.

Exercise 12: Missing Data Visualization

Using the df_missing dataset created in Section 5.8 of the chapter, create a two-panel figure: - Left panel: A bar chart showing the percentage of missing values per column (only columns with missing data) - Right panel: A heatmap showing co-occurrence of missingness (do missing values in column A tend to co-occur with missing values in column B?)

Hint for the right panel: compute the correlation matrix of the binary "is missing" flags.

Section C: EDA Workflow (Exercises 13-18)

Exercise 13: Full EDA — Employee Attrition

You are given an HR dataset with the following columns: - employee_id, age, department, tenure_years, salary, performance_rating (1-5), satisfaction_score (1-10), overtime_hours_monthly, is_attrited (0/1)

Without writing code, outline a complete EDA workflow:

(a) What are the first five things you would check?

(b) What distributions would you expect to be skewed, and in which direction?

(c) Which correlations would you investigate first, and why?

(d) What missing data patterns would concern you most?

(e) Write three hypotheses that the EDA should test (or at least inform).

Exercise 14: Business Question to EDA Plan

A VP of Sales asks: "Why did our close rate drop from 32% to 24% last quarter?"

(a) List at least five data fields you would want in your dataset to investigate this question.

(b) Describe three specific visualizations you would create, explaining what each would reveal.

(c) What would a "smoking gun" finding look like? What would an ambiguous finding look like?

(d) Draft the SCQA structure (Situation, Complication, Question, Answer) you might use to present your findings, filling in plausible details.

Exercise 15: EDA on Real-World Data

Download a public dataset from one of the following sources: - Kaggle (search for "customer churn," "sales data," or "employee attrition") - UCI Machine Learning Repository - data.gov

Run the EDAReport class from this chapter on the dataset. Then answer:

(a) What was the most surprising finding?

(b) Which columns had the most missing data? Was the missingness likely MCAR, MAR, or MNAR?

(c) What is the strongest correlation you found? Does it make business sense?

(d) Write a three-paragraph executive summary of your findings.

Exercise 16: Before-and-After EDA

You're told that a previous data science team built a customer lifetime value (CLV) model with the following features: age, income, tenure, total_purchases, avg_order_value, num_returns, support_tickets.

The model has an R-squared of 0.45 (explaining 45% of variance in CLV).

(a) What EDA steps would you perform on the existing features to understand why the model isn't performing better?

(b) Propose three new features you might engineer based on EDA insights (e.g., ratios, interactions, binned variables).

(c) What would you look for in a scatter plot of predicted CLV vs. actual CLV? What patterns would indicate specific problems?

Exercise 17: EDA for Anomaly Detection

An e-commerce company notices that last Tuesday's revenue was unusually high ($890,000 vs. the typical $120,000-$180,000 range).

(a) Design an EDA investigation to determine whether this is a genuine spike or a data error. What five things would you check, in order?

(b) For each check, describe what "error" would look like vs. what "genuine spike" would look like.

(c) If the spike is genuine, what follow-up EDA would help understand its cause?

Exercise 18: Cross-Functional EDA

You are on a cross-functional team with members from Marketing, Finance, Operations, and Product. Each team member has a different question about the same customer dataset.

Marketing: "Which customer segments are most engaged?"
Finance: "What's our revenue concentration risk?"
Operations: "Where are the fulfillment bottlenecks?"
Product: "Which features drive repeat purchases?"

(a) For each question, identify which EDA technique (descriptive stats, distribution analysis, correlation, segmented comparison, or missing data analysis) is most relevant.

(b) Design a single EDA report that addresses all four questions. What visualizations would you include?

(c) How would you structure a 15-minute presentation of your findings to this cross-functional audience?

Section D: Data Storytelling (Exercises 19-22)

Exercise 19: Chart Title Rewrite

Rewrite the following chart titles to communicate insights rather than descriptions:

"Revenue by Month, 2024"
"Customer Count by Segment"
"Average Response Time by Agent"
"Churn Rate by Channel"
"Product Returns by Category"

For each, write both a descriptive title and an insight-driven title (the insight can be hypothetical).

Exercise 20: SCQA Narrative

Using the Athena Retail Group findings from this chapter (online customers: high value, high churn; returns driven by category, not customer; satisfaction scores don't predict behavior), write a complete SCQA narrative as if presenting to Athena's CEO. Your narrative should:

Be 300-500 words
Include at least one specific metric in each section
End with a clear recommendation
Anticipate one likely objection and address it

Exercise 21: Audience Adaptation

You have discovered that a company's mobile app users spend 2.3x more per transaction than web users, but the app has a 4.1-star rating vs. 4.6 for the web experience.

Write three versions of this finding: (a) For a board of directors (one sentence, maximum impact) (b) For a product team (one paragraph, actionable detail) (c) For a data science team (technical detail, including what to investigate next)

Exercise 22: Executive Summary from Raw Output

Given the following EDAReport output excerpt, write a 200-word executive summary suitable for a non-technical audience:

Dataset: Customer_Transactions_2024.csv
Rows: 45,823
Columns: 14 (9 numeric, 5 categorical)

Missing Data:
  email                 12.3%  (5,637 values)
  phone                  8.1%  (3,712 values)
  loyalty_tier           0.0%
  purchase_amount        0.0%

Skewed Columns:
  purchase_amount (right-skewed, skew=3.2)
  days_between_purchases (right-skewed, skew=2.8)

Top Correlations:
  purchase_amount x loyalty_tier_numeric     r = +0.62
  days_between_purchases x is_churned        r = +0.48
  customer_age x total_lifetime_purchases    r = +0.31

Section E: EDAReport Extensions (Exercises 23-30)

Exercise 23: Add a Data Type Summary Method

Add a method called type_summary() to the EDAReport class that returns a DataFrame with one row per column, showing: column name, data type, number of unique values, number of missing values, and a sample value. This is often the very first thing an analyst looks at.

Exercise 24: Add Automated Chart Saving

Modify the plot_distributions() and plot_correlations() methods to accept an optional save_path parameter. When provided, the figures should be saved to that path as PNG files with descriptive filenames (e.g., distributions_customer_analysis.png).

Exercise 25: Add a Target Variable Analysis

Add a method called target_analysis(target_col) that, given a binary target column (like is_churned): - Computes the base rate (what % of records are positive) - For each numeric column, compares the mean, median, and standard deviation between the two groups - For each categorical column, computes the target rate within each category - Flags columns where the difference between groups exceeds a user-specified threshold

This method turns the EDAReport into a feature-screening tool — a bridge to supervised learning in Chapter 7.

Exercise 26: Add a Duplicate Detection Method

Add a method called duplicate_report() that: - Counts exact duplicate rows - Counts near-duplicate rows (matching on a user-specified subset of columns) - Prints examples of duplicates found - Returns the deduplicated DataFrame

Exercise 27: Add Automated Visualization Recommendations

Add a method called suggest_charts() that, based on the data types and distributions detected, prints a recommended visualization plan. For example: - Right-skewed numeric columns: "Consider log-scale histogram for annual_income" - High-cardinality categorical columns: "Use top-N bar chart for product_id (5,000+ unique values)" - Strong correlations: "Create scatter plot for total_purchases vs. avg_order_value (r = 0.65)"

Exercise 28: HTML Report Generation

Extend the EDAReport class with a to_html() method that generates a self-contained HTML report including: - All text-based summaries formatted in HTML tables - Embedded matplotlib figures (saved as base64-encoded images within the HTML) - A table of contents with anchor links

Hint: Use base64 and io.BytesIO to embed images without saving separate files.

Exercise 29: Time Series EDA Extension

Add a method called time_series_summary(date_col, value_col) that: - Detects the time granularity (daily, weekly, monthly) - Plots the time series with a rolling average - Identifies trend, seasonality, and any structural breaks - Reports the period with highest and lowest values

Exercise 30: Integration Challenge

Combine the EDAReport class with the pandas workflow from Chapter 3. Write a script that: 1. Loads a CSV file (use a public dataset or the sample from the chapter) 2. Performs data cleaning (handle missing values, correct types) 3. Runs the full EDAReport 4. Saves the text report and all figures 5. Prints three actionable recommendations based on the findings

This exercise is intentionally open-ended — it mirrors what you'd actually do on the first day of a data project.

Submission Guidelines

For coding exercises, submit well-commented Python scripts or Jupyter notebooks
For interpretation exercises, write clear, concise answers (aim for business clarity, not academic length)
For the SCQA narrative (Exercise 20), treat this as a professional deliverable — formatting, grammar, and persuasiveness matter
For EDAReport extensions, include both the code and a demonstration of the method working on sample data