Chapter 5 Exercises: Exploratory Data Analysis
Section A: Statistics Interpretation (Exercises 1-6)
Exercise 1: Reading Summary Statistics
A retail analytics team presents the following summary statistics for daily revenue across 365 days:
| Statistic | Value |
|---|---|
| Mean | $47,200 |
| Median | $31,800 |
| Std Dev | $38,500 |
| Min | $4,100 |
| Q1 (25th) | $18,900 |
| Q3 (75th) | $62,400 |
| Max | $312,000 |
| Skewness | 2.41 |
(a) What does the gap between mean and median tell you about the distribution shape? Is this typical for revenue data? Why?
(b) Calculate the IQR. Using the 1.5x IQR rule, what are the lower and upper bounds for outlier detection? How many "extreme" revenue days might you expect?
(c) A new analyst reports to the CFO that "our average daily revenue is $47,200." Is this a fair representation? What would you report instead, and why?
(d) The skewness value is 2.41. In plain English, what does this tell a business audience?
Exercise 2: Comparing Two Distributions
Two product lines at Athena Retail Group have the following order value statistics:
| Metric | Electronics | Apparel |
|---|---|---|
| Mean | $285 | $62 | |
| Median | $210 | $55 | |
| Std Dev | $195 | $28 | |
| Skewness | 1.8 | 0.4 |
| Kurtosis | 4.2 | -0.3 |
(a) Which product line has more predictable order values? How do you know?
(b) The electronics category has kurtosis of 4.2. What does this mean in practical terms for inventory planning or financial forecasting?
(c) If the company sets a "high-value order" threshold at the 90th percentile for each category, which category's threshold would you expect to be farther from its mean? Why?
(d) Write a two-sentence executive summary comparing these distributions for a VP of Merchandising.
Exercise 3: Percentile Analysis
A SaaS company tracks time-to-resolution for customer support tickets (in hours):
| Percentile | Value |
|---|---|
| P10 | 0.5 |
| P25 | 1.2 |
| P50 (Median) | 3.8 |
| P75 | 8.5 |
| P90 | 24.0 |
| P95 | 48.0 |
| P99 | 168.0 |
(a) The company's SLA guarantees resolution within 24 hours. What percentage of tickets meet this SLA?
(b) What does the jump from P90 (24 hours) to P99 (168 hours = 7 days) tell you about the data? What business question would you ask next?
(c) Management is considering changing the SLA metric from "mean resolution time" to "P90 resolution time." Argue for or against this change.
Exercise 4: Correlation Interpretation
A marketing analyst presents the following correlation matrix excerpt:
| Ad Spend | Impressions | Click Rate | Conversions | Revenue | |
|---|---|---|---|---|---|
| Ad Spend | 1.00 | 0.92 | -0.15 | 0.61 | 0.58 |
| Impressions | 0.92 | 1.00 | -0.22 | 0.55 | 0.52 |
| Click Rate | -0.15 | -0.22 | 1.00 | 0.38 | 0.42 |
| Conversions | 0.61 | 0.55 | 0.38 | 1.00 | 0.89 |
| Revenue | 0.58 | 0.52 | 0.42 | 0.89 | 1.00 |
(a) Why is the correlation between Ad Spend and Impressions so high (0.92)? Is this surprising? Does it suggest causation?
(b) Click Rate has a negative correlation with both Ad Spend (-0.15) and Impressions (-0.22). Propose a business explanation for this pattern.
(c) Conversions and Revenue have a correlation of 0.89. Does this mean that maximizing conversions will maximize revenue? What might be missing from this analysis?
(d) If you could add one more variable to this matrix to improve strategic insight, what would it be and why?
Exercise 5: Missing Data Diagnosis
A hospital dataset has the following missing data pattern:
| Column | % Missing | Notes |
|---|---|---|
| Patient Age | 0.2% | Randomly scattered |
| Blood Pressure | 3.1% | Higher in emergency admissions |
| Income | 22.0% | Higher for uninsured patients |
| Follow-up Visit | 35.0% | Higher for patients with severe diagnoses |
| Satisfaction Score | 41.0% | Higher for patients who left AMA |
(a) Classify each column's missingness as MCAR, MAR, or MNAR. Justify each classification.
(b) For the "Follow-up Visit" column (35% missing, likely MNAR), explain why simply dropping rows with missing values would introduce bias. What population would be underrepresented?
(c) Propose a strategy for handling the missing "Income" data. Could you use other variables to impute it? Which ones?
(d) The "Satisfaction Score" is 41% missing, and missingness correlates with patients who left against medical advice. A hospital administrator wants to report "average patient satisfaction: 4.2/5." What warning would you give?
Exercise 6: Chart Critique
Consider a chart with the following properties: - 3D pie chart showing market share of 8 competitors - Each slice has a gradient fill and a shadow - The smallest slice (2.1%) is labeled but nearly invisible - The chart title reads "Market Share" - No data source or time period is indicated
(a) List at least four specific problems with this visualization, referencing Tufte's principles.
(b) Redesign this chart. What chart type would you use? What would the title be? What information would you add?
(c) Apply the "So What?" test. Write a chart title that communicates an insight, not just a description.
Section B: Visualization Creation (Exercises 7-12)
Exercise 7: Basic Distribution Plot
Write Python code to create a histogram of the following employee salary data with the following requirements: - Use matplotlib, not seaborn - Include vertical lines for mean and median with a legend - Remove top and right spines - Title should state an insight, not just describe the data - Save the figure to a PNG file
import numpy as np
np.random.seed(42)
salaries = np.random.lognormal(mean=11.0, sigma=0.5, size=500)
Exercise 8: Grouped Bar Chart
Using the following data, create a horizontal grouped bar chart comparing Q1 and Q2 performance across four regions:
import pandas as pd
data = {
'Region': ['North', 'South', 'East', 'West'],
'Q1_Revenue': [4200000, 5800000, 3100000, 4600000],
'Q2_Revenue': [4500000, 5200000, 3900000, 4800000]
}
df_revenue = pd.DataFrame(data)
Requirements: - Use distinct but non-garish colors - Include value labels on each bar (formatted as $X.XM) - Indicate which regions grew and which declined (hint: use annotations or color coding)
Exercise 9: Correlation Heatmap
Write Python code using seaborn to create a correlation heatmap for a dataset of your choice (you may use the sample retail dataset from the chapter or create your own). Your heatmap must:
- Show only the lower triangle (mask the upper)
- Use a diverging color palette centered at zero
- Include correlation values as annotations
- Have a descriptive title
- Print the three strongest correlations below the heatmap (as text output)
Exercise 10: Box Plot Comparison
Create a figure with four side-by-side box plots comparing days_since_last_purchase across four customer segments: "New" (0-90 days as customer), "Growing" (91-365), "Established" (366-730), "Veteran" (731+).
import numpy as np
import pandas as pd
np.random.seed(42)
n = 800
customer_tenure_days = np.random.exponential(400, n).astype(int)
days_since_purchase = np.random.exponential(45, n).astype(int)
# Add correlation: longer-tenure customers have slightly longer gaps
days_since_purchase = days_since_purchase + (customer_tenure_days * 0.02).astype(int)
df_box = pd.DataFrame({
'tenure_days': customer_tenure_days,
'days_since_purchase': days_since_purchase
})
# Create segments
df_box['segment'] = pd.cut(
df_box['tenure_days'],
bins=[0, 90, 365, 730, 99999],
labels=['New', 'Growing', 'Established', 'Veteran']
)
Requirements:
- Use seaborn's boxplot or violinplot
- Add a horizontal reference line at 60 days (the "at risk" threshold)
- Annotate the median value above each box
- Write a one-sentence insight as the chart title
Exercise 11: Small Multiples
Create a 2x2 grid of scatter plots showing the relationship between two variables across four regions. Use the sample retail dataset from the chapter. Each panel should show one region, and all panels should use the same axis scales for comparison.
Exercise 12: Missing Data Visualization
Using the df_missing dataset created in Section 5.8 of the chapter, create a two-panel figure:
- Left panel: A bar chart showing the percentage of missing values per column (only columns with missing data)
- Right panel: A heatmap showing co-occurrence of missingness (do missing values in column A tend to co-occur with missing values in column B?)
Hint for the right panel: compute the correlation matrix of the binary "is missing" flags.
Section C: EDA Workflow (Exercises 13-18)
Exercise 13: Full EDA — Employee Attrition
You are given an HR dataset with the following columns:
- employee_id, age, department, tenure_years, salary, performance_rating (1-5), satisfaction_score (1-10), overtime_hours_monthly, is_attrited (0/1)
Without writing code, outline a complete EDA workflow:
(a) What are the first five things you would check?
(b) What distributions would you expect to be skewed, and in which direction?
(c) Which correlations would you investigate first, and why?
(d) What missing data patterns would concern you most?
(e) Write three hypotheses that the EDA should test (or at least inform).
Exercise 14: Business Question to EDA Plan
A VP of Sales asks: "Why did our close rate drop from 32% to 24% last quarter?"
(a) List at least five data fields you would want in your dataset to investigate this question.
(b) Describe three specific visualizations you would create, explaining what each would reveal.
(c) What would a "smoking gun" finding look like? What would an ambiguous finding look like?
(d) Draft the SCQA structure (Situation, Complication, Question, Answer) you might use to present your findings, filling in plausible details.
Exercise 15: EDA on Real-World Data
Download a public dataset from one of the following sources: - Kaggle (search for "customer churn," "sales data," or "employee attrition") - UCI Machine Learning Repository - data.gov
Run the EDAReport class from this chapter on the dataset. Then answer:
(a) What was the most surprising finding?
(b) Which columns had the most missing data? Was the missingness likely MCAR, MAR, or MNAR?
(c) What is the strongest correlation you found? Does it make business sense?
(d) Write a three-paragraph executive summary of your findings.
Exercise 16: Before-and-After EDA
You're told that a previous data science team built a customer lifetime value (CLV) model with the following features: age, income, tenure, total_purchases, avg_order_value, num_returns, support_tickets.
The model has an R-squared of 0.45 (explaining 45% of variance in CLV).
(a) What EDA steps would you perform on the existing features to understand why the model isn't performing better?
(b) Propose three new features you might engineer based on EDA insights (e.g., ratios, interactions, binned variables).
(c) What would you look for in a scatter plot of predicted CLV vs. actual CLV? What patterns would indicate specific problems?
Exercise 17: EDA for Anomaly Detection
An e-commerce company notices that last Tuesday's revenue was unusually high ($890,000 vs. the typical $120,000-$180,000 range).
(a) Design an EDA investigation to determine whether this is a genuine spike or a data error. What five things would you check, in order?
(b) For each check, describe what "error" would look like vs. what "genuine spike" would look like.
(c) If the spike is genuine, what follow-up EDA would help understand its cause?
Exercise 18: Cross-Functional EDA
You are on a cross-functional team with members from Marketing, Finance, Operations, and Product. Each team member has a different question about the same customer dataset.
- Marketing: "Which customer segments are most engaged?"
- Finance: "What's our revenue concentration risk?"
- Operations: "Where are the fulfillment bottlenecks?"
- Product: "Which features drive repeat purchases?"
(a) For each question, identify which EDA technique (descriptive stats, distribution analysis, correlation, segmented comparison, or missing data analysis) is most relevant.
(b) Design a single EDA report that addresses all four questions. What visualizations would you include?
(c) How would you structure a 15-minute presentation of your findings to this cross-functional audience?
Section D: Data Storytelling (Exercises 19-22)
Exercise 19: Chart Title Rewrite
Rewrite the following chart titles to communicate insights rather than descriptions:
- "Revenue by Month, 2024"
- "Customer Count by Segment"
- "Average Response Time by Agent"
- "Churn Rate by Channel"
- "Product Returns by Category"
For each, write both a descriptive title and an insight-driven title (the insight can be hypothetical).
Exercise 20: SCQA Narrative
Using the Athena Retail Group findings from this chapter (online customers: high value, high churn; returns driven by category, not customer; satisfaction scores don't predict behavior), write a complete SCQA narrative as if presenting to Athena's CEO. Your narrative should:
- Be 300-500 words
- Include at least one specific metric in each section
- End with a clear recommendation
- Anticipate one likely objection and address it
Exercise 21: Audience Adaptation
You have discovered that a company's mobile app users spend 2.3x more per transaction than web users, but the app has a 4.1-star rating vs. 4.6 for the web experience.
Write three versions of this finding: (a) For a board of directors (one sentence, maximum impact) (b) For a product team (one paragraph, actionable detail) (c) For a data science team (technical detail, including what to investigate next)
Exercise 22: Executive Summary from Raw Output
Given the following EDAReport output excerpt, write a 200-word executive summary suitable for a non-technical audience:
Dataset: Customer_Transactions_2024.csv
Rows: 45,823
Columns: 14 (9 numeric, 5 categorical)
Missing Data:
email 12.3% (5,637 values)
phone 8.1% (3,712 values)
loyalty_tier 0.0%
purchase_amount 0.0%
Skewed Columns:
purchase_amount (right-skewed, skew=3.2)
days_between_purchases (right-skewed, skew=2.8)
Top Correlations:
purchase_amount x loyalty_tier_numeric r = +0.62
days_between_purchases x is_churned r = +0.48
customer_age x total_lifetime_purchases r = +0.31
Section E: EDAReport Extensions (Exercises 23-30)
Exercise 23: Add a Data Type Summary Method
Add a method called type_summary() to the EDAReport class that returns a DataFrame with one row per column, showing: column name, data type, number of unique values, number of missing values, and a sample value. This is often the very first thing an analyst looks at.
Exercise 24: Add Automated Chart Saving
Modify the plot_distributions() and plot_correlations() methods to accept an optional save_path parameter. When provided, the figures should be saved to that path as PNG files with descriptive filenames (e.g., distributions_customer_analysis.png).
Exercise 25: Add a Target Variable Analysis
Add a method called target_analysis(target_col) that, given a binary target column (like is_churned):
- Computes the base rate (what % of records are positive)
- For each numeric column, compares the mean, median, and standard deviation between the two groups
- For each categorical column, computes the target rate within each category
- Flags columns where the difference between groups exceeds a user-specified threshold
This method turns the EDAReport into a feature-screening tool — a bridge to supervised learning in Chapter 7.
Exercise 26: Add a Duplicate Detection Method
Add a method called duplicate_report() that:
- Counts exact duplicate rows
- Counts near-duplicate rows (matching on a user-specified subset of columns)
- Prints examples of duplicates found
- Returns the deduplicated DataFrame
Exercise 27: Add Automated Visualization Recommendations
Add a method called suggest_charts() that, based on the data types and distributions detected, prints a recommended visualization plan. For example:
- Right-skewed numeric columns: "Consider log-scale histogram for annual_income"
- High-cardinality categorical columns: "Use top-N bar chart for product_id (5,000+ unique values)"
- Strong correlations: "Create scatter plot for total_purchases vs. avg_order_value (r = 0.65)"
Exercise 28: HTML Report Generation
Extend the EDAReport class with a to_html() method that generates a self-contained HTML report including:
- All text-based summaries formatted in HTML tables
- Embedded matplotlib figures (saved as base64-encoded images within the HTML)
- A table of contents with anchor links
Hint: Use base64 and io.BytesIO to embed images without saving separate files.
Exercise 29: Time Series EDA Extension
Add a method called time_series_summary(date_col, value_col) that:
- Detects the time granularity (daily, weekly, monthly)
- Plots the time series with a rolling average
- Identifies trend, seasonality, and any structural breaks
- Reports the period with highest and lowest values
Exercise 30: Integration Challenge
Combine the EDAReport class with the pandas workflow from Chapter 3. Write a script that:
1. Loads a CSV file (use a public dataset or the sample from the chapter)
2. Performs data cleaning (handle missing values, correct types)
3. Runs the full EDAReport
4. Saves the text report and all figures
5. Prints three actionable recommendations based on the findings
This exercise is intentionally open-ended — it mirrors what you'd actually do on the first day of a data project.
Submission Guidelines
- For coding exercises, submit well-commented Python scripts or Jupyter notebooks
- For interpretation exercises, write clear, concise answers (aim for business clarity, not academic length)
- For the SCQA narrative (Exercise 20), treat this as a professional deliverable — formatting, grammar, and persuasiveness matter
- For EDAReport extensions, include both the code and a demonstration of the method working on sample data