Professor Okonkwo clicked to the first slide without a word. The projector filled the screen with a chart — if you could call it that. A three-dimensional clustered bar chart erupted from the screen in six neon colors, each bar casting a shadow over...
In This Chapter
- Same Data, Different Story
- 5.1 The EDA Mindset
- 5.2 Descriptive Statistics for Managers
- 5.3 Data Visualization Best Practices
- 5.4 matplotlib Fundamentals
- 5.5 seaborn for Statistical Visualization
- 5.6 Distribution Analysis
- 5.7 Correlation Analysis
- 5.8 Missing Data Analysis
- 5.9 Feature Distributions by Category
- 5.10 Telling Stories with Data
- 5.11 Building the EDAReport Class
- 5.12 Extending the EDAReport
- 5.13 From EDA to Action: Connecting the Dots
- Chapter Summary
Chapter 5: Exploratory Data Analysis
Same Data, Different Story
Professor Okonkwo clicked to the first slide without a word. The projector filled the screen with a chart — if you could call it that. A three-dimensional clustered bar chart erupted from the screen in six neon colors, each bar casting a shadow over the one behind it. The legend was crammed into the bottom-right corner in 8-point font. Two y-axes competed for attention. A title read, in all caps: "CUSTOMER METRICS BY REGION AND CATEGORY Q1-Q4 2024." Gridlines sliced through everything like a chain-link fence.
She let the class absorb the visual assault for a full ten seconds.
Then she clicked again. A clean, single-axis bar chart appeared. Muted blue bars. One variable. A clear title: "Southeast Region Drives 62% of Premium Category Growth." Beneath it, a single annotation arrow pointing to the tallest bar, with a note: "Up 18 pp YoY — recommend doubling Southeast digital spend."
"Same data," Okonkwo said. "Same analyst. Same Tuesday afternoon." She paused. "The first chart was attached to a 40-page report that no one on the executive team finished reading. The second one changed a $10 million product strategy within a week." She turned to face the class. "The difference is not the data. The difference is the thinking behind the chart."
NK Adeyemi stared at the two slides, something clicking into place. She'd spent years in marketing watching beautifully designed reports get ignored — and she'd always assumed the problem was the audience. Maybe the problem was the report.
Tom Kowalski, sitting two rows back, was less impressed. "I mean, sure, pretty charts are nice," he murmured to his neighbor. "But the real work is the model. EDA is just... poking around."
Okonkwo, who had a disconcerting ability to hear murmurs from across a lecture hall, smiled. "Mr. Kowalski, I'm glad you said that. Because today we're going to talk about what happens when you skip the poking around." She pulled up a third slide — a confusion matrix from a real project, all red. "This was a churn prediction model built by a consulting team that billed at $400 an hour. Accuracy: 94%. Sounds great, right?" She let that sit. "The dataset was 94% non-churners. The model learned to predict 'will not churn' for everyone. A $200,000 engagement that produced a model no better than a coin flip for the thing it was supposed to predict. And the reason it happened is that nobody — nobody — ran an exploratory data analysis first."
The room was quiet.
"Today," she said, "we learn to look before we leap."
5.1 The EDA Mindset
Exploratory Data Analysis is not a step you check off on the way to the interesting work. It is the interesting work — or at least, it is the work that determines whether everything that follows will be interesting or catastrophic.
The term was coined by the American mathematician John Tukey in his landmark 1977 book Exploratory Data Analysis. Tukey drew a sharp distinction between two modes of statistical thinking:
- Confirmatory Data Analysis (CDA): You start with a hypothesis and test it. This is classical statistics — null hypotheses, p-values, confidence intervals.
- Exploratory Data Analysis (EDA): You start with no hypothesis and let the data speak. You look for patterns, anomalies, relationships, and surprises. You ask, "What is this data trying to tell me?"
Definition — Exploratory Data Analysis (EDA): An approach to analyzing datasets that emphasizes visualization, summary statistics, and pattern detection before any formal modeling or hypothesis testing. The goal is to develop an understanding of the data's structure, quality, and potential — and to surface questions you didn't know to ask.
Tukey was fond of saying, "The greatest value of a picture is when it forces us to notice what we never expected to see." In business, this translates to a practical truth: the most expensive mistakes in analytics come not from bad models but from bad assumptions about the data that feeds them.
Why Business Professionals Must Do EDA
If you are an MBA student reading this, you might be tempted to think of EDA as a technical exercise best left to data engineers. That would be a mistake, for three reasons:
1. EDA surfaces business questions, not just data questions. When you discover that your customer data has a bimodal distribution — two distinct humps instead of one bell curve — that is not a statistics fact. That is a market segmentation insight. The shape of your data often is the strategy.
2. EDA catches problems that models hide. Machine learning algorithms are optimization machines. Give them garbage data and they will optimize garbage, often with impressive-sounding accuracy numbers (as Tom's $200,000 churn model demonstrated). EDA is your quality control layer.
3. EDA builds intuition. A model might tell you that Feature X correlates with revenue at r = 0.73. But until you have looked at the scatter plot, you don't know whether that correlation is driven by a clean linear trend, a few extreme outliers, or a nonlinear curve that a linear model will miss entirely.
Business Insight: At McKinsey, Bain, and BCG, the first week of most analytics engagements is pure EDA. Partners call it "getting to know the data." It is billable. It is expected. And it often yields the insight that defines the entire project.
The EDA Workflow
While EDA is inherently open-ended — that's the point — experienced analysts follow a loose workflow:
- Understand the data dictionary. What are the columns? What do they represent? What are the units?
- Check shape and types. How many rows? How many columns? What data types are present?
- Examine summary statistics. What are the central tendencies? The spreads? The extremes?
- Assess data quality. How much is missing? Are there duplicates? Do values make sense?
- Visualize distributions. What shape does each variable take? Are there outliers?
- Explore relationships. How do variables relate to each other? What correlates with what?
- Form hypotheses. Based on what you've seen, what questions are worth testing?
- Document findings. Write down what you learned — not just what you did.
We will work through each of these steps in this chapter, building up a Python toolkit as we go.
5.2 Descriptive Statistics for Managers
Before we write any code, let's make sure the foundational concepts are clear. If you studied statistics in undergrad, this will be review — but review with a business lens that may reframe what you thought you knew.
Measures of Central Tendency
Mean (Average): The sum of all values divided by the count. The mean is the balance point of your data — if you placed all your data points on a seesaw, the mean is where you'd put the fulcrum.
Business use: Average order value, mean customer age, average handle time in a call center.
Danger: The mean is sensitive to outliers. If nine customers spend $50 and one spends $50,000, the mean spend is $5,045 — a number that describes nobody in the dataset. When a CEO asks "What's our average deal size?" and the answer is distorted by three whale accounts, the mean is lying to you politely.
Median: The middle value when all values are sorted. Half the data falls above, half below.
Business use: Median household income (far more informative than mean income), median time-to-close for sales deals.
Advantage: The median is robust to outliers. Those nine $50 customers and one $50,000 customer? The median spend is $50 — which actually describes the typical customer.
Mode: The most frequently occurring value.
Business use: Most common product purchased, most frequent support ticket category, modal price point.
When it matters most: Categorical data. You can't compute a mean for "product category," but you can find the mode.
Business Insight: When reporting to executives, always present both mean and median for skewed distributions. If they differ substantially, that gap is the story. A company with a mean customer lifetime value of $800 and a median of $120 has a small number of high-value customers subsidizing the overall picture. That's not a number — that's a strategy conversation.
Measures of Spread
Standard Deviation (SD): How far, on average, data points sit from the mean. A small SD means data clusters tightly; a large SD means data is spread out.
Business intuition: Think of standard deviation as predictability. A call center with a mean handle time of 6 minutes and an SD of 1 minute is predictable — you can staff for it. A call center with the same mean but an SD of 8 minutes is a scheduling nightmare.
Variance: The square of the standard deviation. Mathematically useful but harder to interpret because the units are squared (minutes-squared doesn't mean much). You will rarely report variance to a business audience.
Range: Maximum minus minimum. Simple but easily distorted by a single extreme value.
Interquartile Range (IQR): The range of the middle 50% of data (75th percentile minus 25th percentile). Robust to outliers.
Business use: "The middle 50% of our customers spend between $40 and $120 per visit" is a more useful statement than reporting the range of $5 to $47,000.
Percentiles
The Nth percentile is the value below which N% of the data falls. The median is the 50th percentile.
Business use cases you'll encounter constantly:
- P95 (95th percentile): Used in SLAs. "95% of API calls complete in under 200 milliseconds." The P95 latency is a standard performance metric in tech.
- P90 and P10: A common trick for understanding spread. The gap between the 90th and 10th percentile tells you how wide the bulk of your data is, without being distorted by extremes.
- Quartiles (P25, P50, P75): The backbone of box plots, which we'll build shortly.
Skewness and Kurtosis
Skewness measures how asymmetric a distribution is: - Positive skew (right-skewed): A long tail to the right. Most values cluster low, with a few extreme highs. Example: income distributions, website session durations, purchase amounts. - Negative skew (left-skewed): A long tail to the left. Most values cluster high, with a few extreme lows. Example: exam scores when the test is easy, age at retirement. - Zero skew: Symmetric. The classic bell curve.
Business rule of thumb: If your data is right-skewed (and business data almost always is), the mean will be higher than the median. If someone reports only the mean, they're painting a rosier picture than typical experience warrants.
Kurtosis measures how heavy the tails are — how likely extreme values are: - High kurtosis (leptokurtic): Fat tails, sharp peak. Extreme events are more common than a normal distribution would predict. Example: financial returns — the 2008 crash was a "25-standard-deviation event" under normal assumptions, which means the assumptions were wrong. - Low kurtosis (platykurtic): Thin tails, flat peak. Extreme events are rare.
Caution
Kurtosis is frequently misunderstood. It is not about how "peaked" a distribution looks (a common textbook myth). It is about the tails. A distribution with high kurtosis has a higher probability of producing outliers. In risk management, this is the difference between a model that says "a 10% daily loss is virtually impossible" and a model that says "a 10% daily loss happens about once a decade." Both statements can't be right, and the difference is kurtosis.
Putting It Into Practice with pandas
Let's see these concepts come alive. If you completed Chapter 3, you have pandas installed and ready to go. We'll use a sample retail dataset throughout this chapter.
import pandas as pd
import numpy as np
# Create a sample retail dataset (in practice, you'd load from CSV)
np.random.seed(42)
n_customers = 1000
data = {
'customer_id': range(1, n_customers + 1),
'age': np.random.normal(42, 12, n_customers).astype(int),
'annual_income': np.random.lognormal(10.8, 0.7, n_customers).astype(int),
'total_purchases': np.random.poisson(15, n_customers),
'avg_order_value': np.random.gamma(5, 20, n_customers).round(2),
'days_since_last_purchase': np.random.exponential(60, n_customers).astype(int),
'channel': np.random.choice(
['Online', 'In-Store', 'Mobile App'], n_customers,
p=[0.45, 0.30, 0.25]
),
'region': np.random.choice(
['Northeast', 'Southeast', 'Midwest', 'West'], n_customers,
p=[0.25, 0.35, 0.20, 0.20]
),
'is_churned': np.random.choice([0, 1], n_customers, p=[0.82, 0.18])
}
df = pd.DataFrame(data)
print(f"Dataset shape: {df.shape}")
print(f"\nFirst 5 rows:")
print(df.head())
Code Explanation: We use
np.random.seed(42)for reproducibility — every student will get the same "random" data. Notice how different columns use different distributions:normalfor age (bell curve),lognormalfor income (right-skewed, as real incomes are),poissonfor purchase counts (discrete, always non-negative),exponentialfor days since last purchase (right-skewed, many recent, few long-lapsed). We're simulating the patterns you'd find in real business data.
Now let's compute descriptive statistics:
# Quick summary statistics
print(df.describe())
The output of describe() gives you count, mean, standard deviation, min, 25th percentile, median (50th), 75th percentile, and max for every numeric column — all the measures we just discussed, in one line of code.
# Compare mean vs. median for income (expect right skew)
print(f"\nIncome — Mean: ${df['annual_income'].mean():,.0f}")
print(f"Income — Median: ${df['annual_income'].median():,.0f}")
print(f"Income — Skewness: {df['annual_income'].skew():.2f}")
# The gap tells the story
gap_pct = ((df['annual_income'].mean() - df['annual_income'].median())
/ df['annual_income'].median() * 100)
print(f"Mean is {gap_pct:.1f}% higher than median — right-skewed distribution")
Try It: Run this code and look at the gap between mean and median income. Because we used a lognormal distribution (which mimics real income data), the mean will be substantially higher than the median. This is the pattern you will see in almost every revenue, income, or transaction-value dataset you encounter in your career.
5.3 Data Visualization Best Practices
Before we start building charts, we need to talk about why most business charts fail. This is not an aesthetics conversation — it's a communication conversation.
Edward Tufte's Principles
Edward Tufte, the Yale professor and information design legend, articulated principles that every data-literate business professional should internalize:
1. Data-Ink Ratio. Of all the ink on a chart, what fraction represents actual data? Tufte argues this ratio should be as close to 1.0 as possible. Every gridline, border, shadow, and decorative element that doesn't encode data is noise. The 3D Excel chart from Okonkwo's opening slide had a data-ink ratio near zero — most of its ink was shadows, depth effects, and redundant borders.
2. Chartjunk. Tufte's term for visual elements that do not inform. This includes 3D effects, gradient fills, unnecessary legends (when there's only one data series), redundant axis labels, and decorative graphics. Chartjunk doesn't just waste space — it actively interferes with comprehension.
3. Lie Factor. The size of an effect shown in the graphic divided by the size of the effect in the data. A chart where a 10% increase is depicted with a bar that's 50% taller has a lie factor of 5.0. This happens more often than you'd think, usually through truncated y-axes or area-based comparisons (doubling a radius quadruples the area of a circle, making a 2x difference look like 4x).
4. Small Multiples. Instead of cramming six data series onto one cluttered chart, create six small, identical charts with one series each. Same scales, same axes, easy comparison. This technique is underused in business presentations and overused in good data journalism.
The "So What?" Test
Professor Okonkwo teaches a test she learned at McKinsey: every chart should pass the "So What?" test. After looking at the chart, can you immediately articulate:
- What the chart is showing (should be obvious from the title)
- So what — why does this matter? What's the insight?
- Now what — what action does this suggest?
If a chart doesn't suggest an action, it is decoration.
Business Insight: At Amazon, the standard for charts in internal documents (the famous "six-page memo" format) is that every chart must have a title that states the insight, not just describes the data. Not "Revenue by Quarter" but "Q3 Revenue Declined 12%, Driven Entirely by European Markets." The title does the work so the reader doesn't have to.
Choosing the Right Chart Type
| Question | Chart Type | When to Use |
|---|---|---|
| How is one variable distributed? | Histogram, density plot | Always start here for numerical columns |
| How do categories compare? | Bar chart (horizontal preferred) | Comparing discrete groups |
| How does a value change over time? | Line chart | Time series data |
| How do two variables relate? | Scatter plot | Exploring correlation |
| What's the spread within groups? | Box plot, violin plot | Comparing distributions across categories |
| What are the correlations among many variables? | Heatmap | Multivariate relationships |
| What's the composition of a whole? | Stacked bar, treemap | Part-to-whole relationships |
Caution
Pie charts are almost never the right choice. Humans are poor at comparing angles and areas. A simple bar chart conveys the same information more accurately. If your CEO loves pie charts, make them a bar chart and call it a "category comparison." They'll get the information faster.
5.4 matplotlib Fundamentals
matplotlib is Python's foundational visualization library. It's not the prettiest out of the box, but it's the most flexible and the one that everything else is built on. Understanding matplotlib's mental model will make every other visualization library easier to learn.
The Figure-Axes Model
matplotlib thinks in two layers: - Figure: The entire canvas — think of it as the blank page. - Axes: A single plot area within the figure. A figure can contain multiple axes (subplots).
import matplotlib.pyplot as plt
# Create a figure with one set of axes
fig, ax = plt.subplots(figsize=(10, 6))
# Plot on the axes
ax.hist(df['annual_income'], bins=40, color='steelblue', edgecolor='white')
ax.set_title('Distribution of Customer Annual Income', fontsize=14, fontweight='bold')
ax.set_xlabel('Annual Income ($)', fontsize=12)
ax.set_ylabel('Number of Customers', fontsize=12)
# Remove top and right spines (Tufte would approve)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.tight_layout()
plt.savefig('income_distribution.png', dpi=150)
plt.show()
Code Explanation:
plt.subplots()returns both the figure and axes objects. We useax.hist()rather thanplt.hist()because the axes-based approach gives us more control and scales to multi-panel figures. Theedgecolor='white'adds thin white borders between bars for clarity. Removing the top and right spines is a simple trick that dramatically reduces chartjunk.
What you'd see: A right-skewed histogram. Most customers cluster in the $20,000-$80,000 range, with a long tail stretching toward $200,000 and beyond. The shape immediately tells you that "average income" would be misleading — the typical customer earns much less than the mean.
Essential Plot Types
Bar Charts — Comparing Categories
fig, ax = plt.subplots(figsize=(8, 5))
# Count customers by channel
channel_counts = df['channel'].value_counts()
bars = ax.barh(channel_counts.index, channel_counts.values, color='steelblue')
# Add value labels on the bars
for bar in bars:
width = bar.get_width()
ax.text(width + 5, bar.get_y() + bar.get_height() / 2,
f'{int(width)}', va='center', fontsize=11)
ax.set_title('Customer Count by Channel', fontsize=14, fontweight='bold')
ax.set_xlabel('Number of Customers', fontsize=12)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.tight_layout()
plt.show()
Business Insight: Notice we used horizontal bars (
barh), not vertical. When your category labels are text (especially longer text like "Mobile App" or "In-Store"), horizontal bars are easier to read because the labels sit naturally on the y-axis. This is a small choice that dramatically improves readability. Also note the value labels — never make your reader guess the exact number from gridlines.
Scatter Plots — Exploring Relationships
fig, ax = plt.subplots(figsize=(10, 6))
# Color-code by churn status
colors = df['is_churned'].map({0: 'steelblue', 1: 'tomato'})
ax.scatter(df['total_purchases'], df['avg_order_value'],
c=colors, alpha=0.5, s=30)
ax.set_title('Purchases vs. Order Value by Churn Status',
fontsize=14, fontweight='bold')
ax.set_xlabel('Total Purchases', fontsize=12)
ax.set_ylabel('Average Order Value ($)', fontsize=12)
# Custom legend
from matplotlib.lines import Line2D
legend_elements = [
Line2D([0], [0], marker='o', color='w', markerfacecolor='steelblue',
markersize=10, label='Active'),
Line2D([0], [0], marker='o', color='w', markerfacecolor='tomato',
markersize=10, label='Churned')
]
ax.legend(handles=legend_elements, loc='upper right')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.tight_layout()
plt.show()
What you'd see: A cloud of blue and red dots. You might notice that churned customers (red) tend to cluster in the lower-left — fewer purchases, lower order values — but there's significant overlap. This suggests that churn is predictable to some degree from purchasing behavior, but it's not a clean separation. This is exactly the kind of preliminary insight that shapes how you'd build a predictive model in Chapter 7.
Multi-Panel Figures (Small Multiples)
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Panel 1: Age distribution
axes[0, 0].hist(df['age'], bins=30, color='steelblue', edgecolor='white')
axes[0, 0].set_title('Age Distribution')
axes[0, 0].set_xlabel('Age')
# Panel 2: Income distribution
axes[0, 1].hist(df['annual_income'], bins=40, color='darkorange', edgecolor='white')
axes[0, 1].set_title('Income Distribution')
axes[0, 1].set_xlabel('Annual Income ($)')
# Panel 3: Purchase count distribution
axes[1, 0].hist(df['total_purchases'], bins=20, color='seagreen', edgecolor='white')
axes[1, 0].set_title('Total Purchases Distribution')
axes[1, 0].set_xlabel('Number of Purchases')
# Panel 4: Days since last purchase
axes[1, 1].hist(df['days_since_last_purchase'], bins=40,
color='mediumpurple', edgecolor='white')
axes[1, 1].set_title('Days Since Last Purchase')
axes[1, 1].set_xlabel('Days')
# Clean up all panels
for ax_row in axes:
for ax in ax_row:
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_ylabel('Count')
fig.suptitle('Customer Dataset — Distribution Overview',
fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()
What you'd see: Four histograms arranged in a 2x2 grid. Age is roughly bell-shaped (normal). Income is right-skewed with a long tail. Purchase counts form a classic Poisson shape — clustered around 15 with a tail. Days since last purchase is sharply right-skewed (exponential) — most customers purchased recently, but a long tail of dormant customers stretches far to the right.
This single figure tells you more about your customer base in ten seconds than a page of summary statistics.
5.5 seaborn for Statistical Visualization
seaborn is built on top of matplotlib and provides higher-level statistical visualizations with better default styling. Think of matplotlib as the engine and seaborn as the luxury body kit.
import seaborn as sns
# Set the default style
sns.set_style('whitegrid')
sns.set_palette('muted')
Distribution Plots
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# KDE (Kernel Density Estimate) — a smoothed histogram
sns.kdeplot(data=df, x='annual_income', hue='is_churned',
fill=True, alpha=0.4, ax=axes[0])
axes[0].set_title('Income Distribution by Churn Status')
axes[0].set_xlabel('Annual Income ($)')
# Box plot — the five-number summary visualized
sns.boxplot(data=df, x='channel', y='avg_order_value', ax=axes[1],
palette='muted')
axes[1].set_title('Order Value by Channel')
axes[1].set_xlabel('Channel')
axes[1].set_ylabel('Average Order Value ($)')
plt.tight_layout()
plt.show()
Code Explanation: The KDE plot smooths a histogram into a continuous curve, making it easier to compare two distributions on the same axes. The
hue='is_churned'parameter automatically splits the data into two curves (churned vs. active) and colors them differently. The box plot shows median (center line), IQR (box), whiskers (1.5x IQR), and outliers (diamonds) — all five of our summary statistics in one visual.
What you'd see on the KDE: Two overlapping curves. The active customers (blue) and churned customers (orange) have broadly similar income distributions, but the churned distribution is slightly shifted left — churned customers tend to have somewhat lower incomes. The overlap is substantial, meaning income alone is a weak predictor.
What you'd see on the box plot: Three box-and-whisker plots side by side. If Mobile App customers have a notably higher median order value, that's an insight worth investigating. Are mobile users different people, or does the mobile experience encourage higher-value purchases?
Heatmaps for Correlation
fig, ax = plt.subplots(figsize=(10, 8))
# Select only numeric columns
numeric_df = df.select_dtypes(include=[np.number])
# Compute correlation matrix
corr_matrix = numeric_df.corr()
# Create heatmap
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='RdBu_r',
center=0, square=True, linewidths=0.5,
vmin=-1, vmax=1, ax=ax)
ax.set_title('Correlation Matrix — Customer Dataset',
fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
Code Explanation:
annot=Trueprints the correlation coefficient inside each cell.fmt='.2f'formats to two decimal places.cmap='RdBu_r'uses a red-blue color scale (red for positive, blue for negative — a standard convention).center=0ensures that zero correlation appears as white.vmin=-1, vmax=1fixes the color scale to the full possible range of correlations.
What you'd see: A grid of colored squares, each labeled with a correlation coefficient. The diagonal is all 1.00 (every variable correlates perfectly with itself). Look for off-diagonal values far from zero. If total_purchases and avg_order_value show r = 0.05, they're essentially unrelated — a customer who buys frequently doesn't necessarily buy expensive items. If days_since_last_purchase and is_churned show r = 0.3, that's a meaningful signal: customers who haven't bought in a while are more likely to churn.
Pair Plots — The Shotgun Approach
# Select a subset of columns (pair plots get unwieldy with many variables)
subset_cols = ['age', 'annual_income', 'total_purchases',
'avg_order_value', 'is_churned']
sns.pairplot(df[subset_cols], hue='is_churned',
diag_kind='kde', plot_kws={'alpha': 0.4, 's': 20},
palette={0: 'steelblue', 1: 'tomato'})
plt.suptitle('Pair Plot — Key Customer Variables', y=1.02, fontsize=16)
plt.show()
What you'd see: A matrix of scatter plots, one for every pair of variables, with KDE plots on the diagonal. Each point is colored by churn status. This is the "show me everything" visualization — great for early-stage EDA when you don't know which relationships matter yet. It's too busy for a presentation, but perfect for your own analysis notebook.
Business Insight: NK Adeyemi, working on this in class, noticed something interesting in the pair plot: the relationship between age and total purchases was different for churned vs. active customers. Older active customers tended to have more purchases (loyalty over time), but older churned customers had fewer purchases than younger churned customers. "That's weird," she said. "It's like the older customers who churn are the ones who never really engaged in the first place." Okonkwo smiled. "That's an EDA insight. You just generated a hypothesis from a picture."
5.6 Distribution Analysis
Understanding the shape of your data is not an academic exercise. It determines which statistical methods are valid, which visualizations are appropriate, and which business conclusions are defensible.
Common Distribution Shapes in Business Data
Normal (Gaussian): The bell curve. Symmetric, most values near the mean, tails fall off quickly. - Where you find it: Human physical measurements (height, weight), manufacturing tolerances, test scores (by design), measurement errors. - Business implication: Mean and standard deviation fully describe the data. Confidence intervals and z-scores work as expected.
Right-Skewed (Lognormal, Exponential, Pareto): Long tail to the right. A few extreme high values. - Where you find it: Income, revenue, website traffic, purchase amounts, company sizes, city populations. This is the dominant shape in business data. - Business implication: Mean is misleading. Use median. A small number of observations (customers, products, regions) drive a disproportionate share of the total. This is the Pareto principle made visible.
Bimodal: Two distinct peaks. - Where you find it: Mixed populations. A dataset containing both casual users and power users. A product sold to both consumers and businesses. - Business implication: You're probably looking at two different populations jammed into one dataset. Segment them.
Uniform: Every value is equally likely. - Where you find it: Random IDs, lottery numbers, occasionally pricing data in a narrow range. - Business implication: There's no central tendency. Mean and median are technically correct but uninformative.
Identifying Outliers
An outlier is a data point that is "unusually far" from the rest. But how far is unusual?
The IQR Method: A point is an outlier if it falls below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR. This is the rule that box plots use for their whiskers.
def detect_outliers_iqr(series, multiplier=1.5):
"""Detect outliers using the IQR method."""
Q1 = series.quantile(0.25)
Q3 = series.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - multiplier * IQR
upper_bound = Q3 + multiplier * IQR
outliers = series[(series < lower_bound) | (series > upper_bound)]
print(f"Column: {series.name}")
print(f" IQR: {IQR:.2f}")
print(f" Lower bound: {lower_bound:.2f}")
print(f" Upper bound: {upper_bound:.2f}")
print(f" Number of outliers: {len(outliers)} ({len(outliers)/len(series)*100:.1f}%)")
return outliers
# Check for outliers in key columns
for col in ['annual_income', 'avg_order_value', 'days_since_last_purchase']:
detect_outliers_iqr(df[col])
print()
Caution
Not every outlier is an error. A customer with $500,000 in annual income might be perfectly real — they're just rare. Before removing outliers, ask: "Is this a data quality problem (someone typed an extra zero) or a genuine extreme value?" The answer determines your response. Errors should be corrected. Genuine extremes should be understood, not discarded.
Tom Kowalski learned this lesson the hard way during the in-class exercise. He wrote a script that automatically removed all outliers beyond 2 standard deviations, then built a quick linear regression. "Look, high R-squared," he said, showing his screen. Okonkwo walked over, looked at his code, and said: "You just removed your 50 highest-value customers. Your model now perfectly predicts the behavior of people who don't matter to the business. The $50,000-a-year customer that you deleted? That's who the VP of Sales cares about." Tom stared at his screen for a long moment. "I guess I should have looked at who those outliers were before deleting them."
"You're building on quicksand," Okonkwo said. "EDA first. Always."
5.7 Correlation Analysis
Correlation measures the strength and direction of a linear relationship between two variables. The correlation coefficient, r, ranges from -1 (perfect negative relationship) to +1 (perfect positive relationship), with 0 indicating no linear relationship.
Interpreting Correlations
| |r| Value | Interpretation | Business Example | |-----------|---------------|-----------------| | 0.00 - 0.19 | Negligible | Shoe size and customer satisfaction | | 0.20 - 0.39 | Weak | Weather and online sales | | 0.40 - 0.59 | Moderate | Customer tenure and lifetime value | | 0.60 - 0.79 | Strong | Employee engagement and retention | | 0.80 - 1.00 | Very strong | Ad spend and impressions |
Building a Correlation Analysis
# Focus on business-relevant correlations
numeric_cols = ['age', 'annual_income', 'total_purchases',
'avg_order_value', 'days_since_last_purchase', 'is_churned']
corr = df[numeric_cols].corr()
# Find the strongest correlations (excluding self-correlations)
strong_corrs = []
for i in range(len(corr.columns)):
for j in range(i + 1, len(corr.columns)):
col1 = corr.columns[i]
col2 = corr.columns[j]
r = corr.iloc[i, j]
strong_corrs.append((col1, col2, r))
# Sort by absolute correlation strength
strong_corrs.sort(key=lambda x: abs(x[2]), reverse=True)
print("Correlation Ranking (strongest to weakest):")
print("-" * 55)
for col1, col2, r in strong_corrs:
strength = "Strong" if abs(r) > 0.5 else "Moderate" if abs(r) > 0.3 else "Weak"
print(f" {col1:30s} vs {col2:30s}: r = {r:+.3f} ({strength})")
Code Explanation: This code extracts every unique pair of variables from the correlation matrix, computes their correlation, and ranks them from strongest to weakest. This is far more actionable than staring at a full correlation matrix — it tells you immediately which relationships are worth investigating.
The Three Deadly Sins of Correlation Analysis
Sin 1: Confusing Correlation with Causation. The classic error. Ice cream sales and drowning deaths are correlated (both increase in summer). That doesn't mean ice cream causes drowning. In business, you might find that customers who use your mobile app have higher lifetime value. Does the app cause higher spending, or do high-spending customers self-select into using the app? The correlation alone can't tell you.
Sin 2: Ignoring Nonlinear Relationships. Correlation measures linear relationships. If your data follows a U-shape or an inverted-U (which is common — think of the relationship between price and demand), the correlation coefficient might be near zero even though there's a strong, meaningful relationship. Always look at the scatter plot.
Sin 3: Being Fooled by Outliers. A single extreme point can create a high correlation in a small dataset, or destroy a real correlation in a larger one.
# Demonstration: outlier-driven correlation
np.random.seed(99)
x = np.random.normal(50, 10, 30)
y = np.random.normal(50, 10, 30)
# No real relationship
print(f"Correlation without outlier: {np.corrcoef(x, y)[0, 1]:.3f}")
# Add one extreme outlier
x_with_outlier = np.append(x, 200)
y_with_outlier = np.append(y, 250)
print(f"Correlation with one outlier: {np.corrcoef(x_with_outlier, y_with_outlier)[0, 1]:.3f}")
What you'd see: The correlation jumps from near-zero to something that looks significant — all because of one data point. This is why you never report a correlation without first looking at the scatter plot.
Business Insight: NK was reviewing a correlation matrix during the class exercise when she found that
annual_incomeandtotal_purchaseshad a very low correlation. "That's weird," she said. "Shouldn't richer people buy more?" Professor Okonkwo nodded approvingly. "Great question. What could explain it?" NK thought about it. "Maybe high-income customers buy fewer times but spend more per purchase. Or maybe income doesn't predict buying behavior as much as we assume." "Both are plausible," Okonkwo said. "And now you have two hypotheses to test. That is what EDA is for — not answers, but better questions."
5.8 Missing Data Analysis
Missing data is not just a technical nuisance — it's a business signal. Why data is missing often tells you as much as the data itself.
Types of Missingness
Statisticians categorize missing data into three types, and the distinction matters enormously:
MCAR (Missing Completely at Random): The probability of a value being missing is unrelated to any variable in the dataset. A survey response is missing because the respondent's pen ran out of ink. This is the most benign type — you can safely drop or impute these values without introducing bias.
MAR (Missing at Random): The probability of a value being missing is related to other observed variables but not to the missing value itself. Income data is more likely to be missing for younger respondents (age is observed, and it predicts missingness). You can account for this using information from other columns.
MNAR (Missing Not at Random): The probability of a value being missing is related to the missing value itself. High-income individuals are more likely to leave income blank on a survey because their income is high. This is the most dangerous type — no amount of clever imputation can fully correct for it, because the missingness pattern is inherently linked to the information you're trying to recover.
Business Insight: In practice, MNAR is everywhere. Customers who are about to churn stop filling out surveys. Patients who are sickest miss follow-up appointments. Products that are failing get pulled from shelves before sales data accumulates. When you see a pattern of missingness in your data, ask: "Could the reason this data is missing be related to the value I'd expect to see?" If the answer is yes, proceed with extreme caution.
Visualizing Missingness
# Introduce some realistic missing values for demonstration
df_missing = df.copy()
np.random.seed(42)
# MCAR: random 5% missing in age
mask_age = np.random.random(len(df_missing)) < 0.05
df_missing.loc[mask_age, 'age'] = np.nan
# MAR: income more likely missing for younger customers
mask_income = (np.random.random(len(df_missing)) < 0.15) & (df_missing['age'] < 35)
df_missing.loc[mask_income, 'annual_income'] = np.nan
# MNAR: high-value customers less likely to report channel
mask_channel = (np.random.random(len(df_missing)) < 0.10) & (df_missing['avg_order_value'] > 150)
df_missing.loc[mask_channel, 'channel'] = np.nan
# Missing value summary
def missing_summary(dataframe):
"""Create a summary of missing values."""
missing = dataframe.isnull().sum()
percent = (missing / len(dataframe)) * 100
summary = pd.DataFrame({
'Missing Count': missing,
'Missing %': percent.round(2),
'Data Type': dataframe.dtypes
})
# Only show columns with missing values
summary = summary[summary['Missing Count'] > 0].sort_values(
'Missing %', ascending=False
)
return summary
print("Missing Value Report")
print("=" * 50)
print(missing_summary(df_missing))
# Visualize missing data patterns
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Bar chart of missing percentages
missing_pct = (df_missing.isnull().sum() / len(df_missing) * 100)
missing_pct = missing_pct[missing_pct > 0].sort_values(ascending=True)
axes[0].barh(missing_pct.index, missing_pct.values, color='tomato')
axes[0].set_title('Missing Data by Column', fontsize=14, fontweight='bold')
axes[0].set_xlabel('% Missing')
for i, (val, name) in enumerate(zip(missing_pct.values, missing_pct.index)):
axes[0].text(val + 0.2, i, f'{val:.1f}%', va='center')
# Matrix visualization of missingness
cols_with_missing = df_missing.columns[df_missing.isnull().any()].tolist()
missing_matrix = df_missing[cols_with_missing].isnull().astype(int)
# Show first 100 rows as a heatmap
sns.heatmap(missing_matrix.head(100).T, cbar=False, cmap='YlOrRd',
yticklabels=True, ax=axes[1])
axes[1].set_title('Missing Data Pattern (First 100 Rows)',
fontsize=14, fontweight='bold')
axes[1].set_xlabel('Row Index')
plt.tight_layout()
plt.show()
What you'd see: The left panel shows horizontal bars indicating the percentage of missing values in each column — a quick way to identify which columns are most affected. The right panel shows a matrix where each row is a variable and each column is a data record: yellow cells indicate present data, red cells indicate missing data. If missing values in two columns tend to co-occur (they form vertical red lines across the same rows), that suggests a systematic pattern — perhaps those records come from a data source that doesn't capture those fields.
Business Implications of Missing Data
The decision about what to do with missing data is a business decision, not just a technical one:
| Strategy | When to Use | Risk |
|---|---|---|
| Drop rows | Very few missing values (<5%), MCAR | Lose data; may bias if not MCAR |
| Drop columns | Column mostly missing (>50%) | Lose potentially useful features |
| Impute with mean/median | Numerical, MCAR or MAR, moderate missingness | Reduces variance; distorts distribution |
| Impute with mode | Categorical, MCAR or MAR | Overrepresents most common category |
| Impute with model | MAR with good predictors | Complex; can propagate errors |
| Create a "missing" indicator | When missingness is informative (MNAR) | Adds a column; the model can learn the pattern |
Caution
Never impute missing values and then forget that you imputed them. If you fill in missing income values with the median income, you've just told your model that a bunch of customers earn exactly the median — which is unlikely to be true. Always document which columns were imputed and how.
5.9 Feature Distributions by Category
One of the most powerful EDA techniques is comparing distributions across business-relevant groups. This is where statistical exploration meets market segmentation.
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# 1. Income distribution by channel
sns.boxplot(data=df, x='channel', y='annual_income', ax=axes[0, 0],
palette='muted')
axes[0, 0].set_title('Income by Channel', fontweight='bold')
# 2. Order value distribution by region
sns.violinplot(data=df, x='region', y='avg_order_value', ax=axes[0, 1],
palette='muted', inner='quartile')
axes[0, 1].set_title('Order Value by Region', fontweight='bold')
# 3. Churn rate by channel
churn_by_channel = df.groupby('channel')['is_churned'].mean() * 100
axes[1, 0].bar(churn_by_channel.index, churn_by_channel.values,
color=['steelblue', 'darkorange', 'seagreen'])
axes[1, 0].set_title('Churn Rate by Channel', fontweight='bold')
axes[1, 0].set_ylabel('Churn Rate (%)')
for i, (channel, rate) in enumerate(churn_by_channel.items()):
axes[1, 0].text(i, rate + 0.5, f'{rate:.1f}%', ha='center', fontweight='bold')
# 4. Purchase frequency by region and channel
pivot = df.pivot_table(values='total_purchases', index='region',
columns='channel', aggfunc='mean')
pivot.plot(kind='bar', ax=axes[1, 1], rot=0)
axes[1, 1].set_title('Avg Purchases by Region & Channel', fontweight='bold')
axes[1, 1].set_ylabel('Average Purchases')
axes[1, 1].legend(title='Channel')
plt.tight_layout()
plt.show()
What you'd see: Four panels, each telling a different part of the segmentation story.
The violin plots in the upper-right are particularly revealing — they show not just the median and quartiles (like a box plot) but the full shape of the distribution. If the Southeast region's violin has a bulge at a higher order value, that confirms the pattern Okonkwo hinted at in the opening.
The churn rate comparison (lower-left) is where business strategy meets data. If Online customers churn at 22% while In-Store customers churn at 14%, that's a strategic challenge: your fastest-growing channel is also your leakiest.
Athena Update: When Ravi Mehta's data team ran this exact analysis on Athena Retail Group's unified customer dataset (the one they painstakingly integrated in Chapter 4), the results reshaped their entire AI roadmap. Three findings stood out:
Online-only customers had 40% higher lifetime value — but 60% higher churn rate. They spend more per transaction and buy more frequently, but they also disappear more abruptly. The executive team had been celebrating online growth without realizing they were filling a leaky bucket.
Return rate correlated strongly with product category, not customer segment. The assumption had been that "serial returners" were a customer type that could be predicted and managed. The data showed that returns were driven by specific product categories (primarily apparel sizing issues), not customer behavior. The fix wasn't a churn model — it was a sizing guide.
The correlation heatmap revealed almost zero correlation between customer satisfaction scores and actual purchasing behavior. Customers who rated their experience 5 stars were no more likely to buy again than those who rated it 3 stars. The satisfaction survey, which consumed significant operational resources, was measuring sentiment but not predicting behavior.
These insights reshuffled Athena's AI project priorities. Churn prediction for online customers moved to the top of the list — a project we'll begin building in Chapter 7. The sizing recommendation system moved from "someday" to "next quarter." And the customer satisfaction survey was redesigned from scratch.
"This is why we do EDA," Ravi told his team. "We just saved six months of building the wrong models."
5.10 Telling Stories with Data
Technical skill gets you the analysis. Storytelling gets you the impact.
The difference between a data analyst and a data leader is not the sophistication of their models — it's their ability to translate quantitative findings into narratives that drive decisions. Every visualization, every summary statistic, every correlation you've computed in this chapter is raw material. The finished product is a story.
The Analytical Narrative Structure
Professor Okonkwo teaches a four-part structure she calls the SCQA framework (adapted from Barbara Minto's The Pyramid Principle, a McKinsey staple):
-
Situation: What is the current state? What does the audience already know? - "Athena Retail Group has grown online revenue 35% year-over-year for three consecutive years."
-
Complication: What's the problem or surprise? What does the data reveal that the audience doesn't know? - "However, our EDA reveals that online customers churn at 60% higher rates than in-store customers. Net customer growth is actually flat."
-
Question: What must we decide? - "Should we invest in acquiring new online customers, or in retaining the ones we have?"
-
Answer: What does the data suggest? - "Retention. A churn prediction model focused on online customers could save an estimated $4.2M annually in lost lifetime value."
Business Insight: Notice that the SCQA framework doesn't start with the data. It starts with the audience's existing understanding and then introduces the data as a plot twist. This is not manipulation — it's communication. Humans process information narratively. A finding that contradicts an existing belief is far more memorable and actionable than one that confirms it.
Writing Executive Summaries from EDA
An executive summary distills your EDA into three to five findings, each structured as:
Finding: [One sentence stating what the data shows] Evidence: [The specific metric or visualization that supports it] Implication: [What this means for the business] Recommendation: [What action to take]
Finding: Online customers are significantly more valuable but significantly
less loyal than in-store customers.
Evidence: Online customer LTV averages $1,240 vs. $890 in-store (+39%),
but online annual churn rate is 28% vs. 17% in-store (+65%).
Implication: The online channel is a leaky bucket. Without intervention,
growing online acquisition will produce diminishing returns as churn
compounds.
Recommendation: Prioritize building a churn prediction model for online
customers. Early-warning signals (declining purchase frequency, reduced
session time) can trigger retention interventions.
NK was the first to complete this exercise in class. Her marketing background had given her years of practice writing narratives from data — she just hadn't realized that was a transferable skill. Her executive summary was so well-structured that Okonkwo used it as an example. "This," Okonkwo said, holding up NK's writeup, "is what makes data useful. The analysis is not complete when you've run the code. It's complete when someone who has never seen the data understands what it means and knows what to do about it."
Tom, watching from across the room, realized his technical skills weren't enough. He could build the model, but NK could sell it. They'd need both.
5.11 Building the EDAReport Class
Throughout this chapter, we've explored data using individual code snippets. Now let's assemble everything into a reusable Python class — the EDAReport — that automates the core EDA workflow. This is the kind of tool that working data scientists build for themselves and share with their teams.
We'll build it incrementally, then show the complete class at the end.
Step 1: The Foundation
class EDAReport:
"""
Automated Exploratory Data Analysis report generator.
Takes a pandas DataFrame and produces a comprehensive EDA
including summary statistics, missing data analysis,
distributions, correlations, and an executive summary.
Usage:
report = EDAReport(df, title="Customer Analysis")
report.run() # Print full report to console
report.plot_distributions() # Show distribution plots
report.plot_correlations() # Show correlation heatmap
"""
def __init__(self, df, title="EDA Report"):
"""
Initialize the EDA report.
Parameters
----------
df : pd.DataFrame
The dataset to analyze.
title : str
A descriptive title for the report.
"""
self.df = df.copy()
self.title = title
self.numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
self.categorical_cols = df.select_dtypes(
include=['object', 'category']
).columns.tolist()
Code Explanation: The constructor stores a copy of the DataFrame (so the original is never modified), identifies numeric and categorical columns automatically, and stores a title for report headers. This separation of numeric vs. categorical columns will drive different analysis paths throughout the class.
Step 2: Shape and Type Summary
def shape_summary(self):
"""Print basic shape and data type information."""
print(f"\n{'=' * 60}")
print(f" {self.title}")
print(f"{'=' * 60}")
print(f"\n Rows: {self.df.shape[0]:,}")
print(f" Columns: {self.df.shape[1]:,}")
print(f"\n Numeric columns ({len(self.numeric_cols)}):")
for col in self.numeric_cols:
print(f" - {col} ({self.df[col].dtype})")
print(f"\n Categorical columns ({len(self.categorical_cols)}):")
for col in self.categorical_cols:
print(f" - {col} ({self.df[col].nunique()} unique values)")
Step 3: Missing Data Report
def missing_report(self):
"""Analyze and report missing values."""
missing = self.df.isnull().sum()
total_missing = missing.sum()
print(f"\n{'─' * 60}")
print(" MISSING DATA REPORT")
print(f"{'─' * 60}")
if total_missing == 0:
print(" No missing values detected. ✓")
return
pct = (missing / len(self.df) * 100).round(2)
report = pd.DataFrame({
'Missing': missing,
'Percent': pct
})
report = report[report['Missing'] > 0].sort_values(
'Percent', ascending=False
)
for col, row in report.iterrows():
bar_len = int(row['Percent'] / 2)
bar = '█' * bar_len + '░' * (50 - bar_len)
print(f" {col:30s} {bar} {row['Percent']:5.1f}% "
f"({int(row['Missing']):,} values)")
print(f"\n Total cells missing: {total_missing:,} "
f"({total_missing / self.df.size * 100:.2f}% of all data)")
Code Explanation: The method creates a visual bar chart using Unicode block characters — a simple but effective way to display proportions in a text-based report. Each column with missing data gets a progress bar showing the percentage missing, plus the raw count. This gives you an instant visual sense of the missingness pattern without needing a plot.
Step 4: Descriptive Statistics
def descriptive_stats(self):
"""Generate descriptive statistics for numeric and categorical columns."""
print(f"\n{'─' * 60}")
print(" DESCRIPTIVE STATISTICS — NUMERIC")
print(f"{'─' * 60}")
if self.numeric_cols:
stats = self.df[self.numeric_cols].describe().T
stats['skew'] = self.df[self.numeric_cols].skew()
stats['kurtosis'] = self.df[self.numeric_cols].kurtosis()
stats['missing'] = self.df[self.numeric_cols].isnull().sum()
# Reorder columns for readability
stats = stats[['count', 'missing', 'mean', '50%', 'std',
'min', '25%', '75%', 'max', 'skew', 'kurtosis']]
stats.columns = ['Count', 'Missing', 'Mean', 'Median', 'Std Dev',
'Min', 'Q1', 'Q3', 'Max', 'Skewness', 'Kurtosis']
print(stats.to_string())
else:
print(" No numeric columns found.")
print(f"\n{'─' * 60}")
print(" DESCRIPTIVE STATISTICS — CATEGORICAL")
print(f"{'─' * 60}")
if self.categorical_cols:
for col in self.categorical_cols:
print(f"\n {col} ({self.df[col].nunique()} unique values):")
value_counts = self.df[col].value_counts()
total = len(self.df[col].dropna())
for val, count in value_counts.head(10).items():
pct = count / total * 100
print(f" {val:30s} {count:6,} ({pct:5.1f}%)")
if self.df[col].nunique() > 10:
print(f" ... and {self.df[col].nunique() - 10} more")
else:
print(" No categorical columns found.")
Code Explanation: For numeric columns, we extend pandas' built-in
describe()with skewness and kurtosis — two metrics thatdescribe()doesn't include by default but that are essential for understanding distribution shape. We rename the columns from pandas' default labels (like "50%") to business-friendly names (like "Median"). For categorical columns, we show the top 10 values with their counts and percentages, which immediately tells you the dominant categories and how concentrated or dispersed the values are.
Step 5: Distribution Plots
def plot_distributions(self, max_cols=12):
"""
Plot distributions for all numeric columns.
Parameters
----------
max_cols : int
Maximum number of columns to plot (to avoid overwhelming output).
"""
cols_to_plot = self.numeric_cols[:max_cols]
n_cols = len(cols_to_plot)
if n_cols == 0:
print("No numeric columns to plot.")
return
n_rows = (n_cols + 2) // 3 # Ceiling division for 3 columns per row
fig, axes = plt.subplots(n_rows, 3, figsize=(15, 4 * n_rows))
axes = axes.flatten() if n_cols > 1 else [axes]
for i, col in enumerate(cols_to_plot):
ax = axes[i]
self.df[col].dropna().hist(bins=30, ax=ax, color='steelblue',
edgecolor='white', alpha=0.8)
ax.axvline(self.df[col].mean(), color='tomato', linestyle='--',
linewidth=1.5, label=f'Mean: {self.df[col].mean():.1f}')
ax.axvline(self.df[col].median(), color='orange', linestyle='-',
linewidth=1.5, label=f'Median: {self.df[col].median():.1f}')
ax.set_title(col, fontsize=11, fontweight='bold')
ax.legend(fontsize=8)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
# Hide unused subplots
for j in range(i + 1, len(axes)):
axes[j].set_visible(False)
fig.suptitle(f'{self.title} — Distributions',
fontsize=14, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()
Code Explanation: Each histogram includes vertical dashed lines for the mean (red) and median (orange). When these two lines are close together, the distribution is roughly symmetric. When they diverge, the distribution is skewed — and the gap between them is visually obvious. The
max_colsparameter prevents the method from trying to plot 50 histograms for a wide dataset. The layout logic (n_rows = (n_cols + 2) // 3) uses ceiling division to arrange subplots in a 3-column grid.
Step 6: Correlation Heatmap
def plot_correlations(self):
"""Plot a correlation heatmap for all numeric columns."""
if len(self.numeric_cols) < 2:
print("Need at least 2 numeric columns for correlation analysis.")
return
corr = self.df[self.numeric_cols].corr()
fig, ax = plt.subplots(figsize=(max(8, len(self.numeric_cols)),
max(6, len(self.numeric_cols) * 0.8)))
mask = np.triu(np.ones_like(corr, dtype=bool)) # Show only lower triangle
sns.heatmap(corr, mask=mask, annot=True, fmt='.2f', cmap='RdBu_r',
center=0, square=True, linewidths=0.5,
vmin=-1, vmax=1, ax=ax)
ax.set_title(f'{self.title} — Correlation Matrix',
fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
# Print top correlations
print(f"\n{'─' * 60}")
print(" TOP CORRELATIONS (by absolute value)")
print(f"{'─' * 60}")
pairs = []
for i in range(len(corr.columns)):
for j in range(i + 1, len(corr.columns)):
pairs.append((
corr.columns[i], corr.columns[j],
corr.iloc[i, j]
))
pairs.sort(key=lambda x: abs(x[2]), reverse=True)
for col1, col2, r in pairs[:10]:
direction = "↑" if r > 0 else "↓"
print(f" {col1:25s} × {col2:25s} r = {r:+.3f} {direction}")
Code Explanation: We use
np.triuto mask the upper triangle of the correlation matrix, since it's a mirror image of the lower triangle. This reduces visual clutter by half. The figure size scales with the number of variables to prevent cramped labels. After the plot, we print the top correlations as a ranked list — a text-based complement to the visual heatmap.
Step 7: Executive Summary Generator
def executive_summary(self):
"""Generate an automated executive summary paragraph."""
n_rows, n_cols = self.df.shape
total_missing = self.df.isnull().sum().sum()
pct_missing = total_missing / self.df.size * 100
print(f"\n{'═' * 60}")
print(" EXECUTIVE SUMMARY")
print(f"{'═' * 60}")
summary_parts = []
# Dataset overview
summary_parts.append(
f"This dataset contains {n_rows:,} records across "
f"{n_cols} variables ({len(self.numeric_cols)} numeric, "
f"{len(self.categorical_cols)} categorical)."
)
# Data quality
if total_missing == 0:
summary_parts.append("Data quality is excellent — no missing values detected.")
elif pct_missing < 5:
cols_missing = self.df.columns[self.df.isnull().any()].tolist()
summary_parts.append(
f"Data quality is generally good, with {pct_missing:.1f}% of "
f"values missing across {len(cols_missing)} column(s): "
f"{', '.join(cols_missing)}."
)
else:
worst_col = self.df.isnull().sum().idxmax()
worst_pct = self.df[worst_col].isnull().sum() / len(self.df) * 100
summary_parts.append(
f"Data quality requires attention: {pct_missing:.1f}% of values "
f"are missing overall. The most affected column is '{worst_col}' "
f"({worst_pct:.1f}% missing)."
)
# Skewness flags
skewed_cols = []
for col in self.numeric_cols:
skew_val = self.df[col].skew()
if abs(skew_val) > 1.5:
direction = "right" if skew_val > 0 else "left"
skewed_cols.append(f"{col} ({direction}-skewed, skew={skew_val:.1f})")
if skewed_cols:
summary_parts.append(
f"Notable distribution skew detected in: {'; '.join(skewed_cols)}. "
f"Consider using median rather than mean for these variables in "
f"reporting and review for outlier impact before modeling."
)
# Strongest correlation
if len(self.numeric_cols) >= 2:
corr = self.df[self.numeric_cols].corr()
max_corr = 0
max_pair = ("", "")
for i in range(len(corr.columns)):
for j in range(i + 1, len(corr.columns)):
if abs(corr.iloc[i, j]) > abs(max_corr):
max_corr = corr.iloc[i, j]
max_pair = (corr.columns[i], corr.columns[j])
if abs(max_corr) > 0.3:
direction = "positive" if max_corr > 0 else "negative"
summary_parts.append(
f"The strongest observed correlation is between "
f"'{max_pair[0]}' and '{max_pair[1]}' "
f"(r = {max_corr:+.3f}, {direction}). This relationship "
f"warrants further investigation."
)
else:
summary_parts.append(
f"No strong linear correlations (|r| > 0.3) were detected "
f"among numeric variables. This may indicate independent "
f"features or nonlinear relationships worth exploring."
)
# Print formatted summary
for i, part in enumerate(summary_parts):
print(f"\n {i + 1}. {part}")
print(f"\n{'═' * 60}")
Code Explanation: The executive summary method synthesizes the entire EDA into a structured paragraph that a non-technical reader can understand. It automatically flags data quality issues, distribution anomalies, and noteworthy correlations. This is the "So what?" layer — translating statistical findings into business-relevant observations.
Step 8: The Complete Class
Here is the EDAReport class assembled into a single, complete, copy-paste-ready block:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
class EDAReport:
"""
Automated Exploratory Data Analysis report generator.
Takes a pandas DataFrame and produces a comprehensive EDA
including summary statistics, missing data analysis,
distributions, correlations, and an executive summary.
Usage:
report = EDAReport(df, title="Customer Analysis")
report.run() # Print full report to console
report.plot_distributions() # Show distribution plots
report.plot_correlations() # Show correlation heatmap
Parameters
----------
df : pd.DataFrame
The dataset to analyze.
title : str
A descriptive title for the report (default: "EDA Report").
"""
def __init__(self, df, title="EDA Report"):
self.df = df.copy()
self.title = title
self.numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
self.categorical_cols = df.select_dtypes(
include=['object', 'category']
).columns.tolist()
def shape_summary(self):
"""Print basic shape and data type information."""
print(f"\n{'=' * 60}")
print(f" {self.title}")
print(f"{'=' * 60}")
print(f"\n Rows: {self.df.shape[0]:,}")
print(f" Columns: {self.df.shape[1]:,}")
print(f"\n Numeric columns ({len(self.numeric_cols)}):")
for col in self.numeric_cols:
print(f" - {col} ({self.df[col].dtype})")
print(f"\n Categorical columns ({len(self.categorical_cols)}):")
for col in self.categorical_cols:
print(f" - {col} ({self.df[col].nunique()} unique values)")
def missing_report(self):
"""Analyze and report missing values."""
missing = self.df.isnull().sum()
total_missing = missing.sum()
print(f"\n{'─' * 60}")
print(" MISSING DATA REPORT")
print(f"{'─' * 60}")
if total_missing == 0:
print(" No missing values detected.")
return
pct = (missing / len(self.df) * 100).round(2)
report = pd.DataFrame({
'Missing': missing,
'Percent': pct
})
report = report[report['Missing'] > 0].sort_values(
'Percent', ascending=False
)
for col, row in report.iterrows():
bar_len = int(row['Percent'] / 2)
bar = '█' * bar_len + '░' * (50 - bar_len)
print(f" {col:30s} {bar} {row['Percent']:5.1f}% "
f"({int(row['Missing']):,} values)")
print(f"\n Total cells missing: {total_missing:,} "
f"({total_missing / self.df.size * 100:.2f}% of all data)")
def descriptive_stats(self):
"""Generate descriptive statistics for numeric and categorical columns."""
print(f"\n{'─' * 60}")
print(" DESCRIPTIVE STATISTICS — NUMERIC")
print(f"{'─' * 60}")
if self.numeric_cols:
stats = self.df[self.numeric_cols].describe().T
stats['skew'] = self.df[self.numeric_cols].skew()
stats['kurtosis'] = self.df[self.numeric_cols].kurtosis()
stats['missing'] = self.df[self.numeric_cols].isnull().sum()
stats = stats[['count', 'missing', 'mean', '50%', 'std',
'min', '25%', '75%', 'max', 'skew', 'kurtosis']]
stats.columns = ['Count', 'Missing', 'Mean', 'Median', 'Std Dev',
'Min', 'Q1', 'Q3', 'Max', 'Skewness', 'Kurtosis']
print(stats.to_string())
else:
print(" No numeric columns found.")
print(f"\n{'─' * 60}")
print(" DESCRIPTIVE STATISTICS — CATEGORICAL")
print(f"{'─' * 60}")
if self.categorical_cols:
for col in self.categorical_cols:
print(f"\n {col} ({self.df[col].nunique()} unique values):")
value_counts = self.df[col].value_counts()
total = len(self.df[col].dropna())
for val, count in value_counts.head(10).items():
pct_val = count / total * 100
print(f" {val:30s} {count:6,} ({pct_val:5.1f}%)")
if self.df[col].nunique() > 10:
print(f" ... and {self.df[col].nunique() - 10} more")
else:
print(" No categorical columns found.")
def plot_distributions(self, max_cols=12):
"""
Plot distributions for all numeric columns.
Parameters
----------
max_cols : int
Maximum number of columns to plot.
"""
cols_to_plot = self.numeric_cols[:max_cols]
n_cols = len(cols_to_plot)
if n_cols == 0:
print("No numeric columns to plot.")
return
n_rows = (n_cols + 2) // 3
fig, axes = plt.subplots(n_rows, 3, figsize=(15, 4 * n_rows))
if n_rows == 1 and n_cols <= 3:
axes = np.array(axes).reshape(1, -1)
axes_flat = axes.flatten()
for i, col in enumerate(cols_to_plot):
ax = axes_flat[i]
self.df[col].dropna().hist(bins=30, ax=ax, color='steelblue',
edgecolor='white', alpha=0.8)
ax.axvline(self.df[col].mean(), color='tomato', linestyle='--',
linewidth=1.5, label=f'Mean: {self.df[col].mean():.1f}')
ax.axvline(self.df[col].median(), color='orange', linestyle='-',
linewidth=1.5,
label=f'Median: {self.df[col].median():.1f}')
ax.set_title(col, fontsize=11, fontweight='bold')
ax.legend(fontsize=8)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
for j in range(i + 1, len(axes_flat)):
axes_flat[j].set_visible(False)
fig.suptitle(f'{self.title} — Distributions',
fontsize=14, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()
def plot_correlations(self):
"""Plot a correlation heatmap for all numeric columns."""
if len(self.numeric_cols) < 2:
print("Need at least 2 numeric columns for correlation analysis.")
return
corr = self.df[self.numeric_cols].corr()
fig_size = max(8, len(self.numeric_cols))
fig, ax = plt.subplots(figsize=(fig_size, fig_size * 0.8))
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, mask=mask, annot=True, fmt='.2f', cmap='RdBu_r',
center=0, square=True, linewidths=0.5,
vmin=-1, vmax=1, ax=ax)
ax.set_title(f'{self.title} — Correlation Matrix',
fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
# Print top correlations
print(f"\n{'─' * 60}")
print(" TOP CORRELATIONS (by absolute value)")
print(f"{'─' * 60}")
pairs = []
for i in range(len(corr.columns)):
for j in range(i + 1, len(corr.columns)):
pairs.append((
corr.columns[i], corr.columns[j],
corr.iloc[i, j]
))
pairs.sort(key=lambda x: abs(x[2]), reverse=True)
for col1, col2, r in pairs[:10]:
direction = "pos" if r > 0 else "neg"
print(f" {col1:25s} x {col2:25s} r = {r:+.3f} ({direction})")
def executive_summary(self):
"""Generate an automated executive summary paragraph."""
n_rows, n_cols_total = self.df.shape
total_missing = self.df.isnull().sum().sum()
pct_missing = total_missing / self.df.size * 100
print(f"\n{'=' * 60}")
print(" EXECUTIVE SUMMARY")
print(f"{'=' * 60}")
summary_parts = []
# Dataset overview
summary_parts.append(
f"This dataset contains {n_rows:,} records across "
f"{n_cols_total} variables ({len(self.numeric_cols)} numeric, "
f"{len(self.categorical_cols)} categorical)."
)
# Data quality
if total_missing == 0:
summary_parts.append(
"Data quality is excellent — no missing values detected."
)
elif pct_missing < 5:
cols_missing = self.df.columns[self.df.isnull().any()].tolist()
summary_parts.append(
f"Data quality is generally good, with {pct_missing:.1f}% of "
f"values missing across {len(cols_missing)} column(s): "
f"{', '.join(cols_missing)}."
)
else:
worst_col = self.df.isnull().sum().idxmax()
worst_pct = (self.df[worst_col].isnull().sum()
/ len(self.df) * 100)
summary_parts.append(
f"Data quality requires attention: {pct_missing:.1f}% of "
f"values are missing overall. The most affected column is "
f"'{worst_col}' ({worst_pct:.1f}% missing)."
)
# Skewness flags
skewed_cols = []
for col in self.numeric_cols:
skew_val = self.df[col].skew()
if abs(skew_val) > 1.5:
direction = "right" if skew_val > 0 else "left"
skewed_cols.append(
f"{col} ({direction}-skewed, skew={skew_val:.1f})"
)
if skewed_cols:
summary_parts.append(
f"Notable distribution skew detected in: "
f"{'; '.join(skewed_cols)}. Consider using median rather "
f"than mean for these variables in reporting and review "
f"for outlier impact before modeling."
)
# Strongest correlation
if len(self.numeric_cols) >= 2:
corr = self.df[self.numeric_cols].corr()
max_corr = 0
max_pair = ("", "")
for i in range(len(corr.columns)):
for j in range(i + 1, len(corr.columns)):
if abs(corr.iloc[i, j]) > abs(max_corr):
max_corr = corr.iloc[i, j]
max_pair = (corr.columns[i], corr.columns[j])
if abs(max_corr) > 0.3:
direction = "positive" if max_corr > 0 else "negative"
summary_parts.append(
f"The strongest observed correlation is between "
f"'{max_pair[0]}' and '{max_pair[1]}' "
f"(r = {max_corr:+.3f}, {direction}). This relationship "
f"warrants further investigation."
)
else:
summary_parts.append(
f"No strong linear correlations (|r| > 0.3) were "
f"detected among numeric variables. This may indicate "
f"independent features or nonlinear relationships "
f"worth exploring."
)
for i, part in enumerate(summary_parts):
print(f"\n {i + 1}. {part}")
print(f"\n{'=' * 60}")
def run(self):
"""Execute the full EDA report."""
self.shape_summary()
self.missing_report()
self.descriptive_stats()
self.executive_summary()
print("\n Call .plot_distributions() and .plot_correlations()")
print(" for visual analysis.")
Using the EDAReport
# Create and run the report
report = EDAReport(df, title="Athena Retail — Customer EDA")
report.run()
What you'd see: A clean, structured text report with the dataset shape, missing values (none in this clean dataset), descriptive statistics for every column (including skewness and kurtosis), top value counts for categorical columns, and an executive summary that automatically flags right-skewed distributions and noteworthy correlations.
# Generate the visual components
report.plot_distributions()
report.plot_correlations()
What you'd see: First, a grid of histograms with mean and median lines for every numeric column. Then, a correlation heatmap showing the lower triangle of the correlation matrix, followed by a ranked list of the strongest variable pairs.
Try It: Create an
EDAReporton the dataset with missing values (df_missingfrom Section 5.8) and compare the output to the clean dataset report. Notice how the missing data report section comes alive, and how the executive summary automatically adjusts its data quality assessment.
# Compare: EDA on the dataset with missing values
report_missing = EDAReport(df_missing, title="Athena Retail — With Missing Data")
report_missing.run()
5.12 Extending the EDAReport
The EDAReport class is designed to be extended. Here are three enhancements that make excellent practice exercises (and that you'll find genuinely useful in your career):
Extension 1: Outlier Detection Method
def outlier_report(self, method='iqr', multiplier=1.5):
"""Detect and report outliers for all numeric columns."""
print(f"\n{'─' * 60}")
print(f" OUTLIER REPORT (method: {method}, threshold: {multiplier}x IQR)")
print(f"{'─' * 60}")
for col in self.numeric_cols:
Q1 = self.df[col].quantile(0.25)
Q3 = self.df[col].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - multiplier * IQR
upper = Q3 + multiplier * IQR
outliers = self.df[(self.df[col] < lower) | (self.df[col] > upper)]
n_outliers = len(outliers)
pct = n_outliers / len(self.df) * 100
if n_outliers > 0:
print(f"\n {col}:")
print(f" Bounds: [{lower:.2f}, {upper:.2f}]")
print(f" Outliers: {n_outliers} ({pct:.1f}%)")
print(f" Range of outliers: "
f"[{self.df.loc[outliers.index, col].min():.2f}, "
f"{self.df.loc[outliers.index, col].max():.2f}]")
Extension 2: Group Comparison Method
def compare_groups(self, group_col, target_col):
"""Compare a numeric variable across categories."""
print(f"\n{'─' * 60}")
print(f" GROUP COMPARISON: {target_col} by {group_col}")
print(f"{'─' * 60}")
grouped = self.df.groupby(group_col)[target_col].agg(
['count', 'mean', 'median', 'std', 'min', 'max']
)
print(grouped.to_string())
fig, ax = plt.subplots(figsize=(10, 5))
sns.boxplot(data=self.df, x=group_col, y=target_col,
palette='muted', ax=ax)
ax.set_title(f'{target_col} by {group_col}', fontweight='bold')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.tight_layout()
plt.show()
Extension 3: Save Report to File
import sys
from io import StringIO
def save_report(self, filepath):
"""Save the text report to a file."""
# Capture print output
old_stdout = sys.stdout
sys.stdout = StringIO()
self.shape_summary()
self.missing_report()
self.descriptive_stats()
self.executive_summary()
report_text = sys.stdout.getvalue()
sys.stdout = old_stdout
with open(filepath, 'w') as f:
f.write(report_text)
print(f"Report saved to {filepath}")
Business Insight: The
save_report()method is more useful than it looks. In a production data pipeline, you might run the EDAReport automatically whenever new data arrives and save the output to a shared drive. If something changes — a new column appears, missing values spike, distributions shift — the report catches it before it breaks a model downstream.
5.13 From EDA to Action: Connecting the Dots
As the class wrapped up, Professor Okonkwo returned to where she had started.
"You've learned to compute statistics, build visualizations, detect outliers, analyze correlations, and assess missing data," she said. "But none of that matters if it doesn't change a decision."
She pulled up the Athena Retail case one more time — the three findings from Ravi's team's EDA:
- Online customers: high value, high churn.
- Returns driven by product category, not customer type.
- Customer satisfaction scores don't predict behavior.
"Each of these findings killed an assumption," she said. "The assumption that online growth was unqualified good news. The assumption that returns were a customer problem. The assumption that satisfaction surveys measured something real. EDA didn't just describe the data — it challenged the story the company was telling itself."
NK raised her hand. "So EDA is really about... testing the narrative?"
"Exactly," Okonkwo said. "Every company has a story it tells about its data. EDA is how you check whether the story is true."
Tom, who had spent the first half of class wanting to skip ahead to machine learning, closed his laptop thoughtfully. His automated outlier-removal disaster had cost him a class exercise, but it had taught him something more valuable: the model is only as good as the understanding behind it.
"Next week," Okonkwo said, "we move from exploration to explanation — from understanding our data to building models that predict from it. But you'll be coming back to EDA constantly. Every new dataset. Every new feature. Every time your model starts behaving strangely. The first question is always the same: What does the data actually look like?"
She paused. "And the second question is: What was I assuming that the data just proved wrong?"
Chapter Summary
This chapter introduced Exploratory Data Analysis as both a technical practice and a business discipline. We covered:
- The EDA mindset: Exploration before confirmation, following John Tukey's philosophy that the most valuable insights come from looking at data without preconceptions.
- Descriptive statistics: Mean, median, mode, standard deviation, percentiles, skewness, and kurtosis — each explained with business intuition rather than formulas.
- Visualization principles: Tufte's data-ink ratio, the dangers of chartjunk, and the "So what?" test that every chart must pass.
- matplotlib and seaborn: The Python visualization stack, from basic histograms and bar charts to correlation heatmaps and pair plots.
- Distribution analysis: Recognizing normal, skewed, bimodal, and uniform distributions, and understanding what each shape means for business decisions.
- Correlation analysis: Interpreting correlation matrices, avoiding the three deadly sins (causation, nonlinearity, outlier effects), and ranking variable relationships.
- Missing data: The MCAR/MAR/MNAR framework, visualization of missingness patterns, and the business implications of different imputation strategies.
- Data storytelling: The SCQA framework, executive summary writing, and the critical difference between analysis and communication.
- The EDAReport class: A reusable Python tool that automates shape summaries, missing data reports, descriptive statistics, distribution plots, correlation heatmaps, and executive summaries.
In Chapter 6, we examine the business of machine learning — the organizational, economic, and strategic context that determines which models get built and which gather dust. And in Chapter 7, we'll apply everything from this chapter directly: the EDA insights from Athena's customer data will feed into our first supervised learning model, a churn classifier where the patterns we discovered here become the features that power prediction.
The data has spoken. Now it's time to listen — and then to act.