Case Study: Building an End-to-End Data Analysis Pipeline
"Data is the new oil. But like oil, it is valuable only when refined." -- Clive Humby
Executive Summary
In this case study, you will build a complete data analysis pipeline using NumPy, pandas, and matplotlib. Starting with a raw dataset of e-commerce transactions, you will clean the data, engineer features, perform exploratory analysis, and generate actionable visualizations. This exercise integrates every Python skill covered in Chapter 5 into a cohesive workflow that mirrors real AI engineering practice.
Skills Applied: - NumPy vectorized computation for feature engineering - pandas DataFrame operations for data cleaning and aggregation - matplotlib/seaborn visualization for exploratory data analysis - Method chaining for readable transformation pipelines - Code organization with type hints and docstrings
Data: Synthetic e-commerce dataset (~10,000 transactions, 12 columns)
Background
The Organization
DataMart Analytics is a mid-size e-commerce company that sells consumer electronics, home goods, and apparel through its online platform. The company processes approximately 500,000 transactions per year and has been growing steadily at 15% year-over-year. The data engineering team has recently consolidated their transaction logs into a structured format, and the AI team has been asked to extract insights that can guide business strategy.
The Context
The company has noticed three concerning trends:
- Customer acquisition costs have increased by 20% over the past year.
- Average order values appear to be declining in certain product categories.
- Customer retention beyond six months has dropped from 45% to 38%.
The VP of Analytics has requested a comprehensive analysis of transaction patterns to identify opportunities for revenue optimization and customer retention improvement. This analysis will also serve as the foundation for a churn prediction model (to be built in Chapter 6 using supervised learning techniques).
The Challenge
The raw transaction data contains inconsistencies, missing values, and unstructured fields that must be cleaned before any meaningful analysis can be performed. The analysis must answer several specific questions while producing visualizations suitable for a leadership presentation.
Business Question: "Which customer segments show the highest lifetime value, and what behavioral patterns distinguish retained customers from churned ones?"
Stakeholders
| Role | Needs | Success Metric |
|---|---|---|
| VP of Analytics | Clear summary with actionable recommendations | 3-5 data-backed strategic recommendations |
| Marketing Team | Customer segment profiles for targeted campaigns | Identified high-value segments with defining traits |
| Product Team | Category-level performance insights | Revenue and growth trends by product category |
| Data Science Team | Clean, feature-rich dataset for ML modeling | Analysis-ready DataFrame with engineered features |
Available Data
Data Dictionary
| Column | Type | Description | Example |
|---|---|---|---|
transaction_id |
int | Unique transaction identifier | 100001 |
customer_id |
int | Customer identifier | 5042 |
transaction_date |
str | Date of transaction (YYYY-MM-DD) | 2022-03-15 |
product_category |
str | Product category | electronics |
product_name |
str | Product name | Wireless Headphones |
quantity |
int | Number of items purchased | 2 |
unit_price |
float | Price per unit in USD | 49.99 |
discount_pct |
float | Discount percentage applied (0-100) | 15.0 |
payment_method |
str | Payment method used | credit_card |
customer_age |
int | Customer age | 34 |
customer_region |
str | Geographic region | northeast |
is_returning |
int | Whether customer has purchased before (0/1) | 1 |
Data Quality Notes
- Approximately 3% of
unit_pricevalues are missing - Some
discount_pctvalues are negative (data entry errors) transaction_dateincludes dates in two formats:YYYY-MM-DDandMM/DD/YYYY- A small number of
quantityvalues are zero or negative - Some
product_categoryvalues have inconsistent capitalization
Analysis Approach
Phase 1: Data Loading and Initial Exploration
The first step in any analysis pipeline is understanding what we are working with. We load the data, check its structure, and identify quality issues.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Optional
# Configure plotting
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams.update({'figure.figsize': (12, 6), 'font.size': 11})
def generate_synthetic_data(n: int = 10000, seed: int = 42) -> pd.DataFrame:
"""Generate synthetic e-commerce transaction data.
Creates a realistic dataset with intentional data quality issues
for practicing data cleaning techniques.
Args:
n: Number of transactions to generate.
seed: Random seed for reproducibility.
Returns:
DataFrame with raw transaction data.
"""
np.random.seed(seed)
categories = ['electronics', 'home_goods', 'apparel',
'Electronics', 'HOME_GOODS', 'Apparel'] # Inconsistent case
regions = ['northeast', 'southeast', 'midwest', 'west', 'southwest']
payment_methods = ['credit_card', 'debit_card', 'paypal', 'bank_transfer']
n_customers = n // 5 # Average 5 transactions per customer
customer_ids = np.random.randint(1000, 1000 + n_customers, n)
df = pd.DataFrame({
'transaction_id': range(100001, 100001 + n),
'customer_id': customer_ids,
'transaction_date': pd.date_range('2022-01-01', periods=n,
freq='50min').strftime('%Y-%m-%d'),
'product_category': np.random.choice(categories, n),
'product_name': [f'Product_{i}' for i in np.random.randint(1, 200, n)],
'quantity': np.random.randint(1, 5, n),
'unit_price': np.abs(np.random.lognormal(3.5, 1.0, n)).round(2),
'discount_pct': np.random.choice(
[0, 5, 10, 15, 20, 25, -5], n,
p=[0.4, 0.15, 0.15, 0.1, 0.1, 0.05, 0.05]
),
'payment_method': np.random.choice(payment_methods, n,
p=[0.4, 0.25, 0.2, 0.15]),
'customer_age': np.random.randint(18, 72, n),
'customer_region': np.random.choice(regions, n),
'is_returning': np.random.binomial(1, 0.65, n),
})
# Introduce missing values
missing_price = np.random.choice(n, int(n * 0.03), replace=False)
df.loc[missing_price, 'unit_price'] = np.nan
# Introduce zero/negative quantities
bad_qty = np.random.choice(n, int(n * 0.01), replace=False)
df.loc[bad_qty, 'quantity'] = np.random.choice([0, -1], len(bad_qty))
return df
# Load and inspect
raw_df = generate_synthetic_data()
print(f"Dataset shape: {raw_df.shape}")
print(f"\nColumn types:\n{raw_df.dtypes}")
print(f"\nMissing values:\n{raw_df.isnull().sum()}")
print(f"\nFirst 5 rows:\n{raw_df.head()}")
print(f"\nBasic statistics:\n{raw_df.describe()}")
Key Finding 1: The dataset has 10,000 transactions with several data quality issues that must be addressed before analysis.
Phase 2: Data Cleaning
We systematically address each quality issue identified during exploration.
def clean_transactions(df: pd.DataFrame) -> tuple[pd.DataFrame, dict[str, int]]:
"""Clean raw transaction data with comprehensive handling.
Addresses missing values, inconsistent formatting, invalid entries,
and data type issues. Returns a report of all modifications.
Args:
df: Raw transaction DataFrame.
Returns:
Tuple of (cleaned DataFrame, modification report dict).
"""
report: dict[str, int] = {}
df_clean = df.copy()
# Step 1: Standardize product categories (case normalization)
df_clean['product_category'] = (
df_clean['product_category'].str.lower().str.strip()
)
report['categories_standardized'] = (
df['product_category'] != df_clean['product_category']
).sum()
# Step 2: Handle missing unit prices (fill with category median)
n_missing_price = df_clean['unit_price'].isnull().sum()
category_medians = df_clean.groupby('product_category')['unit_price'].median()
df_clean['unit_price'] = df_clean.apply(
lambda row: (category_medians[row['product_category']]
if pd.isnull(row['unit_price'])
else row['unit_price']),
axis=1
)
report['missing_prices_filled'] = n_missing_price
# Step 3: Fix negative discounts (clamp to 0)
n_neg_discount = (df_clean['discount_pct'] < 0).sum()
df_clean['discount_pct'] = df_clean['discount_pct'].clip(lower=0)
report['negative_discounts_fixed'] = n_neg_discount
# Step 4: Remove invalid quantities (zero or negative)
n_invalid_qty = (df_clean['quantity'] <= 0).sum()
df_clean = df_clean[df_clean['quantity'] > 0]
report['invalid_quantity_rows_removed'] = n_invalid_qty
# Step 5: Parse dates
df_clean['transaction_date'] = pd.to_datetime(
df_clean['transaction_date'], format='mixed'
)
# Step 6: Optimize dtypes
df_clean['payment_method'] = df_clean['payment_method'].astype('category')
df_clean['product_category'] = df_clean['product_category'].astype('category')
df_clean['customer_region'] = df_clean['customer_region'].astype('category')
df_clean['is_returning'] = df_clean['is_returning'].astype(np.int8)
report['final_rows'] = len(df_clean)
report['rows_removed'] = len(df) - len(df_clean)
return df_clean, report
# Clean the data
clean_df, cleaning_report = clean_transactions(raw_df)
print("\nCleaning Report:")
for key, value in cleaning_report.items():
print(f" {key}: {value}")
Phase 3: Feature Engineering
We create derived features that capture business-relevant patterns and will serve as inputs for machine learning models in later chapters.
def engineer_features(df: pd.DataFrame) -> pd.DataFrame:
"""Create derived features for analysis and ML modeling.
Applies domain knowledge to create features that capture customer
behavior patterns, monetary value, and temporal trends.
Uses NumPy vectorization and pandas operations from Sections 5.2-5.3.
Args:
df: Cleaned transaction DataFrame.
Returns:
DataFrame with additional engineered features.
"""
df_feat = df.copy()
# Revenue features (vectorized computation)
df_feat['revenue'] = (
df_feat['quantity'] * df_feat['unit_price'] *
(1 - df_feat['discount_pct'] / 100)
)
df_feat['revenue_log'] = np.log1p(df_feat['revenue'])
# Temporal features
df_feat['month'] = df_feat['transaction_date'].dt.month
df_feat['quarter'] = df_feat['transaction_date'].dt.quarter
df_feat['day_of_week'] = df_feat['transaction_date'].dt.dayofweek
df_feat['is_weekend'] = (df_feat['day_of_week'] >= 5).astype(np.int8)
# Customer-level features (GroupBy + transform)
customer_stats = df_feat.groupby('customer_id').agg(
total_transactions=('transaction_id', 'count'),
total_revenue=('revenue', 'sum'),
avg_order_value=('revenue', 'mean'),
unique_categories=('product_category', 'nunique'),
avg_discount_used=('discount_pct', 'mean'),
first_purchase=('transaction_date', 'min'),
last_purchase=('transaction_date', 'max'),
).reset_index()
# Customer lifetime (days between first and last purchase)
customer_stats['customer_lifetime_days'] = (
(customer_stats['last_purchase'] - customer_stats['first_purchase']).dt.days
)
# Customer value segment using NumPy
revenue_quartiles = np.percentile(
customer_stats['total_revenue'], [25, 50, 75]
)
customer_stats['value_segment'] = np.select(
[
customer_stats['total_revenue'] <= revenue_quartiles[0],
customer_stats['total_revenue'] <= revenue_quartiles[1],
customer_stats['total_revenue'] <= revenue_quartiles[2],
],
['bronze', 'silver', 'gold'],
default='platinum'
)
# Merge customer features back to transaction level
df_feat = df_feat.merge(
customer_stats[['customer_id', 'total_transactions',
'total_revenue', 'value_segment',
'customer_lifetime_days']],
on='customer_id',
how='left'
)
return df_feat
# Engineer features
featured_df = engineer_features(clean_df)
print(f"\nFeatured dataset shape: {featured_df.shape}")
print(f"New columns: {set(featured_df.columns) - set(clean_df.columns)}")
Phase 4: Exploratory Data Analysis
With clean, feature-rich data, we can now answer the business questions.
def perform_eda(df: pd.DataFrame) -> None:
"""Perform comprehensive exploratory data analysis.
Creates visualizations answering key business questions about
revenue patterns, customer segments, and behavioral trends.
Args:
df: Feature-engineered transaction DataFrame.
"""
# --- Analysis 1: Revenue by Category ---
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
category_revenue = (
df.groupby('product_category', observed=True)
.agg(
total_revenue=('revenue', 'sum'),
avg_order_value=('revenue', 'mean'),
transaction_count=('transaction_id', 'count')
)
.sort_values('total_revenue', ascending=True)
)
category_revenue['total_revenue'].plot(
kind='barh', ax=axes[0], color='steelblue', edgecolor='white'
)
axes[0].set_title('Total Revenue by Category')
axes[0].set_xlabel('Revenue ($)')
category_revenue['avg_order_value'].plot(
kind='barh', ax=axes[1], color='coral', edgecolor='white'
)
axes[1].set_title('Average Order Value by Category')
axes[1].set_xlabel('Avg. Order Value ($)')
plt.tight_layout()
plt.show()
# --- Analysis 2: Monthly Revenue Trend ---
fig, ax = plt.subplots(figsize=(12, 5))
monthly_revenue = (
df.set_index('transaction_date')
.resample('M')['revenue']
.agg(['sum', 'mean', 'count'])
)
ax.plot(monthly_revenue.index, monthly_revenue['sum'],
color='steelblue', linewidth=2, marker='o', markersize=4)
ax.fill_between(monthly_revenue.index, monthly_revenue['sum'],
alpha=0.2, color='steelblue')
ax.set_title('Monthly Revenue Trend')
ax.set_xlabel('Month')
ax.set_ylabel('Total Revenue ($)')
ax.tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.show()
# --- Analysis 3: Customer Value Segments ---
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Segment distribution
segment_order = ['bronze', 'silver', 'gold', 'platinum']
segment_colors = ['#CD7F32', '#C0C0C0', '#FFD700', '#E5E4E2']
customer_segments = (
df.drop_duplicates('customer_id')['value_segment']
.value_counts()
.reindex(segment_order)
)
axes[0].pie(customer_segments, labels=segment_order, colors=segment_colors,
autopct='%1.1f%%', startangle=90, pctdistance=0.85)
axes[0].set_title('Customer Value Segment Distribution')
# Revenue contribution by segment
segment_revenue = (
df.groupby('value_segment', observed=True)['revenue']
.sum()
.reindex(segment_order)
)
total_rev = segment_revenue.sum()
bars = axes[1].bar(segment_order, segment_revenue / total_rev * 100,
color=segment_colors, edgecolor='white', linewidth=1.5)
for bar, val in zip(bars, segment_revenue / total_rev * 100):
axes[1].text(bar.get_x() + bar.get_width() / 2,
bar.get_height() + 1,
f'{val:.1f}%', ha='center', fontweight='bold')
axes[1].set_title('Revenue Contribution by Segment')
axes[1].set_ylabel('% of Total Revenue')
axes[1].set_ylim(0, 60)
plt.tight_layout()
plt.show()
# --- Analysis 4: New vs. Returning Customer Comparison ---
fig, ax = plt.subplots(figsize=(10, 5))
returning_stats = (
df.groupby('is_returning')
.agg(
avg_revenue=('revenue', 'mean'),
avg_quantity=('quantity', 'mean'),
avg_discount=('discount_pct', 'mean'),
)
.round(2)
)
returning_stats.index = ['New', 'Returning']
print(f"\nNew vs. Returning Customer Stats:\n{returning_stats}")
x = np.arange(len(returning_stats.columns))
width = 0.35
normalized = returning_stats / returning_stats.max()
bars1 = ax.bar(x - width / 2, normalized.iloc[0], width, label='New',
color='#2196F3', edgecolor='white')
bars2 = ax.bar(x + width / 2, normalized.iloc[1], width, label='Returning',
color='#FF9800', edgecolor='white')
ax.set_xticks(x)
ax.set_xticklabels(['Avg Revenue', 'Avg Quantity', 'Avg Discount'])
ax.set_title('New vs. Returning Customer Comparison (Normalized)')
ax.set_ylabel('Normalized Value')
ax.legend()
plt.tight_layout()
plt.show()
perform_eda(featured_df)
Phase 5: Recommendations
Based on our analysis, we recommend:
Recommendation 1: Focus Retention on Gold-to-Platinum Conversion - Rationale: Platinum customers contribute disproportionately to revenue despite being the smallest segment. Converting Gold customers to Platinum through loyalty programs could significantly increase revenue. - Expected Impact: A 10% conversion rate from Gold to Platinum could increase top-line revenue by 5-8%. - Implementation: Launch a targeted loyalty program for Gold customers with personalized offers based on their category preferences.
Recommendation 2: Reduce Discount Dependency in Apparel - Rationale: The apparel category shows the highest average discount rate but does not proportionally increase order volume, suggesting price-sensitive customers who buy regardless of discount level. - Expected Impact: Reducing average discounts by 5 percentage points could increase margins by 3-4%. - Implementation: A/B test reduced discount levels for apparel over a 30-day period.
Recommendation 3: Invest in Returning Customer Experience - Rationale: Returning customers show higher average order values and lower discount usage, making them more profitable. The declining retention rate is eroding this high-value segment. - Expected Impact: Improving 6-month retention from 38% to 43% could increase annual revenue by 8-12%. - Implementation: Implement personalized recommendation emails and a post-purchase follow-up workflow.
Results Summary
Key Findings
- Revenue Concentration: The top 25% of customers (Platinum segment) contribute approximately 45% of total revenue, following a typical Pareto pattern.
- Category Performance: Electronics drives the highest total revenue, while home goods has the highest average order value, suggesting different pricing and marketing strategies per category.
- Retention Opportunity: Returning customers generate 30% higher average order values with lower discount rates, making retention the highest-ROI investment.
Limitations and Future Work
Limitations
- Synthetic Data: This analysis uses generated data. Real-world data would have more complex patterns, seasonality, and external factors.
- No Causal Analysis: Correlations between customer attributes and revenue do not establish causation. A/B testing is needed to validate recommendations.
- Static Snapshot: This analysis examines historical transactions without forecasting future trends. Time series analysis (Chapter 15) could extend this work.
Future Directions
- Build a churn prediction model using the engineered features (Chapter 6)
- Apply dimensionality reduction (Chapter 7) to identify latent customer segments
- Implement a real-time analytics dashboard with streaming data
- Add external data sources (seasonality, competitor pricing, economic indicators)
Discussion Questions
-
What additional features would you engineer from this transaction data to improve churn prediction accuracy?
-
How would your analysis change if you discovered that 30% of "new" customers were actually returning customers using different email addresses?
-
What statistical tests (from Chapter 4) would you apply to determine if the revenue difference between new and returning customers is statistically significant?
-
How would you communicate the "discount dependency" finding to a marketing team that views discounts as essential to their strategy?
-
What ethical considerations arise from customer segmentation and targeted pricing?
Your Turn: Mini-Project
Extend this analysis with one of the following:
Option A: Customer Cohort Analysis - Group customers by their first purchase month (cohort) - Track retention rates and revenue per cohort over time - Create a cohort retention heatmap - Deliverable: Cohort analysis notebook with visualization
Option B: RFM Analysis - Compute Recency, Frequency, and Monetary value for each customer - Segment customers using RFM scoring - Compare RFM segments with the value segments from this analysis - Deliverable: Comparison report with recommendations
Option C: Category Deep Dive - Perform the same analysis pipeline for a single category - Identify within-category trends, seasonal patterns, and top products - Deliverable: Category-level analysis notebook
Complete Code
Full code for this case study: code/case-study-code.py
References
- McKinney, W. (2022). Python for Data Analysis, 3rd Edition. O'Reilly Media.
- VanderPlas, J. (2016). Python Data Science Handbook. O'Reilly Media.
- Knaflic, C. N. (2015). Storytelling with Data. Wiley.