Case Study: Building an End-to-End Data Analysis Pipeline

"Data is the new oil. But like oil, it is valuable only when refined." -- Clive Humby

Executive Summary

In this case study, you will build a complete data analysis pipeline using NumPy, pandas, and matplotlib. Starting with a raw dataset of e-commerce transactions, you will clean the data, engineer features, perform exploratory analysis, and generate actionable visualizations. This exercise integrates every Python skill covered in Chapter 5 into a cohesive workflow that mirrors real AI engineering practice.

Skills Applied: - NumPy vectorized computation for feature engineering - pandas DataFrame operations for data cleaning and aggregation - matplotlib/seaborn visualization for exploratory data analysis - Method chaining for readable transformation pipelines - Code organization with type hints and docstrings

Data: Synthetic e-commerce dataset (~10,000 transactions, 12 columns)


Background

The Organization

DataMart Analytics is a mid-size e-commerce company that sells consumer electronics, home goods, and apparel through its online platform. The company processes approximately 500,000 transactions per year and has been growing steadily at 15% year-over-year. The data engineering team has recently consolidated their transaction logs into a structured format, and the AI team has been asked to extract insights that can guide business strategy.

The Context

The company has noticed three concerning trends:

  • Customer acquisition costs have increased by 20% over the past year.
  • Average order values appear to be declining in certain product categories.
  • Customer retention beyond six months has dropped from 45% to 38%.

The VP of Analytics has requested a comprehensive analysis of transaction patterns to identify opportunities for revenue optimization and customer retention improvement. This analysis will also serve as the foundation for a churn prediction model (to be built in Chapter 6 using supervised learning techniques).

The Challenge

The raw transaction data contains inconsistencies, missing values, and unstructured fields that must be cleaned before any meaningful analysis can be performed. The analysis must answer several specific questions while producing visualizations suitable for a leadership presentation.

Business Question: "Which customer segments show the highest lifetime value, and what behavioral patterns distinguish retained customers from churned ones?"

Stakeholders

Role Needs Success Metric
VP of Analytics Clear summary with actionable recommendations 3-5 data-backed strategic recommendations
Marketing Team Customer segment profiles for targeted campaigns Identified high-value segments with defining traits
Product Team Category-level performance insights Revenue and growth trends by product category
Data Science Team Clean, feature-rich dataset for ML modeling Analysis-ready DataFrame with engineered features

Available Data

Data Dictionary

Column Type Description Example
transaction_id int Unique transaction identifier 100001
customer_id int Customer identifier 5042
transaction_date str Date of transaction (YYYY-MM-DD) 2022-03-15
product_category str Product category electronics
product_name str Product name Wireless Headphones
quantity int Number of items purchased 2
unit_price float Price per unit in USD 49.99
discount_pct float Discount percentage applied (0-100) 15.0
payment_method str Payment method used credit_card
customer_age int Customer age 34
customer_region str Geographic region northeast
is_returning int Whether customer has purchased before (0/1) 1

Data Quality Notes

  • Approximately 3% of unit_price values are missing
  • Some discount_pct values are negative (data entry errors)
  • transaction_date includes dates in two formats: YYYY-MM-DD and MM/DD/YYYY
  • A small number of quantity values are zero or negative
  • Some product_category values have inconsistent capitalization

Analysis Approach

Phase 1: Data Loading and Initial Exploration

The first step in any analysis pipeline is understanding what we are working with. We load the data, check its structure, and identify quality issues.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Optional

# Configure plotting
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams.update({'figure.figsize': (12, 6), 'font.size': 11})

def generate_synthetic_data(n: int = 10000, seed: int = 42) -> pd.DataFrame:
    """Generate synthetic e-commerce transaction data.

    Creates a realistic dataset with intentional data quality issues
    for practicing data cleaning techniques.

    Args:
        n: Number of transactions to generate.
        seed: Random seed for reproducibility.

    Returns:
        DataFrame with raw transaction data.
    """
    np.random.seed(seed)

    categories = ['electronics', 'home_goods', 'apparel',
                   'Electronics', 'HOME_GOODS', 'Apparel']  # Inconsistent case
    regions = ['northeast', 'southeast', 'midwest', 'west', 'southwest']
    payment_methods = ['credit_card', 'debit_card', 'paypal', 'bank_transfer']

    n_customers = n // 5  # Average 5 transactions per customer
    customer_ids = np.random.randint(1000, 1000 + n_customers, n)

    df = pd.DataFrame({
        'transaction_id': range(100001, 100001 + n),
        'customer_id': customer_ids,
        'transaction_date': pd.date_range('2022-01-01', periods=n,
                                           freq='50min').strftime('%Y-%m-%d'),
        'product_category': np.random.choice(categories, n),
        'product_name': [f'Product_{i}' for i in np.random.randint(1, 200, n)],
        'quantity': np.random.randint(1, 5, n),
        'unit_price': np.abs(np.random.lognormal(3.5, 1.0, n)).round(2),
        'discount_pct': np.random.choice(
            [0, 5, 10, 15, 20, 25, -5], n,
            p=[0.4, 0.15, 0.15, 0.1, 0.1, 0.05, 0.05]
        ),
        'payment_method': np.random.choice(payment_methods, n,
                                            p=[0.4, 0.25, 0.2, 0.15]),
        'customer_age': np.random.randint(18, 72, n),
        'customer_region': np.random.choice(regions, n),
        'is_returning': np.random.binomial(1, 0.65, n),
    })

    # Introduce missing values
    missing_price = np.random.choice(n, int(n * 0.03), replace=False)
    df.loc[missing_price, 'unit_price'] = np.nan

    # Introduce zero/negative quantities
    bad_qty = np.random.choice(n, int(n * 0.01), replace=False)
    df.loc[bad_qty, 'quantity'] = np.random.choice([0, -1], len(bad_qty))

    return df


# Load and inspect
raw_df = generate_synthetic_data()
print(f"Dataset shape: {raw_df.shape}")
print(f"\nColumn types:\n{raw_df.dtypes}")
print(f"\nMissing values:\n{raw_df.isnull().sum()}")
print(f"\nFirst 5 rows:\n{raw_df.head()}")
print(f"\nBasic statistics:\n{raw_df.describe()}")

Key Finding 1: The dataset has 10,000 transactions with several data quality issues that must be addressed before analysis.


Phase 2: Data Cleaning

We systematically address each quality issue identified during exploration.

def clean_transactions(df: pd.DataFrame) -> tuple[pd.DataFrame, dict[str, int]]:
    """Clean raw transaction data with comprehensive handling.

    Addresses missing values, inconsistent formatting, invalid entries,
    and data type issues. Returns a report of all modifications.

    Args:
        df: Raw transaction DataFrame.

    Returns:
        Tuple of (cleaned DataFrame, modification report dict).
    """
    report: dict[str, int] = {}
    df_clean = df.copy()

    # Step 1: Standardize product categories (case normalization)
    df_clean['product_category'] = (
        df_clean['product_category'].str.lower().str.strip()
    )
    report['categories_standardized'] = (
        df['product_category'] != df_clean['product_category']
    ).sum()

    # Step 2: Handle missing unit prices (fill with category median)
    n_missing_price = df_clean['unit_price'].isnull().sum()
    category_medians = df_clean.groupby('product_category')['unit_price'].median()
    df_clean['unit_price'] = df_clean.apply(
        lambda row: (category_medians[row['product_category']]
                     if pd.isnull(row['unit_price'])
                     else row['unit_price']),
        axis=1
    )
    report['missing_prices_filled'] = n_missing_price

    # Step 3: Fix negative discounts (clamp to 0)
    n_neg_discount = (df_clean['discount_pct'] < 0).sum()
    df_clean['discount_pct'] = df_clean['discount_pct'].clip(lower=0)
    report['negative_discounts_fixed'] = n_neg_discount

    # Step 4: Remove invalid quantities (zero or negative)
    n_invalid_qty = (df_clean['quantity'] <= 0).sum()
    df_clean = df_clean[df_clean['quantity'] > 0]
    report['invalid_quantity_rows_removed'] = n_invalid_qty

    # Step 5: Parse dates
    df_clean['transaction_date'] = pd.to_datetime(
        df_clean['transaction_date'], format='mixed'
    )

    # Step 6: Optimize dtypes
    df_clean['payment_method'] = df_clean['payment_method'].astype('category')
    df_clean['product_category'] = df_clean['product_category'].astype('category')
    df_clean['customer_region'] = df_clean['customer_region'].astype('category')
    df_clean['is_returning'] = df_clean['is_returning'].astype(np.int8)

    report['final_rows'] = len(df_clean)
    report['rows_removed'] = len(df) - len(df_clean)

    return df_clean, report


# Clean the data
clean_df, cleaning_report = clean_transactions(raw_df)
print("\nCleaning Report:")
for key, value in cleaning_report.items():
    print(f"  {key}: {value}")

Phase 3: Feature Engineering

We create derived features that capture business-relevant patterns and will serve as inputs for machine learning models in later chapters.

def engineer_features(df: pd.DataFrame) -> pd.DataFrame:
    """Create derived features for analysis and ML modeling.

    Applies domain knowledge to create features that capture customer
    behavior patterns, monetary value, and temporal trends.

    Uses NumPy vectorization and pandas operations from Sections 5.2-5.3.

    Args:
        df: Cleaned transaction DataFrame.

    Returns:
        DataFrame with additional engineered features.
    """
    df_feat = df.copy()

    # Revenue features (vectorized computation)
    df_feat['revenue'] = (
        df_feat['quantity'] * df_feat['unit_price'] *
        (1 - df_feat['discount_pct'] / 100)
    )
    df_feat['revenue_log'] = np.log1p(df_feat['revenue'])

    # Temporal features
    df_feat['month'] = df_feat['transaction_date'].dt.month
    df_feat['quarter'] = df_feat['transaction_date'].dt.quarter
    df_feat['day_of_week'] = df_feat['transaction_date'].dt.dayofweek
    df_feat['is_weekend'] = (df_feat['day_of_week'] >= 5).astype(np.int8)

    # Customer-level features (GroupBy + transform)
    customer_stats = df_feat.groupby('customer_id').agg(
        total_transactions=('transaction_id', 'count'),
        total_revenue=('revenue', 'sum'),
        avg_order_value=('revenue', 'mean'),
        unique_categories=('product_category', 'nunique'),
        avg_discount_used=('discount_pct', 'mean'),
        first_purchase=('transaction_date', 'min'),
        last_purchase=('transaction_date', 'max'),
    ).reset_index()

    # Customer lifetime (days between first and last purchase)
    customer_stats['customer_lifetime_days'] = (
        (customer_stats['last_purchase'] - customer_stats['first_purchase']).dt.days
    )

    # Customer value segment using NumPy
    revenue_quartiles = np.percentile(
        customer_stats['total_revenue'], [25, 50, 75]
    )
    customer_stats['value_segment'] = np.select(
        [
            customer_stats['total_revenue'] <= revenue_quartiles[0],
            customer_stats['total_revenue'] <= revenue_quartiles[1],
            customer_stats['total_revenue'] <= revenue_quartiles[2],
        ],
        ['bronze', 'silver', 'gold'],
        default='platinum'
    )

    # Merge customer features back to transaction level
    df_feat = df_feat.merge(
        customer_stats[['customer_id', 'total_transactions',
                        'total_revenue', 'value_segment',
                        'customer_lifetime_days']],
        on='customer_id',
        how='left'
    )

    return df_feat


# Engineer features
featured_df = engineer_features(clean_df)
print(f"\nFeatured dataset shape: {featured_df.shape}")
print(f"New columns: {set(featured_df.columns) - set(clean_df.columns)}")

Phase 4: Exploratory Data Analysis

With clean, feature-rich data, we can now answer the business questions.

def perform_eda(df: pd.DataFrame) -> None:
    """Perform comprehensive exploratory data analysis.

    Creates visualizations answering key business questions about
    revenue patterns, customer segments, and behavioral trends.

    Args:
        df: Feature-engineered transaction DataFrame.
    """
    # --- Analysis 1: Revenue by Category ---
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    category_revenue = (
        df.groupby('product_category', observed=True)
        .agg(
            total_revenue=('revenue', 'sum'),
            avg_order_value=('revenue', 'mean'),
            transaction_count=('transaction_id', 'count')
        )
        .sort_values('total_revenue', ascending=True)
    )

    category_revenue['total_revenue'].plot(
        kind='barh', ax=axes[0], color='steelblue', edgecolor='white'
    )
    axes[0].set_title('Total Revenue by Category')
    axes[0].set_xlabel('Revenue ($)')

    category_revenue['avg_order_value'].plot(
        kind='barh', ax=axes[1], color='coral', edgecolor='white'
    )
    axes[1].set_title('Average Order Value by Category')
    axes[1].set_xlabel('Avg. Order Value ($)')

    plt.tight_layout()
    plt.show()

    # --- Analysis 2: Monthly Revenue Trend ---
    fig, ax = plt.subplots(figsize=(12, 5))

    monthly_revenue = (
        df.set_index('transaction_date')
        .resample('M')['revenue']
        .agg(['sum', 'mean', 'count'])
    )

    ax.plot(monthly_revenue.index, monthly_revenue['sum'],
            color='steelblue', linewidth=2, marker='o', markersize=4)
    ax.fill_between(monthly_revenue.index, monthly_revenue['sum'],
                     alpha=0.2, color='steelblue')
    ax.set_title('Monthly Revenue Trend')
    ax.set_xlabel('Month')
    ax.set_ylabel('Total Revenue ($)')
    ax.tick_params(axis='x', rotation=45)
    plt.tight_layout()
    plt.show()

    # --- Analysis 3: Customer Value Segments ---
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    # Segment distribution
    segment_order = ['bronze', 'silver', 'gold', 'platinum']
    segment_colors = ['#CD7F32', '#C0C0C0', '#FFD700', '#E5E4E2']
    customer_segments = (
        df.drop_duplicates('customer_id')['value_segment']
        .value_counts()
        .reindex(segment_order)
    )
    axes[0].pie(customer_segments, labels=segment_order, colors=segment_colors,
                autopct='%1.1f%%', startangle=90, pctdistance=0.85)
    axes[0].set_title('Customer Value Segment Distribution')

    # Revenue contribution by segment
    segment_revenue = (
        df.groupby('value_segment', observed=True)['revenue']
        .sum()
        .reindex(segment_order)
    )
    total_rev = segment_revenue.sum()
    bars = axes[1].bar(segment_order, segment_revenue / total_rev * 100,
                        color=segment_colors, edgecolor='white', linewidth=1.5)
    for bar, val in zip(bars, segment_revenue / total_rev * 100):
        axes[1].text(bar.get_x() + bar.get_width() / 2,
                      bar.get_height() + 1,
                      f'{val:.1f}%', ha='center', fontweight='bold')
    axes[1].set_title('Revenue Contribution by Segment')
    axes[1].set_ylabel('% of Total Revenue')
    axes[1].set_ylim(0, 60)

    plt.tight_layout()
    plt.show()

    # --- Analysis 4: New vs. Returning Customer Comparison ---
    fig, ax = plt.subplots(figsize=(10, 5))

    returning_stats = (
        df.groupby('is_returning')
        .agg(
            avg_revenue=('revenue', 'mean'),
            avg_quantity=('quantity', 'mean'),
            avg_discount=('discount_pct', 'mean'),
        )
        .round(2)
    )
    returning_stats.index = ['New', 'Returning']
    print(f"\nNew vs. Returning Customer Stats:\n{returning_stats}")

    x = np.arange(len(returning_stats.columns))
    width = 0.35

    normalized = returning_stats / returning_stats.max()
    bars1 = ax.bar(x - width / 2, normalized.iloc[0], width, label='New',
                    color='#2196F3', edgecolor='white')
    bars2 = ax.bar(x + width / 2, normalized.iloc[1], width, label='Returning',
                    color='#FF9800', edgecolor='white')

    ax.set_xticks(x)
    ax.set_xticklabels(['Avg Revenue', 'Avg Quantity', 'Avg Discount'])
    ax.set_title('New vs. Returning Customer Comparison (Normalized)')
    ax.set_ylabel('Normalized Value')
    ax.legend()

    plt.tight_layout()
    plt.show()


perform_eda(featured_df)

Phase 5: Recommendations

Based on our analysis, we recommend:

Recommendation 1: Focus Retention on Gold-to-Platinum Conversion - Rationale: Platinum customers contribute disproportionately to revenue despite being the smallest segment. Converting Gold customers to Platinum through loyalty programs could significantly increase revenue. - Expected Impact: A 10% conversion rate from Gold to Platinum could increase top-line revenue by 5-8%. - Implementation: Launch a targeted loyalty program for Gold customers with personalized offers based on their category preferences.

Recommendation 2: Reduce Discount Dependency in Apparel - Rationale: The apparel category shows the highest average discount rate but does not proportionally increase order volume, suggesting price-sensitive customers who buy regardless of discount level. - Expected Impact: Reducing average discounts by 5 percentage points could increase margins by 3-4%. - Implementation: A/B test reduced discount levels for apparel over a 30-day period.

Recommendation 3: Invest in Returning Customer Experience - Rationale: Returning customers show higher average order values and lower discount usage, making them more profitable. The declining retention rate is eroding this high-value segment. - Expected Impact: Improving 6-month retention from 38% to 43% could increase annual revenue by 8-12%. - Implementation: Implement personalized recommendation emails and a post-purchase follow-up workflow.


Results Summary

Key Findings

  1. Revenue Concentration: The top 25% of customers (Platinum segment) contribute approximately 45% of total revenue, following a typical Pareto pattern.
  2. Category Performance: Electronics drives the highest total revenue, while home goods has the highest average order value, suggesting different pricing and marketing strategies per category.
  3. Retention Opportunity: Returning customers generate 30% higher average order values with lower discount rates, making retention the highest-ROI investment.

Limitations and Future Work

Limitations

  1. Synthetic Data: This analysis uses generated data. Real-world data would have more complex patterns, seasonality, and external factors.
  2. No Causal Analysis: Correlations between customer attributes and revenue do not establish causation. A/B testing is needed to validate recommendations.
  3. Static Snapshot: This analysis examines historical transactions without forecasting future trends. Time series analysis (Chapter 15) could extend this work.

Future Directions

  • Build a churn prediction model using the engineered features (Chapter 6)
  • Apply dimensionality reduction (Chapter 7) to identify latent customer segments
  • Implement a real-time analytics dashboard with streaming data
  • Add external data sources (seasonality, competitor pricing, economic indicators)

Discussion Questions

  1. What additional features would you engineer from this transaction data to improve churn prediction accuracy?

  2. How would your analysis change if you discovered that 30% of "new" customers were actually returning customers using different email addresses?

  3. What statistical tests (from Chapter 4) would you apply to determine if the revenue difference between new and returning customers is statistically significant?

  4. How would you communicate the "discount dependency" finding to a marketing team that views discounts as essential to their strategy?

  5. What ethical considerations arise from customer segmentation and targeted pricing?


Your Turn: Mini-Project

Extend this analysis with one of the following:

Option A: Customer Cohort Analysis - Group customers by their first purchase month (cohort) - Track retention rates and revenue per cohort over time - Create a cohort retention heatmap - Deliverable: Cohort analysis notebook with visualization

Option B: RFM Analysis - Compute Recency, Frequency, and Monetary value for each customer - Segment customers using RFM scoring - Compare RFM segments with the value segments from this analysis - Deliverable: Comparison report with recommendations

Option C: Category Deep Dive - Perform the same analysis pipeline for a single category - Identify within-category trends, seasonal patterns, and top products - Deliverable: Category-level analysis notebook


Complete Code

Full code for this case study: code/case-study-code.py


References

  • McKinney, W. (2022). Python for Data Analysis, 3rd Edition. O'Reilly Media.
  • VanderPlas, J. (2016). Python Data Science Handbook. O'Reilly Media.
  • Knaflic, C. N. (2015). Storytelling with Data. Wiley.