Case Study 2: Finding the Subtle Bug

Overview

Field Detail
Project Customer data migration pipeline for SaaSMetrics Inc.
Bug Type Off-by-one error in date range filtering
Detection Discovered 19 days after deployment by a customer support escalation
Impact Systematic data loss — the first record of every batch was silently dropped
Root Cause AI-generated comparison used > instead of >= in a date boundary check

Background

SaaSMetrics Inc. provides business intelligence dashboards for SaaS companies. Their platform ingests customer subscription data from multiple sources, processes it through a transformation pipeline, and produces analytics reports including Monthly Recurring Revenue (MRR), churn rates, and cohort analyses.

The engineering team was migrating from their legacy data pipeline to a new system. Developer Priya Sharma was assigned to build the core data transformation module. Given the tight timeline — the migration had to be completed within a sprint — Priya used an AI coding assistant extensively.

The Development

The Requirement

The pipeline needed to: 1. Ingest CSV exports from customer billing systems 2. Parse and validate subscription records 3. Filter records by date range (processing window) 4. Aggregate records by customer and subscription tier 5. Calculate MRR metrics 6. Output structured reports

The AI-Generated Pipeline

Priya prompted the AI assistant in stages, building the pipeline incrementally. The critical module was the data processor. Here is what the AI generated:

"""
Data processing pipeline for subscription analytics.
Processes raw subscription records into aggregated metrics.
"""

import csv
from dataclasses import dataclass, field
from datetime import datetime, date
from decimal import Decimal
from pathlib import Path
from typing import Optional


@dataclass
class SubscriptionRecord:
    """A single subscription event record."""
    customer_id: str
    subscription_id: str
    event_type: str  # 'new', 'upgrade', 'downgrade', 'churn'
    event_date: date
    tier: str
    amount: Decimal
    currency: str = "USD"


@dataclass
class CustomerMetrics:
    """Aggregated metrics for a single customer."""
    customer_id: str
    current_tier: str
    mrr: Decimal
    events: list[SubscriptionRecord] = field(default_factory=list)
    first_event_date: Optional[date] = None
    last_event_date: Optional[date] = None


class DataProcessor:
    """Process subscription records into customer metrics."""

    def __init__(self, start_date: date, end_date: date):
        """
        Initialize processor with a date range filter.

        Args:
            start_date: Beginning of processing window (inclusive)
            end_date: End of processing window (inclusive)
        """
        self.start_date = start_date
        self.end_date = end_date
        self.records: list[SubscriptionRecord] = []
        self.metrics: dict[str, CustomerMetrics] = {}
        self.processing_stats = {
            "total_read": 0,
            "filtered_out": 0,
            "processed": 0,
            "errors": 0,
        }

    def load_csv(self, filepath: Path) -> list[SubscriptionRecord]:
        """Load subscription records from a CSV file."""
        records = []
        with open(filepath, "r", newline="") as f:
            reader = csv.DictReader(f)
            for row in reader:
                self.processing_stats["total_read"] += 1
                try:
                    record = SubscriptionRecord(
                        customer_id=row["customer_id"],
                        subscription_id=row["subscription_id"],
                        event_type=row["event_type"],
                        event_date=datetime.strptime(
                            row["event_date"], "%Y-%m-%d"
                        ).date(),
                        tier=row["tier"],
                        amount=Decimal(row["amount"]),
                        currency=row.get("currency", "USD"),
                    )
                    records.append(record)
                except (KeyError, ValueError) as e:
                    self.processing_stats["errors"] += 1
                    print(f"Warning: Skipping invalid row: {e}")
        return records

    def filter_by_date_range(
        self, records: list[SubscriptionRecord]
    ) -> list[SubscriptionRecord]:
        """
        Filter records to only include those within the processing window.

        The window is defined by start_date (inclusive) and end_date (inclusive).
        """
        filtered = []
        for record in records:
            # BUG: Uses > instead of >= for start_date comparison
            # This excludes records that fall exactly on start_date
            if record.event_date > self.start_date and record.event_date <= self.end_date:
                filtered.append(record)
            else:
                self.processing_stats["filtered_out"] += 1
        return filtered

    def process_records(
        self, records: list[SubscriptionRecord]
    ) -> dict[str, CustomerMetrics]:
        """Process filtered records into customer metrics."""
        for record in records:
            self.processing_stats["processed"] += 1

            if record.customer_id not in self.metrics:
                self.metrics[record.customer_id] = CustomerMetrics(
                    customer_id=record.customer_id,
                    current_tier=record.tier,
                    mrr=Decimal("0"),
                )

            customer = self.metrics[record.customer_id]
            customer.events.append(record)

            # Update tier based on most recent event
            if (
                customer.last_event_date is None
                or record.event_date >= customer.last_event_date
            ):
                customer.current_tier = record.tier
                customer.last_event_date = record.event_date

            # Update first event date
            if (
                customer.first_event_date is None
                or record.event_date < customer.first_event_date
            ):
                customer.first_event_date = record.event_date

            # Calculate MRR impact
            if record.event_type == "new":
                customer.mrr += record.amount
            elif record.event_type == "upgrade":
                customer.mrr += record.amount
            elif record.event_type == "downgrade":
                customer.mrr -= record.amount
            elif record.event_type == "churn":
                customer.mrr = Decimal("0")

        return self.metrics

    def run_pipeline(self, filepath: Path) -> dict[str, CustomerMetrics]:
        """Execute the full processing pipeline."""
        raw_records = self.load_csv(filepath)
        filtered_records = self.filter_by_date_range(raw_records)
        self.process_records(filtered_records)
        return self.metrics

    def get_summary(self) -> dict:
        """Return a summary of processing statistics."""
        total_mrr = sum(m.mrr for m in self.metrics.values())
        active_customers = sum(
            1 for m in self.metrics.values() if m.mrr > 0
        )
        return {
            "processing_stats": self.processing_stats,
            "total_customers": len(self.metrics),
            "active_customers": active_customers,
            "total_mrr": float(total_mrr),
        }

The Review

The code went through review. The reviewer noted the clean structure, proper use of Decimal for financial calculations (a common best practice the AI got right), comprehensive type hints, and processing statistics for observability. The reviewer checked the MRR calculation logic carefully because financial calculations are high-stakes. The reviewer did not notice the > vs >= issue in filter_by_date_range.

Why did the reviewer miss it? Several reasons:

  1. The docstring said "inclusive." The method docstring stated "start_date (inclusive) and end_date (inclusive)," so the reviewer read the intent and skimmed the implementation.

  2. The > and >= operators look nearly identical when scanning code quickly. The difference between "greater than" and "greater than or equal to" is a single character.

  3. The end_date comparison was correct (<=), which created a pattern that the reviewer's eye followed. The reviewer likely expected both comparisons to be consistent and did not notice the asymmetry.

  4. The bug is only visible in the comparison, not in the output. For most test data, losing one record is invisible in a batch of hundreds or thousands.

The Tests

Priya wrote tests for the pipeline with help from the AI assistant:

def test_filter_by_date_range():
    """Test that records are filtered to the correct date range."""
    processor = DataProcessor(
        start_date=date(2024, 1, 1),
        end_date=date(2024, 1, 31),
    )
    records = [
        make_record(event_date=date(2023, 12, 31)),  # Before range
        make_record(event_date=date(2024, 1, 15)),    # In range
        make_record(event_date=date(2024, 1, 31)),    # End of range
        make_record(event_date=date(2024, 2, 1)),     # After range
    ]

    filtered = processor.filter_by_date_range(records)
    assert len(filtered) == 2  # Jan 15 and Jan 31

The test passed. But notice what is missing: there is no test case for a record dated exactly 2024-01-01 (the start of the range). The test data jumped from December 31 (before the range) to January 15 (mid-range). The AI generated test data that avoided the exact boundary where the bug lives.

This is a common pattern: AI-generated tests often avoid the exact edge cases where AI-generated code has bugs, because both the code and the tests emerge from the same statistical patterns.

The Discovery: Day 19

The Customer Report

Nineteen days after deployment, a customer named Acme Corp submitted a support ticket:

"Our January MRR report is showing $142,800 but our billing system shows $149,300 for the same period. We've reconciled and there's a consistent pattern — it looks like the first subscription event of each processing batch is missing."

The Investigation

Support escalated the ticket to engineering. Priya investigated by comparing the raw CSV data against the processed output. The findings:

  • The January processing window was set to start_date=date(2024, 1, 1) and end_date=date(2024, 1, 31)
  • Records dated exactly January 1, 2024 were missing from the processed output
  • This was consistent across all customers
  • The total amount of "missing" MRR corresponded exactly to the sum of all January 1st subscription events

Tracing the Bug

Priya traced through the filter_by_date_range method with a concrete example:

Record: event_date = 2024-01-01
Check: 2024-01-01 > 2024-01-01  →  False (EXCLUDED!)
Expected: 2024-01-01 >= 2024-01-01  →  True (INCLUDED)

The > operator excluded records on the exact start date. The fix was a single character:

# Before (buggy):
if record.event_date > self.start_date and record.event_date <= self.end_date:

# After (fixed):
if record.event_date >= self.start_date and record.event_date <= self.end_date:

Why It Took 19 Days

The bug was discovered slowly for several reasons:

  1. The data loss was systematic but small. Only records on the exact first day of each processing window were affected. For most customers, this was a handful of records out of hundreds.

  2. No automated reconciliation. There was no process to compare pipeline output against source data totals.

  3. MRR is a derived metric. Customers typically monitor MRR trends, not exact values. A small discrepancy was initially attributed to timing differences between systems.

  4. The bug was consistent. Because it affected every batch the same way, there was no random variation to trigger anomaly detection. The error was constant, not noisy.

  5. The docstring was misleading. Anyone reading the code would see the docstring say "start_date (inclusive)" and trust it without verifying the implementation.

The Deeper Analysis

Quantifying the Impact

The investigation revealed:

Metric Value
Affected processing batches 23 (daily batches over 19 business days, plus initial backfill batches)
Total records dropped 1,847
Affected customers 312 (out of 4,200 total)
Total MRR underreported $47,200 across all customers
Largest single-customer impact $6,500 MRR underreported (Acme Corp)

Why the AI Made This Mistake

The bug is a classic off-by-one error, the most common logic error in AI-generated code. The AI likely generated > instead of >= because:

  1. Both patterns are common in training data. Date range filtering code in training data uses both exclusive and inclusive bounds, depending on the convention. The model has no strong preference.

  2. The end-date comparison was correct by coincidence. The AI used <= for the end date and > for the start date. This asymmetry is unusual, but neither pattern is rare enough to stand out.

  3. The bug is in the boundary, not the bulk. The function works correctly for 99% of records (those not on the exact start date). Statistical pattern matching favors code that works for the common case.

Why the Tests Did Not Catch It

The AI-generated test had a subtle gap: it tested a record before the range and a record in the range, but not a record exactly on the start boundary. This gap exists because:

  1. AI generates "representative" test data, not "adversarial" test data. December 31 represents "before the range" and January 15 represents "in the range." The AI did not think to test the exact boundary.

  2. The boundary test would have been a duplicate concern. From the AI's perspective, the December 31 test already covers "start boundary" and the January 15 test covers "in range." The exact start date is a conceptual gap that requires understanding the difference between > and >=.

  3. Boundary value analysis is a testing technique, not a pattern. AI generates code and tests based on patterns. Boundary value analysis — a technique where you specifically test the exact values at boundaries — is a deliberate methodology that the AI does not consistently apply.

The Fix and Prevention

Immediate Fix

def filter_by_date_range(
    self, records: list[SubscriptionRecord]
) -> list[SubscriptionRecord]:
    """
    Filter records to only include those within the processing window.

    The window is defined by start_date (inclusive) and end_date (inclusive).
    """
    filtered = []
    for record in records:
        if self.start_date <= record.event_date <= self.end_date:
            filtered.append(record)
        else:
            self.processing_stats["filtered_out"] += 1
    return filtered

The rewrite uses Python's chained comparison (start <= value <= end), which is more readable and makes the inclusive nature of both bounds visually obvious. This style is harder to get wrong than two separate comparison operators.

Data Remediation

The team reprocessed all affected batches with the fixed code:

# Reprocessing script for affected date ranges
affected_dates = get_affected_processing_dates()
for start_date, end_date in affected_dates:
    processor = DataProcessor(start_date=start_date, end_date=end_date)
    processor.run_pipeline(get_source_file(start_date))
    verify_totals(processor.get_summary(), get_expected_totals(start_date))

Additional Tests

def test_filter_boundary_exact_start_date():
    """Records on exact start_date must be included (inclusive)."""
    processor = DataProcessor(
        start_date=date(2024, 1, 1),
        end_date=date(2024, 1, 31),
    )
    record = make_record(event_date=date(2024, 1, 1))
    filtered = processor.filter_by_date_range([record])
    assert len(filtered) == 1, "Start date should be inclusive"


def test_filter_boundary_exact_end_date():
    """Records on exact end_date must be included (inclusive)."""
    processor = DataProcessor(
        start_date=date(2024, 1, 1),
        end_date=date(2024, 1, 31),
    )
    record = make_record(event_date=date(2024, 1, 31))
    filtered = processor.filter_by_date_range([record])
    assert len(filtered) == 1, "End date should be inclusive"


def test_filter_boundary_one_before_start():
    """Records one day before start_date must be excluded."""
    processor = DataProcessor(
        start_date=date(2024, 1, 1),
        end_date=date(2024, 1, 31),
    )
    record = make_record(event_date=date(2023, 12, 31))
    filtered = processor.filter_by_date_range([record])
    assert len(filtered) == 0, "Day before start should be excluded"


def test_filter_boundary_one_after_end():
    """Records one day after end_date must be excluded."""
    processor = DataProcessor(
        start_date=date(2024, 1, 1),
        end_date=date(2024, 1, 31),
    )
    record = make_record(event_date=date(2024, 2, 1))
    filtered = processor.filter_by_date_range([record])
    assert len(filtered) == 0, "Day after end should be excluded"


def test_filter_single_day_range():
    """When start_date == end_date, exactly that date should be included."""
    processor = DataProcessor(
        start_date=date(2024, 1, 15),
        end_date=date(2024, 1, 15),
    )
    records = [
        make_record(event_date=date(2024, 1, 14)),
        make_record(event_date=date(2024, 1, 15)),
        make_record(event_date=date(2024, 1, 16)),
    ]
    filtered = processor.filter_by_date_range(records)
    assert len(filtered) == 1
    assert filtered[0].event_date == date(2024, 1, 15)

Process Improvements

  1. Mandatory boundary value tests. Any function that involves range filtering, pagination, or boundary conditions must include tests for exact boundary values.

  2. Automated reconciliation. Pipeline output totals are automatically compared against source data totals. Any discrepancy above 0.01% triggers an alert.

  3. Prefer chained comparisons. The team adopted a coding standard to use Python's chained comparisons (a <= x <= b) for range checks, as they are more readable and less error-prone.

  4. AI-generated test review. When AI generates both code and tests, a reviewer must specifically check that tests cover the exact boundaries of any range operations. AI-generated tests have a known tendency to skip boundary cases.


Lessons Learned

  1. Off-by-one errors are AI's most common logic bug. Always verify comparison operators (> vs >=, < vs <=) in any code involving ranges, pagination, or boundaries.

  2. AI-generated tests have the same blind spots as AI-generated code. Do not assume AI-generated tests are comprehensive. Specifically check for boundary value coverage.

  3. Docstrings can be misleading. The AI wrote "inclusive" in the docstring but used > (exclusive) in the code. Always verify implementation matches documentation.

  4. Small, systematic errors are harder to detect than large, random errors. A function that drops 1% of records consistently is harder to catch than one that crashes intermittently.

  5. Financial calculations need reconciliation. Any pipeline that processes financial data should have automated reconciliation against source totals.

  6. Chained comparisons are defensive coding. start <= x <= end is more readable and less error-prone than x > start and x <= end.


Discussion Questions

  1. How would you design a test generation strategy that specifically targets boundary conditions in AI-generated code?

  2. The bug persisted for 19 days. What monitoring or alerting could have caught it sooner?

  3. The code review missed the > vs >= discrepancy. What review techniques might help catch single-character bugs like this?

  4. If the AI had been prompted differently, could the bug have been prevented? Write an alternative prompt that might produce correct boundary handling.

  5. The AI-generated test data conveniently avoided the boundary where the bug lived. Is this a coincidence, or is there a systematic reason why AI-generated tests fail to test AI-generated code effectively?


This case study illustrates concepts from Section 14.3 (Subtle Logic Errors), Section 14.7 (The Confidence Problem), and Section 14.8 (Debugging AI-Generated Code Systematically). For a catalog of similar bugs and their fixes, see code/example-01-bug-catalog.py.