Case Study 2: Finding the Subtle Bug
Overview
| Field | Detail |
|---|---|
| Project | Customer data migration pipeline for SaaSMetrics Inc. |
| Bug Type | Off-by-one error in date range filtering |
| Detection | Discovered 19 days after deployment by a customer support escalation |
| Impact | Systematic data loss — the first record of every batch was silently dropped |
| Root Cause | AI-generated comparison used > instead of >= in a date boundary check |
Background
SaaSMetrics Inc. provides business intelligence dashboards for SaaS companies. Their platform ingests customer subscription data from multiple sources, processes it through a transformation pipeline, and produces analytics reports including Monthly Recurring Revenue (MRR), churn rates, and cohort analyses.
The engineering team was migrating from their legacy data pipeline to a new system. Developer Priya Sharma was assigned to build the core data transformation module. Given the tight timeline — the migration had to be completed within a sprint — Priya used an AI coding assistant extensively.
The Development
The Requirement
The pipeline needed to: 1. Ingest CSV exports from customer billing systems 2. Parse and validate subscription records 3. Filter records by date range (processing window) 4. Aggregate records by customer and subscription tier 5. Calculate MRR metrics 6. Output structured reports
The AI-Generated Pipeline
Priya prompted the AI assistant in stages, building the pipeline incrementally. The critical module was the data processor. Here is what the AI generated:
"""
Data processing pipeline for subscription analytics.
Processes raw subscription records into aggregated metrics.
"""
import csv
from dataclasses import dataclass, field
from datetime import datetime, date
from decimal import Decimal
from pathlib import Path
from typing import Optional
@dataclass
class SubscriptionRecord:
"""A single subscription event record."""
customer_id: str
subscription_id: str
event_type: str # 'new', 'upgrade', 'downgrade', 'churn'
event_date: date
tier: str
amount: Decimal
currency: str = "USD"
@dataclass
class CustomerMetrics:
"""Aggregated metrics for a single customer."""
customer_id: str
current_tier: str
mrr: Decimal
events: list[SubscriptionRecord] = field(default_factory=list)
first_event_date: Optional[date] = None
last_event_date: Optional[date] = None
class DataProcessor:
"""Process subscription records into customer metrics."""
def __init__(self, start_date: date, end_date: date):
"""
Initialize processor with a date range filter.
Args:
start_date: Beginning of processing window (inclusive)
end_date: End of processing window (inclusive)
"""
self.start_date = start_date
self.end_date = end_date
self.records: list[SubscriptionRecord] = []
self.metrics: dict[str, CustomerMetrics] = {}
self.processing_stats = {
"total_read": 0,
"filtered_out": 0,
"processed": 0,
"errors": 0,
}
def load_csv(self, filepath: Path) -> list[SubscriptionRecord]:
"""Load subscription records from a CSV file."""
records = []
with open(filepath, "r", newline="") as f:
reader = csv.DictReader(f)
for row in reader:
self.processing_stats["total_read"] += 1
try:
record = SubscriptionRecord(
customer_id=row["customer_id"],
subscription_id=row["subscription_id"],
event_type=row["event_type"],
event_date=datetime.strptime(
row["event_date"], "%Y-%m-%d"
).date(),
tier=row["tier"],
amount=Decimal(row["amount"]),
currency=row.get("currency", "USD"),
)
records.append(record)
except (KeyError, ValueError) as e:
self.processing_stats["errors"] += 1
print(f"Warning: Skipping invalid row: {e}")
return records
def filter_by_date_range(
self, records: list[SubscriptionRecord]
) -> list[SubscriptionRecord]:
"""
Filter records to only include those within the processing window.
The window is defined by start_date (inclusive) and end_date (inclusive).
"""
filtered = []
for record in records:
# BUG: Uses > instead of >= for start_date comparison
# This excludes records that fall exactly on start_date
if record.event_date > self.start_date and record.event_date <= self.end_date:
filtered.append(record)
else:
self.processing_stats["filtered_out"] += 1
return filtered
def process_records(
self, records: list[SubscriptionRecord]
) -> dict[str, CustomerMetrics]:
"""Process filtered records into customer metrics."""
for record in records:
self.processing_stats["processed"] += 1
if record.customer_id not in self.metrics:
self.metrics[record.customer_id] = CustomerMetrics(
customer_id=record.customer_id,
current_tier=record.tier,
mrr=Decimal("0"),
)
customer = self.metrics[record.customer_id]
customer.events.append(record)
# Update tier based on most recent event
if (
customer.last_event_date is None
or record.event_date >= customer.last_event_date
):
customer.current_tier = record.tier
customer.last_event_date = record.event_date
# Update first event date
if (
customer.first_event_date is None
or record.event_date < customer.first_event_date
):
customer.first_event_date = record.event_date
# Calculate MRR impact
if record.event_type == "new":
customer.mrr += record.amount
elif record.event_type == "upgrade":
customer.mrr += record.amount
elif record.event_type == "downgrade":
customer.mrr -= record.amount
elif record.event_type == "churn":
customer.mrr = Decimal("0")
return self.metrics
def run_pipeline(self, filepath: Path) -> dict[str, CustomerMetrics]:
"""Execute the full processing pipeline."""
raw_records = self.load_csv(filepath)
filtered_records = self.filter_by_date_range(raw_records)
self.process_records(filtered_records)
return self.metrics
def get_summary(self) -> dict:
"""Return a summary of processing statistics."""
total_mrr = sum(m.mrr for m in self.metrics.values())
active_customers = sum(
1 for m in self.metrics.values() if m.mrr > 0
)
return {
"processing_stats": self.processing_stats,
"total_customers": len(self.metrics),
"active_customers": active_customers,
"total_mrr": float(total_mrr),
}
The Review
The code went through review. The reviewer noted the clean structure, proper use of Decimal for financial calculations (a common best practice the AI got right), comprehensive type hints, and processing statistics for observability. The reviewer checked the MRR calculation logic carefully because financial calculations are high-stakes. The reviewer did not notice the > vs >= issue in filter_by_date_range.
Why did the reviewer miss it? Several reasons:
-
The docstring said "inclusive." The method docstring stated "start_date (inclusive) and end_date (inclusive)," so the reviewer read the intent and skimmed the implementation.
-
The
>and>=operators look nearly identical when scanning code quickly. The difference between "greater than" and "greater than or equal to" is a single character. -
The end_date comparison was correct (
<=), which created a pattern that the reviewer's eye followed. The reviewer likely expected both comparisons to be consistent and did not notice the asymmetry. -
The bug is only visible in the comparison, not in the output. For most test data, losing one record is invisible in a batch of hundreds or thousands.
The Tests
Priya wrote tests for the pipeline with help from the AI assistant:
def test_filter_by_date_range():
"""Test that records are filtered to the correct date range."""
processor = DataProcessor(
start_date=date(2024, 1, 1),
end_date=date(2024, 1, 31),
)
records = [
make_record(event_date=date(2023, 12, 31)), # Before range
make_record(event_date=date(2024, 1, 15)), # In range
make_record(event_date=date(2024, 1, 31)), # End of range
make_record(event_date=date(2024, 2, 1)), # After range
]
filtered = processor.filter_by_date_range(records)
assert len(filtered) == 2 # Jan 15 and Jan 31
The test passed. But notice what is missing: there is no test case for a record dated exactly 2024-01-01 (the start of the range). The test data jumped from December 31 (before the range) to January 15 (mid-range). The AI generated test data that avoided the exact boundary where the bug lives.
This is a common pattern: AI-generated tests often avoid the exact edge cases where AI-generated code has bugs, because both the code and the tests emerge from the same statistical patterns.
The Discovery: Day 19
The Customer Report
Nineteen days after deployment, a customer named Acme Corp submitted a support ticket:
"Our January MRR report is showing $142,800 but our billing system shows $149,300 for the same period. We've reconciled and there's a consistent pattern — it looks like the first subscription event of each processing batch is missing."
The Investigation
Support escalated the ticket to engineering. Priya investigated by comparing the raw CSV data against the processed output. The findings:
- The January processing window was set to
start_date=date(2024, 1, 1)andend_date=date(2024, 1, 31) - Records dated exactly January 1, 2024 were missing from the processed output
- This was consistent across all customers
- The total amount of "missing" MRR corresponded exactly to the sum of all January 1st subscription events
Tracing the Bug
Priya traced through the filter_by_date_range method with a concrete example:
Record: event_date = 2024-01-01
Check: 2024-01-01 > 2024-01-01 → False (EXCLUDED!)
Expected: 2024-01-01 >= 2024-01-01 → True (INCLUDED)
The > operator excluded records on the exact start date. The fix was a single character:
# Before (buggy):
if record.event_date > self.start_date and record.event_date <= self.end_date:
# After (fixed):
if record.event_date >= self.start_date and record.event_date <= self.end_date:
Why It Took 19 Days
The bug was discovered slowly for several reasons:
-
The data loss was systematic but small. Only records on the exact first day of each processing window were affected. For most customers, this was a handful of records out of hundreds.
-
No automated reconciliation. There was no process to compare pipeline output against source data totals.
-
MRR is a derived metric. Customers typically monitor MRR trends, not exact values. A small discrepancy was initially attributed to timing differences between systems.
-
The bug was consistent. Because it affected every batch the same way, there was no random variation to trigger anomaly detection. The error was constant, not noisy.
-
The docstring was misleading. Anyone reading the code would see the docstring say "start_date (inclusive)" and trust it without verifying the implementation.
The Deeper Analysis
Quantifying the Impact
The investigation revealed:
| Metric | Value |
|---|---|
| Affected processing batches | 23 (daily batches over 19 business days, plus initial backfill batches) |
| Total records dropped | 1,847 |
| Affected customers | 312 (out of 4,200 total) |
| Total MRR underreported | $47,200 across all customers |
| Largest single-customer impact | $6,500 MRR underreported (Acme Corp) |
Why the AI Made This Mistake
The bug is a classic off-by-one error, the most common logic error in AI-generated code. The AI likely generated > instead of >= because:
-
Both patterns are common in training data. Date range filtering code in training data uses both exclusive and inclusive bounds, depending on the convention. The model has no strong preference.
-
The end-date comparison was correct by coincidence. The AI used
<=for the end date and>for the start date. This asymmetry is unusual, but neither pattern is rare enough to stand out. -
The bug is in the boundary, not the bulk. The function works correctly for 99% of records (those not on the exact start date). Statistical pattern matching favors code that works for the common case.
Why the Tests Did Not Catch It
The AI-generated test had a subtle gap: it tested a record before the range and a record in the range, but not a record exactly on the start boundary. This gap exists because:
-
AI generates "representative" test data, not "adversarial" test data. December 31 represents "before the range" and January 15 represents "in the range." The AI did not think to test the exact boundary.
-
The boundary test would have been a duplicate concern. From the AI's perspective, the December 31 test already covers "start boundary" and the January 15 test covers "in range." The exact start date is a conceptual gap that requires understanding the difference between
>and>=. -
Boundary value analysis is a testing technique, not a pattern. AI generates code and tests based on patterns. Boundary value analysis — a technique where you specifically test the exact values at boundaries — is a deliberate methodology that the AI does not consistently apply.
The Fix and Prevention
Immediate Fix
def filter_by_date_range(
self, records: list[SubscriptionRecord]
) -> list[SubscriptionRecord]:
"""
Filter records to only include those within the processing window.
The window is defined by start_date (inclusive) and end_date (inclusive).
"""
filtered = []
for record in records:
if self.start_date <= record.event_date <= self.end_date:
filtered.append(record)
else:
self.processing_stats["filtered_out"] += 1
return filtered
The rewrite uses Python's chained comparison (start <= value <= end), which is more readable and makes the inclusive nature of both bounds visually obvious. This style is harder to get wrong than two separate comparison operators.
Data Remediation
The team reprocessed all affected batches with the fixed code:
# Reprocessing script for affected date ranges
affected_dates = get_affected_processing_dates()
for start_date, end_date in affected_dates:
processor = DataProcessor(start_date=start_date, end_date=end_date)
processor.run_pipeline(get_source_file(start_date))
verify_totals(processor.get_summary(), get_expected_totals(start_date))
Additional Tests
def test_filter_boundary_exact_start_date():
"""Records on exact start_date must be included (inclusive)."""
processor = DataProcessor(
start_date=date(2024, 1, 1),
end_date=date(2024, 1, 31),
)
record = make_record(event_date=date(2024, 1, 1))
filtered = processor.filter_by_date_range([record])
assert len(filtered) == 1, "Start date should be inclusive"
def test_filter_boundary_exact_end_date():
"""Records on exact end_date must be included (inclusive)."""
processor = DataProcessor(
start_date=date(2024, 1, 1),
end_date=date(2024, 1, 31),
)
record = make_record(event_date=date(2024, 1, 31))
filtered = processor.filter_by_date_range([record])
assert len(filtered) == 1, "End date should be inclusive"
def test_filter_boundary_one_before_start():
"""Records one day before start_date must be excluded."""
processor = DataProcessor(
start_date=date(2024, 1, 1),
end_date=date(2024, 1, 31),
)
record = make_record(event_date=date(2023, 12, 31))
filtered = processor.filter_by_date_range([record])
assert len(filtered) == 0, "Day before start should be excluded"
def test_filter_boundary_one_after_end():
"""Records one day after end_date must be excluded."""
processor = DataProcessor(
start_date=date(2024, 1, 1),
end_date=date(2024, 1, 31),
)
record = make_record(event_date=date(2024, 2, 1))
filtered = processor.filter_by_date_range([record])
assert len(filtered) == 0, "Day after end should be excluded"
def test_filter_single_day_range():
"""When start_date == end_date, exactly that date should be included."""
processor = DataProcessor(
start_date=date(2024, 1, 15),
end_date=date(2024, 1, 15),
)
records = [
make_record(event_date=date(2024, 1, 14)),
make_record(event_date=date(2024, 1, 15)),
make_record(event_date=date(2024, 1, 16)),
]
filtered = processor.filter_by_date_range(records)
assert len(filtered) == 1
assert filtered[0].event_date == date(2024, 1, 15)
Process Improvements
-
Mandatory boundary value tests. Any function that involves range filtering, pagination, or boundary conditions must include tests for exact boundary values.
-
Automated reconciliation. Pipeline output totals are automatically compared against source data totals. Any discrepancy above 0.01% triggers an alert.
-
Prefer chained comparisons. The team adopted a coding standard to use Python's chained comparisons (
a <= x <= b) for range checks, as they are more readable and less error-prone. -
AI-generated test review. When AI generates both code and tests, a reviewer must specifically check that tests cover the exact boundaries of any range operations. AI-generated tests have a known tendency to skip boundary cases.
Lessons Learned
-
Off-by-one errors are AI's most common logic bug. Always verify comparison operators (
>vs>=,<vs<=) in any code involving ranges, pagination, or boundaries. -
AI-generated tests have the same blind spots as AI-generated code. Do not assume AI-generated tests are comprehensive. Specifically check for boundary value coverage.
-
Docstrings can be misleading. The AI wrote "inclusive" in the docstring but used
>(exclusive) in the code. Always verify implementation matches documentation. -
Small, systematic errors are harder to detect than large, random errors. A function that drops 1% of records consistently is harder to catch than one that crashes intermittently.
-
Financial calculations need reconciliation. Any pipeline that processes financial data should have automated reconciliation against source totals.
-
Chained comparisons are defensive coding.
start <= x <= endis more readable and less error-prone thanx > start and x <= end.
Discussion Questions
-
How would you design a test generation strategy that specifically targets boundary conditions in AI-generated code?
-
The bug persisted for 19 days. What monitoring or alerting could have caught it sooner?
-
The code review missed the
>vs>=discrepancy. What review techniques might help catch single-character bugs like this? -
If the AI had been prompted differently, could the bug have been prevented? Write an alternative prompt that might produce correct boundary handling.
-
The AI-generated test data conveniently avoided the boundary where the bug lived. Is this a coincidence, or is there a systematic reason why AI-generated tests fail to test AI-generated code effectively?
This case study illustrates concepts from Section 14.3 (Subtle Logic Errors), Section 14.7 (The Confidence Problem), and Section 14.8 (Debugging AI-Generated Code Systematically). For a catalog of similar bugs and their fixes, see code/example-01-bug-catalog.py.