Case Study 2: The Mysterious Memory Leak — AI-Assisted Debugging Under Pressure

The Incident

At 11:47 PM on a Thursday, Raj gets a PagerDuty alert. The payment reconciliation service — a Python service that runs nightly batch jobs to reconcile payment records against bank settlements — has been consuming increasing memory for six hours and is now at 94% of its allocated container memory. The automatic restart has been triggered twice, and the batch job is only 70% complete.

The payment reconciliation job runs every night. It has been running without issues for eight months. It is running on unchanged code. Nothing was deployed today.

This is a production incident. Financial reporting systems depend on the reconciliation completing before 6 AM when the finance team arrives. Raj has six hours.

Initial Assessment (30 Minutes)

Raj pulls up metrics. The memory growth pattern is clear in the Grafana dashboard: smooth, linear memory growth over the past six hours at approximately 15MB per minute. This is not a spike — it is a steady leak that has been accumulating all night. The two restarts have not fixed it because the job resumes where it left off, and the leak continues.

He looks at what changed. No deployments. He checks the input data size: tonight's reconciliation job is processing 2.3 million records, compared to the usual 1.8-1.9 million. Volume is up approximately 25% due to month-end transaction spike. This is the first hint at the problem's structure.

He submits a first diagnostic prompt:

"I have a Python service with a steady memory leak — 15MB/minute growth over 6 hours in a batch processing job. The job processes financial records from a database in batches. It has been running unchanged for 8 months. Tonight it is processing 25% more records than usual, which is the first time this specific volume has been reached. What are the most common causes of this pattern (steady linear growth, not a spike, triggered by higher volume)?"

AI's response surfaces five hypotheses: 1. Accumulated references in a list or dictionary that is never cleared between batches 2. Database query result sets not being properly closed or garbage collected 3. An in-memory cache growing without bound 4. Event listeners or callback registrations that accumulate over the loop lifetime 5. Circular references preventing garbage collection

The "25% higher volume" clue is particularly important for Raj: linear growth that only becomes problematic at a volume threshold suggests the leak exists at smaller volumes too but was slow enough not to cause problems within the job's normal runtime.

Code Investigation (45 Minutes)

Raj pulls the relevant service code. The reconciliation job is a loop: fetch a batch of records, match them against bank settlement data, write results, repeat. He submits the core loop to Claude with his diagnostic context:

"Here is the core processing loop for the batch reconciliation job. Given that the memory leak is linear and volume-triggered, what specific patterns in this code would cause steady accumulation? I'm particularly interested in anything that grows with iteration count rather than with batch size."

The code review surfaces a specific concern: the reconcile_batch() function appends to a module-level list called _reconciliation_errors that is defined at import time and is never cleared between batches. Each failed reconciliation appends a structured error object. For 1.9 million records with a typical 3% error rate, that is approximately 57,000 error objects. For tonight's 2.3 million records, it would be 69,000 objects — but that is not enough to explain 5.4GB of memory growth.

Raj looks at the error objects. Each one stores the full bank settlement record and the full transaction record for comparison purposes — presumably to aid in debugging. The bank settlement records contain detailed payment metadata. He calculates: if each error object is approximately 8KB (two records of ~4KB each), 69,000 objects would be approximately 550MB. That is consistent with the observed growth.

But wait — the growth is 15MB/minute and has been running for six hours. That is 5.4GB. Even with tonight's higher error rate, the object count math does not produce 5.4GB. Something else is happening.

Hypothesis Revision (20 Minutes)

Raj shares his calculation with Claude:

"The module-level error list could accumulate approximately 550MB based on my math (69,000 objects x 8KB average). But the actual memory growth is approximately 5.4GB. The math doesn't match. What am I missing? Could the error objects be much larger than I estimated? Or is there a different leak mechanism I haven't found?"

This is the conversation that breaks the case open. Claude suggests: "Are the error objects storing any references to the database cursor or query result object? If each error object holds a reference to an open database result, the result set may prevent garbage collection of the underlying database connection resources."

Raj checks the error object construction. Yes — the error objects store a reference to batch_record, which is a SQLAlchemy model instance. The model instance holds a reference to the database session that loaded it. That session is part of a connection pool. The connection objects are significantly larger than the row data — they include connection state, pending transactions, and result buffers.

More importantly: the batch processing code uses session.expire_on_access = False (a legitimate optimization to avoid re-querying when accessing attributes after the session closes). This means the model instances hold strong references to in-memory copies of all their attributes even after the session closes. With 69,000 error objects each holding a fully-loaded SQLAlchemy model instance, the actual memory per object is likely 60-80KB, not 8KB.

That produces approximately 4.8GB — much closer to the observed 5.4GB, with the remaining discrepancy likely attributable to Python object overhead and garbage collector fragmentation.

The Fix Under Pressure (30 Minutes)

Raj knows the root cause. He needs a fix that can be deployed without restarting the entire batch job from scratch — the job is 70% complete and restarting from zero would miss the 6 AM deadline.

He submits a targeted implementation request:

"I've identified the root cause: a module-level error list accumulates SQLAlchemy model instances that hold database session references. The fix needs to: (1) change the error list to store lightweight error summaries (just the IDs and error codes, not full model instances), (2) be deployable without restarting the job from scratch, and (3) not lose the errors already accumulated. Write the minimal change to implement this fix."

The generated fix is a function that converts the accumulated error objects to lightweight summaries (transaction_id, settlement_id, error_code, timestamp), clears the original list, and returns the summaries for writing to the error log database. Raj applies it as a hotfix via a feature flag that his deployment infrastructure supports.

He also asks for the comprehensive fix — the structural change to the reconciliation loop that prevents recurrence:

"Write the corrected version of the reconciliation loop that: stores only lightweight error summaries (not model instances) in the error accumulator, clears the accumulator after each batch batch writes, and uses a dataclass rather than a dict for the error summary structure."

He reviews this code carefully. One change he makes: he changes the error accumulator from a module-level variable to a local variable passed as a parameter — module-level mutable state is the architectural pattern that allowed this bug to hide for eight months.

The Fix Deployment

At 1:15 AM, Raj deploys the hotfix via feature flag. The memory growth stops within five minutes of the flag being enabled. Memory actually decreases slightly as the now-lightweight error accumulator stops growing. The batch job completes at 5:23 AM — 37 minutes ahead of the 6 AM deadline.

The comprehensive structural fix is tested and deployed in a PR the following afternoon.

The Trust-But-Verify Moments

Raj's post-incident review identifies three moments where he applied active verification rather than accepting AI outputs directly.

Moment 1: The first hypothesis list. AI provided five hypotheses for the memory leak pattern. Raj did not accept the first one (accumulated references in a list) as the answer — it was an hypothesis that directed investigation, not a conclusion. The investigation confirmed the list hypothesis was approximately right but significantly underestimated the per-object size.

Moment 2: The memory math. When AI's first explanation produced memory estimates (550MB) that did not match the observed growth (5.4GB), Raj did the math and identified the discrepancy rather than accepting the partial explanation. This discrepancy directed the investigation toward the SQLAlchemy session reference mechanism, which was the real root cause.

Moment 3: The hotfix code review. Even under pressure at 1 AM, Raj read the hotfix code before deploying it. The code was correct as generated — but the review took five minutes that Raj considers well-spent. A hotfix that introduced a new bug at 1 AM would have been much harder to recover from than five minutes of code review.

Moment 4: The architectural change. The module-level variable anti-pattern was the architectural vulnerability that allowed the bug to hide for eight months. AI identified the immediate fix; Raj recognized that the comprehensive fix required addressing the architectural pattern. AI generated the code for both; Raj decided which was which.

The Root Cause Report

In his post-incident report, Raj documents the timeline, the diagnosis process, and the lessons:

Lesson 1: Linear memory growth that is volume-triggered likely has existed for a long time at smaller scale. The bug had been present for eight months. Smaller nightly volumes were enough to stay within memory limits within the job's runtime. Volume growth (month-end spike plus organic growth) finally pushed it over the threshold. This class of bugs — bugs that are always present but only become problems at scale — is particularly difficult to catch with standard testing because tests rarely run at production scale.

Lesson 2: Memory debugging requires understanding the full object reference chain, not just the surface-level data size. The initial estimate of 8KB per error object was wrong by a factor of 10 because it did not account for the SQLAlchemy session reference chain. Understanding object graphs in Python requires knowing how the ORM manages session state — domain knowledge that AI helped surface but that Raj needed to validate.

Lesson 3: Module-level mutable state is a code smell that creates exactly this class of bug. The pattern _errors = [] at module level, mutated by functions over thousands of iterations, is an architectural vulnerability. The post-incident PR converts all such patterns in the service to parameter-passed accumulators.

Lesson 4: AI-assisted debugging works best as a dialogue, not a one-shot answer. The case was broken when Raj challenged AI's first explanation with the memory math discrepancy. The second conversation — "the math doesn't match, what am I missing?" — produced the SQLAlchemy session insight. AI's first answer was not wrong; it was incomplete. The dialogue surfaced the rest.