Case Study 01: The Mystery Memory Leak

Debugging a Memory Leak in a Web Application Using AI-Assisted Analysis

Background

Priya Sharma was a backend developer at DataFlow Analytics, a mid-sized SaaS company that provided real-time data dashboards to enterprise clients. Their core application was a Flask-based web service that ingested data from client APIs, processed it through various transformation pipelines, and served the results through a REST API consumed by their React frontend.

The application had been running reliably for over a year. Then, one Monday morning, the operations team reported that the production servers were being automatically restarted every 6-8 hours by their container orchestration system. The containers were being killed because they exceeded their 2 GB memory limit. This had not happened before the previous Friday's deployment.

Priya was assigned to investigate. She had experience with Python debugging but had never tracked down a memory leak before. She decided to use AI as a debugging partner throughout the process.

Phase 1: Reproducing and Documenting the Problem

Priya started by gathering data. She pulled the container metrics from their monitoring dashboard and observed a clear pattern: memory usage started at approximately 400 MB after each restart and grew linearly, reaching the 2 GB limit after about 6-8 hours depending on traffic volume.

She captured the key observations:

  • Memory grew linearly over time at approximately 200 MB per hour
  • The growth rate correlated with request volume
  • CPU usage remained stable
  • The problem started after Friday's deployment
  • The deployment included 14 commits across 6 pull requests

Her first AI prompt was structured to provide this context:

I'm investigating a memory leak in a Flask application running in Docker
containers. The containers are being OOM-killed every 6-8 hours.

Key observations:
- Memory starts at ~400 MB and grows linearly at ~200 MB/hour
- Growth rate correlates with request volume (more requests = faster growth)
- Started after a deployment on Friday (14 commits, 6 PRs)
- Python 3.11, Flask 2.3, SQLAlchemy 2.0, Celery 5.3
- Running in Docker with 2 GB memory limit

The application processes data from client APIs, transforms it, and
serves results through REST endpoints. We handle about 500 requests
per minute during peak hours.

What are the most common causes of memory leaks in Flask/SQLAlchemy
applications, and what diagnostic steps should I take first?

The AI responded with a prioritized list of common causes:

  1. SQLAlchemy session not being closed properly — sessions holding references to objects
  2. Caching without eviction — in-memory caches growing unbounded
  3. Global data structures accumulating entries — dictionaries or lists at module level
  4. Event listener accumulation — registering event handlers repeatedly
  5. Large file or response objects not being garbage collected — circular references

The AI also suggested specific diagnostic tools: tracemalloc for tracking memory allocations, objgraph for visualizing object references, and memory_profiler for line-by-line memory analysis.

Phase 2: Initial Investigation

Priya started with tracemalloc because it was built into Python and did not require additional dependencies. She added instrumentation to their application:

import tracemalloc
import logging

logger = logging.getLogger(__name__)

tracemalloc.start()

def log_memory_snapshot():
    """Log the top memory-consuming allocations."""
    snapshot = tracemalloc.take_snapshot()
    top_stats = snapshot.statistics('lineno')
    logger.info("=== Memory Snapshot ===")
    for stat in top_stats[:10]:
        logger.info(f"  {stat}")

She configured this to run every 5 minutes and deployed it to a staging environment with the same traffic patterns. After 30 minutes, the logs revealed a clear trend:

=== Memory Snapshot (T+5min) ===
  app/transforms.py:45: size=12.3 MiB, count=15234
  app/transforms.py:47: size=8.7 MiB, count=10456
  app/cache.py:23: size=5.2 MiB, count=3201

=== Memory Snapshot (T+30min) ===
  app/transforms.py:45: size=78.9 MiB, count=98754
  app/transforms.py:47: size=56.2 MiB, count=67892
  app/cache.py:23: size=34.5 MiB, count=21345

She shared these results with AI:

The tracemalloc output shows two files growing rapidly:
- app/transforms.py lines 45 and 47 (from 12 MiB to 79 MiB in 25 min)
- app/cache.py line 23 (from 5 MiB to 35 MiB in 25 min)

Here's the relevant code from transforms.py:

class DataTransformer:
    _transformation_history = {}  # Line 43

    def transform(self, data, transform_id):
        result = self._apply_transforms(data)
        self._transformation_history[transform_id] = {  # Line 45
            'input': data,                               # Line 46
            'output': result,                            # Line 47
            'timestamp': datetime.utcnow()
        }
        return result

And from cache.py:

class ResultCache:
    def __init__(self):
        self._cache = {}  # Line 22

    def store(self, key, value):
        self._cache[key] = value  # Line 23

Both are instantiated as module-level singletons.

Phase 3: AI-Assisted Root Cause Analysis

The AI immediately identified the problem. The _transformation_history dictionary was a class variable (shared across all instances) that stored the complete input and output data for every transformation, and it was never cleared. Similarly, the ResultCache had no eviction policy — entries were added but never removed.

The AI's analysis was precise:

"You have two unbounded data structures growing with every request. The _transformation_history is a class-level dictionary that stores full copies of input and output data for every transformation. With 500 requests per minute, each storing input/output data, this will consume memory linearly with request volume — exactly matching your observed pattern of ~200 MB/hour.

The ResultCache has the same issue: entries are added but never evicted. This is a classic 'cache without eviction' memory leak.

The fact that this started after Friday's deployment suggests that either _transformation_history was added in that deployment, or the cache eviction logic was removed/broken."

Priya checked the git log and confirmed: one of Friday's pull requests had added the _transformation_history dictionary for a new audit logging feature. The developer who wrote it intended to periodically flush the history to a database, but that flushing logic was planned for a follow-up PR that had not been merged yet.

Phase 4: Developing the Fix

Priya asked the AI for the best approach to fix both issues:

I need to fix two unbounded data structures:

1. _transformation_history: Stores audit data that should eventually
   go to the database. For now, I need it to not leak memory. The
   audit feature is required by our enterprise clients.

2. ResultCache: Needs an eviction policy. Entries should expire after
   15 minutes or when the cache reaches 1000 entries.

What's the best approach for each?

The AI suggested several approaches:

For the transformation history: - Use a bounded deque (collections.deque(maxlen=1000)) as an immediate fix - Implement periodic flushing to the database using a background task - Add a context manager pattern to ensure cleanup after each request

For the result cache: - Use functools.lru_cache for simple cases - Use cachetools.TTLCache for time-based expiration with a maximum size - Implement a custom LRU cache with both time and size limits

Priya chose the following approach:

from collections import deque
from cachetools import TTLCache
import threading

class DataTransformer:
    # Bounded history with periodic flush
    _transformation_history = deque(maxlen=1000)
    _flush_lock = threading.Lock()

    def transform(self, data, transform_id):
        result = self._apply_transforms(data)
        self._transformation_history.append({
            'transform_id': transform_id,
            'input_hash': hash(str(data)),  # Store hash, not full data
            'output_size': len(str(result)),
            'timestamp': datetime.utcnow()
        })
        if len(self._transformation_history) >= 900:
            self._flush_history()
        return result

    def _flush_history(self):
        with self._flush_lock:
            items = list(self._transformation_history)
            self._transformation_history.clear()
            # Flush to database in background
            flush_to_db.delay(items)  # Celery task


class ResultCache:
    def __init__(self, maxsize=1000, ttl=900):
        self._cache = TTLCache(maxsize=maxsize, ttl=ttl)

    def store(self, key, value):
        self._cache[key] = value

    def get(self, key, default=None):
        return self._cache.get(key, default)

Phase 5: Verification and Learning

After deploying the fix to staging, Priya ran the same monitoring for 24 hours. Memory usage stabilized at approximately 500 MB, fluctuating within a 50 MB range but never growing beyond that. The fix was confirmed.

She then had a learning conversation with AI:

The fix worked — memory is stable now. I want to prevent this class
of bug in the future. What patterns should I watch for in code reviews,
and are there automated tools that can detect unbounded data structure
growth?

The AI recommended several preventive measures:

  1. Code review checklist item: Any module-level or class-level mutable data structure should have a documented eviction strategy
  2. Memory monitoring: Set up alerts when memory usage exceeds 70% of the container limit
  3. Testing: Write a stress test that runs 10,000 requests and asserts memory stays below a threshold
  4. Static analysis: Use pylint custom rules to flag class-level mutable data structures without size bounds
  5. Architecture pattern: Use Redis or Memcached for caching instead of in-process dictionaries, so cache state survives restarts and is shared across instances

Key Lessons

This case study illustrates several important debugging principles:

Start with observation, not assumption. Priya gathered metrics before forming hypotheses. The linear memory growth pattern and correlation with request volume immediately pointed toward a per-request accumulation, which narrowed the search significantly.

Use the right diagnostic tool. tracemalloc pinpointed the exact lines of code responsible for the memory growth. Without it, Priya might have spent hours reading through the 14 commits manually.

AI excels at pattern recognition. The AI immediately recognized the "unbounded data structure" pattern and the "cache without eviction" anti-pattern. These are well-documented patterns that AI has been trained on extensively.

Fix the root cause, not just the symptom. Rather than simply adding del statements or calling gc.collect(), the fix addressed the architectural issue: data structures need explicit lifecycle management.

Prevent recurrence. The monitoring alerts, stress tests, and code review checklist items ensure that similar issues will be caught before they reach production.

The Role of AI Throughout

AI served different roles at each phase of this investigation:

  • Phase 1: Provided a prioritized list of common causes, helping Priya focus her investigation
  • Phase 2: Suggested tracemalloc as the right diagnostic tool
  • Phase 3: Identified the root cause from the code and tracemalloc output
  • Phase 4: Proposed multiple fix strategies with tradeoffs
  • Phase 5: Suggested preventive measures for the future

At no point did AI replace Priya's engineering judgment. She chose which diagnostic approach to use, evaluated the fix suggestions against her system's requirements, and made the architectural decisions. AI accelerated her investigation by providing expertise in memory debugging patterns that she had not previously encountered.

The total time from initial report to deployed fix was approximately 4 hours. Priya estimated that without AI assistance, the investigation would have taken 1-2 days, primarily because she would have needed to research memory profiling tools and common Flask memory leak patterns from scratch. The AI compressed that research phase from hours to minutes.