Case Study 01: The Mystery Memory Leak
Debugging a Memory Leak in a Web Application Using AI-Assisted Analysis
Background
Priya Sharma was a backend developer at DataFlow Analytics, a mid-sized SaaS company that provided real-time data dashboards to enterprise clients. Their core application was a Flask-based web service that ingested data from client APIs, processed it through various transformation pipelines, and served the results through a REST API consumed by their React frontend.
The application had been running reliably for over a year. Then, one Monday morning, the operations team reported that the production servers were being automatically restarted every 6-8 hours by their container orchestration system. The containers were being killed because they exceeded their 2 GB memory limit. This had not happened before the previous Friday's deployment.
Priya was assigned to investigate. She had experience with Python debugging but had never tracked down a memory leak before. She decided to use AI as a debugging partner throughout the process.
Phase 1: Reproducing and Documenting the Problem
Priya started by gathering data. She pulled the container metrics from their monitoring dashboard and observed a clear pattern: memory usage started at approximately 400 MB after each restart and grew linearly, reaching the 2 GB limit after about 6-8 hours depending on traffic volume.
She captured the key observations:
- Memory grew linearly over time at approximately 200 MB per hour
- The growth rate correlated with request volume
- CPU usage remained stable
- The problem started after Friday's deployment
- The deployment included 14 commits across 6 pull requests
Her first AI prompt was structured to provide this context:
I'm investigating a memory leak in a Flask application running in Docker
containers. The containers are being OOM-killed every 6-8 hours.
Key observations:
- Memory starts at ~400 MB and grows linearly at ~200 MB/hour
- Growth rate correlates with request volume (more requests = faster growth)
- Started after a deployment on Friday (14 commits, 6 PRs)
- Python 3.11, Flask 2.3, SQLAlchemy 2.0, Celery 5.3
- Running in Docker with 2 GB memory limit
The application processes data from client APIs, transforms it, and
serves results through REST endpoints. We handle about 500 requests
per minute during peak hours.
What are the most common causes of memory leaks in Flask/SQLAlchemy
applications, and what diagnostic steps should I take first?
The AI responded with a prioritized list of common causes:
- SQLAlchemy session not being closed properly — sessions holding references to objects
- Caching without eviction — in-memory caches growing unbounded
- Global data structures accumulating entries — dictionaries or lists at module level
- Event listener accumulation — registering event handlers repeatedly
- Large file or response objects not being garbage collected — circular references
The AI also suggested specific diagnostic tools: tracemalloc for tracking memory allocations, objgraph for visualizing object references, and memory_profiler for line-by-line memory analysis.
Phase 2: Initial Investigation
Priya started with tracemalloc because it was built into Python and did not require additional dependencies. She added instrumentation to their application:
import tracemalloc
import logging
logger = logging.getLogger(__name__)
tracemalloc.start()
def log_memory_snapshot():
"""Log the top memory-consuming allocations."""
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
logger.info("=== Memory Snapshot ===")
for stat in top_stats[:10]:
logger.info(f" {stat}")
She configured this to run every 5 minutes and deployed it to a staging environment with the same traffic patterns. After 30 minutes, the logs revealed a clear trend:
=== Memory Snapshot (T+5min) ===
app/transforms.py:45: size=12.3 MiB, count=15234
app/transforms.py:47: size=8.7 MiB, count=10456
app/cache.py:23: size=5.2 MiB, count=3201
=== Memory Snapshot (T+30min) ===
app/transforms.py:45: size=78.9 MiB, count=98754
app/transforms.py:47: size=56.2 MiB, count=67892
app/cache.py:23: size=34.5 MiB, count=21345
She shared these results with AI:
The tracemalloc output shows two files growing rapidly:
- app/transforms.py lines 45 and 47 (from 12 MiB to 79 MiB in 25 min)
- app/cache.py line 23 (from 5 MiB to 35 MiB in 25 min)
Here's the relevant code from transforms.py:
class DataTransformer:
_transformation_history = {} # Line 43
def transform(self, data, transform_id):
result = self._apply_transforms(data)
self._transformation_history[transform_id] = { # Line 45
'input': data, # Line 46
'output': result, # Line 47
'timestamp': datetime.utcnow()
}
return result
And from cache.py:
class ResultCache:
def __init__(self):
self._cache = {} # Line 22
def store(self, key, value):
self._cache[key] = value # Line 23
Both are instantiated as module-level singletons.
Phase 3: AI-Assisted Root Cause Analysis
The AI immediately identified the problem. The _transformation_history dictionary was a class variable (shared across all instances) that stored the complete input and output data for every transformation, and it was never cleared. Similarly, the ResultCache had no eviction policy — entries were added but never removed.
The AI's analysis was precise:
"You have two unbounded data structures growing with every request. The
_transformation_historyis a class-level dictionary that stores full copies of input and output data for every transformation. With 500 requests per minute, each storing input/output data, this will consume memory linearly with request volume — exactly matching your observed pattern of ~200 MB/hour.The
ResultCachehas the same issue: entries are added but never evicted. This is a classic 'cache without eviction' memory leak.The fact that this started after Friday's deployment suggests that either
_transformation_historywas added in that deployment, or the cache eviction logic was removed/broken."
Priya checked the git log and confirmed: one of Friday's pull requests had added the _transformation_history dictionary for a new audit logging feature. The developer who wrote it intended to periodically flush the history to a database, but that flushing logic was planned for a follow-up PR that had not been merged yet.
Phase 4: Developing the Fix
Priya asked the AI for the best approach to fix both issues:
I need to fix two unbounded data structures:
1. _transformation_history: Stores audit data that should eventually
go to the database. For now, I need it to not leak memory. The
audit feature is required by our enterprise clients.
2. ResultCache: Needs an eviction policy. Entries should expire after
15 minutes or when the cache reaches 1000 entries.
What's the best approach for each?
The AI suggested several approaches:
For the transformation history:
- Use a bounded deque (collections.deque(maxlen=1000)) as an immediate fix
- Implement periodic flushing to the database using a background task
- Add a context manager pattern to ensure cleanup after each request
For the result cache:
- Use functools.lru_cache for simple cases
- Use cachetools.TTLCache for time-based expiration with a maximum size
- Implement a custom LRU cache with both time and size limits
Priya chose the following approach:
from collections import deque
from cachetools import TTLCache
import threading
class DataTransformer:
# Bounded history with periodic flush
_transformation_history = deque(maxlen=1000)
_flush_lock = threading.Lock()
def transform(self, data, transform_id):
result = self._apply_transforms(data)
self._transformation_history.append({
'transform_id': transform_id,
'input_hash': hash(str(data)), # Store hash, not full data
'output_size': len(str(result)),
'timestamp': datetime.utcnow()
})
if len(self._transformation_history) >= 900:
self._flush_history()
return result
def _flush_history(self):
with self._flush_lock:
items = list(self._transformation_history)
self._transformation_history.clear()
# Flush to database in background
flush_to_db.delay(items) # Celery task
class ResultCache:
def __init__(self, maxsize=1000, ttl=900):
self._cache = TTLCache(maxsize=maxsize, ttl=ttl)
def store(self, key, value):
self._cache[key] = value
def get(self, key, default=None):
return self._cache.get(key, default)
Phase 5: Verification and Learning
After deploying the fix to staging, Priya ran the same monitoring for 24 hours. Memory usage stabilized at approximately 500 MB, fluctuating within a 50 MB range but never growing beyond that. The fix was confirmed.
She then had a learning conversation with AI:
The fix worked — memory is stable now. I want to prevent this class
of bug in the future. What patterns should I watch for in code reviews,
and are there automated tools that can detect unbounded data structure
growth?
The AI recommended several preventive measures:
- Code review checklist item: Any module-level or class-level mutable data structure should have a documented eviction strategy
- Memory monitoring: Set up alerts when memory usage exceeds 70% of the container limit
- Testing: Write a stress test that runs 10,000 requests and asserts memory stays below a threshold
- Static analysis: Use
pylintcustom rules to flag class-level mutable data structures without size bounds - Architecture pattern: Use Redis or Memcached for caching instead of in-process dictionaries, so cache state survives restarts and is shared across instances
Key Lessons
This case study illustrates several important debugging principles:
Start with observation, not assumption. Priya gathered metrics before forming hypotheses. The linear memory growth pattern and correlation with request volume immediately pointed toward a per-request accumulation, which narrowed the search significantly.
Use the right diagnostic tool. tracemalloc pinpointed the exact lines of code responsible for the memory growth. Without it, Priya might have spent hours reading through the 14 commits manually.
AI excels at pattern recognition. The AI immediately recognized the "unbounded data structure" pattern and the "cache without eviction" anti-pattern. These are well-documented patterns that AI has been trained on extensively.
Fix the root cause, not just the symptom. Rather than simply adding del statements or calling gc.collect(), the fix addressed the architectural issue: data structures need explicit lifecycle management.
Prevent recurrence. The monitoring alerts, stress tests, and code review checklist items ensure that similar issues will be caught before they reach production.
The Role of AI Throughout
AI served different roles at each phase of this investigation:
- Phase 1: Provided a prioritized list of common causes, helping Priya focus her investigation
- Phase 2: Suggested
tracemallocas the right diagnostic tool - Phase 3: Identified the root cause from the code and tracemalloc output
- Phase 4: Proposed multiple fix strategies with tradeoffs
- Phase 5: Suggested preventive measures for the future
At no point did AI replace Priya's engineering judgment. She chose which diagnostic approach to use, evaluated the fix suggestions against her system's requirements, and made the architectural decisions. AI accelerated her investigation by providing expertise in memory debugging patterns that she had not previously encountered.
The total time from initial report to deployed fix was approximately 4 hours. Priya estimated that without AI assistance, the investigation would have taken 1-2 days, primarily because she would have needed to research memory profiling tools and common Flask memory leak patterns from scratch. The AI compressed that research phase from hours to minutes.