Case Study 02: Monolith to Microservices Migration

Overview

This case study follows a mid-sized company as they plan and execute a migration from a monolithic Python application to a microservices architecture, using AI assistance throughout the process. The case study illustrates when migration is justified, how to plan the decomposition, the Strangler Fig pattern for incremental migration, and the pitfalls that arise when theory meets practice.

The Starting Point

Company: DataPulse, a B2B analytics platform that helps e-commerce companies understand their customer behavior.

The monolith: A 3-year-old Django application with approximately 80,000 lines of Python code. The application handles: - User management and authentication - Data ingestion (receiving event data from customer websites) - Data processing (aggregating and transforming raw events into metrics) - Dashboard generation (real-time and historical analytics dashboards) - Report generation (scheduled PDF/CSV reports emailed to customers) - Billing and subscription management - API for programmatic access

The team: 18 developers organized into 3 feature teams, plus a platform team of 4.

Current infrastructure: - 8 application servers behind an AWS Application Load Balancer - PostgreSQL primary with 2 read replicas - Redis for caching and Celery task queue - Single deployment pipeline deploying the entire application

Why They Decided to Migrate

The team did not wake up one morning and decide microservices sounded fun. They were experiencing concrete pain points:

Pain Point 1: Deployment conflicts. With 18 developers working in the same codebase, merge conflicts were a daily occurrence. A deployment that included changes from three teams meant that a bug in one team's code blocked all three teams' releases.

Pain Point 2: Scaling inefficiency. The data ingestion module received 100x more traffic than the dashboard module, but they could not scale them independently. Adding servers scaled everything equally, wasting resources.

Pain Point 3: Reliability coupling. A memory leak in the report generation module caused application server restarts, which interrupted real-time dashboard users. Completely unrelated functionality was taking each other down.

Pain Point 4: Technology constraints. The data processing team wanted to use Apache Spark for heavy computations, but the Django monolith could not accommodate a JVM-based tool in its deployment pipeline.

Pain Point 5: Slow test suite. The full test suite took 45 minutes. Developers ran only their own tests locally, leading to integration failures discovered late in the pipeline.

Phase 1: Architecture Assessment with AI

The tech lead initiated a design conversation with an AI assistant:

We have a Django monolith (80,000 lines) serving as a B2B analytics
platform. We're experiencing deployment conflicts, scaling inefficiency,
and reliability coupling between modules.

Team: 18 developers in 3 feature teams + 4 platform engineers
Current modules in the monolith:
1. User management and auth
2. Data ingestion (high volume: 10,000 events/second peak)
3. Data processing (CPU-heavy aggregations)
4. Dashboard serving (real-time queries, latency-sensitive)
5. Report generation (periodic, resource-intensive)
6. Billing (Stripe integration)
7. API gateway (rate limiting, authentication)

Questions:
1. Which modules should we extract first, and why?
2. What migration pattern should we use?
3. How do we handle the shared database during migration?
4. What's a realistic timeline for a team our size?

The AI's Recommendation

The AI recommended a phased migration using the Strangler Fig pattern, starting with the modules that would provide the most immediate value:

Phase 1 -- Extract Data Ingestion (Months 1-3): This module has the highest volume, the most distinct scaling requirements, and the clearest boundary (it receives external events and writes to the database). Extracting it provides immediate scaling benefits and reduces load on the monolith.

Phase 2 -- Extract Report Generation (Months 3-5): This module is a clear candidate because it runs on schedules, does not serve real-time requests, and its failures should never affect dashboard users. It communicates with the rest of the system through a well-defined data interface.

Phase 3 -- Extract Dashboard Serving (Months 5-8): The read-heavy dashboard module can be optimized independently once separated, potentially using a dedicated read model (CQRS pattern).

Phase 4 -- Extract Data Processing (Months 8-12): This enables the team to adopt Apache Spark or other processing frameworks without affecting the rest of the system.

Modules to leave in the monolith: User management, billing, and the API gateway should remain in the monolith. They are relatively stable, low-volume, and tightly integrated. Extracting them would add complexity without proportional benefit.

The AI emphasized a critical principle: "Do not extract a service unless you can clearly articulate the specific operational benefit. 'It feels like a separate concern' is not sufficient justification."

Phase 2: Planning the Data Ingestion Extraction

The team chose data ingestion as the first extraction target and asked the AI for help with detailed planning.

Defining the Service Boundary

The data ingestion service would be responsible for: - Receiving HTTP event payloads from customer websites - Validating event structure and authentication - Writing raw events to a data store - Publishing "event received" notifications for downstream processing

It would NOT be responsible for: - Processing or aggregating events (data processing module) - User authentication (the ingestion endpoint uses API keys, not user sessions) - Any read operations (ingestion is write-only)

The Strangler Fig Pattern

Instead of a Big Bang cutover, the team used the Strangler Fig pattern:

BEFORE (all traffic goes to the monolith):

    Customer Website --> Load Balancer --> Django Monolith
                                              |
                                              v
                                          PostgreSQL


STEP 1 (router sends ingestion traffic to new service):

    Customer Website --> API Router --|
                                     |--> Ingestion Service (new)
                                     |        |
                                     |        v
                                     |    Event Store (Kafka)
                                     |
                                     |--> Django Monolith (everything else)
                                              |
                                              v
                                          PostgreSQL


STEP 2 (monolith's ingestion code is removed):

    Customer Website --> API Router --> Ingestion Service
                                           |
                                           v
                                       Event Store (Kafka)
                                           |
                                           v
                         Django Monolith (reads from Kafka for processing)

The Shared Database Problem

The most challenging aspect of the migration was the shared PostgreSQL database. The monolith's ingestion code wrote directly to the raw_events table, and the processing code read from it.

The AI proposed a three-step approach:

Introduce Kafka as an intermediate layer. The new ingestion service writes events to Kafka instead of directly to PostgreSQL. A consumer in the monolith reads from Kafka and writes to PostgreSQL, maintaining backward compatibility.
Migrate processing to read from Kafka. Once the data processing module is updated to consume from Kafka directly, the monolith's Kafka-to-PostgreSQL bridge can be removed.
Eventually, the ingestion service owns its data store. The raw events table moves to a dedicated data store owned by the ingestion service. The monolith no longer has access to it.

Phase 3: Implementation

The New Ingestion Service

The team built the ingestion service using FastAPI (for high-performance async handling) with the following structure:

ingestion-service/
    app/
        main.py               # FastAPI application entry
        routes/
            events.py          # POST /events endpoint
            health.py          # GET /health endpoint
        services/
            event_validator.py # Schema validation
            event_publisher.py # Kafka producer
        models/
            event.py           # Event data model
        auth/
            api_key.py         # API key validation
    tests/
        test_events.py
        test_validator.py
    Dockerfile
    docker-compose.yml
    pyproject.toml

Routing with Feature Flags

The team used an API gateway with feature flags to control traffic routing:

# Simplified routing logic in the API gateway
async def route_request(request):
    if request.path.startswith("/api/events"):
        if feature_flag("use_new_ingestion_service"):
            return await forward_to_ingestion_service(request)
    return await forward_to_monolith(request)

This enabled a gradual rollout: - Week 1: 1% of ingestion traffic to the new service (canary) - Week 2: 10% of traffic (with monitoring) - Week 3: 50% of traffic - Week 4: 100% of traffic - Week 5: Remove old ingestion code from the monolith

The Bridge Consumer

During the transition period, a bridge consumer kept the monolith working:

# Bridge: reads from Kafka and writes to PostgreSQL
# This temporary component maintained backward compatibility
# during the migration period.

class IngestionBridgeConsumer:
    """Consumes events from Kafka and writes them to the
    monolith's raw_events table for backward compatibility."""

    def __init__(self, kafka_consumer, db_session):
        self.consumer = kafka_consumer
        self.db = db_session

    def run(self):
        for message in self.consumer:
            event_data = json.loads(message.value)
            self.db.execute(
                "INSERT INTO raw_events (tenant_id, event_type, payload, received_at) "
                "VALUES (:tenant_id, :event_type, :payload, :received_at)",
                event_data,
            )
            self.db.commit()

Phase 4: What Went Wrong

Despite careful planning, the migration encountered several challenges:

Problem 1: Duplicate Events During Cutover

During the period when traffic was split between the old and new ingestion paths, some events were processed twice. The old path wrote directly to PostgreSQL; the new path wrote to Kafka and then to PostgreSQL via the bridge. For a brief window, some requests were processed by both paths.

Solution: The team added idempotency keys to events. Each event included a unique event_id that was checked before insertion. Duplicate events were silently dropped.

Problem 2: Kafka Operational Complexity

The team underestimated the operational burden of running Kafka. Configuration tuning, partition management, consumer group coordination, and monitoring required more expertise than the team had.

Solution: They migrated to AWS Managed Streaming for Apache Kafka (MSK), trading cost for operational simplicity. This was documented in ADR-007.

Problem 3: Latency Increase

The new ingestion path added 15-30ms of latency due to the Kafka write and the bridge consumer's processing delay. While acceptable for most customers, one Enterprise customer had monitoring that alerted on ingestion latency above 50ms.

Solution: The team communicated the change proactively to Enterprise customers and optimized the Kafka producer configuration (reducing batch size and linger time for lower latency at the cost of slightly reduced throughput).

Problem 4: Testing Complexity

Integration tests that previously ran against a single application now needed to coordinate between the ingestion service, Kafka, and the monolith. The test setup became significantly more complex.

Solution: The team invested in a Docker Compose-based integration test environment that spun up all services. They also adopted contract testing (using Pact) to verify that the ingestion service and the monolith agreed on the event format.

Phase 5: Results and Metrics

After completing the data ingestion extraction (which took 4 months instead of the planned 3), the team measured the impact:

Metric	Before	After
Ingestion latency (p99)	45ms	60ms (+ Kafka)
Ingestion throughput capacity	10,000 events/sec	50,000 events/sec
Deployment frequency (ingestion)	2x/week (with full monolith)	5x/week (independent)
Deployment risk (ingestion)	High (deploys entire app)	Low (isolated service)
Monolith server count	8	6 (reduced load)
Ingestion service instances	N/A	3 (auto-scaled)
Mean time to recovery (ingestion failures)	15 min (restart monolith)	2 min (restart service)

Team Sentiment

A survey of the development team revealed: - Data ingestion team: Very positive. They could deploy independently, use different testing strategies, and iterate faster. - Other teams: Cautiously optimistic. They saw the benefits for the ingestion team but were concerned about the complexity of the Kafka integration and the new testing requirements. - Platform team: Overwhelmed. They now had two systems to manage, monitor, and deploy. They requested an additional platform engineer (which was approved).

Phase 6: Decision to Pause

After the data ingestion extraction, the team planned to proceed immediately to extracting report generation. However, the tech lead made a deliberate decision to pause for two months.

The reasoning, captured in ADR-009:

# ADR-009: Pause Microservice Extraction for Operational Stabilization

## Status
Accepted

## Context
We successfully extracted the data ingestion service. However,
the team is experiencing operational learning curve effects:
Kafka monitoring, multi-service debugging, and container management
are consuming more time than expected. The platform team is at
capacity.

## Decision
We will pause further service extractions for 8 weeks to:
1. Stabilize the ingestion service in production
2. Build comprehensive monitoring and alerting
3. Create runbooks for common failure scenarios
4. Train all developers on the new operational procedures
5. Hire an additional platform engineer

## Consequences
- The report generation extraction is delayed by 2 months
- The team will be more confident and capable for the next extraction
- We reduce the risk of operational incidents during migration

This decision was validated when, during the stabilization period, the team discovered and fixed three production issues that would have been significantly harder to diagnose if they had been simultaneously extracting another service.

Lessons Learned

The DataPulse team distilled their migration experience into these lessons:

Migrate for concrete reasons, not theoretical benefits. Every extracted service must solve a specific, measurable problem. "It should be a separate service" is not sufficient justification.
The Strangler Fig pattern works, but the "fig" grows slowly. Incremental migration is safer than Big Bang, but it means running two systems simultaneously for months. Budget for this operational overhead.
The shared database is the hardest problem. Data coupling between the monolith and new services is where most migration pain lives. Invest heavily in planning the data separation strategy.
Operational readiness is a prerequisite, not an afterthought. Before extracting a service, ensure your team can deploy, monitor, debug, and roll back the new service independently. If you cannot operate it, do not build it.
Feature flags are essential for safe migrations. The ability to route traffic between old and new implementations, gradually increase the percentage, and instantly roll back is what makes incremental migration safe.
AI assistants are invaluable for migration planning. The AI surfaced the Kafka bridge pattern, the idempotency key requirement, and the contract testing approach -- all of which the team had not initially considered. The AI's breadth of experience with migration patterns complemented the team's deep knowledge of their own system.
Know when to pause. The temptation to maintain momentum is strong, but pausing to stabilize after a major architectural change is often the fastest path to long-term velocity. The team that pauses to learn operates faster over the next twelve months than the team that rushes through all extractions.
Document everything. The team wrote 12 ADRs during the migration. Each one saved hours of re-discussion when questions arose later. New team members could read the ADR log and understand not just the current architecture, but the journey that led to it.