Case Study 02: The Deployment That Went Wrong

A Production Incident, Rollback, and Post-Mortem

Background

Marcus Chen is a backend developer at a mid-sized e-commerce startup called ShopStream. The company sells artisanal goods from independent makers, processing about 2,000 orders per day through a platform built with Python (Django), React, PostgreSQL, and Elasticsearch. The engineering team of eight developers practices continuous delivery: code merged to main is automatically tested, built into Docker images, and deployed to a staging environment. Production deployments happen via a single-click approval in their GitLab CI pipeline.

Marcus had been using AI coding assistants extensively for the past six months. His productivity had roughly doubled — he was shipping features faster than ever. But on a Tuesday in February, speed would become the problem.

The Feature

ShopStream's product team wanted to revamp the checkout flow. The existing checkout was a multi-page process: cart review, shipping address, payment, confirmation. Customer data showed a 35% abandonment rate, and the product team believed a single-page checkout would improve conversion.

Marcus was assigned the task. Using his AI assistant, he rewrote the checkout backend in three days — roughly half the estimated time. The new checkout endpoint consolidated four API calls into one, reduced database round trips from 12 to 4, and introduced an optimistic inventory reservation system to prevent the dreaded "out of stock" error at the payment step.

The AI had been tremendously helpful with the database query optimization and the inventory reservation logic. Marcus reviewed the generated code, wrote tests (also with AI assistance), and was confident in the implementation.

The Deployment: Tuesday, 2:00 PM

The code had been in staging for two days. The QA team tested the new checkout flow and signed off. Marcus's pull request had been reviewed by two colleagues. All CI checks were green: 247 unit tests passed, 18 integration tests passed, linting clean, type checking clean.

At 2:00 PM, Marcus clicked "Deploy to Production" in GitLab CI. The pipeline built the Docker image, pushed it to the container registry, and triggered a rolling deployment across the three production application servers. The deployment took 4 minutes. Health checks passed on all three instances.

Marcus watched the Grafana dashboard for five minutes. Request rate normal, error rate at 0.2% (baseline), p95 latency at 180ms (normal). He announced in Slack: "Checkout v2 is live! Monitor for any issues."

The First Signs: Tuesday, 2:25 PM

Twenty-five minutes after deployment, the customer support team posted in the #engineering Slack channel:

"Three customers in the last 10 minutes reporting 'Something went wrong' on checkout. Can someone look?"

Marcus checked Grafana. The error rate had climbed from 0.2% to 1.8%. Not catastrophic, but clearly elevated. He opened Sentry and found a new error cluster: IntegrityError: duplicate key value violates unique constraint "orders_order_number_key".

The error occurred in the new checkout endpoint. Marcus's optimistic inventory reservation created an order record at the beginning of the checkout process and updated it when payment was confirmed. The order number was generated using a sequence, but under concurrent load, the sequence was occasionally producing duplicates due to a race condition in the new code.

Escalation: Tuesday, 2:35 PM

Marcus examined the error pattern. It was not every checkout — roughly 8% of checkout attempts were failing. The issue manifested only when two users initiated checkout within the same database transaction window. In staging, the QA team had tested sequentially, never triggering the race condition.

The error rate was now at 3.2% and climbing as afternoon traffic increased. Marcus made the call to roll back.

The Rollback Attempt: Tuesday, 2:40 PM

Marcus navigated to GitLab CI and triggered a redeployment of the previous Docker image tag. The rollback pipeline started. It would take approximately 4 minutes for the rolling deployment to complete.

Then he remembered: the deployment included a database migration.

The new checkout code had added three columns to the orders table and created a new inventory_reservations table. The old code did not know about these columns and would not use them — but it also ran queries that selected * from the orders table. Would the extra columns cause issues?

Marcus checked the old code quickly. The Django ORM models explicitly defined fields, so SELECT * was not used. The extra columns would be ignored. The rollback should be safe from the application side.

But there was another problem. During the 25 minutes the new code had been running, it had created records in the inventory_reservations table. These reservations had locked inventory. If Marcus rolled back the code but left the reservations table in place, that inventory would remain locked indefinitely — customers would see items as "out of stock" that actually had available inventory.

The Decision: Tuesday, 2:45 PM

Marcus faced a critical decision with three options:

Option A: Full rollback — Revert the code and manually clear the inventory_reservations table. Risk: might release inventory for orders that were genuinely in progress.

Option B: Hotfix forward — Fix the race condition in the new code and deploy the fix. Risk: takes time; error rate continues meanwhile.

Option C: Partial rollback — Revert the code, keep the database changes, write a script to carefully reconcile the reservations. Risk: complexity under pressure.

Marcus chose Option A with a safeguard. He would roll back the code immediately to stop the bleeding, then write a reconciliation script to handle the reservations correctly rather than blindly clearing them.

The Recovery: Tuesday, 2:50 PM – 3:30 PM

2:50 PM — The rolling deployment of the old code completed. Health checks passed. Marcus watched the error rate drop from 3.2% back to 0.2% within two minutes.

2:55 PM — Marcus posted a status update in Slack: "Checkout v2 has been rolled back. Checkout is working normally on the old flow. I'm now working on cleaning up inventory reservations."

3:00 PM — Using his AI assistant, Marcus wrote a reconciliation script:

"Write a Python script that: queries all records from inventory_reservations created in the last hour, checks if each reservation has a corresponding completed order in the orders table, releases (deletes) any reservations that do NOT have a completed order, and logs every action taken. This is for a Django application."

The AI generated a script with proper database transaction handling and comprehensive logging. Marcus reviewed it line by line. He tested it against a copy of the production data in his local environment. It correctly identified 47 orphaned reservations and would release them.

3:15 PM — Marcus ran the reconciliation script in production. 47 reservations released. He verified that the affected products now showed correct inventory counts on the website.

3:30 PM — Full recovery confirmed. Marcus posted the final status update: "All clear. Checkout is working normally. Inventory counts have been reconciled. No orders were lost or duplicated."

Impact Assessment

The incident lasted approximately 65 minutes from first customer report to full recovery. During that time:

312 checkout attempts were made
26 checkouts failed with the IntegrityError (8.3% failure rate)
19 of those customers retried and succeeded
7 customers abandoned their carts
Estimated revenue impact: $840 in potentially lost sales
47 inventory items were incorrectly reserved and then released
Zero data corruption — no orders were lost or incorrectly charged
Customer support handled 11 tickets related to the incident

The Root Cause Analysis

Marcus and his team held a blameless post-mortem the next day. They identified multiple contributing factors:

Primary cause: Race condition in order number generation. The new code used a pattern where it generated an order number, created a record, and then processed payment. Under concurrent load, two transactions could generate the same order number before either committed. The AI-generated code used SELECT MAX(order_number) + 1 instead of a proper database sequence, and Marcus had not caught this during review.

Contributing factor 1: Insufficient test coverage for concurrency. The test suite tested the checkout flow sequentially. There were no tests that simulated concurrent checkout attempts. This is a common blind spot — concurrent behavior is hard to test and easy to overlook.

Contributing factor 2: Staging environment did not simulate production load. QA testing was manual and sequential. There was no load testing step in the deployment pipeline. The race condition could only be triggered under concurrent load.

Contributing factor 3: Deployment timing. The deployment happened at 2:00 PM — peak traffic time. A deployment during low-traffic hours (early morning) would have given the team more time to detect and respond before many customers were affected.

Contributing factor 4: Coupled deployment and migration. The database migration and code deployment were a single atomic operation. This made rollback more complex because the database changes could not be easily reversed independently of the code changes.

Action Items

The team identified concrete improvements across four categories:

Testing improvements: 1. Add concurrent load testing to the CI pipeline using locust or k6 (Owner: Marcus, Due: 2 weeks) 2. Write specific concurrency tests for any code that generates unique identifiers or modifies shared state (Owner: all developers, Ongoing) 3. Add a "chaos testing" step to staging deployments that simulates concurrent users (Owner: DevOps lead, Due: 1 month)

Deployment process improvements: 4. Implement blue-green deployment to enable instant rollback without rolling deployment delay (Owner: DevOps lead, Due: 3 weeks) 5. Separate database migrations from code deployments — run migrations first, verify, then deploy code (Owner: tech lead, Due: 2 weeks) 6. Establish a deployment window policy: no deployments between 1 PM and 5 PM (peak traffic) unless urgent (Owner: engineering manager, Immediate)

Monitoring improvements: 7. Add an alert for checkout failure rate exceeding 2% (Owner: Marcus, Due: 3 days) 8. Create a deployment-specific Grafana dashboard that automatically displays key metrics during and after each deployment (Owner: DevOps lead, Due: 2 weeks)

AI-assisted development process improvements: 9. Add a review checklist item: "Does this code handle concurrent access correctly?" specifically for AI-generated code (Owner: tech lead, Due: 1 week) 10. When AI generates database-related code, always verify that it uses proper database primitives (sequences, advisory locks, SELECT FOR UPDATE) rather than application-level workarounds (Owner: all developers, Ongoing)

Lessons Learned

Lesson 1: AI-generated code has systematic blind spots. The AI generated functionally correct code that worked perfectly in isolation. It did not consider concurrent access because concurrency was not mentioned in the prompt. AI assistants optimize for the happy path unless explicitly prompted otherwise. Marcus's refined process now includes asking: "What are the concurrency, failure, and edge case scenarios for this code?"

Lesson 2: The test suite is the safety net, not code review. Two experienced developers reviewed the code and missed the race condition. This is not a failure of code review — race conditions are notoriously hard to spot in static code. Automated concurrency testing would have caught this. The lesson is that certain classes of bugs require specific types of testing, and those testing requirements should be part of the definition of done.

Lesson 3: Database migrations complicate rollbacks exponentially. A code-only rollback is trivial: deploy the old image. A rollback with database migration reversal is complex and risky. The solution is to decouple migrations from deployments and use the expand-contract pattern for schema changes.

Lesson 4: Deploy during low-traffic periods. This seems obvious in hindsight, but many teams deploy whenever code is ready. A simple policy of avoiding peak-traffic deployments would have reduced the blast radius of this incident by 60-70%.

Lesson 5: Have a rollback runbook before you need one. Marcus made correct decisions under pressure, but he was improvising. A pre-written runbook with decision trees ("If the deployment includes a migration, follow these steps...") would have made the response faster and less stressful.

Lesson 6: Blameless post-mortems produce better outcomes. The post-mortem focused on systemic improvements, not on who made the mistake. Marcus was not blamed for the race condition — instead, the team identified gaps in their process that allowed the bug to reach production. This approach encouraged honesty and produced actionable improvements.

The Fix

Marcus fixed the race condition by replacing the application-level order number generation with a PostgreSQL sequence:

CREATE SEQUENCE order_number_seq START WITH 100000;

# Before (race condition):
# max_num = Order.objects.aggregate(Max('order_number'))['order_number__max']
# new_order_number = (max_num or 99999) + 1

# After (correct):
from django.db import connection

def generate_order_number() -> int:
    with connection.cursor() as cursor:
        cursor.execute("SELECT nextval('order_number_seq')")
        return cursor.fetchone()[0]

He also added a concurrent checkout test:

import threading
from concurrent.futures import ThreadPoolExecutor, as_completed

def test_concurrent_checkout_no_duplicate_order_numbers():
    """Verify that concurrent checkouts produce unique order numbers."""
    results = []
    errors = []

    def attempt_checkout():
        try:
            response = client.post("/api/checkout/", checkout_data)
            results.append(response.json()["order_number"])
        except Exception as e:
            errors.append(str(e))

    with ThreadPoolExecutor(max_workers=10) as executor:
        futures = [executor.submit(attempt_checkout) for _ in range(50)]
        for future in as_completed(futures):
            future.result()

    assert len(errors) == 0, f"Checkout errors: {errors}"
    assert len(results) == len(set(results)), "Duplicate order numbers detected!"

The fix was deployed the following Monday during a low-traffic window (7:00 AM). Database migration was applied 30 minutes before the code deployment. The new checkout flow has been running without incident for three months with a checkout failure rate below 0.1%.

Epilogue

Three months after the incident, Marcus reflected on what he had learned. The experience fundamentally changed how he used AI coding assistants for critical code paths. He now follows a personal rule: any AI-generated code that touches financial transactions, inventory, or user data gets an additional review pass specifically for concurrency, failure modes, and edge cases.

His prompt engineering also evolved. Before the incident, he might ask: "Write a checkout endpoint that creates an order and processes payment." After the incident, he asks: "Write a checkout endpoint that creates an order and processes payment. Consider: concurrent checkouts for the same inventory, network failures during payment processing, database connection timeouts, and partial failures where payment succeeds but order creation fails. Use database-level constraints to prevent duplicates."

The difference in AI output quality between these two prompts is substantial. The second prompt produces code that handles real-world conditions. The first produces code that works in a demo.

DevOps practices — monitoring, automated rollbacks, deployment policies — are the safety net that catches the gap between demo-quality code and production-quality code. For vibe coders, these practices are not optional overhead. They are the difference between moving fast and moving fast without breaking things.