Case Study 1: Optimizing a Slow API Endpoint

From 2-Second Response to 50ms Through Systematic Profiling and Optimization


Background

DataVault is a mid-stage startup building an analytics platform for e-commerce merchants. Their dashboard API endpoint, /api/dashboard, aggregates sales data, customer metrics, and inventory alerts for a merchant's store. After launching with 50 merchants, the endpoint performed adequately — responses returned in about 300ms. But as the platform grew to 500 merchants with larger product catalogs and more transaction history, the endpoint's response time crept up to 2.1 seconds at the P50 level, with P95 reaching 4.8 seconds.

The product team received complaints from merchants: "The dashboard takes forever to load." The engineering team of three developers needed to fix it quickly. They had no dedicated performance engineer, so they decided to use a systematic measurement-driven approach augmented by AI-assisted analysis.

The Application Architecture

The application was a Flask API with a PostgreSQL database, deployed with Gunicorn (4 workers) behind an Nginx reverse proxy. The dashboard endpoint looked like this:

@app.route("/api/dashboard/<int:merchant_id>")
def get_dashboard(merchant_id: int):
    """Generate merchant dashboard data."""
    merchant = Merchant.query.get_or_404(merchant_id)

    # Sales metrics
    orders = Order.query.filter_by(merchant_id=merchant_id).all()
    total_revenue = sum(o.total_amount for o in orders)
    orders_today = [o for o in orders if o.created_at.date() == date.today()]
    revenue_today = sum(o.total_amount for o in orders_today)

    # Product performance
    products = Product.query.filter_by(merchant_id=merchant_id).all()
    product_metrics = []
    for product in products:
        reviews = Review.query.filter_by(product_id=product.id).all()
        order_items = OrderItem.query.filter_by(product_id=product.id).all()
        product_metrics.append({
            "id": product.id,
            "name": product.name,
            "avg_rating": (sum(r.rating for r in reviews) / len(reviews))
                         if reviews else 0,
            "total_sold": sum(oi.quantity for oi in order_items),
            "revenue": sum(oi.quantity * oi.unit_price for oi in order_items),
        })

    # Customer overview
    customers = Customer.query.filter_by(merchant_id=merchant_id).all()
    customer_data = []
    for customer in customers:
        customer_orders = Order.query.filter_by(
            customer_id=customer.id, merchant_id=merchant_id
        ).all()
        customer_data.append({
            "id": customer.id,
            "name": customer.name,
            "order_count": len(customer_orders),
            "total_spent": sum(o.total_amount for o in customer_orders),
        })

    # Low inventory alerts
    low_stock = [p for p in products if p.stock_quantity < p.reorder_threshold]

    return jsonify({
        "merchant": merchant.name,
        "total_revenue": total_revenue,
        "revenue_today": revenue_today,
        "orders_today": len(orders_today),
        "products": sorted(product_metrics,
                          key=lambda p: p["revenue"], reverse=True)[:20],
        "top_customers": sorted(customer_data,
                               key=lambda c: c["total_spent"],
                               reverse=True)[:10],
        "low_stock_alerts": [{"name": p.name, "stock": p.stock_quantity}
                            for p in low_stock],
    })

The database had these approximate row counts for a typical mid-size merchant: - Orders: 15,000 rows - Products: 200 rows - Reviews: 8,000 rows - OrderItems: 45,000 rows - Customers: 3,000 rows

Phase 1: Measurement and Profiling

Step 1: Baseline Measurement

The team first established a reliable baseline. They used a simple timing decorator and ran 50 requests against a staging environment with production-like data:

import time
import statistics

response_times = []
for _ in range(50):
    start = time.perf_counter()
    response = requests.get(f"{BASE_URL}/api/dashboard/42")
    elapsed = (time.perf_counter() - start) * 1000
    response_times.append(elapsed)

print(f"P50: {statistics.median(response_times):.0f}ms")
print(f"P95: {sorted(response_times)[47]:.0f}ms")
print(f"P99: {sorted(response_times)[49]:.0f}ms")

Results: - P50: 2,134ms - P95: 4,823ms - P99: 6,102ms

Step 2: cProfile Analysis

The team wrapped the endpoint handler in cProfile:

import cProfile
import pstats
from io import StringIO

def profile_dashboard():
    profiler = cProfile.Profile()
    profiler.enable()
    with app.test_request_context():
        get_dashboard(42)
    profiler.disable()

    stream = StringIO()
    stats = pstats.Stats(profiler, stream=stream)
    stats.sort_stats("cumulative")
    stats.print_stats(25)
    return stream.getvalue()

The top lines of the profiling output revealed the bottleneck structure:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.008    0.008    2.147    2.147 views.py:12(get_dashboard)
     3201    0.045    0.000    1.423    0.000 query.py:298(all)
     3201    1.156    0.000    1.312    0.000 cursor.py:178(execute)
      200    0.002    0.000    0.534    0.003 query.py:298(all)  [reviews]
      200    0.001    0.000    0.489    0.002 query.py:298(all)  [order_items]
    15000    0.089    0.000    0.089    0.000 {sum}
     3000    0.003    0.000    0.378    0.000 query.py:298(all)  [cust orders]

The team shared this profiling output with their AI assistant and asked: "Identify the performance bottlenecks and their root causes."

Step 3: AI-Assisted Analysis

The AI assistant identified three critical issues:

  1. N+1 queries on products (534ms + 489ms = 1,023ms): For each of 200 products, the code issued separate queries for reviews and order items. That was 400 extra queries.

  2. N+1 queries on customers (378ms): For each of 3,000 customers, the code issued a query for their orders. That was 3,000 extra queries.

  3. Loading all orders into Python (~200ms for query + ~89ms for in-memory aggregation): All 15,000 orders were loaded into memory just to sum amounts and filter by date. SQL aggregation would be far more efficient.

Total identified waste: approximately 1,690ms out of 2,147ms, representing 79% of execution time.

Phase 2: Systematic Optimization

Optimization 1: Replace N+1 Product Queries with JOINs

Instead of querying reviews and order items for each product individually, the team used SQL aggregation:

product_metrics = db.session.query(
    Product.id,
    Product.name,
    func.coalesce(func.avg(Review.rating), 0).label("avg_rating"),
    func.coalesce(func.sum(OrderItem.quantity), 0).label("total_sold"),
    func.coalesce(
        func.sum(OrderItem.quantity * OrderItem.unit_price), 0
    ).label("revenue"),
).outerjoin(
    Review, Review.product_id == Product.id
).outerjoin(
    OrderItem, OrderItem.product_id == Product.id
).filter(
    Product.merchant_id == merchant_id
).group_by(
    Product.id, Product.name
).order_by(
    desc("revenue")
).limit(20).all()

Result: 400 queries replaced with 1. Product metrics section dropped from 1,023ms to 45ms.

Optimization 2: Replace N+1 Customer Queries with Aggregation

top_customers = db.session.query(
    Customer.id,
    Customer.name,
    func.count(Order.id).label("order_count"),
    func.coalesce(func.sum(Order.total_amount), 0).label("total_spent"),
).outerjoin(
    Order, and_(
        Order.customer_id == Customer.id,
        Order.merchant_id == merchant_id,
    )
).filter(
    Customer.merchant_id == merchant_id
).group_by(
    Customer.id, Customer.name
).order_by(
    desc("total_spent")
).limit(10).all()

Result: 3,000 queries replaced with 1. Customer section dropped from 378ms to 22ms.

Optimization 3: SQL Aggregation for Revenue Metrics

Instead of loading all 15,000 orders into Python:

revenue_stats = db.session.query(
    func.sum(Order.total_amount).label("total_revenue"),
    func.sum(
        case(
            (func.date(Order.created_at) == date.today(), Order.total_amount),
            else_=0,
        )
    ).label("revenue_today"),
    func.sum(
        case(
            (func.date(Order.created_at) == date.today(), 1),
            else_=0,
        )
    ).label("orders_today"),
).filter(
    Order.merchant_id == merchant_id
).first()

Result: Loading 15,000 rows replaced with a single aggregate query. Revenue section dropped from 289ms to 8ms.

Intermediate Measurement

After the three query optimizations, the team re-measured:

  • P50: 187ms (down from 2,134ms — 11.4x improvement)
  • P95: 312ms (down from 4,823ms)
  • P99: 498ms (down from 6,102ms)

This was already a huge improvement, but the team wanted to push further.

Optimization 4: Add Database Indexes

Running EXPLAIN ANALYZE on the new queries revealed sequential scans on the orders.merchant_id and order_items.product_id columns:

CREATE INDEX idx_orders_merchant_id ON orders(merchant_id);
CREATE INDEX idx_orders_customer_merchant ON orders(customer_id, merchant_id);
CREATE INDEX idx_order_items_product_id ON order_items(product_id);
CREATE INDEX idx_reviews_product_id ON reviews(product_id);
CREATE INDEX idx_products_merchant_id ON products(merchant_id);
CREATE INDEX idx_customers_merchant_id ON customers(merchant_id);

Result: Query execution time dropped by approximately 60%. P50 went from 187ms to 78ms.

Optimization 5: Add Response Caching

Dashboard data does not need to be real-time — a few minutes of staleness is acceptable. The team added Redis caching:

import json
import redis

redis_client = redis.Redis(host="localhost", port=6379, db=0)
DASHBOARD_CACHE_TTL = 120  # 2 minutes

@app.route("/api/dashboard/<int:merchant_id>")
def get_dashboard(merchant_id: int):
    cache_key = f"dashboard:{merchant_id}"

    # Check cache
    cached = redis_client.get(cache_key)
    if cached:
        return jsonify(json.loads(cached))

    # Generate dashboard (the optimized version)
    dashboard_data = generate_dashboard_data(merchant_id)

    # Cache the result
    redis_client.setex(cache_key, DASHBOARD_CACHE_TTL,
                       json.dumps(dashboard_data, default=str))

    return jsonify(dashboard_data)

Result: Cache hits serve in 2-5ms. Cache misses take 78ms. With a 2-minute TTL and merchants checking their dashboard every few minutes, the cache hit rate was approximately 85%.

Phase 3: Final Results and Validation

Final Performance Numbers

After all five optimizations, the team ran a comprehensive load test with 100 concurrent users:

Metric Before After (Cache Miss) After (Cache Hit)
P50 2,134ms 78ms 3ms
P95 4,823ms 142ms 8ms
P99 6,102ms 234ms 15ms
Queries per request ~3,600 5 0
DB load (CPU) 89% 12% <1%

Effective P50 considering 85% cache hit rate: approximately 14ms (0.85 * 3 + 0.15 * 78).

Query Count Breakdown

Component Before After
Merchant lookup 1 1
Revenue aggregation 1 (loading 15K rows) 1 (aggregate query)
Product metrics 401 (1 + 200 * 2) 1 (single query with JOINs)
Customer metrics 3,001 (1 + 3,000) 1 (single query with JOINs)
Low stock check ~200 1 (filter in product query)
Total ~3,604 5

Lessons Learned

1. Measurement Was Essential

The team's initial assumption was that "the database is slow." Profiling revealed that the database was fast per-query — the problem was making 3,600 queries instead of 5. Without profiling, they might have spent time optimizing database server configuration instead of fixing the actual bottleneck.

2. The Biggest Wins Came from the Simplest Changes

Replacing N+1 queries with JOINs was a conceptual change, not a technically complex one. Adding indexes required no application code changes at all. These two optimizations — taking perhaps 2 hours of work — delivered a 27x improvement.

3. Caching Was the Final Polish, Not the First Solution

If the team had started with caching, they would have masked the underlying problem. Cache misses would still take 2+ seconds, causing poor experience for the first visitor after cache expiration. By optimizing the queries first, cache misses became fast (78ms), and caching provided additional benefit on top of an already-fast endpoint.

4. AI Analysis Accelerated the Process

Sharing the cProfile output with an AI assistant immediately identified the N+1 pattern and suggested specific solutions. This saved the team from spending hours manually tracing through the profiling data and researching optimization strategies. The AI prompt was specific — it included the profiling data, the data volumes, and the desired outcome — which produced actionable suggestions.

5. Optimization Is Iterative

Each optimization shifted the bottleneck. After fixing queries, indexes became the next target. After indexes, caching was the next opportunity. The team ran the full profiling loop five times. Each iteration was informed by fresh measurements, not assumptions carried over from the previous iteration.

Code: The Final Optimized Endpoint

@app.route("/api/dashboard/<int:merchant_id>")
def get_dashboard(merchant_id: int):
    """Generate merchant dashboard with caching and optimized queries."""
    cache_key = f"dashboard:{merchant_id}"
    cached = redis_client.get(cache_key)
    if cached:
        return jsonify(json.loads(cached))

    merchant = Merchant.query.get_or_404(merchant_id)
    dashboard_data = {
        "merchant": merchant.name,
        **get_revenue_metrics(merchant_id),
        "products": get_top_products(merchant_id),
        "top_customers": get_top_customers(merchant_id),
        "low_stock_alerts": get_low_stock_alerts(merchant_id),
    }

    redis_client.setex(cache_key, 120, json.dumps(dashboard_data, default=str))
    return jsonify(dashboard_data)

Each helper function executes a single optimized query, and the entire endpoint completes in 5 queries on a cache miss. The response time dropped from 2,134ms to an effective 14ms — a 152x improvement — through systematic profiling, targeted optimization, and strategic caching.