Case Study 1 — The Page That Fired 5,000 Queries

The N+1 problem, in the wild. An admin dashboard that looked fine in development became unusable in production — not because of slow queries, but because of thousands of fast ones. The fix was one line of eager loading, and the lesson was "watch the SQL your ORM generates."

Background

An admin page listed customers, and for each showed their order count and total spend. The developer wrote clean, readable ORM code:

customers = session.scalars(select(Customer)).all()
for c in customers:
    order_count = len(c.orders)                         # access relationship
    total = sum(item.quantity * item.unit_price          # access nested relationship
                for o in c.orders for item in o.items)
    rows.append((c.first_name, c.last_name, order_count, total))

In development (a handful of test customers) it rendered instantly. In production — thousands of customers, each with orders, each order with items — the page took 30+ seconds and hammered the database. Curiously, the database's slow query log was empty: no single query was slow. Yet the page was crawling.

Diagnosis

The developer enabled SQL logging (create_engine(..., echo=True)) and watched the console as the page loaded. The output was damning:

SELECT * FROM customers;                                   -- 1 query
SELECT * FROM orders WHERE customer_id = 1;                -- then one PER customer...
SELECT * FROM orders WHERE customer_id = 2;
SELECT * FROM orders WHERE customer_id = 3;
... (×3,000)
SELECT * FROM order_items WHERE order_id = 1;              -- and one per ORDER...
SELECT * FROM order_items WHERE order_id = 2;
... (×many thousands)

There it was: the N+1 problem, doubled. The first query loaded all customers (1). Then c.orders in the loop lazily fired one query per customer (N). Then o.items fired one query per order (another, larger N). For thousands of customers and tens of thousands of orders, the page issued tens of thousands of queries — each individually fast (so nothing showed in the slow-query log), but collectively catastrophic. The bottleneck wasn't query speed; it was query count and the per-query round-trip overhead.

This is exactly why the slow-query log was empty: N+1 is death by a thousand (fast) cuts, not one slow query. You can only see it by watching the number of queries — which is why echo=True (or an APM tool) is the diagnostic.

The fix: eager loading

Tell the ORM to load the relationships up front in a fixed number of queries instead of lazily one-at-a-time:

from sqlalchemy.orm import selectinload

customers = session.scalars(
    select(Customer).options(
        selectinload(Customer.orders).selectinload(Order.items)   # load orders, then their items
    )
).all()
for c in customers:
    order_count = len(c.orders)                  # already loaded — no query
    total = sum(item.quantity * item.unit_price
                for o in c.orders for item in o.items)   # already loaded — no query
    rows.append((c.first_name, c.last_name, order_count, total))

Now the ORM issues 3 queries total, regardless of customer count: one for customers, one for all their orders (WHERE customer_id IN (...)), one for all those orders' items. The page dropped from 30+ seconds and tens of thousands of queries to well under a second and 3 queries. The Python code barely changed — only the loading strategy.

(For this aggregate — counts and totals — an even better fix is to compute it in SQL directly: a GROUP BY query, or a raw-SQL report (Case Study 2). Eager loading fixes the N+1; pushing the aggregation into the database avoids materializing all the objects at all. The team used eager loading for the object list and a separate aggregate query for the totals.)

The analysis

  1. N+1 is the ORM's signature performance bug. Convenient relationship access (c.orders) in a loop secretly fires a query per row. The code reads beautifully and performs terribly. It's the most common ORM mistake, period.

  2. It's invisible at small scale and in the slow-query log. 5 test customers = 6 fast queries (fine). 5,000 = 5,001 (disaster). And because each query is fast, the slow-query log shows nothing — the problem is query count, not query duration. You must watch the number of queries.

  3. echo=True (or APM) is the diagnostic. You cannot fix what you cannot see. Logging the generated SQL revealed the per-row queries instantly. Watching the SQL an ORM generates is a core skill — the ORM hides the SQL, so you must deliberately surface it.

  4. Eager loading is the fix. selectinload/joinedload load related data in a fixed number of queries. Choose it whenever you'll access a relationship for many parent rows. (selectinload = a second IN query; joinedload = a JOIN — pick by data shape.)

  5. For aggregates, prefer SQL over loading objects. Counting/summing by loading all objects into Python is wasteful even with eager loading; a GROUP BY (or raw SQL) computes it in the database. Don't materialize thousands of objects just to count them.

Discussion questions

  1. Why was the page slow even though no individual query was slow (and the slow-query log was empty)?
  2. Count the queries the original code issues for C customers with O orders each. Why is that catastrophic?
  3. How did echo=True reveal the problem? Why is watching generated SQL essential with an ORM?
  4. How does selectinload reduce thousands of queries to a constant few?
  5. ⭐ For the order-count/total specifically, why might a GROUP BY query be even better than eager-loading all the objects?