Case Study 2 — The KPI That Counted Line Items as Orders
Chapter 6 showed how a one-to-many join multiplies rows (fan-out). This case shows the aggregation-side consequence: a
COUNT(*)over a fanned-out join that quietly counted line items as orders, and inflated a board-level KPI for months.
Background
An e-commerce team reported "total orders" and "orders per day" on an executive dashboard. The underlying query joined orders to order_items (to also compute revenue in the same query) and counted:
-- "Orders per day" ❌
SELECT o.order_date::date AS day,
COUNT(*) AS orders,
SUM(oi.quantity * oi.unit_price) AS revenue
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
GROUP BY o.order_date::date
ORDER BY day;
The revenue numbers were fine. The "orders" numbers were inflated — by the average number of line items per order (about 2.6×). The dashboard proudly showed roughly 2.6 times more orders than the company actually received. Growth looked spectacular; targets were set against the inflated baseline; and a quarter later, when the numbers were reconciled against the order-management system, the correction was awkward and public.
What went wrong
The join orders ⋈ order_items produces one row per line item, not per order. An order with three items contributes three rows. COUNT(*) then counts those rows — so it counts line items, which it labels "orders":
order 42 has 3 items → 3 joined rows → COUNT(*) adds 3, not 1
The revenue SUM was correct precisely because it operates at line-item grain (each line item genuinely contributes its own revenue). But COUNT(*) of orders does not belong at line-item grain — it needs to count each order once. Mixing an order-grain metric (COUNT of orders) and a line-item-grain metric (SUM of revenue) in one fanned-out query is the root error.
The fixes
1. Count distinct orders:
SELECT o.order_date::date AS day,
COUNT(DISTINCT o.order_id) AS orders, -- each order once
SUM(oi.quantity * oi.unit_price) AS revenue
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
GROUP BY o.order_date::date
ORDER BY day;
COUNT(DISTINCT o.order_id) counts each order a single time regardless of how many line items it has. Quick, correct, and a one-word change.
2. Aggregate at separate grains (cleaner for more metrics). Compute order-grain and item-grain metrics in separate subqueries, then combine:
WITH daily_orders AS (
SELECT order_date::date AS day, COUNT(*) AS orders
FROM orders GROUP BY order_date::date
),
daily_revenue AS (
SELECT o.order_date::date AS day, SUM(oi.quantity * oi.unit_price) AS revenue
FROM orders o JOIN order_items oi ON oi.order_id = o.order_id
GROUP BY o.order_date::date
)
SELECT d.day, d.orders, COALESCE(r.revenue, 0) AS revenue
FROM daily_orders d
LEFT JOIN daily_revenue r ON r.day = d.day
ORDER BY d.day;
Here orders is counted from the orders table at its natural grain (no fan-out at all), and revenue is summed separately. This scales cleanly when you add more metrics at different grains.
The analysis
-
COUNT(*)counts rows at the current grain. After a one-to-many join, the grain is the "many" side. If your metric is about the "one" side (orders, customers, products),COUNT(*)over the joined result over-counts it. UseCOUNT(DISTINCT key)or count from the un-joined table. -
Be suspicious of mixing grains in one query. Revenue (per line item) and order count (per order) live at different grains. Computing both over a single fanned-out join makes at least one of them wrong. Either use
DISTINCTfor the coarser metric or separate the aggregations. -
Correct-looking halves hide wrong halves. As in Chapter 6's revenue case, the right revenue number lent false credibility to the wrong order count. A partially-correct report is more dangerous than an obviously broken one.
-
Validate against a join-free baseline.
SELECT COUNT(*) FROM orders;gives the true order count in one line. Comparing the dashboard's total to that baseline would have caught the 2.6× inflation on day one. Always cross-check a grouped count against the simple count. -
Define KPIs with their grain. "Orders per day" should be specified as "distinct orders with
order_datein the day." Writing the grain into the metric definition makes the correct query obvious and the wrong one easy to reject in review.
Discussion questions
- Explain precisely why
COUNT(*)overorders ⋈ order_itemscounts line items, not orders. - Why was the revenue
SUMcorrect in the same query that got the order count wrong? - Compare the two fixes. When is the separate-aggregations (CTE) approach worth the extra code?
- Write the one-line baseline query that would have caught the inflation immediately.
- ⭐ Generalize the rule: before adding any
COUNT(*)orSUM(...)to a query with joins, what two questions should you ask? (Recall Chapter 6's "grain" and "no-match rows.")