Case Study 1 — The Report That Triple-Counted Revenue

DataField.Dev

Case Study 1 — The Report That Triple-Counted Revenue

The most insidious join bug isn't a crash or a Cartesian explosion — it's fan-out: a one-to-many join quietly multiplies rows, and a later SUM counts the same value several times. The total looks plausible. It's wrong. This is the single most common analytics error in SQL, and every data professional must learn to see it.

Background

A finance analyst built a "revenue and shipping by order" report. Each order has a shipping_fee stored once on the orders row, and each order has several line items in order_items. She wanted, per order, the line-item revenue and the shipping fee, so she joined the two tables and summed both:

SELECT o.order_id,
       SUM(oi.quantity * oi.unit_price) AS revenue,
       SUM(o.shipping_fee)              AS shipping     -- ❌ trouble
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
GROUP BY o.order_id;

The revenue column was correct. The shipping column was inflated — by exactly the number of line items in each order. An order with three line items reported three times its actual shipping fee. Summed across the company, total "shipping revenue" was overstated by a factor of roughly the average basket size — and the inflated number went into a quarterly report before anyone caught it.

What went wrong: fan-out

The join orders ⋈ order_items is one-to-many. A single order row is duplicated once for each of its line items:

   orders                 order_items                 joined result
   order 1, ship=$10      (1, productA)               1, productA, ship=$10
                          (1, productB)        →       1, productB, ship=$10   ← $10 again
                          (1, productC)               1, productC, ship=$10   ← and again

order_items-level values (quantity, unit_price) are correct on each row — there's genuinely one per line item. But orders-level values (shipping_fee) are now repeated across the duplicated rows. SUM(o.shipping_fee) therefore adds $10 three times for a single $10 fee. The grain of the joined result is "one row per line item," and summing an order-level column at line-item grain over-counts it.

This is fan-out (also called the "join multiplication" or "fan trap"): when you join tables at different grains and then aggregate, values from the coarser-grained table are multiplied by the fan-out factor.

The fixes

There are three correct approaches, depending on what you need.

1. Aggregate each table to a common grain before joining (the cleanest):

WITH item_rev AS (
    SELECT order_id, SUM(quantity * unit_price) AS revenue
    FROM order_items
    GROUP BY order_id
)
SELECT o.order_id, ir.revenue, o.shipping_fee AS shipping
FROM orders o
JOIN item_rev ir ON ir.order_id = o.order_id;

Now order_items is collapsed to one row per order before the join, so the join is one-to-one and nothing is multiplied. (This uses a CTE — Chapter 11 — but the idea is clear: pre-aggregate to avoid fan-out.)

2. Use a conditional that counts each order-level value once. Because the order's shipping fee is identical across the duplicated rows, you can take it once per group:

SELECT o.order_id,
       SUM(oi.quantity * oi.unit_price) AS revenue,
       MAX(o.shipping_fee)              AS shipping    -- MAX/MIN reads it once per order
FROM orders o
JOIN order_items oi ON oi.order_id = o.order_id
GROUP BY o.order_id;

MAX(o.shipping_fee) works because every row in the group has the same shipping fee, so MAX (or MIN) returns that single value rather than summing the duplicates. (It's a slightly hacky read, but a common and valid one.)

3. Don't mix grains in one query at all — report line-item revenue and order-level shipping in two separate queries, or join the pre-aggregated revenue as in fix #1.

The analysis

Every join has a grain — know it. After a join, ask: "what does one row represent now?" Here it's "one line item," not "one order." Aggregating a coarser-grained column (shipping_fee, which is per-order) at a finer grain (per-line-item) over-counts it.
Fan-out is silent. No error, no warning — just a number that's too big. The revenue column being correct made it more deceptive: half the report was right, which lent false credibility to the wrong half.
DISTINCT is not the fix. A tempting "fix" is SUM(DISTINCT o.shipping_fee) — but that's wrong too (two different orders could legitimately have the same fee, and DISTINCT would collapse them). The real fix is to aggregate at the correct grain, usually by pre-aggregating before the join.
Sanity-check totals against a simple baseline. SELECT SUM(shipping_fee) FROM orders; gives the true total in one line, with no join. Comparing the report's shipping total to that baseline would have exposed the inflation instantly. Cross-checking an aggregate against a join-free baseline is a habit worth building.
This connects to design and to aggregation. You'll meet GROUP BY formally in Chapter 7, and pre-aggregating CTEs in Chapter 11. The deeper point spans both: think about grain before you aggregate across a join.

Discussion questions

Explain "fan-out" in your own words. Why was revenue correct but shipping inflated?
After the orders ⋈ order_items join, what does one row represent? Why does that grain break SUM(shipping_fee)?
Compare the three fixes. When is the pre-aggregation (CTE) approach clearly best?
Why is SUM(DISTINCT shipping_fee) not a safe fix?
⭐ Write a one-line baseline query that would have caught the bug, and explain how you'd build "compare report to baseline" into a review habit.