Case Study 1 — The UNION That Ate Real Transactions

UNION silently removes duplicate rows. When two genuinely distinct events happen to produce identical-looking rows, UNION merges them into one — and a total comes out low. The fix is one word: ALL.

Background

A payments team built a daily "all settlements" report by stacking two sources: settlements from the card processor and settlements from the bank-transfer system. Each source produced rows of (settlement_date, amount, currency). The report combined them:

-- Daily settlements from both sources   ❌ UNION drops duplicate rows
SELECT settlement_date, amount, currency FROM card_settlements
UNION
SELECT settlement_date, amount, currency FROM bank_settlements;

For months the totals were slightly low — never dramatically, just enough that the report never quite matched the finance ledger. Reconciliation was a recurring headache, and people assumed it was rounding or timing. It wasn't.

What went wrong

UNION removes duplicate rows. But two different settlements can legitimately produce an identical row: two customers each paying exactly $49.99 USD on the same day generates two rows of (2024-06-15, 49.99, USD) — which are genuinely two settlements, not a duplicate. UNION saw them as the same row and collapsed them into one. Every time two real settlements happened to share a date, amount, and currency, the report lost one of them.

The error was small and data-dependent (it only struck when collisions occurred), which is exactly why it was so hard to pin down: the discrepancy varied day to day with no obvious pattern. The rows the report dropped were real money.

The fix is to keep every row with UNION ALL:

-- Keep every settlement, even identical-looking ones
SELECT settlement_date, amount, currency FROM card_settlements
UNION ALL
SELECT settlement_date, amount, currency FROM bank_settlements;

Now identical-looking-but-distinct settlements are both retained, and the totals match the ledger. Even better, the team added a unique identifier to each source row so rows are distinguishable, removing any ambiguity:

SELECT 'card' AS source, settlement_id, settlement_date, amount, currency FROM card_settlements
UNION ALL
SELECT 'bank' AS source, settlement_id, settlement_date, amount, currency FROM bank_settlements;

With a source tag and an id, every row is unique by construction — and you could even safely use UNION now without loss, though UNION ALL remains correct and faster.

The analysis

  1. UNION deduplicates on the entire row, not on identity. If your rows don't include something that distinguishes genuinely different records, UNION will merge real, distinct data. It can't know that two identical rows represent two different settlements.

  2. UNION ALL is the correct default for "combine these lists." When you're concatenating sources and want every record, use UNION ALL. Use UNION only when you specifically want to remove duplicates and "duplicate row" truly means "the same thing."

  3. Carry an identifier when combining sources. Including a primary key (and often a source tag) makes rows distinguishable, eliminates accidental collisions, and makes the combined data traceable back to its origin. This is good practice whenever you stack heterogeneous sources.

  4. Data-dependent bugs are the hardest to catch. The error only appeared when two settlements collided on all columns, so the size of the discrepancy was effectively random. "Sometimes a little off" is a classic signature of an accidental dedup — keep it on your list of suspects.

  5. Reconcile against an independent total. SELECT SUM(amount) FROM card_settlements + SELECT SUM(amount) FROM bank_settlements should equal the report's total. The mismatch was visible all along in that comparison; making it an automated check would have caught the leak immediately.

Discussion questions

  1. Explain how two distinct settlements could be collapsed by UNION. What exactly does UNION deduplicate on?
  2. Why does UNION ALL fix the total? What does adding a source tag and id additionally buy you?
  3. Why was the discrepancy small and irregular rather than a fixed amount?
  4. State the rule: when do you want UNION, and when UNION ALL?
  5. ⭐ Describe a case where UNION (dedup) is the correct choice, and why the rows there can't have the collision problem.