Case Study 1 — The Anti-Join That Silently Returned Nothing

The NOT IN-with-NULL trap, in production. A nightly job that found "inactive accounts" worked for two years, then one day quietly stopped finding any — because a single NULL slipped into the subquery's result.

Background

A SaaS company ran a nightly job to identify accounts with no activity, so it could send re-engagement emails and eventually flag them for archival. The query found accounts that had no row in the activity_log:

-- "Accounts with no logged activity"   ❌ NULL-fragile
SELECT account_id, email
FROM accounts
WHERE account_id NOT IN (SELECT account_id FROM activity_log);

For two years it worked. Then, after a release, the job started returning zero inactive accounts every night. No error, no alert — the job "succeeded" and simply emailed no one. The re-engagement program silently went dark for six weeks before anyone noticed the email volume had flatlined.

What went wrong

The release added a new code path that occasionally inserted an activity_log row with a NULL account_id (an anonymous, pre-login event). That single NULL in the subquery's result was enough to break NOT IN completely.

Here's the mechanism (three-valued logic, Chapters 3 and 5). account_id NOT IN (1, 2, NULL) is evaluated as:

   account_id <> 1  AND  account_id <> 2  AND  account_id <> NULL
                                                └──────────────┘
                                              this is always NULL (unknown)

Because account_id <> NULL is NULL (never true), the whole AND chain can never be true for any account — at best it's NULL, which WHERE treats as "exclude." So the moment the subquery contains a single NULL, NOT IN returns no rows at all, regardless of the data. The "find inactive accounts" query became "find nothing."

The fix

Switch to NOT EXISTS, which is NULL-safe:

-- NULL-safe anti-join
SELECT a.account_id, a.email
FROM accounts a
WHERE NOT EXISTS (
    SELECT 1 FROM activity_log l WHERE l.account_id = a.account_id
);

NOT EXISTS asks, per account, "does any activity row match this account_id?" A row with a NULL account_id simply doesn't match any real account, so it's harmlessly ignored — it can't poison the entire result the way it does inside NOT IN. The job immediately started finding inactive accounts again.

Two further hardening steps:

  • Fix the data at the source. A NULL account_id in activity_log was itself a bug; the column was made NOT NULL (with a sentinel "anonymous" account for pre-login events), so the situation can't recur.
  • Add a sanity assertion. The job now logs the count of inactive accounts and alerts if it drops to zero (or changes by more than, say, 90% night-over-night) — so a silent "found nothing" can never again go unnoticed for weeks.

The analysis

  1. NOT IN + a NULL in the subquery = no rows. This is the single most dangerous beginner-and-expert SQL trap, because the query is correct for years until a NULL appears, then fails silently and totally. Treat NOT IN (subquery) as a code smell.

  2. Prefer NOT EXISTS for anti-joins. It's NULL-safe, it short-circuits, and it usually optimizes well. Make it your default for every "X with no matching Y."

  3. NULL-safety is a property you must design for. The deeper fix wasn't just the query — it was making the column NOT NULL so the bad value couldn't exist. Constraints (Chapter 3) prevent the data states that break queries.

  4. Silent "success" is worse than failure. The job didn't crash; it ran clean and did nothing. Jobs that produce no output on a normally-nonempty task should assert against that — "found zero" is often a bug, not a result. Build that check in.

  5. Test queries against NULL-containing data. The bug would have surfaced instantly in a test where the subquery's source contained a NULL. Anti-join queries deserve a "what if there's a NULL in here?" test case.

Discussion questions

  1. Walk through why one NULL in the subquery makes NOT IN return zero rows for every account.
  2. Why is NOT EXISTS immune to this problem?
  3. The team also made account_id NOT NULL. Why fix the data and the query, rather than just one?
  4. The job ran "successfully" for six weeks while doing nothing. What monitoring would catch this class of silent failure?
  5. ⭐ Are there cases where NOT IN is fine? State the precise condition under which NOT IN (subquery) is safe.