Case Study 2 — The Org Chart That Never Finished

Recursion's great power comes with a sharp edge: if the data contains a cycle, a recursive CTE runs forever. A reporting job that worked for years hung indefinitely one night — because a single bad manager_id update turned the org chart into a loop.

Background

A company ran a nightly recursive query to build its employee org chart (everyone under the CEO, with reporting levels) for an internal dashboard. The query was the textbook shape:

WITH RECURSIVE org AS (
    SELECT employee_id, manager_id, 1 AS level
    FROM employees WHERE manager_id IS NULL
    UNION ALL
    SELECT e.employee_id, e.manager_id, org.level + 1
    FROM employees e
    JOIN org ON e.manager_id = org.employee_id
)
SELECT * FROM org;

It ran in under a second every night — until the night it didn't. The job hung, consumed CPU, and eventually had to be killed. The dashboard went stale, and on-call spent hours figuring out why a query that had "never been touched" suddenly looped forever.

What went wrong: a cycle in the data

That afternoon, an HR admin had reorganized a department and, through a series of edits, accidentally created a cycle: employee A's manager was set to B, and B's manager was set to A (directly, or through a longer chain A→B→C→A). The org "tree" was no longer a tree — it had a loop.

A recursive CTE assumes the structure terminates: each pass adds the next level, and eventually a pass adds nothing, so it stops. But with a cycle, the recursion keeps finding "new" rows forever — A produces B, B produces A, A produces B again — incrementing level endlessly. With UNION ALL (no dedup), nothing ever signals "we've been here before," so the query never terminates. It wasn't slow; it was non-terminating.

The fixes

1. Immediate: cycle detection in the query. PostgreSQL's CYCLE clause stops the recursion when it revisits a node:

WITH RECURSIVE org AS (
    SELECT employee_id, manager_id, 1 AS level
    FROM employees WHERE manager_id IS NULL
    UNION ALL
    SELECT e.employee_id, e.manager_id, org.level + 1
    FROM employees e
    JOIN org ON e.manager_id = org.employee_id
) CYCLE employee_id SET is_cycle USING path
SELECT * FROM org WHERE NOT is_cycle;

CYCLE employee_id SET is_cycle USING path tracks the path of visited employee_ids; when a node would repeat, it marks is_cycle = true and stops descending that branch. The query now terminates even on bad data, and you can even report which rows are part of a cycle (a useful data-quality signal). On older PostgreSQL, the same is done manually by carrying an array path and adding WHERE NOT e.employee_id = ANY(path).

2. A depth-limit safety net. Independently, a WHERE level < 100 guard ensures no recursion can run away, regardless of cause — cheap insurance for any recursive query over real-world data.

3. The real fix: prevent the cycle in the data. A cycle in an org chart is invalid by definition — nobody can be their own (transitive) manager. The robust solution is to stop the bad state from being stored at all. The team added validation (a trigger / application-level check, since a simple CHECK constraint can't see other rows) that rejects any manager_id update which would create a cycle. Now the data can't loop, and the query is safe by construction.

The analysis

  1. Recursion assumes termination; cycles break that assumption. A recursive CTE over a structure that's supposed to be acyclic (a tree/DAG) will loop forever if the data accidentally forms a cycle. The query was correct; the data violated an invariant the query relied on.

  2. UNION ALL won't save you from cycles. Because it doesn't deduplicate, it can't notice that it's revisiting a node. You need explicit cycle detection (the CYCLE clause or a path array) — or a depth cap.

  3. Defend at two layers. Guard the query (CYCLE clause / depth limit) so a bad row can't hang your job, and guard the data (validation preventing cycles) so the bad state never exists. Query guards keep you running; data guards keep you correct. (This is the recurring theme: integrity belongs in the database/design, not just in each query.)

  4. A "never touched" query can still break. The query hadn't changed in years — but it depended on a data invariant that a routine HR edit silently violated. Code is only as safe as the assumptions it makes about its data; making those assumptions enforced is what prevents the 2 a.m. page.

  5. Cycle detection is also a data-quality tool. Selecting the rows where is_cycle is true tells you exactly which records form the loop — turning "the job hangs" into "employees 14 and 27 reference each other; fix the data."

Discussion questions

  1. Explain precisely why a cycle makes the recursive CTE non-terminating, and why UNION ALL doesn't help.
  2. What does the CYCLE clause do? How does the manual path-array approach achieve the same thing?
  3. Why add both a query-level guard and a data-level validation? What does each protect against?
  4. A simple CHECK constraint can't prevent this cycle. Why not, and what mechanism can?
  5. ⭐ How could you turn cycle detection into a proactive data-quality report that runs before the org-chart job?