Case Study 1 — The Closure Table That Made the Org Chart Instant

DataField.Dev

Case Study 1 — The Closure Table That Made the Org Chart Instant

A real trade-off between hierarchy models. A company's deeply-nested org chart made "everyone under this manager" queries slow with the adjacency-list + recursive-CTE approach. A closure table — a deliberate denormalization for hierarchies — made subtree reads instant, at a manageable write cost.

Background

An HR system modeled its org chart as an adjacency list — employees.manager_id referencing employees (Mercado's model, Chapter 11). For most operations this was perfect: simple, one update to move someone, and recursive CTEs answered hierarchy questions.

But the org was large (tens of thousands of employees) and deep (up to 12 levels), and one operation ran constantly and got slow: "give me everyone in this VP's organization" (the full subtree under a node), used on every manager's dashboard, in permission checks ("can this manager see this employee?"), and in reports. Each call ran a recursive CTE descending up to 12 levels over tens of thousands of rows — and these calls happened thousands of times a minute. The recursion, repeated at that volume, became a bottleneck.

The decision

The team weighed the four hierarchy models (this chapter):

Adjacency list (current): great writes, but repeated deep recursive reads were the problem.
Nested set: fast reads, but moving an employee (reorgs happen weekly) would renumber huge swaths of the table — unacceptable write cost for a frequently-edited tree.
Materialized path: fast subtree reads via LIKE 'path/%', but path maintenance on every move, and depth limits.
Closure table: a table of all ancestor–descendant pairs. Fast subtree reads via a simple join; moderate write cost (insert/delete pairs on change); extra storage.

Because reads vastly dominated writes and the tree changed in bounded ways (individual moves, not mass renumbering), a closure table was the right denormalization (Chapter 20) — keep the adjacency list as the source of truth, and maintain a closure table for fast reads.

The implementation

CREATE TABLE employee_tree (        -- all ancestor→descendant pairs (incl. self, depth 0)
    ancestor_id   integer NOT NULL REFERENCES employees(employee_id),
    descendant_id integer NOT NULL REFERENCES employees(employee_id),
    depth         integer NOT NULL,
    PRIMARY KEY (ancestor_id, descendant_id)
);

For each employee, employee_tree holds a row for every ancestor above them (and themselves at depth 0). Now the killer query — "everyone under VP #42" — is a trivial, indexed join, no recursion:

SELECT e.* FROM employee_tree t
JOIN employees e ON e.employee_id = t.descendant_id
WHERE t.ancestor_id = 42;          -- the entire subtree, instantly

And "is employee X under manager Y?" (a permission check) is a single-row lookup:

SELECT EXISTS (SELECT 1 FROM employee_tree WHERE ancestor_id = :y AND descendant_id = :x);

The recursive CTE went from a per-request cost to a one-time cost paid only when the tree changes: maintaining the closure table on a move (delete the moved subtree's old ancestor pairs, insert the new ones) — handled by a trigger or the reorg procedure. Since moves are far rarer than reads, the trade was overwhelmingly favorable. Dashboard and permission-check latency dropped dramatically.

The analysis

Hierarchy model is a read/write trade-off. Adjacency list is the simplest default and writes cheaply, but repeated deep subtree reads can be slow. When those reads dominate, a closure table (or materialized path) trades write cost and storage for fast reads — exactly the denormalization calculus of Chapter 20, applied to trees.
Keep the simple model as the source of truth. They didn't abandon the adjacency list — manager_id remains the canonical structure (simple to reason about and edit). The closure table is a derived read-optimization maintained from it. (Like a materialized view, but for graph reachability.)
Closure tables make both subtree reads and ancestry checks trivial joins. "Everyone under X" and "is X under Y?" — the two hot queries — became single indexed lookups, no recursion.
The cost is paid where it's cheap. Reads happen thousands of times a minute; moves happen weekly. Moving the cost from every read to every (rare) write is the whole win. Always locate the cost where your workload can afford it.
Rejecting nested sets was as important as choosing closure. Nested sets read even faster but would have made the frequent reorgs agonizing (renumbering). Matching the model to the write pattern (frequent moves) ruled it out. The right answer depends on both read and write patterns, not reads alone.

Discussion questions

Why did the adjacency-list + recursive-CTE approach become a bottleneck here specifically?
Why is a closure table considered a denormalization, and what does it cost?
Why keep manager_id as the source of truth instead of replacing it?
Why were nested sets rejected despite their fast reads?
⭐ Sketch how you'd maintain the closure table when an employee (and their subtree) is moved under a new manager.