Case Study: Analyzing a Build System's Dependency Graph with Closures

DataField.Dev

Case Study: Analyzing a Build System's Dependency Graph with Closures

"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." — Leslie Lamport

Executive Summary

A continuous-integration pipeline is taking longer and longer, and twice this month a build has hung. You are handed the project's dependency manifest — a list of "module X directly depends on module Y" pairs — and asked three questions a senior engineer needs answered before lunch: Which modules does the release artifact actually pull in (directly or transitively)? Is there a circular dependency that could explain the hang? Which modules are dead weight that nothing depends on? Every one of those is a relation question, and the chapter's machinery answers them exactly. You will model the manifest as a relation, compute its transitive closure to get reachability, read cycles straight off the closure's diagonal, and use projections to find unused modules — without writing a single graph algorithm from scratch.

This case study is analysis-heavy: you are handed a system and you interrogate it. (Case Study 2 is the opposite — you'll build a feature.) The point is that once you see the manifest as a relation, the answers are computations you already know how to do.

Skills applied

Translating a real artifact (a dependency manifest) into a relation on a set (§12.1).
Computing the transitive closure to answer reachability (§12.5).
Detecting cycles by inspecting the closure's self-loops — a property test (§12.3, §12.5).
Using projection and set difference to find unused modules (§12.6, Ch. 8).
Hand-deriving expected output and reconciling it with the math.

Background

The scenario

The team's build tool reads a manifest like this (each line is "this module needs that module first"):

release  -> auth
release  -> api
api      -> auth
api      -> db
auth     -> crypto
db       -> crypto
crypto   -> logging
metrics  -> logging

There is also a module legacy_cache that appears nowhere in the manifest — it ships in the repo but no line mentions it. Eight "needs" lines, and a feeling that something is wrong. Stripped of formatting, the manifest is a relation $D$ on the set of modules: $(X, Y) \in D$ means "$X$ directly depends on $Y$." We will treat the manifest as the source of truth and ask the math to summarize it.

🔗 Connection. This is §12.1 in the wild: a flat list of pairs is a relation, full stop. The build tool's manifest, a database's foreign-key list, an import graph — all the same shape. Recognizing the shape is what lets us reuse one toolbox. (Chapter 27 will call this exact object a directed graph; here we work with it purely as a relation.)

Why this matters

Build correctness is a relation problem dressed in operational clothing. "Will the release build pull in a vulnerable version of logging?" is a reachability question. "Why does the build never terminate?" is often a cycle question — a circular dependency that sends a naive builder into an infinite loop. "Can we delete legacy_cache?" is a question about whether anything projects onto it. Getting these wrong costs real time and real outages; getting them right is a ten-minute analysis if you have the chapter's tools.

Phase 1: Model the manifest as a relation

First, capture the manifest as a Python set of pairs and pin down the underlying set of modules.

# The dependency relation D: (X, Y) means "X directly depends on Y".
D = {
    ("release", "auth"), ("release", "api"),
    ("api", "auth"),     ("api", "db"),
    ("auth", "crypto"),  ("db", "crypto"),
    ("crypto", "logging"), ("metrics", "logging"),
}

# The underlying set of modules includes legacy_cache, which appears in no pair.
modules = {"release", "api", "auth", "db", "crypto", "logging", "metrics", "legacy_cache"}

print(len(D), "direct-dependency pairs")
print(len(modules), "modules total")
# Expected output:
# 8 direct-dependency pairs
# 8 modules total

Note carefully: the relation $D$ has 8 pairs, and the set modules has 8 elements — but those two 8s are unrelated coincidences. legacy_cache is in modules yet in no pair of $D$, which is exactly the situation that lets a module be "dead weight." Keeping the domain set separate from the relation is the same discipline the chapter insisted on for reflexivity testing (you cannot check "every element relates to itself" without knowing what "every element" means).

🔄 Check Your Understanding Before computing anything: name every module that release depends on directly (one step in $D$), and every module that crypto depends on directly.

Answer

release depends directly on auth and api (the two pairs with release first). crypto depends directly only on logging. Everything else is indirect — which is precisely what the transitive closure will surface.

Phase 2: Compute reachability via transitive closure

"What does the release artifact actually pull in?" means: which modules are reachable from release by following dependencies any number of steps? That is the transitive closure of $D$, restricted to pairs starting at release. We reuse the chapter's closure_transitive verbatim.

def closure_transitive(R):
    """Smallest transitive relation containing R (reachability via paths)."""
    S = set(R)
    while True:
        new = {(a, c) for (a, b) in S for (x, c) in S if b == x}
        if new <= S:                 # a full pass added nothing -> stable
            return S
        S |= new

reach = closure_transitive(D)
pulled_in = {y for (x, y) in reach if x == "release"}
print(sorted(pulled_in))
# Expected output:
# ['api', 'auth', 'crypto', 'db', 'logging']

Hand-derive it to be sure. Start from release's direct dependencies and walk:

Direct: release -> auth, release -> api.
From auth: auth -> crypto, so release reaches crypto.
From api: api -> auth (already have), api -> db, so release reaches db.
From crypto: crypto -> logging, so release reaches logging.
From db: db -> crypto (already have).

So release transitively reaches exactly $\{$auth, api, crypto, db, logging$\}$ — five modules, matching the printed set. Notice metrics is not in the list: nothing on the release path depends on metrics, even though metrics exists and depends on logging. Reachability is directional, exactly as §12.5 warned — the closure follows arrows, it does not run them backward.

💡 Intuition. The transitive closure is the manifest you wish you had: not just "needs directly" but "needs, eventually." A security tool asking "does the release ship logging ≤ v1.2?" should query the closure, not the raw manifest — a vulnerable logging two hops away is just as shipped as one hop away.

⚠️ Common Pitfall. It is tempting to answer "what does release pull in?" by listing $D$'s direct pairs from release and stopping. That misses crypto, db, and logging entirely — three of the five real dependencies. The whole reason transitive closure exists is that "depends on" is not transitive as written; you must close it.

Phase 3: Detect the circular dependency

A circular dependency is a cycle: a module that, by following dependencies, eventually depends on itself. In relation terms, that is a pair $(X, X)$ in the transitive closure — a self-loop that $D$ itself did not contain. So cycle detection is a one-liner once we have reach: look for any $(X, X) \in$ reach.

Suppose code review reveals one more manifest line that slipped in: logging -> auth (someone made the logger call back into auth for an audit trail). Add it and recompute.

D2 = D | {("logging", "auth")}          # the accidental back-edge
reach2 = closure_transitive(D2)
cycles = {x for (x, y) in reach2 if x == y}     # self-loops in the closure
print(sorted(cycles))
# Expected output:
# ['auth', 'crypto', 'logging']

Hand-derive the cycle. Follow the new edge: auth -> crypto -> logging -> auth. That is a closed loop of length 3. Every module on the loop can reach itself:

auth -> crypto -> logging -> auth, so $(\text{auth}, \text{auth}) \in$ closure.
crypto -> logging -> auth -> crypto, so $(\text{crypto}, \text{crypto}) \in$ closure.
logging -> auth -> crypto -> logging, so $(\text{logging}, \text{logging}) \in$ closure.

Exactly three self-loops appear, naming exactly the three modules on the cycle — matching the output. release, api, and db can reach the cycle but are not on it, so they get no self-loop. That distinction — "feeds into the cycle" versus "is in the cycle" — falls out for free from looking at which elements gained a diagonal entry.

🚪 Threshold idea in miniature. A property of the whole structure ("there is a circular dependency") became a local membership test ("is any $(X,X)$ in the closure?"). This is the recurring superpower of the chapter: once you compute the right closure, structural questions collapse into set-membership questions. The same self-loop-on-the-diagonal trick detects cycles in any "depends on," "calls," or "imports" relation.

🔗 Connection. A dependency relation with no cycle — no self-loop in its transitive closure — is exactly the precondition for a topological sort, the build-order algorithm Chapter 13 develops (and Chapter 28 re-derives via DFS). A cycle is precisely why a topological order fails to exist: you cannot linearize "must come before" if something must come before itself.

Phase 4: Find the dead weight with a projection

The last question — "which modules can we delete because nothing depends on them?" — is about the second coordinate of $D$. A module $Y$ is depended upon iff $Y$ appears as the second element of some pair, i.e. $Y$ is in the projection of $D$ onto its second column (§12.6). The deletable modules are those in modules but not in that projection — a set difference (Chapter 8). We must also exclude release itself, which is the intended top-level target (nothing should depend on it; that does not make it dead).

depended_upon = {y for (x, y) in D}          # projection onto the 2nd column
top_level = {"release"}                        # intended root(s), not dead weight
dead_weight = modules - depended_upon - top_level
print(sorted(dead_weight))
# Expected output:
# ['legacy_cache', 'metrics']

Hand-derive it. The second coordinates appearing in $D$ are $\{$auth, api, db, crypto, logging$\}$ — those are "depended upon." Subtract them from the eight modules, then also remove release. What remains is legacy_cache (in no pair at all) and metrics (it depends on logging, so it is a first coordinate, but nothing depends on metrics, so it is never a second coordinate). Both are leaves of the "is depended on" view: candidates for deletion or for being flagged as separate top-level targets. The analysis correctly distinguishes them from release, which is unused-as-a-dependency by design.

💡 Intuition. "Who depends on $Y$?" reads the relation backward; "what does $X$ depend on?" reads it forward. The projection onto the second column is the set of all $Y$ that someone depends on, so its complement (minus intended roots) is exactly the set nobody needs. Projection plus set difference — two operations from this chapter and the last — answer a real cleanup question with no graph traversal at all.

Discussion Questions

In Phase 2 we computed reachability from release. How would you compute, for a vulnerable module like logging, the set of all modules that would be affected if logging broke — i.e., everything that transitively depends on logging? Which coordinate do you fix, and does the relation need to be reversed first?
The cycle test in Phase 3 reports the modules on a cycle. Modify the reasoning to also report modules that merely reach a cycle (and are therefore at risk of an infinite build). Which set operation relates "on a cycle" to "reaches a cycle"?
We treated the manifest as a relation and never built a graph data structure. For a manifest with 100,000 modules, the pass-until-stable closure is expensive. Which algorithm from the chapter's "Connection" notes (and Chapter 29's neighborhood) would you reach for, and what is its asymptotic cost?
The dead-weight analysis excluded release by hand as an intended root. How would you generalize "intended roots" so the analysis needs no manual exception — e.g., for a project with several top-level release artifacts?

Your Turn: Extensions

Option A (impact analysis). Write affected_by(R, target) returning every module that transitively depends on target. Hint: reverse the relation ({(b, a) for (a, b) in R}) and take reachability from target. Test it for target = "crypto" on the original $D$ and hand-derive the expected set.
Option B (minimal cycle report). When a cycle exists, the self-loop test names the modules involved but not the cycle itself. Write code that, given a module m with $(m, m)$ in the closure, recovers one actual cycle (a path m -> ... -> m). State the invariant your search maintains.
Option C (closure as equivalence check). "Mutually depend on each other" — $X$ reaches $Y$ and $Y$ reaches $X$ — partitions the modules into groups (the strongly connected components). Define this relation from the transitive closure, prove it is an equivalence relation, and compute its classes for $D2$ (the version with the back-edge). Which class has more than one module, and why?

Key Takeaways

A manifest is a relation; treat it like one. Modeling the flat list of "depends on" pairs as a set $D \subseteq \text{modules} \times \text{modules}$ turns three vague operational questions into three precise relation computations.
Reachability is transitive closure. "What does release actually pull in?" is answered by closing the relation, not by reading the raw pairs — direct dependencies undercount the real footprint.
Cycles are self-loops in the closure. A circular dependency shows up as a pair $(X, X)$ that $D$ never contained; one membership test over the closure's diagonal both detects the cycle and names every module on it.
Projection plus set difference finds dead weight. "Nobody depends on it" is "not in the second-column projection," and the deletable set is a set difference — Chapter 8 and Chapter 12 composing to answer a maintenance question with no traversal.