Case Study: Building a Dependency Graph for a Build System

"Within a computer, natural language is unnatural." — Alan Perlis, Epigrams on Programming

Executive Summary

Every build tool — make, Bazel, cargo, npm, a CI pipeline — is, underneath, a directed graph. The modules are vertices; "A must be built before B" is a directed edge $A \to B$. In this build-focused case study you will construct a small DiGraph data structure from scratch (the directed cousin of the Project Checkpoint's Graph), model a real set of module dependencies as a digraph, compute each module's in-degree and out-degree to find the foundational and the leaf modules, and discover — the way a real build tool does — that a cycle in this graph is not a curiosity but a bug: a circular dependency that makes any build order impossible. You will detect that cycle by hand and in code. The directed-graph-is-a-relation connection (§27.3, Chapter 12) is the spine of the whole thing.

Where Case Study 1 audited a given undirected social graph, this one builds a directed graph and a data structure to hold it — a different kind of work on a different kind of graph.

Skills applied

  • Modeling a real one-way relationship as a directed graph (the §27.6 four-question method).
  • Building a directed-graph data structure (DiGraph) on an adjacency dictionary, from scratch.
  • Computing in-degree and out-degree, and reading them as "how foundational / how dependent."
  • Recognizing a directed graph as a relation (Chapter 12) and a cycle as a circular dependency bug.
  • Sanity-checking the structure with a directed analogue of the handshaking lemma.

Background

The scenario

You are writing a small build tool for a Python project. The package has these modules, and these "depends-on" relationships (read "config is imported by — i.e. needed before — db"):

config  ->  db
config  ->  logging
db      ->  models
logging ->  models
models  ->  api
api     ->  web

Six modules, six dependencies. Your tool must eventually answer "in what order do I build these?" — but that (a topological sort) is Chapter 28's job. Chapter 27's job, and yours today, is to build the graph correctly and to validate that a build order can exist at all. Because if someone later adds a dependency that creates a cycle, no order exists, and your tool must say so instead of looping forever.

💡 Intuition: A dependency is inherently one-way: config must precede db, but db does not precede config. One-way relationships are the textbook signal (§27.3) to reach for a directed graph — ordered-pair edges $(u, v)$, drawn $u \to v$ — not the unordered edges of a friendship graph.

Why this matters

Build systems, package managers, spreadsheet recalculation, course schedulers, and task runners are all the same directed-graph problem. Getting the model right — what is a vertex, what direction does an edge point, what does a cycle mean — is the entire game. A build tool that models dependencies as undirected could not tell "A before B" from "B before A"; one that ignores cycles will hang. So we build deliberately and validate as we go (Theme 2: correctness before cleverness).

Phase 1: Choose the model (the four questions)

Run the §27.6 modeling checklist before writing any code:

  1. What is a vertex? A module (config, db, …).
  2. What is an edge? A dependency "must be built before."
  3. Directed or undirected? Directed — dependency is one-way. Edge $(\text{config}, \text{db})$ means config before db, drawn config → db.
  4. Weighted or unweighted? Unweighted for now — we care only about order, not cost. (A build tool that tracked compile time per module would add weights; that is Chapter 29's world.)

So the model is a directed, unweighted simple graph: $$V = \{\text{config, db, logging, models, api, web}\},$$ $$E = \{(\text{config},\text{db}),(\text{config},\text{logging}),(\text{db},\text{models}),(\text{logging},\text{models}),(\text{models},\text{api}),(\text{api},\text{web})\}.$$

🔗 Connection. This $E$ is a relation on $V$ (Chapter 12) — "depends on," drawn as a digraph. Everything Chapter 12 says about relations applies: we can ask if it is reflexive (no — nothing depends on itself, and it had better not!), and we will care a great deal about its reachability (the transitive closure), because "does building web ultimately require config?" is exactly "is config reachable from … " — careful with direction — "can we reach web from config?"

Phase 2: Build the DiGraph data structure

Now build the directed cousin of the Project Checkpoint's Graph. The single structural difference: add_edge(u, v) records the arc in one direction only (u → v), and each vertex tracks its out-neighbors. We add in_degree and out_degree because a digraph has two degrees per vertex (§27.3).

class DiGraph:
    """A directed simple graph backed by an adjacency dict (vertex -> set of heads)."""

    def __init__(self):
        self.adj = {}                       # vertex -> set of OUT-neighbors (heads)

    def add_vertex(self, v):
        self.adj.setdefault(v, set())

    def add_edge(self, u, v):               # arc u -> v (ORDERED, one direction)
        if u == v:
            raise ValueError("no self-dependency allowed")
        self.add_vertex(u)
        self.add_vertex(v)
        self.adj[u].add(v)                  # note: NOT adj[v].add(u)

    def out_degree(self, v):
        return len(self.adj[v])             # arcs leaving v

    def in_degree(self, v):                 # arcs entering v: scan all out-sets
        return sum(1 for heads in self.adj.values() if v in heads)

    def vertices(self):
        return set(self.adj)

    def num_arcs(self):
        return sum(len(heads) for heads in self.adj.values())

Contrast with Graph.add_edge, which did adj[u].add(v) and adj[v].add(u). Dropping the second line is the whole difference between undirected and directed — and it is precisely "unordered pair vs. ordered pair" from §27.1 and §27.3, made into code.

⚠️ Common Pitfall: The most common directed-graph bug is accidentally inserting both directions (copy-pasting from undirected code), silently turning every dependency into a mutual one — which creates a 2-cycle on every edge and makes every build "impossible." If your cycle checker suddenly reports that everything is circular, suspect a stray adj[v].add(u).

Phase 3: Populate and inspect (in-degree, out-degree)

Build the graph and read off the two degrees of every module.

deps = [("config","db"), ("config","logging"), ("db","models"),
        ("logging","models"), ("models","api"), ("api","web")]

g = DiGraph()
for u, v in deps:
    g.add_edge(u, v)

for v in sorted(g.vertices()):
    print(f"{v:8} in={g.in_degree(v)} out={g.out_degree(v)}")
print("total arcs:", g.num_arcs())
# Expected output:
# api      in=1 out=1
# config   in=0 out=2
# db       in=1 out=1
# logging  in=1 out=1
# models   in=2 out=1
# web      in=1 out=0
# total arcs: 6

Hand-derive the table from the dependency list:

Module In-degree (depends on it) Out-degree (it depends on)
config 0 2 (db, logging)
db 1 (config) 1 (models)
logging 1 (config) 1 (models)
models 2 (db, logging) 1 (api)
api 1 (models) 1 (web)
web 1 (api) 0

Reading the degrees tells a story you can act on:

  • config has in-degree 0 — nothing must precede it. It is a source: a valid build starts here. (A directed graph with no cycles always has at least one in-degree-0 vertex; that is the seed of topological sort, Chapter 28.)
  • web has out-degree 0 — it depends on things but nothing depends on it. It is a sink: a valid build ends here.
  • models has the highest in-degree (2) — it is the most depended-upon module. Changing models forces rebuilding everything downstream (api, then web), so it is the riskiest file to touch.

🔄 Check Your Understanding Why does the directed handshaking identity read $\sum_v \deg^-(v) = \sum_v \deg^+(v) = \lvert E\rvert$ (note: not $2\lvert E\rvert$)? Verify it on the table above.

Answer

Each arc $(u,v)$ contributes exactly once to an out-degree (at its tail $u$) and exactly once to an in-degree (at its head $v$). So summing out-degrees counts every arc once → $\lvert E\rvert$; summing in-degrees likewise → $\lvert E\rvert$. Unlike the undirected lemma (where each edge hits two degree counts, giving $2\lvert E\rvert$), a directed arc splits its two contributions across two different sums. Check (table order config, db, logging, models, api, web): in-degrees $0+1+1+2+1+1 = 6$; out-degrees $2+1+1+1+1+0 = 6$; arcs $= 6$. ✓

Phase 4: The cycle is a bug — detect it

Now the payoff. A teammate refactors and makes config read a value from web at import time. That adds a new dependency — config now needs web first — i.e. the arc web → config (read "web must be built before config"):

web -> config        # the new, dangerous arc

Add it and look at what it does to the graph:

g.add_edge("web", "config")     # the breaking change
print("config in-degree now:", g.in_degree("config"))
# Expected output:
# config in-degree now: 1

config's in-degree jumped from 0 to 1 — there is no longer any source! Trace the consequence by hand. Follow the arcs: $$\text{config} \to \text{db} \to \text{models} \to \text{api} \to \text{web} \to \text{config}.$$ That is a directed cycle (§27.2): a closed directed walk that returns to its start. In dependency terms it is a circular dependency — to build config you must first build web, but to build web you must first build config, … forever. No build order exists. A naive build tool would loop infinitely; a correct one must detect the cycle and refuse.

Here is a from-scratch cycle detector — a depth-first walk that flags an arc back into the current path (the "gray" set). It is informal (Chapter 28 formalizes DFS), but it is enough to catch the bug.

def has_cycle(g):
    WHITE, GRAY, BLACK = 0, 1, 2
    color = {v: WHITE for v in g.vertices()}

    def visit(u):
        color[u] = GRAY                       # u is on the current path
        for w in g.adj[u]:
            if color[w] == GRAY:              # arc back into the path => cycle
                return True
            if color[w] == WHITE and visit(w):
                return True
        color[u] = BLACK                      # u fully explored, off the path
        return False

    return any(color[v] == WHITE and visit(v) for v in g.vertices())

print("cycle present?", has_cycle(g))
# Expected output:
# cycle present? True

Hand-trace from config: mark config GRAY, descend config→db→models→api→web, each marked GRAY in turn; from web the only arc is web→config, and config is GRAY (still on the path) — so the detector returns True. The cycle is caught. Remove the bad arc and it returns to False:

g.adj["web"].discard("config")   # undo the breaking change
print("cycle present?", has_cycle(g))
# Expected output:
# cycle present? False

🔗 Connection. A directed graph with no cycles is a DAG (directed acyclic graph), and a DAG is exactly a graph that has a valid build order — a topological sort (Chapter 28, building on the posets of Chapter 13). "Can these modules be built?" is precisely "is this digraph a DAG?", and you just wrote the test for it. Your build tool's job is: refuse to build a non-DAG; otherwise, topologically sort and go.

⚠️ Common Pitfall: A cycle in a dependency graph is always a bug, but a cycle in other directed graphs is perfectly normal (a "follows" graph is full of mutual-follow 2-cycles; a state machine cycles by design). The meaning of "cycle" depends on what the graph models (§27.6) — direction and domain together tell you whether a cycle is a feature or a fault.

Phase 5: Validate the build (putting it together)

Wrap the pieces into the check your tool actually runs before every build: build the graph, assert the directed degree identity as an integrity check, and confirm acyclicity.

def can_build(deps):
    g = DiGraph()
    for u, v in deps:
        g.add_edge(u, v)
    # integrity: sum of in-degrees == sum of out-degrees == number of arcs
    sin = sum(g.in_degree(v) for v in g.vertices())
    sout = sum(g.out_degree(v) for v in g.vertices())
    assert sin == sout == g.num_arcs(), "arc accounting is inconsistent"
    return not has_cycle(g)          # buildable iff acyclic (a DAG)

good = [("config","db"),("config","logging"),("db","models"),
        ("logging","models"),("models","api"),("api","web")]
bad  = good + [("web","config")]
print("good project buildable?", can_build(good))
print("bad  project buildable?", can_build(bad))
# Expected output:
# good project buildable? True
# bad  project buildable? False

Hand-derive: the good list is the original DAG — no cycle — so can_build returns True. The bad list adds web→config, closing the config→db→models→api→web→config cycle, so has_cycle is True and can_build returns False. The assert passes in both cases because arc accounting (Phase 3's identity) is always consistent for a correctly built DiGraph — if it ever failed, it would mean a bug in add_edge, not in the data.

Discussion Questions

  1. The DiGraph.in_degree method scans every vertex's out-set, costing $O(\lvert V\rvert + \lvert E\rvert)$ per call. How would you change the data structure so `in_degree` is $O(1)$, and what does that cost you on add_edge? (This is the adjacency-representation trade-off Chapter 28 studies.)
  2. In Phase 1 you chose unweighted. Sketch the model change if the tool needed to estimate total build time. Which §27.3 graph type appears, and which later algorithm (and chapter) would compute the longest/critical path?
  3. The cycle detector uses three colors (white/gray/black). Why is a single "visited" flag not enough to detect a cycle in a directed graph? Give a 3-vertex digraph where a plain visited-set would falsely report a cycle (or miss one).
  4. A "follows" social graph and a "depends-on" build graph are both directed. A cycle is a bug in one and normal in the other. Explain, in terms of what each graph models (§27.6), why the same structural feature carries opposite meanings.
  5. Phase 5 asserts $\sum \deg^- = \sum \deg^+ = \lvert E\rvert$. If that assertion ever fired, what specific kind of bug in add_edge would you suspect, and why would it break exactly that identity?

Your Turn: Extensions

  • Option A (find the source/sink). Add methods sources(self) (in-degree 0) and sinks(self) (out-degree 0) to DiGraph. Run them on the good project; confirm config is the lone source and web the lone sink. (Sources are where a topological sort begins — Chapter 28.)
  • Option B (report the cycle). Extend has_cycle to return the actual cycle (the list of modules in the circular dependency), not just True. A build tool that says "circular dependency: config → db → models → api → web → config" is far more useful than one that says "build failed."
  • Option C (reverse the graph). Write reverse(g) returning a new DiGraph with every arc flipped. What real question does the reversed dependency graph answer? (Hint: "what breaks if I delete models?" — i.e. everything reachable from models in the original.)
  • Option D (mixed graph). Real road maps mix one-way and two-way streets. Extend DiGraph so a two-way connection adds both arcs, and discuss how this relates a symmetric relation (Chapter 12) to an undirected graph (§27.3).

Key Takeaways

  1. One-way relationship → directed graph. The §27.6 four-question method selected a directed, unweighted model; the only code difference from the undirected Graph is inserting one arc instead of two — exactly "ordered vs. unordered pair."
  2. A digraph is a relation; build the structure to match. In-degree and out-degree are the two degrees of §27.3, and the directed identity $\sum\deg^- = \sum\deg^+ = \lvert E\rvert$ is your integrity check (each arc lands in one in-sum and one out-sum).
  3. A cycle's meaning is domain-dependent. In a dependency graph a directed cycle is a circular dependency bug with no valid build order; a correct tool detects it and refuses rather than looping forever.
  4. Buildable = acyclic = DAG. "Can these modules be built?" is "is this digraph a DAG?", and a topological sort (Chapter 28) produces the order once acyclicity is confirmed.
  5. This complements the social-graph anchor. Case Study 1 analyzed a given undirected graph; here you built a directed one and a data structure for it — two faces of the same Chapter 27 vocabulary.