Case Study: Building a Dependency-Aware Build Scheduler with Topological Sort

DataField.Dev

Case Study: Building a Dependency-Aware Build Scheduler with Topological Sort

"A build system's only real job is to do things in an order that never breaks a promise."

Executive Summary

In this build-focused case study you will construct, from scratch, a miniature build scheduler — the beating heart of tools like make, Bazel, and every CI pipeline. Given a set of build targets and their "must come before" dependencies, the scheduler must output an order in which each target is built only after everything it depends on. That is exactly the topological sort of §28.5, recovered as a depth-first-search post-order. But a real scheduler must also refuse impossible inputs: if the dependencies contain a cycle ("A needs B, B needs A"), no valid order exists, and the tool must say so rather than emit garbage. So you will also build the directed cycle detector that the chapter gestures at — the three-state DFS that watches its own recursion stack.

You will build in four increments, each a working program: (1) parse dependencies into a directed graph, (2) topologically sort a known-good DAG, (3) detect a cycle and report it, and (4) fold both into one robust schedule() that either returns a build order or raises a precise error. Where Case Study 1 was about reading a graph, this one is about building a tool on top of a traversal.

Skills applied

Translating "X must be built before Y" into a directed graph and an adjacency list (§28.1, Chapter 27).
Implementing topological sort as a DFS post-order, reversed (§28.5).
Implementing directed cycle detection with the unvisited / on-stack / finished state machine (§28.5, contrasted with the undirected test of §28.4).
Recognizing that topological order exists iff the dependency graph is a DAG (§28.5, Chapter 13).
Composing the two into a production-shaped API that fails loudly on bad input.

Background

The scenario

You are writing the dependency engine for a small build tool. A project's build is described by targets and dependency edges. We adopt the convention from §28.5: an edge $a \to b$ means "$a$ must be built before $b$" (so the arrow points from prerequisite to dependent). A valid build order is then a listing of all targets in which, for every edge $a \to b$, target $a$ appears before target $b$ — precisely a topological order.

🔗 Connection. You met this problem twice already. In Chapter 13 a topological sort was a linear extension of a partial order — flattening a poset of "must precede" relations into a sequence — and make was the motivating example. In §28.5 we re-derived the same ordering from depth-first search. This case study turns that re-derivation into a working tool, and — the new piece — handles the input a theory chapter can assume away: a dependency graph that is not a DAG.

The dependencies

Here is a realistic (if small) C-project build. Read each edge as "left must be built before right."

config  ->  utils      (utils includes config.h)
config  ->  db
utils   ->  db
utils   ->  api
db      ->  api
api     ->  web
api     ->  cli
logger  ->  utils
logger  ->  api

Targets with no outgoing edges (web, cli) are final artifacts; targets with no incoming edges (config, logger) are leaves of the dependency tree that can be built first. Our scheduler must discover an order — some order — respecting all nine edges.

💡 Intuition: Think of each edge as a promise: "I will not build db until config and utils are done." A topological sort is a single global ordering that keeps every promise simultaneously. DFS finds one by the §28.5 trick: a target finishes (in DFS) only after everything reachable from it finishes, so listing targets in reverse finish order puts every prerequisite ahead of its dependents.

Phase 1: Parse dependencies into a directed graph

First, build the adjacency list from a flat list of edges. Every target must appear as a key — even ones with no dependencies — so the traversal knows it exists (the §28.1 lesson: an isolated vertex still needs a key, here an empty neighbor list).

def build_graph(edges):
    """edges: list of (prereq, dependent) pairs meaning prereq -> dependent.
    Returns an adjacency list with EVERY mentioned target as a key."""
    adj = {}
    for a, b in edges:
        adj.setdefault(a, [])
        adj.setdefault(b, [])     # ensure dependents are keys too
        adj[a].append(b)
    return adj

edges = [
    ("config", "utils"), ("config", "db"),
    ("utils", "db"), ("utils", "api"),
    ("db", "api"),
    ("api", "web"), ("api", "cli"),
    ("logger", "utils"), ("logger", "api"),
]
adj = build_graph(edges)
for k in adj:
    print(k, "->", adj[k])
# Expected output:
# config -> ['utils', 'db']
# utils -> ['db', 'api']
# db -> ['api']
# api -> ['web', 'cli']
# web -> []
# cli -> []
# logger -> ['utils', 'api']

The keys appear in insertion order (Python 3.7+ dicts preserve it): config, utils, db, api, web, cli, logger. That order matters — it is the order the topological sort's outer loop will consider targets, and hence (together with neighbor order) determines which valid topological order we get among the many that exist.

🔄 Check Your Understanding Why do web and cli appear as keys with empty lists, even though they are never on the left of an edge?

Answer

They are still targets — they must be built, and the scheduler must place them in the output. The setdefault(b, []) line guarantees every dependent becomes a key. A target absent from the keys would simply be missing from the build order, exactly the §28.1 reason an isolated vertex still gets an empty neighbor list.

Phase 2: Topological sort of the (good) DAG

Now the core algorithm, taken verbatim from §28.5: DFS, append each vertex when it finishes, then reverse.

def topological_sort(adj):
    """Topological order of a DAG via DFS post-order (reversed)."""
    visited = set()
    finish = []                       # vertices in order of FINISHING
    def visit(v):
        visited.add(v)
        for w in adj[v]:
            if w not in visited:
                visit(w)
        finish.append(v)              # record AFTER all descendants finish
    for v in adj:
        if v not in visited:
            visit(v)
    finish.reverse()
    return finish

order = topological_sort(adj)
print(order)
# Expected output:
# ['logger', 'config', 'utils', 'db', 'api', 'cli', 'web']

This output is load-bearing, so let us hand-derive it with a finish-order trace, exactly like the §28.5 worked example. The outer loop visits keys in insertion order: config, utils, db, api, web, cli, logger. It starts at config.

Event	Detail	`finish` (append order)
`visit(config)`	dive to first neighbor `utils`	—
`visit(utils)`	dive to first neighbor `db`	—
`visit(db)`	dive to its only neighbor `api`	—
`visit(api)`	dive to `web`	—
finish `web`	leaf (no neighbors)	`[web]`
back in `api`	next neighbor `cli`	—
finish `cli`	leaf	`[web, cli]`
finish `api`	both neighbors done	`[web, cli, api]`
finish `db`	its neighbor `api` done	`[web, cli, api, db]`
back in `utils`	next neighbor `api` already visited	—
finish `utils`	done	`[web, cli, api, db, utils]`
back in `config`	next neighbor `db` already visited	—
finish `config`	done	`[web, cli, api, db, utils, config]`
outer loop	`utils…cli` all visited; reach `logger`	—
`visit(logger)`	neighbors `utils`, `api` already visited	—
finish `logger`	nothing new	`[web, cli, api, db, utils, config, logger]`

Finish order is [web, cli, api, db, utils, config, logger]; reversed, the topological order is ['logger', 'config', 'utils', 'db', 'api', 'cli', 'web']. Now verify the promises — spot-check a few edges against the output:

def respects(order, edges):
    pos = {t: i for i, t in enumerate(order)}
    return all(pos[a] < pos[b] for a, b in edges)

print("all dependencies respected?", respects(order, edges))
print("config before db?", order.index("config") < order.index("db"))
print("logger before api?", order.index("logger") < order.index("api"))
# Expected output:
# all dependencies respected? True
# config before db? True
# logger before api? True

Every edge points "forward" in the list — config (index 1) precedes db (index 3); api (index 4) precedes both cli (5) and web (6); logger (index 0) precedes everything it feeds. That is the whole promise of a topological order.

⚠️ Common Pitfall: This output is one valid order, not the order. ['config', 'logger', ...] would be equally valid — config and logger are incomparable (neither must precede the other), so a build system is free to choose, and could even build them in parallel. The §28.5 algorithm produces a specific one determined by key-insertion order and neighbor order; do not write a test that demands an exact list unless you have pinned both. Test the property (respects(order, edges)), not the exact permutation.

🐛 Find the Error. A teammate "simplifies" the sort by recording each target at discovery (pre-order) and skipping the finish.reverse(): def visit(v): visited.add(v); order.append(v); for w .... On this graph it happens to emit ['config', 'utils', 'db', 'api', 'web', 'cli', 'logger']. Spot the bug and the failing promise.

Answer

Recording at discovery has no ordering guarantee with respect to edges (§28.5). On this graph the emitted list ends with logger, but logger -> utils and logger -> api require logger to come before utils and api — and it comes after both. The promise logger before utils is broken. Pre-order passes on lucky inputs and fails on the rest; only the reversed finish order is correct. This is exactly theme two of the book: passing on one input is not correctness.

Phase 3: Detect a cycle (so we can reject impossible builds)

Phase 2 assumed the input was a DAG. Real configs contain mistakes. Suppose someone adds one bad edge — api -> config ("api must be built before config") — to the existing dependencies. Now there is a directed cycle: config -> utils -> api -> config. No build order can satisfy it (config would have to come before and after itself), so the scheduler must detect it.

Here is the §28.4 trap, made concrete: the undirected cycle test does not work for directed graphs. We need the directed version — a DFS that tracks three states per target:

unvisited — not yet reached;
on the recursion stack ("gray") — started but not finished, i.e. an ancestor of the current call;
finished ("black") — fully explored.

A directed cycle exists exactly when DFS follows an edge to a target that is currently on the stack — a back edge to an ancestor.

def find_cycle(adj):
    """Return a target that lies on a directed cycle, or None if the graph is a DAG."""
    UNVISITED, ON_STACK, DONE = 0, 1, 2
    state = {v: UNVISITED for v in adj}
    def visit(v):
        state[v] = ON_STACK
        for w in adj[v]:
            if state[w] == ON_STACK:        # edge to an ancestor -> cycle!
                return w
            if state[w] == UNVISITED:
                hit = visit(w)
                if hit is not None:
                    return hit
        state[v] = DONE
        return None
    for v in adj:
        if state[v] == UNVISITED:
            hit = visit(v)
            if hit is not None:
                return hit
    return None

good = build_graph(edges)
bad = build_graph(edges + [("api", "config")])   # inject the cycle
print("cycle in good graph:", find_cycle(good))
print("cycle in bad graph:", find_cycle(bad))
# Expected output:
# cycle in good graph: None
# cycle in bad graph: config

Trace the bad graph to see config get caught. DFS starts at config, marks it ON_STACK, dives config -> utils -> db -> api (each marked ON_STACK as it is entered). Now inside visit(api) the neighbors are web, cli, config (the injected edge appended last). It finishes web and cli (marking them DONE), then examines the edge api -> config: state["config"] is ON_STACK — config is an ancestor still open on the recursion stack — so DFS returns config. The cycle config -> utils -> db -> api -> config is exposed. On the good graph, no edge ever points to an ON_STACK target (every back-pointing edge lands on a DONE one), so find_cycle returns None.

💡 Intuition: "ON_STACK" is the set of targets on the current path from the root of the DFS tree — your unbroken chain of open recursive calls. An edge to one of those is a shortcut back up your own path: a loop. An edge to a DONE target is fine — it points into a region you already finished, off your current path, so it closes no loop. The whole subtlety of directed cycle detection is distinguishing "ancestor on my path" (ON_STACK) from "already finished elsewhere" (DONE) — a distinction the undirected test of §28.4 never needed.

🔄 Check Your Understanding Why is the undirected has_cycle from §28.4 — "report a cycle on any edge to a visited non-parent vertex" — wrong here? Give the smallest DAG it would misjudge.

Answer

Ignoring direction conflates two questions (§28.5). The diamond a -> b, a -> c, b -> d, c -> d is a perfectly good DAG with no directed cycle, yet its underlying undirected graph has the 4-cycle $a\!-\!b\!-\!d\!-\!c\!-\!a$, which the undirected test would flag. Directed detection must respect direction by tracking the recursion stack, reporting a cycle only on a back edge to an ancestor (ON_STACK), not to any finished vertex.

Phase 4: Compose — a robust `schedule()` that fails loudly

A production scheduler should return a build order or refuse with a precise error — never silently emit a broken order. Compose Phases 2 and 3: check for a cycle first; if clean, return the topological sort.

class CyclicDependencyError(Exception):
    pass

def schedule(edges):
    """Return a valid build order, or raise CyclicDependencyError if impossible."""
    adj = build_graph(edges)
    bad = find_cycle(adj)
    if bad is not None:
        raise CyclicDependencyError(
            f"dependency cycle detected (involves '{bad}'); no build order exists"
        )
    return topological_sort(adj)

# Good input -> an order:
print(schedule(edges))

# Bad input -> a precise failure:
try:
    schedule(edges + [("api", "config")])
except CyclicDependencyError as e:
    print("ERROR:", e)
# Expected output:
# ['logger', 'config', 'utils', 'db', 'api', 'cli', 'web']
# ERROR: dependency cycle detected (involves 'config'); no build order exists

This is the §28.5 theorem made operational: a topological order exists if and only if the dependency graph is a DAG. schedule enforces both halves — it produces an order when one exists, and proves none exists (by exhibiting a cycle) when it does not.

⚠️ Common Pitfall: Running topological_sort on a graph with a cycle does not raise an error on its own — the §28.5 implementation will happily return some list (because its visited set treats on-stack and finished vertices identically), and that list will silently violate a dependency. The integrity comes from checking find_cycle first. A robust real-world implementation often fuses the two passes — a single DFS that builds the post-order and raises the instant it meets an ON_STACK vertex — but the two-pass version here is easier to read and prove correct, and asymptotically identical at $\Theta(V + E)$.

🔄 Check Your Understanding schedule runs find_cycle (one DFS) and then topological_sort (another DFS). What is its overall running time, and why does running two traversals not change the asymptotic class?

Answer

Each DFS is $\Theta(V + E)$ on the adjacency list (§28.6). Two of them in sequence add: $\Theta(V+E) + \Theta(V+E) = \Theta(V + E)$ — a constant factor of 2, which big-Theta absorbs. The scheduler is linear in the size of the dependency graph.

Phase 5: Counting the valid orders (a structural aside)

How many valid build orders does the good graph admit? Each one is a distinct topological order, and the count depends on how much freedom the partial order leaves. config and logger are incomparable, and once their prerequisites are met, web and cli (both depending only on api) can come in either order. A by-hand enumeration shows the orders factor as "interleave the two independent leaves at the front" × "order the two independent artifacts at the end," giving several valid permutations — all of which respects(order, edges) would accept. The single order schedule returns is just the one DFS happens to produce. For a build tool the practical consequence is profound: incomparable targets can be built in parallel, and the count of topological orders measures the available parallelism.

🔗 Connection. This ties back to Chapter 13: the number of linear extensions of a poset is a classic (and computationally hard in general) quantity. A build tool does not enumerate them; it picks one cheaply with DFS and exploits the incomparability to schedule independent work concurrently.

Discussion Questions

We chose the convention $a \to b$ = "$a$ before $b$" and used DFS post-order reversed. If you flip the convention so $a \to b$ means "$a$ depends on $b$" (arrow from dependent to prerequisite), what changes in topological_sort — do you still reverse, or not? Work it out on a two-node example.
schedule does two DFS passes. Sketch how to fuse them into one DFS that returns the post-order and raises CyclicDependencyError the instant it follows an edge to an ON_STACK vertex. Which single data structure replaces both visited and the post-order list?
The error message names one target on the cycle (config). How would you modify find_cycle to report the entire cycle as a path (e.g. config -> utils -> db -> api -> config)? (Hint: keep the recursion stack as an explicit list and slice it when you hit the ON_STACK vertex.)
Real build tools (make, Bazel) also want to rebuild only what changed. Given the graph and a set of "dirty" targets, how would you use a traversal to compute exactly the set of targets that must be rebuilt — and in what order? Which traversal, and from which sources?
Topological sort says some order exists; it does not minimize build time. If each target had a known compile duration and unlimited parallel workers, which classic quantity over the DAG gives the shortest possible total build time, and why is a plain topological order not enough to find it?

Your Turn: Extensions

Option A (fuse the passes). Implement schedule_one_pass(edges): a single DFS that appends to the post-order and raises CyclicDependencyError on a back edge to an ON_STACK vertex. Verify it returns the same order as schedule on the good graph and raises on the bad one.
Option B (report the full cycle). Extend find_cycle to return the cycle as a list of targets, by maintaining an explicit path stack and slicing it when an ON_STACK vertex is met. Hand-derive the cycle list for the injected api -> config edge: it should be ['config', 'utils', 'db', 'api'] (or a rotation of it).
Option C (incremental rebuild). Given a set dirty of changed targets, compute the set of targets that transitively depend on any dirty one (a traversal forward along the edges from each dirty target), then topologically sort just that subset. Test it with dirty = {"utils"} and hand-derive which targets must rebuild.
Option D (Kahn's algorithm). Implement the other standard topological sort — the BFS-flavored Kahn's algorithm: repeatedly remove a target with in-degree $0$, decrementing its dependents' in-degrees. It both sorts and detects cycles (if any targets remain with nonzero in-degree, a cycle exists). Compare its output and its cycle-handling to the DFS version on both graphs.

Key Takeaways

Topological sort = DFS post-order, reversed. Record each target when it finishes, reverse the list, and every prerequisite precedes its dependents — the §28.5 algorithm, now a working scheduler.
A real tool must reject impossible inputs. A topological order exists iff the graph is a DAG; a scheduler must detect cycles and fail loudly rather than emit a broken order (§28.5, Chapter 13).
Directed cycle detection needs three states. The undirected §28.4 test is wrong for directed graphs; you must distinguish on the recursion stack (ancestor) from finished, reporting a cycle only on a back edge to an ON_STACK vertex.
Sequencing adds. Two DFS passes are still $\Theta(V + E)$ — linear in the dependency graph (§28.6). Composing traversals does not change the asymptotic class.
Many orders, one chosen. The output is one valid topological order among possibly many; test the property (respects), not the exact permutation. Incomparable targets are the parallelism a build tool can exploit (Chapter 13).

Case Study: Building a Dependency-Aware Build Scheduler with Topological Sort

Executive Summary

Background

The scenario

The dependencies

Phase 1: Parse dependencies into a directed graph

Phase 2: Topological sort of the (good) DAG

Phase 3: Detect a cycle (so we can reject impossible builds)

Phase 4: Compose — a robust schedule() that fails loudly

Phase 5: Counting the valid orders (a structural aside)

Discussion Questions

Your Turn: Extensions

Key Takeaways

Phase 4: Compose — a robust `schedule()` that fails loudly