Case Study: Diagnosing a Slow CI Build as a Partial Order

DataField.Dev

Case Study: Diagnosing a Slow CI Build as a Partial Order

"Order and simplification are the first steps toward the mastery of a subject." — Thomas Mann

Executive Summary

A continuous-integration (CI) pipeline that used to finish in four minutes now takes nine, and nobody changed the machines. In this analysis-focused case study you'll treat the pipeline as a poset of build steps ordered by "must finish before," and use the chapter's tools — Hasse diagrams, chains, antichains, topological sort, and the minimal-element lemma — to find out why it slowed down and what the best achievable time is. You will not build a build system; you will diagnose one, the way a performance engineer actually does: by reading the dependency structure as the mathematical object it is.

The investigation turns on three measurements that are all chapter concepts in disguise. The longest chain is the critical path — the floor on completion time no amount of hardware can beat. The largest antichain is the maximum parallelism the work can absorb. And a failed topological sort is the signature of a circular dependency, the bug that turns a build from slow into impossible. By the end you'll have explained the slowdown, found the one edge that caused it, and stated the speed limit the pipeline can never cross.

Skills applied

Modeling a real dependency graph as a poset and drawing its Hasse diagram (transitive reduction).
Computing the longest chain (critical path) and reading it as a lower bound on runtime.
Computing the largest antichain (maximum parallel width) and reasoning about scheduling.
Running topological sort to produce a legal order and to detect a cycle.
Distinguishing "comparable" (a true dependency) from "incomparable" (parallelizable) work.

Background

The scenario

The team's CI pipeline runs these jobs. Each job, once its prerequisites finish, takes (for simplicity) one unit of time — call it one minute — to run; jobs with no unmet prerequisite may run in parallel if a worker is free.

Job	Must run after	Plain-English reason
`checkout`	(nothing)	clone the repo
`install`	`checkout`	download dependencies
`lint`	`install`	static checks need deps installed
`unit`	`install`	unit tests need deps
`build`	`install`	compile the app
`integration`	`build`, `unit`	needs a built app and passing unit tests
`e2e`	`build`	end-to-end needs the built app
`package`	`integration`, `e2e`, `lint`	bundle only after everything green
`deploy`	`package`	ship the bundle

"Must run after" is a strict order: if $X$ must run after $Y$, then $Y \prec X$. The relation is reflexive once we add $X \preceq X$, antisymmetric (you can't have $X$ after $Y$ and $Y$ after $X$ in a sane pipeline), and transitive (if lint is after install and install is after checkout, then lint is after checkout). So the pipeline is a poset — and equivalently a DAG, with an edge $Y \to X$ whenever $Y$ must finish before $X$.

🔗 Connection. This is exactly the structure of §13.3: a set of tasks with "must-come-before" constraints. The math under make, under GitHub Actions needs:, under every task scheduler, is the linear extension of a poset. We are about to use that math as a diagnostic instrument, not just a way to pick a build order.

Why this matters

CI time is developer time multiplied by the team. A pipeline that wastes five minutes per run, fifty runs a day, burns four hours of waiting daily. But you cannot fix what you cannot measure, and the right measurements here are structural: the critical path tells you the best case, the parallel width tells you how many workers can help, and the presence of a cycle tells you whether the pipeline can run at all. Each is a quantity from this chapter.

Phase 1: Build the poset and its Hasse diagram

First, write the dependencies as a DAG {job: [successors]} — job $u$ points to the jobs that must run after it.

pipeline = {
    'checkout':    ['install'],
    'install':     ['lint', 'unit', 'build'],
    'lint':        ['package'],
    'unit':        ['integration'],
    'build':       ['integration', 'e2e'],
    'integration': ['package'],
    'e2e':         ['package'],
    'package':     ['deploy'],
    'deploy':      [],
}

These are the direct edges — the covering relation. The Hasse diagram draws exactly these and lets us infer the rest (e.g. checkout $\prec$ deploy by walking up). Drawn with time increasing upward:

                deploy
                  |
                package
              /    |     \
     integration  e2e    lint
        /    \     |       |
     unit   build (build)  |
        \     |   _________|
         \    |  /
          install
             |
          checkout

Read it cleanly as layers (each layer is a set of jobs that could run together):

Layer 5:                 deploy
Layer 4:                package
Layer 3:     integration   e2e
Layer 2:   lint   unit   build
Layer 1:           install
Layer 0:          checkout

with covering edges checkout–install; install–lint, install–unit, install–build; unit–integration, build–integration, build–e2e; lint–package, integration–package, e2e–package; package–deploy.

💡 Intuition: A horizontal slice of this diagram is an antichain — jobs none of which waits on another, so all can run at once. A path traced strictly upward is a chain — a sequence each of which waits on the previous, so they must run one after another. The whole performance question is a tug-of-war between the longest chain (which you can't parallelize away) and the widest antichain (which says how much parallelism would even help).

🔄 Check Your Understanding Is lint comparable to unit in this poset? What does your answer say about running them?

Answer

They are incomparable: neither must run before the other (both depend only on install, and neither depends on the other). So $\{\text{lint}, \text{unit}\}$ is an antichain, and the two jobs can run in parallel.

Phase 2: The critical path (longest chain)

The fastest the pipeline can ever finish — even with infinitely many workers — is the length of its longest chain, because every job on that chain must wait for the one below it. We compute it with the dynamic-programming recurrence from §13.2: the longest chain ending at a job is $1 +$ the max over its direct predecessors.

def critical_path_length(dag):
    """Longest chain in a DAG given as {node: [successors]}.
    Returns (length, one_witnessing_chain)."""
    preds = {u: [] for u in dag}
    for u in dag:
        for v in dag[u]:
            preds[v].append(u)
    best, choice = {}, {}
    def depth(u):
        if u in best:
            return best[u]
        if not preds[u]:
            best[u], choice[u] = 1, None
        else:
            p = max(preds[u], key=depth)
            best[u], choice[u] = 1 + depth(p), p
        return best[u]
    end = max(dag, key=depth)
    chain = []
    u = end
    while u is not None:
        chain.append(u); u = choice[u]
    return best[end], list(reversed(chain))

print(critical_path_length(pipeline))
# Expected output:
# (6, ['checkout', 'install', 'unit', 'integration', 'package', 'deploy'])

Hand-trace the depths. checkout has no predecessor, so depth 1. install depends on checkout: depth 2. lint, unit, build each depend only on install: depth 3. e2e depends on build: depth 4. integration depends on unit (depth 3) and build (depth 3), so depth 4. package depends on lint (3), integration (4), e2e (4): depth $1 + 4 = 5$. deploy depends on package: depth $1 + 5 = 6$ — and deploy is the deepest node, so it is the top of the longest chain.

⚠️ Common Pitfall: A natural off-by-one is to read the length off package (depth 5) instead of the actual top, deploy (depth 6). depth counts jobs on the chain, and the chain ends at deploy. The function returns best[end] with end = deploy, so the length is 6. The §13.1 lesson: identify which element is greatest before reading a number off a chain.

The witnessing chain has six jobs. Note the tie-break: integration has two depth-3 predecessors, unit and build; max(..., key=depth) returns the first of equal-depth candidates, which is unit here (it was appended to integration's predecessor list before build). So the function reports the chain through unit — but checkout→install→build→integration→package→deploy is an equally long critical path; longest chains, like topological sorts, need not be unique.

Interpretation. Six jobs lie on the longest chain, each waiting for the previous, so at one minute each the pipeline cannot finish in less than 6 minutes regardless of how many workers you buy. That is the critical path — the speed limit. Knowing it is worth real money: if your build takes 9 minutes and the critical path is 6, then 3 minutes are pure scheduling waste (too few workers), fixable by adding parallelism; but the 6 minutes are structural and can only be reduced by changing the dependencies (e.g. splitting build or integration).

⚠️ Common Pitfall: The critical path is a property of the poset, not of the machine. Adding workers can never beat it; only re-architecting the dependency graph can. Teams routinely throw hardware at a build whose problem is a long chain, and wonder why nothing improves. First measure the longest chain; then decide whether your problem is the chain or the workers.

Phase 3: Maximum parallelism (largest antichain)

How many workers could the pipeline ever keep busy at once? At most the size of the largest antichain — the widest set of mutually independent jobs. From the layered diagram, Layer 2 $\{\text{lint}, \text{unit}, \text{build}\}$ is an antichain of size 3, and no four jobs are pairwise incomparable here.

def widest_layer(dag):
    """Group nodes by longest-chain depth; the largest group is an
    antichain (all same depth => pairwise incomparable). Returns its size."""
    preds = {u: [] for u in dag}
    for u in dag:
        for v in dag[u]:
            preds[v].append(u)
    depth = {}
    def d(u):
        if u in depth: return depth[u]
        depth[u] = 1 + max((d(p) for p in preds[u]), default=0)
        return depth[u]
    layers = {}
    for u in dag:
        layers.setdefault(d(u), []).append(u)
    return max(len(g) for g in layers.values()), {k: sorted(v) for k, v in layers.items()}

print(widest_layer(pipeline))
# Expected output:
# (3, {1: ['checkout'], 2: ['install'], 3: ['build', 'lint', 'unit'],
#      4: ['e2e', 'integration'], 5: ['package'], 6: ['deploy']})

💡 Intuition: Grouping by depth is a layered topological sort: every job in layer $k$ depends only on jobs in earlier layers, so a whole layer is an antichain you can run simultaneously. This is precisely the "minimum number of antichains needed to cover the poset" idea behind Mirsky's theorem from §13.2 — here it takes 6 antichains (one per layer), and 6 is exactly the longest-chain length. Mirsky's theorem says that equality is no accident.

Interpretation. The widest layer has 3 jobs, so a fourth CI worker would sit idle — the work simply never has more than three independent pieces at any moment. This is the answer to "should we pay for more parallel runners?": no more than three help. Combined with Phase 2, the diagnosis sharpens: with 3 workers the pipeline should finish in 6 minutes (the critical path), so the observed 9 minutes means the scheduler is not achieving the available parallelism — the bug is in how jobs are dispatched, not in a shortage of machines.

🔄 Check Your Understanding Mirsky's theorem says the longest chain (6) equals the minimum number of antichains covering the poset. We found a cover by 6 layers. Could there be a cover by fewer than 6 antichains?

Answer

No. Any single antichain can contain at most one element of a chain (chain elements are pairwise comparable, antichain elements pairwise incomparable). The longest chain has 6 elements, so covering it alone needs at least 6 antichains — hence the whole poset needs $\ge 6$, and our 6-layer cover is optimal. That is Mirsky's theorem in action.

Phase 4: A legal order — and catching the regression

To actually run the pipeline, the scheduler needs one concrete sequence respecting all constraints: a topological sort. We use Kahn's algorithm — the topo_sort you built in this chapter's Project Checkpoint.

from dmtoolkit.relations import topo_sort

print(topo_sort(pipeline))
# Expected output:
# ['checkout', 'install', 'build', 'e2e', 'lint', 'unit', 'integration', 'package', 'deploy']

The exact order among same-layer jobs depends on Kahn's tie-break — our topo_sort keeps the ready queue sorted alphabetically and pops the front, so once install frees build, lint, unit, it takes build first, and after build frees e2e the queue becomes ['e2e', 'lint', 'unit'], so e2e goes next. Every output, whatever the tie-break, places each job after all its prerequisites. The schedule exists, so the pipeline is runnable. Good.

Now the regression. A teammate, trying to "speed things up," added a dependency so that install would reuse a cache produced by build — making install depend on build. They did not notice that build already depends on install. The new graph adds the edge build → install:

broken = dict(pipeline)
broken['build'] = ['integration', 'e2e', 'install']   # build now also feeds install
print(topo_sort(broken))
# Expected output:
# ValueError: graph has a cycle; no topological order exists

Hand-trace why it fails. With both install → build and build → install, neither ever reaches in-degree 0: install waits on build, which waits on install. Kahn's algorithm finds no minimal element among them, the ready queue empties before all nodes are output, and the length check trips the ValueError. That exception is the diagnosis: the pipeline now contains a circular dependency, and the math of §13.3 (antisymmetry forbids cycles; topological sort exists iff acyclic) guarantees no legal order can exist until the offending edge is removed.

🚪 Threshold Concept. The same algorithm did two completely different jobs in this case study: on a valid poset it produced a schedule; on an invalid one it proved no schedule exists and pointed at the bug. That duality — "the constructive proof is the algorithm, and its failure is a theorem about your input" — is why §13.3 insisted that topological sort requires acyclicity. A build tool that refuses to run is not malfunctioning; it is reporting a mathematical fact about your dependency graph.

⚠️ Common Pitfall: It is tempting to "fix" the cycle error by catching the exception and falling back to some order. That hides a real bug — a build that ignores a circular dependency will build something against a stale or missing artifact. The error is load-bearing; the correct fix is to delete the edge that created the cycle (here, revert install's dependence on build), not to suppress the message.

Discussion Questions

We assumed every job takes exactly one minute. If jobs have different durations, the "critical path" becomes the longest weighted path, not the longest chain by count. Sketch how you'd modify critical_path_length to sum durations along the path instead of counting jobs. Which line changes?
The largest antichain bounded the useful worker count at 3. But the antichain ${\text{lint}, \text{unit}, \text{build}}$ is only reachable after install. Explain why "3 workers" is an upper bound on simultaneous useful work, not a guarantee that 3 workers are busy for the whole run.
Mirsky's theorem gave a 6-antichain cover matching the 6-chain. Dilworth's theorem (§13.2) is the dual: the largest antichain equals the minimum number of chains covering the poset. What would a minimum chain cover of this pipeline represent operationally? (Hint: think "assign each job to a worker as a sequential lane.")
The cycle was introduced by someone who didn't realize build already (transitively) preceded install. How does the Hasse diagram — which hides transitive edges — make such a mistake easier to commit, and what tooling would catch it before it merged?

Your Turn: Extensions

Option A (weighted critical path). Give each job a duration (e.g. build = 3, e2e = 4, others = 1). Rewrite Phase 2 to compute the longest weighted chain, and recompute the true minimum runtime. Which job is now the most valuable to speed up, and why is it not necessarily the longest-running one?
Option B (schedule simulation). Using the layered output from Phase 3 and exactly 2 workers (not 3), hand-simulate the run minute by minute and compute the actual finish time. Compare it to the 6-minute critical path: where does the 2-worker limit cost you, and which layer is the bottleneck?
Option C (cycle localization). Extend topo_sort so that when it detects a cycle, it also returns the set of nodes still having positive in-degree — the jobs trapped in (or downstream of) the cycle. Argue that this set always contains every node of the actual cycle, and test it on broken.

Key Takeaways

A dependency pipeline is a poset; its structure answers performance questions. "How fast can this possibly run?" is the longest chain; "how many workers help?" is the largest antichain — both are chapter concepts, not vague engineering intuitions.
The critical path is a hardware-proof speed limit. Only re-architecting the dependency graph beats it; adding workers cannot. Measure the longest chain before buying machines.
Mirsky's theorem makes the layered schedule optimal. The minimum number of parallel rounds equals the longest chain, so the depth-layering is the best round structure available.
A failed topological sort is a diagnosis, not a glitch. It proves a circular dependency exists; suppressing the error hides a real bug. Antisymmetry forbids cycles, and "sortable iff acyclic" is the theorem your build tool is enforcing.