Case Study: Diagnosing a Capacity Bottleneck in a Data-Center Scheduler

DataField.Dev

Case Study: Diagnosing a Capacity Bottleneck in a Data-Center Scheduler

"The bottleneck is always somewhere. The only question is whether you know where." — folklore of systems engineering

Executive Summary

A batch scheduler keeps reporting that it cannot place all of tonight's jobs onto the available machines, but it offers no explanation beyond a red INFEASIBLE. In this analysis-focused case study you will model the scheduler as a flow network, compute the maximum number of placeable jobs as a max flow, and — crucially — use the minimum cut to turn "infeasible" into an actionable diagnosis: exactly which group of jobs is over-subscribed for the machines that can run them. You will hand-derive every flow value and every cut, then read the bottleneck straight off the min cut. This is the §34.6 lesson made concrete: a max flow that fails to saturate the source is not a dead end — its min cut tells you what to fix.

This study is analysis-heavy: you are handed a system and a failure, and your job is to explain it rigorously. (Case Study 2 is the opposite — you will build a reusable solver from scratch.)

Skills applied

Modeling a capacitated assignment problem as a flow network with a source, a sink, and edge capacities (§34.1).
Running Ford–Fulkerson / Edmonds–Karp by hand, including a back-edge reroute (§34.3).
Constructing the minimum cut as the residual-reachable set of the source (§34.4).
Reading the min cut as a deficiency diagnosis à la Hall's theorem (§34.5).
Translating a mathematical certificate into an engineering recommendation.

Background

The scenario

You operate a small overnight batch cluster. Tonight there are five jobs to place — J1…J5 — and four machines — M1…M4. Not every job can run on every machine: a job has CPU, memory, and GPU requirements, and a machine either meets them or it does not. The compatibility list (produced by the scheduler's eligibility checker) is:

Job	Eligible machines
`J1`	`M1`, `M2`
`J2`	`M1`
`J3`	`M2`, `M3`
`J4`	`M3`, `M4`
`J5`	`M3`

Each machine can run one job tonight (they are large single-tenant jobs), and each job needs one machine. The scheduler reports it can place only 4 of the 5 jobs and flags the run INFEASIBLE for full placement. Management wants to know: is the scheduler buggy, or is the cluster genuinely under-provisioned — and if so, what is the minimal fix?

💡 Intuition: "Assign jobs to compatible machines, one each, maximize the number placed" is exactly bipartite matching (§34.5): jobs on the left, machines on the right, an edge for each eligibility. The question "can all 5 be placed?" is "does a matching saturate the job side?" And when it cannot, the reason is a Hall-style deficiency — a set of jobs whose combined eligible machines are too few. The min cut will hand us that set.

Why this matters

"Infeasible" with no explanation is the least useful possible error message. An operator staring at it cannot tell whether to retry, to file a bug, or to buy hardware. The max-flow min-cut theorem upgrades the scheduler's output from a binary verdict to a certificate: not just "4 of 5," but "these specific jobs compete for these specific machines, and here is the one change that fixes it." That is the difference between an alert and an answer.

Phase 1: Build the flow network

Apply the §34.5 reduction. Add a source $s$ and a sink $t$. For each job $J_i$, add $s\to J_i$ with capacity 1 (each job is placed at most once). For each eligibility $J_i$–$M_k$, add a middle edge $J_i\to M_k$ with capacity 1. For each machine $M_k$, add $M_k\to t$ with capacity 1 (each machine runs at most one job). A maximum integer flow's value is the number of jobs placeable, and the middle edges carrying flow are the placement.

Here is the network as a capacity dictionary — exactly the form the chapter's max_flow_value consumes.

cap = {
    # source -> each job (capacity 1)
    ("s","J1"):1, ("s","J2"):1, ("s","J3"):1, ("s","J4"):1, ("s","J5"):1,
    # job -> eligible machine (the "middle" edges, capacity 1)
    ("J1","M1"):1, ("J1","M2"):1,
    ("J2","M1"):1,
    ("J3","M2"):1, ("J3","M3"):1,
    ("J4","M3"):1, ("J4","M4"):1,
    ("J5","M3"):1,
    # each machine -> sink (capacity 1)
    ("M1","t"):1, ("M2","t"):1, ("M3","t"):1, ("M4","t"):1,
}

🔄 Check Your Understanding Before running anything: what is the absolute upper bound on the max flow here, just from the source edges? From the sink edges? Which is the tighter ceiling?

Answer

The five $s\to J_i$ edges have total capacity 5, so the flow can be at most 5 (all jobs placed). The four $M_k\to t$ edges have total capacity 4, so the flow can be at most 4. The sink side is the tighter ceiling: with only 4 machines, at most 4 jobs can run regardless of eligibility. So the best conceivable outcome is 4 — already the scheduler's number. The interesting question is which job gets left out and why, which the min cut answers.

Phase 2: Run Edmonds–Karp by hand

We trace Ford–Fulkerson with shortest (BFS) augmenting paths, which is what max_flow_value does. Every capacity is 1, so every bottleneck on a simple $s\to J\to M\to t$ path is 1, and "saturated" means "carries its single unit." BFS explores jobs in the order they appear and picks the lexicographically first machine; we follow that to make the trace reproducible.

Iteration 1. Shortest path $s\to J1\to M1\to t$, bottleneck 1. Place J1→M1. Value 1. Saturated: $s\to J1$, $J1\to M1$, $M1\to t$.

Iteration 2. Next path $s\to J2\to M1\to t$? The edge $M1\to t$ is now saturated, so $M1$ cannot reach $t$ forward. But $M1$ has a residual back edge $M1\to J1$ (capacity 1, the flow we just placed). BFS from $s$: reach $J2$ (via $s\to J2$), then $M1$ (via $J2\to M1$), then — $M1\to t$ is full — take the back edge $M1\to J1$ to reach $J1$, then $J1\to M2$ (residual 1) to reach $M2$, then $M2\to t$ (residual 1) to reach $t$. Augmenting path: $$s \to J2 \to M1 \xrightarrow{\text{back}} J1 \to M2 \to t, \qquad \text{bottleneck} = 1.$$ Apply it: push 1 on $s\to J2$ and $J2\to M1$; the back edge $M1\to J1$ cancels the unit on $J1\to M1$ (so $J1$ is no longer placed on $M1$); push 1 on $J1\to M2$ and $M2\to t$. Net effect: J1 was rerouted from M1 to M2, freeing M1 for J2. Value 2. Placements now: J2→M1, J1→M2.

⚠️ Common Pitfall: Iteration 2 is the heart of the chapter. The reason BFS could even reach $t$ a second time was the back edge $M1\to J1$ — without it (naive forward-only greed) the algorithm would have stalled at value 1, falsely concluding J2 is unplaceable. The back edge is what lets the scheduler discover the reroute "move J1 to M2 so J2 can take M1" automatically.

Iteration 3. BFS for the next path. Available forward room: $s\to J3,J4,J5$; among machines, $M3,M4$ still reach $t$ (their sink edges are unused). Shortest path $s\to J3\to M3\to t$, bottleneck 1. Place J3→M3. Value 3. (M2 is now full via J1; M3 now full via J3.)

Iteration 4. BFS. Forward room out of $s$: $J4, J5$. Try $s\to J4\to M4\to t$ — $M4\to t$ is unused, bottleneck 1. Place J4→M4. Value 4. Now all four machines' sink edges are saturated.

Iteration 5 (the stall). BFS from $s$. The only job with source room left is J5 (the others are saturated except where rerouting could help). Walk it: $s\to J5$ (residual 1) reaches J5; J5's only middle edge is $J5\to M3$, residual 1, reaching M3. From M3: forward $M3\to t$ is saturated; the back edge $M3\to J3$ has residual 1, reaching J3; from J3 the other middle edge $J3\to M2$ has residual 1, reaching M2; from M2: forward $M2\to t$ is saturated; back edge $M2\to J1$ reaches J1; from J1 the other middle edge $J1\to M1$ has residual 1, reaching M1; from M1: forward $M1\to t$ saturated; back edge $M1\to J2$ reaches J2; from J2 the only middle edge is $J2\to M1$ — already visited. Dead end. No residual path reaches $t$. Stop. Max flow $=\mathbf{4}$.

The strategy first. Iteration 5 is worth slowing down: BFS tried every reroute chain — bouncing across back edges through M3→J3→M2→J1→M1→J2 — and still could not reach $t$. That exhaustive failure is precisely condition (ii) of the max-flow min-cut theorem ("no augmenting path remains"), which certifies that 4 is the maximum, not merely a number the algorithm gave up on.

The placement read off the flow-carrying middle edges: J1→M2, J2→M1, J3→M3, J4→M4. Job J5 is unplaced. We can confirm the value 4 directly with the chapter's solver.

from collections import deque

def max_flow_value(cap, s, t):
    res = {}
    for (u, v), c in cap.items():
        res.setdefault(u, {}); res.setdefault(v, {})
        res[u][v] = res[u].get(v, 0) + c
        res[v].setdefault(u, 0)
    total = 0
    while True:
        parent, q = {s: None}, deque([s])
        while q:
            u = q.popleft()
            for v, c in res[u].items():
                if c > 0 and v not in parent:
                    parent[v] = u; q.append(v)
        if t not in parent:
            return total
        v, bottleneck = t, float("inf")
        while parent[v] is not None:
            bottleneck = min(bottleneck, res[parent[v]][v]); v = parent[v]
        v = t
        while parent[v] is not None:
            res[parent[v]][v] -= bottleneck
            res[v][parent[v]] += bottleneck
            v = parent[v]
        total += bottleneck

print(max_flow_value(cap, "s", "t"))
# Expected output:
# 4

Phase 3: Construct the minimum cut — the diagnosis

A value of 4 tells us that one job is unplaced; it does not yet tell us why. For that we build the minimum cut, which by §34.4 is $S$ = the set of vertices reachable from $s$ in the final residual network. We already computed that reachable set in Iteration 5's stalled BFS — that is the residual reachability from $s$ at the max flow. Collecting the vertices the BFS touched:

$$S = \{\,s,\ J5,\ M3,\ J3,\ M2,\ J1,\ M1,\ J2\,\}, \qquad T = V\setminus S = \{\,J4,\ M4,\ t\,\}.$$

Let us sanity-check this against the theorem by computing the cut capacity (only $S\to T$ edges count):

$J3\to ?$: $J3$'s machines are M2 (in $S$) and M3 (in $S$) — no crossing edge.
$J5\to M3$: M3 $\in S$ — not crossing.
$J1\to M1$, $J1\to M2$: both machines in $S$ — not crossing.
$J2\to M1$: in $S$ — not crossing.
Where does $S$ touch $T=\{J4,M4,t\}$? The job J4 is in $T$, reached from $s$ only via $s\to J4$, but $s\in S$ and $J4\in T$, so $s\to J4$ crosses (capacity 1). Also M4$\in T$: the only edge into M4 is $J4\to M4$ with $J4\in T$ (not crossing from $S$). And the sink edges $M1\to t, M2\to t, M3\to t$ have tails in $S$ and head $t\in T$: each crosses (capacity 1 each).

So the crossing $S\to T$ edges are $\{\,s\to J4,\ M1\to t,\ M2\to t,\ M3\to t\,\}$, total capacity $1+1+1+1 = 4$. This equals the max flow, confirming $(S,T)$ is a minimum cut and 4 is provably optimal.

🔗 Connection: Notice that $M4\to t$ is not in the cut, even though it crosses to $t$ — because its tail M4 is in $T$, not $S$. Only $S\to T$ edges count (§34.4). This is the rule students most often trip on: a saturated sink edge is on the cut only if its machine is on the source side.

Now translate the cut into a deficiency, Hall-style (§34.5). Let $W$ be the jobs on the source side that are forced to compete: read off the jobs in $S$ whose every eligible machine is also in $S$ and whose sink edges are saturated. Concretely, look at the jobs $\{J1,J2,J3,J5\}\subseteq S$ and their combined eligible machines:

$$N(\{J1,J2,J3,J5\}) = \{M1,M2\}\cup\{M1\}\cup\{M2,M3\}\cup\{M3\} = \{M1,M2,M3\}.$$

That is four jobs sharing only three machines: $\lvert W\rvert = 4 > 3 = \lvert N(W)\rvert$. Hall's condition fails for $W$, so no matching can place all four — and since they are disjoint from J4/M4, the global deficiency is exactly one job. This is the diagnosis: jobs J1, J2, J3, J5 collectively require machines from $\{M1,M2,M3\}$ only, and there are four of them for three slots. One must wait.

💡 Intuition: The min cut "explains itself." The cut edges $\{M1\to t,\ M2\to t,\ M3\to t\}$ are the three saturated machine slots that the competing jobs fight over; the cut edge $s\to J4$ says J4 is on the far side, comfortably served by M4, and is irrelevant to the shortage. The bottleneck is the trio of machines $\{M1,M2,M3\}$.

Cross-checking the deficiency

A careful analyst does not trust a single derivation; they corroborate it from a second angle. We found the deficient set $W=\{J1,J2,J3,J5\}$ by reading the min cut. Let us independently confirm it satisfies Hall's failing condition and that no smaller sub-group already fails — because the smallest violator is the most informative diagnosis.

Enumerate the small candidate sub-groups and their neighborhoods:

$W$	$N(W)$	$\lvert W\rvert$ vs. $\lvert N(W)\rvert$	Deficient?
$\{J2\}$	$\{M1\}$	$1$ vs. $1$	no (tight)
$\{J2, J5\}$	$\{M1, M3\}$	$2$ vs. $2$	no (tight)
$\{J3, J5\}$	$\{M2, M3\}$	$2$ vs. $2$	no (tight)
$\{J2, J3, J5\}$	$\{M1, M2, M3\}$	$3$ vs. $3$	no (tight)
$\{J1, J2, J3, J5\}$	$\{M1, M2, M3\}$	$4$ vs. $3$	yes

The smallest deficient set is the full $W=\{J1,J2,J3,J5\}$: adding J1 to the already-tight trio $\{J2,J3,J5\}$ tips it over, because J1's machines (M1,M2) are already inside the neighborhood $\{M1,M2,M3\}$ — J1 brings demand but no new supply. This pinpoints J1 as the "straw that breaks the camel's back," even though any of the four could be the one ultimately left waiting. The min cut and the Hall enumeration agree, which is the corroboration we wanted.

🔗 Connection: The "tight" rows ($\lvert W\rvert = \lvert N(W)\rvert$) are exactly the sub-groups that are fully matched but with zero slack — they sit on the knife-edge of feasibility. Chapter 32's cut property had the same flavor: the binding constraint is where supply exactly meets demand. Recognizing "tight = binding" transfers across every optimization problem in this book.

A sensitivity question: which single upgrade helps?

Before recommending a fix, an analyst asks what-if. Suppose we may add exactly one seat to exactly one machine. Which machine yields a higher max flow? The min cut answers this without re-running the algorithm five times. The cut edges are $\{M1\to t,\ M2\to t,\ M3\to t\}$ (plus $s\to J4$). Adding a seat to a machine on the cut raises that cut edge's capacity by 1, so the min cut rises to 5 and the flow can follow — provided a residual path then exists to use it.

Upgrade M1, M2, or M3 (on the cut): the cut capacity becomes 5; an augmenting path appears (e.g., with M3 at capacity 2, the stalled chain $s\to J5\to M3\to t$ now has residual at $M3\to t$), flow rises to 5. ✓
Upgrade M4 (off the cut): M4's sink edge was not in the min cut, so raising it does not raise the min cut; the flow stays at 4. The extra seat sits idle because no competing job can reach M4. ✗

This is point 3 after the max-flow min-cut proof (§34.4) operationalized: only widening a min-cut edge helps. A manager who "adds a server" by upgrading the least-loaded machine (M4) would spend money and see no improvement — the analysis prevents that mistake.

Phase 4: From certificate to recommendation

The min cut points at exactly the constraint to relax. Two principled fixes, each justified by the cut:

Add capacity where the cut is. The cut's machine edges are $M1\to t, M2\to t, M3\to t$. Make one of M1, M2, or M3 able to run a second job tonight — i.e., raise its sink edge to capacity 2. That adds 1 to the cut and lets the flow rise to 5. (Adding capacity to M4, which is not on the cut, changes nothing — point 3 after the theorem in §34.4.)
Broaden eligibility into $T$. Make some competing job eligible for M4 (the idle machine on the $T$ side). For instance, if J5 could also run on M4, add the middle edge $J5\to M4$. Then the stalled BFS in Iteration 5 would reach M4 (which still reaches $t$), an augmenting path $s\to J5\to M4\to t$ would appear, and the flow would climb to 5.
Defer the cheapest job. If neither hardware nor eligibility can change tonight, accept the verdict and choose which of the four competing jobs to defer to the next run — a policy decision the cut has now scoped to exactly four candidates ($J1,J2,J3,J5$), not all five. The deferral cost (priority, SLA, deadline) is a business input; the math has narrowed the choice set.

Notice the analysis does more than say "add a machine." It tells you which machines are even worth touching (those on the cut), rules out the tempting-but-useless upgrade (idle M4), and — if no upgrade is possible — shrinks the deferral decision from five jobs to four. Each option is anchored to a specific structural feature of the min cut: option 1 widens a cut edge, option 2 routes flow into the unsaturated $T$ side, option 3 reads the cut as a deferral scope. A recommendation that names which lever and why is worth far more than a generic "buy more hardware."

Either way, the answer to management is no longer "infeasible." It is: "Four of five jobs placed; the shortage is that jobs J1, J2, J3, J5 can only use machines M1, M2, M3 — three slots for four jobs. Cheapest fixes: let one of M1/M2/M3 take a second job tonight, or qualify one of those jobs to also run on the idle M4; otherwise defer one of those four (not all five)." That is a certificate plus a remedy, derived entirely from the min cut.

🐛 Find the Error. An on-call engineer argues: "The flow stalled at 4, and I found a cut of capacity 5 (the source cut $S=\{s\}$ has the five edges $s\to J_i$, total 5). Since 4 < 5, we should be able to push more — the scheduler has a bug." Where is the reasoning wrong?

Answer

Weak duality only says max flow $\le$ every cut, so a flow being below some cut (here the slack source cut of capacity 5) proves nothing about optimality. Optimality is certified by a cut of equal capacity — the min cut of capacity 4 we constructed — not by any larger cut. The "gap" to the source cut is real but irrelevant; the binding cut is the one through the three machine slots. No bug.

Discussion Questions

We argued the sink side (4 machines) was the tighter ceiling before running anything. In a different night with 6 machines and 5 jobs, which side would bound the flow, and how would that change your first instinct about where a shortage could live?
The deficient set we found was $W=\{J1,J2,J3,J5\}$ with $N(W)=\{M1,M2,M3\}$. Is $W$ the only deficient set, or could a smaller sub-group already violate Hall's condition? Check $\{J2,J5\}$ and explain.
Iteration 2 rerouted J1 from M1 to M2 via a back edge. If the scheduler logged its placements during the run, why would a naive log be misleading, and what should it log instead to report a stable final assignment?
The min cut gave two fix options (add machine capacity vs. broaden eligibility). In a real cluster, which is usually cheaper, and how does the cut's structure (which edges it contains) tell you which levers even exist?
Suppose every machine could run two jobs (sink edges capacity 2) but jobs still need one machine each. Re-state the model and predict the new max flow without re-running the full trace, using the sink-side ceiling and the eligibility structure.

Your Turn: Extensions

Option A (re-run the diagnosis). Add the edge $J5\to M4$ (capacity 1) to cap, hand-trace the one new augmenting path, and confirm with max_flow_value that the max flow becomes 5. Then construct the new min cut and verify it has capacity 5.
Option B (capacitated machines). Change ("M3","t") to capacity 2 (so M3 can run two jobs). Predict the new max flow by hand, then verify. Did J5 get placed? Explain via the min cut.
Option C (the diagnosis tool). Using min_cut_side from Exercise 34.25, write a function placement_report(cap, s, t) that returns (the max-flow value, the deficient job set $W$, its neighborhood $N(W)$). Hand-derive its output on tonight's cap and confirm it matches Phase 3.
Option D (stress the model). Construct a 6-job / 5-machine night where the source side is the binding constraint, find the min cut, and write the one-sentence operator-facing diagnosis.

Key Takeaways

Capacitated assignment is max flow. Jobs-to-machines with one-each limits is the §34.5 reduction; the max-flow value is the number placeable, and the flow-carrying middle edges are the placement.
"Infeasible" is never the end — the min cut is the diagnosis. When the flow fails to saturate the source, the residual-reachable set $S$ yields a min cut whose edges name the exact bottleneck (here: three machine slots fought over by four jobs).
A back edge can reroute a placement automatically. Iteration 2's $M1\to J1$ back edge moved J1 to free M1 for J2 — the undo that naive greed lacks, and the reason the flow reached its true maximum.
The cut tells you what to fix, and what not to. Upgrading an edge on the min cut raises the flow; upgrading anything else (like idle M4's slot) changes nothing — optimization with a target, not guesswork.