Case Study: Building a Complexity Fingerprinter

DataField.Dev

Case Study: Building a Complexity Fingerprinter

"An algorithm must be seen to be believed." — Donald E. Knuth, The Art of Computer Programming, Vol. 1

Executive Summary

In this build-focused case study you will construct a small, reusable tool — a complexity fingerprinter — that takes any algorithm, runs it at a ladder of growing input sizes, counts the operations, and reports its likely growth class from the doubling ratios. Where Case Study 1 diagnosed one endpoint, here you build the instrument that does the diagnosing automatically, then turn it loose on a battery of algorithms (constant, logarithmic, linear, quadratic, cubic) and watch each one announce its $\Theta$-class through its ratio signature. You will assemble the tool in layers — an op counter, a size ladder, a ratio reporter, and finally a classifier that maps a ratio to a label — and at each layer you will hand-derive the expected output so you trust it without ever running it. The result extends the chapter's Project Checkpoint (complexity.py) into a genuinely useful developer tool, and it makes the §14.5 hierarchy tangible: you will see the numbers $1, 2, 4, 8$ fall out of the code.

Skills applied

Building an instrument incrementally, hand-deriving each layer's output (no code execution).
Counting operations for several algorithms and predicting their doubling ratios (§14.4, §14.5).
Mapping the empirical ratio $\approx 2^k$ back to the growth class $\Theta(n^k)$ (Project Checkpoint).
Distinguishing what measurement can establish from what only a proof settles (theme four).

Background

The scenario

Your team keeps shipping "accidentally quadratic" endpoints (the failure mode of Case Study 1). You decide to build a guardrail: a tiny library, fingerprint, that any engineer can point at a function to get a quick read on its growth class before it reaches production. The design constraint, straight from §14.2, is that it must be machine-independent — it must not use wall-clock timing, because timing is noisy and changes with hardware. Instead it counts operations, exactly as the RAM model prescribes, and reports the doubling-ratio fingerprint from §14.5: double $n$, and the work multiplies by $2^k$ for a $\Theta(n^k)$ algorithm.

💡 Intuition: You can read an algorithm's growth class off a ratio without ever knowing the hidden constant $c_2$. If $T(n) = c_2 n^k$, then $\frac{T(2n)}{T(n)} = \frac{c_2 (2n)^k}{c_2 n^k} = 2^k$ — the $c_2$ cancels. That cancellation is why the fingerprinter works on any machine: it never needs to know how fast the machine is, only how the work scales.

Why this matters

A team that can fingerprint a function in thirty seconds catches the $\Theta(n^2)$ surprise of Case Study 1 in code review instead of in an outage. And building the tool, rather than just using one, forces you to internalize the §14.4–14.5 machinery: every layer you write is a piece of the chapter made executable. The tool is also the Project Checkpoint's complexity.py grown up — you will reuse count_ops and doubling_ratios and add a classifier on top.

Phase 1: The op counter — instrument one algorithm

The foundation is a run function: given an input, it returns the number of primitive operations the algorithm performs (you instrument the algorithm yourself, as §14.2 requires — the RAM model counts operations, so we count them explicitly). Start with the simplest case, a linear scan.

def linear_run(data):
    """Instrumented O(n) scan: one operation per element. Returns the op count."""
    ops = 0
    for _x in data:
        ops += 1            # count visiting each element
    return ops

print(linear_run([10, 20, 30, 40]))
# Hand trace: 4 elements, one op each.
# Expected output:
# 4

On a list of length $n$, linear_run returns exactly $n$. That is the operation count $T(n) = n$, the ground truth we will fingerprint. The discipline here is the §14.2 one: a "run" function reports what the machine actually does, not how many lines you typed.

🔄 Check Your Understanding Write the one-line predicted operation count $T(n)$ for an instrumented quadratic run (two nested range(n) loops, one op in the body) and for an instrumented cubic run (three nested loops).

Answer

Quadratic: $T(n) = n^2$ (the body runs $n \cdot n$ times). Cubic: $T(n) = n^3$ (it runs $n \cdot n \cdot n$ times). These are the §14.4 nested-loop counts.

Phase 2: The size ladder and ratio reporter

Next, run a given algorithm at a ladder of sizes and report the ratio of successive op counts. This is the chapter's count_ops from complexity.py; we reproduce it so the tool is self-contained, then build on it.

def count_ops(make_input, run, sizes):
    """For each n in sizes: build an input of that size and record the op count.
    Returns a list of (n, ops, ratio_to_previous)."""
    results, prev = [], None
    for n in sizes:
        ops = run(make_input(n))
        ratio = None if prev is None else ops / prev
        results.append((n, ops, ratio))
        prev = ops
    return results

# Fingerprint the linear algorithm on a doubling ladder.
print(count_ops(lambda n: list(range(n)), linear_run, [10, 20, 40]))
# Hand trace: op counts are n itself -> 10, 20, 40.
#   ratios: first is None; 20/10 = 2.0; 40/20 = 2.0.
# Expected output:
# [(10, 10, None), (20, 20, 2.0), (40, 40, 2.0)]

The make_input argument is what lets the tool fingerprint any algorithm: you supply a function that builds a valid input of size $n$ (here, list(range(n))). For a linear algorithm the ratios sit at exactly $2.0$ — doubling the input doubled the work, the §14.5 signature of $\Theta(n)$.

⚠️ Common Pitfall: The make_input must build an input whose size is $n$ in the sense the algorithm cares about. For a sorting routine, size is the list length; for an integer routine, size is the bit length $\log_2 N$, not the value (§14.2). Feeding the wrong notion of size produces a fingerprint of the wrong function — a classic §14.2 trap.

Phase 3: From ratios to a verdict — the classifier

Raw ratios are informative, but the guardrail should print a label. Add a thin classifier that maps an observed ratio to its nearest growth class. The map is the §14.5 fingerprint table read backwards: a ratio near $1$ means logarithmic, near $2$ linear, near $4$ quadratic, near $8$ cubic.

def doubling_ratios(make_input, run, start=100, doublings=4):
    """Run count_ops on start, 2*start, 4*start, ...; return just the ratios."""
    sizes = [start * (2 ** i) for i in range(doublings + 1)]
    return [r for (_, _, r) in count_ops(make_input, run, sizes) if r is not None]

def classify(ratio):
    """Map an observed doubling ratio to its nearest growth-class label."""
    table = [(1, "Theta(log n)"), (2, "Theta(n)"),
             (4, "Theta(n^2)"), (8, "Theta(n^3)")]
    best = min(table, key=lambda kv: abs(ratio - kv[0]))  # nearest target
    return best[1]

print(classify(3.97), classify(2.01), classify(1.05))
# Hand trace: 3.97 nearest to 4 -> n^2; 2.01 nearest to 2 -> n; 1.05 nearest to 1 -> log n.
# Expected output:
# Theta(n^2) Theta(n) Theta(log n)

classify simply finds which of the target ratios $\{1, 2, 4, 8\}$ the observed ratio is closest to and returns the matching label. It is deliberately crude — a ratio of $3$ is genuinely ambiguous (between linear and quadratic) and the tool will guess — but for the clean power-law algorithms it diagnoses, it is exactly right. We will see its limits in Phase 5.

🔗 Connection. The mapping ratio $\mapsto 2^k \mapsto \Theta(n^k)$ is the §14.5 claim "doubling $n$ multiplies $\Theta(n^k)$ work by $2^k$" turned into a lookup. Building the lookup is how you prove to yourself you understand the claim.

Phase 4: Turn it loose — fingerprint the whole hierarchy

Now assemble the tool against a battery of instrumented algorithms and watch each announce its class. We write one run-function per growth class and fingerprint them all.

def constant_run(data):                     # Theta(1): work ignores input size
    return 1

def quadratic_run(data):                    # Theta(n^2): full n x n
    n, ops = len(data), 0
    for _i in range(n):
        for _j in range(n):
            ops += 1
    return ops

def cubic_run(data):                        # Theta(n^3): three nested loops
    n, ops = len(data), 0
    for _i in range(n):
        for _j in range(n):
            for _k in range(n):
                ops += 1
    return ops

mk = lambda n: list(range(n))
for name, run in [("linear", linear_run), ("quadratic", quadratic_run),
                  ("cubic", cubic_run)]:
    ratios = doubling_ratios(mk, run, start=10, doublings=2)
    print(name, ratios, classify(ratios[-1]))
# Hand trace (start=10, sizes 10,20,40):
#   linear:    ops 10,20,40   -> ratios [2.0, 2.0]   -> nearest 2  -> Theta(n)
#   quadratic: ops 100,400,1600 -> ratios [4.0, 4.0] -> nearest 4  -> Theta(n^2)
#   cubic:     ops 1000,8000,64000 -> ratios [8.0, 8.0] -> nearest 8 -> Theta(n^3)
# Expected output:
# linear [2.0, 2.0] Theta(n)
# quadratic [4.0, 4.0] Theta(n^2)
# cubic [8.0, 8.0] Theta(n^3)

Let us hand-verify the cubic line, since it is the most error-prone. cubic_run on size $n$ returns $n^3$. The sizes $10, 20, 40$ give op counts $1000, 8000, 64000$. The ratios are $8000/1000 = 8.0$ and $64000/8000 = 8.0$ — pinned at $8 = 2^3$, exactly the §14.5 prediction for $\Theta(n^3)$, and classify reports it. The tool works: each algorithm's growth class falls out of the doubling numbers, no timing and no hidden constants required.

🚪 Threshold Concept. You have just built a machine that measures growth class — an abstract, machine-independent property — using nothing but counting and division. This is the §14.3 reframing made concrete: you stopped asking "how many milliseconds?" and built a tool that asks "how does cost scale?", and that question has a clean numerical answer ($1, 2, 4, 8, \dots$) that no hardware change can disturb. Seeing the hierarchy emerge from your own code is the moment asymptotics stops being notation and becomes a lens.

Phase 5: Where the instrument lies — and why you still need proof

A good engineer knows their tool's failure modes. The fingerprinter measures; it does not prove. Three honest limitations, each a chapter lesson:

It can be fooled by small inputs. At $n = 10$ a $\Theta(n \log n)$ algorithm and a $\Theta(n)$ algorithm produce nearly identical ratios; the $\log n$ factor only separates at large $n$ (§14.5: the hierarchy is about eventual growth). Run the ladder too low and classify mislabels.
It cannot distinguish $\Theta(n)$ from $\Theta(n \log n)$ cleanly at all. Doubling $n$ multiplies $n \log n$ work by $2 \cdot \frac{\log 2n}{\log n} = 2(1 + \frac{1}{\log n})$, which approaches $2$ from above. The ratio for linearithmic drifts toward $2.0$ but never quite the flat $2.0$ of linear — a difference the crude classify rounds away.
A passing fingerprint is evidence, not a guarantee (theme four). The tool sampled a handful of sizes; the algorithm might change behavior at sizes you did not test, or on inputs your make_input never builds. Only a proof from the §14.3 definition settles the growth for all $n$.

def linearithmic_run(data):                 # Theta(n log n): n passes of a halving inner loop
    import math
    n = len(data)
    if n <= 1:
        return n
    inner = max(1, int(math.log2(n)))       # ~ log2(n) inner steps per element
    return n * inner

print(doubling_ratios(lambda n: list(range(n)), linearithmic_run, start=16, doublings=2))
# Hand trace (sizes 16,32,64): inner = log2 -> 4,5,6.
#   ops: 16*4=64, 32*5=160, 64*6=384.  ratios: 160/64=2.5, 384/160=2.4.
# Expected output:
# [2.5, 2.4]

The ratios $2.5, 2.4$ are between the linear $2.0$ and the quadratic $4.0$, and they are creeping downward toward $2.0$ as $n$ grows — the signature of an extra $\log n$ factor. A human reading "ratios a bit above 2, slowly falling" correctly infers $\Theta(n \log n)$; the crude classify would round $2.4$ to "$\Theta(n)$" and be wrong. The measurement suggests; the §14.3 proof decides.

⚠️ Common Pitfall: Do not trust a single ratio. One ratio of $2.4$ is ambiguous; the trend across several doublings (is it flat? rising toward $4$? falling toward $2$?) is what carries the signal. This is the empirical analogue of the §14.5 rule that growth is about behavior as $n \to \infty$, not at one input.

Discussion Questions

The fingerprinter cancels the hidden constant $c_2$ by taking a ratio. Show algebraically that for $T(n) = c_2 n^k$, the ratio $T(2n)/T(n)$ is exactly $2^k$ regardless of $c_2$. Why is this the property that makes the tool machine-independent (§14.2)?
classify snaps to the nearest of $\{1, 2, 4, 8\}$. What target would you add to detect $\Theta(\sqrt{n})$, and what ratio should it look for? (Hint: doubling $n$ multiplies $\sqrt{n}$ by what factor?)
Phase 5 showed the tool cannot cleanly separate $\Theta(n)$ from $\Theta(n \log n)$. Propose a change to the experiment (not the algorithm) that would make the distinction visible. (Hint: §14.5 says the gap only opens up for large $n$.)
Suppose the fingerprinter reports flat ratios of $4.0$ for a function the author claims is $\Theta(n \log n)$. Which of the two — the measurement or the claim — is more likely wrong, and how would you settle it definitively (theme four)?
The tool counts operations you instrument by hand. What goes wrong if the algorithm contains a hidden $\Theta(n)$ operation the author forgot to count — say a list + [x] inside the loop (§14.2)? How would the fingerprint expose the oversight even though the source "looks" linear?

Your Turn: Extensions

Option A (add $\Theta(\sqrt{n})$). Extend classify's table with a $\sqrt{n}$ entry (target ratio $\sqrt{2} \approx 1.41$) and write an instrumented sqrt_run whose op count is $\lfloor\sqrt{n}\rfloor$. Hand-derive the ratios on sizes $100, 400, 1600$ (note these quadruple, so the size step is $\times 4$, and $\sqrt{\cdot}$ then doubles) and confirm the new label fires.
Option B (a confidence flag). Make classify return "ambiguous" when the observed ratio is more than, say, $0.4$ away from every target. Hand-trace it on the linearithmic ratios $2.5, 2.4$ from Phase 5 and show it now honestly refuses to guess.
Option C (fingerprint Case Study 1). Point the finished tool at the mutuals_run instrumentation from Case Study 1 (start=10, doublings=2). Predict the output, hand-derive it, and confirm the tool independently re-discovers the $\Theta(n^2)$ bug — then write one sentence explaining why you would still prove the bound from the §14.3 definition before reporting it to the team.

Key Takeaways

Build the instrument and you internalize the theory. Each layer — op counter, size ladder, ratio reporter, classifier — is a piece of §14.4–14.5 turned executable; assembling them is how the doubling-ratio fingerprint becomes second nature.
Ratios cancel constants, which is why they are machine-independent. $T(2n)/T(n) = 2^k$ for $\Theta(n^k)$ regardless of the hidden $c_2$ — the tool never needs to know how fast the machine is, only how the work scales (§14.2).
The hierarchy is visible in the numbers $1, 2, 4, 8$. Logarithmic, linear, quadratic, and cubic algorithms announce themselves through their ratio signatures — the §14.5 hierarchy made tangible.
Measurement suggests; proof decides. The fingerprinter is fooled by small inputs and cannot cleanly separate $n$ from $n \log n$; only a proof from the §14.3 definition settles the growth class for all $n$. Use both — that is theme four (§14.5, Project Checkpoint).