Case Study: Why No Hash Function Is Collision-Free — A Cardinality Audit

DataField.Dev

Case Study: Why No Hash Function Is Collision-Free — A Cardinality Audit

"The infinite! No other question has ever moved so profoundly the spirit of man." — David Hilbert

Executive Summary

A vendor is pitching your team a content-addressable storage system. The headline claim on the slide is seductive: "Every file gets a unique 256-bit fingerprint. No two different files ever collide — store once, deduplicate forever, with mathematically guaranteed integrity." Your job is not to run their benchmark. Your job is to decide whether the claim is even possible before a single byte is written.

This is a pure cardinality audit, and it is exactly the kind of reasoning §10.4 was built for: you compare the size of the set of inputs against the size of the set of outputs and let the arithmetic of infinity deliver the verdict. You will find that the strong claim ("no two files ever collide") is not merely unlikely — it is mathematically impossible, because a countably infinite (in fact, even just a large finite) set of inputs cannot inject into a finite set of fingerprints. Then you will refine the audit: you will see which weaker claim is true and provable, why cryptographic hashes are still worth deploying, and where the genuinely uncountable sets ($\mathbb{R}$, infinite streams) change the analysis entirely. The deliverable is a one-page memo your team can defend in a design review.

Skills applied

Translating a product claim into a precise statement about an injection between two sets (§10.1).
Distinguishing finite, countably infinite, and uncountable input spaces — and why the distinction changes the conclusion (§10.2–10.3).
Applying "more inputs than outputs ⇒ a collision must exist," the finite shadow of §10.4's "more problems than programs."
Separating impossible (no collisions ever) from infeasible-to-find (collisions exist but are hard to exhibit) — a distinction that mirrors "exists" vs. "here is one" from §10.4.

Background

The scenario

Content-addressable storage (CAS) is real and widely deployed: Git names every object by the hash of its contents; Docker layers, IPFS, and most backup-deduplication systems do the same. The principle is simple and powerful — if two blobs hash to the same fingerprint, treat them as the same blob and store it once. The vendor's system, call it HashVault, uses a 256-bit fingerprint, so there are exactly $2^{256}$ possible fingerprints. Their marketing makes three claims, and your team needs to sort them:

Claim 1 (the strong one). No two distinct files ever produce the same fingerprint.

Claim 2 (the operational one). In practice you will never observe a collision.

Claim 3 (the integrity one). An attacker cannot feasibly construct two distinct files with the same fingerprint.

These sound like restatements of one another. They are not. One is provably false, one is true but needs care, and one is a hardness assumption that lives outside this chapter. Cardinality is the tool that cleanly separates the first from the rest.

💡 Intuition: A hash function is a machine that takes something from a huge set (all possible files) and stamps it with a label from a fixed, smaller set (all 256-bit strings). The whole audit is one question from §10.1 in disguise: can a copy of the input set fit inside the output set? If the input set is bigger, the answer is no, and "bigger" is decided by cardinality, not by how the machine is built.

Why this matters

If you believe Claim 1, you will design systems that treat a fingerprint match as proof of content identity — and you will be wrong in a way that can corrupt data silently or be weaponized. The 2017 SHAttered result, which produced two distinct PDFs with the same SHA-1 hash, broke exactly this assumption for a hash the industry had trusted for two decades. The mathematics that guaranteed such a collision existed long before anyone found one — and that guarantee is a one-line counting argument you are about to reconstruct. Knowing why it must exist tells you what to design around.

Phase 1: Model the inputs and outputs as sets

Before any argument, name the two sets precisely.

Outputs. A fingerprint is a 256-bit string. The set of outputs is $$\mathrm{Out} = \{0,1\}^{256}, \qquad \lvert \mathrm{Out}\rvert = 2^{256}.$$ This is a finite set — astronomically large (about $1.16 \times 10^{77}$, comparable to the number of atoms in the observable universe), but finite.
Inputs. A "file" is a finite sequence of bytes — a finite string over the 256-symbol byte alphabet. The set of all possible files is $$\mathrm{In} = \{\text{finite byte strings}\} = \bigcup_{n=0}^{\infty} \{0,\dots,255\}^{n}.$$ By §10.4, Step 1, this is a countably infinite set: list files by length, then in byte order, and every file lands at a finite position. So $\lvert \mathrm{In}\rvert = \aleph_0$.

A hash function is a total function $h\colon \mathrm{In} \to \mathrm{Out}$. The vendor's Claim 1 is the assertion that $h$ is injective (distinct inputs, distinct outputs).

🔗 Connection. "List files by length, then alphabetically, and each gets a finite position" is verbatim the §10.4 enumeration of programs — a program and a file are both just finite strings over a finite alphabet. The countability of all files is the same theorem as the countability of all programs.

🔄 Check Your Understanding Is the set of all possible files finite, countably infinite, or uncountable? What about the set of 256-bit fingerprints?

Answer

All possible files: countably infinite ($\aleph_0$) — finite strings over a finite alphabet. The fingerprints: finite ($2^{256}$ of them). The audit is therefore "can a countably infinite set inject into a finite one?"

Phase 2: The strong claim is impossible (the counting argument)

Now run the audit. We want to know whether an injection $h\colon \mathrm{In} \to \mathrm{Out}$ can exist when $\lvert \mathrm{In}\rvert = \aleph_0$ and $\lvert \mathrm{Out}\rvert = 2^{256}$.

The strategy first. This is the finite shadow of §10.4. There we had countably many programs and uncountably many problems, so no surjection from programs onto problems could exist. Here we have a countably infinite input set and a finite output set, so no injection from inputs into outputs can exist — there are simply too few labels. The mechanism is the pigeonhole principle, which is itself a statement about cardinality: you cannot fit more pigeons than holes without doubling up.

Claim. No function $h\colon \mathrm{In} \to \mathrm{Out}$ is injective. Equivalently, every hash function with a finite output range collides on some pair of distinct inputs.

Proof. Suppose, for contradiction, that $h$ is injective. An injection $h\colon \mathrm{In} \to \mathrm{Out}$ is a bijection from $\mathrm{In}$ onto its image $h(\mathrm{In}) \subseteq \mathrm{Out}$. Therefore $\lvert \mathrm{In}\rvert = \lvert h(\mathrm{In})\rvert \le \lvert \mathrm{Out}\rvert = 2^{256}$. But $\lvert \mathrm{In}\rvert = \aleph_0$, and $\aleph_0 \le 2^{256}$ is false: no infinite set injects into a finite one (a finite set has a finite listing $o_0, \dots, o_{2^{256}-1}$ that runs out, while $\mathrm{In}$'s listing never does). Contradiction. So $h$ is not injective: there exist distinct files $f_1 \ne f_2$ with $h(f_1) = h(f_2)$. $\blacksquare$

That settles Claim 1: it is false for every hash function, HashVault's included, no matter how cleverly designed. You did not need to know anything about the hash's internals — only the sizes of two sets.

🚪 Threshold Concept. A finite output range and an infinite (or merely larger) input domain force a collision. This is the same shape as §10.4's "more problems than programs," scaled down from "countable vs. uncountable" to "infinite vs. finite." Once you see size-comparison as the engine, a whole family of impossibility results becomes obvious on sight: no perfect hash over an unbounded domain, no lossless compressor that shrinks every input, no injective fingerprint of arbitrary data. They are all the same theorem: a copy of the bigger set cannot fit inside the smaller one.

We can make the collision's inevitability concrete with the finite pigeonhole, which is what a working engineer actually invokes. We do not need infinitely many files — just $2^{256} + 1$ of them.

# We can't enumerate 2**256 + 1 files, but the LOGIC is a tiny pigeonhole check.
# Model: a toy hash with a 3-BIT output (8 buckets), fed 9 distinct inputs.
def toy_hash(file_id):
    return file_id % 8                      # output range = {0,...,7}, size 2**3 = 8

inputs = list(range(9))                      # 9 distinct "files" -- one more than 8 buckets
buckets = {}
collision = None
for f in inputs:
    h = toy_hash(f)
    if h in buckets:
        collision = (buckets[h], f, h)       # two distinct files, same fingerprint
        break
    buckets[h] = f

print(len(inputs), "inputs into", 8, "buckets")
print("collision:", collision)
# Expected output:
# 9 inputs into 8 buckets
# collision: (0, 8, 0)

Hand-derive the output: toy_hash sends $0,1,\dots,7$ to buckets $0,1,\dots,7$ (all distinct), then the ninth file, $8$, hashes to $8 \bmod 8 = 0$ — the bucket already holding file $0$. The loop reports the pair $(0, 8)$ colliding in bucket $0$. With $2^{256}+1$ distinct files and $2^{256}$ fingerprints, the identical logic forces a collision; the only reason we used 8 buckets is that $2^{256}+1$ files will not fit in memory. The proof is the program, run on a domain small enough to see.

⚠️ Common Pitfall: "But $2^{256}$ is so huge it might as well be infinite." It is not. The gap between "astronomically large finite" and "infinite" is the entire content of this audit. A finite range, however vast, must collide once the input count exceeds it — and the inputs (all finite files) are not just larger, they are infinite. Treat $2^{256}$ as finite, because it is, and the impossibility of Claim 1 is immediate. (Compare §10.5: $\aleph_0$ is not "a very big number" either — the categories finite and infinite are different in kind, not degree.)

Phase 3: Refine the audit — which claim survives?

So Claim 1 is dead. Are HashVault's other two claims also doomed? No — and seeing why is the analytic payoff. The same cardinality lens that killed Claim 1 clears Claims 2 and 3, by showing they assert something completely different.

Claim 2 ("you'll never observe a collision") is about probability over the inputs you actually use. Cardinality says collisions exist in the abstract input set; it says nothing about whether the particular files your business stores will be among a colliding pair. With $2^{256}$ buckets and, say, $10^{12}$ stored objects, the chance any two collide is governed by the birthday bound and is vanishingly small (roughly $\binom{10^{12}}{2} / 2^{256} \approx 10^{-53}$). So Claim 2 is operationally true — but notice it is a statistical statement about a finite sample, not the absolute statement of Claim 1. The cardinality audit is what forces you to downgrade "never collides" to "won't collide on your data," and that downgrade is the whole point of the memo.

Claim 3 ("an attacker can't feasibly build a collision") is a hardness assumption, not a counting fact. Collisions provably exist (Phase 2). The security question is whether anyone can exhibit one with realistic computation. This is precisely the distinction §10.4 drew between "uncomputable problems exist" (a counting fact) and "here is a specific one" (which needed the explicit halting-problem construction in Chapter 36). For hashes, "exists" is free; "here is one" is what cryptographers spend years attacking. SHA-1 fell (collisions were exhibited in 2017); SHA-256 has not. Cardinality cannot decide Claim 3 — it can only tell you that the target the attacker seeks is guaranteed to be there.

🔗 Connection. "Exists vs. here-is-one" is the central rhythm of the theory-of-computation arc. §10.4 proves uncomputable problems exist by counting; Chapter 36 names one (the halting problem) by diagonalization. Phase 2 proves hash collisions exist by counting; cryptanalysis names one by enormous targeted search. The audit teaches you to always ask which of the two claims a vendor is really making.

🔄 Check Your Understanding The vendor revises the pitch to: "For any 10 trillion files you actually store, no collision will occur." Is this revised claim refuted by the Phase 2 counting argument? Why or why not?

Answer

No — Phase 2 only refutes the absolute claim that collisions never exist anywhere in the (infinite) input set. The revised claim is about a specific finite sample of $10^{13}$ files, and is a probability statement (Claim 2's territory), which the birthday bound makes overwhelmingly likely to hold. Cardinality kills "never, ever, for any inputs"; it is silent about "not among these particular ones."

Phase 3.5: Quantifying "you'll never observe one" — the birthday threshold

Claim 2 deserves a number, because the gap between "collisions exist" (Phase 2) and "you will hit one" is where engineering judgment actually lives. The relevant fact is the birthday bound, and it is itself a small cardinality-flavored counting argument. Suppose you store $k$ distinct files into a hash with $N = 2^{256}$ equally likely fingerprints. The probability that all $k$ are distinct is the fraction of the $N^k$ equally likely fingerprint-assignments that use $k$ different buckets: $$P(\text{no collision}) = \frac{N}{N}\cdot\frac{N-1}{N}\cdots\frac{N-k+1}{N} = \prod_{t=0}^{k-1}\Big(1 - \frac{t}{N}\Big).$$ Each factor is just "the next file must avoid the buckets already taken." Using $1 - x \le e^{-x}$, the collision probability is bounded by $$P(\text{some collision}) \le \frac{k(k-1)}{2N} \approx \frac{k^2}{2N}.$$

Read that approximation carefully: collisions become likely — probability around $\tfrac12$ — not when $k$ approaches $N$, but when $k$ approaches $\sqrt{N} = \sqrt{2^{256}} = 2^{128}$. This square-root threshold is the entire reason cryptographers size a hash at $256$ bits to get only $128$ bits of collision resistance: you must outrun the birthday bound, not the bucket count.

# Order-of-magnitude collision probability k^2 / (2N) for N = 2**256 buckets.
def collision_prob_upper(k, bits=256):
    N = 2 ** bits
    return (k * k) / (2 * N)          # upper bound ~ k^2 / 2N

for k in (10**6, 10**12, 2**128):
    print(f"k=2^{round(k.bit_length()-1):3d}-ish  P(collision) <~ {collision_prob_upper(k):.2e}")
# Expected output:
# k=2^ 19-ish  P(collision) <~ 4.32e-66
# k=2^ 39-ish  P(collision) <~ 4.32e-54
# k=2^127-ish  P(collision) <~ 5.00e-01

Hand-derive the shape (exact mantissas aside): for $k = 10^{12} \approx 2^{40}$, the bound is $k^2/2N \approx 2^{80} / 2^{257} = 2^{-177}$ — utterly negligible, vindicating Claim 2 for any realistic store. But at $k = 2^{128}$ the bound is $2^{256}/2^{257} = 2^{-1} = 0.5$: collisions are now a coin flip. The number makes the qualitative point quantitative — Claim 2 is true in the regime you operate in ($k \ll 2^{128}$) and false past the birthday wall — and it does so with the same "count the assignments, compare the sizes" reasoning as the rest of the audit.

🔗 Connection. The birthday bound is theme four made numeric: Phase 2 proves collisions exist (the all-cases guarantee), while Phase 3.5 estimates how many files you can store before one is likely (the computational, quantitative view). Proof tells you it must happen; the bound tells you when — and you need both to make a deployment decision.

Phase 4: When the inputs are uncountable — the analysis changes

The audit so far assumed inputs were files: finite byte strings, hence countable. But some systems fingerprint infinite streams — a sensor feed, a never-ending log, an arbitrary real-valued signal. Here the input set jumps from $\aleph_0$ to uncountable, and §10.3 sharpens the conclusion.

Consider fingerprinting arbitrary infinite bit-streams (functions $\mathbb{N} \to \{0,1\}$) with any finite-or-countable description — even allowing infinitely long but describable fingerprints. By §10.4, the set of streams is uncountable while the set of finite/countable descriptions is countable. The collision is now not merely guaranteed — it is massive: uncountably many distinct streams must share each fingerprint, because you are cramming an uncountable set into a countable one.

# Two DISTINCT infinite bit-streams that any FINITE-PREFIX fingerprint cannot tell apart.
# A "length-L prefix hash" only inspects the first L bits, so streams agreeing on bits
# 0..L-1 but differing later are forced to collide -- an uncountable family of them.
def prefix_hash(stream_bits, L):
    return tuple(stream_bits(k) for k in range(L))   # only sees the first L bits

L = 4
a = lambda k: 0                       # 0,0,0,0,0,0,...
b = lambda k: 0 if k < L else 1       # 0,0,0,0,1,1,...  (differs from a only past bit L-1)

print(prefix_hash(a, L) == prefix_hash(b, L), "but streams differ at bit", L)
# Expected output:
# True but streams differ at bit 4

Hand-derive it: both a and b emit 0 for bits $0,1,2,3$, so prefix_hash(a, 4) and prefix_hash(b, 4) are both (0, 0, 0, 0) — equal — yet a and b differ at bit 4 (and b keeps differing forever). For every finite $L$ there are streams that agree up to $L$ and diverge after, and in fact there are uncountably many such streams per fingerprint. The uncountability of the input space (§10.3) turns "collisions exist" into "each fingerprint is shared by an uncountable crowd." That is why no fingerprinting scheme can certify the identity of arbitrary infinite objects — a limit you can now prove rather than merely suspect.

💡 Intuition: Phase 2 and Phase 4 are the same audit at two altitudes. Finite output vs. countable input ⇒ collisions exist (pigeonhole). Countable output vs. uncountable input ⇒ collisions are uncountably abundant (diagonalization). The output side never grows fast enough; the only question is how badly it loses.

Discussion Questions

The Phase 2 proof never used a single property of the hash function $h$ — not the bit-length, not the algorithm, nothing but the size of its range. List two other impossibility claims you could refute with the identical argument (e.g., about compression or unique IDs). What is the common pattern?
Claim 2 is a probability statement and Claim 3 is a hardness assumption, yet both were cleared by the same audit that killed Claim 1. Explain precisely what cardinality does and does not decide, and why a careful engineer must state which claim a vendor is making.
Git used SHA-1 for years knowing collisions must exist. Was that irrational? Frame your answer using the "exists vs. feasible-to-find" distinction and the §10.4 analogy to "uncomputable problems exist vs. naming one."
In Phase 4, we said uncountably many streams share each countable fingerprint. Could a fingerprint that is itself an arbitrary real number (uncountably many possible fingerprints) injectively tag every infinite bit-stream? Use $\lvert \mathbb{R}\rvert = \lvert \{0,1\}^{\mathbb{N}}\rvert$ from §10.5 to answer.
The birthday bound made Claim 2 hold for $10^{12}$ files among $2^{256}$ buckets. At roughly what number of stored files would the collision probability stop being negligible? (You do not need a precise figure — reason about $\sqrt{2^{256}} = 2^{128}$ and explain why that square-root threshold is the relevant one.)

Your Turn: Extensions

Option A (the memo). Write the actual one-page design-review memo: restate the three claims, give the Phase 2 counting argument in three sentences, and recommend exactly which guarantees the team may rely on. This is the deliverable a senior engineer would expect.
Option B (perfect hashing, honestly). A perfect hash can be collision-free — but only over a fixed, finite, known key set. Reconcile this with Phase 2 by stating precisely how the input set differs (finite and bounded vs. the infinite set of all possible files). Then explain why a perfect hash silently breaks the moment one new key arrives.
Option C (toolkit tie-in). Using the Toolkit's cantor_pair from cardinality.py, build an injective fingerprint for the set of pairs of files into a single $\mathbb{N}$ index — and then explain why this does not contradict Phase 2 (hint: the target is $\mathbb{N}$, an infinite set, not a finite range). State the cardinality of the input and output sets in your construction.
Option D (uncountable inputs). Make Phase 4 rigorous: prove that the set of infinite bit-streams agreeing with a fixed stream on their first $L$ bits is itself uncountable, by exhibiting a bijection between it and the set of all infinite bit-streams (which §10.4 proved uncountable).

Key Takeaways

A cardinality audit decides possibility before implementation. Compare $\lvert \text{inputs}\rvert$ to $\lvert \text{outputs}\rvert$; if the inputs are larger, no injective map exists — no perfect hash, no perfect fingerprint, no perfect compressor. The hash's internals are irrelevant.
Finite range + infinite (or merely larger) domain ⇒ a collision must exist. This is §10.4's "more problems than programs," scaled to "more files than fingerprints." Pigeonhole is a cardinality statement.
"Exists" and "feasible to find" are different claims. Cardinality guarantees collisions exist; it says nothing about exhibiting one. That gap — the same as "uncomputable problems exist" vs. "here is one" (§10.4, Chapter 36) — is exactly where cryptographic security lives.
Uncountable inputs make the loss catastrophic. Fingerprinting arbitrary infinite streams forces uncountably many inputs per output (§10.3), which is why the identity of arbitrary infinite objects can never be certified by a countable tag.
Always ask which claim a vendor is making. "Never collides" is provably false; "won't collide on your data" is a defensible probability bound; "an attacker can't build a collision" is a hardness assumption. One audit, three verdicts.