38 min read

> "Every bit that crosses a wire is a bit that might arrive wrong."

Prerequisites

  • 23
  • 17

Learning Objectives

  • Design a modular hash function and explain how it maps a large key space into a small table.
  • Use the pigeonhole principle to prove that collisions are unavoidable, and quantify load factor.
  • Compute parity bits, weighted-mod checksums, and CRC remainders by hand and in code.
  • Distinguish error detection from error correction, and state the guarantee each one provides.
  • Explain what an additional property — collision resistance — a cryptographic hash must have, and why ordinary hashes lack it.

Chapter 26: Hashing, Checksums, and Error Detection

"Every bit that crosses a wire is a bit that might arrive wrong." — a paraphrase of the engineer's working assumption behind every checksum (a hypothetical attribution)

Overview

Open your browser's developer tools, download a file, and somewhere in the network trace is a number nobody designed for you to read: a checksum. Look at how a Python dict finds a value in roughly constant time regardless of how many keys it holds: a hash function did that. Run git log and every commit is named by a 40-character string that is, in effect, a fingerprint of its entire contents: a cryptographic hash produced it. These three mechanisms — hash tables, error-detecting codes, and cryptographic fingerprints — look unrelated. They are the same idea wearing three coats.

The idea is this: take something large and turn it into something small, on purpose, in a way that is fast to compute and useful to compare. A hash table turns an arbitrary key into a small array index so lookup is fast. A checksum turns a whole message into a few bits so corruption is visible. A cryptographic hash turns a file into a fixed fingerprint so tampering is detectable. In every case the machinery is number theory you already own — the modular arithmetic of Chapter 23 — and the limits are governed by a counting argument you already proved — the pigeonhole principle of Chapter 17.

This is also where two threads of the book finally meet. The pigeonhole principle told you, abstractly, that a function from a big set to a small set cannot be injective. Here that abstract fact becomes a concrete, unavoidable engineering reality: collisions are not a bug you can fix; they are a theorem you must design around. Understanding that distinction — what the math guarantees versus what you merely hope — is the difference between a hash table that degrades gracefully and one that turns $O(1)$ lookups into $O(n)$ disasters in production.

In this chapter, you will learn to:

  • Build a hash function from modular arithmetic and reason about how it distributes keys.
  • Use the pigeonhole principle to prove collisions are inevitable, and use load factor to predict when they bite.
  • Compute the three classic detection schemes — parity bits, weighted-modular checksums (ISBN/UPC style), and cyclic redundancy checks (CRC) — and state exactly which errors each one catches.
  • Separate the two distinct jobs of detecting an error and correcting it, and see why correction costs more.
  • Explain the extra property that promotes an ordinary hash function into a cryptographic one, and why git, password storage, and digital signatures depend on it.

Learning Paths

🏃 Fast Track: If you have used hash tables and just need the math, read §26.2 (why collisions are guaranteed) and §26.5 (CRC), then do the Project Checkpoint. Those are the two places where the discrete math does real work.

📖 Standard Path: Read in order. Work every 🔄 Check Your Understanding and hand-trace the CRC division in §26.5 yourself before reading ours — polynomial-over-$\mathbb{Z}_2$ arithmetic only sticks if your own pencil does it once.

🔬 Deep Dive: After the chapter, read the §26.3 note on universal hashing and prove the expected bound it states; then look ahead to Chapter 38, where the parity idea of §26.4 becomes a full error-correcting code over the finite fields of Chapter 24.


26.1 Hash functions and hash tables

Suppose you are building a phone book for a company with a million employees, and you need to look up "what is Dana Okoro's extension?" thousands of times per second. A list forces you to scan, on average, half a million entries per query. A sorted array with binary search costs $O(\log n)$ — better, but it makes inserting a new employee expensive. What you want is to jump directly to Dana's record in one step, the way you index into an array: extensions[i]. The obstacle is that array indices are small integers $0, 1, 2, \dots, m-1$, and "Dana Okoro" is not a small integer.

A hash function bridges that gap. It takes a key from some enormous (often infinite) universe — all possible names, all possible 64-bit integers, all possible files — and deterministically produces an index into a small array.

Definition (hash function). A hash function is a function $h\colon U \to {0, 1, \dots, m-1}$ from a (large) key universe $U$ to a finite set of $m$ slots, the table indices. It must be deterministic (the same key always hashes to the same slot) and fast to compute. The array of $m$ slots it indexes into is a hash table.

🔗 Connection. A hash function is exactly a function in the Chapter 9 sense — a rule assigning every element of the domain $U$ a single element of the codomain $\{0,\dots,m-1\}$. Chapter 9 foreshadowed this application by name. What makes hashing interesting is precisely that $U$ is far bigger than the codomain, so (jumping ahead to §26.2) the function cannot be injective.

The simplest useful hash for integer keys is the one you already met as the mod operator in Chapter 23.

💡 Intuition: "Hashing" is just deliberately throwing away information in a structured way. A key carries far more bits than a table index needs; the hash keeps just enough to point at a slot and discards the rest. The art is discarding the right bits so that different keys spread out evenly instead of piling up.

Worked example: a modular hash

Take a table of size $m = 11$ and hash integer keys with $h(k) = k \bmod 11$.

def modular_hash(key, m=11):
    """Map an integer key into a table of size m."""
    return key % m

for k in (1234, 5678, 91011):
    print(k, "->", modular_hash(k))
# Expected output:
# 1234 -> 2
# 5678 -> 2
# 91011 -> 8

To check the first by hand: $11 \cdot 112 = 1232$, so $1234 = 1232 + 2$ and $1234 \bmod 11 = 2$. Likewise $11 \cdot 516 = 5676$, so $5678 \bmod 11 = 2$, and $11 \cdot 8273 = 91003$, so $91011 \bmod 11 = 8$. Notice already that $1234$ and $5678$ landed in the same slot. Hold that thought — it is the subject of the next section.

Strings are handled by first turning them into an integer. A common scheme treats the characters as digits in some base $b$ (often a small prime like 31), a polynomial hash:

$$h(c_0 c_1 \cdots c_{L-1}) = \left( \sum_{i=0}^{L-1} c_i \, b^{\,L-1-i} \right) \bmod m,$$

where $c_i$ is the numeric code of the $i$-th character. The sum is evaluated with Horner's rule and reduced mod $m$ as you go, so the intermediate values never grow large — an application of the modular arithmetic rules from Chapter 23, where reducing at each step gives the same answer as reducing at the end.

⚠️ Common Pitfall: a bad table size. If you choose $m = 2^p$ (a power of two), then $k \bmod m$ keeps only the low $p$ bits of $k$ and throws the rest away. For keys whose low bits are correlated — say, pointers that are all 8-byte aligned, so their low 3 bits are always 0 — this clusters everything into a few slots. Choosing $m$ to be a prime not close to a power of two mixes all the bits of the key into the result. (This is why §26.3's modular hashing prefers primes.)

A hash table uses $h$ to decide where each key-value pair lives: to store (key, value), compute h(key) and put the pair in that slot; to look it up, compute h(key) again and go straight there. When two keys want the same slot, the table needs a collision-resolution policy (chaining the colliding entries in a list, or probing for the next free slot) — and to understand why that policy is mandatory rather than optional, we need the next section.

🔄 Check Your Understanding 1. Why must a hash function be deterministic? What breaks if h("Dana") returns 4 on insert and 7 on lookup? 2. Compute $h(k) = k \bmod 7$ for the keys 50 and 100. Do they collide? 3. In one sentence, what is the difference between a hash function and a hash table?

Answers

  1. Lookup recomputes the slot from the key; if the function isn't deterministic, you store the value in slot 4 but later look in slot 7 and never find it. 2. $50 \bmod 7 = 1$ (since $49 = 7\cdot 7$) and $100 \bmod 7 = 2$ (since $98 = 7\cdot 14$). They do not collide. 3. The hash function is the rule that maps a key to a slot index; the hash table is the array of slots it indexes into (plus a policy for handling collisions).

26.2 Collisions and the pigeonhole principle

Here is the question that separates people who use hash tables from people who understand them: can a clever enough hash function avoid collisions entirely? If we just pick the right function, can we guarantee that no two distinct keys ever land in the same slot?

For any realistic table, the answer is a flat no, and it is not a matter of cleverness. It is a theorem.

A collision is the event we have been circling.

Definition (collision). A collision occurs when two distinct keys hash to the same slot: $k_1 \ne k_2$ but $h(k_1) = h(k_2)$.

Now recall the pigeonhole principle from Chapter 17: if $n$ objects are placed into $m$ boxes and $n > m$, then some box holds at least two objects; equivalently, a function from a set of size $n$ to a set of size $m < n$ cannot be injective. A hash function is exactly such a function, and the key universe is exactly that oversized set.

🔗 Connection. Chapter 17 introduced the pigeonhole principle and pointed forward to this very section, promising that it would explain why hash collisions are unavoidable. The keys are the pigeons; the table slots are the holes. As soon as you have more possible keys than slots — which is always, since the universe of keys is huge and the table is small — at least two keys must share a slot.

Theorem (collisions are unavoidable). Let $h\colon U \to \{0, 1, \dots, m-1\}$ be any hash function. If $\lvert U\rvert > m$, then there exist distinct keys $k_1, k_2 \in U$ with $h(k_1) = h(k_2)$.

The strategy first. This is the pigeonhole principle restated in hashing's vocabulary, so the proof is the pigeonhole argument: assume there were no collision, conclude $h$ is injective, and derive a size contradiction. We argue by contradiction because "no two keys collide" is exactly the statement "$h$ is injective," and injectivity of a function into a smaller set is what pigeonhole forbids.

Proof. Suppose, for contradiction, that no collision exists. Then for all distinct $k_1, k_2 \in U$ we have $h(k_1) \ne h(k_2)$ — which is precisely the statement that $h$ is injective. An injective function maps distinct inputs to distinct outputs, so its image contains $\lvert U\rvert$ distinct slot values. But the codomain has only $m$ slots, so the image has at most $m$ elements. Therefore $\lvert U\rvert \le m$, contradicting the hypothesis $\lvert U\rvert > m$. Hence our supposition was false: some pair of distinct keys must collide. $\blacksquare$

This is theme two of the book in its purest form: "it passed the tests" is not the same as "it is correct." You could hash a million random keys and observe no collision and feel safe — but the theorem says that safety is luck, not structure. Designing for collisions (with chaining or probing) is not defensive paranoia; it is the only correct response to a proof.

🚪 Threshold Concept: a hash function cannot be injective, ever. Many beginners spend real effort hunting for the "perfect" hash function that never collides on arbitrary input, the way one might hunt for a faster sort. The pigeonhole principle says that search is provably hopeless for any fixed function on a universe larger than the table. This reframes the entire engineering problem: you stop asking "how do I avoid collisions?" and start asking "how do I make collisions rare and cheap?" Every hash table design — chaining, open addressing, resizing — is an answer to the second question, because the first has no answer. Once you internalize that a many-to-one map must lose information, you will recognize the same constraint behind lossy compression, dimensionality reduction, and the uncomputability results waiting in Part VI.

Quantifying the squeeze: load factor

The theorem says collisions must happen eventually. But a table with a million slots holding only ten keys collides rarely; a table with ten slots holding a million keys is a catastrophe. The single number that captures the difference is the load factor.

Definition (load factor). For a hash table with $n$ stored keys and $m$ slots, the load factor is $\alpha = n / m$ — the average number of keys per slot.

Load factor is the dial that controls performance. With chaining (each slot holds a list of its keys), an unsuccessful lookup must scan the whole chain in its slot, and under the idealizing assumption that keys spread uniformly, the expected chain length is exactly $\alpha$. So the expected lookup cost is $O(1 + \alpha)$: constant work plus a term that grows with how crowded the table is. Keep $\alpha$ bounded (real implementations resize the table when $\alpha$ exceeds a threshold like $0.75$) and you keep expected lookups constant. Let $\alpha$ drift upward unbounded and your $O(1)$ table quietly becomes an $O(n)$ list.

💡 Intuition: The generalized pigeonhole principle (Chapter 17) sharpens this: distribute $n$ keys into $m$ slots and some slot is forced to hold at least $\lceil n/m \rceil = \lceil \alpha \rceil$ keys. So $\alpha$ is not just the average chain length — it is a guaranteed lower bound on the worst chain. You cannot beat it with any hash function; you can only keep $\alpha$ small.

🧩 Productive Struggle. Before reading on: you hash $n$ keys into $m = n$ slots (load factor exactly 1). The average chain has length 1, which sounds ideal. Yet the longest chain is, with high probability, around $\Theta(\log n / \log\log n)$ — not 1. Why does an average of 1 still leave some slot with many keys? (Hint: think about the difference between an average and a maximum, and recall the birthday-style counting from probability.)

🔄 Check Your Understanding 1. State the collision theorem in pigeonhole language: what are the pigeons, what are the holes, and what is the inequality that forces a collision? 2. A hash table has $m = 1000$ slots and stores $n = 1500$ keys. What is the load factor, and what does the generalized pigeonhole principle guarantee about the most crowded slot? 3. True or false: a sufficiently clever hash function can guarantee zero collisions on an arbitrary stream of 64-bit integer keys into a 1000-slot table. Justify.

Answers

  1. Pigeons = the keys; holes = the $m$ table slots; the inequality $\lvert U\rvert > m$ (more possible keys than slots) forces at least two keys into one slot. 2. $\alpha = 1500/1000 = 1.5$; the generalized pigeonhole principle guarantees some slot holds at least $\lceil 1500/1000 \rceil = 2$ keys. 3. False. There are $2^{64}$ possible keys and only 1000 slots; since $2^{64} > 1000$, the theorem guarantees two distinct keys with the same hash — no function avoids it.

26.3 Modular hashing and a glimpse of universal hashing

We have a hash function ($k \bmod m$) and a proof that collisions are inevitable. The remaining design question is practical: among all the functions we could use, which spread keys out well, and can we say anything rigorous about "well"?

Why a prime modulus

Modular hashing is $h(k) = k \bmod m$, and the choice of $m$ matters more than it looks. We saw that a power-of-two modulus keeps only low-order bits. The deeper issue is patterned keys. Suppose your keys happen to be multiples of some number $d$ that shares a factor with $m$. Concretely, let $m = 12$ and suppose every key is even. Then $k \bmod 12$ is always even, so the six odd slots $1,3,5,7,9,11$ are never used — half the table is dead, doubling the effective load factor.

The fix is theme three, abstraction applied: choose $m$ to be prime. If $m$ is prime, then for any key step $d$ that is not a multiple of $m$, the values $k, k+d, k+2d, \dots$ cycle through all $m$ residues before repeating, because $\gcd(d, m) = 1$ (a fact from the Euclidean-algorithm groundwork of Chapter 22, since a prime shares no factor with anything smaller). A prime modulus refuses to let arithmetic patterns in the keys collapse into a few slots.

⚠️ Common Pitfall: Reaching for $m = 2^p$ "because the mod is a fast bitmask." It is fast, but it only looks at the low bits. If you must use a power-of-two table (common in real libraries for speed), first run the key through a mixing step that scrambles its high bits into its low bits; do not feed raw keys to a power-of-two mod.

Universal hashing (intro)

There is a subtler vulnerability. For any single fixed hash function $h$, an adversary who knows $h$ can hand-pick keys that all collide — by the collision theorem there are infinitely many such keys, and the adversary just enumerates them. This is not hypothetical: it is the basis of real "algorithmic complexity" denial-of-service attacks that force a server's hash table into its $O(n)$ worst case.

The defense is to not commit to one function. Universal hashing picks a hash function at random, at table-creation time, from a carefully designed family $H$ of functions.

💡 Intuition (universal family). A family $H$ of hash functions is called universal if, for any two distinct keys $k_1 \ne k_2$, the probability — taken over a random choice of $h$ from $H$ — that they collide is at most $1/m$: $$\Pr_{h \in H}[\,h(k_1) = h(k_2)\,] \le \frac{1}{m}.$$ That is exactly the collision rate you would get from an idealized "random" assignment of keys to slots. The point is subtle but powerful: collisions still happen, but no adversary can force them, because they cannot predict which $h$ you drew. The randomness is yours, not theirs.

A classic universal family for integer keys, due to Carter and Wegman, uses modular arithmetic directly: pick a prime $p$ larger than every key, then choose random $a \in \{1, \dots, p-1\}$ and $b \in {0, \dots, p-1}$ and set $$h_{a,b}(k) = \big((a k + b) \bmod p\big) \bmod m.$$ Different random $(a, b)$ give different functions; the analysis (which leans on the modular-inverse machinery of Chapter 23 — for fixed distinct $k_1, k_2$, the map from $(a,b)$ to the residue pair is a bijection, so collisions are rare over the random choice) shows the family meets the $1/m$ bound. We will not prove it here, but notice the shape: randomness converts a worst case you cannot avoid into an expected case you can predict — the same computation-vs-proof complementarity (theme four) that randomized algorithms exploit throughout the book.

🔗 Connection. The Carter–Wegman family is built from the linear-congruence arithmetic of Chapter 23. The same $ax + b \pmod p$ form that scrambles keys here also drives linear congruential random-number generators — modular arithmetic is the engine of both.

🔄 Check Your Understanding 1. Why does a prime modulus help when keys share a common factor? 2. In one sentence, what problem does universal hashing solve that a single fixed hash function cannot? 3. In the universal family $h_{a,b}(k) = ((ak+b) \bmod p) \bmod m$, what is randomized — the key, or the choice of $a$ and $b$?

Answers

  1. If $m$ is prime it shares no factor with the keys' common step $d$ (so $\gcd(d,m)=1$), and the keys cycle through all $m$ residues instead of collapsing into the few slots compatible with the shared factor. 2. It prevents an adversary from deliberately choosing keys that all collide, by hiding which hash function is in use behind a random choice. 3. The choice of $a$ and $b$ (made once, at table creation); the keys are whatever the user supplies.

26.4 Checksums and parity: detecting that something changed

Switch settings. We are no longer building a fast lookup; we are sending a message across a noisy channel — a network cable, a radio link, a scratched disk — and bits sometimes flip in transit. We want the receiver to notice when the message arrived corrupted. The tool is the same kind of small fingerprint as a hash, but used for a different purpose.

Definition (checksum). A checksum is a small, fixed-size value computed from a block of data and sent (or stored) alongside it, so that the receiver can recompute it and compare: if the recomputed checksum differs from the received one, the data was corrupted. A checksum detects errors; on its own it does not say where the error is or how to fix it.

The simplest possible checksum is a single bit.

Definition (parity bit). A parity bit is one extra bit appended to a block of bits, chosen so that the total number of 1-bits is even (even parity) or odd (odd parity). The receiver counts the 1-bits in the whole block-plus-parity; if the count has the wrong parity, an error is detected.

Worked example: even parity

Take the 7-bit data block $1010001$. It has three 1-bits (an odd count). Under even parity we append the bit that makes the total even — here a $1$ — producing the 8-bit transmitted block $10100011$, which has four 1-bits.

def parity_bit(bits):
    """Return the even-parity bit for a list of 0/1 bits."""
    return sum(bits) % 2

data = [1, 0, 1, 0, 0, 0, 1]
p = parity_bit(data)
print("data:", data, "parity:", p)
print("transmitted:", data + [p])
# Expected output:
# data: [1, 0, 1, 0, 0, 0, 1] parity: 1
# transmitted: [1, 0, 1, 0, 0, 0, 1, 1]

If exactly one bit flips in transit, the count of 1-bits changes by one, flipping its parity, and the receiver's check fails — the error is caught. But parity has a sharp limit that you can state precisely:

Claim. A single parity bit detects any odd number of bit errors and misses any even number of bit errors.

The strategy first. Parity is a statement about the count of 1-bits modulo 2. Each flipped bit toggles that count's parity once. So the question "does the receiver notice?" is just "did the parity change?", which is "was the number of flips odd?" The proof is a one-line parity (mod 2) argument.

Proof. Let the transmitted block have an even number of 1-bits (even parity, by construction). Suppose $e$ bits flip in transit. Each flip changes a 1 to a 0 or a 0 to a 1, and in either case changes the total count of 1-bits by exactly $\pm 1$, hence toggles the count modulo 2. After $e$ flips the count's parity has toggled $e$ times, so the final parity equals the original parity if and only if $e$ is even. The receiver detects an error exactly when the parity differs from even — i.e., exactly when $e$ is odd. Therefore odd $e$ is always detected and even $e \ge 2$ is never detected (the parity looks correct again). $\blacksquare$

⚠️ Common Pitfall: Treating "passed the parity check" as "the data is correct." A two-bit error sails through. Parity is a one-bit fingerprint and can only carry one bit of "is something wrong" information; it cannot distinguish two errors from zero. This is the detection-strength-versus-cost tradeoff that the rest of the chapter (and Chapter 38) is about.

Weighted checksums: ISBN and the power of a prime modulus

A single parity bit is weak partly because it weights every position the same — so swapping two digits, or any even disturbance, hides. Real-world identifier checksums fix this by giving each position a different weight and working modulo a well-chosen number. The ISBN-10 book number is the classic example: its tenth digit is a check digit chosen so that

$$\sum_{i=1}^{10} i \cdot d_i \equiv 0 \pmod{11},$$

where $d_i$ is the $i$-th digit. (The check digit can be $10$, written as the symbol X, which is why some ISBNs end in X.)

def isbn10_check_digit(first9):
    """Given the first 9 ISBN-10 digits, return the check digit (10 -> 'X')."""
    s = sum((i + 1) * d for i, d in enumerate(first9))
    c = (-s) % 11
    return "X" if c == 10 else str(c)

print(isbn10_check_digit([1, 2, 3, 4, 5, 6, 7, 8, 9]))
# Expected output:
# X

By hand: the weighted sum of the first nine digits is $1\cdot1 + 2\cdot2 + \cdots + 9\cdot9 = 1 + 4 + 9 + 16 + 25 + 36 + 49 + 64 + 81 = 285$. We need $285 + 10\, d_{10} \equiv 0 \pmod{11}$. Since $285 = 11\cdot25 + 10$, we have $285 \equiv 10 \pmod{11}$, so we need $10 + 10\,d_{10} \equiv 0$, i.e. $10(1 + d_{10}) \equiv 0 \pmod{11}$. Because $11$ is prime, $10$ has a multiplicative inverse mod $11$ (Chapter 23), so we may cancel it: $1 + d_{10} \equiv 0$, giving $d_{10} = 10 = $ X.

The prime modulus is doing real work, and it shows what weighting buys you:

💡 Intuition: Because $11$ is prime, every single-digit error and every transposition of two adjacent digits changes the weighted sum modulo 11 — so both are always detected. A single wrong digit $d_i \to d_i'$ shifts the sum by $i\,(d_i' - d_i)$, which is nonzero mod 11 because $i \in {1,\dots,10}$ and the digit change are both nonzero and smaller than the prime. Swapping adjacent digits shifts the sum by $\pm(d_i - d_{i+1})$, again nonzero mod 11 unless the digits were equal (in which case the swap changes nothing anyway). A non-prime modulus like 10 would let some of these errors slip through — the two most common human typing mistakes. This is the same reason §26.3 wanted a prime: a prime modulus has no nontrivial factors for an error to hide behind.

🔄 Check Your Understanding 1. The block $11010$ is sent with even parity. What parity bit is appended? If the receiver gets $110100$ (with parity), how many bit-flips could have occurred, and would the check catch a single flip? 2. Why can a single parity bit never tell you which bit was corrupted? 3. ISBN-10 uses modulus 11, not 10. Name one specific kind of error that mod 11 catches but mod 10 would miss.

Answers

  1. $11010$ has three 1-bits (odd), so even parity appends $1$, giving $110101$. The received $110100$ has parity bit $0$ but three 1-bits in the data — recomputing even parity gives $1 \ne 0$, so an error is detected; a single flip (odd) is always caught. 2. A parity bit carries one bit of information — "is the count even or odd" — which is enough to say whether something is wrong but not enough to locate it among the many positions. 3. The transposition of two adjacent digits (a common typo): mod 11 with positional weights detects every such swap, whereas a plain mod-10 sum of digits is unchanged by a transposition and misses it.

26.5 Cyclic redundancy checks (CRC)

Parity catches single-bit errors; weighted checksums catch single-digit and transposition errors. But real channels produce bursts — a scratch on a disk or a moment of radio interference flips a run of consecutive bits. The detection scheme that dominates networking and storage precisely because it crushes burst errors is the cyclic redundancy check, and it is the most number-theoretic of the three. The whole trick is to treat bit strings as polynomials and do division.

Bit strings as polynomials over $\mathbb{Z}_2$

Read a bit string $b_{n-1} b_{n-2} \cdots b_1 b_0$ as the polynomial $$b_{n-1} x^{n-1} + b_{n-2} x^{n-2} + \cdots + b_1 x + b_0,$$ with coefficients in $\mathbb{Z}_2 = \{0, 1\}$ — the integers mod 2 you met in Chapter 23. So $1011$ is $x^3 + x + 1$. In $\mathbb{Z}_2$, addition is XOR ($1 + 1 = 0$, no carries), and subtraction is the same as addition — both are XOR. That single fact is what makes CRC hardware-cheap: polynomial division over $\mathbb{Z}_2$ is just repeated XOR and shifting, no borrows, no carries.

Definition (cyclic redundancy check, CRC). A CRC is an error-detecting checksum computed as the remainder when the message polynomial (shifted up to make room) is divided, modulo 2, by a fixed generator polynomial $G(x)$ of degree $r$. The $r$-bit remainder is appended to the message; the receiver divides the received frame by the same $G(x)$ and flags an error if the remainder is nonzero.

The procedure, given a message $M$ (as bits) and a generator $G$ of degree $r$:

  1. Append $r$ zero bits to $M$ (this is multiplying $M(x)$ by $x^r$ — shifting it up to leave room for the remainder).
  2. Divide the augmented bits by $G$ using XOR-subtraction; keep the $r$-bit remainder $R$.
  3. Transmit $M$ followed by $R$. (As a polynomial, $M(x)\,x^r - R(x)$, which is exactly divisible by $G(x)$ — and since subtraction is XOR, $M$-with-$R$-appended is a multiple of $G$.)

The receiver divides the whole received frame by $G$. A zero remainder means "no detected error"; a nonzero remainder means the frame was corrupted.

Worked example: CRC by hand

Let the message be $M = 1101$ and the generator be $G = 1011$ (degree $r = 3$, i.e. $x^3 + x + 1$).

Step 1 — augment. Append $r = 3$ zeros: $1101000$.

Step 2 — divide mod 2. Long division is "if the leading bit under the cursor is 1, XOR the generator; then move on." Aligning $G = 1011$ under each leading 1:

    1101000          <- augmented message
    1011             <- XOR (G aligned at the leading 1)
    ----
    0110000
     1011            <- XOR (G aligned at the next leading 1)
     ----
    0011100
      1011           <- XOR
      ----
    0001010
       1011          <- XOR
       ----
    0000001

After the last XOR the leftmost 1 sits in a position too low to align another copy of $G$ (it would run off the right end), so division stops. The remainder is the low $r = 3$ bits: $R = 001$.

Step 3 — transmit. Send $M$ followed by $R$: $1101\,001 = 1101001$.

def crc_remainder(message, generator):
    """Return the r-bit CRC remainder (r = len(generator) - 1), all str of 0/1."""
    r = len(generator) - 1
    bits = list(message) + ["0"] * r            # augment
    for i in range(len(message)):               # one XOR per message bit
        if bits[i] == "1":
            for j in range(len(generator)):     # XOR generator in place
                bits[i + j] = str(int(bits[i + j]) ^ int(generator[j]))
    return "".join(bits[len(message):])         # last r bits

print(crc_remainder("1101", "1011"))
# Expected output:
# 001

Verification. The receiver divides the transmitted $1101001$ by $1011$:

    1101001
    1011
    ----
    0110001
     1011
     ----
    0011101
      1011
      ----
    0001011
       1011
       ----
    0000000          <- remainder 0: no error detected

The remainder is $000$, so the receiver accepts the frame. If any bit had flipped in transit, the division would (in the cases the generator is designed to catch) leave a nonzero remainder.

Why CRC is powerful, in one theorem-sized idea

The reason engineers reach for CRC instead of a sum-based checksum is that polynomial division detects whole classes of errors, and the number theory tells you exactly which. Model the error as an error polynomial $E(x)$: the received frame is $T(x) + E(x)$ where $T(x)$ is the (divisible) transmitted frame. Since $G \mid T$, the receiver's remainder is the remainder of $E(x)$ modulo $G(x)$. An error is undetected if and only if $G(x)$ divides $E(x)$. From this single observation:

  • All single-bit errors are detected if $G(x)$ has at least two nonzero terms: a single-bit error is $E(x) = x^i$, and a $G$ with two or more terms cannot divide a single power of $x$.
  • All burst errors shorter than $G$ (bursts of length $\le r$) are detected, because such an error polynomial has degree less than $\deg G$ and a nonzero polynomial of smaller degree can never be divisible by $G$.
  • All odd numbers of bit errors are detected if $G(x)$ has $x + 1$ as a factor — the polynomial version of the parity argument from §26.4.

💡 Intuition: A CRC is a hash function whose codomain is the set of $r$-bit remainders, but chosen with the algebra of $\mathbb{Z}_2[x]$ so that the kinds of changes a noisy channel actually makes — single flips, short bursts — land in different slots from the original with certainty, not just with high probability. Where a generic checksum says "probably different if corrupted," a well-chosen CRC generator guarantees "different" for an entire catalogued family of errors. That guarantee is why Ethernet, ZIP, PNG, and disk controllers all use CRCs.

🔗 Connection. Treating bit strings as polynomials over $\mathbb{Z}_2$ is your first taste of coding theory. Chapter 38 develops this into full error-correcting codes — including codes built on the finite fields of Chapter 24 — where the algebra does not just detect corruption but pinpoints and repairs it.

🔄 Check Your Understanding 1. In CRC, why do we append $r$ zeros to the message before dividing? What does that do to the message polynomial? 2. A generator $G$ has degree 16. What is the longest burst error length the CRC is guaranteed to detect, and why? 3. Compute the CRC remainder of message $M = 1010$ with generator $G = 11$ (degree 1). (Hint: $G = x + 1$ — this CRC is just a parity check.)

Answers

  1. Appending $r$ zeros multiplies $M(x)$ by $x^r$, shifting it up by $r$ positions to leave room for the $r$-bit remainder; the transmitted frame is then $M(x)x^r$ minus the remainder, an exact multiple of $G$. 2. Any burst of length $\le 16$ is guaranteed detected: such an error polynomial has degree $< 16 = \deg G$, and a nonzero polynomial of degree less than $\deg G$ cannot be divisible by $G$.
  2. Augment $1010$ with one zero: $10100$. Dividing by $11$ (XOR) leaves remainder $0$ (the message has an even number of 1-bits), so $R = 0$ — confirming $G = x+1$ acts as a parity check.

26.6 Cryptographic hashing: when an adversary is trying to collide you

Everything so far assumed errors are accidental — random noise, cosmic rays, a worn disk. But some "errors" are deliberate. When you download an installer and check its published hash, you are not worried about a flipped bit; you are worried about an attacker who replaced the whole file with malware and wants it to pass your check anyway. Ordinary hashes and CRCs are useless against that adversary, because they were designed to be fast and predictable — and an attacker can exploit predictability to forge a collision on purpose. Defending against a thinking opponent needs a stronger object.

Definition (cryptographic hash function). A cryptographic hash function maps an arbitrary-length input to a fixed-length output (a digest) and additionally satisfies, as a practical computational matter: - Preimage resistance: given a digest $y$, it is infeasible to find any input $x$ with $h(x) = y$ (you cannot reverse the fingerprint). - Second-preimage resistance: given an input $x_1$, it is infeasible to find a different $x_2$ with $h(x_1) = h(x_2)$ (you cannot forge a substitute for a specific file). - Collision resistance: it is infeasible to find any pair $x_1 \ne x_2$ with $h(x_1) = h(x_2)$ (you cannot produce any colliding pair).

Notice what is not on this list: "no collisions." By the collision theorem of §26.2, a function from arbitrary-length inputs to a fixed 256-bit digest must have infinitely many collisions — the universe of inputs dwarfs the $2^{256}$ possible digests. Cryptographic hashing does not deny the theorem; it makes the guaranteed collisions infeasible to find. The pigeonhole principle says they exist; the design ensures no adversary with realistic computing power can locate one.

🚪 Threshold Concept: hard-to-find is as good as nonexistent. This is a genuinely new way to think, and it underlies all of modern cryptography (the same shift that made RSA possible in Chapter 25). A collision exists — pigeonhole guarantees it — but if finding one would take longer than the age of the universe, it might as well not exist for any practical purpose. Security is not built on impossibility (the math forbids that) but on computational infeasibility: the gap between "a solution exists" and "you can compute it." Internalizing that gap is the key to everything in Chapter 25 and a recurring theme when we reach complexity theory in Chapter 37, where the same "easy to check, hard to find" asymmetry defines the class NP.

⚠️ Common Pitfall: using the wrong tool for the threat. A CRC is not a cryptographic hash. CRCs are linear over $\mathbb{Z}_2$, so an attacker can compute exactly which bits to flip to keep the CRC unchanged while altering the message — trivial forgery. Conversely, a cryptographic hash is overkill (and slower) for catching random line noise. Match the tool to the adversary: random faults want a CRC; a malicious tamperer wants a cryptographic hash like SHA-256.

🔗 Connection. The 40-character name on every git commit is a cryptographic hash of the commit's contents (historically SHA-1, now migrating to SHA-256). It is why you cannot quietly rewrite history in a shared repository without every downstream hash changing — collision resistance is what makes a content-addressed system trustworthy. The same property underlies password storage (store $h(\text{password})$, not the password), digital signatures (sign the hash, not the document), and blockchain integrity.

🐛 Find the Error. A developer protects file downloads by publishing a CRC-32 of each file alongside it, reasoning: "CRC-32 detects corruption, and tampering is just corruption an attacker caused, so CRC-32 will catch tampering too." Where is the reasoning wrong?

Answer

The conflation of accidental corruption with adversarial tampering is the error. CRC-32 is designed to detect random faults a noisy channel produces, and it does so with provable guarantees for that model. But CRC is linear and public: an attacker who alters the file can recompute the matching CRC-32 (or even craft changes that leave the original CRC intact, exploiting the linearity), so the published checksum still matches the malicious file. CRC has no preimage or collision resistance. Detecting deliberate tampering requires a cryptographic hash (SHA-256) — or, better, a keyed construction (an HMAC) so the attacker cannot recompute the digest without the secret key.

🔄 Check Your Understanding 1. Why does the existence of collisions (guaranteed by §26.2) not break a cryptographic hash function? 2. Which of the three security properties is violated if an attacker, given the hash of your password, recovers the password itself? 3. Give one reason a CRC is unsuitable for verifying that a downloaded file has not been tampered with.

Answers

  1. Cryptographic hashing never claims collisions don't exist — pigeonhole guarantees they do; it claims they are computationally infeasible to find. Security rests on the cost of finding a collision, not on their nonexistence. 2. Preimage resistance — recovering an input from its digest is exactly a preimage attack. 3. CRC is linear and uses a public, keyless generator, so an attacker who changes the file can recompute (or deliberately preserve) the matching CRC; it offers no resistance to a chosen, malicious modification.

Project Checkpoint: coding.py — Hamming distance and a single-error-correcting code

Theme one of this book is that discrete math is the language of CS, and nowhere is that clearer than in error control: the same XOR-over-$\mathbb{Z}_2$ arithmetic behind CRC, pushed one step further, lets us not just detect a single-bit error but correct it. This chapter's Toolkit module is coding.py, which Chapter 38 will expand into full coding theory. We add three functions: a distance measure and a matched encode/decode pair for the Hamming(7,4) code — 4 data bits protected by 3 parity bits, placed at the power-of-two positions $1, 2, 4$ so that the receiver can compute a 3-bit syndrome that either reads $0$ (no error) or names the exact position to flip.

Create dmtoolkit/coding.py and add:

def hamming_distance(a, b):
    """Number of positions where equal-length bit lists a and b differ."""
    return sum(x != y for x, y in zip(a, b))

def hamming_encode(data):
    """Encode 4 data bits into a 7-bit Hamming codeword (parity at 1,2,4)."""
    d1, d2, d3, d4 = data                      # go to positions 3,5,6,7
    p1 = d1 ^ d2 ^ d4                           # covers positions 1,3,5,7
    p2 = d1 ^ d3 ^ d4                           # covers positions 2,3,6,7
    p4 = d2 ^ d3 ^ d4                           # covers positions 4,5,6,7
    return [p1, p2, d1, p4, d2, d3, d4]        # positions 1..7

def hamming_decode(code):
    """Correct up to one bit error; return (data_bits, error_position)."""
    c = code[:]                                 # 1-indexed below via c[i-1]
    s1 = c[0] ^ c[2] ^ c[4] ^ c[6]             # parity over positions 1,3,5,7
    s2 = c[1] ^ c[2] ^ c[5] ^ c[6]             # parity over positions 2,3,6,7
    s4 = c[3] ^ c[4] ^ c[5] ^ c[6]             # parity over positions 4,5,6,7
    syndrome = s4 * 4 + s2 * 2 + s1            # binary s4 s2 s1 = bad position
    if syndrome:
        c[syndrome - 1] ^= 1                    # flip the indicated bit
    return [c[2], c[4], c[5], c[6]], syndrome   # data at positions 3,5,6,7

Hand-trace hamming_encode([1, 0, 1, 1]): data go to positions $3,5,6,7$, so $d_1{=}1, d_2{=}0, d_3{=}1, d_4{=}1$. Then $p_1 = 1 \oplus 0 \oplus 1 = 0$, $p_2 = 1 \oplus 1 \oplus 1 = 1$, $p_4 = 0 \oplus 1 \oplus 1 = 0$, giving the codeword $[0,1,1,0,0,1,1]$. Now flip position 5 (a transmission error) to get $[0,1,1,0,1,1,1]$ and decode: $s_1 = 0\oplus1\oplus1\oplus1 = 1$, $s_2 = 1\oplus1\oplus1\oplus1 = 0$, $s_4 = 0\oplus1\oplus1\oplus1 = 1$, so the syndrome is $101_2 = 5$ — it names position 5 exactly. The decoder flips position 5 back and recovers the original data.

code = hamming_encode([1, 0, 1, 1])
print("codeword:", code)
received = code[:]; received[4] ^= 1            # corrupt position 5
print("received:", received)
print("decoded :", hamming_decode(received))
# Expected output:
# codeword: [0, 1, 1, 0, 0, 1, 1]
# received: [0, 1, 1, 0, 1, 1, 1]
# decoded : ([1, 0, 1, 1], 5)

How this builds toward the capstone. hamming_distance is the ruler all of coding theory measures with — Chapter 38 proves that a code correcting $t$ errors needs minimum distance $2t+1$, and Hamming(7,4) has minimum distance 3, hence corrects $t = 1$. The encode/decode pair is the first error-correcting member of the Toolkit, the natural sequel to this chapter's error-detecting checksums. The capstone's Track D (an error-correcting code) is built directly on these three functions.

Toolkit so far: through numbertheory.py and crypto.py (Chapters 22–25), and now coding.py begins. With detection (parity, CRC) understood and correction (Hamming) implemented, you have the complete error-control toolkit that Chapter 38 generalizes with finite fields.


Summary

Hashing and error detection are one idea — map something large to something small, fast, for comparison — applied to three jobs. The reference facts:

Mechanism Maps Purpose Math behind it Key guarantee / limit
Hash table key → slot index fast lookup modular arithmetic (Ch. 23) collisions guaranteed (pigeonhole)
Parity bit block → 1 bit detect noise count mod 2 catches odd # of flips; misses even
Weighted checksum digits → check digit detect typos weighted sum mod prime catches single-digit + transposition
CRC message → $r$-bit remainder detect bursts polynomial division over $\mathbb{Z}_2$ catches all bursts of length $\le r$
Cryptographic hash data → fixed digest detect tampering one-way + collision-resistant collisions infeasible to find

Core results to carry forward:

  • Collision theorem. Any $h\colon U \to \{0,\dots,m-1\}$ with $\lvert U\rvert > m$ has a collision (pigeonhole). Collisions are a theorem, not a bug — design around them.
  • Load factor $\alpha = n/m$ governs hash-table performance; expected chained-lookup cost is $O(1+\alpha)$, and the generalized pigeonhole principle forces some chain of length $\ge \lceil\alpha\rceil$.
  • Prime modulus (hashing and ISBN-style checksums) prevents arithmetic patterns and the most common errors from hiding; powers of two keep only low bits.
  • Parity detects exactly the odd error counts (a mod-2 argument).
  • CRC: an error $E(x)$ is undetected iff $G(x) \mid E(x)$; choosing $G$ with $\ge 2$ terms catches all single-bit errors, $\deg G = r$ catches all bursts $\le r$, and the factor $(x+1)$ catches all odd error counts.
  • Detection vs. correction. Detection says something is wrong (parity, checksum, CRC); correction says what is wrong and fixes it (Hamming(7,4), via a syndrome that names the bad bit) and costs more redundancy.
  • Cryptographic hashes add preimage, second-preimage, and collision resistance: the guaranteed collisions are made computationally infeasible to find. A CRC is not a cryptographic hash.

Toolkit additions (coding.py): hamming_distance(a, b), hamming_encode(bits), hamming_decode(bits).


Spaced Review

Retrieval keeps knowledge available. Answer from memory before checking.

  1. (Ch. 23) The modular hash $h(k) = k \bmod m$ and the CRC both rely on a modulus/divisor. State the rule from Chapter 23 that lets you reduce $k$ modulo $m$ during a Horner-rule string hash instead of only at the end, and say why the answer is the same.
  2. (Ch. 23) Why does the ISBN-10 argument need $11$ to be prime when it cancels the factor of $10$ to solve $10(1 + d_{10}) \equiv 0 \pmod{11}$? Which Chapter 23 object does primality guarantee exists?
  3. (Ch. 17) State the pigeonhole principle, then the generalized pigeonhole principle, and map each onto a precise statement about a hash table with $n$ keys and $m$ slots.
  4. (Ch. 17) Chapter 17 proved collisions are unavoidable; this chapter restated it as the collision theorem via a proof by contradiction. What single property of $h$ does "no collisions" assert, and which set-size fact does pigeonhole use to contradict it?

Answers

  1. The rule $(a + b) \bmod m = ((a \bmod m) + (b \bmod m)) \bmod m$ and likewise for products: reduction commutes with addition and multiplication mod $m$, so reducing each partial Horner result mod $m$ yields the same final residue as computing the whole sum and reducing once — and keeps the intermediate values from overflowing. 2. Primality of $11$ guarantees that $10$ has a multiplicative inverse mod $11$ (every nonzero residue mod a prime is invertible), so the factor $10$ can be cancelled legitimately; without invertibility you could not conclude $1 + d_{10} \equiv 0$. 3. Pigeonhole: if $n > m$ then some slot holds $\ge 2$ keys. Generalized: some slot holds $\ge \lceil n/m \rceil = \lceil\alpha\rceil$ keys. The first guarantees *a* collision once you exceed $m$ keys; the second bounds the worst-case chain length below. 4. "No collisions" asserts $h$ is injective; pigeonhole contradicts it because an injection from $U$ into a set of size $m$ forces $\lvert U\rvert \le m$, impossible when $\lvert U\rvert > m$.

What's Next

You can now detect corruption — parity for a stray flip, weighted checksums for human typos, CRC for bursts — and you have implemented your first scheme that corrects it, Hamming(7,4), by reading a syndrome that points at the broken bit. But why does that code work, exactly how many errors can a code of a given size correct, and how do we build codes far more powerful than Hamming's — the codes that let a scratched CD play, a QR code survive a coffee stain, and a spacecraft phone home across the solar system? Answering that needs the algebra of finite fields from Chapter 24 married to the Hamming distance you just defined. That marriage is coding theory, and it is Chapter 38.