Exercises: Hashing, Checksums, and Error Detection

DataField.Dev

Exercises: Hashing, Checksums, and Error Detection

These build from mechanical warm-ups (compute a hash, an ISBN check digit, a CRC remainder) to proofs, code, modeling, and a conjecture you'll test before settling. Difficulty: ⭐ foundational, ⭐⭐ intermediate, ⭐⭐⭐ challenging. Worked solutions to the starred-with-a-dagger (†) and odd-numbered problems are in the appendix answers-to-selected.md — try each one before you peek. Throughout, "the collision theorem" means the §26.2 result that any $h\colon U \to \{0,\dots,m-1\}$ with $\lvert U\rvert

m$ has a collision.

Part A — Warm-ups (⭐)

26.1 † Compute $h(k) = k \bmod 13$ for the keys $k = 100,\ 200,\ 300$. Show the multiple of 13 you subtract in each case, the way the chapter checked $1234 \bmod 11$ by hand. Do any two of the three collide?

26.2 A hash table uses $m = 16$ slots and the function $h(k) = k \bmod 16$. The keys $32, 48, 64, 80$ are inserted. Which slot does each land in? In one sentence, connect what you observe to the chapter's "⚠️ a bad table size" pitfall about powers of two.

26.3 † A hash table has $m = 250$ slots and stores $n = 1000$ keys. State (a) the load factor $\alpha$, and (b) the smallest number of keys that the generalized pigeonhole principle guarantees in some slot.

26.4 Take the 7-bit block $1100110$ and append an even parity bit. Then append an odd parity bit to the same block. Give both 8-bit transmitted blocks.

26.5 † Read the bit string $11010$ as a polynomial over $\mathbb{Z}_2$ in the chapter's convention ($b_{n-1}x^{n-1} + \cdots + b_0$). Write the polynomial, then write the polynomial for the generator $G = 101$.

26.6 ISBN-10 check digits range over $\{0,1,\dots,10\}$, with $10$ written as the symbol X. Using the chapter's definition $\sum_{i=1}^{10} i\,d_i \equiv 0 \pmod{11}$, explain in one or two sentences why the check digit needs eleven possible values, not ten — and why mod 10 would not have this issue but would lose detection power instead.

Part B — Computation (⭐⭐)

26.7 † Compute the CRC remainder of the message $M = 10110$ with generator $G = 1011$ (degree $r = 3$). Show the XOR long division step by step in the chapter's text-block layout, then give the $r$-bit remainder $R$ and the transmitted frame $M$-followed-by-$R$.

26.8 Compute the ISBN-10 check digit for the first nine digits $0,3,0,6,4,0,6,1,5$. Show the weighted sum, reduce it mod 11, and report the check digit (using X if it is 10), following the worked example in §26.4.

26.9 † A polynomial (string) hash uses base $b = 31$ and modulus $m = 101$, with character codes $\texttt{a}=1, \texttt{b}=2, \dots$ Using Horner's rule with reduction mod 101 at each step (the Chapter 23 rule), compute the hash of the three-character string $\texttt{bad}$ ($c_0=2, c_1=1, c_2=4$). Show each intermediate value.

26.10 Verify the receiver's side of Exercise 26.7: divide the transmitted frame you found by the same $G = 1011$ and confirm the remainder is $000$. Then flip the third bit of the frame and redivide; report the new (nonzero) remainder, demonstrating that the single-bit error is detected.

26.11 † A generator $G$ has degree $r = 12$. State, with a one-line justification each: (a) the longest burst error length the CRC is guaranteed to detect; (b) whether all single-bit errors are detected, given that $G$ has at least two nonzero terms; (c) what additional factor $G$ would need for all odd numbers of bit errors to be detected.

Part C — Prove it (⭐⭐, one scaffolded)

26.12 † (Scaffolded — fill in the missing steps.) Prove the §26.4 Claim: a single even-parity bit detects every odd number of bit errors and misses every even number. Fill the blanks.

Setup. By construction the transmitted block has an ______ number of 1-bits, so its count mod 2 is ______.
Effect of one flip. Flipping any single bit changes a $1$ to a $0$ or a $0$ to a $1$; in either case the total count of 1-bits changes by exactly ______, so the count mod 2 ______ (toggles / is unchanged).
Effect of $e$ flips. After $e$ flips the parity has toggled ______ times, so the final parity equals the original parity if and only if $e$ is ______.
Conclusion. The receiver detects an error exactly when the parity ______ the expected value, i.e. exactly when $e$ is ______; an even number of flips $\ge 2$ is therefore ______. ∎

26.13 Prove the collision theorem directly from the pigeonhole principle, without the proof-by-contradiction framing of §26.2: let $h\colon U \to \{0,\dots,m-1\}$ with $\lvert U \rvert > m$, treat the $m$ slots as the boxes and the keys of $U$ as the objects, and apply the pigeonhole principle verbatim to conclude that some slot receives two distinct keys.

26.14 † (Prove it.) Prove the §26.4 intuition that a prime modulus catches every single-digit error in an ISBN-style scheme. Precisely: in $\sum_{i=1}^{10} i\,d_i \pmod{11}$, suppose one digit changes, $d_i \to d_i'$ with $d_i \ne d_i'$ and $1 \le i \le 10$. Show the weighted sum changes by a nonzero amount mod 11, so the error is detected. (Hint: the change is $i\,(d_i' - d_i)$; bound both factors and use that 11 is prime.)

26.15 Prove that a CRC with generator $G(x)$ detects every single-bit error if and only if $G(x)$ has at least two nonzero terms. (A single-bit error is $E(x) = x^i$ for some $i \ge 0$; recall an error is undetected iff $G \mid E$. Argue both directions of the "if and only if.")

26.16 † Prove that if $G(x) = (x+1)\,Q(x)$ for some polynomial $Q$ over $\mathbb{Z}_2$, then the CRC detects every error that flips an odd number of bits. (Hint: evaluate the error polynomial $E(x)$ at $x = 1$ in $\mathbb{Z}_2$; relate $E(1)$ to the number of flipped bits, and use that $x+1 \mid E$ forces $E(1) = 0$.)

Part D — Implement it (⭐⭐)

Write Python for each. Do not run it on the grader's machine — hand-trace and include an # Expected output: comment, matching the book's convention.

26.17 † Write verify_isbn10(digits) that takes a list of ten integers (the tenth may be $10$ to stand for X) and returns True exactly when $\sum_{i=1}^{10} i\,d_i \equiv 0 \pmod{11}$. Test it on a valid ISBN and on the same ISBN with two adjacent digits swapped; state what the two outputs are and why the swap is caught.

26.18 Write crc_check(frame, generator) that returns True when the received frame (a 0/1 string that already includes the appended remainder) divides evenly by generator (remainder all zeros), reusing the division logic from the chapter's crc_remainder. Show it returning True on a clean frame and False on a frame with one flipped bit.

26.19 † Write chain_lengths(keys, m) that hashes each integer in keys with $k \bmod m$ and returns a list counts of length m, where counts[s] is how many keys landed in slot s. Run it (on paper) for keys = [10, 21, 32, 43, 7] and m = 11; give the resulting counts list and identify the load factor and the longest chain.

26.20 Write hamming_distance(a, b) from scratch (you may reuse the chapter's one-liner) and use it to verify a claim from the Project Checkpoint: compute the Hamming distance between the two codewords hamming_encode([0,0,0,0]) and hamming_encode([1,0,0,0]). State the distance you get and what it implies about the code's ability to detect or correct errors.

Part E — Find the error (⭐⭐)

Each argument below is wrong. State precisely which step fails and why.

26.21 † Claim: "With a good enough hash function and a 64-slot table, you can store any 50 distinct 64-bit keys with zero collisions, because $50 < 64$." "Argument": there are fewer keys than slots, so by the pigeonhole principle no two keys need share a slot; pick the function that assigns each key its own slot. — Find the flaw. (Is the function allowed to depend on which 50 keys arrive?)

26.22 Claim: "A two-bit error in a parity-protected block will be caught, because two errors are more corruption than one and parity catches one." "Argument": parity detects a single flipped bit; two flipped bits is strictly worse, so detection only gets easier. — Find the flaw, and name the exact error count parity misses.

26.23 † Claim: "CRC-32 is a 32-bit fingerprint and SHA-256 is a 256-bit fingerprint, so CRC-32 is just a weaker version of the same thing — fine for tamper-detection on small files where a 32-bit fingerprint suffices." "Argument": both map data to a fixed-size value and flag a mismatch; the only difference is digest length, and 32 bits is plenty for a small file. — Find the flaw. (Which property from §26.6 is being ignored, and why does file size not rescue the argument?)

26.24 Claim: "To prove a CRC catches all bursts of length $\le r$, note that a burst error polynomial $E(x)$ has degree exactly $r$, and a degree-$r$ polynomial can't be divisible by the degree-$r$ generator $G$ unless they're equal." — The conclusion (bursts $\le r$ are caught) is right, but this argument is sloppy. Identify the two imprecise claims and state the correct reasoning. (Hint: what is the degree of a burst of length exactly $\ell \le r$, and is "can't be divisible unless equal" true?)

Part F — Model it & Conjecture and test (⭐⭐⭐)

26.25 † (Model it.) A music-streaming service deduplicates uploaded audio files: when a user uploads a track, the server computes a fingerprint of the file's bytes and, if a file with that fingerprint is already stored, links to the existing copy instead of storing a duplicate. Translate this system into the discrete-math objects of this chapter: identify the key universe $U$, the function and its codomain, what a collision would mean here (two different songs deduplicated to one), and which §26.6 property the fingerprint must have so that a malicious user cannot deliberately make a junk file alias a copyrighted one. State clearly which guarantee the pigeonhole principle gives and which one the system actually relies on.

26.26 (Model it.) A warehouse uses 12-digit UPC barcodes whose final digit is a checksum: $$\Big(3\,(d_1 + d_3 + d_5 + d_7 + d_9 + d_{11}) + (d_2 + d_4 + d_6 + d_8 + d_{10} + d_{12})\Big) \equiv 0 \pmod{10}.$$ Model this as a weighted checksum in the §26.4 sense: what are the weights, and what is the modulus? Then explain — using the chapter's reasoning about why ISBN uses a prime — one specific kind of error this mod-10 scheme fails to detect that the prime-modulus ISBN scheme catches. (Hint: think about a transposition of two adjacent digits whose values differ by 5.)

26.27 † (Conjecture and test, then prove.) You hash the consecutive integer keys $0, 1, 2, \dots, n-1$ into a table of size $m$ with $h(k) = k \bmod m$. - (a) Conjecture. By hand or with chain_lengths from Exercise 26.19, compute the slot counts for $(n, m) = (10, 5),\ (12, 5),\ (13, 5)$. Conjecture a formula for the longest chain length in terms of $n$ and $m$ when the keys are exactly $0,\dots,n-1$. - (b) Test. Describe the output you'd expect from running chain_lengths on a few more $(n,m)$ pairs that would support your conjecture, and say what single output would refute it. - (c) Prove. Prove your conjecture. (Hint: with consecutive keys, the chain lengths are as even as possible — exactly $\lceil n/m\rceil$ or $\lfloor n/m\rfloor$. This is the best case, the opposite of the adversarial worst case of §26.3.)

26.28 (Conjecture and test.) Consider the "XOR checksum": the check value of a list of equal-length bit blocks is their bitwise XOR. Conjecture which of these error patterns it detects and which it misses: (i) a single bit flipped in one block; (ii) the same bit position flipped in two blocks; (iii) two adjacent blocks swapped. Test each conjecture on a tiny hand-example (three 4-bit blocks), then state the pattern. Which §26.4 weakness does case (iii) reveal that positional weighting would fix?

Part G — Interleaved & Deep Dive

Mixing techniques from earlier chapters keeps them sharp.

26.29 † (Ch. 23 + 26.) The Carter–Wegman universal family is $h_{a,b}(k) = ((ak+b) \bmod p) \bmod m$. Take $p = 17$ (prime), $m = 5$, $a = 3$, $b = 4$. (a) Compute $h_{3,4}(k)$ for $k = 2$ and $k = 9$. (b) Using the Chapter 23 fact that multiplication by a unit mod a prime is a bijection, explain in one sentence why $((ak+b)\bmod p)$ takes each residue mod $p$ exactly once as $k$ ranges over $0,\dots,16$.

26.30 (Ch. 17 + 26.) Restate the collision theorem purely in the generalized-pigeonhole language of Chapter 17, then sharpen it: for a table of $m$ slots holding $n$ keys, give the Chapter 17 guarantee on the most crowded slot, and explain why this is a statement about every hash function, not just bad ones.

26.31 † (Ch. 22 + 26.) §26.3 claims that if $m$ is prime and the key step $d$ is not a multiple of $m$, then $k, k+d, k+2d, \dots$ cycle through all $m$ residues before repeating. Prove this using the Chapter 22 fact $\gcd(d, m) = 1$ (and the Bézout/Euclid groundwork): show that $jd \equiv 0 \pmod m$ for $0 < j < m$ is impossible, so the $m$ values $0\cdot d, 1\cdot d, \dots, (m-1)d$ are distinct mod $m$.

26.32 (Ch. 9 + 26.) §26.1 notes a hash function is a function in the Chapter 9 sense, and §26.2 that it cannot be injective when $\lvert U \rvert > m$. Using the Chapter 9 definitions, answer: can a hash function $h\colon U \to \{0,\dots,m-1\}$ ever be surjective? Can it be bijective? For each, give the exact condition on $\lvert U\rvert$ and $m$, and connect "not injective" to why collisions exist.

26.33 † (Deep Dive — Ch. 25 + 26.) §26.6 says cryptographic hashing rests on computational infeasibility, the same shift that made RSA possible. Write one paragraph contrasting the guarantee behind a CRC ("all bursts of length $\le r$ are detected — a theorem") with the assumption behind a cryptographic hash ("no adversary can find a collision — a belief about computational cost"). Which one could, in principle, be broken by a faster algorithm without any new mathematics, and which one cannot?

26.34 (Deep Dive.) The Project Checkpoint's Hamming(7,4) code places parity bits at positions $1, 2, 4$ so the 3-bit syndrome reads either $0$ or the exact position to flip. Explain why those positions: show that position $j$ (for $1 \le j \le 7$) is checked by parity bit $p_{2^t}$ exactly when bit $t$ of $j$'s binary representation is 1, so the syndrome bits spell out the binary index of the erroneous position. (This previews Chapter 38; you may reason from the encode/decode code, not prove a general theorem.)

Solutions to † and odd-numbered exercises: see appendices/answers-to-selected.md. Reference code for the "implement it" problems (Part D) is in code/exercise-solutions.py. For each proof, the rubric rewards: a precisely stated claim, the right object identified (the error polynomial, the weighted sum, the function and its codomain), explicit use of the named earlier result (pigeonhole, $\gcd$, modular-inverse), and a clean conclusion.