Case Study: Building a Huffman File Compressor from Scratch

DataField.Dev

Case Study: Building a Huffman File Compressor from Scratch

"The fundamental problem of communication is that of reproducing at one point ... a message selected at another point." — Claude E. Shannon, A Mathematical Theory of Communication (1948)

Executive Summary

In Case Study 1 you analyzed an existing evaluator. Here you will build something end-to-end: a working Huffman compressor that takes a string, counts its symbol frequencies, builds the optimal prefix-code tree of §31.6, encodes the string to a bitstring, and decodes it back — losslessly. Along the way you will use a binary heap (§31.5) as the priority queue that drives the greedy merge, hand-verify that the encode/decode round-trip is exactly the inverse of itself, and measure the real compression ratio against fixed-length encoding. This is the build-it counterpart to the analysis-heavy Case Study 1, and it is the chapter's two big ideas — heaps as priority queues and trees as optimal codes — working together on a real task.

By the end you will have a small program you can defend line by line, and a concrete answer to "how much do we actually save?" that you derived by hand, not by running anything.

Skills applied

Building a Huffman tree from frequencies using a binary heap / priority queue (§31.5–31.6).
Reading codewords off root-to-leaf paths and constructing the encode table.
Encoding a string and decoding it by walking the tree (the prefix-code decode procedure of §31.6).
Proving the round-trip is lossless (decode ∘ encode = identity), using the prefix-free property.
Computing the compression ratio and comparing to a fixed-length baseline.

Background

The scenario

You are adding compression to a logging system that writes lots of short status strings made of a small alphabet (think DNA-like data, or status codes, or a constrained character set). Fixed-length encoding wastes bits when some symbols dominate. You decide to build a Huffman compressor so common symbols get short codes and rare ones get long codes — provably the best any prefix code can do (Theorem 31.4). You will build it from scratch so you understand and can certify every step, then hand it off with a proof that decompression always exactly recovers the input.

We will compress the string

"ABRACADABRA"

an 11-character string whose letters are far from uniform — perfect for showing Huffman's benefit.

Why this matters

Huffman coding is not a toy: it is a component of real formats (DEFLATE, used in ZIP, gzip, and PNG, combines Huffman coding with a dictionary stage; JPEG uses Huffman tables). More importantly for this book, it is where three threads of Chapter 31 converge: the heap of §31.5 is the engine, the prefix-code tree of §31.6 is the product, and the traversal idea of §31.3 is how you read codewords and decode. Building it cements all three.

🔗 Connection: Huffman coding is the chapter's bridge to Chapter 38 (coding theory), which develops optimal and error-correcting codes in full. There, the cost $\sum_s f(s)\,d(s)$ you minimize here is related to Shannon's entropy bound — the theoretical floor on bits per symbol. Build the compressor now; meet its theoretical limit later.

This case study is deliberately build-heavy: we write every stage from raw counting to a verified round-trip, in contrast to Case Study 1, which dissected an evaluator someone else had written. The payoff of building is that you can certify the result — the losslessness proof in Phase 4 is something you could only write because you control every line of encode and decode.

Phase 1: Count frequencies

Compression starts with statistics: how often does each symbol occur? For "ABRACADABRA" we tally by hand.

from collections import Counter

def frequencies(text):
    """Return a dict mapping each character to its count in text."""
    return dict(Counter(text))

freq = frequencies("ABRACADABRA")
print(sorted(freq.items()))
# Expected output:
# [('A', 5), ('B', 2), ('C', 1), ('D', 1), ('R', 2)]

Counting the letters of ABRACADABRA by hand: A appears at positions 1, 4, 6, 8, 11 → 5 times; B at positions 2, 9 → 2; R at positions 3, 10 → 2; C at position 5 → 1; D at position 7 → 1. Total $5 + 2 + 2 + 1 + 1 = 11$, matching the string length. Good.

🔄 Check Your Understanding Before building the tree, predict: which symbol should get the shortest codeword, and which two should get the longest? State the §31.6 principle behind your prediction.

Answer

A (frequency 5) should get the shortest code; C and D (frequency 1 each, the rarest) should get the longest. The principle (§31.6 Intuition): the rarest symbols should sit deepest, because depth costs the fewest total bits when multiplied by a small frequency. Huffman merges the two smallest frequencies first, burying them at the bottom.

Phase 2: Build the Huffman tree with a heap

Now the heart of the algorithm. Following the §31.6 definition: make each symbol a single-node tree tagged with its frequency; repeatedly remove the two smallest-frequency trees and merge them under a new node whose frequency is their sum; stop when one tree remains.

The phrase "repeatedly remove the two smallest" is exactly what a min-heap / priority queue (§31.5) does in $O(\log n)$ per operation. We use Python's heapq. One wrinkle: heap entries are compared by their first field; if two trees tie on frequency, Python would try to compare the tree objects next and fail, so we add a unique counter as a tiebreaker.

import heapq, itertools

def build_huffman_tree(freq):
    """Build a Huffman tree. Leaves are ('leaf', symbol); internal nodes are
    ('node', left, right). Returns the root. Uses a min-heap as priority queue."""
    counter = itertools.count()                       # unique tiebreaker (Ch. 31 §31.5 heap)
    heap = [(f, next(counter), ('leaf', s)) for s, f in freq.items()]
    heapq.heapify(heap)                               # O(n) build of the priority queue
    while len(heap) > 1:
        f1, _, t1 = heapq.heappop(heap)              # two smallest frequencies
        f2, _, t2 = heapq.heappop(heap)
        merged = ('node', t1, t2)                     # t1 on the left (0), t2 on the right (1)
        heapq.heappush(heap, (f1 + f2, next(counter), merged))
    return heap[0][2]                                 # the single remaining tree

Let us hand-trace the merges on freq = {A:5, B:2, R:2, C:1, D:1}, breaking ties by taking the smaller counter (insertion order), so the trace is deterministic. The heap initially holds five single-node trees.

The two smallest frequencies are C:1 and D:1. Pop both; merge into a node {C,D} of frequency $1 + 1 = 2$, with C on the left and D on the right. Push it back.
The queue now holds B:2, R:2, {C,D}:2, A:5. The two smallest are B:2 and R:2 (the two that were inserted earliest among the freq-2 items). Merge into {B,R} of frequency $2 + 2 = 4$. Push it back.
The queue now holds {C,D}:2, A:5, {B,R}:4. The two smallest are {C,D}:2 and {B,R}:4. Merge into {C,D,B,R} of frequency $2 + 4 = 6$. Push it back.
The queue now holds A:5, {C,D,B,R}:6. Merge into the root of frequency $5 + 6 = 11$. One tree remains — done.

The resulting tree (left edge = 0, right edge = 1):

                 (11)
                /    \
             A(5)    (6)
                    /    \
                 (2)      (4)
                /  \     /   \
             C(1) D(1) B(2)  R(2)

⚠️ Common Pitfall: Tie-breaking changes the tree but not its optimality. Different valid tie-break rules (or a different heap implementation) can produce a different Huffman tree with the same total cost — Huffman trees are optimal but not unique. Do not be alarmed if your library's tree differs from this one in which symbol is on the left versus right, or in how equal-frequency subtrees pair up; verify optimality (the cost $\sum f(s)\,d(s)$), not a specific shape. (Compare this chapter's code/example-05-huffman.py, whose tie-breaking puts A on the 1 branch instead.)

Phase 3: Read off the codewords and build the encode table

Codewords are root-to-leaf paths: left = 0, right = 1. Walk the tree once (a traversal!) and record the path to each leaf.

def build_codes(tree):
    """Return a dict mapping each symbol to its binary codeword string."""
    codes = {}
    def walk(node, prefix):
        kind = node[0]
        if kind == 'leaf':
            codes[node[1]] = prefix or "0"           # single-symbol edge case
        else:
            walk(node[1], prefix + "0")              # left child: append 0
            walk(node[2], prefix + "1")              # right child: append 1
    walk(tree, "")
    return codes

Tracing the paths on the tree from Phase 2:

A is the left child of the root → path 0. So A = 0.
C is root→right→left→left → 100. So C = 100.
D is root→right→left→right → 101. So D = 101.
B is root→right→right→left → 110. So B = 110.
R is root→right→right→right → 111. So R = 111.

freq = {'A': 5, 'B': 2, 'R': 2, 'C': 1, 'D': 1}
tree = build_huffman_tree(freq)
codes = build_codes(tree)
print(sorted(codes.items()))
# Expected output (for the hand-traced tie-breaking above):
# [('A', '0'), ('B', '110'), ('C', '100'), ('D', '101'), ('R', '111')]

The common symbol A got a 1-bit code; the rarest symbols C, D got 3-bit codes — exactly the prediction from Phase 1. And no codeword is a prefix of another (e.g., 0 is not a prefix of 100, 101, 110, or 111), because every symbol sits at a leaf — the defining property of a prefix code (§31.6).

🔄 Check Your Understanding Verify the prefix-free property directly: is 0 (the code for A) a prefix of any other codeword? Why does the tree structure guarantee this automatically, so you never have to check it by hand?

Answer

0 is not a prefix of 100, 101, 110, or 111 — all of those start with 1. The tree guarantees prefix-freeness automatically because each symbol is at a leaf: a codeword $u$ is a prefix of a codeword $w$ exactly when the leaf for $u$ is an ancestor of the leaf for $w$ — but leaves have no descendants, so no leaf can be an ancestor of another. Symbols-at-leaves ⟺ prefix-free.

Phase 4: Encode, then decode — and prove it round-trips

Encoding replaces each symbol by its codeword and concatenates.

def encode(text, codes):
    """Concatenate the codeword of each character into one bitstring."""
    return "".join(codes[ch] for ch in text)

bits = encode("ABRACADABRA", codes)
print(bits)
# Expected output:
# 011011101000101011011100

Let us assemble that by hand, symbol by symbol, using A=0, B=110, R=111, C=100, D=101:

char	A	B	R	A	C	A	D	A	B	R	A
code	0	110	111	0	100	0	101	0	110	111	0

Concatenating: 0 110 111 0 100 0 101 0 110 111 0 = 011011101000101011011100. Counting the bits: $1+3+3+1+3+1+3+1+3+3+1 = 23$ bits.

Decoding walks the tree from the root, taking the left edge on 0 and the right edge on 1; when it reaches a leaf it emits that symbol and returns to the root. This is the prefix-code decode procedure of §31.6 — and it needs no delimiters between codewords precisely because the code is prefix-free.

def decode(bits, tree):
    """Walk the tree per bit; emit a symbol at each leaf, then restart at the root."""
    out, node = [], tree
    for b in bits:
        node = node[1] if b == '0' else node[2]      # left on 0, right on 1
        if node[0] == 'leaf':                        # reached a complete codeword
            out.append(node[1])
            node = tree                              # back to the root for the next symbol
    return "".join(out)

print(decode(bits, tree))
# Expected output:
# ABRACADABRA

Hand-trace the first few bits of 011011101... against the Phase-2 tree:

0 → go left from root → leaf A. Emit A, return to root.
1 → right; 1 → right; 0 → left → leaf B. Emit B, return to root. (The codeword 110 was consumed.)
1 → right; 1 → right; 1 → right → leaf R. Emit R, return to root.
0 → leaf A. Emit A … and so on, reproducing ABRACADABRA.

The strategy first (why the round-trip is lossless). We want: decode(encode(text)) == text for every text over the alphabet. The key is the prefix-free property. Encoding writes the codewords back-to-back with no separators; the danger would be that the decoder misreads where one codeword ends and the next begins. Prefix-freeness rules that out: as the decoder walks down from the root, it reaches a leaf exactly when it has consumed one complete codeword, and — because no codeword is a prefix of another — there is never an ambiguity about whether to stop or keep going. So the decoder recovers the symbols in order.

Claim. For any string text over the alphabet, decode(encode(text, codes), tree) == text.

Proof sketch (induction on the number of symbols in text). Base case: the empty string encodes to the empty bitstring, which the decoder turns back into the empty string. Inductive step: suppose the claim holds for strings of length $k$, and let text have length $k+1$ with first symbol $s$. Then encode(text) is $c_s \,\Vert\, \text{encode(rest)}$, where $c_s$ is $s$'s codeword and $\Vert$ is concatenation. Starting at the root, the decoder walks the bits of $c_s$ and reaches the leaf for $s$ exactly at the end of $c_s$ — not before (that would mean a proper prefix of $c_s$ is itself a codeword, contradicting prefix-freeness) and not after (the leaf is reached as soon as the full path $c_s$ is traced). It emits $s$, returns to the root, and is now positioned to decode encode(rest), which by the inductive hypothesis yields rest. So the output is $s \,\Vert\, \text{rest} = \text{text}$. $\blacksquare$

This is lossless compression: the original is recovered bit-for-bit. No amount of testing on sample strings could establish that for every possible input — but the prefix-free argument does, in one finite proof (theme two: a passing test is not a proof).

🐛 Find the Error: A teammate "optimizes" decode by returning to the root only after two leaf hits instead of one, reasoning "most symbols come in pairs." Without re-running, explain why this breaks decoding, and what the output would look like for the first six bits 011011 of our bitstring.

Answer

Returning to the root only every second leaf hit means the decoder keeps walking past the first leaf instead of restarting — but a leaf has no children, so node[1]/node[2] would be undefined (an index error / None dereference) the moment it tries to take an edge from a leaf. Even setting that aside, the logic is wrong: after emitting A on the first 0, the decoder must restart at the root to read the next codeword 110 (→ B). Skipping the restart corrupts everything after the first symbol. The decode procedure's correctness depends on returning to the root after every leaf.

Phase 5: Measure the compression

How much did we save? Two numbers.

Huffman cost. The encoded string is 23 bits (counted in Phase 4). Equivalently, via the §31.6 cost formula $\sum_s f(s)\,d(s)$ with depths $d(A)=1$, $d(B)=d(R)=3$, $d(C)=d(D)=3$:

$$\text{cost} = 5(1) + 2(3) + 2(3) + 1(3) + 1(3) = 5 + 6 + 6 + 3 + 3 = 23 \text{ bits.}$$

(The two counts agree, as they must — the formula is the total bit length.)

Fixed-length baseline. There are 5 distinct symbols, so a fixed-length code needs $\lceil \log_2 5 \rceil = 3$ bits per symbol. For 11 symbols that is $11 \times 3 = 33$ bits.

import math

def fixed_length_bits(text):
    distinct = len(set(text))
    width = max(1, math.ceil(math.log2(distinct)))
    return len(text) * width

def huffman_bits(text, codes):
    return sum(len(codes[ch]) for ch in text)

text = "ABRACADABRA"
print(fixed_length_bits(text), "bits fixed-length")
print(huffman_bits(text, codes), "bits Huffman")
# Expected output:
# 33 bits fixed-length
# 23 bits Huffman

Huffman uses 23 bits versus 33 for fixed-length — a saving of 10 bits, about 30%, on this short example. The saving grows with how skewed the frequencies are: the more one or two symbols dominate, the shorter their codes and the bigger the win. And by Theorem 31.4, 23 bits is the best any prefix code can do for these frequencies — you cannot beat it without changing the model (e.g., coding pairs of symbols, or using arithmetic coding, both topics for Chapter 38).

💡 Intuition: Huffman pays off in proportion to imbalance. If all five symbols occurred equally often, Huffman would essentially reproduce the fixed-length code (every leaf at nearly the same depth) and save little. It is the spread between A's frequency 5 and C/D's frequency 1 that lets the common symbol's short code dominate the average. Real text and logs are highly skewed, which is why Huffman (inside gzip and friends) is everywhere.

Phase 6: Stress the design — edge cases a build must survive

A compressor that works on ABRACADABRA but crashes on real inputs is not finished. A disciplined build hunts the corner cases before shipping. Three matter here, and each is a small lesson in tree thinking.

Edge case 1 — a single distinct symbol. What does the compressor do on "AAAA"? The frequency table is {A: 4}, so build_huffman_tree starts with a heap of one tree and the while len(heap) > 1 loop never runs — it returns that lone leaf as the "tree." Then build_codes hits a leaf at the root with an empty prefix. Our code guards this with prefix or "0", assigning A the codeword 0. Without that guard, A would get the empty codeword "", and encode("AAAA") would produce the empty bitstring — from which decode could never recover four symbols. The lesson: a Huffman tree with one leaf is a degenerate tree of height 0, and degenerate trees are exactly where off-by-one assumptions break (the same moral as §31.4's stick).

freq1 = frequencies("AAAA")          # {'A': 4}
codes1 = build_codes(build_huffman_tree(freq1))
print(codes1)
print(decode(encode("AAAA", codes1), build_huffman_tree(freq1)))
# Expected output:
# {'A': '0'}
# AAAA

Edge case 2 — the empty string. frequencies("") is {}, so the heap is empty and heap[0] would raise an IndexError. A production compress must special-case empty input (emit zero bits and an empty table). The math is fine — an empty file needs zero bits — but the code must not assume a non-empty heap. Flagging this is itself the deliverable; the fix is a one-line guard.

Edge case 3 — all frequencies equal. Compress "ABCD" (each symbol once). Huffman pairs them into two frequency-2 subtrees and then a root, producing a balanced tree of height 2 in which every symbol gets a 2-bit code. The cost is $4 \times 2 = 8$ bits — identical to the fixed-length code ($\lceil\log_2 4\rceil = 2$ bits each). This is the §31.6 intuition made concrete: when frequencies are uniform, Huffman cannot beat fixed-length, because there is no imbalance to exploit. The win is strictly a function of skew.

freqU = frequencies("ABCD")          # all 1
codesU = build_codes(build_huffman_tree(freqU))
print(sorted((s, len(c)) for s, c in codesU.items()))
print(huffman_bits("ABCD", codesU), fixed_length_bits("ABCD"))
# Expected output:
# [('A', 2), ('B', 2), ('C', 2), ('D', 2)]
# 8 8

🚪 Threshold Concept. A build is not done when it produces the right answer on the example you chose; it is done when you have characterized its behavior on the whole input space — including the degenerate ends (one symbol, no symbols) and the no-win middle (uniform frequencies). The same shift in mindset separates "my function passed the test" from "I know what my function does on every input," which is theme two of this book. For trees specifically, the degenerate cases are almost always the height-0 tree and the stick; check them first.

Discussion Questions

We broke frequency ties by insertion order, producing one specific tree. Describe a different valid tie-break that yields a different tree, and argue (using Theorem 31.4) that its total bit cost must be the same 23 bits even though the shapes differ.
The decoder returns to the root after every leaf. Restate this rule as a small state machine (what are the states, and what does each bit do?). How does this connect tree-walking to the finite-state machines you will see in Part VII?
Huffman needs the frequencies before it can build the tree, so a basic compressor must read the input twice (once to count, once to encode) or ship the frequency table with the compressed data. What is the overhead of shipping the table, and for what input sizes does it outweigh the savings?
We computed the fixed-length baseline as $\lceil \log_2 5 \rceil = 3$ bits/symbol. Why the ceiling? What goes wrong if you try to use a fractional number of bits per symbol in a fixed-length scheme, and how does Huffman effectively "use fractional bits on average"?
Theorem 31.4 says Huffman is optimal among prefix codes. Name one assumption of that optimality (hint: it codes one symbol at a time, independently). Sketch why coding pairs of symbols could sometimes do better, previewing the entropy discussion of Chapter 38.

Your Turn: Extensions

Option A (decode robustness). Add error handling to decode so that a bitstring that ends in the middle of a codeword (a truncated stream) raises a clear error instead of silently dropping a symbol. State the tree condition that detects this (hint: the walk ends on a non-root, non-leaf node).
Option B (canonical Huffman). Real formats ship a canonical Huffman code: store only each symbol's codeword length, then reconstruct codewords by a fixed rule, saving the cost of shipping the whole tree. Compute the codeword lengths for ABRACADABRA (you have them: 1, 3, 3, 3, 3) and explain why lengths alone, plus the alphabet, suffice to rebuild a valid prefix code.
Option C (compare to entropy). Compute Shannon's entropy $H = -\sum_s p(s)\log_2 p(s)$ (with $p(s) = f(s)/11$) for ABRACADABRA by hand to two decimals, and compare $11 \times H$ to the 23 bits Huffman used. You should find Huffman is at most one bit per symbol above the entropy floor — the theoretical limit Chapter 38 proves.
Option D (full compressor). Wrap frequencies, build_huffman_tree, build_codes, encode, and decode into two functions compress(text) (returns the bits and the tree) and decompress(bits, tree) (returns the text). Hand-test the round-trip on a different string of your own, e.g. "MISSISSIPPI", deriving its frequency table and at least the first symbol's codeword by hand.

Key Takeaways

Huffman is a heap in action. The greedy "repeatedly merge the two smallest frequencies" is literally a priority-queue loop; the binary heap of §31.5 makes each merge $O(\log n)$, so building the tree for $n$ symbols costs $O(n \log n)$.
Codewords are root-to-leaf paths; decoding is a tree walk. Encoding reads paths off the tree (left = 0, right = 1); decoding walks the tree per bit and emits a symbol at each leaf, returning to the root. Both are traversals (§31.3).
Prefix-freeness makes the round-trip lossless — provably. Because every symbol is at a leaf, no codeword is a prefix of another, so the decoder is never ambiguous about where a codeword ends. decode(encode(text)) == text for every input — a one-shot induction proof, not a test result.
The win scales with imbalance, and it is optimal. On ABRACADABRA, Huffman used 23 bits versus 33 for fixed-length (~30% saving), and by Theorem 31.4 that is the best any prefix code can do. Skewed data (real logs, text) compresses far more.
Build it to certify it. Writing each stage from scratch — count, build, read codes, encode, decode, measure — is what lets you prove the compressor lossless rather than merely hoping the tests covered every case.