Case Study: Building a Reversible Encoder as a Bijection

DataField.Dev

Case Study: Building a Reversible Encoder as a Bijection

"To invert, bijize."

Executive Summary

In this build-focused case study you will construct an encoder that turns structured data into a compact integer and back again — a tiny serialization codec — and you will design it, from the first line, to be a bijection so that decoding is guaranteed to be its inverse. The thread of the whole build is the §9.4 threshold theorem: a function can be reversed if and only if it is bijective, with surjectivity supplying existence of a decode and injectivity supplying uniqueness. You will build in three stages — a single bounded field, a pair of fields, then a small record — proving at each stage that encode is a bijection and that decode ∘ encode = id. Where case study 1 audited a function that was deliberately non-injective, here you engineer injectivity in on purpose, because a codec that loses information cannot be decoded at all.

Skills applied

Constructing a function $\text{encode}\colon A \to \mathbb{N}$ and proving it injective and surjective onto its range (§9.3).
Building the inverse $\text{decode} = \text{encode}^{-1}$ and proving $\text{decode} \circ \text{encode} = \operatorname{id}$ and $\text{encode} \circ \text{decode} = \operatorname{id}$ (§9.4).
Using composition to combine field encoders, and the theorem that a composition of bijections is a bijection (§9.4).
Using floor, mod, and integer arithmetic as the §9.5 toolkit for mixed-radix encoding.
Recognizing the codec as the lossless-compression / reversible-encoding pattern of §9.6.

Background

The scenario

You're building a multiplayer game server. Each game event is a small record — a player slot, an action code, and a board column — and the network layer wants to send each event as one integer to keep packets tiny, then reconstruct the record on the other side. The constraints:

player is one of $4$ slots: $0,1,2,3$.
action is one of $6$ codes: $0,1,2,3,4,5$.
column is one of $7$ board columns: $0,1,2,3,4,5,6$.

The wire format must be reversible: every record encodes to an integer, and every integer the server might receive decodes back to exactly the record that produced it. If two different events ever encode to the same integer, the receiver cannot tell them apart — the game desyncs. So the requirement, stated in this chapter's words, is: encode must be a bijection onto its set of codes.

💡 Intuition: Think of a car odometer with wheels of different sizes — one wheel with 4 positions, one with 6, one with 7. Each combination of wheel positions shows a unique number, and from the number you can read each wheel back off. That "mixed-radix odometer" is exactly the encoder we'll build: distinct records ↔ distinct numbers (injective), and every number in range corresponds to some record (surjective). The odometer is reversible because it's a bijection.

Why this matters

Every time you serialize — JSON to bytes, a struct to a database row, a move to a game packet, a file to a compressed blob — you are relying on a function being reversible. §9.6 made the rule explicit: lossless encoding must be injective, because decompression is the inverse, and by §9.4 an inverse exists only for a bijection. Building a codec is therefore an exercise in engineering bijectivity, and the payoff is a correctness guarantee you can prove rather than test: not "decode undid encode on the cases I tried," but "decode undoes encode on every record there is."

Phase 1: One bounded field — the simplest bijection

Start with a single field: a player in $\{0,1,2,3\}$. The "encoder" is almost trivial — the identity on a $4$-element set — but it lets us nail the proof template we'll reuse.

def encode_player(p):
    """Encode a player slot 0..3 as an integer 0..3."""
    assert 0 <= p < 4
    return p

def decode_player(n):
    """Decode an integer 0..3 back to a player slot."""
    assert 0 <= n < 4
    return n

print(encode_player(2), decode_player(encode_player(2)))
# Expected output:
# 2 2

Model it: $\text{encode\_player}\colon \{0,1,2,3\} \to \{0,1,2,3\}$ is the identity function $\operatorname{id}$ (§9.4). The identity is the canonical bijection — it is injective ($a_1 = a_2 \Rightarrow a_1 = a_2$ trivially) and surjective (every $b$ is its own preimage). By the §9.4 invertibility theorem it is invertible, and its inverse is itself. So decode_player is correct, and decode_player(encode_player(p)) = p for all four inputs.

🔄 Check Your Understanding The identity feels too trivial to be worth a proof. Why is it nonetheless the right base case for building a codec out of composed pieces?

Answer

Because we will build larger encoders by composing and combining bijections, and §9.4's theorem "a composition of bijections is a bijection" needs bijective pieces to start from. The identity is the simplest certified-bijective brick; everything else is built by combining bricks whose bijectivity we already have.

Phase 2: Two fields — packing with a mixed radix

Now combine player ($0..3$) and action ($0..5$) into one integer. The standard trick is positional, like reading a two-digit number where the "tens" digit has base $6$: $$\text{encode}(p, a) = a \cdot 4 + p.$$ Here $p$ is the low part (radix $4$) and $a$ is the high part. The integer ranges over $0,1,\dots,4\cdot 6 - 1 = 23$ — exactly $24$ values, one per $(p,a)$ pair. Decoding peels the integer apart with the §9.5 mod and floor: $$p = n \bmod 4, \qquad a = \lfloor n / 4 \rfloor.$$

def encode_pa(p, a):
    """Encode (player 0..3, action 0..5) as an integer 0..23."""
    assert 0 <= p < 4 and 0 <= a < 6
    return a * 4 + p

def decode_pa(n):
    """Decode an integer 0..23 back to (player, action)."""
    assert 0 <= n < 24
    return (n % 4, n // 4)

print(encode_pa(3, 5))                       # 5*4 + 3 = 23
print(decode_pa(encode_pa(3, 5)))            # should be (3, 5)
print(decode_pa(encode_pa(1, 4)))            # should be (1, 4)
# Expected output:
# 23
# (3, 5)
# (1, 4)

We claim $\text{encode\_pa}\colon \{0,1,2,3\} \times \{0,\dots,5\} \to \{0,1,\dots,23\}$ is a bijection, and decode_pa is its inverse. Prove it with the §9.3–9.4 templates.

Claim. $\text{encode\_pa}$ is injective.

Proof. Assume $\text{encode\_pa}(p_1, a_1) = \text{encode\_pa}(p_2, a_2)$, i.e. $a_1\cdot 4 + p_1 = a_2 \cdot 4 + p_2$. Reduce both sides mod $4$: since $0 \le p_i < 4$, we have $p_i = (a_i \cdot 4 + p_i) \bmod 4$, so $p_1 = p_2$. Subtracting the (now equal) $p$ from both sides gives $a_1 \cdot 4 = a_2 \cdot 4$, hence $a_1 = a_2$. So the inputs were equal, and $\text{encode_pa}$ is injective. $\blacksquare$

Claim. $\text{encode\_pa}$ is surjective onto $\{0,\dots,23\}$.

Proof. Let $n \in \{0,\dots,23\}$ be arbitrary (the §9.3 surjectivity template: start from a target). Set $p = n \bmod 4$ and $a = \lfloor n/4 \rfloor$. The division algorithm (the defining property of $\bmod$ and $\lfloor \cdot \rfloor$, §9.5) gives $n = 4\lfloor n/4 \rfloor + (n \bmod 4) = 4a + p$ with $0 \le p < 4$, and since $0 \le n \le 23$ we get $0 \le a = \lfloor n/4 \rfloor \le \lfloor 23/4 \rfloor = 5$, so $(p, a)$ is a legal record. Then $\text{encode_pa}(p, a) = a\cdot 4 + p = n$. Every target has a preimage, so $\text{encode\_pa}$ is surjective. $\blacksquare$

Being injective and surjective, $\text{encode\_pa}$ is a bijection, so by the §9.4 invertibility theorem it has a unique inverse. The surjectivity proof built that inverse explicitly — the map $n \mapsto (n \bmod 4, \lfloor n/4 \rfloor)$ — which is exactly decode_pa. Notice the §9.4 bookkeeping live in the proof: surjectivity gave existence of $(p,a)$ for each $n$, injectivity gave uniqueness, and "existence + uniqueness" is precisely "decode is a well-defined function."

⚠️ Common Pitfall: the radix must match the field's size, not be guessed. If you wrote $\text{encode}(p,a) = a\cdot 3 + p$ (radix $3$ for a field that ranges $0..3$), the map would collide: $\text{encode}(3, 0) = 3 = \text{encode}(0, 1)$. The multiplier on the high field must be the count of the low field ($4$ here), or injectivity fails. The proof's "reduce mod $4$" step is exactly what pins this down — it only recovers $p$ because $p < 4 =$ the radix.

🔄 Check Your Understanding With the wrong radix $\text{encode}(p,a) = a\cdot 3 + p$, compute $\text{encode}(3,0)$ and $\text{encode}(0,1)$ and say which property of §9.3 fails.

Answer

$\text{encode}(3,0) = 0\cdot 3 + 3 = 3$ and $\text{encode}(0,1) = 1\cdot 3 + 0 = 3$. Two distinct records map to $3$, so injectivity fails — and a non-injective encoder cannot be decoded unambiguously (§9.4). The radix $3$ is too small for a field that reaches $3$.

Phase 3: A full record — composition of bijections

Add the third field, column ($0..6$), to get the full event. Extend the mixed radix one more place: the running multiplier is the product of the sizes already packed. With sizes $4, 6, 7$: $$\text{encode}(p, a, c) = c \cdot (4 \cdot 6) + a \cdot 4 + p = 24c + 4a + p,$$ ranging over $0,\dots,4\cdot 6\cdot 7 - 1 = 167$ — exactly $168$ codes, one per record. Decode peels from the low end: $$p = n \bmod 4, \quad a = \lfloor n/4 \rfloor \bmod 6, \quad c = \lfloor n / 24 \rfloor.$$

SIZES = (4, 6, 7)   # player, action, column

def encode_event(p, a, c):
    """Encode (player, action, column) as an integer 0..167."""
    assert 0 <= p < 4 and 0 <= a < 6 and 0 <= c < 7
    return c * 24 + a * 4 + p

def decode_event(n):
    """Decode an integer 0..167 back to (player, action, column)."""
    assert 0 <= n < 168
    p = n % 4
    a = (n // 4) % 6
    c = n // 24
    return (p, a, c)

print(encode_event(3, 5, 6))                 # 6*24 + 5*4 + 3 = 167
print(decode_event(encode_event(3, 5, 6)))   # should be (3, 5, 6)
print(decode_event(encode_event(2, 0, 4)))   # should be (2, 0, 4)
# Expected output:
# 167
# (3, 5, 6)
# (2, 0, 4)

Rather than redo the injectivity/surjectivity algebra from scratch, reuse Phase 2 by composition — this is the §9.4 payoff. View the full encoder as: first pack $(p, a)$ into a value $m = 4a + p \in {0,\dots,23}$ (Phase 2's bijection $\text{encode_pa}$), then pack $(m, c)$ into $24c + m \in {0,\dots,167}$ by the *same two-field construction*, now with low-field count $24$.

The strategy first. We already proved the two-field pack is a bijection (Phase 2). The three-field pack is another two-field pack — combining the bijectively-encoded $(p,a)$ with $c$ — so it is a bijection by the same proof, with $4$ replaced by $24$. Then the whole encoder is a composition of two bijections, and §9.4's theorem ("a composition of bijections is a bijection") gives the result with no new algebra. This is abstraction as the master tool (theme three): prove the pattern once, instantiate it.

Claim. $\text{encode_event}\colon {0,1,2,3}\times{0,\dots,5}\times{0,\dots,6} \to {0,\dots,167}$ is a bijection, and decode_event is its inverse.

Proof. Let $B_k(x, y) = y\cdot k + x$ denote the two-field pack with low-field count $k$ (low field $0 \le x < k$). Phase 2 proved $B_k$ is a bijection from $\{0,\dots,k-1\}\times\{0,\dots,N-1\}$ onto $\{0,\dots,kN-1\}$, with inverse $z \mapsto (z \bmod k,\ \lfloor z/k\rfloor)$ — the proof there used only $0 \le x < k$ and the division algorithm, so it applies for any $k$. Now $$\text{encode\_event}(p,a,c) = B_{24}\big(B_4(p, a),\ c\big).$$ $B_4(p,a) = 4a + p$ is a bijection onto $\{0,\dots,23\}$ (Phase 2 with $k=4, N=6$). Feeding its output as the low field of $B_{24}$ (with $0 \le B_4(p,a) < 24$, $0 \le c < 7$) is a bijection onto ${0,\dots,24 \cdot 7 - 1} = {0,\dots,167}$ (Phase 2 with $k=24, N=7$). A composition of two bijections is a bijection (§9.4 theorem), so $\text{encode\_event}$ is a bijection. Its inverse is the composition of the inverses in the opposite order: first invert $B_{24}$ to recover $(B_4(p,a),\ c) = (n \bmod 24,\ \lfloor n/24\rfloor)$, then invert $B_4$ to recover $(p, a) = (m \bmod 4,\ \lfloor m/4\rfloor)$ from $m = n \bmod 24$. Composing, $$p = n \bmod 4,\quad a = \Big\lfloor \tfrac{n \bmod 24}{4}\Big\rfloor = \Big\lfloor \tfrac{n}{4}\Big\rfloor \bmod 6,\quad c = \Big\lfloor \tfrac{n}{24}\Big\rfloor,$$ which is exactly decode_event. Therefore $\text{decode_event} \circ \text{encode_event} = \operatorname{id}$ and $\text{encode_event}\circ\text{decode_event} = \operatorname{id}$. $\blacksquare$

The step $a = \lfloor (n \bmod 24)/4\rfloor = \lfloor n/4\rfloor \bmod 6$ is a small §9.5 identity worth checking once: reducing mod $24$ then dividing by $4$ is the same as dividing by $4$ then reducing mod $6$ (because $24 = 4\cdot 6$). Both forms appear in real codecs; they compute the same digit.

🔗 Connection: The "inverse of a composition is the composition of inverses, reversed" move — $(g \circ f)^{-1} = f^{-1} \circ g^{-1}$ — is the §9.4 inverse idea applied twice. You unpack a nested box in the reverse of the order you packed it: last in, first out. It is the same right-to-left discipline as reading $g \circ f$ as "do $f$ first."

Phase 4: Stress the bijection — the round-trip test

A bijection's defining promise is the round trip, $\text{decode}(\text{encode}(r)) = r$ for every record $r$. Because the domain is finite and small ($4\cdot 6\cdot 7 = 168$ records), we can — in principle — check every record, turning the proof's guarantee into an exhaustive verification. Theme four: computation and proof are partners; here they can even cover the same (finite) ground.

def round_trip_ok():
    """Check decode(encode(r)) == r for every record, and that
    encode hits every integer in 0..167 exactly once (bijection onto its range)."""
    seen = set()
    for p in range(4):
        for a in range(6):
            for c in range(7):
                n = encode_event(p, a, c)
                if decode_event(n) != (p, a, c):
                    return f"round-trip FAILED at {(p, a, c)} -> {n}"
                seen.add(n)
    return "all 168 round-trips OK; codes used: " + str(len(seen))

print(round_trip_ok())
# Expected output:
# all 168 round-trips OK; codes used: 168

Two facts are checked at once. Every record survives the round trip (so decode really inverts encode), and the encoder uses $168$ distinct codes for $168$ records — which is injectivity in its counting form (§9.3: distinct inputs give distinct outputs, so the number of distinct codes equals the number of records) and surjectivity onto $\{0,\dots,167\}$ (all $168$ targets used). The finite theorem of §9.3 even lets us shortcut: since the domain and the codomain $\{0,\dots,167\}$ both have $168$ elements, injective alone would have forced surjective, so checking "$168$ distinct codes" already implies the encoder is a bijection. Here the exhaustive test and the proof agree perfectly — but note the proof of Phase 3 covered the pattern for any field sizes, where no finite test could.

⚠️ Common Pitfall: Forgetting the bounds and silently breaking surjectivity. If a fourth field were added but the multiplier left at $24$ (not updated to $24\cdot 7 = 168$), two records would map to the same code and the round-trip test would fail immediately — exactly the radix mistake of Phase 2, one level up. The running multiplier must always equal the product of all lower field sizes. The proof's requirement "$0 \le \text{low part} < k$" is what enforces this.

Discussion Questions

The encoder is a bijection onto $\{0,\dots,167\}$, not onto all of $\mathbb{N}$. Why is "bijection onto its range" the right notion for a codec, and what would go wrong if the receiver got an integer like $200$ that is outside the range? Phrase your answer using domain/codomain/range (§9.2).
Phase 3 proved the three-field encoder a bijection by composition rather than by re-deriving the algebra. State the general theorem you relied on and explain why it let you avoid new work — and what you'd have had to prove without it.
Suppose the game later allows action to be any nonnegative integer (unbounded). Show that the mixed-radix scheme breaks, and explain — using injectivity and the finite theorem — why no fixed-width integer code can losslessly encode an unbounded field. What would you need instead (variable-length encoding)?
The decode formula uses both $\bmod$ and $\lfloor \cdot \rfloor$. Explain why these two §9.5 functions are the natural "unpackers" for positional encoding, referencing the division algorithm $n = 4a + p$.
Compare this case study to the hash-function audit (case study 1). One built injectivity in on purpose; the other found it missing. State the single §9.4 principle that explains why the codec had to be injective but the hash was allowed not to be.

Your Turn: Extensions

Option A (general mixed radix). Generalize to encode(fields, sizes) for a tuple of fields and their sizes, computing the running multiplier as the product of lower sizes; write the matching decode(n, sizes). State and sketch (by the Phase 3 composition argument) why it is a bijection onto $\{0,\dots,\prod \text{sizes} - 1\}$, and hand-trace one encode/decode pair.
Option B (the inverse refuses bad input). Following the Toolkit's inverse (which raises on a non-injective function), make decode_event raise on any $n \notin \{0,\dots,167\}$ and argue that this guard is the §9.4 statement "an inverse is defined only on the range" enforced in code. Why is silently returning a garbage record worse?
Option C (a different bijection). Replace mixed-radix packing with a Cantor pairing-style bijection $\mathbb{N}\times\mathbb{N}\to\mathbb{N}$ for two unbounded fields, $\pi(x,y) = \frac{(x+y)(x+y+1)}{2} + y$. Hand-compute $\pi$ on a few pairs, confirm no collisions among them, and note that this is the infinite bijection Chapter 10 uses to prove $\mathbb{N}\times\mathbb{N}$ has the same size as $\mathbb{N}$. (You are building the bridge to the next chapter.)

Key Takeaways

A codec is a bijection by design. Lossless encoding must be injective (so decode is unambiguous) and surjective onto its range (so every code decodes); §9.4 says exactly that means invertible.
Mixed-radix packing is positional arithmetic. Encode multiplies each field by the product of lower field sizes; decode peels it back with $\bmod$ and $\lfloor \cdot \rfloor$. The multiplier must be the count of the lower field, or injectivity dies.
Build big bijections by composing small ones. The three-field encoder reused the two-field proof via "a composition of bijections is a bijection," and its inverse is the inverse composition reversed, $(g\circ f)^{-1} = f^{-1}\circ g^{-1}$.
For a finite codec, proof and exhaustive test can meet. With $168$ records the round trip is fully checkable, and the finite theorem even lets "$168$ distinct codes" imply bijectivity — but the compositional proof generalizes to field sizes no test could enumerate.