Case Study: Building a Residue Number System with the CRT

DataField.Dev

Case Study: Building a Residue Number System with the CRT

"Divide each difficulty into as many parts as is feasible and necessary to resolve it." — René Descartes, Discourse on the Method

Executive Summary

In this build-focused case study you will construct a small but complete Residue Number System (RNS) calculator: a number representation in which a single integer is stored as a tuple of remainders against several coprime moduli, and in which addition, subtraction, and multiplication happen component-by-component, in parallel, with no carries. The engine that converts a tuple of remainders back into an ordinary integer is the Chinese Remainder Theorem of §23.4 — used here not as a puzzle-solver but as a working decoder. You will build the encoder, the parallel arithmetic, and the CRT decoder, hand-trace every operation, and watch a multiplication of large numbers reduce to three multiplications of tiny ones. By the end you will understand why RNS underlies fast hardware arithmetic, why it is the standard speed-up inside real RSA implementations, and exactly what "a number is its remainders" (the §23.4 threshold concept) means once you can compute with it.

This is a build-heavy counterpart to the affine-cipher analysis: there we dismantled a system with hand-derivation; here we assemble one, function by function, each with its expected output reasoned out by hand.

Skills applied

Encoding an integer into a residue tuple against coprime moduli (§23.1).
Implementing carry-free component-wise modular arithmetic, justified by Theorem 23.2.
Decoding a residue tuple back to an integer with the CRT construction (§23.4, the chapter's crt).
Reasoning about the legal range of an RNS and where it "overflows."
Connecting the RNS isomorphism $\mathbb{Z}_{N} \cong \prod_i \mathbb{Z}_{n_i}$ to the §23.4 threshold concept and the RSA speed-up.

Background

The idea: a number as a tuple of remainders

Pick a set of pairwise-coprime moduli — we will use $n_1 = 7,\ n_2 = 11,\ n_3 = 13$. Their product is $$N = 7 \cdot 11 \cdot 13 = 1001.$$ The CRT (§23.4) says every integer in $\{0, 1, \dots, 1000\}$ corresponds to exactly one triple of remainders $(x \bmod 7,\ x \bmod 11,\ x \bmod 13)$, and that the correspondence is reversible. So we can represent any number in range by its triple. For example, $32$ becomes $(4, 10, 6)$ because $32 = 4 \cdot 7 + 4$, $32 = 2 \cdot 11 + 10$, $32 = 2 \cdot 13 + 6$.

🚪 Threshold Concept revisited. §23.4 introduced the slogan a number is its remainders and called the dictionary between $\mathbb{Z}_{1001}$ and $\mathbb{Z}7 \times \mathbb{Z} \times \mathbb{Z}_{13}$ an isomorphism. This case study makes that slogan operational: we will store numbers as remainder-triples and compute with them in that form, only translating back at the very end. The isomorphism is not decoration — it is the data structure.

Why anyone would do this

Three reasons, all of which matter in real systems:

No carries, full parallelism. Adding two ordinary integers requires carry propagation from the low digits to the high digits — inherently sequential. In an RNS, each component is added independently: three (or thirty) adders run at the same time, never waiting on each other. This is why digital-signal-processing hardware and some cryptographic accelerators use RNS.
Big multiplications become small ones. Multiplying two large numbers directly is expensive; in an RNS each component multiplication is a tiny fixed-width multiply (§23.2's "reduce early" keeps every product below $n_i^2$).
The RSA speed-up (Chapter 25). Real RSA decrypts modulo $p$ and modulo $q$ separately — a two-modulus RNS — then reassembles with the CRT, roughly quartering the work. The calculator you build here is that optimization in miniature.

🔗 Connection. The component-wise arithmetic is licensed by Theorem 23.2: because addition, subtraction, and multiplication are well-defined on residue classes, doing them per-component and reassembling gives the same answer as doing them on the full integer. Theorem 23.2 is why the RNS is correct; the CRT is how we read the answer back out.

Phase 1: The encoder

Encoding is direct: reduce the integer against each modulus.

MODULI = [7, 11, 13]          # pairwise coprime; product N = 1001

def encode(x, moduli=MODULI):
    """Represent integer x as its tuple of residues against the moduli."""
    return tuple(x % n for n in moduli)

print(encode(32))     # 32 = 4*7+4, 2*11+10, 2*13+6
print(encode(123))    # 123 = 17*7+4, 11*11+2, 9*13+6
print(encode(456))    # 456 = 65*7+1, 41*11+5, 35*13+1
# Expected output:
# (4, 10, 6)
# (4, 2, 6)
# (1, 5, 1)

Each line is three reductions, computed by hand in the comments. Note that encode accepts any integer — even one larger than $N = 1001$ — but the representation only round-trips uniquely for $x$ in $\{0, \dots, 1000\}$; outside that range two different inputs collide. We return to this in Phase 4.

🔄 Check Your Understanding Encode $1000$ and $1001$ by hand against $(7, 11, 13)$. What do you notice about $\texttt{encode}(0)$ versus $\texttt{encode}(1001)$?

Answer

$1000 = 142\cdot7 + 6$, $90\cdot11 + 10$, $76\cdot13 + 12$, so $\texttt{encode}(1000) = (6, 10, 12)$. And $1001 = N$ is $\equiv 0$ modulo each of $7, 11, 13$, so $\texttt{encode}(1001) = (0, 0, 0) = \texttt{encode}(0)$. The two collide — the representation *wraps* at $N$, exactly like a finite integer type. Uniqueness holds only on $\{0, \dots, N-1\}$ (CRT, §23.4).

Phase 2: Carry-free parallel arithmetic

Here is the payoff. To add, subtract, or multiply two RNS numbers, operate on each component independently, reducing modulo that component's modulus.

def rns_add(u, v, moduli=MODULI):
    return tuple((a + b) % n for a, b, n in zip(u, v, moduli))

def rns_sub(u, v, moduli=MODULI):
    return tuple((a - b) % n for a, b, n in zip(u, v, moduli))

def rns_mul(u, v, moduli=MODULI):
    return tuple((a * b) % n for a, b, n in zip(u, v, moduli))

u = encode(123)       # (4, 2, 6)
v = encode(456)       # (1, 5, 1)
print(rns_mul(u, v))  # componentwise: (4*1, 2*5, 6*1) reduced = (4, 10, 6)
print(rns_add(encode(800), encode(700)))  # (2+0, 8+7, 7+11) reduced
# Expected output:
# (4, 10, 6)
# (2, 4, 5)

Trace the multiplication by hand. $\texttt{encode}(123) = (4, 2, 6)$ and $\texttt{encode}(456) = (1, 5, 1)$. Component-wise:

modulus $n_i$	$u_i$	$v_i$	$u_i v_i$	$u_i v_i \bmod n_i$
7	4	1	4	4
11	2	5	10	10
13	6	1	6	6

So rns_mul(u, v) = (4, 10, 6). Every product was a one-digit-times-one-digit multiply — never the full $123 \times 456 = 56088$. That is the whole point: the big multiplication never happens. (Notice the result triple $(4, 10, 6)$ is exactly $\texttt{encode}(32)$ — we will confirm in Phase 3 that $123 \times 456 \equiv 32 \pmod{1001}$.)

For the addition, $\texttt{encode}(800) = (2, 8, 7)$ and $\texttt{encode}(700) = (0, 7, 11)$:

modulus $n_i$	$u_i$	$v_i$	$u_i + v_i$	reduced
7	2	0	2	2
11	8	7	15	4
13	7	11	18	5

giving $(2, 4, 5)$ — and no carry ever crossed from the mod-$7$ lane into the mod-$11$ lane. The lanes are independent. That independence is what hardware exploits to run them in parallel.

⚠️ Common Pitfall. A reader sometimes expects the components to "carry into each other" the way decimal digits do, and tries to fix up a component that exceeded its modulus by adding to the next one. Do not. Each component is reduced only by its own modulus and never interacts with the others — that non-interaction is precisely Theorem 23.2 (each congruence holds independently). Carrying between lanes would destroy correctness.

Phase 3: The CRT decoder

The components are useless until we can read a number back out. That is the CRT construction of §23.4, which we implement exactly as the chapter's crt. We will build the decoder's machinery once, by hand, so the code is not a black box.

For moduli $(7, 11, 13)$ with $N = 1001$, precompute for each $i$ the cofactor $N_i = N / n_i$ and its inverse $M_i = N_i^{-1} \bmod n_i$ (Theorem 23.3 guarantees the inverse exists because the moduli are coprime):

$i$	$n_i$	$N_i = 1001/n_i$	$N_i \bmod n_i$	$M_i = N_i^{-1} \bmod n_i$
1	7	143	$143 \equiv 3 \pmod 7$	$3^{-1} \equiv 5 \pmod 7$
2	11	91	$91 \equiv 3 \pmod{11}$	$3^{-1} \equiv 4 \pmod{11}$
3	13	77	$77 \equiv 12 \equiv -1 \pmod{13}$	$(-1)^{-1} \equiv 12 \pmod{13}$

(Check one: $3 \cdot 5 = 15 \equiv 1 \pmod 7$ ✓; $3 \cdot 4 = 12 \equiv 1 \pmod{11}$ ✓; $12 \cdot 12 = 144 = 11\cdot13 + 1 \equiv 1 \pmod{13}$ ✓.)

The decoder forms $x = \sum_i r_i\, N_i\, M_i \bmod N$:

def ext_gcd(a, b):
    if b == 0:
        return (a, 1, 0)
    g, s1, t1 = ext_gcd(b, a % b)
    return (g, t1, s1 - (a // b) * t1)

def mod_inverse(a, m):
    g, s, _ = ext_gcd(a % m, m)
    return s % m if g == 1 else None

def decode(residues, moduli=MODULI):
    """CRT: reconstruct the unique x in [0, N) with x ≡ residues[i] (mod moduli[i])."""
    N = 1
    for n in moduli:
        N *= n
    x = 0
    for r_i, n_i in zip(residues, moduli):
        N_i = N // n_i
        x += r_i * N_i * mod_inverse(N_i, n_i)
    return x % N

print(decode((4, 10, 6)))     # the product result -> should be 32
print(decode((2, 4, 5)))      # the sum result     -> should be 499
print(decode(encode(123)))    # round-trip          -> 123
# Expected output:
# 32
# 499
# 123

Hand-derive the first decode, $(4, 10, 6)$, using the precomputed table: $$x = 4 \cdot 143 \cdot 5 \;+\; 10 \cdot 91 \cdot 4 \;+\; 6 \cdot 77 \cdot 12 = 2860 + 3640 + 5544 = 12044.$$ Reduce modulo $1001$: $12044 - 12 \cdot 1001 = 12044 - 12012 = 32$. So decode((4, 10, 6)) = 32 — confirming $123 \times 456 \equiv 32 \pmod{1001}$, the multiplication we did entirely in tiny components. (And indeed $123 \cdot 456 = 56088 = 56 \cdot 1001 + 32$ ✓.)

For the sum result $(2, 4, 5)$: $$x = 2 \cdot 143 \cdot 5 + 4 \cdot 91 \cdot 4 + 5 \cdot 77 \cdot 12 = 1430 + 1456 + 4620 = 7506, \qquad 7506 \bmod 1001 = 7506 - 7\cdot1001 = 499.$$ So decode((2, 4, 5)) = 499, i.e. $800 + 700 = 1500 \equiv 499 \pmod{1001}$ — correct, since $1500 - 1001 = 499$. The whole pipeline — encode, add in parallel, decode — reproduced ordinary modular addition.

💡 Intuition. Look at what just happened: we multiplied two three-digit numbers using only single-digit multiplies, and added two numbers with no carry, then paid a one-time reconstruction cost at the end. If you do many operations before decoding (as RNS hardware does), the cheap component arithmetic dominates and the one CRT reconstruction is amortized to almost nothing. The CRT isomorphism lets you live in the cheap world and only visit the expensive one to read a final answer.

Phase 4: Range, overflow, and correctness boundaries

An RNS over moduli with product $N$ behaves exactly like arithmetic modulo $N$. That is its strength and its catch.

Strength: within $\{0, \dots, N-1\}$, encode/decode is a perfect bijection (CRT uniqueness, §23.4), and component arithmetic agrees with integer arithmetic taken modulo $N$ (Theorem 23.2).
Catch: if a true result would exceed $N - 1$, the RNS gives the result modulo $N$, not the true integer. Our multiplication $123 \times 456 = 56088$ did not fit in $[0, 1000]$, so the RNS returned $56088 \bmod 1001 = 32$ — correct as a modular result, but not the full product. To hold the full product you would need moduli whose product exceeds $56088$.

This is not a defect; it is the same finiteness as a fixed-width integer type (§23.1's note that a CPU computes modulo $2^{64}$). If you want an RNS that never overflows for inputs up to some bound $B$, choose coprime moduli whose product exceeds the largest result you will compute.

def big_enough(moduli, B):
    """True iff the RNS over these moduli can represent every integer in [0, B] uniquely."""
    N = 1
    for n in moduli:
        N *= n
    return N > B

print(big_enough([7, 11, 13], 56088))      # N = 1001, too small for the full product
print(big_enough([7, 11, 13, 17, 19], 56088))  # N = 323323, plenty
# Expected output:
# False
# True

For the second call, $7 \cdot 11 \cdot 13 \cdot 17 \cdot 19 = 1001 \cdot 323 = 323323 > 56088$, so the five-modulus RNS represents the full product exactly. (With those five moduli, decode(rns_mul(...)) would return $56088$ itself, not its residue.)

🔄 Check Your Understanding You want an RNS that can hold any product of two numbers each up to $1000$. What is the largest result you must represent, and does the modulus set $\{7, 11, 13\}$ suffice? Name one way to extend it.

Answer

The largest product is $1000 \times 1000 = 10^6$, so you need $N > 10^6$. The set $\{7, 11, 13\}$ gives $N = 1001$ — nowhere near enough. Add more coprime moduli until the product exceeds $10^6$; e.g. $\{7, 11, 13, 17, 19, 23\}$ gives $N = 323323 \cdot 23 = 7{,}436{,}429 > 10^6$. (Any pairwise-coprime set with product $> 10^6$ works — primes are the easy choice since distinct primes are automatically coprime, Chapter 22.)

Discussion Questions

The correctness of component-wise rns_mul rests entirely on one theorem from this chapter. Name it, and explain precisely which clause ("$ac \equiv bd$") justifies the per-lane multiplication.
Why must the moduli be pairwise coprime, not merely "have no common factor across all of them at once"? Give a triple of moduli that share no global common factor yet are not pairwise coprime, and explain how CRT decoding would fail for them.
Decoding does real work (the CRT sum), while encoding and per-lane arithmetic are cheap. Argue that RNS pays off only when you perform many arithmetic operations between an encode and a decode. Roughly how many operations must you do before the decode cost is negligible?
RSA decryption uses a two-modulus RNS: the secret primes $p$ and $q$. Using §23.4's intuition, explain why decrypting modulo $p$ and modulo $q$ separately and then CRT-combining is faster than one exponentiation modulo $n = pq$ — and connect this to the cost model of fast exponentiation (§23.5).
Our decode recomputes $N$, the cofactors $N_i$, and the inverses $M_i$ on every call. In a real RNS those are fixed once the moduli are chosen. Describe how you would precompute them (a constructor storing the $N_i M_i$ products) and why that turns each decode into a single dot product.

Your Turn: Extensions

Option A (precomputed decoder). Refactor the RNS into a class RNS(moduli) whose constructor precomputes $N$ and the list of weights $w_i = N_i M_i \bmod N$, so that decode(residues) becomes sum(r_i * w_i for ...) % N — one dot product. Hand-verify the weights for $(7, 11, 13)$ and confirm they decode $(4, 10, 6)$ to $32$.
Option B (signed RNS). Extend the system to represent negative numbers by interpreting residues in the symmetric range $(-N/2, N/2]$ instead of $[0, N)$. Write decode_signed and explain, using §23.1's residue convention, how a single threshold check converts the unsigned decode into a signed one. Test it on $\texttt{encode}(-5)$.
Option C (RNS exponentiation). Combine this with §23.5: implement rns_pow(u, e) that raises an RNS number to a power by applying mod_pow in each lane independently, then decodes once at the end. Trace $\texttt{encode}(5)^3$ over $(7, 11, 13)$ and confirm it decodes to $125$. (This is the kernel of the RSA CRT speed-up you will meet in Chapter 25.)

Key Takeaways

An RNS stores a number as its remainder-tuple and computes lane-by-lane. Theorem 23.2 makes the carry-free, parallel arithmetic correct; the lanes never interact.
The CRT is the decoder. Reconstructing an integer from its residues is the §23.4 construction $x = \sum_i r_i N_i M_i \bmod N$ — the chapter's crt, doing concrete engineering work.
Big operations become small ones. A three-digit multiply became three single-digit multiplies; the full product never formed. This is why RNS appears in DSP hardware and RSA accelerators.
An RNS is arithmetic modulo $N$. It is a perfect bijection on $\{0, \dots, N-1\}$ and wraps beyond it — pick moduli whose product exceeds your largest result to avoid overflow.
"A number is its remainders" is a data structure, not a slogan. Once you can encode, compute, and decode, the §23.4 isomorphism becomes a tool you reach for whenever component arithmetic is cheaper than the whole — from this calculator to the RSA speed-up of Chapter 25.