Case Study 9.1: Big Integer Arithmetic — 256-bit Addition in Assembly
Motivation
The OpenSSL elliptic curve cryptography (ECC) implementation uses 256-bit arithmetic for NIST P-256 operations. On x86-64, the hardware provides 64-bit arithmetic, so 256-bit operations must be synthesized from 64-bit pieces. This case study implements 256-bit addition and subtraction using the ADC/SBB chain technique.
The Problem
A 256-bit unsigned integer fits in four 64-bit limbs stored in little-endian order:
A = a[3] * 2^192 + a[2] * 2^128 + a[1] * 2^64 + a[0]
Where a[0] is the least significant 64-bit limb. To add two 256-bit integers:
1. Add the least significant limbs: a[0] + b[0]. This may produce a carry.
2. Add the next limbs plus the carry: a[1] + b[1] + CF.
3. Continue through all four limbs.
4. If there is a carry after the last limb, 256-bit overflow occurred.
Data Layout
; In memory (little-endian limb order):
; ┌──────────────────┐ ← address
; │ limb[0] (LSL) │ 8 bytes (bits 0..63)
; ├──────────────────┤ ← address + 8
; │ limb[1] │ 8 bytes (bits 64..127)
; ├──────────────────┤ ← address + 16
; │ limb[2] │ 8 bytes (bits 128..191)
; ├──────────────────┤ ← address + 24
; │ limb[3] (MSL) │ 8 bytes (bits 192..255)
; └──────────────────┘ ← address + 32
struc uint256
.limb0: resq 1 ; offset 0
.limb1: resq 1 ; offset 8
.limb2: resq 1 ; offset 16
.limb3: resq 1 ; offset 24
endstruc
Implementation: 256-bit Addition
; uint256_add(uint256 *result, const uint256 *a, const uint256 *b)
; RDI = result pointer
; RSI = a pointer
; RDX = b pointer
; Returns: CF=1 if overflow, CF=0 if no overflow
;
; Clobbers: RAX, R8 (saves nothing: caller-saved registers)
section .text
global uint256_add
uint256_add:
; Load a[0], add b[0], store result[0]
mov rax, [rsi] ; rax = a[0]
add rax, [rdx] ; rax = a[0] + b[0], sets CF
mov [rdi], rax ; result[0] = low sum
; Add a[1] + b[1] + CF from previous
mov rax, [rsi + 8] ; rax = a[1]
adc rax, [rdx + 8] ; rax = a[1] + b[1] + CF, sets CF
mov [rdi + 8], rax ; result[1]
; Add a[2] + b[2] + CF
mov rax, [rsi + 16]
adc rax, [rdx + 16]
mov [rdi + 16], rax
; Add a[3] + b[3] + CF
mov rax, [rsi + 24]
adc rax, [rdx + 24]
mov [rdi + 24], rax
; CF is still set from the last ADC
; Caller can check JC for overflow
ret
Why We Cannot Use Memory Operands with ADC Directly
You might wonder: why not adc [rdi+8], [rsi+8]? Because memory-to-memory operations are illegal. We must load into a register, add, then store.
An alternative approach that avoids the intermediate register store-then-reload pattern uses register pairs:
; Alternative: load all limbs first, then ADC chain
uint256_add_v2:
mov r8, [rsi] ; r8 = a[0]
mov r9, [rsi + 8] ; r9 = a[1]
mov r10, [rsi + 16] ; r10 = a[2]
mov r11, [rsi + 24] ; r11 = a[3]
add r8, [rdx] ; r8 += b[0], CF set
adc r9, [rdx + 8] ; r9 += b[1] + CF
adc r10, [rdx + 16] ; r10 += b[2] + CF
adc r11, [rdx + 24] ; r11 += b[3] + CF
mov [rdi], r8 ; store results
mov [rdi + 8], r9
mov [rdi + 16], r10
mov [rdi + 24], r11
ret
This second version has the same total instruction count but different memory access patterns. On modern out-of-order processors, they perform similarly since the processor can reorder the loads.
Implementation: 256-bit Subtraction
Subtraction uses SBB (Subtract with Borrow) instead of ADC:
; uint256_sub(uint256 *result, const uint256 *a, const uint256 *b)
; result = a - b
; Returns: CF=1 if borrow (a < b), CF=0 if no borrow (a >= b)
uint256_sub:
mov r8, [rsi]
mov r9, [rsi + 8]
mov r10, [rsi + 16]
mov r11, [rsi + 24]
sub r8, [rdx] ; r8 = a[0] - b[0], CF set if borrow
sbb r9, [rdx + 8] ; r9 = a[1] - b[1] - CF
sbb r10, [rdx + 16]
sbb r11, [rdx + 24]
mov [rdi], r8
mov [rdi + 8], r9
mov [rdi + 16], r10
mov [rdi + 24], r11
ret
Testing with a Concrete Example
section .data
; A = 2^255 - 1 (largest 256-bit number fitting in 255 bits)
A: dq 0xFFFFFFFFFFFFFFFF, 0xFFFFFFFFFFFFFFFF, \
0xFFFFFFFFFFFFFFFF, 0x7FFFFFFFFFFFFFFF
; B = 1
B: dq 1, 0, 0, 0
; Expected result: A + B = 2^255 (bit 255 set, all others clear)
; result[0] = 0x0000000000000000 (overflow from 0xFFFF...+1)
; result[1] = 0x0000000000000000 (carry propagated)
; result[2] = 0x0000000000000000
; result[3] = 0x8000000000000000 (carry into bit 255)
; No 256-bit overflow (CF = 0)
result: times 4 dq 0
Trace through the addition:
- a[0] + b[0] = 0xFFFF...FFFF + 1 = 0x0000000000000000, CF = 1
- a[1] + b[1] + CF = 0xFFFF...FFFF + 0 + 1 = 0x0000000000000000, CF = 1
- a[2] + b[2] + CF = 0xFFFF...FFFF + 0 + 1 = 0x0000000000000000, CF = 1
- a[3] + b[3] + CF = 0x7FFF...FFFF + 0 + 1 = 0x8000000000000000, CF = 0
Final CF = 0: no 256-bit overflow. Result = [0, 0, 0, 0x8000000000000000] = 2^255.
Modular Reduction (Partial)
In ECC arithmetic, 256-bit addition is often followed by conditional subtraction of the prime modulus to keep the result in the field. This uses the CF from the addition:
; After uint256_add:
; If CF = 1, or if result >= P (the field prime), subtract P
; This simplified version only handles the CF = 1 case:
uint256_add_mod:
call uint256_add ; compute sum, CF = overflow bit
jnc .no_overflow ; if no overflow, done (may still need reduction)
; Overflow: result is 2^256 + something, which mod P = result - P + 2^256 - P
; For now: just subtract the modulus
lea rsi, [rdi] ; a = result
lea rdx, [rel modulus] ; b = P
call uint256_sub ; result -= P
.no_overflow:
ret
Performance Analysis
On a modern Intel (Ice Lake) processor:
| Implementation | Throughput | Latency | Notes |
|---|---|---|---|
| uint256_add_v1 (store-then-reload) | ~4 cycles/call | ~8 cycles | Memory traffic |
| uint256_add_v2 (load all then add) | ~4 cycles/call | ~6 cycles | Better OOO opportunity |
| Hypothetical SIMD version | ~2 cycles/call | ~3 cycles | Using 256-bit SIMD integers |
The ADC chain itself (4 instructions) has an inherent serial dependency: each ADC depends on CF from the previous. This limits throughput for the pure arithmetic — but the loads and stores can execute in parallel on modern superscalar processors.
Connection to Real Cryptographic Code
OpenSSL's crypto/bn/asm/x86_64-gcc.c contains essentially this pattern for its BIGNUM add. The production implementation also handles the case where the operands are the same pointer (aliasing), adds loop variants for arbitrary precision, and includes MULX-based variants for multiplication. The core insight — that the ADC/SBB chain with the carry flag threading through the chain is how x86-64 does multi-precision arithmetic — is identical to what we have implemented here.