Case Study 9.1: Big Integer Arithmetic — 256-bit Addition in Assembly

Motivation

The OpenSSL elliptic curve cryptography (ECC) implementation uses 256-bit arithmetic for NIST P-256 operations. On x86-64, the hardware provides 64-bit arithmetic, so 256-bit operations must be synthesized from 64-bit pieces. This case study implements 256-bit addition and subtraction using the ADC/SBB chain technique.

The Problem

A 256-bit unsigned integer fits in four 64-bit limbs stored in little-endian order:

A = a[3] * 2^192 + a[2] * 2^128 + a[1] * 2^64 + a[0]

Where a[0] is the least significant 64-bit limb. To add two 256-bit integers: 1. Add the least significant limbs: a[0] + b[0]. This may produce a carry. 2. Add the next limbs plus the carry: a[1] + b[1] + CF. 3. Continue through all four limbs. 4. If there is a carry after the last limb, 256-bit overflow occurred.

Data Layout

; In memory (little-endian limb order):
; ┌──────────────────┐ ← address
; │   limb[0] (LSL)  │ 8 bytes  (bits 0..63)
; ├──────────────────┤ ← address + 8
; │   limb[1]        │ 8 bytes  (bits 64..127)
; ├──────────────────┤ ← address + 16
; │   limb[2]        │ 8 bytes  (bits 128..191)
; ├──────────────────┤ ← address + 24
; │   limb[3] (MSL)  │ 8 bytes  (bits 192..255)
; └──────────────────┘ ← address + 32

struc uint256
    .limb0: resq 1   ; offset 0
    .limb1: resq 1   ; offset 8
    .limb2: resq 1   ; offset 16
    .limb3: resq 1   ; offset 24
endstruc

Implementation: 256-bit Addition

; uint256_add(uint256 *result, const uint256 *a, const uint256 *b)
;   RDI = result pointer
;   RSI = a pointer
;   RDX = b pointer
;   Returns: CF=1 if overflow, CF=0 if no overflow
;
; Clobbers: RAX, R8 (saves nothing: caller-saved registers)

section .text
global uint256_add

uint256_add:
    ; Load a[0], add b[0], store result[0]
    mov rax, [rsi]            ; rax = a[0]
    add rax, [rdx]            ; rax = a[0] + b[0], sets CF
    mov [rdi], rax            ; result[0] = low sum

    ; Add a[1] + b[1] + CF from previous
    mov rax, [rsi + 8]        ; rax = a[1]
    adc rax, [rdx + 8]        ; rax = a[1] + b[1] + CF, sets CF
    mov [rdi + 8], rax        ; result[1]

    ; Add a[2] + b[2] + CF
    mov rax, [rsi + 16]
    adc rax, [rdx + 16]
    mov [rdi + 16], rax

    ; Add a[3] + b[3] + CF
    mov rax, [rsi + 24]
    adc rax, [rdx + 24]
    mov [rdi + 24], rax

    ; CF is still set from the last ADC
    ; Caller can check JC for overflow
    ret

Why We Cannot Use Memory Operands with ADC Directly

You might wonder: why not adc [rdi+8], [rsi+8]? Because memory-to-memory operations are illegal. We must load into a register, add, then store.

An alternative approach that avoids the intermediate register store-then-reload pattern uses register pairs:

; Alternative: load all limbs first, then ADC chain
uint256_add_v2:
    mov r8,  [rsi]            ; r8  = a[0]
    mov r9,  [rsi + 8]        ; r9  = a[1]
    mov r10, [rsi + 16]       ; r10 = a[2]
    mov r11, [rsi + 24]       ; r11 = a[3]

    add r8,  [rdx]            ; r8  += b[0], CF set
    adc r9,  [rdx + 8]        ; r9  += b[1] + CF
    adc r10, [rdx + 16]       ; r10 += b[2] + CF
    adc r11, [rdx + 24]       ; r11 += b[3] + CF

    mov [rdi],      r8        ; store results
    mov [rdi + 8],  r9
    mov [rdi + 16], r10
    mov [rdi + 24], r11
    ret

This second version has the same total instruction count but different memory access patterns. On modern out-of-order processors, they perform similarly since the processor can reorder the loads.

Implementation: 256-bit Subtraction

Subtraction uses SBB (Subtract with Borrow) instead of ADC:

; uint256_sub(uint256 *result, const uint256 *a, const uint256 *b)
; result = a - b
; Returns: CF=1 if borrow (a < b), CF=0 if no borrow (a >= b)

uint256_sub:
    mov r8,  [rsi]
    mov r9,  [rsi + 8]
    mov r10, [rsi + 16]
    mov r11, [rsi + 24]

    sub r8,  [rdx]            ; r8  = a[0] - b[0], CF set if borrow
    sbb r9,  [rdx + 8]        ; r9  = a[1] - b[1] - CF
    sbb r10, [rdx + 16]
    sbb r11, [rdx + 24]

    mov [rdi],      r8
    mov [rdi + 8],  r9
    mov [rdi + 16], r10
    mov [rdi + 24], r11
    ret

Testing with a Concrete Example

section .data
    ; A = 2^255 - 1 (largest 256-bit number fitting in 255 bits)
    A: dq 0xFFFFFFFFFFFFFFFF, 0xFFFFFFFFFFFFFFFF, \
          0xFFFFFFFFFFFFFFFF, 0x7FFFFFFFFFFFFFFF

    ; B = 1
    B: dq 1, 0, 0, 0

    ; Expected result: A + B = 2^255 (bit 255 set, all others clear)
    ; result[0] = 0x0000000000000000 (overflow from 0xFFFF...+1)
    ; result[1] = 0x0000000000000000 (carry propagated)
    ; result[2] = 0x0000000000000000
    ; result[3] = 0x8000000000000000 (carry into bit 255)
    ; No 256-bit overflow (CF = 0)

    result: times 4 dq 0

Trace through the addition: - a[0] + b[0] = 0xFFFF...FFFF + 1 = 0x0000000000000000, CF = 1 - a[1] + b[1] + CF = 0xFFFF...FFFF + 0 + 1 = 0x0000000000000000, CF = 1 - a[2] + b[2] + CF = 0xFFFF...FFFF + 0 + 1 = 0x0000000000000000, CF = 1 - a[3] + b[3] + CF = 0x7FFF...FFFF + 0 + 1 = 0x8000000000000000, CF = 0

Final CF = 0: no 256-bit overflow. Result = [0, 0, 0, 0x8000000000000000] = 2^255.

Modular Reduction (Partial)

In ECC arithmetic, 256-bit addition is often followed by conditional subtraction of the prime modulus to keep the result in the field. This uses the CF from the addition:

; After uint256_add:
; If CF = 1, or if result >= P (the field prime), subtract P
; This simplified version only handles the CF = 1 case:

uint256_add_mod:
    call uint256_add          ; compute sum, CF = overflow bit
    jnc  .no_overflow         ; if no overflow, done (may still need reduction)
    ; Overflow: result is 2^256 + something, which mod P = result - P + 2^256 - P
    ; For now: just subtract the modulus
    lea rsi, [rdi]            ; a = result
    lea rdx, [rel modulus]    ; b = P
    call uint256_sub          ; result -= P
.no_overflow:
    ret

Performance Analysis

On a modern Intel (Ice Lake) processor:

Implementation Throughput Latency Notes
uint256_add_v1 (store-then-reload) ~4 cycles/call ~8 cycles Memory traffic
uint256_add_v2 (load all then add) ~4 cycles/call ~6 cycles Better OOO opportunity
Hypothetical SIMD version ~2 cycles/call ~3 cycles Using 256-bit SIMD integers

The ADC chain itself (4 instructions) has an inherent serial dependency: each ADC depends on CF from the previous. This limits throughput for the pure arithmetic — but the loads and stores can execute in parallel on modern superscalar processors.

Connection to Real Cryptographic Code

OpenSSL's crypto/bn/asm/x86_64-gcc.c contains essentially this pattern for its BIGNUM add. The production implementation also handles the case where the operands are the same pointer (aliasing), adds loop variants for arbitrary precision, and includes MULX-based variants for multiplication. The core insight — that the ADC/SBB chain with the carry flag threading through the chain is how x86-64 does multi-precision arithmetic — is identical to what we have implemented here.