Case Study 19-1: Porting a Crypto Library — XOR Cipher in x86-64 and ARM64

Open Assembly Language Project

Case Study 19-1: Porting a Crypto Library — XOR Cipher in x86-64 and ARM64

Objective

Implement the same XOR encryption routine in both x86-64 and ARM64 assembly, comparing instruction count, register usage, loop structure, and performance. This builds directly on the XOR→AES-NI anchor example started in Chapter 13 and demonstrates practical cross-architecture porting.

Background: XOR Cipher

XOR cipher is the foundation of stream ciphers and block cipher modes like CTR and OFB. While XOR alone is a weak cipher (trivially broken with known plaintext), the XOR operation itself is used in every serious cryptographic algorithm.

The operation: output[i] = input[i] XOR key[i % key_len]

For simplicity, we'll use a key length that's a multiple of the block size (avoiding modulo in the inner loop). Specifically: XOR encrypt with a 16-byte key, processing 8 bytes per iteration in scalar mode and 16 bytes per iteration in SIMD mode.

C Reference Implementation

// xor_cipher.c
void xor_encrypt(uint8_t *output, const uint8_t *input, size_t n,
                 const uint8_t *key, size_t key_len) {
    // Assumes n is multiple of key_len, key_len is multiple of 8
    for (size_t i = 0; i < n; i++) {
        output[i] = input[i] ^ key[i % key_len];
    }
}

// Simplified version with key_len = 8:
void xor_encrypt_8(uint8_t *output, const uint8_t *input, size_t n,
                   uint64_t key_8bytes) {
    size_t i;
    for (i = 0; i + 8 <= n; i += 8) {
        *(uint64_t *)(output + i) = *(const uint64_t *)(input + i) ^ key_8bytes;
    }
    // tail: handle remaining bytes (simplified: assume n % 8 == 0)
}

x86-64 Implementation

Scalar (8 bytes per iteration)

; xor_encrypt_x86: encrypt n bytes with an 8-byte key
; RDI = output, RSI = input, RDX = n, RCX = key_8bytes
; Assumes n is multiple of 8

global xor_encrypt_x86
xor_encrypt_x86:
    test    rdx, rdx          ; n == 0?
    jz      .done

    ; key is already in RCX (4th argument register)
    xor     rax, rax          ; i = 0

.loop:
    mov     r8, [rsi + rax]   ; r8 = input[i..i+7]
    xor     r8, rcx           ; r8 ^= key
    mov     [rdi + rax], r8   ; output[i..i+7] = r8
    add     rax, 8            ; i += 8
    cmp     rax, rdx          ; i < n?
    jb      .loop

.done:
    ret

Per-iteration instruction count: MOV + XOR + MOV + ADD + CMP + JB = 6 instructions, 8 bytes processed.

SIMD (16 bytes per iteration with SSE2)

; xor_encrypt_sse2: encrypt n bytes with a 16-byte key
; RDI = output, RSI = input, RDX = n (multiple of 16)
; RCX = key pointer

global xor_encrypt_sse2
xor_encrypt_sse2:
    test    rdx, rdx
    jz      .done

    movdqu  xmm7, [rcx]       ; xmm7 = 16-byte key (load once, reuse)
    xor     rax, rax          ; i = 0

.loop:
    movdqu  xmm0, [rsi + rax] ; xmm0 = input[i..i+15]
    pxor    xmm0, xmm7        ; xmm0 ^= key (16 bytes XOR!)
    movdqu  [rdi + rax], xmm0 ; output[i..i+15] = xmm0
    add     rax, 16
    cmp     rax, rdx
    jb      .loop

.done:
    ret

Per-iteration: MOVDQU + PXOR + MOVDQU + ADD + CMP + JB = 6 instructions, 16 bytes processed. 2× the scalar throughput with the same instruction count.

ARM64 Implementation

Scalar (8 bytes per iteration)

// xor_encrypt_arm64: X0 = output, X1 = input, X2 = n, X3 = key_8bytes
// Assumes n is multiple of 8

.global xor_encrypt_arm64
xor_encrypt_arm64:
    CBZ   X2, .a64_done
    MOV   X4, XZR           // i = 0

.a64_loop:
    LDR   X5, [X1, X4]      // X5 = input[i..i+7]
    EOR   X5, X5, X3        // X5 ^= key
    STR   X5, [X0, X4]      // output[i..i+7] = X5
    ADD   X4, X4, #8        // i += 8
    CMP   X4, X2            // i < n?
    B.LO  .a64_loop         // B.LO = unsigned lower (equivalent to JB)

.a64_done:
    RET

Per-iteration: LDR + EOR + STR + ADD + CMP + B.LO = 6 instructions, 8 bytes processed. Same as x86-64 scalar.

Instruction comparison:

Step	x86-64	ARM64
Load	MOV r8, [rsi+rax]	LDR X5, [X1, X4]
XOR	XOR r8, rcx	EOR X5, X5, X3
Store	MOV [rdi+rax], r8	STR X5, [X0, X4]
Increment	ADD rax, 8	ADD X4, X4, #8
Compare	CMP rax, rdx	CMP X4, X2
Branch	JB .loop	B.LO .loop

Nearly identical structure. Key difference: x86-64 can encode MOV r8, [rsi+rax] as one instruction (base+index addressing); ARM64's LDR X5, [X1, X4] is also base+index — equivalent! ARM64's addressing modes are richer than just "base+immediate".

NEON (16 bytes per iteration)

// xor_encrypt_neon: X0 = output, X1 = input, X2 = n (mult. of 16)
//                   X3 = key pointer (16-byte key)

.global xor_encrypt_neon
xor_encrypt_neon:
    CBZ   X2, .neon_done
    LDR   Q7, [X3]           // V7 = 16-byte key (load once)
    MOV   X4, XZR            // i = 0

.neon_loop:
    LDR   Q0, [X1, X4]       // V0 = input[i..i+15]  (128 bits)
    EOR   V0.16B, V0.16B, V7.16B  // V0 ^= key (16 bytes at once)
    STR   Q0, [X0, X4]       // output[i..i+15] = V0
    ADD   X4, X4, #16        // i += 16
    CMP   X4, X2
    B.LO  .neon_loop

.neon_done:
    RET

Per-iteration: LDR Q + EOR V.16B + STR Q + ADD + CMP + B.LO = 6 instructions, 16 bytes processed.

Compare to x86-64 SSE2: - x86-64: MOVDQU + PXOR + MOVDQU = "move double quadword unaligned" + "packed XOR" - ARM64: LDR Q + EOR V.16B + STR Q = "load quadword" + "EOR 16 bytes" + "store quadword"

The ARM64 NEON syntax is more regular: the same EOR instruction that XORs scalar registers also XORs 16-byte vectors — just with a .16B suffix. x86-64 uses a completely different mnemonic (PXOR for packed XOR vs. XOR for scalar).

Side-by-Side Comparison

XOR Cipher Implementation Comparison
═══════════════════════════════════════════════════════════════════════════
                    x86-64 Scalar  x86-64 SSE2   ARM64 Scalar  ARM64 NEON
───────────────────────────────────────────────────────────────────────────
Bytes/iteration     8              16            8             16
Instructions/iter   6              6             6             6
Key register        RCX (GP)       XMM7 (SIMD)   X3 (GP)       V7 (NEON)
XOR instruction     XOR rN, rM     PXOR xmm, xmm EOR Xn, Xn, Xm EOR V.16B
Load instruction    MOV rN, [base] MOVDQU        LDR Xn, [base] LDR Qn
Store instruction   MOV [base], rN MOVDQU        STR Xn, [base] STR Qn
Loop overhead       ADD+CMP+JB     ADD+CMP+JB    ADD+CMP+B.LO  ADD+CMP+B.LO
Alignment required  No (any)       No (MOVDQU)   No (any)      No (LDR Q any)
═══════════════════════════════════════════════════════════════════════════

The instruction structures are remarkably parallel. Both architectures achieve the same logical operations in the same number of instructions; only the mnemonics and encoding differ.

Performance Benchmarking

For a 64KB buffer on modern hardware (approximate):

Performance: XOR encrypt 65536 bytes
                    Throughput (GB/s)    Cycles/byte
x86-64 scalar       ~8-10 GB/s          ~0.4
x86-64 SSE2         ~16-20 GB/s         ~0.2
x86-64 AVX2 (256b)  ~32-40 GB/s         ~0.1
ARM64 scalar        ~6-8 GB/s           ~0.5
ARM64 NEON 128b     ~12-16 GB/s         ~0.25
Apple M4 NEON       ~25-30 GB/s         ~0.12

The bottleneck for large buffers is memory bandwidth, not instruction throughput. Both scalar and NEON versions become memory-bound beyond ~1MB.

Porting Checklist: x86-64 → ARM64

When porting crypto code between architectures:

Replace XMM/YMM → V (NEON) registers. PXOR xmm0, xmm1 → EOR V0.16B, V0.16B, V1.16B
Replace MOVDQU/MOVDQA → LDR Q/STR Q. ARM64 Q-loads handle both aligned and unaligned.
Replace PAND/POR → AND/ORR with vector suffixes. PAND xmm0, xmm1 → AND V0.16B, V0.16B, V1.16B
Replace PCMPEQB → CMEQ. PCMPEQB xmm0, xmm1 → CMEQ V0.16B, V0.16B, V1.16B
Replace PSLLW/PSLLD → SHL vector variants. PSLLW xmm0, 1 → SHL V0.8H, V0.8H, #1
Keep scalar XOR as EOR. XOR rax, rbx → EOR X0, X0, X1

The deeper crypto operations (AES-NI on x86-64 vs. AES hardware extensions on ARM64) are covered in the XOR→AES-NI anchor example in Chapter 35.

Summary

The XOR cipher port demonstrates that equivalent algorithms produce structurally equivalent assembly on x86-64 and ARM64. The mnemonic names differ (XOR vs. EOR, PXOR vs. EOR V.16B), but the logical structure — load, operate, store — is identical.

The SIMD extensions (SSE2 on x86-64, NEON on ARM64) both achieve 2× the scalar throughput with the same instruction count by processing 16 bytes per iteration instead of 8. The key difference: ARM64's NEON syntax is more regular (same EOR mnemonic with a vector suffix) while x86-64 uses entirely separate SIMD mnemonics (PXOR, PAND, etc.).

For a crypto developer porting between platforms, the main challenge is learning the new mnemonic names and verifying alignment behavior. The algorithmic logic translates directly.