Case Study 19-1: Porting a Crypto Library — XOR Cipher in x86-64 and ARM64
Objective
Implement the same XOR encryption routine in both x86-64 and ARM64 assembly, comparing instruction count, register usage, loop structure, and performance. This builds directly on the XOR→AES-NI anchor example started in Chapter 13 and demonstrates practical cross-architecture porting.
Background: XOR Cipher
XOR cipher is the foundation of stream ciphers and block cipher modes like CTR and OFB. While XOR alone is a weak cipher (trivially broken with known plaintext), the XOR operation itself is used in every serious cryptographic algorithm.
The operation: output[i] = input[i] XOR key[i % key_len]
For simplicity, we'll use a key length that's a multiple of the block size (avoiding modulo in the inner loop). Specifically: XOR encrypt with a 16-byte key, processing 8 bytes per iteration in scalar mode and 16 bytes per iteration in SIMD mode.
C Reference Implementation
// xor_cipher.c
void xor_encrypt(uint8_t *output, const uint8_t *input, size_t n,
const uint8_t *key, size_t key_len) {
// Assumes n is multiple of key_len, key_len is multiple of 8
for (size_t i = 0; i < n; i++) {
output[i] = input[i] ^ key[i % key_len];
}
}
// Simplified version with key_len = 8:
void xor_encrypt_8(uint8_t *output, const uint8_t *input, size_t n,
uint64_t key_8bytes) {
size_t i;
for (i = 0; i + 8 <= n; i += 8) {
*(uint64_t *)(output + i) = *(const uint64_t *)(input + i) ^ key_8bytes;
}
// tail: handle remaining bytes (simplified: assume n % 8 == 0)
}
x86-64 Implementation
Scalar (8 bytes per iteration)
; xor_encrypt_x86: encrypt n bytes with an 8-byte key
; RDI = output, RSI = input, RDX = n, RCX = key_8bytes
; Assumes n is multiple of 8
global xor_encrypt_x86
xor_encrypt_x86:
test rdx, rdx ; n == 0?
jz .done
; key is already in RCX (4th argument register)
xor rax, rax ; i = 0
.loop:
mov r8, [rsi + rax] ; r8 = input[i..i+7]
xor r8, rcx ; r8 ^= key
mov [rdi + rax], r8 ; output[i..i+7] = r8
add rax, 8 ; i += 8
cmp rax, rdx ; i < n?
jb .loop
.done:
ret
Per-iteration instruction count: MOV + XOR + MOV + ADD + CMP + JB = 6 instructions, 8 bytes processed.
SIMD (16 bytes per iteration with SSE2)
; xor_encrypt_sse2: encrypt n bytes with a 16-byte key
; RDI = output, RSI = input, RDX = n (multiple of 16)
; RCX = key pointer
global xor_encrypt_sse2
xor_encrypt_sse2:
test rdx, rdx
jz .done
movdqu xmm7, [rcx] ; xmm7 = 16-byte key (load once, reuse)
xor rax, rax ; i = 0
.loop:
movdqu xmm0, [rsi + rax] ; xmm0 = input[i..i+15]
pxor xmm0, xmm7 ; xmm0 ^= key (16 bytes XOR!)
movdqu [rdi + rax], xmm0 ; output[i..i+15] = xmm0
add rax, 16
cmp rax, rdx
jb .loop
.done:
ret
Per-iteration: MOVDQU + PXOR + MOVDQU + ADD + CMP + JB = 6 instructions, 16 bytes processed. 2× the scalar throughput with the same instruction count.
ARM64 Implementation
Scalar (8 bytes per iteration)
// xor_encrypt_arm64: X0 = output, X1 = input, X2 = n, X3 = key_8bytes
// Assumes n is multiple of 8
.global xor_encrypt_arm64
xor_encrypt_arm64:
CBZ X2, .a64_done
MOV X4, XZR // i = 0
.a64_loop:
LDR X5, [X1, X4] // X5 = input[i..i+7]
EOR X5, X5, X3 // X5 ^= key
STR X5, [X0, X4] // output[i..i+7] = X5
ADD X4, X4, #8 // i += 8
CMP X4, X2 // i < n?
B.LO .a64_loop // B.LO = unsigned lower (equivalent to JB)
.a64_done:
RET
Per-iteration: LDR + EOR + STR + ADD + CMP + B.LO = 6 instructions, 8 bytes processed. Same as x86-64 scalar.
Instruction comparison:
| Step | x86-64 | ARM64 |
|---|---|---|
| Load | MOV r8, [rsi+rax] | LDR X5, [X1, X4] |
| XOR | XOR r8, rcx | EOR X5, X5, X3 |
| Store | MOV [rdi+rax], r8 | STR X5, [X0, X4] |
| Increment | ADD rax, 8 | ADD X4, X4, #8 |
| Compare | CMP rax, rdx | CMP X4, X2 |
| Branch | JB .loop | B.LO .loop |
Nearly identical structure. Key difference: x86-64 can encode MOV r8, [rsi+rax] as one instruction (base+index addressing); ARM64's LDR X5, [X1, X4] is also base+index — equivalent! ARM64's addressing modes are richer than just "base+immediate".
NEON (16 bytes per iteration)
// xor_encrypt_neon: X0 = output, X1 = input, X2 = n (mult. of 16)
// X3 = key pointer (16-byte key)
.global xor_encrypt_neon
xor_encrypt_neon:
CBZ X2, .neon_done
LDR Q7, [X3] // V7 = 16-byte key (load once)
MOV X4, XZR // i = 0
.neon_loop:
LDR Q0, [X1, X4] // V0 = input[i..i+15] (128 bits)
EOR V0.16B, V0.16B, V7.16B // V0 ^= key (16 bytes at once)
STR Q0, [X0, X4] // output[i..i+15] = V0
ADD X4, X4, #16 // i += 16
CMP X4, X2
B.LO .neon_loop
.neon_done:
RET
Per-iteration: LDR Q + EOR V.16B + STR Q + ADD + CMP + B.LO = 6 instructions, 16 bytes processed.
Compare to x86-64 SSE2:
- x86-64: MOVDQU + PXOR + MOVDQU = "move double quadword unaligned" + "packed XOR"
- ARM64: LDR Q + EOR V.16B + STR Q = "load quadword" + "EOR 16 bytes" + "store quadword"
The ARM64 NEON syntax is more regular: the same EOR instruction that XORs scalar registers also XORs 16-byte vectors — just with a .16B suffix. x86-64 uses a completely different mnemonic (PXOR for packed XOR vs. XOR for scalar).
Side-by-Side Comparison
XOR Cipher Implementation Comparison
═══════════════════════════════════════════════════════════════════════════
x86-64 Scalar x86-64 SSE2 ARM64 Scalar ARM64 NEON
───────────────────────────────────────────────────────────────────────────
Bytes/iteration 8 16 8 16
Instructions/iter 6 6 6 6
Key register RCX (GP) XMM7 (SIMD) X3 (GP) V7 (NEON)
XOR instruction XOR rN, rM PXOR xmm, xmm EOR Xn, Xn, Xm EOR V.16B
Load instruction MOV rN, [base] MOVDQU LDR Xn, [base] LDR Qn
Store instruction MOV [base], rN MOVDQU STR Xn, [base] STR Qn
Loop overhead ADD+CMP+JB ADD+CMP+JB ADD+CMP+B.LO ADD+CMP+B.LO
Alignment required No (any) No (MOVDQU) No (any) No (LDR Q any)
═══════════════════════════════════════════════════════════════════════════
The instruction structures are remarkably parallel. Both architectures achieve the same logical operations in the same number of instructions; only the mnemonics and encoding differ.
Performance Benchmarking
For a 64KB buffer on modern hardware (approximate):
Performance: XOR encrypt 65536 bytes
Throughput (GB/s) Cycles/byte
x86-64 scalar ~8-10 GB/s ~0.4
x86-64 SSE2 ~16-20 GB/s ~0.2
x86-64 AVX2 (256b) ~32-40 GB/s ~0.1
ARM64 scalar ~6-8 GB/s ~0.5
ARM64 NEON 128b ~12-16 GB/s ~0.25
Apple M4 NEON ~25-30 GB/s ~0.12
The bottleneck for large buffers is memory bandwidth, not instruction throughput. Both scalar and NEON versions become memory-bound beyond ~1MB.
Porting Checklist: x86-64 → ARM64
When porting crypto code between architectures:
- Replace XMM/YMM → V (NEON) registers.
PXOR xmm0, xmm1→EOR V0.16B, V0.16B, V1.16B - Replace MOVDQU/MOVDQA → LDR Q/STR Q. ARM64 Q-loads handle both aligned and unaligned.
- Replace PAND/POR → AND/ORR with vector suffixes.
PAND xmm0, xmm1→AND V0.16B, V0.16B, V1.16B - Replace PCMPEQB → CMEQ.
PCMPEQB xmm0, xmm1→CMEQ V0.16B, V0.16B, V1.16B - Replace PSLLW/PSLLD → SHL vector variants.
PSLLW xmm0, 1→SHL V0.8H, V0.8H, #1 - Keep scalar XOR as EOR.
XOR rax, rbx→EOR X0, X0, X1
The deeper crypto operations (AES-NI on x86-64 vs. AES hardware extensions on ARM64) are covered in the XOR→AES-NI anchor example in Chapter 35.
Summary
The XOR cipher port demonstrates that equivalent algorithms produce structurally equivalent assembly on x86-64 and ARM64. The mnemonic names differ (XOR vs. EOR, PXOR vs. EOR V.16B), but the logical structure — load, operate, store — is identical.
The SIMD extensions (SSE2 on x86-64, NEON on ARM64) both achieve 2× the scalar throughput with the same instruction count by processing 16 bytes per iteration instead of 8. The key difference: ARM64's NEON syntax is more regular (same EOR mnemonic with a vector suffix) while x86-64 uses entirely separate SIMD mnemonics (PXOR, PAND, etc.).
For a crypto developer porting between platforms, the main challenge is learning the new mnemonic names and verifying alignment behavior. The algorithmic logic translates directly.