SIMD stands for Single Instruction, Multiple Data. The concept: instead of adding two numbers in one instruction, add eight pairs of numbers in one instruction. The hardware applies the same operation in parallel to multiple data elements packed...
In This Chapter
- One Instruction, Eight Results
- The XMM, YMM, and ZMM Register Files
- SSE2 Packed Operations
- AVX/AVX2: 256-bit Operations
- Shuffle and Permute Instructions
- Alignment: Performance and Correctness
- Vectorizing a Loop: Sum of Array
- Performance Comparison: Array Sum
- AES-NI: Hardware-Accelerated Encryption
- Performance: AES-NI vs. Software AES
- Auto-Vectorization: What GCC Does with -O3
- Complete Example: SIMD Grayscale Conversion Preview
- The XOR → AES-NI Encryption Tool: Complete
- Summary
Chapter 15: SIMD Programming
One Instruction, Eight Results
SIMD stands for Single Instruction, Multiple Data. The concept: instead of adding two numbers in one instruction, add eight pairs of numbers in one instruction. The hardware applies the same operation in parallel to multiple data elements packed into a wide register. For code that processes arrays, images, audio samples, or cryptographic blocks, SIMD delivers the most dramatic performance improvements available in software.
x86-64 has accumulated three generations of SIMD registers: - XMM (128 bits): introduced with SSE in 1999, holds 4 floats, 2 doubles, or various integer packing - YMM (256 bits): introduced with AVX in 2011, holds 8 floats, 4 doubles, or wider integers - ZMM (512 bits): introduced with AVX-512 in 2017, available on server-class processors but not universal
This chapter covers SSE2 and AVX2 — the baseline that all modern x86-64 processors support — plus the AES-NI instructions that complete the encryption tool anchor example.
The XMM, YMM, and ZMM Register Files
The three register files are hierarchically related:
ZMM0 (512 bits):
┌───────────────────────────────────────────────────────────────────┐
│ ZMM0 [511:0] │
├───────────────────────────┬───────────────────────────────────────┤
│ YMM0 [255:0] │ (upper 256 bits of ZMM0 = YMM0_H) │
├─────────────┬─────────────┤ │
│ XMM0[127:0]│ XMM0_H[255:128]│ │
└─────────────┴─────────────┴───────────────────────────────────────┘
Writing to XMM0 zeroes the upper 128 bits of YMM0 (and therefore ZMM0 as well) — this is the zero-extension behavior of 128-bit operations. Writing to YMM0 zeroes the upper 256 bits of ZMM0.
The 16 registers are named XMM0-XMM15 (SSE), YMM0-YMM15 (AVX), ZMM0-ZMM31 (AVX-512).
Data Packing
A 128-bit XMM register can hold:
| Type | Count | Example instruction suffix |
|---|---|---|
| float (32-bit) | 4 | PS (Packed Single) |
| double (64-bit) | 2 | PD (Packed Double) |
| int8_t / uint8_t | 16 | B |
| int16_t / uint16_t | 8 | W |
| int32_t / uint32_t | 4 | D |
| int64_t / uint64_t | 2 | Q |
A 256-bit YMM register holds twice as many elements (8 floats, 4 doubles, 32 bytes, etc.).
SSE2 Packed Operations
Packed Floating-Point
; Load 4 floats from memory into XMM0:
movaps xmm0, [rdi] ; aligned load (address must be 16-byte aligned)
movups xmm0, [rdi] ; unaligned load (any address, slightly slower on old CPUs)
; Packed float arithmetic (4 operations simultaneously):
addps xmm0, xmm1 ; xmm0[i] = xmm0[i] + xmm1[i] for i = 0..3
mulps xmm0, xmm1 ; element-wise multiply
subps xmm0, xmm1 ; element-wise subtract
divps xmm0, xmm1 ; element-wise divide
sqrtps xmm0, xmm1 ; element-wise square root (4 sqrts at once!)
maxps xmm0, xmm1 ; element-wise max
minps xmm0, xmm1 ; element-wise min
; Packed double arithmetic (2 doubles per register):
addpd xmm0, xmm1 ; xmm0[i] = xmm0[i] + xmm1[i] for i = 0..1
mulpd xmm0, xmm1
Packed Integer Arithmetic
; Packed 32-bit integer add (4 int32 operations):
paddd xmm0, xmm1 ; xmm0[i] += xmm1[i] for i = 0..3 (no carry between elements)
; Other sizes:
paddb xmm0, xmm1 ; packed byte add (16 bytes)
paddw xmm0, xmm1 ; packed word add (8 words)
paddq xmm0, xmm1 ; packed qword add (2 qwords)
; Saturating arithmetic (no overflow wraparound):
paddsb xmm0, xmm1 ; signed byte add with saturation (capped at 127/-128)
paddusb xmm0, xmm1 ; unsigned byte add with saturation (capped at 255)
; Packed multiply:
pmulld xmm0, xmm1 ; packed 32-bit multiply, low 32 bits of product
pmullw xmm0, xmm1 ; packed 16-bit multiply, low 16 bits
; Packed compare:
pcmpeqd xmm0, xmm1 ; xmm0[i] = (xmm0[i] == xmm1[i]) ? 0xFFFFFFFF : 0
pcmpgtd xmm0, xmm1 ; xmm0[i] = (xmm0[i] > xmm1[i]) ? 0xFFFFFFFF : 0
AVX/AVX2: 256-bit Operations
AVX uses the VEX encoding prefix and non-destructive three-operand syntax:
; AVX 256-bit float operations (8 floats at once):
vmovaps ymm0, [rdi] ; aligned load 8 floats
vaddps ymm0, ymm1, ymm2 ; ymm0 = ymm1 + ymm2 (three-operand, non-destructive)
vmulps ymm0, ymm1, ymm2 ; 8 multiplies simultaneously
vsqrtps ymm0, ymm1 ; 8 square roots simultaneously
; Fused Multiply-Add (FMA3, available with AVX2/Haswell+):
vfmadd213ps ymm0, ymm1, ymm2 ; ymm0 = ymm0 * ymm1 + ymm2 (a*b + c in one instruction)
vfmadd231ps ymm0, ymm1, ymm2 ; ymm0 = ymm1 * ymm2 + ymm0 (a*b + c with different register assignment)
; AVX2 integer operations (256-bit integers):
vpaddد ymm0, ymm1, ymm2 ; packed 32-bit add (8 elements)
vmovdqu ymm0, [rdi] ; load 32 bytes (unaligned)
⚠️ Common Mistake: Mixing AVX and SSE. Using SSE (non-VEX) instructions after VEX-encoded AVX instructions on some processors incurs a 100-cycle penalty ("AVX-SSE transition penalty") due to register state transitions. In mixed code, either use only VEX-encoded instructions throughout, or insert
VZEROUPPERwhen transitioning from AVX back to SSE:
; After AVX code, before calling a legacy SSE function:
vzeroupper ; zero upper 128 bits of all YMM registers
; eliminates the transition penalty
Shuffle and Permute Instructions
SIMD shuffle instructions rearrange elements within or between registers. They are essential for many algorithms.
SHUFPS: Shuffle Floats
; SHUFPS dst, src, imm8
; Takes two 4-float registers, selects 4 elements to compose dst
; dst[0..1] come from dst, dst[2..3] come from src
; Immediate: 2 bits per element, encoding which element (0-3) to select
shufps xmm0, xmm1, 0b00_01_10_11 ; dst = [dst[3], dst[2], src[1], src[0]]
; Reading left to right: dst[0]=dst[3], dst[1]=dst[2], dst[2]=src[1], dst[3]=src[0]
; Broadcast element 0 to all positions:
shufps xmm0, xmm0, 0b00_00_00_00 ; xmm0 = [xmm0[0], xmm0[0], xmm0[0], xmm0[0]]
; Reverse element order:
shufps xmm0, xmm0, 0b00_01_10_11 ; [3,2,1,0] → [0,1,2,3]... actually this reverses
PSHUFD: Shuffle Dwords in 128-bit Register
; Rearrange 4 int32 elements based on immediate (2 bits per element):
pshufd xmm0, xmm1, 0b11_10_01_00 ; identity permutation (no change)
pshufd xmm0, xmm1, 0b00_00_00_00 ; broadcast element 0 to all positions
pshufd xmm0, xmm1, 0b01_00_11_10 ; swap high and low pairs
VPERMILPS (AVX): Per-Element Permute
; Permute 4 floats within each 128-bit lane of YMM:
vpermilps ymm0, ymm1, imm8 ; each group of 4 floats permuted by imm
; Or with a control vector:
vpermilps ymm0, ymm1, ymm2 ; ymm2 specifies permutation per element
Alignment: Performance and Correctness
Why Alignment Matters
; 16-byte aligned load (fast, required for some instructions):
movaps xmm0, [rdi] ; SIGSEGV if rdi is not 16-byte aligned
; Unaligned load (always works, slight penalty on old hardware):
movups xmm0, [rdi] ; works at any address
; 32-byte aligned load for AVX:
vmovaps ymm0, [rdi] ; SIGSEGV if rdi is not 32-byte aligned
vmovdqu ymm0, [rdi] ; unaligned 256-bit load
On modern Intel hardware (Nehalem and later), unaligned loads that don't cross a cache line boundary have zero penalty. The performance difference between MOVAPS and MOVUPS is only meaningful on pre-2008 processors. For AVX, the guidance is the same.
Aligning Your Data
; In the .data section:
section .data
align 32 ; 32-byte alignment for AVX
my_float_array: times 8 dd 1.0 ; 8 floats (32 bytes)
; Dynamically allocated (align with posix_memalign):
; void *ptr;
; posix_memalign(&ptr, 32, size); // 32-byte aligned allocation
In NASM:
section .bss
align 32
buffer: resb 256 ; 256 bytes, 32-byte aligned
Vectorizing a Loop: Sum of Array
The fundamental SIMD optimization: a scalar loop over N elements becomes N/LANES iterations.
Scalar Version
; float array_sum_scalar(float *arr, int n)
; RDI = arr, ESI = n
array_sum_scalar:
xorps xmm0, xmm0 ; sum = 0.0
xor ecx, ecx
.loop:
cmp ecx, esi
jge .done
addss xmm0, [rdi + rcx*4] ; sum += arr[i]
inc ecx
jmp .loop
.done:
ret ; result in xmm0
SSE2 Vector Version (4 floats per iteration)
; float array_sum_sse(float *arr, int n)
array_sum_sse:
xorps xmm0, xmm0 ; accumulator = [0,0,0,0]
xor ecx, ecx
mov eax, esi
and eax, ~3 ; round down to multiple of 4 (floor(n/4)*4)
; Main loop: 4 elements per iteration
.vec_loop:
cmp ecx, eax
jge .scalar_tail
addps xmm0, [rdi + rcx*4] ; add 4 floats at once
add ecx, 4
jmp .vec_loop
.scalar_tail:
; Handle remaining 0-3 elements
cmp ecx, esi
jge .reduce
addss xmm0, [rdi + rcx*4]
inc ecx
jmp .scalar_tail
.reduce:
; Horizontal sum: add the 4 lanes together
; xmm0 = [a, b, c, d] → we need a+b+c+d
movaps xmm1, xmm0
shufps xmm1, xmm0, 0b01_00_11_10 ; xmm1 = [c, d, a, b]
addps xmm0, xmm1 ; xmm0 = [a+c, b+d, c+a, d+b]
movaps xmm1, xmm0
shufps xmm1, xmm0, 0b10_11_00_01 ; xmm1 = [b+d, a+c, d+b, c+a]
addps xmm0, xmm1 ; xmm0[0] = a+b+c+d (and other lanes)
; Result is in xmm0[31:0]
ret
AVX2 Version (8 floats per iteration)
; float array_sum_avx(float *arr, int n)
array_sum_avx:
vxorps ymm0, ymm0, ymm0 ; accumulator = [0,0,0,0,0,0,0,0]
xor ecx, ecx
mov eax, esi
and eax, ~7 ; round down to multiple of 8
.vec_loop:
cmp ecx, eax
jge .tail
vaddps ymm0, ymm0, [rdi + rcx*4] ; add 8 floats
add ecx, 8
jmp .vec_loop
.tail:
; Reduce YMM to XMM:
vextractf128 xmm1, ymm0, 1 ; xmm1 = upper 128 bits of ymm0
vaddps xmm0, xmm0, xmm1 ; add lower and upper 4-float halves
; Horizontal sum of XMM (as in SSE version):
; ... (same as above)
vzeroupper ; avoid AVX-SSE transition penalty
ret
Performance Comparison: Array Sum
| Implementation | 1000 elements | 10^6 elements | Speedup vs. scalar |
|---|---|---|---|
| Scalar | 1000 cy | 1.0M cy | 1× |
| SSE2 (4-wide) | 260 cy | 250K cy | ~4× |
| AVX2 (8-wide) | 135 cy | 128K cy | ~8× |
| AVX2 + FMA | 130 cy | 125K cy | ~8× (marginal gain for pure sum) |
The theoretical speedup for 8-wide SIMD is 8×. Real-world speedup is ~8× for large arrays where memory bandwidth is not the bottleneck. For small arrays, overhead (setup, tail handling) reduces the benefit.
AES-NI: Hardware-Accelerated Encryption
AES-NI is a set of SSE instructions (using XMM registers) that implement the AES (Advanced Encryption Standard) cipher rounds in hardware. Each instruction performs one AES round — 4 operations (SubBytes, ShiftRows, MixColumns, AddRoundKey) — in approximately 7 clock cycles.
The AES-NI Instructions
; AESENC xmm_state, xmm_roundkey
; One AES encryption round (SubBytes + ShiftRows + MixColumns + AddRoundKey)
aesenc xmm0, xmm1 ; state (xmm0) ← AES_round(state, round_key)
; AESENCLAST xmm_state, xmm_roundkey
; Final AES encryption round (SubBytes + ShiftRows + AddRoundKey, no MixColumns)
aesenclast xmm0, xmm1
; AESDEC xmm_state, xmm_roundkey
; One AES decryption round
aesdec xmm0, xmm1
; AESDECLAST xmm_state, xmm_roundkey
; Final AES decryption round
aesdeclast xmm0, xmm1
; AESKEYGENASSIST xmm_dst, xmm_src, imm8
; Key schedule computation (generates round key material)
aeskeygenassist xmm1, xmm0, 0x01
AES-128 Key Schedule Generation
; Expand a 128-bit AES key into 11 round keys (10 rounds + initial)
; Input: 16-byte key at [rdi]
; Output: 176-byte expanded key (11 × 16 bytes) at [rsi]
section .text
global aes128_key_expand
; Key schedule helper macro:
; Each round of key expansion uses AESKEYGENASSIST + XOR + permutation
%macro AES_KEY_EXPAND_128 2 ; args: round_const (imm8), store_offset
aeskeygenassist xmm2, xmm0, %1 ; xmm2 = KeyGenAssist(prev_key, rcon)
; xmm2[127:96] now contains the needed words; need to broadcast and XOR
; The actual expansion (conceptual - simplified):
; 1. Shuffle xmm2 to get the right word in all positions
pshufd xmm2, xmm2, 0xFF ; broadcast dword 3 to all positions
; 2. XOR with shifted version of current key
vpslldq xmm3, xmm0, 4 ; xmm3 = xmm0 left-shifted by 4 bytes
pxor xmm0, xmm3
vpslldq xmm3, xmm3, 4
pxor xmm0, xmm3
vpslldq xmm3, xmm3, 4
pxor xmm0, xmm3
pxor xmm0, xmm2 ; xmm0 = new round key
movdqu [rsi + %2], xmm0 ; store round key
%endmacro
aes128_key_expand:
movdqu xmm0, [rdi] ; load initial key
movdqu [rsi], xmm0 ; store as round key 0
AES_KEY_EXPAND_128 0x01, 16
AES_KEY_EXPAND_128 0x02, 32
AES_KEY_EXPAND_128 0x04, 48
AES_KEY_EXPAND_128 0x08, 64
AES_KEY_EXPAND_128 0x10, 80
AES_KEY_EXPAND_128 0x20, 96
AES_KEY_EXPAND_128 0x40, 112
AES_KEY_EXPAND_128 0x80, 128
AES_KEY_EXPAND_128 0x1B, 144
AES_KEY_EXPAND_128 0x36, 160
ret
AES-128 Block Encryption
; aes128_encrypt_block(uint8_t *block, const uint8_t *expanded_key)
; Encrypts one 16-byte block in-place
; RDI = block (in/out), RSI = expanded key (176 bytes)
section .text
global aes128_encrypt_block
aes128_encrypt_block:
movdqu xmm0, [rdi] ; load plaintext block
; Initial round key XOR:
movdqu xmm1, [rsi]
pxor xmm0, xmm1 ; AddRoundKey with round key 0
; Rounds 1-9:
%assign round 1
%rep 9
movdqu xmm1, [rsi + round*16]
aesenc xmm0, xmm1
%assign round round+1
%endrep
; Final round (round 10):
movdqu xmm1, [rsi + 160]
aesenclast xmm0, xmm1
movdqu [rdi], xmm0 ; store ciphertext
ret
AES-128 CTR Mode (Stream Encryption)
Counter mode converts AES block cipher into a stream cipher — suitable for encrypting arbitrary-length messages:
; aes128_ctr_encrypt(uint8_t *buf, size_t len, const uint8_t *expanded_key,
; uint8_t *nonce_counter)
; Encrypts/decrypts buf in-place using AES-128-CTR
; RDI = buf, RSI = len, RDX = expanded_key, RCX = nonce_counter (16 bytes, modified in-place)
section .text
global aes128_ctr_encrypt
aes128_ctr_encrypt:
push rbp
mov rbp, rsp
push rbx
push r12
push r13
push r14
sub rsp, 32 ; local stack space + alignment
mov rbx, rdi ; buf
mov r12, rsi ; len
mov r13, rdx ; expanded_key
mov r14, rcx ; nonce_counter
xor ecx, ecx ; byte offset = 0
.block_loop:
cmp rcx, r12 ; processed all bytes?
jge .done
; Load counter block and encrypt it to get keystream block:
movdqu xmm0, [r14] ; xmm0 = nonce || counter
movdqu xmm1, [r13] ; round key 0
pxor xmm0, xmm1 ; initial AddRoundKey
%assign round 1
%rep 9
movdqu xmm1, [r13 + round*16]
aesenc xmm0, xmm1
%assign round round+1
%endrep
movdqu xmm1, [r13 + 160]
aesenclast xmm0, xmm1 ; xmm0 = AES(key, counter) = keystream block
; Determine how many bytes of this keystream block to use:
mov rax, r12
sub rax, rcx ; remaining bytes
cmp rax, 16
jl .partial_block
; Full block: XOR 16 bytes at once:
movdqu xmm2, [rbx + rcx] ; load 16 bytes of ciphertext/plaintext
pxor xmm2, xmm0 ; XOR with keystream
movdqu [rbx + rcx], xmm2 ; store result
add rcx, 16
jmp .increment_counter
.partial_block:
; Handle remaining < 16 bytes one by one:
movdqu [rsp], xmm0 ; spill keystream block to stack
lea rdx, [rsp]
.tail_loop:
cmp rcx, r12
jge .increment_counter
mov al, [rbx + rcx] ; load buf byte
xor al, [rdx + rcx] ; XOR with keystream byte (using offset within block)
; Wait — rdx is the start of the keystream block, but rcx is global offset
; Need: keystream_byte = xmm0_bytes[rcx % 16]
mov r8, rcx
and r8, 15 ; r8 = offset within 16-byte block
mov al, [rbx + rcx]
xor al, [rsp + r8] ; XOR with keystream[r8]
mov [rbx + rcx], al
inc rcx
jmp .tail_loop
.increment_counter:
; Increment the 64-bit counter (low 8 bytes of nonce_counter):
add qword [r14 + 8], 1 ; increment counter (little-endian in low 8 bytes)
cmp rcx, r12
jl .block_loop
.done:
add rsp, 32
pop r14
pop r13
pop r12
pop rbx
pop rbp
ret
Performance: AES-NI vs. Software AES
| Implementation | Cycles per byte | MB/s (at 3GHz) |
|---|---|---|
| Software AES (C, -O2) | ~15-20 cy/byte | ~150-200 MB/s |
| AES-NI single block | ~4-5 cy/byte | ~600-750 MB/s |
| AES-NI pipelined (4 blocks) | ~1 cy/byte | ~3000 MB/s |
| AES-NI + AVX-512 | ~0.5 cy/byte | ~6000 MB/s |
The pipelined version processes 4 blocks simultaneously (4 independent states in XMM0-XMM3), hiding the 7-cycle AESENC latency by having independent operations in flight.
🔐 Security Note: AES-NI is constant-time by design — the hardware takes the same number of cycles regardless of the key or plaintext value. Software AES implementations that use lookup tables can be vulnerable to cache timing attacks (the table access pattern reveals information about the key). AES-NI eliminates this entire class of vulnerability. Always prefer AES-NI over software AES for security-sensitive code.
Auto-Vectorization: What GCC Does with -O3
GCC and Clang can auto-vectorize loops that meet certain conditions:
void add_arrays(float *a, float *b, float *c, int n) {
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
}
With gcc -O3 -march=native (or -mavx2):
; GCC -O3 -mavx2 output (simplified):
add_arrays:
; Main loop: 8 floats per iteration (AVX2)
.vec_loop:
vmovups ymm0, [rdi + ...] ; load 8 floats from a
vaddps ymm0, ymm0, [rsi + ...]; add 8 floats from b
vmovups [rdx + ...], ymm0 ; store 8 floats to c
; (counter and bounds check omitted for brevity)
; Tail loop: remaining < 8 elements (scalar)
; ...
vzeroupper
ret
The auto-vectorizer works when:
- The loop has no cross-iteration dependencies (each iteration is independent)
- The loop bounds are known or analyzable
- The data types match the SIMD lane width
- There is no aliasing (compiler must prove a, b, c don't overlap, or use restrict)
For loops the compiler cannot auto-vectorize, writing SIMD assembly (or using compiler intrinsics) is the way forward.
Complete Example: SIMD Grayscale Conversion Preview
Converting RGB to grayscale: L = 0.299R + 0.587G + 0.114*B, implemented for 16 pixels at a time (described in detail in Case Study 15.1).
The XOR → AES-NI Encryption Tool: Complete
The encryption tool anchor example is now complete:
- Chapter 13: XOR cipher — basic symmetric encryption using XOR, processing 8 bytes at a time
- Chapter 15: AES-NI CTR mode — hardware-accelerated AES, 16 bytes per AESENC instruction, 1 byte/cycle with pipelining
The production version of the tool would add: - AES-GCM (Galois/Counter Mode) for authenticated encryption (protects against tampering) - Proper nonce management (random 96-bit nonce per message) - PKCS#7 or CTR mode (no padding needed) - Key derivation from a password using Argon2 or scrypt
All of these are standard constructions that combine AES-NI for the cipher with CLMULNI (carryless multiply) for the Galois field multiplication in GCM. The assembly patterns — XMM register manipulation, AESENC in a loop, pipelining multiple blocks — are exactly what the production tools use.
Summary
SIMD programming processes multiple data elements per instruction, achieving 4× (SSE2), 8× (AVX2), or 16× (AVX-512) speedup for data-parallel code. The key concepts: XMM (128-bit), YMM (256-bit), and ZMM (512-bit) registers; packed operations (PS for 4 floats, PD for 2 doubles, various integer widths); alignment requirements; shuffle operations for data rearrangement; the AVX-SSE transition penalty; and horizontal reduction (summing all lanes).
AES-NI completes the encryption tool: AESENC/AESENCLAST perform one AES round per instruction, achieving 4-5 cycles per byte (vs. 15-20 for software AES) and eliminating cache timing side-channels. AES-128 CTR mode combines block cipher encryption with counter-based keystream generation to produce a stream cipher for arbitrary-length messages.
Part II is complete. With the instruction set from Chapters 8-15, you have the vocabulary to read and write any x86-64 assembly code. Part III applies this vocabulary to the system-level topics: operating system interfaces, memory management, interrupts, device drivers, and the performance analysis tools that let you measure the cost of everything you have learned.