Chapter 15 Key Takeaways: SIMD Programming
-
x86-64 has three SIMD register widths: XMM (128-bit, SSE/SSE2), YMM (256-bit, AVX/AVX2), and ZMM (512-bit, AVX-512). XMM registers are the low 128 bits of YMM registers; YMM registers are the low 256 bits of ZMM. A 128-bit SSE instruction writing XMM leaves the upper bits of YMM unchanged (or may zero them — depends on the specific instruction and coding form).
-
Packed operations apply the same operation independently to all lanes.
ADDPS xmm0, xmm1adds four float pairs: xmm0[i] += xmm1[i] for i in {0,1,2,3}. There is no communication between lanes during a packed operation (unlike a horizontal reduction, which requires explicit shuffle-and-add steps). -
MOVAPSrequires 16-byte alignment;MOVUPSdoes not. On modern Intel processors (Nehalem and later),MOVUPSwith aligned data has essentially no performance penalty overMOVAPS. UseMOVAPSfor guaranteed-aligned data as documentation,MOVUPSwhen alignment cannot be guaranteed. For AVX:VMOVAPS/VMOVUPS(32-byte alignment for YMM). -
VZEROUPPERis mandatory before transitioning from AVX code to legacy SSE code. Without it, mixing legacy SSE instructions (which write XMM, leaving upper YMM bits unspecified) with AVX instructions (which rely on those upper bits) causes a microarchitectural false dependency stall — up to 70 cycles on Sandy Bridge/Ivy Bridge. Always emitVZEROUPPERat the end of any function that uses YMM registers. -
FMA (
VFMADD213PSand friends) fuses a multiply and add into a single instruction with a single rounding operation. This improves both performance (1 instruction instead of 2) and accuracy (one rounding instead of two). The three-number suffix (e.g., 213 inVFMADD213PS) encodes the operand order: 2=dst, 1=src1, 3=src2 → dst = src1dst + src2. Memorize: 132 = dstsrc1 + src2; 213 = src1dst + src2; 231 = src1src2 + dst. -
AVX2 shuffle instructions (
VPSHUFB,VPSHUFD) operate within each 128-bit lane independently — they do NOT cross the 128-bit boundary. This is the most common AVX2 pitfall. A 256-bit shuffle produces two independent 128-bit shuffles. UseVPERMQ(which operates on 64-bit quadwords and can cross lanes) to consolidate data across the lane boundary when needed. -
Horizontal reduction of a SIMD register (summing all lanes) requires a series of shuffle-and-add steps. For 4-float XMM: 2 rounds of
SHUFPS+ADDPS. For 8-float YMM: extract upper lane withVEXTRACTF128, add to lower lane, then complete the 4-element reduction. The horizontal reduction is the most expensive step in many vectorized loops; minimize it by accumulating vertically through the loop and reducing once at the end. -
AES-NI instructions (
AESENC,AESENCLAST,AESDEC,AESDECLAST,AESKEYGENASSIST,AESIMC) perform complete AES rounds in hardware. AES-128 encryption requires: 1PXOR(whitening) + 9AESENC+ 1AESENCLAST. The key schedule for AES-128 requires 10 applications ofAESKEYGENASSIST(one per round key derivation), combined withPSHUFD,PSLLDQ, andPXORto implement the complete key expansion. -
AES-NI is constant-time: execution time does not depend on key or data values. Software AES using T-tables accesses memory addresses derived from key material, leaking information via cache-timing side channels. AES-NI implements SubBytes as a combinational hardware circuit with data-independent timing. This is the primary security motivation for using AES-NI in cryptographic code.
-
CTR mode makes AES a stream cipher by encrypting successive counter values and XORing with plaintext. Counter blocks are independent, enabling parallel encryption of multiple blocks. Pipelining 4 blocks simultaneously quadruples throughput (from ~2.5 cycles/byte to ~0.6 cycles/byte) because the 4-cycle latency of
AESENCis hidden behind parallel execution of independent blocks. -
The
PADDQinstruction increments the 64-bit counter in AES-CTR mode.PADDQ xmm, [ctr_increment]wherectr_increment = {1, 0}adds 1 to the low 64 bits, leaving the high 64 bits (nonce) unchanged. For cryptographic correctness, the counter must not repeat within a key lifetime — 64-bit counters allow 2^64 blocks (about 2^70 bytes) before wraparound. -
Vectorizing a loop requires eliminating data dependencies between iterations. Auto-vectorization with GCC
-O3 -mavx2generates AVX2 code for simple loops, but requires: no aliasing between input and output pointers (__restrict__), no loop-carried dependencies, and a stride-1 access pattern. Add__builtin_assume_aligned()hints or use-fopt-info-vecto see why a loop was or was not vectorized. -
SIMD image processing works best with Structure-of-Arrays (SoA) layout. Array-of-Structures (AoS) requires shuffle instructions to extract individual channels before SIMD computation. SoA stores all R values together, all G values together, etc., enabling direct SIMD loads of 16/32 channel values at once. For existing AoS data, use
PSHUFB(SSSE3) to extract channels efficiently. -
PMADDUBSW(SSSE3) computes 8 dot products of 8-bit unsigned × 8-bit signed pairs, summing adjacent pairs to 16-bit results. This single instruction replaces separate multiply and add instructions for weighted sums like the grayscale formula (Y = 77R + 150G + 29*B / 256), making it a key instruction for image processing, neural network inference, and audio processing. -
AVX-512 (512-bit ZMM registers) doubles the SIMD width again, adding mask registers (k0-k7) for predicated execution. Masked operations like
VADDPS zmm0{k1}, zmm1, zmm2only update lanes where the corresponding bit in k1 is set. This eliminates many scalar loop tails for handling non-multiple-of-vector-length inputs. AVX-512 requires Skylake-X or Ice Lake and later; check CPU support with CPUID before using it.