Chapter 15 Key Takeaways: SIMD Programming

Open Assembly Language Project

Chapter 15 Key Takeaways: SIMD Programming

x86-64 has three SIMD register widths: XMM (128-bit, SSE/SSE2), YMM (256-bit, AVX/AVX2), and ZMM (512-bit, AVX-512). XMM registers are the low 128 bits of YMM registers; YMM registers are the low 256 bits of ZMM. A 128-bit SSE instruction writing XMM leaves the upper bits of YMM unchanged (or may zero them — depends on the specific instruction and coding form).
Packed operations apply the same operation independently to all lanes. ADDPS xmm0, xmm1 adds four float pairs: xmm0[i] += xmm1[i] for i in {0,1,2,3}. There is no communication between lanes during a packed operation (unlike a horizontal reduction, which requires explicit shuffle-and-add steps).
MOVAPS requires 16-byte alignment; MOVUPS does not. On modern Intel processors (Nehalem and later), MOVUPS with aligned data has essentially no performance penalty over MOVAPS. Use MOVAPS for guaranteed-aligned data as documentation, MOVUPS when alignment cannot be guaranteed. For AVX: VMOVAPS/VMOVUPS (32-byte alignment for YMM).
VZEROUPPER is mandatory before transitioning from AVX code to legacy SSE code. Without it, mixing legacy SSE instructions (which write XMM, leaving upper YMM bits unspecified) with AVX instructions (which rely on those upper bits) causes a microarchitectural false dependency stall — up to 70 cycles on Sandy Bridge/Ivy Bridge. Always emit VZEROUPPER at the end of any function that uses YMM registers.
FMA (VFMADD213PS and friends) fuses a multiply and add into a single instruction with a single rounding operation. This improves both performance (1 instruction instead of 2) and accuracy (one rounding instead of two). The three-number suffix (e.g., 213 in VFMADD213PS) encodes the operand order: 2=dst, 1=src1, 3=src2 → dst = src1dst + src2. Memorize: 132 = dstsrc1 + src2; 213 = src1dst + src2; 231 = src1src2 + dst.
AVX2 shuffle instructions (VPSHUFB, VPSHUFD) operate within each 128-bit lane independently — they do NOT cross the 128-bit boundary. This is the most common AVX2 pitfall. A 256-bit shuffle produces two independent 128-bit shuffles. Use VPERMQ (which operates on 64-bit quadwords and can cross lanes) to consolidate data across the lane boundary when needed.
Horizontal reduction of a SIMD register (summing all lanes) requires a series of shuffle-and-add steps. For 4-float XMM: 2 rounds of SHUFPS+ADDPS. For 8-float YMM: extract upper lane with VEXTRACTF128, add to lower lane, then complete the 4-element reduction. The horizontal reduction is the most expensive step in many vectorized loops; minimize it by accumulating vertically through the loop and reducing once at the end.
AES-NI instructions (AESENC, AESENCLAST, AESDEC, AESDECLAST, AESKEYGENASSIST, AESIMC) perform complete AES rounds in hardware. AES-128 encryption requires: 1 PXOR (whitening) + 9 AESENC + 1 AESENCLAST. The key schedule for AES-128 requires 10 applications of AESKEYGENASSIST (one per round key derivation), combined with PSHUFD, PSLLDQ, and PXOR to implement the complete key expansion.
AES-NI is constant-time: execution time does not depend on key or data values. Software AES using T-tables accesses memory addresses derived from key material, leaking information via cache-timing side channels. AES-NI implements SubBytes as a combinational hardware circuit with data-independent timing. This is the primary security motivation for using AES-NI in cryptographic code.
CTR mode makes AES a stream cipher by encrypting successive counter values and XORing with plaintext. Counter blocks are independent, enabling parallel encryption of multiple blocks. Pipelining 4 blocks simultaneously quadruples throughput (from ~2.5 cycles/byte to ~0.6 cycles/byte) because the 4-cycle latency of AESENC is hidden behind parallel execution of independent blocks.
The PADDQ instruction increments the 64-bit counter in AES-CTR mode. PADDQ xmm, [ctr_increment] where ctr_increment = {1, 0} adds 1 to the low 64 bits, leaving the high 64 bits (nonce) unchanged. For cryptographic correctness, the counter must not repeat within a key lifetime — 64-bit counters allow 2^64 blocks (about 2^70 bytes) before wraparound.
Vectorizing a loop requires eliminating data dependencies between iterations. Auto-vectorization with GCC -O3 -mavx2 generates AVX2 code for simple loops, but requires: no aliasing between input and output pointers (__restrict__), no loop-carried dependencies, and a stride-1 access pattern. Add __builtin_assume_aligned() hints or use -fopt-info-vec to see why a loop was or was not vectorized.
SIMD image processing works best with Structure-of-Arrays (SoA) layout. Array-of-Structures (AoS) requires shuffle instructions to extract individual channels before SIMD computation. SoA stores all R values together, all G values together, etc., enabling direct SIMD loads of 16/32 channel values at once. For existing AoS data, use PSHUFB (SSSE3) to extract channels efficiently.
PMADDUBSW (SSSE3) computes 8 dot products of 8-bit unsigned × 8-bit signed pairs, summing adjacent pairs to 16-bit results. This single instruction replaces separate multiply and add instructions for weighted sums like the grayscale formula (Y = 77R + 150G + 29*B / 256), making it a key instruction for image processing, neural network inference, and audio processing.
AVX-512 (512-bit ZMM registers) doubles the SIMD width again, adding mask registers (k0-k7) for predicated execution. Masked operations like VADDPS zmm0{k1}, zmm1, zmm2 only update lanes where the corresponding bit in k1 is set. This eliminates many scalar loop tails for handling non-multiple-of-vector-length inputs. AVX-512 requires Skylake-X or Ice Lake and later; check CPU support with CPUID before using it.