Chapter 15 Further Reading: SIMD Programming
Intel Documentation
Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 1: Chapter 12 — Programming with SSE, SSE2, and AVX The authoritative reference for SSE2 and AVX instruction semantics, data type encodings, and programming model. Section 12.3 covers the MXCSR control register for packed float operations. Section 12.15 covers the VZEROUPPER and VZEROALL instructions and the conditions under which the AVX-SSE transition penalty applies.
Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 2: AESENC, AESENCLAST, AESDEC, AESDECLAST, AESKEYGENASSIST, AESIMC entries The formal instruction specifications. Each entry includes the exact computation performed (in pseudocode using the AES specification's round function definitions), the effects on EFLAGS (none for AES-NI instructions), and exceptions. The pseudocode directly references the FIPS 197 AES standard for the SubBytes, ShiftRows, MixColumns, and AddRoundKey transformations.
Intel Intrinsics Guide — SSE2, SSSE3, AVX, AVX2, AES sections
software.intel.com/sites/landingpage/IntrinsicsGuide/
Maps each assembly instruction to its corresponding C intrinsic. _mm256_fmadd_ps → VFMADD213PS; _mm_aesenc_si128 → AESENC; _mm256_shuffle_epi8 → VPSHUFB. Essential when reading performance-critical library code that uses intrinsics rather than raw assembly.
SIMD Performance
Agner Fog, "Instruction Tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs" agner.org/optimize/instruction_tables.pdf Per-microarchitecture throughput and latency for all SIMD instructions: ADDPS, MULPS, VFMADD213PS, AESENC, SHUFPS, PSHUFB, etc. The table shows that on Skylake, AESENC has 4-cycle latency and 1-cycle reciprocal throughput, confirming that 4-way pipelining saturates the execution unit.
Agner Fog, "Optimizing software in C++ — An optimization guide for Windows, Linux, and Mac platforms" agner.org/optimize/optimizing_cpp.pdf Chapter 12 covers SIMD optimization strategies: data layout, alignment, loop structure, horizontal reductions, and the specific costs of shuffle instructions. The discussion of AoS vs. SoA and lane-crossing operations is the most practical available guide on the topic.
AES and Cryptography
FIPS 197: Advanced Encryption Standard (AES)
csrc.nist.gov/publications/detail/fips/197/final
The AES standard itself. Appendix B provides the complete worked example (key schedule + block encryption) that can be used to verify any AES implementation. The pseudocode in Section 5 directly maps to what AESENC computes; reading it alongside the Intel instruction description gives a complete picture of what the hardware is doing.
"Intel Advanced Encryption Standard (AES) Instructions Set" — Shay Gueron, Intel Whitepaper intel.com/content/dam/doc/white-paper/advanced-encryption-standard-new-instructions-set-paper.pdf The definitive reference for AES-NI programming. Covers the complete key schedule implementation for AES-128/192/256, CTR mode, GCM (authenticated encryption), and performance measurements across Intel microarchitectures. The key expansion NASM macros in this chapter are adapted from this document. Also includes the proof that AES-NI is resistant to timing attacks via the hardware circuit argument.
Auto-Vectorization
GCC Manual, Section 6.60 — "Auto-vectorization"
gcc.gnu.org/onlinedocs/gcc/Auto-vectorization.html
Documents the -fopt-info-vec and -fopt-info-vec-missed flags that report why each loop was or was not vectorized. The section on vectorization hints (__builtin_assume_aligned, __builtin_expect, __restrict__) explains how to guide the compiler toward better vectorization decisions.
"Vectorization in LLVM" — LLVM documentation llvm.org/docs/Vectorizers.html Explains how Clang's vectorizer works (loop vectorizer and SLP/straight-line vectorizer) and the specific conditions that prevent vectorization. Even when targeting GCC assembly output, understanding the compiler's vectorization model helps write C code that vectorizes well or hand-written SIMD that matches what a compiler would generate.
Image Processing Reference
"Computer Vision: Algorithms and Applications" by Richard Szeliski (2nd edition) Springer, 2022. Chapter 3 covers image processing fundamentals including color space conversions (RGB→YCbCr, which is the full version of the grayscale formula), Gaussian blur, and convolution. The SIMD implementation of 2D convolution (a cornerstone of CNN inference) extends directly from the techniques in this chapter.