Chapter 12 Further Reading: Arrays, Strings, and Data Structures
Intel Documentation
Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 2: String Operations The reference for MOVS, STOS, SCAS, CMPS, and LODS families. Section 4.2 covers the REP/REPE/REPNE prefixes and their interaction with the Direction Flag. The "fast string" optimization (Section 4.2.1.5 in some editions) describes the microarchitectural acceleration for REP MOVSB/STOSB on processors with the ERMSB (Enhanced REP MOVSB/STOSB) feature, detectable via CPUID.
Intel 64 and IA-32 Architectures Optimization Reference Manual, Section 3.7.6: String and Memory Operations Covers the performance characteristics of REP instructions on different microarchitectures. The table showing ERMSB performance thresholds (the size above which REP MOVSB matches or exceeds custom copy loops) is directly relevant to the memcpy case study.
String Library Implementation
glibc string functions source code
sourceware.org/glibc/
The string/ directory contains the C reference implementations. The sysdeps/x86_64/ directory contains the optimized assembly versions: memcpy.S, memset.S, strlen.S, strcmp.S. Reading these shows the full complexity of production string function implementation: multiple size thresholds, CPUID dispatch for AVX2 vs. SSE2, alignment handling.
"Why memcpy() is Better Than You Think" — blog post, cloudflare.com
Discusses the non-temporal store optimization (MOVNTQ) for large copies, showing 40-60% improvement for multi-GB copies by avoiding cache pollution. Includes assembly code examples.
Data Structure Layout
"Data Structures in the Linux Kernel" — various kernel documentation
kernel.org/doc/html/latest/
The Linux kernel's include/linux/list.h implements doubly-linked lists as an intrusive linked list (the list pointers are embedded in the struct). Reading this implementation shows how to do linked list manipulation in real-world C and assembly without separate node allocations.
"Structure Layout Optimization" — Ulrich Drepper, "What Every Programmer Should Know About Memory" lwn.net/Articles/250967/ Section 6 covers struct layout optimization for cache performance. The AoS vs. SoA analysis (Section 6.2) includes assembly-level examples of how different layouts affect SIMD vectorization.
NASM struc Directive
NASM Manual, Section 4.11: struc and endstruc
nasm.us/doc/nasmdoc4.html
Documents the struc/endstruc directives for defining struct layouts in NASM assembly. The istruc/iend directives allow initializing struct instances in the .data section. These are the NASM equivalents of C's struct and offsetof.
Performance Analysis
Agner Fog's "Optimizing Subroutines in Assembly Language" — Chapter 17: Optimizing Memory Access agner.org/optimize/optimizing_assembly.pdf Covers cache behavior, alignment, prefetching, and the performance characteristics of REP instructions versus loop-based alternatives. Table 17.2 shows the breakeven size where REP MOVSQ overtakes an explicit qword loop on various processors.