Chapter 12 Key Takeaways: Arrays, Strings, and Data Structures
-
Arrays are contiguous memory accessed with scaled indexing:
[base + index*element_size]. The four valid scale factors (1, 2, 4, 8) correspond directly to the byte sizes of the standard C integer types. For element sizes not matching these scales, compute a byte offset manually. -
Bounds checking is your responsibility in assembly — the hardware does not check anything. An out-of-bounds access silently reads or writes whatever is at the computed address. The idiomatic bounds check is
cmp rcx, n; jae .error— the unsigned comparison catches bothrcx >= n(too large) and implicitly negative indices (which appear as huge unsigned numbers). -
Multi-dimensional arrays use row-major layout:
matrix[r][c]is at offset(r * num_cols + c) * element_size. Whennum_colsis not 1, 2, 4, or 8, you cannot use the scaled index form directly — compute the linear index first, then use byte addressing. -
The Direction Flag (DF) controls whether REP string instructions advance or retreat. CLD clears DF (forward, the normal case). STD sets DF (backward, for overlapping copies). The System V ABI requires DF = 0 at function boundaries — if you set it, clear it before returning or calling any library function.
-
The five REP string instructions implement the C string library: -
REP MOVSB/Q: memcpy -REP STOSB/Q: memset -REPNE SCASB: strlen, strchr -REPE CMPSB: memcmp, strcmp -
After
REPNE SCASB(strlen idiom), RDI points one past the matched byte, not at it. TheNOT RCX; DEC RCXsequence converts the remaining count to the string length. -
REP MOVSQis faster thanREP MOVSBfor aligned large copies because it moves 8 bytes per hardware iteration instead of 1. Always use the widest variant that your data alignment allows. For the tail (remaining bytes after quadword-aligned portion), useREP MOVSB. -
REP instructions have a startup overhead of several cycles that makes them slower than explicit MOVs for very small operations (< 8-16 bytes). GCC inlines small
memcpycalls as a series of explicit MOV instructions, avoiding the REP overhead entirely. -
LODSBloads[RSI]into AL and increments RSI;STOSBstores AL to[RDI]and increments RDI. These combine load/store with pointer advance in one instruction, useful for string processing loops. -
Struct field access in assembly uses
[pointer + compile_time_offset]. The offsets are determined by C's alignment rules and must be verified, not guessed. Padding bytes between fields for alignment are real and must be accounted for in the offsets. -
Linked list traversal in assembly:
mov rdi, [rdi + next_offset]follows the next pointer. Always check for NULL before dereferencing. The null check istest rdi, rdi; jz .done. -
When traversing or modifying a linked list with calls to
malloc/free, all working pointers must be in callee-saved registers (RBX, R12-R15) because malloc/free will clobber all caller-saved registers (RAX, RCX, RDX, RSI, RDI, R8-R11). -
AoS (Array of Structs) has poor SIMD performance; SoA (Struct of Arrays) has excellent SIMD performance. AoS is standard C struct layout; SoA requires manual transformation. When performance matters and the data can be restructured, SoA enables processing multiple elements per SIMD instruction.
-
MOVSB/MOVSQcannot correctly handle overlapping copies wheredst > src(forward copy overwrites unread source data). UseMEMMOVE's strategy: detect overlap direction and copy backward when needed. -
For large memory operations (> 1KB), the bottleneck is DRAM bandwidth, not the instruction choice. A byte loop and an AVX2 256-bit loop both wait on the same memory bus. Non-temporal stores (
MOVNTQ) bypass the cache for write-only large copies, reducing cache pollution and potentially improving throughput.