Chapter 12 Key Takeaways: Arrays, Strings, and Data Structures

  1. Arrays are contiguous memory accessed with scaled indexing: [base + index*element_size]. The four valid scale factors (1, 2, 4, 8) correspond directly to the byte sizes of the standard C integer types. For element sizes not matching these scales, compute a byte offset manually.

  2. Bounds checking is your responsibility in assembly — the hardware does not check anything. An out-of-bounds access silently reads or writes whatever is at the computed address. The idiomatic bounds check is cmp rcx, n; jae .error — the unsigned comparison catches both rcx >= n (too large) and implicitly negative indices (which appear as huge unsigned numbers).

  3. Multi-dimensional arrays use row-major layout: matrix[r][c] is at offset (r * num_cols + c) * element_size. When num_cols is not 1, 2, 4, or 8, you cannot use the scaled index form directly — compute the linear index first, then use byte addressing.

  4. The Direction Flag (DF) controls whether REP string instructions advance or retreat. CLD clears DF (forward, the normal case). STD sets DF (backward, for overlapping copies). The System V ABI requires DF = 0 at function boundaries — if you set it, clear it before returning or calling any library function.

  5. The five REP string instructions implement the C string library: - REP MOVSB/Q: memcpy - REP STOSB/Q: memset - REPNE SCASB: strlen, strchr - REPE CMPSB: memcmp, strcmp

  6. After REPNE SCASB (strlen idiom), RDI points one past the matched byte, not at it. The NOT RCX; DEC RCX sequence converts the remaining count to the string length.

  7. REP MOVSQ is faster than REP MOVSB for aligned large copies because it moves 8 bytes per hardware iteration instead of 1. Always use the widest variant that your data alignment allows. For the tail (remaining bytes after quadword-aligned portion), use REP MOVSB.

  8. REP instructions have a startup overhead of several cycles that makes them slower than explicit MOVs for very small operations (< 8-16 bytes). GCC inlines small memcpy calls as a series of explicit MOV instructions, avoiding the REP overhead entirely.

  9. LODSB loads [RSI] into AL and increments RSI; STOSB stores AL to [RDI] and increments RDI. These combine load/store with pointer advance in one instruction, useful for string processing loops.

  10. Struct field access in assembly uses [pointer + compile_time_offset]. The offsets are determined by C's alignment rules and must be verified, not guessed. Padding bytes between fields for alignment are real and must be accounted for in the offsets.

  11. Linked list traversal in assembly: mov rdi, [rdi + next_offset] follows the next pointer. Always check for NULL before dereferencing. The null check is test rdi, rdi; jz .done.

  12. When traversing or modifying a linked list with calls to malloc/free, all working pointers must be in callee-saved registers (RBX, R12-R15) because malloc/free will clobber all caller-saved registers (RAX, RCX, RDX, RSI, RDI, R8-R11).

  13. AoS (Array of Structs) has poor SIMD performance; SoA (Struct of Arrays) has excellent SIMD performance. AoS is standard C struct layout; SoA requires manual transformation. When performance matters and the data can be restructured, SoA enables processing multiple elements per SIMD instruction.

  14. MOVSB/MOVSQ cannot correctly handle overlapping copies where dst > src (forward copy overwrites unread source data). Use MEMMOVE's strategy: detect overlap direction and copy backward when needed.

  15. For large memory operations (> 1KB), the bottleneck is DRAM bandwidth, not the instruction choice. A byte loop and an AVX2 256-bit loop both wait on the same memory bus. Non-temporal stores (MOVNTQ) bypass the cache for write-only large copies, reducing cache pollution and potentially improving throughput.