Chapter 32 Key Takeaways: The Memory Hierarchy
-
The memory hierarchy bridges the speed gap between CPUs and DRAM. L1 cache is ~4 cycles, L2 ~12, L3 ~40, DRAM ~200 cycles. Every level is a 3–5× latency increase and a 10–100× capacity increase. Keeping working data high in the hierarchy is the central objective of memory optimization.
-
Caches transfer data in 64-byte cache lines. When one byte is requested from an uncached address, 64 bytes are transferred. This makes spatial locality free: accessing adjacent elements costs no additional misses after the first. It also makes false sharing possible: two threads writing different variables on the same cache line cause coherence traffic as if they were writing the same variable.
-
Cache misses have three root causes. Compulsory (cold) misses are unavoidable on first access — prefetching hides the latency but cannot eliminate the miss. Capacity misses occur when the working set exceeds cache size — tiling/blocking is the primary fix. Conflict misses occur when addresses map to the same cache set — data layout adjustments (padding, alignment) help.
-
The MESI protocol maintains cache coherence across cores. A cache line in M (Modified) state belongs exclusively to one core. Writing a line in S (Shared) state requires sending invalidations to all other cores. False sharing occurs when two cores repeatedly invalidate each other's copies of a line containing unrelated data — each write causes a cache miss for the other core despite no actual data dependency.
-
Cache-friendly access patterns follow stride-1 (sequential) order. Stride-1 access exploits spatial locality: each cache line loaded is fully consumed before eviction. Column-major access to row-major matrices, linked list pointer chasing, and tree traversal are stride-hostile patterns that defeat the hardware prefetcher and cause a cache miss for every element accessed.
-
Cache blocking (tiling) reduces capacity misses by reducing the active working set. Instead of sweeping through entire matrices, tiling divides the computation into tile-sized chunks that fit in L1 or L2. For matrix multiplication, tiling converts O(N³) cache misses to O(N³/T) misses, where T is the tile size. The 100× speedup demonstrated in Case Study 32-1 comes entirely from better cache utilization.
-
Structure of Arrays (SoA) beats Array of Structures (AoS) for SIMD hot loops. When a loop only uses a subset of an object's fields, AoS wastes cache space loading unused fields. SoA groups each field in its own contiguous array, achieving 100% cache utilization for the accessed field and enabling straightforward SIMD processing (8 floats from one field fill a YMM register perfectly).
-
Non-temporal stores bypass the cache for write-only, large-buffer operations.
MOVNTPS,VMOVNTPD,MOVNTIwrite through write-combining buffers directly to DRAM, avoiding the read-for-ownership that normal stores incur. Must be followed bySFENCEto ensure visibility. Most effective for buffer sizes that exceed L3 cache, or for data that will not be read again soon. -
Software prefetch instructions (
PREFETCHT0,PREFETCHNTA) hide memory latency for irregular access patterns. The hardware prefetcher handles sequential and constant-stride patterns automatically. For linked list traversal, tree descent, or hash table probes where the next address is data-dependent,PREFETCHT0 [node->next]issued while processing the current node overlaps the memory latency with computation. -
False sharing collapses parallel scalability even when threads write different variables. Two threads each writing their own counter that share a 64-byte cache line can be 20× slower than the single-threaded version (as demonstrated: 4.71 s at 8 threads vs. 0.23 s single-thread). The fix is padding:
align 64with explicit padding bytes to ensure each thread's writable data occupies its own cache line. -
DDR5 peak bandwidth (~80 GB/s) is the ceiling for bandwidth-bound code. At 4 bytes per float, an 80 GB/s DRAM can supply 20 billion floats per second. Any computation that requires more than that is fundamentally limited by memory bandwidth, not CPU throughput. SIMD and multi-threading both compete for the same bandwidth ceiling — adding more SIMD width or cores does not increase bandwidth beyond the DRAM limit.
-
perf statcache events andvalgrind --tool=cachegrindare the primary diagnostic tools.L1-dcache-load-misses,LLC-loads,LLC-load-missesmeasured withperf statreveal which cache level is the bottleneck.cachegrindprovides per-source-line miss attribution to identify exactly which data structure or loop is responsible.