Chapter 32 Key Takeaways: The Memory Hierarchy

Open Assembly Language Project

Chapter 32 Key Takeaways: The Memory Hierarchy

The memory hierarchy bridges the speed gap between CPUs and DRAM. L1 cache is ~4 cycles, L2 ~12, L3 ~40, DRAM ~200 cycles. Every level is a 3–5× latency increase and a 10–100× capacity increase. Keeping working data high in the hierarchy is the central objective of memory optimization.
Caches transfer data in 64-byte cache lines. When one byte is requested from an uncached address, 64 bytes are transferred. This makes spatial locality free: accessing adjacent elements costs no additional misses after the first. It also makes false sharing possible: two threads writing different variables on the same cache line cause coherence traffic as if they were writing the same variable.
Cache misses have three root causes. Compulsory (cold) misses are unavoidable on first access — prefetching hides the latency but cannot eliminate the miss. Capacity misses occur when the working set exceeds cache size — tiling/blocking is the primary fix. Conflict misses occur when addresses map to the same cache set — data layout adjustments (padding, alignment) help.
The MESI protocol maintains cache coherence across cores. A cache line in M (Modified) state belongs exclusively to one core. Writing a line in S (Shared) state requires sending invalidations to all other cores. False sharing occurs when two cores repeatedly invalidate each other's copies of a line containing unrelated data — each write causes a cache miss for the other core despite no actual data dependency.
Cache-friendly access patterns follow stride-1 (sequential) order. Stride-1 access exploits spatial locality: each cache line loaded is fully consumed before eviction. Column-major access to row-major matrices, linked list pointer chasing, and tree traversal are stride-hostile patterns that defeat the hardware prefetcher and cause a cache miss for every element accessed.
Cache blocking (tiling) reduces capacity misses by reducing the active working set. Instead of sweeping through entire matrices, tiling divides the computation into tile-sized chunks that fit in L1 or L2. For matrix multiplication, tiling converts O(N³) cache misses to O(N³/T) misses, where T is the tile size. The 100× speedup demonstrated in Case Study 32-1 comes entirely from better cache utilization.
Structure of Arrays (SoA) beats Array of Structures (AoS) for SIMD hot loops. When a loop only uses a subset of an object's fields, AoS wastes cache space loading unused fields. SoA groups each field in its own contiguous array, achieving 100% cache utilization for the accessed field and enabling straightforward SIMD processing (8 floats from one field fill a YMM register perfectly).
Non-temporal stores bypass the cache for write-only, large-buffer operations. MOVNTPS, VMOVNTPD, MOVNTI write through write-combining buffers directly to DRAM, avoiding the read-for-ownership that normal stores incur. Must be followed by SFENCE to ensure visibility. Most effective for buffer sizes that exceed L3 cache, or for data that will not be read again soon.
Software prefetch instructions (PREFETCHT0, PREFETCHNTA) hide memory latency for irregular access patterns. The hardware prefetcher handles sequential and constant-stride patterns automatically. For linked list traversal, tree descent, or hash table probes where the next address is data-dependent, PREFETCHT0 [node->next] issued while processing the current node overlaps the memory latency with computation.
False sharing collapses parallel scalability even when threads write different variables. Two threads each writing their own counter that share a 64-byte cache line can be 20× slower than the single-threaded version (as demonstrated: 4.71 s at 8 threads vs. 0.23 s single-thread). The fix is padding: align 64 with explicit padding bytes to ensure each thread's writable data occupies its own cache line.
DDR5 peak bandwidth (~80 GB/s) is the ceiling for bandwidth-bound code. At 4 bytes per float, an 80 GB/s DRAM can supply 20 billion floats per second. Any computation that requires more than that is fundamentally limited by memory bandwidth, not CPU throughput. SIMD and multi-threading both compete for the same bandwidth ceiling — adding more SIMD width or cores does not increase bandwidth beyond the DRAM limit.
perf stat cache events and valgrind --tool=cachegrind are the primary diagnostic tools. L1-dcache-load-misses, LLC-loads, LLC-load-misses measured with perf stat reveal which cache level is the bottleneck. cachegrind provides per-source-line miss attribution to identify exactly which data structure or loop is responsible.