Chapter 32 Further Reading: The Memory Hierarchy

Primary References

"What Every Programmer Should Know About Memory" — Ulrich Drepper https://www.akkadia.org/drepper/cpumemory.pdf The definitive programmer-oriented treatment of the memory hierarchy. Covers cache organization, TLBs, NUMA, prefetching, and memory performance optimization in 114 pages with benchmark data. Written in 2007 but still accurate for principles — the specific numbers have scaled but the architecture is unchanged. Chapter 3 (CPU Caches) and Chapter 6 (Prefetching) are required reading for any performance-critical programmer.

Agner Fog — "Optimizing C++" and "Optimizing Assembly" https://agner.org/optimize/ The optimization chapter on memory covers cache line alignment, false sharing, prefetching, and non-temporal stores with concrete assembly examples. Agner's instruction tables provide latency/throughput for load and store instructions on every microarchitecture — essential for understanding when a load hits L1 vs. causes a pipeline stall.

Intel 64 and IA-32 Architectures Optimization Reference Manual https://www.intel.com/content/www/us/en/developer/articles/technical/intel-64-ia-32-architectures-optimization-reference-manual.html Chapter 7 (Optimizing Cache Usage) covers prefetching guidelines, non-temporal stores, write-combining, and the hardware prefetcher behavior. The prefetch distance recommendations and the write-combining buffer count are documented here with authoritative numbers.

Deep Dives

"Gallery of Processor Cache Effects" — Igor Ostrovsky http://igoro.com/archive/gallery-of-processor-cache-effects/ Seven interactive benchmarks demonstrating cache effects: cache line size, L1/L2/L3 size thresholds, cache associativity and conflict misses, false sharing, instruction cache effects, and hardware prefetching. Each benchmark is a small, reproducible C program with charts. Excellent for building intuition.

"Understanding False Sharing" — INTEL Developer Zone https://www.intel.com/content/www/us/en/developer/articles/technical/avoiding-and-identifying-false-sharing-among-threads.html Intel's analysis of false sharing with VTune Profiler-based detection workflow. Shows how to use hardware performance counters to identify false sharing in real applications and how to verify that padding fixes it.

"Cache-Oblivious Algorithms" — Frigo, Leiserson, Prokop, Ramachandran (1999) https://supertech.csail.mit.edu/papers/FrigoLeiPersonRamachandran1999.pdf The original paper introducing cache-oblivious algorithm design — writing algorithms that are optimal for any cache size without knowing the cache parameters. Covers FFT and matrix multiply. Conceptually important for understanding why tiling works and how to generalize it.

Tools

Valgrind Cachegrind https://valgrind.org/docs/manual/cg-manual.html Simulates L1/L2/L3 caches and counts cache misses per instruction and per source line. Works with any compiled program (C, C++, assembly). The cg_annotate tool produces per-function and per-source-line miss attribution. Essential for identifying which specific data structure or loop is causing cache pressure.

Intel Memory Latency Checker (MLC) https://www.intel.com/content/www/us/en/download/736633/intel-memory-latency-checker-intel-mlc.html Measures actual memory latency (at various load levels) and bandwidth for your specific hardware. Critical for calibrating the latency numbers used in software prefetch distance calculations. Also measures NUMA memory latency and bandwidth matrices for multi-socket systems.

STREAM Benchmark https://www.cs.virginia.edu/stream/ The industry-standard benchmark for sustainable memory bandwidth. Measures Copy, Scale, Add, and Triad operations. Run it to establish the actual DRAM bandwidth ceiling on your hardware before claiming a program is bandwidth-bound.

Linux perf — cache events https://perf.wiki.kernel.org/index.php/Tutorial The cache-specific events — L1-dcache-loads, L1-dcache-load-misses, LLC-loads, LLC-load-misses — provide the fastest way to verify cache behavior. perf record -e LLC-load-misses ./program + perf report gives per-function attribution. perf annotate annotates individual instructions with their miss counts.