Modern CPUs can execute 4–6 instructions per clock cycle. At 4 GHz, that is 16–24 billion instructions per second. Yet a single cache miss to DRAM stalls the CPU for 100+ nanoseconds — long enough to have executed 400 instructions. Memory speed has...
In This Chapter
- Why Memory Is the Bottleneck
- Cache Lines: The Unit of Transfer
- Cache Organization: Sets, Ways, and Tags
- The MESI Protocol: Cache Coherence
- Memory Access Patterns: Stride and Locality
- Matrix Multiplication: Three Implementations
- Prefetching
- Non-Temporal Stores: Bypassing the Cache
- Array of Structures vs. Structure of Arrays
- Measuring Cache Performance
- Memory Bandwidth: DDR5 and Its Limits
- Putting It Together: The Memory Hierarchy Mindset
- Summary
Chapter 32: The Memory Hierarchy
Why Memory Is the Bottleneck
Modern CPUs can execute 4–6 instructions per clock cycle. At 4 GHz, that is 16–24 billion instructions per second. Yet a single cache miss to DRAM stalls the CPU for 100+ nanoseconds — long enough to have executed 400 instructions. Memory speed has not kept pace with CPU speed, creating a widening gap that the memory hierarchy exists to bridge.
The hierarchy is a speed/capacity tradeoff:
Level Size Latency Bandwidth
─────────────────────────────────────────────────────────────
Registers ~2 KB 0 cycles N/A (direct)
L1 cache 32–64 KB 4 cycles ~1000 GB/s
L2 cache 256 KB–1MB 12 cycles ~400 GB/s
L3 cache 4–32 MB 40 cycles ~200 GB/s
DRAM 8–128 GB ~100 ns ~80 GB/s (DDR5)
NVMe SSD 1–8 TB ~100 µs ~7 GB/s
SATA SSD 256GB–4TB ~500 µs ~550 MB/s
HDD 1–20 TB ~10 ms ~150 MB/s
Network ∞ ~100 ms variable
─────────────────────────────────────────────────────────────
Each level is roughly 10× slower and 10–100× larger than the previous. The programmer's goal is to keep the working set — the data actively needed — as high in the hierarchy as possible.
📊 Latency in context: At 4 GHz, 1 nanosecond = 4 cycles. An L3 hit at 40 cycles = 10 ns; DRAM at 100 ns = 400 cycles. The CPU sits idle for 400 cycles waiting for one memory word. In that time, it could have completed ~1600 register operations.
Cache Lines: The Unit of Transfer
Caches do not transfer individual bytes. They transfer cache lines — aligned 64-byte blocks. When a single byte is accessed that is not in cache:
- The CPU issues a cache-line fill request
- A 64-byte aligned block containing the requested byte is fetched from the next level
- The entire 64 bytes is placed in cache
- The requested byte is delivered
This has two consequences:
Spatial locality is free: If you access array[0], array[1] through array[15] (16 ints, 64 bytes) are all brought into cache simultaneously. Subsequent accesses to those elements hit the cache.
False sharing is expensive: Two threads writing different variables that share a cache line create contention. Each write invalidates the other core's copy, forcing cache traffic even though no data is actually shared. (See Chapter 30.)
Cache line layout for a 64-byte line:
┌──────────────────────────────────────────────────────────────┐
│ 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 │
│ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │
│ i0 i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15│
└──────────────────────────────────────────────────────────────┘
accessing array[0] loads all 16 int32s
Cache Organization: Sets, Ways, and Tags
A direct-mapped cache would be simple but suffers from conflict misses. Modern caches are N-way set-associative:
8-way set-associative L1 cache (32 KB):
────────────────────────────────────────────────────────────
Total size: 32,768 bytes
Line size: 64 bytes
Total lines: 32,768 / 64 = 512 lines
Ways: 8
Sets: 512 / 8 = 64 sets
Address breakdown (64-bit):
[63:12] = tag (identifies which memory block)
[11:6] = set index (which of the 64 sets)
[5:0] = offset within cache line (0-63)
────────────────────────────────────────────────────────────
When accessing an address: 1. Extract the set index bits → select the set 2. Compare the address tag against all 8 tags in the set simultaneously 3. If any tag matches (and the valid bit is set): cache hit 4. If no tag matches: cache miss — one of the 8 ways is evicted (LRU policy) and the new line is loaded
Replacement policy: Most CPUs use pseudo-LRU (tree-based approximation of least-recently-used). Understanding this matters for conflict miss analysis.
Three Types of Cache Misses
| Miss Type | Cause | Remedy |
|---|---|---|
| Compulsory (cold) | First access to a line, ever | Prefetching |
| Capacity | Working set exceeds cache size | Reduce working set, blocking/tiling |
| Conflict | Two lines map to the same set, evict each other | Adjust data layout, padding |
The MESI Protocol: Cache Coherence
With multiple cores, each core has its own L1/L2 cache. A cache line can exist in multiple caches simultaneously. The MESI (Modified/Exclusive/Shared/Invalid) protocol maintains coherence:
MESI States:
────────────────────────────────────────────────────────────
M (Modified) Line is dirty. Only this core has it.
Must write back to memory before eviction.
E (Exclusive) Line is clean. Only this core has it.
No other core is caching this line.
S (Shared) Line is clean. Multiple cores may have it.
Cannot write without transitioning.
I (Invalid) Line is not present (or stale).
Any access = cache miss.
────────────────────────────────────────────────────────────
State transitions (simplified):
Core 0 reads line X (not cached anywhere):
→ X enters E state in Core 0's cache
Core 1 reads line X (Core 0 has it in E):
→ X transitions to S in both caches
Core 0 writes line X (in S state):
→ Core 0 sends "invalidate" to all other cores
→ Core 1's copy → I state
→ Core 0's copy → M state
→ Next read by Core 1: cache miss, must fetch from Core 0
(Core 0's cache "supplies" the line, not DRAM)
This is why false sharing is expensive: two cores repeatedly invalidating each other's copies of a cache line that contains different data. Every write generates an invalidation message on the coherence bus.
; FALSE SHARING EXAMPLE
; Thread 0 and Thread 1 writing adjacent variables
; Assume counter0 and counter1 are 4 bytes each, adjacent in memory
section .data
align 64
counter0: dd 0 ; Thread 0 owns this
counter1: dd 0 ; Thread 1 owns this
; Both counter0 and counter1 are on the SAME cache line!
; Every write by Thread 0 invalidates Thread 1's copy and vice versa.
; FIX: Pad to separate cache lines
section .data
align 64
counter0: dd 0
times 15 dd 0 ; 60 bytes of padding
counter1: dd 0 ; Now on a different cache line
Memory Access Patterns: Stride and Locality
Sequential (Stride-1) Access — Optimal
; Sum an array of int32s — stride-1 access
; rsi = array base, rcx = count
; Returns sum in eax
sum_array:
xor eax, eax
xor rdx, rdx
.loop:
mov edx, [rsi + rcx*4 - 4] ; stride-1: each access is 4 bytes forward
add eax, edx
dec rcx
jnz .loop
ret
; Access pattern: [0][4][8][12][16]...[60] → cache line loaded once
; After first miss: all 16 int32s are in cache. Zero additional misses.
; Cache miss rate: 1/16 = 6.25% (one miss per 16 accesses)
Stride-N Access — Increasingly Worse
; Sum every 16th element — stride-64 (one cache line per element)
; This accesses one element per cache line loaded
sum_stride16:
xor eax, eax
xor rdx, rdx
.loop:
mov edx, [rsi + rcx*64 - 64] ; stride-64: each access misses L1
add eax, edx
dec rcx
jnz .loop
ret
; Cache miss rate: 100% (every access is a new cache line)
; L1 miss → L2 → L3 → DRAM for every element
; Performance: ~16× slower than stride-1 for large arrays
The Penalty Scales with Array Size
For small arrays (fits in L1/L2): stride doesn't matter much — all data is cached after the first pass.
For large arrays (exceeds L3, ~32 MB+): stride-1 saturates memory bandwidth reading each byte once; stride-16 reads 16× more memory bandwidth to access the same number of elements.
Matrix Multiplication: Three Implementations
Matrix multiplication is the canonical memory hierarchy case study because the naive implementation is memory-hostile, and optimizing it teaches every important lesson.
Naive ijk order:
// C[i][j] += A[i][k] * B[k][j]
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
for (int k = 0; k < N; k++)
C[i][j] += A[i][k] * B[k][j];
Memory access pattern (N=1024, row-major storage):
A[i][k]: stride-1 (sequential across row i) — GOOD
B[k][j]: stride-N (column access) — BAD: 1024×4 = 4096 byte stride
C[i][j]: constant during inner k loop — GOOD
B column access: every B[k][j] for fixed j is 4096 bytes apart.
N=1024: B is 4 MB. L3 cache is ~8 MB.
But accessing B column-by-column → working set = N×64 cache lines
for each column = 1024 different cache lines cycling in and out.
Result: frequent L3 misses, ~DRAM bandwidth limited.
NASM naive implementation:
; matmul_naive: N×N float matrix multiply
; rdi = C, rsi = A, rdx = B, ecx = N
; All matrices stored row-major as float32
matmul_naive:
push rbx
push r12
push r13
push r14
push r15
mov r15d, ecx ; r15 = N
xor r12d, r12d ; i = 0
.loop_i:
xor r13d, r13d ; j = 0
.loop_j:
xorps xmm0, xmm0 ; accumulator = 0.0
xor r14d, r14d ; k = 0
.loop_k:
; A[i][k] → rsi + (i*N + k) * 4
mov eax, r12d
imul eax, r15d
add eax, r14d
movss xmm1, [rsi + rax*4]
; B[k][j] → rdx + (k*N + j) * 4
mov eax, r14d
imul eax, r15d
add eax, r13d
mulss xmm1, [rdx + rax*4] ; B[k][j] — potential cache miss
addss xmm0, xmm1
inc r14d
cmp r14d, r15d
jl .loop_k
; C[i][j] += accumulator
mov eax, r12d
imul eax, r15d
add eax, r13d
addss xmm0, [rdi + rax*4]
movss [rdi + rax*4], xmm0
inc r13d
cmp r13d, r15d
jl .loop_j
inc r12d
cmp r12d, r15d
jl .loop_i
pop r15
pop r14
pop r13
pop r12
pop rbx
ret
Transposed B (ikj order): Transpose B before multiplying so B is accessed stride-1:
; Transpose B into B_T first, then use B_T row-wise
; B_T[j][k] = B[k][j] → now B_T[k] access is stride-1 for fixed j loop
; ikj loop order: for i, for k, for j
; A[i][k]: constant in j loop — load once, keep in register
; B_T[k][j]: stride-1 — sequential — GOOD
; C[i][j]: stride-1 — GOOD
matmul_transposed:
; (B transposition code omitted for brevity)
; Inner loop accesses B_T[k][j] sequentially:
; rax = A[i][k] (scalar, broadcast to xmm1)
movss xmm1, [rsi + rax*4]
shufps xmm1, xmm1, 0 ; broadcast to all 4 lanes
; Process 4 j values at once using SIMD:
; xmm2 = B_T[k][j:j+3]
movups xmm2, [rdx + rbx*4] ; 4 consecutive B_T values
mulps xmm2, xmm1 ; A[i][k] * B_T[k][j:j+3]
addps xmm0, xmm2 ; accumulate into C[i][j:j+3]
Cache-blocked (tiled) implementation: Divide the matrices into B×B tiles. Process one tile of A, one of B, one of C at a time. If B is chosen so that 3 tiles (A tile + B tile + C tile) fit in L2 cache, every element is accessed from L2 rather than DRAM.
Tiling visualization (N=1024, tile=64):
Without tiling: entire B column accessed each i iteration
Working set per i-row: N×N floats = 4 MB → L3 thrashing
With tiling: block A[i:i+B][k:k+B] and B[k:k+B][j:j+B]
Tile size: B×B×4 bytes = 64×64×4 = 16 KB per tile
Three tiles: 48 KB → fits in L2 cache (256 KB)
Each tile element reused B times before eviction
Reuse ratio: B = 64 reuses per element
; Tiled matrix multiply — outer loops over tiles
; TILE_SIZE = 64 (define as constant)
%define TILE 64
matmul_tiled:
; Outer three loops iterate over tiles
; inner three loops iterate within tiles
; For each tile (ii, jj, kk):
; For i = ii to min(ii+TILE, N):
; For j = jj to min(jj+TILE, N):
; For k = kk to min(kk+TILE, N):
; C[i][j] += A[i][k] * B[k][j]
;
; Key: the (A_tile, B_tile, C_tile) working set stays in L2
; throughout the inner triple loop
; [Full NASM implementation omitted for space — see exercises]
; The critical optimization is the loop ordering and TILE constant choice.
Prefetching
The CPU has a hardware prefetcher that detects stride-1 access patterns and automatically fetches the next cache lines before they are needed. For sequential array traversal, the hardware prefetcher works well.
For irregular access patterns (trees, hash tables, linked lists), the hardware prefetcher fails — it cannot predict which cache line to fetch next. Software prefetching issues explicit prefetch hints ahead of computation:
; Software prefetch instructions:
; PREFETCHT0 — prefetch to L1 (and L2, L3)
; PREFETCHT1 — prefetch to L2 (and L3)
; PREFETCHT2 — prefetch to L3 only
; PREFETCHNTA — non-temporal prefetch (bypasses L2/L3, goes to L1)
; Use for data accessed only once (avoids polluting L2/L3)
; Example: sum an array with explicit prefetch
; Prefetch 256 bytes (4 cache lines) ahead of the current position
sum_prefetch:
xor eax, eax
mov rcx, rsi ; rcx = current ptr
mov rdx, rsi
add rdx, r8 ; rdx = end ptr (r8 = byte count)
.loop:
prefetcht0 [rcx + 256] ; prefetch 4 lines ahead
; Process current 64-byte cache line (16 int32s)
mov r9d, [rcx]
add eax, r9d
mov r9d, [rcx + 4]
add eax, r9d
; ... (process all 16 elements in this cache line)
add rcx, 64 ; advance to next cache line
cmp rcx, rdx
jl .loop
ret
; Prefetch distance: choose so the prefetch finishes before the data is needed.
; Rule of thumb: distance = latency_of_prefetch / cycles_per_cache_line_processed
; DRAM latency ~200 cycles, 64-byte line takes ~16 cycles to sum:
; Distance = 200 / 16 ≈ 12 cache lines ≈ 768 bytes ahead
; 256 bytes (4 lines) is conservative; 512-1024 bytes often better
When to use software prefetch:
- Linked list traversal: prefetch node->next->data while processing node->data
- Tree traversal: prefetch children while processing parent
- Hash table lookups: prefetch the bucket while computing the hash
- Scatter/gather operations: prefetch target addresses before writing
When NOT to use software prefetch: - Sequential access: hardware prefetcher handles this - Data that fits in L1/L2: already cached, no benefit - High-bandwidth SIMD loops: bandwidth is the bottleneck, not latency
Non-Temporal Stores: Bypassing the Cache
When writing large buffers that will not be read again soon (video frames, large memset operations, output buffers), loading the cache line before writing it is wasteful. Every store to uncached memory normally follows: 1. Load cache line from DRAM (read-for-ownership) 2. Modify the line 3. Write back to DRAM on eviction
Non-temporal stores skip step 1 — they write directly to write-combining buffers and flush to DRAM, bypassing the cache entirely:
; Normal store to uncached memory — causes read-for-ownership:
mov [rdi], rax ; loads cache line first, then writes
; Non-temporal store — writes directly, no read:
movnti [rdi], rax ; 64-bit non-temporal store
movntq [rdi], mm0 ; 64-bit MMX non-temporal store
movntps [rdi], xmm0 ; 128-bit SSE non-temporal store
movntpd [rdi], xmm0 ; 128-bit double non-temporal store
vmovntps [rdi], ymm0 ; 256-bit AVX non-temporal store
vmovntpd [rdi], ymm0 ; 256-bit AVX non-temporal store
; IMPORTANT: Non-temporal stores are WEAKLY ORDERED.
; Must use SFENCE to ensure stores are visible to other processors:
sfence ; flush write-combining buffers
Write-combining buffers: The CPU maintains 4–12 write-combining (WC) buffers. Non-temporal stores to the same cache line accumulate in a WC buffer until it is full (64 bytes), then flush to DRAM in one burst. For maximum efficiency, fill each buffer completely before moving to the next:
; GOOD: Fill one 64-byte region completely (fills one WC buffer)
vmovntps [rdi], ymm0 ; bytes 0-31
vmovntps [rdi + 32], ymm1 ; bytes 32-63
; WC buffer is full → flushed as single 64-byte write
; BAD: Interleave writes to different regions
vmovntps [rdi], ymm0 ; WC buffer A: bytes 0-31
vmovntps [rdi + 4096], ymm1 ; WC buffer B: bytes 4096-4127
vmovntps [rdi + 32], ymm2 ; WC buffer A: bytes 32-63 (buffer A flushes)
vmovntps [rdi + 4128], ymm3 ; WC buffer B: bytes 4128-4159
; Interleaving may evict WC buffers before they are full, reducing efficiency
memset with non-temporal stores:
; Zero a large buffer with VMOVNTPS — faster than REP STOSD for large buffers
; because it avoids read-for-ownership and maximizes write bandwidth
; rdi = destination (must be 32-byte aligned for VMOVNTPS)
; rcx = byte count (must be multiple of 32)
fast_memzero:
vxorps ymm0, ymm0, ymm0 ; ymm0 = 0.0 (all zeros)
shr rcx, 5 ; rcx = count / 32 (number of YMM stores)
.loop:
vmovntps [rdi], ymm0 ; 32-byte non-temporal store
add rdi, 32
dec rcx
jnz .loop
sfence ; ensure stores are globally visible
vzeroupper ; clear YMM upper halves (AVX/SSE transition)
ret
; For maximum bandwidth, process 4 stores per iteration (256 bytes):
fast_memzero_unrolled:
vxorps ymm0, ymm0, ymm0
shr rcx, 8 ; rcx = count / 256
.loop:
vmovntps [rdi], ymm0
vmovntps [rdi + 32], ymm0
vmovntps [rdi + 64], ymm0
vmovntps [rdi + 96], ymm0
vmovntps [rdi + 128], ymm0
vmovntps [rdi + 160], ymm0
vmovntps [rdi + 192], ymm0
vmovntps [rdi + 224], ymm0
add rdi, 256
dec rcx
jnz .loop
sfence
vzeroupper
ret
Array of Structures vs. Structure of Arrays
Data layout is one of the highest-impact optimizations for cache performance.
Array of Structures (AoS): Each object's fields are adjacent:
; AoS layout: struct Particle { float x, y, z, vx, vy, vz, mass, charge; }
; sizeof(Particle) = 32 bytes
; Memory layout:
; [x0][y0][z0][vx0][vy0][vz0][m0][c0] [x1][y1][z1][vx1][vy1][vz1][m1][c1] ...
; ──────────── 32 bytes ──────────── ──────────── 32 bytes ────────────
; If you only need x, y, z positions (gravity simulation):
; Must load 32 bytes per particle but use only 12 bytes
; 12/32 = 37.5% cache utilization — 62.5% wasted cache space
section .data
align 64
particles_aos:
; x0, y0, z0, vx0, vy0, vz0, mass0, charge0
dd 1.0, 2.0, 3.0, 0.1, 0.2, 0.3, 1.0, 0.5
; x1, y1, z1 ...
dd 4.0, 5.0, 6.0, 0.4, 0.5, 0.6, 2.0, -0.5
; ... N particles
Structure of Arrays (SoA): Each field is a separate array:
; SoA layout: separate arrays for each field
; Memory layout:
; x: [x0][x1][x2]...[xN]
; y: [y0][y1][y2]...[yN]
; z: [z0][z1][z2]...[zN]
; vx: [vx0][vx1]...
; ...
; If you only need x, y, z positions:
; Load x array: 100% cache utilization for x values
; Load y array: 100% for y values
; Load z array: 100% for z values
; SIMD processes 8 floats at once from contiguous memory
section .bss
align 64
px: resd MAX_PARTICLES ; all X positions
py: resd MAX_PARTICLES ; all Y positions
pz: resd MAX_PARTICLES ; all Z positions
pvx: resd MAX_PARTICLES
pvy: resd MAX_PARTICLES
pvz: resd MAX_PARTICLES
pmass: resd MAX_PARTICLES
pcharge: resd MAX_PARTICLES
; Update all X positions using SIMD:
update_positions_x:
; rsi = px (x array), rdx = pvx (vx array)
; rcx = particle count / 8 (8 floats per YMM register)
.loop:
vmovaps ymm0, [rsi] ; load 8 x positions
vmovaps ymm1, [rdx] ; load 8 vx velocities
vaddps ymm0, ymm0, ymm1 ; x += vx (8 particles simultaneously)
vmovaps [rsi], ymm0 ; store 8 updated positions
add rsi, 32
add rdx, 32
dec rcx
jnz .loop
vzeroupper
ret
; This loop achieves ~95% cache utilization and uses SIMD on contiguous data
When to use AoS vs SoA:
| Pattern | Use |
|---|---|
| Frequently access all fields of one object | AoS |
| Frequently process one field across many objects | SoA |
| SIMD processing of single field | SoA required |
| Object-oriented code, few objects | AoS (simplicity) |
| Hot loop over millions of objects, few fields needed | SoA (cache) |
A common hybrid is AoSoA (Array of Structures of Arrays): group 8 particles together so each group's fields are contiguous, balancing SIMD width with access patterns.
Measuring Cache Performance
perf cache events
# Count L1 data cache misses during execution
perf stat -e L1-dcache-loads,L1-dcache-load-misses ./matrix_multiply
# Sample output:
# 1,024,000,000 L1-dcache-loads
# 52,428,800 L1-dcache-load-misses # 5.12% miss rate
# L2 and L3 cache statistics:
perf stat -e \
L1-dcache-loads,L1-dcache-load-misses,\
l2_rqsts.references,l2_rqsts.miss,\
LLC-loads,LLC-load-misses \
./matrix_multiply
# Cache miss hierarchy:
# L1 miss → L2: 12 cycles extra
# L2 miss → L3: 28 cycles extra
# L3 miss → DRAM: ~200 cycles extra
# Specific events (Intel microarchitecture):
perf stat -e cache-misses,cache-references ./program
perf stat -e mem_load_retired.l3_miss ./program # LLC misses
# Memory bandwidth measurement:
perf stat -e \
offcore_response.all_reads.llc_miss.any_response \
./bandwidth_benchmark
Valgrind Cachegrind
# Run with cachegrind — simulates cache with configurable parameters
valgrind --tool=cachegrind \
--I1=32768,8,64 \ # 32KB 8-way L1 instruction cache, 64-byte lines
--D1=32768,8,64 \ # 32KB 8-way L1 data cache
--LL=8388608,16,64 \ # 8MB 16-way L3
./matrix_multiply
# Produces cachegrind.out.<pid>
# Analyze with:
cg_annotate cachegrind.out.12345
# Output example:
# I refs: 2,048,000,000
# I1 misses: 32,768
# D refs: 1,024,000,000 (768M rd + 256M wr)
# D1 misses: 52,428,800 (51M rd + 1.4M wr)
# LL misses: 12,582,912 (12M rd + 512K wr)
# D1 miss rate: 5.1%
# LL miss rate: 1.2%
# Per-source-line annotation:
cg_annotate --auto=yes cachegrind.out.12345 matrix.c
RDTSC Timing for Cache Experiments
; Measure latency of array access at various strides
; to identify cache level thresholds
section .data
align 4096
test_array: times 16777216 db 0 ; 16 MB array (L3-sized)
section .text
; bench_stride: measure cycles per element at given stride
; rdi = array base, rsi = stride (bytes), rdx = iterations
; Returns average cycles in xmm0
bench_stride:
mov rcx, rdx ; iteration count
xor rax, rax ; index = 0
; Warm up: ensure array is in state we expect
; (cold start or evicted as desired)
; RDTSC before
lfence
rdtsc
shl rdx, 32
or rax, rdx
mov r8, rax ; start timestamp
.loop:
mov al, [rdi + rax] ; dependent load (rax from result — pointer chasing)
add rax, rsi ; advance by stride
; If stride > array size, wrap: and rax, ARRAY_MASK
dec rcx
jnz .loop
; RDTSC after
lfence
rdtsc
shl rdx, 32
or rax, rdx
sub rax, r8 ; elapsed cycles
; divide by iteration count for cycles-per-element
ret
; Running with strides 1, 64, 4096, 131072 reveals:
; stride 1: ~4 cycles (L1 hit)
; stride 64: ~4 cycles (L1 hit — different cache sets)
; stride 4096: ~12 cycles (L2 hit — 64 sets, stride 4096 causes conflict)
; stride 65536: ~40 cycles (L3 hit)
; stride 1048576: ~200 cycles (DRAM latency)
Memory Bandwidth: DDR5 and Its Limits
DDR5 (2022+) provides approximately 80 GB/s of memory bandwidth per channel on a single-socket system. This sounds like a lot. It is not.
Memory bandwidth budget (Intel Core i9-13900K, DDR5-4800):
Theoretical peak (2 channels × DDR5-4800): 76.8 GB/s
Practical sustained: ~65 GB/s
A loop that accesses 1 byte per element, N=1 billion elements:
Data volume: 1 billion × 64 bytes/cache-line = 64 GB (loading full lines)
At 65 GB/s: 64 / 65 ≈ 1 second minimum
A YMM (AVX2) SIMD loop, 8 floats per iteration:
Data rate: 8 × 4 = 32 bytes per iteration
At 65 GB/s: can process 65 GB / 32 bytes = ~2 billion iterations/second
= ~500 ms for 1 billion iterations
IPC is irrelevant — bandwidth is the bottleneck
Measuring Memory Bandwidth
# STREAM benchmark — industry standard for memory bandwidth
# Download from: https://www.cs.virginia.edu/stream/
# Compile with: gcc -O3 -march=native -fopenmp stream.c -o stream
./stream
# Sample output:
# STREAM version $Revision: 5.10 $
# Array size = 10,000,000 (elements), Offset = 0 (elements)
# Function Best Rate MB/s Avg time Min time Max time
# Copy: 61,234.5 0.00327 0.00327 0.00328
# Scale: 58,891.2 0.00340 0.00340 0.00341
# Add: 62,108.7 0.00483 0.00483 0.00484
# Triad: 61,987.4 0.00484 0.00484 0.00485
# Intel Memory Latency Checker (MLC):
# https://www.intel.com/content/www/us/en/download/736633/
mlc --bandwidth_matrix # bandwidth between NUMA nodes
mlc --latency_matrix # latency between NUMA nodes
# perf bandwidth estimate:
perf stat -e \
uncore_imc_0/cas_count_read/,\
uncore_imc_0/cas_count_write/ \
-a sleep 1
# cas_count * 64 bytes / 1 second = bandwidth in GB/s
Bandwidth-Bound vs. Latency-Bound
Two distinct memory performance regimes require different approaches:
Latency-bound: Sequential computation where each load depends on the previous (pointer chasing, tree traversal). Bandwidth does not help — must reduce the number of cache misses or use prefetching.
Bandwidth-bound: Parallel computation over large data (memcpy, matrix operations, image processing). Latency is hidden by multiple in-flight requests. Bandwidth ceiling is the limit; prefetching and SIMD help by keeping the bus fully utilized.
# Determine if code is latency-bound or bandwidth-bound:
perf stat -e \
cycle_activity.stalls_mem_any,\ # stalls waiting for ANY memory
cycle_activity.stalls_l3_miss \ # stalls waiting for L3 miss (DRAM)
./program
# If stalls_mem_any >> stalls_l3_miss:
# → mostly L1/L2 misses → latency-bound (smaller working set or prefetch)
# If stalls_l3_miss / cycles > 0.2:
# → DRAM-bound → bandwidth ceiling or reduce working set
Putting It Together: The Memory Hierarchy Mindset
Every performance-critical program should be analyzed at each cache level:
L1 analysis: Does the inner loop's working set fit in 32 KB? Are accesses stride-1? Are there dependency chains between loads and stores?
L2 analysis: Does the working set of the hot function fit in 256 KB? Can blocking/tiling reduce the effective working set?
L3 analysis: Does the entire dataset fit in the L3? If so, DRAM bandwidth is not a concern. If not, what is the reuse ratio — how many operations per cache line loaded?
DRAM analysis: What is the effective bandwidth utilization? Are non-temporal stores appropriate for write-only passes? Is NUMA a factor (multiple CPU sockets)?
Decision tree for memory optimization:
┌─────────────────┐
│ Profile cache │
│ miss rate │
└────────┬────────┘
│
┌──────────────┼──────────────┐
│ │
< 1% miss > 5% miss
│ │
Not memory-bound ┌───────────┴──────────┐
(optimize elsewhere) │ │
Compulsory misses Capacity/conflict
(first access) misses
│ │
Prefetch ahead Reduce working
Software prefetch set: blocking,
PREFETCHT0/NTA SoA layout,
eliminate unused
fields from hot
structs
⚡ Performance Note: The single highest-impact optimization for memory-bound code is usually data layout: converting AoS to SoA, removing unused fields from hot structures, ensuring stride-1 access patterns. These changes require no new algorithms — just reorganizing how data is stored and accessed. They commonly produce 2–10× speedups on memory-bound workloads.
Summary
The memory hierarchy is not a detail — it is the central factor determining performance for data-intensive code. The key numbers to internalize: L1 is 4 cycles, DRAM is 200 cycles, and cache lines are 64 bytes. Everything else follows from understanding what data is where, how much of each cache line is used, and whether the access pattern allows the hardware prefetcher to do its job. Profile first with perf stat cache events, then apply the appropriate technique — blocking for capacity misses, SoA layout for poor spatial locality, prefetching for irregular access, non-temporal stores for write-only passes.