8 min read

Modern CPUs can execute 4–6 instructions per clock cycle. At 4 GHz, that is 16–24 billion instructions per second. Yet a single cache miss to DRAM stalls the CPU for 100+ nanoseconds — long enough to have executed 400 instructions. Memory speed has...

Chapter 32: The Memory Hierarchy

Why Memory Is the Bottleneck

Modern CPUs can execute 4–6 instructions per clock cycle. At 4 GHz, that is 16–24 billion instructions per second. Yet a single cache miss to DRAM stalls the CPU for 100+ nanoseconds — long enough to have executed 400 instructions. Memory speed has not kept pace with CPU speed, creating a widening gap that the memory hierarchy exists to bridge.

The hierarchy is a speed/capacity tradeoff:

Level          Size        Latency      Bandwidth
─────────────────────────────────────────────────────────────
Registers      ~2 KB       0 cycles     N/A (direct)
L1 cache       32–64 KB    4 cycles     ~1000 GB/s
L2 cache       256 KB–1MB  12 cycles    ~400 GB/s
L3 cache       4–32 MB     40 cycles    ~200 GB/s
DRAM           8–128 GB    ~100 ns      ~80 GB/s (DDR5)
NVMe SSD       1–8 TB      ~100 µs      ~7 GB/s
SATA SSD       256GB–4TB   ~500 µs      ~550 MB/s
HDD            1–20 TB     ~10 ms       ~150 MB/s
Network        ∞           ~100 ms      variable
─────────────────────────────────────────────────────────────

Each level is roughly 10× slower and 10–100× larger than the previous. The programmer's goal is to keep the working set — the data actively needed — as high in the hierarchy as possible.

📊 Latency in context: At 4 GHz, 1 nanosecond = 4 cycles. An L3 hit at 40 cycles = 10 ns; DRAM at 100 ns = 400 cycles. The CPU sits idle for 400 cycles waiting for one memory word. In that time, it could have completed ~1600 register operations.


Cache Lines: The Unit of Transfer

Caches do not transfer individual bytes. They transfer cache lines — aligned 64-byte blocks. When a single byte is accessed that is not in cache:

  1. The CPU issues a cache-line fill request
  2. A 64-byte aligned block containing the requested byte is fetched from the next level
  3. The entire 64 bytes is placed in cache
  4. The requested byte is delivered

This has two consequences:

Spatial locality is free: If you access array[0], array[1] through array[15] (16 ints, 64 bytes) are all brought into cache simultaneously. Subsequent accesses to those elements hit the cache.

False sharing is expensive: Two threads writing different variables that share a cache line create contention. Each write invalidates the other core's copy, forcing cache traffic even though no data is actually shared. (See Chapter 30.)

Cache line layout for a 64-byte line:
┌──────────────────────────────────────────────────────────────┐
│  0   4   8  12  16  20  24  28  32  36  40  44  48  52  56  60 │
│  │   │   │   │   │   │   │   │   │   │   │   │   │   │   │   │  │
│ i0  i1  i2  i3  i4  i5  i6  i7  i8  i9  i10 i11 i12 i13 i14 i15│
└──────────────────────────────────────────────────────────────┘
             accessing array[0] loads all 16 int32s

Cache Organization: Sets, Ways, and Tags

A direct-mapped cache would be simple but suffers from conflict misses. Modern caches are N-way set-associative:

8-way set-associative L1 cache (32 KB):
────────────────────────────────────────────────────────────
Total size:     32,768 bytes
Line size:      64 bytes
Total lines:    32,768 / 64 = 512 lines
Ways:           8
Sets:           512 / 8 = 64 sets

Address breakdown (64-bit):
  [63:12] = tag    (identifies which memory block)
  [11:6]  = set index  (which of the 64 sets)
  [5:0]   = offset within cache line (0-63)
────────────────────────────────────────────────────────────

When accessing an address: 1. Extract the set index bits → select the set 2. Compare the address tag against all 8 tags in the set simultaneously 3. If any tag matches (and the valid bit is set): cache hit 4. If no tag matches: cache miss — one of the 8 ways is evicted (LRU policy) and the new line is loaded

Replacement policy: Most CPUs use pseudo-LRU (tree-based approximation of least-recently-used). Understanding this matters for conflict miss analysis.

Three Types of Cache Misses

Miss Type Cause Remedy
Compulsory (cold) First access to a line, ever Prefetching
Capacity Working set exceeds cache size Reduce working set, blocking/tiling
Conflict Two lines map to the same set, evict each other Adjust data layout, padding

The MESI Protocol: Cache Coherence

With multiple cores, each core has its own L1/L2 cache. A cache line can exist in multiple caches simultaneously. The MESI (Modified/Exclusive/Shared/Invalid) protocol maintains coherence:

MESI States:
────────────────────────────────────────────────────────────
M (Modified)   Line is dirty. Only this core has it.
               Must write back to memory before eviction.

E (Exclusive)  Line is clean. Only this core has it.
               No other core is caching this line.

S (Shared)     Line is clean. Multiple cores may have it.
               Cannot write without transitioning.

I (Invalid)    Line is not present (or stale).
               Any access = cache miss.
────────────────────────────────────────────────────────────

State transitions (simplified):

Core 0 reads line X (not cached anywhere):
  → X enters E state in Core 0's cache

Core 1 reads line X (Core 0 has it in E):
  → X transitions to S in both caches

Core 0 writes line X (in S state):
  → Core 0 sends "invalidate" to all other cores
  → Core 1's copy → I state
  → Core 0's copy → M state
  → Next read by Core 1: cache miss, must fetch from Core 0
    (Core 0's cache "supplies" the line, not DRAM)

This is why false sharing is expensive: two cores repeatedly invalidating each other's copies of a cache line that contains different data. Every write generates an invalidation message on the coherence bus.

; FALSE SHARING EXAMPLE
; Thread 0 and Thread 1 writing adjacent variables
; Assume counter0 and counter1 are 4 bytes each, adjacent in memory

section .data
align 64
counter0:   dd 0    ; Thread 0 owns this
counter1:   dd 0    ; Thread 1 owns this
; Both counter0 and counter1 are on the SAME cache line!
; Every write by Thread 0 invalidates Thread 1's copy and vice versa.

; FIX: Pad to separate cache lines
section .data
align 64
counter0:   dd 0
            times 15 dd 0   ; 60 bytes of padding
counter1:   dd 0            ; Now on a different cache line

Memory Access Patterns: Stride and Locality

Sequential (Stride-1) Access — Optimal

; Sum an array of int32s — stride-1 access
; rsi = array base, rcx = count
; Returns sum in eax

sum_array:
    xor     eax, eax
    xor     rdx, rdx
.loop:
    mov     edx, [rsi + rcx*4 - 4]   ; stride-1: each access is 4 bytes forward
    add     eax, edx
    dec     rcx
    jnz     .loop
    ret

; Access pattern:  [0][4][8][12][16]...[60] → cache line loaded once
; After first miss: all 16 int32s are in cache. Zero additional misses.
; Cache miss rate: 1/16 = 6.25% (one miss per 16 accesses)

Stride-N Access — Increasingly Worse

; Sum every 16th element — stride-64 (one cache line per element)
; This accesses one element per cache line loaded

sum_stride16:
    xor     eax, eax
    xor     rdx, rdx
.loop:
    mov     edx, [rsi + rcx*64 - 64]  ; stride-64: each access misses L1
    add     eax, edx
    dec     rcx
    jnz     .loop
    ret

; Cache miss rate: 100% (every access is a new cache line)
; L1 miss → L2 → L3 → DRAM for every element
; Performance: ~16× slower than stride-1 for large arrays

The Penalty Scales with Array Size

For small arrays (fits in L1/L2): stride doesn't matter much — all data is cached after the first pass.

For large arrays (exceeds L3, ~32 MB+): stride-1 saturates memory bandwidth reading each byte once; stride-16 reads 16× more memory bandwidth to access the same number of elements.


Matrix Multiplication: Three Implementations

Matrix multiplication is the canonical memory hierarchy case study because the naive implementation is memory-hostile, and optimizing it teaches every important lesson.

Naive ijk order:

// C[i][j] += A[i][k] * B[k][j]
for (int i = 0; i < N; i++)
    for (int j = 0; j < N; j++)
        for (int k = 0; k < N; k++)
            C[i][j] += A[i][k] * B[k][j];
Memory access pattern (N=1024, row-major storage):
  A[i][k]: stride-1 (sequential across row i) — GOOD
  B[k][j]: stride-N (column access) — BAD: 1024×4 = 4096 byte stride
  C[i][j]: constant during inner k loop — GOOD

  B column access: every B[k][j] for fixed j is 4096 bytes apart.
  N=1024: B is 4 MB. L3 cache is ~8 MB.
  But accessing B column-by-column → working set = N×64 cache lines
  for each column = 1024 different cache lines cycling in and out.
  Result: frequent L3 misses, ~DRAM bandwidth limited.

NASM naive implementation:

; matmul_naive: N×N float matrix multiply
; rdi = C, rsi = A, rdx = B, ecx = N
; All matrices stored row-major as float32

matmul_naive:
    push    rbx
    push    r12
    push    r13
    push    r14
    push    r15

    mov     r15d, ecx           ; r15 = N
    xor     r12d, r12d          ; i = 0
.loop_i:
    xor     r13d, r13d          ; j = 0
.loop_j:
    xorps   xmm0, xmm0          ; accumulator = 0.0
    xor     r14d, r14d          ; k = 0
.loop_k:
    ; A[i][k] → rsi + (i*N + k) * 4
    mov     eax, r12d
    imul    eax, r15d
    add     eax, r14d
    movss   xmm1, [rsi + rax*4]

    ; B[k][j] → rdx + (k*N + j) * 4
    mov     eax, r14d
    imul    eax, r15d
    add     eax, r13d
    mulss   xmm1, [rdx + rax*4] ; B[k][j] — potential cache miss
    addss   xmm0, xmm1

    inc     r14d
    cmp     r14d, r15d
    jl      .loop_k

    ; C[i][j] += accumulator
    mov     eax, r12d
    imul    eax, r15d
    add     eax, r13d
    addss   xmm0, [rdi + rax*4]
    movss   [rdi + rax*4], xmm0

    inc     r13d
    cmp     r13d, r15d
    jl      .loop_j

    inc     r12d
    cmp     r12d, r15d
    jl      .loop_i

    pop     r15
    pop     r14
    pop     r13
    pop     r12
    pop     rbx
    ret

Transposed B (ikj order): Transpose B before multiplying so B is accessed stride-1:

; Transpose B into B_T first, then use B_T row-wise
; B_T[j][k] = B[k][j] → now B_T[k] access is stride-1 for fixed j loop

; ikj loop order: for i, for k, for j
; A[i][k]: constant in j loop — load once, keep in register
; B_T[k][j]: stride-1 — sequential — GOOD
; C[i][j]: stride-1 — GOOD

matmul_transposed:
    ; (B transposition code omitted for brevity)
    ; Inner loop accesses B_T[k][j] sequentially:

    ; rax = A[i][k] (scalar, broadcast to xmm1)
    movss   xmm1, [rsi + rax*4]
    shufps  xmm1, xmm1, 0       ; broadcast to all 4 lanes

    ; Process 4 j values at once using SIMD:
    ; xmm2 = B_T[k][j:j+3]
    movups  xmm2, [rdx + rbx*4] ; 4 consecutive B_T values
    mulps   xmm2, xmm1           ; A[i][k] * B_T[k][j:j+3]
    addps   xmm0, xmm2           ; accumulate into C[i][j:j+3]

Cache-blocked (tiled) implementation: Divide the matrices into B×B tiles. Process one tile of A, one of B, one of C at a time. If B is chosen so that 3 tiles (A tile + B tile + C tile) fit in L2 cache, every element is accessed from L2 rather than DRAM.

Tiling visualization (N=1024, tile=64):

Without tiling: entire B column accessed each i iteration
  Working set per i-row: N×N floats = 4 MB → L3 thrashing

With tiling: block A[i:i+B][k:k+B] and B[k:k+B][j:j+B]
  Tile size: B×B×4 bytes = 64×64×4 = 16 KB per tile
  Three tiles: 48 KB → fits in L2 cache (256 KB)
  Each tile element reused B times before eviction
  Reuse ratio: B = 64 reuses per element
; Tiled matrix multiply — outer loops over tiles
; TILE_SIZE = 64 (define as constant)
%define TILE 64

matmul_tiled:
    ; Outer three loops iterate over tiles
    ; inner three loops iterate within tiles

    ; For each tile (ii, jj, kk):
    ;   For i = ii to min(ii+TILE, N):
    ;     For j = jj to min(jj+TILE, N):
    ;       For k = kk to min(kk+TILE, N):
    ;         C[i][j] += A[i][k] * B[k][j]
    ;
    ; Key: the (A_tile, B_tile, C_tile) working set stays in L2
    ;      throughout the inner triple loop

    ; [Full NASM implementation omitted for space — see exercises]
    ; The critical optimization is the loop ordering and TILE constant choice.

Prefetching

The CPU has a hardware prefetcher that detects stride-1 access patterns and automatically fetches the next cache lines before they are needed. For sequential array traversal, the hardware prefetcher works well.

For irregular access patterns (trees, hash tables, linked lists), the hardware prefetcher fails — it cannot predict which cache line to fetch next. Software prefetching issues explicit prefetch hints ahead of computation:

; Software prefetch instructions:
; PREFETCHT0  — prefetch to L1 (and L2, L3)
; PREFETCHT1  — prefetch to L2 (and L3)
; PREFETCHT2  — prefetch to L3 only
; PREFETCHNTA — non-temporal prefetch (bypasses L2/L3, goes to L1)
;               Use for data accessed only once (avoids polluting L2/L3)

; Example: sum an array with explicit prefetch
; Prefetch 256 bytes (4 cache lines) ahead of the current position

sum_prefetch:
    xor     eax, eax
    mov     rcx, rsi            ; rcx = current ptr
    mov     rdx, rsi
    add     rdx, r8             ; rdx = end ptr (r8 = byte count)

.loop:
    prefetcht0 [rcx + 256]      ; prefetch 4 lines ahead

    ; Process current 64-byte cache line (16 int32s)
    mov     r9d,  [rcx]
    add     eax,  r9d
    mov     r9d,  [rcx + 4]
    add     eax,  r9d
    ; ... (process all 16 elements in this cache line)

    add     rcx, 64             ; advance to next cache line
    cmp     rcx, rdx
    jl      .loop
    ret

; Prefetch distance: choose so the prefetch finishes before the data is needed.
; Rule of thumb: distance = latency_of_prefetch / cycles_per_cache_line_processed
; DRAM latency ~200 cycles, 64-byte line takes ~16 cycles to sum:
; Distance = 200 / 16 ≈ 12 cache lines ≈ 768 bytes ahead
; 256 bytes (4 lines) is conservative; 512-1024 bytes often better

When to use software prefetch: - Linked list traversal: prefetch node->next->data while processing node->data - Tree traversal: prefetch children while processing parent - Hash table lookups: prefetch the bucket while computing the hash - Scatter/gather operations: prefetch target addresses before writing

When NOT to use software prefetch: - Sequential access: hardware prefetcher handles this - Data that fits in L1/L2: already cached, no benefit - High-bandwidth SIMD loops: bandwidth is the bottleneck, not latency


Non-Temporal Stores: Bypassing the Cache

When writing large buffers that will not be read again soon (video frames, large memset operations, output buffers), loading the cache line before writing it is wasteful. Every store to uncached memory normally follows: 1. Load cache line from DRAM (read-for-ownership) 2. Modify the line 3. Write back to DRAM on eviction

Non-temporal stores skip step 1 — they write directly to write-combining buffers and flush to DRAM, bypassing the cache entirely:

; Normal store to uncached memory — causes read-for-ownership:
mov     [rdi], rax          ; loads cache line first, then writes

; Non-temporal store — writes directly, no read:
movnti  [rdi], rax          ; 64-bit non-temporal store
movntq  [rdi], mm0          ; 64-bit MMX non-temporal store
movntps [rdi], xmm0         ; 128-bit SSE non-temporal store
movntpd [rdi], xmm0         ; 128-bit double non-temporal store
vmovntps [rdi], ymm0        ; 256-bit AVX non-temporal store
vmovntpd [rdi], ymm0        ; 256-bit AVX non-temporal store

; IMPORTANT: Non-temporal stores are WEAKLY ORDERED.
; Must use SFENCE to ensure stores are visible to other processors:
sfence                       ; flush write-combining buffers

Write-combining buffers: The CPU maintains 4–12 write-combining (WC) buffers. Non-temporal stores to the same cache line accumulate in a WC buffer until it is full (64 bytes), then flush to DRAM in one burst. For maximum efficiency, fill each buffer completely before moving to the next:

; GOOD: Fill one 64-byte region completely (fills one WC buffer)
vmovntps [rdi],      ymm0   ; bytes 0-31
vmovntps [rdi + 32], ymm1   ; bytes 32-63
; WC buffer is full → flushed as single 64-byte write

; BAD: Interleave writes to different regions
vmovntps [rdi],        ymm0  ; WC buffer A: bytes 0-31
vmovntps [rdi + 4096], ymm1  ; WC buffer B: bytes 4096-4127
vmovntps [rdi + 32],   ymm2  ; WC buffer A: bytes 32-63 (buffer A flushes)
vmovntps [rdi + 4128], ymm3  ; WC buffer B: bytes 4128-4159
; Interleaving may evict WC buffers before they are full, reducing efficiency

memset with non-temporal stores:

; Zero a large buffer with VMOVNTPS — faster than REP STOSD for large buffers
; because it avoids read-for-ownership and maximizes write bandwidth

; rdi = destination (must be 32-byte aligned for VMOVNTPS)
; rcx = byte count (must be multiple of 32)

fast_memzero:
    vxorps  ymm0, ymm0, ymm0    ; ymm0 = 0.0 (all zeros)
    shr     rcx, 5               ; rcx = count / 32 (number of YMM stores)
.loop:
    vmovntps [rdi], ymm0         ; 32-byte non-temporal store
    add     rdi, 32
    dec     rcx
    jnz     .loop
    sfence                        ; ensure stores are globally visible
    vzeroupper                    ; clear YMM upper halves (AVX/SSE transition)
    ret

; For maximum bandwidth, process 4 stores per iteration (256 bytes):
fast_memzero_unrolled:
    vxorps  ymm0, ymm0, ymm0
    shr     rcx, 8               ; rcx = count / 256
.loop:
    vmovntps [rdi],       ymm0
    vmovntps [rdi + 32],  ymm0
    vmovntps [rdi + 64],  ymm0
    vmovntps [rdi + 96],  ymm0
    vmovntps [rdi + 128], ymm0
    vmovntps [rdi + 160], ymm0
    vmovntps [rdi + 192], ymm0
    vmovntps [rdi + 224], ymm0
    add     rdi, 256
    dec     rcx
    jnz     .loop
    sfence
    vzeroupper
    ret

Array of Structures vs. Structure of Arrays

Data layout is one of the highest-impact optimizations for cache performance.

Array of Structures (AoS): Each object's fields are adjacent:

; AoS layout: struct Particle { float x, y, z, vx, vy, vz, mass, charge; }
; sizeof(Particle) = 32 bytes

; Memory layout:
; [x0][y0][z0][vx0][vy0][vz0][m0][c0] [x1][y1][z1][vx1][vy1][vz1][m1][c1] ...
;  ──────────── 32 bytes ────────────   ──────────── 32 bytes ────────────

; If you only need x, y, z positions (gravity simulation):
; Must load 32 bytes per particle but use only 12 bytes
; 12/32 = 37.5% cache utilization — 62.5% wasted cache space

section .data
align 64
particles_aos:
    ; x0, y0, z0, vx0, vy0, vz0, mass0, charge0
    dd 1.0, 2.0, 3.0, 0.1, 0.2, 0.3, 1.0, 0.5
    ; x1, y1, z1 ...
    dd 4.0, 5.0, 6.0, 0.4, 0.5, 0.6, 2.0, -0.5
    ; ... N particles

Structure of Arrays (SoA): Each field is a separate array:

; SoA layout: separate arrays for each field

; Memory layout:
; x:  [x0][x1][x2]...[xN]
; y:  [y0][y1][y2]...[yN]
; z:  [z0][z1][z2]...[zN]
; vx: [vx0][vx1]...
; ...

; If you only need x, y, z positions:
; Load x array: 100% cache utilization for x values
; Load y array: 100% for y values
; Load z array: 100% for z values
; SIMD processes 8 floats at once from contiguous memory

section .bss
align 64
px: resd MAX_PARTICLES      ; all X positions
py: resd MAX_PARTICLES      ; all Y positions
pz: resd MAX_PARTICLES      ; all Z positions
pvx: resd MAX_PARTICLES
pvy: resd MAX_PARTICLES
pvz: resd MAX_PARTICLES
pmass: resd MAX_PARTICLES
pcharge: resd MAX_PARTICLES

; Update all X positions using SIMD:
update_positions_x:
    ; rsi = px (x array), rdx = pvx (vx array)
    ; rcx = particle count / 8 (8 floats per YMM register)
.loop:
    vmovaps ymm0, [rsi]          ; load 8 x positions
    vmovaps ymm1, [rdx]          ; load 8 vx velocities
    vaddps  ymm0, ymm0, ymm1     ; x += vx (8 particles simultaneously)
    vmovaps [rsi], ymm0          ; store 8 updated positions
    add     rsi, 32
    add     rdx, 32
    dec     rcx
    jnz     .loop
    vzeroupper
    ret
; This loop achieves ~95% cache utilization and uses SIMD on contiguous data

When to use AoS vs SoA:

Pattern Use
Frequently access all fields of one object AoS
Frequently process one field across many objects SoA
SIMD processing of single field SoA required
Object-oriented code, few objects AoS (simplicity)
Hot loop over millions of objects, few fields needed SoA (cache)

A common hybrid is AoSoA (Array of Structures of Arrays): group 8 particles together so each group's fields are contiguous, balancing SIMD width with access patterns.


Measuring Cache Performance

perf cache events

# Count L1 data cache misses during execution
perf stat -e L1-dcache-loads,L1-dcache-load-misses ./matrix_multiply

# Sample output:
#     1,024,000,000      L1-dcache-loads
#        52,428,800      L1-dcache-load-misses     # 5.12% miss rate

# L2 and L3 cache statistics:
perf stat -e \
  L1-dcache-loads,L1-dcache-load-misses,\
  l2_rqsts.references,l2_rqsts.miss,\
  LLC-loads,LLC-load-misses \
  ./matrix_multiply

# Cache miss hierarchy:
# L1 miss → L2: 12 cycles extra
# L2 miss → L3: 28 cycles extra
# L3 miss → DRAM: ~200 cycles extra

# Specific events (Intel microarchitecture):
perf stat -e cache-misses,cache-references ./program
perf stat -e mem_load_retired.l3_miss ./program   # LLC misses

# Memory bandwidth measurement:
perf stat -e \
  offcore_response.all_reads.llc_miss.any_response \
  ./bandwidth_benchmark

Valgrind Cachegrind

# Run with cachegrind — simulates cache with configurable parameters
valgrind --tool=cachegrind \
         --I1=32768,8,64 \    # 32KB 8-way L1 instruction cache, 64-byte lines
         --D1=32768,8,64 \    # 32KB 8-way L1 data cache
         --LL=8388608,16,64 \ # 8MB 16-way L3
         ./matrix_multiply

# Produces cachegrind.out.<pid>
# Analyze with:
cg_annotate cachegrind.out.12345

# Output example:
# I   refs:       2,048,000,000
# I1  misses:            32,768
# D   refs:       1,024,000,000  (768M rd + 256M wr)
# D1  misses:       52,428,800  (51M rd + 1.4M wr)
# LL  misses:       12,582,912  (12M rd + 512K wr)
# D1  miss rate:          5.1%
# LL  miss rate:          1.2%

# Per-source-line annotation:
cg_annotate --auto=yes cachegrind.out.12345 matrix.c

RDTSC Timing for Cache Experiments

; Measure latency of array access at various strides
; to identify cache level thresholds

section .data
align 4096
test_array: times 16777216 db 0   ; 16 MB array (L3-sized)

section .text

; bench_stride: measure cycles per element at given stride
; rdi = array base, rsi = stride (bytes), rdx = iterations
; Returns average cycles in xmm0

bench_stride:
    mov     rcx, rdx            ; iteration count
    xor     rax, rax            ; index = 0

    ; Warm up: ensure array is in state we expect
    ; (cold start or evicted as desired)

    ; RDTSC before
    lfence
    rdtsc
    shl     rdx, 32
    or      rax, rdx
    mov     r8, rax             ; start timestamp

.loop:
    mov     al, [rdi + rax]     ; dependent load (rax from result — pointer chasing)
    add     rax, rsi            ; advance by stride
    ; If stride > array size, wrap: and rax, ARRAY_MASK
    dec     rcx
    jnz     .loop

    ; RDTSC after
    lfence
    rdtsc
    shl     rdx, 32
    or      rax, rdx
    sub     rax, r8             ; elapsed cycles
    ; divide by iteration count for cycles-per-element
    ret

; Running with strides 1, 64, 4096, 131072 reveals:
; stride 1:      ~4 cycles (L1 hit)
; stride 64:     ~4 cycles (L1 hit — different cache sets)
; stride 4096:   ~12 cycles (L2 hit — 64 sets, stride 4096 causes conflict)
; stride 65536:  ~40 cycles (L3 hit)
; stride 1048576: ~200 cycles (DRAM latency)

Memory Bandwidth: DDR5 and Its Limits

DDR5 (2022+) provides approximately 80 GB/s of memory bandwidth per channel on a single-socket system. This sounds like a lot. It is not.

Memory bandwidth budget (Intel Core i9-13900K, DDR5-4800):
  Theoretical peak (2 channels × DDR5-4800): 76.8 GB/s
  Practical sustained:                        ~65 GB/s

A loop that accesses 1 byte per element, N=1 billion elements:
  Data volume: 1 billion × 64 bytes/cache-line = 64 GB (loading full lines)
  At 65 GB/s: 64 / 65 ≈ 1 second minimum

A YMM (AVX2) SIMD loop, 8 floats per iteration:
  Data rate: 8 × 4 = 32 bytes per iteration
  At 65 GB/s: can process 65 GB / 32 bytes = ~2 billion iterations/second
  = ~500 ms for 1 billion iterations
  IPC is irrelevant — bandwidth is the bottleneck

Measuring Memory Bandwidth

# STREAM benchmark — industry standard for memory bandwidth
# Download from: https://www.cs.virginia.edu/stream/
# Compile with: gcc -O3 -march=native -fopenmp stream.c -o stream
./stream

# Sample output:
# STREAM version $Revision: 5.10 $
# Array size = 10,000,000 (elements), Offset = 0 (elements)
# Function    Best Rate MB/s  Avg time     Min time     Max time
# Copy:           61,234.5     0.00327      0.00327      0.00328
# Scale:          58,891.2     0.00340      0.00340      0.00341
# Add:            62,108.7     0.00483      0.00483      0.00484
# Triad:          61,987.4     0.00484      0.00484      0.00485

# Intel Memory Latency Checker (MLC):
# https://www.intel.com/content/www/us/en/download/736633/
mlc --bandwidth_matrix    # bandwidth between NUMA nodes
mlc --latency_matrix      # latency between NUMA nodes

# perf bandwidth estimate:
perf stat -e \
  uncore_imc_0/cas_count_read/,\
  uncore_imc_0/cas_count_write/ \
  -a sleep 1
# cas_count * 64 bytes / 1 second = bandwidth in GB/s

Bandwidth-Bound vs. Latency-Bound

Two distinct memory performance regimes require different approaches:

Latency-bound: Sequential computation where each load depends on the previous (pointer chasing, tree traversal). Bandwidth does not help — must reduce the number of cache misses or use prefetching.

Bandwidth-bound: Parallel computation over large data (memcpy, matrix operations, image processing). Latency is hidden by multiple in-flight requests. Bandwidth ceiling is the limit; prefetching and SIMD help by keeping the bus fully utilized.

# Determine if code is latency-bound or bandwidth-bound:
perf stat -e \
  cycle_activity.stalls_mem_any,\   # stalls waiting for ANY memory
  cycle_activity.stalls_l3_miss \   # stalls waiting for L3 miss (DRAM)
  ./program

# If stalls_mem_any >> stalls_l3_miss:
#   → mostly L1/L2 misses → latency-bound (smaller working set or prefetch)
# If stalls_l3_miss / cycles > 0.2:
#   → DRAM-bound → bandwidth ceiling or reduce working set

Putting It Together: The Memory Hierarchy Mindset

Every performance-critical program should be analyzed at each cache level:

L1 analysis: Does the inner loop's working set fit in 32 KB? Are accesses stride-1? Are there dependency chains between loads and stores?

L2 analysis: Does the working set of the hot function fit in 256 KB? Can blocking/tiling reduce the effective working set?

L3 analysis: Does the entire dataset fit in the L3? If so, DRAM bandwidth is not a concern. If not, what is the reuse ratio — how many operations per cache line loaded?

DRAM analysis: What is the effective bandwidth utilization? Are non-temporal stores appropriate for write-only passes? Is NUMA a factor (multiple CPU sockets)?

Decision tree for memory optimization:
                    ┌─────────────────┐
                    │ Profile cache    │
                    │ miss rate        │
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              │                             │
         < 1% miss                     > 5% miss
              │                             │
     Not memory-bound           ┌───────────┴──────────┐
     (optimize elsewhere)       │                       │
                          Compulsory misses        Capacity/conflict
                          (first access)                misses
                                │                       │
                          Prefetch ahead           Reduce working
                          Software prefetch        set: blocking,
                          PREFETCHT0/NTA           SoA layout,
                                                   eliminate unused
                                                   fields from hot
                                                   structs

⚡ Performance Note: The single highest-impact optimization for memory-bound code is usually data layout: converting AoS to SoA, removing unused fields from hot structures, ensuring stride-1 access patterns. These changes require no new algorithms — just reorganizing how data is stored and accessed. They commonly produce 2–10× speedups on memory-bound workloads.


Summary

The memory hierarchy is not a detail — it is the central factor determining performance for data-intensive code. The key numbers to internalize: L1 is 4 cycles, DRAM is 200 cycles, and cache lines are 64 bytes. Everything else follows from understanding what data is where, how much of each cache line is used, and whether the access pattern allows the hardware prefetcher to do its job. Profile first with perf stat cache events, then apply the appropriate technique — blocking for capacity misses, SoA layout for poor spatial locality, prefetching for irregular access, non-temporal stores for write-only passes.