Case Study 32-1: Matrix Multiplication — Three Implementations, Three Cache Regimes

The Same Algorithm, Three Very Different Realities

Matrix multiplication is the workload that built the GPU industry. It is also the textbook example of how the same O(N³) algorithm can have wildly different performance depending entirely on memory access patterns. This case study benchmarks three implementations on a 1024×1024 float32 matrix multiply and explains every performance difference through the lens of cache behavior.

All three compute the same result: C = A × B where A, B, C are 1024×1024 matrices of float32.

Hardware: Intel Core i7-12700K (Alder Lake), L1D=48KB, L2=1.25MB, L3=25MB, DDR5-4800.

Matrix size: 1024×1024 × 4 bytes = 4 MB per matrix. Three matrices = 12 MB total. This exceeds L2 (1.25 MB) and fits in L3 (25 MB).


Implementation 1: Naive ijk

; matmul_naive_ijk:
; C[i][j] += sum over k of A[i][k] * B[k][j]
; rdi = C, rsi = A, rdx = B, ecx = N (=1024)
;
; Access pattern:
;   A[i][k]: row-major → stride-1 ✓
;   B[k][j]: column access → stride-N → 4096 bytes per step ✗

matmul_naive_ijk:
    push    rbp
    push    rbx
    push    r12
    push    r13
    push    r14
    push    r15

    mov     r15d, ecx           ; r15 = N = 1024
    xor     r12d, r12d          ; i = 0

.outer_i:
    xor     r13d, r13d          ; j = 0

.middle_j:
    xorps   xmm0, xmm0          ; accumulator = 0.0f
    xor     r14d, r14d          ; k = 0

.inner_k:
    ; A[i][k] = *(A + (i*N + k) * 4)
    mov     eax, r12d           ; eax = i
    imul    eax, r15d           ; eax = i*N
    add     eax, r14d           ; eax = i*N + k
    movss   xmm1, [rsi + rax*4] ; load A[i][k]

    ; B[k][j] = *(B + (k*N + j) * 4)
    mov     eax, r14d           ; eax = k
    imul    eax, r15d           ; eax = k*N
    add     eax, r13d           ; eax = k*N + j
    mulss   xmm1, [rdx + rax*4] ; load B[k][j], multiply

    addss   xmm0, xmm1          ; accumulate

    inc     r14d
    cmp     r14d, r15d
    jl      .inner_k

    ; Store C[i][j]
    mov     eax, r12d
    imul    eax, r15d
    add     eax, r13d
    addss   xmm0, [rdi + rax*4]
    movss   [rdi + rax*4], xmm0

    inc     r13d
    cmp     r13d, r15d
    jl      .middle_j

    inc     r12d
    cmp     r12d, r15d
    jl      .outer_i

    pop     r15 ; ... (restore registers)
    ret

What the cache sees: For each (i, j) pair, the inner k loop scans B column j — accessing B[0][j], B[1][j], ..., B[1023][j]. Each access is 4096 bytes (one row) apart. With 1024 rows, the B column access touches 1024 separate cache lines from 1024 different 4096-byte offsets. At 48 KB L1D with 4096-byte spacing, only ~12 of these can coexist in L1. The other ~1012 must be fetched from L2/L3 on every (i, j) iteration.

Benchmark results (N=1024):

Wall time:        58.3 seconds
GFLOPs/sec:       0.037
L1 miss rate:     78.3%
L2 miss rate:     61.2%
LLC miss rate:    8.7%
Memory traffic:   247 GB (vs. theoretical 12 MB of actual data)

perf stat output:
    L1-dcache-loads:       2,684,354,560
    L1-dcache-load-misses: 2,101,941,171  (78.3%)
    LLC-loads:               232,783,914
    LLC-load-misses:          20,252,200

The program reads 247 GB of memory to perform a computation on 12 MB of data — a 20× redundancy caused by repeatedly re-reading the same B column data from cache after it has been evicted.


Implementation 2: ikj Order with Transposed B

The insight: if we transpose B before multiplying, then accessing B_T[j][k] (where B_T[j][k] = B[k][j]) is stride-1 for the k loop.

; Step 1: Transpose B into B_T
; B_T[j][k] = B[k][j]
; Now the inner k loop accesses B_T[j][k] sequentially

transpose_b:
    ; rdi = B_T (output), rsi = B (input), rdx = N
    ; B_T[j*N + k] = B[k*N + j]
    xor     r8d, r8d            ; j = 0
.tj:
    xor     r9d, r9d            ; k = 0
.tk:
    ; src = B[k][j] = B[k*N + j]
    mov     eax, r9d
    imul    eax, edx
    add     eax, r8d
    movss   xmm0, [rsi + rax*4]

    ; dst = B_T[j][k] = B_T[j*N + k]
    mov     eax, r8d
    imul    eax, edx
    add     eax, r9d
    movss   [rdi + rax*4], xmm0

    inc     r9d
    cmp     r9d, edx
    jl      .tk
    inc     r8d
    cmp     r8d, edx
    jl      .tj
    ret

; Step 2: Multiply using B_T with stride-1 access
; C[i][j] += sum_k A[i][k] * B_T[j][k]
; ikj loop: for i, for k (A row), for j (dot product with B_T rows)
;
; Inner j loop: A[i][k] (scalar), B_T[j][k:k+7] (8 consecutive floats), C[i][j:j+7]
; — fully stride-1 on all three arrays

matmul_transposed:
    ; [transposition code above runs first, then:]
    ; Inner k loop (fixed A[i][k], iterate j with SIMD):
.inner_j:
    ; xmm1 = broadcast A[i][k] to all lanes
    movss   xmm1, [rsi + rax*4]
    shufps  xmm1, xmm1, 0b00000000

    ; Load 4 B_T[j:j+3][k] values (stride-1 in j for fixed k)
    movups  xmm2, [rdx + rbx*4]    ; B_T[j][k], B_T[j+1][k]...
    mulps   xmm2, xmm1
    addps   xmm0, xmm2              ; accumulate 4 C[i][j] values

    add     rbx, 4
    cmp     rbx, r15d               ; j < N?
    jl      .inner_j

The transposition itself costs O(N²) work. For N=1024, the 4 MB transposition takes a negligible ~1 ms compared to the multiply time.

Benchmark results (N=1024):

Wall time:        4.1 seconds
GFLOPs/sec:       0.524
Speedup vs naive: 14.2×
L1 miss rate:     6.2%
L2 miss rate:     3.1%
LLC miss rate:    0.8%
Memory traffic:   15 GB (vs. 247 GB naive)

perf stat output:
    L1-dcache-loads:       2,684,354,560
    L1-dcache-load-misses:   166,429,982  (6.2%)
    LLC-loads:                52,428,800
    LLC-load-misses:             419,430

The 14.2× speedup is entirely from eliminating cache misses. The number of floating-point operations is identical. The transposition converts 78% L1 miss rate to 6% by making B access stride-1.


Implementation 3: Cache-Blocked (Tiled)

Tiling goes further: it ensures that the active working set for any tile computation fits in L2 cache, minimizing even L2 misses.

Tile size calculation: - L2 = 1.25 MB = 1,310,720 bytes - Three tiles needed simultaneously (A tile, B tile, C tile accumulator) - Tile size T×T float32 = T² × 4 bytes per tile - 3 × T² × 4 ≤ 1,310,720 → T² ≤ 109,226 → T ≤ 330 - Choose T = 256 (power of 2, conservative): 3 × 256² × 4 = 786,432 bytes = 768 KB (fits in L2)

%define T 256                   ; tile size

matmul_tiled:
    ; Six nested loops: ii, jj, kk (tile), i, j, k (within tile)
    ; Pseudocode:
    ;   for ii = 0 to N step T:
    ;     for jj = 0 to N step T:
    ;       for kk = 0 to N step T:
    ;         for i = ii to min(ii+T, N):
    ;           for k = kk to min(kk+T, N):
    ;             a = A[i][k]               ; loaded once per (i,k) pair
    ;             for j = jj to min(jj+T, N):
    ;               C[i][j] += a * B[k][j]  ; B[k][j] stride-1, C[i][j] stride-1
    ;
    ; Key: A tile T×T, B tile T×T, C tile T×T all fit in L2.
    ; The inner (i,k,j) loops access each tile element T times before moving on.
    ; Reuse ratio: T = 256 accesses per element loaded.

    ; Outer three tile loops:
    xor     r8d, r8d            ; ii = 0
.tile_i:
    xor     r9d, r9d            ; jj = 0
.tile_j:
    xor     r10d, r10d          ; kk = 0
.tile_k:
    ; Inner three loops: compute C[ii:ii+T][jj:jj+T] += A[ii:ii+T][kk:kk+T] * B[kk:kk+T][jj:jj+T]
    mov     r11d, r8d           ; i = ii
.inner_i:
    mov     r12d, r10d          ; k = kk
.inner_k:
    ; Load A[i][k] into scalar xmm0 — reused for all j in this tile
    mov     eax, r11d
    imul    eax, r15d
    add     eax, r12d
    movss   xmm0, [rsi + rax*4]
    shufps  xmm0, xmm0, 0       ; broadcast to 4 lanes

    mov     r13d, r9d           ; j = jj
.inner_j:
    ; B[k][j] = B[k*N + j] — stride-1 for this inner j loop
    mov     eax, r12d
    imul    eax, r15d
    add     eax, r13d
    movups  xmm1, [rdx + rax*4] ; load 4 B values (stride-1)
    mulps   xmm1, xmm0           ; * A[i][k]

    ; C[i][j] += above
    mov     eax, r11d
    imul    eax, r15d
    add     eax, r13d
    movups  xmm2, [rdi + rax*4]
    addps   xmm2, xmm1
    movups  [rdi + rax*4], xmm2

    add     r13d, 4             ; j += 4 (4 floats per MOVUPS)
    ; Check j < min(jj+T, N)
    mov     eax, r9d
    add     eax, T
    cmp     eax, r15d
    cmovg   eax, r15d           ; eax = min(jj+T, N)
    cmp     r13d, eax
    jl      .inner_j

    inc     r12d
    ; [similar bounds check for kk+T, N] ...
    jl      .inner_k

    inc     r11d
    jl      .inner_i

    add     r10d, T
    cmp     r10d, r15d
    jl      .tile_k
    add     r9d, T
    cmp     r9d, r15d
    jl      .tile_j
    add     r8d, T
    cmp     r8d, r15d
    jl      .tile_i
    ret

Benchmark results (N=1024, T=256):

Wall time:        0.58 seconds
GFLOPs/sec:       3.70
Speedup vs naive: 100.5×
Speedup vs transposed: 7.1×
L1 miss rate:     2.1%
L2 miss rate:     0.7%
LLC miss rate:    0.1%
Memory traffic:   0.38 GB (vs. 247 GB naive)

perf stat output:
    L1-dcache-loads:       2,684,354,560
    L1-dcache-load-misses:    56,371,446  (2.1%)
    LLC-loads:                18,874,368
    LLC-load-misses:             188,743

Memory traffic reduced from 247 GB to 0.38 GB — a 650× reduction. The computation itself is identical (same FLOPs), but the tiled version reuses every loaded byte ~256 times within L2 cache before moving on.


Summary Table

Implementation Time (s) GFLOPs/s L1 Miss% LLC Miss% Mem Traffic
Naive ijk 58.3 0.037 78.3% 8.7% 247 GB
Transposed B 4.1 0.524 6.2% 0.8% 15 GB
Tiled (T=256) 0.58 3.70 2.1% 0.1% 0.38 GB
cuBLAS (GPU ref) ~0.01 ~200

The Lesson

These three implementations compute exactly the same result. They differ only in the order in which they access memory. That order changes performance by 100×. The GPU reference (shown for perspective) achieves 54× more GFLOPs/s than the tiled CPU implementation — not because GPUs are faster per-clock, but because GPUs have 10× more memory bandwidth and 1000× more thread-level parallelism to hide latency.

The takeaway for assembly programmers: the machine's raw throughput is irrelevant if the data is not where the computation expects it. Cache-conscious design is not an optimization — it is the baseline competency for writing fast code.