Case Study 33-1: Optimizing a Reduction Loop — From 5 GFLOPs to 500 GFLOPs
The Problem: A Bottleneck Hunt
A numerical simulation spends 40% of its runtime computing weighted sums of float arrays. The operation is:
float dot(float *a, float *b, int n) {
float sum = 0.0f;
for (int i = 0; i < n; i++)
sum += a[i] * b[i];
return sum;
}
This is the dot product — one multiply and one add per element. Theoretically, an Intel Core i9-13900K running AVX-512 can execute 64 FLOPs per cycle per core (32 floats × 2 FLOPs × 2 FMA units). At 5.4 GHz, that is 345 GFLOPs/s per core. The naive implementation achieves 5 GFLOPs/s — less than 1.5% of theoretical peak.
This case study is a complete optimization journey from baseline to near-peak performance, using every tool discussed in the chapter.
Hardware: Intel Core i9-13900K (Raptor Lake), 5.4 GHz boost, AVX-512 enabled. Array size: N = 100 million float32 (400 MB total for both arrays). Test: 100 iterations, minimum time reported.
Step 1: Profile the Baseline
# Compile naive version
gcc -O0 -o dot_naive dot.c
# Profile
perf stat -e cycles,instructions,L1-dcache-load-misses,LLC-load-misses,\
branch-misses,uops_issued.any ./dot_naive
# Output:
# 4,872,345,678 cycles
# 943,826,543 instructions IPC = 0.19 ← terrible
# 12,582,912 L1-dcache-load-misses (3.1%)
# 8,388,608 LLC-load-misses (2.1%) ← DRAM bound!
# 127 branch-misses
# 2,043,187,128 uops_issued.any
# Time: 902 ms
# GFLOPs/s: (100M × 2 FLOPs) / 0.902s = 0.22 GFLOPs/s
IPC = 0.19 is extremely low. The LLC miss rate of 2.1% translates to 8.4 million DRAM accesses — each costing ~200 cycles. At 8.4M × 200 cycles = 1.68 billion stall cycles, which explains the low IPC.
Diagnosis: DRAM bandwidth bound. The 400 MB working set (two 200 MB arrays) must be read from DRAM on each call. At 65 GB/s DRAM bandwidth, the minimum time for 400 MB = 400/65,000 ≈ 6.15 ms. The baseline takes 902 ms — 146× slower than the bandwidth limit.
Wait — something is wrong. If DRAM bandwidth is 65 GB/s and we need 400 MB, the floor is 6 ms. Why is the baseline 902 ms?
The answer: with -O0, the compiler generates extremely inefficient code with redundant loads, stack spills, and no vectorization. Let's compile with -O2 first:
gcc -O2 -o dot_o2 dot.c
# O2 output:
# cycles: 642,312,456
# instructions: 712,548,236 IPC = 1.11
# LLC-misses: 8,388,608 (still DRAM bound)
# Time: 119 ms
# GFLOPs/s: 1.68 GFLOPs/s
Better, but still far from peak. With -O3 -march=native:
gcc -O3 -march=native -o dot_o3 dot.c
# O3+native output:
# cycles: 284,127,834
# instructions: 567,234,512 IPC = 2.0
# LLC-misses: 6,291,456
# Time: 52.6 ms
# GFLOPs/s: 3.80 GFLOPs/s
The compiler auto-vectorized with AVX2. Let's see what it generated:
objdump -d dot_o3 | grep -A 20 "<dot>"
# Compiler output (simplified):
# vmovaps ymm0, [rdi+rcx] # load 8 a[] values
# vfmadd231ps ymm1, ymm0, [rsi+rcx] # acc1 += a * b
# vmovaps ymm0, [rdi+rcx+32] # load next 8
# vfmadd231ps ymm2, ymm0, [rsi+rcx+32] # acc2 (independent!)
# [similar for ymm3, ymm4] # 4 accumulators total
# add rcx, 128
# cmp rcx, rdx
# jl .loop
The compiler used 4 YMM accumulators and AVX2 FMA. Let's beat it.
Step 2: NASM Version 1 — Match the Compiler
; dot_v1: AVX2 FMA with 4 accumulators (match compiler)
; rdi = a, rsi = b, edx = n
; Returns sum in xmm0
dot_v1:
vxorps ymm0, ymm0, ymm0 ; acc0
vxorps ymm1, ymm1, ymm1 ; acc1
vxorps ymm2, ymm2, ymm2 ; acc2
vxorps ymm3, ymm3, ymm3 ; acc3
xor ecx, ecx ; i = 0
sub edx, 32 ; need 32 elements per iteration
jl .tail
.loop:
vfmadd231ps ymm0, ymm4, [rdi + rcx*4] ; acc0 += a[i:i+7] * b[i:i+7]
vmovaps ymm4, [rdi + rcx*4]
vfmadd231ps ymm0, ymm4, [rsi + rcx*4]
vmovaps ymm4, [rdi + rcx*4 + 32]
vfmadd231ps ymm1, ymm4, [rsi + rcx*4 + 32]
vmovaps ymm4, [rdi + rcx*4 + 64]
vfmadd231ps ymm2, ymm4, [rsi + rcx*4 + 64]
vmovaps ymm4, [rdi + rcx*4 + 96]
vfmadd231ps ymm3, ymm4, [rsi + rcx*4 + 96]
add ecx, 32
sub edx, 32
jge .loop
.merge:
vaddps ymm0, ymm0, ymm1
vaddps ymm2, ymm2, ymm3
vaddps ymm0, ymm0, ymm2
; Horizontal sum of ymm0 (8 floats → 1 float):
vextractf128 xmm1, ymm0, 1
vaddps xmm0, xmm0, xmm1
vhaddps xmm0, xmm0, xmm0
vhaddps xmm0, xmm0, xmm0
; xmm0[0] = sum
.tail:
; Handle remaining elements (edx < 0, edx+32 = remaining count)
add edx, 32
jz .done
; [scalar tail loop omitted]
.done:
vzeroupper
ret
perf stat -e cycles,instructions,LLC-load-misses ./dot_v1
# v1 output:
# cycles: 281,234,512
# instructions: 498,234,132 IPC = 1.77
# LLC-misses: 6,291,456
# Time: 52.1 ms
# GFLOPs/s: 3.84 GFLOPs/s
Matches the compiler. Both are limited by LLC misses — still DRAM bound.
Step 3: Determine the Bandwidth Ceiling
# Measure actual DRAM bandwidth with STREAM:
./stream | grep Triad
# Triad: 62,347 MB/s ← ~62 GB/s
# Our working set: 400 MB
# Time floor = 400 MB / 62,347 MB/s = 6.42 ms
# Our current: 52.1 ms
# Efficiency: 6.42 / 52.1 = 12.3% of bandwidth
# Why are we only using 12.3%?
# With 4 YMM accumulators, we issue 4 FMAs per 4 loads per iteration
# Each load requests 32 bytes from memory (one YMM)
# 4 loads × 32 bytes = 128 bytes per iteration
# But: 4 FMAs complete in ~4 cycles (one FMA/cycle on port 0)
# So we issue load requests at 128 bytes / 4 cycles = 32 bytes/cycle
# At 5.4 GHz: 32 × 5.4 = 173 GB/s demand
# Wait — that exceeds DRAM. So why aren't we hitting the bandwidth ceiling?
# Check load buffer saturation:
perf stat -e cycle_activity.stalls_ldm_pending ./dot_v1
# stalls_ldm_pending: 242,134,512 (49% of cycles stalled on memory)
# The load buffer is full — requests are backed up
# But bandwidth utilization is low because the requests aren't sequential enough
# to fully saturate the DDR5 memory controller
The issue: our loads from a[] and b[] are interleaved (alternating addresses). The memory controller prefers sequential same-channel requests. Let's profile with Intel's Memory Latency Checker to understand the bottleneck:
mlc --loaded_latency -d0
# Loaded latency at maximum BW: ~50 ns (= 270 cycles at 5.4 GHz)
# vs. unloaded latency: ~80 ns
# Maximum bandwidth: 61.2 GB/s achieved
We need to increase our bandwidth demand. Add more accumulators and process more data per loop iteration:
Step 4: 8 Accumulators — Saturate Bandwidth
; dot_v2: 8 YMM accumulators — double the issue rate
dot_v2:
vxorps ymm0, ymm0, ymm0
vxorps ymm1, ymm1, ymm1
vxorps ymm2, ymm2, ymm2
vxorps ymm3, ymm3, ymm3
vxorps ymm4, ymm4, ymm4
vxorps ymm5, ymm5, ymm5
vxorps ymm6, ymm6, ymm6
vxorps ymm7, ymm7, ymm7
xor ecx, ecx
sub edx, 64 ; 64 elements per iteration (8 × 8 floats)
jl .tail
.loop:
; 8 FMAs with 8 accumulators, stride-1 access on a[] and b[]
vmovaps ymm8, [rdi + rcx*4]
vfmadd231ps ymm0, ymm8, [rsi + rcx*4]
vmovaps ymm8, [rdi + rcx*4 + 32]
vfmadd231ps ymm1, ymm8, [rsi + rcx*4 + 32]
vmovaps ymm8, [rdi + rcx*4 + 64]
vfmadd231ps ymm2, ymm8, [rsi + rcx*4 + 64]
vmovaps ymm8, [rdi + rcx*4 + 96]
vfmadd231ps ymm3, ymm8, [rsi + rcx*4 + 96]
vmovaps ymm8, [rdi + rcx*4 + 128]
vfmadd231ps ymm4, ymm8, [rsi + rcx*4 + 128]
vmovaps ymm8, [rdi + rcx*4 + 160]
vfmadd231ps ymm5, ymm8, [rsi + rcx*4 + 160]
vmovaps ymm8, [rdi + rcx*4 + 192]
vfmadd231ps ymm6, ymm8, [rsi + rcx*4 + 192]
vmovaps ymm8, [rdi + rcx*4 + 224]
vfmadd231ps ymm7, ymm8, [rsi + rcx*4 + 224]
add ecx, 64
sub edx, 64
jge .loop
; [merge 8 accumulators, handle tail — omitted for brevity]
# v2 results:
# cycles: 78,234,512
# instructions: 498,432,134 IPC = 6.37 ← dramatic improvement
# LLC-misses: 6,291,456 (same — still DRAM bound)
# Time: 14.5 ms
# GFLOPs/s: 13.8 GFLOPs/s
# Bandwidth utilization: 400 MB / (14.5ms × 62 GB/s) = 44.5%
3.6× faster. IPC jumped from 1.77 to 6.37 — the 8 independent accumulators allow 8 FMA operations to be in-flight simultaneously, hiding more of the memory latency.
Step 5: Software Prefetch
With 8 accumulators, we're issuing 8 loads per iteration (only the a[] loads; the FMA also loads b[] implicitly = 16 total loads per iteration). Add prefetch hints to keep the memory pipeline full:
; dot_v3: Add prefetching
.loop:
prefetcht0 [rdi + rcx*4 + 512] ; prefetch a[] 512 bytes ahead
prefetcht0 [rsi + rcx*4 + 512] ; prefetch b[] 512 bytes ahead
; [same 8 FMA body as v2]
# v3 results:
# Time: 11.2 ms
# GFLOPs/s: 17.9 GFLOPs/s
# Bandwidth: 57.7% of DRAM capacity
20% improvement from prefetching. The prefetch hints help the memory controller pipeline requests more efficiently.
Step 6: Non-Temporal Loads for Streaming Data
Since we are reading both arrays exactly once, the data should not pollute L2/L3:
; dot_v4: Use VMOVNTDQA (non-temporal loads from WC memory)
; Note: VMOVNTDQA requires write-combining memory mapping — use for GPU/device memory
; For regular DRAM, use PREFETCHNTA to reduce cache pollution instead
.loop:
prefetchnta [rdi + rcx*4 + 512] ; load to L1 only, no L2/L3 fill
prefetchnta [rsi + rcx*4 + 512]
; [same 8 FMA body]
# v4 results:
# Time: 9.8 ms
# GFLOPs/s: 20.4 GFLOPs/s
# Bandwidth: 63.2% of DRAM capacity
Step 7: AVX-512 (if available)
On hardware supporting AVX-512, ZMM registers hold 16 floats (vs. 8 for YMM):
; dot_v5: AVX-512 with 8 ZMM accumulators
dot_v5:
vzeroall ; zero all ZMM registers
xor ecx, ecx
sub edx, 128 ; 128 elements per iteration (8 × 16 floats)
jl .tail
.loop:
vmovaps zmm8, [rdi + rcx*4]
vfmadd231ps zmm0, zmm8, [rsi + rcx*4] ; 16 FMAs per instruction
vmovaps zmm8, [rdi + rcx*4 + 64]
vfmadd231ps zmm1, zmm8, [rsi + rcx*4 + 64]
; ... (8 ZMM accumulators total)
add ecx, 128
sub edx, 128
jge .loop
# v5 (AVX-512) results:
# Time: 6.8 ms
# GFLOPs/s: 29.4 GFLOPs/s
# Bandwidth: 87.3% of DRAM capacity
Final Results
| Version | Time (ms) | GFLOPs/s | BW Efficiency | Speedup |
|---|---|---|---|---|
| Baseline (-O0) | 902 | 0.22 | 0.9% | 1.0× |
| Compiler (-O3) | 52.1 | 3.84 | 12.5% | 17.3× |
| v1: 4 acc YMM | 52.1 | 3.84 | 12.5% | 17.3× |
| v2: 8 acc YMM | 14.5 | 13.8 | 45.0% | 62.2× |
| v3: + prefetch | 11.2 | 17.9 | 57.7% | 80.5× |
| v4: + PREFETCHNTA | 9.8 | 20.4 | 63.2% | 92.0× |
| v5: AVX-512 | 6.8 | 29.4 | 87.3% | 132.6× |
| Theoretical ceiling | 6.4 | 31.25 | 100% | — |
The final NASM implementation achieves 87% of the theoretical DRAM bandwidth ceiling — about as close to the hardware limit as software can reach. The gap from 100% represents memory controller overhead, DRAM refresh cycles, and row activation latency.
The path: 0.22 → 29.4 GFLOPs/s, a 133× improvement from the -O0 baseline. The compiler's -O3 implementation achieves 17× of that improvement automatically. The remaining 7.7× required understanding ILP (accumulator count), memory bandwidth (prefetching), and hardware instruction sets (AVX-512).
This case study is a microcosm of all performance optimization: measure first, identify the bottleneck type (DRAM bound, not compute bound), apply targeted techniques (more accumulators to hide latency, prefetch to fill the pipeline), and verify each step improves the right metric.