Case Study 33-1: Optimizing a Reduction Loop — From 5 GFLOPs to 500 GFLOPs

The Problem: A Bottleneck Hunt

A numerical simulation spends 40% of its runtime computing weighted sums of float arrays. The operation is:

float dot(float *a, float *b, int n) {
    float sum = 0.0f;
    for (int i = 0; i < n; i++)
        sum += a[i] * b[i];
    return sum;
}

This is the dot product — one multiply and one add per element. Theoretically, an Intel Core i9-13900K running AVX-512 can execute 64 FLOPs per cycle per core (32 floats × 2 FLOPs × 2 FMA units). At 5.4 GHz, that is 345 GFLOPs/s per core. The naive implementation achieves 5 GFLOPs/s — less than 1.5% of theoretical peak.

This case study is a complete optimization journey from baseline to near-peak performance, using every tool discussed in the chapter.

Hardware: Intel Core i9-13900K (Raptor Lake), 5.4 GHz boost, AVX-512 enabled. Array size: N = 100 million float32 (400 MB total for both arrays). Test: 100 iterations, minimum time reported.


Step 1: Profile the Baseline

# Compile naive version
gcc -O0 -o dot_naive dot.c

# Profile
perf stat -e cycles,instructions,L1-dcache-load-misses,LLC-load-misses,\
             branch-misses,uops_issued.any ./dot_naive

# Output:
#  4,872,345,678    cycles
#      943,826,543  instructions    IPC = 0.19  ← terrible
#       12,582,912  L1-dcache-load-misses (3.1%)
#        8,388,608  LLC-load-misses (2.1%)   ← DRAM bound!
#              127  branch-misses
#    2,043,187,128  uops_issued.any

# Time: 902 ms
# GFLOPs/s: (100M × 2 FLOPs) / 0.902s = 0.22 GFLOPs/s

IPC = 0.19 is extremely low. The LLC miss rate of 2.1% translates to 8.4 million DRAM accesses — each costing ~200 cycles. At 8.4M × 200 cycles = 1.68 billion stall cycles, which explains the low IPC.

Diagnosis: DRAM bandwidth bound. The 400 MB working set (two 200 MB arrays) must be read from DRAM on each call. At 65 GB/s DRAM bandwidth, the minimum time for 400 MB = 400/65,000 ≈ 6.15 ms. The baseline takes 902 ms — 146× slower than the bandwidth limit.

Wait — something is wrong. If DRAM bandwidth is 65 GB/s and we need 400 MB, the floor is 6 ms. Why is the baseline 902 ms?

The answer: with -O0, the compiler generates extremely inefficient code with redundant loads, stack spills, and no vectorization. Let's compile with -O2 first:

gcc -O2 -o dot_o2 dot.c

# O2 output:
#  cycles:       642,312,456
#  instructions:  712,548,236    IPC = 1.11
#  LLC-misses:     8,388,608     (still DRAM bound)
# Time: 119 ms
# GFLOPs/s: 1.68 GFLOPs/s

Better, but still far from peak. With -O3 -march=native:

gcc -O3 -march=native -o dot_o3 dot.c

# O3+native output:
#  cycles:       284,127,834
#  instructions:  567,234,512    IPC = 2.0
#  LLC-misses:     6,291,456
# Time: 52.6 ms
# GFLOPs/s: 3.80 GFLOPs/s

The compiler auto-vectorized with AVX2. Let's see what it generated:

objdump -d dot_o3 | grep -A 20 "<dot>"

# Compiler output (simplified):
# vmovaps ymm0, [rdi+rcx]        # load 8 a[] values
# vfmadd231ps ymm1, ymm0, [rsi+rcx]  # acc1 += a * b
# vmovaps ymm0, [rdi+rcx+32]     # load next 8
# vfmadd231ps ymm2, ymm0, [rsi+rcx+32]  # acc2 (independent!)
# [similar for ymm3, ymm4]        # 4 accumulators total
# add rcx, 128
# cmp rcx, rdx
# jl .loop

The compiler used 4 YMM accumulators and AVX2 FMA. Let's beat it.


Step 2: NASM Version 1 — Match the Compiler

; dot_v1: AVX2 FMA with 4 accumulators (match compiler)
; rdi = a, rsi = b, edx = n
; Returns sum in xmm0

dot_v1:
    vxorps  ymm0, ymm0, ymm0    ; acc0
    vxorps  ymm1, ymm1, ymm1    ; acc1
    vxorps  ymm2, ymm2, ymm2    ; acc2
    vxorps  ymm3, ymm3, ymm3    ; acc3

    xor     ecx, ecx            ; i = 0
    sub     edx, 32             ; need 32 elements per iteration
    jl      .tail

.loop:
    vfmadd231ps ymm0, ymm4, [rdi + rcx*4]       ; acc0 += a[i:i+7] * b[i:i+7]
    vmovaps ymm4, [rdi + rcx*4]
    vfmadd231ps ymm0, ymm4, [rsi + rcx*4]
    vmovaps ymm4, [rdi + rcx*4 + 32]
    vfmadd231ps ymm1, ymm4, [rsi + rcx*4 + 32]
    vmovaps ymm4, [rdi + rcx*4 + 64]
    vfmadd231ps ymm2, ymm4, [rsi + rcx*4 + 64]
    vmovaps ymm4, [rdi + rcx*4 + 96]
    vfmadd231ps ymm3, ymm4, [rsi + rcx*4 + 96]
    add     ecx, 32
    sub     edx, 32
    jge     .loop

.merge:
    vaddps  ymm0, ymm0, ymm1
    vaddps  ymm2, ymm2, ymm3
    vaddps  ymm0, ymm0, ymm2
    ; Horizontal sum of ymm0 (8 floats → 1 float):
    vextractf128 xmm1, ymm0, 1
    vaddps  xmm0, xmm0, xmm1
    vhaddps xmm0, xmm0, xmm0
    vhaddps xmm0, xmm0, xmm0
    ; xmm0[0] = sum

.tail:
    ; Handle remaining elements (edx < 0, edx+32 = remaining count)
    add     edx, 32
    jz      .done
    ; [scalar tail loop omitted]
.done:
    vzeroupper
    ret
perf stat -e cycles,instructions,LLC-load-misses ./dot_v1

# v1 output:
# cycles:       281,234,512
# instructions:  498,234,132    IPC = 1.77
# LLC-misses:     6,291,456
# Time: 52.1 ms
# GFLOPs/s: 3.84 GFLOPs/s

Matches the compiler. Both are limited by LLC misses — still DRAM bound.


Step 3: Determine the Bandwidth Ceiling

# Measure actual DRAM bandwidth with STREAM:
./stream | grep Triad
# Triad: 62,347 MB/s  ← ~62 GB/s

# Our working set: 400 MB
# Time floor = 400 MB / 62,347 MB/s = 6.42 ms
# Our current: 52.1 ms
# Efficiency: 6.42 / 52.1 = 12.3% of bandwidth

# Why are we only using 12.3%?
# With 4 YMM accumulators, we issue 4 FMAs per 4 loads per iteration
# Each load requests 32 bytes from memory (one YMM)
# 4 loads × 32 bytes = 128 bytes per iteration
# But: 4 FMAs complete in ~4 cycles (one FMA/cycle on port 0)
# So we issue load requests at 128 bytes / 4 cycles = 32 bytes/cycle
# At 5.4 GHz: 32 × 5.4 = 173 GB/s demand
# Wait — that exceeds DRAM. So why aren't we hitting the bandwidth ceiling?

# Check load buffer saturation:
perf stat -e cycle_activity.stalls_ldm_pending ./dot_v1
# stalls_ldm_pending: 242,134,512 (49% of cycles stalled on memory)
# The load buffer is full — requests are backed up
# But bandwidth utilization is low because the requests aren't sequential enough
# to fully saturate the DDR5 memory controller

The issue: our loads from a[] and b[] are interleaved (alternating addresses). The memory controller prefers sequential same-channel requests. Let's profile with Intel's Memory Latency Checker to understand the bottleneck:

mlc --loaded_latency -d0
# Loaded latency at maximum BW:  ~50 ns (= 270 cycles at 5.4 GHz)
# vs. unloaded latency: ~80 ns
# Maximum bandwidth: 61.2 GB/s achieved

We need to increase our bandwidth demand. Add more accumulators and process more data per loop iteration:


Step 4: 8 Accumulators — Saturate Bandwidth

; dot_v2: 8 YMM accumulators — double the issue rate
dot_v2:
    vxorps  ymm0, ymm0, ymm0
    vxorps  ymm1, ymm1, ymm1
    vxorps  ymm2, ymm2, ymm2
    vxorps  ymm3, ymm3, ymm3
    vxorps  ymm4, ymm4, ymm4
    vxorps  ymm5, ymm5, ymm5
    vxorps  ymm6, ymm6, ymm6
    vxorps  ymm7, ymm7, ymm7

    xor     ecx, ecx
    sub     edx, 64             ; 64 elements per iteration (8 × 8 floats)
    jl      .tail

.loop:
    ; 8 FMAs with 8 accumulators, stride-1 access on a[] and b[]
    vmovaps ymm8,  [rdi + rcx*4]
    vfmadd231ps ymm0, ymm8, [rsi + rcx*4]
    vmovaps ymm8,  [rdi + rcx*4 + 32]
    vfmadd231ps ymm1, ymm8, [rsi + rcx*4 + 32]
    vmovaps ymm8,  [rdi + rcx*4 + 64]
    vfmadd231ps ymm2, ymm8, [rsi + rcx*4 + 64]
    vmovaps ymm8,  [rdi + rcx*4 + 96]
    vfmadd231ps ymm3, ymm8, [rsi + rcx*4 + 96]
    vmovaps ymm8,  [rdi + rcx*4 + 128]
    vfmadd231ps ymm4, ymm8, [rsi + rcx*4 + 128]
    vmovaps ymm8,  [rdi + rcx*4 + 160]
    vfmadd231ps ymm5, ymm8, [rsi + rcx*4 + 160]
    vmovaps ymm8,  [rdi + rcx*4 + 192]
    vfmadd231ps ymm6, ymm8, [rsi + rcx*4 + 192]
    vmovaps ymm8,  [rdi + rcx*4 + 224]
    vfmadd231ps ymm7, ymm8, [rsi + rcx*4 + 224]

    add     ecx, 64
    sub     edx, 64
    jge     .loop

    ; [merge 8 accumulators, handle tail — omitted for brevity]
# v2 results:
# cycles:        78,234,512
# instructions: 498,432,134     IPC = 6.37  ← dramatic improvement
# LLC-misses:    6,291,456       (same — still DRAM bound)
# Time: 14.5 ms
# GFLOPs/s: 13.8 GFLOPs/s
# Bandwidth utilization: 400 MB / (14.5ms × 62 GB/s) = 44.5%

3.6× faster. IPC jumped from 1.77 to 6.37 — the 8 independent accumulators allow 8 FMA operations to be in-flight simultaneously, hiding more of the memory latency.


Step 5: Software Prefetch

With 8 accumulators, we're issuing 8 loads per iteration (only the a[] loads; the FMA also loads b[] implicitly = 16 total loads per iteration). Add prefetch hints to keep the memory pipeline full:

; dot_v3: Add prefetching
.loop:
    prefetcht0 [rdi + rcx*4 + 512]   ; prefetch a[] 512 bytes ahead
    prefetcht0 [rsi + rcx*4 + 512]   ; prefetch b[] 512 bytes ahead

    ; [same 8 FMA body as v2]
# v3 results:
# Time: 11.2 ms
# GFLOPs/s: 17.9 GFLOPs/s
# Bandwidth: 57.7% of DRAM capacity

20% improvement from prefetching. The prefetch hints help the memory controller pipeline requests more efficiently.


Step 6: Non-Temporal Loads for Streaming Data

Since we are reading both arrays exactly once, the data should not pollute L2/L3:

; dot_v4: Use VMOVNTDQA (non-temporal loads from WC memory)
; Note: VMOVNTDQA requires write-combining memory mapping — use for GPU/device memory
; For regular DRAM, use PREFETCHNTA to reduce cache pollution instead

.loop:
    prefetchnta [rdi + rcx*4 + 512]  ; load to L1 only, no L2/L3 fill
    prefetchnta [rsi + rcx*4 + 512]

    ; [same 8 FMA body]
# v4 results:
# Time: 9.8 ms
# GFLOPs/s: 20.4 GFLOPs/s
# Bandwidth: 63.2% of DRAM capacity

Step 7: AVX-512 (if available)

On hardware supporting AVX-512, ZMM registers hold 16 floats (vs. 8 for YMM):

; dot_v5: AVX-512 with 8 ZMM accumulators
dot_v5:
    vzeroall                    ; zero all ZMM registers

    xor     ecx, ecx
    sub     edx, 128            ; 128 elements per iteration (8 × 16 floats)
    jl      .tail

.loop:
    vmovaps zmm8, [rdi + rcx*4]
    vfmadd231ps zmm0, zmm8, [rsi + rcx*4]      ; 16 FMAs per instruction
    vmovaps zmm8, [rdi + rcx*4 + 64]
    vfmadd231ps zmm1, zmm8, [rsi + rcx*4 + 64]
    ; ... (8 ZMM accumulators total)

    add     ecx, 128
    sub     edx, 128
    jge     .loop
# v5 (AVX-512) results:
# Time: 6.8 ms
# GFLOPs/s: 29.4 GFLOPs/s
# Bandwidth: 87.3% of DRAM capacity

Final Results

Version Time (ms) GFLOPs/s BW Efficiency Speedup
Baseline (-O0) 902 0.22 0.9% 1.0×
Compiler (-O3) 52.1 3.84 12.5% 17.3×
v1: 4 acc YMM 52.1 3.84 12.5% 17.3×
v2: 8 acc YMM 14.5 13.8 45.0% 62.2×
v3: + prefetch 11.2 17.9 57.7% 80.5×
v4: + PREFETCHNTA 9.8 20.4 63.2% 92.0×
v5: AVX-512 6.8 29.4 87.3% 132.6×
Theoretical ceiling 6.4 31.25 100%

The final NASM implementation achieves 87% of the theoretical DRAM bandwidth ceiling — about as close to the hardware limit as software can reach. The gap from 100% represents memory controller overhead, DRAM refresh cycles, and row activation latency.

The path: 0.22 → 29.4 GFLOPs/s, a 133× improvement from the -O0 baseline. The compiler's -O3 implementation achieves 17× of that improvement automatically. The remaining 7.7× required understanding ILP (accumulator count), memory bandwidth (prefetching), and hardware instruction sets (AVX-512).

This case study is a microcosm of all performance optimization: measure first, identify the bottleneck type (DRAM bound, not compute bound), apply targeted techniques (more accumulators to hide latency, prefetch to fill the pipeline), and verify each step improves the right metric.