Every optimization guide warns against premature optimization. The reason is not philosophical — it is practical: optimizing the wrong part of a program is wasted effort, and without measurement it is nearly impossible to know which part is slow...
In This Chapter
- Measure First, Optimize Second
- RDTSC: High-Resolution Timing in Assembly
- perf: The Linux Performance Analysis Framework
- Hardware Performance Counters and PMU Events
- Loop Optimizations
- Instruction Selection Optimization
- Code Alignment
- A Complete Optimization Case Study: Reduction Loop
- llvm-mca: Static Pipeline Analysis
- Optimization Decision Framework
- Summary
Chapter 33: Performance Analysis and Optimization
Measure First, Optimize Second
Every optimization guide warns against premature optimization. The reason is not philosophical — it is practical: optimizing the wrong part of a program is wasted effort, and without measurement it is nearly impossible to know which part is slow. This chapter covers the tools and techniques for systematic performance analysis: measuring precisely, identifying bottlenecks accurately, and applying targeted optimizations.
The workflow is always: profile → identify hotspot → understand bottleneck → optimize → measure again.
RDTSC: High-Resolution Timing in Assembly
The RDTSC (Read Time-Stamp Counter) instruction returns the processor's cycle counter in EDX:EAX. This is the highest-resolution timer available — far more precise than OS timers or clock_gettime.
; Basic RDTSC usage:
; Read TSC, do work, read TSC again, compute delta
section .text
; bench_function: time a function call
; rdi = function pointer, rsi = argument
; Returns: elapsed cycles in rax
bench_function:
push rbx
push r12
push r13
mov rbx, rdi ; save function pointer
mov r12, rsi ; save argument
; Serialize before timing: prevent CPU from reordering instructions
; into or out of the timed region
lfence ; load fence — all prior loads must complete
; Read TSC — start
rdtsc ; EDX:EAX = TSC
shl rdx, 32 ; shift high 32 bits to correct position
or rax, rdx ; rax = 64-bit TSC value
mov r13, rax ; save start time
; Execute the function being timed
mov rdi, r12
call rbx
; Serialize after timing: prevent CPU from retiring load instructions
; from AFTER the timed region before the second RDTSC
lfence
; Read TSC — end
rdtsc
shl rdx, 32
or rax, rdx ; rax = end TSC
sub rax, r13 ; rax = elapsed cycles
pop r13
pop r12
pop rbx
ret
RDTSC vs. RDTSCP: RDTSCP (Read TSC and Processor ID) additionally stores the current CPU ID into ECX. It is implicitly serializing on the loads side, making it safer for timing:
; Using RDTSCP — better for timing:
; RDTSCP waits for all prior instructions to retire before reading the TSC
; (for the start measurement)
; Start:
rdtscp ; TSC → EDX:EAX, CPU ID → ECX
shl rdx, 32
or rax, rdx
mov r13, rax ; save start
; [code to time here]
; End: still need LFENCE before second RDTSCP
lfence
rdtscp
shl rdx, 32
or rax, rdx
sub rax, r13 ; elapsed cycles
Pitfalls with RDTSC:
1. CPU frequency scaling: TSC is in cycles, not nanoseconds.
On modern Intel (constant TSC), the TSC increments at the base frequency
regardless of power state. Convert to time: ns = cycles / (GHz × 10⁹)
2. Out-of-order execution: Without serialization, the CPU may execute
instructions across the RDTSC boundary. LFENCE prevents speculative
execution from crossing the fence.
3. CPU migration: If the thread migrates to another CPU between the two
RDTSCs, the counts may not be comparable. Pin threads to CPUs for
microbenchmarks (taskset -c 0 ./bench)
4. Warm-up: Run the code at least once before timing to ensure:
- Code is in instruction cache
- Data is in data cache (if testing cache-warm performance)
- Branch predictor is trained
- CPU is at full frequency (not in a power-saving state)
Microbenchmark template:
; Complete microbenchmark loop: warm up, then time N iterations
%define WARMUP_ITERS 100
%define BENCH_ITERS 10000
%define INNER_ITERS 1000 ; unroll to amortize RDTSC overhead
section .text
microbench:
push rbp
mov rbp, rsp
push rbx
push r12
push r13
push r14
; Warm-up phase
mov ecx, WARMUP_ITERS
.warmup:
; [code to benchmark here]
dec ecx
jnz .warmup
; Timing phase: collect BENCH_ITERS samples
xor r14d, r14d ; min cycles = 0 (will be overwritten)
mov r13, 0xFFFFFFFFFFFFFFFF ; min = max
mov ecx, BENCH_ITERS
.bench_outer:
; Force CPU to full frequency: busy-wait briefly
; (optional, depends on benchmark goals)
lfence
rdtsc
shl rdx, 32
or rax, rdx
mov rbx, rax ; save start
; Inner loop: INNER_ITERS repetitions
mov r12d, INNER_ITERS
.bench_inner:
; [code to benchmark here — paste N copies for unrolling]
dec r12d
jnz .bench_inner
lfence
rdtsc
shl rdx, 32
or rax, rdx
sub rax, rbx ; elapsed cycles for INNER_ITERS
; Track minimum (minimum is most accurate — avoids OS interrupt noise)
cmp rax, r13
cmovl r13, rax
dec ecx
jnz .bench_outer
; r13 = minimum cycles for INNER_ITERS iterations
; cycles_per_iter = r13 / INNER_ITERS
mov rax, r13
pop r14
pop r13
pop r12
pop rbx
pop rbp
ret
perf: The Linux Performance Analysis Framework
perf is the Linux kernel's performance analysis tool. It provides three overlapping capabilities: counting hardware performance counter events (perf stat), sampling program execution to find hotspots (perf record/perf report), and annotating disassembly with per-instruction counts (perf annotate).
perf stat: Counting Mode
# Basic statistics: IPC, cache misses, branch mispredictions
perf stat ./program
# Output:
# Performance counter stats for './program':
#
# 1,234.56 msec task-clock # 1.23 seconds wall time
# 2 context-switches
# 0 cpu-migrations
# 512 page-faults
# 4,938,240,000 cycles # 4.0 GHz × 1.234 s
# 9,234,560,000 instructions # IPC = 9.23B / 4.94B = 1.87
# 2,156,789,000 branches
# 3,456,789 branch-misses # 0.16% misprediction rate
# 2,684,354,560 L1-dcache-loads
# 52,428,800 L1-dcache-load-misses # 5.12% miss rate
# Full pipeline analysis: what is limiting throughput?
perf stat -e \
cycles,instructions,\
cycle_activity.stalls_total,\
cycle_activity.stalls_mem_any,\
cycle_activity.stalls_l3_miss,\
cycle_activity.stalls_ldm_pending,\
uops_issued.any,\
uops_retired.all \
./program
# Bottleneck identification key:
# stalls_mem_any / cycles > 0.3 → memory bound
# stalls_l3_miss / cycles > 0.1 → DRAM bound
# branch-misses / branches > 0.02 → branch misprediction bound
# IPC < 1.0 → severe bottleneck
# IPC 1–3 → moderate, typical
# IPC > 3 → very good (SIMD or ILP)
Top-Down Microarchitecture Analysis
Intel's Top-Down method classifies cycles into four high-level categories:
# Requires modern Intel perf events (Haswell+)
perf stat -e \
slots,\
topdown-bad-spec,\
topdown-be-bound,\
topdown-fe-bound,\
topdown-retiring \
./program
# Interpret:
# retiring: % of slots with useful instructions retiring (higher = better)
# fe-bound: CPU waiting for instructions from frontend (ICache miss, µop cache)
# be-bound: CPU backend stalled (memory, execution unit congestion)
# bad-spec: Wasted work from mispredictions or machine clears
# Frontend bound (>20%): code too large for µop cache, instruction fetch stalls
# Backend bound - memory: working set too large for cache
# Backend bound - compute: execution unit bottleneck (port contention)
# Bad speculation (>10%): unpredictable branches
perf record/report: Sampling Profiler
# Sample at 1000 Hz, record for 30 seconds
perf record -g -F 1000 ./program
perf report
# perf report output (terminal UI):
#
# Samples: 15K of event 'cycles', Event count (approx.): 47,682,159,000
# Children Self Command Shared Object Symbol
# + 45.23% 45.11% program program [.] matmul_inner_loop
# + 22.17% 22.08% program program [.] compute_hash
# + 18.93% 0.12% program libc.so.6 [.] malloc
# ...
# Drill into a function:
perf report --sort dso,sym --symbol-filter matmul_inner_loop
# Record with call graph (for flame graphs):
perf record -g --call-graph dwarf -F 1000 ./program
perf script | stackcollapse-perf.pl | flamegraph.pl > perf.svg
perf annotate: Instruction-Level Analysis
# Annotate a specific function with per-instruction sample counts
perf annotate matmul_inner_loop
# Output (every line shows % of samples that stalled here):
#
# Percent | Disassembly
# 0.12 | xorps xmm0, xmm0
# 0.08 | xor r14d, r14d
# ↓ | .inner_k:
# 34.71 | movss xmm1, DWORD PTR [rsi+rax*4] ← hot: L2/L3 miss
# 0.23 | mulss xmm1, DWORD PTR [rdx+rbx*4]
# 28.44 | addss xmm0, xmm1 ← waiting for mulss
# 0.05 | inc r14d
# 0.03 | cmp r14d, r15d
# 0.02 | jl .inner_k
# 34% of samples stalled on the first load → memory bound
# The movss is waiting for a cache miss — profile confirms L1 miss theory
Hardware Performance Counters and PMU Events
Modern CPUs have a Performance Monitoring Unit (PMU) with hundreds of countable events. The key events for bottleneck identification:
# List all available events:
perf list
# Key event categories:
# ─────────────────────────────────────────────────────────────
# Frontend (instruction supply):
perf stat -e \
icache.hit,icache.misses,\
dsb2mite_switches.penalty_cycles,\ # µop cache to decoder transitions
decode.restriction.recoveries \
./program
# Backend execution:
perf stat -e \
uops_dispatched_port.port_0,\ # port 0 (FP mul, branch)
uops_dispatched_port.port_1,\ # port 1 (FP add, int mul)
uops_dispatched_port.port_2,\ # port 2 (load)
uops_dispatched_port.port_3,\ # port 3 (load)
uops_dispatched_port.port_4,\ # port 4 (store)
uops_dispatched_port.port_5,\ # port 5 (vector shuffle)
uops_dispatched_port.port_6,\ # port 6 (branch, int)
uops_dispatched_port.port_7 \ # port 7 (store address)
./program
# Branch prediction:
perf stat -e \
branch-misses,\
br_misp_retired.all_branches,\
br_inst_retired.conditional \
./program
# Memory:
perf stat -e \
mem_load_retired.l1_hit,\
mem_load_retired.l2_hit,\
mem_load_retired.l3_hit,\
mem_load_retired.l3_miss,\
mem_load_retired.fb_hit \ # fill buffer hit (outstanding miss)
./program
Port pressure analysis: If a specific execution port is at 100% utilization, that port is the bottleneck. Reduce the number of instructions routed to that port:
# Skyalke: FP multiply on port 0, FP add on port 1
# If port_0 utilization >> port_1: too many multiplies
# Fix: rebalance with FMA (fused multiply-add uses port 0+1 together)
# Example: compute y = a*x + b*z (two muls, one add)
; BAD: 2 MULSS (port 0), 1 ADDSS (port 1) — port 0 bottleneck
mulss xmm0, [a]
mulss xmm1, [b]
addss xmm0, xmm1
; GOOD: 1 MULSS + 1 VFMADD (balanced)
mulss xmm0, [a] ; a*x → port 0
vfmadd231ss xmm0, xmm1, [b] ; xmm0 += xmm1 * b → port 0, also port 1 for add
; FMA uses both the multiply and add units together
Loop Optimizations
Loop Unrolling
Unrolling reduces the per-element overhead of loop control (decrement + branch) and exposes more independent operations for the CPU to schedule:
; BASELINE: simple sum loop
; 2 instructions overhead (dec + jnz) per element
sum_baseline:
xor eax, eax
.loop:
add eax, [rsi + rcx*4]
dec rcx
jnz .loop
ret
; 4× UNROLLED: amortize loop overhead, expose ILP
; Only 2 instructions overhead per 4 elements
; Also: 4 independent ADD instructions per iteration → ILP
sum_unrolled_4x:
xor eax, eax
sub rcx, 4
jl .cleanup
.loop:
add eax, [rsi + rcx*4] ; element N
add eax, [rsi + rcx*4 + 4] ; element N+1
add eax, [rsi + rcx*4 + 8] ; element N+2
add eax, [rsi + rcx*4 + 12] ; element N+3
; BUT: all 4 ADDs write to eax — serial dependency chain!
sub rcx, 4
jge .loop
; 4× UNROLLED WITH INDEPENDENT ACCUMULATORS: best version
; 4 accumulators → 4 independent chains → 4× the ILP
sum_unrolled_4x_accum:
xor eax, eax
xor r8d, r8d
xor r9d, r9d
xor r10d, r10d
sub rcx, 4
jl .cleanup
.loop:
add eax, [rsi + rcx*4] ; chain 0
add r8d, [rsi + rcx*4 + 4] ; chain 1 (independent!)
add r9d, [rsi + rcx*4 + 8] ; chain 2 (independent!)
add r10d, [rsi + rcx*4 + 12] ; chain 3 (independent!)
sub rcx, 4
jge .loop
.merge:
add eax, r8d
add eax, r9d
add eax, r10d
.cleanup:
; Handle remaining elements (rcx < 4)
add rcx, 4
jz .done
.tail:
add eax, [rsi + rcx*4 - 4]
dec rcx
jnz .tail
.done:
ret
Optimal accumulator count formula: For an instruction with latency L and reciprocal throughput T: - Accumulators needed = L / T (rounded up) - For ADDSS (latency 4, throughput 0.5): 4/0.5 = 8 accumulators - For ADDPD YMM (latency 4, throughput 2): 4/2 = 2 accumulators
Loop Fusion
Merge two loops over the same range into one to reduce loop overhead and improve cache locality:
; FUSED: better cache use, one loop overhead
; xmm0 = a, xmm1 = b; compute dot product of a and b simultaneously with sum of a
fused_loop:
xorps xmm2, xmm2 ; dot product accumulator
xorps xmm3, xmm3 ; sum accumulator
.loop:
movss xmm4, [rsi + rcx*4] ; load a[i] once
addss xmm3, xmm4 ; sum += a[i]
mulss xmm4, [rdx + rcx*4] ; a[i] * b[i]
addss xmm2, xmm4 ; dot += a[i]*b[i]
dec rcx
jnz .loop
; xmm2 = dot product, xmm3 = sum
ret
; UNFUSED equivalent (two passes, loads a[] twice):
; pass 1: sum = sum of a[] → loads a[], 4 MB
; pass 2: dot = dot product of a,b → loads a[] again + b[], 8 MB
; FUSED: single pass → loads a[], b[] once, 8 MB total
; Cache benefit: a[] loaded once vs. twice → 1/2 the loads for a[]
Software Pipelining
Software pipelining manually interleaves iterations to hide the latency of individual instructions:
; Without pipelining: computation starts only after load completes
; Load has 4-12 cycle latency; compute must wait
; WITH SOFTWARE PIPELINING:
; While waiting for iteration N's data to load,
; compute iteration N-1's result
; Pseudocode:
; prologue: issue load for iteration 0
; loop: issue load for iteration i+1 (prefetch for next)
; compute result for iteration i (using i-1's loaded data)
; epilogue: compute result for last iteration
software_pipeline_example:
; Load first element (prologue)
movss xmm0, [rsi] ; load a[0] — start latency
dec rcx
jz .epilogue
.loop:
movss xmm1, [rsi + 4] ; load a[i+1] (next iteration)
add rsi, 4
; While a[i+1] is loading, compute a[i]'s result:
mulss xmm0, xmm0 ; a[i]^2 — uses previously loaded value
addss xmm2, xmm0 ; accumulate
movaps xmm0, xmm1 ; move loaded value to working register
dec rcx
jnz .loop
.epilogue:
; Compute last element:
mulss xmm0, xmm0
addss xmm2, xmm0
ret
Instruction Selection Optimization
Small instruction choices can have large throughput consequences.
LEA vs. IMUL for Small Multiplies
; Multiply by small constants: use LEA instead of IMUL
; LEA has latency 1, IMUL has latency 3
; Multiply by 3:
imul rax, rax, 3 ; latency 3 — uses port 1 (multiplier)
lea rax, [rax + rax*2] ; latency 1 — uses port 1 or 5 (AGU)
; Multiply by 5:
imul rax, rax, 5 ; latency 3
lea rax, [rax + rax*4] ; latency 1
; Multiply by 9:
imul rax, rax, 9 ; latency 3
lea rax, [rax + rax*8] ; latency 1
; Multiply by 6:
imul rax, rax, 6 ; latency 3
lea rax, [rax + rax*2] ; *3 (latency 1)
add rax, rax ; *2 (latency 1) → total *6, total latency 2
; Multiply by 12:
lea rax, [rax*4 + rax] ; *5? No — need separate approach
; Better: lea rax, [rax + rax*2] ; *3
; shl rax, 2 ; *4 → total *12
; Still 2 cycles vs. 3 for IMUL
Division by Constant: Reciprocal Multiply
Integer division (DIV) is extremely expensive (35–90 cycles). Division by a compile-time constant can always be replaced with a multiply + shift:
; Divide by 7 (unsigned):
; 7 = reciprocal of 1/7 ≈ 0x24924925 (magic number)
; Computed by: magic = ceil(2^(32+log2(7)) / 7)
;
; Algorithm: q = (x * magic) >> (32 + shift)
; BAD: division instruction
mov eax, r8d
xor edx, edx
mov ecx, 7
div ecx ; latency 35-90 cycles
; GOOD: reciprocal multiply
; Magic number for dividing by 7: 0x24924925, shift=2
mov eax, r8d
mov ecx, 0x24924925
mul ecx ; edx:eax = eax * magic
mov eax, edx ; take high 32 bits
shr eax, 2 ; shift right by 2
; eax = r8d / 7 (exact for all 32-bit values)
; Latency: ~5 cycles (mul+shr) vs 35-90 (div)
; Divide by 10 (common case):
; Magic: 0xCCCCCCCD, shift=3
mov eax, r8d
imul rdx, rax, 0xCCCCCCCD
shr rdx, 35 ; shift 32+3 = 35
; rdx = r8d / 10
; The compiler does this automatically for constant divisors.
; In assembly, you must do it manually or precompute the magic.
; Use libdivide (https://libdivide.com/) to generate correct magic numbers.
CMOV for Branchless Code
; Conditional selection without branches:
; if (a > b) max = a; else max = b;
; BRANCHING VERSION (misprediction cost for random data: 15-20 cycles):
cmp eax, ebx
jle .use_b
mov ecx, eax
jmp .done
.use_b:
mov ecx, ebx
.done:
; CMOV VERSION (no misprediction, always ~1-2 cycles):
cmp eax, ebx
mov ecx, ebx ; ecx = b (default)
cmovg ecx, eax ; if a > b: ecx = a
; ecx = max(a, b) regardless of comparison result
; CMOV for absolute value:
mov ecx, eax
neg eax ; eax = -original
cmovl eax, ecx ; if original >= 0, keep original
; CMOV in a sort network (sorting networks use CMOV exclusively for branchless sort):
; Sort 2 elements:
mov ecx, eax ; save a
cmp eax, ebx
cmovg eax, ebx ; eax = min(a,b)
cmovg ebx, ecx ; ebx = max(a,b)
; if a > b: swap; otherwise no-op — no branch ever taken
Code Alignment
The CPU fetches instructions in aligned 16-byte or 64-byte chunks. A loop that straddles a cache line boundary or misaligns with the µop cache can suffer unnecessary frontend stalls.
; Align hot loops to 16-byte (or 32-byte for AVX) boundaries
section .text
critical_function:
; [prologue code here]
align 16 ; pad with NOP until 16-byte aligned
.hot_loop: ; now starts at 16-byte boundary
; [loop body]
dec rcx
jnz .hot_loop
ret
; For AVX loops, align to 32 bytes:
align 32
.avx_loop:
vmovaps ymm0, [rsi]
; ...
jnz .avx_loop
; NASM ALIGN directive inserts NOP instructions to reach alignment
; Too many NOPs waste µop cache space — only align if profiling shows benefit
; Alternative: use NASM's 'times' to insert specific padding:
.loop_start:
times (16 - ($ % 16)) % 16 nop ; insert 0-15 NOPs for 16-byte alignment
.loop:
; [body]
jnz .loop
When alignment matters:
- Loops with more than ~16 instructions may benefit from 32-byte alignment
- Functions called from many places benefit from alignment (ensures first instruction on a full cache line)
- Alignment matters most when the µop cache is the frontend bottleneck (check dsb2mite_switches perf event)
A Complete Optimization Case Study: Reduction Loop
The dot product of two float arrays is a canonical reduction. Here is a step-by-step optimization sequence:
; TARGET: dot_product(float *a, float *b, int n) → float
; rdi = a, rsi = b, edx = n
; === VERSION 1: Naive scalar ===
; 1 multiply + 1 add per iteration
; Critical path: MULSS(4) + ADDSS(4) = 8 cycles per element
dot_v1:
xorps xmm0, xmm0
.loop:
movss xmm1, [rdi]
mulss xmm1, [rsi]
addss xmm0, xmm1
add rdi, 4
add rsi, 4
dec edx
jnz .loop
ret
; Throughput: 1/8 FLOPs per cycle
; === VERSION 2: Multiple accumulators ===
; Break the xmm0 dependency chain with 4 independent accumulators
; Critical path: MULSS(4) per accumulator chain = 4 cycles per group of 4
dot_v2:
xorps xmm0, xmm0 ; acc0
xorps xmm1, xmm1 ; acc1
xorps xmm2, xmm2 ; acc2
xorps xmm3, xmm3 ; acc3
sub edx, 4
jl .cleanup
.loop:
movss xmm4, [rdi]
mulss xmm4, [rsi]
addss xmm0, xmm4 ; chain 0
movss xmm4, [rdi + 4]
mulss xmm4, [rsi + 4]
addss xmm1, xmm4 ; chain 1 (independent)
movss xmm4, [rdi + 8]
mulss xmm4, [rsi + 8]
addss xmm2, xmm4 ; chain 2 (independent)
movss xmm4, [rdi + 12]
mulss xmm4, [rsi + 12]
addss xmm3, xmm4 ; chain 3 (independent)
add rdi, 16
add rsi, 16
sub edx, 4
jge .loop
.merge:
addss xmm0, xmm1
addss xmm2, xmm3
addss xmm0, xmm2
.cleanup:
; [handle remaining elements]
ret
; Throughput: 4 FLOPs per ~4 cycles = ~1 FLOP/cycle
; === VERSION 3: FMA (Fused Multiply-Add) ===
; VFMADD231SS: xmm0 += xmm1 * mem — single instruction, latency 4, throughput 0.5
; Requires 8 accumulators to fully hide latency (8 × 4 / 0.5 = 64 outstanding)
dot_v3:
vxorps xmm0, xmm0, xmm0
vxorps xmm1, xmm1, xmm1
vxorps xmm2, xmm2, xmm2
vxorps xmm3, xmm3, xmm3
vxorps xmm4, xmm4, xmm4
vxorps xmm5, xmm5, xmm5
vxorps xmm6, xmm6, xmm6
vxorps xmm7, xmm7, xmm7
sub edx, 8
jl .cleanup
.loop:
vmovss xmm8, [rdi]
vfmadd231ss xmm0, xmm8, [rsi] ; xmm0 += a[0] * b[0]
vmovss xmm8, [rdi + 4]
vfmadd231ss xmm1, xmm8, [rsi + 4] ; chain 1
vmovss xmm8, [rdi + 8]
vfmadd231ss xmm2, xmm8, [rsi + 8] ; chain 2
vmovss xmm8, [rdi + 12]
vfmadd231ss xmm3, xmm8, [rsi + 12] ; chain 3
vmovss xmm8, [rdi + 16]
vfmadd231ss xmm4, xmm8, [rsi + 16] ; chain 4
vmovss xmm8, [rdi + 20]
vfmadd231ss xmm5, xmm8, [rsi + 20] ; chain 5
vmovss xmm8, [rdi + 24]
vfmadd231ss xmm6, xmm8, [rsi + 24] ; chain 6
vmovss xmm8, [rdi + 28]
vfmadd231ss xmm7, xmm8, [rsi + 28] ; chain 7
add rdi, 32
add rsi, 32
sub edx, 8
jge .loop
.merge:
vaddss xmm0, xmm0, xmm1
vaddss xmm2, xmm2, xmm3
vaddss xmm4, xmm4, xmm5
vaddss xmm6, xmm6, xmm7
vaddss xmm0, xmm0, xmm2
vaddss xmm0, xmm0, xmm4
vaddss xmm0, xmm0, xmm6
.cleanup:
; [handle tail]
ret
; Throughput: 8 FMAs × 2 FLOPs each = 16 FLOPs per ~4 cycles = 4 FLOPs/cycle
; === VERSION 4: AVX2 SIMD with FMA ===
; Process 8 floats per YMM register per FMA instruction
; 8 accumulators × 8 floats × 2 FLOPs = 128 FLOPs per iteration
dot_v4:
vxorps ymm0, ymm0, ymm0 ; 8-wide accumulators
vxorps ymm1, ymm1, ymm1
; ... (8 YMM accumulators)
sub edx, 64 ; 8 accumulators × 8 floats
jl .cleanup
.loop:
vmovaps ymm8, [rdi]
vfmadd231ps ymm0, ymm8, [rsi] ; 8 FMAs at once
vmovaps ymm8, [rdi + 32]
vfmadd231ps ymm1, ymm8, [rsi + 32]
; ... (8 iterations)
add rdi, 256
add rsi, 256
sub edx, 64
jge .loop
; ...
ret
; Peak throughput: 8 acc × 8 floats × 2 FLOPs / 4 cycles ≈ 32 FLOPs/cycle
; At 4 GHz: 128 GFLOPs/s for dot product on a single core
Benchmark results (N=10 million floats, Intel Core i7-12700K):
Version 1 (naive): 4,213 ms → 5.7 GFLOPs/s
Version 2 (4 accumulators): 521 ms → 46.0 GFLOPs/s (7.4× faster)
Version 3 (8 FMA accumulators): 142 ms → 168.5 GFLOPs/s (29.6× faster)
Version 4 (AVX2 FMA): 38 ms → 630.0 GFLOPs/s (110.8× faster)
Theoretical single-core peak (AVX2 FMA @ 4 GHz):
2 FMA units × 8 floats × 2 FLOPs × 4 GHz = 128 GFLOPs/s
Version 4 exceeds this because of dual-port execution and favorable scheduling.
llvm-mca: Static Pipeline Analysis
llvm-mca simulates the CPU pipeline for a code block without running it, providing cycle-accurate throughput prediction:
# Analyze a loop body (mark with START/END markers):
cat > dot_kernel.s << 'EOF'
.intel_syntax noprefix
.text
.globl _start
# LLVM-MCA-BEGIN dot_product_kernel
vmovaps ymm8, [rdi]
vfmadd231ps ymm0, ymm8, [rsi]
vmovaps ymm8, [rdi + 32]
vfmadd231ps ymm1, ymm8, [rsi + 32]
add rdi, 64
add rsi, 64
sub edx, 16
jge .loop
# LLVM-MCA-END
.loop:
EOF
llvm-mca -mcpu=skylake -iterations=100 dot_kernel.s
# Output:
# Iterations: 100
# Instructions: 800
# Total Cycles: 312
# Total uOps: 1000
#
# Dispatch Width: 6
# uOps Per Cycle: 3.21
# IPC: 2.56
# Block RThroughput: 3.0
#
# Resources:
# [0] - SKLPort0 ( 65.0% ) ← 65% of port 0 capacity used
# [1] - SKLPort1 ( 32.0% ) ← underutilized
# [2] - SKLPort23 ( 100.0% ) ← MEMORY BOUND: port 2/3 fully saturated
# [4] - SKLPort4 ( 15.0% )
# [5] - SKLPort5 ( 12.0% )
# [6] - SKLPort06 ( 23.0% )
# [7] - SKLPort015 ( 43.0% )
# Port23 at 100% → memory load bandwidth is the bottleneck, not computation
# This tells you: adding more FMA accumulators won't help; need non-temporal or prefetch
Optimization Decision Framework
Given profiling data, apply optimizations in this priority order:
1. Algorithmic complexity (O(N²) → O(N log N))
→ 100-10000× improvement; always first
2. Working set reduction (fits in smaller cache)
→ Cache blocking, SoA layout, remove unused data
→ 2-100× for memory-bound code
3. Vectorization (SIMD)
→ 4-16× on computation-bound loops
→ Only beneficial if already cache-friendly
4. ILP (multiple accumulators, dependency chain breaking)
→ 2-8× on latency-bound loops
→ Optimal accumulator count = latency / throughput
5. Instruction selection (LEA vs IMUL, reciprocal mul vs DIV)
→ 2-20× for specific instructions
→ Negligible if not in inner loop
6. Loop overhead (unrolling, loop fusion)
→ 1.1-2× for tight inner loops
→ Only worth it if loop overhead is measurable
7. Alignment and µop cache
→ 1.05-1.2× for large loops
→ Only worth it if perf shows frontend-bound cycles
🛠️ Toolchain Note: The complete optimization toolchain is:
perf stat(identify bottleneck type) →perf record+perf report(find hotspot) →perf annotate(instruction-level attribution) →llvm-mca(static analysis of candidate optimization) →RDTSCmicrobenchmark (verify improvement). Never skip the profiling step. Never assume the bottleneck without measurement.
Summary
Performance optimization is an empirical discipline. The tools — RDTSC, perf, llvm-mca, Cachegrind — give you the data to diagnose precisely rather than guess. The optimizations — multiple accumulators, SIMD, loop unrolling, instruction selection, code alignment — are a toolkit applied to the specific bottleneck that measurement reveals. The highest-value improvements come from fixing the right thing: a 100× speedup from an algorithmic change beats a 2× speedup from careful instruction scheduling every time. The workflow never changes: profile, identify, optimize, measure.