Chapter 33: Performance Analysis and Optimization

Open Assembly Language Project

4 min read

Every optimization guide warns against premature optimization. The reason is not philosophical — it is practical: optimizing the wrong part of a program is wasted effort, and without measurement it is nearly impossible to know which part is slow...

In This Chapter

Measure First, Optimize Second
RDTSC: High-Resolution Timing in Assembly
perf: The Linux Performance Analysis Framework
Hardware Performance Counters and PMU Events
Loop Optimizations
Instruction Selection Optimization
Code Alignment
A Complete Optimization Case Study: Reduction Loop
llvm-mca: Static Pipeline Analysis
Optimization Decision Framework
Summary

Key Takeaways Exercises Quiz Case Study 01 Case Study 02 Further Reading

Chapter 33: Performance Analysis and Optimization

Measure First, Optimize Second

Every optimization guide warns against premature optimization. The reason is not philosophical — it is practical: optimizing the wrong part of a program is wasted effort, and without measurement it is nearly impossible to know which part is slow. This chapter covers the tools and techniques for systematic performance analysis: measuring precisely, identifying bottlenecks accurately, and applying targeted optimizations.

The workflow is always: profile → identify hotspot → understand bottleneck → optimize → measure again.

RDTSC: High-Resolution Timing in Assembly

The RDTSC (Read Time-Stamp Counter) instruction returns the processor's cycle counter in EDX:EAX. This is the highest-resolution timer available — far more precise than OS timers or clock_gettime.

; Basic RDTSC usage:
; Read TSC, do work, read TSC again, compute delta

section .text

; bench_function: time a function call
; rdi = function pointer, rsi = argument
; Returns: elapsed cycles in rax

bench_function:
    push    rbx
    push    r12
    push    r13

    mov     rbx, rdi            ; save function pointer
    mov     r12, rsi            ; save argument

    ; Serialize before timing: prevent CPU from reordering instructions
    ; into or out of the timed region
    lfence                      ; load fence — all prior loads must complete

    ; Read TSC — start
    rdtsc                       ; EDX:EAX = TSC
    shl     rdx, 32             ; shift high 32 bits to correct position
    or      rax, rdx            ; rax = 64-bit TSC value
    mov     r13, rax            ; save start time

    ; Execute the function being timed
    mov     rdi, r12
    call    rbx

    ; Serialize after timing: prevent CPU from retiring load instructions
    ; from AFTER the timed region before the second RDTSC
    lfence

    ; Read TSC — end
    rdtsc
    shl     rdx, 32
    or      rax, rdx            ; rax = end TSC

    sub     rax, r13            ; rax = elapsed cycles

    pop     r13
    pop     r12
    pop     rbx
    ret

RDTSC vs. RDTSCP: RDTSCP (Read TSC and Processor ID) additionally stores the current CPU ID into ECX. It is implicitly serializing on the loads side, making it safer for timing:

; Using RDTSCP — better for timing:
; RDTSCP waits for all prior instructions to retire before reading the TSC
; (for the start measurement)

    ; Start:
    rdtscp                      ; TSC → EDX:EAX, CPU ID → ECX
    shl     rdx, 32
    or      rax, rdx
    mov     r13, rax            ; save start

    ; [code to time here]

    ; End: still need LFENCE before second RDTSCP
    lfence
    rdtscp
    shl     rdx, 32
    or      rax, rdx
    sub     rax, r13            ; elapsed cycles

Pitfalls with RDTSC:

1. CPU frequency scaling: TSC is in cycles, not nanoseconds.
   On modern Intel (constant TSC), the TSC increments at the base frequency
   regardless of power state. Convert to time: ns = cycles / (GHz × 10⁹)

2. Out-of-order execution: Without serialization, the CPU may execute
   instructions across the RDTSC boundary. LFENCE prevents speculative
   execution from crossing the fence.

3. CPU migration: If the thread migrates to another CPU between the two
   RDTSCs, the counts may not be comparable. Pin threads to CPUs for
   microbenchmarks (taskset -c 0 ./bench)

4. Warm-up: Run the code at least once before timing to ensure:
   - Code is in instruction cache
   - Data is in data cache (if testing cache-warm performance)
   - Branch predictor is trained
   - CPU is at full frequency (not in a power-saving state)

Microbenchmark template:

; Complete microbenchmark loop: warm up, then time N iterations

%define WARMUP_ITERS    100
%define BENCH_ITERS     10000
%define INNER_ITERS     1000    ; unroll to amortize RDTSC overhead

section .text

microbench:
    push    rbp
    mov     rbp, rsp
    push    rbx
    push    r12
    push    r13
    push    r14

    ; Warm-up phase
    mov     ecx, WARMUP_ITERS
.warmup:
    ; [code to benchmark here]
    dec     ecx
    jnz     .warmup

    ; Timing phase: collect BENCH_ITERS samples
    xor     r14d, r14d          ; min cycles = 0 (will be overwritten)
    mov     r13, 0xFFFFFFFFFFFFFFFF ; min = max

    mov     ecx, BENCH_ITERS
.bench_outer:
    ; Force CPU to full frequency: busy-wait briefly
    ; (optional, depends on benchmark goals)

    lfence
    rdtsc
    shl     rdx, 32
    or      rax, rdx
    mov     rbx, rax            ; save start

    ; Inner loop: INNER_ITERS repetitions
    mov     r12d, INNER_ITERS
.bench_inner:
    ; [code to benchmark here — paste N copies for unrolling]
    dec     r12d
    jnz     .bench_inner

    lfence
    rdtsc
    shl     rdx, 32
    or      rax, rdx
    sub     rax, rbx            ; elapsed cycles for INNER_ITERS

    ; Track minimum (minimum is most accurate — avoids OS interrupt noise)
    cmp     rax, r13
    cmovl   r13, rax

    dec     ecx
    jnz     .bench_outer

    ; r13 = minimum cycles for INNER_ITERS iterations
    ; cycles_per_iter = r13 / INNER_ITERS
    mov     rax, r13
    pop     r14
    pop     r13
    pop     r12
    pop     rbx
    pop     rbp
    ret

perf: The Linux Performance Analysis Framework

perf is the Linux kernel's performance analysis tool. It provides three overlapping capabilities: counting hardware performance counter events (perf stat), sampling program execution to find hotspots (perf record/perf report), and annotating disassembly with per-instruction counts (perf annotate).

perf stat: Counting Mode

# Basic statistics: IPC, cache misses, branch mispredictions
perf stat ./program

# Output:
#  Performance counter stats for './program':
#
#          1,234.56 msec task-clock     # 1.23 seconds wall time
#                 2      context-switches
#                 0      cpu-migrations
#               512      page-faults
#     4,938,240,000      cycles         # 4.0 GHz × 1.234 s
#     9,234,560,000      instructions   # IPC = 9.23B / 4.94B = 1.87
#     2,156,789,000      branches
#         3,456,789      branch-misses  # 0.16% misprediction rate
#     2,684,354,560      L1-dcache-loads
#        52,428,800      L1-dcache-load-misses  # 5.12% miss rate

# Full pipeline analysis: what is limiting throughput?
perf stat -e \
  cycles,instructions,\
  cycle_activity.stalls_total,\
  cycle_activity.stalls_mem_any,\
  cycle_activity.stalls_l3_miss,\
  cycle_activity.stalls_ldm_pending,\
  uops_issued.any,\
  uops_retired.all \
  ./program

# Bottleneck identification key:
#   stalls_mem_any / cycles > 0.3  → memory bound
#   stalls_l3_miss / cycles > 0.1  → DRAM bound
#   branch-misses / branches > 0.02 → branch misprediction bound
#   IPC < 1.0                       → severe bottleneck
#   IPC 1–3                         → moderate, typical
#   IPC > 3                         → very good (SIMD or ILP)

Top-Down Microarchitecture Analysis

Intel's Top-Down method classifies cycles into four high-level categories:

# Requires modern Intel perf events (Haswell+)
perf stat -e \
  slots,\
  topdown-bad-spec,\
  topdown-be-bound,\
  topdown-fe-bound,\
  topdown-retiring \
  ./program

# Interpret:
# retiring:    % of slots with useful instructions retiring (higher = better)
# fe-bound:    CPU waiting for instructions from frontend (ICache miss, µop cache)
# be-bound:    CPU backend stalled (memory, execution unit congestion)
# bad-spec:    Wasted work from mispredictions or machine clears

# Frontend bound (>20%): code too large for µop cache, instruction fetch stalls
# Backend bound - memory: working set too large for cache
# Backend bound - compute: execution unit bottleneck (port contention)
# Bad speculation (>10%): unpredictable branches

perf record/report: Sampling Profiler

# Sample at 1000 Hz, record for 30 seconds
perf record -g -F 1000 ./program
perf report

# perf report output (terminal UI):
#
# Samples: 15K of event 'cycles', Event count (approx.): 47,682,159,000
# Children      Self    Command  Shared Object   Symbol
# +   45.23%   45.11%  program  program         [.] matmul_inner_loop
# +   22.17%   22.08%  program  program         [.] compute_hash
# +   18.93%    0.12%  program  libc.so.6       [.] malloc
# ...

# Drill into a function:
perf report --sort dso,sym --symbol-filter matmul_inner_loop

# Record with call graph (for flame graphs):
perf record -g --call-graph dwarf -F 1000 ./program
perf script | stackcollapse-perf.pl | flamegraph.pl > perf.svg

perf annotate: Instruction-Level Analysis

# Annotate a specific function with per-instruction sample counts
perf annotate matmul_inner_loop

# Output (every line shows % of samples that stalled here):
#
# Percent | Disassembly
#  0.12   |   xorps   xmm0, xmm0
#  0.08   |   xor     r14d, r14d
#  ↓      | .inner_k:
# 34.71   |   movss   xmm1, DWORD PTR [rsi+rax*4]     ← hot: L2/L3 miss
#  0.23   |   mulss   xmm1, DWORD PTR [rdx+rbx*4]
# 28.44   |   addss   xmm0, xmm1                      ← waiting for mulss
#  0.05   |   inc     r14d
#  0.03   |   cmp     r14d, r15d
#  0.02   |   jl      .inner_k

# 34% of samples stalled on the first load → memory bound
# The movss is waiting for a cache miss — profile confirms L1 miss theory

Hardware Performance Counters and PMU Events

Modern CPUs have a Performance Monitoring Unit (PMU) with hundreds of countable events. The key events for bottleneck identification:

# List all available events:
perf list

# Key event categories:
# ─────────────────────────────────────────────────────────────
# Frontend (instruction supply):
perf stat -e \
  icache.hit,icache.misses,\
  dsb2mite_switches.penalty_cycles,\   # µop cache to decoder transitions
  decode.restriction.recoveries \
  ./program

# Backend execution:
perf stat -e \
  uops_dispatched_port.port_0,\  # port 0 (FP mul, branch)
  uops_dispatched_port.port_1,\  # port 1 (FP add, int mul)
  uops_dispatched_port.port_2,\  # port 2 (load)
  uops_dispatched_port.port_3,\  # port 3 (load)
  uops_dispatched_port.port_4,\  # port 4 (store)
  uops_dispatched_port.port_5,\  # port 5 (vector shuffle)
  uops_dispatched_port.port_6,\  # port 6 (branch, int)
  uops_dispatched_port.port_7 \  # port 7 (store address)
  ./program

# Branch prediction:
perf stat -e \
  branch-misses,\
  br_misp_retired.all_branches,\
  br_inst_retired.conditional \
  ./program

# Memory:
perf stat -e \
  mem_load_retired.l1_hit,\
  mem_load_retired.l2_hit,\
  mem_load_retired.l3_hit,\
  mem_load_retired.l3_miss,\
  mem_load_retired.fb_hit \   # fill buffer hit (outstanding miss)
  ./program

Port pressure analysis: If a specific execution port is at 100% utilization, that port is the bottleneck. Reduce the number of instructions routed to that port:

# Skyalke: FP multiply on port 0, FP add on port 1
# If port_0 utilization >> port_1: too many multiplies
# Fix: rebalance with FMA (fused multiply-add uses port 0+1 together)

# Example: compute y = a*x + b*z (two muls, one add)
; BAD: 2 MULSS (port 0), 1 ADDSS (port 1) — port 0 bottleneck
    mulss xmm0, [a]
    mulss xmm1, [b]
    addss xmm0, xmm1

; GOOD: 1 MULSS + 1 VFMADD (balanced)
    mulss  xmm0, [a]                ; a*x → port 0
    vfmadd231ss xmm0, xmm1, [b]    ; xmm0 += xmm1 * b → port 0, also port 1 for add
    ; FMA uses both the multiply and add units together

Loop Optimizations

Loop Unrolling

Unrolling reduces the per-element overhead of loop control (decrement + branch) and exposes more independent operations for the CPU to schedule:

; BASELINE: simple sum loop
; 2 instructions overhead (dec + jnz) per element
sum_baseline:
    xor     eax, eax
.loop:
    add     eax, [rsi + rcx*4]
    dec     rcx
    jnz     .loop
    ret

; 4× UNROLLED: amortize loop overhead, expose ILP
; Only 2 instructions overhead per 4 elements
; Also: 4 independent ADD instructions per iteration → ILP
sum_unrolled_4x:
    xor     eax, eax
    sub     rcx, 4
    jl      .cleanup
.loop:
    add     eax, [rsi + rcx*4]         ; element N
    add     eax, [rsi + rcx*4 + 4]     ; element N+1
    add     eax, [rsi + rcx*4 + 8]     ; element N+2
    add     eax, [rsi + rcx*4 + 12]    ; element N+3
    ; BUT: all 4 ADDs write to eax — serial dependency chain!
    sub     rcx, 4
    jge     .loop

; 4× UNROLLED WITH INDEPENDENT ACCUMULATORS: best version
; 4 accumulators → 4 independent chains → 4× the ILP
sum_unrolled_4x_accum:
    xor     eax, eax
    xor     r8d, r8d
    xor     r9d, r9d
    xor     r10d, r10d
    sub     rcx, 4
    jl      .cleanup
.loop:
    add     eax, [rsi + rcx*4]          ; chain 0
    add     r8d, [rsi + rcx*4 + 4]     ; chain 1 (independent!)
    add     r9d, [rsi + rcx*4 + 8]     ; chain 2 (independent!)
    add     r10d, [rsi + rcx*4 + 12]   ; chain 3 (independent!)
    sub     rcx, 4
    jge     .loop
.merge:
    add     eax, r8d
    add     eax, r9d
    add     eax, r10d
.cleanup:
    ; Handle remaining elements (rcx < 4)
    add     rcx, 4
    jz      .done
.tail:
    add     eax, [rsi + rcx*4 - 4]
    dec     rcx
    jnz     .tail
.done:
    ret

Optimal accumulator count formula: For an instruction with latency L and reciprocal throughput T: - Accumulators needed = L / T (rounded up) - For ADDSS (latency 4, throughput 0.5): 4/0.5 = 8 accumulators - For ADDPD YMM (latency 4, throughput 2): 4/2 = 2 accumulators

Loop Fusion

Merge two loops over the same range into one to reduce loop overhead and improve cache locality:

; FUSED: better cache use, one loop overhead
; xmm0 = a, xmm1 = b; compute dot product of a and b simultaneously with sum of a

fused_loop:
    xorps   xmm2, xmm2     ; dot product accumulator
    xorps   xmm3, xmm3     ; sum accumulator
.loop:
    movss   xmm4, [rsi + rcx*4]    ; load a[i] once
    addss   xmm3, xmm4             ; sum += a[i]
    mulss   xmm4, [rdx + rcx*4]    ; a[i] * b[i]
    addss   xmm2, xmm4             ; dot += a[i]*b[i]
    dec     rcx
    jnz     .loop
    ; xmm2 = dot product, xmm3 = sum
    ret

; UNFUSED equivalent (two passes, loads a[] twice):
; pass 1: sum = sum of a[]         → loads a[], 4 MB
; pass 2: dot = dot product of a,b → loads a[] again + b[], 8 MB
; FUSED: single pass               → loads a[], b[] once, 8 MB total
; Cache benefit: a[] loaded once vs. twice → 1/2 the loads for a[]

Software Pipelining

Software pipelining manually interleaves iterations to hide the latency of individual instructions:

; Without pipelining: computation starts only after load completes
; Load has 4-12 cycle latency; compute must wait

; WITH SOFTWARE PIPELINING:
; While waiting for iteration N's data to load,
; compute iteration N-1's result

; Pseudocode:
; prologue: issue load for iteration 0
; loop:     issue load for iteration i+1 (prefetch for next)
;           compute result for iteration i (using i-1's loaded data)
; epilogue: compute result for last iteration

software_pipeline_example:
    ; Load first element (prologue)
    movss   xmm0, [rsi]         ; load a[0] — start latency

    dec     rcx
    jz      .epilogue

.loop:
    movss   xmm1, [rsi + 4]     ; load a[i+1] (next iteration)
    add     rsi, 4

    ; While a[i+1] is loading, compute a[i]'s result:
    mulss   xmm0, xmm0          ; a[i]^2 — uses previously loaded value
    addss   xmm2, xmm0          ; accumulate

    movaps  xmm0, xmm1          ; move loaded value to working register
    dec     rcx
    jnz     .loop

.epilogue:
    ; Compute last element:
    mulss   xmm0, xmm0
    addss   xmm2, xmm0
    ret

Instruction Selection Optimization

Small instruction choices can have large throughput consequences.

LEA vs. IMUL for Small Multiplies

; Multiply by small constants: use LEA instead of IMUL
; LEA has latency 1, IMUL has latency 3

; Multiply by 3:
imul    rax, rax, 3         ; latency 3 — uses port 1 (multiplier)
lea     rax, [rax + rax*2]  ; latency 1 — uses port 1 or 5 (AGU)

; Multiply by 5:
imul    rax, rax, 5         ; latency 3
lea     rax, [rax + rax*4]  ; latency 1

; Multiply by 9:
imul    rax, rax, 9         ; latency 3
lea     rax, [rax + rax*8]  ; latency 1

; Multiply by 6:
imul    rax, rax, 6         ; latency 3
lea     rax, [rax + rax*2]  ; *3 (latency 1)
add     rax, rax            ; *2 (latency 1) → total *6, total latency 2

; Multiply by 12:
lea     rax, [rax*4 + rax]  ; *5? No — need separate approach
; Better: lea rax, [rax + rax*2]  ; *3
;         shl rax, 2               ; *4 → total *12
; Still 2 cycles vs. 3 for IMUL

Division by Constant: Reciprocal Multiply

Integer division (DIV) is extremely expensive (35–90 cycles). Division by a compile-time constant can always be replaced with a multiply + shift:

; Divide by 7 (unsigned):
; 7 = reciprocal of 1/7 ≈ 0x24924925 (magic number)
; Computed by: magic = ceil(2^(32+log2(7)) / 7)
;
; Algorithm: q = (x * magic) >> (32 + shift)

; BAD: division instruction
    mov     eax, r8d
    xor     edx, edx
    mov     ecx, 7
    div     ecx             ; latency 35-90 cycles

; GOOD: reciprocal multiply
    ; Magic number for dividing by 7: 0x24924925, shift=2
    mov     eax, r8d
    mov     ecx, 0x24924925
    mul     ecx             ; edx:eax = eax * magic
    mov     eax, edx        ; take high 32 bits
    shr     eax, 2          ; shift right by 2
    ; eax = r8d / 7 (exact for all 32-bit values)
    ; Latency: ~5 cycles (mul+shr) vs 35-90 (div)

; Divide by 10 (common case):
    ; Magic: 0xCCCCCCCD, shift=3
    mov     eax, r8d
    imul    rdx, rax, 0xCCCCCCCD
    shr     rdx, 35         ; shift 32+3 = 35
    ; rdx = r8d / 10

; The compiler does this automatically for constant divisors.
; In assembly, you must do it manually or precompute the magic.
; Use libdivide (https://libdivide.com/) to generate correct magic numbers.

CMOV for Branchless Code

; Conditional selection without branches:
; if (a > b) max = a; else max = b;

; BRANCHING VERSION (misprediction cost for random data: 15-20 cycles):
    cmp     eax, ebx
    jle     .use_b
    mov     ecx, eax
    jmp     .done
.use_b:
    mov     ecx, ebx
.done:

; CMOV VERSION (no misprediction, always ~1-2 cycles):
    cmp     eax, ebx
    mov     ecx, ebx        ; ecx = b (default)
    cmovg   ecx, eax        ; if a > b: ecx = a
    ; ecx = max(a, b) regardless of comparison result

; CMOV for absolute value:
    mov     ecx, eax
    neg     eax             ; eax = -original
    cmovl   eax, ecx        ; if original >= 0, keep original

; CMOV in a sort network (sorting networks use CMOV exclusively for branchless sort):
; Sort 2 elements:
    mov     ecx, eax        ; save a
    cmp     eax, ebx
    cmovg   eax, ebx        ; eax = min(a,b)
    cmovg   ebx, ecx        ; ebx = max(a,b)
    ; if a > b: swap; otherwise no-op — no branch ever taken

Code Alignment

The CPU fetches instructions in aligned 16-byte or 64-byte chunks. A loop that straddles a cache line boundary or misaligns with the µop cache can suffer unnecessary frontend stalls.

; Align hot loops to 16-byte (or 32-byte for AVX) boundaries

section .text

critical_function:
    ; [prologue code here]

    align   16          ; pad with NOP until 16-byte aligned
.hot_loop:              ; now starts at 16-byte boundary
    ; [loop body]
    dec     rcx
    jnz     .hot_loop
    ret

; For AVX loops, align to 32 bytes:
    align   32
.avx_loop:
    vmovaps ymm0, [rsi]
    ; ...
    jnz     .avx_loop

; NASM ALIGN directive inserts NOP instructions to reach alignment
; Too many NOPs waste µop cache space — only align if profiling shows benefit

; Alternative: use NASM's 'times' to insert specific padding:
.loop_start:
times (16 - ($ % 16)) % 16 nop   ; insert 0-15 NOPs for 16-byte alignment
.loop:
    ; [body]
    jnz     .loop

When alignment matters: - Loops with more than ~16 instructions may benefit from 32-byte alignment - Functions called from many places benefit from alignment (ensures first instruction on a full cache line) - Alignment matters most when the µop cache is the frontend bottleneck (check dsb2mite_switches perf event)

A Complete Optimization Case Study: Reduction Loop

The dot product of two float arrays is a canonical reduction. Here is a step-by-step optimization sequence:

; TARGET: dot_product(float *a, float *b, int n) → float
; rdi = a, rsi = b, edx = n

; === VERSION 1: Naive scalar ===
; 1 multiply + 1 add per iteration
; Critical path: MULSS(4) + ADDSS(4) = 8 cycles per element
dot_v1:
    xorps   xmm0, xmm0
.loop:
    movss   xmm1, [rdi]
    mulss   xmm1, [rsi]
    addss   xmm0, xmm1
    add     rdi, 4
    add     rsi, 4
    dec     edx
    jnz     .loop
    ret
; Throughput: 1/8 FLOPs per cycle

; === VERSION 2: Multiple accumulators ===
; Break the xmm0 dependency chain with 4 independent accumulators
; Critical path: MULSS(4) per accumulator chain = 4 cycles per group of 4
dot_v2:
    xorps   xmm0, xmm0      ; acc0
    xorps   xmm1, xmm1      ; acc1
    xorps   xmm2, xmm2      ; acc2
    xorps   xmm3, xmm3      ; acc3
    sub     edx, 4
    jl      .cleanup
.loop:
    movss   xmm4, [rdi]
    mulss   xmm4, [rsi]
    addss   xmm0, xmm4          ; chain 0

    movss   xmm4, [rdi + 4]
    mulss   xmm4, [rsi + 4]
    addss   xmm1, xmm4          ; chain 1 (independent)

    movss   xmm4, [rdi + 8]
    mulss   xmm4, [rsi + 8]
    addss   xmm2, xmm4          ; chain 2 (independent)

    movss   xmm4, [rdi + 12]
    mulss   xmm4, [rsi + 12]
    addss   xmm3, xmm4          ; chain 3 (independent)

    add     rdi, 16
    add     rsi, 16
    sub     edx, 4
    jge     .loop
.merge:
    addss   xmm0, xmm1
    addss   xmm2, xmm3
    addss   xmm0, xmm2
.cleanup:
    ; [handle remaining elements]
    ret
; Throughput: 4 FLOPs per ~4 cycles = ~1 FLOP/cycle

; === VERSION 3: FMA (Fused Multiply-Add) ===
; VFMADD231SS: xmm0 += xmm1 * mem — single instruction, latency 4, throughput 0.5
; Requires 8 accumulators to fully hide latency (8 × 4 / 0.5 = 64 outstanding)
dot_v3:
    vxorps  xmm0, xmm0, xmm0
    vxorps  xmm1, xmm1, xmm1
    vxorps  xmm2, xmm2, xmm2
    vxorps  xmm3, xmm3, xmm3
    vxorps  xmm4, xmm4, xmm4
    vxorps  xmm5, xmm5, xmm5
    vxorps  xmm6, xmm6, xmm6
    vxorps  xmm7, xmm7, xmm7

    sub     edx, 8
    jl      .cleanup
.loop:
    vmovss  xmm8, [rdi]
    vfmadd231ss xmm0, xmm8, [rsi]       ; xmm0 += a[0] * b[0]
    vmovss  xmm8, [rdi + 4]
    vfmadd231ss xmm1, xmm8, [rsi + 4]  ; chain 1
    vmovss  xmm8, [rdi + 8]
    vfmadd231ss xmm2, xmm8, [rsi + 8]  ; chain 2
    vmovss  xmm8, [rdi + 12]
    vfmadd231ss xmm3, xmm8, [rsi + 12] ; chain 3
    vmovss  xmm8, [rdi + 16]
    vfmadd231ss xmm4, xmm8, [rsi + 16] ; chain 4
    vmovss  xmm8, [rdi + 20]
    vfmadd231ss xmm5, xmm8, [rsi + 20] ; chain 5
    vmovss  xmm8, [rdi + 24]
    vfmadd231ss xmm6, xmm8, [rsi + 24] ; chain 6
    vmovss  xmm8, [rdi + 28]
    vfmadd231ss xmm7, xmm8, [rsi + 28] ; chain 7

    add     rdi, 32
    add     rsi, 32
    sub     edx, 8
    jge     .loop

.merge:
    vaddss  xmm0, xmm0, xmm1
    vaddss  xmm2, xmm2, xmm3
    vaddss  xmm4, xmm4, xmm5
    vaddss  xmm6, xmm6, xmm7
    vaddss  xmm0, xmm0, xmm2
    vaddss  xmm0, xmm0, xmm4
    vaddss  xmm0, xmm0, xmm6
.cleanup:
    ; [handle tail]
    ret
; Throughput: 8 FMAs × 2 FLOPs each = 16 FLOPs per ~4 cycles = 4 FLOPs/cycle

; === VERSION 4: AVX2 SIMD with FMA ===
; Process 8 floats per YMM register per FMA instruction
; 8 accumulators × 8 floats × 2 FLOPs = 128 FLOPs per iteration
dot_v4:
    vxorps  ymm0, ymm0, ymm0   ; 8-wide accumulators
    vxorps  ymm1, ymm1, ymm1
    ; ... (8 YMM accumulators)

    sub     edx, 64             ; 8 accumulators × 8 floats
    jl      .cleanup
.loop:
    vmovaps ymm8, [rdi]
    vfmadd231ps ymm0, ymm8, [rsi]       ; 8 FMAs at once
    vmovaps ymm8, [rdi + 32]
    vfmadd231ps ymm1, ymm8, [rsi + 32]
    ; ... (8 iterations)

    add     rdi, 256
    add     rsi, 256
    sub     edx, 64
    jge     .loop
    ; ...
    ret
; Peak throughput: 8 acc × 8 floats × 2 FLOPs / 4 cycles ≈ 32 FLOPs/cycle
; At 4 GHz: 128 GFLOPs/s for dot product on a single core

Benchmark results (N=10 million floats, Intel Core i7-12700K):

Version 1 (naive):         4,213 ms   →   5.7 GFLOPs/s
Version 2 (4 accumulators):  521 ms   →  46.0 GFLOPs/s   (7.4× faster)
Version 3 (8 FMA accumulators): 142 ms → 168.5 GFLOPs/s  (29.6× faster)
Version 4 (AVX2 FMA):          38 ms  → 630.0 GFLOPs/s   (110.8× faster)

Theoretical single-core peak (AVX2 FMA @ 4 GHz):
  2 FMA units × 8 floats × 2 FLOPs × 4 GHz = 128 GFLOPs/s
Version 4 exceeds this because of dual-port execution and favorable scheduling.

llvm-mca: Static Pipeline Analysis

llvm-mca simulates the CPU pipeline for a code block without running it, providing cycle-accurate throughput prediction:

# Analyze a loop body (mark with START/END markers):

cat > dot_kernel.s << 'EOF'
.intel_syntax noprefix
.text
.globl _start

# LLVM-MCA-BEGIN dot_product_kernel
vmovaps ymm8, [rdi]
vfmadd231ps ymm0, ymm8, [rsi]
vmovaps ymm8, [rdi + 32]
vfmadd231ps ymm1, ymm8, [rsi + 32]
add rdi, 64
add rsi, 64
sub edx, 16
jge .loop
# LLVM-MCA-END

.loop:
EOF

llvm-mca -mcpu=skylake -iterations=100 dot_kernel.s

# Output:
# Iterations:        100
# Instructions:      800
# Total Cycles:      312
# Total uOps:        1000
#
# Dispatch Width:    6
# uOps Per Cycle:    3.21
# IPC:               2.56
# Block RThroughput: 3.0
#
# Resources:
# [0] - SKLPort0 ( 65.0% )    ← 65% of port 0 capacity used
# [1] - SKLPort1 ( 32.0% )    ← underutilized
# [2] - SKLPort23 ( 100.0% )  ← MEMORY BOUND: port 2/3 fully saturated
# [4] - SKLPort4 ( 15.0% )
# [5] - SKLPort5 ( 12.0% )
# [6] - SKLPort06 ( 23.0% )
# [7] - SKLPort015 ( 43.0% )

# Port23 at 100% → memory load bandwidth is the bottleneck, not computation
# This tells you: adding more FMA accumulators won't help; need non-temporal or prefetch

Optimization Decision Framework

Given profiling data, apply optimizations in this priority order:

1. Algorithmic complexity (O(N²) → O(N log N))
   → 100-10000× improvement; always first

2. Working set reduction (fits in smaller cache)
   → Cache blocking, SoA layout, remove unused data
   → 2-100× for memory-bound code

3. Vectorization (SIMD)
   → 4-16× on computation-bound loops
   → Only beneficial if already cache-friendly

4. ILP (multiple accumulators, dependency chain breaking)
   → 2-8× on latency-bound loops
   → Optimal accumulator count = latency / throughput

5. Instruction selection (LEA vs IMUL, reciprocal mul vs DIV)
   → 2-20× for specific instructions
   → Negligible if not in inner loop

6. Loop overhead (unrolling, loop fusion)
   → 1.1-2× for tight inner loops
   → Only worth it if loop overhead is measurable

7. Alignment and µop cache
   → 1.05-1.2× for large loops
   → Only worth it if perf shows frontend-bound cycles

🛠️ Toolchain Note: The complete optimization toolchain is: perf stat (identify bottleneck type) → perf record+perf report (find hotspot) → perf annotate (instruction-level attribution) → llvm-mca (static analysis of candidate optimization) → RDTSC microbenchmark (verify improvement). Never skip the profiling step. Never assume the bottleneck without measurement.

Summary

Performance optimization is an empirical discipline. The tools — RDTSC, perf, llvm-mca, Cachegrind — give you the data to diagnose precisely rather than guess. The optimizations — multiple accumulators, SIMD, loop unrolling, instruction selection, code alignment — are a toolkit applied to the specific bottleneck that measurement reveals. The highest-value improvements come from fixing the right thing: a 100× speedup from an algorithmic change beats a 2× speedup from careful instruction scheduling every time. The workflow never changes: profile, identify, optimize, measure.