Chapter 33 Key Takeaways: Performance Analysis and Optimization

Open Assembly Language Project

Chapter 33 Key Takeaways: Performance Analysis and Optimization

Profile before optimizing. Without measurement, it is nearly impossible to know which part of a program is slow. A function that takes 78% of execution time is worth 78× more optimization effort than a function taking 1%. Use perf record/perf report to find hotspots before writing a single line of optimized code.
RDTSC is the highest-resolution timer available. RDTSC returns the CPU cycle counter with sub-nanosecond granularity. Wrap it with LFENCE to prevent out-of-order execution from corrupting measurements. Report the minimum across many trials (not the average) to eliminate OS interrupt noise. Run warm-up iterations before timing to ensure code and data are cached.
perf stat classifies bottlenecks before code-level analysis. IPC < 1 indicates severe bottleneck (memory or branch). High LLC-load-misses → DRAM bound. High branch-misses → branch misprediction bound. High uops_dispatched_port.portN → execution unit bottleneck. These metrics guide which optimization to apply.
Multiple independent accumulators are the single most important optimization for latency-bound loops. A single accumulator creates a serial dependency chain where each iteration waits for the previous one to complete. With N independent accumulators, N dependency chains run simultaneously, hiding latency. The optimal number is instruction_latency / reciprocal_throughput. For FMA (latency 4, throughput 0.5): 8 accumulators.
VFMADD231PS is the most efficient floating-point reduction instruction. It computes dst += src1 * src2 in one instruction with one rounding (better accuracy than separate MULPS + ADDPS) and latency 4. On modern Intel, two FMA units provide 2 FMA operations per cycle — up to 16 FLOPs/cycle for YMM (AVX2) or 32 FLOPs/cycle for ZMM (AVX-512) on a single core.
Division by constant should always be replaced by reciprocal multiply. DIV takes 35–90 cycles. A constant divisor d can be converted to IMUL rax, magic_number + SHR in ~5 cycles. Compilers perform this transformation automatically at -O1; in assembly, compute the magic number manually or use libdivide.
LEA computes small constant multiplications in 1 cycle vs. IMUL's 3 cycles. LEA rax, [rax + rax*2] computes rax * 3 in 1 cycle on the address generation unit. Multiplications by 3, 5, 9, and combinations thereof can often be computed with 1-2 LEA instructions instead of one IMUL.
Loop unrolling is only beneficial when combined with multiple independent accumulators. Unrolling without multiple accumulators reduces loop overhead (which is often negligible) but does not break the serial dependency chain. The essential optimization is the multiple accumulators; unrolling is secondary (it reduces the branch-per-element overhead and allows the CPU to see more independent work in each iteration).
CMOV eliminates branch misprediction for unpredictable conditionals. A 15–20 cycle misprediction penalty per branch translates to significant overhead for data-dependent conditionals (sorting, comparisons on random data). CMOV converts the branch into a conditional data movement with no prediction cost. For predictable branches (> 95% one outcome), the branch predictor handles them nearly for free — CMOV there adds a data dependency that may be slower.
llvm-mca predicts pipeline behavior without running code. It simulates the CPU pipeline for an annotated assembly block, reporting cycles per iteration, IPC, and per-port resource utilization. Use it to verify that a planned optimization will actually help before implementing it. It is especially useful for identifying port saturation (execution unit bottleneck) vs. memory latency bottleneck.
Code alignment to 16-byte boundaries reduces frontend stalls. Hot loops should start at 16-byte (or 32-byte for AVX) boundaries so the CPU's instruction fetch window delivers the loop's first instructions without straddling a boundary. Use ALIGN 16 (NASM) before the loop label. The benefit is usually 5–15% for loops that straddle boundaries; it is irrelevant for loops already aligned or loops that fit in the µop cache.
The optimization hierarchy: algorithm → data layout → vectorization → ILP → micro-optimization. An O(N²) algorithm cannot be made competitive with O(N log N) by micro-optimization. After algorithmic correctness, data layout (SoA, cache tiling, working set reduction) provides the largest gains for memory-bound code. SIMD provides 4–16× for compute-bound loops. Independent accumulators provide 2–8× for latency-bound loops. Instruction selection and alignment are last, providing 10–30%.