Chapter 33 Key Takeaways: Performance Analysis and Optimization
-
Profile before optimizing. Without measurement, it is nearly impossible to know which part of a program is slow. A function that takes 78% of execution time is worth 78× more optimization effort than a function taking 1%. Use
perf record/perf reportto find hotspots before writing a single line of optimized code. -
RDTSC is the highest-resolution timer available.
RDTSCreturns the CPU cycle counter with sub-nanosecond granularity. Wrap it withLFENCEto prevent out-of-order execution from corrupting measurements. Report the minimum across many trials (not the average) to eliminate OS interrupt noise. Run warm-up iterations before timing to ensure code and data are cached. -
perf statclassifies bottlenecks before code-level analysis. IPC < 1 indicates severe bottleneck (memory or branch). HighLLC-load-misses→ DRAM bound. Highbranch-misses→ branch misprediction bound. Highuops_dispatched_port.portN→ execution unit bottleneck. These metrics guide which optimization to apply. -
Multiple independent accumulators are the single most important optimization for latency-bound loops. A single accumulator creates a serial dependency chain where each iteration waits for the previous one to complete. With N independent accumulators, N dependency chains run simultaneously, hiding latency. The optimal number is
instruction_latency / reciprocal_throughput. For FMA (latency 4, throughput 0.5): 8 accumulators. -
VFMADD231PSis the most efficient floating-point reduction instruction. It computesdst += src1 * src2in one instruction with one rounding (better accuracy than separate MULPS + ADDPS) and latency 4. On modern Intel, two FMA units provide 2 FMA operations per cycle — up to 16 FLOPs/cycle for YMM (AVX2) or 32 FLOPs/cycle for ZMM (AVX-512) on a single core. -
Division by constant should always be replaced by reciprocal multiply.
DIVtakes 35–90 cycles. A constant divisor d can be converted toIMUL rax, magic_number+SHRin ~5 cycles. Compilers perform this transformation automatically at-O1; in assembly, compute the magic number manually or use libdivide. -
LEAcomputes small constant multiplications in 1 cycle vs.IMUL's 3 cycles.LEA rax, [rax + rax*2]computesrax * 3in 1 cycle on the address generation unit. Multiplications by 3, 5, 9, and combinations thereof can often be computed with 1-2 LEA instructions instead of one IMUL. -
Loop unrolling is only beneficial when combined with multiple independent accumulators. Unrolling without multiple accumulators reduces loop overhead (which is often negligible) but does not break the serial dependency chain. The essential optimization is the multiple accumulators; unrolling is secondary (it reduces the branch-per-element overhead and allows the CPU to see more independent work in each iteration).
-
CMOVeliminates branch misprediction for unpredictable conditionals. A 15–20 cycle misprediction penalty per branch translates to significant overhead for data-dependent conditionals (sorting, comparisons on random data).CMOVconverts the branch into a conditional data movement with no prediction cost. For predictable branches (> 95% one outcome), the branch predictor handles them nearly for free —CMOVthere adds a data dependency that may be slower. -
llvm-mcapredicts pipeline behavior without running code. It simulates the CPU pipeline for an annotated assembly block, reporting cycles per iteration, IPC, and per-port resource utilization. Use it to verify that a planned optimization will actually help before implementing it. It is especially useful for identifying port saturation (execution unit bottleneck) vs. memory latency bottleneck. -
Code alignment to 16-byte boundaries reduces frontend stalls. Hot loops should start at 16-byte (or 32-byte for AVX) boundaries so the CPU's instruction fetch window delivers the loop's first instructions without straddling a boundary. Use
ALIGN 16(NASM) before the loop label. The benefit is usually 5–15% for loops that straddle boundaries; it is irrelevant for loops already aligned or loops that fit in the µop cache. -
The optimization hierarchy: algorithm → data layout → vectorization → ILP → micro-optimization. An O(N²) algorithm cannot be made competitive with O(N log N) by micro-optimization. After algorithmic correctness, data layout (SoA, cache tiling, working set reduction) provides the largest gains for memory-bound code. SIMD provides 4–16× for compute-bound loops. Independent accumulators provide 2–8× for latency-bound loops. Instruction selection and alignment are last, providing 10–30%.