Chapter 31 Key Takeaways: The Modern CPU Pipeline

Open Assembly Language Project

Chapter 31 Key Takeaways: The Modern CPU Pipeline

x86-64 instructions are decoded into µops. The CPU does not execute CISC instructions directly — they are translated into simpler, fixed-format micro-operations. Simple instructions produce 1 µop; complex instructions (PUSH, POP, complex addressing) produce 2–4; microcode instructions (DIV, REP string) produce many more.
The µop cache (Decoded Stream Buffer) bypasses the decoder for recently-executed code. A hot loop that fits in the µop cache (~1500 µops on modern Intel) runs faster than one that must be decoded each iteration. Loops exceeding the µop cache suffer added decode latency per iteration.
Register renaming eliminates false data hazards. The CPU maintains 280+ physical registers, mapping multiple successive writes to architectural registers (RAX, RBX, etc.) to different physical registers. This allows instructions that write the same register to execute in parallel, eliminating WAW and WAR hazards.
Latency and throughput are different properties. Latency is the time until a result is available for dependent instructions. Throughput (reciprocal) is how often the instruction can be issued. ADD has latency 1 and throughput 0.25 (4 per cycle). IMUL has latency 3 and throughput 1. DIV has latency 35–90 and throughput 21–74.
Dependency chains limit ILP. When instructions form a chain (each depending on the previous result), the critical path length = chain length × instruction latency. Independent instructions execute in parallel. Breaking long chains by using multiple independent accumulators is one of the highest-value optimizations.
Branch mispredictions cost 15–20 cycles on modern deep pipelines. The branch predictor makes a guess before the condition is evaluated. For predictable branches (loop counters, simple conditionals), the predictor works well. For data-dependent, unpredictable branches, consider CMOV (conditional move) for branchless code.
CMOV is best for unpredictable branches. Replacing an unpredictable conditional jump with CMOV eliminates the misprediction penalty. For predictable branches, the predictor provides near-zero-cost branching — CMOV adds a data dependency that may be worse.
IPC (Instructions Per Cycle) is the key throughput metric. Use perf stat to measure it. IPC < 1 indicates severe bottleneck (usually memory or branch mispredictions). IPC 1–3 is typical. IPC > 3 requires SIMD or exceptional ILP.
The Reorder Buffer (ROB) enables out-of-order execution while maintaining in-order retirement. With 512 entries in modern Intel CPUs, up to 512 µops can be in flight simultaneously. All retire in program order to preserve correctness. The ROB size determines the "out-of-order window" — how far ahead the CPU can look for independent instructions.
Spectre demonstrates that speculative execution crosses security boundaries. Speculatively-accessed memory is not rolled back from the cache, even when the speculation is squashed. LFENCE mitigates Spectre v1 by preventing speculative loads past the fence — but serializes the pipeline at a 10–30% performance cost per protected access.
Agner Fog's instruction tables are the practical reference. Available at agner.org/optimize, they list latency, throughput, and port assignments for every instruction on every major CPU microarchitecture. Reading them alongside llvm-mca analysis is how you predict and measure loop throughput accurately.