Chapter 31: The Modern CPU Pipeline

Open Assembly Language Project

8 min read

The classic pipeline — Fetch, Decode, Execute, Writeback — is a convenient lie. It is accurate for the Intel 486 (1989). It is not accurate for anything Intel has shipped since the Pentium Pro (1995). The gap between the textbook model and the...

In This Chapter

The Fiction You Were Taught
Micro-Operations (µops)
The Out-of-Order Pipeline in Detail
Register Renaming
Latency vs. Throughput
Dependency Chains and ILP
Branch Prediction
Spectre: The Dark Side of Speculative Execution
Measuring IPC with perf
Writing Code for the Pipeline
Summary

Key Takeaways Exercises Quiz Case Study 01 Case Study 02 Further Reading

Chapter 31: The Modern CPU Pipeline

The Fiction You Were Taught

The classic pipeline — Fetch, Decode, Execute, Writeback — is a convenient lie. It is accurate for the Intel 486 (1989). It is not accurate for anything Intel has shipped since the Pentium Pro (1995). The gap between the textbook model and the actual hardware has been widening for thirty years, and the performance implications of that gap are significant enough that understanding the real model is not optional for anyone who wants to write fast assembly.

The modern x86-64 core — let us use an Intel Alder Lake Performance core as the reference — is a 4-wide, 19-stage pipeline with a 512-entry reorder buffer, 12 execution units, and out-of-order execution that can have 200+ instructions in flight simultaneously. Understanding this at a practical (not a microarchitectural-PhD) level lets you write code that takes advantage of it rather than fighting it.

Micro-Operations (µops)

The first critical concept: x86-64 instructions are not executed directly. They are translated into simpler, RISC-like micro-operations (µops) by the CPU's decoder. This translation is the foundation of everything else.

x86-64 instruction → decoder → 1 or more µops

Examples:

MOV RAX, RBX       → 1 µop (register-register move: may be eliminated by renaming)
ADD RAX, [RBX]     → 2 µops (load + add)
PUSH RBX           → 2 µops (subtract RSP + store)
POP RBX            → 2 µops (load + add RSP)
DIV RCX            → ~20-70 µops (microcode, variable)
REP MOVSB          → variable (complex microcode expansion)

⚙️ How It Works: The µop cache (called the Decoded Stream Buffer or DSB on Intel) caches the decoded µops for recently-executed instructions. When the CPU re-encounters the same code (as in a loop), it fetches directly from the µop cache rather than re-decoding the x86-64 instructions. This is why the µop cache capacity (typically 1500–2000 µops) limits the practical size of loops that can run at peak throughput. A loop body exceeding the µop cache causes re-decoding, which adds ~4 cycles of latency per iteration.

The Out-of-Order Pipeline in Detail

Frontend:
  L1 Instruction Cache (32KB)
        ↓
  Instruction Fetch (up to 16 bytes per cycle)
        ↓
  Branch Predictor (predicts branch direction + target)
        ↓
  Instruction Decode (4 decoders: 1 complex + 3 simple)
        ↓
  µop Cache (Decoded Stream Buffer, ~1500 µops)
        ↓
  Allocator / Register Rename

Out-of-Order Core:
        ↓
  Scheduler / Reservation Stations (dispatch to execution ports)
        ↓ ↓ ↓ ↓ ↓ (multiple dispatch ports)
  Execution Units:
    Port 0: Integer ALU, Branch, FP Multiply/Divide, Shift
    Port 1: Integer ALU, Vector ALU, Vector Multiply
    Port 2: Load + Store Address
    Port 3: Load + Store Address
    Port 4: Store Data
    Port 5: Integer ALU, Vector Shuffle, LEA
    Port 6: Integer ALU, Branch
    Port 7: Store Address
    Port 8: Store Address
        ↓
  Reorder Buffer (ROB, 512 entries in Alder Lake P-core)
        ↓
  Retirement (in-order commit, 4 µops/cycle)
        ↓
  Architectural Registers (the registers your code sees)

The ROB is the key structure. Every in-flight µop has an entry in the ROB. The ROB maintains them in program order even as the execution units process them out of order. When a µop completes execution, it is marked "done" in the ROB. µops retire in program order: the oldest µop at the head of the ROB is retired (its result is written to the architectural register file or memory) once it is complete and all older µops are also complete.

💡 Mental Model: Think of the ROB as an ordered queue of work. You can work on any item in the queue that has its inputs ready, but you can only deliver the results from the front of the queue, in order. Out-of-order execution is the freedom to work on queue items out of sequence; in-order retirement is the commitment to deliver them in sequence.

Register Renaming

Register renaming eliminates false data dependencies — the "write-after-write" (WAW) and "write-after-read" (WAR) hazards that would otherwise prevent out-of-order execution.

; Two independent computations that share a register — creates false dependency
    imul rax, rbx       ; rax = rbx * rax  (latency: 3 cycles)
    ; ... 3 cycles of waiting ...
    add rax, rcx        ; rax = rax + rcx  (dependent: must wait for IMUL)
    ...
    imul rax, rdx       ; SECOND imul overwrites rax

Without renaming, the second IMUL must wait for the first ADD to finish reading RAX (WAR hazard). With renaming, the CPU allocates a new physical register for each destination, eliminating all false dependencies. The processor has 280 physical registers (Intel Alder Lake) but only 16 architectural registers; this extra capacity is consumed by the in-flight µops.

Latency vs. Throughput

This distinction is the key to understanding instruction-level performance:

Latency: how many cycles until the result of an instruction is available for use by subsequent instructions
Throughput (reciprocal throughput): how often the instruction can be issued (1/throughput = cycles between back-to-back issues of the same instruction)

For ADD RAX, RBX: - Latency: 1 cycle - Throughput: 0.25 cycles (4 ADDs can be issued per cycle — they go to 4 different ALU ports)

For IMUL RAX, RBX: - Latency: 3 cycles - Throughput: 1 cycle (only one IMUL unit)

For DIV RAX (64-bit / 64-bit): - Latency: 35-90 cycles (varies by operand size and value) - Throughput: 21-74 cycles

For VCVTPS2PD YMM0, XMM1 (convert 4 float → double): - Latency: 4 cycles - Throughput: 1 cycle

Instruction	Latency	Throughput	Notes
`MOV reg, reg`	0 (eliminated)	0.25	Register rename can eliminate it
`ADD/SUB`	1	0.25	4 per cycle on 4 ALU ports
`AND/OR/XOR`	1	0.25	Same
`LEA` (no shift)	1	0.25	Useful for add+copy
`LEA` (with shift)	3	1	Uses complex port
`IMUL` (32-bit)	3	1
`IMUL` (64-bit)	3	1
`DIV` (32-bit)	21-74	10-24	Dramatically slow
`DIV` (64-bit)	35-90	21-74	Never use in hot loops
`MOVZX reg, byte`	1	0.25	Zero-extends automatically
`BSF/BSR`	3	1	Bit scan
`LZCNT/TZCNT`	3	1	Prefer over BSF/BSR
`RCL/RCR`	10	5	Very slow; avoid
`MOVAPS/MOVAPD`	1	0.25	Fast XMM/YMM copy
`VADDPS ymm`	4	0.5	AVX packed float add
`VMULPS ymm`	4	0.5	AVX packed float multiply
`VDIVPS ymm`	11-13	5-14	Float division is slow

📊 C Comparison: This is why n / 7 compiles to a multiply-by-reciprocal at -O2 instead of a DIV instruction. The compiler replaces DIV RAX, 7 (35+ cycles) with IMUL RAX, magic_constant / SHR RAX, shift (5 cycles total). Chapter 33 covers this technique in detail.

Dependency Chains and ILP

Instruction-Level Parallelism (ILP) is the CPU's ability to execute independent instructions in parallel. The enemy of ILP is data dependency chains — sequences of instructions where each depends on the previous result.

; BAD: Long dependency chain (sequential - no ILP)
    imul rax, rax       ; latency 3 cycles
    imul rax, rax       ; must wait 3 cycles for above
    imul rax, rax       ; must wait 3 cycles for above
    imul rax, rax       ; must wait 3 cycles for above
    ; Total: 12 cycles (4 IMULs × 3 cycles each, serialized)

; GOOD: Independent chains (parallel - uses ILP)
    imul rax, rax       ; chain 1: cycles 1-3
    imul rbx, rbx       ; chain 2: cycles 1-3 (independent!)
    imul rcx, rcx       ; chain 3: cycles 1-3 (independent!)
    imul rdx, rdx       ; chain 4: cycles 1-3 (independent!)
    ; Total: ~3 cycles (all four execute in parallel)

The critical path through a sequence of instructions is the longest dependency chain. The critical path determines the minimum possible execution time, regardless of the CPU's throughput capabilities.

⚠️ Common Mistake: Developers often think adding more operations to a loop will make it slower. Adding independent operations can actually cost zero additional time — they execute in parallel with operations already on the critical path.

; Reduction loop: naive (long dependency chain on rax)
    xor rax, rax        ; accumulator
.loop:
    add rax, [rsi + rcx*8]  ; adds to rax, creating a chain
    dec rcx
    jnz .loop
    ; Critical path: every ADD must complete before next ADD
    ; Limited by ADD latency (1 cycle): 1 cycle per iteration

; Reduction loop: 4-way unrolled with independent accumulators
    xor rax, rax
    xor rbx, rbx
    xor rcx, rcx
    xor rdx, rdx
.loop:
    add rax, [rsi + r8*8]
    add rbx, [rsi + r8*8 + 8]
    add rcx, [rsi + r8*8 + 16]
    add rdx, [rsi + r8*8 + 24]
    add r8, 4
    cmp r8, r9
    jl .loop
    ; Combine: add rax, rbx; add rcx, rdx; add rax, rcx
    ; Critical path: 4 independent ADDs run in parallel
    ; Theoretical throughput: 1 iteration per cycle (vs 1/4 cycle for non-unrolled)

Branch Prediction

The branch predictor is one of the most sophisticated components in the CPU. Its job: predict whether a conditional branch will be taken before the condition has been evaluated. If it predicts correctly, execution continues at full speed. If wrong — a misprediction — the pipeline flushes all speculative work and restarts at the correct address: a 15–20 cycle penalty.

Branch Prediction Structures

Branch Target Buffer (BTB): Cache of recent branch addresses and their predicted targets. Every time you jump to a function, the BTB records where CALL was from and where it went.
Pattern History Table (PHT): Records the recent taken/not-taken history for each branch (identified by its PC). A loop that iterates 1000 times and then falls through will be predicted "taken" 999 times correctly and "not taken" once incorrectly — 1 misprediction per 1000 iterations.
Return Address Stack (RAS): Tracks CALL addresses to predict RET targets. The CPU maintains a stack of return addresses that mirrors the hardware call stack. When you CALL a function, the RAS pushes the return address. When you RET, the RAS predicts the correct return address. This is why indirect returns through corrupted stack pointers cause massive mispredictions.

Branchless Code with CMOV

For branches that depend on unpredictable data (random input, arbitrary comparisons), replacing the branch with a conditional move eliminates the misprediction penalty:

; BRANCHY: conditional assignment (bad for random data)
    cmp rax, rbx
    jle .skip
    mov rax, rbx        ; rax = min(rax, rbx)
.skip:

; BRANCHLESS: conditional move (no misprediction possible)
    cmp rax, rbx
    cmovg rax, rbx      ; if RAX > RBX, move RBX to RAX
    ; Result: rax = min(rax, rbx), no branch taken

; BRANCHLESS: absolute value
    mov rbx, rax
    neg rax
    cmovs rax, rbx      ; if SF set (rax was negative), restore positive value

CMOV instructions: CMOVZ (zero), CMOVNZ, CMOVL (less), CMOVGE (greater/equal), CMOVLE, CMOVG, CMOVS (sign), CMOVNS — complete set matching all Jcc conditions.

⚠️ Common Mistake: CMOV is not always faster than a branch. For branches with highly predictable outcomes (the loop counter branch is taken 99.9% of the time), the branch predictor works well and the branch is nearly free. Use CMOV for branches with unpredictable data-dependent outcomes.

Spectre: The Dark Side of Speculative Execution

Speculative execution is responsible for the performance of modern CPUs. It is also responsible for Spectre.

; Spectre v1 gadget (the victim's code):
; x is an untrusted input (from attacker-controlled memory)
    cmp x, array_size       ; bounds check
    jae .out_of_bounds      ; branch: taken if x >= size (safe)
    ; CPU speculatively executes this even when x >= array_size:
    movzx rax, byte [array + x]     ; speculative out-of-bounds access
    shl rax, 6                      ; multiply by 64 (cache line size)
    movzx rbx, byte [probe_array + rax]  ; BRING SECRET INTO CACHE
    ; When misprediction detected: CPU rolls back rax, rbx
    ; But the cache state change is NOT rolled back!
    ; Attacker: time access to each probe_array[0], probe_array[64], probe_array[128]...
    ; The one that's fast (cache hit) reveals the secret byte value.

The Spectre mitigation for this pattern:

    cmp x, array_size
    jae .out_of_bounds
    lfence              ; prevent speculation past this point
    movzx rax, byte [array + x]   ; now safe: no speculation

The cost: LFENCE serializes the pipeline. Every LFENCE is a ~4 cycle fence that prevents the out-of-order engine from doing any useful work while it waits for all prior instructions to complete their memory accesses. On code that is called in a tight loop, this can cost 10-15% throughput. This is the real cost of Spectre mitigations.

Measuring IPC with perf

# Measure instructions per clock cycle for a program
perf stat ./my_program

# Sample output:
Performance counter stats for './my_program':

       1,823,456,789      cycles                    #    3.241 GHz
       3,456,789,012      instructions              #    1.90  insn per cycle  ← IPC
         145,678,901      branches                  #  258.896 M/sec
           2,345,678      branch-misses             #    1.61% of all branches
         456,789,012      cache-misses              # 10,234.5 K/sec

       0.562754545 seconds time elapsed

Interpreting IPC: - IPC < 1: severely bottlenecked — likely memory-bound or many mispredictions - IPC 1–2: modest utilization, room for improvement - IPC 2–4: good pipeline utilization - IPC > 4: exceptional (most programs do not achieve this; possible with SIMD)

Theoretical maximum IPC for modern Intel: 4–6 µops/cycle (depending on operation mix). Real programs: typically 1–3 IPC.

Writing Code for the Pipeline

Practical guidelines derived from the pipeline model:

1. Break long dependency chains:

; Sum 4 elements: chain length 4 (bad)
    add rax, [rsi]
    add rax, [rsi+8]
    add rax, [rsi+16]
    add rax, [rsi+24]

; Sum 4 elements: 2 independent chains (better)
    mov rbx, [rsi]
    mov rax, [rsi+8]
    add rbx, [rsi+16]
    add rax, [rsi+24]
    add rax, rbx        ; combine

2. Mix instruction types to use different execution ports:

; All on same port (bad: serializes on port 0)
    imul rax, rax
    imul rbx, rbx
    imul rcx, rcx

; Mix of operations (good: spreads across ports)
    imul rax, rax       ; port 0 or 1
    add  rdi, rsi       ; port 0 or 1 or 5 or 6
    lea  rbx, [rbx + rbx*2]  ; port 1 or 5

3. Avoid unpredictable branches in inner loops: Use CMOV, branchless comparisons, or sort data to make branches predictable.

4. Keep hot loops in the µop cache: If a loop body exceeds ~70 x86-64 instructions (~1500 µops in the µop cache), the CPU must re-decode on every iteration, adding latency.

Summary

The modern CPU pipeline is a superscalar out-of-order machine with µop translation, register renaming, speculative execution, and a large reorder buffer. Instructions have distinct latency (time to produce result) and throughput (issue rate) properties. Dependency chains limit ILP. Branch mispredictions cost 15–20 cycles. CMOV eliminates unpredictable branches. Spectre demonstrates that speculative execution crosses security boundaries — the mitigation cost (LFENCE) is a direct tax on pipeline efficiency.

🔄 Check Your Understanding: 1. What is the difference between instruction latency and instruction throughput? 2. Why does register renaming improve performance? 3. When is CMOV faster than a conditional jump, and when is it slower? 4. What does IPC tell you about a program's pipeline utilization? 5. Why is DIV so much slower than IMUL?