The classic pipeline — Fetch, Decode, Execute, Writeback — is a convenient lie. It is accurate for the Intel 486 (1989). It is not accurate for anything Intel has shipped since the Pentium Pro (1995). The gap between the textbook model and the...
In This Chapter
Chapter 31: The Modern CPU Pipeline
The Fiction You Were Taught
The classic pipeline — Fetch, Decode, Execute, Writeback — is a convenient lie. It is accurate for the Intel 486 (1989). It is not accurate for anything Intel has shipped since the Pentium Pro (1995). The gap between the textbook model and the actual hardware has been widening for thirty years, and the performance implications of that gap are significant enough that understanding the real model is not optional for anyone who wants to write fast assembly.
The modern x86-64 core — let us use an Intel Alder Lake Performance core as the reference — is a 4-wide, 19-stage pipeline with a 512-entry reorder buffer, 12 execution units, and out-of-order execution that can have 200+ instructions in flight simultaneously. Understanding this at a practical (not a microarchitectural-PhD) level lets you write code that takes advantage of it rather than fighting it.
Micro-Operations (µops)
The first critical concept: x86-64 instructions are not executed directly. They are translated into simpler, RISC-like micro-operations (µops) by the CPU's decoder. This translation is the foundation of everything else.
x86-64 instruction → decoder → 1 or more µops
Examples:
MOV RAX, RBX → 1 µop (register-register move: may be eliminated by renaming)
ADD RAX, [RBX] → 2 µops (load + add)
PUSH RBX → 2 µops (subtract RSP + store)
POP RBX → 2 µops (load + add RSP)
DIV RCX → ~20-70 µops (microcode, variable)
REP MOVSB → variable (complex microcode expansion)
⚙️ How It Works: The µop cache (called the Decoded Stream Buffer or DSB on Intel) caches the decoded µops for recently-executed instructions. When the CPU re-encounters the same code (as in a loop), it fetches directly from the µop cache rather than re-decoding the x86-64 instructions. This is why the µop cache capacity (typically 1500–2000 µops) limits the practical size of loops that can run at peak throughput. A loop body exceeding the µop cache causes re-decoding, which adds ~4 cycles of latency per iteration.
The Out-of-Order Pipeline in Detail
Frontend:
L1 Instruction Cache (32KB)
↓
Instruction Fetch (up to 16 bytes per cycle)
↓
Branch Predictor (predicts branch direction + target)
↓
Instruction Decode (4 decoders: 1 complex + 3 simple)
↓
µop Cache (Decoded Stream Buffer, ~1500 µops)
↓
Allocator / Register Rename
Out-of-Order Core:
↓
Scheduler / Reservation Stations (dispatch to execution ports)
↓ ↓ ↓ ↓ ↓ (multiple dispatch ports)
Execution Units:
Port 0: Integer ALU, Branch, FP Multiply/Divide, Shift
Port 1: Integer ALU, Vector ALU, Vector Multiply
Port 2: Load + Store Address
Port 3: Load + Store Address
Port 4: Store Data
Port 5: Integer ALU, Vector Shuffle, LEA
Port 6: Integer ALU, Branch
Port 7: Store Address
Port 8: Store Address
↓
Reorder Buffer (ROB, 512 entries in Alder Lake P-core)
↓
Retirement (in-order commit, 4 µops/cycle)
↓
Architectural Registers (the registers your code sees)
The ROB is the key structure. Every in-flight µop has an entry in the ROB. The ROB maintains them in program order even as the execution units process them out of order. When a µop completes execution, it is marked "done" in the ROB. µops retire in program order: the oldest µop at the head of the ROB is retired (its result is written to the architectural register file or memory) once it is complete and all older µops are also complete.
💡 Mental Model: Think of the ROB as an ordered queue of work. You can work on any item in the queue that has its inputs ready, but you can only deliver the results from the front of the queue, in order. Out-of-order execution is the freedom to work on queue items out of sequence; in-order retirement is the commitment to deliver them in sequence.
Register Renaming
Register renaming eliminates false data dependencies — the "write-after-write" (WAW) and "write-after-read" (WAR) hazards that would otherwise prevent out-of-order execution.
; Two independent computations that share a register — creates false dependency
imul rax, rbx ; rax = rbx * rax (latency: 3 cycles)
; ... 3 cycles of waiting ...
add rax, rcx ; rax = rax + rcx (dependent: must wait for IMUL)
...
imul rax, rdx ; SECOND imul overwrites rax
Without renaming, the second IMUL must wait for the first ADD to finish reading RAX (WAR hazard). With renaming, the CPU allocates a new physical register for each destination, eliminating all false dependencies. The processor has 280 physical registers (Intel Alder Lake) but only 16 architectural registers; this extra capacity is consumed by the in-flight µops.
Latency vs. Throughput
This distinction is the key to understanding instruction-level performance:
- Latency: how many cycles until the result of an instruction is available for use by subsequent instructions
- Throughput (reciprocal throughput): how often the instruction can be issued (1/throughput = cycles between back-to-back issues of the same instruction)
For ADD RAX, RBX:
- Latency: 1 cycle
- Throughput: 0.25 cycles (4 ADDs can be issued per cycle — they go to 4 different ALU ports)
For IMUL RAX, RBX:
- Latency: 3 cycles
- Throughput: 1 cycle (only one IMUL unit)
For DIV RAX (64-bit / 64-bit):
- Latency: 35-90 cycles (varies by operand size and value)
- Throughput: 21-74 cycles
For VCVTPS2PD YMM0, XMM1 (convert 4 float → double):
- Latency: 4 cycles
- Throughput: 1 cycle
| Instruction | Latency | Throughput | Notes |
|---|---|---|---|
MOV reg, reg |
0 (eliminated) | 0.25 | Register rename can eliminate it |
ADD/SUB |
1 | 0.25 | 4 per cycle on 4 ALU ports |
AND/OR/XOR |
1 | 0.25 | Same |
LEA (no shift) |
1 | 0.25 | Useful for add+copy |
LEA (with shift) |
3 | 1 | Uses complex port |
IMUL (32-bit) |
3 | 1 | |
IMUL (64-bit) |
3 | 1 | |
DIV (32-bit) |
21-74 | 10-24 | Dramatically slow |
DIV (64-bit) |
35-90 | 21-74 | Never use in hot loops |
MOVZX reg, byte |
1 | 0.25 | Zero-extends automatically |
BSF/BSR |
3 | 1 | Bit scan |
LZCNT/TZCNT |
3 | 1 | Prefer over BSF/BSR |
RCL/RCR |
10 | 5 | Very slow; avoid |
MOVAPS/MOVAPD |
1 | 0.25 | Fast XMM/YMM copy |
VADDPS ymm |
4 | 0.5 | AVX packed float add |
VMULPS ymm |
4 | 0.5 | AVX packed float multiply |
VDIVPS ymm |
11-13 | 5-14 | Float division is slow |
📊 C Comparison: This is why
n / 7compiles to a multiply-by-reciprocal at -O2 instead of a DIV instruction. The compiler replacesDIV RAX, 7(35+ cycles) withIMUL RAX, magic_constant / SHR RAX, shift(5 cycles total). Chapter 33 covers this technique in detail.
Dependency Chains and ILP
Instruction-Level Parallelism (ILP) is the CPU's ability to execute independent instructions in parallel. The enemy of ILP is data dependency chains — sequences of instructions where each depends on the previous result.
; BAD: Long dependency chain (sequential - no ILP)
imul rax, rax ; latency 3 cycles
imul rax, rax ; must wait 3 cycles for above
imul rax, rax ; must wait 3 cycles for above
imul rax, rax ; must wait 3 cycles for above
; Total: 12 cycles (4 IMULs × 3 cycles each, serialized)
; GOOD: Independent chains (parallel - uses ILP)
imul rax, rax ; chain 1: cycles 1-3
imul rbx, rbx ; chain 2: cycles 1-3 (independent!)
imul rcx, rcx ; chain 3: cycles 1-3 (independent!)
imul rdx, rdx ; chain 4: cycles 1-3 (independent!)
; Total: ~3 cycles (all four execute in parallel)
The critical path through a sequence of instructions is the longest dependency chain. The critical path determines the minimum possible execution time, regardless of the CPU's throughput capabilities.
⚠️ Common Mistake: Developers often think adding more operations to a loop will make it slower. Adding independent operations can actually cost zero additional time — they execute in parallel with operations already on the critical path.
; Reduction loop: naive (long dependency chain on rax)
xor rax, rax ; accumulator
.loop:
add rax, [rsi + rcx*8] ; adds to rax, creating a chain
dec rcx
jnz .loop
; Critical path: every ADD must complete before next ADD
; Limited by ADD latency (1 cycle): 1 cycle per iteration
; Reduction loop: 4-way unrolled with independent accumulators
xor rax, rax
xor rbx, rbx
xor rcx, rcx
xor rdx, rdx
.loop:
add rax, [rsi + r8*8]
add rbx, [rsi + r8*8 + 8]
add rcx, [rsi + r8*8 + 16]
add rdx, [rsi + r8*8 + 24]
add r8, 4
cmp r8, r9
jl .loop
; Combine: add rax, rbx; add rcx, rdx; add rax, rcx
; Critical path: 4 independent ADDs run in parallel
; Theoretical throughput: 1 iteration per cycle (vs 1/4 cycle for non-unrolled)
Branch Prediction
The branch predictor is one of the most sophisticated components in the CPU. Its job: predict whether a conditional branch will be taken before the condition has been evaluated. If it predicts correctly, execution continues at full speed. If wrong — a misprediction — the pipeline flushes all speculative work and restarts at the correct address: a 15–20 cycle penalty.
Branch Prediction Structures
- Branch Target Buffer (BTB): Cache of recent branch addresses and their predicted targets. Every time you jump to a function, the BTB records where
CALLwas from and where it went. - Pattern History Table (PHT): Records the recent taken/not-taken history for each branch (identified by its PC). A loop that iterates 1000 times and then falls through will be predicted "taken" 999 times correctly and "not taken" once incorrectly — 1 misprediction per 1000 iterations.
- Return Address Stack (RAS): Tracks CALL addresses to predict RET targets. The CPU maintains a stack of return addresses that mirrors the hardware call stack. When you CALL a function, the RAS pushes the return address. When you RET, the RAS predicts the correct return address. This is why indirect returns through corrupted stack pointers cause massive mispredictions.
Branchless Code with CMOV
For branches that depend on unpredictable data (random input, arbitrary comparisons), replacing the branch with a conditional move eliminates the misprediction penalty:
; BRANCHY: conditional assignment (bad for random data)
cmp rax, rbx
jle .skip
mov rax, rbx ; rax = min(rax, rbx)
.skip:
; BRANCHLESS: conditional move (no misprediction possible)
cmp rax, rbx
cmovg rax, rbx ; if RAX > RBX, move RBX to RAX
; Result: rax = min(rax, rbx), no branch taken
; BRANCHLESS: absolute value
mov rbx, rax
neg rax
cmovs rax, rbx ; if SF set (rax was negative), restore positive value
CMOV instructions: CMOVZ (zero), CMOVNZ, CMOVL (less), CMOVGE (greater/equal), CMOVLE, CMOVG, CMOVS (sign), CMOVNS — complete set matching all Jcc conditions.
⚠️ Common Mistake: CMOV is not always faster than a branch. For branches with highly predictable outcomes (the loop counter branch is taken 99.9% of the time), the branch predictor works well and the branch is nearly free. Use CMOV for branches with unpredictable data-dependent outcomes.
Spectre: The Dark Side of Speculative Execution
Speculative execution is responsible for the performance of modern CPUs. It is also responsible for Spectre.
; Spectre v1 gadget (the victim's code):
; x is an untrusted input (from attacker-controlled memory)
cmp x, array_size ; bounds check
jae .out_of_bounds ; branch: taken if x >= size (safe)
; CPU speculatively executes this even when x >= array_size:
movzx rax, byte [array + x] ; speculative out-of-bounds access
shl rax, 6 ; multiply by 64 (cache line size)
movzx rbx, byte [probe_array + rax] ; BRING SECRET INTO CACHE
; When misprediction detected: CPU rolls back rax, rbx
; But the cache state change is NOT rolled back!
; Attacker: time access to each probe_array[0], probe_array[64], probe_array[128]...
; The one that's fast (cache hit) reveals the secret byte value.
The Spectre mitigation for this pattern:
cmp x, array_size
jae .out_of_bounds
lfence ; prevent speculation past this point
movzx rax, byte [array + x] ; now safe: no speculation
The cost: LFENCE serializes the pipeline. Every LFENCE is a ~4 cycle fence that prevents the out-of-order engine from doing any useful work while it waits for all prior instructions to complete their memory accesses. On code that is called in a tight loop, this can cost 10-15% throughput. This is the real cost of Spectre mitigations.
Measuring IPC with perf
# Measure instructions per clock cycle for a program
perf stat ./my_program
# Sample output:
Performance counter stats for './my_program':
1,823,456,789 cycles # 3.241 GHz
3,456,789,012 instructions # 1.90 insn per cycle ← IPC
145,678,901 branches # 258.896 M/sec
2,345,678 branch-misses # 1.61% of all branches
456,789,012 cache-misses # 10,234.5 K/sec
0.562754545 seconds time elapsed
Interpreting IPC: - IPC < 1: severely bottlenecked — likely memory-bound or many mispredictions - IPC 1–2: modest utilization, room for improvement - IPC 2–4: good pipeline utilization - IPC > 4: exceptional (most programs do not achieve this; possible with SIMD)
Theoretical maximum IPC for modern Intel: 4–6 µops/cycle (depending on operation mix). Real programs: typically 1–3 IPC.
Writing Code for the Pipeline
Practical guidelines derived from the pipeline model:
1. Break long dependency chains:
; Sum 4 elements: chain length 4 (bad)
add rax, [rsi]
add rax, [rsi+8]
add rax, [rsi+16]
add rax, [rsi+24]
; Sum 4 elements: 2 independent chains (better)
mov rbx, [rsi]
mov rax, [rsi+8]
add rbx, [rsi+16]
add rax, [rsi+24]
add rax, rbx ; combine
2. Mix instruction types to use different execution ports:
; All on same port (bad: serializes on port 0)
imul rax, rax
imul rbx, rbx
imul rcx, rcx
; Mix of operations (good: spreads across ports)
imul rax, rax ; port 0 or 1
add rdi, rsi ; port 0 or 1 or 5 or 6
lea rbx, [rbx + rbx*2] ; port 1 or 5
3. Avoid unpredictable branches in inner loops:
Use CMOV, branchless comparisons, or sort data to make branches predictable.
4. Keep hot loops in the µop cache: If a loop body exceeds ~70 x86-64 instructions (~1500 µops in the µop cache), the CPU must re-decode on every iteration, adding latency.
Summary
The modern CPU pipeline is a superscalar out-of-order machine with µop translation, register renaming, speculative execution, and a large reorder buffer. Instructions have distinct latency (time to produce result) and throughput (issue rate) properties. Dependency chains limit ILP. Branch mispredictions cost 15–20 cycles. CMOV eliminates unpredictable branches. Spectre demonstrates that speculative execution crosses security boundaries — the mitigation cost (LFENCE) is a direct tax on pipeline efficiency.
🔄 Check Your Understanding: 1. What is the difference between instruction latency and instruction throughput? 2. Why does register renaming improve performance? 3. When is
CMOVfaster than a conditional jump, and when is it slower? 4. What does IPC tell you about a program's pipeline utilization? 5. Why isDIVso much slower thanIMUL?