Chapter 31 Exercises: The Modern CPU Pipeline
Section A: Latency and Throughput Analysis
1. Instruction Timing Table Using Agner Fog's instruction tables (https://agner.org/optimize/) for your CPU (or Intel's documentation), fill in the following table for Intel Alder Lake or Zen 4:
| Instruction | Latency | Throughput | Port(s) |
|---|---|---|---|
ADD RAX, RBX |
|||
IMUL RAX, RBX |
|||
IDIV RCX |
|||
MOV RAX, [RSI] (L1 hit) |
|||
LEA RAX, [RBX + RCX*4] |
|||
SHLX RAX, RBX, RCX |
|||
POPCNT RAX, RBX |
|||
BSWAP RAX |
|||
MOVZX RAX, byte [RSI] |
|||
VMULPS YMM0, YMM1, YMM2 |
2. Critical Path Analysis For each code sequence, identify the critical path (longest dependency chain) and compute the minimum possible execution time in cycles:
; Sequence A:
mov rax, [rsi] ; load latency = 4 cycles
add rax, rbx ; add latency = 1 cycle
imul rax, rcx ; imul latency = 3 cycles
add rax, rdx ; add latency = 1 cycle
; Critical path length = ?
; Sequence B:
mov rax, [rsi] ; load latency = 4 cycles
mov rbx, [rsi+8] ; load latency = 4 cycles (independent)
add rax, rbx ; depends on both loads
; Critical path length = ?
; Sequence C:
imul rax, rax ; chain 1
imul rbx, rbx ; chain 2 (independent)
imul rax, rax ; extends chain 1
imul rbx, rbx ; extends chain 2
add rax, rbx ; depends on both chains
; Critical path length = ?
3. Throughput-Bound vs. Latency-Bound For each of the following loops, determine whether execution is latency-bound (critical path limits speed) or throughput-bound (execution unit port limits speed):
; Loop A: 8 independent adds
add rax, r8
add rbx, r9
add rcx, r10
add rdx, r11
add rdi, r12
add rsi, r13
add rbp, r14
add rsp, r15
dec rcx
jnz .loop
; Loop B: Dependent multiply chain
imul rax, rax
imul rax, rax
imul rax, rax
dec rcx
jnz .loop
; Loop C: Memory loads with computation
mov r8, [rsi + rcx*8]
add rax, r8
imul rax, rdx
dec rcx
jnz .loop
Section B: Branch Prediction
4. Branch Prediction Benchmark Write two implementations of a search function:
// Version A: conditional branch (data-dependent)
long find_value_branch(long *array, long n, long target) {
for (long i = 0; i < n; i++) {
if (array[i] == target) return i;
}
return -1;
}
// Version B: CMOV-based (branchless result accumulation)
// Uses CMOV to update the result register without branching
Benchmark both with: a) Sorted array (predictable: target always in first half) b) Random array (target at random position) c) No match (target not present)
Measure with perf stat -e branch-misses for each scenario. When is the branchy version faster? When is the branchless version faster?
5. Misprediction Cost Measurement Write a benchmark that deliberately causes branch mispredictions:
; Array of random 0/1 values
; For each element: if 1, add to sum; if 0, skip
; This branch is maximally unpredictable when array is random
.loop:
movzx rbx, byte [rsi + rcx] ; load random 0 or 1
test rbx, rbx
jz .skip ; 50% taken, 50% not-taken: ~50% misprediction rate
add rax, rdx
.skip:
dec rcx
jnz .loop
Measure cycles per iteration. Then implement a branchless version:
movzx rbx, byte [rsi + rcx]
imul rbx, rdx ; rbx = 0 or value (no branch)
add rax, rbx
What is the cycles-per-iteration difference?
6. Predictable Branch Optimization A loop that runs exactly 1000 times has the loop-closing branch mispredicted exactly once (the 1000th iteration). With 1000 iterations and 15 cycles per misprediction: a) What percentage of loop execution time is misprediction overhead? b) At what loop body size (in cycles) does misprediction become < 1% of total time? c) What happens to the misprediction penalty if you unroll the loop 4×?
7. Return Address Stack Depth
The Return Address Stack (RAS) typically holds 16–32 entries. Write code that:
a) Has a call chain of depth 20 (function A calls B calls C ... calls T)
b) Measures the RET misprediction rate at various call depths using perf stat -e branch-misses
c) Plots misprediction rate vs. call depth
At what depth do you see a sharp increase in RET mispredictions?
Section C: ILP and Out-of-Order Execution
8. Dependency Chain Breaking Start with this loop that has a 3-cycle dependency chain:
.loop:
imul rax, [rsi + rcx*8] ; rax = rax * mem[i]
dec rcx
jnz .loop
Transform it into: a) 2-way unrolled with 2 independent accumulators b) 4-way unrolled with 4 independent accumulators
Measure cycles per element for each version. What is the speedup from 1 to 2 to 4 accumulators? Why does it plateau?
9. Mixed-Type Instruction Scheduling The following loop only uses port 1 (IMUL port):
.loop:
imul rax, rax
imul rbx, rbx
imul rcx, rcx
imul rdx, rdx
dec rsi
jnz .loop
Rewrite the loop replacing some IMULs with ADDs or LEAs to spread across more ports. What is the throughput improvement?
10. µop Cache Limit Write a loop body that is exactly: a) 50 instructions (well within µop cache) b) 150 instructions (at the µop cache boundary) c) 300 instructions (beyond µop cache)
The instructions should be independent NOPs or simple ADDs to isolate the µop cache effect. Measure cycles per iteration at each size. At what instruction count do you see a performance cliff?
Section D: llvm-mca Analysis
11. Analyzing a Loop with llvm-mca
Install LLVM and use llvm-mca to analyze loop throughput:
llvm-mca --mcpu=skylake -iterations=1000 loop.s
For the following loop:
.loop:
vaddps ymm0, ymm0, [rsi]
vaddps ymm1, ymm1, [rsi+32]
add rsi, 64
dec rcx
jnz .loop
Read the llvm-mca output: - Dispatch Width - µops per iteration - Critical resource bottleneck - IPC prediction
What does llvm-mca say is the bottleneck? Is it a port, a dependency chain, or a load bandwidth issue?
12. Spectre Mitigation Cost Write a bounds-checked array access function:
safe_array_load:
cmp rsi, [array_size]
jae .out_of_bounds
lfence
mov rax, [array + rsi*8]
ret
Benchmark this function called 100 million times: a) With LFENCE b) Without LFENCE
Measure the throughput difference. Also use perf stat -e instructions to compare instruction counts. What fraction of the overhead is from LFENCE vs. the bounds check?
Section E: Profile-Guided Analysis
13. IPC Profiling For a program of your choice (could be your MinOS kernel routines, an AES encryption function, or a matrix multiply), run:
perf stat -e cycles,instructions,branch-misses,cache-misses ./program
Interpret the results: - What is the IPC? - What percentage of cycles are wasted on branch mispredictions? - What percentage are potentially wasted on cache misses? - Is the program pipeline-bound, memory-bound, or branch-bound?
14. Hot Function Identification Run a larger program (e.g., your Assembly AES encryption from earlier chapters) under:
perf record -e cycles ./program
perf report
Identify the top 3 functions by cycle count. Use perf annotate to see which specific instructions are most expensive within the hot function. Does the perf annotation match your expectation of where the bottleneck should be?
15. RDTSC Microbenchmark Write a precise microbenchmark using RDTSC:
rdtsc
shl rdx, 32
or rax, rdx
mov r8, rax ; start time
; ... code to benchmark (N iterations) ...
rdtsc
shl rdx, 32
or rax, rdx
sub rax, r8 ; elapsed cycles
xor rdx, rdx
mov rcx, N
div rcx ; cycles per iteration
Use this to measure:
a) NOP (should be ~0.25 cycles)
b) ADD RAX, RAX (should be ~0.25 cycles)
c) IMUL RAX, RAX (should be ~1 cycle at throughput)
d) DIV RCX where RCX=7 (should be ~35-90 cycles)
Warm up the cache (run once before measuring), repeat 5 times and take the minimum (minimizing measurement noise).