Chapter 31 Exercises: The Modern CPU Pipeline

Section A: Latency and Throughput Analysis

1. Instruction Timing Table Using Agner Fog's instruction tables (https://agner.org/optimize/) for your CPU (or Intel's documentation), fill in the following table for Intel Alder Lake or Zen 4:

Instruction Latency Throughput Port(s)
ADD RAX, RBX
IMUL RAX, RBX
IDIV RCX
MOV RAX, [RSI] (L1 hit)
LEA RAX, [RBX + RCX*4]
SHLX RAX, RBX, RCX
POPCNT RAX, RBX
BSWAP RAX
MOVZX RAX, byte [RSI]
VMULPS YMM0, YMM1, YMM2

2. Critical Path Analysis For each code sequence, identify the critical path (longest dependency chain) and compute the minimum possible execution time in cycles:

; Sequence A:
    mov rax, [rsi]      ; load latency = 4 cycles
    add rax, rbx        ; add latency = 1 cycle
    imul rax, rcx       ; imul latency = 3 cycles
    add rax, rdx        ; add latency = 1 cycle
; Critical path length = ?

; Sequence B:
    mov rax, [rsi]      ; load latency = 4 cycles
    mov rbx, [rsi+8]    ; load latency = 4 cycles (independent)
    add rax, rbx        ; depends on both loads
; Critical path length = ?

; Sequence C:
    imul rax, rax       ; chain 1
    imul rbx, rbx       ; chain 2 (independent)
    imul rax, rax       ; extends chain 1
    imul rbx, rbx       ; extends chain 2
    add rax, rbx        ; depends on both chains
; Critical path length = ?

3. Throughput-Bound vs. Latency-Bound For each of the following loops, determine whether execution is latency-bound (critical path limits speed) or throughput-bound (execution unit port limits speed):

; Loop A: 8 independent adds
    add rax, r8
    add rbx, r9
    add rcx, r10
    add rdx, r11
    add rdi, r12
    add rsi, r13
    add rbp, r14
    add rsp, r15
    dec rcx
    jnz .loop

; Loop B: Dependent multiply chain
    imul rax, rax
    imul rax, rax
    imul rax, rax
    dec rcx
    jnz .loop

; Loop C: Memory loads with computation
    mov r8, [rsi + rcx*8]
    add rax, r8
    imul rax, rdx
    dec rcx
    jnz .loop

Section B: Branch Prediction

4. Branch Prediction Benchmark Write two implementations of a search function:

// Version A: conditional branch (data-dependent)
long find_value_branch(long *array, long n, long target) {
    for (long i = 0; i < n; i++) {
        if (array[i] == target) return i;
    }
    return -1;
}

// Version B: CMOV-based (branchless result accumulation)
// Uses CMOV to update the result register without branching

Benchmark both with: a) Sorted array (predictable: target always in first half) b) Random array (target at random position) c) No match (target not present)

Measure with perf stat -e branch-misses for each scenario. When is the branchy version faster? When is the branchless version faster?

5. Misprediction Cost Measurement Write a benchmark that deliberately causes branch mispredictions:

; Array of random 0/1 values
; For each element: if 1, add to sum; if 0, skip
; This branch is maximally unpredictable when array is random
.loop:
    movzx rbx, byte [rsi + rcx]  ; load random 0 or 1
    test rbx, rbx
    jz .skip                      ; 50% taken, 50% not-taken: ~50% misprediction rate
    add rax, rdx
.skip:
    dec rcx
    jnz .loop

Measure cycles per iteration. Then implement a branchless version:

    movzx rbx, byte [rsi + rcx]
    imul rbx, rdx               ; rbx = 0 or value (no branch)
    add rax, rbx

What is the cycles-per-iteration difference?

6. Predictable Branch Optimization A loop that runs exactly 1000 times has the loop-closing branch mispredicted exactly once (the 1000th iteration). With 1000 iterations and 15 cycles per misprediction: a) What percentage of loop execution time is misprediction overhead? b) At what loop body size (in cycles) does misprediction become < 1% of total time? c) What happens to the misprediction penalty if you unroll the loop 4×?

7. Return Address Stack Depth The Return Address Stack (RAS) typically holds 16–32 entries. Write code that: a) Has a call chain of depth 20 (function A calls B calls C ... calls T) b) Measures the RET misprediction rate at various call depths using perf stat -e branch-misses c) Plots misprediction rate vs. call depth

At what depth do you see a sharp increase in RET mispredictions?


Section C: ILP and Out-of-Order Execution

8. Dependency Chain Breaking Start with this loop that has a 3-cycle dependency chain:

.loop:
    imul rax, [rsi + rcx*8]     ; rax = rax * mem[i]
    dec rcx
    jnz .loop

Transform it into: a) 2-way unrolled with 2 independent accumulators b) 4-way unrolled with 4 independent accumulators

Measure cycles per element for each version. What is the speedup from 1 to 2 to 4 accumulators? Why does it plateau?

9. Mixed-Type Instruction Scheduling The following loop only uses port 1 (IMUL port):

.loop:
    imul rax, rax
    imul rbx, rbx
    imul rcx, rcx
    imul rdx, rdx
    dec rsi
    jnz .loop

Rewrite the loop replacing some IMULs with ADDs or LEAs to spread across more ports. What is the throughput improvement?

10. µop Cache Limit Write a loop body that is exactly: a) 50 instructions (well within µop cache) b) 150 instructions (at the µop cache boundary) c) 300 instructions (beyond µop cache)

The instructions should be independent NOPs or simple ADDs to isolate the µop cache effect. Measure cycles per iteration at each size. At what instruction count do you see a performance cliff?


Section D: llvm-mca Analysis

11. Analyzing a Loop with llvm-mca Install LLVM and use llvm-mca to analyze loop throughput:

llvm-mca --mcpu=skylake -iterations=1000 loop.s

For the following loop:

.loop:
    vaddps ymm0, ymm0, [rsi]
    vaddps ymm1, ymm1, [rsi+32]
    add rsi, 64
    dec rcx
    jnz .loop

Read the llvm-mca output: - Dispatch Width - µops per iteration - Critical resource bottleneck - IPC prediction

What does llvm-mca say is the bottleneck? Is it a port, a dependency chain, or a load bandwidth issue?

12. Spectre Mitigation Cost Write a bounds-checked array access function:

safe_array_load:
    cmp rsi, [array_size]
    jae .out_of_bounds
    lfence
    mov rax, [array + rsi*8]
    ret

Benchmark this function called 100 million times: a) With LFENCE b) Without LFENCE

Measure the throughput difference. Also use perf stat -e instructions to compare instruction counts. What fraction of the overhead is from LFENCE vs. the bounds check?


Section E: Profile-Guided Analysis

13. IPC Profiling For a program of your choice (could be your MinOS kernel routines, an AES encryption function, or a matrix multiply), run:

perf stat -e cycles,instructions,branch-misses,cache-misses ./program

Interpret the results: - What is the IPC? - What percentage of cycles are wasted on branch mispredictions? - What percentage are potentially wasted on cache misses? - Is the program pipeline-bound, memory-bound, or branch-bound?

14. Hot Function Identification Run a larger program (e.g., your Assembly AES encryption from earlier chapters) under:

perf record -e cycles ./program
perf report

Identify the top 3 functions by cycle count. Use perf annotate to see which specific instructions are most expensive within the hot function. Does the perf annotation match your expectation of where the bottleneck should be?

15. RDTSC Microbenchmark Write a precise microbenchmark using RDTSC:

rdtsc
shl rdx, 32
or  rax, rdx
mov r8, rax             ; start time

; ... code to benchmark (N iterations) ...

rdtsc
shl rdx, 32
or  rax, rdx
sub rax, r8             ; elapsed cycles
xor rdx, rdx
mov rcx, N
div rcx                 ; cycles per iteration

Use this to measure: a) NOP (should be ~0.25 cycles) b) ADD RAX, RAX (should be ~0.25 cycles) c) IMUL RAX, RAX (should be ~1 cycle at throughput) d) DIV RCX where RCX=7 (should be ~35-90 cycles)

Warm up the cache (run once before measuring), repeat 5 times and take the minimum (minimizing measurement noise).