Chapter 31 Quiz: The Modern CPU Pipeline

1. What are micro-operations (µops)?

A) Very small x86-64 instructions that fit in one byte B) RISC-like internal operations that x86-64 instructions are decoded into before execution C) Micro-optimized versions of common instruction sequences stored in cache D) The individual bits within a machine word that the CPU operates on

Answer: B — The x86-64 decoder translates complex CISC instructions into simpler, fixed-format µops that the out-of-order core can schedule and execute. Simple instructions produce 1 µop; complex instructions may produce 2–70+ µops.


2. What is the Reorder Buffer (ROB) used for?

A) Reordering memory accesses to improve cache performance B) Buffering recently decoded instructions for the µop cache C) Tracking in-flight µops in program order, allowing out-of-order execution while maintaining in-order retirement D) Storing branch prediction history for recently executed branches

Answer: C — The ROB holds all in-flight µops in program order. µops execute out of order (on whatever execution unit is free), but they retire in order from the head of the ROB. This gives the appearance of sequential execution to the programmer while allowing the CPU to execute instructions out of order.


3. Instruction A has latency 3 and instruction B depends on A's result. What is the earliest cycle B can execute after A starts?

A) Cycle 0 (immediately) B) Cycle 1 (next cycle) C) Cycle 3 (B can start when A's result is available) D) Cycle 4 (A finishes at end of cycle 3; B can start cycle 4)

Answer: D — If A starts at cycle 0 and has latency 3, its result is available at the start of cycle 3 (i.e., after 3 cycles: 0, 1, 2). B can start at cycle 3. The sequence is: A executes cycles 0-2, result ready at start of cycle 3, B can start at cycle 3 (and completes at cycle 3+latency_B).


4. ADD RAX, RBX has throughput 0.25 cycles. What does this mean?

A) The instruction takes 1/4 cycle to execute B) Up to 4 ADD instructions can be issued per clock cycle (to 4 different ALU execution units) C) The result is available in 0.25 cycles D) The instruction occupies the execution unit for 0.25 cycles

Answer: B — Throughput of 0.25 cycles (often written as "reciprocal throughput 0.25") means one ADD can be issued every 0.25 cycles, i.e., 4 ADD instructions can start per cycle. This requires 4 independent ADD operations — with data dependencies, you are limited by latency (1 cycle between dependent ADDs).


5. What does register renaming accomplish?

A) It renames architectural registers for compatibility with different calling conventions B) It eliminates write-after-write (WAW) and write-after-read (WAR) false data hazards by mapping each destination to a fresh physical register C) It speeds up register access by using a smaller, faster register file D) It allows registers to be shared between different execution units

Answer: B — Register renaming eliminates "false" dependencies. When RAX is written by instruction A and then by instruction B, without renaming B must wait for A to read its old value. With renaming, B gets a new physical register for its result — B and A are now independent and can execute in parallel.


6. A branch predictor makes an incorrect prediction. What is the typical pipeline penalty?

A) 1-2 cycles (just one pipeline stage) B) 5-8 cycles (a few pipeline stages) C) 15-20 cycles (the entire pipeline must be flushed and restarted) D) 100+ cycles (similar to a cache miss)

Answer: C — A branch misprediction requires flushing all speculatively-executed instructions from the pipeline and restarting at the correct address. Modern deep pipelines (19+ stages) incur 15–20 cycle penalties. This is why unpredictable branches in tight loops can dramatically reduce performance.


7. When is CMOV (conditional move) faster than a conditional jump?

A) Always — CMOV is always faster than any branch instruction B) When the branch condition is predictable (> 95% taken or > 95% not-taken) C) When the branch condition depends on data that has no predictable pattern D) Only when the code is in a tight inner loop

Answer: C — CMOV eliminates the branch entirely, so there is no misprediction cost. It is most beneficial when the branch depends on data that the predictor cannot predict reliably (random data, hash values, comparisons of unsorted data). For predictable branches, the predictor works well and the branch has near-zero cost — CMOV may actually be slower because it has a data dependency that prevents ILP.


8. What is the µop cache (Decoded Stream Buffer)?

A) A cache of frequently-used machine code instructions in ROM B) A cache of pre-decoded µops that bypasses the instruction decoder for recently-executed code C) A buffer that stores µops waiting for execution unit availability D) The cache that maps x86-64 register names to physical registers

Answer: B — The µop cache (or DSB — Decoded Stream Buffer) stores the µops decoded from recently-executed x86-64 instructions. When the CPU re-encounters the same code (loops), it can fetch µops directly from this cache instead of re-decoding, saving decode bandwidth and reducing frontend latency.


9. Why is the DIV instruction dramatically slower than IMUL?

A) DIV accesses memory while IMUL only uses registers B) Division requires more complex hardware (not easily pipelined like multiplication) and uses microcode with many µops C) DIV locks the entire CPU pipeline D) DIV results must be verified by a second execution unit

Answer: B — Integer division is algorithmically more complex than multiplication. The hardware divider iterates through the quotient bit by bit, requiring many cycles. It cannot be pipelined as aggressively as the multiplier. IMUL uses a fast combinatorial multiplier and completes in 3 cycles. DIV takes 35-90 cycles and has a throughput of 20-74 cycles.


10. The Spectre v1 vulnerability exploits which CPU mechanism?

A) The µop cache executing malicious µops directly B) Speculative execution past a bounds check, which leaves observable traces in the cache state C) Branch prediction that redirects execution to attacker-controlled code D) Register renaming that leaks values between security domains

Answer: B — Spectre v1 exploits speculative execution: the CPU speculatively accesses out-of-bounds memory past an unpredicted branch taken, bringing secret data into a specific cache location. Even when the speculation is squashed, the cache state change persists. An attacker measures cache access timing to infer the secret byte value.


11. What does IPC (Instructions Per Clock) measure, and what values indicate good pipeline utilization?

A) The number of instructions fetched per clock; 1.0 is optimal B) The number of architectural instructions retired per clock cycle; modern CPUs achieve 2-4 IPC for well-optimized code C) The number of µops issued per clock; should equal the pipeline width D) The percentage of clock cycles where the pipeline is fully utilized

Answer: B — IPC counts architecturally-visible instructions retired per clock cycle. Modern superscalar CPUs have theoretical peaks of 4-6 IPC, but real programs achieve 1-3 IPC typically. IPC < 1 usually indicates memory-bound or branch-misprediction-heavy code; IPC > 3 indicates very good pipeline utilization.


12. In a loop with a 3-cycle critical path and no other bottlenecks, what is the maximum throughput?

A) 3 iterations per cycle (since the pipeline handles 3 in parallel) B) 1 iteration per 3 cycles C) 1 iteration per cycle (the CPU executes the iterations in parallel) D) 3 iterations per 9 cycles

Answer: B — With a 3-cycle critical path, each iteration cannot begin until the previous iteration's 3-cycle chain completes. This gives 1 iteration per 3 cycles. To break through this limit, you must either eliminate the dependency (IMUL chain) or use multiple independent accumulator variables.


13. The Return Address Stack (RAS) predicts:

A) The target of indirect CALL instructions (calls through function pointers) B) The return address for RET instructions, by tracking CALL instructions C) Which branch in a function will be taken next D) The next instruction to fetch after a conditional jump

Answer: B — The RAS is a hardware stack that mirrors the software call stack. Every CALL pushes the return address; every RET pops the predicted return address. This gives near-perfect RET prediction for normal function call patterns. It fails when the call stack is corrupted, when setjmp/longjmp is used, or when call depth exceeds the RAS depth (~16-32 entries).


14. How does loop unrolling help performance?

A) It reduces the total number of iterations the loop must execute B) It reduces loop overhead (branch and counter update) and exposes more ILP by having more independent operations in the loop body C) It reduces the code size and improves µop cache usage D) It prevents branch mispredictions on the loop-closing branch

Answer: B — Loop unrolling does two things: (1) amortizes the loop overhead (dec/jnz) across more work, and (2) more importantly, creates multiple independent operations within one iteration that the CPU can execute in parallel. A 4x unrolled reduction loop can achieve 4x the throughput if the four accumulators are independent.


15. LFENCE is used as a Spectre mitigation. What performance cost does it impose?

A) It flushes the entire L1 instruction cache B) It serializes the instruction stream — all prior instructions must complete memory accesses before any instructions after LFENCE can execute, preventing speculative execution past it C) It disables the branch predictor for the duration of the protected region D) It doubles the cost of all subsequent memory loads

Answer: B — LFENCE prevents the CPU from speculatively executing loads past it until all preceding loads have completed. This directly prevents the speculative out-of-bounds access in Spectre v1. The cost is pipeline serialization: no instructions can execute until the fence's predecessor instructions finish, which can reduce throughput by 10-30% in tight loops.