Chapter 33 Quiz: Performance Analysis and Optimization

1. Why should you use the minimum (not the average) across many RDTSC measurements when benchmarking a function?

A) The minimum eliminates hardware errors that occasionally make instructions faster B) The minimum represents the best-case performance, which is the hardware limit; higher values reflect OS interrupts, context switches, and other noise C) The minimum is faster to compute than the average D) The average includes warm-up iterations that make the result inaccurate

Answer: B — When timing a function repeatedly, the minimum sample corresponds to the trial where no OS interrupt, TLB miss, or context switch occurred. That minimum represents what the hardware can actually achieve. Higher samples include external noise. If you are measuring hardware throughput, the minimum is the correct estimate; if you are measuring system latency (including OS overhead), the average is more appropriate.


2. What does LFENCE do when placed around an RDTSC instruction?

A) It clears the L1 cache so the measured code starts cold B) It prevents out-of-order execution from reordering instructions into or out of the timed region C) It flushes write-combining buffers before reading the time stamp D) It disables interrupts during the time measurement

Answer: B — Without serialization, the CPU's out-of-order execution engine may speculatively execute instructions that nominally appear after RDTSC before the counter is read (or instructions before RDTSC may not have completed). LFENCE prevents loads from being reordered past it, effectively creating a barrier that ensures the TSC is read at the right point in the execution sequence.


3. perf stat reports IPC = 0.4 for a program. What does this most likely indicate?

A) The CPU is executing too many instructions — IPC should be closer to 1.0 B) The program has a severe bottleneck: either many cache misses to DRAM or many branch mispredictions C) The program is very well optimized and using SIMD efficiently D) The frontend is delivering instructions faster than the backend can execute them

Answer: B — IPC < 1 means the CPU is stalling for more than one cycle per instruction. This indicates a severe bottleneck, most commonly DRAM-latency-bound pointer chasing (each load stalls the pipeline for 200+ cycles before the next address is known), or very high branch misprediction rate (flushing 15–20 cycles of work per misprediction). IPC 1–3 is typical; IPC > 3 requires SIMD or excellent ILP.


4. What does the Intel Top-Down analysis category "Backend Bound — Memory" indicate?

A) The program is generating too many store instructions B) The execution backend is waiting for data from cache or DRAM due to cache misses C) The memory bus between the CPU and GPU is saturated D) The program is spending more than 50% of cycles in OS memory allocation routines

Answer: B — "Backend Bound — Memory" means the execution units are idle because they are waiting for data to arrive from cache or DRAM. Load instructions that miss the cache stall the dependent instructions until the data arrives. This category distinguishes memory-bound behavior from compute-bound (Backend Bound — Core) and instruction supply problems (Frontend Bound).


5. A loop body has the following dependency chain: - MULSS XMM0, XMM1 (latency 4) - ADDSS XMM0, XMM2 (latency 4) The loop runs N iterations. What is the minimum number of independent accumulators needed to fully saturate the FP multiply and add units (assuming each unit has throughput 0.5)?

A) 2 accumulators B) 4 accumulators C) 8 accumulators D) 16 accumulators

Answer: C — The combined operation (MULSS + ADDSS) has a critical path latency of 8 cycles. Each unit has reciprocal throughput 0.5, so the combined throughput is also ~0.5 cycles per issue. To keep both units busy despite the 8-cycle latency chain, you need: accumulators = latency / throughput = 8 / 0.5 = 16? But since MULSS and ADDSS are on different ports, the combined throughput is min(0.5, 0.5) per pair = ~1 cycle for the pair. With pair latency 8: need 8 independent pairs. However, with 8 accumulators, each accumulator's chain is 8 cycles deep, allowing 8 chains to overlap perfectly. The correct answer is 8.


6. Why is dividing by a constant faster with a reciprocal multiply than with the DIV instruction?

A) The reciprocal multiply avoids a memory access that DIV requires B) DIV is implemented in microcode with dozens of µops and variable latency; reciprocal multiply uses a single IMUL instruction with 3-cycle latency C) The compiler automatically replaces DIV with the faster version D) DIV requires the operand to be in RDX:RAX which causes register conflicts

Answer: B — Integer DIV is a microcode instruction with latency 35–90 cycles and low throughput. It cannot be pipelined efficiently. A reciprocal multiply computes q ≈ n × (2^k / d) >> k using IMUL (3 cycles) and SHR (1 cycle), giving the exact quotient in ~5 cycles — 7–18× faster. This is why compilers automatically replace constant-divisor divisions with reciprocal multiplies at -O1 and above.


7. What is llvm-mca used for?

A) A runtime profiler that samples instruction execution on real hardware B) A static analysis tool that simulates the CPU pipeline for a code region, predicting throughput, IPC, and resource pressure without running the code C) A memory checker that detects cache coherence violations D) A binary rewriter that applies automatic SIMD vectorization

Answer: Bllvm-mca (LLVM Machine Code Analyzer) takes assembly source and simulates how a specified CPU pipeline would execute it, computing iterations-per-cycle throughput, IPC, and per-port resource utilization. It requires no hardware access. It is used to quickly evaluate whether an optimization will help and which execution port is the bottleneck, before spending time on full benchmarking.


8. In a floating-point reduction loop, replacing MULSS + ADDSS with VFMADD231SS primarily helps by:

A) Using the FMA unit which is twice as wide as the add unit B) Reducing the number of instructions, which reduces frontend pressure C) Fusing the multiply and add into one instruction, reducing the dependency chain length and improving accuracy D) Allowing the computation to execute in half the cycles because FMA is faster than separate mul+add

Answer: CVFMADD231SS computes dst += src1 * src2 as a single operation with latency 4 (same as MULSS alone). Without FMA, the chain is MULSS (4 cycles) + ADDSS (4 cycles) = 8 cycles. With FMA, the chain is just 4 cycles — halving the minimum accumulator count needed and doubling throughput for latency-bound loops. Additionally, FMA has better floating-point accuracy (single rounding vs. two roundings).


9. perf annotate shows that 35% of samples fall on a MOVSS [rsi + rax*4] load instruction. What does this indicate?

A) The instruction is executed 35% of the time in the program B) The instruction is causing frequent cache misses, and the CPU is stalled waiting for data to arrive from L2/L3/DRAM C) The instruction has high latency because of its complex addressing mode D) 35% of instructions in the program are load instructions

Answer: B — In sampling profilers, samples accumulate on the instruction that is currently being waited for when the sample fires. A high percentage on a load instruction typically indicates that load is frequently missing cache and the CPU is stalling (many cycles elapse waiting for the data), causing the sample counter to fire repeatedly while the CPU waits. The complex addressing mode (rsi + rax*4) itself adds at most 1 cycle; the miss is the real cost.


10. Loop unrolling with a single accumulator does NOT improve throughput for a serial dependency chain. Why?

A) Unrolling increases code size, which may evict other code from the instruction cache B) The dependency chain through the accumulator still exists; unrolling processes more elements per iteration but the chain length in cycles does not decrease C) The CPU automatically re-rolls unrolled loops, negating the benefit D) Unrolling requires extra registers that may not be available

Answer: B — If each iteration does acc += arr[i] and the accumulator acc is written every iteration, the dependency chain is: iteration 1 must complete before iteration 2, iteration 2 before iteration 3, etc. Unrolling without multiple accumulators just writes to acc more times per iteration, but the same serial chain exists. The critical path is still N × latency(ADD). Multiple independent accumulators break the chain by running N/k independent chains of length k × latency.


11. What is "software pipelining" in the context of loop optimization?

A) Using OS pipes to parallelize loop iterations across processes B) Manually interleaving instructions from different iterations so that while one iteration's data is loading from memory, another iteration's computation is executing C) Breaking a loop into a pipeline of stages, each in a separate function D) Using SIMD to process multiple loop iterations simultaneously

Answer: B — Software pipelining exploits instruction-level parallelism across iteration boundaries. While waiting for iteration i's load to complete (4–200 cycles), the CPU can execute computation from iteration i-1 whose data has already arrived. By explicitly issuing the load for iteration i+1 at the top of the loop body and computing iteration i's result using the previously loaded value, the memory latency is hidden behind computation.


12. Which instruction should be preferred for computing rax = rax * 3 in a latency-critical dependency chain?

A) IMUL RAX, RAX, 3 (multiply by immediate) B) LEA RAX, [RAX + RAX*2] (load effective address) C) ADD RAX, RAX; ADD RAX, RAX (two additions) D) SHL RAX, 1; ADD RAX, [original] (shift + add)

Answer: BLEA with a scaled index register computes RAX + RAX*2 = RAX*3 in 1 cycle using the address generation unit (AGU). IMUL takes 3 cycles latency. Two ADD instructions take 2 cycles total but require the original value to be preserved separately. LEA is both the fastest (1 cycle) and does not destroy the original value (unless the destination register is the same as the source, as written here where it does replace RAX).


13. The ALIGN 16 directive in NASM before a loop label serves what purpose?

A) It forces the loop data to be 16-byte aligned for SIMD loads B) It pads with NOP instructions so the loop's first instruction starts at a 16-byte aligned address, potentially improving instruction fetch efficiency C) It aligns the loop iteration count to a multiple of 16 D) It tells the assembler to generate 16-byte versions of instructions

Answer: BALIGN 16 inserts NOP instructions (or multi-byte NOP combinations) until the program counter reaches the next 16-byte boundary. This ensures the loop's first instruction is aligned, so the CPU's 16-byte instruction fetch window can deliver the loop's first several instructions without straddling a boundary. For AVX loops, ALIGN 32 is often preferred to match the 32-byte µop cache row width.


14. When is PREFETCHNTA more appropriate than PREFETCHT0?

A) When the data will be accessed multiple times and should stay in cache as long as possible B) When the data will be accessed exactly once and should not evict frequently-used data from L2/L3 C) When the data is a pointer that will be used for further loads D) When the access pattern is stride-1 and the hardware prefetcher is already active

Answer: BPREFETCHNTA (Non-Temporal Aligned) fetches the data into L1 with minimal insertion into L2/L3 — it has a "non-temporal" hint meaning "I will not reuse this soon." Use it for data that is processed once (streaming reads, memcpy source). PREFETCHT0 fills L1+L2+L3, which is better for data that will be accessed several times within the near future. Using PREFETCHT0 for one-time-access data evicts hot data from L2/L3 unnecessarily.


15. A program's perf stat shows: uops_dispatched_port.port_0 = 95%, port_1 = 30%. What optimization should be investigated?

A) Add more SIMD instructions to reduce instruction count B) Reduce the number of instructions that execute on port 0, possibly by rebalancing with FMA or rearranging the computation C) Increase cache blocking to reduce memory pressure on port 0 D) Use non-temporal stores to reduce port 0 saturation

Answer: B — Port 0 at 95% utilization means the port 0 execution unit is the bottleneck — all other ports are idle while waiting for port 0 to free up. Port 0 on Intel Skylake handles floating-point multiplication, branches, and some integer operations. If the loop has many VMULPS or VFMADD instructions, they all compete for port 0. Rebalancing with FMA instructions that use port 0 + port 1 together, or restructuring to reduce multiply-heavy sequences, can reduce port 0 pressure. Non-temporal stores use port 4, which is unrelated.