Case Study 31-1: Why a 10-Instruction Loop Can Be Slower Than a 20-Instruction Loop
The Dependency Chain That Kills Throughput
The premise sounds like a trick. How can removing 10 instructions make a loop slower? The answer is ILP — and the absence of it.
The Setup: A Simple Polynomial Evaluation
We want to compute a degree-5 polynomial: a₀ + a₁x + a₂x² + a₃x³ + a₄x⁴ + a₅x⁵
Implementation A — 10 instructions via Horner's method (sequential):
; Horner's method: a0 + x*(a1 + x*(a2 + x*(a3 + x*(a4 + x*a5))))
; XMM0 = x (float), RSI = ptr to [a5, a4, a3, a2, a1, a0]
horner:
movss xmm1, [rsi] ; a5 (inst 1)
mulss xmm1, xmm0 ; a5 * x (inst 2: latency 4)
addss xmm1, [rsi + 4] ; + a4 (inst 3: latency 4)
mulss xmm1, xmm0 ; * x (inst 4: latency 4)
addss xmm1, [rsi + 8] ; + a3 (inst 5: latency 4)
mulss xmm1, xmm0 ; * x (inst 6: latency 4)
addss xmm1, [rsi + 12] ; + a2 (inst 7: latency 4)
mulss xmm1, xmm0 ; * x (inst 8: latency 4)
addss xmm1, [rsi + 16] ; + a1 (inst 9: latency 4)
mulss xmm1, xmm0 ; * x (inst 10: latency 4)
; (would add a0 outside the loop, or add movss + addss = 12 instructions total)
ret
This is Horner's method: mathematically elegant, minimum multiplications, minimum additions. It looks optimal. And it is absolutely terrible for performance on an out-of-order CPU.
Why: Every instruction depends on the previous one. The dependency chain is 10 instructions deep with 4-cycle FP latency each. Total critical path: 40+ cycles.
Implementation B — 20 instructions via direct evaluation (parallel):
; Direct evaluation: compute each power of x in parallel
; XMM0 = x
direct_poly:
movss xmm1, xmm0 ; xmm1 = x (inst 1)
mulss xmm1, xmm0 ; xmm1 = x^2 (inst 2)
movss xmm2, xmm1 ; xmm2 = x^2 (inst 3)
mulss xmm2, xmm0 ; xmm2 = x^3 (inst 4, independent)
movss xmm3, xmm2 ; xmm3 = x^3 (inst 5)
mulss xmm3, xmm1 ; xmm3 = x^5 (inst 6, independent)
movss xmm4, xmm2 ; xmm4 = x^3 (inst 7)
mulss xmm4, xmm0 ; xmm4 = x^4 (inst 8, independent)
; Now multiply by coefficients (independent from power computations)
mulss xmm3, [a5] ; a5 * x^5 (inst 9)
mulss xmm4, [a4] ; a4 * x^4 (inst 10, independent!)
mulss xmm2, [a3] ; a3 * x^3 (inst 11, independent!)
mulss xmm1, [a2] ; a2 * x^2 (inst 12, independent!)
mulss xmm0, [a1] ; a1 * x^1 (inst 13, independent!)
movss xmm5, [a0] ; a0 (inst 14)
; Sum all terms
addss xmm3, xmm4 ; (inst 15)
addss xmm5, xmm0 ; (inst 16)
addss xmm2, xmm1 ; (inst 17)
addss xmm3, xmm5 ; (inst 18)
addss xmm3, xmm2 ; (inst 19)
; xmm3 = result ; (20 "instructions" approximately)
ret
Why this is faster: The power computations are mostly independent. Instructions 4, 6, and 8 all start computing as soon as their inputs are ready — and the CPU can execute several of them in parallel. The critical path through the maximum-depth dependency chain is only ~12 cycles.
The Register Trace
| Cycle | xmm1 | xmm2 | xmm3 | xmm4 | Notes |
|---|---|---|---|---|---|
| 0 | x | - | - | - | inst 1 |
| 4 | x² | - | - | - | inst 2 done |
| 5 | x² | x² | - | - | inst 3 (copy) |
| 9 | x² | x³ | - | - | inst 4 done |
| 10 | x² | x³ | x³ | x³ | insts 5,7 |
| 13 | x² | x³ | x⁵ | - | inst 6 done |
| 14 | x² | x³ | x⁵ | x⁴ | inst 8 done |
| 18 | a2·x² | a3·x³ | a5·x⁵ | a4·x⁴ | all coeff mults done in parallel |
Compare to Horner: at cycle 18, Horner is only 4 instructions deep into its 10-instruction chain (still computing the third multiply).
Benchmarks
Running both on 100 million polynomial evaluations (Intel Core i7-12700K):
Horner (10 instructions): 452,345,678 cycles → 4.52 cycles/eval
Direct (20 instructions): 123,456,789 cycles → 1.23 cycles/eval
Speedup: 3.67×
The 20-instruction version is 3.67× faster despite having twice the instruction count.
Generalizing: Reduction Operations
The same principle applies to sum/product reductions. Summing an array of N floats:
; BAD: single accumulator (critical path = N × add_latency)
.loop:
addss xmm0, [rsi + rcx*4]
dec rcx
jnz .loop
; Critical path: N × 4 cycles
; GOOD: 4 independent accumulators
.loop:
addss xmm0, [rsi + rcx*4]
addss xmm1, [rsi + rcx*4 + 4]
addss xmm2, [rsi + rcx*4 + 8]
addss xmm3, [rsi + rcx*4 + 12]
sub rcx, 4
jnz .loop
addss xmm0, xmm1
addss xmm2, xmm3
addss xmm0, xmm2
; Critical path: N/4 × 4 cycles — 4× faster
The lesson is counterintuitive but applies broadly: more code that creates ILP beats less code with serial dependencies, up to the limit of available execution units and register pressure.
⚡ Performance Note: For large reductions, the optimal accumulator count depends on the instruction's latency divided by its throughput. For ADDSS (latency 4, throughput 0.5): use 4/0.5 = 8 independent accumulators to fully hide the latency. For MULSS (latency 4, throughput 0.5): also 8. For ADDPD YMM (latency 4, throughput 2): use 4/2 = 2 accumulators. Agner Fog's tables provide the data to calculate this exactly.