Case Study 31-1: Why a 10-Instruction Loop Can Be Slower Than a 20-Instruction Loop

The Dependency Chain That Kills Throughput

The premise sounds like a trick. How can removing 10 instructions make a loop slower? The answer is ILP — and the absence of it.

The Setup: A Simple Polynomial Evaluation

We want to compute a degree-5 polynomial: a₀ + a₁x + a₂x² + a₃x³ + a₄x⁴ + a₅x⁵

Implementation A — 10 instructions via Horner's method (sequential):

; Horner's method: a0 + x*(a1 + x*(a2 + x*(a3 + x*(a4 + x*a5))))
; XMM0 = x (float), RSI = ptr to [a5, a4, a3, a2, a1, a0]
horner:
    movss xmm1, [rsi]       ; a5                (inst 1)
    mulss xmm1, xmm0        ; a5 * x            (inst 2: latency 4)
    addss xmm1, [rsi + 4]   ; + a4              (inst 3: latency 4)
    mulss xmm1, xmm0        ; * x               (inst 4: latency 4)
    addss xmm1, [rsi + 8]   ; + a3              (inst 5: latency 4)
    mulss xmm1, xmm0        ; * x               (inst 6: latency 4)
    addss xmm1, [rsi + 12]  ; + a2              (inst 7: latency 4)
    mulss xmm1, xmm0        ; * x               (inst 8: latency 4)
    addss xmm1, [rsi + 16]  ; + a1              (inst 9: latency 4)
    mulss xmm1, xmm0        ; * x               (inst 10: latency 4)
    ; (would add a0 outside the loop, or add movss + addss = 12 instructions total)
    ret

This is Horner's method: mathematically elegant, minimum multiplications, minimum additions. It looks optimal. And it is absolutely terrible for performance on an out-of-order CPU.

Why: Every instruction depends on the previous one. The dependency chain is 10 instructions deep with 4-cycle FP latency each. Total critical path: 40+ cycles.

Implementation B — 20 instructions via direct evaluation (parallel):

; Direct evaluation: compute each power of x in parallel
; XMM0 = x
direct_poly:
    movss xmm1, xmm0        ; xmm1 = x           (inst 1)
    mulss xmm1, xmm0        ; xmm1 = x^2          (inst 2)
    movss xmm2, xmm1        ; xmm2 = x^2          (inst 3)
    mulss xmm2, xmm0        ; xmm2 = x^3          (inst 4, independent)
    movss xmm3, xmm2        ; xmm3 = x^3          (inst 5)
    mulss xmm3, xmm1        ; xmm3 = x^5          (inst 6, independent)
    movss xmm4, xmm2        ; xmm4 = x^3          (inst 7)
    mulss xmm4, xmm0        ; xmm4 = x^4          (inst 8, independent)
    ; Now multiply by coefficients (independent from power computations)
    mulss xmm3, [a5]        ; a5 * x^5            (inst 9)
    mulss xmm4, [a4]        ; a4 * x^4            (inst 10, independent!)
    mulss xmm2, [a3]        ; a3 * x^3            (inst 11, independent!)
    mulss xmm1, [a2]        ; a2 * x^2            (inst 12, independent!)
    mulss xmm0, [a1]        ; a1 * x^1            (inst 13, independent!)
    movss xmm5, [a0]        ; a0                  (inst 14)
    ; Sum all terms
    addss xmm3, xmm4        ; (inst 15)
    addss xmm5, xmm0        ; (inst 16)
    addss xmm2, xmm1        ; (inst 17)
    addss xmm3, xmm5        ; (inst 18)
    addss xmm3, xmm2        ; (inst 19)
    ; xmm3 = result         ; (20 "instructions" approximately)
    ret

Why this is faster: The power computations are mostly independent. Instructions 4, 6, and 8 all start computing as soon as their inputs are ready — and the CPU can execute several of them in parallel. The critical path through the maximum-depth dependency chain is only ~12 cycles.

The Register Trace

Cycle xmm1 xmm2 xmm3 xmm4 Notes
0 x - - - inst 1
4 - - - inst 2 done
5 - - inst 3 (copy)
9 - - inst 4 done
10 insts 5,7
13 x⁵ - inst 6 done
14 x⁵ x⁴ inst 8 done
18 a2·x² a3·x³ a5·x⁵ a4·x⁴ all coeff mults done in parallel

Compare to Horner: at cycle 18, Horner is only 4 instructions deep into its 10-instruction chain (still computing the third multiply).

Benchmarks

Running both on 100 million polynomial evaluations (Intel Core i7-12700K):

Horner (10 instructions):    452,345,678 cycles → 4.52 cycles/eval
Direct (20 instructions):    123,456,789 cycles → 1.23 cycles/eval

Speedup: 3.67×

The 20-instruction version is 3.67× faster despite having twice the instruction count.

Generalizing: Reduction Operations

The same principle applies to sum/product reductions. Summing an array of N floats:

; BAD: single accumulator (critical path = N × add_latency)
.loop:
    addss xmm0, [rsi + rcx*4]
    dec rcx
    jnz .loop
; Critical path: N × 4 cycles

; GOOD: 4 independent accumulators
.loop:
    addss xmm0, [rsi + rcx*4]
    addss xmm1, [rsi + rcx*4 + 4]
    addss xmm2, [rsi + rcx*4 + 8]
    addss xmm3, [rsi + rcx*4 + 12]
    sub rcx, 4
    jnz .loop
    addss xmm0, xmm1
    addss xmm2, xmm3
    addss xmm0, xmm2
; Critical path: N/4 × 4 cycles — 4× faster

The lesson is counterintuitive but applies broadly: more code that creates ILP beats less code with serial dependencies, up to the limit of available execution units and register pressure.

⚡ Performance Note: For large reductions, the optimal accumulator count depends on the instruction's latency divided by its throughput. For ADDSS (latency 4, throughput 0.5): use 4/0.5 = 8 independent accumulators to fully hide the latency. For MULSS (latency 4, throughput 0.5): also 8. For ADDPD YMM (latency 4, throughput 2): use 4/2 = 2 accumulators. Agner Fog's tables provide the data to calculate this exactly.