Case Study 8.2: LEA as a Math Instruction

The Question

How does a compiler multiply an integer by a small constant without using a multiply instruction? The answer — almost always — is LEA. This case study dissects five implementations of y = x * 5 + 3 and explains when each is preferred.

Why Avoid IMUL?

The IMUL instruction is general-purpose and can multiply by any immediate. But it has a latency of 3 cycles on modern Intel processors, versus 1 cycle for LEA and 1 cycle for SHL. For multiplications by small constants that appear inside loops, saving 2 cycles per iteration is meaningful.

Additionally, LEA leaves flags untouched. IMUL sets some flags (OF, CF) and leaves others undefined. When the surrounding code uses flags, preserving them can simplify the code.

Five Implementations of y = x * 5 + 3

Implementation 1: Using IMUL

; Input: RDI = x
; Output: RAX = x*5 + 3

imul_version:
    imul rax, rdi, 5       ; rax = x*5  (three-operand form, 3 cycles latency)
    add  rax, 3             ; rax += 3   (1 cycle)
    ret

Latency: 3 + 1 = 4 cycles (serial dependency) Instructions: 2 Notes: Simple to read. GCC will almost never emit this for multiplication by 5.

Implementation 2: LEA (Single Instruction)

; Input: RDI = x
; Output: RAX = x*5 + 3

lea_single:
    lea rax, [rdi + rdi*4 + 3]   ; rax = rdi + rdi*4 + 3 = x*5 + 3
    ret

Latency: 1 cycle Instructions: 1 Notes: This is what GCC -O2 actually emits. The LEA computes x + x*4 + 3 in the AGU in one cycle. This is the optimal solution.

Wait — is x + x*4 = x*5? Yes: x*(1+4) = x*5. The base register and the index register are both RDI, and the scale is 4.

Implementation 3: Shift + Add

; Input: RDI = x
; Output: RAX = x*5 + 3

shift_add:
    mov  rax, rdi          ; rax = x     (1 cycle)
    shl  rax, 2            ; rax = x*4   (1 cycle)
    add  rax, rdi          ; rax = x*5   (1 cycle, adds original x)
    add  rax, 3            ; rax += 3    (1 cycle)
    ret

Latency: 4 cycles (serial chain) Instructions: 4 Notes: This was the traditional approach before LEA was well-understood. Still sometimes emitted by compilers for larger multipliers.

Implementation 4: Two LEAs

; Input: RDI = x
; Output: RAX = x*5 + 3
; (Hypothetical two-LEA version to demonstrate chaining)

lea_two:
    lea rax, [rdi + rdi*4]   ; rax = x*5  (1 cycle)
    lea rax, [rax + 3]       ; rax = x*5 + 3  (1 cycle, depends on first)
    ret

Latency: 2 cycles (two LEAs in series) Instructions: 2 Notes: Two LEAs in series. Each LEA takes 1 cycle but they cannot execute in parallel because the second depends on the output of the first. Worse than the single-LEA version.

Implementation 5: Branchless Conditional (Contrived)

; This version shows how NOT to do it — illustrating pipeline issues
suboptimal:
    mov  rax, 5             ; rax = 5
    mul  rdi                ; rdx:rax = rax * rdi  (single-operand, RDX clobbered!)
    add  rax, 3             ; rax = x*5 + 3
    ret

Latency: 3+ cycles, also clobbers RDX Instructions: 3 Notes: The single-operand MUL is the unsigned 128-bit multiply form. It clobbers RDX. Never use this for simple constant multiplication — it is the form designed for computing the upper half of very large products.

Comparison Table

Version Instructions Latency Flags Modified Notes
IMUL (3-op) 2 4 cycles OF, CF set; others undefined Readable but slow
LEA (single) 1 1 cycle None Optimal; what GCC emits
Shift + Add 4 4 cycles OF, CF, etc. Traditional; verbose
Two LEAs 2 2 cycles None Better than IMUL/shift
MUL (1-op) 3 3+ cycles All flags Clobbers RDX; wrong tool

What GCC Actually Does

Let us verify with GCC 13.2 at -O2:

long multiply_five_plus_three(long x) {
    return x * 5 + 3;
}
gcc -O2 -S -o - multiply.c

Output:

multiply_five_plus_three:
    lea     rax, [rdi+rdi*4+3]
    ret

One instruction. The compiler recognized the pattern immediately and collapsed the entire expression into a single LEA.

Now try a multiplier LEA cannot handle alone:

long multiply_seven(long x) {
    return x * 7;
}
multiply_seven:
    lea     rax, [rdi+rdi*2]    ; rax = x*3
    lea     rax, [rax+rax*2]    ; rax = x*9  ← WRONG, this would be x*9 not x*7

Actually, GCC does this for x*7:

multiply_seven:
    imul    rax, rdi, 7
    ret

GCC uses IMUL for 7 because the LEA decomposition of 7 requires a subtract (x*8 - x), and LEA cannot subtract. The compiler's heuristic: use LEA when the multiplier can be computed with additions in 1-2 instructions; fall back to IMUL otherwise.

long multiply_twelve(long x) {
    return x * 12;
}
multiply_twelve:
    lea     rax, [rdi+rdi*2]    ; rax = x*3
    sal     rax, 2               ; rax <<= 2, so rax = x*12
    ret

GCC decomposes 12 = 3 × 4 = (1+2) × 4. LEA computes x*3, then SHL multiplies by 4.

The Pattern: When Does Compiler Use LEA?

GCC's optimizer uses LEA for multiplication by constant N when N can be expressed as: - N = 2^k → use SHL only - N = 1 + 2^k (k ∈ {0,1,2,3}) → single LEA (base + index×scale where scale = 2^k) - N = 2^k + 2^j → two-operand form, often LEA + SHL - Other small N → IMUL with immediate if LEA decomposition is too complex

The breakeven is roughly: if LEA+SHL achieves it in ≤ 2 instructions with ≤ 2 cycles, prefer LEA. Otherwise, IMUL at 3 cycles with 1 instruction.

Takeaway for Assembly Programmers

When you need to multiply by a small constant in performance-sensitive code:

  1. Check if N = 1 + 2^k (k ∈ {0,1,2,3}): use lea rax, [rdi + rdi*(2^k)]
  2. Check if N = 2^k: use shl rax, k
  3. Check if N = M * 2^k where M is type 1: use LEA for M, SHL for 2^k
  4. Otherwise: use imul rax, rdi, N

The mental model: LEA is addition hardware that can scale one input by 1, 2, 4, or 8. Chains of LEA + SHL can compute many multiplications faster than IMUL. For anything complex, IMUL is clear, correct, and only 3 cycles — not worth contorting the code for a 2-cycle saving outside a hot loop.