Case Study 8.2: LEA as a Math Instruction
The Question
How does a compiler multiply an integer by a small constant without using a multiply instruction? The answer — almost always — is LEA. This case study dissects five implementations of y = x * 5 + 3 and explains when each is preferred.
Why Avoid IMUL?
The IMUL instruction is general-purpose and can multiply by any immediate. But it has a latency of 3 cycles on modern Intel processors, versus 1 cycle for LEA and 1 cycle for SHL. For multiplications by small constants that appear inside loops, saving 2 cycles per iteration is meaningful.
Additionally, LEA leaves flags untouched. IMUL sets some flags (OF, CF) and leaves others undefined. When the surrounding code uses flags, preserving them can simplify the code.
Five Implementations of y = x * 5 + 3
Implementation 1: Using IMUL
; Input: RDI = x
; Output: RAX = x*5 + 3
imul_version:
imul rax, rdi, 5 ; rax = x*5 (three-operand form, 3 cycles latency)
add rax, 3 ; rax += 3 (1 cycle)
ret
Latency: 3 + 1 = 4 cycles (serial dependency) Instructions: 2 Notes: Simple to read. GCC will almost never emit this for multiplication by 5.
Implementation 2: LEA (Single Instruction)
; Input: RDI = x
; Output: RAX = x*5 + 3
lea_single:
lea rax, [rdi + rdi*4 + 3] ; rax = rdi + rdi*4 + 3 = x*5 + 3
ret
Latency: 1 cycle
Instructions: 1
Notes: This is what GCC -O2 actually emits. The LEA computes x + x*4 + 3 in the AGU in one cycle. This is the optimal solution.
Wait — is x + x*4 = x*5? Yes: x*(1+4) = x*5. The base register and the index register are both RDI, and the scale is 4.
Implementation 3: Shift + Add
; Input: RDI = x
; Output: RAX = x*5 + 3
shift_add:
mov rax, rdi ; rax = x (1 cycle)
shl rax, 2 ; rax = x*4 (1 cycle)
add rax, rdi ; rax = x*5 (1 cycle, adds original x)
add rax, 3 ; rax += 3 (1 cycle)
ret
Latency: 4 cycles (serial chain) Instructions: 4 Notes: This was the traditional approach before LEA was well-understood. Still sometimes emitted by compilers for larger multipliers.
Implementation 4: Two LEAs
; Input: RDI = x
; Output: RAX = x*5 + 3
; (Hypothetical two-LEA version to demonstrate chaining)
lea_two:
lea rax, [rdi + rdi*4] ; rax = x*5 (1 cycle)
lea rax, [rax + 3] ; rax = x*5 + 3 (1 cycle, depends on first)
ret
Latency: 2 cycles (two LEAs in series) Instructions: 2 Notes: Two LEAs in series. Each LEA takes 1 cycle but they cannot execute in parallel because the second depends on the output of the first. Worse than the single-LEA version.
Implementation 5: Branchless Conditional (Contrived)
; This version shows how NOT to do it — illustrating pipeline issues
suboptimal:
mov rax, 5 ; rax = 5
mul rdi ; rdx:rax = rax * rdi (single-operand, RDX clobbered!)
add rax, 3 ; rax = x*5 + 3
ret
Latency: 3+ cycles, also clobbers RDX Instructions: 3 Notes: The single-operand MUL is the unsigned 128-bit multiply form. It clobbers RDX. Never use this for simple constant multiplication — it is the form designed for computing the upper half of very large products.
Comparison Table
| Version | Instructions | Latency | Flags Modified | Notes |
|---|---|---|---|---|
| IMUL (3-op) | 2 | 4 cycles | OF, CF set; others undefined | Readable but slow |
| LEA (single) | 1 | 1 cycle | None | Optimal; what GCC emits |
| Shift + Add | 4 | 4 cycles | OF, CF, etc. | Traditional; verbose |
| Two LEAs | 2 | 2 cycles | None | Better than IMUL/shift |
| MUL (1-op) | 3 | 3+ cycles | All flags | Clobbers RDX; wrong tool |
What GCC Actually Does
Let us verify with GCC 13.2 at -O2:
long multiply_five_plus_three(long x) {
return x * 5 + 3;
}
gcc -O2 -S -o - multiply.c
Output:
multiply_five_plus_three:
lea rax, [rdi+rdi*4+3]
ret
One instruction. The compiler recognized the pattern immediately and collapsed the entire expression into a single LEA.
Now try a multiplier LEA cannot handle alone:
long multiply_seven(long x) {
return x * 7;
}
multiply_seven:
lea rax, [rdi+rdi*2] ; rax = x*3
lea rax, [rax+rax*2] ; rax = x*9 ← WRONG, this would be x*9 not x*7
Actually, GCC does this for x*7:
multiply_seven:
imul rax, rdi, 7
ret
GCC uses IMUL for 7 because the LEA decomposition of 7 requires a subtract (x*8 - x), and LEA cannot subtract. The compiler's heuristic: use LEA when the multiplier can be computed with additions in 1-2 instructions; fall back to IMUL otherwise.
long multiply_twelve(long x) {
return x * 12;
}
multiply_twelve:
lea rax, [rdi+rdi*2] ; rax = x*3
sal rax, 2 ; rax <<= 2, so rax = x*12
ret
GCC decomposes 12 = 3 × 4 = (1+2) × 4. LEA computes x*3, then SHL multiplies by 4.
The Pattern: When Does Compiler Use LEA?
GCC's optimizer uses LEA for multiplication by constant N when N can be expressed as:
- N = 2^k → use SHL only
- N = 1 + 2^k (k ∈ {0,1,2,3}) → single LEA (base + index×scale where scale = 2^k)
- N = 2^k + 2^j → two-operand form, often LEA + SHL
- Other small N → IMUL with immediate if LEA decomposition is too complex
The breakeven is roughly: if LEA+SHL achieves it in ≤ 2 instructions with ≤ 2 cycles, prefer LEA. Otherwise, IMUL at 3 cycles with 1 instruction.
Takeaway for Assembly Programmers
When you need to multiply by a small constant in performance-sensitive code:
- Check if
N = 1 + 2^k(k ∈ {0,1,2,3}): uselea rax, [rdi + rdi*(2^k)] - Check if
N = 2^k: useshl rax, k - Check if
N = M * 2^kwhere M is type 1: use LEA for M, SHL for 2^k - Otherwise: use
imul rax, rdi, N
The mental model: LEA is addition hardware that can scale one input by 1, 2, 4, or 8. Chains of LEA + SHL can compute many multiplications faster than IMUL. For anything complex, IMUL is clear, correct, and only 3 cycles — not worth contorting the code for a 2-cycle saving outside a hot loop.