Case Study 17-1: Factorial in ARM64 — Recursive and Iterative
Objective
Implement factorial in both recursive and iterative styles in ARM64 assembly, with full stack frame diagrams showing the AAPCS64 prologue/epilogue pattern. Trace the recursive version to understand how the call stack builds and unwinds.
Part 1: Iterative Factorial
Iterative factorial is a leaf function (makes no function calls), making it simpler:
// C version
uint64_t factorial_iter(uint32_t n) {
uint64_t result = 1;
for (uint32_t i = 2; i <= n; i++) {
result *= i;
}
return result;
}
ARM64 assembly:
// factorial_iter(n): W0 = n, returns X0 = n!
// This is a leaf function — no calls to other functions.
// We don't need to save LR since we never overwrite it.
// We use only caller-saved registers (X1, X2) so no callee-save needed.
.global factorial_iter
factorial_iter:
// Special case: 0! = 1, 1! = 1
CMP W0, #2
MOV X1, #1 // result = 1
B.LT .iter_done // if n < 2, return 1
// Loop: i from 2 to n
MOV W2, #2 // i = 2
.iter_loop:
CMP W2, W0
B.GT .iter_done // if i > n, done
UXTW X3, W2 // X3 = (uint64_t)i
MUL X1, X1, X3 // result *= i
ADD W2, W2, #1 // i++
B .iter_loop
.iter_done:
MOV X0, X1 // return result
RET
Register trace for factorial_iter(5):
| Iteration | W0(n) | W2(i) | X1(result) |
|---|---|---|---|
| init | 5 | — | 1 |
| i=2 | 5 | 2 | 1×2 = 2 |
| i=3 | 5 | 3 | 2×3 = 6 |
| i=4 | 5 | 4 | 6×4 = 24 |
| i=5 | 5 | 5 | 24×5 = 120 |
| exit | 5 | 6 (>5) | 120 |
Return value: X0 = 120.
Part 2: Recursive Factorial
Recursive factorial is a non-leaf function — it calls itself. This requires saving X30 (LR) on every invocation:
// C version
uint64_t factorial_rec(uint32_t n) {
if (n <= 1) return 1;
return (uint64_t)n * factorial_rec(n - 1);
}
ARM64 assembly:
// factorial_rec(n): W0 = n, returns X0 = n!
// Non-leaf: must save X30 and any callee-saved regs we use.
// We use X19 to preserve n across the recursive call.
.global factorial_rec
factorial_rec:
// Prologue
STP X29, X30, [SP, #-32]! // Save FP and LR; allocate 32 bytes
MOV X29, SP // Update frame pointer
STR X19, [SP, #16] // Save X19 (callee-saved, we'll use it)
// (16 bytes for X29/X30 + 8 bytes for X19 + 8 bytes padding = 32)
// Base case: n <= 1
CMP W0, #1
B.GT .rec_recurse // if n > 1, recurse
MOV X0, #1 // return 1
B .rec_done
.rec_recurse:
MOV W19, W0 // X19 = n (preserve across call)
SUB W0, W0, #1 // arg = n - 1
BL factorial_rec // X0 = factorial_rec(n-1)
// On return: X0 = (n-1)!
UXTW X1, W19 // X1 = (uint64_t)n
MUL X0, X0, X1 // X0 = n * (n-1)!
.rec_done:
// Epilogue
LDR X19, [SP, #16] // Restore X19
LDP X29, X30, [SP], #32 // Restore FP and LR; SP += 32
RET
Stack Trace for factorial_rec(4)
Call sequence: factorial_rec(4) → factorial_rec(3) → factorial_rec(2) → factorial_rec(1).
Let's track the stack (assuming initial SP = 0x7FFFFFE0):
Call 1: factorial_rec(4), SP = 0x7FFFFFE0
Prologue: STP X29, X30, [SP, #-32]! → SP = 0x7FFFFFC0
STR X19, [SP, #16] → saves W19=0 (irrelevant at this point)
Stack after call 1's prologue:
0x7FFFFFE0: ┌───────────────────────────┐ ← original SP (caller's frame)
│ (caller's frame ...) │
0x7FFFFFD8: ├───────────────────────────┤
│ [SP+24] (padding) │
0x7FFFFFD0: ├───────────────────────────┤
│ [SP+16] X19 saved = ? │
0x7FFFFFC8: ├───────────────────────────┤
│ [SP+8] X30 = ret addr │ ← return addr to caller of factorial_rec(4)
0x7FFFFFC0: ├───────────────────────────┤
│ [SP+0] X29 = old FP │ ← SP after prologue (new FP)
└───────────────────────────┘ ← SP = X29 = 0x7FFFFFC0
Then MOV W19, W0 saves n=4. SUB W0, W0, #1 → W0=3. BL factorial_rec calls recursively, X30 = return address within factorial_rec(4) (the MUL instruction).
Call 2: factorial_rec(3), SP = 0x7FFFFFC0
Prologue: STP X29, X30, [SP, #-32]! → SP = 0x7FFFFFA0
Stack:
0x7FFFFFC0: ┌───────────────────────────┐ ← Call 1's frame
│ X29 of call 1 │ [0]
│ X30 of call 1 (to caller)│ [8]
│ X19 = 4 (call 1's n) │ [16]
│ padding │ [24]
0x7FFFFFA0: ├───────────────────────────┤ ← Call 2's frame
│ X29 = 0x7FFFFFC0 │ [0] (points to call 1's frame)
│ X30 = MUL in call 1 │ [8]
│ X19 = 3 (call 2's n) │ [16]
│ padding │ [24]
└───────────────────────────┘ ← SP = 0x7FFFFFA0
Call 3: factorial_rec(2), SP = 0x7FFFFFA0 → 0x7FFFFF80
Call 4: factorial_rec(1), SP = 0x7FFFFF80 → 0x7FFFFF60
At call 4 (n=1): hits base case
Returns X0 = 1 immediately
Complete stack at maximum depth (factorial_rec(1)):
HIGH ADDRESSES
┌──────────────────────────────────────────────────────────┐
│ Caller's frame (program main) │
│ │
0x7FFFFFC0: Call 1 frame (n=4) │
│ [+0] X29 (old FP) │
│ [+8] X30 (return to main) │
│ [+16] X19 = 4 │
│ [+24] padding │
│ │
0x7FFFFFA0: Call 2 frame (n=3) │
│ [+0] X29 = 0x7FFFFFC0 │
│ [+8] X30 = MUL in call 1 │
│ [+16] X19 = 3 │
│ [+24] padding │
│ │
0x7FFFFF80: Call 3 frame (n=2) │
│ [+0] X29 = 0x7FFFFFA0 │
│ [+8] X30 = MUL in call 2 │
│ [+16] X19 = 2 │
│ [+24] padding │
│ │
0x7FFFFF60: Call 4 frame (n=1) ← deepest │
│ [+0] X29 = 0x7FFFFF80 │
│ [+8] X30 = MUL in call 3 │
│ [+16] X19 = 1 (though we hit base case before using it)│
│ [+24] padding │
│ │
SP = 0x7FFFFF60 ← current top │
└──────────────────────────────────────────────────────────┘
LOW ADDRESSES
Unwinding
- Call 4 returns X0=1 to Call 3's MUL instruction
- Call 3:
X1 = (uint64_t)X19 = 2,X0 = 1 * 2 = 2. Epilogue restores X29/X30/X19, SP returns to 0x7FFFFF80. Returns X0=2 to Call 2. - Call 2:
X1 = 3,X0 = 2 * 3 = 6. Returns X0=6 to Call 1. - Call 1:
X1 = 4,X0 = 6 * 4 = 24. Returns X0=24 to main.
Comparison: Iterative vs. Recursive
Property Iterative Recursive
─────────────────────────────────────────────────────────────────
Stack frames 1 (constant) n+1 (grows with n)
Stack space ~16 bytes n × 32 bytes
LR save required No (leaf function) Yes (must save X30)
Callee-saved regs None needed X19 per frame
Max n before overflow No limit (while result ~125,000 (8MB stack)
fits in uint64)
Instructions per iter ~5 ~8 + prologue/epilogue
Suitable for n=10? Yes Yes
Suitable for n=100? Yes Yes
Suitable for n=10000? Yes Stack overflow risk
For n=20, factorial(20) = 2,432,902,008,176,640,000 (fits in uint64_t). For n=21, it overflows 64-bit unsigned. So either implementation is "limited" by the math before the stack becomes a concern for small inputs.
Building and Testing
// test_factorial.s — test harness
.section .data
fmt_iter: .asciz "Iterative: %lu! = %lu\n"
fmt_rec: .asciz "Recursive: %lu! = %lu\n"
.section .text
.extern printf
.extern factorial_iter
.extern factorial_rec
.global main
main:
STP X29, X30, [SP, #-16]!
MOV X29, SP
// Test factorial_iter(10)
MOV W0, #10
BL factorial_iter
MOV X2, X0 // result
ADR X0, fmt_iter
MOV X1, #10 // n
BL printf
// Test factorial_rec(10)
MOV W0, #10
BL factorial_rec
MOV X2, X0
ADR X0, fmt_rec
MOV X1, #10
BL printf
MOV W0, #0
LDP X29, X30, [SP], #16
RET
aarch64-linux-gnu-as factorial_iter.s -o factorial_iter.o
aarch64-linux-gnu-as factorial_rec.s -o factorial_rec.o
aarch64-linux-gnu-as test_factorial.s -o test_factorial.o
aarch64-linux-gnu-gcc -static factorial_iter.o factorial_rec.o test_factorial.o -o factorial_test
qemu-aarch64 ./factorial_test
# Output:
# Iterative: 10! = 3628800
# Recursive: 10! = 3628800
Summary
The iterative version demonstrates the leaf function pattern: no prologue/epilogue overhead, no LR save, minimal stack usage. The recursive version demonstrates the full non-leaf pattern: save FP and LR at function entry, save any callee-saved registers that will be used, restore everything in the epilogue, and return with RET.
The 32-byte stack frame allocation (16 for X29/X30, 8 for X19, 8 for padding) keeps SP 16-byte aligned throughout. The frame pointer chain (each frame's X29 points to the previous frame's X29) enables debuggers to unwind the stack reliably.