Case Study 17-1: Factorial in ARM64 — Recursive and Iterative

Objective

Implement factorial in both recursive and iterative styles in ARM64 assembly, with full stack frame diagrams showing the AAPCS64 prologue/epilogue pattern. Trace the recursive version to understand how the call stack builds and unwinds.


Part 1: Iterative Factorial

Iterative factorial is a leaf function (makes no function calls), making it simpler:

// C version
uint64_t factorial_iter(uint32_t n) {
    uint64_t result = 1;
    for (uint32_t i = 2; i <= n; i++) {
        result *= i;
    }
    return result;
}

ARM64 assembly:

// factorial_iter(n): W0 = n, returns X0 = n!
// This is a leaf function — no calls to other functions.
// We don't need to save LR since we never overwrite it.
// We use only caller-saved registers (X1, X2) so no callee-save needed.

.global factorial_iter
factorial_iter:
    // Special case: 0! = 1, 1! = 1
    CMP   W0, #2
    MOV   X1, #1               // result = 1
    B.LT  .iter_done           // if n < 2, return 1

    // Loop: i from 2 to n
    MOV   W2, #2               // i = 2
.iter_loop:
    CMP   W2, W0
    B.GT  .iter_done           // if i > n, done
    UXTW  X3, W2               // X3 = (uint64_t)i
    MUL   X1, X1, X3           // result *= i
    ADD   W2, W2, #1           // i++
    B     .iter_loop

.iter_done:
    MOV   X0, X1               // return result
    RET

Register trace for factorial_iter(5):

Iteration W0(n) W2(i) X1(result)
init 5 1
i=2 5 2 1×2 = 2
i=3 5 3 2×3 = 6
i=4 5 4 6×4 = 24
i=5 5 5 24×5 = 120
exit 5 6 (>5) 120

Return value: X0 = 120.


Part 2: Recursive Factorial

Recursive factorial is a non-leaf function — it calls itself. This requires saving X30 (LR) on every invocation:

// C version
uint64_t factorial_rec(uint32_t n) {
    if (n <= 1) return 1;
    return (uint64_t)n * factorial_rec(n - 1);
}

ARM64 assembly:

// factorial_rec(n): W0 = n, returns X0 = n!
// Non-leaf: must save X30 and any callee-saved regs we use.
// We use X19 to preserve n across the recursive call.

.global factorial_rec
factorial_rec:
    // Prologue
    STP   X29, X30, [SP, #-32]!   // Save FP and LR; allocate 32 bytes
    MOV   X29, SP                  // Update frame pointer
    STR   X19, [SP, #16]           // Save X19 (callee-saved, we'll use it)
    // (16 bytes for X29/X30 + 8 bytes for X19 + 8 bytes padding = 32)

    // Base case: n <= 1
    CMP   W0, #1
    B.GT  .rec_recurse             // if n > 1, recurse
    MOV   X0, #1                   // return 1
    B     .rec_done

.rec_recurse:
    MOV   W19, W0                  // X19 = n (preserve across call)
    SUB   W0, W0, #1               // arg = n - 1
    BL    factorial_rec            // X0 = factorial_rec(n-1)
    // On return: X0 = (n-1)!
    UXTW  X1, W19                  // X1 = (uint64_t)n
    MUL   X0, X0, X1               // X0 = n * (n-1)!

.rec_done:
    // Epilogue
    LDR   X19, [SP, #16]           // Restore X19
    LDP   X29, X30, [SP], #32      // Restore FP and LR; SP += 32
    RET

Stack Trace for factorial_rec(4)

Call sequence: factorial_rec(4)factorial_rec(3)factorial_rec(2)factorial_rec(1).

Let's track the stack (assuming initial SP = 0x7FFFFFE0):

Call 1: factorial_rec(4), SP = 0x7FFFFFE0

Prologue: STP X29, X30, [SP, #-32]!  → SP = 0x7FFFFFC0
          STR X19, [SP, #16]          → saves W19=0 (irrelevant at this point)

Stack after call 1's prologue:

0x7FFFFFE0: ┌───────────────────────────┐  ← original SP (caller's frame)
            │   (caller's frame ...)    │
0x7FFFFFD8: ├───────────────────────────┤
            │   [SP+24] (padding)       │
0x7FFFFFD0: ├───────────────────────────┤
            │   [SP+16] X19 saved = ?   │
0x7FFFFFC8: ├───────────────────────────┤
            │   [SP+8]  X30 = ret addr  │  ← return addr to caller of factorial_rec(4)
0x7FFFFFC0: ├───────────────────────────┤
            │   [SP+0]  X29 = old FP    │  ← SP after prologue (new FP)
            └───────────────────────────┘  ← SP = X29 = 0x7FFFFFC0

Then MOV W19, W0 saves n=4. SUB W0, W0, #1 → W0=3. BL factorial_rec calls recursively, X30 = return address within factorial_rec(4) (the MUL instruction).

Call 2: factorial_rec(3), SP = 0x7FFFFFC0

Prologue: STP X29, X30, [SP, #-32]!  → SP = 0x7FFFFFA0

Stack:

0x7FFFFFC0: ┌───────────────────────────┐  ← Call 1's frame
            │   X29 of call 1           │  [0]
            │   X30 of call 1 (to caller)│  [8]
            │   X19 = 4 (call 1's n)    │  [16]
            │   padding                 │  [24]
0x7FFFFFA0: ├───────────────────────────┤  ← Call 2's frame
            │   X29 = 0x7FFFFFC0       │  [0] (points to call 1's frame)
            │   X30 = MUL in call 1    │  [8]
            │   X19 = 3 (call 2's n)   │  [16]
            │   padding                │  [24]
            └───────────────────────────┘  ← SP = 0x7FFFFFA0

Call 3: factorial_rec(2), SP = 0x7FFFFFA0 → 0x7FFFFF80

Call 4: factorial_rec(1), SP = 0x7FFFFF80 → 0x7FFFFF60

At call 4 (n=1): hits base case
  Returns X0 = 1 immediately

Complete stack at maximum depth (factorial_rec(1)):

HIGH ADDRESSES
┌──────────────────────────────────────────────────────────┐
│  Caller's frame (program main)                            │
│                                                           │
0x7FFFFFC0:  Call 1 frame (n=4)                            │
│   [+0]  X29 (old FP)                                     │
│   [+8]  X30 (return to main)                             │
│   [+16] X19 = 4                                          │
│   [+24] padding                                          │
│                                                           │
0x7FFFFFA0:  Call 2 frame (n=3)                            │
│   [+0]  X29 = 0x7FFFFFC0                                 │
│   [+8]  X30 = MUL in call 1                              │
│   [+16] X19 = 3                                          │
│   [+24] padding                                          │
│                                                           │
0x7FFFFF80:  Call 3 frame (n=2)                            │
│   [+0]  X29 = 0x7FFFFFA0                                 │
│   [+8]  X30 = MUL in call 2                              │
│   [+16] X19 = 2                                          │
│   [+24] padding                                          │
│                                                           │
0x7FFFFF60:  Call 4 frame (n=1) ← deepest                  │
│   [+0]  X29 = 0x7FFFFF80                                 │
│   [+8]  X30 = MUL in call 3                              │
│   [+16] X19 = 1 (though we hit base case before using it)│
│   [+24] padding                                          │
│                                                           │
SP = 0x7FFFFF60 ← current top                              │
└──────────────────────────────────────────────────────────┘
LOW ADDRESSES

Unwinding

  1. Call 4 returns X0=1 to Call 3's MUL instruction
  2. Call 3: X1 = (uint64_t)X19 = 2, X0 = 1 * 2 = 2. Epilogue restores X29/X30/X19, SP returns to 0x7FFFFF80. Returns X0=2 to Call 2.
  3. Call 2: X1 = 3, X0 = 2 * 3 = 6. Returns X0=6 to Call 1.
  4. Call 1: X1 = 4, X0 = 6 * 4 = 24. Returns X0=24 to main.

Comparison: Iterative vs. Recursive

Property               Iterative               Recursive
─────────────────────────────────────────────────────────────────
Stack frames           1 (constant)            n+1 (grows with n)
Stack space            ~16 bytes               n × 32 bytes
LR save required       No (leaf function)      Yes (must save X30)
Callee-saved regs      None needed             X19 per frame
Max n before overflow  No limit (while result  ~125,000 (8MB stack)
                       fits in uint64)
Instructions per iter  ~5                      ~8 + prologue/epilogue
Suitable for n=10?     Yes                     Yes
Suitable for n=100?    Yes                     Yes
Suitable for n=10000?  Yes                     Stack overflow risk

For n=20, factorial(20) = 2,432,902,008,176,640,000 (fits in uint64_t). For n=21, it overflows 64-bit unsigned. So either implementation is "limited" by the math before the stack becomes a concern for small inputs.


Building and Testing

// test_factorial.s — test harness
.section .data
fmt_iter: .asciz "Iterative: %lu! = %lu\n"
fmt_rec:  .asciz "Recursive: %lu! = %lu\n"

.section .text
.extern printf
.extern factorial_iter
.extern factorial_rec
.global main

main:
    STP X29, X30, [SP, #-16]!
    MOV X29, SP

    // Test factorial_iter(10)
    MOV W0, #10
    BL  factorial_iter
    MOV X2, X0                // result
    ADR X0, fmt_iter
    MOV X1, #10               // n
    BL  printf

    // Test factorial_rec(10)
    MOV W0, #10
    BL  factorial_rec
    MOV X2, X0
    ADR X0, fmt_rec
    MOV X1, #10
    BL  printf

    MOV W0, #0
    LDP X29, X30, [SP], #16
    RET
aarch64-linux-gnu-as factorial_iter.s -o factorial_iter.o
aarch64-linux-gnu-as factorial_rec.s  -o factorial_rec.o
aarch64-linux-gnu-as test_factorial.s -o test_factorial.o
aarch64-linux-gnu-gcc -static factorial_iter.o factorial_rec.o test_factorial.o -o factorial_test
qemu-aarch64 ./factorial_test
# Output:
# Iterative: 10! = 3628800
# Recursive: 10! = 3628800

Summary

The iterative version demonstrates the leaf function pattern: no prologue/epilogue overhead, no LR save, minimal stack usage. The recursive version demonstrates the full non-leaf pattern: save FP and LR at function entry, save any callee-saved registers that will be used, restore everything in the epilogue, and return with RET.

The 32-byte stack frame allocation (16 for X29/X30, 8 for X19, 8 for padding) keeps SP 16-byte aligned throughout. The frame pointer chain (each frame's X29 points to the previous frame's X29) enables debuggers to unwind the stack reliably.