Case Study 16-2: The ARM64 Execution Model — Tracing an ARM64 Program

Objective

Step through an ARM64 arithmetic program instruction by instruction under QEMU + GDB, observing register state at each step. Compare the same algorithm compiled for x86-64 to see exactly how the architectures differ in practice.


The Program: Simple Arithmetic

We'll trace this C function in ARM64 assembly:

// compute.c
int compute(int a, int b, int c) {
    int sum = a + b;
    int product = sum * c;
    int result = product - a;
    return result;
}

The ARM64 assembly (what GCC -O0 would produce):

// compute_arm64.s
// compute(a, b, c): returns (a+b)*c - a
// Arguments: W0=a, W1=b, W2=c
// Return value: W0

.section .text
.global compute
compute:
    // Function prologue (save frame pointer and link register)
    STP  X29, X30, [SP, #-16]!   // push {fp, lr}; sp -= 16
    MOV  X29, SP                  // fp = sp

    // Body: all in 32-bit (W registers) for int arithmetic
    ADD  W3, W0, W1               // W3 = a + b  (sum)
    MUL  W4, W3, W2               // W4 = sum * c  (product)
    SUB  W0, W4, W0               // W0 = product - a  (result, also return value)

    // Function epilogue
    LDP  X29, X30, [SP], #16      // pop {fp, lr}; sp += 16
    RET                           // return (branch to X30)

Test harness (main):

// main_arm64.s
.section .data
fmt:    .asciz "compute(3, 4, 5) = %d\n"

.section .text
.extern printf
.global _start

_start:
    // Set up frame for main
    STP  X29, X30, [SP, #-16]!
    MOV  X29, SP

    // Call compute(3, 4, 5)
    MOV  W0, #3       // a = 3
    MOV  W1, #4       // b = 4
    MOV  W2, #5       // c = 5
    BL   compute      // X30 = return address; branch to compute
    // On return: W0 = result = (3+4)*5 - 3 = 32

    // Call printf(fmt, result)
    MOV  W1, W0       // second arg = result
    ADR  X0, fmt      // first arg = format string
    BL   printf

    // exit(0)
    MOV  W0, #0
    LDP  X29, X30, [SP], #16
    RET

Build with C library for printf:

aarch64-linux-gnu-as compute_arm64.s -o compute_arm64.o
aarch64-linux-gnu-as main_arm64.s -o main_arm64.o
aarch64-linux-gnu-gcc -static compute_arm64.o main_arm64.o -o compute_prog -nostartfiles
qemu-aarch64 ./compute_prog
# Output: compute(3, 4, 5) = 32

GDB Trace Session

# Terminal 1
qemu-aarch64 -g 1234 ./compute_prog

# Terminal 2
aarch64-linux-gnu-gdb ./compute_prog
(gdb) target remote :1234
(gdb) break compute
(gdb) continue

At the compute Function Entry

(gdb) info registers x0 x1 x2 x29 x30 sp
x0   0x3    3       // a = 3
x1   0x4    4       // b = 4
x2   0x5    5       // c = 5
x29  0x...  (caller's frame pointer)
x30  0x...  (return address = address after BL in main)
sp   0x7ffffff0  (some stack address)

Tracing the Prologue

STP X29, X30, [SP, #-16]!

This is the canonical ARM64 function prologue. Let's decode it:

STP X29, X30, [SP, #-16]!
│    │    │    │    │    │
│    │    │    │    │    └── ! means write-back: SP = SP + (#-16) = SP - 16 FIRST
│    │    │    │    └─────── post-decrement by 16 before the store
│    │    │    └──────────── base register
│    │    └───────────────── second register to store (X30 → [SP + 8])
│    └────────────────────── first register to store (X29 → [SP + 0])
└─────────────────────────── Store Pair

Effect:
  SP = SP - 16
  Memory[SP + 0]  = X29  (old frame pointer)
  Memory[SP + 8]  = X30  (return address / link register)

Stack before:

SP → ┌────────────────────────────┐ (high address)
     │  ... (caller's frame) ...  │
     └────────────────────────────┘

Stack after:

SP → ┌────────────────────────────┐
     │  X29 (old FP)              │ +0
     ├────────────────────────────┤
     │  X30 (return address)      │ +8
     ├────────────────────────────┤
     │  ... (caller's frame) ...  │ +16
     └────────────────────────────┘ (high address)

Register trace:

Instruction SP X29 (FP) X30 (LR) [SP+0] [SP+8]
(before) 0x7FFFFF10 old_fp ret_addr ? ?
STP X29,X30,[SP,#-16]! 0x7FFFFF00 old_fp ret_addr old_fp ret_addr

MOV X29, SP

Sets the frame pointer to the current stack pointer. Now X29 points to the saved X29/X30 pair at the top of this frame.

Instruction SP X29 Notes
MOV X29, SP 0x7FFFFF00 0x7FFFFF00 FP = SP

Tracing the Body

ADD W3, W0, W1

ADD W3, W0, W1
  W3 = W0 + W1 = 3 + 4 = 7
  (W3 is the 32-bit view of X3)
  Note: W register write zero-extends → X3 = 0x0000000000000007
  Flags NOT updated (no S suffix)
Instruction W0 W1 W2 W3 W4 NZCV
(before) 3 4 5 ? ? ????
ADD W3, W0, W1 3 4 5 7 ? unchanged

MUL W4, W3, W2

MUL W4, W3, W2
  W4 = W3 * W2 = 7 * 5 = 35
  MUL in ARM64 is actually MADD Wd, Wn, Wm, WZR (multiply and add zero)
  Flags NOT updated
Instruction W0 W1 W2 W3 W4 NZCV
MUL W4, W3, W2 3 4 5 7 35 unchanged

SUB W0, W4, W0

SUB W0, W4, W0
  W0 = W4 - W0 = 35 - 3 = 32
  Result stored back in W0 (the return value register)
  Flags NOT updated
Instruction W0 W1 W2 W3 W4 NZCV
SUB W0, W4, W0 32 4 5 7 35 unchanged

Tracing the Epilogue

LDP X29, X30, [SP], #16

The mirror image of STP. Post-indexed: load first, then update SP.

LDP X29, X30, [SP], #16
  X29 = Memory[SP + 0]   (restore old frame pointer)
  X30 = Memory[SP + 8]   (restore return address)
  SP = SP + 16           (post-indexed: done AFTER the load)

Stack after:

SP → ┌────────────────────────────┐
     │  ... (caller's frame) ...  │ ← SP back to where it was before prologue
     └────────────────────────────┘
Instruction SP X29 X30
LDP X29,X30,[SP],#16 0x7FFFFF10 old_fp ret_addr

RET

RET is equivalent to BR X30. The processor jumps to the address in X30, which is the instruction after the BL compute call in main. W0 = 32 is the return value.


Comparing to Equivalent x86-64

The same compute(a, b, c) function in x86-64:

; x86-64 System V ABI: edi=a, esi=b, edx=c, return in eax
compute:
    push   rbp
    mov    rbp, rsp

    ; sum = a + b
    lea    eax, [rdi + rsi]      ; eax = a + b (lea trick: no flags)
    ; product = sum * c
    imul   eax, edx              ; eax = sum * c
    ; result = product - a
    sub    eax, edi              ; eax = product - a

    pop    rbp
    ret

Side-by-side comparison:

x86-64                              ARM64
────────────────────────────────    ────────────────────────────────
push   rbp                          STP X29, X30, [SP, #-16]!
mov    rbp, rsp                     MOV X29, SP

; sum = edi + esi → eax             ADD W3, W0, W1
lea    eax, [rdi + rsi]

; product = eax * edx               MUL W4, W3, W2
imul   eax, edx

; result = eax - edi                SUB W0, W4, W0
sub    eax, edi

pop    rbp                          LDP X29, X30, [SP], #16
ret                                 RET
────────────────────────────────    ────────────────────────────────
7 instructions                      7 instructions
Uses LEA for add (no flags)         ADD doesn't set flags anyway
Return address on stack             Return address in X30
IMUL sets flags                     MUL does NOT set flags

For this simple function, the instruction count is identical. The key differences: 1. x86-64 uses push/pop rbp (separate instructions); ARM64 uses STP/LDP (store/load pair — two registers in one instruction) 2. x86-64's prologue touches the stack twice; ARM64's STP does it once 3. ARM64 explicitly uses W3 and W4 as temporaries; x86-64 reuses EAX throughout 4. ARM64's MUL doesn't touch flags; x86-64's IMUL does


The Load/Store Model in Action

Now let's look at a function that demonstrates why load/store matters:

void increment_array(int *arr, int n) {
    for (int i = 0; i < n; i++) {
        arr[i]++;
    }
}

ARM64 assembly (GCC -O0):

// X0 = arr, W1 = n
// Local: W2 = i
increment_array:
    STP  X29, X30, [SP, #-32]!
    MOV  X29, SP
    STR  X0, [SP, #16]       // store arr to stack slot
    STR  W1, [SP, #12]       // store n to stack slot
    STR  WZR, [SP, #8]       // i = 0

.loop_check:
    LDR  W0, [SP, #8]        // W0 = i
    LDR  W1, [SP, #12]       // W1 = n
    CMP  W0, W1
    B.GE .loop_exit

.loop_body:
    LDR  X0, [SP, #16]       // X0 = arr
    LDR  W1, [SP, #8]        // W1 = i
    // Address of arr[i]: arr + i*4 (int is 4 bytes)
    LSL  W2, W1, #2          // W2 = i << 2 = i * 4
    // Can't add W2 directly to X0 (different widths) — use SXTW
    ADD  X3, X0, W2, SXTW    // X3 = arr + (int64_t)i*4
    LDR  W4, [X3]            // W4 = arr[i]   ← mandatory LOAD
    ADD  W4, W4, #1          // W4 = arr[i] + 1
    STR  W4, [X3]            // arr[i] = W4   ← mandatory STORE
    LDR  W0, [SP, #8]        // W0 = i
    ADD  W0, W0, #1          // W0 = i + 1
    STR  W0, [SP, #8]        // store i back
    B    .loop_check

.loop_exit:
    LDP  X29, X30, [SP], #32
    RET

In x86-64, arr[i]++ can be done in one instruction: inc dword [rax + rcx*4]. In ARM64, it requires: LDR → ADD → STR. Three instructions minimum. This is the load/store constraint in action.

GCC -O2 would optimize this completely differently, using NEON SIMD instructions to increment 4 elements at once. But at -O0, you see the raw load/store discipline.


What You've Learned

After tracing this program:

  1. The ARM64 prologue (STP X29, X30, [SP, #-16]!) atomically saves both the frame pointer and return address
  2. The epilogue (LDP X29, X30, [SP], #16) restores both and moves SP back in one instruction
  3. Function arguments arrive in W0/W1/W2 for 32-bit values, X0/X1/X2 for 64-bit
  4. The return value goes in W0 (32-bit) or X0 (64-bit)
  5. ARM64 MUL, ADD, SUB do not set flags unless you use the S suffix
  6. Load/store forces you to explicitly load, modify in register, then store — no inc [memory] shorthand

This execution model is fundamentally different from x86-64, but it's clean and predictable once internalized.