Case Study 16-2: The ARM64 Execution Model — Tracing an ARM64 Program
Objective
Step through an ARM64 arithmetic program instruction by instruction under QEMU + GDB, observing register state at each step. Compare the same algorithm compiled for x86-64 to see exactly how the architectures differ in practice.
The Program: Simple Arithmetic
We'll trace this C function in ARM64 assembly:
// compute.c
int compute(int a, int b, int c) {
int sum = a + b;
int product = sum * c;
int result = product - a;
return result;
}
The ARM64 assembly (what GCC -O0 would produce):
// compute_arm64.s
// compute(a, b, c): returns (a+b)*c - a
// Arguments: W0=a, W1=b, W2=c
// Return value: W0
.section .text
.global compute
compute:
// Function prologue (save frame pointer and link register)
STP X29, X30, [SP, #-16]! // push {fp, lr}; sp -= 16
MOV X29, SP // fp = sp
// Body: all in 32-bit (W registers) for int arithmetic
ADD W3, W0, W1 // W3 = a + b (sum)
MUL W4, W3, W2 // W4 = sum * c (product)
SUB W0, W4, W0 // W0 = product - a (result, also return value)
// Function epilogue
LDP X29, X30, [SP], #16 // pop {fp, lr}; sp += 16
RET // return (branch to X30)
Test harness (main):
// main_arm64.s
.section .data
fmt: .asciz "compute(3, 4, 5) = %d\n"
.section .text
.extern printf
.global _start
_start:
// Set up frame for main
STP X29, X30, [SP, #-16]!
MOV X29, SP
// Call compute(3, 4, 5)
MOV W0, #3 // a = 3
MOV W1, #4 // b = 4
MOV W2, #5 // c = 5
BL compute // X30 = return address; branch to compute
// On return: W0 = result = (3+4)*5 - 3 = 32
// Call printf(fmt, result)
MOV W1, W0 // second arg = result
ADR X0, fmt // first arg = format string
BL printf
// exit(0)
MOV W0, #0
LDP X29, X30, [SP], #16
RET
Build with C library for printf:
aarch64-linux-gnu-as compute_arm64.s -o compute_arm64.o
aarch64-linux-gnu-as main_arm64.s -o main_arm64.o
aarch64-linux-gnu-gcc -static compute_arm64.o main_arm64.o -o compute_prog -nostartfiles
qemu-aarch64 ./compute_prog
# Output: compute(3, 4, 5) = 32
GDB Trace Session
# Terminal 1
qemu-aarch64 -g 1234 ./compute_prog
# Terminal 2
aarch64-linux-gnu-gdb ./compute_prog
(gdb) target remote :1234
(gdb) break compute
(gdb) continue
At the compute Function Entry
(gdb) info registers x0 x1 x2 x29 x30 sp
x0 0x3 3 // a = 3
x1 0x4 4 // b = 4
x2 0x5 5 // c = 5
x29 0x... (caller's frame pointer)
x30 0x... (return address = address after BL in main)
sp 0x7ffffff0 (some stack address)
Tracing the Prologue
STP X29, X30, [SP, #-16]!
This is the canonical ARM64 function prologue. Let's decode it:
STP X29, X30, [SP, #-16]!
│ │ │ │ │ │
│ │ │ │ │ └── ! means write-back: SP = SP + (#-16) = SP - 16 FIRST
│ │ │ │ └─────── post-decrement by 16 before the store
│ │ │ └──────────── base register
│ │ └───────────────── second register to store (X30 → [SP + 8])
│ └────────────────────── first register to store (X29 → [SP + 0])
└─────────────────────────── Store Pair
Effect:
SP = SP - 16
Memory[SP + 0] = X29 (old frame pointer)
Memory[SP + 8] = X30 (return address / link register)
Stack before:
SP → ┌────────────────────────────┐ (high address)
│ ... (caller's frame) ... │
└────────────────────────────┘
Stack after:
SP → ┌────────────────────────────┐
│ X29 (old FP) │ +0
├────────────────────────────┤
│ X30 (return address) │ +8
├────────────────────────────┤
│ ... (caller's frame) ... │ +16
└────────────────────────────┘ (high address)
Register trace:
| Instruction | SP | X29 (FP) | X30 (LR) | [SP+0] | [SP+8] |
|---|---|---|---|---|---|
| (before) | 0x7FFFFF10 | old_fp | ret_addr | ? | ? |
| STP X29,X30,[SP,#-16]! | 0x7FFFFF00 | old_fp | ret_addr | old_fp | ret_addr |
MOV X29, SP
Sets the frame pointer to the current stack pointer. Now X29 points to the saved X29/X30 pair at the top of this frame.
| Instruction | SP | X29 | Notes |
|---|---|---|---|
| MOV X29, SP | 0x7FFFFF00 | 0x7FFFFF00 | FP = SP |
Tracing the Body
ADD W3, W0, W1
ADD W3, W0, W1
W3 = W0 + W1 = 3 + 4 = 7
(W3 is the 32-bit view of X3)
Note: W register write zero-extends → X3 = 0x0000000000000007
Flags NOT updated (no S suffix)
| Instruction | W0 | W1 | W2 | W3 | W4 | NZCV |
|---|---|---|---|---|---|---|
| (before) | 3 | 4 | 5 | ? | ? | ???? |
| ADD W3, W0, W1 | 3 | 4 | 5 | 7 | ? | unchanged |
MUL W4, W3, W2
MUL W4, W3, W2
W4 = W3 * W2 = 7 * 5 = 35
MUL in ARM64 is actually MADD Wd, Wn, Wm, WZR (multiply and add zero)
Flags NOT updated
| Instruction | W0 | W1 | W2 | W3 | W4 | NZCV |
|---|---|---|---|---|---|---|
| MUL W4, W3, W2 | 3 | 4 | 5 | 7 | 35 | unchanged |
SUB W0, W4, W0
SUB W0, W4, W0
W0 = W4 - W0 = 35 - 3 = 32
Result stored back in W0 (the return value register)
Flags NOT updated
| Instruction | W0 | W1 | W2 | W3 | W4 | NZCV |
|---|---|---|---|---|---|---|
| SUB W0, W4, W0 | 32 | 4 | 5 | 7 | 35 | unchanged |
Tracing the Epilogue
LDP X29, X30, [SP], #16
The mirror image of STP. Post-indexed: load first, then update SP.
LDP X29, X30, [SP], #16
X29 = Memory[SP + 0] (restore old frame pointer)
X30 = Memory[SP + 8] (restore return address)
SP = SP + 16 (post-indexed: done AFTER the load)
Stack after:
SP → ┌────────────────────────────┐
│ ... (caller's frame) ... │ ← SP back to where it was before prologue
└────────────────────────────┘
| Instruction | SP | X29 | X30 |
|---|---|---|---|
| LDP X29,X30,[SP],#16 | 0x7FFFFF10 | old_fp | ret_addr |
RET
RET is equivalent to BR X30. The processor jumps to the address in X30, which is the instruction after the BL compute call in main. W0 = 32 is the return value.
Comparing to Equivalent x86-64
The same compute(a, b, c) function in x86-64:
; x86-64 System V ABI: edi=a, esi=b, edx=c, return in eax
compute:
push rbp
mov rbp, rsp
; sum = a + b
lea eax, [rdi + rsi] ; eax = a + b (lea trick: no flags)
; product = sum * c
imul eax, edx ; eax = sum * c
; result = product - a
sub eax, edi ; eax = product - a
pop rbp
ret
Side-by-side comparison:
x86-64 ARM64
──────────────────────────────── ────────────────────────────────
push rbp STP X29, X30, [SP, #-16]!
mov rbp, rsp MOV X29, SP
; sum = edi + esi → eax ADD W3, W0, W1
lea eax, [rdi + rsi]
; product = eax * edx MUL W4, W3, W2
imul eax, edx
; result = eax - edi SUB W0, W4, W0
sub eax, edi
pop rbp LDP X29, X30, [SP], #16
ret RET
──────────────────────────────── ────────────────────────────────
7 instructions 7 instructions
Uses LEA for add (no flags) ADD doesn't set flags anyway
Return address on stack Return address in X30
IMUL sets flags MUL does NOT set flags
For this simple function, the instruction count is identical. The key differences:
1. x86-64 uses push/pop rbp (separate instructions); ARM64 uses STP/LDP (store/load pair — two registers in one instruction)
2. x86-64's prologue touches the stack twice; ARM64's STP does it once
3. ARM64 explicitly uses W3 and W4 as temporaries; x86-64 reuses EAX throughout
4. ARM64's MUL doesn't touch flags; x86-64's IMUL does
The Load/Store Model in Action
Now let's look at a function that demonstrates why load/store matters:
void increment_array(int *arr, int n) {
for (int i = 0; i < n; i++) {
arr[i]++;
}
}
ARM64 assembly (GCC -O0):
// X0 = arr, W1 = n
// Local: W2 = i
increment_array:
STP X29, X30, [SP, #-32]!
MOV X29, SP
STR X0, [SP, #16] // store arr to stack slot
STR W1, [SP, #12] // store n to stack slot
STR WZR, [SP, #8] // i = 0
.loop_check:
LDR W0, [SP, #8] // W0 = i
LDR W1, [SP, #12] // W1 = n
CMP W0, W1
B.GE .loop_exit
.loop_body:
LDR X0, [SP, #16] // X0 = arr
LDR W1, [SP, #8] // W1 = i
// Address of arr[i]: arr + i*4 (int is 4 bytes)
LSL W2, W1, #2 // W2 = i << 2 = i * 4
// Can't add W2 directly to X0 (different widths) — use SXTW
ADD X3, X0, W2, SXTW // X3 = arr + (int64_t)i*4
LDR W4, [X3] // W4 = arr[i] ← mandatory LOAD
ADD W4, W4, #1 // W4 = arr[i] + 1
STR W4, [X3] // arr[i] = W4 ← mandatory STORE
LDR W0, [SP, #8] // W0 = i
ADD W0, W0, #1 // W0 = i + 1
STR W0, [SP, #8] // store i back
B .loop_check
.loop_exit:
LDP X29, X30, [SP], #32
RET
In x86-64, arr[i]++ can be done in one instruction: inc dword [rax + rcx*4]. In ARM64, it requires: LDR → ADD → STR. Three instructions minimum. This is the load/store constraint in action.
GCC -O2 would optimize this completely differently, using NEON SIMD instructions to increment 4 elements at once. But at -O0, you see the raw load/store discipline.
What You've Learned
After tracing this program:
- The ARM64 prologue (
STP X29, X30, [SP, #-16]!) atomically saves both the frame pointer and return address - The epilogue (
LDP X29, X30, [SP], #16) restores both and moves SP back in one instruction - Function arguments arrive in W0/W1/W2 for 32-bit values, X0/X1/X2 for 64-bit
- The return value goes in W0 (32-bit) or X0 (64-bit)
- ARM64 MUL, ADD, SUB do not set flags unless you use the S suffix
- Load/store forces you to explicitly load, modify in register, then store — no
inc [memory]shorthand
This execution model is fundamentally different from x86-64, but it's clean and predictable once internalized.