Chapter 16 gave you the philosophy. This chapter gives you the instructions. By the end of this chapter, you'll be able to read any ARM64 assembly listing and understand what's happening, and write ARM64 programs from scratch.
In This Chapter
- The Working Vocabulary
- 17.1 Data Processing Instructions
- 17.2 Comparison and Flag Instructions
- 17.3 Memory Instructions: Load and Store
- 17.4 Branch Instructions
- 17.5 AAPCS64 Calling Convention: Full Details
- 17.6 ARM64 Linux System Calls
- 17.7 Side-by-Side: x86-64 vs. ARM64 Instruction Comparison
- 17.8 Translating a C Function to ARM64
- Summary
Chapter 17: ARM64 Instruction Set
The Working Vocabulary
Chapter 16 gave you the philosophy. This chapter gives you the instructions. By the end of this chapter, you'll be able to read any ARM64 assembly listing and understand what's happening, and write ARM64 programs from scratch.
ARM64's instruction set is large but internally regular. Unlike x86-64, where ADD, IMUL, DIV, PUSH, CALL, and REP MOVS all follow completely different rules, ARM64 instructions mostly share the same structure: operation + destination register + source register(s) + optional modifier. Learn the pattern once; it applies everywhere.
17.1 Data Processing Instructions
Basic Arithmetic: ADD, ADDS, SUB, SUBS
ADD Xd, Xn, Xm // Xd = Xn + Xm (no flags)
ADD Xd, Xn, #imm // Xd = Xn + immediate (12-bit, optional LSL #12)
ADDS Xd, Xn, Xm // Xd = Xn + Xm, update N/Z/C/V
SUB Xd, Xn, Xm // Xd = Xn - Xm
SUB Xd, Xn, #imm // Xd = Xn - immediate
SUBS Xd, Xn, Xm // Xd = Xn - Xm, update flags
The immediate encoding deserves explanation. ARM64 data processing instructions can encode a 12-bit immediate (0–4095). Optionally, that immediate can be shifted left by 12 bits first (making it 0–4095 × 4096). This covers the ranges needed for most small constants and page-aligned offsets.
ADD X0, X1, #1000 // X0 = X1 + 1000
ADD X0, X1, #1, LSL #12 // X0 = X1 + 4096 (1 shifted left 12)
For carry-propagating multi-precision arithmetic:
ADC Xd, Xn, Xm // Xd = Xn + Xm + C (add with carry)
ADCS Xd, Xn, Xm // Same, update flags
SBC Xd, Xn, Xm // Xd = Xn - Xm - ~C (subtract with borrow)
SBCS Xd, Xn, Xm // Same, update flags
128-bit addition example:
// Add two 128-bit numbers: (X1:X0) + (X3:X2) → (X5:X4)
ADDS X4, X0, X2 // Low 64 bits, set C on carry
ADC X5, X1, X3 // High 64 bits + carry from low
Logic: AND, ORR, EOR, BIC, ORN, EON
AND Xd, Xn, Xm // Xd = Xn & Xm
ANDS Xd, Xn, Xm // Same, update flags (TST = ANDS XZR, Xn, Xm)
ORR Xd, Xn, Xm // Xd = Xn | Xm
EOR Xd, Xn, Xm // Xd = Xn ^ Xm (XOR)
BIC Xd, Xn, Xm // Xd = Xn & ~Xm (bit clear = AND NOT)
BICS Xd, Xn, Xm // Same, update flags
ORN Xd, Xn, Xm // Xd = Xn | ~Xm (OR NOT)
EON Xd, Xn, Xm // Xd = Xn ^ ~Xm (XOR NOT = XNOR)
Logical immediates (ANDS/ORR/EOR/AND with immediate) use a special encoding: the immediate must be a pattern of repeated bit sequences. Not all constants are expressible, but the compiler handles this automatically.
AND X0, X0, #0xFF // X0 &= 0xFF (extract low byte)
ORR X0, X0, #0x80 // X0 |= 0x80 (set high bit of byte)
EOR X0, X0, #0xFF // X0 ^= 0xFF (flip low byte)
Multiply
MUL Xd, Xn, Xm // Xd = Xn * Xm (lower 64 bits)
// Pseudoinstruction for: MADD Xd, Xn, Xm, XZR
MADD Xd, Xn, Xm, Xa // Xd = Xa + Xn*Xm (multiply-accumulate)
MSUB Xd, Xn, Xm, Xa // Xd = Xa - Xn*Xm (multiply-subtract)
MNEG Xd, Xn, Xm // Xd = -(Xn*Xm) (MSUB Xd, Xn, Xm, XZR)
// High-half multiply (for 128-bit results)
SMULH Xd, Xn, Xm // Xd = upper 64 bits of Xn*Xm (signed)
UMULH Xd, Xn, Xm // Xd = upper 64 bits of Xn*Xm (unsigned)
Getting a full 128-bit multiply result:
// X2:X0 = X0 * X1 (128-bit result)
UMULH X2, X0, X1 // X2 = high 64 bits
MUL X0, X0, X1 // X0 = low 64 bits
⚠️ Common Mistake: ARM64 has no implicit 128-bit multiply like x86-64's
MUL rax(which puts the full product in RDX:RAX). You must use SMULH/UMULH explicitly for the high half.📊 C Comparison: In C,
(uint64_t)a * bgives the low 64 bits.__uint128_tmultiplication or compiler builtins for the high bits. ARM64's UMULH is the hardware instruction that makes those builtins fast.
Division
SDIV Xd, Xn, Xm // Xd = Xn / Xm (signed integer division)
UDIV Xd, Xn, Xm // Xd = Xn / Xm (unsigned integer division)
Critical differences from x86-64:
1. ARM64 division stores ONLY the quotient in Xd — there is no remainder register
2. To get the remainder: compute remainder = dividend - (quotient * divisor)
3. Division by zero does NOT trap — it returns 0 (implementation-defined, but Linux CPUs return 0)
// C: q = a / b; r = a % b;
// X0 = a, X1 = b
SDIV X2, X0, X1 // X2 = a / b (quotient)
MSUB X3, X2, X1, X0 // X3 = X0 - X2*X1 = a - (a/b)*b = a % b (remainder)
🔍 Under the Hood: x86-64's
IDIVproduces both quotient and remainder in one instruction (RAX and RDX respectively). ARM64 requires SDIV + MSUB. This looks less efficient, but modern ARM64 CPUs have fast hardware dividers, and the compiler often optimizes integer division by constants into multiply-shift sequences that don't use SDIV at all.
Move Instructions
MOV Xd, Xn // Xd = Xn (ORR Xd, XZR, Xn)
MOV Xd, #imm16 // Xd = 16-bit immediate (MOVZ)
MOVZ Xd, #imm16 // Xd = imm16 (zero other bits)
MOVZ Xd, #imm16, LSL #16 // Xd = imm16 << 16
MOVK Xd, #imm16, LSL #32 // Xd[47:32] = imm16 (keep other bits)
MOVN Xd, #imm16 // Xd = ~imm16 (move NOT)
Loading a 64-bit constant requires multiple instructions:
// Load 0xDEADBEEFCAFEBABE into X0
MOVZ X0, #0xBABE // X0 = 0x000000000000BABE
MOVK X0, #0xCAFE, LSL #16 // X0 = 0x00000000CAFEBABE
MOVK X0, #0xBEEF, LSL #32 // X0 = 0x0000BEEFCAFEBABE
MOVK X0, #0xDEAD, LSL #48 // X0 = 0xDEADBEEFCAFEBABE
Shifts and Rotates (Inline)
ARM64's barrel shifter can apply a shift to the second source operand of any data processing instruction. This is an extremely powerful feature that has no x86-64 equivalent.
ADD X0, X1, X2, LSL #3 // X0 = X1 + (X2 << 3) = X1 + X2*8
ADD X0, X1, X2, LSR #2 // X0 = X1 + (X2 >> 2) (logical)
ADD X0, X1, X2, ASR #1 // X0 = X1 + (X2 >> 1) (arithmetic)
ORR X0, X1, X2, ROR #4 // X0 = X1 | rotate_right(X2, 4)
Shift types:
- LSL #n — Logical Shift Left (fill with zeros)
- LSR #n — Logical Shift Right (fill with zeros — unsigned right shift)
- ASR #n — Arithmetic Shift Right (fill with sign bit — signed right shift)
- ROR #n — ROtate Right
As standalone instructions:
LSL Xd, Xn, #n // Xd = Xn << n
LSL Xd, Xn, Xm // Xd = Xn << (Xm & 63)
LSR Xd, Xn, #n // Xd = Xn >> n (unsigned)
ASR Xd, Xn, #n // Xd = Xn >> n (signed)
ROR Xd, Xn, #n // Xd = rotate_right(Xn, n)
ROR Xd, Xn, Xm // Xd = rotate_right(Xn, Xm & 63)
⚡ Performance Note:
ADD X0, X1, X2, LSL #3is a single instruction that computes array index arithmetic:base + index * 8. In x86-64, this requiresLEA rax, [rbx + rcx*8]— same instruction count, but ARM64's approach generalizes to any power-of-2 scale factor embedded in any ALU instruction, not just LEA.
Bit Field Operations
UBFX Xd, Xn, #lsb, #width // Extract unsigned bit field
SBFX Xd, Xn, #lsb, #width // Extract signed bit field (sign extend)
BFI Xd, Xn, #lsb, #width // Insert bit field into Xd from Xn
BFXIL Xd, Xn, #lsb, #width // Extract and insert at low end
// Example: extract bits [11:8] (4 bits) from X1
UBFX X0, X1, #8, #4 // X0 = (X1 >> 8) & 0xF
17.2 Comparison and Flag Instructions
CMP Xn, Xm // SUBS XZR, Xn, Xm — compare (set flags)
CMP Xn, #imm // SUBS XZR, Xn, #imm
CMN Xn, Xm // ADDS XZR, Xn, Xm — compare negative
TST Xn, Xm // ANDS XZR, Xn, Xm — test bits
ARM64 does NOT have TEQ (test equivalent/XOR). If you need it, use EOR and check the result.
17.3 Memory Instructions: Load and Store
This is where ARM64 programming gets rich. The addressing modes are varied but consistent.
Basic Load and Store
LDR Xd, [Xn] // Xd = Memory[Xn] (load 64-bit)
STR Xd, [Xn] // Memory[Xn] = Xd (store 64-bit)
LDR Wd, [Xn] // Wd = Memory[Xn] (load 32-bit, zero-extend)
STR Wd, [Xn] // Memory[Xn] = Wd (store 32-bit)
Offset Addressing
LDR Xd, [Xn, #imm] // Xd = Memory[Xn + imm] (immediate offset)
STR Xd, [Xn, #imm] // Memory[Xn + imm] = Xd
LDR Xd, [Xn, Xm] // Xd = Memory[Xn + Xm] (register offset)
LDR Xd, [Xn, Xm, LSL #3] // Xd = Memory[Xn + Xm*8] (scaled register)
The immediate offset in LDR Xd, [Xn, #imm] is a 12-bit unsigned value scaled by the access size. For 64-bit loads, that's 0–32760 in steps of 8. For 32-bit loads, 0–16380 in steps of 4.
// Accessing a struct:
// struct { int64_t x; int64_t y; int64_t z; } at X0
LDR X1, [X0, #0] // X1 = s.x
LDR X2, [X0, #8] // X2 = s.y
LDR X3, [X0, #16] // X3 = s.z
Pre-indexed and Post-indexed Addressing
Pre-indexed: update the base register BEFORE the access (the ! suffix):
LDR Xd, [Xn, #imm]! // Xn += imm, then Xd = Memory[Xn]
STR Xd, [Xn, #imm]! // Xn += imm, then Memory[Xn] = Xd
Post-indexed: access first, then update the base register:
LDR Xd, [Xn], #imm // Xd = Memory[Xn], then Xn += imm
STR Xd, [Xn], #imm // Memory[Xn] = Xd, then Xn += imm
These are essential for array traversal and stack operations:
// Walk an array of 8-byte elements
MOV X1, X0 // X1 = array pointer
.loop:
LDR X2, [X1], #8 // X2 = *X1; X1 += 8 (advance to next element)
// ... process X2 ...
CBZ X2, .done // if X2 == 0, exit (null-terminated array example)
B .loop
.done:
💡 Mental Model: Pre-indexed
[Xn, #imm]!is "move then access." Post-indexed[Xn], #immis "access then move." The!means the base register is updated (written back). This is ARM's equivalent of C's*p++(post-increment) and*++p(pre-increment).
Load/Store Pair: LDP and STP
One of ARM64's most useful instructions. Load or store two registers simultaneously:
LDP Xd1, Xd2, [Xn] // Xd1 = Memory[Xn]; Xd2 = Memory[Xn+8]
STP Xn1, Xn2, [Xd] // Memory[Xd] = Xn1; Memory[Xd+8] = Xn2
// Pre/post-indexed versions:
STP X29, X30, [SP, #-16]! // SP -= 16; store pair (canonical prologue)
LDP X29, X30, [SP], #16 // load pair; SP += 16 (canonical epilogue)
LDP/STP halves the number of instructions needed for saving/restoring registers:
// Save 4 callee-saved registers (typical function entry)
STP X19, X20, [SP, #-32]! // save X19 and X20, advance SP
STP X21, X22, [SP, #16] // save X21 and X22 at SP+16
// ... function body ...
LDP X21, X22, [SP, #16] // restore X21 and X22
LDP X19, X20, [SP], #32 // restore X19 and X20, reclaim stack
Sized Loads: Byte, Halfword, Word, Doubleword
LDRB Wd, [Xn] // Load byte, zero-extend to 32 bits
LDRH Wd, [Xn] // Load halfword (16-bit), zero-extend
LDRSB Wd, [Xn] // Load byte, sign-extend to 32 bits
LDRSH Wd, [Xn] // Load halfword, sign-extend to 32 bits
LDRSB Xd, [Xn] // Load byte, sign-extend to 64 bits
LDRSH Xd, [Xn] // Load halfword, sign-extend to 64 bits
LDRSW Xd, [Xn] // Load word (32-bit), sign-extend to 64 bits
STRB Wd, [Xn] // Store low byte of Wd
STRH Wd, [Xn] // Store low halfword of Wd
⚠️ Common Mistake:
LDR Wd, [Xn]loads 32 bits and zero-extends into Xd.LDRSW Xd, [Xn]loads 32 bits and sign-extends into Xd. UseLDRSWwhen working withintarrays that contain negative numbers and you need to use the result in 64-bit arithmetic.
17.4 Branch Instructions
Unconditional Branch
B label // PC = label (26-bit PC-relative offset, ±128MB range)
BL label // X30 = PC+4; PC = label (call: saves return address)
BR Xn // PC = Xn (indirect jump — register contains address)
BLR Xn // X30 = PC+4; PC = Xn (indirect call)
RET // PC = X30 (return — BR X30)
RET Xn // PC = Xn (return via specific register)
Conditional Branch
B.EQ label // Branch if Z=1 (equal)
B.NE label // Branch if Z=0 (not equal)
B.LT label // Branch if signed less than (N≠V)
B.LE label // Branch if signed less or equal (Z=1 or N≠V)
B.GT label // Branch if signed greater than (Z=0 and N=V)
B.GE label // Branch if signed greater or equal (N=V)
B.LO label // Branch if unsigned lower (C=0)
B.LS label // Branch if unsigned lower or same (C=0 or Z=1)
B.HI label // Branch if unsigned higher (C=1 and Z=0)
B.HS label // Branch if unsigned higher or same (C=1)
B.MI label // Branch if minus (N=1)
B.PL label // Branch if plus/zero (N=0)
B.VS label // Branch if overflow (V=1)
B.VC label // Branch if no overflow (V=0)
Conditional branches have a ±1MB range (19-bit PC-relative offset). For longer jumps, you branch unconditionally to a B with a larger range, or use an indirect branch through a register.
Compare and Branch: CBZ, CBNZ
Compare register to zero and branch — no flag update:
CBZ Xn, label // Branch if Xn == 0
CBNZ Xn, label // Branch if Xn != 0
CBZ Wn, label // 32-bit version (checks W register)
This avoids needing a separate CMP + B.EQ for the extremely common "is this zero?" test:
// Loop until X0 is zero
.loop:
// ... work using X0 ...
SUBS X0, X0, #1 // decrement counter, set flags
CBNZ X0, .loop // if X0 != 0, continue
// Alternatively: CBZ/CBNZ after any computation
Test Bit and Branch: TBZ, TBNZ
TBZ Xn, #bit, label // Branch if bit N of Xn is 0
TBNZ Xn, #bit, label // Branch if bit N of Xn is non-zero
Useful for flag-bit testing without TST + B.EQ:
TBNZ X0, #0, .is_odd // Branch if bit 0 of X0 is set (X0 is odd)
TBZ X0, #63, .positive // Branch if bit 63 is clear (X0 is positive int64)
Range: ±32KB (14-bit PC-relative offset).
17.5 AAPCS64 Calling Convention: Full Details
The ARM Procedure Call Standard for AArch64 (AAPCS64) defines how functions talk to each other. You cannot deviate from this if you want your assembly to work with C code.
Argument Passing
AAPCS64 Argument Passing Rules
┌──────────────────────────────────────────────────────────────────────────┐
│ Integer/Pointer Arguments: │
│ Arguments 1-8: X0-X7 (64-bit) or W0-W7 (32-bit) │
│ Arguments 9+: Pushed on stack (right-to-left, then left-to-right) │
│ │
│ Floating-Point Arguments: │
│ Arguments 1-8: D0-D7 (double) or S0-S7 (float) │
│ Arguments 9+: Pushed on stack │
│ │
│ Return Values: │
│ Integer/Pointer: X0 (64-bit) or W0 (32-bit) │
│ Two 64-bit values: X0, X1 │
│ Floating-point: D0 (double) or S0 (float) │
│ Large structs: caller allocates, passes pointer in X8 │
└──────────────────────────────────────────────────────────────────────────┘
Caller-Saved vs. Callee-Saved
Caller-Saved (may be clobbered by called function — caller must save if needed):
X0-X18 (including X8, X9-X15, X16-X17, X18)
Callee-Saved (must be preserved by the called function):
X19-X28, X29 (FP), X30 (LR), SP
Note on X30 (LR): technically caller-saved, but the callee must preserve it
if it makes any function calls (BL overwrites X30).
Convention: non-leaf functions always save X30 via STP.
Stack Alignment
The SP must be 16-byte aligned at the point of any function call. This is architectural — violating it causes an SP_ELx alignment fault on hardware.
The 16-byte alignment rule means you always allocate stack space in multiples of 16:
// Bad: SP becomes 8-byte aligned (not 16)
SUB SP, SP, #8 // allocate 8 bytes
BL some_function // fault or undefined behavior!
// Good: SP stays 16-byte aligned
SUB SP, SP, #16 // allocate 16 bytes (minimum allocation unit)
BL some_function
The Canonical Function Frame
// Standard ARM64 function with local variables
my_function:
// Prologue: save FP and LR, allocate locals
STP X29, X30, [SP, #-48]! // 48 bytes: 16 for FP/LR + 32 for locals
MOV X29, SP // FP = SP (point to frame)
// Optional: save callee-saved registers if used
STP X19, X20, [SP, #16] // save X19, X20 at SP+16
STP X21, X22, [SP, #32] // save X21, X22 at SP+32
// Access local variables (relative to SP or FP):
STR X0, [SP, #?] // or [X29, #?]
LDR X1, [SP, #?]
// ... function body ...
// Epilogue: restore and return
LDP X21, X22, [SP, #32]
LDP X19, X20, [SP, #16]
LDP X29, X30, [SP], #48 // restore FP, LR; SP += 48
RET
Stack frame layout:
High address
┌─────────────────────────────────┐ ← (caller's SP before call)
│ Caller's frame │
├─────────────────────────────────┤ ← SP on function entry (after return address
│ (no return address slot — │ is in LR, not pushed to stack)
│ return addr is in X30/LR) │
├─────────────────────────────────┤ ← SP after STP X29,X30,[SP,#-48]!
│ [SP+0] Saved X29 (old FP) │
│ [SP+8] Saved X30 (return addr) │
│ [SP+16] Saved X19 │
│ [SP+24] Saved X20 │
│ [SP+32] Saved X21 │
│ [SP+40] Saved X22 │
├─────────────────────────────────┤
│ Local variables (if needed) │
└─────────────────────────────────┘ ← SP (current top of stack)
Low address
🔍 Under the Hood: x86-64's
CALLinstruction pushes an 8-byte return address on the stack, making the stack 8-byte aligned at function entry (and you needsub rsp, 8orpush rbpto restore 16-byte alignment). ARM64'sBLdoesn't push anything — the return address goes in X30. So SP is 16-byte aligned on function entry already. The prologueSTP X29, X30, [SP, #-16]!allocates 16 bytes and stores two 8-byte values, maintaining alignment.
17.6 ARM64 Linux System Calls
System Call Mechanism
// Linux ARM64 system call convention:
// X8 = syscall number
// X0 = argument 1
// X1 = argument 2
// X2 = argument 3
// X3 = argument 4
// X4 = argument 5
// X5 = argument 6
// SVC #0 — invoke kernel
// X0 = return value (negative errno on error)
Common Linux ARM64 Syscall Numbers
Linux ARM64 Syscall Reference (selected)
┌─────────┬───────────┬──────────────────────────────────────────────────┐
│ Number │ Name │ Signature │
├─────────┼───────────┼──────────────────────────────────────────────────┤
│ 0 │ io_setup │ - │
│ 3 │ io_cancel │ - │
│ 56 │ openat │ (dirfd, pathname, flags, mode) → fd │
│ 57 │ close │ (fd) → 0 │
│ 63 │ read │ (fd, buf, count) → bytes │
│ 64 │ write │ (fd, buf, count) → bytes │
│ 93 │ exit │ (status) → (no return) │
│ 94 │ exit_group│ (status) → (no return) │
│ 172 │ getpid │ () → pid │
│ 174 │ getuid │ () → uid │
│ 220 │ clone │ (flags, stack, ...) → pid │
│ 221 │ execve │ (path, argv, envp) → (no return on success) │
│ 222 │ mmap │ (addr, len, prot, flags, fd, offset) → addr │
│ 226 │ mprotect │ (addr, len, prot) → 0 │
│ 233 │ madvise │ (addr, len, advice) → 0 │
└─────────┴───────────┴──────────────────────────────────────────────────┘
Note: Linux ARM64 uses the generic syscall table (not x86-derived).
write(fd, buf, count) is syscall 64, not 1 as on x86-64.
openat is used instead of open (syscall 56, not 2).
Complete File I/O Example
// write_file.s — Write "Hello" to a file
// Uses: openat, write, close, exit
.section .rodata
filename: .asciz "output.txt"
message: .ascii "Hello, file!\n"
msg_len = . - message
flags_str: .byte 0 // (not used directly)
// O_WRONLY|O_CREAT|O_TRUNC = 0x241 on Linux
O_WRONLY = 1
O_CREAT = 64 // 0x40
O_TRUNC = 512 // 0x200
O_FLAGS = O_WRONLY | O_CREAT | O_TRUNC // = 577 = 0x241
.section .text
.global _start
_start:
// === openat(AT_FDCWD, filename, O_WRONLY|O_CREAT|O_TRUNC, 0644) ===
MOV X8, #56 // syscall: openat
MOV X0, #-100 // AT_FDCWD = -100 (relative to current dir)
ADR X1, filename // pathname
MOV X2, #O_FLAGS // flags (this works if O_FLAGS <= 4095)
// For O_FLAGS = 577, we need to build it:
MOV X2, #0x241 // 0x241 = 577 = O_WRONLY|O_CREAT|O_TRUNC
MOV X3, #0644 // mode = 0644 octal = 420 decimal
// Actually mode in octal: 0644 = 6*64 + 4*8 + 4 = 420
MOV X3, #420
SVC #0 // fd = openat(...)
// X0 = file descriptor (or negative errno)
MOV X19, X0 // save fd in callee-saved X19
// === write(fd, message, msg_len) ===
MOV X8, #64 // syscall: write
MOV X0, X19 // fd
ADR X1, message // buffer
MOV X2, #msg_len // count
SVC #0 // write(...)
// === close(fd) ===
MOV X8, #57 // syscall: close
MOV X0, X19 // fd
SVC #0
// === exit(0) ===
MOV X8, #93 // syscall: exit
MOV X0, #0
SVC #0
17.7 Side-by-Side: x86-64 vs. ARM64 Instruction Comparison
x86-64 / ARM64 Instruction Reference
═══════════════════════════════════════════════════════════════════════════
Operation x86-64 ARM64
───────────────────────────────────────────────────────────────────────────
Move reg→reg mov rax, rbx MOV X0, X1
Move immediate mov rax, 42 MOV X0, #42
Load 64-bit mov rax, [rbx] LDR X0, [X1]
Load 32-bit mov eax, [rbx] LDR W0, [X1]
Load byte movzx eax, byte [rbx] LDRB W0, [X1]
Store 64-bit mov [rbx], rax STR X0, [X1]
Store byte mov byte [rbx], al STRB W0, [X1]
Add regs add rax, rbx ADD X0, X0, X1
Add immediate add rax, 42 ADD X0, X0, #42
Subtract sub rax, rbx SUB X0, X0, X1
Multiply imul rax, rbx MUL X0, X0, X1
Multiply-add (no direct equiv) MADD X0, X1, X2, X3
Divide (signed) idiv rbx → rax,rdx SDIV X0, X0, X1
MSUB X2, X0, X1, X3 (remain.)
AND and rax, rbx AND X0, X0, X1
OR or rax, rbx ORR X0, X0, X1
XOR xor rax, rbx EOR X0, X0, X1
NOT not rax MVN X0, X0
Shift left shl rax, 3 LSL X0, X0, #3
Shift right (u) shr rax, 3 LSR X0, X0, #3
Shift right (s) sar rax, 3 ASR X0, X0, #3
Compare cmp rax, rbx CMP X0, X1
Test bits test rax, rbx TST X0, X1
Jump jmp label B label
Call call label BL label
Return ret RET
Cond branch (eq) je label B.EQ label
Cond branch (ne) jne label B.NE label
Cond branch (<s) jl label B.LT label
Cond branch (>s) jg label B.GT label
Push push rax STR X0, [SP, #-8]! (no PUSH)
Pop pop rax LDR X0, [SP], #8 (no POP)
Push pair (2 push instrs) STP X0, X1, [SP, #-16]!
System call syscall SVC #0
Branch if zero (test rax,rax; jz) CBZ X0, label
Branch if nonzero (test rax,rax; jnz) CBNZ X0, label
Conditional move cmove rax, rbx CSEL X0, X1, X2, EQ
───────────────────────────────────────────────────────────────────────────
17.8 Translating a C Function to ARM64
Let's translate a full C function to see AAPCS64 in action:
// C source
int64_t sum_array(const int64_t *arr, int n) {
int64_t total = 0;
for (int i = 0; i < n; i++) {
total += arr[i];
}
return total;
}
ARM64 assembly:
// sum_array(arr, n): sum of n int64_t elements
// X0 = arr (pointer), W1 = n (int)
// Returns: X0 = total
.global sum_array
sum_array:
// Prologue (leaf function — no calls, but save LR for good practice)
// Actually, since we don't call anything, we can skip saving X30.
// But we use X19-X21 as callee-saved temps:
STP X19, X20, [SP, #-32]! // save callee-saved regs + align stack
STP X21, XZR, [SP, #16] // save X21, pad to maintain 32-byte alloc
MOV X19, X0 // X19 = arr (callee-saved copy)
MOV W20, W1 // W20 = n (callee-saved copy)
MOV X21, XZR // X21 = total = 0
MOV W22, WZR // W22 = i = 0 (hmm, W22 not saved... fix:)
// Better: use simple approach without saving regs we don't need to:
// Re-do: since we only read arguments and use temps that are caller-saved
// A non-calling function can use X9-X15 freely
// (restore callee-saved regs we saved)
LDP X21, XZR, [SP, #16]
LDP X19, X20, [SP], #32
// Clean version (use caller-saved temp registers):
MOV X2, XZR // X2 = total = 0
MOV W3, WZR // W3 = i = 0
SXTW X4, W1 // X4 = (int64_t)n (sign-extend)
.loop:
CMP W3, W1 // compare i to n
B.GE .done // if i >= n, exit loop
LDR X5, [X0, X3, LSL #3] // X5 = arr[i] (arr + i*8)
// X3 holds i as int (zero-extended to 64)
ADD X2, X2, X5 // total += arr[i]
ADD W3, W3, #1 // i++
B .loop
.done:
MOV X0, X2 // return value = total
RET
Wait — there's a subtlety: X3 holds a 32-bit i (W3), but we need a 64-bit offset in [X0, X3, LSL #3]. ARM64 allows this with a sign or zero extension modifier:
LDR X5, [X0, W3, UXTW #3] // X5 = arr[i]: arr + (uint64_t)W3 * 8
// UXTW: zero-extend W3 to 64 bits, then shift
Cleaned-up, correct version:
// sum_array(arr, n)
// X0 = arr, W1 = n
.global sum_array
sum_array:
MOV X2, XZR // total = 0
MOV W3, WZR // i = 0
B .check // check before first iteration
.loop:
LDR X4, [X0, W3, UXTW #3] // X4 = arr[i] (UXTW: W3 zero-extended * 8)
ADD X2, X2, X4 // total += arr[i]
ADD W3, W3, #1 // i++
.check:
CMP W3, W1 // i < n?
B.LT .loop // if yes, continue
MOV X0, X2 // return total
RET
Register trace (arr=[10, 20, 30], n=3):
| Step | Instruction | X0(arr) | W1(n) | X2(total) | W3(i) | X4(arr[i]) |
|---|---|---|---|---|---|---|
| init | MOV X2, XZR | 0x1000 | 3 | 0 | 0 | ? |
| init | MOV W3, WZR | 0x1000 | 3 | 0 | 0 | ? |
| 1st | CMP W3,W1 | 0x1000 | 3 | 0 | 0 | ? |
| 1st | B.LT loop | (taken) | ||||
| 1st | LDR X4,[X0,W3,UXTW #3] | 0x1000 | 3 | 0 | 0 | 10 |
| 1st | ADD X2,X2,X4 | 0x1000 | 3 | 10 | 0 | 10 |
| 1st | ADD W3,W3,#1 | 0x1000 | 3 | 10 | 1 | 10 |
| 2nd | LDR X4 (i=1) | 0x1000 | 3 | 10 | 1 | 20 |
| 2nd | ADD X2 | 0x1000 | 3 | 30 | 1 | 20 |
| 2nd | ADD W3 | 0x1000 | 3 | 30 | 2 | 20 |
| 3rd | LDR X4 (i=2) | 0x1000 | 3 | 30 | 2 | 30 |
| 3rd | ADD X2 | 0x1000 | 3 | 60 | 2 | 30 |
| 3rd | ADD W3 | 0x1000 | 3 | 60 | 3 | 30 |
| exit | CMP W3,W1: 3 >= 3 | |||||
| exit | B.LT not taken | |||||
| exit | MOV X0,X2 → X0=60 | |||||
| exit | RET |
🔄 Check Your Understanding: 1. What does
LDR X0, [X1, X2, LSL #3]compute for the load address? 2. What is the difference betweenLDR X0, [X1, #8]!andLDR X0, [X1], #8? 3. Why does the ARM64 calling convention save X30 in the prologue? 4. What is the ARM64 equivalent of x86-64'sPUSH rax? 5. In AAPCS64, are X9-X15 caller-saved or callee-saved?
Summary
ARM64's instruction set is regular and consistent. ALU instructions take registers (and an optional shifted register), do one thing, and produce one result. Memory instructions are separate and provide rich addressing modes: base+offset, base+register (with optional shift), pre-indexed, and post-indexed. LDP/STP provide paired access — two registers in one instruction — which makes prologue/epilogue efficient.
The AAPCS64 calling convention gives you 8 argument registers (vs. 6 in x86-64 System V), 10 callee-saved registers (X19-X28), and a stack that must be 16-byte aligned at call time. The canonical prologue STP X29, X30, [SP, #-16]! / MOV X29, SP is the pattern you'll see in every compiled function.