7 min read

Chapter 16 gave you the philosophy. This chapter gives you the instructions. By the end of this chapter, you'll be able to read any ARM64 assembly listing and understand what's happening, and write ARM64 programs from scratch.

Chapter 17: ARM64 Instruction Set

The Working Vocabulary

Chapter 16 gave you the philosophy. This chapter gives you the instructions. By the end of this chapter, you'll be able to read any ARM64 assembly listing and understand what's happening, and write ARM64 programs from scratch.

ARM64's instruction set is large but internally regular. Unlike x86-64, where ADD, IMUL, DIV, PUSH, CALL, and REP MOVS all follow completely different rules, ARM64 instructions mostly share the same structure: operation + destination register + source register(s) + optional modifier. Learn the pattern once; it applies everywhere.


17.1 Data Processing Instructions

Basic Arithmetic: ADD, ADDS, SUB, SUBS

ADD  Xd, Xn, Xm        // Xd = Xn + Xm (no flags)
ADD  Xd, Xn, #imm      // Xd = Xn + immediate (12-bit, optional LSL #12)
ADDS Xd, Xn, Xm        // Xd = Xn + Xm, update N/Z/C/V
SUB  Xd, Xn, Xm        // Xd = Xn - Xm
SUB  Xd, Xn, #imm      // Xd = Xn - immediate
SUBS Xd, Xn, Xm        // Xd = Xn - Xm, update flags

The immediate encoding deserves explanation. ARM64 data processing instructions can encode a 12-bit immediate (0–4095). Optionally, that immediate can be shifted left by 12 bits first (making it 0–4095 × 4096). This covers the ranges needed for most small constants and page-aligned offsets.

ADD X0, X1, #1000      // X0 = X1 + 1000
ADD X0, X1, #1, LSL #12 // X0 = X1 + 4096 (1 shifted left 12)

For carry-propagating multi-precision arithmetic:

ADC  Xd, Xn, Xm        // Xd = Xn + Xm + C (add with carry)
ADCS Xd, Xn, Xm        // Same, update flags
SBC  Xd, Xn, Xm        // Xd = Xn - Xm - ~C (subtract with borrow)
SBCS Xd, Xn, Xm        // Same, update flags

128-bit addition example:

// Add two 128-bit numbers: (X1:X0) + (X3:X2) → (X5:X4)
ADDS X4, X0, X2        // Low 64 bits, set C on carry
ADC  X5, X1, X3        // High 64 bits + carry from low

Logic: AND, ORR, EOR, BIC, ORN, EON

AND  Xd, Xn, Xm        // Xd = Xn & Xm
ANDS Xd, Xn, Xm        // Same, update flags (TST = ANDS XZR, Xn, Xm)
ORR  Xd, Xn, Xm        // Xd = Xn | Xm
EOR  Xd, Xn, Xm        // Xd = Xn ^ Xm (XOR)
BIC  Xd, Xn, Xm        // Xd = Xn & ~Xm (bit clear = AND NOT)
BICS Xd, Xn, Xm        // Same, update flags
ORN  Xd, Xn, Xm        // Xd = Xn | ~Xm (OR NOT)
EON  Xd, Xn, Xm        // Xd = Xn ^ ~Xm (XOR NOT = XNOR)

Logical immediates (ANDS/ORR/EOR/AND with immediate) use a special encoding: the immediate must be a pattern of repeated bit sequences. Not all constants are expressible, but the compiler handles this automatically.

AND  X0, X0, #0xFF     // X0 &= 0xFF (extract low byte)
ORR  X0, X0, #0x80     // X0 |= 0x80 (set high bit of byte)
EOR  X0, X0, #0xFF     // X0 ^= 0xFF (flip low byte)

Multiply

MUL   Xd, Xn, Xm       // Xd = Xn * Xm (lower 64 bits)
                        // Pseudoinstruction for: MADD Xd, Xn, Xm, XZR
MADD  Xd, Xn, Xm, Xa   // Xd = Xa + Xn*Xm (multiply-accumulate)
MSUB  Xd, Xn, Xm, Xa   // Xd = Xa - Xn*Xm (multiply-subtract)
MNEG  Xd, Xn, Xm       // Xd = -(Xn*Xm)   (MSUB Xd, Xn, Xm, XZR)

// High-half multiply (for 128-bit results)
SMULH Xd, Xn, Xm       // Xd = upper 64 bits of Xn*Xm (signed)
UMULH Xd, Xn, Xm       // Xd = upper 64 bits of Xn*Xm (unsigned)

Getting a full 128-bit multiply result:

// X2:X0 = X0 * X1 (128-bit result)
UMULH X2, X0, X1       // X2 = high 64 bits
MUL   X0, X0, X1       // X0 = low 64 bits

⚠️ Common Mistake: ARM64 has no implicit 128-bit multiply like x86-64's MUL rax (which puts the full product in RDX:RAX). You must use SMULH/UMULH explicitly for the high half.

📊 C Comparison: In C, (uint64_t)a * b gives the low 64 bits. __uint128_t multiplication or compiler builtins for the high bits. ARM64's UMULH is the hardware instruction that makes those builtins fast.

Division

SDIV  Xd, Xn, Xm       // Xd = Xn / Xm (signed integer division)
UDIV  Xd, Xn, Xm       // Xd = Xn / Xm (unsigned integer division)

Critical differences from x86-64: 1. ARM64 division stores ONLY the quotient in Xd — there is no remainder register 2. To get the remainder: compute remainder = dividend - (quotient * divisor) 3. Division by zero does NOT trap — it returns 0 (implementation-defined, but Linux CPUs return 0)

// C: q = a / b; r = a % b;
// X0 = a, X1 = b
SDIV  X2, X0, X1       // X2 = a / b (quotient)
MSUB  X3, X2, X1, X0   // X3 = X0 - X2*X1 = a - (a/b)*b = a % b (remainder)

🔍 Under the Hood: x86-64's IDIV produces both quotient and remainder in one instruction (RAX and RDX respectively). ARM64 requires SDIV + MSUB. This looks less efficient, but modern ARM64 CPUs have fast hardware dividers, and the compiler often optimizes integer division by constants into multiply-shift sequences that don't use SDIV at all.

Move Instructions

MOV  Xd, Xn            // Xd = Xn (ORR Xd, XZR, Xn)
MOV  Xd, #imm16        // Xd = 16-bit immediate (MOVZ)
MOVZ Xd, #imm16        // Xd = imm16 (zero other bits)
MOVZ Xd, #imm16, LSL #16  // Xd = imm16 << 16
MOVK Xd, #imm16, LSL #32  // Xd[47:32] = imm16 (keep other bits)
MOVN Xd, #imm16        // Xd = ~imm16 (move NOT)

Loading a 64-bit constant requires multiple instructions:

// Load 0xDEADBEEFCAFEBABE into X0
MOVZ X0, #0xBABE               // X0 = 0x000000000000BABE
MOVK X0, #0xCAFE, LSL #16      // X0 = 0x00000000CAFEBABE
MOVK X0, #0xBEEF, LSL #32      // X0 = 0x0000BEEFCAFEBABE
MOVK X0, #0xDEAD, LSL #48      // X0 = 0xDEADBEEFCAFEBABE

Shifts and Rotates (Inline)

ARM64's barrel shifter can apply a shift to the second source operand of any data processing instruction. This is an extremely powerful feature that has no x86-64 equivalent.

ADD X0, X1, X2, LSL #3    // X0 = X1 + (X2 << 3) = X1 + X2*8
ADD X0, X1, X2, LSR #2    // X0 = X1 + (X2 >> 2) (logical)
ADD X0, X1, X2, ASR #1    // X0 = X1 + (X2 >> 1) (arithmetic)
ORR X0, X1, X2, ROR #4    // X0 = X1 | rotate_right(X2, 4)

Shift types: - LSL #n — Logical Shift Left (fill with zeros) - LSR #n — Logical Shift Right (fill with zeros — unsigned right shift) - ASR #n — Arithmetic Shift Right (fill with sign bit — signed right shift) - ROR #n — ROtate Right

As standalone instructions:

LSL  Xd, Xn, #n         // Xd = Xn << n
LSL  Xd, Xn, Xm         // Xd = Xn << (Xm & 63)
LSR  Xd, Xn, #n         // Xd = Xn >> n (unsigned)
ASR  Xd, Xn, #n         // Xd = Xn >> n (signed)
ROR  Xd, Xn, #n         // Xd = rotate_right(Xn, n)
ROR  Xd, Xn, Xm         // Xd = rotate_right(Xn, Xm & 63)

⚡ Performance Note: ADD X0, X1, X2, LSL #3 is a single instruction that computes array index arithmetic: base + index * 8. In x86-64, this requires LEA rax, [rbx + rcx*8] — same instruction count, but ARM64's approach generalizes to any power-of-2 scale factor embedded in any ALU instruction, not just LEA.

Bit Field Operations

UBFX Xd, Xn, #lsb, #width  // Extract unsigned bit field
SBFX Xd, Xn, #lsb, #width  // Extract signed bit field (sign extend)
BFI  Xd, Xn, #lsb, #width  // Insert bit field into Xd from Xn
BFXIL Xd, Xn, #lsb, #width // Extract and insert at low end

// Example: extract bits [11:8] (4 bits) from X1
UBFX X0, X1, #8, #4         // X0 = (X1 >> 8) & 0xF

17.2 Comparison and Flag Instructions

CMP  Xn, Xm         // SUBS XZR, Xn, Xm  — compare (set flags)
CMP  Xn, #imm       // SUBS XZR, Xn, #imm
CMN  Xn, Xm         // ADDS XZR, Xn, Xm  — compare negative
TST  Xn, Xm         // ANDS XZR, Xn, Xm  — test bits

ARM64 does NOT have TEQ (test equivalent/XOR). If you need it, use EOR and check the result.


17.3 Memory Instructions: Load and Store

This is where ARM64 programming gets rich. The addressing modes are varied but consistent.

Basic Load and Store

LDR  Xd, [Xn]           // Xd = Memory[Xn]   (load 64-bit)
STR  Xd, [Xn]           // Memory[Xn] = Xd   (store 64-bit)
LDR  Wd, [Xn]           // Wd = Memory[Xn]   (load 32-bit, zero-extend)
STR  Wd, [Xn]           // Memory[Xn] = Wd   (store 32-bit)

Offset Addressing

LDR  Xd, [Xn, #imm]     // Xd = Memory[Xn + imm]    (immediate offset)
STR  Xd, [Xn, #imm]     // Memory[Xn + imm] = Xd
LDR  Xd, [Xn, Xm]       // Xd = Memory[Xn + Xm]     (register offset)
LDR  Xd, [Xn, Xm, LSL #3] // Xd = Memory[Xn + Xm*8]  (scaled register)

The immediate offset in LDR Xd, [Xn, #imm] is a 12-bit unsigned value scaled by the access size. For 64-bit loads, that's 0–32760 in steps of 8. For 32-bit loads, 0–16380 in steps of 4.

// Accessing a struct:
// struct { int64_t x; int64_t y; int64_t z; } at X0
LDR X1, [X0, #0]         // X1 = s.x
LDR X2, [X0, #8]         // X2 = s.y
LDR X3, [X0, #16]        // X3 = s.z

Pre-indexed and Post-indexed Addressing

Pre-indexed: update the base register BEFORE the access (the ! suffix):

LDR  Xd, [Xn, #imm]!    // Xn += imm, then Xd = Memory[Xn]
STR  Xd, [Xn, #imm]!    // Xn += imm, then Memory[Xn] = Xd

Post-indexed: access first, then update the base register:

LDR  Xd, [Xn], #imm     // Xd = Memory[Xn], then Xn += imm
STR  Xd, [Xn], #imm     // Memory[Xn] = Xd, then Xn += imm

These are essential for array traversal and stack operations:

// Walk an array of 8-byte elements
MOV  X1, X0             // X1 = array pointer
.loop:
    LDR  X2, [X1], #8   // X2 = *X1; X1 += 8 (advance to next element)
    // ... process X2 ...
    CBZ  X2, .done      // if X2 == 0, exit (null-terminated array example)
    B    .loop
.done:

💡 Mental Model: Pre-indexed [Xn, #imm]! is "move then access." Post-indexed [Xn], #imm is "access then move." The ! means the base register is updated (written back). This is ARM's equivalent of C's *p++ (post-increment) and *++p (pre-increment).

Load/Store Pair: LDP and STP

One of ARM64's most useful instructions. Load or store two registers simultaneously:

LDP  Xd1, Xd2, [Xn]         // Xd1 = Memory[Xn]; Xd2 = Memory[Xn+8]
STP  Xn1, Xn2, [Xd]         // Memory[Xd] = Xn1; Memory[Xd+8] = Xn2

// Pre/post-indexed versions:
STP  X29, X30, [SP, #-16]!  // SP -= 16; store pair (canonical prologue)
LDP  X29, X30, [SP], #16    // load pair; SP += 16 (canonical epilogue)

LDP/STP halves the number of instructions needed for saving/restoring registers:

// Save 4 callee-saved registers (typical function entry)
STP  X19, X20, [SP, #-32]!  // save X19 and X20, advance SP
STP  X21, X22, [SP, #16]    // save X21 and X22 at SP+16
// ... function body ...
LDP  X21, X22, [SP, #16]    // restore X21 and X22
LDP  X19, X20, [SP], #32    // restore X19 and X20, reclaim stack

Sized Loads: Byte, Halfword, Word, Doubleword

LDRB  Wd, [Xn]     // Load byte, zero-extend to 32 bits
LDRH  Wd, [Xn]     // Load halfword (16-bit), zero-extend
LDRSB Wd, [Xn]     // Load byte, sign-extend to 32 bits
LDRSH Wd, [Xn]     // Load halfword, sign-extend to 32 bits
LDRSB Xd, [Xn]     // Load byte, sign-extend to 64 bits
LDRSH Xd, [Xn]     // Load halfword, sign-extend to 64 bits
LDRSW Xd, [Xn]     // Load word (32-bit), sign-extend to 64 bits

STRB  Wd, [Xn]     // Store low byte of Wd
STRH  Wd, [Xn]     // Store low halfword of Wd

⚠️ Common Mistake: LDR Wd, [Xn] loads 32 bits and zero-extends into Xd. LDRSW Xd, [Xn] loads 32 bits and sign-extends into Xd. Use LDRSW when working with int arrays that contain negative numbers and you need to use the result in 64-bit arithmetic.


17.4 Branch Instructions

Unconditional Branch

B   label           // PC = label (26-bit PC-relative offset, ±128MB range)
BL  label           // X30 = PC+4; PC = label (call: saves return address)
BR  Xn              // PC = Xn (indirect jump — register contains address)
BLR Xn              // X30 = PC+4; PC = Xn (indirect call)
RET                 // PC = X30 (return — BR X30)
RET Xn              // PC = Xn (return via specific register)

Conditional Branch

B.EQ label          // Branch if Z=1 (equal)
B.NE label          // Branch if Z=0 (not equal)
B.LT label          // Branch if signed less than (N≠V)
B.LE label          // Branch if signed less or equal (Z=1 or N≠V)
B.GT label          // Branch if signed greater than (Z=0 and N=V)
B.GE label          // Branch if signed greater or equal (N=V)
B.LO label          // Branch if unsigned lower (C=0)
B.LS label          // Branch if unsigned lower or same (C=0 or Z=1)
B.HI label          // Branch if unsigned higher (C=1 and Z=0)
B.HS label          // Branch if unsigned higher or same (C=1)
B.MI label          // Branch if minus (N=1)
B.PL label          // Branch if plus/zero (N=0)
B.VS label          // Branch if overflow (V=1)
B.VC label          // Branch if no overflow (V=0)

Conditional branches have a ±1MB range (19-bit PC-relative offset). For longer jumps, you branch unconditionally to a B with a larger range, or use an indirect branch through a register.

Compare and Branch: CBZ, CBNZ

Compare register to zero and branch — no flag update:

CBZ  Xn, label      // Branch if Xn == 0
CBNZ Xn, label      // Branch if Xn != 0
CBZ  Wn, label      // 32-bit version (checks W register)

This avoids needing a separate CMP + B.EQ for the extremely common "is this zero?" test:

// Loop until X0 is zero
.loop:
    // ... work using X0 ...
    SUBS X0, X0, #1      // decrement counter, set flags
    CBNZ X0, .loop       // if X0 != 0, continue
    // Alternatively: CBZ/CBNZ after any computation

Test Bit and Branch: TBZ, TBNZ

TBZ  Xn, #bit, label    // Branch if bit N of Xn is 0
TBNZ Xn, #bit, label    // Branch if bit N of Xn is non-zero

Useful for flag-bit testing without TST + B.EQ:

TBNZ X0, #0, .is_odd    // Branch if bit 0 of X0 is set (X0 is odd)
TBZ  X0, #63, .positive // Branch if bit 63 is clear (X0 is positive int64)

Range: ±32KB (14-bit PC-relative offset).


17.5 AAPCS64 Calling Convention: Full Details

The ARM Procedure Call Standard for AArch64 (AAPCS64) defines how functions talk to each other. You cannot deviate from this if you want your assembly to work with C code.

Argument Passing

AAPCS64 Argument Passing Rules
┌──────────────────────────────────────────────────────────────────────────┐
│ Integer/Pointer Arguments:                                                │
│   Arguments 1-8: X0-X7 (64-bit) or W0-W7 (32-bit)                       │
│   Arguments 9+:  Pushed on stack (right-to-left, then left-to-right)     │
│                                                                           │
│ Floating-Point Arguments:                                                 │
│   Arguments 1-8: D0-D7 (double) or S0-S7 (float)                        │
│   Arguments 9+:  Pushed on stack                                          │
│                                                                           │
│ Return Values:                                                            │
│   Integer/Pointer: X0 (64-bit) or W0 (32-bit)                           │
│   Two 64-bit values: X0, X1                                              │
│   Floating-point: D0 (double) or S0 (float)                              │
│   Large structs: caller allocates, passes pointer in X8                   │
└──────────────────────────────────────────────────────────────────────────┘

Caller-Saved vs. Callee-Saved

Caller-Saved (may be clobbered by called function — caller must save if needed):
  X0-X18 (including X8, X9-X15, X16-X17, X18)

Callee-Saved (must be preserved by the called function):
  X19-X28, X29 (FP), X30 (LR), SP

Note on X30 (LR): technically caller-saved, but the callee must preserve it
  if it makes any function calls (BL overwrites X30).
  Convention: non-leaf functions always save X30 via STP.

Stack Alignment

The SP must be 16-byte aligned at the point of any function call. This is architectural — violating it causes an SP_ELx alignment fault on hardware.

The 16-byte alignment rule means you always allocate stack space in multiples of 16:

// Bad: SP becomes 8-byte aligned (not 16)
SUB SP, SP, #8         // allocate 8 bytes
BL  some_function      // fault or undefined behavior!

// Good: SP stays 16-byte aligned
SUB SP, SP, #16        // allocate 16 bytes (minimum allocation unit)
BL  some_function

The Canonical Function Frame

// Standard ARM64 function with local variables
my_function:
    // Prologue: save FP and LR, allocate locals
    STP  X29, X30, [SP, #-48]!  // 48 bytes: 16 for FP/LR + 32 for locals
    MOV  X29, SP                 // FP = SP (point to frame)

    // Optional: save callee-saved registers if used
    STP  X19, X20, [SP, #16]    // save X19, X20 at SP+16
    STP  X21, X22, [SP, #32]    // save X21, X22 at SP+32

    // Access local variables (relative to SP or FP):
    STR  X0,  [SP, #?]          // or [X29, #?]
    LDR  X1,  [SP, #?]

    // ... function body ...

    // Epilogue: restore and return
    LDP  X21, X22, [SP, #32]
    LDP  X19, X20, [SP, #16]
    LDP  X29, X30, [SP], #48    // restore FP, LR; SP += 48
    RET

Stack frame layout:

High address
┌─────────────────────────────────┐ ← (caller's SP before call)
│  Caller's frame                 │
├─────────────────────────────────┤ ← SP on function entry (after return address
│  (no return address slot —      │    is in LR, not pushed to stack)
│   return addr is in X30/LR)     │
├─────────────────────────────────┤ ← SP after STP X29,X30,[SP,#-48]!
│  [SP+0] Saved X29 (old FP)      │
│  [SP+8] Saved X30 (return addr) │
│  [SP+16] Saved X19              │
│  [SP+24] Saved X20              │
│  [SP+32] Saved X21              │
│  [SP+40] Saved X22              │
├─────────────────────────────────┤
│  Local variables (if needed)    │
└─────────────────────────────────┘ ← SP (current top of stack)
Low address

🔍 Under the Hood: x86-64's CALL instruction pushes an 8-byte return address on the stack, making the stack 8-byte aligned at function entry (and you need sub rsp, 8 or push rbp to restore 16-byte alignment). ARM64's BL doesn't push anything — the return address goes in X30. So SP is 16-byte aligned on function entry already. The prologue STP X29, X30, [SP, #-16]! allocates 16 bytes and stores two 8-byte values, maintaining alignment.


17.6 ARM64 Linux System Calls

System Call Mechanism

// Linux ARM64 system call convention:
// X8  = syscall number
// X0  = argument 1
// X1  = argument 2
// X2  = argument 3
// X3  = argument 4
// X4  = argument 5
// X5  = argument 6
// SVC #0 — invoke kernel
// X0  = return value (negative errno on error)

Common Linux ARM64 Syscall Numbers

Linux ARM64 Syscall Reference (selected)
┌─────────┬───────────┬──────────────────────────────────────────────────┐
│ Number  │ Name      │ Signature                                         │
├─────────┼───────────┼──────────────────────────────────────────────────┤
│ 0       │ io_setup  │ -                                                 │
│ 3       │ io_cancel │ -                                                 │
│ 56      │ openat    │ (dirfd, pathname, flags, mode) → fd               │
│ 57      │ close     │ (fd) → 0                                          │
│ 63      │ read      │ (fd, buf, count) → bytes                          │
│ 64      │ write     │ (fd, buf, count) → bytes                          │
│ 93      │ exit      │ (status) → (no return)                            │
│ 94      │ exit_group│ (status) → (no return)                            │
│ 172     │ getpid    │ () → pid                                          │
│ 174     │ getuid    │ () → uid                                          │
│ 220     │ clone     │ (flags, stack, ...) → pid                         │
│ 221     │ execve    │ (path, argv, envp) → (no return on success)       │
│ 222     │ mmap      │ (addr, len, prot, flags, fd, offset) → addr       │
│ 226     │ mprotect  │ (addr, len, prot) → 0                             │
│ 233     │ madvise   │ (addr, len, advice) → 0                           │
└─────────┴───────────┴──────────────────────────────────────────────────┘
Note: Linux ARM64 uses the generic syscall table (not x86-derived).
write(fd, buf, count) is syscall 64, not 1 as on x86-64.
openat is used instead of open (syscall 56, not 2).

Complete File I/O Example

// write_file.s — Write "Hello" to a file
// Uses: openat, write, close, exit

.section .rodata
filename:   .asciz "output.txt"
message:    .ascii "Hello, file!\n"
msg_len     = . - message
flags_str:  .byte 0   // (not used directly)

// O_WRONLY|O_CREAT|O_TRUNC = 0x241 on Linux
O_WRONLY    = 1
O_CREAT     = 64        // 0x40
O_TRUNC     = 512       // 0x200
O_FLAGS     = O_WRONLY | O_CREAT | O_TRUNC   // = 577 = 0x241

.section .text
.global _start
_start:
    // === openat(AT_FDCWD, filename, O_WRONLY|O_CREAT|O_TRUNC, 0644) ===
    MOV  X8, #56          // syscall: openat
    MOV  X0, #-100        // AT_FDCWD = -100 (relative to current dir)
    ADR  X1, filename     // pathname
    MOV  X2, #O_FLAGS     // flags (this works if O_FLAGS <= 4095)
    // For O_FLAGS = 577, we need to build it:
    MOV  X2, #0x241       // 0x241 = 577 = O_WRONLY|O_CREAT|O_TRUNC
    MOV  X3, #0644        // mode = 0644 octal = 420 decimal
    // Actually mode in octal: 0644 = 6*64 + 4*8 + 4 = 420
    MOV  X3, #420
    SVC  #0               // fd = openat(...)
    // X0 = file descriptor (or negative errno)
    MOV  X19, X0          // save fd in callee-saved X19

    // === write(fd, message, msg_len) ===
    MOV  X8, #64          // syscall: write
    MOV  X0, X19          // fd
    ADR  X1, message      // buffer
    MOV  X2, #msg_len     // count
    SVC  #0               // write(...)

    // === close(fd) ===
    MOV  X8, #57          // syscall: close
    MOV  X0, X19          // fd
    SVC  #0

    // === exit(0) ===
    MOV  X8, #93          // syscall: exit
    MOV  X0, #0
    SVC  #0

17.7 Side-by-Side: x86-64 vs. ARM64 Instruction Comparison

x86-64 / ARM64 Instruction Reference
═══════════════════════════════════════════════════════════════════════════
Operation          x86-64                   ARM64
───────────────────────────────────────────────────────────────────────────
Move reg→reg       mov  rax, rbx            MOV X0, X1
Move immediate     mov  rax, 42             MOV X0, #42
Load 64-bit        mov  rax, [rbx]          LDR X0, [X1]
Load 32-bit        mov  eax, [rbx]          LDR W0, [X1]
Load byte          movzx eax, byte [rbx]    LDRB W0, [X1]
Store 64-bit       mov  [rbx], rax          STR X0, [X1]
Store byte         mov  byte [rbx], al      STRB W0, [X1]
Add regs           add  rax, rbx            ADD X0, X0, X1
Add immediate      add  rax, 42             ADD X0, X0, #42
Subtract           sub  rax, rbx            SUB X0, X0, X1
Multiply           imul rax, rbx            MUL X0, X0, X1
Multiply-add       (no direct equiv)        MADD X0, X1, X2, X3
Divide (signed)    idiv rbx → rax,rdx       SDIV X0, X0, X1
                                            MSUB X2, X0, X1, X3 (remain.)
AND                and  rax, rbx            AND X0, X0, X1
OR                 or   rax, rbx            ORR X0, X0, X1
XOR                xor  rax, rbx            EOR X0, X0, X1
NOT                not  rax                 MVN X0, X0
Shift left         shl  rax, 3              LSL X0, X0, #3
Shift right (u)    shr  rax, 3              LSR X0, X0, #3
Shift right (s)    sar  rax, 3              ASR X0, X0, #3
Compare            cmp  rax, rbx            CMP X0, X1
Test bits          test rax, rbx            TST X0, X1
Jump               jmp  label               B label
Call               call label               BL label
Return             ret                      RET
Cond branch (eq)   je   label               B.EQ label
Cond branch (ne)   jne  label               B.NE label
Cond branch (<s)   jl   label               B.LT label
Cond branch (>s)   jg   label               B.GT label
Push               push rax                 STR X0, [SP, #-8]! (no PUSH)
Pop                pop  rax                 LDR X0, [SP], #8   (no POP)
Push pair          (2 push instrs)          STP X0, X1, [SP, #-16]!
System call        syscall                  SVC #0
Branch if zero     (test rax,rax; jz)       CBZ X0, label
Branch if nonzero  (test rax,rax; jnz)      CBNZ X0, label
Conditional move   cmove rax, rbx           CSEL X0, X1, X2, EQ
───────────────────────────────────────────────────────────────────────────

17.8 Translating a C Function to ARM64

Let's translate a full C function to see AAPCS64 in action:

// C source
int64_t sum_array(const int64_t *arr, int n) {
    int64_t total = 0;
    for (int i = 0; i < n; i++) {
        total += arr[i];
    }
    return total;
}

ARM64 assembly:

// sum_array(arr, n): sum of n int64_t elements
// X0 = arr (pointer), W1 = n (int)
// Returns: X0 = total

.global sum_array
sum_array:
    // Prologue (leaf function — no calls, but save LR for good practice)
    // Actually, since we don't call anything, we can skip saving X30.
    // But we use X19-X21 as callee-saved temps:
    STP  X19, X20, [SP, #-32]!   // save callee-saved regs + align stack
    STP  X21, XZR, [SP, #16]     // save X21, pad to maintain 32-byte alloc

    MOV  X19, X0                  // X19 = arr (callee-saved copy)
    MOV  W20, W1                  // W20 = n (callee-saved copy)
    MOV  X21, XZR                 // X21 = total = 0
    MOV  W22, WZR                 // W22 = i = 0 (hmm, W22 not saved... fix:)
    // Better: use simple approach without saving regs we don't need to:

    // Re-do: since we only read arguments and use temps that are caller-saved
    // A non-calling function can use X9-X15 freely
    // (restore callee-saved regs we saved)
    LDP  X21, XZR, [SP, #16]
    LDP  X19, X20, [SP], #32

    // Clean version (use caller-saved temp registers):
    MOV  X2, XZR                  // X2 = total = 0
    MOV  W3, WZR                  // W3 = i = 0
    SXTW X4, W1                   // X4 = (int64_t)n (sign-extend)

.loop:
    CMP  W3, W1                   // compare i to n
    B.GE .done                    // if i >= n, exit loop

    LDR  X5, [X0, X3, LSL #3]    // X5 = arr[i] (arr + i*8)
                                  // X3 holds i as int (zero-extended to 64)
    ADD  X2, X2, X5               // total += arr[i]
    ADD  W3, W3, #1               // i++
    B    .loop

.done:
    MOV  X0, X2                   // return value = total
    RET

Wait — there's a subtlety: X3 holds a 32-bit i (W3), but we need a 64-bit offset in [X0, X3, LSL #3]. ARM64 allows this with a sign or zero extension modifier:

LDR  X5, [X0, W3, UXTW #3]    // X5 = arr[i]: arr + (uint64_t)W3 * 8
                                // UXTW: zero-extend W3 to 64 bits, then shift

Cleaned-up, correct version:

// sum_array(arr, n)
// X0 = arr, W1 = n
.global sum_array
sum_array:
    MOV  X2, XZR              // total = 0
    MOV  W3, WZR              // i = 0
    B    .check               // check before first iteration
.loop:
    LDR  X4, [X0, W3, UXTW #3]  // X4 = arr[i]  (UXTW: W3 zero-extended * 8)
    ADD  X2, X2, X4           // total += arr[i]
    ADD  W3, W3, #1           // i++
.check:
    CMP  W3, W1               // i < n?
    B.LT .loop                // if yes, continue
    MOV  X0, X2               // return total
    RET

Register trace (arr=[10, 20, 30], n=3):

Step Instruction X0(arr) W1(n) X2(total) W3(i) X4(arr[i])
init MOV X2, XZR 0x1000 3 0 0 ?
init MOV W3, WZR 0x1000 3 0 0 ?
1st CMP W3,W1 0x1000 3 0 0 ?
1st B.LT loop (taken)
1st LDR X4,[X0,W3,UXTW #3] 0x1000 3 0 0 10
1st ADD X2,X2,X4 0x1000 3 10 0 10
1st ADD W3,W3,#1 0x1000 3 10 1 10
2nd LDR X4 (i=1) 0x1000 3 10 1 20
2nd ADD X2 0x1000 3 30 1 20
2nd ADD W3 0x1000 3 30 2 20
3rd LDR X4 (i=2) 0x1000 3 30 2 30
3rd ADD X2 0x1000 3 60 2 30
3rd ADD W3 0x1000 3 60 3 30
exit CMP W3,W1: 3 >= 3
exit B.LT not taken
exit MOV X0,X2 → X0=60
exit RET

🔄 Check Your Understanding: 1. What does LDR X0, [X1, X2, LSL #3] compute for the load address? 2. What is the difference between LDR X0, [X1, #8]! and LDR X0, [X1], #8? 3. Why does the ARM64 calling convention save X30 in the prologue? 4. What is the ARM64 equivalent of x86-64's PUSH rax? 5. In AAPCS64, are X9-X15 caller-saved or callee-saved?


Summary

ARM64's instruction set is regular and consistent. ALU instructions take registers (and an optional shifted register), do one thing, and produce one result. Memory instructions are separate and provide rich addressing modes: base+offset, base+register (with optional shift), pre-indexed, and post-indexed. LDP/STP provide paired access — two registers in one instruction — which makes prologue/epilogue efficient.

The AAPCS64 calling convention gives you 8 argument registers (vs. 6 in x86-64 System V), 10 callee-saved registers (X19-X28), and a stack that must be 16-byte aligned at call time. The canonical prologue STP X29, X30, [SP, #-16]! / MOV X29, SP is the pattern you'll see in every compiled function.