Chapter 16: ARM64 Architecture

Open Assembly Language Project

12 min read

The chip inside your phone is more powerful than the computers that put astronauts on the moon. It's also ARM64. The chip inside every iPhone since the iPhone 5S is ARM64. The chip inside the Raspberry Pi 4 is ARM64. The M-series chips in Apple's...

In This Chapter

Welcome to AArch64
16.1 RISC vs. CISC: The Philosophy Difference
16.2 The ARM64 Register File
16.3 PSTATE: The Condition Flags
16.4 Fixed-Width Instructions
16.5 Load/Store Architecture
16.6 Conditional Execution and Select
16.7 Setting Up ARM64 Development
16.8 Calling Convention Preview (AAPCS64)
16.9 x86-64 vs. ARM64 Register Diagram
Summary

Key Takeaways Exercises Quiz Case Study 01 Case Study 02 Further Reading

Chapter 16: ARM64 Architecture

Welcome to AArch64

The chip inside your phone is more powerful than the computers that put astronauts on the moon. It's also ARM64. The chip inside every iPhone since the iPhone 5S is ARM64. The chip inside the Raspberry Pi 4 is ARM64. The M-series chips in Apple's computers that are beating Intel in benchmark after benchmark are ARM64.

AArch64 is the official name for the 64-bit execution state of the ARM architecture, version 8 and later. "ARM64" is what everybody calls it. "AArch64" is what the ISA reference manual calls it and what you'll see in toolchain names (aarch64-linux-gnu-as). They mean the same thing.

This chapter builds the mental model you need before writing a single line of ARM64 assembly: the register file, the instruction encoding philosophy, the condition flags, and the load/store discipline. If Part II gave you an x86-64 brain, Part III builds you a second brain with a different philosophy.

16.1 RISC vs. CISC: The Philosophy Difference

Before we look at registers and instructions, we need to understand why ARM64 looks the way it does. The philosophy shapes every design decision.

CISC: Complex Instruction Set Computer

x86-64 is CISC. The driving philosophy, born in the 1970s, was: make the machine do as much work as possible per instruction. Memory was slow and expensive; compilers were primitive; programmers wrote assembly by hand. Fewer instructions meant smaller programs and less memory fetching.

This led to instructions like:

; x86-64: move a block of memory
REP MOVSB              ; copies RCX bytes from [RSI] to [RDI] — one instruction

; x86-64: multiply memory operand by immediate, store in register
IMUL rbx, [rsp+24], 42

; x86-64: enter a stack frame
ENTER 32, 0           ; allocates locals and sets up frame pointer

; x86-64: push all GP registers
PUSHA                 ; (32-bit only, but illustrates the philosophy)

Variable-length encoding: x86-64 instructions are 1 to 15 bytes. The processor has to figure out where each instruction starts and how long it is before it can even decode what the instruction does. This is not a trivial problem — the x86 instruction decoder is one of the most complex components on the chip.

RISC: Reduced Instruction Set Computer

ARM is RISC. The philosophy, developed at Acorn Computers in the early 1980s (for the Acorn Archimedes), was: simple, regular instructions that a chip designer can implement with minimal transistors and that a compiler can target predictably.

The core tenets of RISC: 1. Load/Store architecture: Only LOAD and STORE instructions touch memory. Every other instruction operates on registers. 2. Fixed-width instructions: All instructions are exactly 32 bits (4 bytes) in ARM64. 3. Many registers: 31 general-purpose registers (vs. x86-64's 16). 4. Simple addressing modes: No complex [base + index*scale + displacement] expressions.

The consequence: the same algorithm takes more ARM64 instructions than x86-64 instructions. But each instruction is simple, predictable, and fast to decode.

Why RISC Is NOT Simpler

This is the most important thing to understand before you get frustrated:

RISC is not simpler to program. It is differently complex.

In x86-64, add rax, [rbx] loads from memory and adds in one instruction. In ARM64, you need:

LDR X2, [X1]      // Load from [X1] into X2
ADD X0, X0, X2    // Now add

More code. More instructions. But each instruction is well-defined: LDR loads, ADD adds. Nothing does both at once.

The practical benefit of RISC shows up in hardware: a CPU that decodes 4-byte instructions at fixed offsets can process 4 instructions per cycle trivially. An x86-64 decoder processing variable-length instructions is genuinely hard to build. Modern Intel CPUs devote enormous die area to instruction decoding.

The deeper irony: modern x86-64 CPUs internally convert CISC instructions into RISC-like micro-operations (µops) before execution. The CISC encoding is essentially an API maintained for backward compatibility. Inside a modern Intel or AMD core, it's RISC micro-ops all the way down.

Historical Context: Acorn to Apple

ARM stands for Advanced RISC Machine (originally Acorn RISC Machine). Designed by Sophie Wilson and Steve Furber at Acorn Computers in 1983-1985. First chip: the ARM1, fabricated in 1985.

Key moments: - 1990: Acorn, Apple, and VLSI Technology form ARM Ltd as a separate company (Apple needed a chip for the Newton) - 1995-2005: ARM dominates mobile and embedded — phones, PDAs, digital cameras - 2010: iPhone 4 uses the Apple A4, an ARM Cortex-A8 derived design - 2013: ARMv8, introducing AArch64 (64-bit ARM) - 2020: Apple M1 — ARM64 laptop/desktop CPU that beats Intel - 2021-2024: AWS Graviton, Ampere Altra — ARM64 in data centers - 2026: ARM64 is the dominant architecture by unit volume

16.2 The ARM64 Register File

ARM64 has 31 general-purpose registers, a zero register, and a stack pointer. Here is the complete picture:

ARM64 General-Purpose Register File
┌──────────────────────────────────────────────────────────────────────────┐
│  64-bit name │  32-bit name │  Role / Convention                          │
├──────────────┼──────────────┼─────────────────────────────────────────────┤
│  X0          │  W0          │  Argument 1 / Return value                  │
│  X1          │  W1          │  Argument 2 / Return value (high 64 bits)   │
│  X2          │  W2          │  Argument 3                                 │
│  X3          │  W3          │  Argument 4                                 │
│  X4          │  W4          │  Argument 5                                 │
│  X5          │  W5          │  Argument 6                                 │
│  X6          │  W6          │  Argument 7                                 │
│  X7          │  W7          │  Argument 8                                 │
├──────────────┼──────────────┼─────────────────────────────────────────────┤
│  X8          │  W8          │  Indirect result register / syscall number  │
│  X9-X15      │  W9-W15      │  Caller-saved temporaries                   │
├──────────────┼──────────────┼─────────────────────────────────────────────┤
│  X16 (IP0)   │  W16         │  Intra-procedure-call temp (linker trampoline)│
│  X17 (IP1)   │  W17         │  Intra-procedure-call temp                  │
│  X18         │  W18         │  Platform register (OS-reserved on some OSes)│
├──────────────┼──────────────┼─────────────────────────────────────────────┤
│  X19-X28     │  W19-W28     │  Callee-saved registers                     │
├──────────────┼──────────────┼─────────────────────────────────────────────┤
│  X29 (FP)    │  W29         │  Frame pointer                              │
│  X30 (LR)    │  W30         │  Link register (return address)             │
├──────────────┼──────────────┼─────────────────────────────────────────────┤
│  SP          │  WSP         │  Stack pointer (not X31)                    │
│  XZR         │  WZR         │  Zero register (always 0, writes discarded) │
├──────────────┼──────────────┼─────────────────────────────────────────────┤
│  PC          │              │  Program counter (not directly accessible)  │
└──────────────┴──────────────┴─────────────────────────────────────────────┘

X0-X30: 64-bit General-Purpose Registers

These are your working registers. Every one of them can hold a 64-bit integer, a pointer, or be used for arithmetic.

MOV X0, #42          // X0 = 42
ADD X1, X0, X0       // X1 = X0 + X0 = 84
MUL X2, X1, X0       // X2 = X1 * X0 = 3528

W0-W30: 32-bit Views of X0-X30

Every 64-bit register has a 32-bit alias. W0 is the low 32 bits of X0. W1 is the low 32 bits of X1. And so on.

The crucial behavior: writing to Wn zero-extends into Xn. If you write 32 bits to W0, the upper 32 bits of X0 are zeroed — not garbage, not sign-extended, zeroed.

MOV X0, #0xFFFFFFFFFFFFFFFF    // X0 = all ones
MOV W0, #1                     // W0 = 1, but X0 is now 0x0000000000000001
                                // Upper 32 bits are ZEROED

This is cleaner than x86-64's confusing aliasing (where EAX zeros the upper half of RAX, but AX and AL don't).

💡 Mental Model: Think of W registers as "truncating store" operations. You're not working with a "narrow" view that leaves garbage — you're doing a 32-bit operation that explicitly zeroes the high half.

⚠️ Common Mistake: Assuming Wn and Xn behave like AL/AX/EAX/RAX in x86. In x86-64, writing to EAX zeroes the high 32 bits of RAX, but writing to AX or AL does NOT zero anything above — the high bits are preserved. In ARM64, writing to a W register always zeroes the top 32 bits of the corresponding X register, making the behavior consistent.

SP: Stack Pointer

The stack pointer is a separate register in ARM64. It is NOT aliased to X31 in normal usage (though in some contexts XZR and X31 can refer to the same encoding — the architecture uses the encoding to mean SP in some instructions and XZR in others, depending on context).

SP must be 16-byte aligned when making function calls. More on this in the calling convention.

XZR/WZR: The Zero Register

This is one of ARM64's most elegant features: a register that always reads as zero, and silently discards any writes.

ADD X0, X1, XZR      // X0 = X1 + 0 = X1 (effectively MOV X0, X1)
SUBS XZR, X0, X1     // Compute X0 - X1, set flags, discard result
                     // This is what CMP X0, X1 assembles to

The CMP instruction in ARM64 is a pseudoinstruction that assembles to SUBS XZR, Xn, Xm — subtract, set flags, throw away the result.

Having an architectural zero register avoids special-casing "compare to zero" in the decoder. It's also why ARM64 has no MOV instruction between registers — you just ADD Xd, Xn, XZR.

(Well, there IS a MOV pseudoinstruction, but it assembles to ORR Xd, XZR, Xn.)

PC: Program Counter

The program counter is not directly accessible as a general-purpose register in ARM64 user-mode code. You cannot MOV X0, PC. However:

PC-relative addressing exists: ADR X0, label loads the address of label into X0
ADRP X0, label loads a page-aligned PC-relative address (used for accessing globals)
Branch instructions implicitly modify PC
LR (X30) contains the return address after a BL instruction

LR: Link Register (X30)

When you call a function with BL (Branch and Link), the return address is stored in X30. The called function returns with RET, which branches to X30.

BL  some_function     // PC = some_function, X30 = return address
// ... (inside some_function)
RET                   // PC = X30

If some_function wants to call another function, it must save X30 to the stack first, or it will lose the return address.

some_function:
    STP X29, X30, [SP, #-16]!   // Save FP and LR
    MOV X29, SP                  // Update frame pointer
    BL  another_function         // X30 now has the return addr back into some_function
    LDP X29, X30, [SP], #16      // Restore FP and LR
    RET                          // Return to caller

🔍 Under the Hood: In x86-64, CALL pushes the return address onto the stack. In ARM64, BL stores it in a register (X30). This is faster for leaf functions (functions that don't call anything else) because they never need to touch memory for the return address. Non-leaf functions must save and restore X30, which costs the same as x86-64's implicit push.

FP: Frame Pointer (X29)

X29 is conventionally used as the frame pointer. The AAPCS64 calling convention requires it. In practice, compilers with -fomit-frame-pointer skip it for performance, but keeping it is required for reliable stack unwinding (which debuggers and crash reporters depend on).

16.3 PSTATE: The Condition Flags

ARM64 tracks four condition flags in PSTATE (Process State), comparable to x86-64's RFLAGS:

PSTATE Condition Flags
┌──────┬──────────────────────────────────────────────────────────────────┐
│ Flag │ Meaning                                                           │
├──────┼──────────────────────────────────────────────────────────────────┤
│  N   │ Negative — result was negative (MSB = 1)                         │
│      │ Equivalent to x86-64 SF (Sign Flag)                              │
├──────┼──────────────────────────────────────────────────────────────────┤
│  Z   │ Zero — result was zero                                            │
│      │ Equivalent to x86-64 ZF                                           │
├──────┼──────────────────────────────────────────────────────────────────┤
│  C   │ Carry — unsigned overflow or borrow (add: carry out; sub: no borrow)│
│      │ Equivalent to x86-64 CF (but subtraction carry is INVERTED in ARM)│
├──────┼──────────────────────────────────────────────────────────────────┤
│  V   │ oVerflow — signed overflow                                        │
│      │ Equivalent to x86-64 OF                                           │
└──────┴──────────────────────────────────────────────────────────────────┘

⚠️ Common Mistake: The C flag in ARM64 for subtraction is the inverse of x86-64's CF. In x86-64, SUB sets CF=1 on borrow. In ARM64, SUBS sets C=0 on borrow (C=1 means no borrow). This matters when you're implementing multi-precision arithmetic using chained carry.

The S Suffix: Optional Flag Update

In ARM64, most arithmetic instructions do NOT update flags by default. You must explicitly request flag updates by appending the S suffix:

ADD  X0, X1, X2     // X0 = X1 + X2, flags NOT updated
ADDS X0, X1, X2     // X0 = X1 + X2, N/Z/C/V updated

SUB  X0, X1, X2     // X0 = X1 - X2, flags NOT updated
SUBS X0, X1, X2     // X0 = X1 - X2, N/Z/C/V updated

Compare to x86-64, where ADD always updates flags. ARM64's approach allows you to perform arithmetic without disturbing flags you need to preserve. It's particularly useful in complex expressions where you only need to test once at the end.

The comparison pseudoinstructions use this:

CMP X0, X1          // assembles to: SUBS XZR, X0, X1
CMN X0, X1          // assembles to: ADDS XZR, X0, X1  (compare negative)
TST X0, X1          // assembles to: ANDS XZR, X0, X1  (test bits)

Condition Codes

The full set of ARM64 condition codes used in conditional branches (B.cond) and conditional select instructions:

Condition Code Reference
┌───────┬──────────────────────────┬─────────────────────────────────────┐
│ Code  │ Meaning                  │ Flags tested                        │
├───────┼──────────────────────────┼─────────────────────────────────────┤
│ EQ    │ Equal                    │ Z = 1                               │
│ NE    │ Not Equal                │ Z = 0                               │
│ CS/HS │ Carry Set / Higher/Same  │ C = 1                               │
│ CC/LO │ Carry Clear / Lower      │ C = 0                               │
│ MI    │ Minus (negative)         │ N = 1                               │
│ PL    │ Plus (positive/zero)     │ N = 0                               │
│ VS    │ Overflow Set             │ V = 1                               │
│ VC    │ Overflow Clear           │ V = 0                               │
│ HI    │ Higher (unsigned >)      │ C = 1 AND Z = 0                     │
│ LS    │ Lower or Same (unsigned ≤)│ C = 0 OR Z = 1                     │
│ GE    │ Greater or Equal (signed ≥)│ N = V                             │
│ LT    │ Less Than (signed <)     │ N ≠ V                               │
│ GT    │ Greater Than (signed >)  │ Z = 0 AND N = V                     │
│ LE    │ Less or Equal (signed ≤) │ Z = 1 OR N ≠ V                      │
│ AL    │ Always                   │ (unconditional)                     │
│ NV    │ Never (reserved)         │ -                                   │
└───────┴──────────────────────────┴─────────────────────────────────────┘

16.4 Fixed-Width Instructions

Every ARM64 instruction is exactly 32 bits (4 bytes). No exceptions. This is one of the defining characteristics of RISC.

ARM64 Instruction Encoding
┌─────────────────────────────────────────────────────────────────────────┐
│  All instructions: exactly 4 bytes, aligned on 4-byte boundaries        │
│                                                                           │
│  0x0000:  ┌──────────┐                                                   │
│            │ ADD X0,  │  = 0x8B010000  (4 bytes)                         │
│            │ X0, X1   │                                                   │
│            └──────────┘                                                   │
│  0x0004:  ┌──────────┐                                                   │
│            │ SUB X2,  │  = 0xCB020042  (4 bytes)                         │
│            │ X2, X2   │                                                   │
│            └──────────┘                                                   │
│  0x0008:  ...                                                             │
│                                                                           │
│  vs. x86-64:                                                              │
│  0x0000:  48 01 C8       ADD RAX, RCX        (3 bytes)                   │
│  0x0003:  48 83 C0 01    ADD RAX, 1          (4 bytes)                   │
│  0x0007:  48 8D 84 CB    LEA RAX,[RBX+RCX*8] (8 bytes)                   │
│           80 00 00 00                                                     │
└─────────────────────────────────────────────────────────────────────────┘

Benefits of fixed-width encoding: - Simpler decode: The processor always knows where the next instruction starts without scanning for the end of the current one - Predictable alignment: All branch targets are 4-byte aligned; validating a branch target is trivially addr & 3 == 0 - Superscalar parallelism: Issuing 4 instructions per cycle is straightforward when each instruction is 4 bytes — you process 16 bytes per cycle - Disassembly: ARM64 disassemblers are trivial to write

Costs: - Code density: Programs are larger than equivalent x86-64 binaries. More code means more cache pressure - Limited immediate range: You can't encode a 64-bit immediate in a 32-bit instruction. Large constants require multiple instructions or memory loads - PC-relative addressing range: Branch instructions have a fixed number of bits for offset, limiting range (±128MB for B, ±1MB for B.cond)

📊 C Comparison: GCC typically generates ~20% more instructions for ARM64 than for x86-64 for the same C code. But each instruction is simple and fast, so the actual execution time is often comparable or better (depending on the microarchitecture).

16.5 Load/Store Architecture

This is the rule that will catch you off-guard more than any other:

In ARM64, ALU instructions CANNOT access memory. Period.

In x86-64:

add rax, [rbx]          ; Load from [rbx] and add to rax — valid
add [rax], rbx          ; Add rbx to memory at [rax] — valid
imul rbx, [rsp+24], 42  ; Load, multiply by immediate, store — valid

In ARM64, all of these are illegal. The equivalent is always:

// "add rax, [rbx]" equivalent
LDR X1, [X1]        // Load from [X1] into X1
ADD X0, X0, X1      // Now add register to register

// "add [rax], rbx" equivalent
LDR X2, [X0]        // Load from [X0]
ADD X2, X2, X1      // Add
STR X2, [X0]        // Store back

// "imul rbx, [rsp+24], 42" equivalent
LDR X1, [SP, #24]   // Load from stack
MUL X1, X1, X2      // Multiply (X2 holds 42)
// ... actually ARM64 doesn't do imul-by-immediate; need MOV then MUL

This strict separation between memory operations and arithmetic is the defining characteristic of load/store architecture. It makes the pipeline cleaner: the Load/Store unit handles memory, the ALU handles arithmetic, and they never get confused about which is which.

💡 Mental Model: Think of ARM64 registers as a scratchpad. You can only do arithmetic on what's in the scratchpad. To work on something from memory, you bring it to the scratchpad (LDR), do your work (ADD/SUB/MUL/etc.), and put it back (STR). There is no shortcut.

16.6 Conditional Execution and Select

In ARM32 (the 32-bit predecessor), almost every instruction could be made conditional with a 4-bit condition code field. You could write ADDEQ R0, R1, R2 (add only if equal). This was RISC's answer to branches: avoid the branch entirely by making instructions conditional.

ARM64 removed this. Most instructions are unconditional. The branch predictor in modern CPUs is good enough that conditional execution usually isn't faster than branch prediction anymore.

What ARM64 does have are conditional select instructions:

// CSEL: conditional select
// CSEL Xd, Xn, Xm, cond  →  Xd = (cond true) ? Xn : Xm
CMP  X0, #0
CSEL X1, X2, X3, EQ    // X1 = (X0 == 0) ? X2 : X3

This is ARM64's equivalent of x86-64's CMOV family:

; x86-64 equivalent
cmp   rax, 0
cmove rbx, rcx          ; rbx = (rax == 0) ? rcx : rbx
; Note: CMOV only has one source — it's "if true, replace"
; CSEL has two sources — it's a true ternary

CSEL is actually more powerful than CMOV: it fully replaces the value in either case, rather than conditionally replacing an existing value.

Other conditional variants:

CSET  Xd, cond          // Xd = (cond true) ? 1 : 0
CSINC Xd, Xn, Xm, cond // Xd = (cond true) ? Xn : Xm + 1
CSINV Xd, Xn, Xm, cond // Xd = (cond true) ? Xn : ~Xm
CSNEG Xd, Xn, Xm, cond // Xd = (cond true) ? Xn : -Xm

These are extremely useful for branchless code — computing absolute value, clamping, sign extraction, etc.

// Absolute value using CSNEG
CMP   X0, #0
CSNEG X0, X0, X0, GE   // X0 = (X0 >= 0) ? X0 : -X0

16.7 Setting Up ARM64 Development

Option 1: Raspberry Pi (Native)

Install Raspberry Pi OS (64-bit) and assemble natively:

# On the Raspberry Pi
sudo apt install binutils gcc
aarch64-linux-gnu-as hello.s -o hello.o
aarch64-linux-gnu-ld hello.o -o hello
./hello

Actually, if you're ON the Raspberry Pi, the toolchain is just as and ld (the native toolchain). The aarch64-linux-gnu- prefix is for cross-compilation from an x86 host.

Option 2: QEMU on x86-64 Linux

# Install cross-compiler and QEMU user-mode emulator
sudo apt install binutils-aarch64-linux-gnu qemu-user

# Assemble for ARM64
aarch64-linux-gnu-as hello.s -o hello.o
aarch64-linux-gnu-ld hello.o -o hello

# Run under QEMU user emulation
qemu-aarch64 ./hello

QEMU user mode translates ARM64 system calls to host system calls. No virtual machine needed.

For GDB debugging:

# Run with GDB stub
qemu-aarch64 -g 1234 ./hello &
aarch64-linux-gnu-gdb hello
(gdb) target remote :1234
(gdb) layout regs
(gdb) stepi

Option 3: Apple Silicon (M1/M2/M3/M4)

On macOS with Apple Silicon, clang and as target ARM64 natively. GDB is replaced by LLDB.

# macOS ARM64 assembly
clang -arch arm64 hello.s -o hello
./hello

# Debug with LLDB
lldb ./hello
(lldb) run
(lldb) register read

Note: macOS uses different system call numbers and a different calling convention for syscalls than Linux. See Chapter 18 for details.

A First ARM64 Program

Let's write "Hello, World" to verify the toolchain works:

// hello_arm64.s
// Target: Linux ARM64 (AArch64)
// Assemble: aarch64-linux-gnu-as hello_arm64.s -o hello_arm64.o
// Link:     aarch64-linux-gnu-ld hello_arm64.o -o hello_arm64
// Run:      qemu-aarch64 ./hello_arm64  (or native on ARM64 Linux)

.section .data
msg:    .ascii "Hello, ARM64!\n"
msg_len = . - msg              // Calculate string length at assemble time

.section .text
.global _start
_start:
    // write(1, msg, msg_len)
    MOV X8, #64               // syscall number: write (64 on Linux ARM64)
    MOV X0, #1                // fd = stdout
    ADR X1, msg               // X1 = address of msg
    MOV X2, #msg_len          // X2 = length
    SVC #0                    // invoke syscall

    // exit(0)
    MOV X8, #93               // syscall number: exit (93 on Linux ARM64)
    MOV X0, #0                // status = 0
    SVC #0                    // invoke syscall

Observe: - ADR X1, msg — PC-relative address of label (ARM64's way to load addresses) - SVC #0 — Supervisor Call, equivalent to x86-64's SYSCALL - Syscall number in X8 (not RAX like x86-64) - Arguments in X0-X5 (similar to x86-64's RDI/RSI/RDX/...)

🛠️ Lab Exercise: Assemble and run the hello world program above. Then use GDB/LLDB or qemu-aarch64 -g 1234 to trace through it one instruction at a time. Note which register changes at each step.

16.8 Calling Convention Preview (AAPCS64)

The ARM Procedure Call Standard for AArch64 (AAPCS64) defines how functions receive arguments and return values:

AAPCS64 Register Usage Summary
┌──────────────────────┬──────────────────────────────────────────────────┐
│ Registers            │ Role                                              │
├──────────────────────┼──────────────────────────────────────────────────┤
│ X0-X7                │ Arguments (1st through 8th); X0 = return value   │
│ X8                   │ Indirect result location register                │
│ X9-X15               │ Caller-saved temporaries (may be trashed by calls)│
│ X16-X17 (IP0, IP1)   │ Reserved for linker trampolines                  │
│ X18                  │ Platform reserved (avoid on Linux; use on iOS)    │
│ X19-X28              │ Callee-saved (must preserve across calls)         │
│ X29 (FP)             │ Frame pointer (callee-saved)                      │
│ X30 (LR)             │ Link register (callee-saved logically)            │
│ SP                   │ Stack pointer (16-byte aligned at calls)          │
└──────────────────────┴──────────────────────────────────────────────────┘

Compare to System V AMD64 ABI (x86-64):

Side-by-Side Calling Convention Comparison
┌──────────────────────┬────────────────────────────────────────────────────┐
│ AAPCS64 (ARM64)      │ System V AMD64 (x86-64)                            │
├──────────────────────┼────────────────────────────────────────────────────┤
│ X0-X7 = args 1-8     │ RDI, RSI, RDX, RCX, R8, R9 = args 1-6            │
│ X0 = return value    │ RAX = return value                                 │
│ X19-X28 = callee-saved│ RBX, RBP, R12-R15 = callee-saved                 │
│ X0-X18 = caller-saved│ RAX, RCX, RDX, RSI, RDI, R8-R11 = caller-saved   │
│ SP must be 16-aligned│ RSP must be 16-aligned (8-byte misaligned at entry)│
│ X30 = return addr    │ Return addr pushed on stack by CALL                │
└──────────────────────┴────────────────────────────────────────────────────┘

ARM64 has more argument registers (8 vs. 6), which means more functions pass all their arguments in registers without touching the stack. For functions with 7 or 8 arguments, ARM64 is more efficient.

Full details in Chapter 17.

16.9 x86-64 vs. ARM64 Register Diagram

Register File Comparison
═══════════════════════════════════════════════════════════════════
x86-64                          ARM64
───────────────────────────────────────────────────────────────────
RAX  64-bit  EAX  AX  AH/AL    X0/W0    64/32-bit  arg1/return
RCX  64-bit  ECX  CX  CH/CL    X1/W1               arg2
RDX  64-bit  EDX  DX  DH/DL    X2/W2               arg3
RBX  64-bit  EBX  BX  BH/BL    X3/W3               arg4
RSP  64-bit  (stack pointer)    X4/W4               arg5
RBP  64-bit  (frame pointer)    X5/W5               arg6
RSI  64-bit  ESI  SI            X6/W6               arg7
RDI  64-bit  EDI  DI            X7/W7               arg8
R8 - R15  64-bit                X8/W8               indirect result
  (each has E*/W* 32-bit view)   X9-X15              caller-saved temps
  (no 16/8-bit aliasing)         X16-X17             IP0, IP1
                                 X18                 platform reserved
                                 X19-X28             callee-saved
                                 X29/W29  (FP)       frame pointer
                                 X30/W30  (LR)       link register
                                 SP/WSP              stack pointer
                                 XZR/WZR             always zero
───────────────────────────────────────────────────────────────────
16 GP registers                 31 GP registers + XZR
Complex aliasing (8/16/32/64)   Clean aliasing (32/64 only)
Complex encodings (ModR/M, SIB) Fixed 4-byte encodings
Flags always updated by ALU ops Flags updated only with S suffix
═══════════════════════════════════════════════════════════════════

🔄 Check Your Understanding: 1. If you write 0xDEADBEEF to W5, what does X5 contain? 2. What does SUBS XZR, X0, X1 do? (What pseudoinstruction is this?) 3. Why can't you write ADD X0, X1, [X2] in ARM64? 4. What is X30 used for? What must you do before calling another function? 5. What's the difference between ADD and ADDS?

Summary

ARM64 is a RISC architecture with 31 general-purpose registers, a dedicated zero register (XZR), fixed 4-byte instruction encoding, optional flag updates (S suffix), and a strict load/store model where ALU instructions cannot touch memory.

The key mindset shift from x86-64: - No implicit register operands: every operand is explicit - No memory operands in ALU instructions: load first, compute, store - Flag updates are optional and explicit - Return address in a register (X30), not on the stack - More registers means less stack pressure

The architecture was designed from the ground up to be compiled-to, to be power-efficient, and to scale from a smartwatch chip to an AWS server. Those constraints produced something genuinely elegant — once you stop expecting it to behave like x86.