Chapter 17 Key Takeaways: ARM64 Instruction Set

  1. ARM64 arithmetic instructions (ADD, SUB, MUL, AND, ORR, EOR) take three explicit register operands. The destination is always first: ADD Xd, Xn, Xm means Xd = Xn + Xm. Compare to x86-64's two-operand format where the destination is also a source.

  2. The barrel shifter is built into every ARM64 data-processing instruction. ADD X0, X1, X2, LSL #3 computes X0 = X1 + (X2 << 3) in one instruction. This makes array index arithmetic (base + index * element_size) extremely efficient.

  3. ARM64 has no division remainder register. SDIV/UDIV produces only the quotient. To get the remainder, use MSUB X_rem, X_quot, X_divisor, X_dividend (equivalent to remainder = dividend - quotient * divisor).

  4. SMULH/UMULH produce the upper 64 bits of a 128-bit multiply. Use these in sequence with MUL to get the full 128-bit product of two 64-bit values, or to implement fast division by constants via multiply-high + shift.

  5. ARM64 addressing modes are rich: base only, base+immediate offset, base+register offset (with optional shift), pre-indexed ([Xn, #imm]!), and post-indexed ([Xn], #imm). Pre-indexed updates the base before access; post-indexed updates it after. The ! suffix means write-back.

  6. LDP and STP (load/store pair) transfer two registers in one instruction. The canonical ARM64 function prologue STP X29, X30, [SP, #-16]! saves both frame pointer and link register in one instruction while decrementing SP.

  7. Sized loads (LDRB, LDRH, LDRSB, LDRSH, LDRSW) control zero vs. sign extension. The S in LDRSB/LDRSH/LDRSW means sign-extend. Use sign-extending loads when loading char, short, or int for use in 64-bit arithmetic.

  8. CBZ and CBNZ branch if a register is zero or non-zero without touching flags. These replace the x86-64 TEST reg, reg + JZ/JNZ pattern with a single instruction. TBZ/TBNZ do the same for individual bits.

  9. Conditional branches (B.EQ, B.NE, B.LT, etc.) have only a ±1MB range. Unconditional B has ±128MB range. For farther jumps, load the target address into a register and use BR Xn (indirect branch).

  10. AAPCS64 calling convention: X0-X7 = first 8 arguments, X0 = return value, X19-X28 = callee-saved, X0-X18 = caller-saved. Stack must be 16-byte aligned before any BL instruction.

  11. The canonical function prologue is STP X29, X30, [SP, #-16]! / MOV X29, SP and the epilogue is LDP X29, X30, [SP], #16 / RET. Leaf functions (no calls) can skip saving LR and omit the full prologue.

  12. Linux ARM64 system calls use X8 for the syscall number, X0-X5 for arguments, and SVC #0 to invoke the kernel. ARM64 syscall numbers differ from x86-64: write=64, exit=93, openat=56.

  13. ARM64 has no string instructions (no REP SCASB, REP MOVSB, etc.). Implementing strlen, memcpy, and memset requires explicit byte/word/NEON loops. The NEON SIMD approach (16 bytes at once) is typically used in production libc implementations.

  14. CSEL (conditional select) is ARM64's branchless conditional. CSEL Xd, Xn, Xm, cond selects between two registers without branching — stronger than x86-64's CMOV because it's a full ternary (not a conditional replacement).