Chapter 18 Key Takeaways: ARM64 Programming
-
Array element access requires explicit index scaling via LSL.
LDR X2, [X0, X1, LSL #3]accesses an 8-byte element atarr[X1]. Use LSL #0 for bytes, #1 for halfwords, #2 for 32-bit words, #3 for 64-bit doublewords. For other sizes, compute the offset explicitly. -
memcpy and memset must be implemented with explicit loops on ARM64. There is no REP MOVSB or STOSD equivalent. Efficient implementations use
LDP/STPpairs (16 bytes per instruction pair) or NEON Q-register loads/stores (16 bytes per instruction) for the main loop, with byte-precision tail handling. -
ARM64 FP/SIMD registers (V0-V31) are 128 bits wide with multiple name aliases. Dn = 64-bit double, Sn = 32-bit float, Qn = full 128-bit NEON. V8-V15 are callee-saved (lower 64 bits only — if used as full 128-bit NEON, save the full register).
-
FMADD Dd, Dn, Dm, Dais a fused multiply-add (single rounding step). It computesDa + Dn*Dmwith better numerical precision than separate FMUL+FADD, and typically executes in one pipeline stage. Use FMADD instead of FMUL+FADD wherever possible. -
NEON processes 4 float32s or 2 float64s per instruction.
FADD V0.4S, V1.4S, V2.4Sadds 4 single-precision floats simultaneously.FMLA V0.4S, V1.4S, V2.4Saccumulates 4 multiply-adds. Load 128-bit vectors withLDR Q0, [X1]. -
NEON horizontal reduction uses
FADDP. To sum the 4 elements of V0.4S:FADDP V0.4S, V0.4S, V0.4S(pairwise sum), thenFADDP S0, V0.2S(final sum). Two instructions to reduce 4 elements. -
Two-accumulator NEON unrolling enables instruction-level parallelism. Keeping two independent FMLA chains in flight (V0 and V1 accumulators) allows the CPU's out-of-order engine to execute both FMLAs simultaneously on CPUs with multiple FP units.
-
Linux ARM64 program entry: stack contains
argcat [SP],argv[0]at [SP+8],argv[1]at [SP+16]. No registers hold argc/argv at program entry (unlike some other calling conventions). Load them from the initial stack. -
macOS ARM64 system calls use X16 (not X8) for the syscall number and
SVC #0x80(notSVC #0). macOS uses BSD syscall numbers: write=4, exit=1, open=5. These differ completely from Linux ARM64's generic table (write=64, exit=93, openat=56). -
macOS uses Mach-O binary format with sections named
__TEXT,__textand__TEXT,__const, not ELF.textand.rodata. External C function names require an underscore prefix in assembly (_printf, notprintf). -
ADRP + ADD loads addresses with ±4GB range on macOS. This is necessary because Apple's toolchain and link model can place code and data more than 1MB apart (ADR's limit).
ADRP X0, msg@PAGE+ADD X0, X0, msg@PAGEOFFreplacesADR X0, msgin large programs. -
Apple Silicon M-series chips use 16KB memory pages, not 4KB. This is unusual (most ARM64 Linux systems use 4KB pages). It affects
mmap, memory alignment requirements for some operations, and code that hardcodes page size assumptions. -
Rosetta 2 translates x86-64 binaries to ARM64 ahead-of-time at first run. Translated code runs at approximately 70-85% of native x86-64 speed. Native ARM64 code on M-series chips typically outperforms Intel x86-64 code in real-world workloads despite the lower clock frequencies.
-
Universal binaries (Mach-O fat binaries) contain multiple architecture slices. Use
lipoto inspect, extract, or combine slices. When an Apple Silicon Mac runs a universal binary, it picks the native ARM64 slice; Rosetta 2 handles x86-only binaries automatically.