Chapter 18 Key Takeaways: ARM64 Programming

  1. Array element access requires explicit index scaling via LSL. LDR X2, [X0, X1, LSL #3] accesses an 8-byte element at arr[X1]. Use LSL #0 for bytes, #1 for halfwords, #2 for 32-bit words, #3 for 64-bit doublewords. For other sizes, compute the offset explicitly.

  2. memcpy and memset must be implemented with explicit loops on ARM64. There is no REP MOVSB or STOSD equivalent. Efficient implementations use LDP/STP pairs (16 bytes per instruction pair) or NEON Q-register loads/stores (16 bytes per instruction) for the main loop, with byte-precision tail handling.

  3. ARM64 FP/SIMD registers (V0-V31) are 128 bits wide with multiple name aliases. Dn = 64-bit double, Sn = 32-bit float, Qn = full 128-bit NEON. V8-V15 are callee-saved (lower 64 bits only — if used as full 128-bit NEON, save the full register).

  4. FMADD Dd, Dn, Dm, Da is a fused multiply-add (single rounding step). It computes Da + Dn*Dm with better numerical precision than separate FMUL+FADD, and typically executes in one pipeline stage. Use FMADD instead of FMUL+FADD wherever possible.

  5. NEON processes 4 float32s or 2 float64s per instruction. FADD V0.4S, V1.4S, V2.4S adds 4 single-precision floats simultaneously. FMLA V0.4S, V1.4S, V2.4S accumulates 4 multiply-adds. Load 128-bit vectors with LDR Q0, [X1].

  6. NEON horizontal reduction uses FADDP. To sum the 4 elements of V0.4S: FADDP V0.4S, V0.4S, V0.4S (pairwise sum), then FADDP S0, V0.2S (final sum). Two instructions to reduce 4 elements.

  7. Two-accumulator NEON unrolling enables instruction-level parallelism. Keeping two independent FMLA chains in flight (V0 and V1 accumulators) allows the CPU's out-of-order engine to execute both FMLAs simultaneously on CPUs with multiple FP units.

  8. Linux ARM64 program entry: stack contains argc at [SP], argv[0] at [SP+8], argv[1] at [SP+16]. No registers hold argc/argv at program entry (unlike some other calling conventions). Load them from the initial stack.

  9. macOS ARM64 system calls use X16 (not X8) for the syscall number and SVC #0x80 (not SVC #0). macOS uses BSD syscall numbers: write=4, exit=1, open=5. These differ completely from Linux ARM64's generic table (write=64, exit=93, openat=56).

  10. macOS uses Mach-O binary format with sections named __TEXT,__text and __TEXT,__const, not ELF .text and .rodata. External C function names require an underscore prefix in assembly (_printf, not printf).

  11. ADRP + ADD loads addresses with ±4GB range on macOS. This is necessary because Apple's toolchain and link model can place code and data more than 1MB apart (ADR's limit). ADRP X0, msg@PAGE + ADD X0, X0, msg@PAGEOFF replaces ADR X0, msg in large programs.

  12. Apple Silicon M-series chips use 16KB memory pages, not 4KB. This is unusual (most ARM64 Linux systems use 4KB pages). It affects mmap, memory alignment requirements for some operations, and code that hardcodes page size assumptions.

  13. Rosetta 2 translates x86-64 binaries to ARM64 ahead-of-time at first run. Translated code runs at approximately 70-85% of native x86-64 speed. Native ARM64 code on M-series chips typically outperforms Intel x86-64 code in real-world workloads despite the lower clock frequencies.

  14. Universal binaries (Mach-O fat binaries) contain multiple architecture slices. Use lipo to inspect, extract, or combine slices. When an Apple Silicon Mac runs a universal binary, it picks the native ARM64 slice; Rosetta 2 handles x86-only binaries automatically.