Chapter 18 Further Reading: ARM64 Programming


1. ARM Neon Intrinsics Reference ARM Ltd. — https://developer.arm.com/architectures/instruction-sets/intrinsics

The NEON intrinsics guide documents every NEON instruction as a C intrinsic. Essential for cross-referencing: when you see vfmlaq_f32 in C code, look it up here to understand the underlying FMLA assembly instruction. The search functionality lets you filter by operation type.


2. "NEON Programmer's Guide" (DEN0018A) ARM Ltd. — free download from developer.arm.com

A 50-page guide specifically for ARM NEON programming. Covers data types, load/store operations, the de-interleave instructions (LD2/LD3/LD4), and optimization tips. More accessible than the Architecture Reference Manual for learning NEON.


3. "Apple Silicon Reference" — Apple Developer Documentation Apple Inc. — https://developer.apple.com/documentation

Documents the Apple Silicon ABI, including stack layout, register usage, calling convention differences from standard AAPCS64, and system call conventions. Essential for macOS ARM64 assembly programming.


4. XNU Source Code (Apple's Kernel) Apple Open Source — https://github.com/apple-oss-distributions/xnu

The macOS kernel source includes ARM64-specific system call tables, trap handling code, and low-level assembly in osfmk/arm64/. Useful for understanding exactly how SVC #0x80 is handled and what happens to registers during a macOS system call.


5. "Writing Arm Assembly Code" — ARM Developer Blog Series developer.arm.com/blogs

Practical blog series covering ARM64 assembly optimization: loop unrolling, pipeline utilization, memory access patterns, and NEON vectorization. Includes code examples targeting Cortex-A series processors (comparable to Raspberry Pi).


6. glibc ARM64 String Function Implementations https://github.com/bminor/glibc/tree/master/sysdeps/aarch64

Production NEON implementations of memcpy, memset, memmove, strlen, and strcpy optimized for various ARM64 microarchitectures. Study these to see real-world NEON programming including multi-registration loads (LD4), prefetching (PRFM), and alignment handling.


7. "Optimizing Memory Accesses on Arm Processors" ARM Ltd. Application Note AN455 — free download

Covers the memory access patterns that achieve best throughput on ARM Cortex-A processors: cache line size, prefetching with PRFM, alignment requirements, and when to use LDP/STP vs. NEON loads. Directly applicable to the memcpy/memset implementations in this chapter.


8. "The Mach-O Format" — Apple's Binary Format Documentation Apple Inc. — Inside Mach-O Binaries, developer.apple.com

Covers the structure of Mach-O executables, the load command format, section types, and symbol tables. Essential for understanding what the linker produces and how dyld (the dynamic linker) loads programs on macOS.


9. "lld Linker and ARM64" — LLVM Documentation https://lld.llvm.org

LLVM's linker (used by Apple's toolchain and Clang) documentation. Covers the link model for ARM64 including ADRP relocation types, GOT-relative addressing, and PLT generation for shared libraries. Relevant when your assembly programs call library functions.


10. "SIMD in the Library: Demystifying NEON" — Linaro Connect Presentation Linaro Connect, available on YouTube and linaro.org

A 45-minute talk covering practical NEON optimization for production code: vectorization of common patterns, pitfalls, measurement, and comparison with auto-vectorization. Includes the FIR filter example (dot product) from Case Study 18-1 in more detail.