Chapters 16 and 17 gave you the register model and the instruction vocabulary. This chapter puts them together into real programs: array operations, string processing without string instructions, floating-point math, NEON SIMD, Linux system...
In This Chapter
- Putting It Together
- 18.1 Arrays on ARM64
- 18.2 String Operations Without String Instructions
- 18.3 ARM64 Floating-Point
- 18.4 NEON SIMD: ARM64's Answer to SSE/AVX
- 18.5 ARM64 Linux System Programming
- 18.6 AArch64 vs. AArch32
- 18.7 Apple Silicon: ARM64 on macOS
- 18.8 Side-by-Side: Array Sum in x86-64 and ARM64
- Summary
Chapter 18: ARM64 Programming
Putting It Together
Chapters 16 and 17 gave you the register model and the instruction vocabulary. This chapter puts them together into real programs: array operations, string processing without string instructions, floating-point math, NEON SIMD, Linux system programming, and the differences you need to know for Apple Silicon.
18.1 Arrays on ARM64
The Scale Factor Problem
In x86-64, the SIB (Scale, Index, Base) addressing mode lets you access array elements with a built-in scale factor:
; x86-64: access 8-byte element arr[i]
mov rax, [rbx + rcx*8] ; rbx = arr, rcx = i, *8 for 8-byte elements
The *8 is part of the instruction encoding. ARM64 has no such built-in. To access an 8-byte element, you must shift the index:
// ARM64: access 8-byte element arr[i]
// X0 = arr, X1 = i
LDR X2, [X0, X1, LSL #3] // X2 = arr[i]: address = X0 + (X1 << 3) = X0 + X1*8
The shift is part of the register-offset addressing mode. You can use LSL #0 (no shift, 1-byte elements), LSL #1 (2-byte), LSL #2 (4-byte), LSL #3 (8-byte). These cover byte, halfword, word, and doubleword element sizes.
For other element sizes (3, 5, 6, 7 bytes), you'd need an explicit multiply or a sequence:
// 3-byte elements (unusual but illustrative):
// X0 = arr, X1 = i
// Need: offset = i * 3
ADD X2, X1, X1, LSL #1 // X2 = X1 + (X1 << 1) = X1 + 2*X1 = 3*X1
LDR W3, [X0, X2] // Load 4 bytes (overlapping, but common technique)
Complete Array Sum Example
// Sum an array of int64_t
int64_t array_sum(const int64_t *arr, size_t count) {
int64_t sum = 0;
for (size_t i = 0; i < count; i++) {
sum += arr[i];
}
return sum;
}
ARM64 assembly:
// array_sum: X0 = arr, X1 = count, returns X0 = sum
.global array_sum
array_sum:
MOV X2, XZR // sum = 0
CBZ X1, .sum_done // if count == 0, return 0
.sum_loop:
LDR X3, [X0], #8 // X3 = *arr; arr++ (post-indexed by 8)
ADD X2, X2, X3 // sum += *arr
SUBS X1, X1, #1 // count-- (SUBS to set Z flag)
B.NE .sum_loop // if count != 0, continue
.sum_done:
MOV X0, X2 // return sum
RET
Multi-Element Load with LDP
LDP can load two consecutive array elements at once, reducing load instructions by half:
// Sum pairs of int64_t elements at a time (unrolled loop)
// Requires count to be even (or handle odd count separately)
array_sum_unrolled:
MOV X2, XZR // sum = 0
LSR X3, X1, #1 // X3 = count / 2 (pairs to process)
CBZ X3, .us_done
.us_loop:
LDP X4, X5, [X0], #16 // X4 = arr[i], X5 = arr[i+1]; arr += 16
ADD X2, X2, X4 // sum += X4
ADD X2, X2, X5 // sum += X5
SUBS X3, X3, #1
B.NE .us_loop
.us_done:
// Handle odd element if count was odd
TBNZ X1, #0, .us_odd // if bit 0 of count is set, count was odd
MOV X0, X2
RET
.us_odd:
LDR X4, [X0] // load the last element
ADD X2, X2, X4
MOV X0, X2
RET
Post-Increment for Array Traversal
The post-indexed addressing mode is ideal for array traversal:
// Find the maximum value in an int32_t array
// X0 = arr, W1 = count, returns W0 = max
find_max:
SXTW X1, W1 // sign-extend count to 64-bit
CBZ X1, .max_done
LDR W0, [X0], #4 // W0 = arr[0]; arr++
SUBS X1, X1, #1
B.EQ .max_done // only one element
.max_loop:
LDR W2, [X0], #4 // W2 = next element; arr++
CMP W2, W0
CSEL W0, W2, W0, GT // W0 = max(W0, W2) — branchless!
SUBS X1, X1, #1
B.NE .max_loop
.max_done:
RET
The CSEL (conditional select) here avoids a conditional branch entirely — ARM64's answer to branchless min/max.
18.2 String Operations Without String Instructions
ARM64 has no MOVSB, SCASB, STOSD, or any other x86-style string instructions. Every byte must be processed by explicit load-operate-store sequences. The saving grace: NEON can process 16 bytes at a time when performance matters.
memset
void *memset(void *dest, int c, size_t n);
ARM64 implementation:
// memset: X0 = dest, W1 = byte value, X2 = count
// Returns X0 = dest (original)
.global my_memset
my_memset:
MOV X3, X0 // save dest for return
// Create an 8-byte-wide version of the fill byte
AND W1, W1, #0xFF // ensure W1 is 8-bit
// Replicate byte into all 8 bytes of X1:
ORR W1, W1, W1, LSL #8 // W1 = byte | byte<<8 = 2-byte repeat
ORR W1, W1, W1, LSL #16 // W1 = 4-byte repeat
ORR X1, X1, X1, LSL #32 // X1 = 8-byte repeat (full 64-bit pattern)
// Set 16 bytes at a time using STP
LSR X4, X2, #4 // X4 = count / 16
CBZ X4, .mset_tail8
.mset_16:
STP X1, X1, [X0], #16 // store 16 bytes (two 8-byte stores)
SUBS X4, X4, #1
B.NE .mset_16
.mset_tail8:
TBZQ X2, #3, .mset_tail4 // if bit 3 of count is 0, no 8-byte chunk
STR X1, [X0], #8 // store 8 bytes
.mset_tail4:
TBZ X2, #2, .mset_tail2 // if bit 2 of count is 0, no 4-byte chunk
STR W1, [X0], #4 // store 4 bytes
.mset_tail2:
TBZ X2, #1, .mset_tail1
STRH W1, [X0], #2 // store 2 bytes
.mset_tail1:
TBZ X2, #0, .mset_done
STRB W1, [X0] // store 1 byte
.mset_done:
MOV X0, X3 // return original dest
RET
💡 Mental Model: The
TBZ X2, #N, labelinstructions test specific bits of the count to handle the tail (the remainder after processing 16-byte chunks). Bit 3 set → 8-byte tail chunk; bit 2 set → 4-byte; bit 1 → 2-byte; bit 0 → 1-byte. This handles any count from 0 to 15 in the tail.
memcpy with LDP/STP
// memcpy: X0 = dest, X1 = src, X2 = count
// Returns X0 = dest
.global my_memcpy
my_memcpy:
MOV X3, X0 // save dest
LSR X4, X2, #4 // X4 = count / 16
CBZ X4, .mcpy_tail
.mcpy_16:
LDP X5, X6, [X1], #16 // load 16 bytes from src
STP X5, X6, [X0], #16 // store 16 bytes to dest
SUBS X4, X4, #1
B.NE .mcpy_16
.mcpy_tail:
ANDS X2, X2, #15 // X2 = count % 16
CBZ X2, .mcpy_done
// Handle remaining bytes (1-15)
TBZ X2, #3, .mcpy_tail4
LDR X5, [X1], #8
STR X5, [X0], #8
.mcpy_tail4:
TBZ X2, #2, .mcpy_tail2
LDR W5, [X1], #4
STR W5, [X0], #4
.mcpy_tail2:
TBZ X2, #1, .mcpy_tail1
LDRH W5, [X1], #2
STRH W5, [X0], #2
.mcpy_tail1:
TBZ X2, #0, .mcpy_done
LDRB W5, [X1]
STRB W5, [X0]
.mcpy_done:
MOV X0, X3
RET
⚡ Performance Note: The LDP/STP pair approach processes 16 bytes per loop iteration. glibc's ARM64 memcpy goes further, unrolling the main loop to process 64 or even 128 bytes per iteration using multiple LDP/STP instructions and prefetch hints (PRFM). For buffers larger than ~4KB, NEON loads (LDR Q) or even SVE (Scalable Vector Extension, ARM's variable-width SIMD) are used.
18.3 ARM64 Floating-Point
The FP/SIMD Register File
ARM64 has 32 FP/SIMD registers, each 128 bits wide. They have multiple names depending on how you use them:
FP/SIMD Register Naming
┌──────────┬────────────────────────────────────────────────────────────┐
│ Name │ Meaning │
├──────────┼────────────────────────────────────────────────────────────┤
│ Vn │ 128-bit (the full register, usually for NEON vector ops) │
│ Qn │ 128-bit (alternative name for Vn in scalar context) │
│ Dn │ Low 64 bits (double-precision float scalar) │
│ Sn │ Low 32 bits (single-precision float scalar) │
│ Hn │ Low 16 bits (half-precision float scalar — ARMv8.2+) │
│ Bn │ Low 8 bits (byte — for NEON operations) │
└──────────┴────────────────────────────────────────────────────────────┘
V0-V7: FP/SIMD argument and return registers (caller-saved)
V8-V15: Callee-saved (but only the lower 64 bits must be preserved!)
V16-V31: Caller-saved temporaries
⚠️ Common Mistake: For AAPCS64 FP, V8-V15 are callee-saved — but only the lower 64 bits (D8-D15). The upper 64 bits are NOT required to be preserved. If you use V8-V15 as 128-bit NEON registers, you must save and restore the full 128 bits yourself.
Scalar Floating-Point Instructions
// Arithmetic
FADD Dd, Dn, Dm // Dd = Dn + Dm (double)
FADD Sd, Sn, Sm // Sd = Sn + Sm (float)
FSUB Dd, Dn, Dm
FMUL Dd, Dn, Dm
FDIV Dd, Dn, Dm
FSQRT Dd, Dn // Dd = sqrt(Dn)
FNEG Dd, Dn // Dd = -Dn
FABS Dd, Dn // Dd = |Dn|
// Multiply-accumulate (fused, single rounding)
FMADD Dd, Dn, Dm, Da // Dd = Da + Dn*Dm (fused multiply-add)
FMSUB Dd, Dn, Dm, Da // Dd = Da - Dn*Dm
FNMADD Dd, Dn, Dm, Da // Dd = -Da - Dn*Dm
FNMSUB Dd, Dn, Dm, Da // Dd = -Da + Dn*Dm
// Comparison
FCMP Dn, Dm // sets condition flags (FPSR N, Z, C, V)
FCMP Dn, #0.0 // compare to zero
FCCMP Dn, Dm, nzcv, cond // conditional FP compare
// Conversion
FCVT Dd, Sn // single → double
FCVT Sd, Dn // double → single
SCVTF Dd, Xn // signed int64 → double
SCVTF Sd, Wn // signed int32 → float
UCVTF Dd, Xn // unsigned int64 → double
FCVTZS Xd, Dn // double → int64 (truncate toward zero)
FCVTZS Wd, Sn // float → int32 (truncate toward zero)
FCVTZU Xd, Dn // double → uint64 (truncate toward zero)
// Move between GP and FP registers
FMOV Dd, Xn // Dd = Xn (bitwise, no conversion)
FMOV Xd, Dn // Xd = Dn (bitwise, no conversion)
FMOV Sd, Wn // Sd = Wn (bitwise, 32-bit)
FMOV Wd, Sn // Wd = Sn (bitwise, 32-bit)
Floating-Point Function Example
double hypot(double a, double b) {
return sqrt(a*a + b*b);
}
ARM64 assembly:
// hypot: D0 = a, D1 = b, returns D0 = sqrt(a*a + b*b)
.global my_hypot
my_hypot:
FMUL D2, D0, D0 // D2 = a*a
FMADD D2, D1, D1, D2 // D2 = D2 + b*b = a*a + b*b (fused!)
FSQRT D0, D2 // D0 = sqrt(a*a + b*b)
RET
⚡ Performance Note:
FMADD(fused multiply-add) performsa*a + b*bwith a single rounding step, which is both faster and more numerically accurate than separate FMUL + FADD. This is one of ARM64's advantages over architectures without native FMA instructions.
Floating-Point Load/Store
FP registers load and store with LDR/STR, but using FP register names:
LDR D0, [X0] // Load double from [X0] into D0
LDR S0, [X0] // Load float from [X0] into S0
STR D0, [X0] // Store D0 to [X0]
STR S0, [X0] // Store S0 to [X0]
LDP D0, D1, [X0] // Load two doubles
STP D0, D1, [X0] // Store two doubles
18.4 NEON SIMD: ARM64's Answer to SSE/AVX
NEON is ARM's SIMD (Single Instruction, Multiple Data) extension. It operates on the V registers (128 bits) in vector mode, treating each register as a vector of smaller elements.
Vector Data Types
NEON Vector Types
┌───────────────┬─────────────────────────────────────────────────────────┐
│ Suffix │ Meaning │
├───────────────┼─────────────────────────────────────────────────────────┤
│ V0.16B │ 16 × 8-bit bytes (uint8) │
│ V0.8H │ 8 × 16-bit halfwords (uint16/int16) │
│ V0.4S │ 4 × 32-bit words (uint32/int32/float32) │
│ V0.2D │ 2 × 64-bit doublewords (uint64/int64/float64) │
│ V0.4S (FP) │ 4 × single-precision floats (with FADD/FMUL etc.) │
│ V0.2D (FP) │ 2 × double-precision floats │
└───────────────┴─────────────────────────────────────────────────────────┘
NEON Integer SIMD
// Add 4 int32_t pairs simultaneously
ADD V0.4S, V1.4S, V2.4S // V0[i] = V1[i] + V2[i] for i in 0..3
// Multiply 4 int32_t pairs
MUL V0.4S, V1.4S, V2.4S // V0[i] = V1[i] * V2[i]
// Compare 16 bytes
CMEQ V0.16B, V1.16B, V2.16B // V0[i] = 0xFF if V1[i]==V2[i], else 0
// Saturating add (bytes stay in [0,255], no wrap)
UQADD V0.16B, V1.16B, V2.16B
// Horizontal max of all 4 int32_t in vector
UMAXV S0, V1.4S // S0 = max of all four 32-bit elements
NEON Floating-Point SIMD
// Add 4 float32s simultaneously
FADD V0.4S, V1.4S, V2.4S // V0[i] = V1[i] + V2[i] for i in 0..3
// Multiply-accumulate: 4 float32s
FMLA V0.4S, V1.4S, V2.4S // V0[i] += V1[i] * V2[i]
// Multiply-accumulate with scalar element (broadcasting)
// V2.S[0] means: use scalar element 0 of V2 for all 4 multiplies
FMLA V0.4S, V1.4S, V2.S[0] // V0[i] += V1[i] * V2[0] for all i
// Reduce: sum all 4 floats in vector
FADDP V0.4S, V1.4S, V2.4S // pairwise add → V0.4S = [a0+a1, a2+a3, b0+b1, b2+b3]
FADDP S0, V0.2S // S0 = V0[0] + V0[1] (horizontal sum of 2)
NEON Load/Store with Multiple Registers
// Load 4 consecutive int32 elements into V0
LDR Q0, [X0] // Load 128 bits (4×int32) into V0
// Load 4 registers worth of interleaved data (AoS → SoA)
// LD4 is NEON's deinterleave instruction
LD4 {V0.4S-V3.4S}, [X0] // Load 4×16 bytes, deinterleaving
// Store
STR Q0, [X0] // Store V0 (128 bits)
ST4 {V0.4S-V3.4S}, [X0] // Store with interleaving
Complete NEON Example: Array Sum
// sum_float_neon: sum an array of float32
// X0 = arr, X1 = count (must be multiple of 4 for simplicity)
// Returns S0 = sum
.global sum_float_neon
sum_float_neon:
MOVI V0.4S, #0 // accumulator = {0, 0, 0, 0}
LSR X2, X1, #2 // X2 = count / 4 (number of NEON iterations)
CBZ X2, .sneon_done
.sneon_loop:
LDR Q1, [X0], #16 // load 4 float32s (128 bits), X0 += 16
FADD V0.4S, V0.4S, V1.4S // V0 += V1 (4 floats at once)
SUBS X2, X2, #1
B.NE .sneon_loop
// Horizontal sum: reduce V0 to a single float
FADDP V0.4S, V0.4S, V0.4S // V0 = [s0+s1, s2+s3, s0+s1, s2+s3]
FADDP S0, V0.2S // S0 = (s0+s1) + (s2+s3) = total sum
.sneon_done:
RET
Register trace (arr = [1.0, 2.0, 3.0, 4.0], count = 4):
| Step | V0.4S | V1.4S | Notes |
|---|---|---|---|
| MOVI | {0,0,0,0} | init | |
| LDR Q1 | {0,0,0,0} | {1.0,2.0,3.0,4.0} | load 4 floats |
| FADD V0.4S | {1.0,2.0,3.0,4.0} | {1.0,2.0,3.0,4.0} | accumulate |
| SUBS: 1→0 | exit loop | ||
| FADDP V0.4S | {3.0,7.0,3.0,7.0} | pairwise sum | |
| FADDP S0 | S0=10.0 | final sum |
📊 C Comparison: GCC with
-O2 -march=armv8-a+simdwill auto-vectorize simple loops like this into NEON FMLA/FADD instructions. This is NEON vs. SSE2 from Chapter 15 (x86-64 SIMD) — both are 128-bit SIMD with 4×float32 lanes.
18.5 ARM64 Linux System Programming
Complete File I/O Program
The following demonstrates reading a file into a buffer and writing it to stdout — the basis for implementing cat:
// cat_arm64.s — Minimal cat(1) implementation
// Opens argv[1], reads it, writes to stdout, exits
.section .data
err_msg: .asciz "Error: could not open file\n"
err_len = . - err_msg
.section .bss
.align 3
buf: .space 4096 // 4KB read buffer
.section .text
.global _start
_start:
// Stack on entry: [sp+0]=argc, [sp+8]=argv[0], [sp+16]=argv[1], ...
LDR X0, [SP, #16] // X0 = argv[1] (filename)
CBZ X0, .usage_error // if no argument, error
// === openat(AT_FDCWD, filename, O_RDONLY, 0) ===
MOV X8, #56 // openat
MOV X1, X0 // pathname = argv[1]
MOV X0, #-100 // AT_FDCWD
MOV X2, #0 // O_RDONLY = 0
MOV X3, #0 // mode (ignored for O_RDONLY)
SVC #0
CMP X0, #0
B.LT .open_error // if negative, error
MOV X19, X0 // save fd in callee-saved X19
.read_loop:
// === read(fd, buf, 4096) ===
MOV X8, #63 // read
MOV X0, X19 // fd
ADR X1, buf // buffer
MOV X2, #4096 // count
SVC #0
CMP X0, #0
B.EQ .close_and_exit // 0 bytes = EOF
B.LT .close_and_exit // error
MOV X20, X0 // save bytes read
// === write(stdout, buf, bytes_read) ===
MOV X8, #64 // write
MOV X0, #1 // stdout
ADR X1, buf
MOV X2, X20 // bytes to write
SVC #0
B .read_loop
.close_and_exit:
// === close(fd) ===
MOV X8, #57 // close
MOV X0, X19
SVC #0
B .exit_success
.open_error:
// write error message to stderr
MOV X8, #64
MOV X0, #2 // stderr
ADR X1, err_msg
MOV X2, #err_len
SVC #0
MOV X0, #1 // exit with error
B .exit
.usage_error:
MOV X0, #1
.exit:
MOV X8, #93
SVC #0
.exit_success:
MOV X8, #93
MOV X0, #0
SVC #0
Key observations:
- argv[1] lives at [SP+16] at program entry (argc is at [SP], argv[0] at [SP+8])
- X19 is used to preserve the file descriptor across system calls (X0-X7 are caller-saved)
- The read/write loop continues until read() returns 0 (EOF) or negative (error)
18.6 AArch64 vs. AArch32
ARM64 (AArch64) is not just a wider version of ARM32 (AArch32). Key differences:
AArch64 vs. AArch32
┌─────────────────────────┬─────────────────────────┬─────────────────────┐
│ Feature │ AArch64 │ AArch32 │
├─────────────────────────┼─────────────────────────┼─────────────────────┤
│ GP registers │ 31 × 64-bit │ 16 × 32-bit │
│ Instruction encoding │ Fixed 32-bit │ 32-bit + Thumb-2 │
│ Per-instr condition code│ Removed (CSEL instead) │ Every instruction! │
│ Stack model │ 16-byte aligned │ 8-byte aligned │
│ NEON support │ Always present │ Optional extension │
│ FP registers │ 32 × 128-bit │ 16/32 × 64-bit │
│ Addressing modes │ Fewer, cleaner │ More complex │
│ Syscall instruction │ SVC #0 │ SVC #0 (same) │
│ Return address │ X30 (LR register) │ R14 (LR register) │
│ Address space │ 64-bit (48 actually) │ 32-bit │
└─────────────────────────┴─────────────────────────┴─────────────────────┘
ARMv8 processors can run both AArch64 and AArch32 code. The 64-bit OS mode (Exception Level 1) can load and run 32-bit processes. Android used this transition mode heavily during the 2013-2017 migration period.
18.7 Apple Silicon: ARM64 on macOS
The Apple M-series chips run ARM64. Apple's implementation differs from Linux in several ways that matter for assembly programmers.
System Calls on macOS
macOS uses a different mechanism for system calls:
// macOS ARM64 system call via libsystem (preferred)
// Don't call syscalls directly — use the library stubs
// However, if you must make raw syscalls:
// The syscall number goes in X16, not X8!
// Return value is still in X0
// macOS write(fd, buf, count):
MOV X16, #4 // write on macOS arm64 = 4 (different from Linux's 64!)
MOV X0, #1 // stdout
ADR X1, msg
MOV X2, #len
SVC #0x80 // macOS uses SVC #0x80 (not SVC #0)
The #0x80 flag in SVC tells the macOS kernel this is a Unix system call (as opposed to Mach trap = #0, or thread call = #0xC0). macOS syscall numbers are the BSD numbers, not the Linux generic table.
Mach-O Binary Format
macOS uses Mach-O, not ELF. The segment names differ:
ELF Mach-O
─────────────────────────
.text __TEXT,__text
.data __DATA,__data
.bss __DATA,__bss
.rodata __TEXT,__const
An ARM64 Mach-O "Hello World":
// hello_macos_arm64.s
// Assemble: as -arch arm64 hello_macos_arm64.s -o hello.o
// Link: ld -arch arm64 -lSystem hello.o -o hello
// Run: ./hello
.global _start // macOS entry point is _start (or _main via libc)
.p2align 2 // align to 4-byte boundary
.section __TEXT,__text
_start:
MOV X16, #4 // write syscall on macOS ARM64
MOV X0, #1 // stdout
ADR X1, msg
MOV X2, #14 // length of "Hello, macOS!\n"
SVC #0x80 // macOS syscall
MOV X16, #1 // exit syscall on macOS ARM64
MOV X0, #0
SVC #0x80
.section __TEXT,__const
msg:
.ascii "Hello, macOS!\n"
Rosetta 2: x86-64 on Apple Silicon
Rosetta 2 is Apple's x86-64 to ARM64 binary translation layer. When you run an x86-64 binary on M1, Rosetta 2:
- Translates the x86-64 instructions to ARM64 at the first run (ahead-of-time translation, cached)
- Handles x86-64's memory ordering model (TSO) vs. ARM64's weaker ordering (by inserting memory barriers)
- Emulates x86-64 semantics including RFLAGS, segment registers, and other x86 quirks
Performance: Rosetta 2 translated x86-64 code typically runs at 70-85% of native x86-64 speed — which, given the M-series chips' raw performance advantage, often means it still beats a native Intel Mac at x86-64 code.
🔐 Security Note: Rosetta 2 runs x86-64 code in a sand-boxed execution environment. From a security perspective, an x86-64 program running under Rosetta cannot directly make ARM64 system calls — it goes through Rosetta's translation layer. This creates a distinct threat model for exploits.
18.8 Side-by-Side: Array Sum in x86-64 and ARM64
Array Sum: C, x86-64, and ARM64 Compared
════════════════════════════════════════════════════════════════════════════
C source:
int64_t sum(const int64_t *arr, int n) {
int64_t s = 0;
for (int i = 0; i < n; i++) s += arr[i];
return s;
}
x86-64 (GCC -O2): ARM64 (GCC -O2):
───────────────────────────── ──────────────────────────────
sum: sum:
xor eax, eax MOV X2, XZR
test esi, esi CMP W1, #0
jle .done B.LE .done
xor ecx, ecx SXTW X3, W1
movsxd rdx, esi MOV X4, X0
.loop: .loop:
add rax, [rdi + rcx*8] LDR X5, [X4], #8
inc rcx ADD X2, X2, X5
cmp rcx, rdx SUBS X3, X3, #1
jne .loop B.NE .loop
.done: .done:
ret MOV X0, X2
RET
───────────────────────────── ──────────────────────────────
8 instructions 8 instructions
Uses RCX as index, scales *8 Uses post-increment pointer
x86 implicit memory operand ARM64 explicit load instruction
(add rax, [rdi+rcx*8]) (LDR X5, [X4], #8)
MOVSXD sign-extends loop limit SXTW sign-extends loop limit
The instruction counts are identical. The key difference is structural: x86-64 folds the memory access into the ADD instruction, while ARM64 explicitly loads then adds. In practice, on modern microarchitectures, both sequence execute in roughly the same time due to pipelining.
🔄 Check Your Understanding: 1. Why does
LDR X2, [X0, X1, LSL #3]work for accessingint64_tarray elements but not forint32_telements (without changing the shift)? 2. What doesFMLA V0.4S, V1.4S, V2.S[0]do differently fromFMLA V0.4S, V1.4S, V2.4S? 3. Why does macOS useSVC #0x80while Linux usesSVC #0? 4. What NEON register type would you use to process 8 uint16_t values at once? 5. Why doesFMADDproduce a more numerically accurate result than separate FMUL + FADD?
Summary
ARM64 programming requires explicit array index scaling (via LSL in addressing modes), explicit memcpy/memset loops (no string instructions), and explicit load-before-compute discipline. NEON SIMD provides high-throughput parallel computation on 128-bit registers for arrays, image processing, audio, cryptography, and any vectorizable computation.
Apple Silicon changes the game on macOS: different syscall numbers (X16, not X8), different binary format (Mach-O, not ELF), different section names, and the SVC #0x80 mechanism. Rosetta 2 enables x86-64 compatibility at the cost of ~15-30% performance overhead.
The load/store discipline that feels verbose in small examples becomes a feature at scale: each instruction does exactly one thing, the CPU's out-of-order engine can find and exploit instruction-level parallelism more easily, and performance is predictable.