7 min read

Chapters 16 and 17 gave you the register model and the instruction vocabulary. This chapter puts them together into real programs: array operations, string processing without string instructions, floating-point math, NEON SIMD, Linux system...

Chapter 18: ARM64 Programming

Putting It Together

Chapters 16 and 17 gave you the register model and the instruction vocabulary. This chapter puts them together into real programs: array operations, string processing without string instructions, floating-point math, NEON SIMD, Linux system programming, and the differences you need to know for Apple Silicon.


18.1 Arrays on ARM64

The Scale Factor Problem

In x86-64, the SIB (Scale, Index, Base) addressing mode lets you access array elements with a built-in scale factor:

; x86-64: access 8-byte element arr[i]
mov rax, [rbx + rcx*8]    ; rbx = arr, rcx = i, *8 for 8-byte elements

The *8 is part of the instruction encoding. ARM64 has no such built-in. To access an 8-byte element, you must shift the index:

// ARM64: access 8-byte element arr[i]
// X0 = arr, X1 = i
LDR X2, [X0, X1, LSL #3]    // X2 = arr[i]: address = X0 + (X1 << 3) = X0 + X1*8

The shift is part of the register-offset addressing mode. You can use LSL #0 (no shift, 1-byte elements), LSL #1 (2-byte), LSL #2 (4-byte), LSL #3 (8-byte). These cover byte, halfword, word, and doubleword element sizes.

For other element sizes (3, 5, 6, 7 bytes), you'd need an explicit multiply or a sequence:

// 3-byte elements (unusual but illustrative):
// X0 = arr, X1 = i
// Need: offset = i * 3
ADD  X2, X1, X1, LSL #1    // X2 = X1 + (X1 << 1) = X1 + 2*X1 = 3*X1
LDR  W3, [X0, X2]          // Load 4 bytes (overlapping, but common technique)

Complete Array Sum Example

// Sum an array of int64_t
int64_t array_sum(const int64_t *arr, size_t count) {
    int64_t sum = 0;
    for (size_t i = 0; i < count; i++) {
        sum += arr[i];
    }
    return sum;
}

ARM64 assembly:

// array_sum: X0 = arr, X1 = count, returns X0 = sum
.global array_sum
array_sum:
    MOV  X2, XZR              // sum = 0
    CBZ  X1, .sum_done        // if count == 0, return 0

.sum_loop:
    LDR  X3, [X0], #8         // X3 = *arr; arr++ (post-indexed by 8)
    ADD  X2, X2, X3           // sum += *arr
    SUBS X1, X1, #1           // count-- (SUBS to set Z flag)
    B.NE .sum_loop             // if count != 0, continue

.sum_done:
    MOV  X0, X2               // return sum
    RET

Multi-Element Load with LDP

LDP can load two consecutive array elements at once, reducing load instructions by half:

// Sum pairs of int64_t elements at a time (unrolled loop)
// Requires count to be even (or handle odd count separately)
array_sum_unrolled:
    MOV  X2, XZR              // sum = 0
    LSR  X3, X1, #1           // X3 = count / 2 (pairs to process)
    CBZ  X3, .us_done

.us_loop:
    LDP  X4, X5, [X0], #16    // X4 = arr[i], X5 = arr[i+1]; arr += 16
    ADD  X2, X2, X4           // sum += X4
    ADD  X2, X2, X5           // sum += X5
    SUBS X3, X3, #1
    B.NE .us_loop

.us_done:
    // Handle odd element if count was odd
    TBNZ X1, #0, .us_odd      // if bit 0 of count is set, count was odd
    MOV  X0, X2
    RET
.us_odd:
    LDR  X4, [X0]             // load the last element
    ADD  X2, X2, X4
    MOV  X0, X2
    RET

Post-Increment for Array Traversal

The post-indexed addressing mode is ideal for array traversal:

// Find the maximum value in an int32_t array
// X0 = arr, W1 = count, returns W0 = max
find_max:
    SXTW  X1, W1              // sign-extend count to 64-bit
    CBZ   X1, .max_done

    LDR   W0, [X0], #4        // W0 = arr[0]; arr++
    SUBS  X1, X1, #1
    B.EQ  .max_done           // only one element

.max_loop:
    LDR   W2, [X0], #4        // W2 = next element; arr++
    CMP   W2, W0
    CSEL  W0, W2, W0, GT      // W0 = max(W0, W2) — branchless!
    SUBS  X1, X1, #1
    B.NE  .max_loop

.max_done:
    RET

The CSEL (conditional select) here avoids a conditional branch entirely — ARM64's answer to branchless min/max.


18.2 String Operations Without String Instructions

ARM64 has no MOVSB, SCASB, STOSD, or any other x86-style string instructions. Every byte must be processed by explicit load-operate-store sequences. The saving grace: NEON can process 16 bytes at a time when performance matters.

memset

void *memset(void *dest, int c, size_t n);

ARM64 implementation:

// memset: X0 = dest, W1 = byte value, X2 = count
// Returns X0 = dest (original)
.global my_memset
my_memset:
    MOV   X3, X0              // save dest for return

    // Create an 8-byte-wide version of the fill byte
    AND   W1, W1, #0xFF       // ensure W1 is 8-bit
    // Replicate byte into all 8 bytes of X1:
    ORR   W1, W1, W1, LSL #8  // W1 = byte | byte<<8 = 2-byte repeat
    ORR   W1, W1, W1, LSL #16 // W1 = 4-byte repeat
    ORR   X1, X1, X1, LSL #32 // X1 = 8-byte repeat (full 64-bit pattern)

    // Set 16 bytes at a time using STP
    LSR   X4, X2, #4          // X4 = count / 16
    CBZ   X4, .mset_tail8

.mset_16:
    STP   X1, X1, [X0], #16  // store 16 bytes (two 8-byte stores)
    SUBS  X4, X4, #1
    B.NE  .mset_16

.mset_tail8:
    TBZQ  X2, #3, .mset_tail4 // if bit 3 of count is 0, no 8-byte chunk
    STR   X1, [X0], #8        // store 8 bytes

.mset_tail4:
    TBZ   X2, #2, .mset_tail2 // if bit 2 of count is 0, no 4-byte chunk
    STR   W1, [X0], #4        // store 4 bytes

.mset_tail2:
    TBZ   X2, #1, .mset_tail1
    STRH  W1, [X0], #2        // store 2 bytes

.mset_tail1:
    TBZ   X2, #0, .mset_done
    STRB  W1, [X0]            // store 1 byte

.mset_done:
    MOV   X0, X3              // return original dest
    RET

💡 Mental Model: The TBZ X2, #N, label instructions test specific bits of the count to handle the tail (the remainder after processing 16-byte chunks). Bit 3 set → 8-byte tail chunk; bit 2 set → 4-byte; bit 1 → 2-byte; bit 0 → 1-byte. This handles any count from 0 to 15 in the tail.

memcpy with LDP/STP

// memcpy: X0 = dest, X1 = src, X2 = count
// Returns X0 = dest

.global my_memcpy
my_memcpy:
    MOV   X3, X0              // save dest

    LSR   X4, X2, #4          // X4 = count / 16
    CBZ   X4, .mcpy_tail

.mcpy_16:
    LDP   X5, X6, [X1], #16   // load 16 bytes from src
    STP   X5, X6, [X0], #16   // store 16 bytes to dest
    SUBS  X4, X4, #1
    B.NE  .mcpy_16

.mcpy_tail:
    ANDS  X2, X2, #15         // X2 = count % 16
    CBZ   X2, .mcpy_done

    // Handle remaining bytes (1-15)
    TBZ   X2, #3, .mcpy_tail4
    LDR   X5, [X1], #8
    STR   X5, [X0], #8

.mcpy_tail4:
    TBZ   X2, #2, .mcpy_tail2
    LDR   W5, [X1], #4
    STR   W5, [X0], #4

.mcpy_tail2:
    TBZ   X2, #1, .mcpy_tail1
    LDRH  W5, [X1], #2
    STRH  W5, [X0], #2

.mcpy_tail1:
    TBZ   X2, #0, .mcpy_done
    LDRB  W5, [X1]
    STRB  W5, [X0]

.mcpy_done:
    MOV   X0, X3
    RET

⚡ Performance Note: The LDP/STP pair approach processes 16 bytes per loop iteration. glibc's ARM64 memcpy goes further, unrolling the main loop to process 64 or even 128 bytes per iteration using multiple LDP/STP instructions and prefetch hints (PRFM). For buffers larger than ~4KB, NEON loads (LDR Q) or even SVE (Scalable Vector Extension, ARM's variable-width SIMD) are used.


18.3 ARM64 Floating-Point

The FP/SIMD Register File

ARM64 has 32 FP/SIMD registers, each 128 bits wide. They have multiple names depending on how you use them:

FP/SIMD Register Naming
┌──────────┬────────────────────────────────────────────────────────────┐
│ Name     │ Meaning                                                     │
├──────────┼────────────────────────────────────────────────────────────┤
│ Vn       │ 128-bit (the full register, usually for NEON vector ops)   │
│ Qn       │ 128-bit (alternative name for Vn in scalar context)        │
│ Dn       │ Low 64 bits (double-precision float scalar)                │
│ Sn       │ Low 32 bits (single-precision float scalar)                │
│ Hn       │ Low 16 bits (half-precision float scalar — ARMv8.2+)       │
│ Bn       │ Low 8 bits (byte — for NEON operations)                    │
└──────────┴────────────────────────────────────────────────────────────┘

V0-V7:  FP/SIMD argument and return registers (caller-saved)
V8-V15: Callee-saved (but only the lower 64 bits must be preserved!)
V16-V31: Caller-saved temporaries

⚠️ Common Mistake: For AAPCS64 FP, V8-V15 are callee-saved — but only the lower 64 bits (D8-D15). The upper 64 bits are NOT required to be preserved. If you use V8-V15 as 128-bit NEON registers, you must save and restore the full 128 bits yourself.

Scalar Floating-Point Instructions

// Arithmetic
FADD  Dd, Dn, Dm     // Dd = Dn + Dm (double)
FADD  Sd, Sn, Sm     // Sd = Sn + Sm (float)
FSUB  Dd, Dn, Dm
FMUL  Dd, Dn, Dm
FDIV  Dd, Dn, Dm
FSQRT Dd, Dn          // Dd = sqrt(Dn)
FNEG  Dd, Dn          // Dd = -Dn
FABS  Dd, Dn          // Dd = |Dn|

// Multiply-accumulate (fused, single rounding)
FMADD Dd, Dn, Dm, Da  // Dd = Da + Dn*Dm  (fused multiply-add)
FMSUB Dd, Dn, Dm, Da  // Dd = Da - Dn*Dm
FNMADD Dd, Dn, Dm, Da // Dd = -Da - Dn*Dm
FNMSUB Dd, Dn, Dm, Da // Dd = -Da + Dn*Dm

// Comparison
FCMP  Dn, Dm           // sets condition flags (FPSR N, Z, C, V)
FCMP  Dn, #0.0         // compare to zero
FCCMP Dn, Dm, nzcv, cond // conditional FP compare

// Conversion
FCVT  Dd, Sn           // single → double
FCVT  Sd, Dn           // double → single
SCVTF Dd, Xn           // signed int64 → double
SCVTF Sd, Wn           // signed int32 → float
UCVTF Dd, Xn           // unsigned int64 → double
FCVTZS Xd, Dn          // double → int64 (truncate toward zero)
FCVTZS Wd, Sn          // float  → int32 (truncate toward zero)
FCVTZU Xd, Dn          // double → uint64 (truncate toward zero)

// Move between GP and FP registers
FMOV  Dd, Xn           // Dd = Xn (bitwise, no conversion)
FMOV  Xd, Dn           // Xd = Dn (bitwise, no conversion)
FMOV  Sd, Wn           // Sd = Wn (bitwise, 32-bit)
FMOV  Wd, Sn           // Wd = Sn (bitwise, 32-bit)

Floating-Point Function Example

double hypot(double a, double b) {
    return sqrt(a*a + b*b);
}

ARM64 assembly:

// hypot: D0 = a, D1 = b, returns D0 = sqrt(a*a + b*b)
.global my_hypot
my_hypot:
    FMUL  D2, D0, D0    // D2 = a*a
    FMADD D2, D1, D1, D2 // D2 = D2 + b*b = a*a + b*b (fused!)
    FSQRT D0, D2         // D0 = sqrt(a*a + b*b)
    RET

⚡ Performance Note: FMADD (fused multiply-add) performs a*a + b*b with a single rounding step, which is both faster and more numerically accurate than separate FMUL + FADD. This is one of ARM64's advantages over architectures without native FMA instructions.

Floating-Point Load/Store

FP registers load and store with LDR/STR, but using FP register names:

LDR  D0, [X0]          // Load double from [X0] into D0
LDR  S0, [X0]          // Load float from [X0] into S0
STR  D0, [X0]          // Store D0 to [X0]
STR  S0, [X0]          // Store S0 to [X0]
LDP  D0, D1, [X0]      // Load two doubles
STP  D0, D1, [X0]      // Store two doubles

18.4 NEON SIMD: ARM64's Answer to SSE/AVX

NEON is ARM's SIMD (Single Instruction, Multiple Data) extension. It operates on the V registers (128 bits) in vector mode, treating each register as a vector of smaller elements.

Vector Data Types

NEON Vector Types
┌───────────────┬─────────────────────────────────────────────────────────┐
│ Suffix        │ Meaning                                                  │
├───────────────┼─────────────────────────────────────────────────────────┤
│ V0.16B        │ 16 × 8-bit bytes (uint8)                                │
│ V0.8H         │ 8 × 16-bit halfwords (uint16/int16)                     │
│ V0.4S         │ 4 × 32-bit words (uint32/int32/float32)                 │
│ V0.2D         │ 2 × 64-bit doublewords (uint64/int64/float64)           │
│ V0.4S (FP)    │ 4 × single-precision floats (with FADD/FMUL etc.)      │
│ V0.2D (FP)    │ 2 × double-precision floats                             │
└───────────────┴─────────────────────────────────────────────────────────┘

NEON Integer SIMD

// Add 4 int32_t pairs simultaneously
ADD  V0.4S, V1.4S, V2.4S    // V0[i] = V1[i] + V2[i] for i in 0..3

// Multiply 4 int32_t pairs
MUL  V0.4S, V1.4S, V2.4S    // V0[i] = V1[i] * V2[i]

// Compare 16 bytes
CMEQ V0.16B, V1.16B, V2.16B // V0[i] = 0xFF if V1[i]==V2[i], else 0

// Saturating add (bytes stay in [0,255], no wrap)
UQADD V0.16B, V1.16B, V2.16B

// Horizontal max of all 4 int32_t in vector
UMAXV S0, V1.4S              // S0 = max of all four 32-bit elements

NEON Floating-Point SIMD

// Add 4 float32s simultaneously
FADD V0.4S, V1.4S, V2.4S    // V0[i] = V1[i] + V2[i] for i in 0..3

// Multiply-accumulate: 4 float32s
FMLA V0.4S, V1.4S, V2.4S    // V0[i] += V1[i] * V2[i]

// Multiply-accumulate with scalar element (broadcasting)
// V2.S[0] means: use scalar element 0 of V2 for all 4 multiplies
FMLA V0.4S, V1.4S, V2.S[0]  // V0[i] += V1[i] * V2[0] for all i

// Reduce: sum all 4 floats in vector
FADDP V0.4S, V1.4S, V2.4S   // pairwise add → V0.4S = [a0+a1, a2+a3, b0+b1, b2+b3]
FADDP S0, V0.2S              // S0 = V0[0] + V0[1] (horizontal sum of 2)

NEON Load/Store with Multiple Registers

// Load 4 consecutive int32 elements into V0
LDR  Q0, [X0]              // Load 128 bits (4×int32) into V0

// Load 4 registers worth of interleaved data (AoS → SoA)
// LD4 is NEON's deinterleave instruction
LD4  {V0.4S-V3.4S}, [X0]   // Load 4×16 bytes, deinterleaving

// Store
STR  Q0, [X0]              // Store V0 (128 bits)
ST4  {V0.4S-V3.4S}, [X0]  // Store with interleaving

Complete NEON Example: Array Sum

// sum_float_neon: sum an array of float32
// X0 = arr, X1 = count (must be multiple of 4 for simplicity)
// Returns S0 = sum

.global sum_float_neon
sum_float_neon:
    MOVI  V0.4S, #0             // accumulator = {0, 0, 0, 0}
    LSR   X2, X1, #2            // X2 = count / 4 (number of NEON iterations)
    CBZ   X2, .sneon_done

.sneon_loop:
    LDR   Q1, [X0], #16         // load 4 float32s (128 bits), X0 += 16
    FADD  V0.4S, V0.4S, V1.4S  // V0 += V1 (4 floats at once)
    SUBS  X2, X2, #1
    B.NE  .sneon_loop

    // Horizontal sum: reduce V0 to a single float
    FADDP V0.4S, V0.4S, V0.4S  // V0 = [s0+s1, s2+s3, s0+s1, s2+s3]
    FADDP S0, V0.2S             // S0 = (s0+s1) + (s2+s3) = total sum

.sneon_done:
    RET

Register trace (arr = [1.0, 2.0, 3.0, 4.0], count = 4):

Step V0.4S V1.4S Notes
MOVI {0,0,0,0} init
LDR Q1 {0,0,0,0} {1.0,2.0,3.0,4.0} load 4 floats
FADD V0.4S {1.0,2.0,3.0,4.0} {1.0,2.0,3.0,4.0} accumulate
SUBS: 1→0 exit loop
FADDP V0.4S {3.0,7.0,3.0,7.0} pairwise sum
FADDP S0 S0=10.0 final sum

📊 C Comparison: GCC with -O2 -march=armv8-a+simd will auto-vectorize simple loops like this into NEON FMLA/FADD instructions. This is NEON vs. SSE2 from Chapter 15 (x86-64 SIMD) — both are 128-bit SIMD with 4×float32 lanes.


18.5 ARM64 Linux System Programming

Complete File I/O Program

The following demonstrates reading a file into a buffer and writing it to stdout — the basis for implementing cat:

// cat_arm64.s — Minimal cat(1) implementation
// Opens argv[1], reads it, writes to stdout, exits

.section .data
err_msg:    .asciz "Error: could not open file\n"
err_len     = . - err_msg

.section .bss
.align 3
buf:        .space 4096          // 4KB read buffer

.section .text
.global _start
_start:
    // Stack on entry: [sp+0]=argc, [sp+8]=argv[0], [sp+16]=argv[1], ...
    LDR  X0, [SP, #16]           // X0 = argv[1] (filename)
    CBZ  X0, .usage_error        // if no argument, error

    // === openat(AT_FDCWD, filename, O_RDONLY, 0) ===
    MOV  X8, #56                 // openat
    MOV  X1, X0                  // pathname = argv[1]
    MOV  X0, #-100               // AT_FDCWD
    MOV  X2, #0                  // O_RDONLY = 0
    MOV  X3, #0                  // mode (ignored for O_RDONLY)
    SVC  #0
    CMP  X0, #0
    B.LT .open_error             // if negative, error
    MOV  X19, X0                 // save fd in callee-saved X19

.read_loop:
    // === read(fd, buf, 4096) ===
    MOV  X8, #63                 // read
    MOV  X0, X19                 // fd
    ADR  X1, buf                 // buffer
    MOV  X2, #4096               // count
    SVC  #0
    CMP  X0, #0
    B.EQ .close_and_exit         // 0 bytes = EOF
    B.LT .close_and_exit         // error
    MOV  X20, X0                 // save bytes read

    // === write(stdout, buf, bytes_read) ===
    MOV  X8, #64                 // write
    MOV  X0, #1                  // stdout
    ADR  X1, buf
    MOV  X2, X20                 // bytes to write
    SVC  #0
    B    .read_loop

.close_and_exit:
    // === close(fd) ===
    MOV  X8, #57                 // close
    MOV  X0, X19
    SVC  #0
    B    .exit_success

.open_error:
    // write error message to stderr
    MOV  X8, #64
    MOV  X0, #2                  // stderr
    ADR  X1, err_msg
    MOV  X2, #err_len
    SVC  #0
    MOV  X0, #1                  // exit with error
    B    .exit

.usage_error:
    MOV  X0, #1
.exit:
    MOV  X8, #93
    SVC  #0

.exit_success:
    MOV  X8, #93
    MOV  X0, #0
    SVC  #0

Key observations: - argv[1] lives at [SP+16] at program entry (argc is at [SP], argv[0] at [SP+8]) - X19 is used to preserve the file descriptor across system calls (X0-X7 are caller-saved) - The read/write loop continues until read() returns 0 (EOF) or negative (error)


18.6 AArch64 vs. AArch32

ARM64 (AArch64) is not just a wider version of ARM32 (AArch32). Key differences:

AArch64 vs. AArch32
┌─────────────────────────┬─────────────────────────┬─────────────────────┐
│ Feature                 │ AArch64                 │ AArch32             │
├─────────────────────────┼─────────────────────────┼─────────────────────┤
│ GP registers            │ 31 × 64-bit             │ 16 × 32-bit         │
│ Instruction encoding    │ Fixed 32-bit            │ 32-bit + Thumb-2    │
│ Per-instr condition code│ Removed (CSEL instead)  │ Every instruction!  │
│ Stack model             │ 16-byte aligned          │ 8-byte aligned      │
│ NEON support            │ Always present          │ Optional extension  │
│ FP registers            │ 32 × 128-bit            │ 16/32 × 64-bit      │
│ Addressing modes        │ Fewer, cleaner          │ More complex        │
│ Syscall instruction     │ SVC #0                  │ SVC #0 (same)       │
│ Return address          │ X30 (LR register)       │ R14 (LR register)   │
│ Address space           │ 64-bit (48 actually)    │ 32-bit              │
└─────────────────────────┴─────────────────────────┴─────────────────────┘

ARMv8 processors can run both AArch64 and AArch32 code. The 64-bit OS mode (Exception Level 1) can load and run 32-bit processes. Android used this transition mode heavily during the 2013-2017 migration period.


18.7 Apple Silicon: ARM64 on macOS

The Apple M-series chips run ARM64. Apple's implementation differs from Linux in several ways that matter for assembly programmers.

System Calls on macOS

macOS uses a different mechanism for system calls:

// macOS ARM64 system call via libsystem (preferred)
// Don't call syscalls directly — use the library stubs

// However, if you must make raw syscalls:
// The syscall number goes in X16, not X8!
// Return value is still in X0

// macOS write(fd, buf, count):
MOV X16, #4        // write on macOS arm64 = 4 (different from Linux's 64!)
MOV X0, #1         // stdout
ADR X1, msg
MOV X2, #len
SVC #0x80          // macOS uses SVC #0x80 (not SVC #0)

The #0x80 flag in SVC tells the macOS kernel this is a Unix system call (as opposed to Mach trap = #0, or thread call = #0xC0). macOS syscall numbers are the BSD numbers, not the Linux generic table.

Mach-O Binary Format

macOS uses Mach-O, not ELF. The segment names differ:

ELF             Mach-O
─────────────────────────
.text           __TEXT,__text
.data           __DATA,__data
.bss            __DATA,__bss
.rodata         __TEXT,__const

An ARM64 Mach-O "Hello World":

// hello_macos_arm64.s
// Assemble: as -arch arm64 hello_macos_arm64.s -o hello.o
// Link:     ld -arch arm64 -lSystem hello.o -o hello
// Run:      ./hello

.global _start      // macOS entry point is _start (or _main via libc)
.p2align 2          // align to 4-byte boundary

.section __TEXT,__text
_start:
    MOV X16, #4     // write syscall on macOS ARM64
    MOV X0, #1      // stdout
    ADR X1, msg
    MOV X2, #14     // length of "Hello, macOS!\n"
    SVC #0x80       // macOS syscall

    MOV X16, #1     // exit syscall on macOS ARM64
    MOV X0, #0
    SVC #0x80

.section __TEXT,__const
msg:
    .ascii "Hello, macOS!\n"

Rosetta 2: x86-64 on Apple Silicon

Rosetta 2 is Apple's x86-64 to ARM64 binary translation layer. When you run an x86-64 binary on M1, Rosetta 2:

  1. Translates the x86-64 instructions to ARM64 at the first run (ahead-of-time translation, cached)
  2. Handles x86-64's memory ordering model (TSO) vs. ARM64's weaker ordering (by inserting memory barriers)
  3. Emulates x86-64 semantics including RFLAGS, segment registers, and other x86 quirks

Performance: Rosetta 2 translated x86-64 code typically runs at 70-85% of native x86-64 speed — which, given the M-series chips' raw performance advantage, often means it still beats a native Intel Mac at x86-64 code.

🔐 Security Note: Rosetta 2 runs x86-64 code in a sand-boxed execution environment. From a security perspective, an x86-64 program running under Rosetta cannot directly make ARM64 system calls — it goes through Rosetta's translation layer. This creates a distinct threat model for exploits.


18.8 Side-by-Side: Array Sum in x86-64 and ARM64

Array Sum: C, x86-64, and ARM64 Compared
════════════════════════════════════════════════════════════════════════════

C source:
  int64_t sum(const int64_t *arr, int n) {
      int64_t s = 0;
      for (int i = 0; i < n; i++) s += arr[i];
      return s;
  }

x86-64 (GCC -O2):                    ARM64 (GCC -O2):
─────────────────────────────         ──────────────────────────────
sum:                                  sum:
  xor  eax, eax                         MOV  X2, XZR
  test esi, esi                         CMP  W1, #0
  jle  .done                            B.LE .done
  xor  ecx, ecx                         SXTW X3, W1
  movsxd rdx, esi                       MOV  X4, X0
.loop:                                .loop:
  add  rax, [rdi + rcx*8]               LDR  X5, [X4], #8
  inc  rcx                              ADD  X2, X2, X5
  cmp  rcx, rdx                         SUBS X3, X3, #1
  jne  .loop                            B.NE .loop
.done:                                .done:
  ret                                   MOV  X0, X2
                                        RET

─────────────────────────────         ──────────────────────────────
8 instructions                        8 instructions
Uses RCX as index, scales *8          Uses post-increment pointer
x86 implicit memory operand           ARM64 explicit load instruction
  (add rax, [rdi+rcx*8])                (LDR X5, [X4], #8)
MOVSXD sign-extends loop limit        SXTW sign-extends loop limit

The instruction counts are identical. The key difference is structural: x86-64 folds the memory access into the ADD instruction, while ARM64 explicitly loads then adds. In practice, on modern microarchitectures, both sequence execute in roughly the same time due to pipelining.


🔄 Check Your Understanding: 1. Why does LDR X2, [X0, X1, LSL #3] work for accessing int64_t array elements but not for int32_t elements (without changing the shift)? 2. What does FMLA V0.4S, V1.4S, V2.S[0] do differently from FMLA V0.4S, V1.4S, V2.4S? 3. Why does macOS use SVC #0x80 while Linux uses SVC #0? 4. What NEON register type would you use to process 8 uint16_t values at once? 5. Why does FMADD produce a more numerically accurate result than separate FMUL + FADD?


Summary

ARM64 programming requires explicit array index scaling (via LSL in addressing modes), explicit memcpy/memset loops (no string instructions), and explicit load-before-compute discipline. NEON SIMD provides high-throughput parallel computation on 128-bit registers for arrays, image processing, audio, cryptography, and any vectorizable computation.

Apple Silicon changes the game on macOS: different syscall numbers (X16, not X8), different binary format (Mach-O, not ELF), different section names, and the SVC #0x80 mechanism. Rosetta 2 enables x86-64 compatibility at the cost of ~15-30% performance overhead.

The load/store discipline that feels verbose in small examples becomes a feature at scale: each instruction does exactly one thing, the CPU's out-of-order engine can find and exploit instruction-level parallelism more easily, and performance is predictable.