Case Study 17-2: ARM64 String Operations — Implementing strlen Without REP
Objective
Implement strlen in ARM64 assembly. ARM64 has no REP SCASB equivalent, so every byte must be tested manually (or with SIMD). We'll implement three versions — naive byte loop, optimized word-at-a-time, and a preview of the NEON SIMD approach — and compare them to the x86-64 REP SCASB approach.
The Problem
strlen counts characters in a null-terminated string until it finds a zero byte. On x86-64, this is famously done with REPNE SCASB:
; x86-64: strlen using REPNE SCASB
; RDI = string pointer, returns RAX = length
strlen_x86:
mov rcx, -1 ; set counter to max
xor al, al ; al = 0 (scan for null byte)
cld ; direction = forward
repne scasb ; scan until [RDI] == AL, decrementing RCX
; After: RCX = -(length+2), RDI = one past null terminator
not rcx ; RCX = length+1
dec rcx ; RCX = length
mov rax, rcx
ret
ARM64 has no SCAS, no REP prefix, no string instructions. Every byte must be checked explicitly (or in parallel with NEON). This is load/store architecture in action: it cannot combine memory access and comparison in one instruction.
Version 1: Naive Byte Loop
// strlen_naive: X0 = string pointer, returns X0 = length
// Uses: X1 (current pointer), X2 (byte read)
.global strlen_naive
strlen_naive:
MOV X1, X0 // X1 = start pointer (save original)
.byte_loop:
LDRB W2, [X0], #1 // W2 = *X0; X0++ (post-indexed byte load)
CBNZ W2, .byte_loop // if W2 != 0, continue
// X0 now points ONE past the null terminator
// Length = X0 - X1 - 1
SUB X0, X0, X1 // X0 = distance from start to one-past-null
SUB X0, X0, #1 // subtract 1 to exclude the null byte
RET
Wait — let's be careful with the post-increment. When CBNZ exits the loop, X0 points to the byte AFTER the null terminator (since we did X0++ before the compare). So length = X0 - original_X0 - 1.
Let's verify: - Input: "Hello\0" at address 0x1000 - Initial: X0=0x1000, X1=0x1000 - Loop: read 'H'=0x48, X0=0x1001, continue - Loop: read 'e'=0x65, X0=0x1002, continue - Loop: read 'l'=0x6C, X0=0x1003, continue - Loop: read 'l'=0x6C, X0=0x1004, continue - Loop: read 'o'=0x6F, X0=0x1005, continue - Loop: read '\0'=0x00, X0=0x1006, CBNZ not taken (W2=0) - SUB X0, X0, X1: X0 = 0x1006 - 0x1000 = 6 - SUB X0, X0, #1: X0 = 5 ✓ (correct: "Hello" is 5 characters)
Register trace:
| Step | X0 (pointer) | X1 (start) | W2 (byte) | Notes |
|---|---|---|---|---|
| init | 0x1000 | 0x1000 | ? | |
| LDRB+CBNZ | 0x1001 | 0x1000 | 'H' | continue |
| LDRB+CBNZ | 0x1002 | 0x1000 | 'e' | continue |
| LDRB+CBNZ | 0x1003 | 0x1000 | 'l' | continue |
| LDRB+CBNZ | 0x1004 | 0x1000 | 'l' | continue |
| LDRB+CBNZ | 0x1005 | 0x1000 | 'o' | continue |
| LDRB+CBNZ | 0x1006 | 0x1000 | '\0' | exit (W2=0) |
| SUB X0,X0,X1 | 6 | 0x1000 | '\0' | |
| SUB X0,X0,#1 | 5 | 0x1000 | '\0' | return 5 |
Version 2: Word-at-a-Time strlen
The byte loop works but touches memory once per byte. Modern CPUs prefer to load 8 bytes at a time. The trick: load 8 bytes, then use arithmetic to detect if any of those 8 bytes is zero.
The null-byte detection algorithm (from Hacker's Delight):
// Returns non-zero if any byte in x is zero
uint64_t has_zero_byte(uint64_t x) {
return (x - 0x0101010101010101ULL) & ~x & 0x8080808080808080ULL;
}
If has_zero_byte(x) != 0, then x contains a null byte somewhere in its 8 bytes.
// strlen_word: word-at-a-time strlen (unaligned-safe version for now)
// X0 = string pointer, returns X0 = length
.global strlen_word
strlen_word:
MOV X1, X0 // X1 = start pointer
// Load the magic constants
MOV X4, #0x0101010101010101 // 0x0101... (repeat pattern of 0x01)
// Actually 0x0101010101010101 doesn't fit in a 16-bit immediate.
// Load it using MOVZ + MOVK:
MOVZ X4, #0x0101
MOVK X4, #0x0101, LSL #16
MOVK X4, #0x0101, LSL #32
MOVK X4, #0x0101, LSL #48 // X4 = 0x0101010101010101
LSL X5, X4, #7 // X5 = 0x8080808080808080
.word_loop:
LDR X2, [X0], #8 // X2 = 8 bytes from string; X0 += 8
SUB X3, X2, X4 // X3 = X2 - 0x0101...
BIC X3, X3, X2 // X3 &= ~X2
AND X3, X3, X5 // X3 &= 0x8080...
CBZ X3, .word_loop // if no zero byte found, continue
// A null byte was found in the 8 bytes we just loaded.
// X0 now points 8 bytes PAST the word containing the null byte.
// We need to find which byte within that word was null.
// Back up to the word:
SUB X0, X0, #8 // X0 = address of the word with null byte
LDR X2, [X0] // reload the word
// Find the null byte position by byte-scanning
// (We could use RBIT/CLZ for a branchless version,
// but for clarity, use a byte loop on the final word)
MOV W6, #0
.final_scan:
LSR X7, X2, X6 // shift word right by W6 bits
AND X7, X7, #0xFF // isolate byte
CBZ X7, .found_null
ADD W6, W6, #8 // next byte position (8 bits)
B .final_scan
.found_null:
// W6 = bit offset of null byte (0, 8, 16, 24, 32, 40, 48, or 56)
LSR X6, X6, #3 // convert bit offset to byte offset (÷8)
ADD X0, X0, X6 // X0 = address of null byte
SUB X0, X0, X1 // length = null_addr - start_addr
RET
⚠️ Common Mistake: The word-at-a-time approach assumes the string is padded or that reading past the end of the string (within the same word) won't cause a fault. Production implementations of strlen (like musl libc) carefully align the pointer to word boundaries first, ensuring they never load across a page boundary. The version above can fault if the string ends near a page boundary and the next page is unmapped.
Version 3: NEON SIMD strlen Preview
ARM64 NEON allows loading 16 bytes at once and comparing all 16 in parallel:
// strlen_neon: 16-bytes-at-a-time using NEON
// X0 = string pointer (assumed 16-byte aligned for simplicity)
// Returns X0 = length
// Uses NEON V0 register
.global strlen_neon
strlen_neon:
MOV X1, X0 // save start
MOVI V1.16B, #0 // V1 = all zeros (our search value)
.neon_loop:
LDR Q0, [X0], #16 // Load 16 bytes into Q0 (= V0.16B); X0 += 16
CMEQ V0.16B, V0.16B, V1.16B // V0[i] = 0xFF if V0[i] == 0, else 0
UMAXV B2, V0.16B // B2 = max of all 16 bytes in V0
CBNZ W2, .found_neon_null // if any byte was 0xFF, we found a null
B .neon_loop
.found_neon_null:
// Back up to find exact position
SUB X0, X0, #16 // X0 = start of the 16-byte block with null
LDR Q0, [X0] // reload
MOVI V1.16B, #0
CMEQ V0.16B, V0.16B, V1.16B // 0xFF where null bytes are
// Find lowest set byte using CLZ on the result
// (This is simplified; production code uses more NEON tricks)
// For now, fall back to byte scan
MOV W3, #0
.neon_final:
LDRB W4, [X0, W3, UXTW] // load byte at X0+W3
CBZ W4, .neon_done // found null
ADD W3, W3, #1
B .neon_final
.neon_done:
ADD X0, X0, W3, UXTW // X0 = null byte address
SUB X0, X0, X1 // length = null_addr - start
RET
Comparison: Performance and Code Complexity
strlen Implementation Comparison
═══════════════════════════════════════════════════════════════════════════
x86-64 REPNE ARM64 Naive ARM64 Word ARM64 NEON
SCASB Byte Loop at-a-Time 16B at once
───────────────────────────────────────────────────────────────────────────
Bytes per iter 1 1 8 16
Instructions ~5 total ~3/byte ~10/8bytes ~6/16bytes
(very low loop
overhead)
Code size ~10 bytes ~20 bytes ~60 bytes ~40 bytes
Alignment req. None None Careful 16-byte
Correctness Easy Easy Subtle Moderate
Relative speed 1x ~1x ~3-4x ~8-12x
(for long strings)
───────────────────────────────────────────────────────────────────────────
For short strings (< 8 chars), the naive byte loop is often fastest because it avoids the setup overhead of the word-at-a-time approach. For long strings (>64 chars), NEON dominates.
Real-world ARM64 strlen (like in musl libc or glibc for ARM64) typically:
1. Handles the head (align pointer to 8 or 16 bytes)
2. Processes aligned words with the null-byte detection trick
3. Handles the tail (final partial word)
Building and Testing
// test_strlen.s — minimal test
.section .data
s1: .asciz "Hello" // expected length: 5
s2: .asciz "" // expected length: 0
s3: .asciz "A" // expected length: 1
s4: .asciz "Hello World" // expected length: 11
.section .text
.extern strlen_naive
.extern printf
.global main
main:
STP X29, X30, [SP, #-16]!
MOV X29, SP
ADR X0, s1
BL strlen_naive
// X0 should be 5; check it...
MOV W0, #0
LDP X29, X30, [SP], #16
RET
Summary
ARM64's lack of string instructions is the load/store philosophy in action: no single instruction combines memory fetch with comparison. The naive byte loop is two instructions (LDRB + CBNZ) per character. The word-at-a-time approach processes 8 bytes per iteration with clever arithmetic null detection — faster but more code and subtle alignment requirements. NEON can process 16 bytes per iteration.
The lesson: on RISC, when the hardware won't do it, software strategies (SIMD, bit tricks, alignment) bridge the gap. The compiler's libc and the OS's hand-optimized routines use these strategies; your assembly programs benefit from understanding them.