Case Study 17-2: ARM64 String Operations — Implementing strlen Without REP

Objective

Implement strlen in ARM64 assembly. ARM64 has no REP SCASB equivalent, so every byte must be tested manually (or with SIMD). We'll implement three versions — naive byte loop, optimized word-at-a-time, and a preview of the NEON SIMD approach — and compare them to the x86-64 REP SCASB approach.


The Problem

strlen counts characters in a null-terminated string until it finds a zero byte. On x86-64, this is famously done with REPNE SCASB:

; x86-64: strlen using REPNE SCASB
; RDI = string pointer, returns RAX = length
strlen_x86:
    mov  rcx, -1           ; set counter to max
    xor  al, al            ; al = 0 (scan for null byte)
    cld                    ; direction = forward
    repne scasb            ; scan until [RDI] == AL, decrementing RCX
    ; After: RCX = -(length+2), RDI = one past null terminator
    not  rcx               ; RCX = length+1
    dec  rcx               ; RCX = length
    mov  rax, rcx
    ret

ARM64 has no SCAS, no REP prefix, no string instructions. Every byte must be checked explicitly (or in parallel with NEON). This is load/store architecture in action: it cannot combine memory access and comparison in one instruction.


Version 1: Naive Byte Loop

// strlen_naive: X0 = string pointer, returns X0 = length
// Uses: X1 (current pointer), X2 (byte read)

.global strlen_naive
strlen_naive:
    MOV   X1, X0              // X1 = start pointer (save original)

.byte_loop:
    LDRB  W2, [X0], #1        // W2 = *X0; X0++  (post-indexed byte load)
    CBNZ  W2, .byte_loop      // if W2 != 0, continue

    // X0 now points ONE past the null terminator
    // Length = X0 - X1 - 1
    SUB   X0, X0, X1          // X0 = distance from start to one-past-null
    SUB   X0, X0, #1          // subtract 1 to exclude the null byte
    RET

Wait — let's be careful with the post-increment. When CBNZ exits the loop, X0 points to the byte AFTER the null terminator (since we did X0++ before the compare). So length = X0 - original_X0 - 1.

Let's verify: - Input: "Hello\0" at address 0x1000 - Initial: X0=0x1000, X1=0x1000 - Loop: read 'H'=0x48, X0=0x1001, continue - Loop: read 'e'=0x65, X0=0x1002, continue - Loop: read 'l'=0x6C, X0=0x1003, continue - Loop: read 'l'=0x6C, X0=0x1004, continue - Loop: read 'o'=0x6F, X0=0x1005, continue - Loop: read '\0'=0x00, X0=0x1006, CBNZ not taken (W2=0) - SUB X0, X0, X1: X0 = 0x1006 - 0x1000 = 6 - SUB X0, X0, #1: X0 = 5 ✓ (correct: "Hello" is 5 characters)

Register trace:

Step X0 (pointer) X1 (start) W2 (byte) Notes
init 0x1000 0x1000 ?
LDRB+CBNZ 0x1001 0x1000 'H' continue
LDRB+CBNZ 0x1002 0x1000 'e' continue
LDRB+CBNZ 0x1003 0x1000 'l' continue
LDRB+CBNZ 0x1004 0x1000 'l' continue
LDRB+CBNZ 0x1005 0x1000 'o' continue
LDRB+CBNZ 0x1006 0x1000 '\0' exit (W2=0)
SUB X0,X0,X1 6 0x1000 '\0'
SUB X0,X0,#1 5 0x1000 '\0' return 5

Version 2: Word-at-a-Time strlen

The byte loop works but touches memory once per byte. Modern CPUs prefer to load 8 bytes at a time. The trick: load 8 bytes, then use arithmetic to detect if any of those 8 bytes is zero.

The null-byte detection algorithm (from Hacker's Delight):

// Returns non-zero if any byte in x is zero
uint64_t has_zero_byte(uint64_t x) {
    return (x - 0x0101010101010101ULL) & ~x & 0x8080808080808080ULL;
}

If has_zero_byte(x) != 0, then x contains a null byte somewhere in its 8 bytes.

// strlen_word: word-at-a-time strlen (unaligned-safe version for now)
// X0 = string pointer, returns X0 = length

.global strlen_word
strlen_word:
    MOV   X1, X0                   // X1 = start pointer

    // Load the magic constants
    MOV   X4, #0x0101010101010101  // 0x0101...  (repeat pattern of 0x01)
    // Actually 0x0101010101010101 doesn't fit in a 16-bit immediate.
    // Load it using MOVZ + MOVK:
    MOVZ  X4, #0x0101
    MOVK  X4, #0x0101, LSL #16
    MOVK  X4, #0x0101, LSL #32
    MOVK  X4, #0x0101, LSL #48    // X4 = 0x0101010101010101

    LSL   X5, X4, #7              // X5 = 0x8080808080808080

.word_loop:
    LDR   X2, [X0], #8            // X2 = 8 bytes from string; X0 += 8
    SUB   X3, X2, X4              // X3 = X2 - 0x0101...
    BIC   X3, X3, X2              // X3 &= ~X2
    AND   X3, X3, X5              // X3 &= 0x8080...
    CBZ   X3, .word_loop          // if no zero byte found, continue

    // A null byte was found in the 8 bytes we just loaded.
    // X0 now points 8 bytes PAST the word containing the null byte.
    // We need to find which byte within that word was null.
    // Back up to the word:
    SUB   X0, X0, #8              // X0 = address of the word with null byte
    LDR   X2, [X0]                // reload the word

    // Find the null byte position by byte-scanning
    // (We could use RBIT/CLZ for a branchless version,
    //  but for clarity, use a byte loop on the final word)
    MOV   W6, #0
.final_scan:
    LSR   X7, X2, X6              // shift word right by W6 bits
    AND   X7, X7, #0xFF           // isolate byte
    CBZ   X7, .found_null
    ADD   W6, W6, #8              // next byte position (8 bits)
    B     .final_scan

.found_null:
    // W6 = bit offset of null byte (0, 8, 16, 24, 32, 40, 48, or 56)
    LSR   X6, X6, #3              // convert bit offset to byte offset (÷8)
    ADD   X0, X0, X6              // X0 = address of null byte
    SUB   X0, X0, X1              // length = null_addr - start_addr
    RET

⚠️ Common Mistake: The word-at-a-time approach assumes the string is padded or that reading past the end of the string (within the same word) won't cause a fault. Production implementations of strlen (like musl libc) carefully align the pointer to word boundaries first, ensuring they never load across a page boundary. The version above can fault if the string ends near a page boundary and the next page is unmapped.


Version 3: NEON SIMD strlen Preview

ARM64 NEON allows loading 16 bytes at once and comparing all 16 in parallel:

// strlen_neon: 16-bytes-at-a-time using NEON
// X0 = string pointer (assumed 16-byte aligned for simplicity)
// Returns X0 = length
// Uses NEON V0 register

.global strlen_neon
strlen_neon:
    MOV   X1, X0                  // save start

    MOVI  V1.16B, #0              // V1 = all zeros (our search value)

.neon_loop:
    LDR   Q0, [X0], #16           // Load 16 bytes into Q0 (= V0.16B); X0 += 16
    CMEQ  V0.16B, V0.16B, V1.16B  // V0[i] = 0xFF if V0[i] == 0, else 0
    UMAXV B2, V0.16B              // B2 = max of all 16 bytes in V0
    CBNZ  W2, .found_neon_null    // if any byte was 0xFF, we found a null
    B     .neon_loop

.found_neon_null:
    // Back up to find exact position
    SUB   X0, X0, #16             // X0 = start of the 16-byte block with null
    LDR   Q0, [X0]                // reload
    MOVI  V1.16B, #0
    CMEQ  V0.16B, V0.16B, V1.16B  // 0xFF where null bytes are

    // Find lowest set byte using CLZ on the result
    // (This is simplified; production code uses more NEON tricks)
    // For now, fall back to byte scan
    MOV   W3, #0
.neon_final:
    LDRB  W4, [X0, W3, UXTW]     // load byte at X0+W3
    CBZ   W4, .neon_done          // found null
    ADD   W3, W3, #1
    B     .neon_final

.neon_done:
    ADD   X0, X0, W3, UXTW       // X0 = null byte address
    SUB   X0, X0, X1             // length = null_addr - start
    RET

Comparison: Performance and Code Complexity

strlen Implementation Comparison
═══════════════════════════════════════════════════════════════════════════
                  x86-64 REPNE    ARM64 Naive    ARM64 Word    ARM64 NEON
                  SCASB           Byte Loop      at-a-Time     16B at once
───────────────────────────────────────────────────────────────────────────
Bytes per iter    1               1              8             16
Instructions     ~5 total         ~3/byte        ~10/8bytes    ~6/16bytes
                 (very low loop
                  overhead)
Code size         ~10 bytes        ~20 bytes      ~60 bytes    ~40 bytes
Alignment req.   None             None           Careful       16-byte
Correctness       Easy             Easy           Subtle        Moderate
Relative speed    1x               ~1x            ~3-4x         ~8-12x
(for long strings)
───────────────────────────────────────────────────────────────────────────

For short strings (< 8 chars), the naive byte loop is often fastest because it avoids the setup overhead of the word-at-a-time approach. For long strings (>64 chars), NEON dominates.

Real-world ARM64 strlen (like in musl libc or glibc for ARM64) typically: 1. Handles the head (align pointer to 8 or 16 bytes) 2. Processes aligned words with the null-byte detection trick 3. Handles the tail (final partial word)


Building and Testing

// test_strlen.s — minimal test
.section .data
s1:  .asciz "Hello"        // expected length: 5
s2:  .asciz ""             // expected length: 0
s3:  .asciz "A"            // expected length: 1
s4:  .asciz "Hello World"  // expected length: 11

.section .text
.extern strlen_naive
.extern printf
.global main

main:
    STP  X29, X30, [SP, #-16]!
    MOV  X29, SP

    ADR  X0, s1
    BL   strlen_naive
    // X0 should be 5; check it...

    MOV  W0, #0
    LDP  X29, X30, [SP], #16
    RET

Summary

ARM64's lack of string instructions is the load/store philosophy in action: no single instruction combines memory fetch with comparison. The naive byte loop is two instructions (LDRB + CBNZ) per character. The word-at-a-time approach processes 8 bytes per iteration with clever arithmetic null detection — faster but more code and subtle alignment requirements. NEON can process 16 bytes per iteration.

The lesson: on RISC, when the hardware won't do it, software strategies (SIMD, bit tricks, alignment) bridge the gap. The compiler's libc and the OS's hand-optimized routines use these strategies; your assembly programs benefit from understanding them.