Chapter 12: Arrays, Strings, and Data Structures

Open Assembly Language Project

7 min read

At the machine level, there are no arrays, strings, or linked lists. There is only memory: a flat sequence of bytes, and registers that hold numbers. Every data structure your program uses is a convention for interpreting those bytes — an agreement...

In This Chapter

Memory as the Only Data Structure
Arrays: Contiguous Memory with Scaled Indexing
String Operations: The REP Prefix and String Instructions
Implementing the C String Library in Assembly
Linked Lists in Assembly
Struct Layout and Padding
Implementing a Simple Stack Data Structure
C Comparison: String and Memory Operations
Summary

Key Takeaways Exercises Quiz Case Study 01 Case Study 02 Further Reading

Chapter 12: Arrays, Strings, and Data Structures

Memory as the Only Data Structure

At the machine level, there are no arrays, strings, or linked lists. There is only memory: a flat sequence of bytes, and registers that hold numbers. Every data structure your program uses is a convention for interpreting those bytes — an agreement between the writer and reader of the memory about what the bytes mean.

This chapter makes that concrete. Arrays are contiguous memory regions accessed with scaled indexing. Strings are byte arrays terminated by zero (or specified by length). Linked lists are structs where one field holds the address of the next struct. The REP-prefix string instructions (MOVSB, STOSB, SCASB, CMPSB) provide hardware acceleration for operating on these byte sequences. Understanding how they work at the instruction level gives you precise control over memory that no higher-level language can match.

Arrays: Contiguous Memory with Scaled Indexing

An array stores elements contiguously in memory. If the base address is B and each element is S bytes, element i is at address B + i*S.

Element Access

; int64_t arr[8] — base at RDI, index in RCX
mov rax, [rdi + rcx*8]     ; load arr[rcx] (8-byte elements, scale=8)
mov [rdi + rcx*8], rax     ; store arr[rcx]

; int32_t arr[8] — scale=4
mov eax, [rdi + rcx*4]     ; load arr[rcx] (4-byte elements)

; int16_t arr[8] — scale=2
movsx rax, word [rdi + rcx*2]  ; load arr[rcx] (sign-extend 16-bit to 64)

; int8_t arr[8] — scale=1
movsx rax, byte [rdi + rcx]    ; load arr[rcx] (sign-extend byte to 64)

Bounds Checking

Assembly provides no automatic bounds checking. Out-of-bounds access silently reads or writes whatever memory happens to be at the computed address:

; DANGEROUS: no bounds check
mov rax, [rdi + rcx*8]     ; if rcx < 0 or rcx >= arr_size, this is undefined behavior

; CORRECT: add bounds check
cmp rcx, arr_size
jae .out_of_bounds          ; if rcx >= arr_size (unsigned), error
mov rax, [rdi + rcx*8]      ; now safe

The unsigned comparison jae covers both rcx >= arr_size (too large) and effectively also negative values (which are huge positive numbers in unsigned interpretation). One check handles both cases.

Multi-Dimensional Arrays: Row-Major Layout

C stores multi-dimensional arrays in row-major order: a[r][c] is at offset (r * num_cols + c) * element_size.

int64_t matrix[5][4];   // 5 rows, 4 columns
// matrix[r][c] is at matrix + (r*4 + c) * 8

; Access matrix[r][c]:
; RDI = matrix base, RSI = row, RDX = col, num_cols = 4

; Method 1: compute full index
mov  rax, rsi           ; rax = row
imul rax, 4             ; rax = row * 4 (num_cols)
add  rax, rdx           ; rax = row*4 + col
mov  rax, [rdi + rax*8] ; matrix[row][col]

; Method 2: LEA trick (if num_cols = 4)
lea  rax, [rsi + rsi*3] ; WRONG: this gives rsi*4, not rsi*4 + col
; Correct: LEA for row*4, then add col
lea  rax, [rsi*4 + rdx] ; rax = row*4 + col (one LEA if scales work out)
mov  rax, [rdi + rax*8]

; Method 3: Advance pointer row by row
; If iterating over all rows:
lea  rbx, [rdi + rsi*32]  ; if each row is 32 bytes (4 * 8), stride = 32 per row
                           ; But 32 is not a valid scale, so use: mov rbx, rsi; shl rbx, 5; add rbx, rdi

Arrays of Structs vs. Structs of Arrays

The AoS (Array of Structs) vs. SoA (Struct of Arrays) layout distinction matters enormously for SIMD performance:

// AoS: poor for SIMD (fields interleaved)
struct Point { float x, y, z; };
Point points[1000];       // x,y,z,x,y,z,x,y,z...

// SoA: excellent for SIMD (all x's together)
struct Points {
    float x[1000];
    float y[1000];
    float z[1000];
};

In AoS, to process all x-coordinates, you must load x, skip y and z, load next x — stride 12 bytes. In SoA, all x-coordinates are contiguous — can be loaded 4 at a time with SSE or 8 at a time with AVX.

String Operations: The REP Prefix and String Instructions

x86-64 has a family of "string instructions" that operate on sequences of bytes (or words, or dwords) pointed to by RSI and/or RDI. Combined with the REP prefix, they repeat the operation RCX times.

Direction Flag (DF) and CLD/STD

All string instructions advance RSI and/or RDI after each operation. The direction depends on the Direction Flag (DF) in RFLAGS: - DF = 0 (cleared): increment (forward direction) — the normal case - DF = 1 (set): decrement (backward direction) — for overlapping copies

cld                     ; clear DF: forward direction (almost always what you want)
std                     ; set DF: backward direction (rare; dangerous if forgotten)

⚠️ Common Mistake: Calling code that uses STD without a subsequent CLD. If DF is left set, subsequent string operations (including those in library functions) will scan backward. Always pair STD with CLD, and strongly prefer CLD at function entry when using string instructions.

The System V ABI requires DF = 0 at function call boundaries. Library functions assume DF = 0 on entry.

MOVSB/MOVSW/MOVSD/MOVSQ: Memory Copy

Copies one element from [RSI] to [RDI], then advances both RSI and RDI (by 1/2/4/8 bytes depending on size suffix), and decrements RCX.

; REP MOVSB: copy RCX bytes from [RSI] to [RDI]
cld                     ; forward direction
mov rcx, 64             ; count
lea rsi, [rel source]   ; source
lea rdi, [rel dest]     ; destination
rep movsb               ; copy 64 bytes

; REP MOVSQ: copy RCX quadwords (8 bytes each) from [RSI] to [RDI]
mov rcx, 8              ; 8 × 8 = 64 bytes
rep movsq               ; more efficient than rep movsb for aligned data

⚠️ Common Mistake: Using rep movsb when source and destination overlap and the source is at a lower address than destination. With DF=0 (forward), rep movsb reads from the source before the destination pointer reaches it — which is safe for forward-overlapping copies only if destination < source. For destination > source with overlap, use rep movsb in reverse (STD+count adjustment) or use memmove.

Implementing memcpy with rep movsq:

; void memcpy(void *dst, const void *src, size_t n)
; RDI = dst, RSI = src, RDX = n (bytes)
; Note: after this function, RDI points past the destination (REP advances it)
; So we save RDI first to return it

memcpy_impl:
    push rbp
    mov  rbp, rsp
    push rdi               ; save original dst to return

    mov  rcx, rdx          ; rcx = n
    cld                    ; forward direction
    rep  movsb             ; copy n bytes

    pop  rax               ; return original dst
    pop  rbp
    ret

STOSB/STOSW/STOSD/STOSQ: Memory Fill

Stores AL/AX/EAX/RAX into [RDI], then advances RDI and decrements RCX.

; REP STOSB: fill RCX bytes at [RDI] with value in AL
cld
mov  rdi, [rel buffer_addr]
mov  al, 0               ; fill with zero
mov  rcx, 64             ; 64 bytes
rep  stosb               ; zero-fill 64 bytes

; REP STOSQ: fill RCX quadwords with RAX (8-byte fill value)
xor  eax, eax            ; rax = 0
mov  rcx, 8              ; 8 × 8 = 64 bytes
rep  stosq               ; zero-fill 64 bytes more efficiently

Implementing memset:

; void *memset(void *dst, int c, size_t n)
; RDI = dst, ESI = c (fill value, treated as unsigned byte), RDX = n

memset_impl:
    mov  rax, rdi          ; save dst for return value
    movzx ecx, sil         ; ecx = c (byte value, zero-extended)
    ; Broadcast byte to all 8 bytes of RAX for stosq
    imul rcx, 0x0101010101010101  ; replicate byte: 0xCC → 0xCCCCCCCCCCCCCCCC
    mov  rax, rcx          ; rax = replicated fill value
    mov  rcx, rdx          ; rcx = n (bytes)
    cld
    ; Use stosq for most of the fill, stosb for the tail
    mov  rdx, rcx
    shr  rcx, 3            ; rcx = n/8 (number of quadwords)
    rep  stosq             ; fill quadwords
    mov  rcx, rdx
    and  rcx, 7            ; rcx = n%8 (remaining bytes)
    ; Fill remaining bytes with AL
    movzx eax, byte [rel .fill_byte]  ; restore byte value for stosb
    ; ... (simplified: in practice, use a lookup or save byte in callee-saved reg)
    ret

SCASB/SCASW/SCASD/SCASQ: Scan for Value

Compares AL/AX/EAX/RAX with [RDI], sets flags, advances RDI, decrements RCX. Combined with REPNE (repeat while not equal), this implements strchr and strlen:

; strlen using REPNE SCASB
; RDI = string pointer, returns length in RAX

strlen_scasb:
    cld
    xor  eax, eax          ; al = 0 (search for null terminator)
    not  rcx               ; rcx = 0xFFFFFFFFFFFFFFFF (max count)
    repne scasb            ; scan: advance RDI, decrement RCX until [RDI-1] == AL or RCX == 0
    ; After: RDI points one past the null byte found
    ;        RCX = 0xFFFFFFFFFFFFFFFF - (length+1)
    not  rcx               ; rcx = length + 1
    dec  rcx               ; rcx = length (not counting null)
    mov  rax, rcx
    ret

Trace for "ABC\0": - Initial: RDI = str, AL = 0, RCX = 0xFFFF...FFFF - Iteration 1: compare AL with 'A' → not equal; RDI += 1, RCX-- - Iteration 2: compare AL with 'B' → not equal; RDI += 1, RCX-- - Iteration 3: compare AL with 'C' → not equal; RDI += 1, RCX-- - Iteration 4: compare AL with '\0' → EQUAL; exit loop (REPNE stops on equal) - After: RCX = 0xFFFF...FFFF - 4 = 0xFFFF...FFFB - not rcx = 4; dec rcx = 3. Returns 3. Correct.

CMPSB/CMPSW/CMPSD/CMPSQ: Compare Memory Regions

Compares [RSI] with [RDI], sets flags, advances both RSI and RDI, decrements RCX.

; REPE CMPSB: compare RCX bytes, stop at first mismatch or end
; Used for: memcmp, strcmp

memcmp_impl:
    ; RDI = a, RSI = b, RDX = n
    mov rcx, rdx
    cld
    repe cmpsb             ; compare until mismatch or RCX == 0
    ; After: ZF = 1 if all bytes matched; ZF = 0 if mismatch
    ; For memcmp return value: a[i] - b[i] at first mismatch
    je   .equal
    ; Get the differing bytes (RDI and RSI both advanced one past the mismatch)
    movzx eax, byte [rdi - 1]  ; the byte from a
    movzx ecx, byte [rsi - 1]  ; the byte from b
    sub  eax, ecx              ; return a[i] - b[i] (positive if a > b, negative if a < b)
    ret
.equal:
    xor  eax, eax
    ret

Register State of REP Instructions

REP instructions modify RCX (decremented to 0), RSI and/or RDI (advanced), and RFLAGS (based on the last comparison). After REP MOVSB: - RCX = 0 - RSI = original_RSI + count - RDI = original_RDI + count

After REPE CMPSB (stopped by mismatch): - RCX = remaining count (not counting the mismatching pair) - RSI = original_RSI + (count - remaining + 1) - RDI = similarly advanced

⚡ Performance Note: REP instructions have a startup overhead of several cycles before the microcode loop begins. For very short operations (< ~8 bytes), scalar code or a few explicit MOV instructions is faster. For larger operations (> 64 bytes), REP MOVSQ or REP STOSD achieves near-memory-bandwidth throughput on modern Intel processors (through the "fast string" microarchitecture optimization introduced in Ivy Bridge).

Implementing the C String Library in Assembly

; ─────────────────────────────────────────────────
; strlen: size_t strlen(const char *s)
; ─────────────────────────────────────────────────
strlen:
    cld
    xor eax, eax
    not rcx
    repne scasb            ; scan for null
    not rcx
    dec rcx
    mov rax, rcx
    ret

; ─────────────────────────────────────────────────
; strcpy: char *strcpy(char *dst, const char *src)
; RDI = dst, RSI = src
; ─────────────────────────────────────────────────
strcpy:
    push rbp
    mov  rbp, rsp
    push r12               ; save return value destination
    mov  r12, rdi          ; save dst for return
    cld
.copy_loop:
    lodsb                  ; al = [rsi], rsi++
    stosb                  ; [rdi] = al, rdi++
    test al, al            ; null terminator?
    jnz .copy_loop         ; no: continue
    mov rax, r12           ; return original dst
    pop r12
    pop rbp
    ret

; ─────────────────────────────────────────────────
; memset: void *memset(void *s, int c, size_t n)
; RDI = s, ESI = c (byte), RDX = n
; ─────────────────────────────────────────────────
memset:
    mov  rax, rdi          ; return value = dst
    movzx ecx, sil         ; cl = fill byte
    mov  rcx, rdx          ; rcx = n
    cld
    rep  stosb             ; fill n bytes with SIL
    ; (simplified: production version uses stosq for large n)
    ret

; ─────────────────────────────────────────────────
; memcmp: int memcmp(const void *a, const void *b, size_t n)
; RDI = a, RSI = b, RDX = n
; ─────────────────────────────────────────────────
memcmp:
    mov  rcx, rdx
    cld
    repe cmpsb
    je   .equal
    movzx eax, byte [rdi - 1]
    movzx ecx, byte [rsi - 1]
    sub  eax, ecx
    ret
.equal:
    xor  eax, eax
    ret

Note: LODSB (Load String Byte) loads [RSI] into AL and increments RSI. It is the MOVSB sister instruction for when you need to read and process, not just copy.

Linked Lists in Assembly

A singly-linked list node:

struct Node {
    int64_t  value;    // offset 0
    struct Node *next; // offset 8
};

; Traverse a linked list, sum all values
; RDI = head pointer (or NULL if empty)
; Returns sum in RAX

list_sum:
    xor  eax, eax          ; sum = 0
    test rdi, rdi          ; head == NULL?
    jz   .done             ; if NULL, return 0

.loop:
    add  rax, [rdi]        ; sum += node->value  (offset 0)
    mov  rdi, [rdi + 8]    ; rdi = node->next    (offset 8)
    test rdi, rdi          ; next == NULL?
    jnz  .loop             ; if not NULL, continue

.done:
    ret

Linked List Insertion (at head)

; Node *list_prepend(Node *head, int64_t value)
; Allocates a new node with malloc, inserts at head
; RDI = head, RSI = value
; Returns new head in RAX

list_prepend:
    push rbp
    mov  rbp, rsp
    push rbx               ; save head (callee-saved)
    push r12               ; save value
    push r13               ; for alignment
    sub  rsp, 8            ; pad to 16-byte alignment

    mov  rbx, rdi          ; save head
    mov  r12, rsi          ; save value

    mov  edi, 16           ; malloc(sizeof(Node)) = 16 bytes
    call malloc            ; rax = new node pointer

    test rax, rax          ; malloc returned NULL?
    jz   .fail

    ; Initialize the new node
    mov  [rax], r12        ; node->value = value
    mov  [rax + 8], rbx    ; node->next = old head

    ; Return new node (it is now the head)
    jmp  .done

.fail:
    xor  eax, eax          ; return NULL

.done:
    add  rsp, 8
    pop  r13
    pop  r12
    pop  rbx
    pop  rbp
    ret

Linked List Deletion (remove node by value)

; Node *list_remove(Node *head, int64_t value)
; Removes the first node with the given value
; Returns new head (may be different if we removed the head)
; RDI = head, RSI = value

list_remove:
    push rbp
    mov  rbp, rsp
    push rbx               ; current node
    push r12               ; previous node
    push r13               ; target value
    push r14               ; head (to return)

    mov  r14, rdi          ; save head for potential return
    mov  r13, rsi          ; save target value
    xor  r12d, r12d        ; prev = NULL

    test rdi, rdi          ; head == NULL?
    jz   .done             ; empty list, return NULL

    mov  rbx, rdi          ; current = head

.loop:
    cmp  [rbx], r13        ; current->value == target?
    je   .found_it

    mov  r12, rbx          ; prev = current
    mov  rbx, [rbx + 8]    ; current = current->next
    test rbx, rbx
    jnz  .loop             ; continue if not NULL

    ; Not found: return original head
    jmp  .done

.found_it:
    ; Remove rbx from the list
    test r12, r12          ; prev == NULL? (removing head)
    jz   .remove_head

    ; Middle/tail: prev->next = current->next
    mov  rax, [rbx + 8]    ; rax = current->next
    mov  [r12 + 8], rax    ; prev->next = current->next
    jmp  .free_node

.remove_head:
    mov  r14, [rbx + 8]    ; new head = old head->next

.free_node:
    ; Free the removed node
    mov  rdi, rbx          ; arg: pointer to free
    push r14               ; save head across free() call
    call free
    pop  r14               ; restore head

.done:
    mov  rax, r14          ; return (possibly new) head
    pop  r14
    pop  r13
    pop  r12
    pop  rbx
    pop  rbp
    ret

Struct Layout and Padding

C compilers insert padding bytes between struct fields to ensure alignment:

struct Padded {
    char  a;     // offset 0, 1 byte
    // 7 bytes padding
    int64_t b;   // offset 8, 8 bytes (aligned to 8-byte boundary)
    int32_t c;   // offset 16, 4 bytes
    // 4 bytes padding
    int64_t d;   // offset 24, 8 bytes
};               // sizeof = 32 bytes

struct Packed {
    int64_t b;   // offset 0, 8 bytes
    int64_t d;   // offset 8, 8 bytes
    int32_t c;   // offset 16, 4 bytes
    char    a;   // offset 20, 1 byte
    // 3 bytes padding (for overall alignment to 8 bytes)
};               // sizeof = 24 bytes (more compact)

The rule: each field is aligned to a multiple of its own size. The struct itself is aligned to the size of its largest member. Reordering fields from largest to smallest minimizes padding.

In assembly, you work with the actual offsets regardless of padding:

; Accessing Padded struct:
mov rax, [rdi + 8]    ; rdi->b (offset 8, not 1!)
mov eax, [rdi + 16]   ; rdi->c (offset 16)
mov rbx, [rdi + 24]   ; rdi->d (offset 24)

🔍 Under the Hood: Use offsetof(struct_type, field_name) in C to get the exact offset at compile time, or verify with GDB: p &my_struct.field_name. Never guess struct offsets in assembly — always verify.

Implementing a Simple Stack Data Structure

; A simple integer stack implemented as an array with a top pointer
; Stack stored at: [rdi] = top index (initially -1), [rdi+8..] = data

; push: void stack_push(Stack *s, int64_t value)
; RDI = stack, RSI = value

stack_push:
    mov  eax, [rdi]         ; eax = top index
    inc  eax                ; top++
    mov  [rdi], eax         ; store new top
    movsx rax, eax          ; sign-extend for indexing
    mov  [rdi + 8 + rax*8], rsi  ; s->data[top] = value
    ret

; pop: int64_t stack_pop(Stack *s)
; RDI = stack, returns popped value in RAX

stack_pop:
    mov  eax, [rdi]         ; eax = top
    test eax, eax           ; top < 0?
    js   .underflow
    movsx rcx, eax          ; sign-extend for indexing
    mov  rax, [rdi + 8 + rcx*8]  ; rax = s->data[top]
    dec  dword [rdi]        ; top--
    ret
.underflow:
    mov  rax, -1            ; error: stack underflow
    ret

C Comparison: String and Memory Operations

C operation	Assembly approach
`memcpy(dst, src, n)`	`rep movsb` or `rep movsq`
`memset(dst, c, n)`	`rep stosb` (or `rep stosq` with replicated byte)
`strlen(s)`	`repne scasb` with NOT/DEC idiom
`memcmp(a, b, n)`	`repe cmpsb`
`strchr(s, c)`	`repne scasb` with AL = target
`arr[i]` (int64_t)	`[rdi + rcx*8]`
`struct.field`	`[rdi + offset]`
`list->next`	`[rdi + next_offset]`

🔄 Check Your Understanding: After repne scasb with AL=0 and RDI pointing to "XY\0", how many times did the loop body execute? What is the final value of RCX relative to its starting value?

Answer

The loop executed 3 times: once for 'X' (no match), once for 'Y' (no match), once for '\0' (match, loop exits). RCX was decremented 3 times from its starting value.

After the loop, RDI points one past the null byte (to the byte after '\0'), and RCX = starting_RCX - 3.

Summary

Arrays, strings, and data structures in assembly are exactly what they are in theory: contiguous memory regions with consistent interpretation conventions. The x86-64 REP string instructions (MOVSB, STOSB, SCASB, CMPSB) provide hardware-assisted byte operations that form the core of the C standard library. The Direction Flag controls whether operations advance forward or backward; almost always you want CLD at the start.

Linked lists require careful management of next-pointer offsets and the null-termination convention. Struct layouts require knowing exact byte offsets, which padding rules determine. These are not abstract concerns — getting an offset wrong by 4 bytes produces a program that works on some compilers and crashes on others.

In Chapter 13, we turn to bit manipulation: the operations that let you pack information into individual bits, test flags efficiently, and build the foundation of the XOR cipher that starts the encryption tool anchor example.