At the machine level, there are no arrays, strings, or linked lists. There is only memory: a flat sequence of bytes, and registers that hold numbers. Every data structure your program uses is a convention for interpreting those bytes — an agreement...
In This Chapter
- Memory as the Only Data Structure
- Arrays: Contiguous Memory with Scaled Indexing
- String Operations: The REP Prefix and String Instructions
- Implementing the C String Library in Assembly
- Linked Lists in Assembly
- Struct Layout and Padding
- Implementing a Simple Stack Data Structure
- C Comparison: String and Memory Operations
- Summary
Chapter 12: Arrays, Strings, and Data Structures
Memory as the Only Data Structure
At the machine level, there are no arrays, strings, or linked lists. There is only memory: a flat sequence of bytes, and registers that hold numbers. Every data structure your program uses is a convention for interpreting those bytes — an agreement between the writer and reader of the memory about what the bytes mean.
This chapter makes that concrete. Arrays are contiguous memory regions accessed with scaled indexing. Strings are byte arrays terminated by zero (or specified by length). Linked lists are structs where one field holds the address of the next struct. The REP-prefix string instructions (MOVSB, STOSB, SCASB, CMPSB) provide hardware acceleration for operating on these byte sequences. Understanding how they work at the instruction level gives you precise control over memory that no higher-level language can match.
Arrays: Contiguous Memory with Scaled Indexing
An array stores elements contiguously in memory. If the base address is B and each element is S bytes, element i is at address B + i*S.
Element Access
; int64_t arr[8] — base at RDI, index in RCX
mov rax, [rdi + rcx*8] ; load arr[rcx] (8-byte elements, scale=8)
mov [rdi + rcx*8], rax ; store arr[rcx]
; int32_t arr[8] — scale=4
mov eax, [rdi + rcx*4] ; load arr[rcx] (4-byte elements)
; int16_t arr[8] — scale=2
movsx rax, word [rdi + rcx*2] ; load arr[rcx] (sign-extend 16-bit to 64)
; int8_t arr[8] — scale=1
movsx rax, byte [rdi + rcx] ; load arr[rcx] (sign-extend byte to 64)
Bounds Checking
Assembly provides no automatic bounds checking. Out-of-bounds access silently reads or writes whatever memory happens to be at the computed address:
; DANGEROUS: no bounds check
mov rax, [rdi + rcx*8] ; if rcx < 0 or rcx >= arr_size, this is undefined behavior
; CORRECT: add bounds check
cmp rcx, arr_size
jae .out_of_bounds ; if rcx >= arr_size (unsigned), error
mov rax, [rdi + rcx*8] ; now safe
The unsigned comparison jae covers both rcx >= arr_size (too large) and effectively also negative values (which are huge positive numbers in unsigned interpretation). One check handles both cases.
Multi-Dimensional Arrays: Row-Major Layout
C stores multi-dimensional arrays in row-major order: a[r][c] is at offset (r * num_cols + c) * element_size.
int64_t matrix[5][4]; // 5 rows, 4 columns
// matrix[r][c] is at matrix + (r*4 + c) * 8
; Access matrix[r][c]:
; RDI = matrix base, RSI = row, RDX = col, num_cols = 4
; Method 1: compute full index
mov rax, rsi ; rax = row
imul rax, 4 ; rax = row * 4 (num_cols)
add rax, rdx ; rax = row*4 + col
mov rax, [rdi + rax*8] ; matrix[row][col]
; Method 2: LEA trick (if num_cols = 4)
lea rax, [rsi + rsi*3] ; WRONG: this gives rsi*4, not rsi*4 + col
; Correct: LEA for row*4, then add col
lea rax, [rsi*4 + rdx] ; rax = row*4 + col (one LEA if scales work out)
mov rax, [rdi + rax*8]
; Method 3: Advance pointer row by row
; If iterating over all rows:
lea rbx, [rdi + rsi*32] ; if each row is 32 bytes (4 * 8), stride = 32 per row
; But 32 is not a valid scale, so use: mov rbx, rsi; shl rbx, 5; add rbx, rdi
Arrays of Structs vs. Structs of Arrays
The AoS (Array of Structs) vs. SoA (Struct of Arrays) layout distinction matters enormously for SIMD performance:
// AoS: poor for SIMD (fields interleaved)
struct Point { float x, y, z; };
Point points[1000]; // x,y,z,x,y,z,x,y,z...
// SoA: excellent for SIMD (all x's together)
struct Points {
float x[1000];
float y[1000];
float z[1000];
};
In AoS, to process all x-coordinates, you must load x, skip y and z, load next x — stride 12 bytes. In SoA, all x-coordinates are contiguous — can be loaded 4 at a time with SSE or 8 at a time with AVX.
String Operations: The REP Prefix and String Instructions
x86-64 has a family of "string instructions" that operate on sequences of bytes (or words, or dwords) pointed to by RSI and/or RDI. Combined with the REP prefix, they repeat the operation RCX times.
Direction Flag (DF) and CLD/STD
All string instructions advance RSI and/or RDI after each operation. The direction depends on the Direction Flag (DF) in RFLAGS: - DF = 0 (cleared): increment (forward direction) — the normal case - DF = 1 (set): decrement (backward direction) — for overlapping copies
cld ; clear DF: forward direction (almost always what you want)
std ; set DF: backward direction (rare; dangerous if forgotten)
⚠️ Common Mistake: Calling code that uses STD without a subsequent CLD. If DF is left set, subsequent string operations (including those in library functions) will scan backward. Always pair STD with CLD, and strongly prefer CLD at function entry when using string instructions.
The System V ABI requires DF = 0 at function call boundaries. Library functions assume DF = 0 on entry.
MOVSB/MOVSW/MOVSD/MOVSQ: Memory Copy
Copies one element from [RSI] to [RDI], then advances both RSI and RDI (by 1/2/4/8 bytes depending on size suffix), and decrements RCX.
; REP MOVSB: copy RCX bytes from [RSI] to [RDI]
cld ; forward direction
mov rcx, 64 ; count
lea rsi, [rel source] ; source
lea rdi, [rel dest] ; destination
rep movsb ; copy 64 bytes
; REP MOVSQ: copy RCX quadwords (8 bytes each) from [RSI] to [RDI]
mov rcx, 8 ; 8 × 8 = 64 bytes
rep movsq ; more efficient than rep movsb for aligned data
⚠️ Common Mistake: Using
rep movsbwhen source and destination overlap and the source is at a lower address than destination. With DF=0 (forward),rep movsbreads from the source before the destination pointer reaches it — which is safe for forward-overlapping copies only if destination < source. For destination > source with overlap, userep movsbin reverse (STD+count adjustment) or usememmove.
Implementing memcpy with rep movsq:
; void memcpy(void *dst, const void *src, size_t n)
; RDI = dst, RSI = src, RDX = n (bytes)
; Note: after this function, RDI points past the destination (REP advances it)
; So we save RDI first to return it
memcpy_impl:
push rbp
mov rbp, rsp
push rdi ; save original dst to return
mov rcx, rdx ; rcx = n
cld ; forward direction
rep movsb ; copy n bytes
pop rax ; return original dst
pop rbp
ret
STOSB/STOSW/STOSD/STOSQ: Memory Fill
Stores AL/AX/EAX/RAX into [RDI], then advances RDI and decrements RCX.
; REP STOSB: fill RCX bytes at [RDI] with value in AL
cld
mov rdi, [rel buffer_addr]
mov al, 0 ; fill with zero
mov rcx, 64 ; 64 bytes
rep stosb ; zero-fill 64 bytes
; REP STOSQ: fill RCX quadwords with RAX (8-byte fill value)
xor eax, eax ; rax = 0
mov rcx, 8 ; 8 × 8 = 64 bytes
rep stosq ; zero-fill 64 bytes more efficiently
Implementing memset:
; void *memset(void *dst, int c, size_t n)
; RDI = dst, ESI = c (fill value, treated as unsigned byte), RDX = n
memset_impl:
mov rax, rdi ; save dst for return value
movzx ecx, sil ; ecx = c (byte value, zero-extended)
; Broadcast byte to all 8 bytes of RAX for stosq
imul rcx, 0x0101010101010101 ; replicate byte: 0xCC → 0xCCCCCCCCCCCCCCCC
mov rax, rcx ; rax = replicated fill value
mov rcx, rdx ; rcx = n (bytes)
cld
; Use stosq for most of the fill, stosb for the tail
mov rdx, rcx
shr rcx, 3 ; rcx = n/8 (number of quadwords)
rep stosq ; fill quadwords
mov rcx, rdx
and rcx, 7 ; rcx = n%8 (remaining bytes)
; Fill remaining bytes with AL
movzx eax, byte [rel .fill_byte] ; restore byte value for stosb
; ... (simplified: in practice, use a lookup or save byte in callee-saved reg)
ret
SCASB/SCASW/SCASD/SCASQ: Scan for Value
Compares AL/AX/EAX/RAX with [RDI], sets flags, advances RDI, decrements RCX. Combined with REPNE (repeat while not equal), this implements strchr and strlen:
; strlen using REPNE SCASB
; RDI = string pointer, returns length in RAX
strlen_scasb:
cld
xor eax, eax ; al = 0 (search for null terminator)
not rcx ; rcx = 0xFFFFFFFFFFFFFFFF (max count)
repne scasb ; scan: advance RDI, decrement RCX until [RDI-1] == AL or RCX == 0
; After: RDI points one past the null byte found
; RCX = 0xFFFFFFFFFFFFFFFF - (length+1)
not rcx ; rcx = length + 1
dec rcx ; rcx = length (not counting null)
mov rax, rcx
ret
Trace for "ABC\0":
- Initial: RDI = str, AL = 0, RCX = 0xFFFF...FFFF
- Iteration 1: compare AL with 'A' → not equal; RDI += 1, RCX--
- Iteration 2: compare AL with 'B' → not equal; RDI += 1, RCX--
- Iteration 3: compare AL with 'C' → not equal; RDI += 1, RCX--
- Iteration 4: compare AL with '\0' → EQUAL; exit loop (REPNE stops on equal)
- After: RCX = 0xFFFF...FFFF - 4 = 0xFFFF...FFFB
- not rcx = 4; dec rcx = 3. Returns 3. Correct.
CMPSB/CMPSW/CMPSD/CMPSQ: Compare Memory Regions
Compares [RSI] with [RDI], sets flags, advances both RSI and RDI, decrements RCX.
; REPE CMPSB: compare RCX bytes, stop at first mismatch or end
; Used for: memcmp, strcmp
memcmp_impl:
; RDI = a, RSI = b, RDX = n
mov rcx, rdx
cld
repe cmpsb ; compare until mismatch or RCX == 0
; After: ZF = 1 if all bytes matched; ZF = 0 if mismatch
; For memcmp return value: a[i] - b[i] at first mismatch
je .equal
; Get the differing bytes (RDI and RSI both advanced one past the mismatch)
movzx eax, byte [rdi - 1] ; the byte from a
movzx ecx, byte [rsi - 1] ; the byte from b
sub eax, ecx ; return a[i] - b[i] (positive if a > b, negative if a < b)
ret
.equal:
xor eax, eax
ret
Register State of REP Instructions
REP instructions modify RCX (decremented to 0), RSI and/or RDI (advanced), and RFLAGS (based on the last comparison). After REP MOVSB: - RCX = 0 - RSI = original_RSI + count - RDI = original_RDI + count
After REPE CMPSB (stopped by mismatch): - RCX = remaining count (not counting the mismatching pair) - RSI = original_RSI + (count - remaining + 1) - RDI = similarly advanced
⚡ Performance Note: REP instructions have a startup overhead of several cycles before the microcode loop begins. For very short operations (< ~8 bytes), scalar code or a few explicit MOV instructions is faster. For larger operations (> 64 bytes), REP MOVSQ or REP STOSD achieves near-memory-bandwidth throughput on modern Intel processors (through the "fast string" microarchitecture optimization introduced in Ivy Bridge).
Implementing the C String Library in Assembly
; ─────────────────────────────────────────────────
; strlen: size_t strlen(const char *s)
; ─────────────────────────────────────────────────
strlen:
cld
xor eax, eax
not rcx
repne scasb ; scan for null
not rcx
dec rcx
mov rax, rcx
ret
; ─────────────────────────────────────────────────
; strcpy: char *strcpy(char *dst, const char *src)
; RDI = dst, RSI = src
; ─────────────────────────────────────────────────
strcpy:
push rbp
mov rbp, rsp
push r12 ; save return value destination
mov r12, rdi ; save dst for return
cld
.copy_loop:
lodsb ; al = [rsi], rsi++
stosb ; [rdi] = al, rdi++
test al, al ; null terminator?
jnz .copy_loop ; no: continue
mov rax, r12 ; return original dst
pop r12
pop rbp
ret
; ─────────────────────────────────────────────────
; memset: void *memset(void *s, int c, size_t n)
; RDI = s, ESI = c (byte), RDX = n
; ─────────────────────────────────────────────────
memset:
mov rax, rdi ; return value = dst
movzx ecx, sil ; cl = fill byte
mov rcx, rdx ; rcx = n
cld
rep stosb ; fill n bytes with SIL
; (simplified: production version uses stosq for large n)
ret
; ─────────────────────────────────────────────────
; memcmp: int memcmp(const void *a, const void *b, size_t n)
; RDI = a, RSI = b, RDX = n
; ─────────────────────────────────────────────────
memcmp:
mov rcx, rdx
cld
repe cmpsb
je .equal
movzx eax, byte [rdi - 1]
movzx ecx, byte [rsi - 1]
sub eax, ecx
ret
.equal:
xor eax, eax
ret
Note: LODSB (Load String Byte) loads [RSI] into AL and increments RSI. It is the MOVSB sister instruction for when you need to read and process, not just copy.
Linked Lists in Assembly
A singly-linked list node:
struct Node {
int64_t value; // offset 0
struct Node *next; // offset 8
};
; Traverse a linked list, sum all values
; RDI = head pointer (or NULL if empty)
; Returns sum in RAX
list_sum:
xor eax, eax ; sum = 0
test rdi, rdi ; head == NULL?
jz .done ; if NULL, return 0
.loop:
add rax, [rdi] ; sum += node->value (offset 0)
mov rdi, [rdi + 8] ; rdi = node->next (offset 8)
test rdi, rdi ; next == NULL?
jnz .loop ; if not NULL, continue
.done:
ret
Linked List Insertion (at head)
; Node *list_prepend(Node *head, int64_t value)
; Allocates a new node with malloc, inserts at head
; RDI = head, RSI = value
; Returns new head in RAX
list_prepend:
push rbp
mov rbp, rsp
push rbx ; save head (callee-saved)
push r12 ; save value
push r13 ; for alignment
sub rsp, 8 ; pad to 16-byte alignment
mov rbx, rdi ; save head
mov r12, rsi ; save value
mov edi, 16 ; malloc(sizeof(Node)) = 16 bytes
call malloc ; rax = new node pointer
test rax, rax ; malloc returned NULL?
jz .fail
; Initialize the new node
mov [rax], r12 ; node->value = value
mov [rax + 8], rbx ; node->next = old head
; Return new node (it is now the head)
jmp .done
.fail:
xor eax, eax ; return NULL
.done:
add rsp, 8
pop r13
pop r12
pop rbx
pop rbp
ret
Linked List Deletion (remove node by value)
; Node *list_remove(Node *head, int64_t value)
; Removes the first node with the given value
; Returns new head (may be different if we removed the head)
; RDI = head, RSI = value
list_remove:
push rbp
mov rbp, rsp
push rbx ; current node
push r12 ; previous node
push r13 ; target value
push r14 ; head (to return)
mov r14, rdi ; save head for potential return
mov r13, rsi ; save target value
xor r12d, r12d ; prev = NULL
test rdi, rdi ; head == NULL?
jz .done ; empty list, return NULL
mov rbx, rdi ; current = head
.loop:
cmp [rbx], r13 ; current->value == target?
je .found_it
mov r12, rbx ; prev = current
mov rbx, [rbx + 8] ; current = current->next
test rbx, rbx
jnz .loop ; continue if not NULL
; Not found: return original head
jmp .done
.found_it:
; Remove rbx from the list
test r12, r12 ; prev == NULL? (removing head)
jz .remove_head
; Middle/tail: prev->next = current->next
mov rax, [rbx + 8] ; rax = current->next
mov [r12 + 8], rax ; prev->next = current->next
jmp .free_node
.remove_head:
mov r14, [rbx + 8] ; new head = old head->next
.free_node:
; Free the removed node
mov rdi, rbx ; arg: pointer to free
push r14 ; save head across free() call
call free
pop r14 ; restore head
.done:
mov rax, r14 ; return (possibly new) head
pop r14
pop r13
pop r12
pop rbx
pop rbp
ret
Struct Layout and Padding
C compilers insert padding bytes between struct fields to ensure alignment:
struct Padded {
char a; // offset 0, 1 byte
// 7 bytes padding
int64_t b; // offset 8, 8 bytes (aligned to 8-byte boundary)
int32_t c; // offset 16, 4 bytes
// 4 bytes padding
int64_t d; // offset 24, 8 bytes
}; // sizeof = 32 bytes
struct Packed {
int64_t b; // offset 0, 8 bytes
int64_t d; // offset 8, 8 bytes
int32_t c; // offset 16, 4 bytes
char a; // offset 20, 1 byte
// 3 bytes padding (for overall alignment to 8 bytes)
}; // sizeof = 24 bytes (more compact)
The rule: each field is aligned to a multiple of its own size. The struct itself is aligned to the size of its largest member. Reordering fields from largest to smallest minimizes padding.
In assembly, you work with the actual offsets regardless of padding:
; Accessing Padded struct:
mov rax, [rdi + 8] ; rdi->b (offset 8, not 1!)
mov eax, [rdi + 16] ; rdi->c (offset 16)
mov rbx, [rdi + 24] ; rdi->d (offset 24)
🔍 Under the Hood: Use
offsetof(struct_type, field_name)in C to get the exact offset at compile time, or verify with GDB:p &my_struct.field_name. Never guess struct offsets in assembly — always verify.
Implementing a Simple Stack Data Structure
; A simple integer stack implemented as an array with a top pointer
; Stack stored at: [rdi] = top index (initially -1), [rdi+8..] = data
; push: void stack_push(Stack *s, int64_t value)
; RDI = stack, RSI = value
stack_push:
mov eax, [rdi] ; eax = top index
inc eax ; top++
mov [rdi], eax ; store new top
movsx rax, eax ; sign-extend for indexing
mov [rdi + 8 + rax*8], rsi ; s->data[top] = value
ret
; pop: int64_t stack_pop(Stack *s)
; RDI = stack, returns popped value in RAX
stack_pop:
mov eax, [rdi] ; eax = top
test eax, eax ; top < 0?
js .underflow
movsx rcx, eax ; sign-extend for indexing
mov rax, [rdi + 8 + rcx*8] ; rax = s->data[top]
dec dword [rdi] ; top--
ret
.underflow:
mov rax, -1 ; error: stack underflow
ret
C Comparison: String and Memory Operations
| C operation | Assembly approach |
|---|---|
memcpy(dst, src, n) |
rep movsb or rep movsq |
memset(dst, c, n) |
rep stosb (or rep stosq with replicated byte) |
strlen(s) |
repne scasb with NOT/DEC idiom |
memcmp(a, b, n) |
repe cmpsb |
strchr(s, c) |
repne scasb with AL = target |
arr[i] (int64_t) |
[rdi + rcx*8] |
struct.field |
[rdi + offset] |
list->next |
[rdi + next_offset] |
🔄 Check Your Understanding: After
repne scasbwith AL=0 and RDI pointing to "XY\0", how many times did the loop body execute? What is the final value of RCX relative to its starting value?
Answer
The loop executed 3 times: once for 'X' (no match), once for 'Y' (no match), once for '\0' (match, loop exits). RCX was decremented 3 times from its starting value.
After the loop, RDI points one past the null byte (to the byte after '\0'), and RCX = starting_RCX - 3.
Summary
Arrays, strings, and data structures in assembly are exactly what they are in theory: contiguous memory regions with consistent interpretation conventions. The x86-64 REP string instructions (MOVSB, STOSB, SCASB, CMPSB) provide hardware-assisted byte operations that form the core of the C standard library. The Direction Flag controls whether operations advance forward or backward; almost always you want CLD at the start.
Linked lists require careful management of next-pointer offsets and the null-termination convention. Struct layouts require knowing exact byte offsets, which padding rules determine. These are not abstract concerns — getting an offset wrong by 4 bytes produces a program that works on some compilers and crashes on others.
In Chapter 13, we turn to bit manipulation: the operations that let you pack information into individual bits, test flags efficiently, and build the foundation of the XOR cipher that starts the encryption tool anchor example.