13 min read

Before a processor can add two numbers, compare them, or jump to a subroutine, it must move data. Every non-trivial computation is mostly data movement: load a value from memory, operate on it in a register, store the result back. The x86-64 MOV...

Chapter 8: Data Movement and Addressing Modes

The Fundamental Operation of Every Computer

Before a processor can add two numbers, compare them, or jump to a subroutine, it must move data. Every non-trivial computation is mostly data movement: load a value from memory, operate on it in a register, store the result back. The x86-64 MOV instruction family and its addressing modes are the machinery behind all of that movement, and understanding them in detail separates programmers who write assembly from programmers who understand assembly.

This chapter covers every form of MOV, the complete x86-64 addressing mode syntax, the LEA instruction (which computes addresses without reading memory and serves double duty as fast arithmetic), and the MOVZX/MOVSX instructions that handle size mismatches safely. By the end of this chapter, you will read mov rax, [rbx + rcx*8 + 16] and immediately understand what it accesses, how the processor computes the address, and why a compiler might emit that exact instruction for a struct field inside an array.

The MOV Instruction: All Forms

MOV is the most common instruction in any x86-64 binary. Its job is conceptually simple: copy a value from a source to a destination. The complexity is in how many kinds of sources and destinations exist.

; Register to register (same size)
mov rax, rbx        ; 64-bit: RAX = RBX
mov eax, ebx        ; 32-bit: EAX = EBX (also zeros upper 32 bits of RAX)
mov ax, bx          ; 16-bit: AX = BX  (upper 48 bits of RAX unchanged)
mov al, bl          ; 8-bit:  AL = BL  (upper 56 bits of RAX unchanged)

; Immediate to register
mov rax, 0x1234567890abcdef   ; 64-bit immediate (MOVABS encoding)
mov eax, 42                   ; 32-bit immediate (zeros upper 32 bits of RAX)
mov ax, 0xFFFF                ; 16-bit immediate
mov al, 0xFF                  ; 8-bit immediate

; Memory to register
mov rax, [rbx]                ; load 64-bit from address in RBX
mov eax, [rbx]                ; load 32-bit from address in RBX
mov ax,  [rbx]                ; load 16-bit from address in RBX
mov al,  [rbx]                ; load 8-bit  from address in RBX

; Register to memory
mov [rbx], rax                ; store 64-bit to address in RBX
mov [rbx], eax                ; store 32-bit
mov [rbx], ax                 ; store 16-bit
mov [rbx], al                 ; store 8-bit

; Immediate to memory (must specify size explicitly)
mov qword [rbx], 42           ; store 64-bit value 42
mov dword [rbx], 42           ; store 32-bit value 42
mov word  [rbx], 42           ; store 16-bit value 42
mov byte  [rbx], 42           ; store 8-bit  value 42

The 32-bit Zero-Extension Rule

This is one of the most important (and surprising) behaviors in x86-64. When you write to a 32-bit register, the upper 32 bits of the full 64-bit register are automatically zeroed:

mov rax, 0xFFFFFFFFFFFFFFFF   ; RAX = 0xFFFFFFFFFFFFFFFF
mov eax, 1                    ; RAX = 0x0000000000000001  ← upper 32 zeroed!

mov rax, 0xFFFFFFFFFFFFFFFF   ; RAX = 0xFFFFFFFFFFFFFFFF
mov ax, 1                     ; RAX = 0xFFFFFFFFFFFF0001  ← only 16 bits changed
mov al, 1                     ; RAX = 0xFFFFFFFFFFFFFF01  ← only 8 bits changed

The 32-bit zero-extension rule exists because Intel designed it for performance: clearing the upper half eliminates false dependencies that would otherwise stall the out-of-order engine. The 8-bit and 16-bit variants do not have this behavior — they preserve the upper bits — which is a historical quirk from 16-bit x86.

⚠️ Common Mistake: Writing to AX or AL when you intend to operate on a clean 64-bit value. If RAX contains garbage in the upper bits and you write to AL, the garbage persists. Use movzx rax, al to explicitly zero-extend when you need the full register clean.

What MOV Cannot Do

MOV has one important restriction: you cannot move memory to memory directly. Both operands cannot be memory references:

; ILLEGAL — cannot do memory to memory
mov [rdi], [rsi]              ; assembler error

; CORRECT — use a register as intermediary
mov rax, [rsi]
mov [rdi], rax

MOV also cannot load a 64-bit immediate into memory directly (use a register first), and the immediate-to-memory form is limited to 32-bit sign-extended values for the mov [mem], imm encoding.

Addressing Modes: The Complete Reference

An addressing mode is the method by which an instruction specifies the memory location it wants to access. x86-64 supports more addressing modes than most RISC architectures, and they all reduce to one general form:

[ base + index × scale + displacement ]

Where: - base is any general-purpose register - index is any general-purpose register except RSP - scale is 1, 2, 4, or 8 (no other values — enforced by the encoding) - displacement is a signed 8-bit or 32-bit constant

Not all components are required. Here is every combination you will encounter:

Immediate (Not Really an Addressing Mode, but Listed for Completeness)

mov rax, 42            ; source is the literal value 42

The value is encoded directly in the instruction bytes. No memory access for the source operand.

Register Direct

mov rax, rbx           ; source is the register RBX

No memory access. The value comes from another register.

Direct Memory (Absolute Address)

mov rax, [0x600000]    ; load from absolute address 0x600000

Rarely used in position-independent code. Requires a 32-bit address in the 64-bit encoding, which limits it to addresses that fit in 32 bits (or use sign extension to reach negative addresses). In practice, you will almost always see RIP-relative addressing instead.

Register Indirect

mov rax, [rbx]         ; load from address stored in RBX

This is [base] — displacement and index are zero. The register holds the address. This is how you dereference a pointer.

; Dereferencing a pointer in C:
;   int x = *ptr;
; In assembly (ptr in RDI):
mov eax, [rdi]         ; EAX = *ptr

Base + Displacement

mov rax, [rbx + 8]     ; load from address RBX+8
mov rax, [rbx - 4]     ; displacement can be negative
mov rax, [rbx + 0x18]  ; or in hex

This is the workhorse of struct field access. If RBX points to a struct, [rbx + 8] accesses the field at offset 8:

struct Point {
    int64_t x;    // offset 0
    int64_t y;    // offset 8
    int64_t z;    // offset 16
};
; Accessing struct fields (struct pointer in RDI):
mov rax, [rdi]          ; rax = point.x
mov rbx, [rdi + 8]      ; rbx = point.y
mov rcx, [rdi + 16]     ; rcx = point.z

Base + Index

mov rax, [rbx + rcx]    ; load from address RBX + RCX

Useful for two independent dynamic values. Less common than the next form.

Base + Index × Scale

mov rax, [rbx + rcx*8]  ; load from RBX + RCX*8
mov rax, [rbx + rcx*4]  ; scale 4 for int32_t arrays
mov rax, [rbx + rcx*2]  ; scale 2 for int16_t arrays
mov rax, [rbx + rcx*1]  ; scale 1 for int8_t arrays (same as base+index)

This is the array indexing form. If RBX is the base address of an array of 64-bit integers and RCX is the index, [rbx + rcx*8] accesses element array[rcx].

; int64_t array[N]; accessing array[i]
; RBX = array base address, RCX = i
mov rax, [rbx + rcx*8]    ; rax = array[i]

The scale factor must be 1, 2, 4, or 8. These correspond to the sizes of the standard C types (byte, short, int, pointer/long). No other scale values are encodable. If you need [rbx + rcx*3], you must do the multiply yourself or use LEA.

⚙️ How It Works: The scale multiplication happens in dedicated address generation hardware, not the integer ALU. It costs nothing extra compared to unscaled indexing. The processor computes base + index*scale + displacement in a single clock in the AGU (Address Generation Unit) pipeline stage.

Base + Index × Scale + Displacement

mov rax, [rbx + rcx*8 + 16]  ; the full form
mov rax, [rdi + rsi*4 + 24]  ; accessing struct in array

This is the most general form. It combines array indexing with a field offset, letting you access a field inside a struct inside an array in one instruction:

struct Record {
    int32_t id;      // offset 0
    int32_t flags;   // offset 4
    int64_t value;   // offset 8
    int64_t next;    // offset 16
};
// Accessing records[i].value:
// RBX = array base, RCX = i, sizeof(Record) = 24
mov rax, [rbx + rcx*8 + 8]   ; WRONG: scale 8 implies stride 8, not 24

Wait — there is a catch. The scale can only be 1, 2, 4, or 8. If your struct size is not one of those values, you cannot use the scaled index form directly. For a 24-byte struct, you need to multiply the index by 24 yourself:

; records[i].value where sizeof(Record) = 24
; Option 1: IMUL
imul rcx, rcx, 24          ; rcx = i * 24
mov rax, [rbx + rcx + 8]   ; rax = records[i].value

; Option 2: LEA (faster, see next section)
lea rcx, [rcx + rcx*2]     ; rcx = rcx * 3
shl rcx, 3                 ; rcx *= 8, so rcx = i * 24
mov rax, [rbx + rcx + 8]

RIP-Relative Addressing

mov rax, [rel my_variable]   ; NASM: RIP + offset to my_variable
mov rax, [my_variable]       ; also works in NASM 64-bit mode (default rel)

In 64-bit mode, addresses can be up to 64 bits wide — too large for a fixed address in most instruction encodings. Position-independent code (PIE/shared libraries) cannot use absolute 32-bit addresses anyway. The solution is RIP-relative addressing: the address is expressed as a signed 32-bit offset from the next instruction's RIP.

section .data
    counter: dq 0

section .text
    global _start
_start:
    ; RIP-relative access to counter
    mov rax, [rel counter]   ; load counter
    inc rax
    mov [rel counter], rax   ; store counter

The assembler computes the correct offset at link time. At runtime, the CPU adds the 32-bit offset to RIP to get the actual address. This works anywhere the data is within ±2GB of the code, which covers virtually all binaries.

💡 Mental Model: Think of RIP-relative as "the data is this many bytes away from where I'm standing right now." It is the addressing mode that makes position-independent executables possible.

Addressing Mode Reference Table

Mode NASM Syntax Address Computed Use Case
Immediate mov rax, 42 N/A (value in instruction) Literal values
Register mov rax, rbx N/A (register to register) Register copies
Direct mov rax, [0x600000] 0x600000 Absolute static addresses
Register Indirect mov rax, [rbx] RBX Pointer dereference
Base+Disp mov rax, [rbx+8] RBX + 8 Struct field access
Base+Index mov rax, [rbx+rcx] RBX + RCX Two dynamic indices
Base+Index×Scale mov rax, [rbx+rcx*8] RBX + RCX×8 Array indexing
Full mov rax, [rbx+rcx*4+16] RBX + RCX×4 + 16 Array of structs
RIP-relative mov rax, [rel label] RIP + offset PIC static data

LEA: The Instruction That Isn't Really About Memory

LEA stands for Load Effective Address. It computes an address using the full addressing mode syntax, but — crucially — it does not actually access memory. The computed address is stored directly in the destination register.

lea rax, [rbx + 8]          ; rax = rbx + 8  (no memory load)
lea rax, [rbx + rcx*4 + 16] ; rax = rbx + rcx*4 + 16

At first this seems pointless. Why compute an address and not use it? The answer is that LEA is the fastest way to perform certain kinds of arithmetic:

LEA as Addition with Offset

; These are equivalent:
add rax, rbx                 ; rax += rbx
lea rax, [rax + rbx]         ; rax = rax + rbx (different dest possible)

; LEA can use a different destination:
lea rcx, [rax + rbx]         ; rcx = rax + rbx, rax and rbx unchanged
; ADD cannot do this: add rdx, rax, rbx is NOT a valid encoding

LEA for Multiplication by Small Constants

The scale factor in the addressing mode encodes multiplication by 1, 2, 4, or 8. By combining base and index cleverly, you can multiply by 3, 5, 9:

; Multiply by 3: x*3 = x*2 + x
lea rax, [rbx + rbx*2]      ; rax = rbx + rbx*2 = rbx*3

; Multiply by 5: x*5 = x*4 + x
lea rax, [rbx + rbx*4]      ; rax = rbx + rbx*4 = rbx*5

; Multiply by 9: x*9 = x*8 + x
lea rax, [rbx + rbx*8]      ; rax = rbx + rbx*8 = rbx*9

; Multiply by 10: two LEAs
lea rax, [rbx + rbx*4]      ; rax = rbx*5
lea rax, [rax + rax]        ; rax = rbx*10  (or: lea rax, [rax*2])

; Multiply by 7: x*7 = x*8 - x
lea rax, [rbx*8]            ; rax = rbx*8
sub rax, rbx                ; rax = rbx*7
; Or: lea rax, [rbx*8 - base] is NOT encodable (no negative base)
; Use: lea rax, [rbx + rbx*4]; add rax, rbx; add rax, rax (complex)
; Or just: imul rax, rbx, 7

; Multiply by 25: x*25 = x*5 * 5
lea rax, [rbx + rbx*4]      ; rax = rbx*5
lea rax, [rax + rax*4]      ; rax = rax*5 = rbx*25

LEA for Addition with Displacement

; Increment by a constant without affecting flags
lea rax, [rax + 4]          ; rax += 4  (does NOT touch flags — ADD does)

; Multi-operand addition in one instruction
lea rax, [rbx + rcx + 16]   ; rax = rbx + rcx + 16

⚡ Performance Note: LEA executes in the AGU pipeline and does not use the integer ALU. On modern Intel processors, there are up to 3 AGUs that can execute LEA instructions per clock, compared to 4 integer ALUs. For simple forms (base+displacement), LEA has the same latency as ADD (1 cycle). The three-component form (base+index+displacement) may have slightly higher latency on some microarchitectures.

Why Compilers Love LEA

GCC and Clang use LEA aggressively for two reasons: it computes a three-way sum in one instruction, and it does not modify flags. Consider this C code:

long compute(long a, long b) {
    return a * 5 + b + 3;
}

Naive translation might use IMUL. But GCC -O2 emits:

; RDI = a, RSI = b
lea rax, [rdi + rdi*4]    ; rax = a*5
lea rax, [rax + rsi + 3]  ; rax = a*5 + b + 3
ret

Two LEA instructions, no flags clobbered, no multiply latency. This is one of those x86-specific tricks that makes the architecture fast despite its complexity.

🔍 Under the Hood: The "no-flags" property of LEA is important when surrounding code depends on flags from a previous comparison. An ADD in the middle would destroy those flags; a LEA leaves them untouched, allowing the conditional branch that checks them to remain correct.

MOVZX: Zero-Extend on Load

When loading a smaller value into a larger register, you often want the upper bits cleared. MOVZX does this explicitly:

movzx rax, byte [rbx]      ; rax = zero_extend_to_64(*(uint8_t*)rbx)
movzx rax, word [rbx]      ; rax = zero_extend_to_64(*(uint16_t*)rbx)
movzx eax, byte [rbx]      ; eax = zero_extend_to_32(*(uint8_t*)rbx)
movzx eax, word [rbx]      ; eax = zero_extend_to_32(*(uint16_t*)rbx)
movzx rax, al              ; rax = zero_extend_to_64(al)
movzx rax, ax              ; rax = zero_extend_to_64(ax)

The key thing to remember: movzx r64, r/m8 always zeroes bits 63:8 of the destination. This is different from mov al, [rbx] which only writes bits 7:0 and leaves the rest of RAX as-is.

; Incorrect: loading a byte into a loop counter
mov rax, 0xDEADBEEFDEADBEEF
mov al, [rbx]              ; RAX = 0xDEADBEEFDEADBE?? (garbage in upper bits!)

; Correct:
movzx rax, byte [rbx]      ; RAX = 0x00000000000000?? (clean)

📊 C Comparison: c uint8_t b = read_byte(ptr); uint64_t v = b; // implicit zero-extension The compiler emits movzx rax, byte [rdi] for this.

MOVSX: Sign-Extend on Load

When loading a signed smaller value into a larger register, you want the sign bit propagated. MOVSX sign-extends:

movsx rax, byte [rbx]      ; sign-extend 8-bit to 64-bit
movsx rax, word [rbx]      ; sign-extend 16-bit to 64-bit
movsx rax, dword [rbx]     ; sign-extend 32-bit to 64-bit
movsx eax, byte [rbx]      ; sign-extend 8-bit to 32-bit
movsx eax, word [rbx]      ; sign-extend 16-bit to 32-bit

Sign extension replicates the most significant bit:

; Value in memory: 0xFF = -1 as int8_t
movsx rax, byte [rbx]      ; RAX = 0xFFFFFFFFFFFFFFFF = -1 as int64_t
movzx rax, byte [rbx]      ; RAX = 0x00000000000000FF = 255 as uint64_t
; Value in memory: 0x80 = -128 as int8_t
movsx eax, byte [rbx]      ; EAX = 0xFFFFFF80 = -128 as int32_t
movzx eax, byte [rbx]      ; EAX = 0x00000080 = 128 as uint32_t

📊 C Comparison: c int8_t b = *(int8_t*)ptr; int64_t v = b; // implicit sign-extension The compiler emits movsx rax, byte [rdi] for this.

MOVSXD: Sign-Extend 32-bit to 64-bit

There is a special instruction for the 32-to-64 case because the regular movsx encoding does not cover it:

movsxd rax, dword [rbx]    ; sign-extend 32-bit to 64-bit
movsxd rax, ecx            ; sign-extend ECX into RAX

This is commonly seen when working with 32-bit indices or when interfacing with code that stores signed 32-bit values:

; int32_t index; used as 64-bit array subscript
movsxd rax, dword [rbp-4]  ; load signed 32-bit index
mov rbx, [array + rax*8]   ; use as 64-bit index

⚠️ Common Mistake: Using mov eax, [mem] instead of movsxd rax, dword [mem] when the value is a signed 32-bit integer. The MOV zero-extends, which gives wrong results for negative indices. movsxd sign-extends, preserving the signed value.

XCHG: Exchange

XCHG swaps the contents of two operands atomically:

xchg rax, rbx              ; swap RAX and RBX
xchg [mutex], rax          ; atomically swap memory with register

The memory form of XCHG has an implicit LOCK prefix — it is always atomic, regardless of whether you write LOCK explicitly. This makes it useful as a mutex acquire:

; Spinlock using XCHG (the classic test-and-set)
section .data
    lock_var: db 0          ; 0 = free, 1 = locked

acquire_lock:
    mov al, 1
.spin:
    xchg [lock_var], al     ; atomically: al = *lock_var, *lock_var = 1
    test al, al             ; was it 0 before?
    jnz .spin               ; no: someone else holds it, spin
    ret                     ; yes: we acquired it

release_lock:
    mov byte [lock_var], 0  ; store-release; not atomic needed here
    ret

⚠️ Common Mistake: Using XCHG with memory as a fast swap (for the swap algorithm). Its implicit LOCK prefix causes it to emit a full memory fence, making it slower than two MOVs for non-atomic use. Use XCHG between registers only if you just want to swap values and do not need atomicity.

A Complete Addressing Mode Example: Struct Access in C

Let us trace through what the compiler generates for a realistic C struct access:

typedef struct {
    uint32_t  id;       // offset  0, size 4
    uint32_t  flags;    // offset  4, size 4
    uint64_t  value;    // offset  8, size 8
    char     *name;     // offset 16, size 8
    double    score;    // offset 24, size 8
} Record;               // total size: 32 bytes

uint64_t get_value(Record *rec) {
    return rec->value;
}
; GCC -O2 output for get_value:
; RDI = rec (pointer to Record)
get_value:
    mov rax, [rdi + 8]     ; load rec->value (offset 8)
    ret

Now for an array access:

uint64_t sum_values(Record *array, int count) {
    uint64_t total = 0;
    for (int i = 0; i < count; i++) {
        total += array[i].value;
    }
    return total;
}
; GCC -O2 output (simplified):
; RDI = array, ESI = count
sum_values:
    xor eax, eax           ; total = 0
    test esi, esi
    jle .done              ; if count <= 0, return 0
    xor ecx, ecx           ; i = 0
.loop:
    add rax, [rdi + rcx*1 + 8]   ; total += array[i].value
    ; Note: GCC uses rcx as byte offset, not element index
    ; Because sizeof(Record)=32 is not 1/2/4/8, it uses byte offset
    add rcx, 32            ; advance to next Record (32 bytes)
    dec esi                ; count--
    jnz .loop
.done:
    ret

The compiler cannot use rcx*32 (32 is not a valid scale), so it tracks the byte offset instead of the element index. This is standard compiler behavior: when the stride is not 1, 2, 4, or 8, the index register holds the byte offset, updated by adding the stride each iteration.

Performance of Addressing Modes

Not all addressing modes are equally fast:

Mode Address Generation Latency Notes
Register direct 0 (no AGU needed) Fastest possible
Base only 1 cycle Simple
Base + displacement 1 cycle Same as above
Base + index 1 cycle Same
Base + index×scale 1 cycle Scale is free
Base + index×scale + displacement 1 cycle (Intel Haswell+) Was 2 cycles on older CPUs
RIP-relative 1 cycle RIP is just another register to AGU

Modern Intel processors (Haswell and later) can compute the full general form in 1 cycle. On older hardware (Sandy Bridge, Ivy Bridge), the four-component form sometimes took 2 cycles. This is a rare case where the full general form has no practical penalty on current hardware.

⚡ Performance Note: The real performance concern with addressing modes is pipeline depth and memory latency, not the AGU complexity. A [rbx] load that hits L1 cache takes ~4 cycles to complete. A [rbx + rcx*8 + 16] load that also hits L1 takes ~4 cycles. The addressing mode itself is essentially free; the cache hierarchy is where the cost lives.

Register Trace: A Data Movement Sequence

Let us trace through a complete example to cement the concepts:

section .data
    array: dq 10, 20, 30, 40, 50   ; 5 × 8-byte values

section .text
global _start
_start:
    lea rbx, [rel array]    ; RBX = address of array
    mov rcx, 2              ; index = 2
    mov rax, [rbx + rcx*8] ; rax = array[2]

    movzx rdx, byte [rbx]   ; rdx = (uint8_t)array[0] — just the first byte (0x0A)
    lea rsi, [rbx + 4*8]    ; rsi = address of array[4] (not a load)

    ; rax should be 30, rdx should be 10 (byte), rsi should be array+32
Instruction RAX RBX RCX RDX RSI Notes
(start) ? ? ? ? ?
lea rbx, [rel array] ? array_addr ? ? ? No memory load
mov rcx, 2 ? array_addr 2 ? ? Immediate load
mov rax, [rbx+rcx*8] 30 array_addr 2 ? ? Loads array[2]=30
movzx rdx, byte [rbx] 30 array_addr 2 10 ? First byte of array[0]
lea rsi, [rbx+4*8] 30 array_addr 2 10 array_addr+32 Address, not load

🛠️ Lab Exercise: Assemble and run this code in GDB. Set a breakpoint at _start, then use ni (next instruction) to step through. After each instruction, check register values with info registers. Verify the trace above. Then modify the index to 4 and confirm rax becomes 50.

Memory-Mapped I/O Preview

In Chapter 29, you will use these addressing modes to talk directly to hardware registers. The principle is that hardware devices are mapped to specific physical addresses, and writing to those addresses controls the device.

; On a bare-metal x86-64 system (or in MinOS kernel mode):
; VGA text buffer lives at physical address 0xB8000
; Each character cell is 2 bytes: character + attribute

mov rbx, 0xB8000           ; VGA buffer base address
mov word [rbx], 0x0741     ; 'A' (0x41) with white-on-black (0x07)
mov word [rbx + 2], 0x0742 ; 'B' in next cell

The addressing modes are identical to those for normal RAM access. The memory controller routes writes to certain address ranges to hardware registers instead of DRAM. From the instruction's perspective, it is just a memory write.

📐 OS Kernel Project (MinOS): Save this pattern. In Chapter 29's MinOS kernel project, you will implement a VGA text mode console driver that uses exactly this technique — direct writes to 0xB8000 — to display output from your kernel before any device driver infrastructure exists.

The AT&T Syntax Alternative

If you are reading compiler output or working with GDB disassembly, you will encounter AT&T syntax. The addressing mode syntax is reversed and differently formatted:

NASM (Intel syntax) AT&T syntax Meaning
mov rax, rbx movq %rbx, %rax rax = rbx
mov rax, [rbx] movq (%rbx), %rax rax = *rbx
mov rax, [rbx+8] movq 8(%rbx), %rax rax = *(rbx+8)
mov rax, [rbx+rcx*8] movq (%rbx,%rcx,8), %rax rax = (rbx+rcx8)
mov rax, [rbx+rcx*4+16] movq 16(%rbx,%rcx,4), %rax rax = (rbx+rcx4+16)
movzx eax, byte [rbx] movzbl (%rbx), %eax zero-extend byte to dword
movsx rax, dword [rbx] movslq (%rbx), %rax sign-extend dword to qword

The AT&T format is disp(base, index, scale). GDB defaults to AT&T; use set disassembly-flavor intel to switch.

Complete Example: Copying an Array

Here is a complete NASM program demonstrating multiple addressing modes:

; copy_array.asm — demonstrates addressing modes
; Copies src[0..4] to dst[0..4], reversing the order

section .data
    src: dq 1, 2, 3, 4, 5          ; source array
    dst: dq 0, 0, 0, 0, 0          ; destination array

section .text
global _start

_start:
    lea rsi, [rel src]     ; RSI = &src[0]
    lea rdi, [rel dst]     ; RDI = &dst[0]
    mov rcx, 0             ; loop index

.loop:
    ; Load src[rcx]
    mov rax, [rsi + rcx*8]

    ; Store into dst[4-rcx] (reversed)
    ; dst index = 4 - rcx
    mov rdx, 4
    sub rdx, rcx
    mov [rdi + rdx*8], rax

    ; Increment and check
    inc rcx
    cmp rcx, 5
    jl .loop

    ; Exit
    mov eax, 60            ; sys_exit
    xor edi, edi           ; status = 0
    syscall

🔄 Check Your Understanding: 1. In the loop above, what is the value in RDX on the first iteration (rcx=0)? 2. After mov [rdi + rdx*8], rax on the first iteration, which element of dst was written? 3. If you changed src to hold dq 10, 20, 30, 40, 50, what would dst contain after the loop?

Answer

  1. RDX = 4 - 0 = 4 on first iteration.
  2. dst[4] was written with src[0] = 1. So dst[4] = 1.
  3. dst would contain [50, 40, 30, 20, 10] — the array reversed.

Summary

The x86-64 addressing modes give you a powerful language for describing how to locate data. The general form [base + index*scale + displacement] covers array indexing ([rsi + rcx*8]), struct field access ([rdi + 24]), and array-of-struct field access ([rbx + rcx*32 + 8]) all in one instruction. LEA exploits this syntax for arithmetic — multiplying by 3, 5, or 9, or computing multi-operand sums without touching flags. MOVZX and MOVSX handle size mismatches cleanly, preventing the partial-register bugs that have caused subtle errors since the 16-bit era.

The 32-bit zero-extension rule is the one behavior you must internalize: writing to EAX always zeroes the upper half of RAX, but writing to AX or AL does not. Get this wrong and you will spend an afternoon debugging a 64-bit value that should have been clean.

In the next chapter, you will use these addressing modes constantly as we work through every arithmetic and logic instruction in the ISA.