13 min read

When you sit down to write assembly, the first thing you need in your head is a complete map of the registers — what each one is, what size sub-registers it has, what each one is conventionally used for, and the rules the hardware enforces about...

Chapter 3: The x86-64 Architecture

The Register Set Is the Architecture

When you sit down to write assembly, the first thing you need in your head is a complete map of the registers — what each one is, what size sub-registers it has, what each one is conventionally used for, and the rules the hardware enforces about them. Without this map, you're writing assembly blind.

This chapter provides that map in exhaustive detail. The general-purpose registers, the instruction pointer, the flags register, the segment registers, and the SIMD register files — all of it, with the specific x86-64 behaviors (particularly the aliasing behavior) that cause bugs when misunderstood.


The Sixteen General-Purpose Registers

x86-64 has 16 general-purpose 64-bit integer registers. They fall into two historical groups: the eight original registers inherited from x86 (with new 64-bit names), and the eight new registers added by AMD's 64-bit extension.

The Original Eight (with Historical Names)

 63              31       15    8 7      0
┌────────────────┬────────┬─────┬────────┐
│      RAX       │  EAX   │ AX  │ AH │AL │
├────────────────┼────────┼─────┼────────┤
│      RBX       │  EBX   │ BX  │ BH │BL │
├────────────────┼────────┼─────┼────────┤
│      RCX       │  ECX   │ CX  │ CH │CL │
├────────────────┼────────┼─────┼────────┤
│      RDX       │  EDX   │ DX  │ DH │DL │
├────────────────┼────────┼─────┼────────┤
│      RSI       │  ESI   │ SI  │   │SIL │
├────────────────┼────────┼─────┼────────┤
│      RDI       │  EDI   │ DI  │   │DIL │
├────────────────┼────────┼─────┼────────┤
│      RSP       │  ESP   │ SP  │   │SPL │
├────────────────┼────────┼─────┼────────┤
│      RBP       │  EBP   │ BP  │   │BPL │
└────────────────┴────────┴─────┴────────┘

Note: SIL, DIL, SPL, BPL (byte-width access to RSI, RDI, RSP, RBP) require a REX prefix to access and cannot be encoded simultaneously with AH, BH, CH, DH in the same instruction.

The Eight New x86-64 Registers

 63              31       15         7      0
┌────────────────┬────────┬──────────┬──────┐
│       R8       │  R8D   │   R8W    │  R8B │
├────────────────┼────────┼──────────┼──────┤
│       R9       │  R9D   │   R9W    │  R9B │
├────────────────┼────────┼──────────┼──────┤
│      R10       │  R10D  │   R10W   │ R10B │
├────────────────┼────────┼──────────┼──────┤
│      R11       │  R11D  │   R11W   │ R11B │
├────────────────┼────────┼──────────┼──────┤
│      R12       │  R12D  │   R12W   │ R12B │
├────────────────┼────────┼──────────┼──────┤
│      R13       │  R13D  │   R13W   │ R13B │
├────────────────┼────────┼──────────┼──────┤
│      R14       │  R14D  │   R14W   │ R14B │
├────────────────┼────────┼──────────┼──────┤
│      R15       │  R15D  │   R15W   │ R15B │
└────────────────┴────────┴──────────┴──────┘

R8-R15 use a different naming convention: the D suffix for 32-bit, W for 16-bit, B for 8-bit. Unlike the original registers, R8-R15 do not have separate high-byte registers.


Register Purposes: The Calling Convention Assignment

The x86-64 System V ABI (used on Linux, macOS, and most Unix-like systems) assigns conventional purposes to registers. The hardware doesn't enforce these — any register can hold any value — but the calling convention determines what you can rely on after a function call:

Argument Registers (destroyed by function calls)

Register Purpose in calls Notes
RDI 1st integer argument Also source index for string ops
RSI 2nd integer argument Also destination index for string ops
RDX 3rd integer argument Also 2nd return value; high word of MUL result
RCX 4th integer argument Also loop counter; saved by SYSCALL
R8 5th integer argument
R9 6th integer argument

Return Value Registers

Register Purpose
RAX Primary return value (also SYSCALL number and SYSCALL result)
RDX Secondary return value (for 128-bit returns); IDIV quotient overflow

Caller-Saved (Scratch) Registers — NOT preserved across calls

R10, R11, and the argument registers (RDI, RSI, RDX, RCX, R8, R9) are caller-saved. If you need their values after a function call, save them before the call (typically with PUSH) and restore afterward.

Note: SYSCALL specifically clobbers RCX and R11 (saves RIP and RFLAGS there). Even though RCX is a caller-saved register, code that uses RCX across a syscall will lose its value.

Callee-Saved (Preserved) Registers — must be restored before returning

Register Notes
RBX "General" callee-saved; no specific architectural purpose in 64-bit mode
RBP Frame pointer (optionally — can omit with -fno-frame-pointer)
R12 General callee-saved
R13 General callee-saved
R14 General callee-saved
R15 General callee-saved

If your function uses these registers, you must push them at the start and pop them before returning.

The Stack Pointer

RSP is special: it points to the top of the stack (the lowest occupied address). PUSH decrements RSP by the operand size, then stores; POP loads, then increments RSP. RSP must be 16-byte-aligned when executing a CALL instruction (after the 8-byte return address is pushed, RSP is 16-byte-aligned at the start of the called function).


The Critical Register Aliasing Rule

This is the most important hardware rule in this chapter. Read it twice.

Writing to a 32-bit register (EAX, EBX, etc., or R8D-R15D) zeroes the upper 32 bits of the corresponding 64-bit register.

Writing to a 16-bit or 8-bit register does NOT zero any upper bits.

This is not a convention. It is an architectural specification, implemented in every x86-64 processor.

; Demonstration of aliasing behavior

mov  rax, 0xDEADBEEFCAFEBABE   ; rax = 0xDEADBEEFCAFEBABE

; 32-bit write: zeros upper half
mov  eax, 0x12345678           ; rax = 0x0000000012345678  (upper zeroed!)

; Now restore
mov  rax, 0xDEADBEEFCAFEBABE

; 16-bit write: does NOT zero upper
mov  ax, 0x1234                ; rax = 0xDEADBEEF0000CAFE  -- wait, that's wrong
                               ; Actually: ax is bits 15:0
                               ; rax = 0xDEADBEEFCAFE1234

; Now restore
mov  rax, 0xDEADBEEFCAFEBABE

; 8-bit high write: does NOT zero upper
mov  ah, 0x12                  ; rax = 0xDEADBEEFCAFE12BE
                               ; (only bits 15:8 changed)

; 8-bit low write: does NOT zero upper
mov  al, 0x12                  ; rax = 0xDEADBEEFCAFEBA12
                               ; (only bits 7:0 changed)

Why This Rule Exists

When AMD designed x86-64 (released as the Athlon 64 in 2003), they needed 32-bit code to run efficiently in 64-bit mode. In 32-bit code, every instruction operates on 32-bit registers. If running 32-bit code in 64-bit mode required explicit zeroing of upper halves after every operation, performance would be terrible.

The solution: make 32-bit writes implicitly zero the upper half. This means that 32-bit code in 64-bit mode is automatically safe — the upper halves are always zero for "32-bit variables." And 64-bit code that works with 32-bit values gets the zero-extension for free.

The 16-bit and 8-bit writes don't zero upper bits because that would break 16-bit and 8-bit code that relies on the upper bits being preserved. (Think of code that packs multiple values into one register using byte-sized writes.)

The Aliasing Bug Template

Here is the exact form of the most common aliasing bug:

; Buggy function: computes a 64-bit result, but accidentally uses 32-bit write
compute:
    push  rbx
    mov   rbx, rdi          ; preserve argument
    mov   rax, 0x100000000  ; rax = 4,294,967,296

    ; ... lots of code using rax as a 64-bit value ...

    ; Bug: programmer uses 32-bit arithmetic "for efficiency"
    xor   eax, eax          ; ZEROES ALL 64 BITS of RAX (not just lower 32!)
    mov   eax, ebx          ; ZEROES upper 32 bits (intended: copy 64-bit value)

    ; Expected: rax = rbx (the original argument)
    ; Actual: rax = rbx & 0xFFFFFFFF (upper 32 bits silently dropped)
    pop   rbx
    ret

GDB can show you this in action:

(gdb) set $rax = 0xDEADBEEFCAFEBABE
(gdb) p/x $rax
$1 = 0xdeadbeefcafebabe
(gdb) set $eax = 0x12345678
(gdb) p/x $rax
$2 = 0x12345678                  ← upper 32 bits are gone

The Compiler's Knowledge of This Rule

Importantly, the compiler knows this rule and exploits it:

// C function: takes a 64-bit value, returns it masked to 32 bits
unsigned long mask_low32(unsigned long x) {
    return x & 0xFFFFFFFF;
}

Compiler output (gcc -O2):

mask_low32:
    mov   eax, edi    ; NOT: and rax, 0xFFFFFFFF
    ret               ; The 32-bit write zeroes upper 32 bits of RAX
                      ; This is the correct 64-bit zero-extended result

The compiler knows that mov eax, edi zero-extends, so it doesn't need a separate AND instruction. This is shorter and faster than and rax, 0xFFFFFFFF.


The REX Prefix: How x86-64 Encodes 64-bit Operations

x86-64 adds a one-byte prefix called REX (Register Extension) that:

  1. Extends the operand size to 64 bits (REX.W bit)
  2. Extends register fields to access R8-R15 (REX.R, REX.X, REX.B bits)
  3. Accesses the new byte registers SIL, DIL, SPL, BPL

The REX byte encoding is: 0100WRXB where: - W=1: 64-bit operand size - R=1: ModRM.reg field extends to R8-R15 - X=1: SIB.index field extends to R8-R15 - B=1: ModRM.rm or opcode reg field extends to R8-R15

Examples: - 48 01 d0: REX.W=1, ADD r/m64, r64 → add rax, rdx - 4c 01 c0: REX.W=1, REX.R=1 → add rax, r8 - 41 50: REX.B=1, PUSH r64+rd → push r8

You don't normally write REX prefixes manually — the assembler generates them automatically when you use 64-bit register names or R8-R15. But understanding them explains some instruction size surprises.


RIP: The Instruction Pointer

RIP (Register Instruction Pointer) holds the address of the next instruction to execute. The CPU updates it automatically as instructions execute. You cannot write to RIP directly with MOV, but you can influence it:

  • JMP instructions set RIP to a target address
  • CALL pushes the return address (next RIP value) and jumps
  • RET pops the return address and sets RIP
  • SYSCALL saves RIP to RCX before jumping to the kernel

The CALL instruction pushes RIP + instruction_length (the address of the next instruction after CALL, which is the return address), then jumps to the target. This is how the return address ends up on the stack.

RIP-relative addressing: In 64-bit mode, many memory references use a displacement relative to RIP rather than an absolute address. This is how position-independent code (PIC) works — strings and data are accessed relative to the current instruction, so the code can be loaded at any address and still find its data.

; RIP-relative addressing example (generated by compiler):
lea   rax, [rip + .some_string]   ; rax = address of .some_string

; In the binary, this encodes as:
; 48 8d 05 XX XX XX XX
; where XX XX XX XX is the 32-bit signed offset from the next instruction to .some_string

RFLAGS in Detail

RFLAGS is 64 bits wide (EFLAGS is the lower 32 bits; FLAGS is the lower 16). Only bits 0-21 are currently defined. Bits 22-63 are reserved and should always be zero (they're cleared when RFLAGS is popped from the stack).

Bit 0:  CF  Carry Flag
Bit 1:  (Reserved, always 1)
Bit 2:  PF  Parity Flag
Bit 3:  (Reserved, always 0)
Bit 4:  AF  Adjust Flag (BCD arithmetic carry from bit 3)
Bit 5:  (Reserved, always 0)
Bit 6:  ZF  Zero Flag
Bit 7:  SF  Sign Flag
Bit 8:  TF  Trap Flag (single-step debugging)
Bit 9:  IF  Interrupt Enable Flag
Bit 10: DF  Direction Flag
Bit 11: OF  Overflow Flag
Bits 12-13: IOPL I/O Privilege Level
Bit 14: NT  Nested Task Flag
Bit 15: (Reserved)
Bit 16: RF  Resume Flag
Bit 17: VM  Virtual-8086 Mode Flag
Bit 18: AC  Alignment Check / Access Control
Bit 19: VIF Virtual Interrupt Flag
Bit 20: VIP Virtual Interrupt Pending
Bit 21: ID  ID Flag (CPUID support detection)

The PUSHFQ/POPFQ instructions push and pop the full 64-bit RFLAGS:

; Reading and modifying specific RFLAGS bits:

; Set the Trap Flag (enables single-step exceptions):
pushfq
pop   rax
or    rax, (1 << 8)     ; set bit 8 (TF)
push  rax
popfq

; Clear the Direction Flag:
cld   ; equivalent to clearing bit 10, but faster

; Set the Direction Flag (rare; causes string ops to decrement):
std

The LAHF instruction loads bits 7:0 of EFLAGS (SF, ZF, AF, PF, CF) into AH. The SAHF stores AH back to those flag bits. These are old instructions primarily used when PUSHFQ/POPFQ is too expensive for flag preservation.


Segment Registers in 64-bit Mode

The six segment registers (CS, DS, ES, FS, GS, SS) have diminished but not zero importance in 64-bit mode:

Register 64-bit Behavior
CS Code segment: base=0, limit ignored. CPL (privilege level) is in CS.RPL
DS Data segment: base=0, limit ignored
ES Extra segment: base=0, limit ignored; used by some string instructions
SS Stack segment: base=0, limit ignored; used for stack accesses
FS Base address in MSR_FS_BASE (IA32_FS_BASE); used for TLS on Linux
GS Base address in MSR_GS_BASE; used for per-CPU data in kernel

The WRFSBASE and WRGSBASE instructions (available if FSGSBASE CPUID bit is set) write the FS/GS base addresses directly from a register. On Linux, arch_prctl(ARCH_SET_FS, addr) sets the FS base for user-space TLS.

In the kernel, SWAPGS swaps the GS base between the user-space GS base and the kernel's GS base. This is part of the syscall/interrupt entry code — the kernel uses GS to access its per-CPU data structure, but user space uses GS for its own purposes.

For user-space assembly programmers, the practical consequence is: fs:0x28 is the stack canary, fs:0x10 is the current thread's stack base, and other fs: accesses are TLS variables. Don't modify FS or GS base in user-space programs.


The SIMD Register Files

x86-64 has three generations of SIMD (Single Instruction, Multiple Data) registers:

SSE Registers: XMM0-XMM15 (128 bits each)

Introduced with the Pentium III (SSE) and extended in SSE2 (Pentium 4). Every x86-64 processor supports SSE2 as a baseline.

127                              0
┌────────────────────────────────┐
│           XMM0 (128 bits)      │
│  Can hold: 4×float, 2×double,  │
│  16×byte, 8×word, 4×dword,     │
│  2×qword integer               │
└────────────────────────────────┘

XMM registers are used for: - Scalar floating-point (single and double precision) - Packed SIMD operations (process multiple values simultaneously) - AES-NI encryption/decryption (entire AES round in one instruction) - CRC32 computations - String comparison instructions (PCMPESTRM, PCMPISTRM)

AVX Registers: YMM0-YMM15 (256 bits each)

Introduced with Sandy Bridge (2011). YMM registers are the upper half of the ZMM registers; using YMM instructions automatically zeroes the upper 256 bits of ZMM if using the non-destructive three-operand form.

255                             128 127                             0
┌──────────────────────────────────┬────────────────────────────────┐
│     XMM0 upper half (YMM high)   │          XMM0 (lower half)     │
└──────────────────────────────────┴────────────────────────────────┘
                     YMM0 (256 bits)

AVX-512 Registers: ZMM0-ZMM31 (512 bits each, plus k0-k7 mask registers)

AVX-512 (available on select Intel/AMD server and desktop CPUs from 2016 onward) doubles the number of SIMD registers to 32 and adds mask registers for predicated (conditional) SIMD operations.

Check for SIMD support with CPUID before using any extension. The CPU feature flags are: - SSE2: required for x86-64 (always present) - SSE4.1, SSE4.2: nearly universal on modern systems - AVX2: common on CPUs from 2013 onward - AVX-512: available on select Intel Skylake-X and later; AMD Zen 4 and later


The Execution Model: Fetch, Decode, Execute, Retire

The CPU executes instructions through a pipeline. The simplified view:

Memory ──► [Fetch] ──► [Decode] ──► [Execute] ──► [Retire]
                            │             │
                     Instruction ──► Micro-ops
                     Boundary

Fetch: The CPU reads bytes from the instruction cache (L1 I-cache) starting at the address in RIP. On x86-64, the CPU must fetch 1-15 bytes, not knowing the instruction length until it decodes.

Decode: The CPU identifies the instruction boundary, splits the instruction into micro-operations (μops), and sends them to the out-of-order execution engine. Modern CPUs can decode 4-6 instructions per cycle.

Execute: Micro-operations execute on execution units — ALUs (integer arithmetic), FPUs (floating-point), load/store units (memory access), etc. Out-of-order execution means μops from multiple instructions may be executing simultaneously.

Retire: Completed μops are retired in program order. Results are written to architectural registers and memory in the order the programmer specified, even though execution was out-of-order internally.

The program model is that instructions execute in sequence. The hardware's actual execution is parallel and out-of-order. The architectural guarantee is that the observable state (registers, memory) after N instructions is the same as if they had executed strictly in order.

Why this matters for assembly programmers:

  1. Instruction throughput vs. latency: An IMUL might have a latency of 3 cycles (the result isn't ready for 3 cycles) but a throughput of 1 per cycle (a new IMUL can start every cycle). You can achieve throughput without latency by keeping the pipeline full with independent instructions.

  2. Data hazards: If instruction N+1 needs the result of instruction N, the CPU must wait for N to complete (a "data hazard" or "RAW — Read After Write — dependency"). The compiler and assembler arrange instructions to minimize these stalls when possible.

  3. Memory ordering: Loads and stores can be reordered by the CPU within certain constraints. The MFENCE, LFENCE, and SFENCE instructions enforce ordering when needed (e.g., in lock-free concurrent data structures).


Instruction Lengths: 1-15 Bytes

x86-64 instruction encoding has up to four components, in order:

  1. Legacy prefixes (0-4 bytes): LOCK, REP/REPNE, operand-size override (0x66), address-size override (0x67)
  2. REX prefix (0 or 1 byte): W, R, X, B bits for 64-bit operation and register extension
  3. Opcode (1-3 bytes): the instruction identifier
  4. ModRM (0 or 1 byte): operand addressing
  5. SIB (0 or 1 byte): Scale-Index-Base addressing
  6. Displacement (0, 1, 2, or 4 bytes): memory offset
  7. Immediate (0, 1, 2, 4, or 8 bytes): embedded constant

A maximally long instruction might have: 4 legacy prefixes + 1 REX + 3 opcode bytes + 1 ModRM + 1 SIB + 4 displacement + 4 immediate = 18 bytes... but the architecture limits to 15 bytes per instruction. Instructions longer than 15 bytes trigger an exception.

For practical purposes, most instructions are 3-7 bytes. Knowing the size of your hot loop's instructions helps estimate instruction cache pressure.


CPUID: Querying CPU Capabilities

The CPUID instruction returns information about the processor's features. It uses EAX as the "leaf" (which information to return) and may use ECX as a "sub-leaf." Results are returned in EAX, EBX, ECX, EDX.

; Example: read the processor brand string (48 characters)
; Returned as three CPUID calls with leaves 0x80000002, 0x80000003, 0x80000004

section .data
    brand:  times 48 db 0    ; 48-byte buffer for brand string

section .text
    global _start

_start:
    ; Read first 16 bytes of brand string (CPUID leaf 0x80000002)
    mov  eax, 0x80000002
    cpuid
    mov  [brand],      eax    ; first 4 bytes
    mov  [brand + 4],  ebx    ; next 4 bytes
    mov  [brand + 8],  ecx
    mov  [brand + 12], edx

    ; Second 16 bytes
    mov  eax, 0x80000003
    cpuid
    mov  [brand + 16], eax
    mov  [brand + 20], ebx
    mov  [brand + 24], ecx
    mov  [brand + 28], edx

    ; Third 16 bytes
    mov  eax, 0x80000004
    cpuid
    mov  [brand + 32], eax
    mov  [brand + 36], ebx
    mov  [brand + 40], ecx
    mov  [brand + 44], edx

    ; Now brand[] contains the processor name string
    ; e.g., "Intel(R) Core(TM) i7-12700K CPU @ 3.60GHz"
    ; (terminated with null bytes)

Checking for specific CPU features:

; Check for AES-NI support (CPUID leaf 1, ECX bit 25)
mov   eax, 1
cpuid
test  ecx, (1 << 25)    ; bit 25 = AES-NI support
jz    no_aes_ni         ; if zero, AES-NI not supported

; Check for AVX2 support (CPUID leaf 7, sub-leaf 0, EBX bit 5)
mov   eax, 7
xor   ecx, ecx          ; sub-leaf 0
cpuid
test  ebx, (1 << 5)     ; bit 5 = AVX2
jz    no_avx2

; Check for FSGSBASE support (CPUID leaf 7, sub-leaf 0, EBX bit 0)
; (Allows direct read/write of FS/GS base from user space)
mov   eax, 7
xor   ecx, ecx
cpuid
test  ebx, (1 << 0)     ; bit 0 = FSGSBASE

The full CPUID specification is in Intel's SDM Volume 2A, under "CPUID — CPU Identification."


Putting It Together: A Complete Register State Example

Here is a complete GDB session showing the register state after running a few instructions. This is what you'll look at during debugging:

(gdb) break _start
(gdb) run
(gdb) stepi              ; step through instructions

(gdb) info registers
rax            0x0000000000000001      1
rbx            0x0000000000000000      0
rcx            0x0000000000000000      0
rdx            0x0000000000000000      0
rsi            0x0000000000402000      4202496
rdi            0x0000000000000001      1
rbp            0x0000000000000000      0
rsp            0x00007fffffffe0f0      140737488347376
r8             0x0000000000000000      0
r9             0x0000000000000000      0
r10            0x0000000000000000      0
r11            0x0000000000000000      0
r12            0x0000000000000000      0
r13            0x0000000000000000      0
r14            0x0000000000000000      0
r15            0x0000000000000000      0
rip            0x0000000000401007      4198407
eflags         0x0000000000000202      [ IF ]
cs             0x0000000000000033      51
ss             0x000000000000002b      43
ds             0x0000000000000000      0
es             0x0000000000000000      0
fs             0x0000000000000000      0
gs             0x0000000000000000      0

Notice: - All integer registers are 64 bits wide (shown as 0x0000000000000001 etc.) - EFLAGS shows [ IF ] — Interrupt Flag is set (interrupts enabled, normal for user space) - CS (code segment) is 0x33 = ring 3 (user mode); kernel would show 0x10 = ring 0 - The initial RSP value is a kernel-assigned stack address

This is the starting state. Every assembly operation you perform changes some subset of these registers.


Summary

The x86-64 register set is the foundation of everything. Before writing a single program, you must know:

  • All 16 general-purpose registers and their sub-register names
  • The 32-bit write rule: it zeros the upper half
  • The calling convention: which registers are arguments, return values, caller-saved, callee-saved
  • How RIP works and why you can't MOV to it directly
  • The SIMD register files (XMM/YMM/ZMM) and what instructions use them
  • How to check CPU capabilities with CPUID

The architecture is complex because it has accumulated 45 years of history. But the working programmer only needs to understand the 64-bit layer and what it inherited from below. The rest — the 8086 segment model, the x87 stack, 32-bit protected mode — is context that explains why things are the way they are.

🔄 Check Your Understanding: After the following instruction sequence, what is the value of RAX? nasm mov rax, 0xFFFFFFFFFFFFFFFF mov ebx, 0xAAAAAAAA mov eax, ebx

Answer After mov rax, 0xFFFFFFFFFFFFFFFF, RAX = 0xFFFFFFFFFFFFFFFF. After mov ebx, 0xAAAAAAAA, RBX = 0x00000000AAAAAAAA (32-bit write zeros upper half of RBX). After mov eax, ebx, EAX gets the value of EBX = 0xAAAAAAAA, AND this 32-bit write zeroes the upper 32 bits of RAX. Final RAX = 0x00000000AAAAAAAA.

The original 0xFFFFFFFF in the upper half of RAX is completely gone, destroyed by the mov eax, ebx instruction.