When you sit down to write assembly, the first thing you need in your head is a complete map of the registers — what each one is, what size sub-registers it has, what each one is conventionally used for, and the rules the hardware enforces about...
In This Chapter
- The Register Set Is the Architecture
- The Sixteen General-Purpose Registers
- Register Purposes: The Calling Convention Assignment
- The Critical Register Aliasing Rule
- The REX Prefix: How x86-64 Encodes 64-bit Operations
- RIP: The Instruction Pointer
- RFLAGS in Detail
- Segment Registers in 64-bit Mode
- The SIMD Register Files
- The Execution Model: Fetch, Decode, Execute, Retire
- Instruction Lengths: 1-15 Bytes
- CPUID: Querying CPU Capabilities
- Putting It Together: A Complete Register State Example
- Summary
Chapter 3: The x86-64 Architecture
The Register Set Is the Architecture
When you sit down to write assembly, the first thing you need in your head is a complete map of the registers — what each one is, what size sub-registers it has, what each one is conventionally used for, and the rules the hardware enforces about them. Without this map, you're writing assembly blind.
This chapter provides that map in exhaustive detail. The general-purpose registers, the instruction pointer, the flags register, the segment registers, and the SIMD register files — all of it, with the specific x86-64 behaviors (particularly the aliasing behavior) that cause bugs when misunderstood.
The Sixteen General-Purpose Registers
x86-64 has 16 general-purpose 64-bit integer registers. They fall into two historical groups: the eight original registers inherited from x86 (with new 64-bit names), and the eight new registers added by AMD's 64-bit extension.
The Original Eight (with Historical Names)
63 31 15 8 7 0
┌────────────────┬────────┬─────┬────────┐
│ RAX │ EAX │ AX │ AH │AL │
├────────────────┼────────┼─────┼────────┤
│ RBX │ EBX │ BX │ BH │BL │
├────────────────┼────────┼─────┼────────┤
│ RCX │ ECX │ CX │ CH │CL │
├────────────────┼────────┼─────┼────────┤
│ RDX │ EDX │ DX │ DH │DL │
├────────────────┼────────┼─────┼────────┤
│ RSI │ ESI │ SI │ │SIL │
├────────────────┼────────┼─────┼────────┤
│ RDI │ EDI │ DI │ │DIL │
├────────────────┼────────┼─────┼────────┤
│ RSP │ ESP │ SP │ │SPL │
├────────────────┼────────┼─────┼────────┤
│ RBP │ EBP │ BP │ │BPL │
└────────────────┴────────┴─────┴────────┘
Note: SIL, DIL, SPL, BPL (byte-width access to RSI, RDI, RSP, RBP) require a REX prefix to access and cannot be encoded simultaneously with AH, BH, CH, DH in the same instruction.
The Eight New x86-64 Registers
63 31 15 7 0
┌────────────────┬────────┬──────────┬──────┐
│ R8 │ R8D │ R8W │ R8B │
├────────────────┼────────┼──────────┼──────┤
│ R9 │ R9D │ R9W │ R9B │
├────────────────┼────────┼──────────┼──────┤
│ R10 │ R10D │ R10W │ R10B │
├────────────────┼────────┼──────────┼──────┤
│ R11 │ R11D │ R11W │ R11B │
├────────────────┼────────┼──────────┼──────┤
│ R12 │ R12D │ R12W │ R12B │
├────────────────┼────────┼──────────┼──────┤
│ R13 │ R13D │ R13W │ R13B │
├────────────────┼────────┼──────────┼──────┤
│ R14 │ R14D │ R14W │ R14B │
├────────────────┼────────┼──────────┼──────┤
│ R15 │ R15D │ R15W │ R15B │
└────────────────┴────────┴──────────┴──────┘
R8-R15 use a different naming convention: the D suffix for 32-bit, W for 16-bit, B for 8-bit. Unlike the original registers, R8-R15 do not have separate high-byte registers.
Register Purposes: The Calling Convention Assignment
The x86-64 System V ABI (used on Linux, macOS, and most Unix-like systems) assigns conventional purposes to registers. The hardware doesn't enforce these — any register can hold any value — but the calling convention determines what you can rely on after a function call:
Argument Registers (destroyed by function calls)
| Register | Purpose in calls | Notes |
|---|---|---|
| RDI | 1st integer argument | Also source index for string ops |
| RSI | 2nd integer argument | Also destination index for string ops |
| RDX | 3rd integer argument | Also 2nd return value; high word of MUL result |
| RCX | 4th integer argument | Also loop counter; saved by SYSCALL |
| R8 | 5th integer argument | |
| R9 | 6th integer argument |
Return Value Registers
| Register | Purpose |
|---|---|
| RAX | Primary return value (also SYSCALL number and SYSCALL result) |
| RDX | Secondary return value (for 128-bit returns); IDIV quotient overflow |
Caller-Saved (Scratch) Registers — NOT preserved across calls
R10, R11, and the argument registers (RDI, RSI, RDX, RCX, R8, R9) are caller-saved. If you need their values after a function call, save them before the call (typically with PUSH) and restore afterward.
Note: SYSCALL specifically clobbers RCX and R11 (saves RIP and RFLAGS there). Even though RCX is a caller-saved register, code that uses RCX across a syscall will lose its value.
Callee-Saved (Preserved) Registers — must be restored before returning
| Register | Notes |
|---|---|
| RBX | "General" callee-saved; no specific architectural purpose in 64-bit mode |
| RBP | Frame pointer (optionally — can omit with -fno-frame-pointer) |
| R12 | General callee-saved |
| R13 | General callee-saved |
| R14 | General callee-saved |
| R15 | General callee-saved |
If your function uses these registers, you must push them at the start and pop them before returning.
The Stack Pointer
RSP is special: it points to the top of the stack (the lowest occupied address). PUSH decrements RSP by the operand size, then stores; POP loads, then increments RSP. RSP must be 16-byte-aligned when executing a CALL instruction (after the 8-byte return address is pushed, RSP is 16-byte-aligned at the start of the called function).
The Critical Register Aliasing Rule
This is the most important hardware rule in this chapter. Read it twice.
Writing to a 32-bit register (EAX, EBX, etc., or R8D-R15D) zeroes the upper 32 bits of the corresponding 64-bit register.
Writing to a 16-bit or 8-bit register does NOT zero any upper bits.
This is not a convention. It is an architectural specification, implemented in every x86-64 processor.
; Demonstration of aliasing behavior
mov rax, 0xDEADBEEFCAFEBABE ; rax = 0xDEADBEEFCAFEBABE
; 32-bit write: zeros upper half
mov eax, 0x12345678 ; rax = 0x0000000012345678 (upper zeroed!)
; Now restore
mov rax, 0xDEADBEEFCAFEBABE
; 16-bit write: does NOT zero upper
mov ax, 0x1234 ; rax = 0xDEADBEEF0000CAFE -- wait, that's wrong
; Actually: ax is bits 15:0
; rax = 0xDEADBEEFCAFE1234
; Now restore
mov rax, 0xDEADBEEFCAFEBABE
; 8-bit high write: does NOT zero upper
mov ah, 0x12 ; rax = 0xDEADBEEFCAFE12BE
; (only bits 15:8 changed)
; 8-bit low write: does NOT zero upper
mov al, 0x12 ; rax = 0xDEADBEEFCAFEBA12
; (only bits 7:0 changed)
Why This Rule Exists
When AMD designed x86-64 (released as the Athlon 64 in 2003), they needed 32-bit code to run efficiently in 64-bit mode. In 32-bit code, every instruction operates on 32-bit registers. If running 32-bit code in 64-bit mode required explicit zeroing of upper halves after every operation, performance would be terrible.
The solution: make 32-bit writes implicitly zero the upper half. This means that 32-bit code in 64-bit mode is automatically safe — the upper halves are always zero for "32-bit variables." And 64-bit code that works with 32-bit values gets the zero-extension for free.
The 16-bit and 8-bit writes don't zero upper bits because that would break 16-bit and 8-bit code that relies on the upper bits being preserved. (Think of code that packs multiple values into one register using byte-sized writes.)
The Aliasing Bug Template
Here is the exact form of the most common aliasing bug:
; Buggy function: computes a 64-bit result, but accidentally uses 32-bit write
compute:
push rbx
mov rbx, rdi ; preserve argument
mov rax, 0x100000000 ; rax = 4,294,967,296
; ... lots of code using rax as a 64-bit value ...
; Bug: programmer uses 32-bit arithmetic "for efficiency"
xor eax, eax ; ZEROES ALL 64 BITS of RAX (not just lower 32!)
mov eax, ebx ; ZEROES upper 32 bits (intended: copy 64-bit value)
; Expected: rax = rbx (the original argument)
; Actual: rax = rbx & 0xFFFFFFFF (upper 32 bits silently dropped)
pop rbx
ret
GDB can show you this in action:
(gdb) set $rax = 0xDEADBEEFCAFEBABE
(gdb) p/x $rax
$1 = 0xdeadbeefcafebabe
(gdb) set $eax = 0x12345678
(gdb) p/x $rax
$2 = 0x12345678 ← upper 32 bits are gone
The Compiler's Knowledge of This Rule
Importantly, the compiler knows this rule and exploits it:
// C function: takes a 64-bit value, returns it masked to 32 bits
unsigned long mask_low32(unsigned long x) {
return x & 0xFFFFFFFF;
}
Compiler output (gcc -O2):
mask_low32:
mov eax, edi ; NOT: and rax, 0xFFFFFFFF
ret ; The 32-bit write zeroes upper 32 bits of RAX
; This is the correct 64-bit zero-extended result
The compiler knows that mov eax, edi zero-extends, so it doesn't need a separate AND instruction. This is shorter and faster than and rax, 0xFFFFFFFF.
The REX Prefix: How x86-64 Encodes 64-bit Operations
x86-64 adds a one-byte prefix called REX (Register Extension) that:
- Extends the operand size to 64 bits (REX.W bit)
- Extends register fields to access R8-R15 (REX.R, REX.X, REX.B bits)
- Accesses the new byte registers SIL, DIL, SPL, BPL
The REX byte encoding is: 0100WRXB where:
- W=1: 64-bit operand size
- R=1: ModRM.reg field extends to R8-R15
- X=1: SIB.index field extends to R8-R15
- B=1: ModRM.rm or opcode reg field extends to R8-R15
Examples:
- 48 01 d0: REX.W=1, ADD r/m64, r64 → add rax, rdx
- 4c 01 c0: REX.W=1, REX.R=1 → add rax, r8
- 41 50: REX.B=1, PUSH r64+rd → push r8
You don't normally write REX prefixes manually — the assembler generates them automatically when you use 64-bit register names or R8-R15. But understanding them explains some instruction size surprises.
RIP: The Instruction Pointer
RIP (Register Instruction Pointer) holds the address of the next instruction to execute. The CPU updates it automatically as instructions execute. You cannot write to RIP directly with MOV, but you can influence it:
JMPinstructions set RIP to a target addressCALLpushes the return address (next RIP value) and jumpsRETpops the return address and sets RIPSYSCALLsaves RIP to RCX before jumping to the kernel
The CALL instruction pushes RIP + instruction_length (the address of the next instruction after CALL, which is the return address), then jumps to the target. This is how the return address ends up on the stack.
RIP-relative addressing: In 64-bit mode, many memory references use a displacement relative to RIP rather than an absolute address. This is how position-independent code (PIC) works — strings and data are accessed relative to the current instruction, so the code can be loaded at any address and still find its data.
; RIP-relative addressing example (generated by compiler):
lea rax, [rip + .some_string] ; rax = address of .some_string
; In the binary, this encodes as:
; 48 8d 05 XX XX XX XX
; where XX XX XX XX is the 32-bit signed offset from the next instruction to .some_string
RFLAGS in Detail
RFLAGS is 64 bits wide (EFLAGS is the lower 32 bits; FLAGS is the lower 16). Only bits 0-21 are currently defined. Bits 22-63 are reserved and should always be zero (they're cleared when RFLAGS is popped from the stack).
Bit 0: CF Carry Flag
Bit 1: (Reserved, always 1)
Bit 2: PF Parity Flag
Bit 3: (Reserved, always 0)
Bit 4: AF Adjust Flag (BCD arithmetic carry from bit 3)
Bit 5: (Reserved, always 0)
Bit 6: ZF Zero Flag
Bit 7: SF Sign Flag
Bit 8: TF Trap Flag (single-step debugging)
Bit 9: IF Interrupt Enable Flag
Bit 10: DF Direction Flag
Bit 11: OF Overflow Flag
Bits 12-13: IOPL I/O Privilege Level
Bit 14: NT Nested Task Flag
Bit 15: (Reserved)
Bit 16: RF Resume Flag
Bit 17: VM Virtual-8086 Mode Flag
Bit 18: AC Alignment Check / Access Control
Bit 19: VIF Virtual Interrupt Flag
Bit 20: VIP Virtual Interrupt Pending
Bit 21: ID ID Flag (CPUID support detection)
The PUSHFQ/POPFQ instructions push and pop the full 64-bit RFLAGS:
; Reading and modifying specific RFLAGS bits:
; Set the Trap Flag (enables single-step exceptions):
pushfq
pop rax
or rax, (1 << 8) ; set bit 8 (TF)
push rax
popfq
; Clear the Direction Flag:
cld ; equivalent to clearing bit 10, but faster
; Set the Direction Flag (rare; causes string ops to decrement):
std
The LAHF instruction loads bits 7:0 of EFLAGS (SF, ZF, AF, PF, CF) into AH. The SAHF stores AH back to those flag bits. These are old instructions primarily used when PUSHFQ/POPFQ is too expensive for flag preservation.
Segment Registers in 64-bit Mode
The six segment registers (CS, DS, ES, FS, GS, SS) have diminished but not zero importance in 64-bit mode:
| Register | 64-bit Behavior |
|---|---|
| CS | Code segment: base=0, limit ignored. CPL (privilege level) is in CS.RPL |
| DS | Data segment: base=0, limit ignored |
| ES | Extra segment: base=0, limit ignored; used by some string instructions |
| SS | Stack segment: base=0, limit ignored; used for stack accesses |
| FS | Base address in MSR_FS_BASE (IA32_FS_BASE); used for TLS on Linux |
| GS | Base address in MSR_GS_BASE; used for per-CPU data in kernel |
The WRFSBASE and WRGSBASE instructions (available if FSGSBASE CPUID bit is set) write the FS/GS base addresses directly from a register. On Linux, arch_prctl(ARCH_SET_FS, addr) sets the FS base for user-space TLS.
In the kernel, SWAPGS swaps the GS base between the user-space GS base and the kernel's GS base. This is part of the syscall/interrupt entry code — the kernel uses GS to access its per-CPU data structure, but user space uses GS for its own purposes.
For user-space assembly programmers, the practical consequence is: fs:0x28 is the stack canary, fs:0x10 is the current thread's stack base, and other fs: accesses are TLS variables. Don't modify FS or GS base in user-space programs.
The SIMD Register Files
x86-64 has three generations of SIMD (Single Instruction, Multiple Data) registers:
SSE Registers: XMM0-XMM15 (128 bits each)
Introduced with the Pentium III (SSE) and extended in SSE2 (Pentium 4). Every x86-64 processor supports SSE2 as a baseline.
127 0
┌────────────────────────────────┐
│ XMM0 (128 bits) │
│ Can hold: 4×float, 2×double, │
│ 16×byte, 8×word, 4×dword, │
│ 2×qword integer │
└────────────────────────────────┘
XMM registers are used for: - Scalar floating-point (single and double precision) - Packed SIMD operations (process multiple values simultaneously) - AES-NI encryption/decryption (entire AES round in one instruction) - CRC32 computations - String comparison instructions (PCMPESTRM, PCMPISTRM)
AVX Registers: YMM0-YMM15 (256 bits each)
Introduced with Sandy Bridge (2011). YMM registers are the upper half of the ZMM registers; using YMM instructions automatically zeroes the upper 256 bits of ZMM if using the non-destructive three-operand form.
255 128 127 0
┌──────────────────────────────────┬────────────────────────────────┐
│ XMM0 upper half (YMM high) │ XMM0 (lower half) │
└──────────────────────────────────┴────────────────────────────────┘
YMM0 (256 bits)
AVX-512 Registers: ZMM0-ZMM31 (512 bits each, plus k0-k7 mask registers)
AVX-512 (available on select Intel/AMD server and desktop CPUs from 2016 onward) doubles the number of SIMD registers to 32 and adds mask registers for predicated (conditional) SIMD operations.
Check for SIMD support with CPUID before using any extension. The CPU feature flags are: - SSE2: required for x86-64 (always present) - SSE4.1, SSE4.2: nearly universal on modern systems - AVX2: common on CPUs from 2013 onward - AVX-512: available on select Intel Skylake-X and later; AMD Zen 4 and later
The Execution Model: Fetch, Decode, Execute, Retire
The CPU executes instructions through a pipeline. The simplified view:
Memory ──► [Fetch] ──► [Decode] ──► [Execute] ──► [Retire]
│ │
Instruction ──► Micro-ops
Boundary
Fetch: The CPU reads bytes from the instruction cache (L1 I-cache) starting at the address in RIP. On x86-64, the CPU must fetch 1-15 bytes, not knowing the instruction length until it decodes.
Decode: The CPU identifies the instruction boundary, splits the instruction into micro-operations (μops), and sends them to the out-of-order execution engine. Modern CPUs can decode 4-6 instructions per cycle.
Execute: Micro-operations execute on execution units — ALUs (integer arithmetic), FPUs (floating-point), load/store units (memory access), etc. Out-of-order execution means μops from multiple instructions may be executing simultaneously.
Retire: Completed μops are retired in program order. Results are written to architectural registers and memory in the order the programmer specified, even though execution was out-of-order internally.
The program model is that instructions execute in sequence. The hardware's actual execution is parallel and out-of-order. The architectural guarantee is that the observable state (registers, memory) after N instructions is the same as if they had executed strictly in order.
Why this matters for assembly programmers:
-
Instruction throughput vs. latency: An
IMULmight have a latency of 3 cycles (the result isn't ready for 3 cycles) but a throughput of 1 per cycle (a new IMUL can start every cycle). You can achieve throughput without latency by keeping the pipeline full with independent instructions. -
Data hazards: If instruction N+1 needs the result of instruction N, the CPU must wait for N to complete (a "data hazard" or "RAW — Read After Write — dependency"). The compiler and assembler arrange instructions to minimize these stalls when possible.
-
Memory ordering: Loads and stores can be reordered by the CPU within certain constraints. The
MFENCE,LFENCE, andSFENCEinstructions enforce ordering when needed (e.g., in lock-free concurrent data structures).
Instruction Lengths: 1-15 Bytes
x86-64 instruction encoding has up to four components, in order:
- Legacy prefixes (0-4 bytes): LOCK, REP/REPNE, operand-size override (0x66), address-size override (0x67)
- REX prefix (0 or 1 byte): W, R, X, B bits for 64-bit operation and register extension
- Opcode (1-3 bytes): the instruction identifier
- ModRM (0 or 1 byte): operand addressing
- SIB (0 or 1 byte): Scale-Index-Base addressing
- Displacement (0, 1, 2, or 4 bytes): memory offset
- Immediate (0, 1, 2, 4, or 8 bytes): embedded constant
A maximally long instruction might have: 4 legacy prefixes + 1 REX + 3 opcode bytes + 1 ModRM + 1 SIB + 4 displacement + 4 immediate = 18 bytes... but the architecture limits to 15 bytes per instruction. Instructions longer than 15 bytes trigger an exception.
For practical purposes, most instructions are 3-7 bytes. Knowing the size of your hot loop's instructions helps estimate instruction cache pressure.
CPUID: Querying CPU Capabilities
The CPUID instruction returns information about the processor's features. It uses EAX as the "leaf" (which information to return) and may use ECX as a "sub-leaf." Results are returned in EAX, EBX, ECX, EDX.
; Example: read the processor brand string (48 characters)
; Returned as three CPUID calls with leaves 0x80000002, 0x80000003, 0x80000004
section .data
brand: times 48 db 0 ; 48-byte buffer for brand string
section .text
global _start
_start:
; Read first 16 bytes of brand string (CPUID leaf 0x80000002)
mov eax, 0x80000002
cpuid
mov [brand], eax ; first 4 bytes
mov [brand + 4], ebx ; next 4 bytes
mov [brand + 8], ecx
mov [brand + 12], edx
; Second 16 bytes
mov eax, 0x80000003
cpuid
mov [brand + 16], eax
mov [brand + 20], ebx
mov [brand + 24], ecx
mov [brand + 28], edx
; Third 16 bytes
mov eax, 0x80000004
cpuid
mov [brand + 32], eax
mov [brand + 36], ebx
mov [brand + 40], ecx
mov [brand + 44], edx
; Now brand[] contains the processor name string
; e.g., "Intel(R) Core(TM) i7-12700K CPU @ 3.60GHz"
; (terminated with null bytes)
Checking for specific CPU features:
; Check for AES-NI support (CPUID leaf 1, ECX bit 25)
mov eax, 1
cpuid
test ecx, (1 << 25) ; bit 25 = AES-NI support
jz no_aes_ni ; if zero, AES-NI not supported
; Check for AVX2 support (CPUID leaf 7, sub-leaf 0, EBX bit 5)
mov eax, 7
xor ecx, ecx ; sub-leaf 0
cpuid
test ebx, (1 << 5) ; bit 5 = AVX2
jz no_avx2
; Check for FSGSBASE support (CPUID leaf 7, sub-leaf 0, EBX bit 0)
; (Allows direct read/write of FS/GS base from user space)
mov eax, 7
xor ecx, ecx
cpuid
test ebx, (1 << 0) ; bit 0 = FSGSBASE
The full CPUID specification is in Intel's SDM Volume 2A, under "CPUID — CPU Identification."
Putting It Together: A Complete Register State Example
Here is a complete GDB session showing the register state after running a few instructions. This is what you'll look at during debugging:
(gdb) break _start
(gdb) run
(gdb) stepi ; step through instructions
(gdb) info registers
rax 0x0000000000000001 1
rbx 0x0000000000000000 0
rcx 0x0000000000000000 0
rdx 0x0000000000000000 0
rsi 0x0000000000402000 4202496
rdi 0x0000000000000001 1
rbp 0x0000000000000000 0
rsp 0x00007fffffffe0f0 140737488347376
r8 0x0000000000000000 0
r9 0x0000000000000000 0
r10 0x0000000000000000 0
r11 0x0000000000000000 0
r12 0x0000000000000000 0
r13 0x0000000000000000 0
r14 0x0000000000000000 0
r15 0x0000000000000000 0
rip 0x0000000000401007 4198407
eflags 0x0000000000000202 [ IF ]
cs 0x0000000000000033 51
ss 0x000000000000002b 43
ds 0x0000000000000000 0
es 0x0000000000000000 0
fs 0x0000000000000000 0
gs 0x0000000000000000 0
Notice:
- All integer registers are 64 bits wide (shown as 0x0000000000000001 etc.)
- EFLAGS shows [ IF ] — Interrupt Flag is set (interrupts enabled, normal for user space)
- CS (code segment) is 0x33 = ring 3 (user mode); kernel would show 0x10 = ring 0
- The initial RSP value is a kernel-assigned stack address
This is the starting state. Every assembly operation you perform changes some subset of these registers.
Summary
The x86-64 register set is the foundation of everything. Before writing a single program, you must know:
- All 16 general-purpose registers and their sub-register names
- The 32-bit write rule: it zeros the upper half
- The calling convention: which registers are arguments, return values, caller-saved, callee-saved
- How RIP works and why you can't MOV to it directly
- The SIMD register files (XMM/YMM/ZMM) and what instructions use them
- How to check CPU capabilities with CPUID
The architecture is complex because it has accumulated 45 years of history. But the working programmer only needs to understand the 64-bit layer and what it inherited from below. The rest — the 8086 segment model, the x87 stack, 32-bit protected mode — is context that explains why things are the way they are.
🔄 Check Your Understanding: After the following instruction sequence, what is the value of RAX?
nasm mov rax, 0xFFFFFFFFFFFFFFFF mov ebx, 0xAAAAAAAA mov eax, ebx
Answer
Aftermov rax, 0xFFFFFFFFFFFFFFFF, RAX =0xFFFFFFFFFFFFFFFF. Aftermov ebx, 0xAAAAAAAA, RBX =0x00000000AAAAAAAA(32-bit write zeros upper half of RBX). Aftermov eax, ebx, EAX gets the value of EBX =0xAAAAAAAA, AND this 32-bit write zeroes the upper 32 bits of RAX. Final RAX =0x00000000AAAAAAAA.The original
0xFFFFFFFFin the upper half of RAX is completely gone, destroyed by themov eax, ebxinstruction.