Case Study 1.2: What Makes x86-64 Weird

Open Assembly Language Project

Case Study 1.2: What Makes x86-64 Weird

A tour of x86-64's most notable quirks for programmers seeing assembly for the first time

Overview

x86-64 is a mature, widely-deployed, highly capable instruction set. It is also genuinely strange. It evolved from the 8086 (1978) through the 286, 386, 486, Pentium, and eventually to the 64-bit extension AMD designed in 2003, and it carries the scars of every decade of that evolution. Some of the weirdness is harmless; some of it causes real bugs; all of it is worth understanding before you write your first real program.

This case study catalogs the most important x86-64 quirks — the ones that trip up programmers who come from clean RISC architectures, or who learned x86 from a textbook that didn't mention these things.

Quirk 1: Variable-Length Instructions

x86-64 instructions range from 1 to 15 bytes long. The CPU doesn't know where one instruction ends and the next begins until it decodes the current instruction. There is no alignment requirement — an instruction can start at any byte address.

Compare this to ARM64, where every instruction is exactly 4 bytes. On ARM64, you can find instruction boundaries by scanning for any 4-byte-aligned address. On x86-64, there's no way to find instruction boundaries without sequential decoding from a known start point.

Here's a real example showing the range:

; 1-byte instruction
ret             ; c3

; 2-byte instruction
xor eax, eax   ; 31 c0

; 3-byte instruction
add rax, rdi   ; 48 03 c7

; 4-byte instruction
add rsp, 8     ; 48 83 c4 08

; 5-byte instruction
mov eax, 12345 ; b8 39 30 00 00

; 7-byte instruction
lea rax, [rip + .LC0]  ; 48 8d 05 xx xx xx xx
                       ; (4-byte RIP-relative offset)

; 10-byte instruction
movabs rax, 0x123456789ABCDEF0  ; 48 b8 f0 de bc 9a 78 56 34 12

; Examples that approach the limit...
; SIMD instructions with AVX-512 prefixes can reach 15 bytes

Why this matters for security: Disassemblers can be confused by data embedded in code sections. If you start disassembling at the wrong offset — even one byte off — you get a completely different instruction stream. Some obfuscation techniques exploit this deliberately (see "overlapping instructions").

Why this matters for debugging: When GDB stops at a particular address and shows you an instruction, it had to figure out where that instruction starts. If you ask GDB to disassemble from an arbitrary address in the middle of an instruction, the output will be nonsense. This is why stepi (step one instruction) is more reliable than manually calculating addresses.

Quirk 2: The Legacy Register Hierarchy

x86-64 has 16 general-purpose registers: RAX, RBX, RCX, RDX, RSP, RBP, RSI, RDI, R8-R15. But the first 8 of these have sub-registers that are individually named and individually accessible:

 63              31       15    8 7      0
┌────────────────┬────────┬─────┬────────┐
│      RAX       │  EAX   │ AX  │ AH │AL │
└────────────────┴────────┴─────┴────────┘

So you can write:

mov rax, 0x0102030405060708

And then read the sub-components: - rax = 0x0102030405060708 (all 64 bits) - eax = 0x05060708 (lower 32 bits) - ax = 0x0708 (lower 16 bits) - ah = 0x07 (bits 15:8) - al = 0x08 (bits 7:0)

The 8 new registers (R8-R15) also have sub-registers, but named differently: r8d (lower 32 bits), r8w (lower 16 bits), r8b (lower 8 bits).

This is a direct consequence of backward compatibility. The 8086 had 16-bit registers AX, BX, CX, DX, SP, BP, SI, DI, with AX divided into AH and AL. The 386 extended these to 32 bits by prepending "E": EAX, EBX, etc. The 64-bit extension prepended "R": RAX, RBX, etc. Every generation kept all the old names working.

Quirk 3: The 32-Bit Write That Changes 64 Bits

This is the single most dangerous quirk for programmers new to x86-64: writing a 32-bit register implicitly zeros the upper 32 bits of the 64-bit register.

mov rax, 0xFFFFFFFFFFFFFFFF  ; rax = 0xFFFFFFFFFFFFFFFF
mov eax, 0                    ; rax = 0x0000000000000000 (!!)

The second instruction wrote only 32 bits, but the result is a 64-bit zero. This is specified behavior — not a bug, not an implementation detail. Any 32-bit write to a register zeros the upper half.

But: writing a 16-bit or 8-bit register does NOT zero the upper bits:

mov rax, 0xFFFFFFFFFFFFFFFF  ; rax = 0xFFFFFFFFFFFFFFFF
mov ax, 0                     ; rax = 0xFFFFFFFF0000FFFF -- upper 48 bits preserved!
mov al, 0                     ; rax = 0xFFFFFFFF0000FF00 -- only low 8 bits changed

This asymmetry was introduced specifically in x86-64 to make 32-bit code more efficient in 64-bit mode. When you zero-extend a 32-bit value to 64 bits (the common case), you just write to the 32-bit register and the upper half is automatically zeroed — no extra instruction needed. This optimization comes at the cost of the asymmetry, which is a genuine source of bugs.

Real bug example:

; Buggy code: trying to preserve upper 32 bits while updating lower 32
; Programmer thinks: "I'm only writing EAX, so upper half should be unaffected"
; This is WRONG.

save_value:
    push rax                  ; save RAX
    ; ... some computation that puts a 32-bit result in EAX ...
    mov eax, ecx              ; BUG: this zeros the upper 32 bits of RAX!
    ; ... more code expecting RAX upper half unchanged ...
    pop rax                   ; restores original RAX (doesn't help - damage done to rax here)
    ret

The compiler knows this rule and handles it correctly, which is part of why compiler output differs from naive hand-written assembly. When a compiler writes a 32-bit result, it knows the upper bits are zeroed and doesn't bother to zero them explicitly.

⚠️ Common Mistake: If you're reading assembly output and see mov eax, ecx where you expected mov rax, rcx, the compiler is intentionally using the 32-bit form to zero the upper half. This is not a bug — it's a sign that the compiler knows the value is 32-bit.

Quirk 4: The Accumulator Register's Implicit Role

Several x86-64 instructions implicitly use RAX/EAX/AX/AL as an operand, even when it's not written in the instruction:

MUL and IMUL (unsigned and signed multiply):

; MUL src computes: RDX:RAX = RAX * src
; (128-bit result split across two registers!)
mov rax, 1000000
mov rbx, 2000000
mul rbx                 ; Result: rdx:rax = rax * rbx = 2,000,000,000,000
                        ; For this example: rdx = 0, rax = 2000000000000 (fits in 64 bits)

Note: mul rbx doesn't mention RAX or RDX at all, but it implicitly reads RAX and writes both RDX and RAX. Forgetting this and expecting RDX to be unmodified is a classic bug.

DIV and IDIV:

; DIV src computes: rax = rdx:rax / src, rdx = remainder
; The 128-bit dividend is constructed from rdx (high) and rax (low)
xor rdx, rdx           ; zero rdx (upper half of dividend)
mov rax, 1000000       ; lower half of dividend
mov rbx, 7             ; divisor
div rbx                 ; rax = quotient (142857), rdx = remainder (1)

If you forget to zero RDX before an unsigned division, you'll divide a 128-bit number instead of a 64-bit number, and you'll get the wrong answer (or a divide-by-zero exception if RDX is large enough).

SYSCALL and SYSRET:

The syscall instruction implicitly reads RAX (syscall number), writes RCX (saving RIP for the return), and writes R11 (saving RFLAGS). The sysret instruction restores RIP from RCX and RFLAGS from R11. If you use RCX or R11 for other purposes across a syscall, you'll lose those values.

Quirk 5: The EFLAGS/RFLAGS Register and Carry Propagation

x86-64 has a 64-bit flags register called RFLAGS (the lower 32 bits are EFLAGS for compatibility). It contains status flags that record the outcome of arithmetic and comparison operations.

The carry flag (CF) is particularly interesting: it enables multi-precision arithmetic. The ADC (Add with Carry) instruction adds two operands plus the current carry flag, allowing you to chain 64-bit additions to compute 128-bit or larger results:

; 128-bit addition: (rdx:rax) += (rcx:rbx)
; Low 64 bits:
add rax, rbx    ; rax += rbx; if overflow, CF = 1
; High 64 bits:
adc rdx, rcx    ; rdx += rcx + CF; propagates carry from the low addition

This is elegant, but it means that any instruction that sets CF between the add and the adc will corrupt the multi-precision arithmetic. Beginners often accidentally insert a cmp or test between two adc operations, destroying the carry chain.

The direction flag (DF) is another unusual flag. The string instructions (MOVS, SCAS, CMPS, LODS, STOS) use RSI/RDI as pointers and advance them after each operation. The direction they advance depends on DF: if DF=0, pointers increase; if DF=1, pointers decrease. The System V ABI requires that DF=0 on function entry and exit. Violating this causes string operations to walk backward through memory when the caller expected forward traversal — a subtle, hard-to-debug corruption.

; Common string idiom (requires DF=0):
mov rdi, dest
mov rsi, src
mov rcx, count
rep movsb          ; copy count bytes from [rsi] to [rdi], advancing both

If DF=1 when this runs, it copies bytes in reverse order. The std instruction sets DF; cld clears it. Always clear DF with cld before using string instructions if there's any doubt about its state.

Quirk 6: The Segment Registers That Won't Die

x86-64 still has the six segment registers: CS, DS, ES, FS, GS, SS. In 16-bit and 32-bit x86, these were critical — they formed the upper part of a segmented memory address. In 64-bit mode, their base addresses are forced to 0 (except FS and GS), and segmentation is effectively disabled. Every memory reference just uses the full 64-bit virtual address.

So why do FS and GS still matter?

Thread-local storage. The OS sets the FS register's base address to point to a per-thread data structure. On Linux, fs:0 points to the pthread structure for the current thread, which contains thread-local variables, the thread's stack canary value, and other per-thread state.

; Reading a thread-local variable (compiler-generated code):
mov rax, QWORD PTR fs:0x28    ; read the stack canary value from TLS

That fs:0x28 is how GCC implements stack smashing protection — it stores the canary in thread-local storage so each thread has its own canary value.

GS is used on Windows for the Thread Environment Block (TEB) and other per-CPU data in the kernel. On Linux, GS is used for per-CPU data in the kernel.

For application programmers, you'll encounter FS-relative addressing primarily when you see fs:0x... in disassembly. Don't be confused — it's just thread-local storage.

Quirk 7: The LOCK Prefix

For multiprocessor synchronization, x86 provides the LOCK prefix, which guarantees that the prefixed instruction is atomic with respect to all other processors:

lock add [counter], 1    ; atomically increment a memory counter
lock cmpxchg [ptr], rbx  ; atomic compare-and-exchange
lock xchg rax, [ptr]     ; atomic exchange (XCHG is always locked)

The LOCK prefix can only be applied to certain instructions (ADD, AND, BTC, BTR, BTS, CMPXCHG, CMPXCHG8B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG), and only to their memory-destination forms. Applying it to a register operand triggers an illegal instruction exception.

The cost of LOCK is significant on modern CPUs — it can be 10-50x slower than the unlocked version because it requires cache coherence coordination across all cores. This is why lock-free algorithms are a performance optimization: they eliminate LOCK prefixes.

Quirk 8: The REP Prefix and Microcoded Loops

The REP prefix turns string instructions into loops:

mov rcx, 100      ; count
mov rdi, buffer   ; destination
xor al, al        ; value to store (0)
rep stosb         ; store AL to [RDI], decrement RCX, until RCX = 0

rep stosb is equivalent to a memset(buffer, 0, 100) loop, but implemented as a single instruction. Modern CPUs handle rep stosb and rep movsb (the copy version) with fast microcode that can copy data at memory bandwidth, making them competitive with or superior to explicit vectorized loops for small to medium sizes.

The Intel REP MOVSB Optimization (ERMSB — Enhanced REP MOVSB/STOSB) guarantees that the CPU uses a fast implementation. Check for this feature with CPUID leaf 7, EBX bit 9.

Summary: A Mental Checklist for x86-64

When reading or writing x86-64 assembly, keep these quirks in mind:

Quirk	Consequence	How to Avoid the Bug
Variable-length instructions	Can't find boundaries without decoding	Trust your disassembler; never jump into the middle of an instruction
32-bit write zeros upper 32 bits	Silently destroys upper 64-bit value	Use 64-bit registers when working with 64-bit values
16/8-bit write does NOT zero upper bits	Partial update leaves garbage bits	`movzx` for zero-extension, `movsx` for sign-extension
MUL/DIV use RDX:RAX implicitly	RDX gets clobbered by MUL; bad dividend for DIV	Always zero or sign-extend RDX before DIV
SYSCALL clobbers RCX and R11	Values in RCX/R11 lost after syscall	Preserve RCX/R11 if needed across syscalls
Direction flag affects string ops	String operations may go backward	`cld` before string instructions if DF state is unknown
FS/GS are base offsets for TLS	`fs:0x28` is not an error	Understand that FS-relative accesses are TLS
LOCK prefix is expensive	Naive locking hurts performance	Use atomic intrinsics; understand lock-free alternatives

These are not obscure corner cases. They are the everyday reality of x86-64 programming. The sooner they become second nature, the fewer mysterious bugs you'll spend hours tracking down.