Chapter 22 Key Takeaways: Inline Assembly
-
GCC extended inline assembly syntax is
asm("code" : outputs : inputs : clobbers). Each section is colon-separated. Outputs describe where results go, inputs describe what the asm reads, and clobbers list everything the asm modifies that does not appear as an output. -
Constraint letters specify where operands live:
"r"= any general-purpose register,"m"= memory location,"i"= immediate constant,"a"= RAX/EAX,"b"= RBX/EBX,"c"= RCX/ECX,"d"= RDX/EDX. The compiler allocates registers based on your constraints and inserts the correct assembly syntax. -
Output constraints use
"=r"(var)(write-only) or"+r"(var)(read-write). The"="prefix marks write-only outputs;"+"marks read-write. An operand used as both input and output must use"+"or appear in both output and input sections with a matching constraint number. -
asm volatileprevents the compiler from moving, deleting, or deduplicating the asm block. Use it whenever the asm has side effects not visible through the output constraints — I/O operations, CPUID, RDTSC, CLFLUSH, memory fences, and any instruction that reads or writes processor state not expressed as C variables. -
The
"memory"clobber prevents the compiler from caching values in registers across the asm boundary. It is a compiler-level barrier. It does not emit any hardware fence instruction. Use it whenever the asm reads or writes memory locations not expressed as explicit operands. -
CPUID is used to serialize before RDTSC at the start of a timed region; RDTSCP + LFENCE at the end. The canonical pattern:
CPUID; RDTSC(start) andRDTSCP; LFENCE(stop). CPUID fully serializes the instruction stream. RDTSCP serializes on the load side. LFENCE prevents subsequent loads from executing before RDTSCP completes. -
XCHGwith a memory operand is implicitly atomic on x86 — no LOCK prefix needed. Intel's architecture specification guarantees thatXCHG reg, memasserts the bus lock automatically. Use XCHG for test-and-set spinlocks. Use PAUSE inside spin loops to reduce power consumption and prevent pipeline thrashing on hyperthreaded CPUs. -
LOCK CMPXCHGimplements compare-and-swap. The implicit operand is RAX/EAX (the expected value). If*ptr == RAX, then*ptr = srcand ZF=1; otherwise RAX =*ptrand ZF=0. The ZF result is captured with SETE. Clobbers:"cc"(ZF modified),"memory"(memory modified), and"=a"for the RAX output. -
Memory fence instructions differ in scope: MFENCE serializes all loads and stores; SFENCE serializes stores only; LFENCE serializes loads and instruction fetch. On x86 TSO, regular loads are acquire and regular stores are release — hardware fences are needed only for specific ordering scenarios (write-combining memory, non-temporal stores, or explicit release-consume ordering).
-
CLFLUSH addrevicts the 64-byte cache line containingaddrfrom all levels of cache. Requires"m"(*ptr)constraint so the compiler generates the correct memory addressing syntax. Always follow CLFLUSH with MFENCE to ensure the eviction completes before subsequent accesses. -
Inline assembly is appropriate for: CPUID, RDTSC/RDTSCP, CLFLUSH, I/O ports (inb/outb), CMPXCHG16B, PAUSE, and non-standard atomic patterns not covered by C11
<stdatomic.h>. It is not appropriate for standard arithmetic, loops, function calls, or anything a compiler intrinsic or C11 atomic covers — those alternatives are safer, more portable, and produce equally fast code. -
RBX must be clobbered in CPUID inline assembly. CPUID modifies EBX (the low 32 bits of RBX). RBX is a callee-saved register that the compiler may be actively using; failing to declare it as a clobber causes silent register corruption and is one of the most common inline assembly bugs.
-
The
asmwith empty code string and"memory"clobber is a compiler-only barrier:asm volatile("" ::: "memory")emits zero machine instructions but prevents the compiler from reordering loads and stores across the barrier point. Use it to prevent optimization of volatile variables or to enforce ordering between C statements. -
C11
<stdatomic.h>atomics and inline assembly LOCK CMPXCHG compile to the same machine code on x86-64. The difference is portability and compiler visibility. C11 atomics are portable to ARM64, RISC-V, and future architectures. Inline assembly is x86-64 specific but necessary for instructions C11 does not expose (CMPXCHG16B,XADD,CRC32,CLFLUSH).