Chapter 22 Key Takeaways: Inline Assembly

  1. GCC extended inline assembly syntax is asm("code" : outputs : inputs : clobbers). Each section is colon-separated. Outputs describe where results go, inputs describe what the asm reads, and clobbers list everything the asm modifies that does not appear as an output.

  2. Constraint letters specify where operands live: "r" = any general-purpose register, "m" = memory location, "i" = immediate constant, "a" = RAX/EAX, "b" = RBX/EBX, "c" = RCX/ECX, "d" = RDX/EDX. The compiler allocates registers based on your constraints and inserts the correct assembly syntax.

  3. Output constraints use "=r"(var) (write-only) or "+r"(var) (read-write). The "=" prefix marks write-only outputs; "+" marks read-write. An operand used as both input and output must use "+" or appear in both output and input sections with a matching constraint number.

  4. asm volatile prevents the compiler from moving, deleting, or deduplicating the asm block. Use it whenever the asm has side effects not visible through the output constraints — I/O operations, CPUID, RDTSC, CLFLUSH, memory fences, and any instruction that reads or writes processor state not expressed as C variables.

  5. The "memory" clobber prevents the compiler from caching values in registers across the asm boundary. It is a compiler-level barrier. It does not emit any hardware fence instruction. Use it whenever the asm reads or writes memory locations not expressed as explicit operands.

  6. CPUID is used to serialize before RDTSC at the start of a timed region; RDTSCP + LFENCE at the end. The canonical pattern: CPUID; RDTSC (start) and RDTSCP; LFENCE (stop). CPUID fully serializes the instruction stream. RDTSCP serializes on the load side. LFENCE prevents subsequent loads from executing before RDTSCP completes.

  7. XCHG with a memory operand is implicitly atomic on x86 — no LOCK prefix needed. Intel's architecture specification guarantees that XCHG reg, mem asserts the bus lock automatically. Use XCHG for test-and-set spinlocks. Use PAUSE inside spin loops to reduce power consumption and prevent pipeline thrashing on hyperthreaded CPUs.

  8. LOCK CMPXCHG implements compare-and-swap. The implicit operand is RAX/EAX (the expected value). If *ptr == RAX, then *ptr = src and ZF=1; otherwise RAX = *ptr and ZF=0. The ZF result is captured with SETE. Clobbers: "cc" (ZF modified), "memory" (memory modified), and "=a" for the RAX output.

  9. Memory fence instructions differ in scope: MFENCE serializes all loads and stores; SFENCE serializes stores only; LFENCE serializes loads and instruction fetch. On x86 TSO, regular loads are acquire and regular stores are release — hardware fences are needed only for specific ordering scenarios (write-combining memory, non-temporal stores, or explicit release-consume ordering).

  10. CLFLUSH addr evicts the 64-byte cache line containing addr from all levels of cache. Requires "m"(*ptr) constraint so the compiler generates the correct memory addressing syntax. Always follow CLFLUSH with MFENCE to ensure the eviction completes before subsequent accesses.

  11. Inline assembly is appropriate for: CPUID, RDTSC/RDTSCP, CLFLUSH, I/O ports (inb/outb), CMPXCHG16B, PAUSE, and non-standard atomic patterns not covered by C11 <stdatomic.h>. It is not appropriate for standard arithmetic, loops, function calls, or anything a compiler intrinsic or C11 atomic covers — those alternatives are safer, more portable, and produce equally fast code.

  12. RBX must be clobbered in CPUID inline assembly. CPUID modifies EBX (the low 32 bits of RBX). RBX is a callee-saved register that the compiler may be actively using; failing to declare it as a clobber causes silent register corruption and is one of the most common inline assembly bugs.

  13. The asm with empty code string and "memory" clobber is a compiler-only barrier: asm volatile("" ::: "memory") emits zero machine instructions but prevents the compiler from reordering loads and stores across the barrier point. Use it to prevent optimization of volatile variables or to enforce ordering between C statements.

  14. C11 <stdatomic.h> atomics and inline assembly LOCK CMPXCHG compile to the same machine code on x86-64. The difference is portability and compiler visibility. C11 atomics are portable to ARM64, RISC-V, and future architectures. Inline assembly is x86-64 specific but necessary for instructions C11 does not expose (CMPXCHG16B, XADD, CRC32, CLFLUSH).