6 min read

Inline assembly is the escape hatch. When you need a specific instruction the compiler won't generate — CPUID, RDTSC, I/O port access, an exact atomic sequence — inline assembly lets you drop raw instructions into C code with the compiler managing...

Chapter 22: Inline Assembly

When the Compiler Won't Do It

Inline assembly is the escape hatch. When you need a specific instruction the compiler won't generate — CPUID, RDTSC, I/O port access, an exact atomic sequence — inline assembly lets you drop raw instructions into C code with the compiler managing the register allocation around them.

Use it sparingly. The compiler is usually right. But sometimes only assembly will do.


22.1 What Inline Assembly Is and When to Use It

Use inline assembly when: - You need instructions the compiler cannot generate: CPUID, RDTSC, RDRAND, IN/OUT (I/O ports), HLT, LGDT - You need a specific atomic sequence: CMPXCHG, LOCK XCHG (though <stdatomic.h> is usually better) - You are writing a security-critical function where the compiler might optimize away zeroing (use __builtin_memset_explicit or inline asm to prevent this) - You need precisely timed code and need to prevent instruction reordering around a specific point

Do NOT use inline assembly when: - A compiler intrinsic exists: <immintrin.h> for SSE/AVX, <nmmintrin.h> for SSE4.2 - <stdatomic.h> covers your atomic operation - __builtin_clz, __builtin_popcount, __builtin_expect exist for your case - You could write a separate .asm file and link it


22.2 GCC Extended Inline Assembly Syntax

The full syntax:

asm volatile (
    "assembly code"
    : output_constraints
    : input_constraints
    : clobbers
);

Each part: - "assembly code": AT&T syntax instructions (can span multiple lines with \n\t) - output_constraints: "=r"(variable) — compiler allocates a register, writes result to variable - input_constraints: "r"(variable) — compiler puts variable in a register - clobbers: "rax", "rcx" or "memory", "cc" — registers modified by the asm

Basic Examples

// Simple: no inputs, no outputs
asm("nop");                     // insert a NOP instruction

// With a clobber (nop doesn't actually clobber anything, but illustrative):
asm("nop" : : : "memory");     // also fence memory

// The "volatile" prevents the compiler from removing or reordering this asm
asm volatile("nop");

Output Constraints

int result;
asm("movl $42, %0"             // %0 refers to first operand (result)
    : "=r"(result)             // "=r": output (=), in any GP register (r)
    :                          // no inputs
    :                          // no clobbers
);
// After this: result == 42

Constraint letters: - "r" — any general-purpose register - "m" — memory operand (variable must be in memory) - "i" — immediate integer constant - "a" — specifically RAX/EAX/AX/AL - "b" — specifically RBX - "c" — specifically RCX - "d" — specifically RDX

The = prefix means "output operand" (written by the asm). + means "read-write" (both input and output).

Input Constraints

int x = 10, y = 20, result;
asm("addl %1, %0"              // %0 = result, %1 = y (second operand = input)
    : "=r"(result)             // output: result in any register
    : "r"(y), "0"(x)           // inputs: y in any register, x in same reg as %0
);
// result = x + y = 30
// "0" constraint means "same register as operand 0" — so x starts in the result register

The "0"(x) constraint means "put x in the same register as the 0th operand." This is how you express "this input is also the output register."

Named Operands

Easier to read than %0, %1:

int value, result;
asm("movl %[val], %[res]"
    : [res] "=r"(result)
    : [val] "r"(value)
);

Clobbers

// CPUID clobbers EAX, EBX, ECX, EDX
uint32_t eax, ebx, ecx, edx;
asm volatile("cpuid"
    : "=a"(eax), "=b"(ebx), "=c"(ecx), "=d"(edx)
    : "a"(1)                   // input: EAX = 1 (leaf)
);
// eax, ebx, ecx, edx now contain CPUID leaf 1 results

Special clobbers: - "memory" — tells the compiler that the asm may read/write arbitrary memory; prevents reordering of memory accesses across the asm - "cc" — tells the compiler that the asm modifies condition flags


22.3 Practical Inline Assembly Examples

CPUID: Read Processor Information

#include <stdint.h>

typedef struct {
    uint32_t eax, ebx, ecx, edx;
} cpuid_result_t;

cpuid_result_t cpuid(uint32_t leaf) {
    cpuid_result_t r;
    asm volatile("cpuid"
        : "=a"(r.eax), "=b"(r.ebx), "=c"(r.ecx), "=d"(r.edx)
        : "a"(leaf), "c"(0)    // EAX = leaf, ECX = 0 (subleaf)
    );
    return r;
}

// Usage: check if SSE4.2 is available
int has_sse42(void) {
    cpuid_result_t r = cpuid(1);
    return (r.ecx >> 20) & 1;   // bit 20 of ECX from leaf 1
}

Note: RBX is callee-saved in System V AMD64. GCC automatically saves/restores it when you list "=b"(r.ebx) as an output, because the compiler knows it must preserve RBX.

RDTSC: High-Resolution Timing

#include <stdint.h>

uint64_t rdtsc(void) {
    uint32_t lo, hi;
    asm volatile("rdtsc"
        : "=a"(lo), "=d"(hi)   // RDTSC puts low 32 bits in EAX, high 32 in EDX
    );
    return ((uint64_t)hi << 32) | lo;
}

// Usage: measure instruction count/time between two points
uint64_t start = rdtsc();
// ... work to measure ...
uint64_t end = rdtsc();
uint64_t cycles = end - start;

⚠️ Common Mistake: RDTSC is not serialized — the CPU may reorder it with surrounding instructions. Use RDTSCP for a serialized read, or use LFENCE + RDTSC + LFENCE to prevent reordering:

uint64_t rdtsc_serialized(void) {
    uint32_t lo, hi;
    asm volatile(
        "lfence\n\t"             // serialize before RDTSC
        "rdtsc\n\t"
        "lfence"                 // serialize after RDTSC
        : "=a"(lo), "=d"(hi)
        :
        : "memory"
    );
    return ((uint64_t)hi << 32) | lo;
}

RDTSCP: Serialized Timestamp with Processor ID

uint64_t rdtscp(uint32_t *processor_id) {
    uint32_t lo, hi, aux;
    asm volatile("rdtscp"
        : "=a"(lo), "=d"(hi), "=c"(aux)
    );
    if (processor_id) *processor_id = aux;
    return ((uint64_t)hi << 32) | lo;
}

XCHG: Atomic Swap

// Atomic exchange: swap *ptr with val, return old value
uint64_t atomic_xchg(uint64_t *ptr, uint64_t val) {
    asm volatile("xchgq %0, %1"
        : "=r"(val), "+m"(*ptr)    // val = old *ptr; *ptr = val
        : "0"(val)                  // input: val starts in same reg as output
        : "memory"
    );
    return val;
}

XCHG with a memory operand is automatically atomic on x86 (the bus lock is implicit for XCHG with memory). No LOCK prefix needed.

CMPXCHG: Compare and Swap

// Compare *ptr to expected; if equal, store desired; return old value
// Returns: old value (caller checks if == expected for success)
uint64_t cmpxchg64(uint64_t *ptr, uint64_t expected, uint64_t desired) {
    uint64_t old;
    asm volatile("lock cmpxchgq %2, %1"
        : "=a"(old),               // output: EAX = old value
          "+m"(*ptr)               // output: *ptr may be modified
        : "r"(desired),            // input: desired value in any register
          "0"(expected)            // input: expected value in EAX (same as output 0)
        : "memory", "cc"
    );
    return old;
}

// Usage: lock-free compare-and-swap
int cas64(uint64_t *ptr, uint64_t expected, uint64_t desired) {
    return cmpxchg64(ptr, expected, desired) == expected;
}

⚠️ Common Mistake: Not declaring "cc" as a clobber when your asm modifies condition flags. CMPXCHG sets ZF based on whether the comparison succeeded. If you omit "cc", the compiler might cache a flag result across your asm block, leading to incorrect conditional branches.

Memory Fence Instructions

// Full memory fence: no load or store passes this barrier
static inline void mfence(void) {
    asm volatile("mfence" : : : "memory");
}

// Store fence: all stores before this complete before stores after
static inline void sfence(void) {
    asm volatile("sfence" : : : "memory");
}

// Load fence: all loads before this complete before loads after
static inline void lfence(void) {
    asm volatile("lfence" : : : "memory");
}

// Compiler memory barrier only (no hardware barrier instruction)
static inline void compiler_barrier(void) {
    asm volatile("" : : : "memory");
}

The "memory" clobber tells GCC that the asm block may read or write any memory, preventing the compiler from moving memory accesses across the asm. This is a software barrier — it affects compiler optimization but not CPU execution order.

I/O Port Access (Kernel Use)

// Read a byte from x86 I/O port
uint8_t inb(uint16_t port) {
    uint8_t value;
    asm volatile("inb %1, %0"
        : "=a"(value)      // output: EAX
        : "Nd"(port)       // "N" = unsigned 8-bit immediate; "d" = DX register
                           // "Nd" means: if port fits in 8-bit, use immediate; else use DX
    );
    return value;
}

// Write a byte to x86 I/O port
void outb(uint16_t port, uint8_t value) {
    asm volatile("outb %0, %1"
        :
        : "a"(value), "Nd"(port)
    );
}

🔐 Security Note: I/O port access requires kernel privilege (ring 0). Using inb/outb from user space triggers a #GP (General Protection fault). These are only usable in kernel code or via ioperm/iopl system calls (deprecated and restricted).

PAUSE: Efficient Spin Loop

// Hint to the processor that we're in a spin-wait loop
// Reduces power consumption and improves performance by avoiding
// pipeline stalls due to memory order violations in speculative execution
static inline void cpu_pause(void) {
    asm volatile("pause" : : : "memory");
}

// Correct spinlock using pause:
void spinlock_acquire(volatile int *lock) {
    while (__atomic_exchange_n(lock, 1, __ATOMIC_ACQUIRE)) {
        do {
            cpu_pause();
        } while (*lock);  // wait without CMPXCHG to reduce memory bus traffic
    }
}

CLFLUSH: Flush Cache Line

// Flush the cache line containing addr to memory
// Used in persistent memory programming
void clflush(const void *addr) {
    asm volatile("clflushopt %0"
        : "+m"(*(char *)addr)  // "+m" declares that we read and write *addr's cache line
    );
}

22.4 The volatile Qualifier

asm volatile(...) prevents the compiler from: 1. Removing the asm block if its outputs are unused 2. Moving the asm block to a different point in the code 3. Duplicating the asm block (e.g., by loop unrolling)

Use volatile for: - Side-effecting instructions (CPUID, RDTSC, IN/OUT) - Memory barriers (MFENCE, SFENCE, LFENCE) - Instructions whose effect is not captured in the output constraints

Do NOT use volatile if the asm is truly pure (same inputs → same outputs, no side effects) and you want the compiler to be able to optimize it.


22.5 When NOT to Use Inline Assembly

Use Compiler Intrinsics Instead

For SIMD operations, compiler intrinsics are almost always better than inline assembly:

// BAD: inline assembly for SSE2 add
asm("paddd %1, %0" : "+x"(a) : "x"(b));

// GOOD: intrinsic
#include <immintrin.h>
__m128i result = _mm_add_epi32(a, b);

Intrinsics are: - Portable between GCC, Clang, and MSVC - Type-safe (wrong operand types cause compile errors, not crashes) - Easier to read - Optimizable by the compiler (can be combined, scheduled, etc.)

The intrinsic reference: https://software.intel.com/sites/landingpage/IntrinsicsGuide/

Use GCC Builtins for Common Operations

// CLZ: count leading zeros
int n = __builtin_clzll(x);   // instead of BSR + XOR

// CTZ: count trailing zeros
int n = __builtin_ctzll(x);   // instead of BSF

// Popcount
int n = __builtin_popcountll(x);   // instead of POPCNT inline asm

// Branch prediction hint (compiles to CMOV or explicit branch prediction instruction)
if (__builtin_expect(ptr == NULL, 0)) {
    handle_null();  // unlikely path
}

Use <stdatomic.h> for Atomic Operations

// BAD: inline CMPXCHG
asm volatile("lock cmpxchgq %2, %1" ...);

// GOOD: C11 atomics
#include <stdatomic.h>
atomic_uint64_t counter = ATOMIC_VAR_INIT(0);
uint64_t old = atomic_fetch_add(&counter, 1);

The C11 atomic operations (atomic_compare_exchange_strong, atomic_fetch_add, etc.) generate the correct LOCK-prefixed instructions and are portable across architectures.


22.6 MSVC Inline Assembly

MSVC uses a different syntax for inline assembly:

// MSVC: __asm keyword (32-bit only — MSVC x64 dropped inline asm)
__asm {
    mov eax, 1
    cpuid
    mov [eax_out], eax
}

// MSVC x64: use intrinsics instead (no inline asm for 64-bit MSVC)
#include <intrin.h>
int info[4];
__cpuid(info, 1);   // CPUID intrinsic for both 32-bit and 64-bit MSVC

💡 Mental Model: MSVC x64 intentionally removed inline assembly to push programmers toward intrinsics and compiler builtins. Microsoft's position: "If you need inline assembly, you're doing it wrong." For 64-bit Windows code, use intrinsics.

Clang's Inline Assembly

Clang supports the same GCC extended inline assembly syntax. Code using GCC extended asm compiles identically with Clang.


22.7 Common Inline Assembly Mistakes

Missing Clobber for RBX

CPUID clobbers EBX. Since RBX is callee-saved, the compiler must save it. Specifying "=b"(ebx) as an output constraint automatically handles this. Alternatively:

// If you don't need EBX:
asm volatile("cpuid" : "=a"(eax), "=c"(ecx), "=d"(edx)
             : "a"(leaf) : "rbx");   // declare rbx as clobber

Using Wrong Constraint Letter

// WRONG: "m" constraint in a context requiring a register
asm("addl %1, %0" : "+m"(result) : "m"(input));
// Many instructions require register operands, not memory

// CORRECT:
asm("addl %1, %0" : "+r"(result) : "r"(input));

Assuming Register Allocation

// WRONG: assuming %0 will be EAX
asm("xorl %0, %0" : "=r"(result));  // compiler might put result in RCX!
// The output is correct (xor with itself = 0) but if you access EAX directly:
asm("xorl %%eax, %%eax" : "=r"(result));  // This sets EAX=0 but result might be elsewhere

// CORRECT for forcing EAX:
asm("xorl %%eax, %%eax" : "=a"(result));  // "a" = force EAX

Missing volatile for Side-Effecting Asm

// WRONG: compiler might remove this if result is unused
asm("rdtsc" : "=a"(lo), "=d"(hi));

// CORRECT: volatile prevents removal/reordering
asm volatile("rdtsc" : "=a"(lo), "=d"(hi));

22.8 Complete Example: Performance Measurement with RDTSC

// perf_measure.c — measure cache effects using RDTSC inline assembly
#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <stdlib.h>

static inline uint64_t rdtsc_fence(void) {
    uint32_t lo, hi;
    asm volatile(
        "lfence\n\t"
        "rdtsc\n\t"
        "lfence"
        : "=a"(lo), "=d"(hi)
        :
        : "memory"
    );
    return ((uint64_t)hi << 32) | lo;
}

#define CACHE_LINE_SIZE 64
#define ARRAY_SIZE (16 * 1024 * 1024 / sizeof(uint64_t))  // 16MB

void measure_access_pattern(void) {
    uint64_t *arr = (uint64_t *)malloc(ARRAY_SIZE * sizeof(uint64_t));
    memset(arr, 0x55, ARRAY_SIZE * sizeof(uint64_t));

    int iterations = 10000;

    // Sequential access (cache-friendly)
    volatile uint64_t sum = 0;
    uint64_t start = rdtsc_fence();
    for (int j = 0; j < iterations; j++) {
        sum += arr[j % ARRAY_SIZE];
    }
    uint64_t seq_cycles = rdtsc_fence() - start;

    // Strided access (cache-unfriendly for large strides)
    int stride = 4096 / sizeof(uint64_t);  // one cache line per access
    uint64_t start2 = rdtsc_fence();
    for (int j = 0; j < iterations; j++) {
        sum += arr[(j * stride) % ARRAY_SIZE];
    }
    uint64_t stride_cycles = rdtsc_fence() - start2;

    printf("Sequential %d accesses: %lu cycles (%.1f cycles/access)\n",
           iterations, seq_cycles, (double)seq_cycles / iterations);
    printf("Strided    %d accesses: %lu cycles (%.1f cycles/access)\n",
           iterations, stride_cycles, (double)stride_cycles / iterations);
    printf("(Ignore sum=%lu — prevents dead code elimination)\n\n", sum);

    free(arr);
}

int main(void) {
    printf("Measuring memory access patterns with RDTSC:\n");
    measure_access_pattern();
    return 0;
}

Expected output (varies by CPU):

Sequential 10000 accesses: 35000 cycles (3.5 cycles/access)     ← L1/L2 cache hit
Strided    10000 accesses: 400000 cycles (40.0 cycles/access)   ← L3 or RAM miss

🛠️ Lab Exercise: Compile and run this program. Then change ARRAY_SIZE to work within L1 cache (~32KB), L2 cache (~256KB), and beyond L3 (>8MB). Observe how the sequential and strided access times change at each cache level boundary.


🔄 Check Your Understanding: 1. What does the "memory" clobber tell the compiler? 2. Why must CPUID declare "rbx" as a clobber (or use "=b" as an output constraint)? 3. What constraint would you use to force a value into EAX specifically? 4. When is asm volatile necessary vs. plain asm? 5. Why does XCHG with a memory operand not need an explicit LOCK prefix on x86?


Summary

GCC extended inline assembly uses the syntax asm("code" : outputs : inputs : clobbers). Output constraints use "=r"(var) (any register) or "=a"(var) (specific register). Input constraints use "r"(var). Clobbers declare which registers and flags the asm modifies; "memory" is a special clobber that prevents memory access reordering.

Use inline assembly for: CPUID, RDTSC, atomics (when <stdatomic.h> isn't available), I/O ports, and specific instruction sequences. Prefer compiler intrinsics for SIMD, GCC builtins for bit manipulation, and <stdatomic.h> for atomic operations. Missing clobbers, wrong constraint letters, and absent volatile are the three most common bugs.