Chapter 22 Exercises: Inline Assembly


Exercise 1: Basic Constraint Writing

Write GCC extended inline assembly that reads the value of the RFLAGS register into a C variable using PUSHFQ (push flags) and POPQ (pop into register).

unsigned long flags;
// Write inline asm here to read RFLAGS into 'flags'

Requirements: - Use the "=r" output constraint - The only clobber needed is "memory" (PUSHFQ modifies the stack momentarily) - Hint: pushfq; popq %0


Exercise 2: Constraint Mismatch Debugging

The following inline assembly is broken. Identify all errors and write the corrected version.

// Broken: compute a = b + c using inline asm
int a, b = 10, c = 20;
asm("addl %1, %2"
    : "=r"(a)
    : "r"(b), "r"(c)
    );

Hint: There are two separate problems: one with the output constraint and one with the instruction semantics.


Exercise 3: Using Named Operands

Rewrite the following inline assembly to use named operands (%[name] syntax) instead of positional %0, %1, %2 notation. The function computes result = (a * b) + c using LEA:

int fused_multiply_add(int a, int b, int c) {
    int result;
    asm("imull %2, %1\n\t"
        "addl  %3, %1\n\t"
        "movl  %1, %0"
        : "=r"(result), "+r"(a)
        : "r"(b), "r"(c)
        );
    return result;
}

After rewriting with named operands, explain why named operands improve readability for functions with more than three operands.


Exercise 4: The "a", "b", "c", "d" Constraints

Write a function cpuid_brand_string() that calls CPUID with EAX=0x80000002, 0x80000003, and 0x80000004 to retrieve the 48-character processor brand string. Store the result in a char[49] buffer (null-terminated).

The brand string is retrieved across three CPUID calls: - EAX=0x80000002: EAX/EBX/ECX/EDX = bytes 0-15 - EAX=0x80000003: EAX/EBX/ECX/EDX = bytes 16-31 - EAX=0x80000004: EAX/EBX/ECX/EDX = bytes 32-47

Use the "a"/"b"/"c"/"d" constraints for EAX/EBX/ECX/EDX. Remember to clobber RBX (it is callee-saved and CPUID modifies it).


Exercise 5: Memory Constraint vs. Register Constraint

Explain the difference between the "r" and "m" constraints. Then write two versions of the same inline assembly snippet that atomically increments a memory-resident counter:

Version A: Use "r" constraint — load the value, increment in a register, write back. Is this actually atomic?

Version B: Use "m" constraint with LOCK INCL — increment the memory location directly with the LOCK prefix. Is this atomic?

Write both versions, explain why Version B provides atomicity that Version A lacks, and explain when you would use each.


Exercise 6: RDTSC Benchmark Framework

Using the RDTSC infrastructure from the chapter, write a complete benchmark function that measures the median latency (not mean) of calling sqrt(2.0) over 101 iterations (odd number for median calculation):

double benchmark_sqrt_median(void);

Requirements: - Use RDTSCP (not RDTSC) for better serialization - Use LFENCE before and after RDTSCP - Discard the first 10 iterations as warmup - Store 101 cycle counts, sort them, return the median - The return type is double (cycles are large but imprecise; double is fine for a benchmark result)


Exercise 7: CMPXCHG and ABA Problem

Implement a lock-free stack push operation using CMPXCHG. The stack is represented as:

typedef struct Node {
    int value;
    struct Node *next;
} Node;

typedef struct {
    Node *top;  // shared stack top pointer
} LockFreeStack;
void stack_push(LockFreeStack *stack, Node *new_node);

Requirements: - Use LOCK CMPXCHG to atomically update stack->top - Loop until the CAS succeeds - Use the __asm__ form (same as asm) - Explain in a comment why the ABA problem can occur with this implementation and what the solution would be (no need to implement the solution)


Exercise 8: XCHG for Spinlock

Using the XCHG-based spinlock from the chapter as reference, implement a complete spinlock with both lock and unlock operations:

typedef volatile int spinlock_t;
#define SPINLOCK_INIT 0

void spinlock_lock(spinlock_t *lock);
void spinlock_unlock(spinlock_t *lock);

Requirements: - spinlock_lock: Use XCHG in a loop (test-and-set). When the lock is held, spin using PAUSE before retrying (reduces power consumption and pipeline thrashing on hyperthreaded CPUs). - spinlock_unlock: A simple store of 0 with a release fence is sufficient. Use inline assembly with MFENCE or explain why a C volatile store might suffice on x86.


Exercise 9: Compiler Barrier Without Hardware Fence

Write a compiler_barrier() macro using inline assembly that prevents the compiler from reordering memory operations across the barrier, but emits zero machine instructions at runtime.

#define compiler_barrier() /* your inline asm here */

Then write a test case that demonstrates the barrier's effect: show the difference in GCC -O2 output (using Compiler Explorer) between:

// Without barrier:
x = 1;
flag = 1;

// With barrier:
x = 1;
compiler_barrier();
flag = 1;

where x and flag are global volatile int variables. Note: on x86, volatile alone may not be sufficient for all reordering scenarios — explain when compiler_barrier is enough and when you additionally need MFENCE.


Exercise 10: I/O Port Read — Port 0x70 (CMOS)

The x86 CMOS real-time clock is accessed via I/O port pair 0x70 (address) and 0x71 (data). Reading the current second from CMOS: 1. Write the register index (0x00 = seconds) to port 0x70 2. Read the value from port 0x71

Write a kernel-mode function (assume ring 0 privilege):

unsigned char cmos_read_seconds(void);

Using the outb/inb inline assembly patterns from the chapter. The seconds register returns BCD (binary-coded decimal): convert the result from BCD to binary in C code after the inline asm returns.

BCD conversion: bcd = ((bcd >> 4) * 10) + (bcd & 0x0F).


Exercise 11: CLFLUSH Cache Invalidation Measurement

Write a program that measures the difference in memory access latency between: 1. Accessing a cache-hot value (loaded previously) 2. Accessing a cache-cold value (after CLFLUSH)

typedef struct {
    uint64_t hot_cycles;
    uint64_t cold_cycles;
} CacheLatencyResult;

CacheLatencyResult measure_cache_latency(void *ptr);

Use RDTSC + LFENCE for timing, and CLFLUSH for eviction. The function should: - Measure time to read *(uint64_t*)ptr after it's been accessed (hot) - CLFLUSH the cache line - MFENCE to ensure the flush completes - Measure time to read *(uint64_t*)ptr again (cold)

Typical values: hot = 4-5 cycles (L1 hit), cold = 200-300 cycles (DRAM).


Exercise 12: Inline Assembly in C++ — Name Mangling Interaction

In C++, inline assembly works the same way as in C, but there is a subtle interaction with const variables. Consider:

const int MULTIPLIER = 7;

int multiply_const(int x) {
    int result;
    asm("imull %[mult], %[x]\n\t"
        "movl %[x], %[res]"
        : [res] "=r"(result), [x] "+r"(x)
        : [mult] "i"(MULTIPLIER)   // "i" = immediate constraint
        );
    return result;
}

a) What does the "i" constraint mean, and why is it valid for MULTIPLIER but not for a runtime variable?

b) Compile this with g++ -O2 -S and examine the output. What single instruction does GCC generate?

c) Rewrite using the "r" constraint instead of "i". What is the difference in the generated assembly?

d) For which constraint choices does GCC have the freedom to fold the multiply into a single IMULL immediate form?


Exercise 13: MSVC Equivalent

The following GCC inline assembly computes the bit-reverse of a 32-bit integer using the BSR (bit scan reverse) + shift pattern:

// GCC version
unsigned int bit_reverse_32(unsigned int x) {
    // Naive implementation using inline asm for BSR
    unsigned int result = 0;
    for (int i = 0; i < 32; i++) {
        result = (result << 1) | (x & 1);
        x >>= 1;
    }
    return result;
}

a) There is no efficient single-instruction bit-reverse on x86-64. Write the MSVC equivalent of the __builtin_bswap32 call using MSVC's _byteswap_ulong intrinsic (not inline asm).

b) ARM64 has RBIT (reverse bits). Write ARM64 GCC inline assembly for bit_reverse_32 using RBIT W0, W0.

c) Which approach is more portable? Which is more efficient? Explain the tradeoffs.


Exercise 14: Volatile and Optimization

Consider this timing loop that checks if an event has occurred:

volatile int event_occurred = 0;

void wait_for_event(void) {
    while (!event_occurred) {
        // spin
    }
}

a) Without volatile, what optimization would GCC -O2 apply to this loop?

b) Does volatile provide the correct memory ordering semantics for multithreaded code? Why or why not?

c) Rewrite wait_for_event using C11 atomics (_Atomic int, atomic_load_explicit with memory_order_acquire) and explain why this is preferable to either volatile alone or inline assembly with MFENCE.

d) When would you still need inline assembly with explicit MFENCE rather than C11 atomics? Give a concrete example.


Exercise 15: Full Inline Assembly Project — Hardware Performance Counters

Using a combination of RDTSC, CPUID (for serialization), and basic counter infrastructure, write a self-contained header perf_counter.h that provides:

typedef struct {
    uint64_t start;
    uint64_t end;
} PerfCounter;

static inline void perf_start(PerfCounter *pc);
static inline void perf_stop(PerfCounter *pc);
static inline uint64_t perf_elapsed(const PerfCounter *pc);

Requirements: - perf_start: Use CPUID + RDTSC sequence (CPUID serializes, then RDTSC reads) - perf_stop: Use RDTSCP + LFENCE sequence (RDTSCP is serialized, then LFENCE prevents later loads from reordering before RDTSCP) - perf_elapsed: Returns end - start in TSC cycles - All functions must be static inline — zero call overhead - Include a main() test that measures the overhead of memset(buf, 0, 4096) where buf is a 4096-byte stack array

Bonus: Add a perf_cycles_to_ns(uint64_t cycles, double ghz) helper that converts cycles to nanoseconds given the CPU frequency.


Challenge Exercise: CMPXCHG16B — 128-bit Atomic Compare-and-Swap

x86-64 has CMPXCHG16B — a 128-bit compare-and-swap that atomically compares the 128-bit value in RDX:RAX with a memory location and swaps with RCX:RBX if equal.

typedef struct {
    uint64_t lo;
    uint64_t hi;
} uint128_aligned_t __attribute__((aligned(16)));

int cas128(uint128_aligned_t *ptr,
           uint128_aligned_t expected,
           uint128_aligned_t desired);

Write the implementation using LOCK CMPXCHG16B. Note: - The memory operand must be 16-byte aligned (guaranteed by __attribute__((aligned(16)))) - RDX:RAX contains the expected value (hi:lo on entry) - RCX:RBX contains the desired value - ZF is set on success - The "A" constraint covers RAX+RDX together; RCX and RBX need "c" and "b" constraints - Clobbers: "cc" (ZF is modified)

This is used in real lock-free data structures to implement double-word CAS, solving the ABA problem from Exercise 7.