Chapter 22 Further Reading: Inline Assembly

Open Assembly Language Project

Affiliate disclosure

Book titles on this page link to Amazon. As an Amazon Associate, DataField.Dev earns from qualifying purchases — at no additional cost to you.

Chapter 22 Further Reading: Inline Assembly

1. GCC Documentation — "Extended Asm — Assembler Instructions with C Expression Operands" https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html

The authoritative GCC reference for extended inline assembly. Covers every constraint letter, modifier (+, =, &, %), named operand syntax, the clobbers list, and every edge case. The "Constraints for asm Operands" subsection is the single most important reference for writing correct inline assembly.

2. "GCC-Inline-Assembly-HOWTO" by Sandeep.S https://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html

A practical, example-driven tutorial that walks through AT&T syntax, constraints, and common patterns before the GCC documentation was well-organized. Dated but still one of the clearest introductions. The section on volatile, memory clobbers, and the difference between %0 and %%rax is particularly useful for beginners.

3. Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 2 — Instruction Set Reference https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html

The primary reference for every instruction used in this chapter: CPUID (leaf definitions, clobbers), RDTSC/RDTSCP (serialization requirements, TSC invariance), CMPXCHG/CMPXCHG16B (operation, ZF behavior), CLFLUSH (cache line granularity, requirements), MFENCE/SFENCE/LFENCE (ordering scope), XCHG (implicit LOCK), PAUSE (spin-wait hint). When the GCC docs say "use the right constraint," the Intel SDM tells you what the instruction actually does.

4. "Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms" — Maged M. Michael, Michael L. Scott PODC 1996 — the original Michael-Scott queue paper

The foundational paper for the lock-free queue implemented in Case Study 22-2. Describes the two-pointer (head/tail) structure, the CAS operations, the "helping" mechanism (advancing tail when it falls behind), and the correctness proof. Every lock-free data structure course uses this paper. Reading it after implementing the queue makes both the code and the paper immediately comprehensible.

5. "Is Parallel Programming Hard, And, If So, What Can You Do About It?" — Paul E. McKenney kernel.org — free PDF

Paul McKenney (Linux kernel RCU maintainer) covers memory ordering, memory barriers, and the difference between compiler barriers and hardware fences in exhaustive detail. Chapter 4 covers memory barriers; Appendix C covers the x86 memory model (TSO). The section on smp_mb(), smp_rmb(), smp_wmb() in Linux shows how the kernel uses inline assembly barriers to implement portable concurrency primitives.

6. "Intel Intrinsics Guide" https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html

When inline assembly is the wrong tool, Intel intrinsics are the right tool. The Intrinsics Guide documents every Intel compiler intrinsic: _mm_cmpxchg_epi128, _rdtsc, _rdtscp, _mm_clflush, _mm_mfence, _mm_pause. Understanding both the inline assembly and the intrinsic for the same instruction clarifies what each approach provides and when to choose which.

7. "Spectre Attacks: Exploiting Speculative Execution" — Kocher et al. IEEE S&P 2019 — https://spectreattack.com/spectre.pdf

The Spectre paper uses exactly the cache timing techniques from Case Study 22-1: RDTSC for timing, CLFLUSH for cache eviction, and memory ordering constraints for precision. Reading the attack code (Listing 1 in the paper) after completing this chapter makes the inline assembly immediately recognizable. The paper is both a security research milestone and a practical demonstration of inline assembly used for microarchitectural measurement.

8. "A Primer on Memory Consistency and Cache Coherence" — Sorin, Hill, Wood Synthesis Lectures on Computer Architecture — free PDF available

Chapter 4 covers relaxed memory consistency models: TSO (x86), PSO, RMO, and their implications for synchronization. Explains why x86 needs fewer hardware fences than ARM64 or RISC-V for the same correctness guarantees, and why the inline assembly fence patterns differ across architectures. Essential context for anyone writing portable concurrent code.

9. Linux Kernel Source — arch/x86/include/asm/atomic.h and arch/x86/include/asm/barrier.h https://github.com/torvalds/linux

The Linux kernel uses inline assembly for every atomic operation and memory barrier. atomic.h shows production-quality inline assembly for atomic_add, atomic_cmpxchg, atomic_xchg, and others. barrier.h shows mb(), rmb(), wmb() (hardware barriers) and barrier() (compiler-only barrier). This is the gold standard for correct, production inline assembly — read it after completing the chapter.

10. "The Art of Multiprocessor Programming" — Herlihy and Shavit Morgan Kaufmann — textbook

Chapters 5-7 cover spin locks, concurrent queues, and lock-free data structures with formal correctness proofs. Chapter 5's treatment of spinlocks (Test-and-Set, TAS, TTAS, CLH queues, MCS locks) gives theoretical grounding to the XCHG-based spinlock from this chapter. Chapter 10 covers the Michael-Scott queue with a formal linearizability proof. The combination of hardware-level inline assembly knowledge from this chapter and formal correctness proofs from Herlihy & Shavit produces complete understanding of concurrent data structures.