Chapter 30 Further Reading: Concurrency at the Hardware Level
Primary References
"A Primer on Memory Consistency and Cache Coherence" — Sorin, Hill, Wood (2nd ed.) https://www.morganclaypool.com/doi/10.2200/S00962ED2V01Y201910CAC049 The definitive academic treatment of memory consistency models: sequential consistency, TSO, relaxed models, and formal definitions. Chapter 5 (Relaxed Models) directly addresses x86-64 TSO and ARM64. Available as a free PDF from the publisher. If you want to understand why the hardware behaves as it does, this is the resource.
Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A, Chapter 8 https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html Chapter 8 (Multiple-Processor Management) specifies the x86-64 memory ordering model precisely, with examples of each type of reordering and which instructions provide which ordering guarantees. The store buffer explanation and the exact semantics of MFENCE/SFENCE/LFENCE are normative here.
ARM Architecture Reference Manual (ARM64) https://developer.arm.com/documentation/ddi0487/latest/ Chapter B2 (The AArch64 Application Level Memory Model) specifies the ARM64 memory model. More complex than the Intel manual because ARM64's memory model is more complex. Essential for anyone writing concurrent code for ARM64 systems.
Deep Dives
"Preshing on Programming" — Memory Ordering Articles https://preshing.com/20120625/memory-ordering-at-compile-time/ https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ Jeff Preshing's series on memory ordering is the clearest practical explanation available. Uses source-control analogies that make the abstract memory model concrete. Covers compiler reordering vs. hardware reordering, acquire/release semantics, and the C11 atomic model. Read all six articles in sequence.
"Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms" — Michael and Scott (1996) https://www.cs.rochester.edu/u/scott/papers/1996_PODC_queues.pdf The original paper for the Michael-Scott queue from Case Study 30-1. Four pages. Shows the algorithm, the correctness proof (in informal terms), and the key insight that two sentinel nodes prevent ABA on head and tail simultaneously. A model of how to write a correct concurrent algorithm paper.
"Implementing Lock-Free Queues" — John D. Valois A deeper analysis of lock-free queue implementations, covering the hazard pointer approach for safe memory reclamation (the production-quality solution to the memory reclamation problem that our simplified queue ignores).
Tools
perf-c2c — Cache-to-Cache False Sharing Detector
https://man7.org/linux/man-pages/man1/perf-c2c.1.html
Linux perf c2c detects false sharing with hardware precision. perf c2c record, then perf c2c report shows which addresses are causing HITM (Hit Modified) cache coherence events, indicating which data structures need cache-line padding. Essential for diagnosing concurrent performance problems.
Helgrind / ThreadSanitizer — Data Race Detection https://valgrind.org/docs/manual/hg-manual.html https://github.com/google/sanitizers/wiki/ThreadSanitizerCppManual Helgrind (part of Valgrind) and ThreadSanitizer (part of LLVM) dynamically detect data races — concurrent accesses to the same memory without sufficient synchronization. While these tools operate at the C/C++ level rather than assembly level, they find the same bugs that arise from missing LOCK prefixes or incorrect memory ordering in assembly. Use them when writing assembly-backed concurrent data structures in C wrappers.
The LMAX Disruptor Design https://lmax-exchange.github.io/disruptor/disruptor.html A high-performance inter-thread queue that eliminates false sharing through aggressive cache-line padding and ring buffer design. The whitepaper explains exactly how padding cache lines, avoiding locks in the critical path, and single-writer-per-sequence design combine to achieve ~6 billion operations per second on a single machine. An excellent case study in applying the principles from this chapter to a real system.