Chapter 30 Key Takeaways: Concurrency at the Hardware Level
-
The hardware does not guarantee sequential consistency. Even correct single-threaded code can produce surprising results under concurrency because the CPU and memory system reorder operations for performance. Understanding the memory model is prerequisite for writing correct concurrent code in any language.
-
x86-64 TSO allows only one reordering: Store-Load. All other orderings (Load-Load, Store-Store, Load-Store) are preserved. A store may remain in the CPU's store buffer while a subsequent load reads from cache, allowing another CPU to see the load's result before the store. This is the source of the "both CPUs read 0" scenario.
-
MFENCEis a full fence;SFENCEandLFENCEare partial. MFENCE orders all prior memory accesses before all subsequent ones. SFENCE orders stores only (needed after non-temporal stores). LFENCE orders loads and is used as a Spectre mitigation to prevent speculative execution past a security-sensitive load. -
The
LOCKprefix makes read-modify-write operations atomic. It works with ADD, AND, INC, DEC, OR, SUB, XOR, XADD, CMPXCHG, NEG, NOT, SBB, and others. The LOCK prefix asserts exclusive ownership of the cache line containing the operand, forcing a cache coherence event. It is expensive (~10–100 cycles) but necessary for shared counters and flags. -
XCHGwith memory is always atomic — no LOCK prefix needed. This makesXCHG [lock], ala valid spinlock implementation without the explicit LOCK. -
CMPXCHGis the universal atomic primitive. It atomically compares[mem]with RAX and, if equal, storessrc_reginto[mem]with ZF=1. If not equal, RAX is updated with[mem]and ZF=0. Every mutex, spinlock, and lock-free data structure is built from this operation. -
PAUSEinside spin loops is mandatory for performance. Without PAUSE, the CPU speculatively executes ahead in the loop and pays a pipeline-flush penalty when the lock word changes. PAUSE signals the spin-loop state, reducing power consumption by ~40% and reducing contention with the lock holder. -
ARM64 allows all four reordering types (much weaker than x86-64 TSO). ARM64 assembly for concurrent data structures requires explicit
DMB,DSB, andISBbarriers. ARM64 usesLDXR/STXR(Load/Store Exclusive) for CAS operations, notLOCK CMPXCHG. -
CMPXCHG16Benables ABA-safe lock-free algorithms. By combining a pointer with a version counter in a 128-bit value, each CAS includes both the pointer and the version — preventing a pointer from appearing unchanged after it went through A→B→A transitions. The 16-byte operand must be 16-byte aligned. -
False sharing causes cache lines to be invalidated between cores even when the threads write to different variables. Two variables on the same 64-byte cache line are not independent from the hardware's perspective. The fix — padding variables to cache line boundaries — wastes 56 bytes per variable but can eliminate a 10–12× performance penalty.
-
perf c2cdetects false sharing by countingHITM(Hit Modified) events — loads that found their cache line was in Modified state on another core. Ifperf c2c reportshows high HITM rates on a specific address, that data structure needs cache-line padding. -
Futex provides the fast path for user-space mutexes. When a mutex is uncontended,
mutex_lockandmutex_unlockmake zero system calls — they are pure CAS operations. The system call overhead only occurs when there is actual contention. This is whypthread_mutex_lockcan be fast even though it is conceptually a kernel operation.