Chapter 30 Key Takeaways: Concurrency at the Hardware Level

Open Assembly Language Project

Chapter 30 Key Takeaways: Concurrency at the Hardware Level

The hardware does not guarantee sequential consistency. Even correct single-threaded code can produce surprising results under concurrency because the CPU and memory system reorder operations for performance. Understanding the memory model is prerequisite for writing correct concurrent code in any language.
x86-64 TSO allows only one reordering: Store-Load. All other orderings (Load-Load, Store-Store, Load-Store) are preserved. A store may remain in the CPU's store buffer while a subsequent load reads from cache, allowing another CPU to see the load's result before the store. This is the source of the "both CPUs read 0" scenario.
MFENCE is a full fence; SFENCE and LFENCE are partial. MFENCE orders all prior memory accesses before all subsequent ones. SFENCE orders stores only (needed after non-temporal stores). LFENCE orders loads and is used as a Spectre mitigation to prevent speculative execution past a security-sensitive load.
The LOCK prefix makes read-modify-write operations atomic. It works with ADD, AND, INC, DEC, OR, SUB, XOR, XADD, CMPXCHG, NEG, NOT, SBB, and others. The LOCK prefix asserts exclusive ownership of the cache line containing the operand, forcing a cache coherence event. It is expensive (~10–100 cycles) but necessary for shared counters and flags.
XCHG with memory is always atomic — no LOCK prefix needed. This makes XCHG [lock], al a valid spinlock implementation without the explicit LOCK.
CMPXCHG is the universal atomic primitive. It atomically compares [mem] with RAX and, if equal, stores src_reg into [mem] with ZF=1. If not equal, RAX is updated with [mem] and ZF=0. Every mutex, spinlock, and lock-free data structure is built from this operation.
PAUSE inside spin loops is mandatory for performance. Without PAUSE, the CPU speculatively executes ahead in the loop and pays a pipeline-flush penalty when the lock word changes. PAUSE signals the spin-loop state, reducing power consumption by ~40% and reducing contention with the lock holder.
ARM64 allows all four reordering types (much weaker than x86-64 TSO). ARM64 assembly for concurrent data structures requires explicit DMB, DSB, and ISB barriers. ARM64 uses LDXR/STXR (Load/Store Exclusive) for CAS operations, not LOCK CMPXCHG.
CMPXCHG16B enables ABA-safe lock-free algorithms. By combining a pointer with a version counter in a 128-bit value, each CAS includes both the pointer and the version — preventing a pointer from appearing unchanged after it went through A→B→A transitions. The 16-byte operand must be 16-byte aligned.
False sharing causes cache lines to be invalidated between cores even when the threads write to different variables. Two variables on the same 64-byte cache line are not independent from the hardware's perspective. The fix — padding variables to cache line boundaries — wastes 56 bytes per variable but can eliminate a 10–12× performance penalty.
perf c2c detects false sharing by counting HITM (Hit Modified) events — loads that found their cache line was in Modified state on another core. If perf c2c report shows high HITM rates on a specific address, that data structure needs cache-line padding.
Futex provides the fast path for user-space mutexes. When a mutex is uncontended, mutex_lock and mutex_unlock make zero system calls — they are pure CAS operations. The system call overhead only occurs when there is actual contention. This is why pthread_mutex_lock can be fast even though it is conceptually a kernel operation.