In a single-CPU world, concurrency is simple: only one instruction executes at a time, and if you disable interrupts, you have a perfectly ordered execution. In a multi-CPU world — which describes every server, laptop, and smartphone made in the...
In This Chapter
- The Hardware That Does Not Do What You Think
- Memory Visibility: The Core Problem
- The x86-64 Memory Model (Total Store Order)
- Memory Fences
- The LOCK Prefix
- CMPXCHG: Compare and Exchange
- CMPXCHG16B: Double-Width Atomic CAS
- A Mutex Using futex
- The ARM64 Memory Model (Weakly Ordered)
- The ABA Problem
- Spectre: Speculative Execution as Security Vulnerability
- MinOS Kernel Spinlocks for SMP
- Summary
Chapter 30: Concurrency at the Hardware Level
The Hardware That Does Not Do What You Think
In a single-CPU world, concurrency is simple: only one instruction executes at a time, and if you disable interrupts, you have a perfectly ordered execution. In a multi-CPU world — which describes every server, laptop, and smartphone made in the last two decades — the rules change in ways that contradict every intuition you developed writing single-threaded code.
Two CPUs writing to different variables, apparently independent, can corrupt each other's data. Reading a value you just wrote can return a stale result on the same CPU. A mutex that looks correct in a high-level language can be broken in assembly. These failures are not bugs in your logic — they are the hardware behaving exactly as specified. Understanding the specification is the only way to write correct concurrent code.
Memory Visibility: The Core Problem
Consider this code running on two CPUs simultaneously:
CPU 0: CPU 1:
store [x], 1 store [y], 1
load rax, [y] load rbx, [x]
Both CPUs initialize x = 0, y = 0. After both sequences complete, what can rax and rbx contain?
The intuitive answer: at least one of {rax=1, rbx=1} must be true — either CPU 0 read CPU 1's write to y, or CPU 1 read CPU 0's write to x, or both. It seems impossible for both to read 0.
On x86-64, both CPUs can read 0. This is not a bug. It is the specified behavior of the x86-64 memory model.
The reason: processors have store buffers. When a CPU executes a store instruction, the store goes into a local store buffer before being written to cache (which then propagates to other CPUs). The store buffer allows the CPU to continue executing without waiting for the store to complete. The load on the next line reads from cache (or memory), not from the store buffer — so it may not see the pending store to y from CPU 0's perspective.
This is called store-load reordering — the only type of reordering that x86-64 explicitly permits.
The x86-64 Memory Model (Total Store Order)
x86-64 uses the TSO (Total Store Order) memory model, which is among the strongest orderings of any general-purpose ISA. TSO guarantees:
- Stores from a single CPU are globally visible in program order — other CPUs see that CPU's stores in the order they were issued
- Loads from a single CPU are not reordered with each other — if you load A then B, you see at most a version of A from before the version of B you see
- A CPU always sees its own stores (from its store buffer) before seeing anyone else's
The only violation: Store-Load: a store may not be visible to other CPUs before a subsequent load completes. This is the scenario above.
x86-64 TSO reordering summary:
Load → Load: NEVER reordered ✓
Store → Store: NEVER reordered ✓
Load → Store: NEVER reordered ✓
Store → Load: CAN reorder ✗ (store buffer allows this)
Memory Fences
Fences (also called memory barriers) prevent specific reorderings. x86-64 provides three:
MFENCE — Full Memory Fence
MFENCE is the strongest fence: all stores and loads that precede MFENCE in program order complete before any loads or stores after MFENCE are issued.
; Publisher CPU: store data, then store flag
mov [data], 42
mfence ; ensure data store is visible before flag store
mov [flag], 1
; Consumer CPU: load flag, then load data
.wait:
mov rax, [flag]
test rax, rax
jz .wait
mfence ; ensure flag load completes before data load
mov rbx, [data] ; guaranteed to see data=42 if flag=1
Without MFENCE, a compiler or the CPU could reorder the stores or loads, breaking the producer-consumer protocol.
SFENCE — Store Fence
SFENCE orders all stores: stores before SFENCE become visible to other CPUs before stores after SFENCE.
; After writing non-temporal (MOVNT) stores, flush them with SFENCE
movntdq [buf], xmm0 ; non-temporal store: bypasses cache
movntdq [buf+16], xmm1
sfence ; ensure NT stores are flushed and ordered
mov [ready], 1 ; signal consumer
LFENCE — Load Fence
LFENCE orders all loads (and prevents speculative execution past it in some security contexts):
mov rax, [index]
lfence ; prevent speculative execution past this point
; Without LFENCE, CPU might speculatively load [array + rax] before
; bounds check completes (Spectre v1 vulnerability pattern)
cmp rax, max_index
jae .out_of_bounds
mov rbx, [array + rax*8]
🔐 Security Note:
LFENCEis used as a Spectre mitigation. The CPU's branch predictor can speculatively execute code past a bounds check, loading potentially secret memory and leaking it through a cache side channel.LFENCEprevents the speculative execution from accessing memory before the preceding loads complete.
The LOCK Prefix
For read-modify-write operations that must be atomic across CPUs, x86-64 provides the LOCK prefix. It makes a read-modify-write memory operation indivisible.
lock add [counter], 1 ; atomic increment
lock xchg [flag], al ; atomic exchange (always locked even without LOCK)
lock cmpxchg [target], rbx ; atomic compare-exchange
lock xadd [value], rax ; atomic fetch-and-add (rax = old value)
lock inc [counter] ; atomic increment
lock dec [counter] ; atomic decrement
lock or [flags], al ; atomic OR
lock and [flags], al ; atomic AND
⚠️ Common Mistake:
XCHGwith a memory operand is always atomic, even without theLOCKprefix. The CPU guarantees this. However,LOCK XCHGis not an error — it just has redundant prefix. This is whyXCHG [lock], alcan implement a spinlock without explicitly writingLOCK XCHG.⚡ Performance Note:
LOCKmakes the operation atomic by requiring the CPU to establish exclusive ownership of the cache line containing the memory operand. On a multi-socket system where the cache line is on a different NUMA node, this can take hundreds of cycles — expensive compared to uncontended operations (~1–5 cycles) or even cache misses (~100 cycles). Design lock-free code to minimizeLOCKoperations on shared data.
CMPXCHG: Compare and Exchange
CMPXCHG is the foundation of nearly every lock-free algorithm. It atomically does:
if [mem] == RAX:
[mem] = src_register
ZF = 1
else:
RAX = [mem]
ZF = 0
With the LOCK prefix, this is an atomic compare-and-swap (CAS):
; Atomic CAS: if [counter] == 0, set it to 1, else get current value in RAX
xor rax, rax ; expected = 0
mov rcx, 1 ; new value = 1
lock cmpxchg [counter], rcx
jz .cas_succeeded ; ZF=1 means we did the swap
; ZF=0: RAX now holds the value that was in [counter] (not 0)
Implementing a Spinlock with CMPXCHG
A spinlock is a mutex that busy-waits (spins) instead of sleeping. For kernel use — where sleeping is impossible in interrupt context — it is the primary synchronization primitive.
; Spinlock implementation using CMPXCHG
; Lock format: 64-bit value at address RDI
; 0 = UNLOCKED
; 1 = LOCKED
; spinlock_acquire: acquire the lock, spinning until available
; RDI = pointer to lock variable
global spinlock_acquire
spinlock_acquire:
mov ecx, 1 ; desired: LOCKED
.try_lock:
xor eax, eax ; expected: UNLOCKED (0)
lock cmpxchg [rdi], ecx
jz .acquired ; ZF=1: was 0, swapped to 1 — we have the lock
; Lock was not free: spin with PAUSE to reduce power and contention
.spin:
pause ; hint to CPU: we are in a spin loop (reduces power usage
; and improves performance when the lock is released)
cmp dword [rdi], 0 ; is lock free yet?
jnz .spin ; still locked: keep checking
jmp .try_lock ; might be free: try CAS again
.acquired:
; Memory fence: ensure we see all writes by the previous lock holder
; On x86-64, the LOCK prefix of CMPXCHG already provides acquire semantics
; No explicit MFENCE needed here (x86-64 LOCK is sequentially consistent)
ret
; spinlock_release: release the lock
; RDI = pointer to lock variable
global spinlock_release
spinlock_release:
; Simple store: on x86-64, this has "release" semantics because stores
; are not reordered with earlier stores (TSO guarantee)
; The MFENCE is technically not needed on x86-64 but helps on other ISAs
; and is required for correctness if this code ever targets ARM64.
; For x86-64-only code:
mov dword [rdi], 0 ; release: set lock to UNLOCKED
ret
; With explicit fence (recommended for portability):
; mfence
; mov dword [rdi], 0
The PAUSE Instruction
PAUSE is a hint instruction that tells the CPU it is in a spin-wait loop. Without it, the CPU would speculatively execute ahead into the loop body and detect a "pipeline flush" on every iteration when the memory finally changes — wasting power and degrading performance of the other CPU that holds the lock. With PAUSE, the CPU waits a short time before re-checking, reducing power consumption by ~40% in tight spin loops.
CMPXCHG16B: Double-Width Atomic CAS
CMPXCHG16B performs a 128-bit compare-and-swap. The operands are:
- [mem]: 128-bit memory location (must be 16-byte aligned)
- RDX:RAX: expected value (combined 128-bit)
- RCX:RBX: new value (combined 128-bit)
This enables atomic updates of two-word structures — essential for ABA-free lock-free algorithms:
; Atomic 128-bit CAS
; If [rdi] == RDX:RAX, write RCX:RBX to [rdi]
; Set up: RDX:RAX = expected, RCX:RBX = new value
mov rax, expected_low
mov rdx, expected_high
mov rbx, new_low
mov rcx, new_high
lock cmpxchg16b [rdi]
jz .success
; If failed: RDX:RAX = actual value (the CPU loads it)
A Mutex Using futex
For user-space programs, spinning is wasteful. The futex (fast userspace mutex) system call provides an efficient sleep-and-wake mechanism:
; Futex-based mutex (simplified pthreads mutex model)
; States: 0 = unlocked, 1 = locked (no waiters), 2 = locked (waiters sleeping)
SYS_FUTEX equ 202
FUTEX_WAIT equ 0
FUTEX_WAKE equ 1
; mutex_lock: acquire mutex
; RDI = pointer to mutex (int32_t)
mutex_lock:
; Fast path: try CAS 0→1
xor eax, eax ; expected: 0 (unlocked)
mov ecx, 1 ; new: 1 (locked)
lock cmpxchg [rdi], ecx
jz .got_lock ; was 0, now 1 — we have it
; Slow path: need to sleep
; First set state to 2 (locked + waiter)
mov eax, [rdi]
cmp eax, 2
je .sleep ; already 2
mov eax, 1 ; expected: 1
mov ecx, 2 ; new: 2
lock cmpxchg [rdi], ecx
.sleep:
; futex(addr, FUTEX_WAIT, val=2, timeout=NULL, ...)
; Kernel atomically checks [rdi]==2 and sleeps if so
push rdi
mov rax, SYS_FUTEX
; rdi = mutex address (already set)
mov rsi, FUTEX_WAIT
mov rdx, 2 ; expected value (sleep only if still 2)
xor r10, r10 ; timeout = NULL (sleep forever)
xor r8, r8
xor r9, r9
syscall
pop rdi
; Woke up; try to acquire again
; Use CAS to try 0→2 (keep it as "locked+waiters")
.try_again:
xor eax, eax
mov ecx, 2
lock cmpxchg [rdi], ecx
jnz .sleep ; still locked, sleep again
.got_lock:
ret
; mutex_unlock: release mutex
; RDI = pointer to mutex
mutex_unlock:
; Atomically subtract 1 and check if result is 0
; If result > 0, there are waiters: wake one
lock xadd [rdi], ecx ; Hmm: need to use dec, not xadd
; Actually: fetch current value and set to 0
mov eax, [rdi]
cmp eax, 1 ; was locked with no waiters?
je .no_waiters
; Has waiters: set to 0 and wake one
xor eax, eax
xchg [rdi], eax ; atomic 0 → [rdi]
; futex(addr, FUTEX_WAKE, 1, ...)
mov rax, SYS_FUTEX
mov rsi, FUTEX_WAKE
mov rdx, 1 ; wake 1 waiter
syscall
ret
.no_waiters:
mov dword [rdi], 0 ; simple store to 0
ret
💡 Mental Model: The futex design is elegant: when uncontended (no other thread wants the lock), mutex_lock and mutex_unlock make zero system calls — they are pure user-space atomic operations. Only when there is actually contention (another thread is waiting) does the system call overhead occur. This is why pthreads mutexes are fast in practice despite being implemented in terms of system calls.
The ARM64 Memory Model (Weakly Ordered)
ARM64 allows dramatically more reordering than x86-64:
ARM64 memory ordering (Relaxed model):
Load → Load: CAN reorder ✗
Load → Store: CAN reorder ✗
Store → Store: CAN reorder ✗
Store → Load: CAN reorder ✗
All four reordering types are permitted. ARM64 assembly must use explicit barriers everywhere x86-64 relies on implicit ordering:
// ARM64 spinlock:
// lock: 64-bit word at address in x0
// 0 = UNLOCKED, 1 = LOCKED
spinlock_acquire_arm64:
mov w1, #1 // new value: LOCKED
.try:
ldxr w2, [x0] // Load-Exclusive: load + mark for exclusive access
cbnz w2, .spin // not 0 (already locked): spin
stxr w3, w1, [x0] // Store-Exclusive: store if exclusive access still valid
cbnz w3, .try // w3=1 means store failed (another CPU intervened): retry
// Acquire barrier: ensures loads/stores after this point see
// the values written by the previous lock holder
dmb ish // Data Memory Barrier, inner-shareable
ret
.spin:
wfe // Wait For Event: low-power spin hint (like x86 PAUSE)
b .try
spinlock_release_arm64:
// Release barrier: ensures previous loads/stores are visible before unlock
dmb ish
str wzr, [x0] // wzr = zero register; store 0 = UNLOCKED
sev // Send Event: wake cores waiting on WFE
ret
The LDXR/STXR pair (Load Exclusive / Store Exclusive) is ARM64's equivalent of LOCK CMPXCHG. The exclusive monitor in the CPU tracks whether another CPU has written to the address between the LDXR and STXR; if so, STXR fails (returns 1) and you retry.
Memory Model Comparison Table
| Operation | x86-64 (TSO) | ARM64 (Relaxed) | RISC-V (Relaxed) |
|---|---|---|---|
| Load→Load reorder | No | Yes | Yes |
| Load→Store reorder | No | Yes | Yes |
| Store→Store reorder | No | Yes | Yes |
| Store→Load reorder | Yes | Yes | Yes |
| Atomic read-modify-write | LOCK prefix | LDXR/STXR | LR/SC |
| Full fence | MFENCE | DMB ISH | FENCE |
| Acquire barrier | MFENCE (or LOCK) | DMB ISH LD | FENCE.R.RW |
| Release barrier | MFENCE (or LOCK) | DMB ISH ST | FENCE.RW.W |
The ABA Problem
A common pitfall in CAS-based lock-free code:
- Thread 1 reads [ptr] = A, intends to CAS A → C
- Thread 1 is preempted
- Thread 2 changes [ptr] from A → B → A (ABA)
- Thread 1 resumes, CAS succeeds (sees A), sets C — but the state is wrong
The classic fix: use a tagged pointer — combine the pointer with a version counter in one 128-bit value, use CMPXCHG16B. The version counter increments on every change, making ABA impossible (you would need the pointer AND the counter to match).
; Tagged pointer head for a lock-free stack
; struct { void *ptr; uint64_t version; } (16 bytes total)
; Push to lock-free stack (ABA-safe):
push_stack:
; RDI = stack head address (16-byte aligned)
; RSI = new node to push
.retry:
; Load current head into RDX:RAX
mov rax, [rdi] ; ptr
mov rdx, [rdi + 8] ; version
; Set new node's next pointer to current head
mov [rsi], rax ; new_node->next = head.ptr
; New head: ptr=new_node, version=current_version+1
mov rbx, rsi ; new ptr
lea rcx, [rdx + 1] ; new version = old + 1
lock cmpxchg16b [rdi]
jnz .retry ; CAS failed: head changed, retry
ret
Spectre: Speculative Execution as Security Vulnerability
The connection between concurrency and security: speculative execution races time with an architectural check:
Spectre v1 pattern:
1. CPU speculatively executes past a bounds check
2. Speculative load accesses out-of-bounds memory (potentially secret)
3. That access brings a cache line into the cache
4. Bounds check completes: speculation was wrong, CPU rolls back
5. But the cache state change is NOT rolled back
6. Attacker measures cache timing to infer the secret
Mitigation:
cmp rax, array_size
jae .out_of_bounds
lfence ← LFENCE prevents speculation past this point
mov rbx, [array + rax]
The LFENCE instruction serializes the instruction stream: nothing after LFENCE executes speculatively until all prior instructions have completed their memory accesses. This is expensive (serializes the pipeline) but correct.
MinOS Kernel Spinlocks for SMP
When MinOS runs on multiple CPUs, the kernel's shared data structures (the IDT, the PMM bitmap, the process table) need protection:
; minOS/kernel/spinlock.asm — Kernel spinlocks for SMP
; Global kernel lock (big kernel lock pattern, for simplicity)
section .data
kernel_lock: dq 0
; Before modifying any kernel structure:
lea rdi, [kernel_lock]
call spinlock_acquire
; ... modify shared kernel data ...
lea rdi, [kernel_lock]
call spinlock_release
The Big Kernel Lock (BKL) pattern — one global lock for all kernel operations — is how early Linux handled SMP. It is correct but eliminates all kernel parallelism. Real kernels use fine-grained locking (per-data-structure spinlocks), per-CPU data (no locking needed if only one CPU accesses it), and lock-free algorithms.
Summary
The hardware memory model defines which reorderings are possible without fences. x86-64 (TSO) only allows store-load reordering; ARM64 allows all four. MFENCE, SFENCE, and LFENCE enforce ordering; the LOCK prefix makes read-modify-write operations atomic. CMPXCHG is the universal atomic primitive underlying mutexes, spinlocks, and lock-free data structures. Understanding the hardware model is mandatory for writing correct concurrent code in any language.
🔄 Check Your Understanding: 1. What is the one type of memory reordering that x86-64 TSO permits? 2. Why is
PAUSEimportant inside a spin loop? 3. What doesLOCK CMPXCHGdo if the comparison fails? 4. How does the ARM64LDXR/STXRpair differ from x86-64'sLOCK CMPXCHG? 5. What is false sharing, and how does cache-line padding fix it?