Chapter 30: Concurrency at the Hardware Level

Open Assembly Language Project

8 min read

In a single-CPU world, concurrency is simple: only one instruction executes at a time, and if you disable interrupts, you have a perfectly ordered execution. In a multi-CPU world — which describes every server, laptop, and smartphone made in the...

In This Chapter

The Hardware That Does Not Do What You Think
Memory Visibility: The Core Problem
The x86-64 Memory Model (Total Store Order)
Memory Fences
The LOCK Prefix
CMPXCHG: Compare and Exchange
CMPXCHG16B: Double-Width Atomic CAS
A Mutex Using futex
The ARM64 Memory Model (Weakly Ordered)
The ABA Problem
Spectre: Speculative Execution as Security Vulnerability
MinOS Kernel Spinlocks for SMP
Summary

Key Takeaways Exercises Quiz Case Study 01 Case Study 02 Further Reading

Chapter 30: Concurrency at the Hardware Level

The Hardware That Does Not Do What You Think

In a single-CPU world, concurrency is simple: only one instruction executes at a time, and if you disable interrupts, you have a perfectly ordered execution. In a multi-CPU world — which describes every server, laptop, and smartphone made in the last two decades — the rules change in ways that contradict every intuition you developed writing single-threaded code.

Two CPUs writing to different variables, apparently independent, can corrupt each other's data. Reading a value you just wrote can return a stale result on the same CPU. A mutex that looks correct in a high-level language can be broken in assembly. These failures are not bugs in your logic — they are the hardware behaving exactly as specified. Understanding the specification is the only way to write correct concurrent code.

Memory Visibility: The Core Problem

Consider this code running on two CPUs simultaneously:

CPU 0:                          CPU 1:
  store [x], 1                    store [y], 1
  load  rax, [y]                  load  rbx, [x]

Both CPUs initialize x = 0, y = 0. After both sequences complete, what can rax and rbx contain?

The intuitive answer: at least one of {rax=1, rbx=1} must be true — either CPU 0 read CPU 1's write to y, or CPU 1 read CPU 0's write to x, or both. It seems impossible for both to read 0.

On x86-64, both CPUs can read 0. This is not a bug. It is the specified behavior of the x86-64 memory model.

The reason: processors have store buffers. When a CPU executes a store instruction, the store goes into a local store buffer before being written to cache (which then propagates to other CPUs). The store buffer allows the CPU to continue executing without waiting for the store to complete. The load on the next line reads from cache (or memory), not from the store buffer — so it may not see the pending store to y from CPU 0's perspective.

This is called store-load reordering — the only type of reordering that x86-64 explicitly permits.

The x86-64 Memory Model (Total Store Order)

x86-64 uses the TSO (Total Store Order) memory model, which is among the strongest orderings of any general-purpose ISA. TSO guarantees:

Stores from a single CPU are globally visible in program order — other CPUs see that CPU's stores in the order they were issued
Loads from a single CPU are not reordered with each other — if you load A then B, you see at most a version of A from before the version of B you see
A CPU always sees its own stores (from its store buffer) before seeing anyone else's

The only violation: Store-Load: a store may not be visible to other CPUs before a subsequent load completes. This is the scenario above.

x86-64 TSO reordering summary:
  Load → Load:    NEVER reordered  ✓
  Store → Store:  NEVER reordered  ✓
  Load → Store:   NEVER reordered  ✓
  Store → Load:   CAN reorder      ✗ (store buffer allows this)

Memory Fences

Fences (also called memory barriers) prevent specific reorderings. x86-64 provides three:

MFENCE — Full Memory Fence

MFENCE is the strongest fence: all stores and loads that precede MFENCE in program order complete before any loads or stores after MFENCE are issued.

; Publisher CPU: store data, then store flag
    mov [data], 42
    mfence              ; ensure data store is visible before flag store
    mov [flag], 1

; Consumer CPU: load flag, then load data
    .wait:
    mov rax, [flag]
    test rax, rax
    jz .wait
    mfence              ; ensure flag load completes before data load
    mov rbx, [data]    ; guaranteed to see data=42 if flag=1

Without MFENCE, a compiler or the CPU could reorder the stores or loads, breaking the producer-consumer protocol.

SFENCE — Store Fence

SFENCE orders all stores: stores before SFENCE become visible to other CPUs before stores after SFENCE.

; After writing non-temporal (MOVNT) stores, flush them with SFENCE
    movntdq [buf], xmm0     ; non-temporal store: bypasses cache
    movntdq [buf+16], xmm1
    sfence                  ; ensure NT stores are flushed and ordered
    mov [ready], 1          ; signal consumer

LFENCE — Load Fence

LFENCE orders all loads (and prevents speculative execution past it in some security contexts):

    mov rax, [index]
    lfence                  ; prevent speculative execution past this point
    ; Without LFENCE, CPU might speculatively load [array + rax] before
    ; bounds check completes (Spectre v1 vulnerability pattern)
    cmp rax, max_index
    jae .out_of_bounds
    mov rbx, [array + rax*8]

🔐 Security Note: LFENCE is used as a Spectre mitigation. The CPU's branch predictor can speculatively execute code past a bounds check, loading potentially secret memory and leaking it through a cache side channel. LFENCE prevents the speculative execution from accessing memory before the preceding loads complete.

The LOCK Prefix

For read-modify-write operations that must be atomic across CPUs, x86-64 provides the LOCK prefix. It makes a read-modify-write memory operation indivisible.

lock add [counter], 1       ; atomic increment
lock xchg [flag], al        ; atomic exchange (always locked even without LOCK)
lock cmpxchg [target], rbx  ; atomic compare-exchange
lock xadd [value], rax      ; atomic fetch-and-add (rax = old value)
lock inc [counter]          ; atomic increment
lock dec [counter]          ; atomic decrement
lock or  [flags], al        ; atomic OR
lock and [flags], al        ; atomic AND

⚠️ Common Mistake: XCHG with a memory operand is always atomic, even without the LOCK prefix. The CPU guarantees this. However, LOCK XCHG is not an error — it just has redundant prefix. This is why XCHG [lock], al can implement a spinlock without explicitly writing LOCK XCHG.

⚡ Performance Note: LOCK makes the operation atomic by requiring the CPU to establish exclusive ownership of the cache line containing the memory operand. On a multi-socket system where the cache line is on a different NUMA node, this can take hundreds of cycles — expensive compared to uncontended operations (~1–5 cycles) or even cache misses (~100 cycles). Design lock-free code to minimize LOCK operations on shared data.

CMPXCHG: Compare and Exchange

CMPXCHG is the foundation of nearly every lock-free algorithm. It atomically does:

if [mem] == RAX:
    [mem] = src_register
    ZF = 1
else:
    RAX = [mem]
    ZF = 0

With the LOCK prefix, this is an atomic compare-and-swap (CAS):

; Atomic CAS: if [counter] == 0, set it to 1, else get current value in RAX
    xor rax, rax        ; expected = 0
    mov rcx, 1          ; new value = 1
    lock cmpxchg [counter], rcx
    jz .cas_succeeded   ; ZF=1 means we did the swap
    ; ZF=0: RAX now holds the value that was in [counter] (not 0)

Implementing a Spinlock with CMPXCHG

A spinlock is a mutex that busy-waits (spins) instead of sleeping. For kernel use — where sleeping is impossible in interrupt context — it is the primary synchronization primitive.

; Spinlock implementation using CMPXCHG
; Lock format: 64-bit value at address RDI
;   0 = UNLOCKED
;   1 = LOCKED

; spinlock_acquire: acquire the lock, spinning until available
; RDI = pointer to lock variable
global spinlock_acquire
spinlock_acquire:
    mov ecx, 1          ; desired: LOCKED

.try_lock:
    xor eax, eax        ; expected: UNLOCKED (0)
    lock cmpxchg [rdi], ecx
    jz  .acquired       ; ZF=1: was 0, swapped to 1 — we have the lock

    ; Lock was not free: spin with PAUSE to reduce power and contention
.spin:
    pause               ; hint to CPU: we are in a spin loop (reduces power usage
                        ; and improves performance when the lock is released)
    cmp dword [rdi], 0  ; is lock free yet?
    jnz .spin           ; still locked: keep checking

    jmp .try_lock       ; might be free: try CAS again

.acquired:
    ; Memory fence: ensure we see all writes by the previous lock holder
    ; On x86-64, the LOCK prefix of CMPXCHG already provides acquire semantics
    ; No explicit MFENCE needed here (x86-64 LOCK is sequentially consistent)
    ret

; spinlock_release: release the lock
; RDI = pointer to lock variable
global spinlock_release
spinlock_release:
    ; Simple store: on x86-64, this has "release" semantics because stores
    ; are not reordered with earlier stores (TSO guarantee)
    ; The MFENCE is technically not needed on x86-64 but helps on other ISAs
    ; and is required for correctness if this code ever targets ARM64.
    ; For x86-64-only code:
    mov dword [rdi], 0  ; release: set lock to UNLOCKED
    ret
    ; With explicit fence (recommended for portability):
    ; mfence
    ; mov dword [rdi], 0

The PAUSE Instruction

PAUSE is a hint instruction that tells the CPU it is in a spin-wait loop. Without it, the CPU would speculatively execute ahead into the loop body and detect a "pipeline flush" on every iteration when the memory finally changes — wasting power and degrading performance of the other CPU that holds the lock. With PAUSE, the CPU waits a short time before re-checking, reducing power consumption by ~40% in tight spin loops.

CMPXCHG16B: Double-Width Atomic CAS

CMPXCHG16B performs a 128-bit compare-and-swap. The operands are: - [mem]: 128-bit memory location (must be 16-byte aligned) - RDX:RAX: expected value (combined 128-bit) - RCX:RBX: new value (combined 128-bit)

This enables atomic updates of two-word structures — essential for ABA-free lock-free algorithms:

; Atomic 128-bit CAS
; If [rdi] == RDX:RAX, write RCX:RBX to [rdi]

    ; Set up: RDX:RAX = expected, RCX:RBX = new value
    mov rax, expected_low
    mov rdx, expected_high
    mov rbx, new_low
    mov rcx, new_high
    lock cmpxchg16b [rdi]
    jz  .success
    ; If failed: RDX:RAX = actual value (the CPU loads it)

A Mutex Using futex

For user-space programs, spinning is wasteful. The futex (fast userspace mutex) system call provides an efficient sleep-and-wake mechanism:

; Futex-based mutex (simplified pthreads mutex model)
; States: 0 = unlocked, 1 = locked (no waiters), 2 = locked (waiters sleeping)

SYS_FUTEX     equ 202
FUTEX_WAIT    equ 0
FUTEX_WAKE    equ 1

; mutex_lock: acquire mutex
; RDI = pointer to mutex (int32_t)
mutex_lock:
    ; Fast path: try CAS 0→1
    xor eax, eax        ; expected: 0 (unlocked)
    mov ecx, 1          ; new: 1 (locked)
    lock cmpxchg [rdi], ecx
    jz  .got_lock       ; was 0, now 1 — we have it

    ; Slow path: need to sleep
    ; First set state to 2 (locked + waiter)
    mov eax, [rdi]
    cmp eax, 2
    je  .sleep          ; already 2
    mov eax, 1          ; expected: 1
    mov ecx, 2          ; new: 2
    lock cmpxchg [rdi], ecx

.sleep:
    ; futex(addr, FUTEX_WAIT, val=2, timeout=NULL, ...)
    ; Kernel atomically checks [rdi]==2 and sleeps if so
    push rdi
    mov rax, SYS_FUTEX
    ; rdi = mutex address (already set)
    mov rsi, FUTEX_WAIT
    mov rdx, 2          ; expected value (sleep only if still 2)
    xor r10, r10        ; timeout = NULL (sleep forever)
    xor r8, r8
    xor r9, r9
    syscall
    pop rdi

    ; Woke up; try to acquire again
    ; Use CAS to try 0→2 (keep it as "locked+waiters")
.try_again:
    xor eax, eax
    mov ecx, 2
    lock cmpxchg [rdi], ecx
    jnz .sleep          ; still locked, sleep again

.got_lock:
    ret

; mutex_unlock: release mutex
; RDI = pointer to mutex
mutex_unlock:
    ; Atomically subtract 1 and check if result is 0
    ; If result > 0, there are waiters: wake one
    lock xadd [rdi], ecx  ; Hmm: need to use dec, not xadd
    ; Actually: fetch current value and set to 0
    mov eax, [rdi]
    cmp eax, 1              ; was locked with no waiters?
    je  .no_waiters
    ; Has waiters: set to 0 and wake one
    xor eax, eax
    xchg [rdi], eax         ; atomic 0 → [rdi]
    ; futex(addr, FUTEX_WAKE, 1, ...)
    mov rax, SYS_FUTEX
    mov rsi, FUTEX_WAKE
    mov rdx, 1              ; wake 1 waiter
    syscall
    ret

.no_waiters:
    mov dword [rdi], 0  ; simple store to 0
    ret

💡 Mental Model: The futex design is elegant: when uncontended (no other thread wants the lock), mutex_lock and mutex_unlock make zero system calls — they are pure user-space atomic operations. Only when there is actually contention (another thread is waiting) does the system call overhead occur. This is why pthreads mutexes are fast in practice despite being implemented in terms of system calls.

The ARM64 Memory Model (Weakly Ordered)

ARM64 allows dramatically more reordering than x86-64:

ARM64 memory ordering (Relaxed model):
  Load → Load:    CAN reorder  ✗
  Load → Store:   CAN reorder  ✗
  Store → Store:  CAN reorder  ✗
  Store → Load:   CAN reorder  ✗

All four reordering types are permitted. ARM64 assembly must use explicit barriers everywhere x86-64 relies on implicit ordering:

// ARM64 spinlock:
// lock: 64-bit word at address in x0
//   0 = UNLOCKED, 1 = LOCKED

spinlock_acquire_arm64:
    mov w1, #1              // new value: LOCKED

.try:
    ldxr w2, [x0]           // Load-Exclusive: load + mark for exclusive access
    cbnz w2, .spin          // not 0 (already locked): spin
    stxr w3, w1, [x0]       // Store-Exclusive: store if exclusive access still valid
    cbnz w3, .try           // w3=1 means store failed (another CPU intervened): retry

    // Acquire barrier: ensures loads/stores after this point see
    // the values written by the previous lock holder
    dmb ish                 // Data Memory Barrier, inner-shareable

    ret

.spin:
    wfe                     // Wait For Event: low-power spin hint (like x86 PAUSE)
    b .try

spinlock_release_arm64:
    // Release barrier: ensures previous loads/stores are visible before unlock
    dmb ish
    str wzr, [x0]           // wzr = zero register; store 0 = UNLOCKED
    sev                     // Send Event: wake cores waiting on WFE
    ret

The LDXR/STXR pair (Load Exclusive / Store Exclusive) is ARM64's equivalent of LOCK CMPXCHG. The exclusive monitor in the CPU tracks whether another CPU has written to the address between the LDXR and STXR; if so, STXR fails (returns 1) and you retry.

Memory Model Comparison Table

Operation	x86-64 (TSO)	ARM64 (Relaxed)	RISC-V (Relaxed)
Load→Load reorder	No	Yes	Yes
Load→Store reorder	No	Yes	Yes
Store→Store reorder	No	Yes	Yes
Store→Load reorder	Yes	Yes	Yes
Atomic read-modify-write	LOCK prefix	LDXR/STXR	LR/SC
Full fence	MFENCE	DMB ISH	FENCE
Acquire barrier	MFENCE (or LOCK)	DMB ISH LD	FENCE.R.RW
Release barrier	MFENCE (or LOCK)	DMB ISH ST	FENCE.RW.W

The ABA Problem

A common pitfall in CAS-based lock-free code:

Thread 1 reads [ptr] = A, intends to CAS A → C
Thread 1 is preempted
Thread 2 changes [ptr] from A → B → A (ABA)
Thread 1 resumes, CAS succeeds (sees A), sets C — but the state is wrong

The classic fix: use a tagged pointer — combine the pointer with a version counter in one 128-bit value, use CMPXCHG16B. The version counter increments on every change, making ABA impossible (you would need the pointer AND the counter to match).

; Tagged pointer head for a lock-free stack
; struct { void *ptr; uint64_t version; } (16 bytes total)

; Push to lock-free stack (ABA-safe):
push_stack:
    ; RDI = stack head address (16-byte aligned)
    ; RSI = new node to push

.retry:
    ; Load current head into RDX:RAX
    mov rax, [rdi]       ; ptr
    mov rdx, [rdi + 8]   ; version

    ; Set new node's next pointer to current head
    mov [rsi], rax       ; new_node->next = head.ptr

    ; New head: ptr=new_node, version=current_version+1
    mov rbx, rsi         ; new ptr
    lea rcx, [rdx + 1]   ; new version = old + 1

    lock cmpxchg16b [rdi]
    jnz .retry           ; CAS failed: head changed, retry
    ret

Spectre: Speculative Execution as Security Vulnerability

The connection between concurrency and security: speculative execution races time with an architectural check:

Spectre v1 pattern:
1. CPU speculatively executes past a bounds check
2. Speculative load accesses out-of-bounds memory (potentially secret)
3. That access brings a cache line into the cache
4. Bounds check completes: speculation was wrong, CPU rolls back
5. But the cache state change is NOT rolled back
6. Attacker measures cache timing to infer the secret

Mitigation:
    cmp rax, array_size
    jae .out_of_bounds
    lfence              ← LFENCE prevents speculation past this point
    mov rbx, [array + rax]

The LFENCE instruction serializes the instruction stream: nothing after LFENCE executes speculatively until all prior instructions have completed their memory accesses. This is expensive (serializes the pipeline) but correct.

MinOS Kernel Spinlocks for SMP

When MinOS runs on multiple CPUs, the kernel's shared data structures (the IDT, the PMM bitmap, the process table) need protection:

; minOS/kernel/spinlock.asm — Kernel spinlocks for SMP

; Global kernel lock (big kernel lock pattern, for simplicity)
section .data
    kernel_lock: dq 0

; Before modifying any kernel structure:
    lea rdi, [kernel_lock]
    call spinlock_acquire
    ; ... modify shared kernel data ...
    lea rdi, [kernel_lock]
    call spinlock_release

The Big Kernel Lock (BKL) pattern — one global lock for all kernel operations — is how early Linux handled SMP. It is correct but eliminates all kernel parallelism. Real kernels use fine-grained locking (per-data-structure spinlocks), per-CPU data (no locking needed if only one CPU accesses it), and lock-free algorithms.

Summary

The hardware memory model defines which reorderings are possible without fences. x86-64 (TSO) only allows store-load reordering; ARM64 allows all four. MFENCE, SFENCE, and LFENCE enforce ordering; the LOCK prefix makes read-modify-write operations atomic. CMPXCHG is the universal atomic primitive underlying mutexes, spinlocks, and lock-free data structures. Understanding the hardware model is mandatory for writing correct concurrent code in any language.

🔄 Check Your Understanding: 1. What is the one type of memory reordering that x86-64 TSO permits? 2. Why is PAUSE important inside a spin loop? 3. What does LOCK CMPXCHG do if the comparison fails? 4. How does the ARM64 LDXR/STXR pair differ from x86-64's LOCK CMPXCHG? 5. What is false sharing, and how does cache-line padding fix it?