Chapter 30 Quiz: Concurrency at the Hardware Level

1. What is the only type of memory reordering that the x86-64 TSO memory model explicitly permits?

A) Load-Load reordering B) Store-Store reordering C) Store-Load reordering D) Load-Store reordering

Answer: C — x86-64 TSO prevents all four reorderings except Store-Load. A store to address A may not be visible to other CPUs before a subsequent load from address B completes, because the store may sit in the CPU's store buffer while the load reads from the cache.


2. What does MFENCE guarantee?

A) All stores before MFENCE are committed to DRAM B) All loads and stores before MFENCE are ordered before all loads and stores after MFENCE C) Only stores are ordered; loads may still be reordered D) The instruction cache is flushed

Answer: B — MFENCE is a full memory fence. Every load and store instruction issued before MFENCE in program order completes before any load or store issued after MFENCE begins. This prevents all four reordering types.


3. Why is the PAUSE instruction important inside a spinlock spin loop?

A) It delays for exactly 1000 cycles, reducing polling frequency B) It hints to the CPU that it is in a spin loop, reducing power consumption and contention, and preventing pipeline misprediction penalties when the lock is released C) It flushes the cache, ensuring the CPU sees the latest value of the lock D) It yields the CPU to other threads in the spin queue

Answer: B — PAUSE (which takes ~100 cycles on modern CPUs) hints to the processor that this is a spin-wait loop. Without PAUSE, the CPU speculatively executes far ahead and must flush the pipeline when the lock word changes, causing a performance penalty for the core that holds the lock. PAUSE reduces this contention by ~40%.


4. LOCK CMPXCHG [mem], reg does what if the comparison fails (i.e., [mem] ≠ RAX)?

A) Stores the new value anyway and sets CF=1 B) Leaves [mem] unchanged, loads [mem] into RAX, and clears ZF C) Retries the comparison automatically in hardware D) Generates a #GP fault

Answer: B — On CMPXCHG failure: the memory [mem] is NOT modified, RAX is updated with the current value of [mem], and ZF is cleared. This allows the caller to see the actual current value and retry with updated expectations.


5. Why is XCHG [mem], reg always atomic, even without the LOCK prefix?

A) The CPU's cache coherence protocol makes all word exchanges atomic B) The XCHG instruction has an implicit LOCK prefix when used with a memory operand C) Exchanges are single-instruction and cannot be interrupted D) The assembler adds LOCK automatically when it sees XCHG

Answer: B — The x86-64 architecture specification explicitly states that XCHG with a memory operand always asserts the LOCK signal, making it atomic. This is a special case — other read-modify-write instructions (ADD, OR, INC, etc.) require an explicit LOCK prefix for atomicity.


6. What is false sharing?

A) Two threads sharing a mutex that only one actually needs B) Two threads accessing different variables that happen to be on the same cache line, causing cache coherence traffic between cores C) A CPU incorrectly sharing a cache line with another CPU due to a TLB error D) Two processes mapping the same shared memory but using incompatible synchronization

Answer: B — False sharing occurs when two (or more) cores independently modify different variables that happen to be on the same 64-byte cache line. Each modification invalidates the cache line on all other cores, forcing them to re-fetch it, even though they never access each other's variable. The fix: pad variables to cache-line boundaries.


7. In the ARM64 memory model, which of the following reorderings can the hardware perform?

A) Load-Load only B) Store-Store only C) Store-Load only (same as x86-64) D) All four: Load-Load, Load-Store, Store-Store, and Store-Load

Answer: D — ARM64 uses a relaxed memory model that permits all four reordering types. This is why ARM64 code that uses shared data must explicitly insert DMB or DSB barriers in locations where x86-64 code requires none. Writing ARM64 concurrent code without understanding this is a common source of subtle bugs.


8. What does LOCK XADD [mem], rax do?

A) Atomically adds RAX to [mem] and discards the old value B) Atomically swaps [mem] and RAX, then stores their sum in [mem]; RAX = old [mem] C) Atomically loads [mem] into RAX, then adds 1 to [mem] D) Atomically compares [mem] and RAX, adds if equal

Answer: B — XADD (Exchange and Add): first it swaps [mem] with RAX (RAX gets the old value of [mem]), then [mem] gets the sum of the original [mem] and the original RAX. With LOCK, this is atomic. This is how __sync_fetch_and_add is implemented on x86-64.


9. The futex-based mutex is "fast" because:

A) It uses fewer CPU instructions than a spinlock B) When there is no contention, it makes zero system calls — all operations are user-space CAS C) The kernel optimizes futex wait/wake better than regular sleep/wakeup D) Futexes bypass the scheduler entirely

Answer: B — The key insight of futex: the mutex_lock fast path is just a CAS (user-space only). If the CAS succeeds (lock was free), no kernel call is made. Only on the slow path (lock is contended) does the program call futex(FUTEX_WAIT) to sleep. Most mutex operations in well-designed programs are uncontended, so the fast path dominates.


10. What is the ABA problem in CAS-based lock-free algorithms?

A) A deadlock between two CAS operations competing for the same memory B) A CAS operation succeeds even though the value went from A to B back to A between the read and the CAS, hiding an invalid intermediate state C) Aligned/unaligned memory access causing CAS to fail spuriously D) A performance problem where CAS retries cause excessive bus traffic

Answer: B — ABA: Thread 1 reads value A, is preempted. Thread 2 changes A→B→A. Thread 1 resumes, CAS(A→newval) succeeds because [mem]=A again — but the state is actually different (a different object occupies the same address, or the structure was modified during A→B→A). The fix: tag the value with a version counter, use CMPXCHG16B.


11. What is CMPXCHG16B used for?

A) Performing a CAS on two separate 8-byte values simultaneously B) Comparing and swapping a 16-byte (128-bit) value atomically, typically to implement tagged pointers or ABA-safe structures C) Extending the CMPXCHG instruction to work on 64-bit addresses D) Comparing 16 values simultaneously in a SIMD operation

Answer: B — CMPXCHG16B atomically compares and swaps 128 bits. The 16-byte value is stored in RDX:RAX (expected) and RCX:RBX (new value). This enables tagged pointer structures where the upper 64 bits are a pointer and the lower 64 bits are a version counter, preventing ABA problems.


12. In the ARM64 spinlock, why is WFE (Wait For Event) used instead of a simple retry loop?

A) WFE is faster than a memory read in all cases B) WFE puts the CPU in a low-power state until another core executes SEV (Send Event), reducing power consumption and bus traffic C) WFE is required by the ARM64 memory model for spinlock correctness D) WFE automatically retries the LDXR when the event fires

Answer: B — WFE is ARM64's equivalent of x86-64's PAUSE but with an event mechanism. The CPU enters a low-power state. When the lock-holder executes SEV (in spinlock_release), all CPUs waiting on WFE wake up and retry. This reduces power consumption compared to a busy loop and reduces cache coherence traffic.


13. On x86-64, does the lock-release function (which stores 0 to the lock word) require an MFENCE?

A) Yes, always required to prevent store-load reordering in the unlock path B) No, because x86-64 TSO guarantees that stores are visible in program order; a plain store provides "release" semantics on x86-64 C) Yes, MFENCE is required to prevent the compiler from reordering the unlock D) No, because the LOCK CMPXCHG in acquire already acts as a full fence

Answer: B — On x86-64 TSO, a plain store provides release semantics: all prior stores are visible before this store. No MFENCE is needed for correctness. However, a compiler barrier (or __asm__ volatile("" ::: "memory") in GCC) is needed to prevent the compiler from reordering the store relative to other operations. MFENCE would also work but is unnecessarily expensive.


14. Why does false sharing disappear with cache-line padding?

A) Padding forces the hardware to use different cache sets for the two variables B) Padding puts the two variables on different 64-byte cache lines, so each core can hold and modify its line independently without invalidating the other C) Padding prevents the two variables from being prefetched together D) Padding triggers NUMA-aware memory allocation

Answer: B — Cache coherence operates at cache-line granularity. If two frequently-written variables share a 64-byte cache line, every write to either one invalidates the entire line on all other cores. With padding, each variable occupies its own cache line, and writes to one do not affect the other's cache state.


15. What does LFENCE prevent in the context of Spectre mitigation?

A) Load reordering between two cacheable memory regions B) The CPU from speculatively executing loads that follow LFENCE until all preceding instructions have completed their memory accesses C) The compiler from reordering loads with stores in the source code D) Cache prefetch speculation beyond a security boundary

Answer: B — In Spectre v1, the CPU speculatively executes a load past a bounds check. LFENCE prevents the speculative execution from issuing any loads until all preceding loads are complete and their results are committed. This prevents the speculative out-of-bounds access from affecting the cache state.