Chapter 30 Exercises: Concurrency at the Hardware Level

Section A: Memory Ordering

1. The Store-Load Reordering Test Implement the Dekker-style reordering test from the chapter introduction:

// Two threads, x=0 y=0 initially
// Thread 0: store x=1, then load rax=y
// Thread 1: store y=1, then load rbx=x
// Is it possible for both rax=0 AND rbx=0?

Write this in assembly using pthread_create (for the threading infrastructure) but assembly for the critical section. Run it in a loop 1,000,000 times and count how often both results are 0. Add MFENCE between store and load in both threads. Does the 0,0 outcome disappear?

2. Memory Fence Overhead Benchmark Write a benchmark that measures the cost of each fence instruction: - Baseline: empty loop, 1 billion iterations - MFENCE loop: insert MFENCE in the loop body - SFENCE loop: insert SFENCE - LFENCE loop: insert LFENCE

Use RDTSC for precise cycle counting. Report cycles per fence instruction. Are the fences roughly equal cost, or are there significant differences?

3. x86-64 vs ARM64 Reordering Given this producer-consumer pattern:

; Producer:           Consumer:
    mov [data], 42       .wait: cmp [flag], 0
    mov [flag], 1               jz .wait
                                mov rax, [data]

a) On x86-64, is MFENCE required between the producer's stores? Between the consumer's loads? Justify your answer using the TSO ordering rules.

b) Write the equivalent on ARM64. Where must DMB instructions be added?

c) Using C11 atomic operations, write the equivalent with explicit memory_order_acquire and memory_order_release. What assembly does GCC generate for each?


Section B: Atomic Operations

4. Building a Counter with LOCK ADD Implement a thread-safe counter in assembly:

; counter_increment(counter_ptr: rdi)
; counter_decrement(counter_ptr: rdi)
; counter_read(counter_ptr: rdi) → rax
; counter_add(counter_ptr: rdi, amount: rsi)

Test by spawning 8 threads, each incrementing a shared counter 1,000,000 times, and verifying the final count is exactly 8,000,000.

5. LOCK XADD — Fetch and Add LOCK XADD [mem], reg atomically swaps [mem] and reg, then adds them. After the instruction, [mem] = [mem]+reg_original and reg = [mem]_original.

Use it to implement a thread-safe ID allocator:

; get_next_id: returns a unique monotonically increasing ID
; Returns: RAX = unique ID (starts at 0, increments each call)
get_next_id:
    mov rax, 1              ; increment amount
    lock xadd [id_counter], rax
    ; After: RAX = old value of id_counter, id_counter = old + 1
    ret

Verify correctness with 8 threads each calling it 100,000 times: collect all returned IDs into a sorted array and verify no duplicates and no gaps.

6. CMPXCHG-Based Lock-Free Stack Implement a lock-free stack using LOCK CMPXCHG:

; Each node: [next_ptr (8 bytes)] [data (8 bytes)]

; push_stack(head_ptr: rdi, node: rsi)
; pop_stack(head_ptr: rdi) → rax (node pointer, or 0 if empty)

Be careful about ABA. For this exercise, use a single-threaded test first to verify correctness, then a multi-threaded test.

7. Test-and-Set vs. CMPXCHG Spinlock Implement two spinlock variants: - TAS spinlock: uses LOCK XCHG [lock], al to test-and-set - CAS spinlock: uses LOCK CMPXCHG [lock], ecx (from the chapter)

Both should include the PAUSE instruction in the spin loop. Benchmark both under high contention (16 threads, 1,000,000 lock/unlock cycles each) and compare: - Total execution time - Cache coherence traffic (use perf stat -e cache-misses)

Which is faster under high contention? Why?


Section C: ARM64 Atomics

8. ARM64 LDXR/STXR Implementation Implement the ARM64 spinlock from the chapter and compile it using an AArch64 cross-compiler:

aarch64-linux-gnu-as spinlock.s -o spinlock.o

Add the DMB ISH barrier in the correct positions. Verify with a qemu-aarch64 userspace test:

qemu-aarch64 ./spinlock_test

9. ARM64 vs x86-64 Spinlock Assembly Write the same logical spinlock in both x86-64 and ARM64 assembly. Create a side-by-side comparison showing: - How many instructions are needed for each implementation - Where barriers are placed and why - What the WFE/SEV pair does on ARM64 (hint: it is similar to PAUSE + notification)

10. Acquire/Release Semantics In C11, atomic_load_explicit(ptr, memory_order_acquire) compiles to: - x86-64: plain MOV (no fence needed due to TSO) - ARM64: LDAR (Load-Acquire instruction)

And atomic_store_explicit(ptr, val, memory_order_release) compiles to: - x86-64: plain MOV (stores are already ordered on x86) - ARM64: STLR (Store-Release instruction)

Write a small C program using C11 atomics, compile with -O2 for both x86_64-linux-gnu and aarch64-linux-gnu, and compare the generated assembly for the acquire/release operations. Confirm the x86-64 version has no explicit fences.


Section D: ABA Problem and CMPXCHG16B

11. Demonstrating the ABA Problem Construct a concrete example of an ABA bug in a CAS-based lock-free data structure: 1. Thread 1 reads head → A 2. Thread 1 pauses 3. Thread 2: pop A (head is now B) 4. Thread 2: push A back (head is now A again, but ABA!) 5. Thread 1: CAS head (A → new_value) — succeeds despite stale assumption

Write the code to demonstrate this, then fix it using a tagged pointer (CMPXCHG16B) where the "version" prevents the false match.

12. CMPXCHG16B ABA-Safe Stack Implement the ABA-safe lock-free stack using CMPXCHG16B:

; Stack head: 16-byte struct { void *ptr; uint64_t version; }
; push: atomically update head to new_node, increment version
; pop:  atomically replace head with head->next, increment version

Verify with the ABA scenario from exercise 11 — the ABA bug should no longer be possible.


Section E: False Sharing

13. False Sharing Demonstration Create a benchmark that demonstrates false sharing:

// Scenario 1: Adjacent (false sharing)
struct {
    long counter_a;    // offset 0
    long counter_b;    // offset 8
} shared;

// Scenario 2: Padded (no false sharing)
struct {
    long counter_a;
    char pad[56];      // pad to 64 bytes (one cache line)
    long counter_b;
} padded;

Spawn 2 threads. Thread 0 increments counter_a 100 million times; Thread 1 increments counter_b. Measure wall-clock time for both scenarios. What is the speedup from cache-line padding?

Also measure with perf stat -e cache-misses,L1-dcache-load-misses for both scenarios.

14. NUMA False Sharing If you have a multi-socket system (or can simulate with QEMU or numactl):

numactl --cpunodebind=0 --membind=1 ./false_sharing_test

Run the false sharing benchmark with thread 0 on socket 0 and thread 1 on socket 1, with shared data allocated on socket 1. Compare with both threads on the same socket. NUMA false sharing can be 10× more expensive than intra-socket false sharing due to the QPI/UPI interconnect latency.


Section F: MinOS SMP

15. Fine-Grained Locking for MinOS PMM The MinOS physical memory manager uses a global bitmap. Under SMP, multiple CPUs might call pmm_alloc simultaneously. Implement a spinlock protecting the PMM bitmap:

a) Add a spinlock to the PMM data structure b) Acquire before searching for a free frame, release after marking it c) Benchmark: single CPU performance (no contention), 2-CPU contention, 4-CPU contention

d) Optimize: can you use a LOCK BTS (lock bit test and set) to make the allocation itself atomic without holding a lock? Implement this and compare performance.

16. Per-CPU Data and Lock Elimination Design a per-CPU allocator that eliminates most PMM locking: - Each CPU has its own "magazine" of 64 pre-allocated physical frames - Allocation from the magazine requires no locks (only this CPU touches it) - When a magazine is empty, refill from the global PMM (requires lock, but rare) - When a CPU frees a frame, put it in a per-CPU free list first

Implement this two-level allocator and measure the reduction in lock contention vs. the simple locked allocator from exercise 15.