Case Study 32-2: False Sharing — When Two Threads Fight Over One Cache Line
The Bug That Looks Like a Race Condition but Isn't
False sharing is one of the most counterintuitive performance pathologies in concurrent programming. The code is correct — no data race, no undefined behavior — yet performance collapses as the thread count increases. This case study demonstrates false sharing with assembly-level analysis, shows exactly what is happening on the cache coherence bus, and presents the fix.
The Setup: Parallel Counters
Consider a parallel workload where each of N threads increments its own counter, then the main thread sums them at the end:
// Version A: Adjacent counters (FALSE SHARING)
#define THREADS 8
typedef struct {
long counter; // 8 bytes per counter
} Counter;
Counter counters[THREADS]; // 8 × 8 = 64 bytes — one cache line!
// Each thread i runs:
void thread_func(int thread_id) {
for (long i = 0; i < 100000000L; i++) {
counters[thread_id].counter++;
}
}
The counters array is exactly 64 bytes — precisely one cache line. All eight counters are packed into a single cache line that every thread must write to.
Memory layout — all 8 counters in one cache line:
┌────────────────────────────────────────────────────────────────┐
│ ctr0 │ ctr1 │ ctr2 │ ctr3 │ ctr4 │ ctr5 │ ctr6 │ ctr7 │
│ 8B │ 8B │ 8B │ 8B │ 8B │ 8B │ 8B │ 8B │
└────────────────────────────────────────────────────────────────┘
0x00 0x08 0x10 0x18 0x20 0x28 0x30 0x38 0x40
one cache line (64 bytes)
The NASM Code
Each thread executes a simple increment loop:
; thread_increment: increment own counter N times
; rdi = pointer to thread's counter (counters[thread_id].counter)
; rsi = iteration count (100,000,000)
thread_increment:
xor eax, eax
.loop:
lock inc qword [rdi] ; atomic increment (not actually needed — only this
; thread writes here, but using LOCK to be explicit)
inc rax
cmp rax, rsi
jl .loop
ret
; Alternative: non-atomic (equally demonstrates false sharing)
thread_increment_nolock:
xor eax, eax
.loop:
mov rcx, [rdi] ; load counter
inc rcx ; increment
mov [rdi], rcx ; store counter
; These loads and stores trigger MESI protocol even without LOCK
inc rax
cmp rax, rsi
jl .loop
ret
What Happens on the Cache Bus
When Thread 0 writes counters[0] and Thread 1 writes counters[1]:
Timeline (8 cores, each writing to one 8-byte counter on the shared 64-byte cache line):
Cycle Core 0 Action Core 1 Action Cache Line Owner
──────────────────────────────────────────────────────────────────
0 Read line from L3 — Core 0 (E→M)
1 Write counter[0] — Core 0 (M)
2 — Read line? Invalidation!
3 Core 0: M→I Core 1: I→M Core 1 (M)
4 — Write counter[1] Core 1 (M)
5 Read line? — Invalidation!
6 Core 1: M→I Core 0: I→M Core 0 (M)
7 Write counter[0] — Core 0 (M)
...
Every single write by any core invalidates the line in all other cores. Each write generates: 1. A "request for ownership" (RFO) message on the coherence interconnect 2. Invalidation acknowledgments from all other cores 3. The line being transferred from the current owner's cache
At 100,000,000 writes per thread × 8 threads, this generates ~800 million coherence transactions for work that should require zero coherence overhead.
Benchmarks
# Compile and run the benchmark
gcc -O2 -pthread -o false_sharing false_sharing.c
time ./false_sharing --threads 1
time ./false_sharing --threads 2
time ./false_sharing --threads 4
time ./false_sharing --threads 8
VERSION A (false sharing — all counters on one cache line):
Threads: 1 2 4 8
Time (s): 0.23 0.94 2.18 4.71 ← GETS SLOWER!
GCounters/s: 0.43 0.21 0.18 0.17
VERSION B (padded — each counter on separate cache line):
Threads: 1 2 4 8
Time (s): 0.23 0.12 0.063 0.034 ← Scales linearly!
GCounters/s: 0.43 0.83 1.59 2.94
Scaling efficiency:
False sharing: 1.0× → 0.5× → 0.42× → 0.40× (inverse scaling!)
Padded: 1.0× → 1.93× → 3.70× → 6.84× (near-linear)
With false sharing, adding more threads makes the program slower. At 8 threads, it takes 4.71 seconds — 20× slower than the 8-thread padded version.
The Fix: Cache Line Padding
; NASM fix: separate each counter onto its own cache line
section .bss
align 64
counter0: resq 1 ; 8 bytes (counter)
resb 56 ; 56 bytes padding → total 64 bytes (one cache line)
counter1: resq 1 ; starts at offset +64: new cache line
resb 56
counter2: resq 1
resb 56
; ... etc.
; Or use a struct with explicit padding in C:
; typedef struct { long counter; char pad[56]; } PaddedCounter;
; PaddedCounter counters[THREADS]; // 64 bytes each → separate cache lines
; The non-atomic version with padded counters:
; Thread 0 writes counter0 (cache line 0)
; Thread 1 writes counter1 (cache line 1)
; → No coherence traffic between them
thread_increment_padded:
xor eax, eax
.loop:
mov rcx, [rdi] ; load (only cache line for this thread)
inc rcx ; increment
mov [rdi], rcx ; store (no invalidation needed — no other writers)
inc rax
cmp rax, rsi
jl .loop
ret
; After the pad: Core 0 owns counter0's cache line in E or M state throughout.
; No other core writes to it. No coherence traffic.
Measuring with perf
# Observe cache line bouncing with perf:
perf stat -e \
cache-misses,\
cache-references,\
mem_load_retired.l3_miss,\
machine_clears.smc \ # self-modifying code + some coherence events
./false_sharing --threads 8 --version A
# Version A output (false sharing, 8 threads):
# cache-references: 3,284,729,132
# cache-misses: 2,947,154,813 (89.7% miss rate!)
# mem_load_retired.l3_miss: 1,842,341,782
# Version B output (padded, 8 threads):
# cache-references: 401,293,847
# cache-misses: 642,918 (0.16% miss rate)
# mem_load_retired.l3_miss: 51,234
# The difference: 2.9 billion vs 643K cache misses
# for the same amount of work.
Real-World False Sharing Patterns
False sharing is not limited to counters. It appears wherever concurrent code writes independent data that happens to be adjacent in memory:
1. Thread-local statistics in a shared struct:
// BAD: All stats on same/adjacent cache lines
struct Stats { long ops; long errors; long latency_sum; } stats[MAX_THREADS];
// GOOD: Pad each thread's stats
struct Stats { long ops; long errors; long latency_sum; char pad[40]; } stats[MAX_THREADS];
2. OpenMP reduction variables:
// BAD: accumulator per thread in a shared array
double partials[N_THREADS]; // all on same/adjacent cache lines
#pragma omp parallel for
for (int i = 0; i < N; i++)
partials[omp_get_thread_num()] += data[i];
// GOOD: private variable, merge at end
#pragma omp parallel for reduction(+:total)
for (int i = 0; i < N; i++)
total += data[i];
3. Linux kernel per-CPU variables: The Linux kernel uses DEFINE_PER_CPU_ALIGNED to ensure per-CPU data structures are cache-aligned, preventing false sharing between CPUs. The kernel's implementation:
// From linux/percpu-defs.h:
#define DEFINE_PER_CPU_ALIGNED(type, name) \
DEFINE_PER_CPU_SECTION(type, name, "..cacheline_aligned") \
__aligned(SMP_CACHE_BYTES)
// Where SMP_CACHE_BYTES = 64 on x86-64
// This ensures no false sharing between CPU-local statistics,
// scheduler runqueues, or interrupt counters.
The Deeper Lesson
False sharing reveals an important fact about modern hardware: the cache coherence unit is the cache line, not the individual variable. Two variables 4 bytes apart are, from the cache protocol's perspective, the same object. Any write to either variable claims exclusive ownership of both.
This creates a counterintuitive design principle: for concurrent code, separation is performance. Variables that are written by different threads must be separated by at least 64 bytes. This applies to: - Per-thread accumulators - Per-thread statistics or logs - Lock structures (mutex + protected data on different cache lines) - Producer-consumer queue head and tail pointers (separate cache lines!)
; Lock-free SPSC queue: head and tail must be on different cache lines
; Otherwise, producer writes (tail) invalidate consumer reads (head) unnecessarily
section .bss
align 64
queue_tail: resq 1 ; producer writes here
resb 56 ; padding
queue_head: resq 1 ; consumer writes here
resb 56 ; padding
queue_buf: resq 4096 ; 4096 element slots
The false sharing case study demonstrates that performance analysis must extend below the source code level to the memory model of the underlying hardware. A 20× slowdown from 8 threads doing genuinely independent work — invisible to any memory sanitizer or correctness checker — is the price of ignoring cache line boundaries.