Case Study 32-2: False Sharing — When Two Threads Fight Over One Cache Line

Open Assembly Language Project

The Bug That Looks Like a Race Condition but Isn't

False sharing is one of the most counterintuitive performance pathologies in concurrent programming. The code is correct — no data race, no undefined behavior — yet performance collapses as the thread count increases. This case study demonstrates false sharing with assembly-level analysis, shows exactly what is happening on the cache coherence bus, and presents the fix.

The Setup: Parallel Counters

Consider a parallel workload where each of N threads increments its own counter, then the main thread sums them at the end:

// Version A: Adjacent counters (FALSE SHARING)
#define THREADS 8

typedef struct {
    long counter;     // 8 bytes per counter
} Counter;

Counter counters[THREADS];  // 8 × 8 = 64 bytes — one cache line!

// Each thread i runs:
void thread_func(int thread_id) {
    for (long i = 0; i < 100000000L; i++) {
        counters[thread_id].counter++;
    }
}

The counters array is exactly 64 bytes — precisely one cache line. All eight counters are packed into a single cache line that every thread must write to.

Memory layout — all 8 counters in one cache line:
┌────────────────────────────────────────────────────────────────┐
│ ctr0 │ ctr1 │ ctr2 │ ctr3 │ ctr4 │ ctr5 │ ctr6 │ ctr7 │
│  8B  │  8B  │  8B  │  8B  │  8B  │  8B  │  8B  │  8B  │
└────────────────────────────────────────────────────────────────┘
 0x00   0x08   0x10   0x18   0x20   0x28   0x30   0x38   0x40
                     one cache line (64 bytes)

The NASM Code

Each thread executes a simple increment loop:

; thread_increment: increment own counter N times
; rdi = pointer to thread's counter (counters[thread_id].counter)
; rsi = iteration count (100,000,000)

thread_increment:
    xor     eax, eax
.loop:
    lock inc qword [rdi]    ; atomic increment (not actually needed — only this
                            ; thread writes here, but using LOCK to be explicit)
    inc     rax
    cmp     rax, rsi
    jl      .loop
    ret

; Alternative: non-atomic (equally demonstrates false sharing)
thread_increment_nolock:
    xor     eax, eax
.loop:
    mov     rcx, [rdi]      ; load counter
    inc     rcx             ; increment
    mov     [rdi], rcx      ; store counter
    ; These loads and stores trigger MESI protocol even without LOCK
    inc     rax
    cmp     rax, rsi
    jl      .loop
    ret

What Happens on the Cache Bus

When Thread 0 writes counters[0] and Thread 1 writes counters[1]:

Timeline (8 cores, each writing to one 8-byte counter on the shared 64-byte cache line):

Cycle   Core 0 Action        Core 1 Action        Cache Line Owner
──────────────────────────────────────────────────────────────────
  0     Read line from L3    —                    Core 0 (E→M)
  1     Write counter[0]     —                    Core 0 (M)
  2     —                    Read line?           Invalidation!
  3     Core 0: M→I          Core 1: I→M          Core 1 (M)
  4     —                    Write counter[1]     Core 1 (M)
  5     Read line?           —                    Invalidation!
  6     Core 1: M→I          Core 0: I→M          Core 0 (M)
  7     Write counter[0]     —                    Core 0 (M)
  ...

Every single write by any core invalidates the line in all other cores. Each write generates: 1. A "request for ownership" (RFO) message on the coherence interconnect 2. Invalidation acknowledgments from all other cores 3. The line being transferred from the current owner's cache

At 100,000,000 writes per thread × 8 threads, this generates ~800 million coherence transactions for work that should require zero coherence overhead.

Benchmarks

# Compile and run the benchmark
gcc -O2 -pthread -o false_sharing false_sharing.c
time ./false_sharing --threads 1
time ./false_sharing --threads 2
time ./false_sharing --threads 4
time ./false_sharing --threads 8

VERSION A (false sharing — all counters on one cache line):
Threads:     1        2        4        8
Time (s):  0.23     0.94     2.18     4.71    ← GETS SLOWER!
GCounters/s: 0.43    0.21    0.18    0.17

VERSION B (padded — each counter on separate cache line):
Threads:     1        2        4        8
Time (s):  0.23     0.12     0.063    0.034   ← Scales linearly!
GCounters/s: 0.43    0.83     1.59     2.94

Scaling efficiency:
  False sharing:  1.0× → 0.5× → 0.42× → 0.40×  (inverse scaling!)
  Padded:         1.0× → 1.93× → 3.70× → 6.84×  (near-linear)

With false sharing, adding more threads makes the program slower. At 8 threads, it takes 4.71 seconds — 20× slower than the 8-thread padded version.

The Fix: Cache Line Padding

; NASM fix: separate each counter onto its own cache line

section .bss
align 64
counter0: resq 1        ; 8 bytes (counter)
          resb 56       ; 56 bytes padding → total 64 bytes (one cache line)
counter1: resq 1        ; starts at offset +64: new cache line
          resb 56
counter2: resq 1
          resb 56
; ... etc.

; Or use a struct with explicit padding in C:
; typedef struct { long counter; char pad[56]; } PaddedCounter;
; PaddedCounter counters[THREADS];   // 64 bytes each → separate cache lines

; The non-atomic version with padded counters:
; Thread 0 writes counter0 (cache line 0)
; Thread 1 writes counter1 (cache line 1)
; → No coherence traffic between them

thread_increment_padded:
    xor     eax, eax
.loop:
    mov     rcx, [rdi]      ; load (only cache line for this thread)
    inc     rcx             ; increment
    mov     [rdi], rcx      ; store (no invalidation needed — no other writers)
    inc     rax
    cmp     rax, rsi
    jl      .loop
    ret

; After the pad: Core 0 owns counter0's cache line in E or M state throughout.
; No other core writes to it. No coherence traffic.

Measuring with perf

# Observe cache line bouncing with perf:
perf stat -e \
  cache-misses,\
  cache-references,\
  mem_load_retired.l3_miss,\
  machine_clears.smc \           # self-modifying code + some coherence events
  ./false_sharing --threads 8 --version A

# Version A output (false sharing, 8 threads):
#     cache-references:     3,284,729,132
#     cache-misses:         2,947,154,813   (89.7% miss rate!)
#     mem_load_retired.l3_miss:  1,842,341,782

# Version B output (padded, 8 threads):
#     cache-references:       401,293,847
#     cache-misses:               642,918   (0.16% miss rate)
#     mem_load_retired.l3_miss:     51,234

# The difference: 2.9 billion vs 643K cache misses
# for the same amount of work.

False sharing is not limited to counters. It appears wherever concurrent code writes independent data that happens to be adjacent in memory:

1. Thread-local statistics in a shared struct:

// BAD: All stats on same/adjacent cache lines
struct Stats { long ops; long errors; long latency_sum; } stats[MAX_THREADS];

// GOOD: Pad each thread's stats
struct Stats { long ops; long errors; long latency_sum; char pad[40]; } stats[MAX_THREADS];

2. OpenMP reduction variables:

// BAD: accumulator per thread in a shared array
double partials[N_THREADS];   // all on same/adjacent cache lines
#pragma omp parallel for
for (int i = 0; i < N; i++)
    partials[omp_get_thread_num()] += data[i];

// GOOD: private variable, merge at end
#pragma omp parallel for reduction(+:total)
for (int i = 0; i < N; i++)
    total += data[i];

3. Linux kernel per-CPU variables: The Linux kernel uses DEFINE_PER_CPU_ALIGNED to ensure per-CPU data structures are cache-aligned, preventing false sharing between CPUs. The kernel's implementation:

// From linux/percpu-defs.h:
#define DEFINE_PER_CPU_ALIGNED(type, name)              \
    DEFINE_PER_CPU_SECTION(type, name, "..cacheline_aligned")  \
    __aligned(SMP_CACHE_BYTES)

// Where SMP_CACHE_BYTES = 64 on x86-64
// This ensures no false sharing between CPU-local statistics,
// scheduler runqueues, or interrupt counters.

The Deeper Lesson

False sharing reveals an important fact about modern hardware: the cache coherence unit is the cache line, not the individual variable. Two variables 4 bytes apart are, from the cache protocol's perspective, the same object. Any write to either variable claims exclusive ownership of both.

This creates a counterintuitive design principle: for concurrent code, separation is performance. Variables that are written by different threads must be separated by at least 64 bytes. This applies to: - Per-thread accumulators - Per-thread statistics or logs - Lock structures (mutex + protected data on different cache lines) - Producer-consumer queue head and tail pointers (separate cache lines!)

; Lock-free SPSC queue: head and tail must be on different cache lines
; Otherwise, producer writes (tail) invalidate consumer reads (head) unnecessarily

section .bss
align 64
queue_tail: resq 1      ; producer writes here
            resb 56     ; padding
queue_head: resq 1      ; consumer writes here
            resb 56     ; padding
queue_buf:  resq 4096   ; 4096 element slots

The false sharing case study demonstrates that performance analysis must extend below the source code level to the memory model of the underlying hardware. A 20× slowdown from 8 threads doing genuinely independent work — invisible to any memory sanitizer or correctness checker — is the price of ignoring cache line boundaries.