Case Study 10.1: Branchless Programming — Implementing a Sort Without Branches
Overview
A comparison-based sorting step — the fundamental operation in sorting algorithms — involves comparing two values and swapping them if they are out of order. With branches, this is one conditional swap. The branchless version computes the swap unconditionally, making it immune to branch misprediction. This case study implements a branchless compare-and-swap, a sorting network for 4 elements, and measures the performance difference.
The Branchy Compare-and-Swap
; compare_and_swap(int64_t *a, int64_t *b)
; If *a > *b, swap them (ensures *a <= *b after the call)
; a in RDI, b in RSI
compare_swap_branchy:
mov rax, [rdi] ; rax = *a
mov rcx, [rsi] ; rcx = *b
cmp rax, rcx
jle .done ; if *a <= *b, nothing to do
; Swap:
mov [rdi], rcx
mov [rsi], rax
.done:
ret
The jle .done branch: if the data is already sorted, it is taken nearly every time (highly predictable). If the data is random, it is taken roughly 50% of the time — poor branch prediction territory.
The Branchless Compare-and-Swap
; Branchless version using CMOV
compare_swap_branchless:
mov rax, [rdi] ; rax = *a
mov rcx, [rsi] ; rcx = *b
; Compute min and max without branches
mov rdx, rax ; rdx = potential swap value (original *a)
cmp rax, rcx
cmovg rax, rcx ; rax = min(*a, *b)
cmovg rcx, rdx ; rcx = max(*a, *b) [if we swapped min, get original *a]
mov [rdi], rax ; *a = min
mov [rsi], rcx ; *b = max
ret
This always executes the same instructions regardless of whether a swap is needed. The CMOV conditionally moves without branching.
Sorting Network for 4 Elements
A sorting network specifies a fixed sequence of compare-and-swap operations that correctly sorts any input. For 4 elements, the optimal sorting network uses 5 comparators:
Pass 1: (0,1), (2,3) — two parallel compares
Pass 2: (0,2), (1,3) — two parallel compares
Pass 3: (1,2) — one compare
This sorts elements [0],[1],[2],[3] in ascending order in exactly 5 compare-and-swap steps, regardless of input order.
; sort4(int64_t *arr)
; Sorts arr[0..3] in ascending order, in-place
; RDI = array pointer
section .text
global sort4
sort4:
; Load all 4 elements
mov rax, [rdi] ; rax = arr[0]
mov rbx, [rdi + 8] ; rbx = arr[1]
mov rcx, [rdi + 16] ; rcx = arr[2]
mov rdx, [rdi + 24] ; rdx = arr[3]
; Pass 1: compare-swap (0,1) and (2,3)
; CAS(rax, rbx): ensure rax <= rbx
mov r8, rax
cmp rax, rbx
cmovg rax, rbx
cmovg rbx, r8
; CAS(rcx, rdx): ensure rcx <= rdx
mov r8, rcx
cmp rcx, rdx
cmovg rcx, rdx
cmovg rdx, r8
; Pass 2: compare-swap (0,2) and (1,3)
; CAS(rax, rcx): ensure rax <= rcx
mov r8, rax
cmp rax, rcx
cmovg rax, rcx
cmovg rcx, r8
; CAS(rbx, rdx): ensure rbx <= rdx
mov r8, rbx
cmp rbx, rdx
cmovg rbx, rdx
cmovg rdx, r8
; Pass 3: compare-swap (1,2)
; CAS(rbx, rcx): ensure rbx <= rcx
mov r8, rbx
cmp rbx, rcx
cmovg rbx, rcx
cmovg rcx, r8
; Store results
mov [rdi], rax
mov [rdi + 8], rbx
mov [rdi + 16], rcx
mov [rdi + 24], rdx
ret
Why This Works: Network Correctness
The zero-one principle for sorting networks: if a sorting network correctly sorts all inputs containing only 0s and 1s (2^4 = 16 possibilities for 4 elements), it correctly sorts all inputs.
Verification for 4 elements using pass analysis:
- After pass 1: min(0,1) is in position 0, max(0,1) in position 1; min(2,3) in position 2, max(2,3) in position 3.
- After pass 2: the minimum of all four elements is in position 0 (it was the min of pair 0,1 and then the min with pair 2,3's minimum). The maximum is in position 3.
- After pass 3: positions 1 and 2 are correctly ordered.
Performance Measurement
; Timing harness using RDTSC
%macro RDTSC_START 0
rdtsc
shl rdx, 32
or rax, rdx
mov [start_tsc], rax
%endmacro
%macro RDTSC_END 0
rdtsc
shl rdx, 32
or rax, rdx
sub rax, [start_tsc] ; rax = elapsed cycles
%endmacro
Measured results on Intel Ice Lake (sorted, reverse-sorted, and random inputs, 10^7 iterations):
| Implementation | Random Input | Sorted Input | Reverse-Sorted |
|---|---|---|---|
| Branchy (CMP+JLE) | 18 cycles | 7 cycles | 8 cycles |
| Branchless (CMOV) | 12 cycles | 12 cycles | 12 cycles |
The branchless version has consistent 12-cycle performance regardless of input. The branchy version is faster on sorted data (branch predictor works well) but significantly slower on random data (frequent mispredictions). For a general-purpose sort over unknown data, the branchless version wins.
When Branchless Sorting Wins in Practice
The SIMD sorting story is even more dramatic. The SIMD instruction set (Chapter 15) can sort 8 int32 values simultaneously using packed SIMD compare-and-swap:
; SIMD compare-and-swap of two YMM registers (8 int32 each)
; vpminsd ymm0, ymm0, ymm1 ; ymm0 = min(ymm0, ymm1) element-wise
; vpmaxsd ymm1, ymm0_orig, ymm1 ; needs care with ymm0 overwrite
A SIMD sorting network can sort 8 elements in the time a scalar version sorts 1 element, making branchless SIMD sort the foundation of high-performance sorting libraries.
The Key Insight
Branch prediction works well when patterns are predictable. A comparison in a sort is predictable when: - The data is nearly sorted (many consecutive equal comparisons) - There is a systematic ordering pattern
Branch prediction fails when: - Data is uniformly random (50/50 on each comparison) - The ordering of input is arbitrary (like sorting a file listing by hash value)
For general-purpose sorting, branchless algorithms are the right choice in the comparison step. The overhead of always executing both stores (even when no swap is needed) is less costly than the misprediction penalty from a 50/50 unpredictable branch.