Network packet headers, hardware registers, file permission flags, color channels, encryption keys — these all pack information into individual bits. The x86-64 instruction set has a complete toolkit for operating on bits: instructions for...
In This Chapter
- Every Bit Has a Job
- Bitmask Operations: The Three Primitives
- Bit Test Instructions: BT, BTS, BTR, BTC
- Bit Scanning: BSF, BSR, LZCNT, TZCNT
- POPCNT: Count Set Bits
- BMI1 and BMI2: The Advanced Bit Manipulation Sets
- XOR Tricks: Classic Bit Manipulation Idioms
- Packing Flags into a Single Register
- The XOR Cipher: Anchor Example Part 1
- Bit Manipulation for Real Applications
- Register Trace: Bit Manipulation Operations
- Summary
Chapter 13: Bit Manipulation
Every Bit Has a Job
Network packet headers, hardware registers, file permission flags, color channels, encryption keys — these all pack information into individual bits. The x86-64 instruction set has a complete toolkit for operating on bits: instructions for isolating, setting, clearing, and toggling individual bits or groups, for finding the position of the highest or lowest set bit, for counting set bits, and the BMI1/BMI2 extension set that adds hardware acceleration for several advanced bit operations.
This chapter also introduces the first anchor example for encryption: the XOR cipher. XOR is the fundamental operation of symmetric cryptography — every AES round, every stream cipher, every one-time pad uses XOR at its core. The implementation here is pedagogical rather than secure (XOR cipher is not secure for repeated keys), but it establishes the pattern for the AES-NI implementation in Chapter 15.
Bitmask Operations: The Three Primitives
Every bit manipulation application reduces to three operations: isolate bits you care about, set bits to 1, clear bits to 0, toggle bits. These correspond exactly to AND, OR, XOR.
AND: Isolating and Clearing Bits
AND with a mask preserves bits where the mask has 1s, and forces 0s where the mask has 0s.
; Isolate the low 8 bits (mask out upper 56 bits):
and rax, 0xFF ; rax = rax & 0x00000000000000FF
; Check if bit 4 is set (test a specific bit):
and rax, (1 << 4) ; result is nonzero if bit 4 was set, zero if not
; Better: use TEST for this (doesn't modify rax)
test rax, (1 << 4) ; set ZF=0 if bit 4 set, ZF=1 if clear
; Clear specific bits using NOT of mask:
; Clear bits 3, 4, 5 (mask = ~0b00111000 = 0xFFFFFFFFFFFFFFC7)
and rax, ~(7 << 3) ; NASM constant expression: ~0b111000
; Extract a bit field: bits 12:8 (5-bit field)
and rax, 0x1F00 ; isolate bits 12:8
shr rax, 8 ; shift down to bits 4:0
OR: Setting Bits
OR with a mask forces 1s where the mask has 1s, preserves other bits.
; Set bit 5:
or rax, (1 << 5) ; rax |= 0x20
; Set multiple bits (bits 0, 2, 7):
or rax, (1 | 4 | 128) ; or rax, 0x85
; Set a bit field (write 0b101 into bits 5:3):
; First clear the field, then OR in the new value
and rax, ~(0x7 << 3) ; clear bits 5:3
or rax, (0x5 << 3) ; set bits 5:3 = 0b101
XOR: Toggling Bits
XOR with a mask flips bits where the mask has 1s, preserves other bits.
; Toggle bit 0:
xor rax, 1 ; flip the LSB
; Toggle bits 3 and 7:
xor rax, ((1 << 3) | (1 << 7)) ; xor rax, 0x88
; Toggle all bits (bitwise NOT — also achievable with NOT instruction):
xor rax, 0xFFFFFFFFFFFFFFFF ; same as: not rax
; Classic XOR swap (no temporary variable):
xor rax, rbx ; rax = rax ^ rbx
xor rbx, rax ; rbx = rbx ^ (rax ^ rbx) = original rax
xor rax, rbx ; rax = (rax ^ rbx) ^ original_rax = original rbx
; Zero a register (XOR with self):
xor eax, eax ; rax = 0 (canonical zero idiom)
NOT: Bitwise Complement
not rax ; rax = ~rax (flip all 64 bits)
not byte [rbx] ; flip all bits in a memory byte
NOT does not modify flags. It is equivalent to xor rax, -1.
ANDN: Clear Bits Efficiently (BMI1)
In BMI1 (Bit Manipulation Instructions set 1, available on Haswell+), ANDN dst, src1, src2 computes dst = (~src1) & src2. This lets you clear bits without a separate NOT:
; Clear the bits set in RDX from RAX (without modifying RDX):
; Old way (two instructions):
mov rcx, rdx
not rcx
and rax, rcx
; BMI1 ANDN (one instruction):
andn rax, rdx, rax ; rax = (~rdx) & rax
Bit Test Instructions: BT, BTS, BTR, BTC
These instructions test a single bit and optionally modify it.
; BT (Bit Test): copy bit N of dst into CF
bt rax, 5 ; CF = bit 5 of rax (rax unchanged)
bt rax, rcx ; CF = bit (rcx & 63) of rax
bt [rbx], rdi ; CF = bit rdi of memory (arbitrary bit offset!)
; BTS (Bit Test and Set): copy bit to CF, then set it
bts rax, 5 ; CF = old bit 5; rax bit 5 = 1
; BTR (Bit Test and Reset): copy bit to CF, then clear it
btr rax, 5 ; CF = old bit 5; rax bit 5 = 0
; BTC (Bit Test and Complement): copy bit to CF, then toggle it
btc rax, 5 ; CF = old bit 5; rax bit 5 flipped
After BT/BTS/BTR/BTC, check CF with jc (carry set = bit was 1) or jnc (carry clear = bit was 0).
The memory form of BT is powerful and unusual: bt [rbx], rdi treats memory starting at [rbx] as a bit array of arbitrary length and tests the bit at position RDI. This is used for implementing large bitsets/bit arrays.
; Bit array: test, set, clear, toggle bit at arbitrary position
; Array at [rdi], bit index in RSI
test_bit:
bt [rdi], rsi ; CF = bit at position rsi
setc al ; al = CF (0 or 1)
movzx eax, al
ret
set_bit:
bts [rdi], rsi ; set bit at position rsi (old value in CF)
ret
clear_bit:
btr [rdi], rsi ; clear bit at position rsi
ret
toggle_bit:
btc [rdi], rsi ; toggle bit at position rsi
ret
⚡ Performance Note: The memory forms of BT/BTS/BTR/BTC with a register bit index are slow (they can access any bit in memory, which requires computation). For bit arrays where the index is computed at runtime, consider using the register form with SHR/AND to locate the byte and bit position manually — it is often faster.
Bit Scanning: BSF, BSR, LZCNT, TZCNT
These instructions find the position of set bits.
BSF: Bit Scan Forward (Lowest Set Bit)
; BSF dst, src: dst = index of lowest set bit of src
; ZF = 1 if src == 0 (result is undefined if src is zero)
bsf rax, rbx ; rax = index of LSB of rbx
jz .no_bits_set ; if rbx was 0, result is undefined
BSF has undefined behavior when the source is zero. Never use BSF without first checking that the input is nonzero, or use TZCNT instead.
BSR: Bit Scan Reverse (Highest Set Bit)
; BSR dst, src: dst = index of highest set bit of src
bsr rax, rbx ; rax = index of MSB of rbx (= floor(log2(rbx)) for nonzero rbx)
jz .no_bits_set
BSR also has undefined behavior for zero input.
LZCNT and TZCNT: Better Alternatives (BMI1)
LZCNT (Leading Zero Count) and TZCNT (Trailing Zero Count) are defined for zero input and are generally preferred over BSF/BSR in new code:
; LZCNT: count leading zeros (zeros from MSB down to first 1)
lzcnt rax, rbx ; rax = number of leading zeros in rbx
; If rbx == 0: rax = 64 (64-bit operand)
; If rbx == 1: rax = 63
; If rbx == 0x8000000000000000: rax = 0
; TZCNT: count trailing zeros (zeros from LSB up to first 1)
tzcnt rax, rbx ; rax = number of trailing zeros in rbx
; If rbx == 0: rax = 64
; If rbx == 1: rax = 0
; If rbx == 8: rax = 3 (since 8 = 0b1000)
These require BMI1 (CPUID: CPUID.7H:EBX[bit 3]). On AMD processors, LZCNT maps to LZCNT; on older Intel without BMI1, the F3 0F BD encoding falls back to BSR behavior (with different semantics for zero).
; Applications:
; Integer log2 (floor):
bsr rax, rdi ; rax = floor(log2(rdi)) for rdi > 0
; Or with LZCNT:
lzcnt rax, rdi ; rax = leading zeros
mov rcx, 63
sub rcx, rax ; rcx = 63 - leading_zeros = floor(log2)
; Isolate lowest set bit:
mov rcx, rdi
neg rcx ; rcx = -rdi (two's complement)
and rcx, rdi ; rcx = rdi & (-rdi) = lowest set bit only
; This works because -x = ~x + 1, so x & (-x) cancels everything except the lowest bit
; Round up to next power of 2:
bsr rax, rdi ; find floor(log2)
jz .is_zero
; If rdi is already a power of 2, return rdi; else return 1 << (floor(log2) + 1)
lea rcx, [rax + 1] ; rcx = floor(log2) + 1
mov rax, 1
shl rax, cl ; rax = 2^(floor(log2)+1)
POPCNT: Count Set Bits
; POPCNT dst, src: count number of 1 bits in src
popcnt rax, rbx ; rax = popcount(rbx)
popcnt rax, [rbx] ; rax = popcount(memory_qword)
POPCNT is defined when the input is 0 (result = 0). It is available on all modern x86-64 processors (check CPUID feature flag).
; Applications:
; Hamming weight (number of 1 bits in a number)
popcnt rax, rdi ; rax = number of set bits in rdi
; Population count for a bit array:
; Count set bits in 64-byte bitset at [rdi]
xor eax, eax
mov rcx, 8 ; 8 × 8-byte = 64 bytes
.loop:
popcnt r8, [rdi + rcx*8 - 8]
add rax, r8
dec rcx
jnz .loop
; Parity (even or odd number of set bits):
popcnt rax, rdi
and rax, 1 ; 0 = even parity, 1 = odd parity
Software Popcount for Comparison
Before POPCNT (or on processors without it), the standard software implementation uses the "parallel counting" technique:
; Software popcount of RDI → RAX
; Uses the parallel bit-counting algorithm
popcount_software:
mov rax, rdi
; Step 1: count pairs (2-bit groups)
mov rcx, 0x5555555555555555 ; 01010101...
mov rdx, rax
shr rdx, 1
and rdx, rcx
and rax, rcx
add rax, rdx
; Step 2: count nibbles (4-bit groups)
mov rcx, 0x3333333333333333 ; 00110011...
mov rdx, rax
shr rdx, 2
and rdx, rcx
and rax, rcx
add rax, rdx
; ... (continue for 4→8→16→32→64 bit groups)
; Step 3: count bytes (8-bit groups)
mov rcx, 0x0F0F0F0F0F0F0F0F
mov rdx, rax
shr rdx, 4
and rdx, rcx
and rax, rcx
add rax, rdx
; Sum all byte counts
imul rax, 0x0101010101010101
shr rax, 56 ; extract high byte = sum of all bytes
ret
The POPCNT instruction replaces this entire function with one instruction. Always prefer POPCNT on hardware that has it.
BMI1 and BMI2: The Advanced Bit Manipulation Sets
BMI1 Instructions
Available on Intel Haswell (2013) and AMD Piledriver:
; ANDN: ~src1 & src2
andn rax, rdx, rcx ; rax = (~rdx) & rcx
; BEXTR: extract bit field
; BEXTR dst, src, control — control[7:0] = start, control[15:8] = length
mov rcx, (8 << 8) | 4 ; extract 8 bits starting at bit 4
bextr rax, rdi, rcx ; rax = bits 11:4 of rdi, shifted to bits 7:0
; BLSI: isolate lowest set bit
blsi rax, rdi ; rax = rdi & (-rdi) = lowest 1 bit of rdi
; ZF=1 if rdi was 0
; BLSMSK: mask from lowest set bit
blsmsk rax, rdi ; rax = rdi XOR (rdi-1) = all bits from bit 0 to lowest 1 bit
; BLSR: reset lowest set bit
blsr rax, rdi ; rax = rdi & (rdi-1) = rdi with lowest 1 bit cleared
; If rdi is a power of 2: rax = 0
; Use case: iterate over all set bits in a value
The BLSR application for iterating set bits:
; Process each set bit in RDI:
.loop:
test rdi, rdi
jz .done
blsi rax, rdi ; rax = isolated lowest bit of rdi
; ... process bit position using TZCNT rax ...
blsr rdi, rdi ; clear that bit from rdi
jmp .loop
.done:
BMI2 Instructions
Available on Intel Haswell and AMD Excavator:
; BZHI: zero high bits from a specified position
bzhi rax, rdi, rdx ; rax = rdi with bits above position rdx zeroed
; rax = rdi & ((1 << rdx) - 1) -- extract low rdx bits
; PEXT: parallel bits extract (gather specified bits)
mov rdx, 0x0000FF00 ; mask: which bits to extract
pext rax, rdi, rdx ; rax = bits 15:8 of rdi, compacted to bits 7:0
; Essentially: for each set bit i in mask, extract bit i from src and pack consecutively
; PDEP: parallel bits deposit (scatter specified bits)
mov rdx, 0x0000FF00 ; mask: where to place bits
pdep rax, rdi, rdx ; for each bit i in mask, place bit from src at position i
; PEXT and PDEP are inverses of each other
; RORX: rotate right without modifying flags (useful in chains)
rorx rax, rdi, 8 ; rax = rotate_right(rdi, 8), no flags modified
; SARX, SHLX, SHRX: shifts with register count, no flags modification
sarx rax, rdi, rdx ; rax = rdi SAR rdx (rdx = shift count, no flags)
shlx rax, rdi, rdx ; shift left
shrx rax, rdi, rdx ; shift right logical
🔍 Under the Hood: PEXT and PDEP are used in hardware-optimized implementations of bit interleaving (Morton codes for spatial indexing), CRC computation, and certain compression algorithms. On hardware that supports them, they replace software loops that would otherwise test and shift bits one at a time.
XOR Tricks: Classic Bit Manipulation Idioms
Check if a Value is a Power of 2
; A power of 2 has exactly one set bit: 0001, 0010, 0100, 1000, ...
; x & (x-1) clears the lowest set bit:
; - If x is a power of 2, only bit is the lowest set bit, result = 0
; - Otherwise, result != 0
; Also: x must be nonzero
; rdi = x
test rdi, rdi ; is x == 0?
jz .not_power_of_2
mov rax, rdi
sub rax, 1 ; rax = x - 1
test rdi, rax ; x & (x-1) == 0?
jz .is_power_of_2
Or with BLSR:
; blsr rax, rdi sets ZF=1 if rdi is a power of 2 (or 0)
blsr rax, rdi ; rax = rdi with lowest bit cleared; ZF=1 if rdi had exactly 1 bit
jz .is_power_of_2_or_zero
Clear the Lowest Set Bit
; x & (x-1) clears the lowest set bit:
mov rax, rdi
and rax, [rdi - 1] ; WRONG: that tries to access memory!
; Correct:
mov rax, rdi
lea rcx, [rdi - 1] ; rcx = rdi - 1
and rax, rcx ; rax = rdi & (rdi-1)
; Or:
blsr rax, rdi ; BMI1: equivalent, one instruction
Isolate the Lowest Set Bit
; x & (-x) isolates the lowest set bit:
mov rax, rdi
neg rax ; rax = -rdi (two's complement)
and rax, rdi ; rax = rdi & (-rdi) = isolated lowest bit
; Or:
blsi rax, rdi ; BMI1: equivalent, also sets ZF if rdi=0
XOR Swap (Without Temporary)
; Swap rdi and rsi without using a third register:
xor rdi, rsi
xor rsi, rdi ; rsi = original rdi
xor rdi, rsi ; rdi = original rsi
⚠️ Common Mistake: The XOR swap fails if both operands are the same variable (same register or same memory address).
xor rax, raxzeroes RAX; the three-step swap would zero out both values. Never use XOR swap when source and destination may alias.
Packing Flags into a Single Register
Hardware registers, network protocol headers, and file format fields often pack multiple boolean flags into a single byte or word:
; Define bit positions for a flags register:
FLAG_READABLE equ (1 << 0) ; bit 0
FLAG_WRITABLE equ (1 << 1) ; bit 1
FLAG_EXECUTABLE equ (1 << 2) ; bit 2
FLAG_DIRTY equ (1 << 3) ; bit 3
FLAG_PRESENT equ (1 << 4) ; bit 4
; Check if a page is present and writable:
test rax, (FLAG_PRESENT | FLAG_WRITABLE)
; ZF=0 if either flag is set (not exactly what we want for AND check)
; Better: use AND and compare
mov rcx, (FLAG_PRESENT | FLAG_WRITABLE)
and rcx, rax
cmp rcx, (FLAG_PRESENT | FLAG_WRITABLE)
je .both_flags_set
; Set the writable flag:
or rax, FLAG_WRITABLE
; Clear the dirty flag:
and rax, ~FLAG_DIRTY ; NASM: and rax, 0xFFFFFFFFFFFFFFE7
; Toggle the executable flag:
xor rax, FLAG_EXECUTABLE
The XOR Cipher: Anchor Example Part 1
XOR is reversible: a XOR b XOR b = a. This is the basis of every stream cipher: XOR the plaintext with a keystream, transmit the ciphertext. The receiver XORs with the same keystream to recover the plaintext.
A simple XOR cipher (pedagogical only — not secure for fixed keys):
; xor_encrypt(uint8_t *data, size_t len, uint8_t *key, size_t key_len)
; Encrypts/decrypts data in-place with a repeating XOR key
; RDI = data, RSI = len, RDX = key, RCX = key_len
section .text
global xor_encrypt
xor_encrypt:
push rbp
mov rbp, rsp
push rbx ; data index
push r12 ; key index
push r13 ; key_len
push r14 ; len
push r15 ; key pointer
sub rsp, 8 ; alignment (5 pushes + rbp = 6*8=48 after entry's 8 = net 40 from adjusted rsp)
; entry rsp ≡ 8; -8(rbp)→0; -8(rbx)→8; -8(r12)→0; -8(r13)→8; -8(r14)→0; -8(r15)→8; sub 8→0. Good.
xor ebx, ebx ; i = 0 (data index)
xor r12d, r12d ; j = 0 (key index)
mov r13, rcx ; key_len
mov r14, rsi ; len
mov r15, rdx ; key pointer
test r14, r14 ; len == 0?
jz .done
.loop:
; Load data byte, key byte
movzx eax, byte [rdi + rbx] ; al = data[i]
movzx ecx, byte [r15 + r12] ; cl = key[j]
; XOR encrypt
xor al, cl ; al = data[i] ^ key[j]
mov [rdi + rbx], al ; data[i] = result
; Advance data index
inc rbx
; Advance key index with wrapping
inc r12
cmp r12, r13
jl .no_wrap
xor r12d, r12d ; j = 0 (wrap around)
.no_wrap:
; Check if we're done
cmp rbx, r14
jl .loop
.done:
add rsp, 8
pop r15
pop r14
pop r13
pop r12
pop rbx
pop rbp
ret
Why XOR is the Foundation of Cryptography
XOR has three properties that make it cryptographically useful:
- Self-inverse:
a XOR b XOR b = a. The same operation decrypts and encrypts. - No bias: Each output bit is 1 if and only if the corresponding input bits differ. XOR with a uniformly random key produces a uniformly random output regardless of the input.
- Composability: Multiple layers of XOR with independent keys produce a net effect of XOR with the XOR of all the keys.
The weakness of a simple XOR cipher: if the key is shorter than the data and repeats, known-plaintext attacks become trivial. If two messages are encrypted with the same key, XORing the two ciphertexts gives plaintext XOR plaintext, eliminating the key entirely.
AES-NI (Chapter 15) uses XOR in every encryption round — the AESENC instruction applies a nonlinear S-Box substitution, a row permutation, a column mixing operation, and an XOR with the round key. XOR is always the last step. The S-Box is what provides the non-linearity that makes AES secure where simple XOR is not.
🔐 Security Note: Never use a simple XOR cipher for real encryption. It is secure only as a one-time pad (key as long as message, used once, truly random) — which is impractical in almost all real applications. The purpose of this implementation is to understand XOR's role in symmetric cryptography before upgrading to AES-NI in Chapter 15.
Bit Manipulation for Real Applications
File Permissions (POSIX)
; POSIX permission bits for a file mode
; rwxrwxrwx = bits 8:0
OWNER_READ equ (1 << 8) ; 0400
OWNER_WRITE equ (1 << 7) ; 0200
OWNER_EXEC equ (1 << 6) ; 0100
GROUP_READ equ (1 << 5) ; 0040
; ... etc.
; Check if owner can read:
test rax, OWNER_READ
jnz .owner_can_read
; Set read+write for owner, read for group:
or rax, (OWNER_READ | OWNER_WRITE | GROUP_READ) ; rax |= 0640
; Remove execute permission for all:
and rax, ~(OWNER_EXEC | (1 << 3) | (1 << 0))
Network Protocol Fields
; IPv4 header bit manipulation:
; IHL (Internet Header Length) is bits 3:0 of the first byte
; Version is bits 7:4 of the first byte
; Extract IHL from first byte in AL:
movzx eax, al
and eax, 0x0F ; isolate low 4 bits = IHL
shl eax, 2 ; IHL * 4 = header length in bytes
; Extract version:
movzx ecx, al ; al = first byte of IP header
shr ecx, 4 ; version in bits 3:0 of ecx
Register Trace: Bit Manipulation Operations
; Starting with RAX = 0x1234567890ABCDEF
mov rax, 0x1234567890ABCDEF
; RAX = 0x1234567890ABCDEF = 0001 0010 0011 0100 ... 1100 1101 1110 1111
; Isolate low byte:
and rax, 0xFF
; RAX = 0x00000000000000EF
; Reset and set bit 16:
mov rax, 0x1234567890ABCDEF
or rax, (1 << 16)
; RAX = 0x123456789**1**ABCDEF (bit 16 was already 0, now set → ABCDEF | 0x10000 = 0x00ABCDEF | 0x10000)
; More precisely: bit 16 of 0x90ABCDEF: 0x90ABCDEF & 0x10000 = 0 (bit 16 was 0)
; After OR: 0x90ABCDEF | 0x10000 = 0x90BBCDEF? Let's compute: 0xABCDEF | 0x10000 = 0xBBCDEF
; Actually: 0x1234567890ABCDEF | 0x0000000000010000 = 0x1234567890BBCDEF
; Popcount:
mov rax, 0xFF00FF00FF00FF00 ; 32 set bits
popcnt rbx, rax
; RBX = 32
; BSF (find lowest set bit):
mov rax, 0x0000000000000100 ; bit 8 is the only set bit
bsf rcx, rax
; RCX = 8 (bit 8 is the lowest and only set bit)
🛠️ Lab Exercise: In GDB, run this sequence and verify each step. After
popcnt rbx, rax, confirm thatrbx = 32. For the BSF, tryrax = 0x8000000000000001and verify that BSF gives 0 (the LSB, not the MSB). Then try BSR to find the MSB position (63).
Summary
Bit manipulation is the language of hardware interfaces, protocol processing, and cryptography. AND isolates and clears, OR sets, XOR toggles and forms the foundation of encryption. BT/BTS/BTR/BTC operate on single bits (or arbitrary positions in bit arrays). BSF/BSR/LZCNT/TZCNT find bit positions; POPCNT counts them. BMI1/BMI2 add ANDN, BLSI, BLSR, PEXT, PDEP, and the flag-free shift/rotate variants.
The XOR cipher introduced here is the first step toward the AES-NI implementation in Chapter 15. XOR's self-inverse property and its composability are what make it the kernel of symmetric cryptography.