7 min read

Network packet headers, hardware registers, file permission flags, color channels, encryption keys — these all pack information into individual bits. The x86-64 instruction set has a complete toolkit for operating on bits: instructions for...

Chapter 13: Bit Manipulation

Every Bit Has a Job

Network packet headers, hardware registers, file permission flags, color channels, encryption keys — these all pack information into individual bits. The x86-64 instruction set has a complete toolkit for operating on bits: instructions for isolating, setting, clearing, and toggling individual bits or groups, for finding the position of the highest or lowest set bit, for counting set bits, and the BMI1/BMI2 extension set that adds hardware acceleration for several advanced bit operations.

This chapter also introduces the first anchor example for encryption: the XOR cipher. XOR is the fundamental operation of symmetric cryptography — every AES round, every stream cipher, every one-time pad uses XOR at its core. The implementation here is pedagogical rather than secure (XOR cipher is not secure for repeated keys), but it establishes the pattern for the AES-NI implementation in Chapter 15.

Bitmask Operations: The Three Primitives

Every bit manipulation application reduces to three operations: isolate bits you care about, set bits to 1, clear bits to 0, toggle bits. These correspond exactly to AND, OR, XOR.

AND: Isolating and Clearing Bits

AND with a mask preserves bits where the mask has 1s, and forces 0s where the mask has 0s.

; Isolate the low 8 bits (mask out upper 56 bits):
and rax, 0xFF               ; rax = rax & 0x00000000000000FF

; Check if bit 4 is set (test a specific bit):
and rax, (1 << 4)           ; result is nonzero if bit 4 was set, zero if not
; Better: use TEST for this (doesn't modify rax)
test rax, (1 << 4)          ; set ZF=0 if bit 4 set, ZF=1 if clear

; Clear specific bits using NOT of mask:
; Clear bits 3, 4, 5 (mask = ~0b00111000 = 0xFFFFFFFFFFFFFFC7)
and rax, ~(7 << 3)          ; NASM constant expression: ~0b111000

; Extract a bit field: bits 12:8 (5-bit field)
and rax, 0x1F00             ; isolate bits 12:8
shr rax, 8                  ; shift down to bits 4:0

OR: Setting Bits

OR with a mask forces 1s where the mask has 1s, preserves other bits.

; Set bit 5:
or rax, (1 << 5)            ; rax |= 0x20

; Set multiple bits (bits 0, 2, 7):
or rax, (1 | 4 | 128)       ; or rax, 0x85

; Set a bit field (write 0b101 into bits 5:3):
; First clear the field, then OR in the new value
and rax, ~(0x7 << 3)        ; clear bits 5:3
or  rax, (0x5 << 3)         ; set bits 5:3 = 0b101

XOR: Toggling Bits

XOR with a mask flips bits where the mask has 1s, preserves other bits.

; Toggle bit 0:
xor rax, 1                  ; flip the LSB

; Toggle bits 3 and 7:
xor rax, ((1 << 3) | (1 << 7))  ; xor rax, 0x88

; Toggle all bits (bitwise NOT — also achievable with NOT instruction):
xor rax, 0xFFFFFFFFFFFFFFFF  ; same as: not rax

; Classic XOR swap (no temporary variable):
xor rax, rbx                ; rax = rax ^ rbx
xor rbx, rax                ; rbx = rbx ^ (rax ^ rbx) = original rax
xor rax, rbx                ; rax = (rax ^ rbx) ^ original_rax = original rbx

; Zero a register (XOR with self):
xor eax, eax                ; rax = 0 (canonical zero idiom)

NOT: Bitwise Complement

not rax                     ; rax = ~rax (flip all 64 bits)
not byte [rbx]              ; flip all bits in a memory byte

NOT does not modify flags. It is equivalent to xor rax, -1.

ANDN: Clear Bits Efficiently (BMI1)

In BMI1 (Bit Manipulation Instructions set 1, available on Haswell+), ANDN dst, src1, src2 computes dst = (~src1) & src2. This lets you clear bits without a separate NOT:

; Clear the bits set in RDX from RAX (without modifying RDX):
; Old way (two instructions):
mov  rcx, rdx
not  rcx
and  rax, rcx

; BMI1 ANDN (one instruction):
andn rax, rdx, rax          ; rax = (~rdx) & rax

Bit Test Instructions: BT, BTS, BTR, BTC

These instructions test a single bit and optionally modify it.

; BT (Bit Test): copy bit N of dst into CF
bt   rax, 5                 ; CF = bit 5 of rax (rax unchanged)
bt   rax, rcx               ; CF = bit (rcx & 63) of rax
bt   [rbx], rdi             ; CF = bit rdi of memory (arbitrary bit offset!)

; BTS (Bit Test and Set): copy bit to CF, then set it
bts  rax, 5                 ; CF = old bit 5; rax bit 5 = 1

; BTR (Bit Test and Reset): copy bit to CF, then clear it
btr  rax, 5                 ; CF = old bit 5; rax bit 5 = 0

; BTC (Bit Test and Complement): copy bit to CF, then toggle it
btc  rax, 5                 ; CF = old bit 5; rax bit 5 flipped

After BT/BTS/BTR/BTC, check CF with jc (carry set = bit was 1) or jnc (carry clear = bit was 0).

The memory form of BT is powerful and unusual: bt [rbx], rdi treats memory starting at [rbx] as a bit array of arbitrary length and tests the bit at position RDI. This is used for implementing large bitsets/bit arrays.

; Bit array: test, set, clear, toggle bit at arbitrary position
; Array at [rdi], bit index in RSI

test_bit:
    bt   [rdi], rsi         ; CF = bit at position rsi
    setc al                 ; al = CF (0 or 1)
    movzx eax, al
    ret

set_bit:
    bts  [rdi], rsi         ; set bit at position rsi (old value in CF)
    ret

clear_bit:
    btr  [rdi], rsi         ; clear bit at position rsi
    ret

toggle_bit:
    btc  [rdi], rsi         ; toggle bit at position rsi
    ret

⚡ Performance Note: The memory forms of BT/BTS/BTR/BTC with a register bit index are slow (they can access any bit in memory, which requires computation). For bit arrays where the index is computed at runtime, consider using the register form with SHR/AND to locate the byte and bit position manually — it is often faster.

Bit Scanning: BSF, BSR, LZCNT, TZCNT

These instructions find the position of set bits.

BSF: Bit Scan Forward (Lowest Set Bit)

; BSF dst, src: dst = index of lowest set bit of src
; ZF = 1 if src == 0 (result is undefined if src is zero)
bsf  rax, rbx              ; rax = index of LSB of rbx
jz   .no_bits_set          ; if rbx was 0, result is undefined

BSF has undefined behavior when the source is zero. Never use BSF without first checking that the input is nonzero, or use TZCNT instead.

BSR: Bit Scan Reverse (Highest Set Bit)

; BSR dst, src: dst = index of highest set bit of src
bsr  rax, rbx              ; rax = index of MSB of rbx (= floor(log2(rbx)) for nonzero rbx)
jz   .no_bits_set

BSR also has undefined behavior for zero input.

LZCNT and TZCNT: Better Alternatives (BMI1)

LZCNT (Leading Zero Count) and TZCNT (Trailing Zero Count) are defined for zero input and are generally preferred over BSF/BSR in new code:

; LZCNT: count leading zeros (zeros from MSB down to first 1)
lzcnt rax, rbx             ; rax = number of leading zeros in rbx
; If rbx == 0: rax = 64 (64-bit operand)
; If rbx == 1: rax = 63
; If rbx == 0x8000000000000000: rax = 0

; TZCNT: count trailing zeros (zeros from LSB up to first 1)
tzcnt rax, rbx             ; rax = number of trailing zeros in rbx
; If rbx == 0: rax = 64
; If rbx == 1: rax = 0
; If rbx == 8: rax = 3 (since 8 = 0b1000)

These require BMI1 (CPUID: CPUID.7H:EBX[bit 3]). On AMD processors, LZCNT maps to LZCNT; on older Intel without BMI1, the F3 0F BD encoding falls back to BSR behavior (with different semantics for zero).

; Applications:
; Integer log2 (floor):
bsr  rax, rdi              ; rax = floor(log2(rdi)) for rdi > 0
; Or with LZCNT:
lzcnt rax, rdi             ; rax = leading zeros
mov  rcx, 63
sub  rcx, rax              ; rcx = 63 - leading_zeros = floor(log2)

; Isolate lowest set bit:
mov  rcx, rdi
neg  rcx                    ; rcx = -rdi (two's complement)
and  rcx, rdi               ; rcx = rdi & (-rdi) = lowest set bit only
; This works because -x = ~x + 1, so x & (-x) cancels everything except the lowest bit

; Round up to next power of 2:
bsr  rax, rdi               ; find floor(log2)
jz   .is_zero
; If rdi is already a power of 2, return rdi; else return 1 << (floor(log2) + 1)
lea  rcx, [rax + 1]         ; rcx = floor(log2) + 1
mov  rax, 1
shl  rax, cl                ; rax = 2^(floor(log2)+1)

POPCNT: Count Set Bits

; POPCNT dst, src: count number of 1 bits in src
popcnt rax, rbx             ; rax = popcount(rbx)
popcnt rax, [rbx]           ; rax = popcount(memory_qword)

POPCNT is defined when the input is 0 (result = 0). It is available on all modern x86-64 processors (check CPUID feature flag).

; Applications:
; Hamming weight (number of 1 bits in a number)
popcnt rax, rdi             ; rax = number of set bits in rdi

; Population count for a bit array:
; Count set bits in 64-byte bitset at [rdi]
xor  eax, eax
mov  rcx, 8                 ; 8 × 8-byte = 64 bytes
.loop:
    popcnt r8, [rdi + rcx*8 - 8]
    add    rax, r8
    dec    rcx
    jnz    .loop

; Parity (even or odd number of set bits):
popcnt rax, rdi
and    rax, 1               ; 0 = even parity, 1 = odd parity

Software Popcount for Comparison

Before POPCNT (or on processors without it), the standard software implementation uses the "parallel counting" technique:

; Software popcount of RDI → RAX
; Uses the parallel bit-counting algorithm
popcount_software:
    mov  rax, rdi
    ; Step 1: count pairs (2-bit groups)
    mov  rcx, 0x5555555555555555  ; 01010101...
    mov  rdx, rax
    shr  rdx, 1
    and  rdx, rcx
    and  rax, rcx
    add  rax, rdx
    ; Step 2: count nibbles (4-bit groups)
    mov  rcx, 0x3333333333333333  ; 00110011...
    mov  rdx, rax
    shr  rdx, 2
    and  rdx, rcx
    and  rax, rcx
    add  rax, rdx
    ; ... (continue for 4→8→16→32→64 bit groups)
    ; Step 3: count bytes (8-bit groups)
    mov  rcx, 0x0F0F0F0F0F0F0F0F
    mov  rdx, rax
    shr  rdx, 4
    and  rdx, rcx
    and  rax, rcx
    add  rax, rdx
    ; Sum all byte counts
    imul rax, 0x0101010101010101
    shr  rax, 56                ; extract high byte = sum of all bytes
    ret

The POPCNT instruction replaces this entire function with one instruction. Always prefer POPCNT on hardware that has it.

BMI1 and BMI2: The Advanced Bit Manipulation Sets

BMI1 Instructions

Available on Intel Haswell (2013) and AMD Piledriver:

; ANDN: ~src1 & src2
andn rax, rdx, rcx          ; rax = (~rdx) & rcx

; BEXTR: extract bit field
; BEXTR dst, src, control — control[7:0] = start, control[15:8] = length
mov  rcx, (8 << 8) | 4     ; extract 8 bits starting at bit 4
bextr rax, rdi, rcx         ; rax = bits 11:4 of rdi, shifted to bits 7:0

; BLSI: isolate lowest set bit
blsi rax, rdi               ; rax = rdi & (-rdi) = lowest 1 bit of rdi
; ZF=1 if rdi was 0

; BLSMSK: mask from lowest set bit
blsmsk rax, rdi             ; rax = rdi XOR (rdi-1) = all bits from bit 0 to lowest 1 bit

; BLSR: reset lowest set bit
blsr rax, rdi               ; rax = rdi & (rdi-1) = rdi with lowest 1 bit cleared
; If rdi is a power of 2: rax = 0
; Use case: iterate over all set bits in a value

The BLSR application for iterating set bits:

; Process each set bit in RDI:
.loop:
    test rdi, rdi
    jz   .done
    blsi rax, rdi           ; rax = isolated lowest bit of rdi
    ; ... process bit position using TZCNT rax ...
    blsr rdi, rdi           ; clear that bit from rdi
    jmp  .loop
.done:

BMI2 Instructions

Available on Intel Haswell and AMD Excavator:

; BZHI: zero high bits from a specified position
bzhi rax, rdi, rdx          ; rax = rdi with bits above position rdx zeroed
; rax = rdi & ((1 << rdx) - 1)   -- extract low rdx bits

; PEXT: parallel bits extract (gather specified bits)
mov  rdx, 0x0000FF00        ; mask: which bits to extract
pext rax, rdi, rdx          ; rax = bits 15:8 of rdi, compacted to bits 7:0
; Essentially: for each set bit i in mask, extract bit i from src and pack consecutively

; PDEP: parallel bits deposit (scatter specified bits)
mov  rdx, 0x0000FF00        ; mask: where to place bits
pdep rax, rdi, rdx          ; for each bit i in mask, place bit from src at position i
; PEXT and PDEP are inverses of each other

; RORX: rotate right without modifying flags (useful in chains)
rorx rax, rdi, 8            ; rax = rotate_right(rdi, 8), no flags modified

; SARX, SHLX, SHRX: shifts with register count, no flags modification
sarx rax, rdi, rdx          ; rax = rdi SAR rdx  (rdx = shift count, no flags)
shlx rax, rdi, rdx          ; shift left
shrx rax, rdi, rdx          ; shift right logical

🔍 Under the Hood: PEXT and PDEP are used in hardware-optimized implementations of bit interleaving (Morton codes for spatial indexing), CRC computation, and certain compression algorithms. On hardware that supports them, they replace software loops that would otherwise test and shift bits one at a time.

XOR Tricks: Classic Bit Manipulation Idioms

Check if a Value is a Power of 2

; A power of 2 has exactly one set bit: 0001, 0010, 0100, 1000, ...
; x & (x-1) clears the lowest set bit:
;   - If x is a power of 2, only bit is the lowest set bit, result = 0
;   - Otherwise, result != 0
; Also: x must be nonzero

; rdi = x
test rdi, rdi               ; is x == 0?
jz   .not_power_of_2
mov  rax, rdi
sub  rax, 1                 ; rax = x - 1
test rdi, rax               ; x & (x-1) == 0?
jz   .is_power_of_2

Or with BLSR:

; blsr rax, rdi sets ZF=1 if rdi is a power of 2 (or 0)
blsr rax, rdi               ; rax = rdi with lowest bit cleared; ZF=1 if rdi had exactly 1 bit
jz   .is_power_of_2_or_zero

Clear the Lowest Set Bit

; x & (x-1) clears the lowest set bit:
mov  rax, rdi
and  rax, [rdi - 1]         ; WRONG: that tries to access memory!
; Correct:
mov  rax, rdi
lea  rcx, [rdi - 1]         ; rcx = rdi - 1
and  rax, rcx               ; rax = rdi & (rdi-1)
; Or:
blsr rax, rdi               ; BMI1: equivalent, one instruction

Isolate the Lowest Set Bit

; x & (-x) isolates the lowest set bit:
mov  rax, rdi
neg  rax                    ; rax = -rdi (two's complement)
and  rax, rdi               ; rax = rdi & (-rdi) = isolated lowest bit
; Or:
blsi rax, rdi               ; BMI1: equivalent, also sets ZF if rdi=0

XOR Swap (Without Temporary)

; Swap rdi and rsi without using a third register:
xor  rdi, rsi
xor  rsi, rdi               ; rsi = original rdi
xor  rdi, rsi               ; rdi = original rsi

⚠️ Common Mistake: The XOR swap fails if both operands are the same variable (same register or same memory address). xor rax, rax zeroes RAX; the three-step swap would zero out both values. Never use XOR swap when source and destination may alias.

Packing Flags into a Single Register

Hardware registers, network protocol headers, and file format fields often pack multiple boolean flags into a single byte or word:

; Define bit positions for a flags register:
FLAG_READABLE   equ (1 << 0)    ; bit 0
FLAG_WRITABLE   equ (1 << 1)    ; bit 1
FLAG_EXECUTABLE equ (1 << 2)    ; bit 2
FLAG_DIRTY      equ (1 << 3)    ; bit 3
FLAG_PRESENT    equ (1 << 4)    ; bit 4

; Check if a page is present and writable:
test rax, (FLAG_PRESENT | FLAG_WRITABLE)
; ZF=0 if either flag is set (not exactly what we want for AND check)
; Better: use AND and compare
mov  rcx, (FLAG_PRESENT | FLAG_WRITABLE)
and  rcx, rax
cmp  rcx, (FLAG_PRESENT | FLAG_WRITABLE)
je   .both_flags_set

; Set the writable flag:
or   rax, FLAG_WRITABLE

; Clear the dirty flag:
and  rax, ~FLAG_DIRTY         ; NASM: and rax, 0xFFFFFFFFFFFFFFE7

; Toggle the executable flag:
xor  rax, FLAG_EXECUTABLE

The XOR Cipher: Anchor Example Part 1

XOR is reversible: a XOR b XOR b = a. This is the basis of every stream cipher: XOR the plaintext with a keystream, transmit the ciphertext. The receiver XORs with the same keystream to recover the plaintext.

A simple XOR cipher (pedagogical only — not secure for fixed keys):

; xor_encrypt(uint8_t *data, size_t len, uint8_t *key, size_t key_len)
; Encrypts/decrypts data in-place with a repeating XOR key
; RDI = data, RSI = len, RDX = key, RCX = key_len

section .text
global xor_encrypt

xor_encrypt:
    push rbp
    mov  rbp, rsp
    push rbx               ; data index
    push r12               ; key index
    push r13               ; key_len
    push r14               ; len
    push r15               ; key pointer
    sub  rsp, 8            ; alignment (5 pushes + rbp = 6*8=48 after entry's 8 = net 40 from adjusted rsp)
                           ; entry rsp ≡ 8; -8(rbp)→0; -8(rbx)→8; -8(r12)→0; -8(r13)→8; -8(r14)→0; -8(r15)→8; sub 8→0. Good.

    xor  ebx, ebx          ; i = 0 (data index)
    xor  r12d, r12d        ; j = 0 (key index)
    mov  r13, rcx          ; key_len
    mov  r14, rsi          ; len
    mov  r15, rdx          ; key pointer

    test r14, r14          ; len == 0?
    jz   .done

.loop:
    ; Load data byte, key byte
    movzx eax, byte [rdi + rbx]   ; al = data[i]
    movzx ecx, byte [r15 + r12]   ; cl = key[j]

    ; XOR encrypt
    xor  al, cl                    ; al = data[i] ^ key[j]
    mov  [rdi + rbx], al           ; data[i] = result

    ; Advance data index
    inc  rbx

    ; Advance key index with wrapping
    inc  r12
    cmp  r12, r13
    jl   .no_wrap
    xor  r12d, r12d                ; j = 0 (wrap around)
.no_wrap:

    ; Check if we're done
    cmp  rbx, r14
    jl   .loop

.done:
    add  rsp, 8
    pop  r15
    pop  r14
    pop  r13
    pop  r12
    pop  rbx
    pop  rbp
    ret

Why XOR is the Foundation of Cryptography

XOR has three properties that make it cryptographically useful:

  1. Self-inverse: a XOR b XOR b = a. The same operation decrypts and encrypts.
  2. No bias: Each output bit is 1 if and only if the corresponding input bits differ. XOR with a uniformly random key produces a uniformly random output regardless of the input.
  3. Composability: Multiple layers of XOR with independent keys produce a net effect of XOR with the XOR of all the keys.

The weakness of a simple XOR cipher: if the key is shorter than the data and repeats, known-plaintext attacks become trivial. If two messages are encrypted with the same key, XORing the two ciphertexts gives plaintext XOR plaintext, eliminating the key entirely.

AES-NI (Chapter 15) uses XOR in every encryption round — the AESENC instruction applies a nonlinear S-Box substitution, a row permutation, a column mixing operation, and an XOR with the round key. XOR is always the last step. The S-Box is what provides the non-linearity that makes AES secure where simple XOR is not.

🔐 Security Note: Never use a simple XOR cipher for real encryption. It is secure only as a one-time pad (key as long as message, used once, truly random) — which is impractical in almost all real applications. The purpose of this implementation is to understand XOR's role in symmetric cryptography before upgrading to AES-NI in Chapter 15.

Bit Manipulation for Real Applications

File Permissions (POSIX)

; POSIX permission bits for a file mode
; rwxrwxrwx = bits 8:0
OWNER_READ    equ (1 << 8)   ; 0400
OWNER_WRITE   equ (1 << 7)   ; 0200
OWNER_EXEC    equ (1 << 6)   ; 0100
GROUP_READ    equ (1 << 5)   ; 0040
; ... etc.

; Check if owner can read:
test rax, OWNER_READ
jnz  .owner_can_read

; Set read+write for owner, read for group:
or   rax, (OWNER_READ | OWNER_WRITE | GROUP_READ)  ; rax |= 0640

; Remove execute permission for all:
and  rax, ~(OWNER_EXEC | (1 << 3) | (1 << 0))

Network Protocol Fields

; IPv4 header bit manipulation:
; IHL (Internet Header Length) is bits 3:0 of the first byte
; Version is bits 7:4 of the first byte

; Extract IHL from first byte in AL:
movzx eax, al
and   eax, 0x0F             ; isolate low 4 bits = IHL
shl   eax, 2                ; IHL * 4 = header length in bytes

; Extract version:
movzx ecx, al               ; al = first byte of IP header
shr   ecx, 4                ; version in bits 3:0 of ecx

Register Trace: Bit Manipulation Operations

; Starting with RAX = 0x1234567890ABCDEF

mov  rax, 0x1234567890ABCDEF
; RAX = 0x1234567890ABCDEF = 0001 0010 0011 0100 ... 1100 1101 1110 1111

; Isolate low byte:
and  rax, 0xFF
; RAX = 0x00000000000000EF

; Reset and set bit 16:
mov  rax, 0x1234567890ABCDEF
or   rax, (1 << 16)
; RAX = 0x123456789**1**ABCDEF (bit 16 was already 0, now set → ABCDEF | 0x10000 = 0x00ABCDEF | 0x10000)
; More precisely: bit 16 of 0x90ABCDEF: 0x90ABCDEF & 0x10000 = 0 (bit 16 was 0)
; After OR: 0x90ABCDEF | 0x10000 = 0x90BBCDEF? Let's compute: 0xABCDEF | 0x10000 = 0xBBCDEF
; Actually: 0x1234567890ABCDEF | 0x0000000000010000 = 0x1234567890BBCDEF

; Popcount:
mov  rax, 0xFF00FF00FF00FF00  ; 32 set bits
popcnt rbx, rax
; RBX = 32

; BSF (find lowest set bit):
mov  rax, 0x0000000000000100  ; bit 8 is the only set bit
bsf  rcx, rax
; RCX = 8 (bit 8 is the lowest and only set bit)

🛠️ Lab Exercise: In GDB, run this sequence and verify each step. After popcnt rbx, rax, confirm that rbx = 32. For the BSF, try rax = 0x8000000000000001 and verify that BSF gives 0 (the LSB, not the MSB). Then try BSR to find the MSB position (63).

Summary

Bit manipulation is the language of hardware interfaces, protocol processing, and cryptography. AND isolates and clears, OR sets, XOR toggles and forms the foundation of encryption. BT/BTS/BTR/BTC operate on single bits (or arbitrary positions in bit arrays). BSF/BSR/LZCNT/TZCNT find bit positions; POPCNT counts them. BMI1/BMI2 add ANDN, BLSI, BLSR, PEXT, PDEP, and the flag-free shift/rotate variants.

The XOR cipher introduced here is the first step toward the AES-NI implementation in Chapter 15. XOR's self-inverse property and its composability are what make it the kernel of symmetric cryptography.