Case Study 3.2: CPUID — Querying the CPU About Itself

A complete NASM program that reads the processor brand string and checks for SSE4.2, AVX2, and AES-NI


Overview

The CPUID instruction is one of x86-64's most unusual features: it's a self-reporting mechanism that lets software query the processor's capabilities, vendor, and version. Before using AES-NI for encryption, AVX2 for vectorized loops, or any other optional extension, you should check CPUID to confirm the current CPU supports it. This case study builds a complete NASM program that enumerates CPU features and demonstrates how runtime capability detection is implemented in real programs.


How CPUID Works

CPUID takes a "leaf" value in EAX (and sometimes a "sub-leaf" in ECX) and returns information in EAX, EBX, ECX, and EDX. The leaf value determines what information is returned.

Key CPUID leaves:

EAX (leaf) ECX (sub-leaf) Information returned
0 Maximum basic leaf; vendor string in EBX/ECX/EDX
1 Processor version info; feature flags
4 0,1,2,... Cache topology
7 0 Extended feature flags (AVX2, SHA, etc.)
7 1 More extended features (AVXVNNI, SHA512, etc.)
0x80000000 Maximum extended leaf
0x80000002-4 Processor brand string (48 chars)
0x80000008 Virtual/physical address size

Important: CPUID is a serializing instruction — it completes all previous instructions and prevents subsequent instructions from executing until CPUID completes. This makes it expensive (hundreds of cycles) but safe for one-time capability detection at startup.


The Complete Program

; cpuid_demo.asm
; Demonstrates CPUID: reads brand string, checks SSE4.2/AVX2/AES-NI/AVX512
; Build: nasm -f elf64 cpuid_demo.asm -o cpuid_demo.o && ld cpuid_demo.o -o cpuid_demo

section .data
    ; Labels and messages
    brand_label     db "CPU: ", 0
    brand_label_len equ $ - brand_label - 1     ; exclude null

    newline         db 10

    feature_labels:
    .sse42          db "SSE4.2    : ", 0
    .aesni          db "AES-NI    : ", 0
    .avx            db "AVX       : ", 0
    .avx2           db "AVX2      : ", 0
    .avx512f        db "AVX-512F  : ", 0
    .sha            db "SHA Ext.  : ", 0

    msg_yes         db "YES", 10
    msg_yes_len     equ $ - msg_yes
    msg_no          db "NO ", 10
    msg_no_len      equ $ - msg_no

section .bss
    ; 48 bytes for brand string + null terminator
    brand_string    resb 49

section .text
    global _start

; ============================================================
; print_cstr: print a null-terminated string
; Args: rdi = pointer to string
; Clobbers: rax, rdi, rsi, rdx, rcx (caller-saved)
; ============================================================
print_cstr:
    push rdi
    ; calculate length by scanning for null
    xor  ecx, ecx
.len_loop:
    cmp  BYTE [rdi + rcx], 0
    je   .len_done
    inc  ecx
    jmp  .len_loop
.len_done:
    ; rdi = string, rcx = length
    mov  rsi, rdi        ; buffer
    mov  rdx, rcx        ; length
    mov  rax, 1          ; sys_write
    mov  rdi, 1          ; stdout
    syscall
    pop  rdi
    ret

; ============================================================
; print_yes_no: print YES or NO followed by newline
; Args: rdi = 1 for yes, 0 for no
; ============================================================
print_yes_no:
    test rdi, rdi
    jz   .no
    mov  rax, 1
    mov  rdi, 1
    lea  rsi, [rel msg_yes]
    mov  rdx, msg_yes_len
    syscall
    ret
.no:
    mov  rax, 1
    mov  rdi, 1
    lea  rsi, [rel msg_no]
    mov  rdx, msg_no_len
    syscall
    ret

; ============================================================
; get_brand_string: fill brand_string buffer with CPU brand
; Uses CPUID leaves 0x80000002, 0x80000003, 0x80000004
; Each leaf returns 16 bytes (EAX, EBX, ECX, EDX = 4×4 bytes)
; ============================================================
get_brand_string:
    ; First, check if extended CPUID leaves are available
    mov  eax, 0x80000000
    cpuid
    cmp  eax, 0x80000004    ; check max extended leaf
    jb   .no_brand          ; if < 0x80000004, brand string not available

    lea  rdi, [rel brand_string]

    ; Leaf 0x80000002: first 16 bytes
    mov  eax, 0x80000002
    cpuid
    mov  [rdi],      eax    ; bytes 0-3
    mov  [rdi + 4],  ebx    ; bytes 4-7
    mov  [rdi + 8],  ecx    ; bytes 8-11
    mov  [rdi + 12], edx    ; bytes 12-15

    ; Leaf 0x80000003: second 16 bytes
    mov  eax, 0x80000003
    cpuid
    mov  [rdi + 16], eax
    mov  [rdi + 20], ebx
    mov  [rdi + 24], ecx
    mov  [rdi + 28], edx

    ; Leaf 0x80000004: third 16 bytes
    mov  eax, 0x80000004
    cpuid
    mov  [rdi + 32], eax
    mov  [rdi + 36], ebx
    mov  [rdi + 40], ecx
    mov  [rdi + 44], edx

    ; The brand string is null-terminated within the 48 bytes
    ret

.no_brand:
    ; Fill with "Unknown CPU"
    lea  rdi, [rel brand_string]
    mov  DWORD [rdi],     0x6E6B6E55   ; "Unkn" (little-endian)
    mov  DWORD [rdi + 4], 0x206E776F   ; "own "
    mov  DWORD [rdi + 8], 0x00555043   ; "CPU\0"
    ret

; ============================================================
; check_features: check and print CPU feature flags
; Uses CPUID leaf 1 (ECX) for SSE4.2, AES-NI, AVX
; Uses CPUID leaf 7, sub-leaf 0 (EBX) for AVX2, AVX-512F, SHA
; ============================================================
check_features:
    push r12                ; callee-saved: will hold leaf-1 ECX
    push r13                ; callee-saved: will hold leaf-7 EBX
    push rbx                ; callee-saved: CPUID clobbers EBX

    ; === CPUID Leaf 1 ===
    mov  eax, 1
    cpuid
    ; EAX = version info, EBX = brand/clflush/apicid, ECX = feature flags
    mov  r12d, ecx          ; save ECX (leaf-1 feature flags)
    ; EDX also has feature flags (older ones: SSE, SSE2, etc.)

    ; === CPUID Leaf 7, Sub-leaf 0 ===
    mov  eax, 7
    xor  ecx, ecx           ; sub-leaf 0
    cpuid
    mov  r13d, ebx          ; save EBX (leaf-7 feature flags)

    ; === Print SSE4.2 (leaf 1, ECX bit 20) ===
    lea  rdi, [rel feature_labels.sse42]
    call print_cstr

    xor  rdi, rdi
    test r12, (1 << 20)     ; bit 20 of leaf-1 ECX = SSE4.2
    setnz dil               ; set to 1 if bit is set, 0 otherwise
    call print_yes_no

    ; === Print AES-NI (leaf 1, ECX bit 25) ===
    lea  rdi, [rel feature_labels.aesni]
    call print_cstr

    xor  rdi, rdi
    test r12, (1 << 25)     ; bit 25 = AES-NI
    setnz dil
    call print_yes_no

    ; === Print AVX (leaf 1, ECX bit 28) ===
    lea  rdi, [rel feature_labels.avx]
    call print_cstr

    xor  rdi, rdi
    test r12, (1 << 28)     ; bit 28 = AVX
    setnz dil
    call print_yes_no

    ; === Print AVX2 (leaf 7, EBX bit 5) ===
    lea  rdi, [rel feature_labels.avx2]
    call print_cstr

    xor  rdi, rdi
    test r13, (1 << 5)      ; bit 5 = AVX2
    setnz dil
    call print_yes_no

    ; === Print AVX-512F (leaf 7, EBX bit 16) ===
    lea  rdi, [rel feature_labels.avx512f]
    call print_cstr

    xor  rdi, rdi
    test r13, (1 << 16)     ; bit 16 = AVX-512 Foundation
    setnz dil
    call print_yes_no

    ; === Print SHA Extensions (leaf 7, EBX bit 29) ===
    lea  rdi, [rel feature_labels.sha]
    call print_cstr

    xor  rdi, rdi
    test r13, (1 << 29)     ; bit 29 = SHA extensions
    setnz dil
    call print_yes_no

    pop  rbx
    pop  r13
    pop  r12
    ret

; ============================================================
; _start: main entry point
; ============================================================
_start:
    ; Get and print brand string
    call get_brand_string

    ; Print "CPU: " label
    mov  rax, 1
    mov  rdi, 1
    lea  rsi, [rel brand_label]
    mov  rdx, brand_label_len
    syscall

    ; Print the brand string
    lea  rdi, [rel brand_string]
    call print_cstr

    ; Print newline
    mov  rax, 1
    mov  rdi, 1
    lea  rsi, [rel newline]
    mov  rdx, 1
    syscall

    ; Check and print features
    call check_features

    ; Exit
    mov  rax, 60
    xor  rdi, rdi
    syscall

Sample Output

Running on an Intel Core i9-12900K:

CPU: Intel(R) Core(TM) i9-12900K CPU @ 3.20GHz
SSE4.2    : YES
AES-NI    : YES
AVX       : YES
AVX2      : YES
AVX-512F  : NO
SHA Ext.  : NO

Running on an AMD Ryzen 9 7950X (Zen 4):

CPU: AMD Ryzen 9 7950X 16-Core Processor
SSE4.2    : YES
AES-NI    : YES
AVX       : YES
AVX2      : YES
AVX-512F  : YES
SHA Ext.  : YES

Key Technical Notes

Why CPUID Clobbers EBX

The CPUID instruction writes to EAX, EBX, ECX, and EDX — all four. In a function that uses EBX for its own purposes (or RBX as a callee-saved register), you must save RBX before calling CPUID and restore it after:

push rbx          ; save callee-saved register
mov  eax, 7
xor  ecx, ecx
cpuid
mov  r13d, ebx    ; save the CPUID result before restoring rbx
pop  rbx          ; restore callee-saved register

Alternatively, save the result to a separate register before CPUID potentially modifies EBX again.

The SETNZ Idiom

test r12, (1 << 25)     ; sets ZF if bit 25 is 0; clears ZF if bit 25 is 1
setnz dil               ; set DIL to 1 if ZF=0 (bit was set), 0 if ZF=1 (bit was clear)

SETNZ (Set if Not Zero) writes 1 or 0 to a byte register based on the Zero Flag. This branchless pattern converts a flag bit into a 0/1 value without a conditional jump.

The Brand String Encoding

The brand string returned by CPUID is in little-endian byte order, stored across three CPUID leaves. Each register (EAX, EBX, ECX, EDX) contains 4 bytes in little-endian order.

For example, if a brand string starts with "Inte" (ASCII: 49 6E 74 65): - EAX at leaf 0x80000002 = 0x65746E49 (little-endian: 49 6E 74 65 → "Inte")

When you write EAX to memory with mov [buffer], eax, the little-endian write places 0x49 at offset 0, 0x6E at offset 1, etc. — which happens to spell "Inte" correctly in the buffer. Little-endian byte order and left-to-right string reading are complementary here.


Practical Applications

Runtime Dispatch in Real Code

Production code uses CPUID at startup to choose the fastest available implementation:

; At program startup:
call detect_features

; Later, when calling a performance-critical function:
cmp  BYTE [has_avx2], 1
je   use_avx2_memcpy

cmp  BYTE [has_sse4_2], 1
je   use_sse42_memcpy

jmp  use_scalar_memcpy

This pattern is used in glibc (for memcpy, strlen, strcmp), OpenSSL (for AES), and virtually every high-performance library.

IFUNC (Indirect Function)

On Linux, the ELF format supports STT_GNU_IFUNC — indirect functions that call a resolver at load time to select the best implementation. This is how glibc's memcpy automatically uses AVX-512 on systems that support it and falls back to SSE2 on older CPUs, without any runtime branch overhead after initialization.


What to Build Next

The CPUID check for AES-NI support is the first step in the XOR→AES-NI encryption tool introduced in Chapter 13. Before writing any AES encryption code, you check:

; Is AES-NI available?
mov  eax, 1
cpuid
test ecx, (1 << 25)   ; AES-NI bit
jz   .no_aesni        ; fall back to software AES

If jz is not taken, you can safely use the AESENC, AESENCLAST, AESDEC, AESDECLAST, and AESKEYGENASSIST instructions that perform AES rounds in hardware.

This case study gives you the complete infrastructure for that detection.