8 min read

The previous six chapters built the mental model. Now we use it. This chapter writes real programs: complete, runnable, debuggable assembly programs that demonstrate every fundamental instruction and pattern you'll use throughout the rest of the...

Chapter 7: Your First Assembly Programs

Writing Code That Actually Does Something

The previous six chapters built the mental model. Now we use it. This chapter writes real programs: complete, runnable, debuggable assembly programs that demonstrate every fundamental instruction and pattern you'll use throughout the rest of the book.

We start with MOV — the most common x86-64 instruction, which turns out to have more nuance than it appears — and work through the arithmetic instructions, system calls, and complete programs with full register trace tables. By the end of this chapter, you'll have written programs that print text, perform arithmetic, convert numbers to ASCII, and read from standard input.

The MinOS kernel project also begins here, in the final section. The first bytes of the kernel are written.


The MOV Instruction Family

MOV is the load-store instruction: it moves data between registers and memory. It is the most frequently used instruction in most programs, and it has more forms than beginners expect.

The Four MOV Forms

; Form 1: Register to Register
mov  rax, rbx           ; rax ← rbx (64-bit)
mov  eax, ebx           ; eax ← ebx (32-bit, zeroes upper 32 bits of rax!)
mov  ax, bx             ; ax ← bx  (16-bit, does NOT zero upper bits)
mov  al, bl             ; al ← bl  (8-bit, does NOT zero upper bits)

; Form 2: Immediate to Register
mov  rax, 42            ; rax ← 42 (64-bit)
mov  eax, 0xFF          ; eax ← 255 (32-bit write, zeroes upper 32 bits)
mov  rax, 0x123456789ABCDEF0   ; 64-bit immediate (MOVABS encoding, 10 bytes!)
xor  eax, eax           ; rax ← 0  (shorter than mov rax, 0)

; Form 3: Memory to Register (LOAD)
mov  rax, [rdi]         ; rax ← 8 bytes at address in rdi
mov  eax, [rdi]         ; eax ← 4 bytes at address in rdi; zeroes upper 32 of rax
mov  ax,  [rdi]         ; ax  ← 2 bytes at address in rdi; upper bits unchanged
mov  al,  [rdi]         ; al  ← 1 byte at address in rdi; upper bits unchanged
mov  rax, [rbp-8]       ; rax ← local variable (base+displacement)
mov  rax, [rdi+rcx*8]   ; rax ← arr[rcx] (base+index*scale)
mov  rax, [rdi+rcx*8+16] ; rax ← struct.field in arr (base+index*scale+disp)

; Form 4: Register to Memory (STORE)
mov  [rdi], rax         ; 8 bytes at address in rdi ← rax
mov  [rdi], eax         ; 4 bytes at address in rdi ← eax
mov  [rbp-8], rax       ; local variable ← rax
mov  QWORD [rdi], 42    ; store immediate to memory (must specify QWORD)
mov  DWORD [rdi], 42    ; store 4-byte immediate
mov  BYTE  [rdi], 'A'   ; store 1-byte immediate

What MOV Does NOT Do

; NO: memory-to-memory move does not exist
mov  [rdi], [rsi]       ; INVALID -- x86-64 has no memory-to-memory MOV

; Instead: use a register as intermediate
mov  rax, [rsi]         ; load from source
mov  [rdi], rax         ; store to destination

There is no direct memory-to-memory instruction in x86-64 (with exceptions like REP MOVS, which is a special string-move instruction covered in Chapter 14). Every data transfer between two memory locations requires a register as a waypoint.

MOV and Zero Extension

The most important behavioral property of MOV:

; CRITICAL: 32-bit write zeroes upper 32 bits
mov  rax, 0xFFFFFFFFFFFFFFFF  ; rax = 0xFFFFFFFFFFFFFFFF
mov  eax, 1                    ; rax = 0x0000000000000001 (upper zeroed!)

; 16-bit and 8-bit writes do NOT zero upper bits
mov  rax, 0xFFFFFFFFFFFFFFFF  ; rax = 0xFFFFFFFFFFFFFFFF
mov  ax, 1                     ; rax = 0xFFFFFFFFFFFF0001 (only low 16 changed)

; Zero-extension: use movzx
movzx rax, bl                  ; rax ← zero-extend BL to 64 bits

; Sign-extension: use movsx
movsx rax, bl                  ; rax ← sign-extend BL to 64 bits

ADD and SUB

ADD and SUB perform integer addition and subtraction, setting the flags as described in Chapter 2.

; ADD forms
add  rax, rbx           ; rax ← rax + rbx  (sets CF, OF, SF, ZF, PF, AF)
add  rax, 100           ; rax ← rax + 100  (immediate)
add  rax, [rdi]         ; rax ← rax + memory (memory operand)
add  [rdi], rax         ; memory ← memory + rax (memory destination)
add  [rdi], 1           ; memory ← memory + 1 (must specify size for immediate!)
; ^ This last form needs: add QWORD [rdi], 1

; SUB forms (same structure as ADD)
sub  rax, rbx           ; rax ← rax - rbx
sub  rax, 1             ; rax ← rax - 1 (but see INC/DEC below)
sub  rax, [rdi]         ; rax ← rax - memory

Register Trace: ADD Examples

Instruction RAX RBX CF OF SF ZF
(initial) 0x0000000000000000 0x0000000000000001 0 0 0 0
add rax, rbx 0x0000000000000001 0x0000000000000001 0 0 0 0
add rax, rbx 0x0000000000000002 0x0000000000000001 0 0 0 0
mov rax, 0x7FFFFFFFFFFFFFFF 0x7FFFFFFFFFFFFFFF 0x0000000000000001 0 0 0 0
add rax, rbx 0x8000000000000000 0x0000000000000001 0 1 1 0
mov rax, 0xFFFFFFFFFFFFFFFF 0xFFFFFFFFFFFFFFFF 0x0000000000000001 0 0 1 0
add rax, rbx 0x0000000000000000 0x0000000000000001 1 0 0 1

The last row: adding 1 to 0xFFFFFFFFFFFFFFFF produces 0 (unsigned overflow, CF=1, ZF=1).


INC, DEC, NEG

; INC: increment by 1 (does NOT set CF)
inc  rax            ; rax ← rax + 1 (CF unchanged! OF, SF, ZF, PF, AF set)
inc  QWORD [rdi]    ; increment memory

; DEC: decrement by 1 (does NOT set CF)
dec  rax            ; rax ← rax - 1 (CF unchanged! OF, SF, ZF, PF, AF set)
dec  QWORD [rdi]    ; decrement memory

; NEG: two's complement negation
neg  rax            ; rax ← -rax (sets CF=1 unless rax=0, plus OF, SF, ZF)

⚠️ Common Mistake: INC and DEC do not set CF (they don't affect the Carry Flag). This means you cannot use JC/JNC after INC/DEC to detect overflow. If you need to detect overflow after incrementing, use ADD rax, 1 instead (which does set CF).

The motivation for this design: in 8086 code, INC/DEC appeared in loops and it was common to use CF for other purposes (multi-byte arithmetic with ADC). Making INC/DEC not touch CF allowed them to be used inside such loops without disturbing the carry chain.


XOR: The Most Versatile Instruction

XOR performs bitwise exclusive-or, but its most common use in x86-64 assembly is zeroing a register:

; Zero a register (most common use):
xor  eax, eax        ; rax ← 0  (32-bit write zeroes upper 32 bits of rax!)
                     ; 2 bytes: 31 c0
                     ; Better than: mov rax, 0 (which is 7 bytes: 48 b8 00...)

; Bitwise XOR:
xor  rax, rbx        ; each bit of rax is XORed with corresponding bit of rbx
xor  rax, 0xFF       ; flip the low 8 bits of rax

; Toggle bits:
xor  BYTE [flag], 1   ; flip bit 0 of a flag byte (toggle on/off)

; Swap two registers without a temporary (classic trick):
xor  rax, rbx        ; rax ← rax XOR rbx
xor  rbx, rax        ; rbx ← rbx XOR (rax XOR rbx) = original rax
xor  rax, rbx        ; rax ← (rax XOR rbx) XOR rax = original rbx
; Note: this trick is clever but slower than using a register; use push/pop instead

; Nullify (securely erase a register -- harder for compilers to optimize away):
xor  rax, rax        ; cryptographic code sometimes uses xor to zero keys

💡 Mental Model: XOR reg, reg is a zero idiom recognized by the CPU microarchitecture. On Intel CPUs from Sandy Bridge onward, XOR EAX, EAX is handled by register renaming without any ALU operation — the register is simply flagged as "zero value" with no execution latency. It doesn't need to read the old value of EAX at all.


System Calls: The Linux Kernel Interface

The syscall instruction is how user-space programs request kernel services. On Linux x86-64:

Syscall Convention:
  RAX = syscall number
  RDI = argument 1
  RSI = argument 2
  RDX = argument 3
  R10 = argument 4 (not RCX — note the difference from the ABI!)
  R8  = argument 5
  R9  = argument 6

Return value: RAX (positive = success, negative = -errno for errors)
Clobbered by syscall: RCX (saved RIP), R11 (saved RFLAGS)

Note the critical difference for argument 4: in the function calling convention, argument 4 is in RCX. But for syscalls, argument 4 is in R10 (because syscall saves the return address in RCX, making it unavailable for argument passing).

Essential Syscalls

; sys_write(fd, buf, count) → bytes_written
;   fd: 1=stdout, 2=stderr
mov  rax, 1             ; SYS_WRITE
mov  rdi, 1             ; fd
mov  rsi, buffer        ; buf (address)
mov  rdx, length        ; count
syscall
; rax = bytes written, or negative error

; sys_read(fd, buf, count) → bytes_read
;   fd: 0=stdin
mov  rax, 0             ; SYS_READ
mov  rdi, 0             ; stdin
mov  rsi, buffer        ; buf
mov  rdx, 4096          ; max bytes to read
syscall
; rax = bytes read (0 = EOF), or negative error

; sys_exit(status)
mov  rax, 60            ; SYS_EXIT
mov  rdi, 0             ; exit status
syscall
; Does not return

; sys_write to stderr (for error messages):
mov  rax, 1
mov  rdi, 2             ; fd=2 (stderr)
mov  rsi, errmsg
mov  rdx, errmsg_len
syscall

Stack Alignment Before SYSCALL

The syscall instruction does not require stack alignment (unlike call, which requires 16-byte alignment before the push of the return address). However, if you're calling any function that might use SSE instructions internally, maintain alignment. As a general habit, keep the stack 16-byte aligned at all times.


Program 1: Hello World (Complete With Register Trace)

; hello.asm -- Hello, Assembly World!
; Build: nasm -f elf64 hello.asm -o hello.o && ld hello.o -o hello

section .data
    msg     db "Hello, Assembly World!", 10     ; 22 bytes
    msglen  equ $ - msg                         ; = 22

section .text
    global _start

_start:
    ; Set up sys_write arguments
    mov  rax, 1         ; syscall number
    mov  rdi, 1         ; fd = stdout
    mov  rsi, msg       ; buffer
    mov  rdx, msglen    ; count

    ; Execute sys_write
    syscall

    ; Set up sys_exit arguments
    mov  rax, 60        ; syscall number
    xor  rdi, rdi       ; exit status = 0
    syscall

Register trace:

Instruction RAX RDI RSI RDX Notes
(entry) ? ? ? ?
mov rax, 1 1 ? ? ?
mov rdi, 1 1 1 ? ?
mov rsi, msg 1 1 0x402000 ? RSI = address of msg
mov rdx, msglen 1 1 0x402000 22
syscall 22 1 0x402000 22 RAX = return value (22 bytes written); RCX = return addr
mov rax, 60 60 1 0x402000 22
xor rdi, rdi 60 0 0x402000 22
syscall Process exits

Program 2: Integer-to-ASCII Conversion and Printing

Converting an integer to its decimal string representation is a fundamental routine. Here's a complete implementation with full explanation:

; print_number.asm -- convert an integer to decimal and print it
; Algorithm: repeatedly divide by 10, collect remainders (digits in reverse),
;            then print the digits in forward order.

section .bss
    digit_buf   resb 24         ; enough for 20 digits + sign + newline + null

section .text
    global _start

; print_uint64: print a 64-bit unsigned integer to stdout, followed by newline
; Args: rdi = value to print
; Clobbers: rax, rdi, rsi, rdx, rcx, rbx
print_uint64:
    push rbx

    lea  rbx, [rel digit_buf + 23]  ; rbx points to end of buffer
    mov  BYTE [rbx], 10             ; add newline at end
    dec  rbx

    ; Handle the special case of zero
    mov  rax, rdi               ; rax = value
    test rax, rax
    jnz  .convert               ; if nonzero, convert normally
    mov  BYTE [rbx], '0'
    dec  rbx
    jmp  .print

.convert:
    mov  rcx, 10                ; divisor

.loop:
    ; Divide rax by 10
    xor  rdx, rdx               ; zero rdx (required before DIV)
    div  rcx                    ; rax = quotient, rdx = remainder (0-9)

    ; Convert remainder to ASCII
    add  dl, '0'                ; '0' = 48; dl is now '0' to '9'
    mov  [rbx], dl              ; store digit (working right-to-left)
    dec  rbx

    test rax, rax               ; quotient zero?
    jnz  .loop                  ; if not, continue

.print:
    inc  rbx                    ; rbx now points to first digit

    ; Calculate length: from rbx to (digit_buf+24) = length+1 (includes newline)
    lea  rsi, [rel digit_buf + 24]
    sub  rsi, rbx               ; sigh, let me recalculate this carefully
    ; digit_buf + 24 = one past the newline
    ; rbx = first digit
    ; length from first digit to newline inclusive = (digit_buf+24) - rbx
    lea  rdx, [rel digit_buf + 24]
    sub  rdx, rbx               ; rdx = length including newline
    mov  rsi, rbx               ; rsi = start of number string

    ; Write to stdout
    mov  rax, 1
    mov  rdi, 1
    syscall

    pop  rbx
    ret

_start:
    ; Print some numbers
    mov  rdi, 0
    call print_uint64           ; prints "0\n"

    mov  rdi, 42
    call print_uint64           ; prints "42\n"

    mov  rdi, 1234567890
    call print_uint64           ; prints "1234567890\n"

    mov  rdi, 0xFFFFFFFFFFFFFFFF  ; max uint64
    call print_uint64           ; prints "18446744073709551615\n"

    ; Exit
    mov  rax, 60
    xor  rdi, rdi
    syscall

Register trace for print_uint64(42):

Instruction RAX RDX RBX RCX Notes
(entry) 42 ? ? ? rdi=42
lea rbx, [digit_buf+23] 42 ? buf+23 ?
mov [rbx], 10 42 ? buf+23 ? newline at buf[23]
dec rbx 42 ? buf+22 ?
mov rax, rdi 42 ? buf+22 ? rax = value
mov rcx, 10 42 ? buf+22 10 divisor
xor rdx, rdx 42 0 buf+22 10
div rcx 4 2 buf+22 10 42÷10: quotient=4, remainder=2
add dl, '0' 4 '2' buf+22 10 '0'+2='2'
mov [rbx], dl 4 '2' buf+22 10 stores '2' at buf[22]
dec rbx 4 '2' buf+21 10
xor rdx, rdx 4 0 buf+21 10
div rcx 0 4 buf+21 10 4÷10: quotient=0, remainder=4
add dl, '0' 0 '4' buf+21 10
mov [rbx], dl 0 '4' buf+21 10 stores '4' at buf[21]
dec rbx 0 '4' buf+20 10
test rax, rax; jnz → not taken 0 '4' buf+20 10 quotient is 0, exit loop
inc rbx 0 '4' buf+21 10 back to first digit
sys_write: buf[21..23] = "42\n" 3 '4' buf+21 10 prints "42\n"

Program 3: Reading from Standard Input

; read_echo.asm -- read a line and echo it back
; Demonstrates: sys_read, error handling

section .bss
    buffer  resb 256            ; input buffer

section .text
    global _start

_start:
    ; Read from stdin (fd=0) into buffer
    mov  rax, 0                 ; SYS_READ
    mov  rdi, 0                 ; stdin
    lea  rsi, [rel buffer]
    mov  rdx, 256               ; max bytes
    syscall

    ; rax = bytes read (or negative error)
    ; Check for EOF (rax = 0) or error (rax < 0)
    test rax, rax
    jle  .done                  ; if <= 0, nothing to echo

    ; Echo: write the same bytes back to stdout
    mov  rdx, rax               ; rdx = bytes to write = bytes read
    mov  rax, 1                 ; SYS_WRITE
    mov  rdi, 1                 ; stdout
    lea  rsi, [rel buffer]
    syscall

.done:
    mov  rax, 60
    xor  rdi, rdi
    syscall

Program 4: Add Two Numbers and Print the Result

; add_print.asm -- add two hardcoded numbers and print the result
; Exercises: arithmetic, function calls, register passing

section .bss
    result_buf  resb 24

section .data
    intro_msg   db "Sum: ", 0
    intro_len   equ $ - intro_msg - 1    ; exclude null

section .text
    global _start

; print_uint64: (same as above -- we'd use %include in a real project)
; ... (implementation as above)

_start:
    ; Compute 12345 + 67890
    mov  rax, 12345
    add  rax, 67890             ; rax = 80235

    ; Print "Sum: "
    mov  rcx, rax               ; save result in rcx (not clobbered by sys_write? check!)
    ; Wait: sys_write (syscall) clobbers RCX! Use a callee-saved register:
    push rbx                    ; save rbx
    mov  rbx, rax               ; save result in rbx (callee-saved, preserved across syscall)

    mov  rax, 1
    mov  rdi, 1
    lea  rsi, [rel intro_msg]
    mov  rdx, intro_len
    syscall

    ; Print the number
    mov  rdi, rbx               ; restore result as argument
    call print_uint64           ; prints "80235\n"

    pop  rbx
    mov  rax, 60
    xor  rdi, rdi
    syscall

This example demonstrates a critical real-world concern: syscall clobbers RCX and R11. The result 80235 must be saved in a callee-saved register (RBX, RBP, R12-R15) before calling syscall, or explicitly pushed/popped.


The Four strlen() Implementations

The strlen function — returning the length of a null-terminated string — is worth implementing multiple ways because it illustrates fundamental trade-offs in assembly programming.

Implementation 1: Naive Loop

; strlen_v1: byte-by-byte scan
; Args: rdi = string pointer
; Returns: rax = length
strlen_v1:
    xor  eax, eax               ; length = 0 (32-bit xor zeros rax)
.loop:
    cmp  BYTE [rdi + rax], 0    ; is current byte null?
    je   .done                  ; if yes, done
    inc  rax                    ; length++
    jmp  .loop
.done:
    ret

Simple, correct, but slow: one iteration per byte, with a load, compare, conditional branch, increment, and unconditional branch per byte.

Implementation 2: SCASB (String Scan Byte)

; strlen_v2: using SCASB
; Args: rdi = string pointer
; Returns: rax = length
strlen_v2:
    push  rdi                   ; save original pointer
    cld                         ; clear DF (scan forward)
    xor   al, al                ; AL = 0 (looking for null byte)
    mov   rcx, -1               ; scan up to 2^64-1 bytes

    repne scasb                 ; scan: while [rdi] != al, advance rdi, dec rcx
    ; After: rdi points one past the null byte
    ; rcx = (original rcx) - (bytes scanned including null) = -1 - (len+1)

    not   rcx                   ; rcx = len + 1 - 1 + 1 ... let me recalculate
    ; initial rcx = -1 = 0xFFFFFFFFFFFFFFFF
    ; rcx decremented (len+1) times (len chars + null terminator)
    ; final rcx = -1 - (len+1) = ~len - 1 in two's complement
    ; not rcx = len + 1
    ; subtract 1 for the null byte:
    lea   rax, [rcx - 1]        ; rax = len

    pop   rdi
    ret

SCASB is a single-byte-per-iteration instruction but with hardware-accelerated loop termination. On modern CPUs, it's typically similar in performance to the naive loop for short strings.

Implementation 3: Word-at-a-Time (Aligned, Faster)

The idea: load 8 bytes at a time and check all 8 bytes for null using bitwise tricks. This approach is used in glibc's optimized strlen.

; strlen_v3: 8-bytes-at-a-time with alignment handling
; Args: rdi = string pointer
; Returns: rax = length
; Note: this is a simplified version; production code handles alignment edge cases
strlen_v3:
    mov  rsi, rdi               ; save start

    ; Check byte-by-byte until 8-byte aligned
.align_loop:
    test rdi, 7                 ; is RDI 8-byte aligned (low 3 bits zero)?
    jz   .aligned               ; if yes, start the 8-byte scan
    cmp  BYTE [rdi], 0          ; check byte
    je   .found_null
    inc  rdi
    jmp  .align_loop

.aligned:
    ; 8-byte-at-a-time scan
    ; Technique: a 64-bit word contains a null byte iff
    ; (word - 0x0101010101010101) & ~word & 0x8080808080808080 != 0
    mov  rax, 0x0101010101010101
    mov  rcx, 0x8080808080808080

.word_loop:
    mov  rdx, [rdi]             ; load 8 bytes
    mov  r8, rdx
    sub  r8, rax                ; r8 = word - 0x0101...
    not  rdx                    ; rdx = ~word
    and  r8, rdx                ; r8 = (word - 0x0101...) & ~word
    and  r8, rcx                ; r8 & 0x8080... = non-zero iff null byte present
    jnz  .found_null_in_word
    add  rdi, 8
    jmp  .word_loop

.found_null_in_word:
    ; Find which byte in the 8-byte word is null
    ; (simplified: just do byte-by-byte from here)
.check_bytes:
    cmp  BYTE [rdi], 0
    je   .found_null
    inc  rdi
    jmp  .check_bytes

.found_null:
    sub  rdi, rsi               ; length = (null address) - (start address)
    mov  rax, rdi
    ret

Implementation 4: AVX2 SIMD (32 bytes at a time)

; strlen_v4: AVX2 version -- requires AVX2 support (check CPUID first!)
; Args: rdi = string pointer
; Returns: rax = length
strlen_v4:
    ; Zero the YMM register for comparison
    vpxor   ymm0, ymm0, ymm0    ; ymm0 = all zeros (the null byte repeated 32 times)

    mov     rsi, rdi            ; save start
    mov     rax, rdi
    and     rax, ~31            ; align down to 32-byte boundary for load

.avx_loop:
    vmovdqu ymm1, [rax]         ; load 32 bytes (unaligned okay with vmovdqu)
    vpcmpeqb ymm2, ymm1, ymm0   ; compare each byte with 0; ymm2[i] = 0xFF if byte[i]==0
    vpmovmskb ecx, ymm2         ; ecx = 32-bit mask; bit i = 1 if ymm2[i] was 0xFF
    test    ecx, ecx            ; any zero bytes found?
    jnz     .found_zero
    add     rax, 32
    jmp     .avx_loop

.found_zero:
    bsf     ecx, ecx            ; bit scan forward: ecx = position of lowest set bit
    add     rax, rcx            ; pointer to null byte
    sub     rax, rsi            ; length = null_address - start
    ; ... handle the pre-alignment bytes (rdi might have been before rax)
    ; (simplified -- production code handles this carefully)
    vzeroupper                  ; clear upper YMM state (required for performance)
    ret

Performance comparison for a 100-byte string: - v1 (naive): ~100 iterations = ~100 cycles - v2 (SCASB): ~100 iterations = ~80 cycles - v3 (8-byte): ~13 iterations = ~40 cycles - v4 (AVX2): ~4 iterations = ~15 cycles

These are rough estimates; actual performance depends on cache behavior, branch prediction, and CPU microarchitecture. The SIMD version (v4) is roughly 5-7x faster than the naive version for medium-length strings.


The MinOS Kernel Project: Step 1

The MinOS kernel project begins here. The goal for this chapter is to write a BIOS-compatible bootloader that: 1. Loads at address 0x7C00 (where the BIOS places a 512-byte boot sector) 2. Sets up a minimal 16-bit environment 3. Prints a message to screen using BIOS interrupts 4. Halts

This is the first 512 bytes of what will become the MinOS operating system.

; minos/boot/boot.asm -- MinOS Stage 1 Bootloader
; Loaded by BIOS at 0x7C00, executed in 16-bit real mode
;
; Build:
;   nasm -f bin boot.asm -o boot.bin
; Test:
;   qemu-system-x86_64 -fda boot.bin
; Verify:
;   wc -c boot.bin     # must be 512
;   xxd boot.bin | tail -1   # must end with 55 aa

; Tell NASM: generate 16-bit code
BITS 16
; Tell NASM: assume code is loaded at address 0x7C00
ORG 0x7C00

; ============================================================
; Entry point: BIOS jumps here
; At entry:
;   CS:IP = 0x0000:0x7C00 (or 0x07C0:0x0000 -- both are the same)
;   DL = boot drive number
; ============================================================
_start:
    ; Step 1: Establish a known segment environment
    ; The BIOS may have CS set to 0x07C0 or 0x0000; we normalize to 0x0000
    jmp     0x0000:init         ; far jump to force CS=0

init:
    ; Set all segment registers to 0
    xor     ax, ax
    mov     ds, ax              ; data segment = 0
    mov     es, ax              ; extra segment = 0
    mov     fs, ax              ; FS = 0
    mov     gs, ax              ; GS = 0
    mov     ss, ax              ; stack segment = 0
    mov     sp, 0x7C00          ; stack pointer: grows down from 0x7C00
                                ; (below our bootloader -- safe for small stacks)

    ; Step 2: Clear screen using BIOS INT 10h
    ; AH=0x00, AL=0x03 = set video mode to 80x25 color text
    mov     ah, 0x00
    mov     al, 0x03
    int     0x10

    ; Step 3: Print the welcome message
    lea     si, [welcome_msg]   ; SI = address of message
    call    print_string_16

    ; Step 4: Halt -- we'll add more in later chapters
.halt:
    hlt
    jmp     .halt               ; in case NMI brings us back from HLT

; ============================================================
; print_string_16: print null-terminated string using BIOS
; Args: SI = pointer to null-terminated string
; Clobbers: AX, BX, SI
; ============================================================
print_string_16:
    mov     bh, 0               ; page number
    mov     bl, 0x07            ; text attribute (light gray on black)
.loop:
    lodsb                       ; AL = [SI], SI++
    test    al, al              ; null terminator?
    jz      .done
    mov     ah, 0x0E            ; BIOS function: teletype output
    int     0x10                ; BIOS video interrupt
    jmp     .loop
.done:
    ret

; ============================================================
; Data
; ============================================================
welcome_msg:
    db "MinOS Bootloader v0.1", 13, 10    ; CR+LF for 16-bit text mode
    db "Initializing...", 13, 10
    db 0                                    ; null terminator

; ============================================================
; Boot Signature
; ============================================================
; Pad to exactly 510 bytes, then write the boot signature 0xAA55
times 510 - ($ - $$) db 0      ; fill with zeros up to byte 510
dw 0xAA55                       ; BIOS boot signature (little-endian: 0x55, 0xAA)

📐 OS Kernel Project — Step 1: This is the first file of MinOS. Assemble it with nasm -f bin boot/boot.asm -o boot.bin, verify it's 512 bytes with wc -c boot.bin, and test it in QEMU with qemu-system-x86_64 -fda boot.bin. You should see "MinOS Bootloader v0.1" and "Initializing..." on a black screen. In Chapter 19, this bootloader will be extended to enter 32-bit protected mode. In Chapter 28, it will switch to 64-bit long mode.


Summary

This chapter wrote real programs: - Hello world with a full register trace showing every register state change - Integer-to-ASCII conversion with the DIV instruction and backward-forward digit accumulation - stdin reading with sys_read and echo - Four implementations of strlen showing the performance progression from naive loop to SIMD - The first 512 bytes of the MinOS bootloader

The fundamental patterns established here: 1. Set up syscall registers (RAX = number, RDI/RSI/RDX = args), execute syscall, check RAX return value 2. Save values across syscalls in callee-saved registers (RBX, R12-R15) or stack 3. Integer division with DIV/IDIV requires zeroing RDX first 4. Register traces are how you verify assembly programs

🔄 Check Your Understanding: The print_uint64 function uses RBX to store the intermediate result across a syscall. Could it have used RCX instead?

Answer No. The syscall instruction saves RIP to RCX and RFLAGS to R11. After any syscall returns to user space, RCX contains the address of the instruction after the syscall, not whatever was in RCX before. R11 contains the RFLAGS value from before the syscall. Both values are meaningless to the user-space program.

RBX is a callee-saved register — the System V ABI requires that functions preserve RBX across calls. Since syscall behaves differently from a regular function call (it has its own clobbering rules for RCX/R11), RBX is one of the safe choices for preserving a value across a syscall. The others are RBP and R12-R15 (all callee-saved in the System V ABI).