The previous six chapters built the mental model. Now we use it. This chapter writes real programs: complete, runnable, debuggable assembly programs that demonstrate every fundamental instruction and pattern you'll use throughout the rest of the...
In This Chapter
- Writing Code That Actually Does Something
- The MOV Instruction Family
- ADD and SUB
- INC, DEC, NEG
- XOR: The Most Versatile Instruction
- System Calls: The Linux Kernel Interface
- Program 1: Hello World (Complete With Register Trace)
- Program 2: Integer-to-ASCII Conversion and Printing
- Program 3: Reading from Standard Input
- Program 4: Add Two Numbers and Print the Result
- The Four strlen() Implementations
- The MinOS Kernel Project: Step 1
- Summary
Chapter 7: Your First Assembly Programs
Writing Code That Actually Does Something
The previous six chapters built the mental model. Now we use it. This chapter writes real programs: complete, runnable, debuggable assembly programs that demonstrate every fundamental instruction and pattern you'll use throughout the rest of the book.
We start with MOV — the most common x86-64 instruction, which turns out to have more nuance than it appears — and work through the arithmetic instructions, system calls, and complete programs with full register trace tables. By the end of this chapter, you'll have written programs that print text, perform arithmetic, convert numbers to ASCII, and read from standard input.
The MinOS kernel project also begins here, in the final section. The first bytes of the kernel are written.
The MOV Instruction Family
MOV is the load-store instruction: it moves data between registers and memory. It is the most frequently used instruction in most programs, and it has more forms than beginners expect.
The Four MOV Forms
; Form 1: Register to Register
mov rax, rbx ; rax ← rbx (64-bit)
mov eax, ebx ; eax ← ebx (32-bit, zeroes upper 32 bits of rax!)
mov ax, bx ; ax ← bx (16-bit, does NOT zero upper bits)
mov al, bl ; al ← bl (8-bit, does NOT zero upper bits)
; Form 2: Immediate to Register
mov rax, 42 ; rax ← 42 (64-bit)
mov eax, 0xFF ; eax ← 255 (32-bit write, zeroes upper 32 bits)
mov rax, 0x123456789ABCDEF0 ; 64-bit immediate (MOVABS encoding, 10 bytes!)
xor eax, eax ; rax ← 0 (shorter than mov rax, 0)
; Form 3: Memory to Register (LOAD)
mov rax, [rdi] ; rax ← 8 bytes at address in rdi
mov eax, [rdi] ; eax ← 4 bytes at address in rdi; zeroes upper 32 of rax
mov ax, [rdi] ; ax ← 2 bytes at address in rdi; upper bits unchanged
mov al, [rdi] ; al ← 1 byte at address in rdi; upper bits unchanged
mov rax, [rbp-8] ; rax ← local variable (base+displacement)
mov rax, [rdi+rcx*8] ; rax ← arr[rcx] (base+index*scale)
mov rax, [rdi+rcx*8+16] ; rax ← struct.field in arr (base+index*scale+disp)
; Form 4: Register to Memory (STORE)
mov [rdi], rax ; 8 bytes at address in rdi ← rax
mov [rdi], eax ; 4 bytes at address in rdi ← eax
mov [rbp-8], rax ; local variable ← rax
mov QWORD [rdi], 42 ; store immediate to memory (must specify QWORD)
mov DWORD [rdi], 42 ; store 4-byte immediate
mov BYTE [rdi], 'A' ; store 1-byte immediate
What MOV Does NOT Do
; NO: memory-to-memory move does not exist
mov [rdi], [rsi] ; INVALID -- x86-64 has no memory-to-memory MOV
; Instead: use a register as intermediate
mov rax, [rsi] ; load from source
mov [rdi], rax ; store to destination
There is no direct memory-to-memory instruction in x86-64 (with exceptions like REP MOVS, which is a special string-move instruction covered in Chapter 14). Every data transfer between two memory locations requires a register as a waypoint.
MOV and Zero Extension
The most important behavioral property of MOV:
; CRITICAL: 32-bit write zeroes upper 32 bits
mov rax, 0xFFFFFFFFFFFFFFFF ; rax = 0xFFFFFFFFFFFFFFFF
mov eax, 1 ; rax = 0x0000000000000001 (upper zeroed!)
; 16-bit and 8-bit writes do NOT zero upper bits
mov rax, 0xFFFFFFFFFFFFFFFF ; rax = 0xFFFFFFFFFFFFFFFF
mov ax, 1 ; rax = 0xFFFFFFFFFFFF0001 (only low 16 changed)
; Zero-extension: use movzx
movzx rax, bl ; rax ← zero-extend BL to 64 bits
; Sign-extension: use movsx
movsx rax, bl ; rax ← sign-extend BL to 64 bits
ADD and SUB
ADD and SUB perform integer addition and subtraction, setting the flags as described in Chapter 2.
; ADD forms
add rax, rbx ; rax ← rax + rbx (sets CF, OF, SF, ZF, PF, AF)
add rax, 100 ; rax ← rax + 100 (immediate)
add rax, [rdi] ; rax ← rax + memory (memory operand)
add [rdi], rax ; memory ← memory + rax (memory destination)
add [rdi], 1 ; memory ← memory + 1 (must specify size for immediate!)
; ^ This last form needs: add QWORD [rdi], 1
; SUB forms (same structure as ADD)
sub rax, rbx ; rax ← rax - rbx
sub rax, 1 ; rax ← rax - 1 (but see INC/DEC below)
sub rax, [rdi] ; rax ← rax - memory
Register Trace: ADD Examples
| Instruction | RAX | RBX | CF | OF | SF | ZF |
|---|---|---|---|---|---|---|
| (initial) | 0x0000000000000000 |
0x0000000000000001 |
0 | 0 | 0 | 0 |
add rax, rbx |
0x0000000000000001 |
0x0000000000000001 |
0 | 0 | 0 | 0 |
add rax, rbx |
0x0000000000000002 |
0x0000000000000001 |
0 | 0 | 0 | 0 |
mov rax, 0x7FFFFFFFFFFFFFFF |
0x7FFFFFFFFFFFFFFF |
0x0000000000000001 |
0 | 0 | 0 | 0 |
add rax, rbx |
0x8000000000000000 |
0x0000000000000001 |
0 | 1 | 1 | 0 |
mov rax, 0xFFFFFFFFFFFFFFFF |
0xFFFFFFFFFFFFFFFF |
0x0000000000000001 |
0 | 0 | 1 | 0 |
add rax, rbx |
0x0000000000000000 |
0x0000000000000001 |
1 | 0 | 0 | 1 |
The last row: adding 1 to 0xFFFFFFFFFFFFFFFF produces 0 (unsigned overflow, CF=1, ZF=1).
INC, DEC, NEG
; INC: increment by 1 (does NOT set CF)
inc rax ; rax ← rax + 1 (CF unchanged! OF, SF, ZF, PF, AF set)
inc QWORD [rdi] ; increment memory
; DEC: decrement by 1 (does NOT set CF)
dec rax ; rax ← rax - 1 (CF unchanged! OF, SF, ZF, PF, AF set)
dec QWORD [rdi] ; decrement memory
; NEG: two's complement negation
neg rax ; rax ← -rax (sets CF=1 unless rax=0, plus OF, SF, ZF)
⚠️ Common Mistake:
INCandDECdo not set CF (they don't affect the Carry Flag). This means you cannot useJC/JNCafterINC/DECto detect overflow. If you need to detect overflow after incrementing, useADD rax, 1instead (which does set CF).
The motivation for this design: in 8086 code, INC/DEC appeared in loops and it was common to use CF for other purposes (multi-byte arithmetic with ADC). Making INC/DEC not touch CF allowed them to be used inside such loops without disturbing the carry chain.
XOR: The Most Versatile Instruction
XOR performs bitwise exclusive-or, but its most common use in x86-64 assembly is zeroing a register:
; Zero a register (most common use):
xor eax, eax ; rax ← 0 (32-bit write zeroes upper 32 bits of rax!)
; 2 bytes: 31 c0
; Better than: mov rax, 0 (which is 7 bytes: 48 b8 00...)
; Bitwise XOR:
xor rax, rbx ; each bit of rax is XORed with corresponding bit of rbx
xor rax, 0xFF ; flip the low 8 bits of rax
; Toggle bits:
xor BYTE [flag], 1 ; flip bit 0 of a flag byte (toggle on/off)
; Swap two registers without a temporary (classic trick):
xor rax, rbx ; rax ← rax XOR rbx
xor rbx, rax ; rbx ← rbx XOR (rax XOR rbx) = original rax
xor rax, rbx ; rax ← (rax XOR rbx) XOR rax = original rbx
; Note: this trick is clever but slower than using a register; use push/pop instead
; Nullify (securely erase a register -- harder for compilers to optimize away):
xor rax, rax ; cryptographic code sometimes uses xor to zero keys
💡 Mental Model:
XOR reg, regis a zero idiom recognized by the CPU microarchitecture. On Intel CPUs from Sandy Bridge onward,XOR EAX, EAXis handled by register renaming without any ALU operation — the register is simply flagged as "zero value" with no execution latency. It doesn't need to read the old value of EAX at all.
System Calls: The Linux Kernel Interface
The syscall instruction is how user-space programs request kernel services. On Linux x86-64:
Syscall Convention:
RAX = syscall number
RDI = argument 1
RSI = argument 2
RDX = argument 3
R10 = argument 4 (not RCX — note the difference from the ABI!)
R8 = argument 5
R9 = argument 6
Return value: RAX (positive = success, negative = -errno for errors)
Clobbered by syscall: RCX (saved RIP), R11 (saved RFLAGS)
Note the critical difference for argument 4: in the function calling convention, argument 4 is in RCX. But for syscalls, argument 4 is in R10 (because syscall saves the return address in RCX, making it unavailable for argument passing).
Essential Syscalls
; sys_write(fd, buf, count) → bytes_written
; fd: 1=stdout, 2=stderr
mov rax, 1 ; SYS_WRITE
mov rdi, 1 ; fd
mov rsi, buffer ; buf (address)
mov rdx, length ; count
syscall
; rax = bytes written, or negative error
; sys_read(fd, buf, count) → bytes_read
; fd: 0=stdin
mov rax, 0 ; SYS_READ
mov rdi, 0 ; stdin
mov rsi, buffer ; buf
mov rdx, 4096 ; max bytes to read
syscall
; rax = bytes read (0 = EOF), or negative error
; sys_exit(status)
mov rax, 60 ; SYS_EXIT
mov rdi, 0 ; exit status
syscall
; Does not return
; sys_write to stderr (for error messages):
mov rax, 1
mov rdi, 2 ; fd=2 (stderr)
mov rsi, errmsg
mov rdx, errmsg_len
syscall
Stack Alignment Before SYSCALL
The syscall instruction does not require stack alignment (unlike call, which requires 16-byte alignment before the push of the return address). However, if you're calling any function that might use SSE instructions internally, maintain alignment. As a general habit, keep the stack 16-byte aligned at all times.
Program 1: Hello World (Complete With Register Trace)
; hello.asm -- Hello, Assembly World!
; Build: nasm -f elf64 hello.asm -o hello.o && ld hello.o -o hello
section .data
msg db "Hello, Assembly World!", 10 ; 22 bytes
msglen equ $ - msg ; = 22
section .text
global _start
_start:
; Set up sys_write arguments
mov rax, 1 ; syscall number
mov rdi, 1 ; fd = stdout
mov rsi, msg ; buffer
mov rdx, msglen ; count
; Execute sys_write
syscall
; Set up sys_exit arguments
mov rax, 60 ; syscall number
xor rdi, rdi ; exit status = 0
syscall
Register trace:
| Instruction | RAX | RDI | RSI | RDX | Notes |
|---|---|---|---|---|---|
| (entry) | ? | ? | ? | ? | |
mov rax, 1 |
1 | ? | ? | ? | |
mov rdi, 1 |
1 | 1 | ? | ? | |
mov rsi, msg |
1 | 1 | 0x402000 |
? | RSI = address of msg |
mov rdx, msglen |
1 | 1 | 0x402000 |
22 | |
syscall |
22 | 1 | 0x402000 |
22 | RAX = return value (22 bytes written); RCX = return addr |
mov rax, 60 |
60 | 1 | 0x402000 |
22 | |
xor rdi, rdi |
60 | 0 | 0x402000 |
22 | |
syscall |
— | — | — | — | Process exits |
Program 2: Integer-to-ASCII Conversion and Printing
Converting an integer to its decimal string representation is a fundamental routine. Here's a complete implementation with full explanation:
; print_number.asm -- convert an integer to decimal and print it
; Algorithm: repeatedly divide by 10, collect remainders (digits in reverse),
; then print the digits in forward order.
section .bss
digit_buf resb 24 ; enough for 20 digits + sign + newline + null
section .text
global _start
; print_uint64: print a 64-bit unsigned integer to stdout, followed by newline
; Args: rdi = value to print
; Clobbers: rax, rdi, rsi, rdx, rcx, rbx
print_uint64:
push rbx
lea rbx, [rel digit_buf + 23] ; rbx points to end of buffer
mov BYTE [rbx], 10 ; add newline at end
dec rbx
; Handle the special case of zero
mov rax, rdi ; rax = value
test rax, rax
jnz .convert ; if nonzero, convert normally
mov BYTE [rbx], '0'
dec rbx
jmp .print
.convert:
mov rcx, 10 ; divisor
.loop:
; Divide rax by 10
xor rdx, rdx ; zero rdx (required before DIV)
div rcx ; rax = quotient, rdx = remainder (0-9)
; Convert remainder to ASCII
add dl, '0' ; '0' = 48; dl is now '0' to '9'
mov [rbx], dl ; store digit (working right-to-left)
dec rbx
test rax, rax ; quotient zero?
jnz .loop ; if not, continue
.print:
inc rbx ; rbx now points to first digit
; Calculate length: from rbx to (digit_buf+24) = length+1 (includes newline)
lea rsi, [rel digit_buf + 24]
sub rsi, rbx ; sigh, let me recalculate this carefully
; digit_buf + 24 = one past the newline
; rbx = first digit
; length from first digit to newline inclusive = (digit_buf+24) - rbx
lea rdx, [rel digit_buf + 24]
sub rdx, rbx ; rdx = length including newline
mov rsi, rbx ; rsi = start of number string
; Write to stdout
mov rax, 1
mov rdi, 1
syscall
pop rbx
ret
_start:
; Print some numbers
mov rdi, 0
call print_uint64 ; prints "0\n"
mov rdi, 42
call print_uint64 ; prints "42\n"
mov rdi, 1234567890
call print_uint64 ; prints "1234567890\n"
mov rdi, 0xFFFFFFFFFFFFFFFF ; max uint64
call print_uint64 ; prints "18446744073709551615\n"
; Exit
mov rax, 60
xor rdi, rdi
syscall
Register trace for print_uint64(42):
| Instruction | RAX | RDX | RBX | RCX | Notes |
|---|---|---|---|---|---|
| (entry) | 42 | ? | ? | ? | rdi=42 |
lea rbx, [digit_buf+23] |
42 | ? | buf+23 |
? | |
mov [rbx], 10 |
42 | ? | buf+23 |
? | newline at buf[23] |
dec rbx |
42 | ? | buf+22 |
? | |
mov rax, rdi |
42 | ? | buf+22 |
? | rax = value |
mov rcx, 10 |
42 | ? | buf+22 |
10 | divisor |
xor rdx, rdx |
42 | 0 | buf+22 |
10 | |
div rcx |
4 | 2 | buf+22 |
10 | 42÷10: quotient=4, remainder=2 |
add dl, '0' |
4 | '2' |
buf+22 |
10 | '0'+2='2' |
mov [rbx], dl |
4 | '2' |
buf+22 |
10 | stores '2' at buf[22] |
dec rbx |
4 | '2' |
buf+21 |
10 | |
xor rdx, rdx |
4 | 0 | buf+21 |
10 | |
div rcx |
0 | 4 | buf+21 |
10 | 4÷10: quotient=0, remainder=4 |
add dl, '0' |
0 | '4' |
buf+21 |
10 | |
mov [rbx], dl |
0 | '4' |
buf+21 |
10 | stores '4' at buf[21] |
dec rbx |
0 | '4' |
buf+20 |
10 | |
test rax, rax; jnz → not taken |
0 | '4' |
buf+20 |
10 | quotient is 0, exit loop |
inc rbx |
0 | '4' |
buf+21 |
10 | back to first digit |
| sys_write: buf[21..23] = "42\n" | 3 | '4' |
buf+21 |
10 | prints "42\n" |
Program 3: Reading from Standard Input
; read_echo.asm -- read a line and echo it back
; Demonstrates: sys_read, error handling
section .bss
buffer resb 256 ; input buffer
section .text
global _start
_start:
; Read from stdin (fd=0) into buffer
mov rax, 0 ; SYS_READ
mov rdi, 0 ; stdin
lea rsi, [rel buffer]
mov rdx, 256 ; max bytes
syscall
; rax = bytes read (or negative error)
; Check for EOF (rax = 0) or error (rax < 0)
test rax, rax
jle .done ; if <= 0, nothing to echo
; Echo: write the same bytes back to stdout
mov rdx, rax ; rdx = bytes to write = bytes read
mov rax, 1 ; SYS_WRITE
mov rdi, 1 ; stdout
lea rsi, [rel buffer]
syscall
.done:
mov rax, 60
xor rdi, rdi
syscall
Program 4: Add Two Numbers and Print the Result
; add_print.asm -- add two hardcoded numbers and print the result
; Exercises: arithmetic, function calls, register passing
section .bss
result_buf resb 24
section .data
intro_msg db "Sum: ", 0
intro_len equ $ - intro_msg - 1 ; exclude null
section .text
global _start
; print_uint64: (same as above -- we'd use %include in a real project)
; ... (implementation as above)
_start:
; Compute 12345 + 67890
mov rax, 12345
add rax, 67890 ; rax = 80235
; Print "Sum: "
mov rcx, rax ; save result in rcx (not clobbered by sys_write? check!)
; Wait: sys_write (syscall) clobbers RCX! Use a callee-saved register:
push rbx ; save rbx
mov rbx, rax ; save result in rbx (callee-saved, preserved across syscall)
mov rax, 1
mov rdi, 1
lea rsi, [rel intro_msg]
mov rdx, intro_len
syscall
; Print the number
mov rdi, rbx ; restore result as argument
call print_uint64 ; prints "80235\n"
pop rbx
mov rax, 60
xor rdi, rdi
syscall
This example demonstrates a critical real-world concern: syscall clobbers RCX and R11. The result 80235 must be saved in a callee-saved register (RBX, RBP, R12-R15) before calling syscall, or explicitly pushed/popped.
The Four strlen() Implementations
The strlen function — returning the length of a null-terminated string — is worth implementing multiple ways because it illustrates fundamental trade-offs in assembly programming.
Implementation 1: Naive Loop
; strlen_v1: byte-by-byte scan
; Args: rdi = string pointer
; Returns: rax = length
strlen_v1:
xor eax, eax ; length = 0 (32-bit xor zeros rax)
.loop:
cmp BYTE [rdi + rax], 0 ; is current byte null?
je .done ; if yes, done
inc rax ; length++
jmp .loop
.done:
ret
Simple, correct, but slow: one iteration per byte, with a load, compare, conditional branch, increment, and unconditional branch per byte.
Implementation 2: SCASB (String Scan Byte)
; strlen_v2: using SCASB
; Args: rdi = string pointer
; Returns: rax = length
strlen_v2:
push rdi ; save original pointer
cld ; clear DF (scan forward)
xor al, al ; AL = 0 (looking for null byte)
mov rcx, -1 ; scan up to 2^64-1 bytes
repne scasb ; scan: while [rdi] != al, advance rdi, dec rcx
; After: rdi points one past the null byte
; rcx = (original rcx) - (bytes scanned including null) = -1 - (len+1)
not rcx ; rcx = len + 1 - 1 + 1 ... let me recalculate
; initial rcx = -1 = 0xFFFFFFFFFFFFFFFF
; rcx decremented (len+1) times (len chars + null terminator)
; final rcx = -1 - (len+1) = ~len - 1 in two's complement
; not rcx = len + 1
; subtract 1 for the null byte:
lea rax, [rcx - 1] ; rax = len
pop rdi
ret
SCASB is a single-byte-per-iteration instruction but with hardware-accelerated loop termination. On modern CPUs, it's typically similar in performance to the naive loop for short strings.
Implementation 3: Word-at-a-Time (Aligned, Faster)
The idea: load 8 bytes at a time and check all 8 bytes for null using bitwise tricks. This approach is used in glibc's optimized strlen.
; strlen_v3: 8-bytes-at-a-time with alignment handling
; Args: rdi = string pointer
; Returns: rax = length
; Note: this is a simplified version; production code handles alignment edge cases
strlen_v3:
mov rsi, rdi ; save start
; Check byte-by-byte until 8-byte aligned
.align_loop:
test rdi, 7 ; is RDI 8-byte aligned (low 3 bits zero)?
jz .aligned ; if yes, start the 8-byte scan
cmp BYTE [rdi], 0 ; check byte
je .found_null
inc rdi
jmp .align_loop
.aligned:
; 8-byte-at-a-time scan
; Technique: a 64-bit word contains a null byte iff
; (word - 0x0101010101010101) & ~word & 0x8080808080808080 != 0
mov rax, 0x0101010101010101
mov rcx, 0x8080808080808080
.word_loop:
mov rdx, [rdi] ; load 8 bytes
mov r8, rdx
sub r8, rax ; r8 = word - 0x0101...
not rdx ; rdx = ~word
and r8, rdx ; r8 = (word - 0x0101...) & ~word
and r8, rcx ; r8 & 0x8080... = non-zero iff null byte present
jnz .found_null_in_word
add rdi, 8
jmp .word_loop
.found_null_in_word:
; Find which byte in the 8-byte word is null
; (simplified: just do byte-by-byte from here)
.check_bytes:
cmp BYTE [rdi], 0
je .found_null
inc rdi
jmp .check_bytes
.found_null:
sub rdi, rsi ; length = (null address) - (start address)
mov rax, rdi
ret
Implementation 4: AVX2 SIMD (32 bytes at a time)
; strlen_v4: AVX2 version -- requires AVX2 support (check CPUID first!)
; Args: rdi = string pointer
; Returns: rax = length
strlen_v4:
; Zero the YMM register for comparison
vpxor ymm0, ymm0, ymm0 ; ymm0 = all zeros (the null byte repeated 32 times)
mov rsi, rdi ; save start
mov rax, rdi
and rax, ~31 ; align down to 32-byte boundary for load
.avx_loop:
vmovdqu ymm1, [rax] ; load 32 bytes (unaligned okay with vmovdqu)
vpcmpeqb ymm2, ymm1, ymm0 ; compare each byte with 0; ymm2[i] = 0xFF if byte[i]==0
vpmovmskb ecx, ymm2 ; ecx = 32-bit mask; bit i = 1 if ymm2[i] was 0xFF
test ecx, ecx ; any zero bytes found?
jnz .found_zero
add rax, 32
jmp .avx_loop
.found_zero:
bsf ecx, ecx ; bit scan forward: ecx = position of lowest set bit
add rax, rcx ; pointer to null byte
sub rax, rsi ; length = null_address - start
; ... handle the pre-alignment bytes (rdi might have been before rax)
; (simplified -- production code handles this carefully)
vzeroupper ; clear upper YMM state (required for performance)
ret
Performance comparison for a 100-byte string: - v1 (naive): ~100 iterations = ~100 cycles - v2 (SCASB): ~100 iterations = ~80 cycles - v3 (8-byte): ~13 iterations = ~40 cycles - v4 (AVX2): ~4 iterations = ~15 cycles
These are rough estimates; actual performance depends on cache behavior, branch prediction, and CPU microarchitecture. The SIMD version (v4) is roughly 5-7x faster than the naive version for medium-length strings.
The MinOS Kernel Project: Step 1
The MinOS kernel project begins here. The goal for this chapter is to write a BIOS-compatible bootloader that: 1. Loads at address 0x7C00 (where the BIOS places a 512-byte boot sector) 2. Sets up a minimal 16-bit environment 3. Prints a message to screen using BIOS interrupts 4. Halts
This is the first 512 bytes of what will become the MinOS operating system.
; minos/boot/boot.asm -- MinOS Stage 1 Bootloader
; Loaded by BIOS at 0x7C00, executed in 16-bit real mode
;
; Build:
; nasm -f bin boot.asm -o boot.bin
; Test:
; qemu-system-x86_64 -fda boot.bin
; Verify:
; wc -c boot.bin # must be 512
; xxd boot.bin | tail -1 # must end with 55 aa
; Tell NASM: generate 16-bit code
BITS 16
; Tell NASM: assume code is loaded at address 0x7C00
ORG 0x7C00
; ============================================================
; Entry point: BIOS jumps here
; At entry:
; CS:IP = 0x0000:0x7C00 (or 0x07C0:0x0000 -- both are the same)
; DL = boot drive number
; ============================================================
_start:
; Step 1: Establish a known segment environment
; The BIOS may have CS set to 0x07C0 or 0x0000; we normalize to 0x0000
jmp 0x0000:init ; far jump to force CS=0
init:
; Set all segment registers to 0
xor ax, ax
mov ds, ax ; data segment = 0
mov es, ax ; extra segment = 0
mov fs, ax ; FS = 0
mov gs, ax ; GS = 0
mov ss, ax ; stack segment = 0
mov sp, 0x7C00 ; stack pointer: grows down from 0x7C00
; (below our bootloader -- safe for small stacks)
; Step 2: Clear screen using BIOS INT 10h
; AH=0x00, AL=0x03 = set video mode to 80x25 color text
mov ah, 0x00
mov al, 0x03
int 0x10
; Step 3: Print the welcome message
lea si, [welcome_msg] ; SI = address of message
call print_string_16
; Step 4: Halt -- we'll add more in later chapters
.halt:
hlt
jmp .halt ; in case NMI brings us back from HLT
; ============================================================
; print_string_16: print null-terminated string using BIOS
; Args: SI = pointer to null-terminated string
; Clobbers: AX, BX, SI
; ============================================================
print_string_16:
mov bh, 0 ; page number
mov bl, 0x07 ; text attribute (light gray on black)
.loop:
lodsb ; AL = [SI], SI++
test al, al ; null terminator?
jz .done
mov ah, 0x0E ; BIOS function: teletype output
int 0x10 ; BIOS video interrupt
jmp .loop
.done:
ret
; ============================================================
; Data
; ============================================================
welcome_msg:
db "MinOS Bootloader v0.1", 13, 10 ; CR+LF for 16-bit text mode
db "Initializing...", 13, 10
db 0 ; null terminator
; ============================================================
; Boot Signature
; ============================================================
; Pad to exactly 510 bytes, then write the boot signature 0xAA55
times 510 - ($ - $$) db 0 ; fill with zeros up to byte 510
dw 0xAA55 ; BIOS boot signature (little-endian: 0x55, 0xAA)
📐 OS Kernel Project — Step 1: This is the first file of MinOS. Assemble it with
nasm -f bin boot/boot.asm -o boot.bin, verify it's 512 bytes withwc -c boot.bin, and test it in QEMU withqemu-system-x86_64 -fda boot.bin. You should see "MinOS Bootloader v0.1" and "Initializing..." on a black screen. In Chapter 19, this bootloader will be extended to enter 32-bit protected mode. In Chapter 28, it will switch to 64-bit long mode.
Summary
This chapter wrote real programs: - Hello world with a full register trace showing every register state change - Integer-to-ASCII conversion with the DIV instruction and backward-forward digit accumulation - stdin reading with sys_read and echo - Four implementations of strlen showing the performance progression from naive loop to SIMD - The first 512 bytes of the MinOS bootloader
The fundamental patterns established here:
1. Set up syscall registers (RAX = number, RDI/RSI/RDX = args), execute syscall, check RAX return value
2. Save values across syscalls in callee-saved registers (RBX, R12-R15) or stack
3. Integer division with DIV/IDIV requires zeroing RDX first
4. Register traces are how you verify assembly programs
🔄 Check Your Understanding: The
print_uint64function uses RBX to store the intermediate result across a syscall. Could it have used RCX instead?
Answer
No. Thesyscallinstruction saves RIP to RCX and RFLAGS to R11. After any syscall returns to user space, RCX contains the address of the instruction after the syscall, not whatever was in RCX before. R11 contains the RFLAGS value from before the syscall. Both values are meaningless to the user-space program.RBX is a callee-saved register — the System V ABI requires that functions preserve RBX across calls. Since
syscallbehaves differently from a regular function call (it has its own clobbering rules for RCX/R11), RBX is one of the safe choices for preserving a value across a syscall. The others are RBP and R12-R15 (all callee-saved in the System V ABI).