9 min read

Every program that does anything useful — reads a file, writes to a terminal, allocates memory, communicates over a network — eventually needs the kernel's help. The kernel has privileges your program does not: it can touch hardware directly, manage...

Chapter 25: System Calls

The Controlled Entry Point into Kernel Mode

Every program that does anything useful — reads a file, writes to a terminal, allocates memory, communicates over a network — eventually needs the kernel's help. The kernel has privileges your program does not: it can touch hardware directly, manage physical memory, and arbitrate between competing processes. The mechanism by which your program asks the kernel for help is the system call.

A system call is not a function call. It is a supervised privilege escalation: the CPU transitions from ring 3 (user mode) to ring 0 (kernel mode), executes a specific kernel-provided handler, then transitions back. The transition is controlled, the kernel handler is fixed, and your user-mode code cannot skip or bypass the transition. This is by design.

Understanding system calls at the assembly level is not academic. It is necessary for writing programs that run without a C library, for analyzing security vulnerabilities, for debugging programs whose behavior differs from their source code, and for implementing the kernel side of the interface.


The syscall Instruction (x86-64)

The syscall instruction is how user-mode x86-64 code enters the kernel. It is not an interrupt (though INT 0x80 served this purpose on 32-bit Linux and still works for compatibility). It is a dedicated fast-path instruction that does the following in hardware:

  1. Saves the current instruction pointer (RIP) into RCX
  2. Saves the current flags register (RFLAGS) into R11
  3. Loads the kernel stack pointer from the IA32_LSTAR MSR (a CPU register holding the kernel entry point address) — actually the kernel entry code must swap the stack
  4. Changes CS and SS to kernel-mode selectors (from the IA32_STAR MSR)
  5. Clears RFLAGS bits specified in IA32_FMASK MSR (disables interrupts)
  6. Jumps to the address in IA32_LSTAR MSR (the kernel syscall entry point)

After the kernel processes the request, it executes SYSRET, which reverses the process: restores RIP from RCX, restores RFLAGS from R11, and returns to ring 3.

⚠️ Common Mistake: Because syscall saves RIP to RCX and RFLAGS to R11, both of these registers are destroyed by a syscall instruction. Do not pass arguments in RCX or R11, and do not expect them to survive a system call. This is different from the regular calling convention.

🔍 Under the Hood: The IA32_LSTAR MSR (address 0xC0000082) contains the address of the kernel's syscall entry point. On Linux, this is the entry_SYSCALL_64 function in arch/x86/entry/entry_64.S. You can read it with rdmsr in ring 0, or inspect it via /proc/kallsyms.


Linux x86-64 Syscall Calling Convention

The Linux kernel defines a specific convention for how to invoke system calls:

Register Role
RAX Syscall number (input) / Return value (output)
RDI Argument 1
RSI Argument 2
RDX Argument 3
R10 Argument 4 (NOT RCX — that gets destroyed by syscall)
R8 Argument 5
R9 Argument 6

The return value is placed in RAX. If the syscall fails, RAX contains a negative value; the error code is -RAX (i.e., -RAX == errno). For example, if RAX returns -13, the error is EACCES (permission denied).

⚠️ Common Mistake: Notice that argument 4 uses R10, not RCX. The C calling convention uses RCX for the fourth argument, but syscall destroys RCX. The Linux syscall wrapper in glibc handles this by moving RCX to R10 before executing syscall. If you are writing raw syscall wrappers, you must do this yourself.


Key System Calls with Complete NASM Examples

sys_write (1) — Write to a File Descriptor

; Write "Hello, syscall!\n" to stdout (fd=1)
; Returns number of bytes written, or negative error

section .data
    message db "Hello, syscall!", 10  ; 10 = newline
    msg_len equ $ - message

section .text
    global _start

_start:
    mov rax, 1          ; sys_write
    mov rdi, 1          ; fd = 1 (stdout)
    mov rsi, message    ; pointer to buffer
    mov rdx, msg_len    ; number of bytes to write
    syscall             ; enter kernel
    ; RAX now contains bytes written, or negative errno

    mov rax, 60         ; sys_exit
    xor rdi, rdi        ; exit code 0
    syscall

sys_read (0) — Read from a File Descriptor

; Read up to 64 bytes from stdin (fd=0) into buffer

section .bss
    buf resb 64         ; reserve 64 bytes

section .text
    global _start

_start:
    mov rax, 0          ; sys_read
    mov rdi, 0          ; fd = 0 (stdin)
    mov rsi, buf        ; destination buffer
    mov rdx, 64         ; max bytes to read
    syscall
    ; RAX = bytes actually read, or negative errno
    ; 0 = EOF (stdin closed)

sys_open (2) — Open a File

; Open a file for reading, returns file descriptor

section .data
    filename db "/etc/hostname", 0  ; null-terminated path

section .text
_start:
    mov rax, 2          ; sys_open
    mov rdi, filename   ; pathname
    mov rsi, 0          ; flags: O_RDONLY = 0
    mov rdx, 0          ; mode (ignored for O_RDONLY)
    syscall
    ; RAX = file descriptor (non-negative), or negative errno
    ; Common errors: -2 = ENOENT (file not found), -13 = EACCES

    ; Save fd for later use
    mov rbx, rax        ; fd in RBX (callee-saved, not affected by syscall)

    ; ... read from rbx ...

    ; sys_close (3): close the file descriptor
    mov rax, 3          ; sys_close
    mov rdi, rbx        ; fd to close
    syscall

📊 C Comparison: open(filename, O_RDONLY) in C compiles to exactly this sequence. The difference is that glibc also handles converting the negative return value to -1 and storing errno, and it handles the RCXR10 rename. Otherwise, it is identical.

sys_mmap (9) — Map Memory

mmap is how you allocate large blocks of memory, load shared libraries, and map files into memory. The kernel finds a free region of virtual address space and maps it.

; Allocate 4096 bytes (one page) of anonymous memory
; mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)

section .text
_start:
    mov rax, 9          ; sys_mmap
    xor rdi, rdi        ; addr = NULL (kernel chooses)
    mov rsi, 4096       ; length = 4096 bytes (one page)
    mov rdx, 3          ; prot = PROT_READ(1) | PROT_WRITE(2) = 3
    mov r10, 0x22       ; flags = MAP_PRIVATE(0x2) | MAP_ANONYMOUS(0x20) = 0x22
    mov r8, -1          ; fd = -1 (anonymous, no file backing)
    xor r9, r9          ; offset = 0
    syscall
    ; RAX = virtual address of mapped region, or MAP_FAILED (-1 cast)
    ; Check: cmp rax, -1; je .error

    ; The returned address is already usable — no extra setup needed
    mov [rax], qword 0xDEADBEEFCAFEBABE   ; write to it

sys_munmap (11) — Unmap Memory

; Unmap the region we allocated above
; RBX = address returned by mmap, RSI = length
    mov rax, 11         ; sys_munmap
    mov rdi, rbx        ; addr to unmap
    mov rsi, 4096       ; length
    syscall

sys_brk (12) — Heap Management

brk is the traditional heap management interface. The "program break" is the end of the data segment; moving it up allocates memory, moving it down frees it.

; Get current program break (pass 0 to query)
    mov rax, 12         ; sys_brk
    xor rdi, rdi        ; 0 = query current break
    syscall
    ; RAX = current program break address

    mov rbx, rax        ; save current break

; Extend the heap by 4096 bytes
    add rbx, 4096
    mov rax, 12         ; sys_brk
    mov rdi, rbx        ; new break address
    syscall
    ; RAX = actual new break (may differ if allocation failed)
    cmp rax, rbx        ; did we get what we asked for?
    jne .alloc_failed

sys_fork (57) — Create a Child Process

; Fork the current process
    mov rax, 57         ; sys_fork
    syscall
    ; After fork:
    ; In parent: RAX = PID of child (positive)
    ; In child:  RAX = 0
    ; Error:     RAX = negative errno

    test rax, rax
    js  .fork_failed    ; negative = error
    jz  .child_code     ; zero = we are the child
    ; fall through: we are the parent, RAX = child PID

.parent_code:
    ; ... parent code ...
    jmp .done

.child_code:
    ; ... child code ...
    mov rax, 60         ; child: exit
    xor rdi, rdi
    syscall

.fork_failed:
    neg rax             ; RAX now = positive errno
    ; handle error

sys_execve (59) — Replace Process Image

; Execute /bin/sh with no arguments and minimal environment
; execve(path, argv, envp)

section .data
    shell   db "/bin/sh", 0
    argv0   dq shell            ; argv[0] = "/bin/sh"
    argv_null dq 0              ; argv[1] = NULL (terminate argv)
    envp_null dq 0              ; envp[0] = NULL (empty environment)

section .text
    mov rax, 59         ; sys_execve
    mov rdi, shell      ; path
    mov rsi, argv0      ; argv array (must end with NULL pointer)
    mov rdx, envp_null  ; envp array (must end with NULL pointer)
    syscall
    ; If execve succeeds, this code never executes — the process image is replaced
    ; If execve fails, RAX = negative errno (program still running)

sys_exit (60) — Terminate Process

    mov rax, 60         ; sys_exit
    mov rdi, 0          ; exit status (0 = success)
    syscall
    ; This instruction never returns

sys_socket (41) and sys_connect (42) — Network

; Create a TCP socket
; socket(AF_INET=2, SOCK_STREAM=1, 0)
    mov rax, 41         ; sys_socket
    mov rdi, 2          ; AF_INET
    mov rsi, 1          ; SOCK_STREAM
    xor rdx, rdx        ; protocol = 0 (auto)
    syscall
    ; RAX = socket file descriptor
    mov rbx, rax        ; save sockfd

; Connect to 127.0.0.1:8080
section .data
    ; struct sockaddr_in: sa_family(2), port_be(2), addr_be(4), zero(8)
    sockaddr:
        dw 2            ; AF_INET (little endian 2-byte)
        dw 0x901F       ; port 8080 in big endian: htons(8080) = 0x1F90
        dd 0x0100007F   ; 127.0.0.1 in big endian
        dq 0            ; padding

section .text
    mov rax, 42         ; sys_connect
    mov rdi, rbx        ; sockfd
    mov rsi, sockaddr   ; struct sockaddr *
    mov rdx, 16         ; sizeof(struct sockaddr_in)
    syscall

sys_getpid (39), sys_kill (62)

; Get own PID
    mov rax, 39         ; sys_getpid
    syscall
    ; RAX = PID
    mov rbx, rax

; Send SIGTERM (15) to ourselves
    mov rax, 62         ; sys_kill
    mov rdi, rbx        ; pid
    mov rsi, 15         ; SIGTERM
    syscall

Syscall Reference Table (Key Linux x86-64 Entries)

RAX Name RDI RSI RDX R10
0 read fd buf* count
1 write fd buf* count
2 open path* flags mode
3 close fd
4 stat path* stat_buf*
5 fstat fd stat_buf*
9 mmap addr length prot flags
10 mprotect addr length prot
11 munmap addr length
12 brk addr
16 ioctl fd request arg
22 pipe fds[2]*
32 dup fd
39 getpid
41 socket domain type protocol
42 connect sockfd addr* addrlen
43 accept sockfd addr* addrlen*
44 sendto sockfd buf* len flags
45 recvfrom sockfd buf* len flags
49 bind sockfd addr* addrlen
50 listen sockfd backlog
57 fork
59 execve path* argv** envp**
60 exit status
61 wait4 pid status* options rusage*
62 kill pid sig
96 gettimeofday tv* tz*
102 getuid
160 setrlimit resource rlim*
231 exit_group status

Writing a Minimal libc

The C standard library is largely a collection of system call wrappers plus utility functions. Here is how to implement the core wrappers yourself:

; minimal_libc.asm — A minimal libc in NASM
; Compile: nasm -f elf64 minimal_libc.asm -o minimal_libc.o

section .text

; int write(int fd, const void *buf, size_t count)
; Returns: bytes written, or -1 (sets errno via global)
global write
write:
    push rbx
    mov rbx, rdi            ; save fd (rdi destroyed if we called nested fn)
    mov rax, 1              ; sys_write
    ; rdi, rsi, rdx already set by caller (standard C calling convention)
    syscall
    test rax, rax
    jns .ok                 ; non-negative = success
    neg rax                 ; rax = positive errno value
    mov [rel errno_val], eax ; store in errno
    mov rax, -1             ; return -1 (C convention for error)
    pop rbx
    ret
.ok:
    pop rbx
    ret

; ssize_t read(int fd, void *buf, size_t count)
global read
read:
    mov rax, 0              ; sys_read
    syscall
    test rax, rax
    jns .ok
    neg rax
    mov [rel errno_val], eax
    mov rax, -1
.ok:
    ret

; int open(const char *path, int flags, mode_t mode)
global open
open:
    mov rax, 2              ; sys_open
    syscall
    test rax, rax
    jns .ok
    neg rax
    mov [rel errno_val], eax
    mov rax, -1
.ok:
    ret

; int close(int fd)
global close_fd             ; avoid name collision with close()
close_fd:
    mov rax, 3              ; sys_close
    syscall
    test rax, rax
    jns .ok
    neg rax
    mov [rel errno_val], eax
    mov rax, -1
.ok:
    ret

; void exit(int status) — noreturn
global _exit
_exit:
    mov rax, 60             ; sys_exit
    syscall
    ; Never returns
    hlt                     ; should never execute

; void *mmap(void *addr, size_t len, int prot, int flags, int fd, off_t off)
; Note: 4th arg is flags → must be in R10, not RCX
global mmap
mmap:
    ; C calling conv: rdi=addr, rsi=len, rdx=prot, rcx=flags, r8=fd, r9=off
    mov r10, rcx            ; flags: move from RCX to R10 (syscall convention)
    mov rax, 9              ; sys_mmap
    syscall
    cmp rax, -4096          ; MAP_FAILED is (void*)-1, any large negative is error
    jbe .ok
    neg rax
    mov [rel errno_val], eax
    mov rax, -1
.ok:
    ret

section .bss
errno_val: resd 1           ; our errno variable

📊 C Comparison: When you link against glibc and call write(), you are calling a function that does essentially this — saves the fourth argument from RCX to R10, moves the syscall number to RAX, executes syscall, and converts negative returns to -1/errno. The "magic" of the C standard library is mostly bookkeeping.

The errno Mechanism

The kernel never touches user-space errno. Instead, it returns a negative value in RAX. The libc wrapper converts this: if RAX < 0, libc stores -RAX in errno and returns -1. This is why errno is only meaningful immediately after a failed system call — it's a thread-local global that any subsequent syscall wrapper can overwrite.

In assembly with no libc, you manage this yourself. The pattern above shows one approach: a errno_val in the BSS segment that your wrappers write to on error.


strace: Tracing System Calls of a Running Program

strace is one of the most powerful debugging tools for Linux programs. It intercepts every system call the program makes and prints the call with arguments and return value.

# Basic usage: trace all syscalls of a program
strace ./myprogram

# Trace with statistics: count each syscall type
strace -c ./myprogram

# Trace only specific syscalls (e.g., file-related)
strace -e trace=file ./myprogram

# Trace an already-running process by PID
strace -p 12345

# Show timestamps
strace -t ./myprogram

# Save output to file
strace -o strace.log ./myprogram

Sample strace output from running /bin/ls:

execve("/bin/ls", ["/bin/ls"], 0x7ffd... /* 24 vars */) = 0
brk(NULL)                               = 0x55a3b4001000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8b12345000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=180345, ...}) = 0
mmap(NULL, 180345, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f8b12318000
close(3)                                = 0
...
write(1, "bin  etc  lib  tmp  usr\n", 24) = 24
close(1)                                = 0
exit_group(0)                           = ?

Reading this output: each line is one syscall, arguments in parentheses, = return value. Negative returns show the errno name. This is exactly what your NASM syscall wrappers produce in the kernel.


A Minimal Shell Using Only Raw Syscalls

Here is a complete, working shell that uses no libc — only raw system calls. It reads a line, executes it with /bin/sh -c, and repeats.

; minish.asm — A minimal shell using only raw system calls
; Build: nasm -f elf64 minish.asm -o minish.o && ld minish.o -o minish
; Note: commands run as "sh -c <input>" for simplicity

section .data
    prompt      db "minish> ", 0
    prompt_len  equ $ - prompt - 1
    sh_path     db "/bin/sh", 0
    sh_arg0     dq sh_path
    sh_arg1     db "-c", 0
    sh_arg1_ptr dq sh_arg1
    cmd_ptr     dq 0            ; will point to user's command
    argv_null   dq 0

section .bss
    input_buf   resb 256        ; input line buffer

section .text
    global _start

_start:
.loop:
    ; Print prompt
    mov rax, 1
    mov rdi, 1
    mov rsi, prompt
    mov rdx, prompt_len
    syscall

    ; Read a line from stdin
    mov rax, 0
    mov rdi, 0
    mov rsi, input_buf
    mov rdx, 255
    syscall
    test rax, rax
    jle .exit           ; EOF or error: exit

    ; Null-terminate the input (replace newline with 0)
    mov rbx, rax        ; save byte count
    dec rbx
    mov byte [input_buf + rbx], 0

    ; Set cmd_ptr to point to our input
    mov qword [cmd_ptr], input_buf

    ; Fork a child to run the command
    mov rax, 57         ; sys_fork
    syscall
    test rax, rax
    js .loop            ; fork failed, try again
    jz .child           ; in child process

    ; Parent: wait for child to finish
    mov rdi, rax        ; child PID
    xor rsi, rsi        ; *status = NULL (don't care)
    xor rdx, rdx        ; options = 0
    xor r10, r10        ; *rusage = NULL
    mov rax, 61         ; sys_wait4
    syscall
    jmp .loop           ; read next command

.child:
    ; Build argv: ["/bin/sh", "-c", cmd, NULL]
    ; We already have sh_arg0, sh_arg1_ptr, cmd_ptr, argv_null set up
    ; but they're scattered. Build a proper argv array on the stack:
    sub rsp, 32
    mov qword [rsp],    sh_path     ; argv[0] = "/bin/sh"
    mov qword [rsp+8],  sh_arg1     ; argv[1] = "-c"
    mov rax, input_buf
    mov qword [rsp+16], rax         ; argv[2] = command
    mov qword [rsp+24], 0           ; argv[3] = NULL

    mov rax, 59         ; sys_execve
    mov rdi, sh_path    ; path
    mov rsi, rsp        ; argv
    xor rdx, rdx        ; envp = NULL (inherit from shell)
    ; Actually we need a valid envp — use the kernel's minimal env:
    sub rsp, 8
    mov qword [rsp], 0  ; envp = { NULL }
    mov rdx, rsp
    syscall
    ; If execve fails, exit the child
    mov rax, 60
    mov rdi, 1
    syscall

.exit:
    mov rax, 60
    xor rdi, rdi
    syscall

🛠️ Lab Exercise: Compile and run minish. Test it with simple commands like ls, pwd, echo hello. Add a check that prints the exit status of each child process (the status from wait4). Add built-in cd support — remember that cd cannot be implemented as a child process because chdir only affects the current process.


vDSO: Virtual Dynamic Shared Object

Some system calls are so frequent that the overhead of the ring transition is significant. gettimeofday is called millions of times per second in some applications. The kernel addresses this with the vDSO (virtual dynamic shared object): a small shared library that the kernel maps into every process's address space.

The vDSO contains implementations of a few time-related functions (gettimeofday, clock_gettime, time) that read directly from a kernel-maintained page of shared memory, without executing a real syscall instruction at all.

# Verify vDSO is mapped in your process
cat /proc/self/maps | grep vdso
# Output: 7ffca8b2c000-7ffca8b2e000 r-xp 00000000 00:00 0  [vdso]

# Disassemble the vDSO
vdso_addr=$(grep vdso /proc/self/maps | awk -F- '{print $1}')
dd if=/proc/self/mem bs=4096 count=2 skip=$((16#$vdso_addr / 4096)) 2>/dev/null | \
    objdump -D -b binary -m i386:x86-64 -
; Calling gettimeofday through vDSO (via PLT in practice)
; In C: gettimeofday(&tv, NULL) — the dynamic linker resolves this
; to the vDSO implementation automatically when vDSO is present.
;
; For raw assembly using vDSO, you need to find the vDSO base
; from the auxiliary vector (AT_SYSINFO_EHDR in the process's aux vector)
; and resolve the function symbol manually. Most programs just use
; the libc wrapper, which already does this.

The result: clock_gettime(CLOCK_REALTIME, &ts) in a program with libc costs ~30ns (a function call and some arithmetic) rather than ~300ns for a real syscall.


ARM64 System Calls

On ARM64 Linux, the system call mechanism is different but conceptually identical:

// ARM64 syscall convention:
// x8 = syscall number
// x0-x5 = arguments 1-6
// x0 = return value (negative = error)
// The instruction is SVC #0 (not SYSCALL)

// write(1, "Hello\n", 6)
    mov x0, #1          // fd = stdout
    adr x1, message     // buffer
    mov x2, #6          // length
    mov x8, #64         // sys_write on ARM64 (NOT 1 like x86-64!)
    svc #0              // system call
    // x0 = return value

// exit(0)
    mov x0, #0          // status
    mov x8, #93         // sys_exit on ARM64
    svc #0

message: .ascii "Hello\n"

⚠️ Common Mistake: ARM64 syscall numbers are completely different from x86-64 syscall numbers. ARM64 sys_write is 64, not 1. ARM64 sys_exit is 93, not 60. The numbers come from a different ABI defined in the kernel's include/uapi/asm-generic/unistd.h. Always check the ARM64 syscall table separately.

Key ARM64 syscall numbers: | x8 | Name | |----|------| | 56 | openat (ARM64 doesn't have open) | | 57 | close | | 63 | read | | 64 | write | | 93 | exit | | 94 | exit_group | | 172 | getpid | | 220 | clone (fork equivalent) | | 221 | execve | | 222 | mmap | | 215 | munmap |


MinOS Kernel: System Call Handler Setup

The MinOS kernel needs to accept system calls from user-space programs. The setup involves configuring three Model-Specific Registers (MSRs) before user-space code runs:

; minOS/kernel/syscall_setup.asm
; Set up the SYSCALL/SYSRET mechanism

; MSR addresses
MSR_STAR    equ 0xC0000081     ; CS/SS for SYSCALL/SYSRET
MSR_LSTAR   equ 0xC0000082     ; kernel entry point for SYSCALL
MSR_FMASK   equ 0xC0000084     ; RFLAGS mask (bits to clear on SYSCALL)

setup_syscall:
    ; Enable SYSCALL instruction: set SCE bit in EFER MSR
    mov ecx, 0xC0000080         ; IA32_EFER MSR
    rdmsr
    or eax, 1                   ; set SCE (SysCall Enable) bit 0
    wrmsr

    ; Set LSTAR to our syscall entry point
    mov ecx, MSR_LSTAR
    mov rax, syscall_entry      ; low 32 bits of entry address
    mov rdx, syscall_entry      ;
    shr rdx, 32                 ; high 32 bits of entry address
    wrmsr

    ; Set STAR: kernel CS = 0x08, user CS = 0x1B (+ 3 for RPL)
    ; STAR bits [47:32] = kernel CS, [63:48] = user CS-8 (SYSRET adds 16)
    mov ecx, MSR_STAR
    xor eax, eax                ; low 32 bits unused
    mov edx, 0x00180008         ; [31:16]=0x0018 user CS-16, [15:0]=0x0008 kernel CS
    wrmsr

    ; Set FMASK: clear IF (interrupts) on syscall entry
    mov ecx, MSR_FMASK
    mov eax, 0x200              ; bit 9 = IF flag
    xor edx, edx
    wrmsr
    ret

; syscall_entry: called when user space executes SYSCALL
; At entry: RCX = user RIP, R11 = user RFLAGS, RAX = syscall number
; We are running on the USER'S STACK — must switch to kernel stack first!
syscall_entry:
    ; Save user stack pointer and switch to kernel stack
    swapgs                      ; swap GS.base with kernel GS (points to per-CPU data)
    mov [gs:user_rsp_offset], rsp  ; save user RSP
    mov rsp, [gs:kernel_rsp_offset] ; load kernel RSP

    ; Save all registers that the calling convention requires
    push r11                    ; user RFLAGS
    push rcx                    ; user RIP (return address)
    push rbp
    push rbx
    push r12
    push r13
    push r14
    push r15

    ; Dispatch on syscall number in RAX
    cmp rax, SYSCALL_MAX
    jae .invalid
    mov rcx, rax                ; use RCX as index (RAX is return value)
    call [syscall_table + rcx*8]
    jmp .return

.invalid:
    mov rax, -38                ; -ENOSYS

.return:
    ; Restore saved registers
    pop r15
    pop r14
    pop r13
    pop r12
    pop rbx
    pop rbp
    pop rcx                     ; restore user RIP → RCX (for SYSRET)
    pop r11                     ; restore user RFLAGS → R11 (for SYSRET)

    ; Switch back to user stack
    mov rsp, [gs:user_rsp_offset]
    swapgs

    ; Return to user space
    sysretq                     ; 64-bit SYSRET: uses RCX as new RIP, R11 as new RFLAGS

; Syscall dispatch table
syscall_table:
    dq sys_read     ; 0
    dq sys_write    ; 1
    dq sys_open     ; 2
    dq sys_close    ; 3
    ; ... more entries ...
SYSCALL_MAX equ $ - syscall_table / 8

📐 OS Kernel Project: This is the MinOS syscall dispatcher. Chapter 26 adds the IDT for hardware interrupt handling. Chapter 28 shows how the kernel is loaded before this code runs. Together, these three chapters build the core of a functional kernel.


Summary

System calls are the hardware-enforced boundary between user code and the kernel. The syscall instruction transitions to ring 0 by loading the kernel entry point from the LSTAR MSR, saving RIP in RCX and RFLAGS in R11. The Linux x86-64 ABI uses RAX for the syscall number, RDI/RSI/RDX/R10/R8/R9 for arguments, and returns in RAX (negative = errno). The strace tool makes the system call layer visible without any kernel modification, making it invaluable for debugging and security analysis.

🔄 Check Your Understanding: 1. Why is R10 used for the 4th syscall argument instead of RCX? 2. What are the two registers destroyed by the syscall instruction, and why? 3. If sys_open returns -13, what does that mean and how should your program handle it? 4. What is the vDSO, and which syscalls benefit from it? 5. How does ARM64's SVC #0 differ from x86-64's SYSCALL in terms of argument passing?