Every program that does anything useful — reads a file, writes to a terminal, allocates memory, communicates over a network — eventually needs the kernel's help. The kernel has privileges your program does not: it can touch hardware directly, manage...
In This Chapter
- The Controlled Entry Point into Kernel Mode
- The syscall Instruction (x86-64)
- Linux x86-64 Syscall Calling Convention
- Key System Calls with Complete NASM Examples
- Syscall Reference Table (Key Linux x86-64 Entries)
- Writing a Minimal libc
- strace: Tracing System Calls of a Running Program
- A Minimal Shell Using Only Raw Syscalls
- vDSO: Virtual Dynamic Shared Object
- ARM64 System Calls
- MinOS Kernel: System Call Handler Setup
- Summary
Chapter 25: System Calls
The Controlled Entry Point into Kernel Mode
Every program that does anything useful — reads a file, writes to a terminal, allocates memory, communicates over a network — eventually needs the kernel's help. The kernel has privileges your program does not: it can touch hardware directly, manage physical memory, and arbitrate between competing processes. The mechanism by which your program asks the kernel for help is the system call.
A system call is not a function call. It is a supervised privilege escalation: the CPU transitions from ring 3 (user mode) to ring 0 (kernel mode), executes a specific kernel-provided handler, then transitions back. The transition is controlled, the kernel handler is fixed, and your user-mode code cannot skip or bypass the transition. This is by design.
Understanding system calls at the assembly level is not academic. It is necessary for writing programs that run without a C library, for analyzing security vulnerabilities, for debugging programs whose behavior differs from their source code, and for implementing the kernel side of the interface.
The syscall Instruction (x86-64)
The syscall instruction is how user-mode x86-64 code enters the kernel. It is not an interrupt (though INT 0x80 served this purpose on 32-bit Linux and still works for compatibility). It is a dedicated fast-path instruction that does the following in hardware:
- Saves the current instruction pointer (
RIP) intoRCX - Saves the current flags register (
RFLAGS) intoR11 - Loads the kernel stack pointer from the
IA32_LSTARMSR (a CPU register holding the kernel entry point address) — actually the kernel entry code must swap the stack - Changes
CSandSSto kernel-mode selectors (from theIA32_STARMSR) - Clears
RFLAGSbits specified inIA32_FMASKMSR (disables interrupts) - Jumps to the address in
IA32_LSTARMSR (the kernel syscall entry point)
After the kernel processes the request, it executes SYSRET, which reverses the process: restores RIP from RCX, restores RFLAGS from R11, and returns to ring 3.
⚠️ Common Mistake: Because
syscallsavesRIPtoRCXandRFLAGStoR11, both of these registers are destroyed by asyscallinstruction. Do not pass arguments inRCXorR11, and do not expect them to survive a system call. This is different from the regular calling convention.🔍 Under the Hood: The
IA32_LSTARMSR (address0xC0000082) contains the address of the kernel's syscall entry point. On Linux, this is theentry_SYSCALL_64function inarch/x86/entry/entry_64.S. You can read it withrdmsrin ring 0, or inspect it via/proc/kallsyms.
Linux x86-64 Syscall Calling Convention
The Linux kernel defines a specific convention for how to invoke system calls:
| Register | Role |
|---|---|
RAX |
Syscall number (input) / Return value (output) |
RDI |
Argument 1 |
RSI |
Argument 2 |
RDX |
Argument 3 |
R10 |
Argument 4 (NOT RCX — that gets destroyed by syscall) |
R8 |
Argument 5 |
R9 |
Argument 6 |
The return value is placed in RAX. If the syscall fails, RAX contains a negative value; the error code is -RAX (i.e., -RAX == errno). For example, if RAX returns -13, the error is EACCES (permission denied).
⚠️ Common Mistake: Notice that argument 4 uses
R10, notRCX. The C calling convention usesRCXfor the fourth argument, butsyscalldestroysRCX. The Linux syscall wrapper in glibc handles this by movingRCXtoR10before executingsyscall. If you are writing raw syscall wrappers, you must do this yourself.
Key System Calls with Complete NASM Examples
sys_write (1) — Write to a File Descriptor
; Write "Hello, syscall!\n" to stdout (fd=1)
; Returns number of bytes written, or negative error
section .data
message db "Hello, syscall!", 10 ; 10 = newline
msg_len equ $ - message
section .text
global _start
_start:
mov rax, 1 ; sys_write
mov rdi, 1 ; fd = 1 (stdout)
mov rsi, message ; pointer to buffer
mov rdx, msg_len ; number of bytes to write
syscall ; enter kernel
; RAX now contains bytes written, or negative errno
mov rax, 60 ; sys_exit
xor rdi, rdi ; exit code 0
syscall
sys_read (0) — Read from a File Descriptor
; Read up to 64 bytes from stdin (fd=0) into buffer
section .bss
buf resb 64 ; reserve 64 bytes
section .text
global _start
_start:
mov rax, 0 ; sys_read
mov rdi, 0 ; fd = 0 (stdin)
mov rsi, buf ; destination buffer
mov rdx, 64 ; max bytes to read
syscall
; RAX = bytes actually read, or negative errno
; 0 = EOF (stdin closed)
sys_open (2) — Open a File
; Open a file for reading, returns file descriptor
section .data
filename db "/etc/hostname", 0 ; null-terminated path
section .text
_start:
mov rax, 2 ; sys_open
mov rdi, filename ; pathname
mov rsi, 0 ; flags: O_RDONLY = 0
mov rdx, 0 ; mode (ignored for O_RDONLY)
syscall
; RAX = file descriptor (non-negative), or negative errno
; Common errors: -2 = ENOENT (file not found), -13 = EACCES
; Save fd for later use
mov rbx, rax ; fd in RBX (callee-saved, not affected by syscall)
; ... read from rbx ...
; sys_close (3): close the file descriptor
mov rax, 3 ; sys_close
mov rdi, rbx ; fd to close
syscall
📊 C Comparison:
open(filename, O_RDONLY)in C compiles to exactly this sequence. The difference is that glibc also handles converting the negative return value to-1and storingerrno, and it handles theRCX→R10rename. Otherwise, it is identical.
sys_mmap (9) — Map Memory
mmap is how you allocate large blocks of memory, load shared libraries, and map files into memory. The kernel finds a free region of virtual address space and maps it.
; Allocate 4096 bytes (one page) of anonymous memory
; mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
section .text
_start:
mov rax, 9 ; sys_mmap
xor rdi, rdi ; addr = NULL (kernel chooses)
mov rsi, 4096 ; length = 4096 bytes (one page)
mov rdx, 3 ; prot = PROT_READ(1) | PROT_WRITE(2) = 3
mov r10, 0x22 ; flags = MAP_PRIVATE(0x2) | MAP_ANONYMOUS(0x20) = 0x22
mov r8, -1 ; fd = -1 (anonymous, no file backing)
xor r9, r9 ; offset = 0
syscall
; RAX = virtual address of mapped region, or MAP_FAILED (-1 cast)
; Check: cmp rax, -1; je .error
; The returned address is already usable — no extra setup needed
mov [rax], qword 0xDEADBEEFCAFEBABE ; write to it
sys_munmap (11) — Unmap Memory
; Unmap the region we allocated above
; RBX = address returned by mmap, RSI = length
mov rax, 11 ; sys_munmap
mov rdi, rbx ; addr to unmap
mov rsi, 4096 ; length
syscall
sys_brk (12) — Heap Management
brk is the traditional heap management interface. The "program break" is the end of the data segment; moving it up allocates memory, moving it down frees it.
; Get current program break (pass 0 to query)
mov rax, 12 ; sys_brk
xor rdi, rdi ; 0 = query current break
syscall
; RAX = current program break address
mov rbx, rax ; save current break
; Extend the heap by 4096 bytes
add rbx, 4096
mov rax, 12 ; sys_brk
mov rdi, rbx ; new break address
syscall
; RAX = actual new break (may differ if allocation failed)
cmp rax, rbx ; did we get what we asked for?
jne .alloc_failed
sys_fork (57) — Create a Child Process
; Fork the current process
mov rax, 57 ; sys_fork
syscall
; After fork:
; In parent: RAX = PID of child (positive)
; In child: RAX = 0
; Error: RAX = negative errno
test rax, rax
js .fork_failed ; negative = error
jz .child_code ; zero = we are the child
; fall through: we are the parent, RAX = child PID
.parent_code:
; ... parent code ...
jmp .done
.child_code:
; ... child code ...
mov rax, 60 ; child: exit
xor rdi, rdi
syscall
.fork_failed:
neg rax ; RAX now = positive errno
; handle error
sys_execve (59) — Replace Process Image
; Execute /bin/sh with no arguments and minimal environment
; execve(path, argv, envp)
section .data
shell db "/bin/sh", 0
argv0 dq shell ; argv[0] = "/bin/sh"
argv_null dq 0 ; argv[1] = NULL (terminate argv)
envp_null dq 0 ; envp[0] = NULL (empty environment)
section .text
mov rax, 59 ; sys_execve
mov rdi, shell ; path
mov rsi, argv0 ; argv array (must end with NULL pointer)
mov rdx, envp_null ; envp array (must end with NULL pointer)
syscall
; If execve succeeds, this code never executes — the process image is replaced
; If execve fails, RAX = negative errno (program still running)
sys_exit (60) — Terminate Process
mov rax, 60 ; sys_exit
mov rdi, 0 ; exit status (0 = success)
syscall
; This instruction never returns
sys_socket (41) and sys_connect (42) — Network
; Create a TCP socket
; socket(AF_INET=2, SOCK_STREAM=1, 0)
mov rax, 41 ; sys_socket
mov rdi, 2 ; AF_INET
mov rsi, 1 ; SOCK_STREAM
xor rdx, rdx ; protocol = 0 (auto)
syscall
; RAX = socket file descriptor
mov rbx, rax ; save sockfd
; Connect to 127.0.0.1:8080
section .data
; struct sockaddr_in: sa_family(2), port_be(2), addr_be(4), zero(8)
sockaddr:
dw 2 ; AF_INET (little endian 2-byte)
dw 0x901F ; port 8080 in big endian: htons(8080) = 0x1F90
dd 0x0100007F ; 127.0.0.1 in big endian
dq 0 ; padding
section .text
mov rax, 42 ; sys_connect
mov rdi, rbx ; sockfd
mov rsi, sockaddr ; struct sockaddr *
mov rdx, 16 ; sizeof(struct sockaddr_in)
syscall
sys_getpid (39), sys_kill (62)
; Get own PID
mov rax, 39 ; sys_getpid
syscall
; RAX = PID
mov rbx, rax
; Send SIGTERM (15) to ourselves
mov rax, 62 ; sys_kill
mov rdi, rbx ; pid
mov rsi, 15 ; SIGTERM
syscall
Syscall Reference Table (Key Linux x86-64 Entries)
| RAX | Name | RDI | RSI | RDX | R10 |
|---|---|---|---|---|---|
| 0 | read |
fd | buf* | count | — |
| 1 | write |
fd | buf* | count | — |
| 2 | open |
path* | flags | mode | — |
| 3 | close |
fd | — | — | — |
| 4 | stat |
path* | stat_buf* | — | — |
| 5 | fstat |
fd | stat_buf* | — | — |
| 9 | mmap |
addr | length | prot | flags |
| 10 | mprotect |
addr | length | prot | — |
| 11 | munmap |
addr | length | — | — |
| 12 | brk |
addr | — | — | — |
| 16 | ioctl |
fd | request | arg | — |
| 22 | pipe |
fds[2]* | — | — | — |
| 32 | dup |
fd | — | — | — |
| 39 | getpid |
— | — | — | — |
| 41 | socket |
domain | type | protocol | — |
| 42 | connect |
sockfd | addr* | addrlen | — |
| 43 | accept |
sockfd | addr* | addrlen* | — |
| 44 | sendto |
sockfd | buf* | len | flags |
| 45 | recvfrom |
sockfd | buf* | len | flags |
| 49 | bind |
sockfd | addr* | addrlen | — |
| 50 | listen |
sockfd | backlog | — | — |
| 57 | fork |
— | — | — | — |
| 59 | execve |
path* | argv** | envp** | — |
| 60 | exit |
status | — | — | — |
| 61 | wait4 |
pid | status* | options | rusage* |
| 62 | kill |
pid | sig | — | — |
| 96 | gettimeofday |
tv* | tz* | — | — |
| 102 | getuid |
— | — | — | — |
| 160 | setrlimit |
resource | rlim* | — | — |
| 231 | exit_group |
status | — | — | — |
Writing a Minimal libc
The C standard library is largely a collection of system call wrappers plus utility functions. Here is how to implement the core wrappers yourself:
; minimal_libc.asm — A minimal libc in NASM
; Compile: nasm -f elf64 minimal_libc.asm -o minimal_libc.o
section .text
; int write(int fd, const void *buf, size_t count)
; Returns: bytes written, or -1 (sets errno via global)
global write
write:
push rbx
mov rbx, rdi ; save fd (rdi destroyed if we called nested fn)
mov rax, 1 ; sys_write
; rdi, rsi, rdx already set by caller (standard C calling convention)
syscall
test rax, rax
jns .ok ; non-negative = success
neg rax ; rax = positive errno value
mov [rel errno_val], eax ; store in errno
mov rax, -1 ; return -1 (C convention for error)
pop rbx
ret
.ok:
pop rbx
ret
; ssize_t read(int fd, void *buf, size_t count)
global read
read:
mov rax, 0 ; sys_read
syscall
test rax, rax
jns .ok
neg rax
mov [rel errno_val], eax
mov rax, -1
.ok:
ret
; int open(const char *path, int flags, mode_t mode)
global open
open:
mov rax, 2 ; sys_open
syscall
test rax, rax
jns .ok
neg rax
mov [rel errno_val], eax
mov rax, -1
.ok:
ret
; int close(int fd)
global close_fd ; avoid name collision with close()
close_fd:
mov rax, 3 ; sys_close
syscall
test rax, rax
jns .ok
neg rax
mov [rel errno_val], eax
mov rax, -1
.ok:
ret
; void exit(int status) — noreturn
global _exit
_exit:
mov rax, 60 ; sys_exit
syscall
; Never returns
hlt ; should never execute
; void *mmap(void *addr, size_t len, int prot, int flags, int fd, off_t off)
; Note: 4th arg is flags → must be in R10, not RCX
global mmap
mmap:
; C calling conv: rdi=addr, rsi=len, rdx=prot, rcx=flags, r8=fd, r9=off
mov r10, rcx ; flags: move from RCX to R10 (syscall convention)
mov rax, 9 ; sys_mmap
syscall
cmp rax, -4096 ; MAP_FAILED is (void*)-1, any large negative is error
jbe .ok
neg rax
mov [rel errno_val], eax
mov rax, -1
.ok:
ret
section .bss
errno_val: resd 1 ; our errno variable
📊 C Comparison: When you link against glibc and call
write(), you are calling a function that does essentially this — saves the fourth argument fromRCXtoR10, moves the syscall number toRAX, executessyscall, and converts negative returns to-1/errno. The "magic" of the C standard library is mostly bookkeeping.
The errno Mechanism
The kernel never touches user-space errno. Instead, it returns a negative value in RAX. The libc wrapper converts this: if RAX < 0, libc stores -RAX in errno and returns -1. This is why errno is only meaningful immediately after a failed system call — it's a thread-local global that any subsequent syscall wrapper can overwrite.
In assembly with no libc, you manage this yourself. The pattern above shows one approach: a errno_val in the BSS segment that your wrappers write to on error.
strace: Tracing System Calls of a Running Program
strace is one of the most powerful debugging tools for Linux programs. It intercepts every system call the program makes and prints the call with arguments and return value.
# Basic usage: trace all syscalls of a program
strace ./myprogram
# Trace with statistics: count each syscall type
strace -c ./myprogram
# Trace only specific syscalls (e.g., file-related)
strace -e trace=file ./myprogram
# Trace an already-running process by PID
strace -p 12345
# Show timestamps
strace -t ./myprogram
# Save output to file
strace -o strace.log ./myprogram
Sample strace output from running /bin/ls:
execve("/bin/ls", ["/bin/ls"], 0x7ffd... /* 24 vars */) = 0
brk(NULL) = 0x55a3b4001000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8b12345000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=180345, ...}) = 0
mmap(NULL, 180345, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f8b12318000
close(3) = 0
...
write(1, "bin etc lib tmp usr\n", 24) = 24
close(1) = 0
exit_group(0) = ?
Reading this output: each line is one syscall, arguments in parentheses, = return value. Negative returns show the errno name. This is exactly what your NASM syscall wrappers produce in the kernel.
A Minimal Shell Using Only Raw Syscalls
Here is a complete, working shell that uses no libc — only raw system calls. It reads a line, executes it with /bin/sh -c, and repeats.
; minish.asm — A minimal shell using only raw system calls
; Build: nasm -f elf64 minish.asm -o minish.o && ld minish.o -o minish
; Note: commands run as "sh -c <input>" for simplicity
section .data
prompt db "minish> ", 0
prompt_len equ $ - prompt - 1
sh_path db "/bin/sh", 0
sh_arg0 dq sh_path
sh_arg1 db "-c", 0
sh_arg1_ptr dq sh_arg1
cmd_ptr dq 0 ; will point to user's command
argv_null dq 0
section .bss
input_buf resb 256 ; input line buffer
section .text
global _start
_start:
.loop:
; Print prompt
mov rax, 1
mov rdi, 1
mov rsi, prompt
mov rdx, prompt_len
syscall
; Read a line from stdin
mov rax, 0
mov rdi, 0
mov rsi, input_buf
mov rdx, 255
syscall
test rax, rax
jle .exit ; EOF or error: exit
; Null-terminate the input (replace newline with 0)
mov rbx, rax ; save byte count
dec rbx
mov byte [input_buf + rbx], 0
; Set cmd_ptr to point to our input
mov qword [cmd_ptr], input_buf
; Fork a child to run the command
mov rax, 57 ; sys_fork
syscall
test rax, rax
js .loop ; fork failed, try again
jz .child ; in child process
; Parent: wait for child to finish
mov rdi, rax ; child PID
xor rsi, rsi ; *status = NULL (don't care)
xor rdx, rdx ; options = 0
xor r10, r10 ; *rusage = NULL
mov rax, 61 ; sys_wait4
syscall
jmp .loop ; read next command
.child:
; Build argv: ["/bin/sh", "-c", cmd, NULL]
; We already have sh_arg0, sh_arg1_ptr, cmd_ptr, argv_null set up
; but they're scattered. Build a proper argv array on the stack:
sub rsp, 32
mov qword [rsp], sh_path ; argv[0] = "/bin/sh"
mov qword [rsp+8], sh_arg1 ; argv[1] = "-c"
mov rax, input_buf
mov qword [rsp+16], rax ; argv[2] = command
mov qword [rsp+24], 0 ; argv[3] = NULL
mov rax, 59 ; sys_execve
mov rdi, sh_path ; path
mov rsi, rsp ; argv
xor rdx, rdx ; envp = NULL (inherit from shell)
; Actually we need a valid envp — use the kernel's minimal env:
sub rsp, 8
mov qword [rsp], 0 ; envp = { NULL }
mov rdx, rsp
syscall
; If execve fails, exit the child
mov rax, 60
mov rdi, 1
syscall
.exit:
mov rax, 60
xor rdi, rdi
syscall
🛠️ Lab Exercise: Compile and run
minish. Test it with simple commands likels,pwd,echo hello. Add a check that prints the exit status of each child process (thestatusfromwait4). Add built-incdsupport — remember thatcdcannot be implemented as a child process becausechdironly affects the current process.
vDSO: Virtual Dynamic Shared Object
Some system calls are so frequent that the overhead of the ring transition is significant. gettimeofday is called millions of times per second in some applications. The kernel addresses this with the vDSO (virtual dynamic shared object): a small shared library that the kernel maps into every process's address space.
The vDSO contains implementations of a few time-related functions (gettimeofday, clock_gettime, time) that read directly from a kernel-maintained page of shared memory, without executing a real syscall instruction at all.
# Verify vDSO is mapped in your process
cat /proc/self/maps | grep vdso
# Output: 7ffca8b2c000-7ffca8b2e000 r-xp 00000000 00:00 0 [vdso]
# Disassemble the vDSO
vdso_addr=$(grep vdso /proc/self/maps | awk -F- '{print $1}')
dd if=/proc/self/mem bs=4096 count=2 skip=$((16#$vdso_addr / 4096)) 2>/dev/null | \
objdump -D -b binary -m i386:x86-64 -
; Calling gettimeofday through vDSO (via PLT in practice)
; In C: gettimeofday(&tv, NULL) — the dynamic linker resolves this
; to the vDSO implementation automatically when vDSO is present.
;
; For raw assembly using vDSO, you need to find the vDSO base
; from the auxiliary vector (AT_SYSINFO_EHDR in the process's aux vector)
; and resolve the function symbol manually. Most programs just use
; the libc wrapper, which already does this.
The result: clock_gettime(CLOCK_REALTIME, &ts) in a program with libc costs ~30ns (a function call and some arithmetic) rather than ~300ns for a real syscall.
ARM64 System Calls
On ARM64 Linux, the system call mechanism is different but conceptually identical:
// ARM64 syscall convention:
// x8 = syscall number
// x0-x5 = arguments 1-6
// x0 = return value (negative = error)
// The instruction is SVC #0 (not SYSCALL)
// write(1, "Hello\n", 6)
mov x0, #1 // fd = stdout
adr x1, message // buffer
mov x2, #6 // length
mov x8, #64 // sys_write on ARM64 (NOT 1 like x86-64!)
svc #0 // system call
// x0 = return value
// exit(0)
mov x0, #0 // status
mov x8, #93 // sys_exit on ARM64
svc #0
message: .ascii "Hello\n"
⚠️ Common Mistake: ARM64 syscall numbers are completely different from x86-64 syscall numbers. ARM64
sys_writeis 64, not 1. ARM64sys_exitis 93, not 60. The numbers come from a different ABI defined in the kernel'sinclude/uapi/asm-generic/unistd.h. Always check the ARM64 syscall table separately.
Key ARM64 syscall numbers:
| x8 | Name |
|----|------|
| 56 | openat (ARM64 doesn't have open) |
| 57 | close |
| 63 | read |
| 64 | write |
| 93 | exit |
| 94 | exit_group |
| 172 | getpid |
| 220 | clone (fork equivalent) |
| 221 | execve |
| 222 | mmap |
| 215 | munmap |
MinOS Kernel: System Call Handler Setup
The MinOS kernel needs to accept system calls from user-space programs. The setup involves configuring three Model-Specific Registers (MSRs) before user-space code runs:
; minOS/kernel/syscall_setup.asm
; Set up the SYSCALL/SYSRET mechanism
; MSR addresses
MSR_STAR equ 0xC0000081 ; CS/SS for SYSCALL/SYSRET
MSR_LSTAR equ 0xC0000082 ; kernel entry point for SYSCALL
MSR_FMASK equ 0xC0000084 ; RFLAGS mask (bits to clear on SYSCALL)
setup_syscall:
; Enable SYSCALL instruction: set SCE bit in EFER MSR
mov ecx, 0xC0000080 ; IA32_EFER MSR
rdmsr
or eax, 1 ; set SCE (SysCall Enable) bit 0
wrmsr
; Set LSTAR to our syscall entry point
mov ecx, MSR_LSTAR
mov rax, syscall_entry ; low 32 bits of entry address
mov rdx, syscall_entry ;
shr rdx, 32 ; high 32 bits of entry address
wrmsr
; Set STAR: kernel CS = 0x08, user CS = 0x1B (+ 3 for RPL)
; STAR bits [47:32] = kernel CS, [63:48] = user CS-8 (SYSRET adds 16)
mov ecx, MSR_STAR
xor eax, eax ; low 32 bits unused
mov edx, 0x00180008 ; [31:16]=0x0018 user CS-16, [15:0]=0x0008 kernel CS
wrmsr
; Set FMASK: clear IF (interrupts) on syscall entry
mov ecx, MSR_FMASK
mov eax, 0x200 ; bit 9 = IF flag
xor edx, edx
wrmsr
ret
; syscall_entry: called when user space executes SYSCALL
; At entry: RCX = user RIP, R11 = user RFLAGS, RAX = syscall number
; We are running on the USER'S STACK — must switch to kernel stack first!
syscall_entry:
; Save user stack pointer and switch to kernel stack
swapgs ; swap GS.base with kernel GS (points to per-CPU data)
mov [gs:user_rsp_offset], rsp ; save user RSP
mov rsp, [gs:kernel_rsp_offset] ; load kernel RSP
; Save all registers that the calling convention requires
push r11 ; user RFLAGS
push rcx ; user RIP (return address)
push rbp
push rbx
push r12
push r13
push r14
push r15
; Dispatch on syscall number in RAX
cmp rax, SYSCALL_MAX
jae .invalid
mov rcx, rax ; use RCX as index (RAX is return value)
call [syscall_table + rcx*8]
jmp .return
.invalid:
mov rax, -38 ; -ENOSYS
.return:
; Restore saved registers
pop r15
pop r14
pop r13
pop r12
pop rbx
pop rbp
pop rcx ; restore user RIP → RCX (for SYSRET)
pop r11 ; restore user RFLAGS → R11 (for SYSRET)
; Switch back to user stack
mov rsp, [gs:user_rsp_offset]
swapgs
; Return to user space
sysretq ; 64-bit SYSRET: uses RCX as new RIP, R11 as new RFLAGS
; Syscall dispatch table
syscall_table:
dq sys_read ; 0
dq sys_write ; 1
dq sys_open ; 2
dq sys_close ; 3
; ... more entries ...
SYSCALL_MAX equ $ - syscall_table / 8
📐 OS Kernel Project: This is the MinOS syscall dispatcher. Chapter 26 adds the IDT for hardware interrupt handling. Chapter 28 shows how the kernel is loaded before this code runs. Together, these three chapters build the core of a functional kernel.
Summary
System calls are the hardware-enforced boundary between user code and the kernel. The syscall instruction transitions to ring 0 by loading the kernel entry point from the LSTAR MSR, saving RIP in RCX and RFLAGS in R11. The Linux x86-64 ABI uses RAX for the syscall number, RDI/RSI/RDX/R10/R8/R9 for arguments, and returns in RAX (negative = errno). The strace tool makes the system call layer visible without any kernel modification, making it invaluable for debugging and security analysis.
🔄 Check Your Understanding: 1. Why is
R10used for the 4th syscall argument instead ofRCX? 2. What are the two registers destroyed by thesyscallinstruction, and why? 3. Ifsys_openreturns -13, what does that mean and how should your program handle it? 4. What is the vDSO, and which syscalls benefit from it? 5. How does ARM64'sSVC #0differ from x86-64'sSYSCALLin terms of argument passing?