Case Study 7.2: Your First System Call — What Actually Happens When You Call sys_write

Open Assembly Language Project

Case Study 7.2: Your First System Call — What Actually Happens When You Call sys_write

Tracing the path from SYSCALL instruction to terminal output and back

The Mystery of syscall

When you execute these four lines of assembly:

mov     rax, 1          ; sys_write
mov     rdi, 1          ; fd = stdout
mov     rsi, msg        ; buffer
mov     rdx, 13         ; length
syscall

...and text appears on your terminal, a remarkable amount of machinery activates and completes in roughly 1-5 microseconds. You see the result; the mechanism is invisible. This case study makes it visible.

Step 1: The SYSCALL Instruction Itself

The SYSCALL instruction is not a function call. It does not push a return address to the stack. Instead, it performs a hardware context switch using Model-Specific Registers (MSRs) that the OS kernel configures at boot time:

What SYSCALL does atomically:

Saves the next instruction's address (RIP) to RCX
Saves RFLAGS to R11
Masks RFLAGS (clears RF, TF, and other flags) per the FMASK MSR
Loads a new RIP from the LSTAR MSR — this is the kernel's syscall entry point
Sets CS to a kernel code segment (from the STAR MSR)
Switches to the kernel stack (the kernel maintains a per-CPU stack pointer in a kernel-accessible location)

This is why RCX and R11 are clobbered by syscall: the hardware uses them to save your RIP and RFLAGS. The kernel's entry code saves everything else.

What you see in GDB right after syscall:

(gdb) stepi       ; steps into the syscall
Program stopped at 0xffffffffa0000000.    ← kernel address

(You cannot step through kernel code in GDB without kernel debugging setup, but the address tells you: you're now executing in the kernel at 0xFFFFFFFF..., the upper half of the address space.)

Step 2: The Kernel's Syscall Entry Point

On Linux x86-64, the LSTAR MSR points to entry_SYSCALL_64 in arch/x86/entry/entry_64.S. This function:

Saves all user registers to the per-thread pt_regs structure on the kernel stack
Switches GS base (to access per-CPU kernel data via swapgs)
Calls the C function do_syscall_64(regs, nr) where nr = RAX (the syscall number)

The pt_regs structure layout (simplified):

┌─────────────────┐  ← kernel stack top after entry
│ r15             │
│ r14             │
│ r13             │
│ r12             │
│ rbp             │
│ rbx             │
│ r11             │  ← saved from RFLAGS (user)
│ r10             │
│ r9              │
│ r8              │
│ rax             │  ← syscall number (1 = write)
│ rcx             │  ← saved RIP (user return address)
│ rdx             │  ← 3rd argument (length = 13)
│ rsi             │  ← 2nd argument (buffer pointer)
│ rdi             │  ← 1st argument (fd = 1)
│ orig_rax        │  ← copy of rax for syscall restart logic
└─────────────────┘

Step 3: Syscall Dispatch

do_syscall_64 uses RAX (now in regs->orig_rax) as an index into the syscall table:

// Simplified from kernel source
long do_syscall_64(struct pt_regs *regs, int nr) {
    if (nr < NR_syscalls) {
        regs->ax = sys_call_table[nr](regs);
        // sys_call_table[1] = __x64_sys_write
    }
    return regs->ax;
}

sys_call_table[1] contains the address of __x64_sys_write.

Step 4: __x64_sys_write and the VFS Layer

The syscall table entry unwraps arguments from pt_regs and calls the real ksys_write:

// fs/read_write.c (simplified)
ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count) {
    struct fd f = fdget_pos(fd);    // look up fd=1 in the file descriptor table
    if (!f.file)
        return -EBADF;

    // Verify buf pointer is in user space (security check)
    if (!access_ok(buf, count))
        return -EFAULT;

    ssize_t ret = vfs_write(f.file, buf, count, &f.file->f_pos);
    fdput_pos(f);
    return ret;
}

The first thing the kernel does: verify that fd=1 exists in your process's file descriptor table. Your process inherited fd=1 (stdout) from the shell. The file descriptor entry points to a struct file that ultimately describes a pseudo-terminal (PTY).

The second thing: access_ok(buf, count) — the kernel verifies that your buffer pointer (rsi) is in user space, not kernel space. This prevents a malicious program from passing a kernel address as the buffer and having the kernel write kernel memory to the terminal.

Step 5: The Terminal Driver and the PTY

vfs_write follows the file operation pointer for your stdout file. If stdout is connected to a terminal (you're running interactively), it leads to the TTY (terminal) subsystem:

vfs_write
    → f->f_op->write_iter     (file operation function pointer)
    → tty_write               (TTY layer)
        → n_tty_write          (line discipline)
            → pty_write        (PTY driver: writes to the slave side)
                → pty_write to master    (the terminal emulator reads from here)

The PTY (pseudo-terminal) is a kernel object with two ends: - The slave end: what your shell process has as stdout (fd=1) - The master end: what the terminal emulator (xterm, gnome-terminal) reads from

When you write to the slave, the kernel copies your bytes to a ring buffer in the PTY master. The terminal emulator (running as a separate process) has the master end open for reading, and its event loop wakes up and reads your bytes.

This is the path of your "Hello, World!" bytes:

Your assembly program
    → SYSCALL (hardware transition)
    → entry_SYSCALL_64 (kernel entry)
    → ksys_write (kernel)
    → vfs_write (VFS layer)
    → tty_write (TTY layer)
    → n_tty_write (line discipline)
    → pty_write (PTY slave → master buffer)
    → [kernel wakes terminal emulator process]
    → terminal emulator reads master
    → terminal emulator renders text via X11/Wayland
    → display hardware

Step 6: Copying Your Buffer

At some point in the write path, the kernel copies bytes from your user-space buffer to a kernel buffer. This is where the security model enforces separation: the kernel cannot simply use your pointer directly for DMA or other operations, because your memory might be paged out or might not even be mapped.

The copy happens via copy_from_user:

// Deep inside tty_write
copy_from_user(kernel_buf, user_buf, count);
// This handles:
// - potential page faults if your buffer is swapped out
// - the user/kernel address space boundary check
// - potential SMAP (Supervisor Mode Access Prevention) hardware

Step 7: Return Path

After the write completes, the kernel needs to return to your process:

The write count (e.g., 13) is placed in regs->ax
entry_SYSCALL_64 restores all user registers from pt_regs
The SYSRET instruction is executed: - Loads RIP from RCX (your saved return address) - Loads RFLAGS from R11 (your saved flags, with some bits forced) - Switches CS back to user code segment - CPU continues executing at your mov rax, ... instruction after syscall

Your RAX is now 13 (bytes written). Your RCX has been overwritten by the hardware (it contained your return address; it now contains something else — actually the return address itself, since SYSRET loaded RIP from RCX). Your R11 has been overwritten (it contained your RFLAGS; SYSRET loaded RFLAGS from R11). Everything else is restored exactly as you left it.

The Complete Timeline

User space:
    mov rax, 1; mov rdi, 1; mov rsi, msg; mov rdx, 13
    SYSCALL instruction executes

Hardware (no software involved):
    RCX ← RIP (next instruction address)
    R11 ← RFLAGS
    RIP ← LSTAR (kernel entry point)
    CS  ← kernel code segment

Kernel space:
    entry_SYSCALL_64:
        swapgs                  ; switch GS to kernel per-CPU data
        save all registers to pt_regs on kernel stack
    do_syscall_64:
        sys_call_table[1] → __x64_sys_write
    ksys_write:
        fdget_pos(1)            ; look up stdout file descriptor
        access_ok(buf, count)   ; verify user pointer
        vfs_write(file, buf, count)
            → tty_write
                → n_tty_write
                    → copy_from_user(kbuf, msg, 13)   ; copy from user
                    → pty_write → PTY master buffer
    return 13 to do_syscall_64
    entry_SYSCALL_64:
        restore all registers from pt_regs
    SYSRET instruction:
        RIP ← RCX
        RFLAGS ← R11
        CS ← user code segment

User space continues at instruction after SYSCALL:
    RAX = 13 (bytes written)
    RCX = corrupted (was used by hardware)
    R11 = corrupted (was used by hardware)

Why This Matters for Assembly Programmers

1. The clobber rule is hardware, not convention. RCX and R11 are clobbered by SYSCALL because the CPU uses them to save state. You cannot "fix" this with software. If you need to preserve RCX or R11 across a syscall, push them before the SYSCALL and pop them after.

2. The kernel copies your buffer — it doesn't use your pointer indefinitely. After sys_write returns, the kernel has already copied your data. You can immediately reuse or free the buffer. (Async I/O is different, but for synchronous write: once it returns, the buffer is yours again.)

3. Error returns are negative integers. The kernel returns a negative errno value (e.g., -9 for EBADF, -14 for EFAULT) in RAX. Your user-space C library then converts this to rax = -1 and errno = EBADF. In raw assembly, you check for negative RAX directly.

4. Partial writes are normal. The kernel may write fewer bytes than requested — for pipes, sockets, and TTYs, the underlying buffer may fill up. A robust assembly program loops on sys_write until all bytes are written (or an error occurs). This is what the C library's fwrite does automatically.

5. Context-switching is expensive but not catastrophic. The full round-trip of SYSCALL → kernel → SYSRET takes roughly 50-300 ns on modern hardware (Spectre/Meltdown mitigations have increased this significantly from the pre-2018 ~100ns baseline). Calling sys_write 1000 times to write 1000 individual bytes would add ~100-300 microseconds of syscall overhead. Buffering writes and calling sys_write once per buffer-flush is standard practice.

This is why printf in C buffers output: not because the kernel can't handle many writes, but because reducing syscall frequency is the correct engineering trade-off.

Looking Deeper

If you want to follow this path in real kernel source code (Linux 6.x):

Step	File	Function
Syscall entry	`arch/x86/entry/entry_64.S`	`entry_SYSCALL_64`
Dispatch	`arch/x86/kernel/syscall_64.c`	`do_syscall_64`
Syscall table	`arch/x86/entry/syscalls/syscall_64.tbl`	line with `1 write`
write() impl	`fs/read_write.c`	`ksys_write`, `vfs_write`
TTY write	`drivers/tty/tty_io.c`	`tty_write`
Line discipline	`drivers/tty/n_tty.c`	`n_tty_write`
PTY driver	`drivers/tty/pty.c`	`pty_write`
User copy	`arch/x86/lib/usercopy.c`	`copy_from_user`

Every line of that kernel code runs between your syscall instruction and the moment sysret returns control. The 5-microsecond round trip is not idle time — it's the operating system doing its job.