Chapter 7 Further Reading: First Programs
Syscall Interface and Kernel Internals
1. "The Linux Programming Interface" Michael Kerrisk, No Starch Press, 2010
Chapter 3 covers the system call interface in exhaustive detail: how errno works, how libc wraps syscalls, and the contract between user space and the kernel. Appendix A contains a complete table of all Linux system call numbers. For the case study on what happens inside sys_write, Kerrisk's treatment of the VFS (Virtual File System) layer in Chapters 13 and 18 fills in the details that the case study summarizes. The most complete Linux programming reference available.
2. Linux kernel source: arch/x86/entry/entry_64.S
Linus Torvalds and contributors. Available at kernel.org/pub/linux/kernel/git/torvalds/linux.git
The actual assembly implementation of entry_SYSCALL_64 — the first kernel code that runs when your syscall instruction fires. Reading this file with the case study as a guide reveals: the swapgs instruction (switching GS for per-CPU data access), the register save sequence that builds pt_regs, the call to do_syscall_64, and the sysretq instruction that returns to user space. Everything in Case Study 7.2 is visible in this file. Reading kernel assembly is an advanced skill, but entry_64.S is surprisingly readable once you know what you're looking for.
3. "Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 2B: Instruction Set Reference (M-U)" Intel Corporation. Free at intel.com/sdm
The definitive reference for the SYSCALL and SYSRET instructions (pages 4-672 and 4-683 in recent editions). The formal description specifies exactly which MSRs are read, what is saved to RCX and R11, how CS is set, and what conditions cause #UD (undefined instruction) or #GP (general protection fault). For MOVSX and MOVZX, Volume 2A (A-L) documents the sign-extension and zero-extension behaviors, including the one case where MOVSX operates on 64-bit operands (MOVSXD). For SCASB, Volume 2B documents the instruction's complete behavior including the interaction with REPNE and the direction flag.
String Operations and Performance
4. "Hacker's Delight, 2nd Edition" Henry S. Warren Jr., Addison-Wesley, 2012
Chapter 6.1 ("Searching for a Zero Byte") documents the bitmask technique used in the 8-bytes-at-a-time strlen implementation: (x - 0x01010101...) & ~x & 0x80808080.... Warren provides formal proofs that this correctly detects zero bytes and explains why the AND with ~x is necessary to eliminate false positives from bytes with their high bit set. Chapter 5 ("Counting Bits") documents the BSF (bit scan forward) and related instructions. For assembly programmers, this book is the reference for bit manipulation algorithms that are otherwise passed around informally.
5. Agner Fog: "Optimizing Assembly" Agner Fog. Free at agner.org/optimize
Chapter 14 covers string operations: SCASB, REPNE SCASB, CMPSB, and why they're often slower than scalar loops on modern hardware. Section 12.9 explains the "string instruction paradox" — that microcoded string instructions cannot be pipelined the way that scalar instructions can, which is why production strlen implementations avoid SCASB. Chapter 13 covers SIMD intrinsics and the AVX2 approach used by glibc's optimized strlen. This is the empirical reference for the performance numbers cited in Case Study 7.1.
6. glibc sysdeps/x86_64/multiarch/strlen-avx2.S
GNU C Library contributors. Available at sourceware.org/git/glibc.git
The actual production strlen implementation for x86-64 systems with AVX2. It implements:
- A short-string fast path (0-16 bytes without SIMD overhead)
- 32-bytes-at-a-time AVX2 scanning with vpcmpeqb and vpmovmskb
- IFUNC dispatch (runtime CPU detection that selects the fastest available implementation)
Reading the real implementation alongside Case Study 7.1's simplified version reveals the additional complexity that production code requires: handling misaligned pointers, page boundary issues, and the short-string overhead trade-off.
x86-64 MOV and Flag Behavior
7. "Computer Systems: A Programmer's Perspective, 3rd Edition" Randal E. Bryant and David R. O'Hallaron, Pearson, 2016
Chapter 3.4 covers data movement instructions (MOV, MOVZ, MOVS) with the partial register write semantics documented in Table 3.4. Figure 3.10 documents RFLAGS behavior for arithmetic instructions. The book's treatment is at the exact level of Chapter 7: not full Intel SDM depth, but enough to understand what programs actually do. Chapter 3.6 covers control flow with complete flag tables for signed and unsigned comparisons. This is the recommended companion textbook for Part I of this book.
8. Brendan Gregg: "Systems Performance: Enterprise and the Cloud, 2nd Edition" Brendan Gregg, Addison-Wesley, 2020
Chapter 6 covers CPUs from a performance perspective, including the instruction pipeline, branch prediction, and the cost of system calls. The section on "Syscall Interface" (Chapter 5.4) documents the Meltdown/Spectre mitigations that increased syscall overhead from ~100ns to 200-300ns on affected hardware. For the performance context in Case Study 7.2 ("context-switching is expensive but not catastrophic"), Gregg provides the empirical data. The chapter on profiling with perf and perf stat lets you measure syscall overhead in your own programs.
9. "The UNIX System V Interface Definition" AT&T Bell Laboratories, 1985. Historical document; relevant passages available in various online archives
The original specification of the write(2) system call semantic: "The write() function shall attempt to write nbyte bytes from the buffer pointed to by buf to the file associated with the open file descriptor, fildes." The guarantee "attempt to write" (not "shall write") is where partial writes are defined. Understanding that write() may write fewer bytes than requested — and that this is correct behavior, not an error — is essential for writing robust assembly programs. Most system programming bugs involving write() trace to ignoring this specification.
10. "What Every Programmer Should Know About Memory" Ulrich Drepper, 2007. Free at akkadia.org/drepper/cpumemory.pdf
Section 6.2 covers the interaction between instruction-level parallelism and the load/store pipeline that affects loop performance. The comparison between byte-at-a-time loops and qword-at-a-time loops directly applies to the strlen performance comparison in Case Study 7.1. Drepper's analysis of prefetch behavior (Section 6.3) explains why the 8-bytes-at-a-time strlen is faster not just because it does fewer iterations, but because the larger stride makes hardware prefetching more effective for long strings.
11. "Low-Level Programming: C, Assembly, and Program Execution on Intel 64 Architecture" Igor Zhirkov, Apress, 2017
Chapter 5 ("Control Flow") documents all conditional jump instructions and their flag conditions with the same register trace approach used in this chapter. Chapter 7 ("System Calls") provides a complete Linux syscall reference with examples written in NASM. Appendix A contains the complete Linux x86-64 syscall table. Zhirkov's book is the closest in style and target audience to this textbook, making it an excellent complement — where this chapter says "write factorial," Zhirkov's exercises often provide worked solutions you can compare against.