Agner Fog, microarchitecture.pdf (agner.org) The per-microarchitecture misprediction penalty table. Haswell: 14-17 cycles. Skylake: 14-17 cycles. Zen 2: 14-23 cycles. These numbers explain when CMOV is worth the additional code complexity. → Chapter 10 Further Reading: Control Flow
"Branchless Equivalents of Simple Functions"
Chess Programming Wiki chessprogramming.org/Branchless_Equivalents Extensive collection of branchless implementations for common functions (abs, min, max, sign, clamp, swap, etc.). The implementations use the sign-mask technique (SAR to get all-ones/all-zeros mask) that appears throughout systems pr → Chapter 10 Further Reading: Control Flow
"Data Structures in the Linux Kernel"
various kernel documentation kernel.org/doc/html/latest/ The Linux kernel's `include/linux/list.h` implements doubly-linked lists as an intrusive linked list (the list pointers are embedded in the struct). Reading this implementation shows how to do linked list manipulation in real-world C and assem → Chapter 12 Further Reading: Arrays, Strings, and Data Structures
"Engineering a Compiler" by Cooper and Torczon
if the compiler pipeline discussion in Chapter 39 interested you. The complete academic treatment of compilation: from parsing to register allocation to instruction scheduling. → Chapter 40: Your Assembly Future
"Exploiting the Hard-Working DWARF"
James Oakley and Sergey Bratus, USENIX WOOT 2011 Discusses how exception handling tables and jump tables in compiled code can be exploited. Relevant background for Chapter 35's exploit development. → Chapter 10 Further Reading: Control Flow
"Falsehoods Programmers Believe About Money"
Erik Wijk, blog Financial correctness requires not just fixed-point arithmetic but also: understanding different currency denominations (JPY has no cents), exchange rate representation, rounding laws by jurisdiction (some require half-up, some round-half-to-even), and overflow analysis for large tra → Chapter 14 Further Reading: Floating Point
"Function Call Conventions and Stack Frame Layout"
Bryan Cantrill (now Oxide Computer) YouTube lecture. Clear explanation of the System V ABI with animated stack diagrams. Covers the 16-byte alignment requirement and its historical context (alignment needed for FXSAVE before SSE was common). → Chapter 11 Further Reading: The Stack and Function Calls
"Function Call Overhead"
Agner Fog, optimizing_assembly.pdf (agner.org) Chapter 14 covers the cost of function calls: CALL/RET, push/pop overhead, and how to minimize it (leaf functions, inlining, tail call optimization). The concrete cycle counts for function call overhead versus inline code are useful for justifying when → Chapter 11 Further Reading: The Stack and Function Calls
"Intel Intrinsics Guide"
software.intel.com/sites/landingpage/IntrinsicsGuide/ The intrinsics guide allows you to search for compiler intrinsics that map to specific instructions. For example, `_mm_popcnt_u64` maps to `POPCNT`. This is useful when writing C code that uses these instructions via intrinsics rather than inline → Chapter 13 Further Reading: Bit Manipulation
"Intel® 64 and IA-32 Software Developer's Manuals"
the authoritative source, always. Download the full PDF set or bookmark the HTML version. When something in assembly is ambiguous, this is where the answer lives. → Chapter 40: Your Assembly Future
"Microarchitecture" documentation
Agner Fog (agner.org/optimize/microarchitecture.pdf) Detailed per-microarchitecture analysis. The sections on Intel Sandy Bridge, Haswell, and Skylake explain the AGU (Address Generation Unit) pipeline and why the four-component addressing mode was cheaper on Haswell than on Sandy Bridge. → Chapter 8 Further Reading: Data Movement and Addressing Modes
"Optimizing Assembly"
instruction selection, dependency chains, loop optimization, SIMD, branch prediction, and every micro-optimization technique covered in this chapter with concrete NASM examples - **"Instruction Tables"** — latency, throughput, and port assignments for every instruction on every major Intel/AMD micro → Chapter 33 Further Reading: Performance Analysis and Optimization
"Optimizing subroutines in assembly language"
Agner Fog agner.org/optimize/optimizing_assembly.pdf Chapter 16 covers LEA and address generation extensively. Agner Fog's optimization manuals are the standard reference for x86 performance tuning. The "Instruction tables" document (separate PDF) gives the exact latency and throughput for every ins → Chapter 8 Further Reading: Data Movement and Addressing Modes
"Parsing Integers Quickly"
Daniel Lemire, blog. lemire.me/blog Shows how PEXT/PDEP can accelerate SIMD parsing of integers from text. Demonstrates real-world use of these BMI2 instructions beyond the toy examples in textbooks. → Chapter 13 Further Reading: Bit Manipulation
"Smashing The Stack For Fun And Profit"
Aleph One (Elias Levy), Phrack Magazine #49, 1996 phrack.org/issues/49/14.html The paper that defined modern stack overflow exploitation. Still readable and technically accurate for the basic technique. The stack layout diagrams and shellcode injection methodology are foundational. → Chapter 11 Further Reading: The Stack and Function Calls
"Software Optimization of AES on x86-64"
Käsper and Schwabe, IACR ePrint 2009 The paper that introduced the "bitsliced" AES implementation achieving record software speeds. Shows that even with AES-NI available, software AES on very old hardware required sophisticated bit manipulation. The contrast with AES-NI performance in Chapter 15 is → Chapter 13 Further Reading: Bit Manipulation
"Sorting Networks and Their Applications"
Batcher, AFIPS Spring Joint Computer Conference 1968 The original paper on optimal sorting networks. Batcher's odd-even merge sort and bitonic sort are the most-cited networks. Sorting networks are the foundation of SIMD-accelerated sorting. → Chapter 10 Further Reading: Control Flow
"Stack Smashing Protection"
Hiroaki Etoh, IBM Research The original description of the GCC stack canary implementation (then called ProPolice, now `-fstack-protector`). Explains the canary placement strategy and why local variables are reordered to put arrays near the canary. → Chapter 11 Further Reading: The Stack and Function Calls
"Structure Layout Optimization"
Ulrich Drepper, "What Every Programmer Should Know About Memory" lwn.net/Articles/250967/ Section 6 covers struct layout optimization for cache performance. The AoS vs. SoA analysis (Section 6.2) includes assembly-level examples of how different layouts affect SIMD vectorization. → Chapter 12 Further Reading: Arrays, Strings, and Data Structures
"Tail Call Optimization"
GCC wiki gcc.gnu.org/wiki/TailCalls Explains when GCC transforms `return func(args)` into a `jmp` instead of `call` + `ret`, eliminating the stack frame growth for recursive calls at the cost of losing the frame in backtraces. Relevant to the recursive factorial example: `factorial(n-1)` is not a ta → Chapter 11 Further Reading: The Stack and Function Calls
"The Art of Exploitation" by Jon Erickson
if the security chapters engaged you. The most approachable deep dive into x86 exploitation, shellcode, and format strings. Includes a live Linux environment for hands-on practice. → Chapter 40: Your Assembly Future
"Why memcpy() is Better Than You Think"
blog post, cloudflare.com Discusses the non-temporal store optimization (`MOVNTQ`) for large copies, showing 40-60% improvement for multi-GB copies by avoiding cache pollution. Includes assembly code examples. → Chapter 12 Further Reading: Arrays, Strings, and Data Structures
"x86 Instruction Encoding"
OSDev Wiki (wiki.osdev.org/X86-64_Instruction_Encoding) The most accessible explanation of how ModRM, SIB, REX, and displacement bytes work together to encode every addressing mode. Understanding the encoding is not required for using the instructions, but it explains *why* RSP cannot be an index re → Chapter 8 Further Reading: Data Movement and Addressing Modes
"operation size not specified": add `QWORD`/`DWORD`/`WORD`/`BYTE` to ambiguous memory operands - "symbol is multiply defined": use `.local` labels instead of global ones in functions - "invalid combination of opcode and operands": memory-to-memory move doesn't exist; wrong operand types - "`times` c → Chapter 6 Key Takeaways: The NASM Assembler
x86-64: `lea rsi, [rip + offset]` — one instruction (ModRM encoding handles it) - ARM64: `adr x1, label` — one instruction when within ±1MB; `adrp + add` for farther - RISC-V: `la a1, label` — pseudoinstruction that assembles to `auipc + addi` — always two instructions → Case Study 39-2: RISC-V Assembly — Hello World on RISC-V
5. Load immediate:
x86-64: `mov rax, 64` — 7 bytes (REX + opcode + 4-byte immediate) - ARM64: `mov x8, #64` — 4 bytes (MOVZ encoding) - RISC-V: `li a7, 64` — pseudo-instruction → `addi a7, x0, 64` — 4 bytes → Case Study 39-2: RISC-V Assembly — Hello World on RISC-V
Major Linux distributions enable CET in packages as of 2022-2024 - Many system libraries (libc, libssl) ship with `ENDBR64` markers in recent versions - Not all software is recompiled yet; CET provides partial protection in mixed environments → Case Study 36-2: Intel CET — The Hardware Solution to Memory Corruption
Agner Fog, "Instruction Tables"
agner.org/optimize/instruction_tables.pdf Per-microarchitecture latency and throughput for all floating-point instructions: ADDSS, MULSD, SQRTSD, CVTSI2SD, FSIN, etc. The FSIN/FCOS timings (50-100 cycles) vs. SSE polynomial (10-20 cycles) comparison from the case study is documented here. → Chapter 14 Further Reading: Floating Point
Answer: A
sys_write is syscall number 1 (RAX=1). stderr is file descriptor 2 (RDI=2). Stdout is fd 1, stdin is fd 0. → Chapter 25 Quiz: System Calls
Answer: B
`syscall` uses RCX to save the return address (RIP), destroying whatever was there. R10 is used as the substitute. → Chapter 25 Quiz: System Calls
If you own an M1/M2/M3/M4 Mac, you are already on ARM64. `clang` on macOS compiles ARM64 natively. GDB is replaced by LLDB. The system calls differ from Linux. Chapter 18 covers the differences. → Part III: ARM64 Assembly
Argument conventions:
x86-64: `syscall` instruction; number in RAX; args in RDI, RSI, RDX, R10, R8, R9; return in RAX - ARM64: `svc #0` instruction; number in X8; args in X0, X1, X2, X3, X4, X5; return in X0 - RISC-V: `ecall` instruction; number in a7; args in a0, a1, a2, a3, a4, a5; return in a0 → Appendix F: Linux System Call Tables
ARM64 binary runs but produces wrong output
Likely a calling convention mismatch when mixing C and assembly. Verify that the function prologue and epilogue are correct and that the ABI register assignments match (x0–x7 for arguments, x0 for return value). → Lab Environment Setup: NASM, GDB, QEMU, and Cross-Compilation Tools
a 512-byte bootloader you write in assembly 2. **Transitions through CPU modes** — real mode → protected mode → long mode 3. **Initializes the hardware** — GDT, IDT, keyboard controller, timer 4. **Manages memory** — a page allocator, a simple heap 5. **Handles interrupts** — keyboard input, timer t → How to Use This Book
Abadi et al., CCS 2005 (Microsoft Research) The original paper on Control Flow Integrity, the defense against jump table hijacking and return-oriented programming. Modern compilers implement CFI via `-fsanitize=cfi`. Relevant to the Chapter 35-37 security chapters. → Chapter 10 Further Reading: Control Flow
Chapter 10: Control Flow
JMP (short, near, indirect); conditional jumps (all variants) - Signed vs. unsigned comparisons: JL vs. JB — the critical distinction - Translating if/else, while, for, do-while, switch/case - CMOV (conditional move): branchless programming - Jump tables for switch/case - Loop optimization: LOOP ins → Learning Assembly Language — Detailed Content Outline
Chapter 11: The Stack and Function Calls
PUSH, POP mechanics; CALL pushes RIP, RET pops it - Stack frame: push rbp / mov rbp, rsp / sub rsp, N - System V AMD64 ABI: RDI, RSI, RDX, RCX, R8, R9; callee/caller-saved - Red zone: 128 bytes below RSP reserved for leaf functions - Stack alignment: 16-byte requirement before CALL - Recursive facto → Learning Assembly Language — Detailed Content Outline
Chapter 12: Arrays, Strings, and Data Structures
Array access with base+index×scale addressing modes - REP MOVSB/STOSB/CMPSB/SCASB: string operations - Implementing strlen, strcpy, memset, memcmp in assembly - Linked list traversal and manipulation - Struct field access: base+offset for each field - AoS vs. SoA data layouts and their performance i → Learning Assembly Language — Detailed Content Outline
Chapter 13: Bit Manipulation
Bitmasks: isolate (AND), set (OR), toggle (XOR), clear (AND NOT) - BT, BTS, BTR, BTC: bit test operations - BSF, BSR, LZCNT, TZCNT: bit scan and count - POPCNT: hardware popcount - BMI1/BMI2 instructions: ANDN, BEXTR, BLSI, BLSR, PDEP, PEXT - XOR tricks: swap, power-of-2 test, isolate lowest set bit → Learning Assembly Language — Detailed Content Outline
The ARM64 architecture itself: the 31-register file, the zero register, PSTATE flags, fixed-width encoding, and the load/store discipline that defines RISC programming. → Part III: ARM64 Assembly
Chapter 16: ARM64 Architecture
RISC vs. CISC philosophy; why ARM64 is not "simpler" - 31 general-purpose registers (X0–X30), SP, XZR, LR, FP - PSTATE flags: N, Z, C, V — set with S-suffix instructions - Fixed-width 4-byte instructions vs. x86-64's variable length - Load/store architecture: no memory operands in ALU instructions - → Learning Assembly Language — Detailed Content Outline
Chapter 17
The ARM64 instruction set: data processing, the barrel shifter, load/store addressing modes, branches, the AAPCS64 calling convention, and Linux system calls. → Part III: ARM64 Assembly
Chapter 17: ARM64 Instruction Set
ADD, SUB, AND, ORR, EOR with barrel shifter: `ADD X0, X1, X2, LSL #3` - LDR, STR, LDP, STP with all addressing modes (pre/post-indexed) - B, BL, BR, BLR, RET; conditional: B.EQ, CBZ, CBNZ, TBZ - AAPCS64: X0–X7 for args, X19–X28 callee-saved, LR/FP preserved - ARM64 Linux system calls: SVC #0, X8 = n → Learning Assembly Language — Detailed Content Outline
Chapter 18
ARM64 programming in practice: arrays, string operations without string instructions, floating-point with the NEON/FP register file, SIMD with NEON, and the differences between Linux ARM64 and Apple Silicon macOS. → Part III: ARM64 Assembly
Chapter 18: ARM64 Programming
Arrays: LSL shift in address calculation, LDP for pairs - memcpy/strlen/memset without REP string instructions - SIMD/FP registers: V0–V31, D0–D31, S0–S31 - ARM64 floating-point: FADD, FMUL, FCMP, FCVT - NEON SIMD: ADD Vd.4S, FMLA Vd.4S — vectorizing a loop - macOS (Apple Silicon) differences from L → Learning Assembly Language — Detailed Content Outline
Chapter 19
The great comparison: x86-64 vs. ARM64, side by side. Same programs, both ISAs. Code density, power, performance, and why the industry is betting on ARM64 to win the next decade. → Part III: ARM64 Assembly
Chapter 19: x86-64 vs. ARM64 Comparison
Code density, instruction count, encoding complexity comparison - Register file: 16 GPRs with aliasing vs. 31 + zero register - Calling conventions side-by-side - Performance characteristics: clock speed vs. power efficiency - The Apple Silicon transition and its industry implications - ARM in the d → Learning Assembly Language — Detailed Content Outline
Chapter 1: Why Assembly Language?
The compilation pipeline: C → preprocessor → compiler → assembler → linker → executable - Disassembling a C program to see the machine code beneath it - Seven reasons to learn assembly in 2026: security, OS, embedded, performance, compilers, CTF, curiosity - The MinOS kernel project preview: what yo → Learning Assembly Language — Detailed Content Outline
Chapter 20
The assembly-C interface itself: calling C functions (printf, malloc, fopen) from assembly; writing assembly functions callable from C; passing structs; the red zone; variadic functions. A complete working mixed project. → Part IV: The Assembly-C Interface
Chapter 21
Reading compiler output: how to use `gcc -S`, Compiler Explorer (godbolt.org), and AT&T vs. Intel syntax. What `-O0`, `-O1`, `-O2`, `-O3` do to your code. The patterns to recognize: function prologue, local variable layout, if-else, loops, switch tables, virtual dispatch. → Part IV: The Assembly-C Interface
Chapter 21: Understanding Compiler Output
AT&T syntax vs. Intel syntax conversion table - GCC -S output patterns: prologue, if/else, loops, switch, recursion - Optimization levels -O0 through -O3: what each does to the assembly - Compiler Explorer (godbolt.org) as a learning tool - Recognizing: strength reduction, inlining, constant folding → Learning Assembly Language — Detailed Content Outline
Chapter 22
Inline assembly: GCC extended syntax, output/input/clobber constraints, and when to use inline assembly (CPUID, RDTSC, atomics, I/O ports). When NOT to use it (compiler intrinsics are usually better). Common mistakes. → Part IV: The Assembly-C Interface
Chapter 22: Inline Assembly
GCC extended asm syntax: `asm("..." : outputs : inputs : clobbers)` - Constraint types: "r", "m", "i"; named operands %[name] - Practical examples: CPUID, RDTSC, CMPXCHG, port I/O, memory fences - The volatile qualifier; when to use it - Compiler intrinsics as the preferred alternative to inline asm → Learning Assembly Language — Detailed Content Outline
Chapter 23
Linking, loading, and ELF: how source becomes an executable; ELF sections and segments; the linker's job (symbol resolution + relocation); static vs. dynamic linking; the loader's job; linker scripts for bare-metal code (the MinOS connection). → Part IV: The Assembly-C Interface
Chapter 23: Linking, Loading, and ELF
Object files: sections, symbol table, relocations - The linker: symbol resolution, relocation patching - Static vs. dynamic linking; ldd for dependency inspection - ELF format: header, program header table (segments), section header table - The Linux ELF loader: initial program state, argv/envp/auxv → Learning Assembly Language — Detailed Content Outline
Chapter 24
Dynamic linking in depth: the PLT/GOT mechanism traced to machine code; lazy binding; RELRO; LD_PRELOAD for interposition; dlopen/dlsym for runtime loading; GOT overwrite security implications (preview of Chapter 36). → Part IV: The Assembly-C Interface
Chapter 24: Dynamic Linking in Depth
LD.so: the dynamic linker and its initialization sequence - PLT/GOT mechanism: lazy binding step by step in assembly - GOT structure: first three entries, resolver function - RELRO: partial (sections reordered) and full (GOT read-only) - LD_PRELOAD for interposition; malloc debugger example - dlopen → Learning Assembly Language — Detailed Content Outline
Chapter 25: System Calls
The syscall instruction: saves RIP to RCX, RFLAGS to R11 - Linux x86-64 convention: RAX=number, RDI/RSI/RDX/R10/R8/R9=args - Key syscalls with complete NASM examples: read, write, open, mmap, fork, exec, exit - Writing a minimal libc: wrappers around raw syscalls - strace: tracing system calls for d → Learning Assembly Language — Detailed Content Outline
BIOS boot: CPU starts in real mode at 0xFFFF0, loads MBR to 0x7C00 - Real mode: 16-bit, segment:offset addressing, 1MB limit, BIOS interrupts - Protected mode: GDT, CR0.PE=1, far jump to flush prefetch - Long mode: PAE, minimal page tables, EFER.LME=1, CR0.PG=1 - Complete bootloader: prints boot mes → Learning Assembly Language — Detailed Content Outline
Chapter 29: Device I/O
Port-mapped I/O: IN/OUT instructions, x86-64 I/O address space - Memory-mapped I/O: devices at physical addresses, MOV instructions - Common ports: PS/2 (0x60/0x64), COM1 (0x3F8), PIC (0x20/0xA0), PIT (0x40) - PIT programming: 100Hz timer interrupt for MinOS scheduler - UART/Serial: baud rate, data → Learning Assembly Language — Detailed Content Outline
Chapter 2: Numbers in the Machine
Binary: bits, bytes, words, doublewords, quadwords - Hexadecimal as binary shorthand; hex↔binary conversion - Unsigned integers, overflow, and wraparound - Two's complement: representation, arithmetic, overflow vs. carry - The RFLAGS register: CF, OF, SF, ZF, PF, AF — when each is set - IEEE 754 flo → Learning Assembly Language — Detailed Content Outline
Tools: objdump, GDB, Ghidra, IDA Free, radare2, pwndbg - Recognizing compiler patterns: prologue/epilogue, loops, switch tables, virtual dispatch - Working without symbols: string cross-references, constant identification - Reconstructing data types and control flow from disassembly - GDB scripting → Learning Assembly Language — Detailed Content Outline
Chapter 35: Buffer Overflows and Memory Corruption
Stack buffer overflow: overwriting adjacent stack memory including return address - Shellcode: position-independent code for exploit payloads (educational) - NOP sleds and reliability before ASLR - Format string vulnerabilities: %x stack reads, %n memory writes - Heap corruption: use-after-free, dou → Learning Assembly Language — Detailed Content Outline
Chapter 36: Exploit Mitigations
Stack canaries: fs:0x28, prologue/epilogue assembly, GCC flags - NX/DEP: the NX bit in page table entries, hardware enforcement - ASLR: stack, heap, library, and executable randomization; entropy values - PIE: position-independent executable for full ASLR - RELRO: partial and full; preventing GOT ov → Learning Assembly Language — Detailed Content Outline
Chapter 37: Return-Oriented Programming
Why ROP: NX/DEP killed shellcode injection, ROP reuses existing code - Gadgets: instruction sequences ending in RET - Building a ROP chain: forged stack, gadget addresses, chained execution - Finding gadgets: ROPgadget, ropper tools - ret2libc, ret2plt: common ROP techniques - JOP, SROP (sigreturn-o → Learning Assembly Language — Detailed Content Outline
Chapter 38: Capstone — A Minimal OS Kernel
MinOS architecture: bootloader + kernel in assembly/C - Components integrated: VGA driver, keyboard handler, timer, page allocator, scheduler, shell - MinOS source structure: boot/, kernel/, drivers/, proc/, syscall/, shell/ - Three capstone tracks: A (minimal), B (with scheduler), C (with filesyste → Learning Assembly Language — Detailed Content Outline
Chapter 39: Beyond Assembly
Compilers: lexing, parsing, IR, optimization passes, code generation - Register allocation (graph coloring) and instruction selection - JIT compilation: generating x86-64 machine code at runtime - WebAssembly: stack machine portable ISA, sandboxing through types - RISC-V: the open ISA, modular exten → Learning Assembly Language — Detailed Content Outline
Chapter 3: The x86-64 Architecture
The 16 general-purpose registers and their 32/16/8-bit sub-registers - The critical aliasing rule: 32-bit writes zero upper 32 bits; 16-bit writes do not - RIP (instruction pointer), RFLAGS, segment registers (FS/GS for TLS) - XMM/YMM/ZMM registers (SSE/AVX/AVX-512) — brief introduction - The execut → Learning Assembly Language — Detailed Content Outline
Chapter 40: Your Assembly Future
A genuine inventory of what you now know - Career paths: OS development, security research, compiler engineering, embedded, HPC - Next projects: extend MinOS, write a compiler backend, CTF competitions - Communities: OSDev, /r/asm, CTF platforms, security conferences - Books to read next: CS:APP, OS → Learning Assembly Language — Detailed Content Outline
Chapter 4: Memory
The flat 64-bit virtual address space (48 bits usable) - Process memory layout: text, data, BSS, heap, stack, mapped libraries - Virtual vs. physical addresses; the MMU's role - Byte alignment requirements and SIMD alignment (16/32/64 bytes) - Little-endian byte ordering with examples - NASM data de → Learning Assembly Language — Detailed Content Outline
Chapter 5: Your Development Environment
Installing NASM, GCC, binutils, GDB, QEMU, Ghidra - Your first NASM program: hello world, assemble, link, run - The Makefile template for assembly projects - GDB for assembly: breakpoints, stepi, info registers, x/16xb, layout regs - objdump, readelf, nm — binary inspection tools - Linking assembly → Learning Assembly Language — Detailed Content Outline
Chapter 6: The NASM Assembler
NASM syntax: Intel syntax (destination first, no sigils, brackets for memory) - Sections: .text, .data, .bss, .rodata - Labels, global, extern, common directives - Data definition depth: db/dw/dd/dq, times, equ, $ and $$ - NASM preprocessor: %define, %assign, %macro/%endmacro, %if, %include - Useful → Learning Assembly Language — Detailed Content Outline
Chapter 7: Your First Assembly Programs
MOV in all its forms: register, immediate, memory load, memory store - ADD, SUB, INC, DEC, NEG — with complete register traces - XOR reg, reg — zeroing a register and why this is standard - System calls: RAX = number, RDI/RSI/RDX/R10/R8/R9 = args, RAX = return - Four complete programs: hello, exit, → Learning Assembly Language — Detailed Content Outline
Chapter 8: Data Movement and Addressing Modes
MOV forms; 32-bit write zero-extension behavior - Addressing modes: immediate, register, direct, indirect, base+offset, base+index×scale+disp - RIP-relative addressing for position-independent code - LEA: computing addresses without memory access; use as fast arithmetic - MOVZX (zero-extend) and MOV → Learning Assembly Language — Detailed Content Outline
Chapter 9: Arithmetic and Logic
ADD, SUB with all operand forms; flag effects - ADC, SBB: multi-precision arithmetic (128-bit addition example) - MUL/IMUL (one-, two-, three-operand forms); DIV/IDIV - AND, OR, XOR, NOT — bitwise operations - TEST (AND without store) and CMP (SUB without store) - SHL, SHR, SAR, ROL, ROR, SHLD, SHRD → Learning Assembly Language — Detailed Content Outline
CMOV hurts when:
The branch is highly predictable (>95% one way) — the processor's branch predictor handles it for near-free - The "not selected" computation is expensive or involves a slow load - CMOV creates a longer dependency chain → Chapter 10: Control Flow
CMOV wins when:
The branch is unpredictable (roughly 50/50 distribution) - The values being selected are already in registers (no load involved) - The computation fits the "compute both, select one" pattern → Chapter 10: Control Flow
Compiler Explorer (godbolt.org)
Matt Godbolt's invaluable tool for exploring compiler output. Chapter 21 uses it extensively. → Acknowledgments
configuration space
256 bytes of registers accessible via port I/O at ports 0xCF8 (address register) and 0xCFC (data register). → Chapter 29: Device I/O
the page table base register. CR3 holds the physical address of the PML4 table (aligned to 4KB). On a context switch, the OS writes a new value to CR3, and the new process's address space takes effect immediately. → Chapter 27: Memory Management
CTF community
particularly the pwn category — has pushed assembly and binary exploitation education further and faster than any academic setting. This book's security chapters are influenced by the quality of public CTF writeups. → Acknowledgments
D
Data layout
AoS (RGBA RGBA...) vs. SoA (RRR... GGG... BBB...). SoA is almost always better for SIMD; AoS requires shuffles to extract channels. 2. **Lane behavior** — AVX2 operates in two independent 128-bit halves for most byte/word instructions. Always check the manual: does your instruction cross lanes or no → Case Study 15.1: SIMD Image Processing — Grayscale Conversion
device I/O
how software talks to hardware. Two paradigms dominate: port-mapped I/O, using the `IN` and `OUT` instructions, and memory-mapped I/O, where device registers appear as ordinary memory addresses. You will program real devices: the PIT timer, the UART serial port, the PIC interrupt controller. These s → Part V: Systems Programming
Do NOT use inline assembly when:
A compiler intrinsic exists: `` for SSE/AVX, `` for SSE4.2 - `` covers your atomic operation - `__builtin_clz`, `__builtin_popcount`, `__builtin_expect` exist for your case - You could write a separate `.asm` file and link it → Chapter 22: Inline Assembly
E
Executable
output of linker, directly executable 3. **Shared library** (`.so`) — position-independent code, loaded by dynamic linker → Chapter 23: Linking, Loading, and ELF
Identify the construct For each disassembly snippet, identify whether it shows: (a) a function call, (b) a virtual method call, (c) a function pointer call through a struct, (d) a tail call, or (e) a leaf function with no frame. → Chapter 34 Exercises: Reverse Engineering
Crackme analysis Consider a program that calls the following validation function. Without running the program, determine what input produces a return value of 1: → Chapter 34 Exercises: Reverse Engineering
Ghidra workflow Describe the step-by-step process for using Ghidra to analyze a password-protected binary where you want to find the validation logic. Include: where to start, how to navigate, what to look for, and how to confirm your analysis. → Chapter 34 Exercises: Reverse Engineering
Exercise 34.17
Stripped binary navigation A stripped x86-64 ELF binary's entry point is at `0x401080`. The `_start` code calls `__libc_start_main`. Describe in detail how you would use GDB to find the address of `main()` in this binary without symbols. → Chapter 34 Exercises: Reverse Engineering
Exercise 34.18
Cross-architecture RE How does reverse engineering ARM64 binaries differ from x86-64? Specifically: a) What are the equivalent patterns for function prologue/epilogue? b) How does the calling convention affect register usage patterns? c) How do you identify a function's return value in ARM64? d) Wha → Chapter 34 Exercises: Reverse Engineering
Exercise 34.19
Obfuscation recognition Describe how you would identify and handle each of these obfuscation techniques encountered while reverse engineering: a) UPX-packed executable (the binary is compressed) b) Opaque predicates (always-taken or never-taken branches) c) String encryption (strings are decrypted a → Chapter 34 Exercises: Reverse Engineering
Exercise 34.2
objdump flags For each task, write the exact `objdump` command: a) Disassemble the binary `./mystery` using Intel syntax b) Show the symbol table of `./server` c) Show the dynamic symbols of `/usr/bin/curl` d) Show all sections and their addresses in `./program` e) Dump the contents of the `.rodata` → Chapter 34 Exercises: Reverse Engineering
Jump table analysis The following disassembly includes a jump table. Identify: a) The bounds check instruction b) The jump table load c) How many cases exist d) Reconstruct the switch statement structure → Chapter 34 Exercises: Reverse Engineering
GDB Python script Write a GDB Python script that: 1. Sets a breakpoint at `0x401196` (a hypothetical `strcmp` call site) 2. When the breakpoint is hit, prints both string arguments (RDI and RSI) 3. Does NOT stop execution — continues automatically 4. Logs all calls to a file named `strcmp_log.txt` → Chapter 34 Exercises: Reverse Engineering
Exercise 34.9
pwndbg workflow List five pwndbg commands (not standard GDB commands) that would be useful when analyzing a binary for security research, and describe what each shows. → Chapter 34 Exercises: Reverse Engineering
Compiler flags for security Compile a simple C program with each of the following flags and explain what protection each adds. Use `checksec` to verify the resulting binary has the expected properties: → Chapter 35 Exercises: Buffer Overflows and Memory Corruption
Exercise 35.15
AddressSanitizer usage Write a simple C program that has a buffer overflow (for testing purposes on your own machine), compile it with `-fsanitize=address`, and interpret the AddressSanitizer output report. What information does ASAN provide that GDB alone does not? → Chapter 35 Exercises: Buffer Overflows and Memory Corruption
Shellcode properties Answer the following about shellcode requirements: a) What does "position-independent" mean, and why is it required? b) Why must shellcode typically be free of null bytes? c) What instruction(s) are used for x86-64 system calls? d) What is the syscall number for `execve` on Linu → Chapter 35 Exercises: Buffer Overflows and Memory Corruption
Identify the vulnerability type For each description, identify the vulnerability: (a) stack buffer overflow, (b) heap buffer overflow, (c) use-after-free, (d) double-free, (e) format string vulnerability, (f) integer overflow leading to buffer overflow. → Chapter 35 Exercises: Buffer Overflows and Memory Corruption
Makefile security flags Write a Makefile that compiles a C program with all recommended security flags. Include: - Canary (`-fstack-protector-strong`) - FORTIFY_SOURCE - PIE - Full RELRO - CET (if supported: `-fcf-protection=full`) - Helpful warnings (`-Wall -Wextra -Wformat-security`) → Chapter 36 Exercises: Exploit Mitigations
Exercise 36.15
Security regression testing Describe a CI/CD pipeline check that ensures compiled binaries always have the required security features. What command would you run? What would cause it to fail? → Chapter 36 Exercises: Exploit Mitigations
PIE vs no-PIE Compile a simple "Hello World" program twice: once with `-no-pie` and once with `-pie -fPIE`. Run each 5 times and record the address of `main`. What do you observe? What does this mean for an attacker who knows the binary but not the load address? → Chapter 36 Exercises: Exploit Mitigations
Exercise 36.7
Canary bypass requirements A server has a stack canary and is running with ASLR. The server also has a format string vulnerability in its logging function. → Chapter 36 Exercises: Exploit Mitigations
Exercise 36.8
ASLR entropy calculation a) On a 64-bit Linux system with ASLR fully enabled, the stack base has approximately 24 bits of randomness, aligned to 4096-byte page boundaries. How many possible stack base addresses exist? b) A 32-bit x86 Linux system has approximately 8 bits of stack randomness. How man → Chapter 36 Exercises: Exploit Mitigations
Boot sequence tracing Trace the execution path from power-on to the MinOS shell prompt. For each stage, identify: a) The CPU mode (real mode / protected mode / long mode) b) What code is executing (BIOS / bootloader / kernel) c) The approximate address range being executed from d) What the code does → Chapter 38 Exercises: Capstone — A Minimal OS Kernel
A20 line a) Why does the A20 address line need to be enabled? b) What address wraps without A20? c) Describe the three methods for enabling A20 (BIOS INT, keyboard controller, fast A20 via port 0x92) d) Which method does the MinOS bootloader use, and why is it the fastest? → Chapter 38 Exercises: Capstone — A Minimal OS Kernel
Exercise 38.5
VGA text mode a) The VGA text buffer is at physical address `0xB8000`. How many bytes does the 80×25 screen require? b) Write the C expression to compute the offset of character at column `x`, row `y` c) What are the attribute byte meanings for: white text on black, green text on blue, blinking red → Chapter 38 Exercises: Capstone — A Minimal OS Kernel
RISC-V hello world The RISC-V hello world in the chapter uses syscall 64 (write) and 93 (exit). Note that Linux RISC-V syscall numbers differ from x86-64. → Chapter 39 Exercises: Beyond Assembly
JIT security in a post-CET world Modern browsers run WebAssembly with JIT compilation on CET-enabled hardware. Describe: → Chapter 39 Exercises: Beyond Assembly
JIT code generation Write C code (using inline arrays of bytes) that generates and executes each of these x86-64 functions: → Chapter 39 Exercises: Beyond Assembly
Exercise 39.8
WASM vs. register machine Compare WASM (stack machine) and x86-64 (register machine) for the expression `(a + b) * (c - d)`: → Chapter 39 Exercises: Beyond Assembly
Exercise 40.11
Interview preparation Assembly and systems knowledge appears in technical interviews for security, systems, and performance roles. For each question, write a confident 2-3 sentence answer: → Chapter 40 Exercises: Your Assembly Future
Conference talk selection Browse the recorded talks from one of these conferences: - DEF CON (defcon.org/media/video) - CCC (media.ccc.de) - Usenix Security (usenix.org/conferences/byname/108) → Chapter 40 Exercises: Your Assembly Future
Teaching plan The best way to consolidate knowledge is to teach it. Choose one topic from this book and plan a 20-minute explanation you could give to a peer: → Chapter 40 Exercises: Your Assembly Future
Exercise 40.19
Reverse engineering practice Find a small open-source compiled binary you use regularly (a command-line utility, a library function). Strip its debug symbols and practice reverse engineering it: → Chapter 40 Exercises: Your Assembly Future
Explain to a beginner Write a clear explanation (3-4 sentences each) suitable for someone who knows C but has never written assembly: → Chapter 40 Exercises: Your Assembly Future
F
File 1: greeting.asm
Define a global string `greeting_msg` and its length `greeting_len` - Declare a global function `print_greeting` that prints the greeting using sys_write - No other global symbols → Chapter 6 Exercises: The NASM Assembler
File 2: math.asm
Define a global function `multiply_by_two(rdi)` that returns 2*rdi in rax - Define a global function `add_numbers(rdi, rsi)` that returns rdi+rsi in rax → Chapter 6 Exercises: The NASM Assembler
File 3: main.asm
Declare extern for all symbols from greeting.asm and math.asm - Implement `_start` which: 1. Calls `print_greeting` 2. Calls `multiply_by_two(21)` and stores the result 3. Calls `add_numbers(result, 0)` and exits with the result as exit code → Chapter 6 Exercises: The NASM Assembler
it fuses multiply and accumulate into one operation, matching the mathematical structure of FIR filters exactly 2. **The reduction step** (FADDP + FADDP) is a fixed cost amortized over the length of the vectors — for longer arrays, this overhead is negligible 3. **Two-accumulator unrolling** (the 8- → Case Study 18-1: NEON SIMD — Vectorizing a Dot Product for Audio Processing
Follow-up questions:
Can you give a real example of a program where hand-written assembly is still used today? (Hint: look at the Linux kernel or cryptography libraries.) - What would you do if a program was running slowly and profiling showed that 80% of time was in one function? How would assembly knowledge help? → Chapter 1: Why Assembly Language? — Discussion Questions
the GNU Debugger. Used in every lab in the book. `layout regs` mode has saved more debugging sessions than we can count. → Acknowledgments
Ghidra
the NSA's open-source reverse engineering framework. Its existence as a free, professional-quality tool has fundamentally changed who can learn reverse engineering. → Acknowledgments
Ghidra fails to launch
Verify Java 17+ is installed (`java -version`). Ghidra requires exactly Java 17 or later; it will not launch with Java 11 or earlier. Run `update-alternatives --config java` to select the correct Java version if multiple are installed. → Lab Environment Setup: NASM, GDB, QEMU, and Cross-Compilation Tools
Glossary
150+ key terms with precise definitions - **Answers to Selected Exercises** — solutions to all ⭐-marked exercises - **Bibliography** — 60+ references organized by category - **Appendix A** — x86-64 instruction quick reference with flags and latency - **Appendix B** — ARM64 instruction quick referenc → Learning Assembly Language — Detailed Content Outline
Godbolt Compiler Explorer
godbolt.org The indispensable tool for understanding what a compiler does with C code. Enter a C function, select GCC x86-64 with `-O2`, and see the assembly output immediately. Essential for verifying the LEA patterns described in this chapter. → Chapter 8 Further Reading: Data Movement and Addressing Modes
Intel: Tiger Lake (2020), Ice Lake, and all subsequent desktop/server processors - AMD: Not yet implemented in mainstream products (as of this writing; announced for future processors) - ARM: Pointer Authentication (PAC) and Branch Target Identification (BTI) serve similar roles on ARM64 → Case Study 36-2: Intel CET — The Hardware Solution to Memory Corruption
Hardware knowledge required
knowing that SCASB is microcoded, that YMM registers can process 32 bytes, and that `bsf` finds the first set bit in one clock cycle — all of this comes from reading CPU documentation, not from the C specification. → Case Study 7.1: strlen() in x86-64 Assembly — Four Implementations
CLFLUSH evicts the entire 64-byte line, and loading any byte in that line causes the entire line to be fetched from DRAM. This is the cache line granularity in action. → Case Study 22-1: Measuring Cache Effects with RDTSC
mnemonic plus operands, emit machine code bytes 2. **Directives** — control the assembler's behavior, do not emit code 3. **Preprocessor directives** — begin with `%`, processed before assembly → Chapter 6: The NASM Assembler
K
Key emphases:
AT&T syntax is not wrong, just different. Security tools (GDB default), objdump, and much online documentation use it. - Learning to read compiler output is a permanent skill — it improves debugging, performance work, and security analysis throughout a career. - Compiler Explorer is a legitimate pro → Chapter 21: Understanding Compiler Output — Instructor Notes
not a simulation or toy. It runs on QEMU with real emulated hardware, handles real interrupts, manages real memory pages, and runs real processes. Every instruction is understood because you wrote it. → Chapter 38 Key Takeaways: Capstone — A Minimal OS Kernel
Chapter 1: Why Assembly Language? — read fully; the security angle is central - Chapter 2: Numbers in the Machine — read fully; two's complement and hex are used constantly in RE - Chapter 3: x86-64 Architecture — study carefully; you must know all registers - Chapter 4: Memory — study carefully; th → Self-Paced Learning Guide — Learning Assembly Language
Module 2: x86-64 Core (30 hours)
Chapter 8: Data Movement and Addressing Modes — study carefully; addressing modes appear in every disassembly - Chapter 9: Arithmetic and Logic — read; focus on flag-setting behavior, not instruction catalog - Chapter 10: Control Flow — study carefully; every loop and conditional in disassembly - Ch → Self-Paced Learning Guide — Learning Assembly Language
Module 4: Security (30 hours)
Chapter 34: Reverse Engineering — study fully; set up Ghidra and complete all exercises - Chapter 35: Buffer Overflows — study fully; implement the exploits in a safe environment - Chapter 36: Defenses — study fully; understanding ASLR, NX, and stack canaries is as important as understanding the att → Self-Paced Learning Guide — Learning Assembly Language
N
NASM on Linux
the assembler the industry actually uses, on the platform where systems programming is taught - **Two architectures** — x86-64 as the primary (your laptop, CTF challenges, compiler output) and ARM64 as the essential secondary (phones, Macs, Raspberry Pi, and increasingly servers) - **Security conten → Preface
data structures maintained by the OS kernel in physical memory — to perform this translation. The OS controls what mappings exist by modifying the page tables; the hardware enforces those mappings on every access. → Chapter 27: Memory Management
use a gadget to call `puts(printf)` or similar, printing a known libc address (a PLT stub address that we called). From this, calculate: `libc_base = leaked_address - known_offset_of_puts_in_libc` 2. **Phase 2: Calculate real addresses** — now that libc base is known, calculate the real addresses of → Chapter 37: Return-Oriented Programming and Modern Exploitation
Prevention:
Use memory-safe languages for code handling untrusted input - Enable AddressSanitizer (`-fsanitize=address`) during testing — it detects UAF and heap overflows - Use `valgrind` during development - Enable glibc heap hardening (tcache security, `MALLOC_CHECK_`) - Use safe allocators with added integr → Chapter 35: Buffer Overflows and Memory Corruption
Primary audience:
Computer science students (sophomore/junior level) who have written C programs and want to understand what the compiler is actually doing - Security researchers and CTF players who need to read disassembly fluently - Embedded engineers moving from microcontrollers to Linux-based ARM platforms → Learning Assembly Language: What's Really Happening Inside the Machine
the open-source machine emulator and virtualizer. Without QEMU, the bare-metal and OS chapters would require physical hardware. QEMU makes them accessible to everyone. → Acknowledgments
All assembly examples must assemble cleanly with NASM 2.16+ and run correctly on x86-64 Linux - ARM64 examples must assemble with GAS (GNU Assembler) on AArch64 Linux or Apple Silicon macOS - New content should match the practitioner tone — precise, direct, no fluff - Open an issue before starting l → Learning Assembly Language: What's Really Happening Inside the Machine
R
Raspberry Pi 4/5
Around $50-100, runs 64-bit Linux natively. Real hardware, real ARM64, boots off an SD card. Ideal for embedded development. → Part III: ARM64 Assembly
README.md
This file. Course mapping, guide structure, grading philosophy. - **lab-setup-qemu-gdb.md** — Complete environment setup for Linux, WSL2, and macOS. Required reading before the first lab session. - **syllabus-one-semester.md** — 15-week schedule (2 lectures + 1 lab/week). Suitable for a single semes → Instructor Guide — Learning Assembly Language: What's Really Happening Inside the Machine
Real-world applications to mention:
Chrome V8 JavaScript engine: hand-written assembly in hot paths - Linux kernel: hundreds of `.S` files for architecture-specific setup, interrupt handling, context switching - OpenSSL / BoringSSL: hand-written assembly for cryptographic primitives (AES-NI, SHA extensions) - Malware analysis: reverse → Chapter 1: Why Assembly Language? What You See When You Look Below C — Instructor Notes
Real-world applications:
Security researchers use GDB and objdump (both AT&T default) constantly - Performance engineers use Compiler Explorer to test optimization hypotheses - Embedded engineers read compiler output to verify that expensive patterns (division, floating point) were not silently introduced - Understanding co → Chapter 21: Understanding Compiler Output — Instructor Notes
indices into the GDT, not raw addresses. Each GDT entry (8 bytes) describes a segment: its base address, size limit, and access rights. → Chapter 28: Bare Metal Programming
Linux kernel: CET SHSTK support in kernel 5.18+ (for user-space processes); kernel-mode SHSTK in newer versions - glibc: CET-aware since 2.27+ (required for setjmp/longjmp compatibility) - GCC: `-fcf-protection=full` enables IBT+SHSTK code generation (since GCC 8) - Clang: `-fcf-protection=full` sim → Case Study 36-2: Intel CET — The Hardware Solution to Memory Corruption
combine the pointer with a version counter in one 128-bit value, use `CMPXCHG16B`. The version counter increments on every change, making ABA impossible (you would need the pointer AND the counter to match). → Chapter 30: Concurrency at the Hardware Level
The key instructions:
`vpxor ymm0, ymm0, ymm0`: zero all 32 bytes of YMM0 (our comparison target) - `vmovdqu ymm1, [rdi]`: load 32 unaligned bytes from memory into YMM1 - `vpcmpeqb ymm2, ymm1, ymm0`: compare each of the 32 bytes against zero; each result byte is 0xFF (match) or 0x00 (no match) - `vpmovmskb ecx, ymm2`: ex → Case Study 7.1: strlen() in x86-64 Assembly — Four Implementations
**Track A** (minimal): bootloader + VGA text + keyboard + simple shell - **Track B** (standard): Track A + preemptive scheduler + two processes - **Track C** (extended): Track B + a simple filesystem + loadable user programs → How to Use This Book
x >> 1 = 0b01011000 - XOR = 0b11101001 (1s at positions where adjacent bits differ) - popcount(0b11101001) = 5 transitions ✓ (transitions: 1→0 at pos 7, 0→1 at pos 5, 1→1 at pos 4=same, 1→0 at pos 3, 0→0 at pos 2=same, 0→0 at pos 1=same, 0→1 at pos 0) Wait: 0b10110001: positions 7,6,5,4,3,2,1,0 = 1, → Case Study 13.1: Bit Manipulation Puzzles — Classic Problems in Assembly
All of Track A - PIT timer at 100Hz - Round-robin preemptive scheduler - 2-4 concurrent kernel threads - Physical memory bitmap allocator - Kernel-mode context switch → Chapter 38: Capstone — A Minimal OS Kernel
Track C (Extended — Track B + ~1000 lines):
All of Track B - Virtual memory (user/kernel separation) - System call interface (SYSCALL/SYSRET) - Simple FAT12 filesystem (or RAM disk) - User-mode shell process (runs in ring 3) - User-mode programs (via ELF loading or simple format) → Chapter 38: Capstone — A Minimal OS Kernel
Typo and grammar fixes
always welcome - **Code corrections** — if assembly code doesn't assemble or produces wrong output - **Diagram improvements** — better ASCII art for complex concepts - **Additional exercises** — following the register-trace or programming format - **ARM64 coverage** — additional ARM64 examples paral → Contributing to Learning Assembly Language
U
unsigned
**OF = 1** after ADD means the result overflowed when operands are treated as **signed** - **CF = 1** after SUB means borrow occurred (a < b in unsigned comparison) - **OF = 1** after SUB means the result overflowed the signed range → Chapter 9: Arithmetic and Logic
Use inline assembly when:
You need instructions the compiler cannot generate: `CPUID`, `RDTSC`, `RDRAND`, `IN`/`OUT` (I/O ports), `HLT`, `LGDT` - You need a specific atomic sequence: `CMPXCHG`, `LOCK XCHG` (though `` is usually better) - You are writing a security-critical function where the compiler might optim → Chapter 22: Inline Assembly
Loops with more than ~16 instructions may benefit from 32-byte alignment - Functions called from many places benefit from alignment (ensures first instruction on a full cache line) - Alignment matters most when the µop cache is the frontend bottleneck (check `dsb2mite_switches` perf event) → Chapter 33: Performance Analysis and Optimization
When NOT to use software prefetch:
Sequential access: hardware prefetcher handles this - Data that fits in L1/L2: already cached, no benefit - High-bandwidth SIMD loops: bandwidth is the bottleneck, not latency → Chapter 32: The Memory Hierarchy
When to use software prefetch:
Linked list traversal: prefetch `node->next->data` while processing `node->data` - Tree traversal: prefetch children while processing parent - Hash table lookups: prefetch the bucket while computing the hash - Scatter/gather operations: prefetch target addresses before writing → Chapter 32: The Memory Hierarchy
When x87 is appropriate:
80-bit extended precision is required (e.g., some numerical algorithms benefit from the extra precision) - Hardware transcendental functions (FSIN, FCOS) are desired without linking libm - Working with legacy code that uses x87 → Chapter 14: Floating Point