Glossary: Learning Assembly Language: What's Really Happening Inside the Machine

#

"Branch Misprediction Cost": Agner Fog, microarchitecture.pdf (agner.org) The per-microarchitecture misprediction penalty table. Haswell: 14-17 cycles. Skylake: 14-17 cycles. Zen 2: 14-23 cycles. These numbers explain when CMOV is worth the additional code complexity. → Chapter 10 Further Reading: Control Flow
"Branchless Equivalents of Simple Functions": Chess Programming Wiki chessprogramming.org/Branchless_Equivalents Extensive collection of branchless implementations for common functions (abs, min, max, sign, clamp, swap, etc.). The implementations use the sign-mask technique (SAR to get all-ones/all-zeros mask) that appears throughout systems pr → Chapter 10 Further Reading: Control Flow
"Data Structures in the Linux Kernel": various kernel documentation kernel.org/doc/html/latest/ The Linux kernel's `include/linux/list.h` implements doubly-linked lists as an intrusive linked list (the list pointers are embedded in the struct). Reading this implementation shows how to do linked list manipulation in real-world C and assem → Chapter 12 Further Reading: Arrays, Strings, and Data Structures
"Engineering a Compiler" by Cooper and Torczon: if the compiler pipeline discussion in Chapter 39 interested you. The complete academic treatment of compilation: from parsing to register allocation to instruction scheduling. → Chapter 40: Your Assembly Future
"Exploiting the Hard-Working DWARF": James Oakley and Sergey Bratus, USENIX WOOT 2011 Discusses how exception handling tables and jump tables in compiled code can be exploited. Relevant background for Chapter 35's exploit development. → Chapter 10 Further Reading: Control Flow
"Falsehoods Programmers Believe About Money": Erik Wijk, blog Financial correctness requires not just fixed-point arithmetic but also: understanding different currency denominations (JPY has no cents), exchange rate representation, rounding laws by jurisdiction (some require half-up, some round-half-to-even), and overflow analysis for large tra → Chapter 14 Further Reading: Floating Point
"Function Call Conventions and Stack Frame Layout": Bryan Cantrill (now Oxide Computer) YouTube lecture. Clear explanation of the System V ABI with animated stack diagrams. Covers the 16-byte alignment requirement and its historical context (alignment needed for FXSAVE before SSE was common). → Chapter 11 Further Reading: The Stack and Function Calls
"Function Call Overhead": Agner Fog, optimizing_assembly.pdf (agner.org) Chapter 14 covers the cost of function calls: CALL/RET, push/pop overhead, and how to minimize it (leaf functions, inlining, tail call optimization). The concrete cycle counts for function call overhead versus inline code are useful for justifying when → Chapter 11 Further Reading: The Stack and Function Calls
"Intel Intrinsics Guide": software.intel.com/sites/landingpage/IntrinsicsGuide/ The intrinsics guide allows you to search for compiler intrinsics that map to specific instructions. For example, `_mm_popcnt_u64` maps to `POPCNT`. This is useful when writing C code that uses these instructions via intrinsics rather than inline → Chapter 13 Further Reading: Bit Manipulation
"Intel x86 Encoding Cheat Sheet": Scott Wolchok A compact reference for the instruction encoding format. Useful if you ever need to manually decode bytes from a memory dump or write a disassembler. → Chapter 8 Further Reading: Data Movement and Addressing Modes
"Intel® 64 and IA-32 Software Developer's Manuals": the authoritative source, always. Download the full PDF set or bookmark the HTML version. When something in assembly is ambiguous, this is where the answer lives. → Chapter 40: Your Assembly Future
"Microarchitecture" documentation: Agner Fog (agner.org/optimize/microarchitecture.pdf) Detailed per-microarchitecture analysis. The sections on Intel Sandy Bridge, Haswell, and Skylake explain the AGU (Address Generation Unit) pipeline and why the four-component addressing mode was cheaper on Haswell than on Sandy Bridge. → Chapter 8 Further Reading: Data Movement and Addressing Modes
"Optimizing Assembly": instruction selection, dependency chains, loop optimization, SIMD, branch prediction, and every micro-optimization technique covered in this chapter with concrete NASM examples - **"Instruction Tables"** — latency, throughput, and port assignments for every instruction on every major Intel/AMD micro → Chapter 33 Further Reading: Performance Analysis and Optimization
"Optimizing subroutines in assembly language": Agner Fog agner.org/optimize/optimizing_assembly.pdf Chapter 16 covers LEA and address generation extensively. Agner Fog's optimization manuals are the standard reference for x86 performance tuning. The "Instruction tables" document (separate PDF) gives the exact latency and throughput for every ins → Chapter 8 Further Reading: Data Movement and Addressing Modes
"Parsing Integers Quickly": Daniel Lemire, blog. lemire.me/blog Shows how PEXT/PDEP can accelerate SIMD parsing of integers from text. Demonstrates real-world use of these BMI2 instructions beyond the toy examples in textbooks. → Chapter 13 Further Reading: Bit Manipulation
"Smashing The Stack For Fun And Profit": Aleph One (Elias Levy), Phrack Magazine #49, 1996 phrack.org/issues/49/14.html The paper that defined modern stack overflow exploitation. Still readable and technically accurate for the basic technique. The stack layout diagrams and shellcode injection methodology are foundational. → Chapter 11 Further Reading: The Stack and Function Calls
"Software Optimization of AES on x86-64": Käsper and Schwabe, IACR ePrint 2009 The paper that introduced the "bitsliced" AES implementation achieving record software speeds. Shows that even with AES-NI available, software AES on very old hardware required sophisticated bit manipulation. The contrast with AES-NI performance in Chapter 15 is → Chapter 13 Further Reading: Bit Manipulation
"Sorting Networks and Their Applications": Batcher, AFIPS Spring Joint Computer Conference 1968 The original paper on optimal sorting networks. Batcher's odd-even merge sort and bitonic sort are the most-cited networks. Sorting networks are the foundation of SIMD-accelerated sorting. → Chapter 10 Further Reading: Control Flow
"Stack Smashing Protection": Hiroaki Etoh, IBM Research The original description of the GCC stack canary implementation (then called ProPolice, now `-fstack-protector`). Explains the canary placement strategy and why local variables are reordered to put arrays near the canary. → Chapter 11 Further Reading: The Stack and Function Calls
"Structure Layout Optimization": Ulrich Drepper, "What Every Programmer Should Know About Memory" lwn.net/Articles/250967/ Section 6 covers struct layout optimization for cache performance. The AoS vs. SoA analysis (Section 6.2) includes assembly-level examples of how different layouts affect SIMD vectorization. → Chapter 12 Further Reading: Arrays, Strings, and Data Structures
"Tail Call Optimization": GCC wiki gcc.gnu.org/wiki/TailCalls Explains when GCC transforms `return func(args)` into a `jmp` instead of `call` + `ret`, eliminating the stack frame growth for recursive calls at the cost of losing the frame in backtraces. Relevant to the recursive factorial example: `factorial(n-1)` is not a ta → Chapter 11 Further Reading: The Stack and Function Calls
"The Art of Exploitation" by Jon Erickson: if the security chapters engaged you. The most approachable deep dive into x86 exploitation, shellcode, and format strings. Includes a live Linux environment for hands-on practice. → Chapter 40: Your Assembly Future
"Why memcpy() is Better Than You Think": blog post, cloudflare.com Discusses the non-temporal store optimization (`MOVNTQ`) for large copies, showing 40-60% improvement for multi-GB copies by avoiding cache pollution. Includes assembly code examples. → Chapter 12 Further Reading: Arrays, Strings, and Data Structures
"x86 Instruction Encoding": OSDev Wiki (wiki.osdev.org/X86-64_Instruction_Encoding) The most accessible explanation of how ModRM, SIB, REX, and displacement bytes work together to encode every addressing mode. Understanding the encoding is not required for using the instructions, but it explains *why* RSP cannot be an index re → Chapter 8 Further Reading: Data Movement and Addressing Modes
1. Syscall instruction name:: x86-64: `syscall` (SYSCALL instruction) - ARM64: `svc #0` (Supervisor Call) - RISC-V: `ecall` (Environment Call) → Case Study 39-2: RISC-V Assembly — Hello World on RISC-V
12. Common NASM errors and their causes:: "operation size not specified": add `QWORD`/`DWORD`/`WORD`/`BYTE` to ambiguous memory operands - "symbol is multiply defined": use `.local` labels instead of global ones in functions - "invalid combination of opcode and operands": memory-to-memory move doesn't exist; wrong operand types - "`times` c → Chapter 6 Key Takeaways: The NASM Assembler
12MB L2 cache per cluster: **Large on-chip caches**: 32MB "system level cache" (what Intel calls last-level cache) → Chapter 19: x86-64 vs. ARM64 Comparison
2. Syscall argument registers:: x86-64: number in RAX, args in RDI, RSI, RDX, R10, R8, R9 - ARM64: number in X8, args in X0-X5 - RISC-V: number in A7 (x17), args in A0-A5 (x10-x15) → Case Study 39-2: RISC-V Assembly — Hello World on RISC-V
3. Syscall numbers:: x86-64 `write` = 1; `exit` = 60 - ARM64 `write` = 64; `exit` = 93 - RISC-V `write` = 64; `exit` = 93 → Case Study 39-2: RISC-V Assembly — Hello World on RISC-V
4. PC-relative address loading:: x86-64: `lea rsi, [rip + offset]` — one instruction (ModRM encoding handles it) - ARM64: `adr x1, label` — one instruction when within ±1MB; `adrp + add` for farther - RISC-V: `la a1, label` — pseudoinstruction that assembles to `auipc + addi` — always two instructions → Case Study 39-2: RISC-V Assembly — Hello World on RISC-V
5. Load immediate:: x86-64: `mov rax, 64` — 7 bytes (REX + opcode + 4-byte immediate) - ARM64: `mov x8, #64` — 4 bytes (MOVZ encoding) - RISC-V: `li a7, 64` — pseudo-instruction → `addi a7, x0, 64` — 4 bytes → Case Study 39-2: RISC-V Assembly — Hello World on RISC-V
8 accumulators: For ADDPD YMM (latency 4, throughput 2): 4/2 = **2 accumulators** → Chapter 33: Performance Analysis and Optimization
`gets()` had no bounds checking: by design. C's philosophy of trusting the programmer meant no safety check was inserted. → Case Study 35-1: The Morris Worm's Buffer Overflow (1988) — The First Famous Exploit
`ld: cannot find crt1.o`: The C runtime is missing. Run `sudo apt install libc-dev` or `gcc-multilib`. → Lab Environment Setup: NASM, GDB, QEMU, and Cross-Compilation Tools
`nasm: command not found`: Installation did not complete. Re-run `sudo apt install nasm` and check for network errors. → Lab Environment Setup: NASM, GDB, QEMU, and Cross-Compilation Tools
`XCHG` with memory is always atomic: no LOCK prefix needed. This makes `XCHG [lock], al` a valid spinlock implementation without the explicit LOCK. → Chapter 30 Key Takeaways: Concurrency at the Hardware Level

A

Abstraction cost: SCASB *looks* like the right tool but is slower due to microcode overhead. Specialized instructions are not always the fastest path. → Case Study 7.1: strlen() in x86-64 Assembly — Four Implementations
Accessible with some study:: `fs/` — filesystem layer (C-heavy, but system calls you understand) - `net/` — networking (C-heavy but logical) - `mm/` — memory management (your page table knowledge helps here) → Case Study 40-2: From Assembly to Linux Kernel Contribution — One Path
Addresses were predictable: there was no ASLR. The stack was at the same address on every execution of `fingerd`. Morris could determine the approximate stack address from his own VAX and use that address in the exploit payload. → Case Study 35-1: The Morris Worm's Buffer Overflow (1988) — The First Famous Exploit
Adoption:: Major Linux distributions enable CET in packages as of 2022-2024 - Many system libraries (libc, libssl) ship with `ENDBR64` markers in recent versions - Not all software is recompiled yet; CET provides partial protection in mixed environments → Case Study 36-2: Intel CET — The Hardware Solution to Memory Corruption
Agner Fog, "Instruction Tables": agner.org/optimize/instruction_tables.pdf Per-microarchitecture latency and throughput for all floating-point instructions: ADDSS, MULSD, SQRTSD, CVTSI2SD, FSIN, etc. The FSIN/FCOS timings (50-100 cycles) vs. SSE polynomial (10-20 cycles) comparison from the case study is documented here. → Chapter 14 Further Reading: Floating Point
Answer: A: sys_write is syscall number 1 (RAX=1). stderr is file descriptor 2 (RDI=2). Stdout is fd 1, stdin is fd 0. → Chapter 25 Quiz: System Calls
Answer: B: `syscall` uses RCX to save the return address (RIP), destroying whatever was there. R10 is used as the substitute. → Chapter 25 Quiz: System Calls
Answer: C: RAX holds the syscall number on entry, and the return value on exit. → Chapter 25 Quiz: System Calls
Answer: D: Vector 14 is #PF (Page Fault). The faulting virtual address is in CR2, and the error code is pushed on the stack. → Chapter 26 Quiz: Interrupts, Exceptions, and Kernel Mode
Apple Silicon: If you own an M1/M2/M3/M4 Mac, you are already on ARM64. `clang` on macOS compiles ARM64 natively. GDB is replaced by LLDB. The system calls differ from Linux. Chapter 18 covers the differences. → Part III: ARM64 Assembly
Argument conventions:: x86-64: `syscall` instruction; number in RAX; args in RDI, RSI, RDX, R10, R8, R9; return in RAX - ARM64: `svc #0` instruction; number in X8; args in X0, X1, X2, X3, X4, X5; return in X0 - RISC-V: `ecall` instruction; number in a7; args in a0, a1, a2, a3, a4, a5; return in a0 → Appendix F: Linux System Call Tables
ARM64 binary runs but produces wrong output: Likely a calling convention mismatch when mixing C and assembly. Verify that the function prologue and epilogue are correct and that the ABI register assignments match (x0–x7 for arguments, x0 for return value). → Lab Environment Setup: NASM, GDB, QEMU, and Cross-Compilation Tools
Attribution: You must give appropriate credit and indicate if changes were made - **ShareAlike** — If you remix or transform the material, you must distribute your contributions under the same license → Learning Assembly Language: What's Really Happening Inside the Machine
AWS Graviton3 (2022): 64-core ARM Neoverse V1 - 25% better performance/watt than Graviton2 - 3× better floating-point performance than Graviton2 - Used by AWS internally for ~50% of their own compute → Case Study 19-2: The Apple Silicon Revolution — A Technical Deep Dive
AWS Graviton4 (2024): 96-core ARM Neoverse V2 - "Best performance and energy efficiency in EC2" - C8g instances: 30% better price/performance than Intel equivalent → Case Study 19-2: The Apple Silicon Revolution — A Technical Deep Dive

B

Block encryption:: Initial key whitening: `plaintext XOR round_key[0]` - 9 rounds of `AESENC` (SubBytes → ShiftRows → MixColumns → XOR round key) - Final round: `AESENCLAST` (SubBytes → ShiftRows → XOR round key, no MixColumns) → Case Study 15.2: AES-NI Encryption — Hardware-Accelerated AES in Assembly
Boots from a raw disk image: a 512-byte bootloader you write in assembly 2. **Transitions through CPU modes** — real mode → protected mode → long mode 3. **Initializes the hardware** — GDT, IDT, keyboard controller, timer 4. **Manages memory** — a page allocator, a simple heap 5. **Handles interrupts** — keyboard input, timer t → How to Use This Book
byte: 8 bits. The architecture processes data in four sizes: → Chapter 2: Numbers in the Machine

C

cache lines: aligned 64-byte blocks. When a single byte is accessed that is not in cache: → Chapter 32: The Memory Hierarchy
cache miss: one of the 8 ways is evicted (LRU policy) and the new line is loaded → Chapter 32: The Memory Hierarchy
CFI: Control Flow Integrity: Abadi et al., CCS 2005 (Microsoft Research) The original paper on Control Flow Integrity, the defense against jump table hijacking and return-oriented programming. Modern compilers implement CFI via `-fsanitize=cfi`. Relevant to the Chapter 35-37 security chapters. → Chapter 10 Further Reading: Control Flow
Chapter 10: Control Flow: JMP (short, near, indirect); conditional jumps (all variants) - Signed vs. unsigned comparisons: JL vs. JB — the critical distinction - Translating if/else, while, for, do-while, switch/case - CMOV (conditional move): branchless programming - Jump tables for switch/case - Loop optimization: LOOP ins → Learning Assembly Language — Detailed Content Outline
Chapter 11: The Stack and Function Calls: PUSH, POP mechanics; CALL pushes RIP, RET pops it - Stack frame: push rbp / mov rbp, rsp / sub rsp, N - System V AMD64 ABI: RDI, RSI, RDX, RCX, R8, R9; callee/caller-saved - Red zone: 128 bytes below RSP reserved for leaf functions - Stack alignment: 16-byte requirement before CALL - Recursive facto → Learning Assembly Language — Detailed Content Outline
Chapter 12: Arrays, Strings, and Data Structures: Array access with base+index×scale addressing modes - REP MOVSB/STOSB/CMPSB/SCASB: string operations - Implementing strlen, strcpy, memset, memcmp in assembly - Linked list traversal and manipulation - Struct field access: base+offset for each field - AoS vs. SoA data layouts and their performance i → Learning Assembly Language — Detailed Content Outline
Chapter 13: Bit Manipulation: Bitmasks: isolate (AND), set (OR), toggle (XOR), clear (AND NOT) - BT, BTS, BTR, BTC: bit test operations - BSF, BSR, LZCNT, TZCNT: bit scan and count - POPCNT: hardware popcount - BMI1/BMI2 instructions: ANDN, BEXTR, BLSI, BLSR, PDEP, PEXT - XOR tricks: swap, power-of-2 test, isolate lowest set bit → Learning Assembly Language — Detailed Content Outline
Chapter 14: Floating Point: x87 FPU (stack-based, legacy): FLD, FST, FADD, FSIN, FSQRT - SSE2 scalar floating point: MOVSS/MOVSD, ADDSS/ADDSD, CVTSI2SS - MXCSR register: exception masks, denormal performance trap - IEEE 754 comparison with UCOMISS/UCOMISD - Precision conversion: CVTSS2SD, CVTSD2SS - How GCC generates floating- → Learning Assembly Language — Detailed Content Outline
Chapter 15: SIMD Programming: XMM (128-bit SSE), YMM (256-bit AVX), ZMM (512-bit AVX-512) - SSE2: ADDPS (4 floats), packed integer operations - AVX/AVX2: VADDPS, VMULPS, VFMADD213PS (fused multiply-add) - Shuffle and permute: SHUFPS, PSHUFD, VPERMILPS - Alignment: MOVAPS vs. MOVUPS; performance implications - Vectorizing a loop: → Learning Assembly Language — Detailed Content Outline
Chapter 16: The ARM64 architecture itself: the 31-register file, the zero register, PSTATE flags, fixed-width encoding, and the load/store discipline that defines RISC programming. → Part III: ARM64 Assembly
Chapter 16: ARM64 Architecture: RISC vs. CISC philosophy; why ARM64 is not "simpler" - 31 general-purpose registers (X0–X30), SP, XZR, LR, FP - PSTATE flags: N, Z, C, V — set with S-suffix instructions - Fixed-width 4-byte instructions vs. x86-64's variable length - Load/store architecture: no memory operands in ALU instructions - → Learning Assembly Language — Detailed Content Outline
Chapter 17: The ARM64 instruction set: data processing, the barrel shifter, load/store addressing modes, branches, the AAPCS64 calling convention, and Linux system calls. → Part III: ARM64 Assembly
Chapter 17: ARM64 Instruction Set: ADD, SUB, AND, ORR, EOR with barrel shifter: `ADD X0, X1, X2, LSL #3` - LDR, STR, LDP, STP with all addressing modes (pre/post-indexed) - B, BL, BR, BLR, RET; conditional: B.EQ, CBZ, CBNZ, TBZ - AAPCS64: X0–X7 for args, X19–X28 callee-saved, LR/FP preserved - ARM64 Linux system calls: SVC #0, X8 = n → Learning Assembly Language — Detailed Content Outline
Chapter 18: ARM64 programming in practice: arrays, string operations without string instructions, floating-point with the NEON/FP register file, SIMD with NEON, and the differences between Linux ARM64 and Apple Silicon macOS. → Part III: ARM64 Assembly
Chapter 18: ARM64 Programming: Arrays: LSL shift in address calculation, LDP for pairs - memcpy/strlen/memset without REP string instructions - SIMD/FP registers: V0–V31, D0–D31, S0–S31 - ARM64 floating-point: FADD, FMUL, FCMP, FCVT - NEON SIMD: ADD Vd.4S, FMLA Vd.4S — vectorizing a loop - macOS (Apple Silicon) differences from L → Learning Assembly Language — Detailed Content Outline
Chapter 19: The great comparison: x86-64 vs. ARM64, side by side. Same programs, both ISAs. Code density, power, performance, and why the industry is betting on ARM64 to win the next decade. → Part III: ARM64 Assembly
Chapter 19: x86-64 vs. ARM64 Comparison: Code density, instruction count, encoding complexity comparison - Register file: 16 GPRs with aliasing vs. 31 + zero register - Calling conventions side-by-side - Performance characteristics: clock speed vs. power efficiency - The Apple Silicon transition and its industry implications - ARM in the d → Learning Assembly Language — Detailed Content Outline
Chapter 1: Why Assembly Language?: The compilation pipeline: C → preprocessor → compiler → assembler → linker → executable - Disassembling a C program to see the machine code beneath it - Seven reasons to learn assembly in 2026: security, OS, embedded, performance, compilers, CTF, curiosity - The MinOS kernel project preview: what yo → Learning Assembly Language — Detailed Content Outline
Chapter 20: The assembly-C interface itself: calling C functions (printf, malloc, fopen) from assembly; writing assembly functions callable from C; passing structs; the red zone; variadic functions. A complete working mixed project. → Part IV: The Assembly-C Interface
Chapter 21: Reading compiler output: how to use `gcc -S`, Compiler Explorer (godbolt.org), and AT&T vs. Intel syntax. What `-O0`, `-O1`, `-O2`, `-O3` do to your code. The patterns to recognize: function prologue, local variable layout, if-else, loops, switch tables, virtual dispatch. → Part IV: The Assembly-C Interface
Chapter 21: Understanding Compiler Output: AT&T syntax vs. Intel syntax conversion table - GCC -S output patterns: prologue, if/else, loops, switch, recursion - Optimization levels -O0 through -O3: what each does to the assembly - Compiler Explorer (godbolt.org) as a learning tool - Recognizing: strength reduction, inlining, constant folding → Learning Assembly Language — Detailed Content Outline
Chapter 22: Inline assembly: GCC extended syntax, output/input/clobber constraints, and when to use inline assembly (CPUID, RDTSC, atomics, I/O ports). When NOT to use it (compiler intrinsics are usually better). Common mistakes. → Part IV: The Assembly-C Interface
Chapter 22: Inline Assembly: GCC extended asm syntax: `asm("..." : outputs : inputs : clobbers)` - Constraint types: "r", "m", "i"; named operands %[name] - Practical examples: CPUID, RDTSC, CMPXCHG, port I/O, memory fences - The volatile qualifier; when to use it - Compiler intrinsics as the preferred alternative to inline asm → Learning Assembly Language — Detailed Content Outline
Chapter 23: Linking, loading, and ELF: how source becomes an executable; ELF sections and segments; the linker's job (symbol resolution + relocation); static vs. dynamic linking; the loader's job; linker scripts for bare-metal code (the MinOS connection). → Part IV: The Assembly-C Interface
Chapter 23: Linking, Loading, and ELF: Object files: sections, symbol table, relocations - The linker: symbol resolution, relocation patching - Static vs. dynamic linking; ldd for dependency inspection - ELF format: header, program header table (segments), section header table - The Linux ELF loader: initial program state, argv/envp/auxv → Learning Assembly Language — Detailed Content Outline
Chapter 24: Dynamic linking in depth: the PLT/GOT mechanism traced to machine code; lazy binding; RELRO; LD_PRELOAD for interposition; dlopen/dlsym for runtime loading; GOT overwrite security implications (preview of Chapter 36). → Part IV: The Assembly-C Interface
Chapter 24: Dynamic Linking in Depth: LD.so: the dynamic linker and its initialization sequence - PLT/GOT mechanism: lazy binding step by step in assembly - GOT structure: first three entries, resolver function - RELRO: partial (sections reordered) and full (GOT read-only) - LD_PRELOAD for interposition; malloc debugger example - dlopen → Learning Assembly Language — Detailed Content Outline
Chapter 25: System Calls: The syscall instruction: saves RIP to RCX, RFLAGS to R11 - Linux x86-64 convention: RAX=number, RDI/RSI/RDX/R10/R8/R9=args - Key syscalls with complete NASM examples: read, write, open, mmap, fork, exec, exit - Writing a minimal libc: wrappers around raw syscalls - strace: tracing system calls for d → Learning Assembly Language — Detailed Content Outline
Chapter 27: Memory Management: The MMU: virtual→physical translation, permission enforcement - x86-64 4-level page tables: PML4/PDP/PD/PT, 12-bit page offset - Page table entry bits: Present, R/W, U/S, NX, physical page number - TLB and INVLPG; context switch TLB flush - Page faults: error code, CR2 = faulting address, handler de → Learning Assembly Language — Detailed Content Outline
Chapter 28: Bare Metal Programming: BIOS boot: CPU starts in real mode at 0xFFFF0, loads MBR to 0x7C00 - Real mode: 16-bit, segment:offset addressing, 1MB limit, BIOS interrupts - Protected mode: GDT, CR0.PE=1, far jump to flush prefetch - Long mode: PAE, minimal page tables, EFER.LME=1, CR0.PG=1 - Complete bootloader: prints boot mes → Learning Assembly Language — Detailed Content Outline
Chapter 29: Device I/O: Port-mapped I/O: IN/OUT instructions, x86-64 I/O address space - Memory-mapped I/O: devices at physical addresses, MOV instructions - Common ports: PS/2 (0x60/0x64), COM1 (0x3F8), PIC (0x20/0xA0), PIT (0x40) - PIT programming: 100Hz timer interrupt for MinOS scheduler - UART/Serial: baud rate, data → Learning Assembly Language — Detailed Content Outline
Chapter 2: Numbers in the Machine: Binary: bits, bytes, words, doublewords, quadwords - Hexadecimal as binary shorthand; hex↔binary conversion - Unsigned integers, overflow, and wraparound - Two's complement: representation, arithmetic, overflow vs. carry - The RFLAGS register: CF, OF, SF, ZF, PF, AF — when each is set - IEEE 754 flo → Learning Assembly Language — Detailed Content Outline
Chapter 30: Concurrency at the Hardware Level: x86-64 TSO memory model; store-load reordering - Memory fences: MFENCE, SFENCE, LFENCE - LOCK prefix: atomic read-modify-write - XCHG, CMPXCHG (atomic CAS), XADD - Spinlock implementation with CMPXCHG; correctness analysis - futex-based mutex: fast path without kernel syscall - ARM64 weakly-ordered → Learning Assembly Language — Detailed Content Outline
Chapter 31: The Modern CPU Pipeline: Real pipeline: frontend (fetch/decode/rename) → OoO core → retirement - Micro-operations: CISC → RISC-like µops, µop cache - Register renaming: eliminates WAW and WAR hazards - Execution units: ALU, load, store, FPU/SIMD, port assignments - Instruction latency vs. throughput; Agner Fog's tables - Br → Learning Assembly Language — Detailed Content Outline
Chapter 32: The Memory Hierarchy: The hierarchy: registers → L1(4cy) → L2(12cy) → L3(40cy) → DRAM(100+cy) - Cache organization: cache lines (64B), N-way set-associative, set/line structure - Cold/capacity/conflict misses - Cache-friendly patterns: sequential access, structure-of-arrays, cache line alignment - MESI protocol for multi → Learning Assembly Language — Detailed Content Outline
Chapter 33: Performance Analysis and Optimization: Profile first: perf stat, perf record, perf report, perf annotate - Hardware performance counters: cycles, instructions, cache-misses, branch-misses - RDTSC/RDTSCP for cycle-accurate measurement in assembly - Identifying bottlenecks: IPC interpretation, cache miss rate thresholds - Loop optimization → Learning Assembly Language — Detailed Content Outline
Chapter 34: Reverse Engineering: Tools: objdump, GDB, Ghidra, IDA Free, radare2, pwndbg - Recognizing compiler patterns: prologue/epilogue, loops, switch tables, virtual dispatch - Working without symbols: string cross-references, constant identification - Reconstructing data types and control flow from disassembly - GDB scripting → Learning Assembly Language — Detailed Content Outline
Chapter 35: Buffer Overflows and Memory Corruption: Stack buffer overflow: overwriting adjacent stack memory including return address - Shellcode: position-independent code for exploit payloads (educational) - NOP sleds and reliability before ASLR - Format string vulnerabilities: %x stack reads, %n memory writes - Heap corruption: use-after-free, dou → Learning Assembly Language — Detailed Content Outline
Chapter 36: Exploit Mitigations: Stack canaries: fs:0x28, prologue/epilogue assembly, GCC flags - NX/DEP: the NX bit in page table entries, hardware enforcement - ASLR: stack, heap, library, and executable randomization; entropy values - PIE: position-independent executable for full ASLR - RELRO: partial and full; preventing GOT ov → Learning Assembly Language — Detailed Content Outline
Chapter 37: Return-Oriented Programming: Why ROP: NX/DEP killed shellcode injection, ROP reuses existing code - Gadgets: instruction sequences ending in RET - Building a ROP chain: forged stack, gadget addresses, chained execution - Finding gadgets: ROPgadget, ropper tools - ret2libc, ret2plt: common ROP techniques - JOP, SROP (sigreturn-o → Learning Assembly Language — Detailed Content Outline
Chapter 38: Capstone — A Minimal OS Kernel: MinOS architecture: bootloader + kernel in assembly/C - Components integrated: VGA driver, keyboard handler, timer, page allocator, scheduler, shell - MinOS source structure: boot/, kernel/, drivers/, proc/, syscall/, shell/ - Three capstone tracks: A (minimal), B (with scheduler), C (with filesyste → Learning Assembly Language — Detailed Content Outline
Chapter 39: Beyond Assembly: Compilers: lexing, parsing, IR, optimization passes, code generation - Register allocation (graph coloring) and instruction selection - JIT compilation: generating x86-64 machine code at runtime - WebAssembly: stack machine portable ISA, sandboxing through types - RISC-V: the open ISA, modular exten → Learning Assembly Language — Detailed Content Outline
Chapter 3: The x86-64 Architecture: The 16 general-purpose registers and their 32/16/8-bit sub-registers - The critical aliasing rule: 32-bit writes zero upper 32 bits; 16-bit writes do not - RIP (instruction pointer), RFLAGS, segment registers (FS/GS for TLS) - XMM/YMM/ZMM registers (SSE/AVX/AVX-512) — brief introduction - The execut → Learning Assembly Language — Detailed Content Outline
Chapter 40: Your Assembly Future: A genuine inventory of what you now know - Career paths: OS development, security research, compiler engineering, embedded, HPC - Next projects: extend MinOS, write a compiler backend, CTF competitions - Communities: OSDev, /r/asm, CTF platforms, security conferences - Books to read next: CS:APP, OS → Learning Assembly Language — Detailed Content Outline
Chapter 4: Memory: The flat 64-bit virtual address space (48 bits usable) - Process memory layout: text, data, BSS, heap, stack, mapped libraries - Virtual vs. physical addresses; the MMU's role - Byte alignment requirements and SIMD alignment (16/32/64 bytes) - Little-endian byte ordering with examples - NASM data de → Learning Assembly Language — Detailed Content Outline
Chapter 5: Your Development Environment: Installing NASM, GCC, binutils, GDB, QEMU, Ghidra - Your first NASM program: hello world, assemble, link, run - The Makefile template for assembly projects - GDB for assembly: breakpoints, stepi, info registers, x/16xb, layout regs - objdump, readelf, nm — binary inspection tools - Linking assembly → Learning Assembly Language — Detailed Content Outline
Chapter 6: The NASM Assembler: NASM syntax: Intel syntax (destination first, no sigils, brackets for memory) - Sections: .text, .data, .bss, .rodata - Labels, global, extern, common directives - Data definition depth: db/dw/dd/dq, times, equ, $ and $$ - NASM preprocessor: %define, %assign, %macro/%endmacro, %if, %include - Useful → Learning Assembly Language — Detailed Content Outline
Chapter 7: Your First Assembly Programs: MOV in all its forms: register, immediate, memory load, memory store - ADD, SUB, INC, DEC, NEG — with complete register traces - XOR reg, reg — zeroing a register and why this is standard - System calls: RAX = number, RDI/RSI/RDX/R10/R8/R9 = args, RAX = return - Four complete programs: hello, exit, → Learning Assembly Language — Detailed Content Outline
Chapter 8: Data Movement and Addressing Modes: MOV forms; 32-bit write zero-extension behavior - Addressing modes: immediate, register, direct, indirect, base+offset, base+index×scale+disp - RIP-relative addressing for position-independent code - LEA: computing addresses without memory access; use as fast arithmetic - MOVZX (zero-extend) and MOV → Learning Assembly Language — Detailed Content Outline
Chapter 9: Arithmetic and Logic: ADD, SUB with all operand forms; flag effects - ADC, SBB: multi-precision arithmetic (128-bit addition example) - MUL/IMUL (one-, two-, three-operand forms); DIV/IDIV - AND, OR, XOR, NOT — bitwise operations - TEST (AND without store) and CMP (SUB without store) - SHL, SHR, SAR, ROL, ROR, SHLD, SHRD → Learning Assembly Language — Detailed Content Outline
CMOV hurts when:: The branch is highly predictable (>95% one way) — the processor's branch predictor handles it for near-free - The "not selected" computation is expensive or involves a slow load - CMOV creates a longer dependency chain → Chapter 10: Control Flow
CMOV wins when:: The branch is unpredictable (roughly 50/50 distribution) - The values being selected are already in registers (no load involved) - The computation fits the "compute both, select one" pattern → Chapter 10: Control Flow
Compiler Explorer (godbolt.org): Matt Godbolt's invaluable tool for exploring compiler output. Chapter 21 uses it extensively. → Acknowledgments
configuration space: 256 bytes of registers accessible via port I/O at ports 0xCF8 (address register) and 0xCFC (data register). → Chapter 29: Device I/O
Correctness first: the naive loop is obviously correct; the AVX2 version requires careful thought about alignment, garbage bytes, and VZEROUPPER. → Case Study 7.1: strlen() in x86-64 Assembly — Four Implementations
CR3: the page table base register. CR3 holds the physical address of the PML4 table (aligned to 4KB). On a context switch, the OS writes a new value to CR3, and the new process's address space takes effect immediately. → Chapter 27: Memory Management
CTF community: particularly the pwn category — has pushed assembly and binary exploitation education further and faster than any academic setting. This book's security chapters are influenced by the quality of public CTF writeups. → Acknowledgments

D

Data layout: AoS (RGBA RGBA...) vs. SoA (RRR... GGG... BBB...). SoA is almost always better for SIMD; AoS requires shuffles to extract channels. 2. **Lane behavior** — AVX2 operates in two independent 128-bit halves for most byte/word instructions. Always check the manual: does your instruction cross lanes or no → Case Study 15.1: SIMD Image Processing — Grayscale Conversion
device I/O: how software talks to hardware. Two paradigms dominate: port-mapped I/O, using the `IN` and `OUT` instructions, and memory-mapped I/O, where device registers appear as ordinary memory addresses. You will program real devices: the PIT timer, the UART serial port, the PIC interrupt controller. These s → Part V: Systems Programming
Do NOT use inline assembly when:: A compiler intrinsic exists: `` for SSE/AVX, `` for SSE4.2 - `` covers your atomic operation - `__builtin_clz`, `__builtin_popcount`, `__builtin_expect` exist for your case - You could write a separate `.asm` file and link it → Chapter 22: Inline Assembly

E

Executable: output of linker, directly executable 3. **Shared library** (`.so`) — position-independent code, loaded by dynamic linker → Chapter 23: Linking, Loading, and ELF
Exercise 2.1 (⭐):: `push rbp`: 1 byte (0x55) — special encoding for push register - `mov rbp, rsp`: 3 bytes (48 89 E5) — REX + opcode + ModRM - `sub rsp, 0x20`: 4 bytes (48 83 EC 20) — REX + opcode + ModRM + imm8 - `mov [rbp-8], rdi`: 4 bytes (48 89 7D F8) — REX + opcode + ModRM + disp8 - `ret`: 1 byte (C3) → Appendix B: Answers to Selected Exercises and Quiz Questions
Exercise 34.1: Syntax conversion Convert the following AT&T syntax instructions to Intel syntax: → Chapter 34 Exercises: Reverse Engineering
Exercise 34.10: Identify the construct For each disassembly snippet, identify whether it shows: (a) a function call, (b) a virtual method call, (c) a function pointer call through a struct, (d) a tail call, or (e) a leaf function with no frame. → Chapter 34 Exercises: Reverse Engineering
Exercise 34.11: Magic constant identification Identify the algorithm or context associated with each constant: → Chapter 34 Exercises: Reverse Engineering
Exercise 34.13: Crackme analysis Consider a program that calls the following validation function. Without running the program, determine what input produces a return value of 1: → Chapter 34 Exercises: Reverse Engineering
Exercise 34.14: Tool selection For each scenario, choose the most appropriate RE tool and justify your choice: → Chapter 34 Exercises: Reverse Engineering
Exercise 34.16: Ghidra workflow Describe the step-by-step process for using Ghidra to analyze a password-protected binary where you want to find the validation logic. Include: where to start, how to navigate, what to look for, and how to confirm your analysis. → Chapter 34 Exercises: Reverse Engineering
Exercise 34.17: Stripped binary navigation A stripped x86-64 ELF binary's entry point is at `0x401080`. The `_start` code calls `__libc_start_main`. Describe in detail how you would use GDB to find the address of `main()` in this binary without symbols. → Chapter 34 Exercises: Reverse Engineering
Exercise 34.18: Cross-architecture RE How does reverse engineering ARM64 binaries differ from x86-64? Specifically: a) What are the equivalent patterns for function prologue/epilogue? b) How does the calling convention affect register usage patterns? c) How do you identify a function's return value in ARM64? d) Wha → Chapter 34 Exercises: Reverse Engineering
Exercise 34.19: Obfuscation recognition Describe how you would identify and handle each of these obfuscation techniques encountered while reverse engineering: a) UPX-packed executable (the binary is compressed) b) Opaque predicates (always-taken or never-taken branches) c) String encryption (strings are decrypted a → Chapter 34 Exercises: Reverse Engineering
Exercise 34.2: objdump flags For each task, write the exact `objdump` command: a) Disassemble the binary `./mystery` using Intel syntax b) Show the symbol table of `./server` c) Show the dynamic symbols of `/usr/bin/curl` d) Show all sections and their addresses in `./program` e) Dump the contents of the `.rodata` → Chapter 34 Exercises: Reverse Engineering
Exercise 34.3: Compiler pattern identification Label each assembly snippet with the C construct it represents: → Chapter 34 Exercises: Reverse Engineering
Exercise 34.5: Jump table analysis The following disassembly includes a jump table. Identify: a) The bounds check instruction b) The jump table load c) How many cases exist d) Reconstruct the switch statement structure → Chapter 34 Exercises: Reverse Engineering
Exercise 34.6: String extraction Given this disassembly snippet from a program that prints a message: → Chapter 34 Exercises: Reverse Engineering
Exercise 34.8: GDB Python script Write a GDB Python script that: 1. Sets a breakpoint at `0x401196` (a hypothetical `strcmp` call site) 2. When the breakpoint is hit, prints both string arguments (RDI and RSI) 3. Does NOT stop execution — continues automatically 4. Logs all calls to a file named `strcmp_log.txt` → Chapter 34 Exercises: Reverse Engineering
Exercise 34.9: pwndbg workflow List five pwndbg commands (not standard GDB commands) that would be useful when analyzing a binary for security research, and describe what each shows. → Chapter 34 Exercises: Reverse Engineering
Exercise 35.1: Stack layout calculation For each function, calculate the offset from the start of the buffer to the return address: → Chapter 35 Exercises: Buffer Overflows and Memory Corruption
Exercise 35.10: Morris Worm analysis The Morris Worm (1988) exploited a buffer overflow in `fingerd` using `gets()`. → Chapter 35 Exercises: Buffer Overflows and Memory Corruption
Exercise 35.11: Historical progression Place these events in chronological order and explain the cause-effect relationship between each pair: → Chapter 35 Exercises: Buffer Overflows and Memory Corruption
Exercise 35.12: Vulnerability lifecycle For a hypothetical buffer overflow in a network service: → Chapter 35 Exercises: Buffer Overflows and Memory Corruption
Exercise 35.14: Compiler flags for security Compile a simple C program with each of the following flags and explain what protection each adds. Use `checksec` to verify the resulting binary has the expected properties: → Chapter 35 Exercises: Buffer Overflows and Memory Corruption
Exercise 35.15: AddressSanitizer usage Write a simple C program that has a buffer overflow (for testing purposes on your own machine), compile it with `-fsanitize=address`, and interpret the AddressSanitizer output report. What information does ASAN provide that GDB alone does not? → Chapter 35 Exercises: Buffer Overflows and Memory Corruption
Exercise 35.17: Format string primitives Without writing exploit code, explain theoretically: → Chapter 35 Exercises: Buffer Overflows and Memory Corruption
Exercise 35.18: Heap grooming concept Explain what "heap grooming" or "heap feng shui" means conceptually: → Chapter 35 Exercises: Buffer Overflows and Memory Corruption
Exercise 35.19: Cross-platform comparison How do buffer overflow mechanics differ on these platforms: → Chapter 35 Exercises: Buffer Overflows and Memory Corruption
Exercise 35.2: Dangerous function identification For each code snippet, identify whether it is vulnerable, why, and the fix: → Chapter 35 Exercises: Buffer Overflows and Memory Corruption
Exercise 35.3: Shellcode properties Answer the following about shellcode requirements: a) What does "position-independent" mean, and why is it required? b) Why must shellcode typically be free of null bytes? c) What instruction(s) are used for x86-64 system calls? d) What is the syscall number for `execve` on Linu → Chapter 35 Exercises: Buffer Overflows and Memory Corruption
Exercise 35.5: Format string analysis Given this vulnerable code: → Chapter 35 Exercises: Buffer Overflows and Memory Corruption
Exercise 35.7: Identify the vulnerability type For each description, identify the vulnerability: (a) stack buffer overflow, (b) heap buffer overflow, (c) use-after-free, (d) double-free, (e) format string vulnerability, (f) integer overflow leading to buffer overflow. → Chapter 35 Exercises: Buffer Overflows and Memory Corruption
Exercise 35.9: Security code review Review this function for all memory safety issues: → Chapter 35 Exercises: Buffer Overflows and Memory Corruption
Exercise 36.1: Mitigation identification Match each mitigation to its primary defense: → Chapter 36 Exercises: Exploit Mitigations
Exercise 36.10: ENDBR64 identification Given this disassembly, determine whether CET IBT is enabled: → Chapter 36 Exercises: Exploit Mitigations
Exercise 36.12: CET adoption and limitations Research (or reason about) these questions about CET's real-world deployment: → Chapter 36 Exercises: Exploit Mitigations
Exercise 36.14: Makefile security flags Write a Makefile that compiles a C program with all recommended security flags. Include: - Canary (`-fstack-protector-strong`) - FORTIFY_SOURCE - PIE - Full RELRO - CET (if supported: `-fcf-protection=full`) - Helpful warnings (`-Wall -Wextra -Wformat-security`) → Chapter 36 Exercises: Exploit Mitigations
Exercise 36.15: Security regression testing Describe a CI/CD pipeline check that ensures compiled binaries always have the required security features. What command would you run? What would cause it to fail? → Chapter 36 Exercises: Exploit Mitigations
Exercise 36.17: TLS canary deep dive The stack canary lives at `fs:0x28` in the Thread Control Block. → Chapter 36 Exercises: Exploit Mitigations
Exercise 36.18: GOT and PLT walkthrough For a dynamically linked binary with Partial RELRO: → Chapter 36 Exercises: Exploit Mitigations
Exercise 36.19: CET microarchitecture Intel CET SHSTK is implemented partly in the CPU's microarchitecture: → Chapter 36 Exercises: Exploit Mitigations
Exercise 36.2: checksec output reading Given this `checksec` output, answer the questions: → Chapter 36 Exercises: Exploit Mitigations
Exercise 36.3: Canary assembly reading Identify the canary read and check in this epilogue: → Chapter 36 Exercises: Exploit Mitigations
Exercise 36.5: PIE vs no-PIE Compile a simple "Hello World" program twice: once with `-no-pie` and once with `-pie -fPIE`. Run each 5 times and record the address of `main`. What do you observe? What does this mean for an attacker who knows the binary but not the load address? → Chapter 36 Exercises: Exploit Mitigations
Exercise 36.7: Canary bypass requirements A server has a stack canary and is running with ASLR. The server also has a format string vulnerability in its logging function. → Chapter 36 Exercises: Exploit Mitigations
Exercise 36.8: ASLR entropy calculation a) On a 64-bit Linux system with ASLR fully enabled, the stack base has approximately 24 bits of randomness, aligned to 4096-byte page boundaries. How many possible stack base addresses exist? b) A 32-bit x86 Linux system has approximately 8 bits of stack randomness. How man → Chapter 36 Exercises: Exploit Mitigations
Exercise 37.1: Why ROP exists Answer these questions about the motivation for ROP: → Chapter 37 Exercises: Return-Oriented Programming and Modern Exploitation
Exercise 37.10: SROP conceptual Answer these questions about Sigreturn-Oriented Programming: → Chapter 37 Exercises: Return-Oriented Programming and Modern Exploitation
Exercise 37.12: Blind ROP concepts Describe the Blind ROP technique step by step, for the purpose of understanding why server security matters: → Chapter 37 Exercises: Return-Oriented Programming and Modern Exploitation
Exercise 37.14: Information leak analysis A server has NX + Canary + ASLR + PIE + Full RELRO. An attacker finds an out-of-bounds read vulnerability that can leak one 8-byte value at a specified stack offset. → Chapter 37 Exercises: Return-Oriented Programming and Modern Exploitation
Exercise 37.16: Turing completeness of ROP Shacham (2007) proved ROP is Turing complete. For each primitive needed for Turing completeness, identify an x86-64 gadget sequence that implements it: → Chapter 37 Exercises: Return-Oriented Programming and Modern Exploitation
Exercise 37.18: ret2plt deep understanding In a ret2plt chain, why is it important to call a function through the PLT rather than directly: → Chapter 37 Exercises: Return-Oriented Programming and Modern Exploitation
Exercise 37.19: SafeStack vs SHSTK Compare SafeStack (Clang software) and CET SHSTK (Intel hardware): → Chapter 37 Exercises: Return-Oriented Programming and Modern Exploitation
Exercise 37.2: Gadget identification For each instruction sequence, determine whether it is a useful ROP gadget and describe what it does: → Chapter 37 Exercises: Return-Oriented Programming and Modern Exploitation
Exercise 37.3: Unintended gadgets Given these bytes at address 0x401234: → Chapter 37 Exercises: Return-Oriented Programming and Modern Exploitation
Exercise 37.5: Required gadgets For each goal, list the gadgets you would need and the order they would appear in the chain: → Chapter 37 Exercises: Return-Oriented Programming and Modern Exploitation
Exercise 37.7: ROPgadget usage Write the ROPgadget commands for: → Chapter 37 Exercises: Return-Oriented Programming and Modern Exploitation
Exercise 37.8: Gadget analysis from ROPgadget output Given this ROPgadget output for a small binary: → Chapter 37 Exercises: Return-Oriented Programming and Modern Exploitation
Exercise 38.1: Boot sequence tracing Trace the execution path from power-on to the MinOS shell prompt. For each stage, identify: a) The CPU mode (real mode / protected mode / long mode) b) What code is executing (BIOS / bootloader / kernel) c) The approximate address range being executed from d) What the code does → Chapter 38 Exercises: Capstone — A Minimal OS Kernel
Exercise 38.11: User/kernel separation MinOS Track C adds user mode (ring 3) processes. → Chapter 38 Exercises: Capstone — A Minimal OS Kernel
Exercise 38.13: FAT12 filesystem MinOS Track C can optionally include a FAT12 filesystem reader. → Chapter 38 Exercises: Capstone — A Minimal OS Kernel
Exercise 38.15: ARM64 port conceptual Describe the major differences required to port MinOS to ARM64 running on QEMU `qemu-system-aarch64`: → Chapter 38 Exercises: Capstone — A Minimal OS Kernel
Exercise 38.17: MinOS extension: adding a `memtest` command Design and implement a `memtest` shell command: → Chapter 38 Exercises: Capstone — A Minimal OS Kernel
Exercise 38.19: SMP extension (research) The MinOS scheduler is single-processor. Describe what would be required to support two CPUs (SMP): → Chapter 38 Exercises: Capstone — A Minimal OS Kernel
Exercise 38.3: A20 line a) Why does the A20 address line need to be enabled? b) What address wraps without A20? c) Describe the three methods for enabling A20 (BIOS INT, keyboard controller, fast A20 via port 0x92) d) Which method does the MinOS bootloader use, and why is it the fastest? → Chapter 38 Exercises: Capstone — A Minimal OS Kernel
Exercise 38.5: VGA text mode a) The VGA text buffer is at physical address `0xB8000`. How many bytes does the 80×25 screen require? b) Write the C expression to compute the offset of character at column `x`, row `y` c) What are the attribute byte meanings for: white text on black, green text on blue, blinking red → Chapter 38 Exercises: Capstone — A Minimal OS Kernel
Exercise 38.7: Physical memory bitmap The physical memory bitmap allocator uses 1 bit per 4KB page. → Chapter 38 Exercises: Capstone — A Minimal OS Kernel
Exercise 38.9: Round-robin scheduling MinOS uses round-robin scheduling with a 10-tick timeslice (100ms at 100Hz). → Chapter 38 Exercises: Capstone — A Minimal OS Kernel
Exercise 39.1: IR identification Match each representation to its stage in the compiler pipeline: → Chapter 39 Exercises: Beyond Assembly
Exercise 39.10: WASM practical Write a C "Hello World" and compile it to WASM using Emscripten or wasi-sdk. Observe: → Chapter 39 Exercises: Beyond Assembly
Exercise 39.12: RISC-V hello world The RISC-V hello world in the chapter uses syscall 64 (write) and 93 (exit). Note that Linux RISC-V syscall numbers differ from x86-64. → Chapter 39 Exercises: Beyond Assembly
Exercise 39.14: RISC-V QEMU Set up a RISC-V development environment using QEMU: → Chapter 39 Exercises: Beyond Assembly
Exercise 39.15: SIMT vs. SIMD Compare NVIDIA GPU SIMT and x86-64 AVX-512 SIMD: → Chapter 39 Exercises: Beyond Assembly
Exercise 39.18: JIT security in a post-CET world Modern browsers run WebAssembly with JIT compilation on CET-enabled hardware. Describe: → Chapter 39 Exercises: Beyond Assembly
Exercise 39.3: Optimization recognition For each assembly sequence, identify which compiler optimization produced it: → Chapter 39 Exercises: Beyond Assembly
Exercise 39.6: JIT code generation Write C code (using inline arrays of bytes) that generates and executes each of these x86-64 functions: → Chapter 39 Exercises: Beyond Assembly
Exercise 39.8: WASM vs. register machine Compare WASM (stack machine) and x86-64 (register machine) for the expression `(a + b) * (c - d)`: → Chapter 39 Exercises: Beyond Assembly
Exercise 40.11: Interview preparation Assembly and systems knowledge appears in technical interviews for security, systems, and performance roles. For each question, write a confident 2-3 sentence answer: → Chapter 40 Exercises: Your Assembly Future
Exercise 40.12: Community choice Choose one community from the chapter and: → Chapter 40 Exercises: Your Assembly Future
Exercise 40.14: Conference talk selection Browse the recorded talks from one of these conferences: - DEF CON (defcon.org/media/video) - CCC (media.ccc.de) - Usenix Security (usenix.org/conferences/byname/108) → Chapter 40 Exercises: Your Assembly Future
Exercise 40.15: Contribution map For the Linux kernel: → Chapter 40 Exercises: Your Assembly Future
Exercise 40.17: Teaching plan The best way to consolidate knowledge is to teach it. Choose one topic from this book and plan a 20-minute explanation you could give to a peer: → Chapter 40 Exercises: Your Assembly Future
Exercise 40.19: Reverse engineering practice Find a small open-source compiled binary you use regularly (a command-line utility, a library function). Strip its debug symbols and practice reverse engineering it: → Chapter 40 Exercises: Your Assembly Future
Exercise 40.2: What you can now read For each of the following, predict what it does, then verify by looking it up: → Chapter 40 Exercises: Your Assembly Future
Exercise 40.3: Career path alignment For each career path, identify which chapters of this book most directly apply: → Chapter 40 Exercises: Your Assembly Future
Exercise 40.5: Timeline planning Choose your top project from Exercise 40.4. Create a realistic weekly plan: → Chapter 40 Exercises: Your Assembly Future
Exercise 40.7: ABI quiz (open book) Without looking at your notes, write from memory: → Chapter 40 Exercises: Your Assembly Future
Exercise 40.9: Explain to a beginner Write a clear explanation (3-4 sentences each) suitable for someone who knows C but has never written assembly: → Chapter 40 Exercises: Your Assembly Future

F

File 1: greeting.asm: Define a global string `greeting_msg` and its length `greeting_len` - Declare a global function `print_greeting` that prints the greeting using sys_write - No other global symbols → Chapter 6 Exercises: The NASM Assembler
File 2: math.asm: Define a global function `multiply_by_two(rdi)` that returns 2*rdi in rax - Define a global function `add_numbers(rdi, rsi)` that returns rdi+rsi in rax → Chapter 6 Exercises: The NASM Assembler
File 3: main.asm: Declare extern for all symbols from greeting.asm and math.asm - Implement `_start` which: 1. Calls `print_greeting` 2. Calls `multiply_by_two(21)` and stores the result 3. Calls `add_numbers(result, 0)` and exits with the result as exit code → Chapter 6 Exercises: The NASM Assembler
First Edition — 2026: *A Free, Open-Source Textbook* → Learning Assembly Language
FMLA is the key instruction: it fuses multiply and accumulate into one operation, matching the mathematical structure of FIR filters exactly 2. **The reduction step** (FADDP + FADDP) is a fixed cost amortized over the length of the vectors — for longer arrays, this overhead is negligible 3. **Two-accumulator unrolling** (the 8- → Case Study 18-1: NEON SIMD — Vectorizing a Dot Product for Audio Processing
Follow-up questions:: Can you give a real example of a program where hand-written assembly is still used today? (Hint: look at the Linux kernel or cryptography libraries.) - What would you do if a program was running slowly and profiling showed that 80% of time was in one function? How would assembly knowledge help? → Chapter 1: Why Assembly Language? — Discussion Questions

G

gate descriptor: a 16-byte structure that tells the CPU: → Chapter 26: Interrupts, Exceptions, and Kernel Mode
GDB: the GNU Debugger. Used in every lab in the book. `layout regs` mode has saved more debugging sessions than we can count. → Acknowledgments
Ghidra: the NSA's open-source reverse engineering framework. Its existence as a free, professional-quality tool has fundamentally changed who can learn reverse engineering. → Acknowledgments
Ghidra fails to launch: Verify Java 17+ is installed (`java -version`). Ghidra requires exactly Java 17 or later; it will not launch with Java 11 or earlier. Run `update-alternatives --config java` to select the correct Java version if multiple are installed. → Lab Environment Setup: NASM, GDB, QEMU, and Cross-Compilation Tools
Glossary: 150+ key terms with precise definitions - **Answers to Selected Exercises** — solutions to all ⭐-marked exercises - **Bibliography** — 60+ references organized by category - **Appendix A** — x86-64 instruction quick reference with flags and latency - **Appendix B** — ARM64 instruction quick referenc → Learning Assembly Language — Detailed Content Outline
Godbolt Compiler Explorer: godbolt.org The indispensable tool for understanding what a compiler does with C code. Enter a C function, select GCC x86-64 with `-O2`, and see the assembly output immediately. Essential for verifying the LEA patterns described in this chapter. → Chapter 8 Further Reading: Data Movement and Addressing Modes
Google Axion (2024): 192-core ARM, used in GCP - 50% better performance than x86-64 equivalents for some workloads - Used for Google Search indexing and YouTube transcoding → Case Study 19-2: The Apple Silicon Revolution — A Technical Deep Dive

H

Hardware availability:: Intel: Tiger Lake (2020), Ice Lake, and all subsequent desktop/server processors - AMD: Not yet implemented in mainstream products (as of this writing; announced for future processors) - ARM: Pointer Authentication (PAC) and Branch Target Identification (BTI) serve similar roles on ARM64 → Case Study 36-2: Intel CET — The Hardware Solution to Memory Corruption
Hardware knowledge required: knowing that SCASB is microcoded, that YMM registers can process 32 bytes, and that `bsf` finds the first set bit in one clock cycle — all of this comes from reading CPU documentation, not from the C specification. → Case Study 7.1: strlen() in x86-64 Assembly — Four Implementations
heap exploitation: a topic returned to in Part VII. → Chapter 27: Memory Management

I

identical latency: CLFLUSH evicts the entire 64-byte line, and loading any byte in that line causes the entire line to be fetched from DRAM. This is the cache line granularity in action. → Case Study 22-1: Measuring Cache Effects with RDTSC
implicit LOCK prefix: it is always atomic, regardless of whether you write LOCK explicitly. This makes it useful as a mutex acquire: → Chapter 8: Data Movement and Addressing Modes
Instruction count per byte: ~4 instructions.: ## Implementation 2: SCASB — The Hardware-Assisted Scan → Case Study 7.1: strlen() in x86-64 Assembly — Four Implementations
Instructions: mnemonic plus operands, emit machine code bytes 2. **Directives** — control the assembler's behavior, do not emit code 3. **Preprocessor directives** — begin with `%`, processed before assembly → Chapter 6: The NASM Assembler

K

Key emphases:: AT&T syntax is not wrong, just different. Security tools (GDB default), objdump, and much online documentation use it. - Learning to read compiler output is a permanent skill — it improves debugging, performance work, and security analysis throughout a career. - Compiler Explorer is a legitimate pro → Chapter 21: Understanding Compiler Output — Instructor Notes
Key relationships:: `buffer[0]` is at `[rbp-64]` = `0x7FFFDFF8` - `buffer[63]` (last byte) is at `[rbp-1]` = `0x7FFFE037` - `bytes_read` is at `[rbp-8]` = `0x7FFFE030` - Saved RBP is at `[rbp+0]` = `0x7FFFE038` - Return address is at `[rbp+8]` = `0x7FFFE040` → Case Study 4.2: Buffer Overflow Preview — Why Stack Layout Matters for Security

L

linker-defined symbols: their addresses are the start and end of the `.bss` section. They are visible to C code as `extern char _bss_start[]`. → Case Study 23-2: Writing a MinOS Kernel Linker Script and Boot Sequence

M

MAP_PRIVATE: changes are private to this process (copy-on-write). Writes do not propagate to the file. → Chapter 27: Memory Management
MAP_SHARED: changes are visible to all processes mapping the same file, and eventually written back to disk. → Chapter 27: Memory Management
Microsoft Cobalt 100 (2024): 128-core ARM, used in Azure - Based on ARM Neoverse N2 - Powers Azure's own services (Teams, Copilot, etc.) → Case Study 19-2: The Apple Silicon Revolution — A Technical Deep Dive
MinOS: a minimal x86-64 operating kernel that boots under QEMU. By the end of the book, MinOS will: → Chapter 1: Why Assembly Language?
MinOS is a real, bootable OS: not a simulation or toy. It runs on QEMU with real emulated hardware, handles real interrupts, manages real memory pages, and runs real processes. Every instruction is understood because you wrote it. → Chapter 38 Key Takeaways: Capstone — A Minimal OS Kernel
misprediction: the pipeline flushes all speculative work and restarts at the correct address: a 15–20 cycle penalty. → Chapter 31: The Modern CPU Pipeline
Module 1: Foundation (20 hours): Chapter 1: Why Assembly Language? — read fully; the security angle is central - Chapter 2: Numbers in the Machine — read fully; two's complement and hex are used constantly in RE - Chapter 3: x86-64 Architecture — study carefully; you must know all registers - Chapter 4: Memory — study carefully; th → Self-Paced Learning Guide — Learning Assembly Language
Module 2: x86-64 Core (30 hours): Chapter 8: Data Movement and Addressing Modes — study carefully; addressing modes appear in every disassembly - Chapter 9: Arithmetic and Logic — read; focus on flag-setting behavior, not instruction catalog - Chapter 10: Control Flow — study carefully; every loop and conditional in disassembly - Ch → Self-Paced Learning Guide — Learning Assembly Language
Module 4: Security (30 hours): Chapter 34: Reverse Engineering — study fully; set up Ghidra and complete all exercises - Chapter 35: Buffer Overflows — study fully; implement the exploits in a safe environment - Chapter 36: Defenses — study fully; understanding ASLR, NX, and stack canaries is as important as understanding the att → Self-Paced Learning Guide — Learning Assembly Language

N

NASM on Linux: the assembler the industry actually uses, on the platform where systems programming is taught - **Two architectures** — x86-64 as the primary (your laptop, CTF challenges, compiler output) and ARM64 as the essential secondary (phones, Macs, Raspberry Pi, and increasingly servers) - **Security conten → Preface

O

optional for execution: it is only needed by tools like `readelf` and debuggers. The smallest valid runnable ELF omits it entirely. → Case Study 23-1: Building a Minimal ELF Executable by Hand
OR: GCC might recognize 37 = 32 + 4 + 1 = (1<<5) + (1<<2) + 1: → Case Study 21-2: Compiler Explorer Workshop — Five C Functions, Five Architectures

P

page tables: data structures maintained by the OS kernel in physical memory — to perform this translation. The OS controls what mappings exist by modifying the page tables; the hardware enforces those mappings on every access. → Chapter 27: Memory Management
PC-relative address: the address of the label `msg` is computed as (current PC + offset) at runtime. → Case Study 16-1: Setting Up ARM64 Development — Raspberry Pi and QEMU
Phase 1: Information leak: use a gadget to call `puts(printf)` or similar, printing a known libc address (a PLT stub address that we called). From this, calculate: `libc_base = leaked_address - known_offset_of_puts_in_libc` 2. **Phase 2: Calculate real addresses** — now that libc base is known, calculate the real addresses of → Chapter 37: Return-Oriented Programming and Modern Exploitation
Prevention:: Use memory-safe languages for code handling untrusted input - Enable AddressSanitizer (`-fsanitize=address`) during testing — it detects UAF and heap overflows - Use `valgrind` during development - Enable glibc heap hardening (tcache security, `MALLOC_CHECK_`) - Use safe allocators with added integr → Chapter 35: Buffer Overflows and Memory Corruption
Primary audience:: Computer science students (sophomore/junior level) who have written C programs and want to understand what the compiler is actually doing - Security researchers and CTF players who need to read disassembly fluently - Embedded engineers moving from microcontrollers to Linux-based ARM platforms → Learning Assembly Language: What's Really Happening Inside the Machine
pwndbg not loading: Verify that `~/.gdbinit` contains the pwndbg source line added by the install script. If it does not, add manually: `source ~/pwndbg/gdbinit.py`. → Lab Environment Setup: NASM, GDB, QEMU, and Cross-Compilation Tools

Q

QEMU: the open-source machine emulator and virtualizer. Without QEMU, the bare-metal and OS chapters would require physical hardware. QEMU makes them accessible to everyone. → Acknowledgments
QEMU fails to start or exits immediately: Check that the disk image exists and is a raw format binary (not ISO). Run `file minOS.img` to verify. For KVM-related errors, check that hardware virtualization is enabled in your BIOS/UEFI settings. → Lab Environment Setup: NASM, GDB, QEMU, and Cross-Compilation Tools
Quick summary:: All assembly examples must assemble cleanly with NASM 2.16+ and run correctly on x86-64 Linux - ARM64 examples must assemble with GAS (GNU Assembler) on AArch64 Linux or Apple Silicon macOS - New content should match the practitioner tone — precise, direct, no fluff - Open an issue before starting l → Learning Assembly Language: What's Really Happening Inside the Machine

R

Raspberry Pi 4/5: Around $50-100, runs 64-bit Linux natively. Real hardware, real ARM64, boots off an SD card. Ideal for embedded development. → Part III: ARM64 Assembly
README.md: This file. Course mapping, guide structure, grading philosophy. - **lab-setup-qemu-gdb.md** — Complete environment setup for Linux, WSL2, and macOS. Required reading before the first lab session. - **syllabus-one-semester.md** — 15-week schedule (2 lectures + 1 lab/week). Suitable for a single semes → Instructor Guide — Learning Assembly Language: What's Really Happening Inside the Machine
Real-world applications to mention:: Chrome V8 JavaScript engine: hand-written assembly in hot paths - Linux kernel: hundreds of `.S` files for architecture-specific setup, interrupt handling, context switching - OpenSSL / BoringSSL: hand-written assembly for cryptographic primitives (AES-NI, SHA extensions) - Malware analysis: reverse → Chapter 1: Why Assembly Language? What You See When You Look Below C — Instructor Notes
Real-world applications:: Security researchers use GDB and objdump (both AT&T default) constantly - Performance engineers use Compiler Explorer to test optimization hypotheses - Embedded engineers read compiler output to verify that expensive patterns (division, floating point) were not silently introduced - Understanding co → Chapter 21: Understanding Compiler Output — Instructor Notes
relocations: placeholders that the linker will fill in with the actual addresses. → Case Study 1.1: Hello World — From C to Binary
Return from function: `ret` is essentially `jmp [rsp]; add rsp, 8` 4. **Dynamic dispatch** in interpreters → Chapter 10: Control Flow

S

Secondary audience:: Hobbyists building emulators, kernels, or other low-level projects - Experienced programmers who learned high-level languages first and want to fill in the foundation → Learning Assembly Language: What's Really Happening Inside the Machine
Segfault on first program with no obvious cause: Almost always a missing or misplaced `syscall` instruction, or an attempt to run a 32-bit binary on a 64-bit system. Check that NASM is invoked with `-f elf64`. → Lab Environment Setup: NASM, GDB, QEMU, and Cross-Compilation Tools
selectors: indices into the GDT, not raw addresses. Each GDT entry (8 bytes) describes a segment: its base address, size limit, and access rights. → Chapter 28: Bare Metal Programming
Share: copy and redistribute the material in any medium or format - **Adapt** — remix, transform, and build upon the material for any purpose, including commercially → Learning Assembly Language: What's Really Happening Inside the Machine
sign extend: fill the upper bits with copies of the sign bit. → Chapter 2: Numbers in the Machine
Software stack:: Linux kernel: CET SHSTK support in kernel 5.18+ (for user-space processes); kernel-mode SHSTK in newer versions - glibc: CET-aware since 2.27+ (required for setjmp/longjmp compatibility) - GCC: `-fcf-protection=full` enables IBT+SHSTK code generation (since GCC 8) - Clang: `-fcf-protection=full` sim → Case Study 36-2: Intel CET — The Hardware Solution to Memory Corruption
store-load reordering: the only type of reordering that x86-64 explicitly permits. → Chapter 30: Concurrency at the Hardware Level
succeeded: We use "=a" to capture this for the failure case → Case Study 22-2: Atomic Operations Without Libraries — Compare-and-Swap from Scratch

T

tagged pointer: combine the pointer with a version counter in one 128-bit value, use `CMPXCHG16B`. The version counter increments on every change, making ABA impossible (you would need the pointer AND the counter to match). → Chapter 30: Concurrency at the Hardware Level
The key instructions:: `vpxor ymm0, ymm0, ymm0`: zero all 32 bytes of YMM0 (our comparison target) - `vmovdqu ymm1, [rdi]`: load 32 unaligned bytes from memory into YMM1 - `vpcmpeqb ymm2, ymm1, ymm0`: compare each of the 32 bytes against zero; each result byte is 0xFF (match) or 0x00 (no match) - `vpmovmskb ecx, ymm2`: ex → Case Study 7.1: strlen() in x86-64 Assembly — Four Implementations
The setup:: A victim program has an array `array` and a size check: `if (x < array_size) { ... }` - The victim also has a `probe_array[256 * 64]` (256 cache-line-spaced entries) - An attacker can call the victim's function with attacker-controlled `x` → Case Study 31-2: Spectre — How Speculative Execution Becomes a Security Vulnerability
The stack was executable: in 1988, there was no NX/DEP concept. Stack memory was writable AND executable. Shellcode written into the stack could run when the return address pointed to it. → Case Study 35-1: The Morris Worm's Buffer Overflow (1988) — The First Famous Exploit
Three capstone tracks:: **Track A** (minimal): bootloader + VGA text + keyboard + simple shell - **Track B** (standard): Track A + preemptive scheduler + two processes - **Track C** (extended): Track B + a simple filesystem + loadable user programs → How to Use This Book
Throughput thinking: the shift from "instructions per byte" to "bytes per instruction" (or per clock) is the key mental model shift. The goal is to do more bytes per unit of work, not fewer instructions per byte. → Case Study 7.1: strlen() in x86-64 Assembly — Four Implementations
Tile size calculation:: L2 = 1.25 MB = 1,310,720 bytes - Three tiles needed simultaneously (A tile, B tile, C tile accumulator) - Tile size T×T float32 = T² × 4 bytes per tile - 3 × T² × 4 ≤ 1,310,720 → T² ≤ 109,226 → T ≤ 330 - Choose T = 256 (power of 2, conservative): 3 × 256² × 4 = 786,432 bytes = 768 KB (fits in L2) → Case Study 32-1: Matrix Multiplication — Three Implementations, Three Cache Regimes
Trace for value = 0xABCDEF, start = 8, len = 8:: mask = (1 << 8) - 1 = 0xFF - value >> 8 = 0xABCD (lower bytes shifted out) - 0xABCD & 0xFF = 0xCD - Result: 0xCD (the byte at bits 15:8) ✓ → Case Study 13.1: Bit Manipulation Puzzles — Classic Problems in Assembly
Trace for x = 0b10110001:: x >> 1 = 0b01011000 - XOR = 0b11101001 (1s at positions where adjacent bits differ) - popcount(0b11101001) = 5 transitions ✓ (transitions: 1→0 at pos 7, 0→1 at pos 5, 1→1 at pos 4=same, 1→0 at pos 3, 0→0 at pos 2=same, 0→0 at pos 1=same, 0→1 at pos 0) Wait: 0b10110001: positions 7,6,5,4,3,2,1,0 = 1, → Case Study 13.1: Bit Manipulation Puzzles — Classic Problems in Assembly
Trace for x = 6 (0b0110):: x-1 = 5 = 0b0101 - x & (x-1) = 0b0110 & 0b0101 = 0b0100 ≠ 0 - ZF = 0, result = 0 (not power of 2) ✓ → Case Study 13.1: Bit Manipulation Puzzles — Classic Problems in Assembly
Trace for x = 6:: x & (x-1) = 6 & 5 = 4 ≠ 0: not already a power of 2 - BSR: highest bit of 6 (0b110) is bit 2, so RCX = 2 - 2 << 2 = 8: next power of 2 is 8 ✓ → Case Study 13.1: Bit Manipulation Puzzles — Classic Problems in Assembly
Trace for x = 8 (0b1000):: x-1 = 7 = 0b0111 - x & (x-1) = 0b1000 & 0b0111 = 0 - ZF = 1, result = 1 (is power of 2) ✓ → Case Study 13.1: Bit Manipulation Puzzles — Classic Problems in Assembly
Trace for x = 8:: x & (x-1) = 8 & 7 = 0: already a power of 2 - Returns 8 ✓ → Case Study 13.1: Bit Manipulation Puzzles — Classic Problems in Assembly
Track B (Standard — Track A + ~500 lines):: All of Track A - PIT timer at 100Hz - Round-robin preemptive scheduler - 2-4 concurrent kernel threads - Physical memory bitmap allocator - Kernel-mode context switch → Chapter 38: Capstone — A Minimal OS Kernel
Track C (Extended — Track B + ~1000 lines):: All of Track B - Virtual memory (user/kernel separation) - System call interface (SYSCALL/SYSRET) - Simple FAT12 filesystem (or RAM disk) - User-mode shell process (runs in ring 3) - User-mode programs (via ELF loading or simple format) → Chapter 38: Capstone — A Minimal OS Kernel
Typo and grammar fixes: always welcome - **Code corrections** — if assembly code doesn't assemble or produces wrong output - **Diagram improvements** — better ASCII art for complex concepts - **Additional exercises** — following the register-trace or programming format - **ARM64 coverage** — additional ARM64 examples paral → Contributing to Learning Assembly Language

U

unsigned: **OF = 1** after ADD means the result overflowed when operands are treated as **signed** - **CF = 1** after SUB means borrow occurred (a < b in unsigned comparison) - **OF = 1** after SUB means the result overflowed the signed range → Chapter 9: Arithmetic and Logic
Use inline assembly when:: You need instructions the compiler cannot generate: `CPUID`, `RDTSC`, `RDRAND`, `IN`/`OUT` (I/O ports), `HLT`, `LGDT` - You need a specific atomic sequence: `CMPXCHG`, `LOCK XCHG` (though `` is usually better) - You are writing a security-critical function where the compiler might optim → Chapter 22: Inline Assembly

V

Verification against FIPS 197 test vector:: Key: `2b7e151628aed2a6abf7158809cf4f3c` - Plaintext: `6bc1bee22e409f96e93d7e117393172a` - Expected ciphertext: `3ad77bb40d7a3660a89ecaf32466ef97` → Case Study 15.2: AES-NI Encryption — Hardware-Accelerated AES in Assembly

W

What a good answer covers:: `ldr x0, [x1, #16]` (offset): loads from address X1+16 = 0x4010; X0 = 0xABCD; X1 = 0x4000 (unchanged). No writeback. - `ldr x0, [x1, #16]!` (pre-index with writeback): updates X1 to X1+16 = 0x4010 FIRST, then loads from that address. X0 = 0xABCD; X1 = 0x4010. - `ldr x0, [x1], #16` (post-index): load → Chapter 17: ARM64 Instruction Set — Discussion Questions
What we especially need:: More worked exercises (with solutions) - Windows-specific callouts (MASM syntax, Windows x64 ABI) - Corrections to any code that doesn't build or run as described - Translations → Learning Assembly Language: What's Really Happening Inside the Machine
When alignment matters:: Loops with more than ~16 instructions may benefit from 32-byte alignment - Functions called from many places benefit from alignment (ensures first instruction on a full cache line) - Alignment matters most when the µop cache is the frontend bottleneck (check `dsb2mite_switches` perf event) → Chapter 33: Performance Analysis and Optimization
When NOT to use software prefetch:: Sequential access: hardware prefetcher handles this - Data that fits in L1/L2: already cached, no benefit - High-bandwidth SIMD loops: bandwidth is the bottleneck, not latency → Chapter 32: The Memory Hierarchy
When to use software prefetch:: Linked list traversal: prefetch `node->next->data` while processing `node->data` - Tree traversal: prefetch children while processing parent - Hash table lookups: prefetch the bucket while computing the hash - Scatter/gather operations: prefetch target addresses before writing → Chapter 32: The Memory Hierarchy
When x87 is appropriate:: 80-bit extended precision is required (e.g., some numerical algorithms benefit from the extra precision) - Hardware transcendental functions (FSIN, FCOS) are desired without linking libm - Working with legacy code that uses x87 → Chapter 14: Floating Point

X

XCHG with a memory operand is always atomic: it carries an implicit LOCK prefix. This makes it useful as a simple test-and-set mutex, but also means it is slower than two MOVs for non-atomic swaps. → Chapter 8 Key Takeaways: Data Movement and Addressing Modes

Y

You do not need:: Prior assembly experience of any kind - A deep math background (we use binary and hex arithmetic; that's it) - Access to expensive tools (everything in this book is free and open source) → Learning Assembly Language: What's Really Happening Inside the Machine
You should know:: At least one programming language well — C is strongly preferred - Basic command-line use on Linux or macOS - What a function call is (even if you don't know what happens below the surface) → Learning Assembly Language: What's Really Happening Inside the Machine