Part VI: Performance and Microarchitecture
Why Instruction Count ≠ Execution Time
A loop that executes 10 instructions takes more wall-clock time than a loop that executes 20 instructions. This is not a trick question — it happens routinely, and understanding why is the difference between assembly programmers who optimize and assembly programmers who actually make things faster.
Instruction count is a proxy for work, and proxies lie. The CPU does not execute one instruction at a time in sequence. It fetches several instructions ahead, decodes them in parallel, breaks them into micro-operations, renames registers to eliminate false dependencies, dispatches multiple micro-ops per cycle to several independent execution units, executes them out of order, and retires them in order to maintain the illusion of sequential execution. The gap between instruction count and execution time is as wide as the gap between this description and what a textbook from 1990 taught you about pipelining.
Part VI examines what the modern CPU actually does with your code.
Chapter 31: The Modern CPU Pipeline
The pipeline chapter dismantles the classic five-stage pipeline model and replaces it with the real thing: a 4-wide superscalar out-of-order processor with a 512-entry reorder buffer, a 16-wide retirement bus, and more execution units than most programmers have ever seen in a block diagram. You will understand why ADD RAX, 1 has a latency of 1 cycle and a throughput of 0.25 cycles (four can execute per cycle), why DIV takes 20-80 cycles depending on operand sizes, and why the difference between a 1-cycle latency path and a 3-cycle latency path can halve your throughput when instructions are chained.
Branch prediction gets the attention it deserves. The branch predictor makes a guess on every conditional branch before the condition is evaluated. A good predictor (predictable code) runs at full speed. A bad predictor (random data) incurs a 15-20 cycle penalty per misprediction. This is why sorting an array before searching it, choosing branchless CMOV over conditional jumps, and avoiding data-dependent branches in inner loops can produce 5x speedups that have nothing to do with instruction count.
Chapter 32: The Memory Hierarchy
Memory accesses are not equal. An L1 cache hit costs 4 cycles. An L2 hit costs 12. An L3 hit costs 40. A DRAM access costs 100+ cycles. An NVMe SSD access costs 10,000 cycles. If your code touches memory in the wrong pattern — striding across columns of a row-major matrix, thrashing a linked list whose nodes are scattered across physical memory, reading struct fields from hundreds of different structs — the CPU's execution units sit idle waiting for memory that is always one level too deep in the hierarchy.
The chapter covers cache line organization (64 bytes), set-associative mapping, the MESI coherence protocol, and practical patterns that keep data in L1. Matrix multiplication serves as the running example: a naive O(n^3) implementation with column-major access can run 10x slower than a cache-blocked implementation on a large matrix, because the difference is between 40-cycle L3 accesses and 4-cycle L1 hits.
Non-temporal stores (MOVNT variants) and software prefetch instructions (PREFETCHT0, PREFETCHT1) are covered where they actually help: streaming workloads where the data is write-once or known to be accessed only once.
Chapter 33: Performance Analysis and Optimization
You cannot optimize what you have not measured. Chapter 33 is the engineering chapter: how to measure performance accurately (RDTSC, perf stat, perf record), how to find the hot path in a large codebase, how to read a performance counter report, and what the counters actually tell you about what the CPU is doing.
Loop optimization techniques form the bulk of the practical content: unrolling to expose ILP, instruction selection (prefer LEA over IMUL where possible, avoid DIV in loops), code alignment for the instruction cache, and the llvm-mca tool that lets you analyze a loop's theoretical throughput before compiling.
The closing theme: the compiler is usually right. -O2 makes reasonable choices. Assembly-level optimization is worth doing when (a) you have profiled and identified a genuine bottleneck, (b) you know something the compiler cannot: the data distribution, the calling frequency, the platform guarantees. Otherwise, write clear code, let the compiler optimize, and profile the result. The assembly chapters teach you to understand what the compiler does, not to do its job for it.
Performance Engineering as a Discipline
The three chapters together represent a discipline: performance engineering. It is not about making code faster for its own sake. It is about understanding the hardware well enough to predict where time is actually spent, to distinguish memory-bound from compute-bound workloads, and to apply the right technique (cache blocking, unrolling, branch elimination, SIMD) in the right place.
After Part VI, you will read Agner Fog's instruction tables not as academic trivia but as design documents. You will look at a loop and estimate its theoretical throughput from the dependency chain and port utilization. You will know when a 10% speedup from loop unrolling is worth the code complexity, and when it is not.
You will, in short, think like a CPU.
The answer is almost always: the CPU is not doing what you think it is doing.
Modern processors are not simple machines that execute one instruction after another at a fixed speed. They are highly complex out-of-order superscalar machines with branch predictors, speculative execution engines, register renaming hardware, and multi-level cache hierarchies. Instruction count is a poor proxy for execution time. The real question is: what is the microarchitecture doing with your instructions?
Part VI answers that question.
What Part VI Covers
Chapter 31 explains the modern CPU pipeline. Not the simple five-stage pipeline from a computer organization textbook, but the real thing: out-of-order execution, the reorder buffer, micro-operations, register renaming, and the branch predictor. Why can a 10-instruction loop be slower than a 20-instruction loop? Because the 10-instruction version has a dependency chain that serializes execution, while the 20-instruction version has enough independent operations to keep all execution units busy simultaneously.
Chapter 32 explains the memory hierarchy. The numbers that should change how every programmer thinks about data structures: L1 cache access in 4 cycles, L2 in 12, L3 in 40, DRAM in 100+ cycles. A cache miss is not a minor inconvenience — it is a complete stall waiting for main memory. Cache-friendly programming is not optimization; it is the difference between code that performs acceptably and code that does not.
Chapter 33 teaches performance analysis: how to use perf to measure what is actually happening in your code, how to read hardware performance counters, and how to systematically identify and address bottlenecks. The chapter walks through optimizing a real function from a baseline implementation to near-peak throughput, using measurement to drive each decision.
The Practitioner's Principle
Profile before you optimize. The most common performance mistake is optimizing the wrong thing. A function that executes 0.1% of the time cannot matter more than 0.1% of total performance, no matter how well you optimize it. perf record followed by perf report will tell you where the time is actually going.
That said, knowing what makes code fast — independent instructions, sequential memory access, predictable branches — should inform every design decision. You will write better code from the start if you understand the hardware.
The Security Connection
Chapter 31's discussion of speculative execution has a direct connection to security: the Spectre vulnerability is a consequence of out-of-order and speculative execution across security boundaries. Understanding the pipeline helps you understand both why Spectre exists and why the mitigations work the way they do.
After Part VI
With an understanding of both systems programming (Part V) and performance (Part VI), Part VII turns to security and reverse engineering — where assembly knowledge, systems knowledge, and performance knowledge all converge.