Chapter 31 Further Reading: The Modern CPU Pipeline
Primary References
Agner Fog — CPU Architecture, Instruction Tables, and Optimization Manuals https://agner.org/optimize/ The definitive practical resource for x86-64 performance optimization. Four separate documents: (1) "Optimizing C++" — general principles; (2) "Optimizing Assembly" — deep instruction-level optimization; (3) "Microarchitecture" — detailed CPU pipeline descriptions for every Intel and AMD microarchitecture; (4) "Instruction Tables" — latency, throughput, and port assignments for every instruction on every microarchitecture. Bookmark these permanently.
Intel 64 and IA-32 Architectures Optimization Reference Manual https://www.intel.com/content/www/us/en/developer/articles/technical/intel-64-ia-32-architectures-optimization-reference-manual.html Intel's official optimization guide. Covers pipeline structure, execution units, branch prediction, memory hierarchy, and optimization techniques. Chapter 2 (Intel Microarchitecture) and Chapter 3 (General Optimization Guidelines) are most relevant. More conservative than Agner Fog, less precise, but authoritative for Intel-specific behavior.
AMD Software Optimization Guide https://developer.amd.com/resources/developer-guides-manuals/ AMD's equivalent optimization guide for Zen 4 and earlier architectures. Critical if you target AMD hardware — Zen's execution ports and latencies differ significantly from Intel in ways that affect optimization strategies.
Deep Dives
"Why Skylake is Different from Haswell" — Agner Fog blog https://www.agner.org/optimize/blog/read.php?i=208 An analysis of the µop cache improvements in Intel Skylake that changed the frontend behavior. Useful as an example of how microarchitectural changes require revising optimization strategies.
"Performance Analysis Guide for Intel Core i7 Processor and Intel Xeon 5500 processors" — PMU Cookbook https://software.intel.com/content/www/us/en/develop/articles/intel-performance-counter-monitor.html Using hardware performance counters to diagnose CPU bottlenecks. Covers the Intel PMU (Performance Monitoring Unit) event hierarchy: how to distinguish frontend-bound from backend-bound execution, identify specific execution-unit bottlenecks, and measure retirement efficiency.
"Branch Prediction Survey" — Fog https://agner.org/optimize/microarchitecture.pdf Chapter 3 of Agner Fog's microarchitecture manual covers branch prediction in detail: BTB structure, PHT (Pattern History Table), indirect branch prediction, return stack behavior, and how to write code that is maximally predictable. Required reading before writing inner-loop conditionals.
Tools
llvm-mca — LLVM Machine Code Analyzer https://llvm.org/docs/CommandGuide/llvm-mca.html Static analysis tool that simulates how a CPU pipeline would execute a code block. Provides throughput (cycles per iteration), IPC prediction, resource pressure by execution port, and timeline visualization. No hardware required — works from assembly source. Use it to validate optimizations before benchmarking.
Intel VTune Profiler
https://www.intel.com/content/www/us/en/develop/tools/oneapi/components/vtune-profiler.html
Commercial (but free for open-source use) profiling tool that provides detailed bottleneck analysis: frontend bound vs. backend bound, bad speculation (mispredictions), memory bound, and core bound breakdowns. More actionable than raw perf stat for identifying which type of bottleneck to address.
Perf Wiki — Linux Performance Counters
https://perf.wiki.kernel.org/
Documentation for the Linux perf tool. The event list (perf list), the stat command, the record/report workflow for sampling, and the annotate command for instruction-level hotspot analysis are all covered. perf is the starting point for any performance investigation on Linux.