Chapter 33 Further Reading: Performance Analysis and Optimization

Open Assembly Language Project

Chapter 33 Further Reading: Performance Analysis and Optimization

Primary References

Agner Fog — Complete Optimization Manuals https://agner.org/optimize/ The full collection is required reading for serious assembly optimization: - "Optimizing Assembly" — instruction selection, dependency chains, loop optimization, SIMD, branch prediction, and every micro-optimization technique covered in this chapter with concrete NASM examples - "Instruction Tables" — latency, throughput, and port assignments for every instruction on every major Intel/AMD microarchitecture. The essential data for computing optimal accumulator count and predicting loop throughput - "Microarchitecture" — detailed pipeline descriptions including issue ports, execution unit counts, and µop cache sizes. Necessary context for interpreting llvm-mca port pressure output

"Computer Architecture: A Quantitative Approach" — Hennessy & Patterson The textbook that defines modern computer architecture. Part of Chapter 3 (ILP and its Exploitation) and Appendix C (Pipelining) provide the theoretical foundation for everything in this chapter: Tomasulo's algorithm, branch prediction tables, execution unit topology.

Intel 64 and IA-32 Architectures Optimization Reference Manual https://www.intel.com/content/www/us/en/developer/articles/technical/intel-64-ia-32-architectures-optimization-reference-manual.html Chapter 3 (General Optimization Guidelines) and Appendix C (Intel Core Microarchitecture) cover the official Intel guidance on IPC optimization, loop unrolling, prefetching, non-temporal stores, and software pipelining. Conservative compared to Agner Fog but authoritative for Intel-specific microarchitectural details.

Tools Documentation

perf — Linux Performance Analysis https://perf.wiki.kernel.org/index.php/Tutorial The canonical tutorial covering: perf stat for counting, perf record/report for sampling, perf annotate for instruction-level analysis, perf top for live profiling, and the PMU event naming conventions. The event list for Intel CPUs is long — the tutorial explains how to find the right event name for a specific bottleneck.

llvm-mca — LLVM Machine Code Analyzer https://llvm.org/docs/CommandGuide/llvm-mca.html Documentation for the analysis tool used throughout this chapter. Covers the marker syntax (# LLVM-MCA-BEGIN/END), the -mcpu flag for selecting target microarchitecture, the output format (throughput, IPC, resource pressure timeline), and command-line options for controlling simulation depth.

libdivide — Optimized Integer Division https://libdivide.com/ A header-only C library that generates correct magic numbers for constant-divisor division, including handling of signed/unsigned edge cases and all divisors from 2 to 2^32. Also provides branchfree variants. Essential for any code with constant-divisor division in hot loops.

Deep Dives

"IACA — Intel Architecture Code Analyzer" (deprecated but instructive) https://software.intel.com/content/www/us/en/develop/articles/intel-architecture-code-analyzer-overview.html Intel's predecessor to llvm-mca. While no longer updated (the last version supports up to Skylake), the documentation and example analyses explain the port utilization model in depth. Reading the IACA user guide alongside llvm-mca output builds intuition for interpreting resource pressure numbers.

"Performance Analysis Guide for Intel Core i7 and Intel Xeon 5500 Processors" https://software.intel.com/content/www/us/en/develop/articles/performance-analysis-guide-for-intel-core-i7-processor-and-intel-xeon-5500-processors.html The original Top-Down Microarchitecture Analysis Method (TMAM) paper. Explains the four-category hierarchy (Frontend Bound, Backend Bound Memory, Backend Bound Core, Bad Speculation) and how to derive it from hardware counter values. The methodology is now built into VTune and modern perf.

"Flame Graphs" — Brendan Gregg https://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html Flame graphs visualize profiling data as stacked call chains where width represents time. They make hotspot identification faster than scrolling through tabular perf output. The flamegraph.pl script at the link works directly with perf script output. Brendan Gregg's blog contains decades of Linux performance analysis methodology.

"Fastware" — Andrei Alexandrescu (CppCon talk) https://www.youtube.com/watch?v=AxnotgLql0k A 1-hour walkthrough of optimizing a real function (string search) from baseline through SIMD. Demonstrates the profile → analyze → optimize cycle in real time, including the surprise that the "obvious" optimization is often not the bottleneck. The NASM optimization mindset applied to C++.