Appendix L: Performance Counters and Measurement Reference
This appendix covers the tools and event names for measuring CPU and memory performance, as used in Part VI of this book.
perf stat — Essential Events
perf stat runs a program and reports hardware performance counters. It requires either root access or perf_event_paranoid set to 0 or 1.
perf stat ./program
perf stat -e event1,event2,... ./program
perf stat -r 5 ./program # repeat 5 times and average
perf stat -p PID # attach to running process
Default Events (always available)
These events are available on all Linux systems without root:
| Event | Meaning |
|---|---|
task-clock |
CPU time used (milliseconds) |
context-switches |
Number of OS context switches |
cpu-migrations |
Times process moved to different CPU |
page-faults |
Total page faults (minor + major) |
cycles |
CPU cycles elapsed |
instructions |
Instructions retired |
branches |
Branch instructions |
branch-misses |
Branch mispredictions |
cache-misses |
Last-level cache misses |
cache-references |
Last-level cache accesses |
Derived Metrics
From default perf stat output, compute:
| Metric | Formula | Good value |
|---|---|---|
| IPC (instructions per cycle) | instructions / cycles |
> 2.0 (modern OOO CPU) |
| Branch miss rate | branch-misses / branches |
< 1% |
| LLC miss rate | cache-misses / cache-references |
< 10% |
| CPI (cycles per instruction) | cycles / instructions |
< 1.0 ideal |
perf stat — Hardware-Specific Events
These require root access or appropriate permissions. Available events vary by CPU model.
Memory Hierarchy Events
perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores \
L2_rqsts.all_demand_data_rd,L2_rqsts.demand_data_rd_miss \
LLC-loads,LLC-load-misses \
./program
| Event | Meaning |
|---|---|
L1-dcache-loads |
L1 data cache loads |
L1-dcache-load-misses |
L1 data cache load misses |
L1-dcache-stores |
L1 data cache stores |
L1-icache-loads |
L1 instruction cache loads |
L1-icache-load-misses |
L1 instruction cache load misses |
LLC-loads |
Last-level cache loads |
LLC-load-misses |
Last-level cache load misses (→ DRAM) |
LLC-stores |
Last-level cache stores |
dTLB-loads |
Data TLB lookups |
dTLB-load-misses |
Data TLB misses (→ page walk) |
iTLB-loads |
Instruction TLB lookups |
iTLB-load-misses |
Instruction TLB misses |
Pipeline Events
perf stat -e cycles,instructions,stalled-cycles-frontend,stalled-cycles-backend ./program
| Event | Meaning |
|---|---|
stalled-cycles-frontend |
Cycles where front-end cannot provide instructions |
stalled-cycles-backend |
Cycles where back-end cannot accept instructions |
branch-instructions |
Branch instructions retired |
branch-misses |
Branch mispredictions |
Intel-Specific PMU Events
For Intel CPUs, more specific events are available using raw event codes or the intel_pt driver. Use perf list to see all available events on your system.
Commonly Used Intel Events
# Retire stall cycles:
perf stat -e cpu/event=0xc4,umask=0x00/ # BR_INST_RETIRED.ALL_BRANCHES
# Detailed cache breakdown:
perf stat -e '{cpu/event=0xd1,umask=0x01/, # MEM_LOAD_RETIRED.L1_HIT
cpu/event=0xd1,umask=0x02/, # MEM_LOAD_RETIRED.L2_HIT
cpu/event=0xd1,umask=0x04/, # MEM_LOAD_RETIRED.L3_HIT
cpu/event=0xd1,umask=0x20/}' # MEM_LOAD_RETIRED.L3_MISS
Key Intel Skylake Events
| Event | Code | Description |
|---|---|---|
UOPS_ISSUED.ANY |
0x0e, umask=0x01 | Micro-ops issued by the front-end |
UOPS_RETIRED.ALL |
0xc2, umask=0x01 | Micro-ops retired |
UOPS_EXECUTED.THREAD |
0xb1, umask=0x01 | Micro-ops executed |
MACHINE_CLEARS.COUNT |
0xc3, umask=0x01 | Machine clear events |
MEM_LOAD_RETIRED.L1_MISS |
0xd1, umask=0x08 | L1 miss loads |
CYCLE_ACTIVITY.STALLS_L1D_MISS |
0xa3, umask=0x0c | Stalls waiting for L1D miss |
CYCLE_ACTIVITY.STALLS_L2_MISS |
0xa3, umask=0x05 | Stalls waiting for L2 miss |
CYCLE_ACTIVITY.STALLS_L3_MISS |
0xa3, umask=0x06 | Stalls waiting for L3 miss |
CYCLE_ACTIVITY.STALLS_MEM_ANY |
0xa3, umask=0x14 | Any memory stall |
perf record and perf report
perf record samples the program at regular intervals and records which code was executing.
# Record at default frequency (4000 Hz):
perf record ./program
perf report # interactive TUI
# Record at higher frequency:
perf record -F 99999 ./program
# Record with call graph (requires frame pointers or DWARF):
perf record -g ./program
perf report --call-graph callee
# Record specific events:
perf record -e cache-misses:u ./program # :u = user space only
# Annotate specific function:
perf annotate function_name
Interpreting perf report
The report shows percentages of samples in each function. A function consuming 80% of samples is not necessarily slow itself — it may be waiting for memory. Use:
- High
cycles% but lowLLC-miss% → compute-bound (the function is slow itself) - High
LLC-miss% → memory-bound (the function is waiting for data from DRAM) - High
stalled-cycles-backend→ memory or dependency stall - High
stalled-cycles-frontend→ instruction cache miss or branch misprediction
RDTSC and RDTSCP
The rdtsc (Read Time-Stamp Counter) instruction reads the processor's cycle counter.
Correct RDTSC Usage
; Serialize before reading (prevent out-of-order execution from moving
; code across the measurement boundary):
lfence
rdtsc ; reads TSC: high 32 bits → EDX, low 32 bits → EAX
shl rdx, 32
or rax, rdx ; combine into RAX (full 64-bit TSC value)
mov [start_tsc], rax
; ... code to measure ...
lfence
rdtsc
shl rdx, 32
or rax, rdx
mov [end_tsc], rax
mov rax, [end_tsc]
sub rax, [start_tsc] ; elapsed cycles in RAX
RDTSCP (with CPU ID)
rdtscp additionally reads the processor ID into ECX, allowing detection of TSC measurement across CPU migrations:
rdtscp ; TSC → EDX:EAX, processor ID → ECX
shl rdx, 32
or rax, rdx
; Check ECX matches the same CPU between start and end
Notes on RDTSC
- The counter increments at the reference frequency (e.g., 3.0 GHz nominal), not the boosted frequency
- On modern CPUs (Skylake+), the reference frequency is reported in CPUID leaf 0x15
rdtscis not privileged in x86-64 (unlike in 32-bit mode); it can be used in user space- The
lfencebefore and after is necessary for accurate measurement because out-of-order execution can reorder therdtscrelative to the measured code - Minimum overhead: approximately 20-40 cycles per measurement pair
C Wrapper
#include <stdint.h>
static inline uint64_t rdtsc_start(void) {
uint32_t lo, hi;
__asm__ volatile (
"lfence\n\t"
"rdtsc\n\t"
"lfence"
: "=a"(lo), "=d"(hi)
);
return ((uint64_t)hi << 32) | lo;
}
static inline uint64_t rdtsc_end(void) {
uint32_t lo, hi;
__asm__ volatile (
"lfence\n\t"
"rdtsc"
: "=a"(lo), "=d"(hi)
);
return ((uint64_t)hi << 32) | lo;
}
// Usage:
uint64_t start = rdtsc_start();
// ... measured code ...
uint64_t elapsed = rdtsc_end() - start;
Memory Bandwidth and Latency Reference
Approximate Memory Hierarchy Latencies (Intel Skylake / AMD Zen 3)
| Level | Size | Latency (cycles) | Bandwidth (GB/s) |
|---|---|---|---|
| L1 data cache | 32-64 KB | 4-5 | 200-300 |
| L2 cache | 256 KB - 1 MB | 12-14 | 100-200 |
| L3 cache (LLC) | 4-64 MB | 35-60 | 50-100 |
| DRAM (local) | GBs | 100-300 ns (~200-600 cycles at 3 GHz) | 20-80 |
| DRAM (remote, NUMA) | — | 2-4× local | 10-40 |
| NVMe SSD | TBs | 50-100 μs | 3-7 |
| SATA SSD | TBs | 100-200 μs | 0.5-0.6 |
| HDD | TBs | 5-10 ms | 0.1-0.3 |
Note: These are rough approximations. Actual values depend on memory frequency, interleaving, prefetcher behavior, and workload patterns. Use perf stat and a bandwidth benchmark (like STREAM) for your specific hardware.
Cache Line Size
On all modern x86-64 and ARM64 processors: 64 bytes. This means:
- Any access within a 64-byte aligned block pulls the entire block into cache
- False sharing occurs when two threads write to different variables in the same cache line
- Padding structures to 64 bytes eliminates false sharing: alignas(64) in C++
TLB Capacity (Approximate, Skylake)
| TLB | Entries | Coverage (4 KB pages) | Coverage (2 MB pages) |
|---|---|---|---|
| L1 dTLB | 64 | 256 KB | 128 MB |
| L2 dTLB | 1536 | 6 MB | 3 GB |
| L1 iTLB | 128 | 512 KB | 256 MB |
| L2 iTLB | 1536 | 6 MB | 3 GB |
TLB thrashing occurs when the working set exceeds TLB coverage, causing frequent page walks. Use huge pages (2 MB or 1 GB) to extend TLB coverage for large working sets.
Agner Fog Instruction Tables Summary
Agner Fog maintains detailed per-instruction latency and throughput tables for every microarchitecture since the Pentium. Available at: https://agner.org/optimize/
How to Read the Tables
- Latency: cycles from input available to output ready (dependency chain cost)
- Reciprocal throughput: one instruction every N cycles (parallelism limit)
- Execution ports: which CPU execution units can run this instruction
Example (Skylake):
| Instruction | Latency | Throughput | Ports |
|---|---|---|---|
add r64, r64 |
1 | 0.25 | p0156 |
imul r64, r64 |
3 | 1 | p1 |
div r64 |
35-90 | 21-74 | p0 p1 p5 p6 |
vmovaps ymm, m256 |
5 | 0.5 | p23 |
vaddps ymm, ymm, ymm |
4 | 0.5 | p01 |
vmulps ymm, ymm, ymm |
4 | 0.5 | p01 |
vfmadd231ps ymm, ymm, ymm |
4 | 0.5 | p01 |
Critical Path Optimization
For a loop with dependent instructions:
; Dependency chain (serial — bad):
vmulps ymm0, ymm1, ymm2 ; latency 4
vaddps ymm0, ymm0, ymm3 ; latency 4; depends on prev instruction
; Total: 8 cycles for 16 floats
; Unrolled to break dependency chain (parallel — good):
vmulps ymm0, ymm1, ymm2 ; starts cycle 0
vmulps ymm4, ymm5, ymm6 ; starts cycle 0 (no dependency)
vaddps ymm0, ymm0, ymm3 ; starts cycle 4
vaddps ymm4, ymm4, ymm7 ; starts cycle 4 (no dependency)
; Total: ~4 cycles for 32 floats (2× throughput improvement)
Profiling Strategy Summary
| Question | Tool | Key Metric |
|---|---|---|
| Is this CPU-bound or memory-bound? | perf stat |
stalled-cycles-backend vs LLC-load-misses |
| Which function is slowest? | perf record/perf report |
% cycles in each function |
| Is branch prediction the problem? | perf stat -e branch-misses |
branch-misses / branches |
| Is TLB causing overhead? | perf stat -e dTLB-load-misses |
dTLB-misses / dTLB-loads |
| Where are the cache misses? | perf annotate |
% LLC misses per instruction |
| Is memory bandwidth saturated? | STREAM benchmark + perf stat |
compare observed vs. theoretical |
| Is vectorization happening? | Compiler Explorer + perf stat -e fp_arith_inst_retired.256b_packed_single |
throughput vs. scalar |
| Exact cycle count for a hot loop | RDTSC + isolation | cycles per iteration |