Appendix L: Performance Counters and Measurement Reference

This appendix covers the tools and event names for measuring CPU and memory performance, as used in Part VI of this book.

perf stat — Essential Events

perf stat runs a program and reports hardware performance counters. It requires either root access or perf_event_paranoid set to 0 or 1.

perf stat ./program
perf stat -e event1,event2,... ./program
perf stat -r 5 ./program          # repeat 5 times and average
perf stat -p PID                  # attach to running process

Default Events (always available)

These events are available on all Linux systems without root:

Event	Meaning
`task-clock`	CPU time used (milliseconds)
`context-switches`	Number of OS context switches
`cpu-migrations`	Times process moved to different CPU
`page-faults`	Total page faults (minor + major)
`cycles`	CPU cycles elapsed
`instructions`	Instructions retired
`branches`	Branch instructions
`branch-misses`	Branch mispredictions
`cache-misses`	Last-level cache misses
`cache-references`	Last-level cache accesses

Derived Metrics

From default perf stat output, compute:

Metric	Formula	Good value
IPC (instructions per cycle)	`instructions / cycles`	> 2.0 (modern OOO CPU)
Branch miss rate	`branch-misses / branches`	< 1%
LLC miss rate	`cache-misses / cache-references`	< 10%
CPI (cycles per instruction)	`cycles / instructions`	< 1.0 ideal

perf stat — Hardware-Specific Events

These require root access or appropriate permissions. Available events vary by CPU model.

Memory Hierarchy Events

perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores \
             L2_rqsts.all_demand_data_rd,L2_rqsts.demand_data_rd_miss \
             LLC-loads,LLC-load-misses \
             ./program

Event	Meaning
`L1-dcache-loads`	L1 data cache loads
`L1-dcache-load-misses`	L1 data cache load misses
`L1-dcache-stores`	L1 data cache stores
`L1-icache-loads`	L1 instruction cache loads
`L1-icache-load-misses`	L1 instruction cache load misses
`LLC-loads`	Last-level cache loads
`LLC-load-misses`	Last-level cache load misses (→ DRAM)
`LLC-stores`	Last-level cache stores
`dTLB-loads`	Data TLB lookups
`dTLB-load-misses`	Data TLB misses (→ page walk)
`iTLB-loads`	Instruction TLB lookups
`iTLB-load-misses`	Instruction TLB misses

Pipeline Events

perf stat -e cycles,instructions,stalled-cycles-frontend,stalled-cycles-backend ./program

Event	Meaning
`stalled-cycles-frontend`	Cycles where front-end cannot provide instructions
`stalled-cycles-backend`	Cycles where back-end cannot accept instructions
`branch-instructions`	Branch instructions retired
`branch-misses`	Branch mispredictions

Intel-Specific PMU Events

For Intel CPUs, more specific events are available using raw event codes or the intel_pt driver. Use perf list to see all available events on your system.

Commonly Used Intel Events

# Retire stall cycles:
perf stat -e cpu/event=0xc4,umask=0x00/  # BR_INST_RETIRED.ALL_BRANCHES

# Detailed cache breakdown:
perf stat -e '{cpu/event=0xd1,umask=0x01/,  # MEM_LOAD_RETIRED.L1_HIT
               cpu/event=0xd1,umask=0x02/,  # MEM_LOAD_RETIRED.L2_HIT
               cpu/event=0xd1,umask=0x04/,  # MEM_LOAD_RETIRED.L3_HIT
               cpu/event=0xd1,umask=0x20/}' # MEM_LOAD_RETIRED.L3_MISS

Key Intel Skylake Events

Event	Code	Description
`UOPS_ISSUED.ANY`	0x0e, umask=0x01	Micro-ops issued by the front-end
`UOPS_RETIRED.ALL`	0xc2, umask=0x01	Micro-ops retired
`UOPS_EXECUTED.THREAD`	0xb1, umask=0x01	Micro-ops executed
`MACHINE_CLEARS.COUNT`	0xc3, umask=0x01	Machine clear events
`MEM_LOAD_RETIRED.L1_MISS`	0xd1, umask=0x08	L1 miss loads
`CYCLE_ACTIVITY.STALLS_L1D_MISS`	0xa3, umask=0x0c	Stalls waiting for L1D miss
`CYCLE_ACTIVITY.STALLS_L2_MISS`	0xa3, umask=0x05	Stalls waiting for L2 miss
`CYCLE_ACTIVITY.STALLS_L3_MISS`	0xa3, umask=0x06	Stalls waiting for L3 miss
`CYCLE_ACTIVITY.STALLS_MEM_ANY`	0xa3, umask=0x14	Any memory stall

perf record and perf report

perf record samples the program at regular intervals and records which code was executing.

# Record at default frequency (4000 Hz):
perf record ./program
perf report                     # interactive TUI

# Record at higher frequency:
perf record -F 99999 ./program

# Record with call graph (requires frame pointers or DWARF):
perf record -g ./program
perf report --call-graph callee

# Record specific events:
perf record -e cache-misses:u ./program    # :u = user space only

# Annotate specific function:
perf annotate function_name

Interpreting perf report

The report shows percentages of samples in each function. A function consuming 80% of samples is not necessarily slow itself — it may be waiting for memory. Use:

High cycles % but low LLC-miss % → compute-bound (the function is slow itself)
High LLC-miss % → memory-bound (the function is waiting for data from DRAM)
High stalled-cycles-backend → memory or dependency stall
High stalled-cycles-frontend → instruction cache miss or branch misprediction

RDTSC and RDTSCP

The rdtsc (Read Time-Stamp Counter) instruction reads the processor's cycle counter.

Correct RDTSC Usage

; Serialize before reading (prevent out-of-order execution from moving
; code across the measurement boundary):
lfence
rdtsc               ; reads TSC: high 32 bits → EDX, low 32 bits → EAX
shl     rdx, 32
or      rax, rdx    ; combine into RAX (full 64-bit TSC value)
mov     [start_tsc], rax

; ... code to measure ...

lfence
rdtsc
shl     rdx, 32
or      rax, rdx
mov     [end_tsc], rax

mov     rax, [end_tsc]
sub     rax, [start_tsc]   ; elapsed cycles in RAX

RDTSCP (with CPU ID)

rdtscp additionally reads the processor ID into ECX, allowing detection of TSC measurement across CPU migrations:

rdtscp              ; TSC → EDX:EAX, processor ID → ECX
shl     rdx, 32
or      rax, rdx
; Check ECX matches the same CPU between start and end

Notes on RDTSC

The counter increments at the reference frequency (e.g., 3.0 GHz nominal), not the boosted frequency
On modern CPUs (Skylake+), the reference frequency is reported in CPUID leaf 0x15
rdtsc is not privileged in x86-64 (unlike in 32-bit mode); it can be used in user space
The lfence before and after is necessary for accurate measurement because out-of-order execution can reorder the rdtsc relative to the measured code
Minimum overhead: approximately 20-40 cycles per measurement pair

C Wrapper

#include <stdint.h>

static inline uint64_t rdtsc_start(void) {
    uint32_t lo, hi;
    __asm__ volatile (
        "lfence\n\t"
        "rdtsc\n\t"
        "lfence"
        : "=a"(lo), "=d"(hi)
    );
    return ((uint64_t)hi << 32) | lo;
}

static inline uint64_t rdtsc_end(void) {
    uint32_t lo, hi;
    __asm__ volatile (
        "lfence\n\t"
        "rdtsc"
        : "=a"(lo), "=d"(hi)
    );
    return ((uint64_t)hi << 32) | lo;
}

// Usage:
uint64_t start = rdtsc_start();
// ... measured code ...
uint64_t elapsed = rdtsc_end() - start;

Memory Bandwidth and Latency Reference

Approximate Memory Hierarchy Latencies (Intel Skylake / AMD Zen 3)

Level	Size	Latency (cycles)	Bandwidth (GB/s)
L1 data cache	32-64 KB	4-5	200-300
L2 cache	256 KB - 1 MB	12-14	100-200
L3 cache (LLC)	4-64 MB	35-60	50-100
DRAM (local)	GBs	100-300 ns (~200-600 cycles at 3 GHz)	20-80
DRAM (remote, NUMA)	—	2-4× local	10-40
NVMe SSD	TBs	50-100 μs	3-7
SATA SSD	TBs	100-200 μs	0.5-0.6
HDD	TBs	5-10 ms	0.1-0.3

Note: These are rough approximations. Actual values depend on memory frequency, interleaving, prefetcher behavior, and workload patterns. Use perf stat and a bandwidth benchmark (like STREAM) for your specific hardware.

Cache Line Size

On all modern x86-64 and ARM64 processors: 64 bytes. This means: - Any access within a 64-byte aligned block pulls the entire block into cache - False sharing occurs when two threads write to different variables in the same cache line - Padding structures to 64 bytes eliminates false sharing: alignas(64) in C++

TLB Capacity (Approximate, Skylake)

TLB	Entries	Coverage (4 KB pages)	Coverage (2 MB pages)
L1 dTLB	64	256 KB	128 MB
L2 dTLB	1536	6 MB	3 GB
L1 iTLB	128	512 KB	256 MB
L2 iTLB	1536	6 MB	3 GB

TLB thrashing occurs when the working set exceeds TLB coverage, causing frequent page walks. Use huge pages (2 MB or 1 GB) to extend TLB coverage for large working sets.

Agner Fog Instruction Tables Summary

Agner Fog maintains detailed per-instruction latency and throughput tables for every microarchitecture since the Pentium. Available at: https://agner.org/optimize/

How to Read the Tables

Latency: cycles from input available to output ready (dependency chain cost)
Reciprocal throughput: one instruction every N cycles (parallelism limit)
Execution ports: which CPU execution units can run this instruction

Example (Skylake):

Instruction	Latency	Throughput	Ports
`add r64, r64`	1	0.25	p0156
`imul r64, r64`	3	1	p1
`div r64`	35-90	21-74	p0 p1 p5 p6
`vmovaps ymm, m256`	5	0.5	p23
`vaddps ymm, ymm, ymm`	4	0.5	p01
`vmulps ymm, ymm, ymm`	4	0.5	p01
`vfmadd231ps ymm, ymm, ymm`	4	0.5	p01

Critical Path Optimization

For a loop with dependent instructions:

; Dependency chain (serial — bad):
vmulps  ymm0, ymm1, ymm2    ; latency 4
vaddps  ymm0, ymm0, ymm3    ; latency 4; depends on prev instruction
; Total: 8 cycles for 16 floats

; Unrolled to break dependency chain (parallel — good):
vmulps  ymm0, ymm1, ymm2    ; starts cycle 0
vmulps  ymm4, ymm5, ymm6    ; starts cycle 0 (no dependency)
vaddps  ymm0, ymm0, ymm3    ; starts cycle 4
vaddps  ymm4, ymm4, ymm7    ; starts cycle 4 (no dependency)
; Total: ~4 cycles for 32 floats (2× throughput improvement)

Profiling Strategy Summary

Question	Tool	Key Metric
Is this CPU-bound or memory-bound?	`perf stat`	`stalled-cycles-backend` vs `LLC-load-misses`
Which function is slowest?	`perf record`/`perf report`	% cycles in each function
Is branch prediction the problem?	`perf stat -e branch-misses`	branch-misses / branches
Is TLB causing overhead?	`perf stat -e dTLB-load-misses`	dTLB-misses / dTLB-loads
Where are the cache misses?	`perf annotate`	% LLC misses per instruction
Is memory bandwidth saturated?	STREAM benchmark + `perf stat`	compare observed vs. theoretical
Is vectorization happening?	Compiler Explorer + `perf stat -e fp_arith_inst_retired.256b_packed_single`	throughput vs. scalar
Exact cycle count for a hot loop	RDTSC + isolation	cycles per iteration