Appendix L: Performance Counters and Measurement Reference

This appendix covers the tools and event names for measuring CPU and memory performance, as used in Part VI of this book.


perf stat — Essential Events

perf stat runs a program and reports hardware performance counters. It requires either root access or perf_event_paranoid set to 0 or 1.

perf stat ./program
perf stat -e event1,event2,... ./program
perf stat -r 5 ./program          # repeat 5 times and average
perf stat -p PID                  # attach to running process

Default Events (always available)

These events are available on all Linux systems without root:

Event Meaning
task-clock CPU time used (milliseconds)
context-switches Number of OS context switches
cpu-migrations Times process moved to different CPU
page-faults Total page faults (minor + major)
cycles CPU cycles elapsed
instructions Instructions retired
branches Branch instructions
branch-misses Branch mispredictions
cache-misses Last-level cache misses
cache-references Last-level cache accesses

Derived Metrics

From default perf stat output, compute:

Metric Formula Good value
IPC (instructions per cycle) instructions / cycles > 2.0 (modern OOO CPU)
Branch miss rate branch-misses / branches < 1%
LLC miss rate cache-misses / cache-references < 10%
CPI (cycles per instruction) cycles / instructions < 1.0 ideal

perf stat — Hardware-Specific Events

These require root access or appropriate permissions. Available events vary by CPU model.

Memory Hierarchy Events

perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores \
             L2_rqsts.all_demand_data_rd,L2_rqsts.demand_data_rd_miss \
             LLC-loads,LLC-load-misses \
             ./program
Event Meaning
L1-dcache-loads L1 data cache loads
L1-dcache-load-misses L1 data cache load misses
L1-dcache-stores L1 data cache stores
L1-icache-loads L1 instruction cache loads
L1-icache-load-misses L1 instruction cache load misses
LLC-loads Last-level cache loads
LLC-load-misses Last-level cache load misses (→ DRAM)
LLC-stores Last-level cache stores
dTLB-loads Data TLB lookups
dTLB-load-misses Data TLB misses (→ page walk)
iTLB-loads Instruction TLB lookups
iTLB-load-misses Instruction TLB misses

Pipeline Events

perf stat -e cycles,instructions,stalled-cycles-frontend,stalled-cycles-backend ./program
Event Meaning
stalled-cycles-frontend Cycles where front-end cannot provide instructions
stalled-cycles-backend Cycles where back-end cannot accept instructions
branch-instructions Branch instructions retired
branch-misses Branch mispredictions

Intel-Specific PMU Events

For Intel CPUs, more specific events are available using raw event codes or the intel_pt driver. Use perf list to see all available events on your system.

Commonly Used Intel Events

# Retire stall cycles:
perf stat -e cpu/event=0xc4,umask=0x00/  # BR_INST_RETIRED.ALL_BRANCHES

# Detailed cache breakdown:
perf stat -e '{cpu/event=0xd1,umask=0x01/,  # MEM_LOAD_RETIRED.L1_HIT
               cpu/event=0xd1,umask=0x02/,  # MEM_LOAD_RETIRED.L2_HIT
               cpu/event=0xd1,umask=0x04/,  # MEM_LOAD_RETIRED.L3_HIT
               cpu/event=0xd1,umask=0x20/}' # MEM_LOAD_RETIRED.L3_MISS

Key Intel Skylake Events

Event Code Description
UOPS_ISSUED.ANY 0x0e, umask=0x01 Micro-ops issued by the front-end
UOPS_RETIRED.ALL 0xc2, umask=0x01 Micro-ops retired
UOPS_EXECUTED.THREAD 0xb1, umask=0x01 Micro-ops executed
MACHINE_CLEARS.COUNT 0xc3, umask=0x01 Machine clear events
MEM_LOAD_RETIRED.L1_MISS 0xd1, umask=0x08 L1 miss loads
CYCLE_ACTIVITY.STALLS_L1D_MISS 0xa3, umask=0x0c Stalls waiting for L1D miss
CYCLE_ACTIVITY.STALLS_L2_MISS 0xa3, umask=0x05 Stalls waiting for L2 miss
CYCLE_ACTIVITY.STALLS_L3_MISS 0xa3, umask=0x06 Stalls waiting for L3 miss
CYCLE_ACTIVITY.STALLS_MEM_ANY 0xa3, umask=0x14 Any memory stall

perf record and perf report

perf record samples the program at regular intervals and records which code was executing.

# Record at default frequency (4000 Hz):
perf record ./program
perf report                     # interactive TUI

# Record at higher frequency:
perf record -F 99999 ./program

# Record with call graph (requires frame pointers or DWARF):
perf record -g ./program
perf report --call-graph callee

# Record specific events:
perf record -e cache-misses:u ./program    # :u = user space only

# Annotate specific function:
perf annotate function_name

Interpreting perf report

The report shows percentages of samples in each function. A function consuming 80% of samples is not necessarily slow itself — it may be waiting for memory. Use:

  • High cycles % but low LLC-miss % → compute-bound (the function is slow itself)
  • High LLC-miss % → memory-bound (the function is waiting for data from DRAM)
  • High stalled-cycles-backend → memory or dependency stall
  • High stalled-cycles-frontend → instruction cache miss or branch misprediction

RDTSC and RDTSCP

The rdtsc (Read Time-Stamp Counter) instruction reads the processor's cycle counter.

Correct RDTSC Usage

; Serialize before reading (prevent out-of-order execution from moving
; code across the measurement boundary):
lfence
rdtsc               ; reads TSC: high 32 bits → EDX, low 32 bits → EAX
shl     rdx, 32
or      rax, rdx    ; combine into RAX (full 64-bit TSC value)
mov     [start_tsc], rax

; ... code to measure ...

lfence
rdtsc
shl     rdx, 32
or      rax, rdx
mov     [end_tsc], rax

mov     rax, [end_tsc]
sub     rax, [start_tsc]   ; elapsed cycles in RAX

RDTSCP (with CPU ID)

rdtscp additionally reads the processor ID into ECX, allowing detection of TSC measurement across CPU migrations:

rdtscp              ; TSC → EDX:EAX, processor ID → ECX
shl     rdx, 32
or      rax, rdx
; Check ECX matches the same CPU between start and end

Notes on RDTSC

  • The counter increments at the reference frequency (e.g., 3.0 GHz nominal), not the boosted frequency
  • On modern CPUs (Skylake+), the reference frequency is reported in CPUID leaf 0x15
  • rdtsc is not privileged in x86-64 (unlike in 32-bit mode); it can be used in user space
  • The lfence before and after is necessary for accurate measurement because out-of-order execution can reorder the rdtsc relative to the measured code
  • Minimum overhead: approximately 20-40 cycles per measurement pair

C Wrapper

#include <stdint.h>

static inline uint64_t rdtsc_start(void) {
    uint32_t lo, hi;
    __asm__ volatile (
        "lfence\n\t"
        "rdtsc\n\t"
        "lfence"
        : "=a"(lo), "=d"(hi)
    );
    return ((uint64_t)hi << 32) | lo;
}

static inline uint64_t rdtsc_end(void) {
    uint32_t lo, hi;
    __asm__ volatile (
        "lfence\n\t"
        "rdtsc"
        : "=a"(lo), "=d"(hi)
    );
    return ((uint64_t)hi << 32) | lo;
}

// Usage:
uint64_t start = rdtsc_start();
// ... measured code ...
uint64_t elapsed = rdtsc_end() - start;

Memory Bandwidth and Latency Reference

Approximate Memory Hierarchy Latencies (Intel Skylake / AMD Zen 3)

Level Size Latency (cycles) Bandwidth (GB/s)
L1 data cache 32-64 KB 4-5 200-300
L2 cache 256 KB - 1 MB 12-14 100-200
L3 cache (LLC) 4-64 MB 35-60 50-100
DRAM (local) GBs 100-300 ns (~200-600 cycles at 3 GHz) 20-80
DRAM (remote, NUMA) 2-4× local 10-40
NVMe SSD TBs 50-100 μs 3-7
SATA SSD TBs 100-200 μs 0.5-0.6
HDD TBs 5-10 ms 0.1-0.3

Note: These are rough approximations. Actual values depend on memory frequency, interleaving, prefetcher behavior, and workload patterns. Use perf stat and a bandwidth benchmark (like STREAM) for your specific hardware.

Cache Line Size

On all modern x86-64 and ARM64 processors: 64 bytes. This means: - Any access within a 64-byte aligned block pulls the entire block into cache - False sharing occurs when two threads write to different variables in the same cache line - Padding structures to 64 bytes eliminates false sharing: alignas(64) in C++

TLB Capacity (Approximate, Skylake)

TLB Entries Coverage (4 KB pages) Coverage (2 MB pages)
L1 dTLB 64 256 KB 128 MB
L2 dTLB 1536 6 MB 3 GB
L1 iTLB 128 512 KB 256 MB
L2 iTLB 1536 6 MB 3 GB

TLB thrashing occurs when the working set exceeds TLB coverage, causing frequent page walks. Use huge pages (2 MB or 1 GB) to extend TLB coverage for large working sets.


Agner Fog Instruction Tables Summary

Agner Fog maintains detailed per-instruction latency and throughput tables for every microarchitecture since the Pentium. Available at: https://agner.org/optimize/

How to Read the Tables

  • Latency: cycles from input available to output ready (dependency chain cost)
  • Reciprocal throughput: one instruction every N cycles (parallelism limit)
  • Execution ports: which CPU execution units can run this instruction

Example (Skylake):

Instruction Latency Throughput Ports
add r64, r64 1 0.25 p0156
imul r64, r64 3 1 p1
div r64 35-90 21-74 p0 p1 p5 p6
vmovaps ymm, m256 5 0.5 p23
vaddps ymm, ymm, ymm 4 0.5 p01
vmulps ymm, ymm, ymm 4 0.5 p01
vfmadd231ps ymm, ymm, ymm 4 0.5 p01

Critical Path Optimization

For a loop with dependent instructions:

; Dependency chain (serial — bad):
vmulps  ymm0, ymm1, ymm2    ; latency 4
vaddps  ymm0, ymm0, ymm3    ; latency 4; depends on prev instruction
; Total: 8 cycles for 16 floats

; Unrolled to break dependency chain (parallel — good):
vmulps  ymm0, ymm1, ymm2    ; starts cycle 0
vmulps  ymm4, ymm5, ymm6    ; starts cycle 0 (no dependency)
vaddps  ymm0, ymm0, ymm3    ; starts cycle 4
vaddps  ymm4, ymm4, ymm7    ; starts cycle 4 (no dependency)
; Total: ~4 cycles for 32 floats (2× throughput improvement)

Profiling Strategy Summary

Question Tool Key Metric
Is this CPU-bound or memory-bound? perf stat stalled-cycles-backend vs LLC-load-misses
Which function is slowest? perf record/perf report % cycles in each function
Is branch prediction the problem? perf stat -e branch-misses branch-misses / branches
Is TLB causing overhead? perf stat -e dTLB-load-misses dTLB-misses / dTLB-loads
Where are the cache misses? perf annotate % LLC misses per instruction
Is memory bandwidth saturated? STREAM benchmark + perf stat compare observed vs. theoretical
Is vectorization happening? Compiler Explorer + perf stat -e fp_arith_inst_retired.256b_packed_single throughput vs. scalar
Exact cycle count for a hot loop RDTSC + isolation cycles per iteration