When your program accesses memory address 0x7fff5fbff8a0, the CPU does not go directly to that location in DRAM. Instead, it performs a hardware page table walk — traversing four levels of tables, each holding 512 entries — to translate that virtual...
In This Chapter
Chapter 27: Memory Management
The Lie Every Pointer Tells
When your program accesses memory address 0x7fff5fbff8a0, the CPU does not go directly to that location in DRAM. Instead, it performs a hardware page table walk — traversing four levels of tables, each holding 512 entries — to translate that virtual address into a physical address. The physical address might be anywhere in RAM. Or the page might not be in RAM at all, triggering a page fault. The hardware does this on every single memory access, every load, every store, every instruction fetch.
This mechanism — virtual memory, backed by page tables, enforced by the Memory Management Unit (MMU) — is the abstraction that makes modern operating systems possible. It provides isolation between processes (your process cannot accidentally write to mine), protection enforcement (code pages cannot be written, kernel memory cannot be accessed from user space), and the illusion of more memory than physically exists (demand paging, swapping). Understanding it at the bit level is prerequisite for kernel development, security research, and diagnosing any memory-related crash.
The MMU and Page Tables
The Memory Management Unit is a hardware component built into the CPU that translates virtual addresses to physical addresses on every memory access. It consults page tables — data structures maintained by the OS kernel in physical memory — to perform this translation. The OS controls what mappings exist by modifying the page tables; the hardware enforces those mappings on every access.
The fundamental unit is the page: a contiguous block of memory, typically 4KB (4096 bytes). The mapping granularity is one page: you map a 4KB virtual page to a 4KB physical page frame.
x86-64 Four-Level Page Tables
x86-64 uses a four-level hierarchy to translate 48-bit virtual addresses (in the current standard configuration) to 52-bit physical addresses:
Virtual Address (48 bits used of 64):
63 48|47 39|38 30|29 21|20 12|11 0
sign-ext | PML4 | PDP | PD | PT | offset
(16 bits)| (9b) | (9b) | (9b) | (9b) | (12b)
Level Bits Entries Entry points to
------ ---- ------- ---------------
PML4 47:39 512 Page Directory Pointer (PDP) tables
PDP 38:30 512 Page Directory (PD) tables
PD 29:21 512 Page Tables (PT) — or 2MB huge pages
PT 20:12 512 4KB physical pages
Offset 11:0 — Byte within the 4KB page
The virtual address is broken into five fields. Each field is an index into the corresponding level's table (512 entries = 9 bits, since 2⁹ = 512). The physical address is:
Physical address = page_table_entry[PT_index].physical_page_number × 4096 + offset
The CPU finds the top-level PML4 table from CR3 — the page table base register. CR3 holds the physical address of the PML4 table (aligned to 4KB). On a context switch, the OS writes a new value to CR3, and the new process's address space takes effect immediately.
Page Table Entry Format (64-bit)
Each entry in PML4, PDP, PD, and PT is 8 bytes (64 bits):
63 52|51 12|11 9|8|7|6|5|4|3|2|1|0
NX | Physical PPN | AVL |G|PS|D|A|C|W|U|W|P
Bit 0 (P) : Present — 1 = entry is valid; 0 = page not mapped
Bit 1 (R/W) : Read/Write — 0 = read-only; 1 = readable and writable
Bit 2 (U/S) : User/Supervisor — 0 = kernel only; 1 = user accessible
Bit 3 (PWT) : Page-level Write-Through — cache write policy
Bit 4 (PCD) : Page-level Cache Disable
Bit 5 (A) : Accessed — set by CPU when page is read (dirty tracking for page eviction)
Bit 6 (D) : Dirty — set by CPU when page is written (PTE only, not PML4/PDP/PD)
Bit 7 (PS) : Page Size — in PD: 0 = 4KB pages; 1 = 2MB huge page
Bits 11:8 : AVL — available for OS use (ignored by hardware)
Bits 51:12 : Physical Page Number (PPN) — physical address >> 12
Bit 63 (NX) : No-Execute — if EFER.NXE=1: set this to prevent execution of this page
⚙️ How It Works: The CPU caches recent translations in the TLB (Translation Lookaside Buffer). If the virtual address is in the TLB, the page walk is skipped entirely — the physical address comes from the cache. TLB misses trigger the full 4-level walk, which accesses 4 separate memory locations (one per level). On an L1-cached walk this costs ~16 cycles; on a cold walk it can cost 100+ cycles.
A Page Table Walk in Assembly
Here is what the hardware does on every memory access (simplified, ignoring TLB):
; Hardware page table walk for virtual address in RAX
; CR3 = physical address of PML4
; Returns physical address in RAX, or faults if not mapped
page_table_walk:
; Step 1: Extract PML4 index (bits 47:39)
mov rbx, rax
shr rbx, 39
and rbx, 0x1FF ; 9-bit index
; Step 2: Read PML4 entry
mov rcx, cr3 ; PML4 physical base
and rcx, ~0xFFF ; clear low 12 bits (flags)
mov rdx, [rcx + rbx*8] ; read PML4 entry
test rdx, 1 ; P bit set?
jz .page_fault ; not present
; Step 3: Extract PDP index (bits 38:30)
mov rbx, rax
shr rbx, 30
and rbx, 0x1FF
and rdx, ~0xFFF ; extract PDP physical address (clear flags)
mov rdx, [rdx + rbx*8] ; read PDP entry
test rdx, 1
jz .page_fault
; Step 4: Extract PD index (bits 29:21)
mov rbx, rax
shr rbx, 21
and rbx, 0x1FF
; Check for 2MB huge page (PS bit in PDP entry)
test rdx, (1<<7) ; PS bit
jnz .huge_1gb ; 1GB huge page (PDP with PS=1)
and rdx, ~0xFFF
mov rdx, [rdx + rbx*8] ; read PD entry
test rdx, 1
jz .page_fault
test rdx, (1<<7) ; PS bit in PD entry
jnz .huge_2mb ; 2MB huge page
; Step 5: Extract PT index (bits 20:12)
mov rbx, rax
shr rbx, 12
and rbx, 0x1FF
and rdx, ~0xFFF
mov rdx, [rdx + rbx*8] ; read PT entry
test rdx, 1
jz .page_fault
; Step 6: Combine physical page address with page offset
and rdx, ~0xFFF ; clear flags, keep PPN
and rax, 0xFFF ; extract 12-bit offset
or rax, rdx ; physical address
ret
.huge_2mb:
; 2MB page: use bits 20:0 as offset
and rdx, ~0x1FFFFF ; clear low 21 bits (flags + page offset bits)
and rax, 0x1FFFFF ; 21-bit offset
or rax, rdx
ret
.page_fault:
; Set CR2 to the faulting address and fire exception 14
; (hardware does this automatically)
Virtual Address Space Layout (Linux x86-64)
Linux divides the 48-bit virtual address space into user space (lower half) and kernel space (upper half):
Virtual Address Space Layout (Linux x86-64, 4-level paging):
0x0000000000000000 ┌─────────────────────────────────┐
│ User Space (~128 TB) │
│ Addresses: 0x0 to 0x7FFF... │
│ │
│ Text (code) │
│ Data / BSS │
│ Heap (grows up from end of BSS) │
│ Memory-mapped files / libraries │
│ Stack (grows down from top) │
│ vDSO / vvar │
0x00007FFFFFFFFFFF └─────────────────────────────────┘
│ Canonical hole (47-bit sign-ext)│
0xFFFF800000000000 ┌─────────────────────────────────┐
│ Kernel Space │
│ Direct physical memory map │
│ Kernel text, data, BSS │
│ vmalloc region │
│ kernel modules │
0xFFFFFFFFFFFFFFFF └─────────────────────────────────┘
⚙️ How It Works: Addresses between
0x0000800000000000and0xFFFF7FFFFFFFFFFFare "non-canonical" — the upper 16 bits are not a sign extension of bit 47. Accessing these addresses causes a #GP fault immediately. This creates a "canonical hole" that separates user space from kernel space without any page table entries needed.
Viewing the Live Address Space
# View your shell's virtual address space
cat /proc/self/maps
# Example output:
55c8e6e45000-55c8e6e47000 r-xp 00000000 fd:01 7340177 /bin/bash (text)
55c8e6e47000-55c8e6e48000 r--p 00002000 fd:01 7340177 /bin/bash (data, read-only)
55c8e6e48000-55c8e6e4a000 rw-p 00003000 fd:01 7340177 /bin/bash (data, read-write)
55c8e703a000-55c8e705b000 rw-p 00000000 00:00 0 [heap]
7f4b7c000000-7f4b7c021000 rw-p 00000000 00:00 0 [anon mapping]
7f4b7e3e0000-7f4b7e5b8000 r-xp 00000000 fd:01 2884 /lib/x86_64-linux-gnu/libc-2.33.so
...
7fff3da00000-7fff3da21000 rw-p 00000000 00:00 0 [stack]
7fff3dbf4000-7fff3dbf8000 r--p 00000000 00:00 0 [vvar]
7fff3dbf8000-7fff3dbfa000 r-xp 00000000 00:00 0 [vdso]
Each line: start_addr-end_addr permissions offset dev inode pathname. Permissions: r=read, w=write, x=execute, p=private (copy-on-write), s=shared.
ASLR: Address Space Layout Randomization
ASLR randomizes the base addresses of the heap, stack, and mapped libraries on each execution. This makes exploitation harder — you cannot hardcode addresses in shellcode or ROP chains.
# Check if ASLR is enabled
cat /proc/sys/kernel/randomize_va_space
# 2 = full ASLR (default), 1 = stack only, 0 = disabled
# Compare addresses across runs:
ldd /bin/ls # shows library addresses (randomized with ASLR)
ldd /bin/ls # run again — different addresses!
From the assembly perspective, ASLR means:
- Code in position-independent executables (PIE) uses RIP-relative addressing
- The GOT (Global Offset Table) stores actual addresses of external symbols at runtime
- Hardcoded absolute addresses in shellcode will not work with ASLR
- Stack-based exploits need to defeat ASLR before constructing ROP chains
How malloc() Really Works
The heap allocation API hides a complex system. Let's trace what actually happens:
void *p = malloc(16); // What does this actually do?
For a small allocation (less than ~128KB), glibc uses its internal heap:
- glibc maintains a heap (called the "arena") starting at the program break
- The first call to malloc may call
sys_brkto extend the heap by a large chunk - glibc divides this chunk into allocation bins of various sizes
- It finds a free chunk that fits, marks it as allocated, returns a pointer to the user data
- The 8 bytes before the pointer contain a header (size + flags)
For a large allocation (>= 128KB by default), glibc calls mmap directly:
; What glibc does for malloc(200000):
mov rax, 9 ; sys_mmap
xor rdi, rdi ; addr = NULL
mov rsi, 204800 ; length = 200000 rounded up to page boundary
mov rdx, 3 ; PROT_READ | PROT_WRITE
mov r10, 0x22 ; MAP_PRIVATE | MAP_ANONYMOUS
mov r8, -1 ; fd = -1
xor r9, r9 ; offset = 0
syscall
; rax = address of new mapping; glibc stores metadata just before it
The Heap Chunk Header
glibc's malloc stores a header before each allocation. This is why buffer overflows that corrupt the heap are exploitable:
Heap layout (simplified glibc malloc):
┌──────────────────────────────────┐
│ prev_size (8 bytes, if free) │ ← not in-use unless previous chunk free
├──────────────────────────────────┤
│ size (8 bytes) │ ← size of THIS chunk (includes header)
│ bit 0: PREV_INUSE flag │ bit 0=1 means previous chunk is in use
│ bit 1: IS_MMAPPED flag │ bit 1=1 means this was mmap'd
│ bit 2: NON_MAIN_ARENA flag │
├──────────────────────────────────┤ ← malloc() returns pointer to HERE
│ User data │
│ (n bytes) │
│ │
└──────────────────────────────────┘
│ (next chunk header follows) │
Writing past the end of a heap allocation overwrites the next chunk's header. The size and PREV_INUSE fields are used by free() to recombine adjacent free chunks (coalescing). Corrupting these fields is the basis of heap exploitation — a topic returned to in Part VII.
Memory-Mapped Files
mmap with a file descriptor creates a mapping backed by file data:
; Map the file /etc/passwd read-only into memory
; mmap(NULL, file_size, PROT_READ, MAP_PRIVATE, fd, 0)
; First: open the file
mov rax, 2 ; sys_open
mov rdi, etc_passwd ; "/etc/passwd"
xor rsi, rsi ; O_RDONLY
xor rdx, rdx
syscall
mov r12, rax ; save fd
; Get file size with fstat
mov rax, 5 ; sys_fstat
mov rdi, r12
lea rsi, [stat_buf]
syscall
; stat_buf + 48 = st_size (64-bit)
mov r13, [stat_buf + 48] ; file size
; Map the file
mov rax, 9 ; sys_mmap
xor rdi, rdi ; addr = kernel chooses
mov rsi, r13 ; length = file size
mov rdx, 1 ; PROT_READ
mov r10, 2 ; MAP_PRIVATE (copy-on-write)
mov r8, r12 ; fd
xor r9, r9 ; offset = 0
syscall
; rax = virtual address of file contents
; Now you can read the file like memory:
; byte at rax+0 is byte 0 of the file
; byte at rax+r13-1 is the last byte of the file
; No read() calls needed!
MAP_PRIVATE — changes are private to this process (copy-on-write). Writes do not propagate to the file. MAP_SHARED — changes are visible to all processes mapping the same file, and eventually written back to disk.
Memory-mapped I/O is also used for device registers. The APIC, for example, is accessed by mapping its physical address range with mmap (or by the kernel direct-mapping it in the kernel address space).
Page Faults in Detail
The kernel's page fault handler is one of the most complex pieces of code in any OS. It must distinguish:
-
Valid but not-present page: The page is in the process's virtual address space (has a VMA — Virtual Memory Area) but is not currently in RAM. Could be: - Demand paging: first access to a page that was never loaded (file-backed) - Swap: page was evicted to swap space, must be swapped back in - Guard page: intentional unmapped page (stack overflow detection)
-
Copy-on-write (COW): A write to a read-only page that is marked as COW (typically after
fork). The kernel allocates a new physical page, copies the content, remaps the virtual page with write permission. -
Stack growth: Access slightly below the current stack pointer on a stack that is mapped with auto-grow enabled.
-
Invalid access: The virtual address has no VMA — the process accessed unmapped memory. This is a segfault: the kernel sends SIGSEGV.
; Simplified MinOS page fault dispatch
; RSI = error code, RDI = CR2 (faulting address)
page_fault_dispatcher:
; Check if in valid VMA range
call find_vma ; returns NULL if no VMA for address
test rax, rax
jz .segfault ; no VMA → SIGSEGV
; Check P bit of error code: 0 = not present, 1 = protection
test sil, 1
jnz .protection_fault
.not_present:
; Is it a COW page (was marked read-only, write attempt)?
; ... check VMA flags ...
; Demand paging: allocate physical page, map it, return
call alloc_physical_page ; returns physical address
test rax, rax
jz .oom ; out of memory
; Map the physical page into the page table
mov rdx, rdi ; faulting address
and rdx, ~0xFFF ; page-align
call map_page ; rdi=vaddr, rax=paddr, using VMA permission flags
; INVLPG to flush the TLB entry for this address
invlpg [rdi]
; Return — the faulting instruction will be retried
ret
.protection_fault:
; Write to read-only page — check if COW
; ... COW handling ...
.segfault:
; Send SIGSEGV to the current process
call deliver_segfault
The INVLPG instruction invalidates a single TLB entry for the specified virtual address. After mapping a new page, you must do this — otherwise the CPU's cached "not present" translation will cause another fault on the next access.
5-Level Page Tables (LA57)
Linux 5.x+ supports 5-level page tables (CONFIG_X86_5LEVEL) for systems that need more than 128TB of user address space. The 5th level adds a PML5 table above PML4, extending the virtual address space to 57 bits (128 petabytes).
5-level address breakdown (57 bits):
PML5[9] → PML4[9] → PDP[9] → PD[9] → PT[9] → offset[12]
5-level paging requires CPU support (CPUID.ECX[16] = LA57 bit). The default remains 4-level paging. Most code you write targets 4-level; 5-level is transparent to most user-space programs.
MinOS: Physical Memory Allocator
MinOS needs to allocate physical page frames when mapping virtual pages. The simplest approach is a bitmap allocator: one bit per 4KB physical page frame.
; minOS/mm/pmm.asm — Physical Memory Manager
; Uses a bitmap: bit N = 1 means page frame N is in use
section .bss
; For a 256MB machine: 256MB / 4KB = 65536 frames
; 65536 bits = 8192 bytes = 64 qwords
phys_bitmap: resq 8192 ; 64 * 8192 = 524288 bits = 2GB addressable
phys_total: resq 1
phys_free: resq 1
section .text
; pmm_init: mark all memory as used, then free available pages
; RDI = total physical memory in bytes
; RSI = pointer to memory map (from multiboot or BIOS E820)
global pmm_init
pmm_init:
; Set all bits to 1 (all in use)
push rdi
lea rdi, [phys_bitmap]
mov rcx, 8192 ; 8192 qwords
mov rax, -1 ; all bits set
rep stosq
; Mark available regions as free based on memory map
; ... (parse E820 map, call pmm_free_range for each available region) ...
ret
; pmm_alloc: allocate one physical page frame
; Returns: RAX = physical address (page-aligned), or 0 if out of memory
global pmm_alloc
pmm_alloc:
; Find first clear bit in the bitmap
lea rdi, [phys_bitmap]
mov rcx, 8192 ; search 8192 qwords
.search_qword:
mov rax, [rdi] ; load qword
not rax ; invert: 1 = free (was 0), 0 = used
bsf rax, rax ; bit scan forward: find first set bit in inverted
jnz .found_free ; found a free page in this qword
add rdi, 8
dec rcx
jnz .search_qword
xor rax, rax ; out of memory
ret
.found_free:
; rax = bit index within the current qword, rdi = address of qword
; Compute global frame number
lea rbx, [phys_bitmap]
sub rdi, rbx ; byte offset from start of bitmap
shl rdi, 3 ; convert bytes to bits (× 8)
add rax, rdi ; global frame number
; Mark as used: set the bit
lea rbx, [phys_bitmap]
mov rcx, rax
shr rcx, 6 ; qword index (divide by 64)
and rax, 63 ; bit index within qword
bts [rbx + rcx*8], rax ; bit test and set (atomic on single CPU)
; Return physical address: frame_number × 4096
shl rax, 12
dec qword [phys_free]
ret
; pmm_free: free a physical page frame
; RDI = physical address to free
global pmm_free
pmm_free:
shr rdi, 12 ; convert to frame number
lea rax, [phys_bitmap]
mov rcx, rdi
shr rcx, 6 ; qword index
and rdi, 63 ; bit index
btr [rax + rcx*8], rdi ; bit test and reset
inc qword [phys_free]
ret
📐 OS Kernel Project: This bitmap allocator is the foundation of MinOS memory management. When a page fault fires and demands a new physical page,
pmm_allocprovides it. Afterfork, both parent and child mappings point to the same physical frames until a write occurs, at which pointpmm_allocprovides a new frame for the child's copy-on-write copy.
Summary
Virtual memory translates every pointer through a 4-level page table walk before the CPU reaches physical memory. CR3 holds the PML4 base. Each page table entry is 8 bytes encoding a physical page number plus access flags (P, R/W, U/S, NX). The TLB caches recent translations. Page faults fire vector 14 with CR2 = faulting address and a detailed error code. mmap and brk are the kernel interfaces for allocating virtual address space. Understanding this at the bit level explains why buffer overflows are exploitable, how ASLR works, and why kernel code cannot access user space without explicit permission checks.
🔄 Check Your Understanding: 1. Given virtual address
0x00007FFF12345678, extract the PML4, PDP, PD, PT indices and the page offset. 2. What bit in a page table entry prevents code execution on data pages? 3. Why does aforkcall not immediately double the physical memory usage? 4. What doesINVLPGdo, and when must you use it? 5. What is the difference betweenMAP_PRIVATEandMAP_SHAREDin ammapcall?