8 min read

When your program accesses memory address 0x7fff5fbff8a0, the CPU does not go directly to that location in DRAM. Instead, it performs a hardware page table walk — traversing four levels of tables, each holding 512 entries — to translate that virtual...

Chapter 27: Memory Management

The Lie Every Pointer Tells

When your program accesses memory address 0x7fff5fbff8a0, the CPU does not go directly to that location in DRAM. Instead, it performs a hardware page table walk — traversing four levels of tables, each holding 512 entries — to translate that virtual address into a physical address. The physical address might be anywhere in RAM. Or the page might not be in RAM at all, triggering a page fault. The hardware does this on every single memory access, every load, every store, every instruction fetch.

This mechanism — virtual memory, backed by page tables, enforced by the Memory Management Unit (MMU) — is the abstraction that makes modern operating systems possible. It provides isolation between processes (your process cannot accidentally write to mine), protection enforcement (code pages cannot be written, kernel memory cannot be accessed from user space), and the illusion of more memory than physically exists (demand paging, swapping). Understanding it at the bit level is prerequisite for kernel development, security research, and diagnosing any memory-related crash.


The MMU and Page Tables

The Memory Management Unit is a hardware component built into the CPU that translates virtual addresses to physical addresses on every memory access. It consults page tables — data structures maintained by the OS kernel in physical memory — to perform this translation. The OS controls what mappings exist by modifying the page tables; the hardware enforces those mappings on every access.

The fundamental unit is the page: a contiguous block of memory, typically 4KB (4096 bytes). The mapping granularity is one page: you map a 4KB virtual page to a 4KB physical page frame.


x86-64 Four-Level Page Tables

x86-64 uses a four-level hierarchy to translate 48-bit virtual addresses (in the current standard configuration) to 52-bit physical addresses:

Virtual Address (48 bits used of 64):

 63      48|47    39|38    30|29    21|20    12|11        0
 sign-ext  | PML4  |  PDP   |   PD   |   PT   |  offset
  (16 bits)| (9b)  |  (9b)  |  (9b)  |  (9b)  |  (12b)

Level    Bits   Entries  Entry points to
------   ----   -------  ---------------
PML4     47:39  512      Page Directory Pointer (PDP) tables
PDP      38:30  512      Page Directory (PD) tables
PD       29:21  512      Page Tables (PT) — or 2MB huge pages
PT       20:12  512      4KB physical pages
Offset   11:0   —        Byte within the 4KB page

The virtual address is broken into five fields. Each field is an index into the corresponding level's table (512 entries = 9 bits, since 2⁹ = 512). The physical address is:

Physical address = page_table_entry[PT_index].physical_page_number × 4096 + offset

The CPU finds the top-level PML4 table from CR3 — the page table base register. CR3 holds the physical address of the PML4 table (aligned to 4KB). On a context switch, the OS writes a new value to CR3, and the new process's address space takes effect immediately.

Page Table Entry Format (64-bit)

Each entry in PML4, PDP, PD, and PT is 8 bytes (64 bits):

 63   52|51          12|11  9|8|7|6|5|4|3|2|1|0
   NX   | Physical PPN | AVL |G|PS|D|A|C|W|U|W|P

Bit  0 (P)   : Present — 1 = entry is valid; 0 = page not mapped
Bit  1 (R/W) : Read/Write — 0 = read-only; 1 = readable and writable
Bit  2 (U/S) : User/Supervisor — 0 = kernel only; 1 = user accessible
Bit  3 (PWT) : Page-level Write-Through — cache write policy
Bit  4 (PCD) : Page-level Cache Disable
Bit  5 (A)   : Accessed — set by CPU when page is read (dirty tracking for page eviction)
Bit  6 (D)   : Dirty — set by CPU when page is written (PTE only, not PML4/PDP/PD)
Bit  7 (PS)  : Page Size — in PD: 0 = 4KB pages; 1 = 2MB huge page
Bits 11:8    : AVL — available for OS use (ignored by hardware)
Bits 51:12   : Physical Page Number (PPN) — physical address >> 12
Bit 63 (NX)  : No-Execute — if EFER.NXE=1: set this to prevent execution of this page

⚙️ How It Works: The CPU caches recent translations in the TLB (Translation Lookaside Buffer). If the virtual address is in the TLB, the page walk is skipped entirely — the physical address comes from the cache. TLB misses trigger the full 4-level walk, which accesses 4 separate memory locations (one per level). On an L1-cached walk this costs ~16 cycles; on a cold walk it can cost 100+ cycles.

A Page Table Walk in Assembly

Here is what the hardware does on every memory access (simplified, ignoring TLB):

; Hardware page table walk for virtual address in RAX
; CR3 = physical address of PML4
; Returns physical address in RAX, or faults if not mapped

page_table_walk:
    ; Step 1: Extract PML4 index (bits 47:39)
    mov rbx, rax
    shr rbx, 39
    and rbx, 0x1FF          ; 9-bit index

    ; Step 2: Read PML4 entry
    mov rcx, cr3            ; PML4 physical base
    and rcx, ~0xFFF         ; clear low 12 bits (flags)
    mov rdx, [rcx + rbx*8] ; read PML4 entry
    test rdx, 1             ; P bit set?
    jz .page_fault          ; not present

    ; Step 3: Extract PDP index (bits 38:30)
    mov rbx, rax
    shr rbx, 30
    and rbx, 0x1FF
    and rdx, ~0xFFF         ; extract PDP physical address (clear flags)
    mov rdx, [rdx + rbx*8] ; read PDP entry
    test rdx, 1
    jz .page_fault

    ; Step 4: Extract PD index (bits 29:21)
    mov rbx, rax
    shr rbx, 21
    and rbx, 0x1FF
    ; Check for 2MB huge page (PS bit in PDP entry)
    test rdx, (1<<7)        ; PS bit
    jnz .huge_1gb           ; 1GB huge page (PDP with PS=1)
    and rdx, ~0xFFF
    mov rdx, [rdx + rbx*8] ; read PD entry
    test rdx, 1
    jz .page_fault
    test rdx, (1<<7)        ; PS bit in PD entry
    jnz .huge_2mb           ; 2MB huge page

    ; Step 5: Extract PT index (bits 20:12)
    mov rbx, rax
    shr rbx, 12
    and rbx, 0x1FF
    and rdx, ~0xFFF
    mov rdx, [rdx + rbx*8] ; read PT entry
    test rdx, 1
    jz .page_fault

    ; Step 6: Combine physical page address with page offset
    and rdx, ~0xFFF         ; clear flags, keep PPN
    and rax, 0xFFF          ; extract 12-bit offset
    or  rax, rdx            ; physical address
    ret

.huge_2mb:
    ; 2MB page: use bits 20:0 as offset
    and rdx, ~0x1FFFFF      ; clear low 21 bits (flags + page offset bits)
    and rax, 0x1FFFFF       ; 21-bit offset
    or  rax, rdx
    ret

.page_fault:
    ; Set CR2 to the faulting address and fire exception 14
    ; (hardware does this automatically)

Virtual Address Space Layout (Linux x86-64)

Linux divides the 48-bit virtual address space into user space (lower half) and kernel space (upper half):

Virtual Address Space Layout (Linux x86-64, 4-level paging):

0x0000000000000000  ┌─────────────────────────────────┐
                    │  User Space (~128 TB)            │
                    │  Addresses: 0x0 to 0x7FFF...     │
                    │                                  │
                    │  Text (code)                     │
                    │  Data / BSS                      │
                    │  Heap (grows up from end of BSS) │
                    │  Memory-mapped files / libraries  │
                    │  Stack (grows down from top)     │
                    │  vDSO / vvar                     │
0x00007FFFFFFFFFFF  └─────────────────────────────────┘
                    │  Canonical hole (47-bit sign-ext)│
0xFFFF800000000000  ┌─────────────────────────────────┐
                    │  Kernel Space                    │
                    │  Direct physical memory map      │
                    │  Kernel text, data, BSS          │
                    │  vmalloc region                  │
                    │  kernel modules                  │
0xFFFFFFFFFFFFFFFF  └─────────────────────────────────┘

⚙️ How It Works: Addresses between 0x0000800000000000 and 0xFFFF7FFFFFFFFFFF are "non-canonical" — the upper 16 bits are not a sign extension of bit 47. Accessing these addresses causes a #GP fault immediately. This creates a "canonical hole" that separates user space from kernel space without any page table entries needed.

Viewing the Live Address Space

# View your shell's virtual address space
cat /proc/self/maps
# Example output:
55c8e6e45000-55c8e6e47000 r-xp 00000000 fd:01 7340177  /bin/bash (text)
55c8e6e47000-55c8e6e48000 r--p 00002000 fd:01 7340177  /bin/bash (data, read-only)
55c8e6e48000-55c8e6e4a000 rw-p 00003000 fd:01 7340177  /bin/bash (data, read-write)
55c8e703a000-55c8e705b000 rw-p 00000000 00:00 0        [heap]
7f4b7c000000-7f4b7c021000 rw-p 00000000 00:00 0        [anon mapping]
7f4b7e3e0000-7f4b7e5b8000 r-xp 00000000 fd:01 2884     /lib/x86_64-linux-gnu/libc-2.33.so
...
7fff3da00000-7fff3da21000 rw-p 00000000 00:00 0        [stack]
7fff3dbf4000-7fff3dbf8000 r--p 00000000 00:00 0        [vvar]
7fff3dbf8000-7fff3dbfa000 r-xp 00000000 00:00 0        [vdso]

Each line: start_addr-end_addr permissions offset dev inode pathname. Permissions: r=read, w=write, x=execute, p=private (copy-on-write), s=shared.

ASLR: Address Space Layout Randomization

ASLR randomizes the base addresses of the heap, stack, and mapped libraries on each execution. This makes exploitation harder — you cannot hardcode addresses in shellcode or ROP chains.

# Check if ASLR is enabled
cat /proc/sys/kernel/randomize_va_space
# 2 = full ASLR (default), 1 = stack only, 0 = disabled

# Compare addresses across runs:
ldd /bin/ls  # shows library addresses (randomized with ASLR)
ldd /bin/ls  # run again — different addresses!

From the assembly perspective, ASLR means: - Code in position-independent executables (PIE) uses RIP-relative addressing - The GOT (Global Offset Table) stores actual addresses of external symbols at runtime - Hardcoded absolute addresses in shellcode will not work with ASLR - Stack-based exploits need to defeat ASLR before constructing ROP chains


How malloc() Really Works

The heap allocation API hides a complex system. Let's trace what actually happens:

void *p = malloc(16);   // What does this actually do?

For a small allocation (less than ~128KB), glibc uses its internal heap:

  1. glibc maintains a heap (called the "arena") starting at the program break
  2. The first call to malloc may call sys_brk to extend the heap by a large chunk
  3. glibc divides this chunk into allocation bins of various sizes
  4. It finds a free chunk that fits, marks it as allocated, returns a pointer to the user data
  5. The 8 bytes before the pointer contain a header (size + flags)

For a large allocation (>= 128KB by default), glibc calls mmap directly:

; What glibc does for malloc(200000):
mov rax, 9              ; sys_mmap
xor rdi, rdi            ; addr = NULL
mov rsi, 204800         ; length = 200000 rounded up to page boundary
mov rdx, 3              ; PROT_READ | PROT_WRITE
mov r10, 0x22           ; MAP_PRIVATE | MAP_ANONYMOUS
mov r8, -1              ; fd = -1
xor r9, r9              ; offset = 0
syscall
; rax = address of new mapping; glibc stores metadata just before it

The Heap Chunk Header

glibc's malloc stores a header before each allocation. This is why buffer overflows that corrupt the heap are exploitable:

Heap layout (simplified glibc malloc):

     ┌──────────────────────────────────┐
     │ prev_size (8 bytes, if free)     │ ← not in-use unless previous chunk free
     ├──────────────────────────────────┤
     │ size (8 bytes)                   │ ← size of THIS chunk (includes header)
     │ bit 0: PREV_INUSE flag           │   bit 0=1 means previous chunk is in use
     │ bit 1: IS_MMAPPED flag           │   bit 1=1 means this was mmap'd
     │ bit 2: NON_MAIN_ARENA flag       │
     ├──────────────────────────────────┤ ← malloc() returns pointer to HERE
     │ User data                        │
     │ (n bytes)                        │
     │                                  │
     └──────────────────────────────────┘
     │ (next chunk header follows)      │

Writing past the end of a heap allocation overwrites the next chunk's header. The size and PREV_INUSE fields are used by free() to recombine adjacent free chunks (coalescing). Corrupting these fields is the basis of heap exploitation — a topic returned to in Part VII.


Memory-Mapped Files

mmap with a file descriptor creates a mapping backed by file data:

; Map the file /etc/passwd read-only into memory
; mmap(NULL, file_size, PROT_READ, MAP_PRIVATE, fd, 0)

    ; First: open the file
    mov rax, 2              ; sys_open
    mov rdi, etc_passwd     ; "/etc/passwd"
    xor rsi, rsi            ; O_RDONLY
    xor rdx, rdx
    syscall
    mov r12, rax            ; save fd

    ; Get file size with fstat
    mov rax, 5              ; sys_fstat
    mov rdi, r12
    lea rsi, [stat_buf]
    syscall
    ; stat_buf + 48 = st_size (64-bit)
    mov r13, [stat_buf + 48]  ; file size

    ; Map the file
    mov rax, 9              ; sys_mmap
    xor rdi, rdi            ; addr = kernel chooses
    mov rsi, r13            ; length = file size
    mov rdx, 1              ; PROT_READ
    mov r10, 2              ; MAP_PRIVATE (copy-on-write)
    mov r8, r12             ; fd
    xor r9, r9              ; offset = 0
    syscall
    ; rax = virtual address of file contents

    ; Now you can read the file like memory:
    ; byte at rax+0 is byte 0 of the file
    ; byte at rax+r13-1 is the last byte of the file
    ; No read() calls needed!

MAP_PRIVATE — changes are private to this process (copy-on-write). Writes do not propagate to the file. MAP_SHARED — changes are visible to all processes mapping the same file, and eventually written back to disk.

Memory-mapped I/O is also used for device registers. The APIC, for example, is accessed by mapping its physical address range with mmap (or by the kernel direct-mapping it in the kernel address space).


Page Faults in Detail

The kernel's page fault handler is one of the most complex pieces of code in any OS. It must distinguish:

  1. Valid but not-present page: The page is in the process's virtual address space (has a VMA — Virtual Memory Area) but is not currently in RAM. Could be: - Demand paging: first access to a page that was never loaded (file-backed) - Swap: page was evicted to swap space, must be swapped back in - Guard page: intentional unmapped page (stack overflow detection)

  2. Copy-on-write (COW): A write to a read-only page that is marked as COW (typically after fork). The kernel allocates a new physical page, copies the content, remaps the virtual page with write permission.

  3. Stack growth: Access slightly below the current stack pointer on a stack that is mapped with auto-grow enabled.

  4. Invalid access: The virtual address has no VMA — the process accessed unmapped memory. This is a segfault: the kernel sends SIGSEGV.

; Simplified MinOS page fault dispatch
; RSI = error code, RDI = CR2 (faulting address)
page_fault_dispatcher:
    ; Check if in valid VMA range
    call find_vma           ; returns NULL if no VMA for address
    test rax, rax
    jz .segfault            ; no VMA → SIGSEGV

    ; Check P bit of error code: 0 = not present, 1 = protection
    test sil, 1
    jnz .protection_fault

.not_present:
    ; Is it a COW page (was marked read-only, write attempt)?
    ; ... check VMA flags ...

    ; Demand paging: allocate physical page, map it, return
    call alloc_physical_page  ; returns physical address
    test rax, rax
    jz .oom                   ; out of memory

    ; Map the physical page into the page table
    mov rdx, rdi            ; faulting address
    and rdx, ~0xFFF         ; page-align
    call map_page           ; rdi=vaddr, rax=paddr, using VMA permission flags
    ; INVLPG to flush the TLB entry for this address
    invlpg [rdi]
    ; Return — the faulting instruction will be retried
    ret

.protection_fault:
    ; Write to read-only page — check if COW
    ; ... COW handling ...

.segfault:
    ; Send SIGSEGV to the current process
    call deliver_segfault

The INVLPG instruction invalidates a single TLB entry for the specified virtual address. After mapping a new page, you must do this — otherwise the CPU's cached "not present" translation will cause another fault on the next access.


5-Level Page Tables (LA57)

Linux 5.x+ supports 5-level page tables (CONFIG_X86_5LEVEL) for systems that need more than 128TB of user address space. The 5th level adds a PML5 table above PML4, extending the virtual address space to 57 bits (128 petabytes).

5-level address breakdown (57 bits):
 PML5[9] → PML4[9] → PDP[9] → PD[9] → PT[9] → offset[12]

5-level paging requires CPU support (CPUID.ECX[16] = LA57 bit). The default remains 4-level paging. Most code you write targets 4-level; 5-level is transparent to most user-space programs.


MinOS: Physical Memory Allocator

MinOS needs to allocate physical page frames when mapping virtual pages. The simplest approach is a bitmap allocator: one bit per 4KB physical page frame.

; minOS/mm/pmm.asm — Physical Memory Manager
; Uses a bitmap: bit N = 1 means page frame N is in use

section .bss
    ; For a 256MB machine: 256MB / 4KB = 65536 frames
    ; 65536 bits = 8192 bytes = 64 qwords
    phys_bitmap:    resq 8192       ; 64 * 8192 = 524288 bits = 2GB addressable
    phys_total:     resq 1
    phys_free:      resq 1

section .text

; pmm_init: mark all memory as used, then free available pages
; RDI = total physical memory in bytes
; RSI = pointer to memory map (from multiboot or BIOS E820)
global pmm_init
pmm_init:
    ; Set all bits to 1 (all in use)
    push rdi
    lea rdi, [phys_bitmap]
    mov rcx, 8192           ; 8192 qwords
    mov rax, -1             ; all bits set
    rep stosq

    ; Mark available regions as free based on memory map
    ; ... (parse E820 map, call pmm_free_range for each available region) ...
    ret

; pmm_alloc: allocate one physical page frame
; Returns: RAX = physical address (page-aligned), or 0 if out of memory
global pmm_alloc
pmm_alloc:
    ; Find first clear bit in the bitmap
    lea rdi, [phys_bitmap]
    mov rcx, 8192           ; search 8192 qwords
.search_qword:
    mov rax, [rdi]          ; load qword
    not rax                 ; invert: 1 = free (was 0), 0 = used
    bsf rax, rax            ; bit scan forward: find first set bit in inverted
    jnz .found_free         ; found a free page in this qword
    add rdi, 8
    dec rcx
    jnz .search_qword
    xor rax, rax            ; out of memory
    ret

.found_free:
    ; rax = bit index within the current qword, rdi = address of qword
    ; Compute global frame number
    lea rbx, [phys_bitmap]
    sub rdi, rbx            ; byte offset from start of bitmap
    shl rdi, 3              ; convert bytes to bits (× 8)
    add rax, rdi            ; global frame number

    ; Mark as used: set the bit
    lea rbx, [phys_bitmap]
    mov rcx, rax
    shr rcx, 6              ; qword index (divide by 64)
    and rax, 63             ; bit index within qword
    bts [rbx + rcx*8], rax  ; bit test and set (atomic on single CPU)

    ; Return physical address: frame_number × 4096
    shl rax, 12
    dec qword [phys_free]
    ret

; pmm_free: free a physical page frame
; RDI = physical address to free
global pmm_free
pmm_free:
    shr rdi, 12             ; convert to frame number
    lea rax, [phys_bitmap]
    mov rcx, rdi
    shr rcx, 6              ; qword index
    and rdi, 63             ; bit index
    btr [rax + rcx*8], rdi  ; bit test and reset
    inc qword [phys_free]
    ret

📐 OS Kernel Project: This bitmap allocator is the foundation of MinOS memory management. When a page fault fires and demands a new physical page, pmm_alloc provides it. After fork, both parent and child mappings point to the same physical frames until a write occurs, at which point pmm_alloc provides a new frame for the child's copy-on-write copy.


Summary

Virtual memory translates every pointer through a 4-level page table walk before the CPU reaches physical memory. CR3 holds the PML4 base. Each page table entry is 8 bytes encoding a physical page number plus access flags (P, R/W, U/S, NX). The TLB caches recent translations. Page faults fire vector 14 with CR2 = faulting address and a detailed error code. mmap and brk are the kernel interfaces for allocating virtual address space. Understanding this at the bit level explains why buffer overflows are exploitable, how ASLR works, and why kernel code cannot access user space without explicit permission checks.

🔄 Check Your Understanding: 1. Given virtual address 0x00007FFF12345678, extract the PML4, PDP, PD, PT indices and the page offset. 2. What bit in a page table entry prevents code execution on data pages? 3. Why does a fork call not immediately double the physical memory usage? 4. What does INVLPG do, and when must you use it? 5. What is the difference between MAP_PRIVATE and MAP_SHARED in a mmap call?