10 min read

The hardware stack is ordinary memory. What makes it "the stack" is the convention that RSP always points to the top, that growth is downward (toward lower addresses), and that PUSH and POP manipulate it through RSP. None of that is enforced by the...

Chapter 11: The Stack and Function Calls

The Stack Is Just Memory

The hardware stack is ordinary memory. What makes it "the stack" is the convention that RSP always points to the top, that growth is downward (toward lower addresses), and that PUSH and POP manipulate it through RSP. None of that is enforced by the hardware except for the RSP-manipulation behavior of PUSH, POP, CALL, and RET. You can ignore the stack and use memory any way you want — but you would be breaking every calling convention and every debugger, and you would not be able to call C library functions. So you use the stack.

This chapter covers how PUSH and POP work, how CALL plants a return address on the stack, how RET finds and jumps to it, and the full System V AMD64 ABI — the calling convention used by Linux, macOS, and all ELF-based systems. By the end of this chapter, you will be able to write functions that call and are called by C code correctly, implement recursive functions with full stack frame management, and understand exactly which memory layout Chapter 35's buffer overflow attack exploits.

PUSH and POP

push rax            ; RSP -= 8; [RSP] = RAX
pop  rbx            ; RBX = [RSP]; RSP += 8

PUSH decrements RSP first, then stores the value. POP loads first, then increments RSP. The stack grows toward lower addresses.

; Initial state: RSP = 0x7FFFFFFFE000

push rax            ; RSP = 0x7FFFFFFFDFF8, [0x7FFFFFFFDFF8] = RAX
push rbx            ; RSP = 0x7FFFFFFFDFF0, [0x7FFFFFFFDFF0] = RBX
pop  rcx            ; RCX = [0x7FFFFFFFDFF0] = RBX's value; RSP = 0x7FFFFFFFDFF8
pop  rdx            ; RDX = [0x7FFFFFFFDFF8] = RAX's value; RSP = 0x7FFFFFFFE000

After the four instructions, RSP is back to its original value and RCX/RDX hold the original values of RBX/RAX respectively (LIFO order). The memory at those stack addresses still holds the values, but they are considered "garbage" — overwritten by the next push.

PUSH and POP Sizes

push rax            ; pushes 8 bytes (64-bit, the default)
push eax            ; NOT valid in 64-bit mode (32-bit push is not encodable)
push ax             ; pushes 2 bytes (16-bit push, valid but rare)
push 42             ; pushes 8 bytes (sign-extended 32-bit immediate)
push qword [rbx]    ; pushes 8 bytes from memory

In 64-bit mode, PUSH/POP default to 64-bit operand size. You cannot push a 32-bit value directly; the operand size is either 64 or 16.

CALL and RET: The Function Call Mechanism

CALL

call target         ; push return_address; jmp target
; return_address = address of the instruction AFTER the call

CALL is exactly equivalent to:

; What CALL target actually does:
sub rsp, 8
mov [rsp], rip_of_next_instruction
jmp target

The return address pushed is the address of the instruction immediately following the CALL instruction — the place execution should resume after the function returns.

; Example:
0x401000: call my_function   ; push 0x401005; jmp my_function
0x401005: mov rax, 1         ; ← this is the return address

RET

ret                 ; pop return_address; jmp return_address

RET is exactly equivalent to:

pop rcx            ; rcx = [rsp]; rsp += 8
jmp rcx            ; jump to the return address

The ret at the end of a function restores control to the caller by jumping to the address on the stack that CALL placed there.

⚠️ Common Mistake: The call/return mechanism works correctly only if RSP is pointing to the return address when ret executes — i.e., RSP must have the same value at ret as it did when CALL pushed the address. If your function pushes registers, allocates stack space, or does anything that changes RSP, you must restore RSP to the correct value before ret. This is the purpose of the function prologue and epilogue.

Stack Diagrams

Let us trace through a complete function call with memory diagrams.

Before the Call

Stack (high addresses at top):
┌──────────────────────────────────┐
│ ... caller's stack frame ...     │ ← higher addresses
├──────────────────────────────────┤
│ (caller's local variables)       │
└──────────────────────────────────┘
          ↑
          RSP = 0x7FFFFFFFE010

After call my_func

┌──────────────────────────────────┐
│ ... caller's stack frame ...     │
├──────────────────────────────────┤
│ (caller's local variables)       │
├──────────────────────────────────┤ ← RSP was here before CALL
│ return address (0x401005)        │ 8 bytes
└──────────────────────────────────┘
          ↑
          RSP = 0x7FFFFFFFE008  (RSP decreased by 8)

After the Prologue push rbp; mov rbp, rsp

┌──────────────────────────────────┐
│ ... caller's stack frame ...     │
├──────────────────────────────────┤
│ (caller's local variables)       │
├──────────────────────────────────┤
│ return address (0x401005)        │ [RSP+16] at this point
├──────────────────────────────────┤
│ saved RBP (caller's RBP)         │ [RSP+8] → [RBP]
└──────────────────────────────────┘
          ↑
          RSP = RBP = 0x7FFFFFFFE000

After sub rsp, 32 (allocate 32 bytes for locals)

┌──────────────────────────────────┐
│ ... caller's stack frame ...     │
├──────────────────────────────────┤
│ (caller's local variables)       │
├──────────────────────────────────┤
│ return address                   │ [RBP+8]
├──────────────────────────────────┤
│ saved RBP                        │ [RBP+0]  ← RBP points here
├──────────────────────────────────┤
│ local var 1 (8 bytes)            │ [RBP-8]
├──────────────────────────────────┤
│ local var 2 (8 bytes)            │ [RBP-16]
├──────────────────────────────────┤
│ local var 3 (8 bytes)            │ [RBP-24]
├──────────────────────────────────┤
│ local var 4 (8 bytes)            │ [RBP-32]
└──────────────────────────────────┘
          ↑
          RSP = 0x7FFFFFFFDFE0  (RBP - 32)

The Standard Function Prologue and Epilogue

Prologue

my_function:
    push rbp            ; save caller's frame pointer
    mov  rbp, rsp       ; establish our frame pointer
    sub  rsp, N         ; allocate N bytes of local storage
                        ; N must make RSP 16-byte aligned

After the prologue, the frame is established: - [rbp] = saved caller's RBP - [rbp + 8] = return address (put there by CALL) - [rbp - 8], [rbp - 16], ... = local variables

Epilogue

    ; ... function body ...
    leave               ; = mov rsp, rbp; pop rbp
    ret                 ; = pop rip (implicit)

LEAVE is the single-instruction epilogue: it restores RSP from RBP (deallocates locals), then pops RBP (restores caller's frame pointer). Then RET pops the return address and jumps to it.

The two-instruction version is also common:

    mov rsp, rbp        ; deallocate locals (RSP = RBP)
    pop rbp             ; restore caller's RBP
    ret

Frame Pointer Omission (-fomit-frame-pointer)

Modern compilers often omit the frame pointer with -fomit-frame-pointer (GCC's default at -O1 and above). Without a frame pointer, the function uses RSP directly for addressing locals:

; Without frame pointer:
my_function_no_fp:
    sub rsp, 24         ; allocate space for locals + alignment

    ; Locals at [rsp+0], [rsp+8], [rsp+16]
    ; (no RBP established)

    add rsp, 24         ; deallocate
    ret

Benefits: RBP is free to use as a general-purpose register, saving one push/pop per function. Drawback: debuggers and stack unwinding tools cannot easily walk the call stack without frame pointers (though the .eh_frame section provides DWARF unwind info that compensates).

💡 Mental Model: The frame pointer (RBP) is a stable anchor for the current stack frame. Local variable [rbp-8] always refers to the same location regardless of what happens to RSP during the function. With -fomit-frame-pointer, the compiler tracks the RSP offset at every instruction to correctly reference locals — which is fine for the compiler but harder for humans reading the assembly.

The System V AMD64 ABI: The Linux Calling Convention

The System V AMD64 ABI (also used by macOS, BSDs, and all ELF/POSIX systems) defines: 1. Which registers pass function arguments 2. Which register receives the return value 3. Which registers must be preserved across calls (callee-saved) 4. Which registers may be clobbered (caller-saved) 5. Stack alignment requirements

Argument Passing

Arguments are passed in registers first, then on the stack:

Argument # Integer/Pointer Floating-Point
1st RDI XMM0
2nd RSI XMM1
3rd RDX XMM2
4th RCX XMM3
5th R8 XMM4
6th R9 XMM5
7th, 8th, ... Stack (right to left) Stack
// C function signature:
int foo(int a, int b, int c, int d, int e, int f, int g, int h);
; How to call foo(1, 2, 3, 4, 5, 6, 7, 8):
mov  edi, 1          ; arg 1: a
mov  esi, 2          ; arg 2: b
mov  edx, 3          ; arg 3: c
mov  ecx, 4          ; arg 4: d
mov  r8d, 5          ; arg 5: e
mov  r9d, 6          ; arg 6: f
push 8               ; arg 8: h (pushed right-to-left)
push 7               ; arg 7: g (pushed right-to-left)
call foo
add  rsp, 16         ; remove stack arguments (caller's responsibility)

Return Values

Size Register
1-64 bits (integer) RAX
65-128 bits (integer) RDX:RAX (RDX = high half)
32 bits (float) XMM0
64 bits (double) XMM0
Struct/large Pointer to caller-allocated space in RDI (hidden first argument)

Register Preservation Contract

The ABI divides registers into two classes:

Caller-saved (volatile): The called function may freely modify these. If the caller needs them after the call, the caller must save them before the call.

Caller-saved registers
RAX, RCX, RDX, RSI, RDI, R8, R9, R10, R11
XMM0-XMM15 (all XMM/YMM/ZMM)
RFLAGS

Callee-saved (preserved): The called function must restore these to their original values before returning. If the function uses them internally, it must push them at the start and pop them before returning.

Callee-saved registers
RBX, RBP, R12, R13, R14, R15

RSP is implicitly preserved: on return, RSP must equal its value at function entry (before the call pushed the return address).

⚠️ Common Mistake: Using RBX without saving/restoring it. RBX is a callee-saved register — if you use it in a function without pushing it first, you corrupt the caller's RBX. The fix: push rbx in the prologue, pop rbx in the epilogue.

; Correct use of callee-saved registers:
my_function:
    push rbp
    mov  rbp, rsp
    push rbx           ; save RBX (we will use it)
    push r12           ; save R12 (we will use it)
    push r13           ; save R13

    ; Now we can use rbx, r12, r13 freely
    mov rbx, rdi       ; save arg1 across function calls
    mov r12, rsi       ; save arg2 across function calls
    call some_other_function
    ; rbx and r12 still hold our saved values
    ; (some_other_function cannot clobber them)

    pop r13
    pop r12
    pop rbx
    pop rbp
    ret

The 16-Byte Stack Alignment Requirement

Before any call instruction, RSP must be 16-byte aligned. After the call instruction pushes the 8-byte return address, RSP will be 8-byte aligned but not 16-byte aligned — and that is the state when the called function begins.

The rule in practice: at the point of a call, RSP must be divisible by 16 (i.e., the low 4 bits of RSP are 0000). After the call instruction pushes 8 bytes, RSP is divisible by 8 but not 16 (low 4 bits = 1000).

Why does this matter? SSE/AVX instructions require 16-byte aligned memory operands (movaps, movapd). If a called function's local variables end up at misaligned addresses (because the stack was misaligned when the function was entered), it will crash with a general protection fault when it executes aligned memory operations.

; In a leaf function or _start (where RSP may be 16-byte aligned already):
_start:
    ; RSP is 16-byte aligned here (the OS guarantees this at process start)
    call some_function      ; RSP becomes RSP-8, now 8-byte aligned (not 16)
    ; some_function's prologue sees RSP % 16 == 8 (correct for a called function)

; When calling a function that requires 16-byte alignment:
    ; If RSP is currently 8-byte aligned (e.g., after one push or odd number of pushes):
    sub rsp, 8              ; align to 16 before the call
    call other_function
    add rsp, 8              ; restore alignment after call

The convention: at function entry (just after the CALL), RSP % 16 == 8 (because CALL pushed 8 bytes). The function prologue (push rbp) pushes 8 more bytes, making RSP % 16 == 0. Then sub rsp, N allocates space; N must be divisible by 16 to maintain alignment.

; Function with 3 locals of 8 bytes each (24 bytes total):
my_func:
    push rbp               ; RSP was ≡ 8 (mod 16); now RSP ≡ 0 (mod 16)
    mov  rbp, rsp
    sub  rsp, 24           ; 24 bytes NOT a multiple of 16!
    ; RSP is now ≡ 8 (mod 16) — MISALIGNED before any nested calls

    ; Fix: use sub rsp, 32 (or 48, or any multiple of 16 >= 24)

⚙️ How It Works: The 16-byte alignment requirement exists because the movaps and related SSE instructions fault on unaligned addresses. C compilers allocate local variable space in multiples of 16 bytes to guarantee alignment. When you write assembly by hand, you must do the same arithmetic.

A Complete Worked Example: Recursive Factorial

; int64_t factorial(int64_t n)
; RDI = n, returns result in RAX

section .text
global factorial

factorial:
    push rbp               ; save caller's RBP
    mov  rbp, rsp          ; establish frame pointer

    ; Base case: if n <= 1, return 1
    cmp  rdi, 1
    jle  .base_case

    ; Recursive case: return n * factorial(n-1)
    push rdi               ; save n (RDI is caller-saved, but we need it after CALL)
                           ; RSP was ≡ 0 (mod 16) after push rbp; push rdi makes ≡ 8
    sub  rsp, 8            ; align RSP to 16 bytes before the call
    lea  rdi, [rdi - 1]    ; arg: n-1
    call factorial         ; rax = factorial(n-1)
    add  rsp, 8            ; restore alignment adjustment
    pop  rdi               ; restore n

    imul rax, rdi          ; rax = n * factorial(n-1)
    jmp  .return

.base_case:
    mov  rax, 1

.return:
    pop  rbp
    ret

Stack Frame Trace for factorial(4)

Call: factorial(4)
┌─────────────────────────────────────────────────┐
│ Frame: factorial(4)                              │
│ [rbp+8] = return address to caller              │
│ [rbp+0] = saved caller's RBP                    │
│ [rsp+8] = saved RDI = 4  (pushed after rbp)     │
│ [rsp+0] = alignment pad  (8 bytes)              │
└─────────────────────────────────────────────────┘
         ↓ calls factorial(3)

┌─────────────────────────────────────────────────┐
│ Frame: factorial(3)                              │
│ [rbp+8] = return address to factorial(4)        │
│ [rbp+0] = saved RBP (factorial(4)'s rbp)        │
│ [rsp+8] = saved RDI = 3                         │
│ [rsp+0] = alignment pad                         │
└─────────────────────────────────────────────────┘
         ↓ calls factorial(2)

┌─────────────────────────────────────────────────┐
│ Frame: factorial(2)                              │
│ ... same pattern ...                             │
│ RDI = 2                                          │
└─────────────────────────────────────────────────┘
         ↓ calls factorial(1)

┌─────────────────────────────────────────────────┐
│ Frame: factorial(1)                              │
│ n=1: base case, returns 1                        │
└─────────────────────────────────────────────────┘

Unwind: factorial(1) returns 1
        factorial(2) computes 2 * 1 = 2, returns 2
        factorial(3) computes 3 * 2 = 6, returns 6
        factorial(4) computes 4 * 6 = 24, returns 24

Each recursive call adds a frame to the stack. With N deep, you have N frames. For large N, you get a stack overflow: RSP crosses the bottom of the stack's allocated region and the OS generates a SIGSEGV.

Leaf Functions: Optimization

A leaf function is one that calls no other functions. It does not need to maintain a full stack frame:

; Leaf function — no calls, can skip the frame setup
abs64:
    mov  rax, rdi
    neg  rax
    cmovns rax, rdi    ; if rdi was non-negative, use original
    ret
    ; No push/pop rbp needed: we don't call anything,
    ; and RSP is never modified, so ret works correctly

Even if the function has local variables, a leaf function can keep them in registers (which it can do freely since no call can clobber them between instructions). Only when a function either calls other functions or cannot fit all state in registers does it need a stack frame.

Variadic Functions (Brief)

Variadic functions (like printf) accept a variable number of arguments. The ABI specifies that when calling a variadic function, AL must contain the number of vector (SSE) register arguments used. The function then knows how many XMM registers to check in addition to the integer registers.

At the assembly level, printf("hello") in NASM:

lea  rdi, [rel fmt_string]  ; arg 1: format string
xor  al, al                  ; 0 floating-point args in vector registers
call printf

The xor al, al is the variadic argument count convention for SSE registers.

The Buffer Overflow Connection: Return Address on the Stack

Look at the stack layout for any function:

[rbp + 8] = return address   ← this is what RET jumps to
[rbp + 0] = saved RBP
[rbp - 8] = local variable 1
[rbp - 16] = local variable 2  (e.g., a character buffer)

If local variable 2 is a character buffer, and a function like gets() or an unchecked strcpy() writes more bytes into the buffer than it can hold, the overflow continues into local variable 1, then into the saved RBP, and then into the return address. When ret executes, it jumps to whatever is now in [rbp+8] — which an attacker has overwritten.

This is the fundamental mechanism of a stack buffer overflow exploit. You will see the exact stack layout, the overflow mechanics, and the exploit construction in Chapters 35-37. For now, the key insight: the return address is a pointer on the stack, positioned at a fixed offset above any local buffers. It is not protected by default. When you write to a local buffer without bounds checking, you can reach it.

🔐 Security Note: Stack canaries (-fstack-protector in GCC) insert a random value between the local variables and the saved RBP/return address. Before RET, the function checks that the canary value is unchanged. If it has been overwritten, the program terminates with a stack smashing error. This defense is now enabled by default in most distributions.

MinOS Kernel: Implementing the Calling Convention

; In MinOS, we define our own calling convention for kernel code:
; Arguments: RDI, RSI, RDX, RCX (same as System V for simplicity)
; Return: RAX
; Callee-saved: RBX, R12-R15, RBP
; The kernel stack is separate from user stack (switched on syscall/interrupt)

; MinOS system call dispatch:
; RAX = system call number
; Arguments follow the System V convention

syscall_handler:
    ; Save all caller-saved registers (could have been doing anything)
    push rcx            ; RCX is clobbered by syscall instruction (holds RIP)
    push r11            ; R11 is clobbered by syscall (holds old RFLAGS)
    push rdi
    push rsi
    push rdx
    push r10            ; R10 used instead of RCX for syscall arg4
    push r8
    push r9

    ; Dispatch to kernel function via jump table
    cmp  rax, SYSCALL_MAX
    jae  .invalid_syscall
    jmp  [rel syscall_table + rax*8]

.return_from_syscall:
    ; RAX holds return value
    pop r9
    pop r8
    pop r10
    pop rdx
    pop rsi
    pop rdi
    pop r11             ; restore flags
    pop rcx             ; restore RIP (sysret will use it)
    sysret              ; return to user mode

📐 OS Kernel Project (MinOS): This pattern — saving registers, dispatching via jump table, restoring and returning — is the kernel system call handler you will implement in the MinOS project. It combines the jump tables from Chapter 10 with the stack frame discipline from this chapter. The syscall/sysret instructions are the ring 0 ↔ ring 3 transition mechanism covered in depth in Chapter 29.

Register Trace: Full Function Call

; Caller calls: long result = add_two(3, 5)
; add_two: RDI=a, RSI=b, returns RAX = a+b
; With full callee-save protocol

section .text
global add_two

add_two:
    push rbp               ; save RBP
    mov  rbp, rsp          ; frame pointer
    ; (no locals needed)
    lea  rax, [rdi + rsi]  ; rax = a + b (using LEA to avoid clobbering flags)
    pop  rbp               ; restore RBP
    ret
Instruction RAX RBP RSP Stack top Notes
(entry, RSP=0x1000) ? caller_RBP 0x0FF8 ret_addr CALL pushed ret addr
push rbp ? caller_RBP 0x0FF0 caller_RBP RBP saved
mov rbp, rsp ? 0x0FF0 0x0FF0 caller_RBP Frame established
lea rax, [rdi+rsi] 8 0x0FF0 0x0FF0 caller_RBP Result computed (3+5)
pop rbp 8 caller_RBP 0x0FF8 ret_addr RBP restored
ret 8 caller_RBP 0x1000 Returns to caller

After ret: RSP is back to 0x1000 (same as before the CALL instruction). RBP is restored. RAX holds the return value 8.

Summary

The stack is the mechanism that makes functions work. PUSH/POP manipulate it through RSP. CALL plants a return address; RET jumps to it. The function prologue establishes a stable frame pointer (RBP) and allocates local variable space; the epilogue tears it down. The System V AMD64 ABI specifies exactly which registers carry arguments (RDI, RSI, RDX, RCX, R8, R9), which carry the return value (RAX), and which are callee-saved (RBX, RBP, R12-R15) — the contract that allows assembly functions to interoperate with C code.

The 16-byte stack alignment requirement exists because SSE instructions need aligned memory. The return address on the stack is the fundamental target of buffer overflow attacks. The frame pointer can be omitted for performance (GCC -O1 and above do this by default), but omitting it complicates debugging and stack unwinding.

In Chapter 12, you will use the stack management skills from this chapter to implement string and data structure operations that allocate and manipulate memory systematically.