The hardware stack is ordinary memory. What makes it "the stack" is the convention that RSP always points to the top, that growth is downward (toward lower addresses), and that PUSH and POP manipulate it through RSP. None of that is enforced by the...
In This Chapter
- The Stack Is Just Memory
- PUSH and POP
- CALL and RET: The Function Call Mechanism
- Stack Diagrams
- The Standard Function Prologue and Epilogue
- The System V AMD64 ABI: The Linux Calling Convention
- A Complete Worked Example: Recursive Factorial
- Leaf Functions: Optimization
- Variadic Functions (Brief)
- The Buffer Overflow Connection: Return Address on the Stack
- MinOS Kernel: Implementing the Calling Convention
- Register Trace: Full Function Call
- Summary
Chapter 11: The Stack and Function Calls
The Stack Is Just Memory
The hardware stack is ordinary memory. What makes it "the stack" is the convention that RSP always points to the top, that growth is downward (toward lower addresses), and that PUSH and POP manipulate it through RSP. None of that is enforced by the hardware except for the RSP-manipulation behavior of PUSH, POP, CALL, and RET. You can ignore the stack and use memory any way you want — but you would be breaking every calling convention and every debugger, and you would not be able to call C library functions. So you use the stack.
This chapter covers how PUSH and POP work, how CALL plants a return address on the stack, how RET finds and jumps to it, and the full System V AMD64 ABI — the calling convention used by Linux, macOS, and all ELF-based systems. By the end of this chapter, you will be able to write functions that call and are called by C code correctly, implement recursive functions with full stack frame management, and understand exactly which memory layout Chapter 35's buffer overflow attack exploits.
PUSH and POP
push rax ; RSP -= 8; [RSP] = RAX
pop rbx ; RBX = [RSP]; RSP += 8
PUSH decrements RSP first, then stores the value. POP loads first, then increments RSP. The stack grows toward lower addresses.
; Initial state: RSP = 0x7FFFFFFFE000
push rax ; RSP = 0x7FFFFFFFDFF8, [0x7FFFFFFFDFF8] = RAX
push rbx ; RSP = 0x7FFFFFFFDFF0, [0x7FFFFFFFDFF0] = RBX
pop rcx ; RCX = [0x7FFFFFFFDFF0] = RBX's value; RSP = 0x7FFFFFFFDFF8
pop rdx ; RDX = [0x7FFFFFFFDFF8] = RAX's value; RSP = 0x7FFFFFFFE000
After the four instructions, RSP is back to its original value and RCX/RDX hold the original values of RBX/RAX respectively (LIFO order). The memory at those stack addresses still holds the values, but they are considered "garbage" — overwritten by the next push.
PUSH and POP Sizes
push rax ; pushes 8 bytes (64-bit, the default)
push eax ; NOT valid in 64-bit mode (32-bit push is not encodable)
push ax ; pushes 2 bytes (16-bit push, valid but rare)
push 42 ; pushes 8 bytes (sign-extended 32-bit immediate)
push qword [rbx] ; pushes 8 bytes from memory
In 64-bit mode, PUSH/POP default to 64-bit operand size. You cannot push a 32-bit value directly; the operand size is either 64 or 16.
CALL and RET: The Function Call Mechanism
CALL
call target ; push return_address; jmp target
; return_address = address of the instruction AFTER the call
CALL is exactly equivalent to:
; What CALL target actually does:
sub rsp, 8
mov [rsp], rip_of_next_instruction
jmp target
The return address pushed is the address of the instruction immediately following the CALL instruction — the place execution should resume after the function returns.
; Example:
0x401000: call my_function ; push 0x401005; jmp my_function
0x401005: mov rax, 1 ; ← this is the return address
RET
ret ; pop return_address; jmp return_address
RET is exactly equivalent to:
pop rcx ; rcx = [rsp]; rsp += 8
jmp rcx ; jump to the return address
The ret at the end of a function restores control to the caller by jumping to the address on the stack that CALL placed there.
⚠️ Common Mistake: The call/return mechanism works correctly only if RSP is pointing to the return address when
retexecutes — i.e., RSP must have the same value atretas it did when CALL pushed the address. If your function pushes registers, allocates stack space, or does anything that changes RSP, you must restore RSP to the correct value beforeret. This is the purpose of the function prologue and epilogue.
Stack Diagrams
Let us trace through a complete function call with memory diagrams.
Before the Call
Stack (high addresses at top):
┌──────────────────────────────────┐
│ ... caller's stack frame ... │ ← higher addresses
├──────────────────────────────────┤
│ (caller's local variables) │
└──────────────────────────────────┘
↑
RSP = 0x7FFFFFFFE010
After call my_func
┌──────────────────────────────────┐
│ ... caller's stack frame ... │
├──────────────────────────────────┤
│ (caller's local variables) │
├──────────────────────────────────┤ ← RSP was here before CALL
│ return address (0x401005) │ 8 bytes
└──────────────────────────────────┘
↑
RSP = 0x7FFFFFFFE008 (RSP decreased by 8)
After the Prologue push rbp; mov rbp, rsp
┌──────────────────────────────────┐
│ ... caller's stack frame ... │
├──────────────────────────────────┤
│ (caller's local variables) │
├──────────────────────────────────┤
│ return address (0x401005) │ [RSP+16] at this point
├──────────────────────────────────┤
│ saved RBP (caller's RBP) │ [RSP+8] → [RBP]
└──────────────────────────────────┘
↑
RSP = RBP = 0x7FFFFFFFE000
After sub rsp, 32 (allocate 32 bytes for locals)
┌──────────────────────────────────┐
│ ... caller's stack frame ... │
├──────────────────────────────────┤
│ (caller's local variables) │
├──────────────────────────────────┤
│ return address │ [RBP+8]
├──────────────────────────────────┤
│ saved RBP │ [RBP+0] ← RBP points here
├──────────────────────────────────┤
│ local var 1 (8 bytes) │ [RBP-8]
├──────────────────────────────────┤
│ local var 2 (8 bytes) │ [RBP-16]
├──────────────────────────────────┤
│ local var 3 (8 bytes) │ [RBP-24]
├──────────────────────────────────┤
│ local var 4 (8 bytes) │ [RBP-32]
└──────────────────────────────────┘
↑
RSP = 0x7FFFFFFFDFE0 (RBP - 32)
The Standard Function Prologue and Epilogue
Prologue
my_function:
push rbp ; save caller's frame pointer
mov rbp, rsp ; establish our frame pointer
sub rsp, N ; allocate N bytes of local storage
; N must make RSP 16-byte aligned
After the prologue, the frame is established:
- [rbp] = saved caller's RBP
- [rbp + 8] = return address (put there by CALL)
- [rbp - 8], [rbp - 16], ... = local variables
Epilogue
; ... function body ...
leave ; = mov rsp, rbp; pop rbp
ret ; = pop rip (implicit)
LEAVE is the single-instruction epilogue: it restores RSP from RBP (deallocates locals), then pops RBP (restores caller's frame pointer). Then RET pops the return address and jumps to it.
The two-instruction version is also common:
mov rsp, rbp ; deallocate locals (RSP = RBP)
pop rbp ; restore caller's RBP
ret
Frame Pointer Omission (-fomit-frame-pointer)
Modern compilers often omit the frame pointer with -fomit-frame-pointer (GCC's default at -O1 and above). Without a frame pointer, the function uses RSP directly for addressing locals:
; Without frame pointer:
my_function_no_fp:
sub rsp, 24 ; allocate space for locals + alignment
; Locals at [rsp+0], [rsp+8], [rsp+16]
; (no RBP established)
add rsp, 24 ; deallocate
ret
Benefits: RBP is free to use as a general-purpose register, saving one push/pop per function. Drawback: debuggers and stack unwinding tools cannot easily walk the call stack without frame pointers (though the .eh_frame section provides DWARF unwind info that compensates).
💡 Mental Model: The frame pointer (RBP) is a stable anchor for the current stack frame. Local variable
[rbp-8]always refers to the same location regardless of what happens to RSP during the function. With-fomit-frame-pointer, the compiler tracks the RSP offset at every instruction to correctly reference locals — which is fine for the compiler but harder for humans reading the assembly.
The System V AMD64 ABI: The Linux Calling Convention
The System V AMD64 ABI (also used by macOS, BSDs, and all ELF/POSIX systems) defines: 1. Which registers pass function arguments 2. Which register receives the return value 3. Which registers must be preserved across calls (callee-saved) 4. Which registers may be clobbered (caller-saved) 5. Stack alignment requirements
Argument Passing
Arguments are passed in registers first, then on the stack:
| Argument # | Integer/Pointer | Floating-Point |
|---|---|---|
| 1st | RDI | XMM0 |
| 2nd | RSI | XMM1 |
| 3rd | RDX | XMM2 |
| 4th | RCX | XMM3 |
| 5th | R8 | XMM4 |
| 6th | R9 | XMM5 |
| 7th, 8th, ... | Stack (right to left) | Stack |
// C function signature:
int foo(int a, int b, int c, int d, int e, int f, int g, int h);
; How to call foo(1, 2, 3, 4, 5, 6, 7, 8):
mov edi, 1 ; arg 1: a
mov esi, 2 ; arg 2: b
mov edx, 3 ; arg 3: c
mov ecx, 4 ; arg 4: d
mov r8d, 5 ; arg 5: e
mov r9d, 6 ; arg 6: f
push 8 ; arg 8: h (pushed right-to-left)
push 7 ; arg 7: g (pushed right-to-left)
call foo
add rsp, 16 ; remove stack arguments (caller's responsibility)
Return Values
| Size | Register |
|---|---|
| 1-64 bits (integer) | RAX |
| 65-128 bits (integer) | RDX:RAX (RDX = high half) |
| 32 bits (float) | XMM0 |
| 64 bits (double) | XMM0 |
| Struct/large | Pointer to caller-allocated space in RDI (hidden first argument) |
Register Preservation Contract
The ABI divides registers into two classes:
Caller-saved (volatile): The called function may freely modify these. If the caller needs them after the call, the caller must save them before the call.
| Caller-saved registers |
|---|
| RAX, RCX, RDX, RSI, RDI, R8, R9, R10, R11 |
| XMM0-XMM15 (all XMM/YMM/ZMM) |
| RFLAGS |
Callee-saved (preserved): The called function must restore these to their original values before returning. If the function uses them internally, it must push them at the start and pop them before returning.
| Callee-saved registers |
|---|
| RBX, RBP, R12, R13, R14, R15 |
RSP is implicitly preserved: on return, RSP must equal its value at function entry (before the call pushed the return address).
⚠️ Common Mistake: Using RBX without saving/restoring it. RBX is a callee-saved register — if you use it in a function without pushing it first, you corrupt the caller's RBX. The fix:
push rbxin the prologue,pop rbxin the epilogue.
; Correct use of callee-saved registers:
my_function:
push rbp
mov rbp, rsp
push rbx ; save RBX (we will use it)
push r12 ; save R12 (we will use it)
push r13 ; save R13
; Now we can use rbx, r12, r13 freely
mov rbx, rdi ; save arg1 across function calls
mov r12, rsi ; save arg2 across function calls
call some_other_function
; rbx and r12 still hold our saved values
; (some_other_function cannot clobber them)
pop r13
pop r12
pop rbx
pop rbp
ret
The 16-Byte Stack Alignment Requirement
Before any call instruction, RSP must be 16-byte aligned. After the call instruction pushes the 8-byte return address, RSP will be 8-byte aligned but not 16-byte aligned — and that is the state when the called function begins.
The rule in practice: at the point of a call, RSP must be divisible by 16 (i.e., the low 4 bits of RSP are 0000). After the call instruction pushes 8 bytes, RSP is divisible by 8 but not 16 (low 4 bits = 1000).
Why does this matter? SSE/AVX instructions require 16-byte aligned memory operands (movaps, movapd). If a called function's local variables end up at misaligned addresses (because the stack was misaligned when the function was entered), it will crash with a general protection fault when it executes aligned memory operations.
; In a leaf function or _start (where RSP may be 16-byte aligned already):
_start:
; RSP is 16-byte aligned here (the OS guarantees this at process start)
call some_function ; RSP becomes RSP-8, now 8-byte aligned (not 16)
; some_function's prologue sees RSP % 16 == 8 (correct for a called function)
; When calling a function that requires 16-byte alignment:
; If RSP is currently 8-byte aligned (e.g., after one push or odd number of pushes):
sub rsp, 8 ; align to 16 before the call
call other_function
add rsp, 8 ; restore alignment after call
The convention: at function entry (just after the CALL), RSP % 16 == 8 (because CALL pushed 8 bytes). The function prologue (push rbp) pushes 8 more bytes, making RSP % 16 == 0. Then sub rsp, N allocates space; N must be divisible by 16 to maintain alignment.
; Function with 3 locals of 8 bytes each (24 bytes total):
my_func:
push rbp ; RSP was ≡ 8 (mod 16); now RSP ≡ 0 (mod 16)
mov rbp, rsp
sub rsp, 24 ; 24 bytes NOT a multiple of 16!
; RSP is now ≡ 8 (mod 16) — MISALIGNED before any nested calls
; Fix: use sub rsp, 32 (or 48, or any multiple of 16 >= 24)
⚙️ How It Works: The 16-byte alignment requirement exists because the
movapsand related SSE instructions fault on unaligned addresses. C compilers allocate local variable space in multiples of 16 bytes to guarantee alignment. When you write assembly by hand, you must do the same arithmetic.
A Complete Worked Example: Recursive Factorial
; int64_t factorial(int64_t n)
; RDI = n, returns result in RAX
section .text
global factorial
factorial:
push rbp ; save caller's RBP
mov rbp, rsp ; establish frame pointer
; Base case: if n <= 1, return 1
cmp rdi, 1
jle .base_case
; Recursive case: return n * factorial(n-1)
push rdi ; save n (RDI is caller-saved, but we need it after CALL)
; RSP was ≡ 0 (mod 16) after push rbp; push rdi makes ≡ 8
sub rsp, 8 ; align RSP to 16 bytes before the call
lea rdi, [rdi - 1] ; arg: n-1
call factorial ; rax = factorial(n-1)
add rsp, 8 ; restore alignment adjustment
pop rdi ; restore n
imul rax, rdi ; rax = n * factorial(n-1)
jmp .return
.base_case:
mov rax, 1
.return:
pop rbp
ret
Stack Frame Trace for factorial(4)
Call: factorial(4)
┌─────────────────────────────────────────────────┐
│ Frame: factorial(4) │
│ [rbp+8] = return address to caller │
│ [rbp+0] = saved caller's RBP │
│ [rsp+8] = saved RDI = 4 (pushed after rbp) │
│ [rsp+0] = alignment pad (8 bytes) │
└─────────────────────────────────────────────────┘
↓ calls factorial(3)
┌─────────────────────────────────────────────────┐
│ Frame: factorial(3) │
│ [rbp+8] = return address to factorial(4) │
│ [rbp+0] = saved RBP (factorial(4)'s rbp) │
│ [rsp+8] = saved RDI = 3 │
│ [rsp+0] = alignment pad │
└─────────────────────────────────────────────────┘
↓ calls factorial(2)
┌─────────────────────────────────────────────────┐
│ Frame: factorial(2) │
│ ... same pattern ... │
│ RDI = 2 │
└─────────────────────────────────────────────────┘
↓ calls factorial(1)
┌─────────────────────────────────────────────────┐
│ Frame: factorial(1) │
│ n=1: base case, returns 1 │
└─────────────────────────────────────────────────┘
Unwind: factorial(1) returns 1
factorial(2) computes 2 * 1 = 2, returns 2
factorial(3) computes 3 * 2 = 6, returns 6
factorial(4) computes 4 * 6 = 24, returns 24
Each recursive call adds a frame to the stack. With N deep, you have N frames. For large N, you get a stack overflow: RSP crosses the bottom of the stack's allocated region and the OS generates a SIGSEGV.
Leaf Functions: Optimization
A leaf function is one that calls no other functions. It does not need to maintain a full stack frame:
; Leaf function — no calls, can skip the frame setup
abs64:
mov rax, rdi
neg rax
cmovns rax, rdi ; if rdi was non-negative, use original
ret
; No push/pop rbp needed: we don't call anything,
; and RSP is never modified, so ret works correctly
Even if the function has local variables, a leaf function can keep them in registers (which it can do freely since no call can clobber them between instructions). Only when a function either calls other functions or cannot fit all state in registers does it need a stack frame.
Variadic Functions (Brief)
Variadic functions (like printf) accept a variable number of arguments. The ABI specifies that when calling a variadic function, AL must contain the number of vector (SSE) register arguments used. The function then knows how many XMM registers to check in addition to the integer registers.
At the assembly level, printf("hello") in NASM:
lea rdi, [rel fmt_string] ; arg 1: format string
xor al, al ; 0 floating-point args in vector registers
call printf
The xor al, al is the variadic argument count convention for SSE registers.
The Buffer Overflow Connection: Return Address on the Stack
Look at the stack layout for any function:
[rbp + 8] = return address ← this is what RET jumps to
[rbp + 0] = saved RBP
[rbp - 8] = local variable 1
[rbp - 16] = local variable 2 (e.g., a character buffer)
If local variable 2 is a character buffer, and a function like gets() or an unchecked strcpy() writes more bytes into the buffer than it can hold, the overflow continues into local variable 1, then into the saved RBP, and then into the return address. When ret executes, it jumps to whatever is now in [rbp+8] — which an attacker has overwritten.
This is the fundamental mechanism of a stack buffer overflow exploit. You will see the exact stack layout, the overflow mechanics, and the exploit construction in Chapters 35-37. For now, the key insight: the return address is a pointer on the stack, positioned at a fixed offset above any local buffers. It is not protected by default. When you write to a local buffer without bounds checking, you can reach it.
🔐 Security Note: Stack canaries (
-fstack-protectorin GCC) insert a random value between the local variables and the saved RBP/return address. Before RET, the function checks that the canary value is unchanged. If it has been overwritten, the program terminates with a stack smashing error. This defense is now enabled by default in most distributions.
MinOS Kernel: Implementing the Calling Convention
; In MinOS, we define our own calling convention for kernel code:
; Arguments: RDI, RSI, RDX, RCX (same as System V for simplicity)
; Return: RAX
; Callee-saved: RBX, R12-R15, RBP
; The kernel stack is separate from user stack (switched on syscall/interrupt)
; MinOS system call dispatch:
; RAX = system call number
; Arguments follow the System V convention
syscall_handler:
; Save all caller-saved registers (could have been doing anything)
push rcx ; RCX is clobbered by syscall instruction (holds RIP)
push r11 ; R11 is clobbered by syscall (holds old RFLAGS)
push rdi
push rsi
push rdx
push r10 ; R10 used instead of RCX for syscall arg4
push r8
push r9
; Dispatch to kernel function via jump table
cmp rax, SYSCALL_MAX
jae .invalid_syscall
jmp [rel syscall_table + rax*8]
.return_from_syscall:
; RAX holds return value
pop r9
pop r8
pop r10
pop rdx
pop rsi
pop rdi
pop r11 ; restore flags
pop rcx ; restore RIP (sysret will use it)
sysret ; return to user mode
📐 OS Kernel Project (MinOS): This pattern — saving registers, dispatching via jump table, restoring and returning — is the kernel system call handler you will implement in the MinOS project. It combines the jump tables from Chapter 10 with the stack frame discipline from this chapter. The
syscall/sysretinstructions are the ring 0 ↔ ring 3 transition mechanism covered in depth in Chapter 29.
Register Trace: Full Function Call
; Caller calls: long result = add_two(3, 5)
; add_two: RDI=a, RSI=b, returns RAX = a+b
; With full callee-save protocol
section .text
global add_two
add_two:
push rbp ; save RBP
mov rbp, rsp ; frame pointer
; (no locals needed)
lea rax, [rdi + rsi] ; rax = a + b (using LEA to avoid clobbering flags)
pop rbp ; restore RBP
ret
| Instruction | RAX | RBP | RSP | Stack top | Notes |
|---|---|---|---|---|---|
| (entry, RSP=0x1000) | ? | caller_RBP | 0x0FF8 | ret_addr | CALL pushed ret addr |
push rbp |
? | caller_RBP | 0x0FF0 | caller_RBP | RBP saved |
mov rbp, rsp |
? | 0x0FF0 | 0x0FF0 | caller_RBP | Frame established |
lea rax, [rdi+rsi] |
8 | 0x0FF0 | 0x0FF0 | caller_RBP | Result computed (3+5) |
pop rbp |
8 | caller_RBP | 0x0FF8 | ret_addr | RBP restored |
ret |
8 | caller_RBP | 0x1000 | — | Returns to caller |
After ret: RSP is back to 0x1000 (same as before the CALL instruction). RBP is restored. RAX holds the return value 8.
Summary
The stack is the mechanism that makes functions work. PUSH/POP manipulate it through RSP. CALL plants a return address; RET jumps to it. The function prologue establishes a stable frame pointer (RBP) and allocates local variable space; the epilogue tears it down. The System V AMD64 ABI specifies exactly which registers carry arguments (RDI, RSI, RDX, RCX, R8, R9), which carry the return value (RAX), and which are callee-saved (RBX, RBP, R12-R15) — the contract that allows assembly functions to interoperate with C code.
The 16-byte stack alignment requirement exists because SSE instructions need aligned memory. The return address on the stack is the fundamental target of buffer overflow attacks. The frame pointer can be omitted for performance (GCC -O1 and above do this by default), but omitting it complicates debugging and stack unwinding.
In Chapter 12, you will use the stack management skills from this chapter to implement string and data structure operations that allocate and manipulate memory systematically.