Pure assembly is powerful and educational, but limited. You cannot write a network server in pure assembly using only system calls — well, you can, but the result would be thousands of lines of error-prone code that every standard library already...
In This Chapter
- The Bridge
- 20.1 Why Interface Assembly with C
- 20.2 Linking Assembly with C
- 20.3 Calling C Functions from Assembly
- 20.4 Writing Assembly Functions Callable from C
- 20.5 Accessing C Global Variables from Assembly
- 20.6 Passing Structs
- 20.7 The Red Zone
- 20.8 Variadic Functions from Assembly
- 20.9 Complete Working Mixed C+Assembly Project
- Summary
Chapter 20: Calling C from Assembly and Assembly from C
The Bridge
Pure assembly is powerful and educational, but limited. You cannot write a network server in pure assembly using only system calls — well, you can, but the result would be thousands of lines of error-prone code that every standard library already provides. You cannot call getaddrinfo(), SSL_CTX_new(), or zlib_compress() without interfacing with C.
Similarly, C code that needs hand-optimized SIMD, precise timing via RDTSC, or hardware-specific instructions can call assembly functions — as long as those functions follow the ABI contract.
This chapter is the bridge.
20.1 Why Interface Assembly with C
Assembly Calls C
Assembly programs can call any C function by following the System V AMD64 ABI:
- printf — formatted output without writing your own formatting code
- malloc/free — dynamic memory allocation
- fopen/fread/fwrite/fclose — file I/O with buffering
- socket, connect, send, recv — networking
- pthread_create — threads
- Any third-party library with a C API (OpenSSL, zlib, SQLite, etc.)
C Calls Assembly
C programs can call assembly functions for: - Performance-critical inner loops (SIMD, specific instruction usage) - Hardware access (CPUID, RDTSC, port I/O) - Cryptographic primitives (avoiding compiler optimizations that might remove security-critical zeroing) - Boot/startup code before C runtime is initialized
20.2 Linking Assembly with C
NASM extern and global
; In NASM assembly:
extern printf ; tell NASM that printf is defined elsewhere (in C library)
extern malloc ; same for malloc
global fast_checksum ; make fast_checksum visible to other object files
section .text
fast_checksum:
; ... implementation ...
ret
extern name— declares a symbol defined in another object fileglobal name— exports a symbol from this object file (makes it available to the linker)
Building a Mixed C+Assembly Project
# Files:
# - main.c (C entry point, calls assembly)
# - helper.asm (assembly functions called from C)
# Step 1: Compile C to object file
gcc -c main.c -o main.o
# Step 2: Assemble to object file
nasm -f elf64 helper.asm -o helper.o
# Step 3: Link them together
gcc main.o helper.o -o program
# Alternatively, let gcc do steps 2+3:
gcc main.c helper.asm -o program
# (gcc calls nasm automatically for .asm files if configured)
⚠️ Common Mistake: Using
lddirectly to link instead ofgcc. When you link withld, you bypass the C runtime startup code (crt1.o,crti.o,crtn.o) and the C standard library. Usegccto link unless you explicitly want no C runtime.
C++ Name Mangling
In C, printf is exported as printf. In C++, void foo(int) is exported as something like _Z3fooi — name mangling encodes the function signature to support overloading.
When C++ code calls assembly (or assembly calls C++ functions), you must either:
1. Declare the assembly function with extern "C" in the C++ code:
cpp
extern "C" int fast_checksum(const char *data, size_t len);
2. Or match the mangled name exactly in assembly (fragile, compiler-specific)
The extern "C" approach is always correct.
20.3 Calling C Functions from Assembly
Calling printf
The C signature: int printf(const char *format, ...);
Arguments: RDI = format string, RSI = first argument, RDX = second, RCX = third, R8 = fourth, R9 = fifth.
Special rule for variadic functions: RAX must hold the number of XMM (floating-point) registers used for arguments. If you're only passing integers, RAX = 0.
; call_printf.asm
; Print: "Value: 42\n"
extern printf
section .data
fmt: db "Value: %d", 10, 0 ; "Value: %d\n\0"
section .text
print_42:
push rbp
mov rbp, rsp
sub rsp, 8 ; maintain 16-byte alignment (RSP is 8-byte aligned
; after push rbp; sub 8 makes it 16-byte again)
mov rdi, fmt ; arg1: format string
mov rsi, 42 ; arg2: integer value
xor eax, eax ; AL = 0: no XMM args (variadic convention)
call printf ; call C library printf
leave
ret
Let's trace the stack alignment carefully:
Before push rbp: RSP = 0x7FFFFFFFE8 (16-byte aligned — at function entry, RSP
was 16-aligned at CALL, meaning RSP was
0x7FFFFFFF00, then CALL pushed 8 bytes,
so RSP = 0x7FFFFFFFE8 — 8-byte aligned)
After push rbp: RSP = 0x7FFFFFFFE0 (16-byte aligned again — pushed 8 more bytes)
After sub rsp, 8: RSP = 0x7FFFFFFFD8 (8-byte aligned — wrong! we need 16)
Wait — the standard x86-64 entry sequence is:
; Standard function entry:
push rbp ; RSP was 16-aligned at function entry? No:
; At function CALL site, RSP was 16-aligned (BEFORE the call).
; CALL pushed 8 bytes → RSP is now 8-byte aligned (not 16) on entry.
; push rbp pushes 8 more → RSP is 16-aligned again.
mov rbp, rsp
sub rsp, N ; N must be multiple of 16 to maintain alignment
So at function entry (after the implicit CALL), RSP is at original_RSP - 8 (8-byte aligned). After push rbp, RSP is at original_RSP - 16 (16-byte aligned). After sub rsp, N where N is a multiple of 16, RSP is still 16-byte aligned.
For the printf example:
print_value:
push rbp
mov rbp, rsp
; RSP is now 16-byte aligned (push rbp added 8 to the post-CALL RSP)
; We need sub rsp, N where N is a multiple of 16
; If we have no locals, sub rsp, 0 — but let's add any alignment pad:
; Actually with push rbp already done, RSP is aligned. No additional sub needed
; UNLESS we need local variables or save additional registers.
; sub rsp, 0 is a no-op, so:
mov rdi, fmt ; format string
mov rsi, 42 ; integer
xor eax, eax ; no FP args
call printf ; RSP is 16-byte aligned ✓
pop rbp
ret
The Complete printf Stack Trace
Function call to print_value:
CALL at caller:
RSP = 0xXXXXFFF0 (16-aligned) ← before call
RSP = 0xXXXXFFE8 ← after CALL (pushed 8-byte return addr)
push rbp:
RSP = 0xXXXXFFE0 ← 16-aligned again
[RSP+0] = old RBP
[RSP+8] = return address (was at RSP before push)
call printf:
RSP must be 16-aligned → it is (0xXXXXFFE0) ✓
printf is called correctly
printf with Multiple Arguments and Types
; Print: "Point: (3, 7), Label: hello\n"
extern printf
section .data
fmt2: db "Point: (%d, %d), Label: %s", 10, 0
label: db "hello", 0
section .text
print_point:
push rbp
mov rbp, rsp
sub rsp, 16 ; align + local space (multiple of 16)
mov rdi, fmt2 ; format
mov rsi, 3 ; first %d (x coordinate)
mov rdx, 7 ; second %d (y coordinate)
mov rcx, label ; third %s (string pointer)
xor eax, eax ; no FP args
call printf
leave ; mov rsp, rbp; pop rbp
ret
printf with Floating-Point Arguments
; Print: "Pi = 3.141593\n"
extern printf
section .data
fmt_fp: db "Pi = %f", 10, 0
section .rodata
pi: dq 3.141592653589793 ; 64-bit double
section .text
print_pi:
push rbp
mov rbp, rsp
mov rdi, fmt_fp ; format string
movsd xmm0, [rel pi] ; XMM0 = 3.14159... (double)
mov eax, 1 ; AL = 1: one XMM register used
call printf ; printf knows to look at XMM0 for the %f arg
pop rbp
ret
⚠️ Common Mistake: Forgetting
xor eax, eax(or the appropriate count) when calling variadic functions. If RAX is non-zero with no actual XMM args, printf will read garbage XMM registers and may crash. The System V ABI requires RAX to be set for variadic function calls.
Calling malloc and free
; Dynamic allocation from assembly
extern malloc
extern free
section .text
alloc_buffer:
push rbp
mov rbp, rsp
push rbx ; callee-saved: will use RBX to hold pointer
; malloc(128) - allocate 128 bytes
mov rdi, 128 ; size argument
call malloc ; RAX = pointer (or NULL on failure)
; Check for NULL
test rax, rax
jz .alloc_failed
mov rbx, rax ; save pointer in callee-saved RBX
; ... use the buffer at [rbx] ...
; e.g., write some data:
mov qword [rbx], 0xDEADBEEF
; free(ptr)
mov rdi, rbx ; pointer to free
call free ; returns void
.alloc_done:
pop rbx
pop rbp
ret
.alloc_failed:
; Handle allocation failure
pop rbx
pop rbp
ret
Key insight: save the malloc return value (RAX) into a callee-saved register (RBX, R12-R15) before calling free or any other function. The call free will clobber RAX, RSI, RDI, etc. — only callee-saved registers survive a function call.
20.4 Writing Assembly Functions Callable from C
The Requirements
To write an assembly function that C can call:
- Declare it as
globalin NASM - In C, declare it as
extern(with the right signature) - Follow System V AMD64 ABI exactly: - Arguments in RDI, RSI, RDX, RCX, R8, R9 - Return value in RAX - Preserve RBX, RBP, R12-R15 (callee-saved) - Return with RET (not any other jump)
// In C header:
extern uint32_t fast_checksum(const uint8_t *data, size_t len);
extern void *fast_memcpy(void *dest, const void *src, size_t n);
extern int fast_strlen(const char *s);
; In NASM assembly:
global fast_strlen
; int fast_strlen(const char *s)
; RDI = s (string pointer)
; Returns: RAX = length
fast_strlen:
mov rax, rdi ; RAX = start pointer
.loop:
cmp byte [rdi], 0 ; is *rdi == '\0'?
je .done
inc rdi
jmp .loop
.done:
sub rax, rdi ; RAX = start - current (this gives negative length!)
; Wait — we need current - start:
sub rdi, rax ; hmm, that doesn't work right either.
; Let's fix the logic:
; On entry: RAX = start_ptr
; At loop exit: RDI = address of '\0'
; Length = RDI - RAX (address of null - address of start = length)
neg rax ; This approach is messy. Let's rewrite:
ret
Let me write this correctly:
global fast_strlen
; int fast_strlen(const char *s)
; RDI = s
; Returns: RAX = length
fast_strlen:
xor eax, eax ; RAX = 0 (length counter)
; xor eax, eax also clears upper 32 bits of RAX
.loop:
cmp byte [rdi + rax], 0 ; is s[i] == '\0'?
je .done
inc rax ; length++
jmp .loop
.done:
ret ; RAX = length
Alternatively, the classic REP SCASB version:
global fast_strlen_rep
fast_strlen_rep:
mov rcx, -1 ; RCX = max count
xor al, al ; AL = 0 (search byte)
repne scasb ; scan [RDI] for AL=0, decrement RCX, increment RDI
; After: RCX = -(length+2) (approximate — see exact derivation)
not rcx ; RCX = length+1
lea rax, [rcx - 1] ; RAX = length
ret
Example: fast_memcpy
; void *fast_memcpy(void *dest, const void *src, size_t n)
; RDI = dest, RSI = src, RDX = n
; Returns: RAX = dest (as C memcpy spec requires)
global fast_memcpy
fast_memcpy:
push rbp
mov rbp, rsp
push rbx
mov rbx, rdi ; save dest for return value
; Copy in 8-byte chunks
mov rcx, rdx
shr rcx, 3 ; rcx = n / 8 (number of 8-byte chunks)
jz .tail
.chunk_loop:
mov rax, [rsi] ; load 8 bytes
mov [rdi], rax ; store 8 bytes
add rsi, 8
add rdi, 8
dec rcx
jnz .chunk_loop
.tail:
and rdx, 7 ; rdx = n % 8 (remaining bytes)
jz .done
.byte_loop:
mov al, [rsi]
mov [rdi], al
inc rsi
inc rdi
dec rdx
jnz .byte_loop
.done:
mov rax, rbx ; return original dest
pop rbx
pop rbp
ret
C usage:
// In C:
extern void *fast_memcpy(void *dest, const void *src, size_t n);
int main() {
char src[] = "Hello, Assembly!";
char dest[20];
fast_memcpy(dest, src, 17);
printf("%s\n", dest);
return 0;
}
20.5 Accessing C Global Variables from Assembly
// In C:
int global_counter = 0;
const char *program_name = "myapp";
; In NASM:
extern global_counter ; symbol defined in C
extern program_name
section .text
use_globals:
; Load global_counter's value
; (Note: 'extern' gives us the ADDRESS of the symbol in RIP-relative addressing)
mov eax, [rel global_counter] ; load the int value
inc eax
mov [rel global_counter], eax ; store back
; Load program_name (which is a pointer to a string)
mov rdi, [rel program_name] ; rdi = the pointer value (address of "myapp")
; Now RDI is the string pointer, usable with strlen/printf etc.
ret
⚠️ Common Mistake:
mov rax, global_counterloads the ADDRESS of the symbol, not its value. You needmov eax, [rel global_counter]to get the value. This is the difference between a label (an address) and the thing the label points to.
For Position-Independent Code (required for shared libraries), use GOT-relative addressing:
; PIC access to external variable
use_globals_pic:
; In PIC code, global variables are accessed via the GOT
mov rax, [rel global_counter wrt ..got] ; load GOT entry address
mov eax, [rax] ; load actual value via pointer
ret
20.6 Passing Structs
The System V AMD64 ABI has detailed rules for how structs are passed.
Small Structs (≤ 16 bytes): Passed in Registers
struct Point { int x; int y; }; // 8 bytes total
A struct that fits in two GP registers is passed as two separate integer values:
- struct Point p as first argument: rdi = p.x, rsi = p.y
struct BigPoint { int64_t x; int64_t y; }; // 16 bytes
struct BigPoint pas first argument:rdi = p.x,rsi = p.y
; C: void use_point(struct Point p);
; Assembly calling use_point with p = {10, 20}:
extern use_point_func
use_point:
mov edi, 10 ; p.x
mov esi, 20 ; p.y
call use_point_func
ret
Large Structs (> 16 bytes): Passed as Pointer
struct Matrix { int64_t m[4]; }; // 32 bytes — too big for registers
When a struct is > 16 bytes, the CALLER allocates space on the stack and passes a POINTER to that space in the first available argument register:
; C: void process_matrix(struct Matrix m);
; When called with a 32-byte struct, the call becomes:
; (hidden pointer to stack copy of m) in RDI
extern process_matrix
pass_big_struct:
push rbp
mov rbp, rsp
sub rsp, 48 ; 32 bytes for struct + 16 for alignment
; Initialize the Matrix struct on the stack at [rbp-32] (or wherever):
mov qword [rbp-32], 1
mov qword [rbp-24], 2
mov qword [rbp-16], 3
mov qword [rbp-8], 4
; Pass pointer to the struct
lea rdi, [rbp-32] ; RDI = pointer to Matrix struct
call process_matrix ; C receives a copy via pointer
leave
ret
Return Value for Large Structs
When a C function returns a large struct, the CALLER provides a "hidden" first argument: a pointer to the memory where the return value should be stored.
struct Matrix compute_matrix(void); // Returns 32-byte struct
The calling convention transforms this to:
void compute_matrix_hidden(struct Matrix *return_buf); // conceptually
In assembly, you must allocate the return buffer and pass its address in RDI before calling:
extern compute_matrix
get_matrix:
push rbp
mov rbp, rsp
sub rsp, 48 ; space for 32-byte return struct
lea rdi, [rbp-32] ; RDI = pointer to return buffer (hidden first arg)
call compute_matrix ; function stores result to [rbp-32]
; Now [rbp-32] to [rbp-1] contains the returned struct
mov rax, [rbp-32] ; example: use first field
leave
ret
20.7 The Red Zone
The red zone is a 128-byte area BELOW the current RSP that is guaranteed not to be modified by signal handlers or other asynchronous events (on Linux, in user space).
This means leaf functions (functions that make no calls) can use up to 128 bytes below RSP for local variables without adjusting RSP:
; Leaf function: uses red zone for local storage without adjusting RSP
fast_inner:
; RSP not touched — we're in a leaf function
mov [rsp - 8], rdi ; store arg to red zone
mov [rsp - 16], rsi ; store second arg to red zone
; ... do work ...
mov rax, [rsp - 8] ; read back
ret ; RSP unchanged — red zone was safe to use
⚠️ Common Mistake: Using the red zone in a non-leaf function (one that calls other functions). When you call another function, that function might use the red zone too — since the red zone is below RSP, and after a CALL RSP decreases by 8, the previous red zone overlaps with the called function's red zone. Any use of [rsp - N] in the called function could overwrite your "saved" values. Only use the red zone in leaf functions.
🔐 Security Note: In kernel code, the red zone cannot be used — interrupts and exceptions push stack frames below RSP, obliterating the red zone. Linux kernel code is compiled with
-mno-red-zone. Any kernel-level assembly must not use the red zone.
20.8 Variadic Functions from Assembly
printf is variadic: int printf(const char *format, ...). The ... means any number of additional arguments.
For variadic calls in System V AMD64: - Integer args go in RDI, RSI, RDX, RCX, R8, R9 (first 6) - Floating-point args go in XMM0-XMM7 (first 8) - AL (low byte of RAX) must contain the number of XMM registers used for FP args
; Call printf("Values: %d %d %f\n", 1, 2, 3.14)
; Args: format (RDI), 1 (RSI/int), 2 (RDX/int), 3.14 (XMM0/float)
extern printf
section .data
fmt3: db "Values: %d %d %f", 10, 0
pi: dq 3.14
print_mixed:
push rbp
mov rbp, rsp
mov rdi, fmt3 ; format
mov esi, 1 ; first integer
mov edx, 2 ; second integer
movsd xmm0, [rel pi] ; first FP argument
mov eax, 1 ; AL=1: one XMM register used ← REQUIRED
call printf
pop rbp
ret
20.9 Complete Working Mixed C+Assembly Project
Here's a complete example: a C program that calls assembly functions for string processing and a checksum calculation.
The C Header (functions.h)
// functions.h
#pragma once
#include <stddef.h>
#include <stdint.h>
// Assembly-implemented functions
extern size_t asm_strlen(const char *s);
extern uint32_t asm_checksum(const uint8_t *data, size_t len);
extern int asm_strcmp(const char *a, const char *b);
The Assembly Implementation (functions.asm)
; functions.asm — Assembly functions called from C
global asm_strlen
global asm_checksum
global asm_strcmp
section .text
;; size_t asm_strlen(const char *s)
;; RDI = s, Returns RAX = length
asm_strlen:
xor eax, eax ; length = 0
.strlen_loop:
cmp byte [rdi + rax], 0
je .strlen_done
inc rax
jmp .strlen_loop
.strlen_done:
ret
;; uint32_t asm_checksum(const uint8_t *data, size_t len)
;; RDI = data, RSI = len
;; Returns RAX = simple 32-bit checksum (Adler-32 like, simplified)
asm_checksum:
xor eax, eax ; sum = 0
xor ecx, ecx ; i = 0
test rsi, rsi
jz .cksum_done
.cksum_loop:
movzx edx, byte [rdi + rcx] ; load byte, zero-extend
add eax, edx ; sum += byte
ror eax, 3 ; rotate sum (mixing)
inc rcx
cmp rcx, rsi
jb .cksum_loop
.cksum_done:
ret
;; int asm_strcmp(const char *a, const char *b)
;; RDI = a, RSI = b
;; Returns: RAX = 0 if equal, <0 if a<b, >0 if a>b
asm_strcmp:
.strcmp_loop:
movzx eax, byte [rdi] ; eax = *a (unsigned byte)
movzx ecx, byte [rsi] ; ecx = *b (unsigned byte)
inc rdi
inc rsi
test al, al ; if *a == 0, end of string
jz .strcmp_check_end
cmp al, cl ; *a == *b?
je .strcmp_loop ; yes, continue
.strcmp_check_end:
sub eax, ecx ; return *a - *b
ret
The C Main Program (main.c)
// main.c
#include <stdio.h>
#include <stdlib.h>
#include "functions.h"
int main(void) {
const char *str1 = "Hello, Assembly!";
const char *str2 = "Hello, Assembly!";
const char *str3 = "Hello, World!";
// Test asm_strlen
size_t len = asm_strlen(str1);
printf("strlen(\"%s\") = %zu\n", str1, len);
// Test asm_strcmp
int cmp1 = asm_strcmp(str1, str2);
int cmp2 = asm_strcmp(str1, str3);
printf("strcmp(\"%s\", \"%s\") = %d\n", str1, str2, cmp1);
printf("strcmp(\"%s\", \"%s\") = %d\n", str1, str3, cmp2);
// Test asm_checksum
uint8_t data[] = {0x01, 0x02, 0x03, 0x04, 0xFF};
uint32_t cksum = asm_checksum(data, sizeof(data));
printf("checksum = 0x%08X\n", cksum);
return 0;
}
Building and Running
# Method 1: Separate compilation + linking
nasm -f elf64 -o functions.o functions.asm
gcc -c -o main.o main.c
gcc -o asm_demo main.o functions.o
./asm_demo
# Method 2: One-step with gcc
nasm -f elf64 -o functions.o functions.asm
gcc main.c functions.o -o asm_demo
./asm_demo
# Expected output:
# strlen("Hello, Assembly!") = 16
# strcmp("Hello, Assembly!", "Hello, Assembly!") = 0
# strcmp("Hello, Assembly!", "Hello, World!") = (negative, since 'A' < 'W')
# checksum = 0x(some value)
🔄 Check Your Understanding: 1. When calling printf from assembly with one integer and one double argument, what must you set in RAX/AL? 2. Why must you save malloc's return value (RAX) into a callee-saved register before calling free? 3. What is the "red zone" and in what type of function can it be used safely? 4. If a C function returns a struct larger than 16 bytes, what hidden argument must the caller provide? 5. What does
extern "C"do in C++, and why is it needed for assembly interoperability?
Summary
The assembly-C interface is governed entirely by the System V AMD64 ABI. Follow the rules (arguments in the right registers, callee-saves preserved, RAX = FP arg count for variadics, stack aligned) and C and assembly interoperate transparently.
For calling C from assembly: declare with extern, set up arguments, handle callee-saved registers, set AL for variadic calls. For assembly callable from C: declare with global, match the C prototype's argument order exactly, return in RAX.
The interface enables the best of both worlds: C's libraries and ecosystem, assembly's precision and performance.