7 min read

Pure assembly is powerful and educational, but limited. You cannot write a network server in pure assembly using only system calls — well, you can, but the result would be thousands of lines of error-prone code that every standard library already...

Chapter 20: Calling C from Assembly and Assembly from C

The Bridge

Pure assembly is powerful and educational, but limited. You cannot write a network server in pure assembly using only system calls — well, you can, but the result would be thousands of lines of error-prone code that every standard library already provides. You cannot call getaddrinfo(), SSL_CTX_new(), or zlib_compress() without interfacing with C.

Similarly, C code that needs hand-optimized SIMD, precise timing via RDTSC, or hardware-specific instructions can call assembly functions — as long as those functions follow the ABI contract.

This chapter is the bridge.


20.1 Why Interface Assembly with C

Assembly Calls C

Assembly programs can call any C function by following the System V AMD64 ABI: - printf — formatted output without writing your own formatting code - malloc/free — dynamic memory allocation - fopen/fread/fwrite/fclose — file I/O with buffering - socket, connect, send, recv — networking - pthread_create — threads - Any third-party library with a C API (OpenSSL, zlib, SQLite, etc.)

C Calls Assembly

C programs can call assembly functions for: - Performance-critical inner loops (SIMD, specific instruction usage) - Hardware access (CPUID, RDTSC, port I/O) - Cryptographic primitives (avoiding compiler optimizations that might remove security-critical zeroing) - Boot/startup code before C runtime is initialized


20.2 Linking Assembly with C

NASM extern and global

; In NASM assembly:
extern printf         ; tell NASM that printf is defined elsewhere (in C library)
extern malloc         ; same for malloc
global fast_checksum  ; make fast_checksum visible to other object files

section .text
fast_checksum:
    ; ... implementation ...
    ret
  • extern name — declares a symbol defined in another object file
  • global name — exports a symbol from this object file (makes it available to the linker)

Building a Mixed C+Assembly Project

# Files:
# - main.c (C entry point, calls assembly)
# - helper.asm (assembly functions called from C)

# Step 1: Compile C to object file
gcc -c main.c -o main.o

# Step 2: Assemble to object file
nasm -f elf64 helper.asm -o helper.o

# Step 3: Link them together
gcc main.o helper.o -o program

# Alternatively, let gcc do steps 2+3:
gcc main.c helper.asm -o program
# (gcc calls nasm automatically for .asm files if configured)

⚠️ Common Mistake: Using ld directly to link instead of gcc. When you link with ld, you bypass the C runtime startup code (crt1.o, crti.o, crtn.o) and the C standard library. Use gcc to link unless you explicitly want no C runtime.

C++ Name Mangling

In C, printf is exported as printf. In C++, void foo(int) is exported as something like _Z3fooi — name mangling encodes the function signature to support overloading.

When C++ code calls assembly (or assembly calls C++ functions), you must either: 1. Declare the assembly function with extern "C" in the C++ code: cpp extern "C" int fast_checksum(const char *data, size_t len); 2. Or match the mangled name exactly in assembly (fragile, compiler-specific)

The extern "C" approach is always correct.


20.3 Calling C Functions from Assembly

Calling printf

The C signature: int printf(const char *format, ...);

Arguments: RDI = format string, RSI = first argument, RDX = second, RCX = third, R8 = fourth, R9 = fifth.

Special rule for variadic functions: RAX must hold the number of XMM (floating-point) registers used for arguments. If you're only passing integers, RAX = 0.

; call_printf.asm
; Print: "Value: 42\n"

extern printf

section .data
fmt:    db  "Value: %d", 10, 0   ; "Value: %d\n\0"

section .text

print_42:
    push    rbp
    mov     rbp, rsp
    sub     rsp, 8              ; maintain 16-byte alignment (RSP is 8-byte aligned
                                ;   after push rbp; sub 8 makes it 16-byte again)

    mov     rdi, fmt            ; arg1: format string
    mov     rsi, 42             ; arg2: integer value
    xor     eax, eax            ; AL = 0: no XMM args (variadic convention)
    call    printf              ; call C library printf

    leave
    ret

Let's trace the stack alignment carefully:

Before push rbp:   RSP = 0x7FFFFFFFE8   (16-byte aligned — at function entry, RSP
                                          was 16-aligned at CALL, meaning RSP was
                                          0x7FFFFFFF00, then CALL pushed 8 bytes,
                                          so RSP = 0x7FFFFFFFE8 — 8-byte aligned)
After push rbp:    RSP = 0x7FFFFFFFE0   (16-byte aligned again — pushed 8 more bytes)
After sub rsp, 8:  RSP = 0x7FFFFFFFD8   (8-byte aligned — wrong! we need 16)

Wait — the standard x86-64 entry sequence is:

; Standard function entry:
push    rbp          ; RSP was 16-aligned at function entry? No:
                     ; At function CALL site, RSP was 16-aligned (BEFORE the call).
                     ; CALL pushed 8 bytes → RSP is now 8-byte aligned (not 16) on entry.
                     ; push rbp pushes 8 more → RSP is 16-aligned again.
mov     rbp, rsp
sub     rsp, N       ; N must be multiple of 16 to maintain alignment

So at function entry (after the implicit CALL), RSP is at original_RSP - 8 (8-byte aligned). After push rbp, RSP is at original_RSP - 16 (16-byte aligned). After sub rsp, N where N is a multiple of 16, RSP is still 16-byte aligned.

For the printf example:

print_value:
    push    rbp
    mov     rbp, rsp
    ; RSP is now 16-byte aligned (push rbp added 8 to the post-CALL RSP)
    ; We need sub rsp, N where N is a multiple of 16
    ; If we have no locals, sub rsp, 0 — but let's add any alignment pad:
    ; Actually with push rbp already done, RSP is aligned. No additional sub needed
    ; UNLESS we need local variables or save additional registers.
    ; sub rsp, 0 is a no-op, so:

    mov     rdi, fmt            ; format string
    mov     rsi, 42             ; integer
    xor     eax, eax            ; no FP args
    call    printf              ; RSP is 16-byte aligned ✓

    pop     rbp
    ret

The Complete printf Stack Trace

Function call to print_value:
  CALL at caller:
    RSP = 0xXXXXFFF0 (16-aligned)  ← before call
    RSP = 0xXXXXFFE8                ← after CALL (pushed 8-byte return addr)

  push rbp:
    RSP = 0xXXXXFFE0                ← 16-aligned again
    [RSP+0] = old RBP
    [RSP+8] = return address (was at RSP before push)

  call printf:
    RSP must be 16-aligned → it is (0xXXXXFFE0) ✓
    printf is called correctly

printf with Multiple Arguments and Types

; Print: "Point: (3, 7), Label: hello\n"
extern printf

section .data
fmt2:   db  "Point: (%d, %d), Label: %s", 10, 0
label:  db  "hello", 0

section .text

print_point:
    push    rbp
    mov     rbp, rsp
    sub     rsp, 16             ; align + local space (multiple of 16)

    mov     rdi, fmt2           ; format
    mov     rsi, 3              ; first %d (x coordinate)
    mov     rdx, 7              ; second %d (y coordinate)
    mov     rcx, label          ; third %s (string pointer)
    xor     eax, eax            ; no FP args
    call    printf

    leave                       ; mov rsp, rbp; pop rbp
    ret

printf with Floating-Point Arguments

; Print: "Pi = 3.141593\n"
extern printf

section .data
fmt_fp: db  "Pi = %f", 10, 0

section .rodata
pi:     dq  3.141592653589793   ; 64-bit double

section .text

print_pi:
    push    rbp
    mov     rbp, rsp

    mov     rdi, fmt_fp         ; format string
    movsd   xmm0, [rel pi]      ; XMM0 = 3.14159... (double)
    mov     eax, 1              ; AL = 1: one XMM register used
    call    printf              ; printf knows to look at XMM0 for the %f arg

    pop     rbp
    ret

⚠️ Common Mistake: Forgetting xor eax, eax (or the appropriate count) when calling variadic functions. If RAX is non-zero with no actual XMM args, printf will read garbage XMM registers and may crash. The System V ABI requires RAX to be set for variadic function calls.

Calling malloc and free

; Dynamic allocation from assembly
extern malloc
extern free

section .text

alloc_buffer:
    push    rbp
    mov     rbp, rsp
    push    rbx             ; callee-saved: will use RBX to hold pointer

    ; malloc(128) - allocate 128 bytes
    mov     rdi, 128        ; size argument
    call    malloc          ; RAX = pointer (or NULL on failure)

    ; Check for NULL
    test    rax, rax
    jz      .alloc_failed

    mov     rbx, rax        ; save pointer in callee-saved RBX

    ; ... use the buffer at [rbx] ...
    ; e.g., write some data:
    mov     qword [rbx], 0xDEADBEEF

    ; free(ptr)
    mov     rdi, rbx        ; pointer to free
    call    free            ; returns void

.alloc_done:
    pop     rbx
    pop     rbp
    ret

.alloc_failed:
    ; Handle allocation failure
    pop     rbx
    pop     rbp
    ret

Key insight: save the malloc return value (RAX) into a callee-saved register (RBX, R12-R15) before calling free or any other function. The call free will clobber RAX, RSI, RDI, etc. — only callee-saved registers survive a function call.


20.4 Writing Assembly Functions Callable from C

The Requirements

To write an assembly function that C can call:

  1. Declare it as global in NASM
  2. In C, declare it as extern (with the right signature)
  3. Follow System V AMD64 ABI exactly: - Arguments in RDI, RSI, RDX, RCX, R8, R9 - Return value in RAX - Preserve RBX, RBP, R12-R15 (callee-saved) - Return with RET (not any other jump)
// In C header:
extern uint32_t fast_checksum(const uint8_t *data, size_t len);
extern void *fast_memcpy(void *dest, const void *src, size_t n);
extern int fast_strlen(const char *s);
; In NASM assembly:

global fast_strlen

; int fast_strlen(const char *s)
; RDI = s (string pointer)
; Returns: RAX = length
fast_strlen:
    mov     rax, rdi            ; RAX = start pointer
.loop:
    cmp     byte [rdi], 0       ; is *rdi == '\0'?
    je      .done
    inc     rdi
    jmp     .loop
.done:
    sub     rax, rdi            ; RAX = start - current (this gives negative length!)
    ; Wait — we need current - start:
    sub     rdi, rax            ; hmm, that doesn't work right either.
    ; Let's fix the logic:
    ; On entry: RAX = start_ptr
    ; At loop exit: RDI = address of '\0'
    ; Length = RDI - RAX (address of null - address of start = length)
    neg     rax                 ; This approach is messy. Let's rewrite:
    ret

Let me write this correctly:

global fast_strlen

; int fast_strlen(const char *s)
; RDI = s
; Returns: RAX = length
fast_strlen:
    xor     eax, eax            ; RAX = 0 (length counter)
                                ; xor eax, eax also clears upper 32 bits of RAX
.loop:
    cmp     byte [rdi + rax], 0 ; is s[i] == '\0'?
    je      .done
    inc     rax                 ; length++
    jmp     .loop
.done:
    ret                         ; RAX = length

Alternatively, the classic REP SCASB version:

global fast_strlen_rep

fast_strlen_rep:
    mov     rcx, -1             ; RCX = max count
    xor     al, al              ; AL = 0 (search byte)
    repne   scasb               ; scan [RDI] for AL=0, decrement RCX, increment RDI
    ; After: RCX = -(length+2) (approximate — see exact derivation)
    not     rcx                 ; RCX = length+1
    lea     rax, [rcx - 1]      ; RAX = length
    ret

Example: fast_memcpy

; void *fast_memcpy(void *dest, const void *src, size_t n)
; RDI = dest, RSI = src, RDX = n
; Returns: RAX = dest (as C memcpy spec requires)
global fast_memcpy

fast_memcpy:
    push    rbp
    mov     rbp, rsp
    push    rbx

    mov     rbx, rdi            ; save dest for return value

    ; Copy in 8-byte chunks
    mov     rcx, rdx
    shr     rcx, 3              ; rcx = n / 8 (number of 8-byte chunks)
    jz      .tail

.chunk_loop:
    mov     rax, [rsi]          ; load 8 bytes
    mov     [rdi], rax          ; store 8 bytes
    add     rsi, 8
    add     rdi, 8
    dec     rcx
    jnz     .chunk_loop

.tail:
    and     rdx, 7              ; rdx = n % 8 (remaining bytes)
    jz      .done

.byte_loop:
    mov     al, [rsi]
    mov     [rdi], al
    inc     rsi
    inc     rdi
    dec     rdx
    jnz     .byte_loop

.done:
    mov     rax, rbx            ; return original dest
    pop     rbx
    pop     rbp
    ret

C usage:

// In C:
extern void *fast_memcpy(void *dest, const void *src, size_t n);

int main() {
    char src[] = "Hello, Assembly!";
    char dest[20];
    fast_memcpy(dest, src, 17);
    printf("%s\n", dest);
    return 0;
}

20.5 Accessing C Global Variables from Assembly

// In C:
int global_counter = 0;
const char *program_name = "myapp";
; In NASM:
extern global_counter   ; symbol defined in C
extern program_name

section .text
use_globals:
    ; Load global_counter's value
    ; (Note: 'extern' gives us the ADDRESS of the symbol in RIP-relative addressing)
    mov     eax, [rel global_counter]   ; load the int value
    inc     eax
    mov     [rel global_counter], eax   ; store back

    ; Load program_name (which is a pointer to a string)
    mov     rdi, [rel program_name]     ; rdi = the pointer value (address of "myapp")
    ; Now RDI is the string pointer, usable with strlen/printf etc.
    ret

⚠️ Common Mistake: mov rax, global_counter loads the ADDRESS of the symbol, not its value. You need mov eax, [rel global_counter] to get the value. This is the difference between a label (an address) and the thing the label points to.

For Position-Independent Code (required for shared libraries), use GOT-relative addressing:

; PIC access to external variable
use_globals_pic:
    ; In PIC code, global variables are accessed via the GOT
    mov     rax, [rel global_counter wrt ..got]   ; load GOT entry address
    mov     eax, [rax]                            ; load actual value via pointer
    ret

20.6 Passing Structs

The System V AMD64 ABI has detailed rules for how structs are passed.

Small Structs (≤ 16 bytes): Passed in Registers

struct Point { int x; int y; };   // 8 bytes total

A struct that fits in two GP registers is passed as two separate integer values: - struct Point p as first argument: rdi = p.x, rsi = p.y

struct BigPoint { int64_t x; int64_t y; };  // 16 bytes
  • struct BigPoint p as first argument: rdi = p.x, rsi = p.y
; C: void use_point(struct Point p);
; Assembly calling use_point with p = {10, 20}:
extern use_point_func

use_point:
    mov     edi, 10             ; p.x
    mov     esi, 20             ; p.y
    call    use_point_func
    ret

Large Structs (> 16 bytes): Passed as Pointer

struct Matrix { int64_t m[4]; };  // 32 bytes — too big for registers

When a struct is > 16 bytes, the CALLER allocates space on the stack and passes a POINTER to that space in the first available argument register:

; C: void process_matrix(struct Matrix m);
; When called with a 32-byte struct, the call becomes:
;   (hidden pointer to stack copy of m) in RDI

extern process_matrix

pass_big_struct:
    push    rbp
    mov     rbp, rsp
    sub     rsp, 48             ; 32 bytes for struct + 16 for alignment

    ; Initialize the Matrix struct on the stack at [rbp-32] (or wherever):
    mov     qword [rbp-32], 1
    mov     qword [rbp-24], 2
    mov     qword [rbp-16], 3
    mov     qword [rbp-8],  4

    ; Pass pointer to the struct
    lea     rdi, [rbp-32]       ; RDI = pointer to Matrix struct
    call    process_matrix      ; C receives a copy via pointer

    leave
    ret

Return Value for Large Structs

When a C function returns a large struct, the CALLER provides a "hidden" first argument: a pointer to the memory where the return value should be stored.

struct Matrix compute_matrix(void);   // Returns 32-byte struct

The calling convention transforms this to:

void compute_matrix_hidden(struct Matrix *return_buf);  // conceptually

In assembly, you must allocate the return buffer and pass its address in RDI before calling:

extern compute_matrix

get_matrix:
    push    rbp
    mov     rbp, rsp
    sub     rsp, 48             ; space for 32-byte return struct

    lea     rdi, [rbp-32]       ; RDI = pointer to return buffer (hidden first arg)
    call    compute_matrix      ; function stores result to [rbp-32]

    ; Now [rbp-32] to [rbp-1] contains the returned struct
    mov     rax, [rbp-32]       ; example: use first field

    leave
    ret

20.7 The Red Zone

The red zone is a 128-byte area BELOW the current RSP that is guaranteed not to be modified by signal handlers or other asynchronous events (on Linux, in user space).

This means leaf functions (functions that make no calls) can use up to 128 bytes below RSP for local variables without adjusting RSP:

; Leaf function: uses red zone for local storage without adjusting RSP
fast_inner:
    ; RSP not touched — we're in a leaf function
    mov     [rsp - 8], rdi      ; store arg to red zone
    mov     [rsp - 16], rsi     ; store second arg to red zone
    ; ... do work ...
    mov     rax, [rsp - 8]      ; read back
    ret                         ; RSP unchanged — red zone was safe to use

⚠️ Common Mistake: Using the red zone in a non-leaf function (one that calls other functions). When you call another function, that function might use the red zone too — since the red zone is below RSP, and after a CALL RSP decreases by 8, the previous red zone overlaps with the called function's red zone. Any use of [rsp - N] in the called function could overwrite your "saved" values. Only use the red zone in leaf functions.

🔐 Security Note: In kernel code, the red zone cannot be used — interrupts and exceptions push stack frames below RSP, obliterating the red zone. Linux kernel code is compiled with -mno-red-zone. Any kernel-level assembly must not use the red zone.


20.8 Variadic Functions from Assembly

printf is variadic: int printf(const char *format, ...). The ... means any number of additional arguments.

For variadic calls in System V AMD64: - Integer args go in RDI, RSI, RDX, RCX, R8, R9 (first 6) - Floating-point args go in XMM0-XMM7 (first 8) - AL (low byte of RAX) must contain the number of XMM registers used for FP args

; Call printf("Values: %d %d %f\n", 1, 2, 3.14)
; Args: format (RDI), 1 (RSI/int), 2 (RDX/int), 3.14 (XMM0/float)

extern printf
section .data
fmt3:   db  "Values: %d %d %f", 10, 0
pi:     dq  3.14

print_mixed:
    push    rbp
    mov     rbp, rsp

    mov     rdi, fmt3           ; format
    mov     esi, 1              ; first integer
    mov     edx, 2              ; second integer
    movsd   xmm0, [rel pi]      ; first FP argument
    mov     eax, 1              ; AL=1: one XMM register used ← REQUIRED
    call    printf

    pop     rbp
    ret

20.9 Complete Working Mixed C+Assembly Project

Here's a complete example: a C program that calls assembly functions for string processing and a checksum calculation.

The C Header (functions.h)

// functions.h
#pragma once
#include <stddef.h>
#include <stdint.h>

// Assembly-implemented functions
extern size_t   asm_strlen(const char *s);
extern uint32_t asm_checksum(const uint8_t *data, size_t len);
extern int      asm_strcmp(const char *a, const char *b);

The Assembly Implementation (functions.asm)

; functions.asm — Assembly functions called from C

global asm_strlen
global asm_checksum
global asm_strcmp

section .text

;; size_t asm_strlen(const char *s)
;; RDI = s, Returns RAX = length
asm_strlen:
    xor     eax, eax            ; length = 0
.strlen_loop:
    cmp     byte [rdi + rax], 0
    je      .strlen_done
    inc     rax
    jmp     .strlen_loop
.strlen_done:
    ret


;; uint32_t asm_checksum(const uint8_t *data, size_t len)
;; RDI = data, RSI = len
;; Returns RAX = simple 32-bit checksum (Adler-32 like, simplified)
asm_checksum:
    xor     eax, eax            ; sum = 0
    xor     ecx, ecx            ; i = 0
    test    rsi, rsi
    jz      .cksum_done
.cksum_loop:
    movzx   edx, byte [rdi + rcx]  ; load byte, zero-extend
    add     eax, edx            ; sum += byte
    ror     eax, 3              ; rotate sum (mixing)
    inc     rcx
    cmp     rcx, rsi
    jb      .cksum_loop
.cksum_done:
    ret


;; int asm_strcmp(const char *a, const char *b)
;; RDI = a, RSI = b
;; Returns: RAX = 0 if equal, <0 if a<b, >0 if a>b
asm_strcmp:
.strcmp_loop:
    movzx   eax, byte [rdi]     ; eax = *a (unsigned byte)
    movzx   ecx, byte [rsi]     ; ecx = *b (unsigned byte)
    inc     rdi
    inc     rsi
    test    al, al              ; if *a == 0, end of string
    jz      .strcmp_check_end
    cmp     al, cl              ; *a == *b?
    je      .strcmp_loop        ; yes, continue
.strcmp_check_end:
    sub     eax, ecx            ; return *a - *b
    ret

The C Main Program (main.c)

// main.c
#include <stdio.h>
#include <stdlib.h>
#include "functions.h"

int main(void) {
    const char *str1 = "Hello, Assembly!";
    const char *str2 = "Hello, Assembly!";
    const char *str3 = "Hello, World!";

    // Test asm_strlen
    size_t len = asm_strlen(str1);
    printf("strlen(\"%s\") = %zu\n", str1, len);

    // Test asm_strcmp
    int cmp1 = asm_strcmp(str1, str2);
    int cmp2 = asm_strcmp(str1, str3);
    printf("strcmp(\"%s\", \"%s\") = %d\n", str1, str2, cmp1);
    printf("strcmp(\"%s\", \"%s\") = %d\n", str1, str3, cmp2);

    // Test asm_checksum
    uint8_t data[] = {0x01, 0x02, 0x03, 0x04, 0xFF};
    uint32_t cksum = asm_checksum(data, sizeof(data));
    printf("checksum = 0x%08X\n", cksum);

    return 0;
}

Building and Running

# Method 1: Separate compilation + linking
nasm -f elf64 -o functions.o functions.asm
gcc -c -o main.o main.c
gcc -o asm_demo main.o functions.o
./asm_demo

# Method 2: One-step with gcc
nasm -f elf64 -o functions.o functions.asm
gcc main.c functions.o -o asm_demo
./asm_demo

# Expected output:
# strlen("Hello, Assembly!") = 16
# strcmp("Hello, Assembly!", "Hello, Assembly!") = 0
# strcmp("Hello, Assembly!", "Hello, World!") = (negative, since 'A' < 'W')
# checksum = 0x(some value)

🔄 Check Your Understanding: 1. When calling printf from assembly with one integer and one double argument, what must you set in RAX/AL? 2. Why must you save malloc's return value (RAX) into a callee-saved register before calling free? 3. What is the "red zone" and in what type of function can it be used safely? 4. If a C function returns a struct larger than 16 bytes, what hidden argument must the caller provide? 5. What does extern "C" do in C++, and why is it needed for assembly interoperability?


Summary

The assembly-C interface is governed entirely by the System V AMD64 ABI. Follow the rules (arguments in the right registers, callee-saves preserved, RAX = FP arg count for variadics, stack aligned) and C and assembly interoperate transparently.

For calling C from assembly: declare with extern, set up arguments, handle callee-saved registers, set AL for variadic calls. For assembly callable from C: declare with global, match the C prototype's argument order exactly, return in RAX.

The interface enables the best of both worlds: C's libraries and ecosystem, assembly's precision and performance.