Case Study 20-2: Calling printf Without libc — printf's Implementation

Objective

Trace exactly what happens when your assembly code calls printf: from the PLT stub through the dynamic linker to the eventual write(2) system call. Use GDB to step through every layer. Understand that "calling printf" involves the dynamic linker, the C runtime, and ultimately the kernel.


Setup: A Simple Program

; trace_printf.asm — we'll trace this with GDB
extern printf

section .data
fmt:    db "Hello, %s! Count=%d", 10, 0
name:   db "World", 0

section .text
global _start

_start:
    push    rbp
    mov     rbp, rsp
    sub     rsp, 8          ; stack alignment

    mov     rdi, fmt
    mov     rsi, name
    mov     rdx, 42
    xor     eax, eax
    call    printf

    add     rsp, 8
    pop     rbp

    ; exit(0) via syscall
    mov     eax, 60
    xor     edi, edi
    syscall

Build with dynamic linking:

nasm -f elf64 -o trace_printf.o trace_printf.asm
gcc trace_printf.o -o trace_printf -nostartfiles
# -nostartfiles: don't include crt1.o etc., use our _start

Layer 1: The call printf Instruction

When CALL printf executes, the assembler replaced printf with the address of printf's PLT stub (Procedure Linkage Table). Not the address of printf in libc — the PLT stub.

Disassemble to see this:

objdump -d trace_printf | grep -A 5 'printf'

Output:

0000000000401020 <printf@plt>:
  401020:  ff 25 e2 2f 00 00     jmpq   *0x2fe2(%rip)    # GOT entry
  401026:  68 00 00 00 00        pushq  $0x0              # relocation index
  40102b:  e9 e0 ff ff ff        jmpq   401010 <.plt>     # to PLT[0]

So CALL printf actually calls printf@plt. This is the PLT stub.


Layer 2: The PLT Stub (First Call)

The PLT stub at 0x401020 does:

; printf@plt:
jmp    qword [rip + GOT_printf_offset]   ; indirect jump via GOT entry
push   0x0                               ; relocation index (skipped on first jmp if GOT loaded)
jmp    PLT_0                             ; jump to PLT[0] (the resolver)

On the first call to printf, the GOT entry for printf contains the address of the next instruction (push 0x0), not yet resolved. So the first instruction jmp [GOT] jumps to the push instruction — effectively falling through to the resolver logic.

On subsequent calls, the GOT entry has been updated to point directly to printf's address in libc, and the first jmp [GOT] goes directly there, skipping the resolver entirely. This is lazy binding.


Layer 3: PLT[0] — The Resolver

PLT[0] is special:

; PLT[0] — calls the dynamic linker's resolver
pushq  qword [rip + GOT+8]    ; push link_map pointer (GOT[1])
jmpq   qword [rip + GOT+16]   ; call _dl_runtime_resolve (GOT[2])

GOT[0] = address of .dynamic section GOT[1] = link_map pointer (the runtime representation of the ELF binary) GOT[2] = _dl_runtime_resolve (the dynamic linker's symbol resolution function)

So printf@pltPLT[0]_dl_runtime_resolve.


Layer 4: _dl_runtime_resolve in ld.so

The dynamic linker's resolver:

  1. Receives the link_map and relocation index (pushed by the PLT stub)
  2. Looks up the symbol "printf" in the dynamic symbol table
  3. Finds printf in libc.so.6
  4. Writes the address of printf into the GOT entry for printf
  5. Jumps to printf

After this runs, the GOT entry for printf is updated. Future calls to printf@plt will hit the first jmp [GOT] and go directly to printf — bypassing the resolver entirely.


GDB Session: Tracing the Full Path

gdb ./trace_printf
(gdb) break _start
Breakpoint 1 at 0x401040

(gdb) run
Breakpoint 1, 0x0000000000401040 in _start ()

(gdb) disassemble
Dump of assembler code for function _start:
   0x401040:  push   %rbp
   0x401041:  mov    %rsp,%rbp
   0x401044:  sub    $0x8,%rsp
   0x401048:  mov    $0x402004,%edi    ; format string address
   0x40104d:  mov    $0x40200e,%esi    ; "World" address
   0x401052:  mov    $0x2a,%edx        ; 42
   0x401057:  xor    %eax,%eax
   0x401059:  call   0x401020 <printf@plt>   ← our call
   ...

(gdb) break *0x401059
(gdb) continue

(gdb) stepi    ; step INTO the call — enters printf@plt
(gdb) x/4i $rip
=> 0x401020 <printf@plt>:    jmpq   *0x2fe2(%rip)
   0x401026 <printf@plt+6>:  push   $0x0
   0x40102b <printf@plt+11>: jmpq   0x401010

(gdb) stepi    ; execute the jmp [GOT] — first call, goes to push
; We should now be at 0x401026 (the push, because GOT not yet resolved)

(gdb) x/4x 0x404008   ; look at GOT entry for printf (address will vary)
0x404008:  0x01026 ...  ; currently points to push instruction (lazy binding)

(gdb) stepi   ; push $0x0 (relocation index)
(gdb) stepi   ; jmp PLT[0]
; Now in PLT[0]:
(gdb) x/4i $rip
=> 0x401010:  pushq  0x2ff2(%rip)    ; push link_map
   0x401016:  jmpq   *0x2ff4(%rip)   ; jmp _dl_runtime_resolve

(gdb) stepi   ; push link_map
(gdb) stepi   ; jmp to _dl_runtime_resolve (jumps into ld.so)

; Now inside ld-linux-x86-64.so.2!
(gdb) bt      ; backtrace
#0  _dl_runtime_resolve_xsavec (...)  in /lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
#1  0x401020 in printf@plt ()
#2  0x401059 in _start ()

; ... (many ld.so frames) ...
; Eventually:

(gdb) finish  ; run until _dl_runtime_resolve returns
; It resolves printf and jumps to it:
(gdb) bt
#0  __printf (format=0x402004 "Hello, %s! Count=%d\n") in libc.so.6
#1  0x401059 in _start ()

; Now inside the real printf in libc.so.6!

After the first call, check the GOT again:

(gdb) x/1gx 0x404008   ; same GOT entry, now updated
0x404008:  0x7ffff7e4a210   ; now points to printf in libc!

On the second call to printf (if there were one), jmp [GOT] would jump directly to 0x7ffff7e4a210 — no resolver, no PLT[0], just a direct call.


Layer 5: printf in libc

Inside printf in libc, several things happen:

  1. printf calls vprintf (which handles the variadic arguments via va_list)
  2. vprintf formats the string into an internal buffer
  3. The buffer is flushed by calling fwrite (via the FILE* stdout)
  4. fwrite eventually calls write(1, buf, len) — the Linux system call
  5. write is implemented as a wrapper around the SYSCALL instruction
(gdb) break __write  ; or the specific syscall wrapper
(gdb) continue
; ... eventually hits the write syscall:
(gdb) x/5i $rip
=> libc: mov    $0x1,%eax     ; syscall number 1 = write
   libc: syscall              ; invoke kernel

Actually on x86-64 Linux, write = syscall 1. The libc wrapper calls SYSCALL with: - RAX = 1 (write) - RDI = 1 (fd = stdout) - RSI = buffer address - RDX = byte count


The Full Call Chain

Assembly code:
  CALL printf             → calls printf@plt

printf@plt:
  JMP [GOT_printf]        → (first call) falls through to push + PLT[0]
                          → (later calls) jumps directly to printf in libc

PLT[0]:
  push link_map
  JMP [GOT_resolver]      → _dl_runtime_resolve in ld.so

_dl_runtime_resolve:
  Looks up "printf" in libc.so.6
  Updates GOT[printf] = &printf
  Jumps to printf in libc

printf in libc:
  Formats string
  Calls fwrite → fflush → write_wrapper

write_wrapper in libc:
  MOV EAX, 1              ; syscall number
  SYSCALL                 ; invoke kernel

Linux kernel:
  Copies buffer to stdout
  Returns bytes written in EAX/RAX

Returns all the way back to printf
printf returns to assembly code

Total function call depth for one printf: approximately 8-15 frames, including the dynamic linker resolution on the first call.


Checking with ltrace

ltrace intercepts library calls at the PLT level — exactly the layer we traced:

ltrace ./trace_printf
# Output:
# printf("Hello, %s! Count=%d\n", "World", 42) = 21

ltrace works by replacing the GOT entry for printf with a trampoline that logs the call before jumping to the real function. The mechanism is identical to how LD_PRELOAD works (Chapter 24).


Summary

A single CALL printf in assembly involves:

  1. The PLT stub — a 3-instruction trampoline in the binary's .plt section
  2. The GOT entry — initially points to the resolver; updated after first call to point directly to printf
  3. The dynamic linker (_dl_runtime_resolve) — only on the first call, resolves the symbol and patches the GOT
  4. printf in libc — formats the string
  5. write(2) system call — actually outputs to stdout via the kernel

After the first call, the PLT stub's jmp [GOT] goes directly to printf with no resolver overhead. The "lazy binding" design means you don't pay the resolution cost for library functions you call frequently — just once per function per process lifetime.

This mechanism is the foundation for GOT overwrite attacks (Chapter 36): if an attacker can write to the GOT entry for any function, they can redirect ALL calls to that function to arbitrary code — once the PLT stub resolves to their payload instead of printf.