8 min read

The best way to learn assembly is to write C and then read what the compiler produces. The compiler has been writing correct, ABI-compliant, optimized assembly for 40 years. It knows patterns you haven't learned yet. And reading its output is the...

Chapter 21: Understanding Compiler Output

The Compiler Is Your Reverse-Engineering Partner

The best way to learn assembly is to write C and then read what the compiler produces. The compiler has been writing correct, ABI-compliant, optimized assembly for 40 years. It knows patterns you haven't learned yet. And reading its output is the fastest way to understand what well-written assembly looks like.

This chapter teaches you to read compiler output fluently: GAS (AT&T) syntax, the compilation flags that change what you see, and the specific patterns to recognize.


21.1 Compiling to Assembly

# Compile C to assembly text (GAS syntax by default)
gcc -S program.c -o program.s

# Intel syntax (easier to read if you know NASM)
gcc -S -masm=intel program.c -o program.s

# Add C source as comments in the output
gcc -S -fverbose-asm program.c -o program.s

# Different optimization levels
gcc -S -O0 program.c -o program_O0.s   # no optimization (default debug)
gcc -S -O1 program.c -o program_O1.s   # basic optimizations
gcc -S -O2 program.c -o program_O2.s   # standard optimizations
gcc -S -O3 program.c -o program_O3.s   # aggressive (may vectorize)
gcc -S -Os program.c -o program_Os.s   # optimize for size

21.2 Reading GAS (AT&T) Syntax

GCC defaults to AT&T syntax (also called GAS syntax after the GNU Assembler). If you know NASM (Intel syntax), AT&T will feel backwards. Here's the translation:

Source and Destination are Swapped

This is the source of endless confusion:

AT&T syntax:          Intel syntax (NASM):
movq %rbx, %rax       mov  rax, rbx        ; rax = rbx
addq %rcx, %rax       add  rax, rcx        ; rax += rcx

In AT&T: source comes first, destination comes second. In Intel: destination comes first, source comes second.

The mnemonic: AT&T looks like an assignment written backwards — dst = src becomes src → dst.

Size Suffixes

AT&T appends a size suffix to the mnemonic:

Suffix  Size       NASM equivalent
b       8-bit      byte
w       16-bit     word
l       32-bit     dword (note: 'l' = "long" in C, which is 32-bit in original naming)
q       64-bit     qword

movb %al, (%rdi)     ; NASM: mov byte [rdi], al
movw %ax, (%rdi)     ; NASM: mov word [rdi], ax
movl %eax, (%rdi)    ; NASM: mov dword [rdi], eax
movq %rax, (%rdi)    ; NASM: mov qword [rdi], rax

Register Prefix %

All registers have a % prefix:

AT&T:    %rax, %rbx, %rsp
Intel:   rax,  rbx,  rsp

Immediate Prefix $

Immediate values have a $ prefix:

AT&T:    $42, $0xFF
Intel:   42,  0xFF

Memory Operands

AT&T syntax:                  Intel syntax (NASM):
(%rax)                        [rax]
8(%rbp)                       [rbp + 8]
-8(%rbp)                      [rbp - 8]
(%rax,%rcx,8)                 [rax + rcx*8]
8(%rax,%rcx,4)                [rax + rcx*4 + 8]

AT&T memory format: disp(base, index, scale) → computes base + index*scale + disp.

Complete Syntax Comparison Table

AT&T (GCC default)          Intel (NASM)              Operation
─────────────────────────────────────────────────────────────────────────
movq %rbx, %rax             mov  rax, rbx             rax = rbx
movq $42, %rax              mov  rax, 42              rax = 42
movq (%rbx), %rax           mov  rax, [rbx]           rax = *rbx
movq 8(%rbx), %rax          mov  rax, [rbx+8]         rax = *(rbx+8)
movq %rax, -8(%rbp)         mov  [rbp-8], rax         *(rbp-8) = rax
leaq (%rax,%rcx,8), %rdx    lea  rdx, [rax+rcx*8]     rdx = rax+rcx*8
addq %rcx, %rax             add  rax, rcx             rax += rcx
subq $1, %rax               sub  rax, 1               rax -= 1
imulq %rbx                  imul rbx                  rdx:rax = rax * rbx
cmpq %rcx, %rax             cmp  rax, rcx             set flags for rax-rcx
je   .label                 je   label                jump if equal
callq printf                call printf               call printf
retq                        ret                       return
─────────────────────────────────────────────────────────────────────────

💡 Mental Model: When reading GCC output, translate every instruction by: (1) removing %, $, and size suffix letters, (2) swapping source and destination, (3) converting disp(base,index,scale) to [base+index*scale+disp]. After a week, you'll stop needing to translate.


21.3 GCC Compiler Output Patterns

Function Prologue and Epilogue

int foo(int a, int b) {
    int x = a + b;
    return x * 2;
}

GCC -O0 output (AT&T):

foo:
    pushq   %rbp
    movq    %rsp, %rbp
    movl    %edi, -20(%rbp)    # store arg a to stack
    movl    %esi, -24(%rbp)    # store arg b to stack
    movl    -20(%rbp), %edx    # reload a
    movl    -24(%rbp), %eax    # reload b
    addl    %edx, %eax         # eax = a + b
    movl    %eax, -4(%rbp)     # store x
    movl    -4(%rbp), %eax     # reload x
    addl    %eax, %eax         # eax = x + x = x * 2
    popq    %rbp
    ret

GCC -O2 output:

foo:
    leal    (%rdi,%rsi), %eax  # eax = a + b (LEA trick: no flags, 3-register)
    addl    %eax, %eax         # eax *= 2
    ret

No prologue, no epilogue, no stack frame at -O2. The compiler realized foo is a leaf function with no local variables that need to be addressable. Everything lives in registers.

if-else

int abs_val(int x) {
    if (x < 0) return -x;
    return x;
}

GCC -O0:

abs_val:
    pushq   %rbp
    movq    %rsp, %rbp
    movl    %edi, -4(%rbp)
    cmpl    $0, -4(%rbp)    # compare x to 0
    jge     .L2              # if x >= 0, jump to return x
    negl    -4(%rbp)         # x = -x (in memory — inefficient but predictable)
.L2:
    movl    -4(%rbp), %eax
    popq    %rbp
    ret

GCC -O2:

abs_val:
    movl    %edi, %eax
    negl    %eax             # eax = -x
    testl   %edi, %edi       # set flags based on x
    cmovns  %edi, %eax       # if x >= 0 (NS = not sign), eax = x (original)
    ret

CMOVNS (conditional move if not sign): branchless abs_val. The compiler transformed the if into a conditional move.

For Loop

int64_t sum_to_n(int n) {
    int64_t sum = 0;
    for (int i = 1; i <= n; i++) {
        sum += i;
    }
    return sum;
}

GCC -O0:

sum_to_n:
    pushq   %rbp
    movq    %rsp, %rbp
    movl    %edi, -20(%rbp)    # store n
    movq    $0, -8(%rbp)       # sum = 0
    movl    $1, -12(%rbp)      # i = 1
    jmp     .L4                # jump to condition check
.L5:
    movl    -12(%rbp), %eax
    cltq                       # sign-extend EAX to RAX (for 64-bit sum)
    addq    %rax, -8(%rbp)     # sum += i
    addl    $1, -12(%rbp)      # i++
.L4:
    movl    -12(%rbp), %eax
    cmpl    -20(%rbp), %eax    # compare i to n
    jle     .L5                # if i <= n, continue
    movq    -8(%rbp), %rax     # load sum into return register
    popq    %rbp
    ret

GCC -O2 (mathematical optimization for consecutive integers sum):

sum_to_n:
    testl   %edi, %edi         # test n
    jle     .L3                # if n <= 0, return 0
    movl    %edi, %eax         # eax = n
    movl    %edi, %edx
    imulq   %rax, %rdx         # rdx = n * n
    leal    1(%rdi), %edi      # edi = n + 1
    movslq  %edi, %rdi         # sign-extend to 64-bit
    imulq   %rdi, %rdx         # rdx = n * (n+1)  ... wait this isn't right
    ; Actually GCC -O2 recognizes the sum-of-integers pattern:
    ; sum(1..n) = n*(n+1)/2
    ; And emits the closed-form formula instead of the loop!
    leaq    1(%rdi), %rax      # rax = n + 1
    imulq   %rdi, %rax         # rax = n * (n+1)
    sarq    $1, %rax            # rax /= 2 (arithmetic right shift = divide by 2)
    ret
.L3:
    xorl    %eax, %eax
    ret

GCC -O2 replaced the loop with n*(n+1)/2. This is GCC recognizing an idiom and applying the mathematical closed form. It doesn't always do this, but it's a striking example of optimization.

switch/case as Jump Table

int day_type(int day) {   // 0=Sun, 1=Mon, ..., 6=Sat
    switch (day) {
    case 0: case 6: return 0;  // weekend
    case 1: case 2: case 3: case 4: case 5: return 1;  // weekday
    default: return -1;
    }
}

GCC -O2:

day_type:
    cmpl    $6, %edi           # compare day to 6
    ja      .Ldefault          # if day > 6 (unsigned), default case
    movl    %edi, %edi         # zero-extend day to 64-bit index
    jmp     *.L_jumptable(,%rdi,8)   # indirect jump via jump table!

.L_jumptable:
    .quad   .Lweekend           # day=0: Sun
    .quad   .Lweekday           # day=1: Mon
    .quad   .Lweekday           # day=2: Tue
    .quad   .Lweekday           # day=3: Wed
    .quad   .Lweekday           # day=4: Thu
    .quad   .Lweekday           # day=5: Fri
    .quad   .Lweekend           # day=6: Sat

.Lweekend:
    xorl    %eax, %eax         # return 0
    ret
.Lweekday:
    movl    $1, %eax           # return 1
    ret
.Ldefault:
    movl    $-1, %eax          # return -1
    ret

The jmp *.L_jumptable(,%rdi,8) is a jump table: RIP = *(jumptable + day * 8). For switch statements with dense case values (0-6 here), GCC creates a jump table for O(1) dispatch rather than a series of comparisons.

Recursive Function

int64_t fibonacci(int n) {
    if (n <= 1) return n;
    return fibonacci(n-1) + fibonacci(n-2);
}

GCC -O1 (no tail-call optimization possible for binary recursion):

fibonacci:
    pushq   %rbp
    pushq   %rbx
    subq    $8, %rsp            # align stack
    movl    %edi, %ebx          # save n (callee-saved)
    movl    $0, %eax
    testl   %edi, %edi          # n == 0?
    je      .Lreturn
    movl    $1, %eax
    cmpl    $1, %edi            # n == 1?
    je      .Lreturn
    leal    -1(%rdi), %edi      # arg = n-1
    call    fibonacci           # rax = fib(n-1)
    movl    %eax, %ebp          # save fib(n-1) in callee-saved rbp
    leal    -2(%rbx), %edi      # arg = n-2
    call    fibonacci           # rax = fib(n-2)
    addl    %ebp, %eax          # rax = fib(n-1) + fib(n-2)
.Lreturn:
    addq    $8, %rsp
    popq    %rbx
    popq    %rbp
    ret

Key observation: n is saved in RBX (callee-saved), and the intermediate fib(n-1) result is saved in RBP (callee-saved) across the second recursive call.


21.4 Optimization Levels in Detail

-O0 (No Optimization)

Default for debug builds. Every variable lives at a fixed stack location. Every expression evaluates through memory. The code is predictable but slow.

Characteristics: - All local variables allocated at RBP-relative offsets - Arguments are immediately stored to the stack - Values are reloaded before each use (no register caching) - Function calls follow the ABI exactly with no shortcuts

Use case: debugging. GDB can inspect every variable because they're all in memory.

-O1 (Basic Optimizations)

  • Dead code elimination
  • Common subexpression elimination
  • Basic register allocation (some variables move to registers)
  • Tail call optimization (in some cases)
  • Constant folding (2+3 becomes 5 at compile time)

The output is shorter and faster but still generally readable.

-O2 (Standard Optimizations)

This is the standard production optimization level. Enables:

  • All -O1 optimizations
  • Inlining: small functions are inlined at call sites
  • Vectorization: loops may be converted to SIMD
  • Loop unrolling: inner loops may be partially unrolled
  • Strength reduction: x*8x << 3, x/9 → multiply-high
  • Branch prediction hints: annotates hot paths
  • CMOV: conditional moves replace predictable branches

Most commercial software is compiled at -O2.

-O3 (Aggressive Optimization)

Adds: - More aggressive vectorization - Loop transformations (interchange, fusion) - More inlining - Speculative execution of side-effect-free operations

Sometimes -O3 is slower than -O2 due to code size increase causing cache pressure.

-Os (Optimize for Size)

Optimizes for minimum code size at the expense of speed: - Avoids unrolling - Avoids function inlining (increases code density) - Prefers compact encodings

Used in embedded systems, bootloaders, and shared libraries where code size matters more than throughput.


21.5 Compiler Explorer (godbolt.org)

Compiler Explorer is the single most useful tool for understanding assembly. You write C (or C++, Rust, Go, etc.) in the left panel and see the assembly output instantly in the right panel.

Key Features

Multi-compiler comparison: see GCC and Clang side by side. They often make different optimization choices.

Multiple architectures: x86-64, ARM64, RISC-V, MIPS, PowerPC — all from the same C source. Chapter 19's comparison examples were generated here.

Optimization level dropdown: change from -O0 to -O2 and instantly see what changes.

Color highlighting: source lines are color-coded to show which assembly instructions they map to.

Diff view: compare two compiler outputs to see exactly what changed.

Using Compiler Explorer

  1. Go to https://godbolt.org
  2. Type or paste C code in the left panel
  3. Select compiler (e.g., "x86-64 gcc 13.2") and flags (e.g., "-O2 -march=native")
  4. The right panel shows AT&T syntax assembly

To switch to Intel syntax: add -masm=intel to the compiler flags.

🛠️ Lab Exercise: Paste the fibonacci function into Compiler Explorer. Compare GCC -O0 output to GCC -O2 output. Then add -O2 -fno-optimize-sibling-calls and see what changes. Then switch to ARM64 (aarch64 gcc 13.2) and compare the ARM64 output to the x86-64 output.


21.6 Specific Optimization Patterns to Recognize

Strength Reduction: Multiply by Constant

int x = n * 17;

GCC -O2 output:

leal    (%rdi,%rdi,16), %eax    # eax = rdi + rdi*16 = 17 * rdi
# LEA computes: base + index*scale = rdi + rdi*16 = 17*rdi
# (LEA scale factor 16 = 2^4, but that's the SIB scale, different from *)
# Actually: leaq (%rdi,%rdi,16) = rdi + rdi*16 = 17*rdi ✓

For multiply by 9: leal (%rdi,%rdi,8) = rdi + rdi8 = 9rdi.

For multiply by constants not reachable with a single LEA:

int x = n * 37;

GCC might emit:

movl    %edi, %eax
sall    $2, %eax         # eax = n * 4
leal    (%rax,%rdi), %eax # eax = 4n + n = 5n ... then more steps
# Or:
imulq   $37, %rdi, %rax  # single IMUL with immediate (if available)

Constant Folding

int x = 2 * 3 + 4;  // GCC computes this at compile time

Output: movl $10, %eax — the arithmetic happens at compile time, not runtime.

Dead Code Elimination

int foo(int x) {
    if (0) return 42;  // unreachable
    return x + 1;
}

Output: only leal 1(%rdi), %eax; ret — the dead branch is completely removed.

Loop Invariant Code Motion

int sum_with_multiplier(int *arr, int n, int multiplier) {
    int sum = 0;
    for (int i = 0; i < n; i++) {
        sum += arr[i] * multiplier;  // multiplier doesn't change
    }
    return sum;
}

GCC -O2 recognizes that multiplier is loop-invariant and keeps it in a register throughout the loop, not reloading it each iteration. At -O0, it would reload from the stack each time.

Integer Division by Constant: The Magic Number

int x = y / 7;

GCC -O2 output:

movl    %edi, %eax
movl    $-1840700269, %edx   # magic number = 0x92492493
imull   %edx                 # edx:eax = y * magic
movl    %edx, %eax
sarl    $2, %eax             # shift right
sarl    $31, %edi            # extract sign bit
subl    %edi, %eax           # adjust for negative dividends

No DIV instruction! Division by a constant is replaced by a multiply-high + shift sequence. This is 2-3× faster than IDIV. The "magic number" is a precomputed constant that, when multiplied and shifted, produces the quotient. (The math is from Hacker's Delight.)

Tail Call Optimization

int64_t factorial_tail(int n, int64_t acc) {
    if (n <= 1) return acc;
    return factorial_tail(n - 1, n * acc);
}

GCC -O2 with tail-call optimization:

factorial_tail:
    movl    $1, %eax
    testl   %edi, %edi
    jle     .Lbase
.Lloop:
    imulq   %rdi, %rsi       # acc = n * acc
    subl    $1, %edi         # n--
    testl   %edi, %edi       # n > 1?
    jg      .Lloop           # loop
    movq    %rsi, %rax       # return acc
    ret
.Lbase:
    movq    %rsi, %rax
    ret

The recursive call was converted to a loop — no stack frames accumulate.


21.7 Using -fverbose-asm

The -fverbose-asm flag adds comments that link assembly instructions back to C source lines:

gcc -S -O2 -fverbose-asm foo.c -o foo.s

Output example:

sum_to_n:
.LFB0:
    .cfi_startproc
    testl   %edi, %edi          # n
    jle     .L3                 # ,
    movl    %edi, %eax          # n, n
    leal    1(%rdi), %edi       #, tmp89
    imull   %eax, %edi          # n, tmp89
    sarl    %edi                # tmp89
    movslq  %edi, %rax          #, <retval>
    ret
.L3:
    xorl    %eax, %eax          # <retval>
    ret

The # variable_name comments identify which C variable each register corresponds to at that point.


21.8 Reading Compiler Output for Debugging

When debugging "why did my optimized program give the wrong answer," reading the compiler output often reveals the issue:

Aliasing violations: If you write to memory through two pointers the compiler assumes don't alias (int * and float * or two int * the compiler assumes are different), the compiler may reorder loads and stores in ways that break your code.

Undefined behavior: GCC exploits undefined behavior (signed integer overflow, out-of-bounds access) to make optimizations. Code with UB may look correct in C but produce incorrect assembly. -fsanitize=undefined catches this at runtime.

Incorrect register usage in inline assembly: If your inline asm doesn't declare all clobbers, the compiler may cache a value in a register that your asm overwrites. Reading the -S output shows exactly what register the compiler allocated.


🔄 Check Your Understanding: 1. movq %rbx, %rax in AT&T syntax — which direction does the move go? 2. What does leal (%rax,%rcx,4), %rdx compute? 3. Why does GCC -O2 sometimes replace x / constant with a multiply-and-shift sequence? 4. What is "tail call optimization" and what must be true for the compiler to apply it? 5. When does GCC use a jump table vs. a series of comparisons for a switch statement?


Summary

Reading compiler output requires learning AT&T syntax (swapped operands, % registers, $ immediates, disp(base,index,scale) memory) and understanding that GCC at -O2 produces substantially different code than at -O0: no stack frames for simple leaf functions, CMOV instead of branches, strength reduction for multiply/divide, and occasional recognition of mathematical identities like sum-of-integers.

Compiler Explorer is your best tool: interactive, multi-compiler, multi-architecture, and instant. Use it to see what your C code actually compiles to, compare optimization levels, and understand why the assembly looks the way it does.