The best way to learn assembly is to write C and then read what the compiler produces. The compiler has been writing correct, ABI-compliant, optimized assembly for 40 years. It knows patterns you haven't learned yet. And reading its output is the...
In This Chapter
- The Compiler Is Your Reverse-Engineering Partner
- 21.1 Compiling to Assembly
- 21.2 Reading GAS (AT&T) Syntax
- 21.3 GCC Compiler Output Patterns
- 21.4 Optimization Levels in Detail
- 21.5 Compiler Explorer (godbolt.org)
- 21.6 Specific Optimization Patterns to Recognize
- 21.7 Using -fverbose-asm
- 21.8 Reading Compiler Output for Debugging
- Summary
Chapter 21: Understanding Compiler Output
The Compiler Is Your Reverse-Engineering Partner
The best way to learn assembly is to write C and then read what the compiler produces. The compiler has been writing correct, ABI-compliant, optimized assembly for 40 years. It knows patterns you haven't learned yet. And reading its output is the fastest way to understand what well-written assembly looks like.
This chapter teaches you to read compiler output fluently: GAS (AT&T) syntax, the compilation flags that change what you see, and the specific patterns to recognize.
21.1 Compiling to Assembly
# Compile C to assembly text (GAS syntax by default)
gcc -S program.c -o program.s
# Intel syntax (easier to read if you know NASM)
gcc -S -masm=intel program.c -o program.s
# Add C source as comments in the output
gcc -S -fverbose-asm program.c -o program.s
# Different optimization levels
gcc -S -O0 program.c -o program_O0.s # no optimization (default debug)
gcc -S -O1 program.c -o program_O1.s # basic optimizations
gcc -S -O2 program.c -o program_O2.s # standard optimizations
gcc -S -O3 program.c -o program_O3.s # aggressive (may vectorize)
gcc -S -Os program.c -o program_Os.s # optimize for size
21.2 Reading GAS (AT&T) Syntax
GCC defaults to AT&T syntax (also called GAS syntax after the GNU Assembler). If you know NASM (Intel syntax), AT&T will feel backwards. Here's the translation:
Source and Destination are Swapped
This is the source of endless confusion:
AT&T syntax: Intel syntax (NASM):
movq %rbx, %rax mov rax, rbx ; rax = rbx
addq %rcx, %rax add rax, rcx ; rax += rcx
In AT&T: source comes first, destination comes second. In Intel: destination comes first, source comes second.
The mnemonic: AT&T looks like an assignment written backwards — dst = src becomes src → dst.
Size Suffixes
AT&T appends a size suffix to the mnemonic:
Suffix Size NASM equivalent
b 8-bit byte
w 16-bit word
l 32-bit dword (note: 'l' = "long" in C, which is 32-bit in original naming)
q 64-bit qword
movb %al, (%rdi) ; NASM: mov byte [rdi], al
movw %ax, (%rdi) ; NASM: mov word [rdi], ax
movl %eax, (%rdi) ; NASM: mov dword [rdi], eax
movq %rax, (%rdi) ; NASM: mov qword [rdi], rax
Register Prefix %
All registers have a % prefix:
AT&T: %rax, %rbx, %rsp
Intel: rax, rbx, rsp
Immediate Prefix $
Immediate values have a $ prefix:
AT&T: $42, $0xFF
Intel: 42, 0xFF
Memory Operands
AT&T syntax: Intel syntax (NASM):
(%rax) [rax]
8(%rbp) [rbp + 8]
-8(%rbp) [rbp - 8]
(%rax,%rcx,8) [rax + rcx*8]
8(%rax,%rcx,4) [rax + rcx*4 + 8]
AT&T memory format: disp(base, index, scale) → computes base + index*scale + disp.
Complete Syntax Comparison Table
AT&T (GCC default) Intel (NASM) Operation
─────────────────────────────────────────────────────────────────────────
movq %rbx, %rax mov rax, rbx rax = rbx
movq $42, %rax mov rax, 42 rax = 42
movq (%rbx), %rax mov rax, [rbx] rax = *rbx
movq 8(%rbx), %rax mov rax, [rbx+8] rax = *(rbx+8)
movq %rax, -8(%rbp) mov [rbp-8], rax *(rbp-8) = rax
leaq (%rax,%rcx,8), %rdx lea rdx, [rax+rcx*8] rdx = rax+rcx*8
addq %rcx, %rax add rax, rcx rax += rcx
subq $1, %rax sub rax, 1 rax -= 1
imulq %rbx imul rbx rdx:rax = rax * rbx
cmpq %rcx, %rax cmp rax, rcx set flags for rax-rcx
je .label je label jump if equal
callq printf call printf call printf
retq ret return
─────────────────────────────────────────────────────────────────────────
💡 Mental Model: When reading GCC output, translate every instruction by: (1) removing %, $, and size suffix letters, (2) swapping source and destination, (3) converting
disp(base,index,scale)to[base+index*scale+disp]. After a week, you'll stop needing to translate.
21.3 GCC Compiler Output Patterns
Function Prologue and Epilogue
int foo(int a, int b) {
int x = a + b;
return x * 2;
}
GCC -O0 output (AT&T):
foo:
pushq %rbp
movq %rsp, %rbp
movl %edi, -20(%rbp) # store arg a to stack
movl %esi, -24(%rbp) # store arg b to stack
movl -20(%rbp), %edx # reload a
movl -24(%rbp), %eax # reload b
addl %edx, %eax # eax = a + b
movl %eax, -4(%rbp) # store x
movl -4(%rbp), %eax # reload x
addl %eax, %eax # eax = x + x = x * 2
popq %rbp
ret
GCC -O2 output:
foo:
leal (%rdi,%rsi), %eax # eax = a + b (LEA trick: no flags, 3-register)
addl %eax, %eax # eax *= 2
ret
No prologue, no epilogue, no stack frame at -O2. The compiler realized foo is a leaf function with no local variables that need to be addressable. Everything lives in registers.
if-else
int abs_val(int x) {
if (x < 0) return -x;
return x;
}
GCC -O0:
abs_val:
pushq %rbp
movq %rsp, %rbp
movl %edi, -4(%rbp)
cmpl $0, -4(%rbp) # compare x to 0
jge .L2 # if x >= 0, jump to return x
negl -4(%rbp) # x = -x (in memory — inefficient but predictable)
.L2:
movl -4(%rbp), %eax
popq %rbp
ret
GCC -O2:
abs_val:
movl %edi, %eax
negl %eax # eax = -x
testl %edi, %edi # set flags based on x
cmovns %edi, %eax # if x >= 0 (NS = not sign), eax = x (original)
ret
CMOVNS (conditional move if not sign): branchless abs_val. The compiler transformed the if into a conditional move.
For Loop
int64_t sum_to_n(int n) {
int64_t sum = 0;
for (int i = 1; i <= n; i++) {
sum += i;
}
return sum;
}
GCC -O0:
sum_to_n:
pushq %rbp
movq %rsp, %rbp
movl %edi, -20(%rbp) # store n
movq $0, -8(%rbp) # sum = 0
movl $1, -12(%rbp) # i = 1
jmp .L4 # jump to condition check
.L5:
movl -12(%rbp), %eax
cltq # sign-extend EAX to RAX (for 64-bit sum)
addq %rax, -8(%rbp) # sum += i
addl $1, -12(%rbp) # i++
.L4:
movl -12(%rbp), %eax
cmpl -20(%rbp), %eax # compare i to n
jle .L5 # if i <= n, continue
movq -8(%rbp), %rax # load sum into return register
popq %rbp
ret
GCC -O2 (mathematical optimization for consecutive integers sum):
sum_to_n:
testl %edi, %edi # test n
jle .L3 # if n <= 0, return 0
movl %edi, %eax # eax = n
movl %edi, %edx
imulq %rax, %rdx # rdx = n * n
leal 1(%rdi), %edi # edi = n + 1
movslq %edi, %rdi # sign-extend to 64-bit
imulq %rdi, %rdx # rdx = n * (n+1) ... wait this isn't right
; Actually GCC -O2 recognizes the sum-of-integers pattern:
; sum(1..n) = n*(n+1)/2
; And emits the closed-form formula instead of the loop!
leaq 1(%rdi), %rax # rax = n + 1
imulq %rdi, %rax # rax = n * (n+1)
sarq $1, %rax # rax /= 2 (arithmetic right shift = divide by 2)
ret
.L3:
xorl %eax, %eax
ret
GCC -O2 replaced the loop with n*(n+1)/2. This is GCC recognizing an idiom and applying the mathematical closed form. It doesn't always do this, but it's a striking example of optimization.
switch/case as Jump Table
int day_type(int day) { // 0=Sun, 1=Mon, ..., 6=Sat
switch (day) {
case 0: case 6: return 0; // weekend
case 1: case 2: case 3: case 4: case 5: return 1; // weekday
default: return -1;
}
}
GCC -O2:
day_type:
cmpl $6, %edi # compare day to 6
ja .Ldefault # if day > 6 (unsigned), default case
movl %edi, %edi # zero-extend day to 64-bit index
jmp *.L_jumptable(,%rdi,8) # indirect jump via jump table!
.L_jumptable:
.quad .Lweekend # day=0: Sun
.quad .Lweekday # day=1: Mon
.quad .Lweekday # day=2: Tue
.quad .Lweekday # day=3: Wed
.quad .Lweekday # day=4: Thu
.quad .Lweekday # day=5: Fri
.quad .Lweekend # day=6: Sat
.Lweekend:
xorl %eax, %eax # return 0
ret
.Lweekday:
movl $1, %eax # return 1
ret
.Ldefault:
movl $-1, %eax # return -1
ret
The jmp *.L_jumptable(,%rdi,8) is a jump table: RIP = *(jumptable + day * 8). For switch statements with dense case values (0-6 here), GCC creates a jump table for O(1) dispatch rather than a series of comparisons.
Recursive Function
int64_t fibonacci(int n) {
if (n <= 1) return n;
return fibonacci(n-1) + fibonacci(n-2);
}
GCC -O1 (no tail-call optimization possible for binary recursion):
fibonacci:
pushq %rbp
pushq %rbx
subq $8, %rsp # align stack
movl %edi, %ebx # save n (callee-saved)
movl $0, %eax
testl %edi, %edi # n == 0?
je .Lreturn
movl $1, %eax
cmpl $1, %edi # n == 1?
je .Lreturn
leal -1(%rdi), %edi # arg = n-1
call fibonacci # rax = fib(n-1)
movl %eax, %ebp # save fib(n-1) in callee-saved rbp
leal -2(%rbx), %edi # arg = n-2
call fibonacci # rax = fib(n-2)
addl %ebp, %eax # rax = fib(n-1) + fib(n-2)
.Lreturn:
addq $8, %rsp
popq %rbx
popq %rbp
ret
Key observation: n is saved in RBX (callee-saved), and the intermediate fib(n-1) result is saved in RBP (callee-saved) across the second recursive call.
21.4 Optimization Levels in Detail
-O0 (No Optimization)
Default for debug builds. Every variable lives at a fixed stack location. Every expression evaluates through memory. The code is predictable but slow.
Characteristics: - All local variables allocated at RBP-relative offsets - Arguments are immediately stored to the stack - Values are reloaded before each use (no register caching) - Function calls follow the ABI exactly with no shortcuts
Use case: debugging. GDB can inspect every variable because they're all in memory.
-O1 (Basic Optimizations)
- Dead code elimination
- Common subexpression elimination
- Basic register allocation (some variables move to registers)
- Tail call optimization (in some cases)
- Constant folding (2+3 becomes 5 at compile time)
The output is shorter and faster but still generally readable.
-O2 (Standard Optimizations)
This is the standard production optimization level. Enables:
- All -O1 optimizations
- Inlining: small functions are inlined at call sites
- Vectorization: loops may be converted to SIMD
- Loop unrolling: inner loops may be partially unrolled
- Strength reduction:
x*8→x << 3,x/9→ multiply-high - Branch prediction hints: annotates hot paths
- CMOV: conditional moves replace predictable branches
Most commercial software is compiled at -O2.
-O3 (Aggressive Optimization)
Adds: - More aggressive vectorization - Loop transformations (interchange, fusion) - More inlining - Speculative execution of side-effect-free operations
Sometimes -O3 is slower than -O2 due to code size increase causing cache pressure.
-Os (Optimize for Size)
Optimizes for minimum code size at the expense of speed: - Avoids unrolling - Avoids function inlining (increases code density) - Prefers compact encodings
Used in embedded systems, bootloaders, and shared libraries where code size matters more than throughput.
21.5 Compiler Explorer (godbolt.org)
Compiler Explorer is the single most useful tool for understanding assembly. You write C (or C++, Rust, Go, etc.) in the left panel and see the assembly output instantly in the right panel.
Key Features
Multi-compiler comparison: see GCC and Clang side by side. They often make different optimization choices.
Multiple architectures: x86-64, ARM64, RISC-V, MIPS, PowerPC — all from the same C source. Chapter 19's comparison examples were generated here.
Optimization level dropdown: change from -O0 to -O2 and instantly see what changes.
Color highlighting: source lines are color-coded to show which assembly instructions they map to.
Diff view: compare two compiler outputs to see exactly what changed.
Using Compiler Explorer
- Go to https://godbolt.org
- Type or paste C code in the left panel
- Select compiler (e.g., "x86-64 gcc 13.2") and flags (e.g., "-O2 -march=native")
- The right panel shows AT&T syntax assembly
To switch to Intel syntax: add -masm=intel to the compiler flags.
🛠️ Lab Exercise: Paste the
fibonaccifunction into Compiler Explorer. Compare GCC -O0 output to GCC -O2 output. Then add-O2 -fno-optimize-sibling-callsand see what changes. Then switch to ARM64 (aarch64 gcc 13.2) and compare the ARM64 output to the x86-64 output.
21.6 Specific Optimization Patterns to Recognize
Strength Reduction: Multiply by Constant
int x = n * 17;
GCC -O2 output:
leal (%rdi,%rdi,16), %eax # eax = rdi + rdi*16 = 17 * rdi
# LEA computes: base + index*scale = rdi + rdi*16 = 17*rdi
# (LEA scale factor 16 = 2^4, but that's the SIB scale, different from *)
# Actually: leaq (%rdi,%rdi,16) = rdi + rdi*16 = 17*rdi ✓
For multiply by 9: leal (%rdi,%rdi,8) = rdi + rdi8 = 9rdi.
For multiply by constants not reachable with a single LEA:
int x = n * 37;
GCC might emit:
movl %edi, %eax
sall $2, %eax # eax = n * 4
leal (%rax,%rdi), %eax # eax = 4n + n = 5n ... then more steps
# Or:
imulq $37, %rdi, %rax # single IMUL with immediate (if available)
Constant Folding
int x = 2 * 3 + 4; // GCC computes this at compile time
Output: movl $10, %eax — the arithmetic happens at compile time, not runtime.
Dead Code Elimination
int foo(int x) {
if (0) return 42; // unreachable
return x + 1;
}
Output: only leal 1(%rdi), %eax; ret — the dead branch is completely removed.
Loop Invariant Code Motion
int sum_with_multiplier(int *arr, int n, int multiplier) {
int sum = 0;
for (int i = 0; i < n; i++) {
sum += arr[i] * multiplier; // multiplier doesn't change
}
return sum;
}
GCC -O2 recognizes that multiplier is loop-invariant and keeps it in a register throughout the loop, not reloading it each iteration. At -O0, it would reload from the stack each time.
Integer Division by Constant: The Magic Number
int x = y / 7;
GCC -O2 output:
movl %edi, %eax
movl $-1840700269, %edx # magic number = 0x92492493
imull %edx # edx:eax = y * magic
movl %edx, %eax
sarl $2, %eax # shift right
sarl $31, %edi # extract sign bit
subl %edi, %eax # adjust for negative dividends
No DIV instruction! Division by a constant is replaced by a multiply-high + shift sequence. This is 2-3× faster than IDIV. The "magic number" is a precomputed constant that, when multiplied and shifted, produces the quotient. (The math is from Hacker's Delight.)
Tail Call Optimization
int64_t factorial_tail(int n, int64_t acc) {
if (n <= 1) return acc;
return factorial_tail(n - 1, n * acc);
}
GCC -O2 with tail-call optimization:
factorial_tail:
movl $1, %eax
testl %edi, %edi
jle .Lbase
.Lloop:
imulq %rdi, %rsi # acc = n * acc
subl $1, %edi # n--
testl %edi, %edi # n > 1?
jg .Lloop # loop
movq %rsi, %rax # return acc
ret
.Lbase:
movq %rsi, %rax
ret
The recursive call was converted to a loop — no stack frames accumulate.
21.7 Using -fverbose-asm
The -fverbose-asm flag adds comments that link assembly instructions back to C source lines:
gcc -S -O2 -fverbose-asm foo.c -o foo.s
Output example:
sum_to_n:
.LFB0:
.cfi_startproc
testl %edi, %edi # n
jle .L3 # ,
movl %edi, %eax # n, n
leal 1(%rdi), %edi #, tmp89
imull %eax, %edi # n, tmp89
sarl %edi # tmp89
movslq %edi, %rax #, <retval>
ret
.L3:
xorl %eax, %eax # <retval>
ret
The # variable_name comments identify which C variable each register corresponds to at that point.
21.8 Reading Compiler Output for Debugging
When debugging "why did my optimized program give the wrong answer," reading the compiler output often reveals the issue:
Aliasing violations: If you write to memory through two pointers the compiler assumes don't alias (int * and float * or two int * the compiler assumes are different), the compiler may reorder loads and stores in ways that break your code.
Undefined behavior: GCC exploits undefined behavior (signed integer overflow, out-of-bounds access) to make optimizations. Code with UB may look correct in C but produce incorrect assembly. -fsanitize=undefined catches this at runtime.
Incorrect register usage in inline assembly: If your inline asm doesn't declare all clobbers, the compiler may cache a value in a register that your asm overwrites. Reading the -S output shows exactly what register the compiler allocated.
🔄 Check Your Understanding: 1.
movq %rbx, %raxin AT&T syntax — which direction does the move go? 2. What doesleal (%rax,%rcx,4), %rdxcompute? 3. Why does GCC -O2 sometimes replacex / constantwith a multiply-and-shift sequence? 4. What is "tail call optimization" and what must be true for the compiler to apply it? 5. When does GCC use a jump table vs. a series of comparisons for a switch statement?
Summary
Reading compiler output requires learning AT&T syntax (swapped operands, % registers, $ immediates, disp(base,index,scale) memory) and understanding that GCC at -O2 produces substantially different code than at -O0: no stack frames for simple leaf functions, CMOV instead of branches, strength reduction for multiply/divide, and occasional recognition of mathematical identities like sum-of-integers.
Compiler Explorer is your best tool: interactive, multi-compiler, multi-architecture, and instant. Use it to see what your C code actually compiles to, compare optimization levels, and understand why the assembly looks the way it does.