Chapter 39: Beyond Assembly

Open Assembly Language Project

14 min read

You have spent this entire book reading assembly. You know what good assembly looks like, what pathological assembly looks like, and why the compiler sometimes makes surprising choices. Now we can close the loop: let us understand how compilers...

In This Chapter

How Compilers Work (Now That You Understand the Output)
JIT Compilation: Generating Machine Code at Runtime
WebAssembly: A Portable ISA for the Web (and Beyond)
RISC-V: The Open ISA
GPU "Assembly": Brief Overview
Other Architectures (Brief)
The Future of Computing
Summary

Key Takeaways Exercises Quiz Case Study 01 Case Study 02 Further Reading

Chapter 39: Beyond Assembly

How Compilers Work (Now That You Understand the Output)

You have spent this entire book reading assembly. You know what good assembly looks like, what pathological assembly looks like, and why the compiler sometimes makes surprising choices. Now we can close the loop: let us understand how compilers produce the assembly you have been reading.

A compiler transforms source code through a series of representations, each one closer to machine code than the last. The pipeline:

1. Lexing and Parsing

The lexer (tokenizer) converts raw source text into a stream of tokens: keywords, identifiers, operators, literals. The parser builds an Abstract Syntax Tree (AST) from the token stream, representing the program's grammatical structure.

The AST for int x = a + b * 2 looks like:

Assignment
  ├── Variable: x (int)
  └── Add
        ├── Variable: a
        └── Multiply
              ├── Variable: b
              └── Literal: 2

Nothing architecture-specific yet. The AST represents the source language, not the target.

2. Type Checking and Semantic Analysis

The compiler verifies that the program is semantically valid: types match, variables are declared, function calls have correct argument counts. This pass annotates the AST with type information.

After this phase, every node in the AST has a known type. The Add node above knows it is adding int + int → int. This type information guides code generation — int addition generates different code than float addition.

3. Intermediate Representation (IR)

The compiler lowers the AST to an IR — a language-independent, target-independent representation optimized for analysis and transformation.

LLVM IR (used by Clang, Rust's rustc, Swift, and many others):

; For: int add(int a, int b) { return a + b; }
define i32 @add(i32 %a, i32 %b) {
entry:
  %result = add nsw i32 %a, %b
  ret i32 %result
}

LLVM IR is typed, in SSA (Static Single Assignment) form (each variable assigned exactly once), and explicitly handles control flow with basic blocks and terminators.

GCC GIMPLE (GCC's primary IR):

add (int a, int b)
{
  int D.1234;
  D.1234 = a + b;
  return D.1234;
}

GIMPLE is simpler than LLVM IR but serves the same role: a clean, analyzable representation between the source and the final assembly.

4. Optimization Passes on the IR

The IR is where most compiler optimizations happen. Each optimization pass transforms the IR to be faster or smaller:

Optimization	What it does	Example
Constant folding	Evaluate constant expressions at compile time	`2 + 3` → `5`
Dead code elimination	Remove unreachable code	Remove code after `return`
Inlining	Replace function call with function body	`min(a,b)` → `a < b ? a : b`
Loop unrolling	Repeat loop body to reduce branch overhead	Loop of 4 → 2 iterations of 2
Strength reduction	Replace expensive ops with cheap ones	`x * 4` → `x << 2`
LICM	Hoist loop-invariant computations	Move `y * 2` out of loop if y unchanged
Vectorization	Replace scalar loops with SIMD	4× `float` add → `vaddps`
Register allocation	Assign IR values to physical registers	Minimize spills to stack
Instruction selection	Choose specific instructions	`LEA` for address + small offset computation

LLVM has ~70 optimization passes at -O2. GCC has a similar count. At -O3, additional passes run including auto-vectorization and aggressive inlining.

5. Code Generation: IR → Assembly

The code generator maps IR constructs to target instructions:

Instruction selection (pattern matching): match IR patterns to instruction sequences. For LLVM, this uses the Target Description (.td) files — declarative specifications of instruction patterns.

For the IR addition:

%result = add nsw i32 %a, %b

Instruction selector produces: ADD W0, W0, W1 (ARM64) or add eax, edi (x86-64, with arguments in registers).

Register allocation (graph coloring): the IR uses an infinite number of "virtual registers." The register allocator maps these to the finite set of physical registers. Where virtual registers outnumber physical ones, values are "spilled" to the stack.

Graph coloring: build an interference graph (two virtual registers "interfere" if they are live at the same time). Color the graph with N colors (one per physical register). Variables with the same color can share a register; interference edges prevent same-color assignment.

Instruction scheduling: reorder instructions to avoid pipeline stalls. Move independent instructions to fill latency slots. The scheduler has detailed knowledge of the target pipeline's latency tables (from Agner Fog's work, among other sources).

6. Why Compilers Make the Choices They Do

Now that you understand the pipeline, specific compiler behaviors make sense:

lea rax, [rdi + rdi*2 + 5] for x = y*3 + 5: LEA can compute base + index*scale + displacement in one instruction. The instruction selector recognizes this pattern.
The function is 40 instructions in -O0 and 3 in -O2: at -O0, the compiler generates straight-line code from the AST. At -O2, constant propagation, dead code elimination, and inlining collapse it.
xor eax, eax instead of mov eax, 0: the assembler produces a 2-byte encoding for xor eax, eax vs. 5 bytes for mov eax, 0. The instruction selector knows this.
Tail call optimization: a recursive call that is the last operation in a function becomes a jmp instead of call; ret. The LLVM IR explicitly represents tail calls.
Auto-vectorized loop with vpbroadcastd and vpaddd: the vectorizer detected an independent scalar loop and replaced it with 256-bit AVX2 vector operations. The IR transformation added a vector prologue for alignment checking and a scalar fallback for the remainder.

📊 C Comparison: godbolt.org lets you write C and see the assembly output for any compiler, architecture, and optimization level. This is the most direct way to understand what the compiler is doing: write one C function, compare -O0 vs -O2 vs -O3, observe exactly which optimizations fired.

JIT Compilation: Generating Machine Code at Runtime

JIT (Just-In-Time) compilation generates machine code during program execution. Understanding JIT requires understanding what we have spent this entire book learning: how machine code looks and what the CPU expects.

Why JIT

JIT combines the portability of interpreted languages with the performance of compiled code: - Ahead-of-time compilation: fast but requires knowing the target architecture at compile time - Interpretation: portable but slow (10-100× slower than native code) - JIT: runs interpreted initially, compiles hot paths to native code at runtime

JavaScript V8's TurboFan JIT, Java HotSpot, and Python's PyPy all use this approach.

Allocating Executable Memory

JIT compilers need pages that are both writable (to write the machine code) and executable (to run it). This is exactly the pattern flagged as suspicious in the malware chapter — which is why JIT compilers need special treatment under CET.

#include <sys/mman.h>
#include <string.h>

/* Allocate executable memory */
void *jit_alloc(size_t size) {
    /* Step 1: Allocate writable, non-executable memory */
    void *mem = mmap(NULL, size,
                     PROT_READ | PROT_WRITE,
                     MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    return mem;
}

void jit_make_executable(void *mem, size_t size) {
    /* Step 2: Make it executable (remove write permission for security) */
    mprotect(mem, size, PROT_READ | PROT_EXEC);
}

/* Call the JIT-compiled function */
typedef int (*jit_func_t)(void);

int jit_call(void *code) {
    jit_func_t f = (jit_func_t)code;
    return f();
}

A Minimal JIT: Writing Machine Code Bytes

The simplest possible JIT: a function that returns the constant 42.

/* JIT: generate "mov eax, 42; ret" */
void jit_generate_return42(uint8_t *code) {
    /* mov eax, 42:   B8 2A 00 00 00  (5 bytes) */
    code[0] = 0xB8;           /* MOV EAX, imm32 */
    code[1] = 42;             /* immediate low byte */
    code[2] = 0;
    code[3] = 0;
    code[4] = 0;              /* immediate high 3 bytes */
    /* ret:           C3  (1 byte) */
    code[5] = 0xC3;
}

int main(void) {
    uint8_t *code = jit_alloc(64);
    jit_generate_return42(code);
    jit_make_executable(code, 64);

    int result = jit_call(code);
    printf("JIT returned: %d\n", result);   /* prints: JIT returned: 42 */

    munmap(code, 64);
    return 0;
}

This works because the x86-64 instruction encoding for mov eax, 42; ret is exactly those 6 bytes. We are writing the bytes that the CPU will execute as instructions.

LLVM as a JIT Backend

Production JIT compilers do not write instruction bytes directly — they use a framework like LLVM's ORC JIT or GCC JIT:

/* Using LLVM's C API for JIT compilation */
LLVMModuleRef module = LLVMModuleCreateWithName("jit_module");
LLVMBuilderRef builder = LLVMCreateBuilder();

/* Create function: int add(int a, int b) */
LLVMTypeRef param_types[] = { LLVMInt32Type(), LLVMInt32Type() };
LLVMTypeRef func_type = LLVMFunctionType(LLVMInt32Type(), param_types, 2, 0);
LLVMValueRef func = LLVMAddFunction(module, "add", func_type);

/* Build function body */
LLVMBasicBlockRef entry = LLVMAppendBasicBlock(func, "entry");
LLVMPositionBuilderAtEnd(builder, entry);
LLVMValueRef result = LLVMBuildAdd(builder,
    LLVMGetParam(func, 0), LLVMGetParam(func, 1), "result");
LLVMBuildRet(builder, result);

/* JIT compile the module */
LLVMExecutionEngineRef engine;
LLVMCreateJITCompilerForModule(&engine, module, 2, NULL);

/* Get function pointer and call */
typedef int (*add_fn)(int, int);
add_fn add = (add_fn)LLVMGetFunctionAddress(engine, "add");
printf("add(3, 4) = %d\n", add(3, 4));   /* 7 */

LLVM handles all the instruction selection, register allocation, and encoding. You describe the computation in IR; LLVM generates the machine code.

WebAssembly: A Portable ISA for the Web (and Beyond)

WebAssembly (WASM) is a compact binary instruction format designed as a compilation target for high-performance web applications. It is, at its core, a virtual ISA — not for a real CPU, but for a virtual stack machine that all browsers implement.

Stack Machine Architecture

Unlike x86-64 (register machine) and ARM64 (register machine), WASM is a stack machine. Operations consume their operands from the stack and push results:

; WASM: compute a + b * 2
local.get $a       ; push a
local.get $b       ; push b
i32.const 2        ; push 2
i32.mul            ; pop 2 and b, push b*2
i32.add            ; pop b*2 and a, push a + b*2

Compare to x86-64:

; x86-64: same computation
imul esi, esi, 2   ; b * 2 → esi
add  eax, esi      ; a + b*2 → eax

Stack machines have simpler code generation (no register allocation needed) and simpler verification. The tradeoff: every value goes through the stack, which is why WASM needs a JIT to run efficiently.

WASM Security Model

WASM is designed for sandboxed execution. Its security guarantees: - Type safety: WASM is strongly typed; no arbitrary casts - Memory safety: all memory accesses are bounds-checked against a linear memory region - Control flow: only jumps to declared labels; no computed jumps to arbitrary addresses - Isolation: each WASM module has its own linear memory; no access to other modules' memory

These guarantees allow browsers to run WASM from untrusted sources without sandboxing the entire process. WASM cannot read the browser's memory or bypass ASLR — the type system prevents it.

WASM → Native via JIT

When a browser executes WASM, it JIT-compiles it to native code. V8 (Chrome) has two WASM JIT tiers: 1. Liftoff (baseline): fast single-pass compilation, less optimized 2. TurboFan (optimizing): slower compilation, highly optimized, used for hot functions

The JIT takes the WASM stack-machine IR and produces x86-64 (or ARM64, RISC-V, etc.) machine code. The bounds checks for linear memory accesses can be implemented with guard pages (no explicit check instruction needed) or explicit bounds check instructions.

WASM Beyond the Browser

The WASI (WebAssembly System Interface) standardizes WASM as a portable application format. A WASM binary compiled for WASI can run on: - Any browser - Linux, macOS, Windows via the wasmtime or wasmer runtime - Embedded systems with WASM interpreters - Edge computing platforms (Cloudflare Workers, Fastly Compute@Edge)

This is the vision: compile once (to WASM), run anywhere — with better portability than Java and better performance than JavaScript.

RISC-V: The Open ISA

RISC-V (pronounced "risk five") is the most important new ISA in decades. It is significant not for technical reasons alone — other clean ISAs exist — but because it is open: anyone can build a RISC-V CPU without paying license fees, filing paperwork with ARM or Intel, or signing confidentiality agreements.

The Design

RISC-V is a clean RISC ISA with a 32-bit base instruction set (RV32I) and extensions defined as standard modules:

Extension	Module	Description
M	Integer Multiply	MUL, DIV, REM
A	Atomic	LR/SC, AMO operations
F	Single Float	32-bit FP, F registers
D	Double Float	64-bit FP, extends F
C	Compressed	16-bit compressed instructions
V	Vector	SIMD vector operations

RV64GC is the common profile: 64-bit base + G (IMAFD) + compressed. Linux runs on RV64GC.

Register Set

RISC-V has 32 integer registers (x0-x31) and 32 floating-point registers (f0-f31):

Register	ABI Name	Role
x0	zero	Hardwired zero
x1	ra	Return address
x2	sp	Stack pointer
x3	gp	Global pointer
x4	tp	Thread pointer
x5-x7	t0-t2	Temporaries
x8	s0/fp	Saved / frame pointer
x9	s1	Saved
x10-x11	a0-a1	Arguments / return values
x12-x17	a2-a7	Arguments
x18-x27	s2-s11	Saved
x28-x31	t3-t6	Temporaries

Compare to x86-64 (16 GP registers with implicit roles) and ARM64 (31 GP registers, x30=LR).

RISC-V Calling Convention

First 8 integer arguments in a0-a7. Return value in a0 (and a1 for 128-bit). Callee-saved: s0-s11. Very similar to ARM64's AAPCS64 — both are clean RISC designs that avoid implicit register roles.

Hello World in RISC-V Assembly

# RISC-V 64-bit hello world (RISC-V assembly, GNU syntax)
# Assemble: riscv64-linux-gnu-as hello.s -o hello.o
# Link: riscv64-linux-gnu-ld hello.o -o hello
# Run: qemu-riscv64 ./hello

.section .data
msg:    .string "Hello from RISC-V!\n"
msg_len = . - msg

.section .text
.global _start
_start:
    li      a7, 64          # syscall 64 = write
    li      a0, 1           # fd = 1 (stdout)
    la      a1, msg         # buffer = &msg
    li      a2, msg_len     # length
    ecall                   # execute syscall

    li      a7, 93          # syscall 93 = exit
    li      a0, 0           # exit code = 0
    ecall

Compare to x86-64 (Chapter 1's hello world): the structure is identical — load syscall number, load arguments, execute syscall instruction (syscall on x86-64, ecall on RISC-V). The register names differ; the concept is the same.

Running RISC-V in QEMU

# Install RISC-V tools (Ubuntu/Debian):
sudo apt install gcc-riscv64-linux-gnu qemu-user

# Compile C for RISC-V:
riscv64-linux-gnu-gcc -O2 hello.c -o hello_riscv

# Run on RISC-V emulator:
qemu-riscv64 ./hello_riscv

QEMU's user-mode emulation (qemu-riscv64) translates RISC-V instructions to your host architecture at runtime. It is not as fast as native RISC-V hardware but is convenient for testing.

RISC-V Hardware

Commercially available RISC-V hardware: - SiFive HiFive Unmatched: developer board with U74 core, Linux-capable - StarFive VisionFive 2: low-cost Linux SBC (~$70) - Kendryte K210: microcontroller for embedded systems - China's domestic push: Alibaba's XuanTie C910 core, used in domestic server chips

The geopolitical significance: China has heavily invested in RISC-V as a path to domestic semiconductor independence. This has accelerated RISC-V hardware development significantly.

GPU "Assembly": Brief Overview

For completeness, GPUs have their own instruction sets — though "GPU assembly" is a different beast entirely.

CUDA PTX and SASS

NVIDIA's CUDA compilation pipeline: - CUDA C → PTX (Parallel Thread Execution): NVIDIA's portable virtual ISA - PTX → SASS (Shader ASSembly): the actual hardware ISA, different per GPU microarchitecture

PTX is to SASS roughly as LLVM IR is to x86-64 assembly: a stable intermediate that NVIDIA's driver JITs to hardware instructions.

; PTX: multiply-add (a * b + c)
.func .reg .f32 example(.param .f32 a, .param .f32 b, .param .f32 c) {
    .reg .f32 result;
    fma.rn.f32 result, a, b, c;   /* fused multiply-add */
    ret result;
}

SIMT: Single Instruction Multiple Threads

The most fundamental difference between GPU and CPU "assembly": GPUs execute the same instruction across thousands of threads simultaneously (SIMT — Single Instruction Multiple Thread). A single SASS instruction in one warp (32 threads on NVIDIA GPUs) causes 32 execution units to do the same operation on 32 different data values.

This is SIMD taken to an extreme. x86-64 SIMD (AVX-512) works on 512-bit vectors — 16 floats at once. A NVIDIA GPU warp operates on 32 floats — but across 32 concurrent threads that can diverge (take different branches), which then serialize.

Understanding x86-64 SIMD from Part VI gives you the conceptual foundation to understand GPU execution. The key difference is the degree of parallelism and the programming model.

Other Architectures (Brief)

MIPS: the classic RISC ISA used in computer architecture education (Patterson & Hennessy textbook). Fixed 32-bit instructions, 32 GP registers. Still used in embedded systems (routers, etc.) though ARM has largely displaced it.

SPARC: Sun Microsystems' RISC architecture, now Oracle. Notable for its register windows (a novel mechanism for fast function calls). Declining market share; still used in some HPC and embedded contexts.

IBM POWER: the high-performance RISC architecture behind PowerPC (now dead) and IBM's POWER servers. Still relevant in HPC (Summit, Sierra supercomputers used POWER9). Notable for out-of-order execution at extreme scale.

Motorola 68000: the classic 16/32-bit processor from the 1980s, used in the original Macintosh, Amiga, and Atari ST. Still found in embedded contexts. Its clean CISC design influenced later architectures. Has a dedicated retrocomputing community.

The Future of Computing

The single-core performance scaling that characterized the 2000s is over. The future is about:

Heterogeneous computing: CPU + GPU + NPU (Neural Processing Unit) + DSP + specialized accelerators, all sharing memory. Programming these systems requires understanding each unit's capabilities and the cost of data movement between them.

Domain-specific architectures (DSAs): Google's TPU (Tensor Processing Unit) for neural network inference, Apple's Neural Engine, Intel's Nervana, various edge AI chips. These are not general-purpose processors — they execute specific computational patterns extremely efficiently. The "assembly" for a TPU is XLA IR; for Apple's Neural Engine, it is Core ML compiled representations.

Near-memory computing: moving computation closer to DRAM (Processing-In-Memory) to reduce the memory bandwidth bottleneck. Research prototypes demonstrate 10-100× energy efficiency improvements for memory-bound workloads.

RISC-V proliferation: as custom silicon becomes more accessible (chiplets, open-source PDKs), RISC-V enables domain-specific cores without licensing friction.

Where Assembly Knowledge Fits

Assembly is not going away. But the question of "where does it matter" shifts:

x86-64: still dominant in servers, desktops, and laptops. Assembly knowledge is directly applicable.
ARM64: dominant in mobile, increasingly in servers (AWS Graviton, Apple M-series). Assembly directly applicable.
RISC-V: growing in embedded and custom silicon. Transferable from ARM64 knowledge.
GPU "assembly": PTX for CUDA work; understanding SIMT principles matters for GPU optimization.
DSA "assembly": compilers handle DSA programming, but understanding the computation model matters for effective use.

The skill that assembly teaches — understanding the machine without abstractions, thinking precisely about what the CPU does at each step — transfers to every future architecture, even those that have not been designed yet.

🔄 Check Your Understanding: 1. In the compiler pipeline, what does SSA (Static Single Assignment) form mean? 2. Why is register allocation modeled as a graph coloring problem? 3. What does a JIT compiler need to do before executing generated machine code on Linux? 4. What is the security model that makes WASM safe to run in a browser? 5. Why is RISC-V considered historically significant beyond its technical design?

Summary

Compilers transform source code through lexing, parsing, IR construction, optimization, instruction selection, register allocation, and instruction scheduling. Understanding this pipeline explains why the compiler makes the choices visible in disassembly. JIT compilers generate machine code at runtime by writing instruction bytes to executable memory (or via LLVM ORC). WebAssembly is a stack-machine portable ISA with a security model built for untrusted code execution in browsers and beyond. RISC-V is the open ISA that is changing the semiconductor landscape by enabling custom silicon without licensing friction. GPU computing uses SIMT execution at a scale that dwarfs CPU SIMD. The future is heterogeneous — but assembly knowledge is how you understand any architecture, present or future, without illusions.