Case Study 21-2: Compiler Explorer Workshop — Five C Functions, Five Architectures

Open Assembly Language Project

Case Study 21-2: Compiler Explorer Workshop — Five C Functions, Five Architectures

Objective

Use Compiler Explorer (godbolt.org) to systematically compare how GCC, Clang, and MSVC compile five representative C functions for x86-64, ARM64, and RISC-V. Each comparison teaches something different about how ISA design affects code generation.

Setup: Compiler Explorer Settings

Navigate to https://godbolt.org. For each function:

Paste the C code in the left panel
Add compiler "x86-64 gcc 13.2" with flags "-O2"
Add compiler "ARM64 GCC 13.2 (aarch64-linux-gnu)" with flags "-O2"
Add compiler "RISC-V (64-bit) GCC 13.2" with flags "-O2"
Enable "Intel syntax" for x86-64 via the options gear (add -masm=intel)

Function 1: Branchless Absolute Value

int absolute_value(int x) {
    return x < 0 ? -x : x;
}

Expected x86-64 Output

absolute_value:
    movl    %edi, %eax
    negl    %eax
    testl   %edi, %edi
    cmovns  %edi, %eax    ; CMOVNS: conditional move if not sign (x >= 0)
    ret

Observation: GCC uses CMOVNS (conditional move if not signed/negative). No branch at all. The compiler computes both -x and x speculatively, then selects the right one with CMOVNS.

Expected ARM64 Output

absolute_value:
    CMP  W0, #0
    CNEG W0, W0, MI      ; CNEG: conditional negate if minus (N flag set)
    RET

ARM64 has a dedicated CNEG (conditional negate) instruction that directly implements "negate if negative." One instruction for the entire computation after the compare.

Key difference: ARM64's CNEG is more direct than x86-64's CMOVNS pattern.

Expected RISC-V Output

absolute_value:
    srli    a5,a0,31      ; a5 = sign bit (x >> 31, unsigned)
    neg     a4,a0         ; a4 = -x
    beqz    a5,.Lpositive ; if sign bit == 0 (x >= 0), use x
    mv      a0,a4         ; else use -x
.Lpositive:
    ret

RISC-V has no CMOV. It uses a branch. The compiler could also use: sub a4, x0, a0; bge a0, x0, ...; mv a0, a4 or the mask trick: extract sign, XOR, subtract. Different compilers make different choices.

Lesson: ARM64's CNEG and x86-64's CMOVNS both avoid branches. RISC-V requires a branch (or a multi-instruction branchless sequence via sign-mask arithmetic).

Function 2: Multiply by Constant

uint64_t multiply_by_37(uint64_t x) {
    return x * 37;
}

x86-64

multiply_by_37:
    imulq   $37, %rdi, %rax   ; single IMUL with immediate
    ret

x86-64 has IMUL reg, reg, imm — three-operand multiply with immediate. One instruction.

ARM64

multiply_by_37:
    MOV  X1, #37
    MUL  X0, X0, X1
    RET

ARM64's MUL requires two register operands. 37 must be loaded first.

OR — GCC might recognize 37 = 32 + 4 + 1 = (1<<5) + (1<<2) + 1:

multiply_by_37:
    ADD  X1, X0, X0, LSL #2   ; X1 = x + x*4 = 5x
    ADD  X0, X0, X1, LSL #2   ; X0 = x + 5x*4 = x + 20x = 21x ... hmm

Actually 37 = 36 + 1 = 49 + 1 = 4(8+1) + 1 = ... the decomposition is complex enough that GCC may just emit MOV + MUL. For simple powers of 2 or small sums, the barrel shifter expansion is used.

RISC-V

multiply_by_37:
    li      a5,37          ; load immediate 37
    mul     a0,a0,a5       ; multiply
    ret

Similar to ARM64 — needs to load the constant first, then multiply.

Lesson: x86-64's three-operand IMUL reg, reg, imm is uniquely powerful for multiplying by small constants.

Function 3: Division by Constant

int32_t divide_by_7(int32_t x) {
    return x / 7;
}

All three compilers should emit the multiply-high trick. This is the same across architectures.

x86-64

divide_by_7:
    movl    %edi, %eax
    movl    $-1840700269, %edx   ; magic number
    imull   %edx                 ; EDX:EAX = x * magic
    movl    %edx, %eax
    sarl    $2, %eax             ; shift quotient
    sarl    $31, %edi            ; extract sign
    subl    %edi, %eax           ; correct for negative dividends
    ret

ARM64

divide_by_7:
    MOV  W1, #-1840700269   ; (or MOVZ + MOVK for the full value)
    SMULL X1, W0, W1        ; 64-bit signed product of 32-bit inputs
    ASR  X1, X1, #34        ; shift to get quotient
    ASR  W2, W0, #31        ; extract sign bit
    SUB  W0, W1, W2         ; correct
    RET

ARM64's SMULL (signed multiply long: 32×32→64) makes this slightly cleaner — no need to use the high half of a 64×64 multiply.

RISC-V

divide_by_7:
    li      a5,-1840700269
    mulh    a5,a0,a5        ; high half of signed 64-bit multiply
    srli    a4,a5,63
    srai    a5,a5,2
    add     a0,a5,a4
    ret

RISC-V has MULH (multiply high, signed), MULHU (unsigned), and MULHSU (signed × unsigned). These directly give the high half.

Lesson: All three architectures implement division-by-constant as multiply-high + shift. The multiply instruction differs (IMUL/SMULL/MULH), but the algorithm is identical.

Function 4: Population Count (Count Set Bits)

int popcount_c(uint64_t x) {
    return __builtin_popcountll(x);
}

x86-64

popcount_c:
    popcntq %rdi, %rax    ; POPCNT instruction (1 instruction!)
    ret

x86-64 has a dedicated POPCNT instruction (SSE4.2+). One instruction.

With -march=native targeting Haswell or later, this is what you get. Without the feature flag, GCC emits a software fallback.

ARM64

popcount_c:
    FMOV D0, X0          ; move integer to FP register
    CNT  V0.8B, V0.8B    ; count set bits in each of 8 bytes
    ADDV B0, V0.8B       ; horizontal sum of the 8 byte-popcounts
    UMOV W0, V0.B[0]     ; move result back to GP register
    RET

ARM64 doesn't have a scalar POPCNT instruction, but NEON has CNT (count bits) for byte vectors. The trick: move the 64-bit value to a NEON register, count bits in each byte (CNT V0.8B), sum the byte counts (ADDV), extract result.

This is 4 instructions but executes efficiently due to NEON's high throughput.

RISC-V (without B extension)

popcount_c:
    ; RISC-V base ISA has no popcount. Software implementation:
    li      a5,0x5555555555555555   ; alternating 1s
    srli    a4,a0,1
    and     a5,a5,a4
    sub     a0,a0,a5
    ; ... (many more instructions for Hamming weight)

RISC-V's B extension ("Bit Manipulation") adds CPOP (count population). Without it, software only.

Lesson: Hardware-specific instructions (POPCNT, NEON CNT) provide huge code density advantages. This is why __builtin_popcountll in C is not trivially portable — the generated code differs dramatically by architecture.

Function 5: 128-bit Addition

#include <stdint.h>
typedef __uint128_t uint128_t;

uint128_t add128(uint128_t a, uint128_t b) {
    return a + b;
}

x86-64

GCC passes/returns 128-bit values as two 64-bit values. The hidden pointer convention:

Actually with __uint128_t, GCC uses RDI+RSI for the low/high 64 bits of a, RDX+RCX for b, and returns in RAX+RDX:

add128:
    addq    %rdx, %rdi    ; low64 = a_low + b_low, sets carry
    adcq    %rcx, %rsi    ; high64 = a_high + b_high + carry
    movq    %rdi, %rax    ; return low
    movq    %rsi, %rdx    ; return high
    ret

ADDQ + ADCQ (add with carry) — the x86-64 carry flag propagates the 65th bit from the low addition to the high addition.

ARM64

add128:
    ADDS X0, X0, X2     ; low64 = a_low + b_low, sets C flag
    ADC  X1, X1, X3     ; high64 = a_high + b_high + C
    RET

ARM64's ADDS (add and set flags) + ADC (add with carry) is the direct equivalent. Clean and equivalent structure.

RISC-V

add128:
    add     a4,a0,a2     ; low = a_lo + b_lo
    sltu    a5,a4,a0     ; carry = (result < a_lo) ? 1 : 0  (unsigned overflow)
    add     a1,a1,a3     ; high += b_hi
    add     a1,a1,a5     ; high += carry
    mv      a0,a4
    ret

RISC-V has no carry flag! It emulates carry with an unsigned comparison: if a + b < a (for unsigned), then overflow/carry occurred. This is the standard technique for multi-precision arithmetic on architectures without carry flags.

Lesson: Carry-propagation for multi-precision arithmetic is fundamental. x86-64 (ADCQ) and ARM64 (ADC) have dedicated carry-add instructions. RISC-V uses conditional comparison to emulate carry — more instructions but the same logical result.

Summary of Lessons from the Five Functions

Cross-Architecture Compiler Output Summary
═══════════════════════════════════════════════════════════════════════════
Function          x86-64 approach    ARM64 approach      RISC-V approach
───────────────────────────────────────────────────────────────────────────
Abs value         CMOVNS (1 instr)   CNEG (1 instr)      Branch (2 instr)
Mul × 37          IMUL reg,reg,imm   MOV + MUL            LI + MUL
Div by 7          IMUL high trick    SMULL trick          MULH trick
Popcount          POPCNT (1 instr)   NEON CNT (4 instr)  Software (12+ instr)
128-bit add       ADDQ + ADCQ        ADDS + ADC           ADD + SLTU + ADD×2
═══════════════════════════════════════════════════════════════════════════

The big takeaways: 1. CMOV vs. branch: x86-64 and ARM64 both avoid branches for branchless patterns; RISC-V often uses branches 2. Immediate multiply: x86-64's IMUL reg, reg, imm is uniquely powerful 3. The division-by-constant trick works identically on all three — multiply-high + shift 4. Popcount: x86-64 wins with POPCNT; ARM64 wins with NEON CNT; RISC-V needs the B extension 5. 128-bit carry: ARM64 and x86-64 have dedicated carry instructions; RISC-V uses SLTU emulation