Case Study 21-2: Compiler Explorer Workshop — Five C Functions, Five Architectures
Objective
Use Compiler Explorer (godbolt.org) to systematically compare how GCC, Clang, and MSVC compile five representative C functions for x86-64, ARM64, and RISC-V. Each comparison teaches something different about how ISA design affects code generation.
Setup: Compiler Explorer Settings
Navigate to https://godbolt.org. For each function:
- Paste the C code in the left panel
- Add compiler "x86-64 gcc 13.2" with flags "-O2"
- Add compiler "ARM64 GCC 13.2 (aarch64-linux-gnu)" with flags "-O2"
- Add compiler "RISC-V (64-bit) GCC 13.2" with flags "-O2"
- Enable "Intel syntax" for x86-64 via the options gear (add
-masm=intel)
Function 1: Branchless Absolute Value
int absolute_value(int x) {
return x < 0 ? -x : x;
}
Expected x86-64 Output
absolute_value:
movl %edi, %eax
negl %eax
testl %edi, %edi
cmovns %edi, %eax ; CMOVNS: conditional move if not sign (x >= 0)
ret
Observation: GCC uses CMOVNS (conditional move if not signed/negative). No branch at all. The compiler computes both -x and x speculatively, then selects the right one with CMOVNS.
Expected ARM64 Output
absolute_value:
CMP W0, #0
CNEG W0, W0, MI ; CNEG: conditional negate if minus (N flag set)
RET
ARM64 has a dedicated CNEG (conditional negate) instruction that directly implements "negate if negative." One instruction for the entire computation after the compare.
Key difference: ARM64's CNEG is more direct than x86-64's CMOVNS pattern.
Expected RISC-V Output
absolute_value:
srli a5,a0,31 ; a5 = sign bit (x >> 31, unsigned)
neg a4,a0 ; a4 = -x
beqz a5,.Lpositive ; if sign bit == 0 (x >= 0), use x
mv a0,a4 ; else use -x
.Lpositive:
ret
RISC-V has no CMOV. It uses a branch. The compiler could also use: sub a4, x0, a0; bge a0, x0, ...; mv a0, a4 or the mask trick: extract sign, XOR, subtract. Different compilers make different choices.
Lesson: ARM64's CNEG and x86-64's CMOVNS both avoid branches. RISC-V requires a branch (or a multi-instruction branchless sequence via sign-mask arithmetic).
Function 2: Multiply by Constant
uint64_t multiply_by_37(uint64_t x) {
return x * 37;
}
x86-64
multiply_by_37:
imulq $37, %rdi, %rax ; single IMUL with immediate
ret
x86-64 has IMUL reg, reg, imm — three-operand multiply with immediate. One instruction.
ARM64
multiply_by_37:
MOV X1, #37
MUL X0, X0, X1
RET
ARM64's MUL requires two register operands. 37 must be loaded first.
OR — GCC might recognize 37 = 32 + 4 + 1 = (1<<5) + (1<<2) + 1:
multiply_by_37:
ADD X1, X0, X0, LSL #2 ; X1 = x + x*4 = 5x
ADD X0, X0, X1, LSL #2 ; X0 = x + 5x*4 = x + 20x = 21x ... hmm
Actually 37 = 36 + 1 = 49 + 1 = 4(8+1) + 1 = ... the decomposition is complex enough that GCC may just emit MOV + MUL. For simple powers of 2 or small sums, the barrel shifter expansion is used.
RISC-V
multiply_by_37:
li a5,37 ; load immediate 37
mul a0,a0,a5 ; multiply
ret
Similar to ARM64 — needs to load the constant first, then multiply.
Lesson: x86-64's three-operand IMUL reg, reg, imm is uniquely powerful for multiplying by small constants.
Function 3: Division by Constant
int32_t divide_by_7(int32_t x) {
return x / 7;
}
All three compilers should emit the multiply-high trick. This is the same across architectures.
x86-64
divide_by_7:
movl %edi, %eax
movl $-1840700269, %edx ; magic number
imull %edx ; EDX:EAX = x * magic
movl %edx, %eax
sarl $2, %eax ; shift quotient
sarl $31, %edi ; extract sign
subl %edi, %eax ; correct for negative dividends
ret
ARM64
divide_by_7:
MOV W1, #-1840700269 ; (or MOVZ + MOVK for the full value)
SMULL X1, W0, W1 ; 64-bit signed product of 32-bit inputs
ASR X1, X1, #34 ; shift to get quotient
ASR W2, W0, #31 ; extract sign bit
SUB W0, W1, W2 ; correct
RET
ARM64's SMULL (signed multiply long: 32×32→64) makes this slightly cleaner — no need to use the high half of a 64×64 multiply.
RISC-V
divide_by_7:
li a5,-1840700269
mulh a5,a0,a5 ; high half of signed 64-bit multiply
srli a4,a5,63
srai a5,a5,2
add a0,a5,a4
ret
RISC-V has MULH (multiply high, signed), MULHU (unsigned), and MULHSU (signed × unsigned). These directly give the high half.
Lesson: All three architectures implement division-by-constant as multiply-high + shift. The multiply instruction differs (IMUL/SMULL/MULH), but the algorithm is identical.
Function 4: Population Count (Count Set Bits)
int popcount_c(uint64_t x) {
return __builtin_popcountll(x);
}
x86-64
popcount_c:
popcntq %rdi, %rax ; POPCNT instruction (1 instruction!)
ret
x86-64 has a dedicated POPCNT instruction (SSE4.2+). One instruction.
With -march=native targeting Haswell or later, this is what you get. Without the feature flag, GCC emits a software fallback.
ARM64
popcount_c:
FMOV D0, X0 ; move integer to FP register
CNT V0.8B, V0.8B ; count set bits in each of 8 bytes
ADDV B0, V0.8B ; horizontal sum of the 8 byte-popcounts
UMOV W0, V0.B[0] ; move result back to GP register
RET
ARM64 doesn't have a scalar POPCNT instruction, but NEON has CNT (count bits) for byte vectors. The trick: move the 64-bit value to a NEON register, count bits in each byte (CNT V0.8B), sum the byte counts (ADDV), extract result.
This is 4 instructions but executes efficiently due to NEON's high throughput.
RISC-V (without B extension)
popcount_c:
; RISC-V base ISA has no popcount. Software implementation:
li a5,0x5555555555555555 ; alternating 1s
srli a4,a0,1
and a5,a5,a4
sub a0,a0,a5
; ... (many more instructions for Hamming weight)
RISC-V's B extension ("Bit Manipulation") adds CPOP (count population). Without it, software only.
Lesson: Hardware-specific instructions (POPCNT, NEON CNT) provide huge code density advantages. This is why __builtin_popcountll in C is not trivially portable — the generated code differs dramatically by architecture.
Function 5: 128-bit Addition
#include <stdint.h>
typedef __uint128_t uint128_t;
uint128_t add128(uint128_t a, uint128_t b) {
return a + b;
}
x86-64
GCC passes/returns 128-bit values as two 64-bit values. The hidden pointer convention:
Actually with __uint128_t, GCC uses RDI+RSI for the low/high 64 bits of a, RDX+RCX for b, and returns in RAX+RDX:
add128:
addq %rdx, %rdi ; low64 = a_low + b_low, sets carry
adcq %rcx, %rsi ; high64 = a_high + b_high + carry
movq %rdi, %rax ; return low
movq %rsi, %rdx ; return high
ret
ADDQ + ADCQ (add with carry) — the x86-64 carry flag propagates the 65th bit from the low addition to the high addition.
ARM64
add128:
ADDS X0, X0, X2 ; low64 = a_low + b_low, sets C flag
ADC X1, X1, X3 ; high64 = a_high + b_high + C
RET
ARM64's ADDS (add and set flags) + ADC (add with carry) is the direct equivalent. Clean and equivalent structure.
RISC-V
add128:
add a4,a0,a2 ; low = a_lo + b_lo
sltu a5,a4,a0 ; carry = (result < a_lo) ? 1 : 0 (unsigned overflow)
add a1,a1,a3 ; high += b_hi
add a1,a1,a5 ; high += carry
mv a0,a4
ret
RISC-V has no carry flag! It emulates carry with an unsigned comparison: if a + b < a (for unsigned), then overflow/carry occurred. This is the standard technique for multi-precision arithmetic on architectures without carry flags.
Lesson: Carry-propagation for multi-precision arithmetic is fundamental. x86-64 (ADCQ) and ARM64 (ADC) have dedicated carry-add instructions. RISC-V uses conditional comparison to emulate carry — more instructions but the same logical result.
Summary of Lessons from the Five Functions
Cross-Architecture Compiler Output Summary
═══════════════════════════════════════════════════════════════════════════
Function x86-64 approach ARM64 approach RISC-V approach
───────────────────────────────────────────────────────────────────────────
Abs value CMOVNS (1 instr) CNEG (1 instr) Branch (2 instr)
Mul × 37 IMUL reg,reg,imm MOV + MUL LI + MUL
Div by 7 IMUL high trick SMULL trick MULH trick
Popcount POPCNT (1 instr) NEON CNT (4 instr) Software (12+ instr)
128-bit add ADDQ + ADCQ ADDS + ADC ADD + SLTU + ADD×2
═══════════════════════════════════════════════════════════════════════════
The big takeaways:
1. CMOV vs. branch: x86-64 and ARM64 both avoid branches for branchless patterns; RISC-V often uses branches
2. Immediate multiply: x86-64's IMUL reg, reg, imm is uniquely powerful
3. The division-by-constant trick works identically on all three — multiply-high + shift
4. Popcount: x86-64 wins with POPCNT; ARM64 wins with NEON CNT; RISC-V needs the B extension
5. 128-bit carry: ARM64 and x86-64 have dedicated carry instructions; RISC-V uses SLTU emulation