11 min read

In 2026, you are writing code that will run on machines with two completely different instruction set architectures — and both are everywhere. Your cloud server is probably ARM64 (AWS Graviton, Azure Cobalt, GCP Axion). Your laptop might be x86-64...

Chapter 19: x86-64 vs. ARM64 Comparison

Two Architectures, One Era

In 2026, you are writing code that will run on machines with two completely different instruction set architectures — and both are everywhere. Your cloud server is probably ARM64 (AWS Graviton, Azure Cobalt, GCP Axion). Your laptop might be x86-64 (Intel or AMD) or ARM64 (Apple M-series, Qualcomm Snapdragon). Your phone is ARM64. Your embedded system might be either.

This isn't a historical curiosity or a niche academic comparison. Understanding both architectures and the tradeoffs between them is practical systems knowledge for 2026.

This chapter puts everything side by side.


19.1 Architectural Philosophy: A Revisit

We've spent three chapters learning ARM64's philosophy from the inside. Now let's compare it directly to x86-64.

The CISC Worldview (x86-64)

x86-64 evolved from Intel 8086 (1978) through decades of backward-compatible extensions. The philosophy was: make each instruction do as much work as possible. Memory operands in arithmetic instructions. String operations that move entire buffers. Complex addressing modes with scale factors.

The result: a programmer (or compiler) can express a complete operation in fewer instructions. The processor is more complex because it must support this richer instruction set, but code is more compact.

; x86-64: three things in one instruction
IMUL RAX, [RBX + RCX*8 + 24]   ; load, multiply, store — all addressing decoded simultaneously
; This loads from [RBX + RCX*8 + 24], multiplies by RAX, stores to RAX
; ONE instruction. The decoder figures out the rest.

The RISC Worldview (ARM64)

ARM64 was designed (1985, ARMv1; 2011, ARMv8-A with AArch64) with the philosophy: simple instructions, regular encoding, let the compiler generate more of them. Nothing touches memory except LDR/STR. Every instruction is 4 bytes. No scale factors in ALU instructions.

The result: more instructions to express the same operation, but each instruction is simpler to decode, pipeline, and execute.

// ARM64: three explicit instructions where x86-64 used one
LSL  X4, X3, #3              // X4 = X3 * 8
ADD  X4, X2, X4              // X4 = RBX + X3*8  (address calculation)
LDR  X5, [X4, #24]           // load from [X4 + 24]
MUL  X0, X0, X5              // multiply

"CISC vs. RISC" Is a Spectrum, Not a Binary

The comparison isn't as clean as textbooks make it sound:

  1. Modern x86-64 CPUs translate CISC to RISC internally. Intel's Core microarchitecture breaks x86-64 CISC instructions into micro-operations (µops). An IMUL RAX, [RBX + RCX*8 + 24] might decompose into 2-3 µops: address calculation, load, multiply. The CPU executes the µops as RISC-like operations.

  2. ARM64 has complex features too. The barrel shifter (inline shifts), the conditional execution variants (CSEL, CSINC), LDP/STP (two-register memory operations), and NEON SIMD instructions are all multi-operation instructions in some sense.

  3. The real difference is the ISA encoding. x86-64 exposes the complexity to the programmer and the assembler. ARM64 hides it in higher-level abstractions (SIMD, compiler intrinsics) while keeping the base instruction set clean.


19.2 Code Density Comparison

A common claim is "x86-64 code is denser than ARM64 code." True in terms of bytes per instruction, but the actual binary sizes for real programs are within 5-20% of each other. Let's look at real examples.

Example: Simple Addition

Instruction       x86-64 bytes    ARM64 bytes
ADD reg, reg      2-3             4
ADD reg, imm8     3               4
ADD reg, imm32    6               4 (if imm fits in 12 bits)
ADD reg, [mem]    3-7             4 + 4 (LDR) = 8

For register-to-register operations, x86-64 is sometimes smaller. For memory operands, x86-64 wins bigger (one instruction vs. load+operate). For large immediates, ARM64 needs multiple MOVZ+MOVK instructions.

Example: strcmp (a real function)

musl libc strcmp implementations:

int strcmp(const char *l, const char *r) {
    for (; *l == *r && *l; l++, r++);
    return *(unsigned char *)l - *(unsigned char *)r;
}

x86-64 musl strcmp: approximately 15-20 instructions, 40-50 bytes. ARM64 musl strcmp: approximately 20-25 instructions, 80-100 bytes.

ARM64 strcmp is about 2× the instruction count and 2× the bytes — because ARM64 can't fold the memory load into the comparison instruction.

The Caches Even It Out

ARM64 has a larger code footprint, but: 1. ARM64 instruction caches are sized to match — a Cortex-A72 has a 48KB I-cache vs. 32KB for a comparable Intel core 2. ARM64's fixed-width instructions are faster to decode 3. The branch predictor can predict further ahead with fixed-width instructions

For real-world programs, the performance difference from code density alone is small — within measurement noise.


19.3 Register File Comparison

Register File: x86-64 vs. ARM64
═══════════════════════════════════════════════════════════════════════════
                        x86-64              ARM64
───────────────────────────────────────────────────────────────────────────
GP registers            16 (RAX-R15)        31 (X0-X30) + XZR
Register width          64-bit max          64-bit max
Sub-register views      8/16/32/64 bit      32/64 bit only (W/X)
                        (AX, AL, AH for RAX; (W0 is low 32 of X0)
                         complex aliasing)    (clean: W write zeroes high)
Zero register           No                  Yes (XZR/WZR)
Caller-saved (arg/temp) 6 arg + 7 temp      8 arg + 11 temp
Callee-saved            6                   10 (X19-X28)
Separate link reg       No (on stack)       Yes (X30)
Separate FP registers   Yes (8 XMM/YMM)     Yes (32 V0-V31 × 128-bit)
Condition flags         RFLAGS (always set) PSTATE (set with S suffix only)
═══════════════════════════════════════════════════════════════════════════

The extra ARM64 registers matter for performance: more callee-saved registers means functions can keep more values "live" without spilling to the stack. The 8-argument register calling convention (vs. 6 on x86-64 System V) means fewer functions need to use the stack for arguments.

Sub-register Aliasing: x86-64's Design Debt

x86-64's sub-register aliasing is a historical accident that creates programmer confusion and occasional performance surprises:

; x86-64: partial register writes
mov  rax, 0x1234567890ABCDEF
mov  ax, 0x1111         ; AX = 0x1111, but RAX = 0x1234567890AB1111 !
mov  al, 0x22           ; AL = 0x22, but RAX = 0x1234567890AB1122 !
mov  eax, 0x33333333    ; EAX = 0x33333333, RAX = 0x0000000033333333
                        ; Writing EAX ZEROS the upper 32 bits (inconsistent!)

This behavior caused real performance bugs in hand-optimized code: some CPUs handle partial register writes by generating false dependencies (stall until the full register value is ready). Modern CPUs handle most cases with register renaming, but the semantic complexity remains.

ARM64's clean model: writing to W0 always zeroes the upper 32 bits of X0. No partial-width writes for 8-bit or 16-bit. The aliasing rule is simple: Wn is always the low 32 bits of Xn, and writing Wn zeroes X0's high 32.


19.4 Instruction Encoding Comparison

The deepest difference between the architectures is visible in instruction encoding.

x86-64 Instruction Format

x86-64 Variable-Length Instruction Format
┌──────────┬─────────┬──────────────┬─────────────┬──────────┬───────────────┐
│ Prefixes │  REX    │    Opcode    │   ModR/M    │   SIB    │  Displacement │
│ (0-4B)   │ (0 or 1)│ (1-3 bytes) │ (0 or 1B)   │ (0 or 1B)│   (0,1,2,4B) │
└──────────┴─────────┴──────────────┴─────────────┴──────────┴───────────────┘
+ Optional Immediate (0,1,2,4,8 bytes)

Total: 1-15 bytes per instruction

ModR/M byte:
  [7:6] = Mod (2 bits): 00=reg-indirect, 01=+disp8, 10=+disp32, 11=reg
  [5:3] = Reg (3 bits): register or opcode extension
  [2:0] = R/M (3 bits): register or 'use SIB'

SIB byte (when ModR/M R/M = 100):
  [7:6] = Scale: 00=×1, 01=×2, 10=×4, 11=×8
  [5:3] = Index: register (RSP = 'no index')
  [2:0] = Base: register

REX prefix: 0100WRXB
  W=1: 64-bit operand size
  R: extends ModR/M Reg
  X: extends SIB Index
  B: extends ModR/M R/M or SIB Base

To decode even the length of an x86-64 instruction, you need to: 1. Check for legacy prefixes (up to 4 bytes) 2. Check for REX/VEX/EVEX prefix 3. Decode opcode (1-3 bytes, with escape bytes 0x0F, 0x0F 0x38, 0x0F 0x3A) 4. Check ModR/M field 5. Maybe decode SIB 6. Determine displacement size from ModR/M Mod bits 7. Determine immediate size from opcode

This is why x86-64 instruction decoders are complex. Intel's frontend (which fetches, decodes, and dispatches instructions) consumes approximately 30-35% of a modern CPU's die area.

ARM64 Instruction Format

ARM64 Fixed-Length 32-bit Instruction Format
┌──────────────────────────────────────────────────────────────────────────┐
│ All instructions: exactly 32 bits                                         │
│                                                                           │
│ Data processing (register):                                               │
│ [31:29] opcode [28:24] op_variant [23:22] shift [20:16] Rm [15:10] imm6  │
│ [9:5] Rn [4:0] Rd                                                         │
│                                                                           │
│ Load/Store:                                                               │
│ [31:30] size [29:27] opcode [26] V [25:24] encoding [23:22] opc          │
│ [21:10] offset/register [9:5] Rn [4:0] Rt                                │
│                                                                           │
│ Branch:                                                                   │
│ [31] op [30:26] opcode [25:0] imm26 (for B) or [23:5] imm19 (for B.cond) │
└──────────────────────────────────────────────────────────────────────────┘

To decode an ARM64 instruction:
1. It's always 32 bits at 4-byte alignment. You already know where it ends.
2. The top 5 bits determine the instruction class.
3. Within each class, the format is consistent.

ARM64 decoder complexity: approximately 5-10% of a modern CPU's die area.

Real Decoder Complexity

This difference is substantial at the silicon level:

Characteristic x86-64 decoder ARM64 decoder
Die area (% of CPU) ~30-35% ~5-10%
Max instructions decoded/cycle 4-6 4-8
Power consumption High Low
Pipeline stages (decode) 3-5 1-2
Out-of-order window 512+ µops 256+ uops

Apple's M-series chips invest the transistors saved on decoding into larger caches, wider execution units, and deeper out-of-order windows — which is a significant factor in their performance advantage.


19.5 Calling Convention Side-by-Side

Calling Convention Comparison
═══════════════════════════════════════════════════════════════════════════
                    System V AMD64 ABI          AAPCS64 (ARM64)
───────────────────────────────────────────────────────────────────────────
Integer args        RDI, RSI, RDX, RCX, R8, R9  X0-X7
Number of arg regs  6                           8
FP args             XMM0-XMM7                   V0-V7 (D0-D7)
Return value        RAX                         X0
FP return           XMM0                        V0 (D0/S0)
Callee-saved        RBX, RBP, R12-R15           X19-X28, X29, SP
Caller-saved        RAX, RCX, RDX, RSI, RDI,    X0-X18
                    R8-R11, XMM0-XMM15
Stack alignment     RSP+8 at call entry         SP aligned to 16 at call
Stack on entry      Return addr pushed by CALL  Return addr in X30 (LR)
Red zone            128B below RSP (leaf funcs) 128B below SP (leaf funcs)
Stack frame pointer RBP                         X29 (FP)
═══════════════════════════════════════════════════════════════════════════

ARM64 has 2 more integer argument registers (8 vs. 6). For functions with 7 or 8 parameters, ARM64 is more efficient: x86-64 must push arguments 7+ to the stack, ARM64 keeps all 8 in registers.

ARM64 also has more callee-saved registers (X19-X28 = 10, vs. RBX/RBP/R12-R15 = 6). Functions can keep more intermediate values alive across calls without stack spills.


19.6 Performance Characteristics

Performance comparison is complex because "faster" depends on workload, compiler, and specific CPU microarchitecture. Here are the general trends as of 2026:

Single-Threaded Scalar Code

For typical application code compiled at -O2: - x86-64: Higher clock frequencies (up to 5.5 GHz for desktop parts), highly optimized superscalar execution, 30+ years of microarchitectural optimization for common code patterns - ARM64: Lower clock (typically 3.0-4.0 GHz for ARM server), but wider execution (Apple M4 can issue 8+ instructions/cycle), larger ROB (reorder buffer)

Apple M4 single-threaded performance in Geekbench exceeds Intel Core i9 despite lower clock frequency — primarily due to the larger out-of-order window and execution width enabled by the simpler decoder.

Power Efficiency

ARM64 wins comprehensively: - Apple M3 Pro: ~20W TDP for laptop, 14" MacBook Pro - Intel Core Ultra 9 185H (competing laptop chip): ~45W TDP - At similar performance levels: ARM64 uses 40-60% less power

For battery life, mobile, and server efficiency, ARM64 is dominant.

Vectorization (SIMD)

Neither architecture has an inherent advantage: - x86-64 has SSE2, AVX, AVX-512 (up to 512-bit SIMD) - ARM64 has NEON (128-bit SIMD, standard) + SVE/SVE2 (variable-width SIMD, optional)

For workloads that fit in 128-bit SIMD, performance is comparable. For applications using AVX-512, x86-64 has an advantage (AVX-512 is 4× NEON's width). For workloads using SVE, ARM64 can match or exceed AVX-512.


19.7 The Apple Silicon Transition

Apple's switch from Intel x86-64 to ARM64 (Apple Silicon) in November 2020 was the largest architectural transition in personal computing since the PowerPC→x86 switch in 2005.

What the M1 Actually Is

The Apple M1 (2020) was not just "ARM64, but faster." It was a complete redesign:

  • 5nm TSMC process (Intel was still at 10nm in 2020)
  • Firestorm (performance) cores: 8 decode, massive 192-entry ROB, 6 µop/cycle retirement
  • Icestorm (efficiency) cores: 3-decode, tight cache hierarchy, designed for background tasks
  • Unified memory architecture: CPU, GPU, and neural engine share one physical memory pool
  • 192KB L1 instruction cache per Firestorm core (Intel: 32KB — 6× larger)
  • 12MB L2 cache per cluster
  • Large on-chip caches: 32MB "system level cache" (what Intel calls last-level cache)

The M1's performance advantage over Intel was primarily: larger caches, wider execution, and fewer cycles wasted on x86 CISC decoding overhead. ARM64 enabled this by freeing up 25-30% of die area from the decoder.

Rosetta 2: Technical Deep Dive

Rosetta 2 translates x86-64 machine code to ARM64:

Rosetta 2 Translation Pipeline
┌─────────────────────────────────────────────────────────────────────────┐
│ 1. First run: binary is scanned, translated ahead-of-time               │
│    - x86-64 instruction → ARM64 instruction sequence                    │
│    - Translation cached in /var/db/oah/                                  │
│                                                                          │
│ 2. Memory ordering: x86-64 uses TSO (Total Store Order)                 │
│    - x86-64 LOADS cannot pass earlier STORES                            │
│    - ARM64 has weaker ordering (loads CAN pass stores)                  │
│    - Rosetta 2 inserts ISB/DSB/DMB barriers where needed                │
│    - This is the main performance cost of translation                   │
│                                                                          │
│ 3. Atomics: x86-64 LOCK prefix vs. ARM64 LDAXR/STLXR                  │
│    - Rosetta 2 maps x86-64 LOCK CMPXCHG → ARM64 exclusive operations   │
│                                                                          │
│ 4. FPU semantics: x86-64 x87 vs. ARM64 scalar FP                       │
│    - x87 80-bit intermediate precision vs. ARM64 64-bit                 │
│    - Minor numerical differences in some programs                       │
└─────────────────────────────────────────────────────────────────────────┘

Performance: x86-64 code under Rosetta 2 typically runs at ~70-80% of native x86-64 speed, but since M1 is 50-100% faster than competing Intel chips, translated x86-64 code on M1 often beats native x86-64 on Intel.

Industry Implications

The Apple Silicon transition proved that: 1. ARM64 CAN achieve server/desktop performance (not just mobile/embedded) 2. The cost of x86 ISA compatibility (the CISC tax) is real and measurable 3. A clean-slate architecture design enabled by RISC can outperform decades of CISC optimization

AWS, Microsoft, and Google followed with custom ARM64 server chips: - AWS Graviton4: ARM64, competitive with Intel at 40% lower price-per-performance - Microsoft Cobalt 100: Azure's own ARM64 server chip (based on Neoverse) - Google Axion: Google Cloud's ARM64 chip for data centers


19.8 ARM in the Data Center

By 2026, approximately 30-40% of cloud instances sold are ARM64. The economics are compelling:

x86-64 vs. ARM64 Data Center Economics (approximate, 2026)
┌─────────────────────────────────────────────────────────────────────────┐
│ AWS EC2 C6i (Intel): $0.170/vCPU-hour                                   │
│ AWS EC2 C7g (Graviton3): $0.125/vCPU-hour (same performance tier)       │
│ Savings: ~26% for the same workload                                      │
│                                                                          │
│ Performance-per-watt: ARM64 Neoverse N2 ≈ 2× Intel Xeon Sapphire       │
│   (meaning: same work with half the electricity)                         │
│                                                                          │
│ Hyperscaler adoption:                                                    │
│ - Amazon: Graviton (ARM64) powers ~50% of Amazon's own workloads        │
│ - Apple: 100% ARM64 since 2022                                           │
│ - Alibaba: Yitian 710 (custom ARM64) for internal workloads             │
└─────────────────────────────────────────────────────────────────────────┘

19.9 RISC-V: The Open ISA on the Horizon

Any comparison of x86-64 and ARM64 in 2026 is incomplete without mentioning RISC-V.

RISC-V is an open-source ISA: no license fees, no patent royalties, no ARM Ltd. or Intel controlling the specification. Anyone can implement it.

RISC-V vs. ARM64 vs. x86-64
═══════════════════════════════════════════════════════════════════════════
Feature              x86-64          ARM64           RISC-V (RV64GC)
───────────────────────────────────────────────────────────────────────────
ISA licensing        Intel/AMD owned ARM Ltd. owned  Open (BSD license)
GP registers         16              31              32 (x0 always zero)
Register width       64-bit          64-bit          64-bit
Instruction width    Variable 1-15B  Fixed 4B        32-bit + 16-bit (C)
Memory model         TSO (strong)    Weak + barriers RVWMO (weak)
Condition flags      Yes (RFLAGS)    Yes (PSTATE)    No (use CBranch)
Condition code ops   Yes (all ALU)   S-suffix        No (separate compare)
Divide instruction   IDIV/UDIV       SDIV/UDIV       DIV/DIVU (extension)
Atomic operations    LOCK prefix     LDAXR/STLXR     LR/SC, AMO
Vector SIMD          SSE-AVX512      NEON/SVE        V extension (optional)
Mature software      Yes             Yes             Growing
Hardware ecosystem   Mature          Mature          Emerging
───────────────────────────────────────────────────────────────────────────

RISC-V key difference: no condition flags at all. Where ARM64 uses CMP + B.EQ, RISC-V uses combined compare-and-branch instructions:

RISC-V conditional branches:
BEQ  rs1, rs2, offset    // branch if rs1 == rs2
BNE  rs1, rs2, offset    // branch if rs1 != rs2
BLT  rs1, rs2, offset    // branch if rs1 < rs2 (signed)
BGE  rs1, rs2, offset    // branch if rs1 >= rs2 (signed)
BLTU rs1, rs2, offset    // branch if rs1 < rs2 (unsigned)
BGEU rs1, rs2, offset    // branch if rs1 >= rs2 (unsigned)

RISC-V also doesn't have a barrel shifter — shifting must be a separate instruction, not an inline modifier. This is a step further toward "truly reduced" than ARM64.

Current RISC-V status (2026): - Dominant in embedded/IoT (SiFive, Espressif ESP32-C3/C6) - Growing in mobile (RISC-V cores in Android SoCs as coprocessors) - Data center research (Alibaba, NVIDIA) - Not yet a primary server or desktop architecture - Key challenge: lack of mature software ecosystem vs. ARM64's decades of optimization

RISC-V matters because: if ARM ever becomes too aggressive with licensing fees (as they tried with their 2023 IPO pricing), the industry has an open alternative.


19.10 The Future: Heterogeneous Computing

Both x86-64 and ARM64 are evolving toward heterogeneous designs:

ARM64 big.LITTLE (and DynamIQ): - Performance cores + efficiency cores - Apple's Firestorm+Icestorm (M-series) - ARM's Cortex-X4 + Cortex-A520 in Snapdragon 8 Gen 3

x86-64 P+E cores: - Intel 12th-14th gen: Performance + Efficiency cores - AMD's similar strategy with 3D V-Cache

Domain-Specific Accelerators: - Apple Neural Engine (ANE): matrix math for ML - NVIDIA DLSS (GPU-side ML inference) - Google TPU (Tensor Processing Unit) - Custom accelerators for compression, encryption, etc.

The long-term trend is clear: "the CPU" is becoming one component in a heterogeneous compute system. The ISA you run on changes depending on which part of the chip your code executes on.


19.11 Side-by-Side Code Examples

Hello World

                    x86-64 (NASM)          ARM64 (GNU AS, Linux)
──────────────────────────────────────────────────────────────────────────
section .data                              .section .rodata
msg: db "Hello!", 10                       msg: .ascii "Hello!\n"
len equ $ - msg                            len = . - msg

section .text                              .section .text
global _start                              .global _start
_start:                                    _start:
  mov rax, 1       ; write                   MOV X8, #64       // write
  mov rdi, 1       ; stdout                  MOV X0, #1        // stdout
  mov rsi, msg                               ADR X1, msg
  mov rdx, len                               MOV X2, #len
  syscall                                    SVC #0

  mov rax, 60      ; exit                    MOV X8, #93       // exit
  xor rdi, rdi                               MOV X0, #0
  syscall                                    SVC #0
──────────────────────────────────────────────────────────────────────────

Factorial (Iterative)

                    x86-64                  ARM64
──────────────────────────────────────────────────────────────────────────
; factorial(n): edi=n, returns rax        // factorial(n): W0=n, returns X0
factorial:                                factorial:
  test edi, edi                             CMP  W0, #2
  jz   .ret_one                            B.LT .ret_one
  mov  eax, edi      ; result = n           MOV  X1, X0        // result = n
  dec  eax                                  SUB  W2, W0, #1    // i = n-1
  jz   .done         ; n was 1
.loop:                                   .loop:
  imul eax, edi                             MUL  X1, X1, X2    // result *= i
  dec  edi                                  // (using W2 as i here is wrong —
  jnz  .loop                               //  need to use correctly:)
  ret                                       SUBS W2, W2, #1
.ret_one:                                   B.NE .loop
  mov  eax, 1                               MOV  X0, X1
  ret                                       RET
.done:                                    .ret_one:
  ret                                       MOV  X0, #1
                                            RET
──────────────────────────────────────────────────────────────────────────

Linked List Traversal

// struct node { int value; struct node *next; }
// int sum_list(struct node *head);
x86-64:                              ARM64:
──────────────────────────────────────────────────────────────────────────
sum_list:                             sum_list:
  xor  eax, eax    ; sum = 0           MOV  X1, XZR       // sum = 0
  test rdi, rdi    ; if head==NULL     CBZ  X0, .list_done
  jz   .done
.loop:                                .loop:
  add  eax, [rdi]  ; sum += node->val  LDR  W2, [X0]      // W2 = node->value
  mov  rdi, [rdi+8]; head = head->next ADD  X1, X1, X2    // sum += value
  test rdi, rdi                        LDR  X0, [X0, #8]  // X0 = node->next
  jnz  .loop                           CBNZ X0, .loop     // if next != NULL, loop
.done:                                .list_done:
  ret                                  MOV  X0, X1        // return sum
                                       RET
──────────────────────────────────────────────────────────────────────────
x86-64:                               ARM64:
- add uses memory operand             - LDR + ADD (two instructions)
- test rdi, rdi + jnz pattern         - CBNZ X0 pattern (one instruction)
- 7 instructions                      - 7 instructions
──────────────────────────────────────────────────────────────────────────

19.12 Comprehensive Comparison Table

x86-64 vs. ARM64: Complete Feature Matrix
═══════════════════════════════════════════════════════════════════════════
Feature                    x86-64              ARM64
───────────────────────────────────────────────────────────────────────────
Origin                     Intel 8086 (1978)   Acorn Archimedes (1985)
ISA type                   CISC                RISC
Instruction width          Variable (1-15B)    Fixed (4B)
GP registers               16                  31 + XZR
Sub-register aliasing      Yes (complex)       Simple (W/X only)
Memory operands in ALU     Yes                 No (load/store arch)
Inline shifts in ALU ops   No                  Yes (barrel shifter)
Condition flags            Always set by ALU   S-suffix only
Zero register              No                  Yes (XZR)
Argument registers         6 int, 8 FP         8 int, 8 FP
Return address             Stack               X30 (LR register)
SIMD width (standard)      128-bit (SSE2)      128-bit (NEON)
SIMD width (extended)      512-bit (AVX-512)   Variable (SVE/SVE2)
Memory ordering model      TSO (strong)        Weak + barriers
Typical peak clocks        3-5.5 GHz           3-4 GHz (non-Apple)
                                               3.7-4.05 GHz (Apple M4)
Performance/watt           Lower               Higher
Power (data center)        ~200W TDP servers   ~100-150W comparable
Code density               Higher              ~10-20% lower
Decoder complexity         Very high           Low
Backward compatibility     Full (to 8086!)     ARMv8+ (2011+)
Open ISA variant           No                  RISC-V (different ISA)
Dominant platform          Desktop, x86 server Mobile, embedded, clouds
Market share (2026)        ~55% of cloud       ~35% of cloud, growing
───────────────────────────────────────────────────────────────────────────

🔄 Check Your Understanding: 1. Why can't x86-64 processors simply be "made faster" by adding more transistors the way ARM64 processors can benefit from the die area freed by a simpler decoder? 2. RISC-V has no condition flags at all. How does BEQ rs1, rs2, offset work — what does the processor compute? 3. The M1's 192KB L1 instruction cache is 6× larger than Intel's typical 32KB. Why does this help ARM64 performance for code with large working sets? 4. What does Rosetta 2 need to do to handle x86-64's TSO memory model on ARM64's weaker memory model? 5. A function has 9 integer arguments. How many extra instructions does x86-64 (System V ABI) require vs. ARM64 (AAPCS64) to pass all arguments?


Summary

x86-64 and ARM64 represent two different answers to "how should we design a processor?" x86-64 maximized programmer convenience per instruction at the cost of hardware complexity. ARM64 maximized hardware efficiency per instruction at the cost of programmer verbosity.

In 2026, both are first-class architectures. x86-64 still dominates desktop and legacy enterprise. ARM64 now dominates mobile, is competitive in the data center, and is ascendant in high-performance computing (Apple Silicon).

The RISC-V wildcard: if ARM's licensing model becomes hostile, the industry has a clean-room, open alternative waiting.

A software developer in 2026 needs to understand both. A security researcher needs to understand both (exploits don't care about your license preferences). An embedded engineer might work on RISC-V. The era of "just learn x86 and you're done" ended when the iPhone shipped.