8 min read

x86-64 has three distinct floating-point subsystems that coexist in every modern processor. Understanding which one to use, and why, separates programmers who get correct and fast floating-point code from those who get surprising wrong answers at...

Chapter 14: Floating Point

Three Subsystems, One Architecture

x86-64 has three distinct floating-point subsystems that coexist in every modern processor. Understanding which one to use, and why, separates programmers who get correct and fast floating-point code from those who get surprising wrong answers at unexpected times.

x87 FPU: The 8087 coprocessor design from 1980, still present in every x86-64 chip. Uses a stack-based register model (ST(0) through ST(7)), 80-bit extended precision, and hardware transcendentals (sin, cos, sqrt). The right choice for 80-bit precision requirements and legacy code. The wrong choice for everything else.

SSE/SSE2 scalar: The modern default for floating-point computation. Uses XMM registers (128 bits, of which 32 or 64 bits are used for scalar operations). Explicit register-to-register operations, predictable precision, no stack to manage. This is what GCC uses for float and double arithmetic at any optimization level.

AVX/AVX2: 256-bit extension of SSE. Four floats or two doubles per scalar operation (trivial case), or eight floats / four doubles per SIMD operation. The right choice for data-parallel floating-point computation.

This chapter focuses on SSE/SSE2 scalar floating-point as the primary subject, covers x87 for completeness and legacy understanding, and previews AVX for the SIMD chapter that follows.

IEEE 754: The Standard Behind the Numbers

All three subsystems implement IEEE 754 (or its 2008 revision), which specifies:

Binary32 (float): 1 sign bit, 8 exponent bits, 23 mantissa bits. Range: ~1.2×10^-38 to ~3.4×10^38. About 7 significant decimal digits.

Binary64 (double): 1 sign bit, 11 exponent bits, 52 mantissa bits. Range: ~2.2×10^-308 to ~1.8×10^308. About 15 significant decimal digits.

Binary80 (x87 extended): 1 sign bit, 15 exponent bits, 64 mantissa bits (with explicit integer bit, unlike 32/64). About 18-19 significant decimal digits.

Special Values

IEEE 754 defines several non-numeric values: - ±Infinity: overflow result, propagates through operations - NaN (Not a Number): undefined operations (0/0, ∞-∞, sqrt(-1)), propagates through operations - Denormal (subnormal): very small values near underflow, gradual loss of precision - ±Zero: both positive and negative zero exist; compare equal but behave differently in some operations

; Creating special values:
; +Infinity = 0x7F800000 (float) or 0x7FF0000000000000 (double)
; NaN = 0x7FC00000 (float, quiet NaN) or 0x7FF8000000000000 (double)
; Denormal: float value 0x00000001 = 1.4012984×10^-45 (smallest positive float)

MXCSR: The Floating-Point Control Register

The SSE floating-point control and status register MXCSR contains: - Exception flags: IE (invalid), ZE (divide by zero), OE (overflow), UE (underflow), PE (precision) - Exception masks: each flag has a corresponding mask bit - Rounding control: round to nearest (default), toward zero, toward +∞, toward -∞ - Flush-to-zero: treat denormals as zero (performance optimization) - Denormal-as-zero: treat denormal inputs as zero

; Read MXCSR:
stmxcsr [rsp - 4]          ; store MXCSR to memory
mov eax, [rsp - 4]          ; read it

; Write MXCSR:
mov dword [rsp - 4], 0x1F80  ; default value (all exceptions masked, round-to-nearest)
ldmxcsr [rsp - 4]

; Common: set flush-to-zero and denormal-as-zero for maximum performance
; (at the cost of losing gradual underflow)
mov dword [rsp - 4], 0x9F80  ; FTZ=1, DAZ=1, all exceptions masked
ldmxcsr [rsp - 4]

x87 FPU: The Legacy Stack

The x87 register stack consists of 8 80-bit registers, accessed as ST(0) through ST(7). ST(0) is always the "top" of the stack. Operations typically pop their inputs and push results.

x87 Basics

; Load constants:
fldz                       ; push 0.0 onto ST(0)
fld1                       ; push 1.0 onto ST(0)
fldpi                      ; push π ≈ 3.14159265...
fldl2e                     ; push log2(e)
fldl2t                     ; push log2(10)
fldlg2                     ; push log10(2)
fldln2                     ; push ln(2)

; Load/store float (32-bit):
flds [rbx]                 ; push *(float*)rbx as 80-bit extended onto ST(0)
fsts [rbx]                 ; store ST(0) as float (no pop)
fstps [rbx]                ; store ST(0) as float and pop

; Load/store double (64-bit):
fldl [rbx]                 ; push *(double*)rbx
fstl [rbx]                 ; store ST(0) as double
fstpl [rbx]                ; store and pop

; Arithmetic:
faddp                      ; ST(1) = ST(1) + ST(0), pop ST(0)
fsubrp                     ; ST(1) = ST(0) - ST(1), pop (note: fsubrp reverses operands)
fmulp                      ; ST(1) = ST(1) * ST(0), pop
fdivp                      ; ST(1) = ST(1) / ST(0), pop

; Transcendentals (no SSE equivalent):
fsqrt                      ; ST(0) = sqrt(ST(0))
fsin                       ; ST(0) = sin(ST(0)) (in radians)
fcos                       ; ST(0) = cos(ST(0))
fsincos                    ; ST(0) = sin(ST(0)), ST(-1) = cos(ST(0)) (produces two values)
fptan                      ; ST(0) = tan(ST(0)), pushes 1.0 above (quirk: always pushes 1.0)
fpatan                     ; ST(0) = atan(ST(1) / ST(0)), pop
fyl2x                      ; ST(0) = ST(1) * log2(ST(0)), pop ST(1)
f2xm1                      ; ST(0) = 2^ST(0) - 1 (ST(0) must be in [-1, 1])

Why x87 is Mostly Avoided in New Code

  1. Stack discipline is error-prone. Every operation implicitly modifies the stack. A missed FSTP or extra FLD corrupts all subsequent calculations.
  2. No SIMD parallelism. x87 processes one value at a time. SSE can process 4 floats or 2 doubles simultaneously with the same instruction.
  3. Precision surprises. x87 keeps intermediate results in 80-bit extended precision by default, while C compilers assume 64-bit double. This produces subtle precision differences between compiled C and hand-written x87 assembly.
  4. Poor compiler support. Compiler register allocators cannot use x87 for floating-point temporaries; they use XMM registers instead.

When x87 is appropriate: - 80-bit extended precision is required (e.g., some numerical algorithms benefit from the extra precision) - Hardware transcendental functions (FSIN, FCOS) are desired without linking libm - Working with legacy code that uses x87

x87 Example: Computing sin(x) + cos(x)

; Compute sin(x) + cos(x) using x87 transcendentals
; Input: double x in ST(0) already pushed (or: fldl [rbx] to load)
; Output: result in ST(0)

; Load x:
fldl [rbx]                 ; ST(0) = x

; Compute sin and cos simultaneously (fsincos: sin to ST(0), cos to new ST(0))
fsincos                    ; After: ST(0) = cos(x), ST(1) = sin(x)
                           ; (fsincos pushes cos to a new ST(0), sin is now ST(1))
; Actually: fsincos computes sin(ST(0)) = ST(1_new), cos(ST(0)) = ST(0_new)
; Wait, let me re-check: FSINCOS result: ST(1) = sin(original), ST(0) = cos(original)

faddp                      ; ST(0) = ST(1) + ST(0) = sin(x) + cos(x), pop
                           ; Result is now in ST(0)

; Store result to memory:
fstpl [rbx + 8]            ; *(double*)(rbx+8) = result, pop

SSE/SSE2 Scalar Floating-Point: The Main Event

SSE2 scalar instructions operate on a single floating-point value in the low portion of an XMM register. The high bits of the register are either left unchanged or zeroed depending on the instruction.

XMM Register Layout

XMM register (128 bits):
┌──────────────────────────────────────────────────────────────────┐
│  Bits 127:64 (upper 64 bits)  │   Bits 63:32   │   Bits 31:0   │
│  (not used in scalar ops)     │  (scalar dbl)  │ (scalar flt)  │
└──────────────────────────────────────────────────────────────────┘
                                      ↑                 ↑
                              MOVSD uses this     MOVSS uses this

Loading and Storing

; Load float (32-bit) into XMM0:
movss xmm0, [rbx]          ; xmm0[31:0] = *(float*)rbx; xmm0[127:32] = 0

; Load double (64-bit) into XMM0:
movsd xmm0, [rbx]          ; xmm0[63:0] = *(double*)rbx; xmm0[127:64] = 0

; Store float:
movss [rbx], xmm0          ; *(float*)rbx = xmm0[31:0]

; Store double:
movsd [rbx], xmm0          ; *(double*)rbx = xmm0[63:0]

; Register to register (clears upper bits in destination):
movss xmm1, xmm0           ; xmm1[31:0] = xmm0[31:0]; xmm1[127:32] = 0
movsd xmm1, xmm0           ; xmm1[63:0] = xmm0[63:0]; xmm1[127:64] = 0

Arithmetic

; Scalar single (SS suffix = "Scalar Single"):
addss xmm0, xmm1           ; xmm0 = xmm0 + xmm1 (float)
subss xmm0, xmm1           ; xmm0 = xmm0 - xmm1
mulss xmm0, xmm1           ; xmm0 = xmm0 * xmm1
divss xmm0, xmm1           ; xmm0 = xmm0 / xmm1
sqrtss xmm0, xmm1          ; xmm0 = sqrt(xmm1)
maxss xmm0, xmm1           ; xmm0 = max(xmm0, xmm1)
minss xmm0, xmm1           ; xmm0 = min(xmm0, xmm1)

; Scalar double (SD suffix = "Scalar Double"):
addsd xmm0, xmm1           ; xmm0 = xmm0 + xmm1 (double)
subsd xmm0, xmm1
mulsd xmm0, xmm1
divsd xmm0, xmm1
sqrtsd xmm0, xmm1
maxsd xmm0, xmm1
minsd xmm0, xmm1

All of these instructions read only the low 32 or 64 bits of the source operand. The upper bits of the destination are preserved (for two-register forms) or zeroed (for memory-source forms).

Comparison

Floating-point comparison does not use RFLAGS directly. Instead, UCOMISS/UCOMISD sets CF, ZF, and PF:

; Compare float:
ucomiss xmm0, xmm1         ; compare xmm0 and xmm1 (floats)
; Compare double:
ucomisd xmm0, xmm1         ; compare xmm0 and xmm1 (doubles)

; After UCOMISS/UCOMISD:
; xmm0 > xmm1:  CF=0, ZF=0, PF=0  → use JA or JG
; xmm0 < xmm1:  CF=1, ZF=0, PF=0  → use JB or JL
; xmm0 == xmm1: CF=0, ZF=1, PF=0  → use JE or JZ
; Unordered (NaN): CF=1, ZF=1, PF=1 → use JP (parity flag set)

; Conditional jumps after UCOMISS:
; if (a < b): ucomiss; jb .less_than
; if (a == b): ucomiss; je .equal  (CAUTION: never do this with floats!)
; if (a > b): ucomiss; ja .greater_than
; Check for NaN: ucomiss; jp .is_nan

⚠️ Common Mistake — Never Compare Floats with ==: 0.1 + 0.2 == 0.3 is false in floating-point arithmetic because 0.1 and 0.2 are not exactly representable in binary. The result of the addition is 0.30000000000000004, which does not equal 0.3. Always compare floats with an epsilon tolerance (see below) or use exact comparison only when the values come from non-arithmetic sources (e.g., comparing two values that were loaded from the same memory location).

Conversion Instructions

; Integer to float:
cvtsi2ss xmm0, rax         ; xmm0 = (float)rax (int64 → float32)
cvtsi2ss xmm0, eax         ; xmm0 = (float)eax (int32 → float32)
cvtsi2sd xmm0, rax         ; xmm0 = (double)rax (int64 → double64)
cvtsi2sd xmm0, eax         ; xmm0 = (double)eax (int32 → double64)

; Float to integer (with rounding controlled by MXCSR):
cvtss2si rax, xmm0         ; rax = (int64)(float)xmm0 (round, not truncate)
cvtsd2si rax, xmm0         ; rax = (int64)(double)xmm0

; Float to integer with truncation toward zero (C cast semantics):
cvttss2si rax, xmm0        ; rax = (int64)(float)xmm0 (truncate)
cvttsd2si rax, xmm0        ; rax = (int64)(double)xmm0 (truncate)

; Precision conversion:
cvtss2sd xmm0, xmm1        ; xmm0 = (double)(float)xmm1 (widening)
cvtsd2ss xmm0, xmm1        ; xmm0 = (float)(double)xmm1 (narrowing, may lose precision)

💡 Mental Model: CVTSS2SI rounds using the rounding mode in MXCSR (default: round to nearest, ties to even). CVTTSS2SI always truncates toward zero — this matches C's (int)float_value cast. Use CVTT when you want C's behavior, CVT when you want the current rounding mode.

Rounding

; ROUNDSS / ROUNDSD with immediate control byte:
; Immediate bit [1:0] controls rounding mode:
;   0b00 = round to nearest (ties to even)
;   0b01 = round toward -∞ (floor)
;   0b10 = round toward +∞ (ceil)
;   0b11 = round toward 0 (truncate)
; Immediate bit [2] = 0 uses MXCSR, bit [2] = 1 uses bits [1:0]

roundsd xmm0, xmm1, 0b1001   ; round xmm1 to nearest, store in xmm0 (mode from imm)
roundss xmm0, xmm1, 0b1010   ; ceil(xmm1), store in xmm0

; Practical: floor and ceil:
; floor:
roundsd xmm0, xmm0, 0b0001 | 0b1000  ; 0b1001 = use imm mode, round toward -inf
; ceil:
roundsd xmm0, xmm0, 0b0010 | 0b1000  ; 0b1010 = use imm mode, round toward +inf
; truncate:
roundsd xmm0, xmm0, 0b0011 | 0b1000  ; 0b1011 = use imm mode, truncate

GCC Floating-Point Code Generation

Let us examine what GCC -O2 generates for simple float arithmetic:

double hypotenuse(double a, double b) {
    return sqrt(a*a + b*b);
}
; GCC -O2 output:
; XMM0 = a, XMM1 = b
hypotenuse:
    mulsd xmm0, xmm0       ; xmm0 = a * a
    mulsd xmm1, xmm1       ; xmm1 = b * b
    addsd xmm0, xmm1       ; xmm0 = a*a + b*b
    sqrtsd xmm0, xmm0      ; xmm0 = sqrt(a*a + b*b)
    ret

Four instructions. The compiler maps sqrt() directly to SQRTSD (the libm sqrt function also uses SQRTSD internally, but the compiler emits it directly for double/float scalar). No function call overhead.

float dot3(float *a, float *b) {
    return a[0]*b[0] + a[1]*b[1] + a[2]*b[2];
}
; GCC -O2 output:
; RDI = a, RSI = b
dot3:
    movss  xmm0, [rdi]         ; xmm0 = a[0]
    movss  xmm1, [rdi + 4]     ; xmm1 = a[1]
    movss  xmm2, [rdi + 8]     ; xmm2 = a[2]
    mulss  xmm0, [rsi]         ; xmm0 = a[0] * b[0]
    mulss  xmm1, [rsi + 4]     ; xmm1 = a[1] * b[1]
    mulss  xmm2, [rsi + 8]     ; xmm2 = a[2] * b[2]
    addss  xmm0, xmm1          ; xmm0 = a[0]*b[0] + a[1]*b[1]
    addss  xmm0, xmm2          ; xmm0 = dot product
    ret

IEEE 754 Special Cases in Practice

NaN Propagation

Any operation involving NaN produces NaN:

; Creating NaN via 0/0:
pxor    xmm0, xmm0         ; xmm0 = 0.0
divsd   xmm0, xmm0         ; xmm0 = 0.0 / 0.0 = NaN (MXCSR invalid exception set)
; Now xmm0 is NaN

; NaN propagates:
addsd   xmm0, xmm1         ; result is NaN (any operation with NaN input → NaN output)

; Checking for NaN (unordered comparison):
ucomisd xmm0, xmm0         ; compare with itself
jp      .is_nan            ; NaN != NaN (PF=1 for unordered comparison)
; A value is NaN iff it is not equal to itself

Infinity Arithmetic

; Overflow to infinity:
; (values near DBL_MAX + DBL_MAX overflow to +inf)
; Exact arithmetic:
; +inf + anything = +inf (except -inf)
; +inf - (+inf) = NaN
; +inf * 0 = NaN
; +inf / +inf = NaN
; Finite / 0 = ±inf (with sign of numerator)

Denormal Numbers and the Performance Cliff

Denormal (subnormal) numbers are very small floating-point values near zero that have reduced precision. Operations involving denormals are handled by microcode on most processors, costing 40-150 extra cycles per operation — the "denormal performance cliff."

; Setting Flush-to-Zero to avoid denormal performance cliff:
; At the cost of losing gradual underflow (denormals become 0)
stmxcsr dword [rsp - 4]
or      dword [rsp - 4], 0x8040    ; set FTZ (bit 15) and DAZ (bit 6)
ldmxcsr dword [rsp - 4]

This is standard practice in audio processing and scientific computing where the tiny values near underflow provide no useful precision anyway.

⚡ Performance Note: If your floating-point code suddenly becomes 10× slower for certain inputs, denormals are the first suspect. Run the algorithm with FTZ+DAZ enabled and compare; if performance recovers, you have a denormal problem in your algorithm or data.

Floating-Point Comparison: Epsilon Comparison

; Epsilon comparison: |a - b| < epsilon
; XMM0 = a, XMM1 = b

fp_nearly_equal:
    subsd  xmm0, xmm1          ; xmm0 = a - b
    ; Take absolute value (clear sign bit):
    movsd  xmm1, [rel abs_mask]  ; xmm1 = 0x7FFFFFFFFFFFFFFF (sign bit cleared)
    andpd  xmm0, xmm1           ; xmm0 = |a - b|
    ucomisd xmm0, [rel epsilon]  ; compare |a-b| with epsilon
    jb     .is_equal            ; jump if |a-b| < epsilon (CF=1 for below)
    ; not equal
    xor    eax, eax
    ret
.is_equal:
    mov    eax, 1
    ret

section .data
    abs_mask: dq 0x7FFFFFFFFFFFFFFF  ; mask to clear sign bit of double
    epsilon:  dq 1.0e-9              ; comparison tolerance

📊 C Comparison: ```c double a = 0.1 + 0.2; double b = 0.3; // a == b is false in C! // Correct:

include

if (fabs(a - b) < 1e-9) { / nearly equal / } `` The assembly does exactly whatfabs(a - b) < epsilon` does in C.

A Complete Floating-Point Function: Polynomial Evaluation

Evaluating a polynomial p(x) = a₀ + a₁x + a₂x² + ... using Horner's method:

; double horner(double x, double *coeffs, int degree)
; p(x) = coeffs[0] + coeffs[1]*x + ... + coeffs[degree]*x^degree
; Computed as: ((coeffs[degree]*x + coeffs[degree-1])*x + ...)*x + coeffs[0]
; XMM0 = x, RDI = coeffs array, ESI = degree

horner:
    push rbp
    mov  rbp, rsp

    movsx  rax, esi            ; rax = degree
    movsd  xmm1, [rdi + rax*8] ; xmm1 = coeffs[degree]

.loop:
    dec    rax                 ; index--
    js     .done               ; if index < 0, done

    mulsd  xmm1, xmm0          ; xmm1 *= x
    addsd  xmm1, [rdi + rax*8] ; xmm1 += coeffs[index]
    jmp    .loop

.done:
    movsd  xmm0, xmm1          ; return value in xmm0
    pop    rbp
    ret

🔄 Check Your Understanding: For coefficients [1.0, 2.0, 3.0] (degree=2) and x=2.0, trace through the Horner evaluation and verify it computes 1 + 22 + 34 = 17.

Answer

  • Start: xmm1 = coeffs[2] = 3.0
  • Iteration 1 (index=1): xmm1 = 3.0 * 2.0 = 6.0; xmm1 += coeffs[1] = 6.0 + 2.0 = 8.0
  • Iteration 2 (index=0): xmm1 = 8.0 * 2.0 = 16.0; xmm1 += coeffs[0] = 16.0 + 1.0 = 17.0
  • Return 17.0 ✓

The polynomial 3x² + 2x + 1 at x=2: 34 + 22 + 1 = 12 + 4 + 1 = 17 ✓

Why Floating-Point is Avoided in Cryptography

The encryption tool anchor example uses AES-NI (integer operations in the SSE unit) rather than floating-point. Why?

  1. Non-determinism across platforms. Floating-point results can vary slightly between processors, compiler settings, and rounding modes. Cryptographic operations must produce identical results everywhere.
  2. Denormal side-channels. Denormal numbers trigger microcode paths that take different amounts of time, potentially leaking information through timing side-channels.
  3. No exact inverse. Floating-point operations are not exactly invertible. x / 3.0 * 3.0 != x in general. Cryptographic operations need exact arithmetic.
  4. MXCSR is shared state. Changing the rounding mode in one part of the program silently affects all floating-point operations until changed back.

The AES-NI instructions (covered in Chapter 15) live in the SSE unit and use XMM registers, but they operate on integer byte values, not floating-point. The XMM register file is shared between SSE floating-point and AES-NI, but the operations themselves are entirely integer.

Summary

The SSE/SSE2 scalar floating-point instructions (MOVSS/MOVSD, ADDSS/ADDSD, MULSS/MULSD, etc.) are the primary floating-point instruction set for modern x86-64 code. They are explicit register-to-register operations with predictable precision, and they integrate cleanly with the System V ABI (which passes float and double arguments in XMM0-XMM5). The x87 FPU is still present and still useful for 80-bit precision and hardware transcendentals, but is not the default for new code.

IEEE 754 special values (NaN, Infinity, denormals) are real and must be handled. NaN propagates silently through arithmetic. Equality comparison of floats is almost always wrong; use epsilon comparison. Denormals cause severe performance degradation and should be flushed to zero in performance-sensitive code.

Chapter 15 builds on the XMM register file to introduce packed SIMD operations — where instead of one float per instruction, we process four (SSE) or eight (AVX2) simultaneously.