Chapter 14 Key Takeaways: Floating Point

Open Assembly Language Project

Chapter 14 Key Takeaways: Floating Point

x86-64 has three floating-point subsystems: x87 (legacy 80-bit stack), SSE/SSE2 (modern default), and AVX/AVX2 (256-bit SIMD). For new scalar code, use SSE2. For 80-bit precision or hardware transcendentals, use x87. For SIMD data-parallel computation, use AVX.
SSE2 scalar floating-point uses XMM registers in scalar mode. MOVSS/ADDSS/MULSS operate on the low 32 bits (float); MOVSD/ADDSD/MULSD operate on the low 64 bits (double). The upper bits of the destination are zeroed on loads from memory and preserved on register-to-register operations.
The System V AMD64 ABI passes float/double arguments in XMM0-XMM7 (first 8 floating-point arguments) and returns in XMM0. Integer and floating-point arguments use separate register sets and can be intermixed freely.
UCOMISS/UCOMISD sets CF, ZF, and PF based on the comparison. After unordered comparison: equal → ZF=1, CF=0, PF=0; less than → ZF=0, CF=1, PF=0; greater than → ZF=0, CF=0, PF=0; unordered (NaN involved) → ZF=1, CF=1, PF=1.
Never compare floating-point values with exact equality (je after UCOMISD) for computed values. Floating-point arithmetic involves rounding; mathematically equal values often differ in their binary representation. Use epsilon comparison: |a - b| < epsilon.
NaN (Not a Number) propagates through all arithmetic. Any operation on NaN produces NaN. Checking for NaN: UCOMISD xmm0, xmm0; JP .is_nan — NaN is the only value not equal to itself.
Conversion instructions: CVTSI2SS/CVTSI2SD converts integer to float/double. CVTTSS2SI/CVTTSD2SI truncates float/double to integer (C cast semantics). CVTSS2SD/CVTSD2SS converts between float and double precision.
The CVTT (truncate) variants match C's (int)x cast behavior (round toward zero). The CVT variants (without T) use the current MXCSR rounding mode (default: round to nearest, ties to even). Use CVTT for C-compatible integer truncation.
Denormal numbers (subnormals) cause severe performance degradation — 40-150 extra cycles per operation — because they trigger microcode handling. Enable Flush-to-Zero (FTZ) mode via MXCSR for performance-critical code that can tolerate denormals being treated as zero.
The MXCSR register controls SSE floating-point behavior: exception masks (which IEEE 754 exceptions are silently handled vs. raised), rounding mode (4 modes), FTZ (flush-to-zero), and DAZ (denormal-as-zero). Changing MXCSR affects all subsequent SSE floating-point operations until changed back.
x87 FSIN/FCOS are the only hardware transcendental functions in the instruction set. SSE has no native sin/cos; the math library provides software implementations using polynomial approximations. For hot loops, a degree-9 minimax polynomial approximation computes sin in ~15-20 cycles vs. 50-100 cycles for hardware FSIN.
Financial calculations must use integer (fixed-point) arithmetic, not floating-point. 0.01 is not exactly representable in binary; repeated additions accumulate error. Store monetary values as integers in the smallest unit needed (cents, millicents) and use 128-bit intermediate products (IMUL/IDIV with RDX:RAX) to avoid overflow.
ROUNDSD/ROUNDSS with an immediate control byte implements floor, ceil, and truncate without changing the MXCSR rounding mode. Immediate bits 1:0 select the rounding mode; bit 3 = 1 means use the immediate rather than MXCSR.
x87 precision surprises: x87 keeps intermediate results in 80-bit extended precision by default, while C assumes 64-bit double. This can produce different results between compiled C and hand-written x87 assembly, and between debug and release builds (optimization may keep values in 80-bit x87 registers vs. flushing to 64-bit memory).
SQRTSD computes square root in hardware and is emitted directly by GCC for sqrt() calls on doubles (not as a libm call). It produces a correctly-rounded result (< 0.5 ULP error). RSQRTSS provides a fast approximate reciprocal square root (~12-bit precision), useful as a starting point for Newton-Raphson refinement.