Chapter 14 Key Takeaways: Floating Point
-
x86-64 has three floating-point subsystems: x87 (legacy 80-bit stack), SSE/SSE2 (modern default), and AVX/AVX2 (256-bit SIMD). For new scalar code, use SSE2. For 80-bit precision or hardware transcendentals, use x87. For SIMD data-parallel computation, use AVX.
-
SSE2 scalar floating-point uses XMM registers in scalar mode.
MOVSS/ADDSS/MULSSoperate on the low 32 bits (float);MOVSD/ADDSD/MULSDoperate on the low 64 bits (double). The upper bits of the destination are zeroed on loads from memory and preserved on register-to-register operations. -
The System V AMD64 ABI passes float/double arguments in XMM0-XMM7 (first 8 floating-point arguments) and returns in XMM0. Integer and floating-point arguments use separate register sets and can be intermixed freely.
-
UCOMISS/UCOMISDsets CF, ZF, and PF based on the comparison. After unordered comparison: equal → ZF=1, CF=0, PF=0; less than → ZF=0, CF=1, PF=0; greater than → ZF=0, CF=0, PF=0; unordered (NaN involved) → ZF=1, CF=1, PF=1. -
Never compare floating-point values with exact equality (
jeafterUCOMISD) for computed values. Floating-point arithmetic involves rounding; mathematically equal values often differ in their binary representation. Use epsilon comparison:|a - b| < epsilon. -
NaN (Not a Number) propagates through all arithmetic. Any operation on NaN produces NaN. Checking for NaN:
UCOMISD xmm0, xmm0; JP .is_nan— NaN is the only value not equal to itself. -
Conversion instructions:
CVTSI2SS/CVTSI2SDconverts integer to float/double.CVTTSS2SI/CVTTSD2SItruncates float/double to integer (C cast semantics).CVTSS2SD/CVTSD2SSconverts between float and double precision. -
The
CVTT(truncate) variants match C's(int)xcast behavior (round toward zero). TheCVTvariants (without T) use the current MXCSR rounding mode (default: round to nearest, ties to even). Use CVTT for C-compatible integer truncation. -
Denormal numbers (subnormals) cause severe performance degradation — 40-150 extra cycles per operation — because they trigger microcode handling. Enable Flush-to-Zero (FTZ) mode via MXCSR for performance-critical code that can tolerate denormals being treated as zero.
-
The MXCSR register controls SSE floating-point behavior: exception masks (which IEEE 754 exceptions are silently handled vs. raised), rounding mode (4 modes), FTZ (flush-to-zero), and DAZ (denormal-as-zero). Changing MXCSR affects all subsequent SSE floating-point operations until changed back.
-
x87 FSIN/FCOS are the only hardware transcendental functions in the instruction set. SSE has no native sin/cos; the math library provides software implementations using polynomial approximations. For hot loops, a degree-9 minimax polynomial approximation computes sin in ~15-20 cycles vs. 50-100 cycles for hardware FSIN.
-
Financial calculations must use integer (fixed-point) arithmetic, not floating-point.
0.01is not exactly representable in binary; repeated additions accumulate error. Store monetary values as integers in the smallest unit needed (cents, millicents) and use 128-bit intermediate products (IMUL/IDIV with RDX:RAX) to avoid overflow. -
ROUNDSD/ROUNDSSwith an immediate control byte implements floor, ceil, and truncate without changing the MXCSR rounding mode. Immediate bits 1:0 select the rounding mode; bit 3 = 1 means use the immediate rather than MXCSR. -
x87 precision surprises: x87 keeps intermediate results in 80-bit extended precision by default, while C assumes 64-bit double. This can produce different results between compiled C and hand-written x87 assembly, and between debug and release builds (optimization may keep values in 80-bit x87 registers vs. flushing to 64-bit memory).
-
SQRTSDcomputes square root in hardware and is emitted directly by GCC forsqrt()calls on doubles (not as a libm call). It produces a correctly-rounded result (< 0.5 ULP error).RSQRTSSprovides a fast approximate reciprocal square root (~12-bit precision), useful as a starting point for Newton-Raphson refinement.