Chapter 14 Further Reading: Floating Point

Open Assembly Language Project

Chapter 14 Further Reading: Floating Point

The Standard

IEEE 754-2008: IEEE Standard for Floating-Point Arithmetic The definitive specification. Section 4 defines the basic formats (binary32, binary64, binary128). Section 5 defines all operations and their rounding behavior. Section 7 defines exception handling. Available from IEEE for purchase; summaries are freely available from multiple sources.

"What Every Computer Scientist Should Know About Floating-Point Arithmetic" — David Goldberg, ACM Computing Surveys, 1991 dl.acm.org/doi/10.1145/103162.103163 The classic introductory paper. Sections 1-4 cover representation, rounding, and error analysis with the clarity that the specification itself lacks. The cancellation examples (Section 1.4) and the Kahan summation algorithm (Section 4.3) are directly relevant to the financial arithmetic case study.

Intel Documentation

Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 1: Chapter 8 — Programming with the x87 FPU Complete x87 instruction reference including the stack model, precision control, and all instructions. The precision control register and the transcendental instruction timing tables are in this chapter.

Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 1: Chapter 10 — Programming with Intel SSE, SSE2, and SSE3 Instructions The SSE2 scalar and packed instructions. Section 10.2 covers MXCSR in detail including bit layout, default initialization, and the interaction between masked exceptions and the special value results (Infinity, NaN, etc.).

"Intel Intrinsics Guide" — SSE, SSE2, AVX sections software.intel.com/sites/landingpage/IntrinsicsGuide/ For each SSE2 scalar instruction, shows the corresponding C intrinsic. _mm_add_sd → addsd, _mm_cvtsd_si64 → cvtsd2si, etc. Useful when reading code that uses intrinsics.

Numerical Analysis

"Accuracy and Stability of Numerical Algorithms" by Nicholas Higham (2nd edition) SIAM, 2002. The graduate-level reference for numerical error analysis. Chapter 2 covers floating-point arithmetic properties. Chapter 3 covers error analysis of summation algorithms (including Kahan compensated summation, which can reduce accumulation error by a factor of ~10^15 compared to naive summation).

"Handbook of Floating-Point Arithmetic" by Jean-Michel Muller et al. Birkhäuser, 2010. Comprehensive coverage including correctly-rounded elementary functions (transcendentals), multiple-precision arithmetic, and hardware design considerations. Relevant to the sin() implementation case study.

Financial Arithmetic

"Falsehoods Programmers Believe About Money" — Erik Wijk, blog Financial correctness requires not just fixed-point arithmetic but also: understanding different currency denominations (JPY has no cents), exchange rate representation, rounding laws by jurisdiction (some require half-up, some round-half-to-even), and overflow analysis for large transaction volumes.

Java BigDecimal documentation (conceptual reference) java.sun.com/j2se/1.5.0/docs/api/java/math/BigDecimal.html The design of BigDecimal shows what a correct financial arithmetic library requires: arbitrary precision, explicit rounding mode specification, and separate scale tracking. Even if you never use Java, understanding what problems it solves informs correct x86-64 fixed-point implementation.

Performance

"Avoiding Denormals" — Intel software.intel.com (Application Note) The Intel application note on denormal performance, with specific cycle counts per microarchitecture and the recommended MXCSR settings (FTZ+DAZ) for performance-critical code. Includes before/after benchmarks for audio processing code.

Agner Fog, "Instruction Tables" — agner.org/optimize/instruction_tables.pdf Per-microarchitecture latency and throughput for all floating-point instructions: ADDSS, MULSD, SQRTSD, CVTSI2SD, FSIN, etc. The FSIN/FCOS timings (50-100 cycles) vs. SSE polynomial (10-20 cycles) comparison from the case study is documented here.