Chapter 2 Further Reading: Numbers in the Machine

Core References


1. "What Every Computer Scientist Should Know About Floating-Point Arithmetic" David Goldberg, ACM Computing Surveys, Vol. 23, No. 1, March 1991 Available free online via ACM Digital Library and widely mirrored

The canonical reference on IEEE 754. Despite being from 1991, it remains current because IEEE 754 has not fundamentally changed (the 2008 revision added new formats but didn't alter the basics). Goldberg covers the representation in detail, explains rounding modes, discusses error analysis, and addresses the implementation challenges. The appendix "Differences Among IEEE 754 Implementations" documents the historical variations that led to the standardization. If you work with floating-point arithmetic at any level, this paper is required reading.


2. "Hacker's Delight" Henry S. Warren Jr., Addison-Wesley, 2nd Edition, 2012

A collection of bit manipulation algorithms with thorough mathematical justification. Chapter 2 covers the properties of binary arithmetic including two's complement, Chapter 3 covers addition and subtraction, and Chapter 4 covers shifts and rotates. The book explains why x & (x-1) clears the lowest set bit, why (x + y) >> 1 is the correct average without overflow, and dozens of other bit-twiddling tricks that appear in security research, kernel code, and performance-critical software. The mathematical derivations make this much more than a recipe book.


3. Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 1 Intel Corporation. Chapter 4: Data Types; Chapter 11: x87 FPU

The authoritative specification for x86-64 data types, including the exact bit layout of all integer sizes and the x87/SSE floating-point formats. Chapter 4 covers the data type hierarchy from bytes to qwords. Chapter 11 covers the x87 floating-point unit (the legacy floating-point stack, now mostly replaced by SSE/AVX but still present). Chapter 12 covers MMX, and Chapters 13-14 cover SSE and SSE2 floating-point — the ADDSS, ADDSD, COMISS, and UCOMISD instructions used in this chapter come from SSE2.


4. "IEEE 754-2008: Standard for Floating-Point Arithmetic" IEEE Computer Society, 2008. Available for purchase from IEEE or through institutional access

The official standard. Most programmers don't need to own this, but knowing it exists and what it specifies is useful. The standard defines five basic formats (binary16, binary32, binary64, binary128, and decimal variants), five rounding modes, and the exact behavior of operations including the handling of NaN, infinity, signed zero, and denormals. The 2008 revision also added fused multiply-add (FMA) as a required operation, which has significant implications for precision of numerical algorithms.


Supplementary Reading


5. "The Floating-Point Guide" (floating-point-gui.de) Michael Borgwardt. Free online at floating-point-gui.de

A well-written online guide explaining floating-point to working programmers. Covers what every programmer needs to know without going as deep as Goldberg. Particularly good explanations of why floating-point comparisons need epsilon-based approaches and when decimal arithmetic is appropriate. The FAQ section addresses the most common misconceptions.


6. "Two's Complement: Why Computers Use It" Multiple sources. Search for rigorous mathematical treatments.

The mathematical explanation of two's complement is worth seeking out in detail. The key insight: two's complement is not "a way to represent negative numbers" but rather "the natural interpretation of modular arithmetic." When you count past the maximum value and wrap around, the resulting sequence (interpreted with the MSB having negative weight) produces the two's complement values. The computer doesn't "know" about signs — it does modular arithmetic, and signed arithmetic is just one valid interpretation of the results.


7. "Representations of Fixed-Point Numbers in C" (cppreference.com/fixed_point) cppreference.com documentation

For financial and embedded applications where fixed-point arithmetic is used (integer arithmetic with an implied decimal point), the C documentation on fixed-point representations and the <stdfix.h> extensions (available in C23 and as compiler extensions earlier) is the reference. Fixed-point is the correct solution when you need decimal-like precision without IEEE 754's approximation issues.


8. "Endianness — The NUXI Problem" Rob Pike, Bell Labs Technical Journal, 1980. Also: historical discussion in Dragon Book

The original documentation of endianness issues from the early days of Unix interoperability. The term "NUXI problem" comes from a system where the word "UNIX" stored as two 16-bit values would read as "NUXI" on a system with different byte order. While the specific systems described are ancient, the explanation of why byte order matters for data interchange is historically clear and still valid.


9. "Why Floating-Point Numbers Are Not Real Numbers" ACM SIGPLAN Notices, various authors

A collection of short, sharp explanations of the differences between mathematical real numbers and IEEE 754 floats. The key point: floats violate several properties of real arithmetic. Addition is not associative: (a + b) + c ≠ a + (b + c) in general. This has significant implications for parallel numerical algorithms where the order of operations is non-deterministic. Understanding that floats are not reals is the foundation for reasoning about numerical stability.


10. "Numeric Recipes: The Art of Scientific Computing" Press, Teukolsky, Vetterling, Flannery. Cambridge University Press, 3rd Edition, 2007

For readers who will use floating-point for scientific computation rather than financial calculation, Numerical Recipes provides practical algorithms for solving differential equations, matrix operations, FFTs, and statistical computations with appropriate attention to numerical stability. Chapter 1 covers floating-point arithmetic from the practitioner's perspective. The code examples are in C but the concepts apply directly to assembly-level numerical work.