Case Study 14.1: Implementing sin() Without the Math Library
The Challenge
The x87 FPU has hardware FSIN. SSE2 does not have a transcendental function. When you call sin() in C, the compiler links against libm, which provides a software implementation. This case study compares three approaches: hardware x87, SSE polynomial approximation, and table lookup — and measures their precision and speed.
Approach 1: x87 FSIN
; double sin_x87(double x)
; Uses hardware x87 FSIN instruction
; XMM0 = x (input, following SSE ABI)
; Returns result in XMM0 (output, following SSE ABI)
; Internally uses x87 for the computation
section .text
global sin_x87
sin_x87:
; Move double from XMM0 to x87 stack via memory
sub rsp, 8
movsd [rsp], xmm0 ; store x to stack
fldl [rsp] ; push x onto x87 stack as 80-bit extended
fsin ; ST(0) = sin(ST(0)) in 80-bit precision
fstpl [rsp] ; pop and store as 64-bit double
movsd xmm0, [rsp] ; load result back into XMM0
add rsp, 8
ret
Precision: Better than IEEE double (computed in 80-bit extended precision, then rounded to 64-bit). For most inputs, fully accurate to all 53 significant bits of a double.
Speed: Approximately 50-100 clock cycles on modern Intel processors. The FSIN instruction itself is microcode and takes 20-100 cycles depending on the input range.
Limitation: The x87 FSIN is defined for inputs in the range [-2^63, 2^63]. For very large inputs (|x| >> 2π), it performs internal range reduction at 80-bit precision, but the reduction itself may lose significant bits for very large |x|. For |x| > 2^20 or so, a higher-precision range reduction is needed.
Approach 2: SSE2 Polynomial Approximation (Minimax)
A degree-9 Taylor-derived minimax polynomial for sin(x) on [-π/2, π/2]:
sin(x) ≈ x - x³/6 + x⁵/120 - x⁷/5040 + x⁹/362880
Using Horner form with minimax-optimized coefficients:
; double sin_poly(double x) — valid for |x| <= π/2
; This is the core computation; a full sin() function needs range reduction first
; XMM0 = x, returns in XMM0
section .data
align 16
; Horner form coefficients for sin(x) = x*(1 + x²*(c3 + x²*(c5 + x²*(c7 + x²*c9))))
sin_coeff_c9: dq -2.7557314351801666e-6 ; -1/9!
sin_coeff_c7: dq 1.9841269841269841e-4 ; 1/7!
sin_coeff_c5: dq -8.3333333333333329e-3 ; -1/5!
sin_coeff_c3: dq 1.6666666666666667e-1 ; 1/3!
section .text
global sin_poly
sin_poly:
movsd xmm1, xmm0 ; xmm1 = x
mulsd xmm1, xmm1 ; xmm1 = x²
; Horner: start from highest degree
movsd xmm2, [rel sin_coeff_c9] ; xmm2 = c9
mulsd xmm2, xmm1 ; xmm2 = c9 * x²
addsd xmm2, [rel sin_coeff_c7] ; xmm2 = c7 + c9*x²
mulsd xmm2, xmm1 ; xmm2 = (c7 + c9*x²) * x²
addsd xmm2, [rel sin_coeff_c5] ; xmm2 = c5 + x²*(c7 + c9*x²)
mulsd xmm2, xmm1 ; xmm2 *= x²
addsd xmm2, [rel sin_coeff_c3] ; xmm2 = c3 + x²*(...)
mulsd xmm2, xmm1 ; xmm2 *= x²
addsd xmm2, [rel one] ; xmm2 = 1 + x²*(...)
mulsd xmm0, xmm2 ; xmm0 = x * (1 + x²*(...)) = sin(x)
ret
section .data
one: dq 1.0
Speed: ~15-20 clock cycles (7 multiplies + 4 adds + loads). Significantly faster than hardware FSIN.
Precision: Maximum error ≈ 1 ULP (Unit in the Last Place) for |x| ≤ π/2. Within the Taylor series approximation error of the polynomial degree chosen.
Range reduction (not shown above): A full sin(x) function must first reduce x modulo 2π and determine the quadrant. This requires a separate 128-bit or extended-precision range reduction for large |x|.
Approach 3: Table Lookup with Linear Interpolation
; sin_table: precomputed sin values at 1024 points in [0, 2π)
; Index i → sin(i * 2π / 1024)
; sin_table_lookup(double x) for 0 ≤ x < 2π
; Interpolates between table entries
section .data
align 8
sin_2pi_over_1024: dq 0.006135923151542 ; 2π / 1024 (index step)
sin_inv_step: dq 163.3986928... ; 1024 / 2π (for index computation)
N_TABLE equ 1024
section .text
sin_table_lookup:
; Compute fractional index: idx = x * (1024 / 2π)
mulsd xmm0, [rel sin_inv_step] ; xmm0 = x * (N / 2π)
cvttsd2si rax, xmm0 ; integer part of index
movsd xmm2, xmm0
cvtsi2sd xmm1, rax ; convert back to double for fractional part
subsd xmm2, xmm1 ; xmm2 = fractional part of index
; Get integer index (modulo table size)
and rax, 1023 ; rax = idx mod 1024 (power of 2: use AND)
lea rcx, [rax + 1]
and rcx, 1023 ; next index (wrap)
; Load sin values at i and i+1
lea rdx, [rel sin_table]
movsd xmm3, [rdx + rax*8] ; sin[i]
movsd xmm4, [rdx + rcx*8] ; sin[i+1]
; Linear interpolation: sin[i] + frac * (sin[i+1] - sin[i])
subsd xmm4, xmm3 ; sin[i+1] - sin[i]
mulsd xmm4, xmm2 ; * fractional part
addsd xmm0, xmm3 ; WRONG: xmm0 still has scaled index
; Fix: movsd xmm0, xmm3; addsd xmm0, xmm4
movsd xmm0, xmm3
addsd xmm0, xmm4 ; interpolated result
ret
Speed: ~8-10 cycles if sin_table is in L1 cache. Cache-dependent behavior for large workloads (table is 8KB, may not fit in L1).
Precision: Error ≈ 0.5 * (2π/1024)² / 2 * max|sin''| ≈ 1.9×10^-5. Only about 5 significant decimal digits — much worse than polynomial or hardware approaches. More table entries improve this at the cost of more memory.
Performance Comparison
| Approach | Cycles | Precision | Notes |
|---|---|---|---|
| x87 FSIN | 50-100 | ~19 decimal digits | Hardware, with x87 overhead |
| SSE polynomial (degree 9) | 15-20 | ~15 decimal digits | Fast, needs range reduction |
| libm sin() | 25-50 | ~15 decimal digits | Well-optimized, handles all inputs |
| Table lookup (1024 entries) | 8-10 | ~5 decimal digits | Fast but inaccurate |
What libm Actually Does
The glibc sin() implementation:
1. Range reduction using 256-bit precision arithmetic (for correct results with very large inputs)
2. A 12th-degree minimax polynomial for the core computation (within [-π/8, π/8])
3. Quadrant selection and sign adjustment
The polynomial in libm is longer than our 9-term version and achieves < 1 ULP error across the full double range. The range reduction step is the most complex part and accounts for most of the cycle count.
Practical Takeaway
For most applications, use the system's sin(). For hot loops where sin is called billions of times with known input range (e.g., game physics, audio synthesis), the 15-20 cycle polynomial approximation is worth the implementation effort. The table lookup is appropriate only for applications that can tolerate the limited precision (shader pre-computation, certain signal processing approximations).
The x87 FSIN is appropriate when you need the maximum precision from hardware instruction and can tolerate the 50-100 cycle cost — or when you are programming without access to the math library (embedded, kernel, early boot code).