Case Study 14.1: Implementing sin() Without the Math Library

The Challenge

The x87 FPU has hardware FSIN. SSE2 does not have a transcendental function. When you call sin() in C, the compiler links against libm, which provides a software implementation. This case study compares three approaches: hardware x87, SSE polynomial approximation, and table lookup — and measures their precision and speed.

Approach 1: x87 FSIN

; double sin_x87(double x)
; Uses hardware x87 FSIN instruction
; XMM0 = x (input, following SSE ABI)
; Returns result in XMM0 (output, following SSE ABI)
; Internally uses x87 for the computation

section .text
global sin_x87

sin_x87:
    ; Move double from XMM0 to x87 stack via memory
    sub  rsp, 8
    movsd [rsp], xmm0      ; store x to stack
    fldl  [rsp]            ; push x onto x87 stack as 80-bit extended
    fsin                   ; ST(0) = sin(ST(0)) in 80-bit precision
    fstpl [rsp]            ; pop and store as 64-bit double
    movsd xmm0, [rsp]      ; load result back into XMM0
    add  rsp, 8
    ret

Precision: Better than IEEE double (computed in 80-bit extended precision, then rounded to 64-bit). For most inputs, fully accurate to all 53 significant bits of a double.

Speed: Approximately 50-100 clock cycles on modern Intel processors. The FSIN instruction itself is microcode and takes 20-100 cycles depending on the input range.

Limitation: The x87 FSIN is defined for inputs in the range [-2^63, 2^63]. For very large inputs (|x| >> 2π), it performs internal range reduction at 80-bit precision, but the reduction itself may lose significant bits for very large |x|. For |x| > 2^20 or so, a higher-precision range reduction is needed.

Approach 2: SSE2 Polynomial Approximation (Minimax)

A degree-9 Taylor-derived minimax polynomial for sin(x) on [-π/2, π/2]:

sin(x) ≈ x - x³/6 + x⁵/120 - x⁷/5040 + x⁹/362880

Using Horner form with minimax-optimized coefficients:

; double sin_poly(double x) — valid for |x| <= π/2
; This is the core computation; a full sin() function needs range reduction first
; XMM0 = x, returns in XMM0

section .data
align 16
; Horner form coefficients for sin(x) = x*(1 + x²*(c3 + x²*(c5 + x²*(c7 + x²*c9))))
sin_coeff_c9: dq -2.7557314351801666e-6   ; -1/9!
sin_coeff_c7: dq  1.9841269841269841e-4   ; 1/7!
sin_coeff_c5: dq -8.3333333333333329e-3   ; -1/5!
sin_coeff_c3: dq  1.6666666666666667e-1   ; 1/3!

section .text
global sin_poly

sin_poly:
    movsd xmm1, xmm0           ; xmm1 = x
    mulsd xmm1, xmm1           ; xmm1 = x²

    ; Horner: start from highest degree
    movsd xmm2, [rel sin_coeff_c9]   ; xmm2 = c9
    mulsd xmm2, xmm1                  ; xmm2 = c9 * x²
    addsd xmm2, [rel sin_coeff_c7]   ; xmm2 = c7 + c9*x²
    mulsd xmm2, xmm1                  ; xmm2 = (c7 + c9*x²) * x²
    addsd xmm2, [rel sin_coeff_c5]   ; xmm2 = c5 + x²*(c7 + c9*x²)
    mulsd xmm2, xmm1                  ; xmm2 *= x²
    addsd xmm2, [rel sin_coeff_c3]   ; xmm2 = c3 + x²*(...)
    mulsd xmm2, xmm1                  ; xmm2 *= x²
    addsd xmm2, [rel one]            ; xmm2 = 1 + x²*(...)
    mulsd xmm0, xmm2                  ; xmm0 = x * (1 + x²*(...)) = sin(x)
    ret

section .data
one: dq 1.0

Speed: ~15-20 clock cycles (7 multiplies + 4 adds + loads). Significantly faster than hardware FSIN.

Precision: Maximum error ≈ 1 ULP (Unit in the Last Place) for |x| ≤ π/2. Within the Taylor series approximation error of the polynomial degree chosen.

Range reduction (not shown above): A full sin(x) function must first reduce x modulo 2π and determine the quadrant. This requires a separate 128-bit or extended-precision range reduction for large |x|.

Approach 3: Table Lookup with Linear Interpolation

; sin_table: precomputed sin values at 1024 points in [0, 2π)
; Index i → sin(i * 2π / 1024)

; sin_table_lookup(double x) for 0 ≤ x < 2π
; Interpolates between table entries

section .data
align 8
sin_2pi_over_1024: dq 0.006135923151542    ; 2π / 1024 (index step)
sin_inv_step:      dq 163.3986928...       ; 1024 / 2π (for index computation)
N_TABLE equ 1024

section .text
sin_table_lookup:
    ; Compute fractional index: idx = x * (1024 / 2π)
    mulsd xmm0, [rel sin_inv_step]  ; xmm0 = x * (N / 2π)
    cvttsd2si rax, xmm0             ; integer part of index
    movsd xmm2, xmm0
    cvtsi2sd xmm1, rax              ; convert back to double for fractional part
    subsd xmm2, xmm1                ; xmm2 = fractional part of index

    ; Get integer index (modulo table size)
    and  rax, 1023                  ; rax = idx mod 1024 (power of 2: use AND)
    lea  rcx, [rax + 1]
    and  rcx, 1023                  ; next index (wrap)

    ; Load sin values at i and i+1
    lea  rdx, [rel sin_table]
    movsd xmm3, [rdx + rax*8]      ; sin[i]
    movsd xmm4, [rdx + rcx*8]      ; sin[i+1]

    ; Linear interpolation: sin[i] + frac * (sin[i+1] - sin[i])
    subsd xmm4, xmm3                ; sin[i+1] - sin[i]
    mulsd xmm4, xmm2                ; * fractional part
    addsd xmm0, xmm3               ; WRONG: xmm0 still has scaled index
    ; Fix: movsd xmm0, xmm3; addsd xmm0, xmm4
    movsd xmm0, xmm3
    addsd xmm0, xmm4               ; interpolated result
    ret

Speed: ~8-10 cycles if sin_table is in L1 cache. Cache-dependent behavior for large workloads (table is 8KB, may not fit in L1).

Precision: Error ≈ 0.5 * (2π/1024)² / 2 * max|sin''| ≈ 1.9×10^-5. Only about 5 significant decimal digits — much worse than polynomial or hardware approaches. More table entries improve this at the cost of more memory.

Performance Comparison

Approach Cycles Precision Notes
x87 FSIN 50-100 ~19 decimal digits Hardware, with x87 overhead
SSE polynomial (degree 9) 15-20 ~15 decimal digits Fast, needs range reduction
libm sin() 25-50 ~15 decimal digits Well-optimized, handles all inputs
Table lookup (1024 entries) 8-10 ~5 decimal digits Fast but inaccurate

What libm Actually Does

The glibc sin() implementation: 1. Range reduction using 256-bit precision arithmetic (for correct results with very large inputs) 2. A 12th-degree minimax polynomial for the core computation (within [-π/8, π/8]) 3. Quadrant selection and sign adjustment

The polynomial in libm is longer than our 9-term version and achieves < 1 ULP error across the full double range. The range reduction step is the most complex part and accounts for most of the cycle count.

Practical Takeaway

For most applications, use the system's sin(). For hot loops where sin is called billions of times with known input range (e.g., game physics, audio synthesis), the 15-20 cycle polynomial approximation is worth the implementation effort. The table lookup is appropriate only for applications that can tolerate the limited precision (shader pre-computation, certain signal processing approximations).

The x87 FSIN is appropriate when you need the maximum precision from hardware instruction and can tolerate the 50-100 cycle cost — or when you are programming without access to the math library (embedded, kernel, early boot code).