Case Study 18-1: NEON SIMD — Vectorizing a Dot Product for Audio Processing

Objective

Implement a dot product computation (central to FIR audio filters) using ARM64 NEON FMLA instructions. Compare scalar, NEON 4-wide, and NEON 8-wide implementations. Measure the throughput improvement and understand why NEON is the right tool for digital signal processing on ARM64.


Background: Why Dot Products Matter for Audio

A Finite Impulse Response (FIR) filter is the workhorse of digital audio processing — equalization, noise reduction, room correction, and audio decompression all use them. The mathematical core of an FIR filter is a dot product:

output = sum(h[i] * x[i])  for i = 0..N-1

Where h[] is the filter's coefficient array (fixed) and x[] is a window of the input audio signal. A typical professional-quality FIR filter has N = 128 to 1024 coefficients, computed for every audio sample at 44100 or 96000 samples/second. At 44100 Hz with N=256 coefficients, that's 11.3 million multiply-add operations per second — continuous.

Scalar execution: 11.3M FMUL + 11.3M FADD per second. NEON 4-wide: reduce to ~2.8M FMLA instructions per second. NEON 8-wide (with unrolling): ~1.4M FMLA instructions per second.

The difference matters for a Raspberry Pi that's simultaneously running other processes.


Scalar Implementation (Baseline)

// C reference implementation
float dot_product_scalar(const float *h, const float *x, int n) {
    float sum = 0.0f;
    for (int i = 0; i < n; i++) {
        sum += h[i] * x[i];
    }
    return sum;
}

ARM64 scalar assembly:

// dot_product_scalar: X0 = h, X1 = x, W2 = n
// Returns S0 = dot product
.global dot_product_scalar
dot_product_scalar:
    FMOV  S0, WZR                // S0 = 0.0f (WZR = 0, FMOV from GP reg)
    SXTW  X3, W2                 // X3 = (int64_t)n
    CBZ   X3, .scalar_done

.scalar_loop:
    LDR   S1, [X0], #4           // S1 = h[i]; h++
    LDR   S2, [X1], #4           // S2 = x[i]; x++
    FMADD S0, S1, S2, S0         // S0 = S0 + S1*S2 (fused multiply-add)
    SUBS  X3, X3, #1
    B.NE  .scalar_loop

.scalar_done:
    RET

Per-iteration: 2 loads + 1 FMADD + 1 SUBS + 1 branch = 5 instructions, 1 float processed.


NEON 4-Wide Implementation

// dot_product_neon4: X0 = h, X1 = x, W2 = n (must be multiple of 4)
// Returns S0 = dot product
.global dot_product_neon4
dot_product_neon4:
    MOVI  V0.4S, #0              // V0 = {0.0, 0.0, 0.0, 0.0} (accumulator)
    SXTW  X3, W2
    LSR   X4, X3, #2             // X4 = n / 4 (NEON iterations)
    CBZ   X4, .n4_reduce

.n4_loop:
    LDR   Q1, [X0], #16          // V1 = {h[i], h[i+1], h[i+2], h[i+3]}
    LDR   Q2, [X1], #16          // V2 = {x[i], x[i+1], x[i+2], x[i+3]}
    FMLA  V0.4S, V1.4S, V2.4S   // V0 += V1 * V2 (element-wise, 4 FMAs at once)
    SUBS  X4, X4, #1
    B.NE  .n4_loop

.n4_reduce:
    // Horizontal sum: V0 = {a, b, c, d}
    FADDP V0.4S, V0.4S, V0.4S   // V0 = {a+b, c+d, a+b, c+d}
    FADDP S0, V0.2S              // S0 = (a+b) + (c+d) = total

    RET

Per-iteration: 2 Q-loads + 1 FMLA (4 FMAs!) + 1 SUBS + 1 branch = 5 instructions, 4 floats processed.

4× throughput improvement over scalar on a single-issue machine.


NEON 8-Wide (Unrolled Loop)

Further unrolling processes 8 floats per loop iteration using two accumulators:

// dot_product_neon8: processes 8 floats per loop iteration
// X0 = h, X1 = x, W2 = n (must be multiple of 8)
.global dot_product_neon8
dot_product_neon8:
    MOVI  V0.4S, #0              // accumulator 0
    MOVI  V1.4S, #0              // accumulator 1
    SXTW  X3, W2
    LSR   X4, X3, #3             // X4 = n / 8
    CBZ   X4, .n8_reduce

.n8_loop:
    // Load 8 h[] and 8 x[] values
    LDR   Q2, [X0], #16          // h[i..i+3]
    LDR   Q3, [X0], #16          // h[i+4..i+7]
    LDR   Q4, [X1], #16          // x[i..i+3]
    LDR   Q5, [X1], #16          // x[i+4..i+7]

    // Two FMLA instructions — can potentially dual-issue on wide CPUs
    FMLA  V0.4S, V2.4S, V4.4S   // V0 += h[i..i+3] * x[i..i+3]
    FMLA  V1.4S, V3.4S, V5.4S   // V1 += h[i+4..i+7] * x[i+4..i+7]

    SUBS  X4, X4, #1
    B.NE  .n8_loop

.n8_reduce:
    // Combine two accumulators
    FADD  V0.4S, V0.4S, V1.4S   // V0 = V0 + V1
    FADDP V0.4S, V0.4S, V0.4S   // pairwise add
    FADDP S0, V0.2S              // final horizontal sum

    RET

Per-iteration: 4 Q-loads + 2 FMLA + 1 FADD + 1 SUBS + 1 branch = 9 instructions, 8 floats processed.

The two-accumulator approach allows the CPU's out-of-order engine to issue both FMLA instructions simultaneously (they are independent), further improving throughput on wide CPUs like the Apple M4 (which can issue multiple FP instructions per cycle).


Handling Non-Multiple-of-4 Arrays

Real arrays rarely have lengths that are exactly multiples of 4. A complete implementation handles the tail:

// dot_product_full: handles any n
.global dot_product_full
dot_product_full:
    STP   X29, X30, [SP, #-32]!
    MOV   X29, SP
    STP   X19, X20, [SP, #16]
    MOV   X19, X0                // save h
    MOV   X20, X1                // save x
    MOV   W21, W2                // save n

    MOVI  V0.4S, #0              // NEON accumulator
    LSR   X4, X3, #2             // NEON iterations
    SXTW  X3, W2

    // NEON main loop (process 4 floats at a time)
    LSR   X5, X3, #2             // X5 = n / 4
    CBZ   X5, .full_tail

.full_neon:
    LDR   Q1, [X0], #16
    LDR   Q2, [X1], #16
    FMLA  V0.4S, V1.4S, V2.4S
    SUBS  X5, X5, #1
    B.NE  .full_neon

    // Reduce NEON accumulator
    FADDP V0.4S, V0.4S, V0.4S
    FADDP S0, V0.2S              // S0 = sum of NEON part

    // Tail: process remaining 0-3 elements with scalar
    ANDS  X5, X3, #3             // X5 = n % 4
    CBZ   X5, .full_done

.full_tail:
    MOVI  V0.4S, #0              // if we didn't enter NEON loop, init S0
    FMOV  S3, WZR                // S3 = 0.0f (scalar accumulator for tail)

.full_tail_loop:
    LDR   S1, [X0], #4
    LDR   S2, [X1], #4
    FMADD S3, S1, S2, S3
    SUBS  X5, X5, #1
    B.NE  .full_tail_loop
    FADD  S0, S0, S3             // combine NEON and scalar parts

.full_done:
    LDP   X19, X20, [SP, #16]
    LDP   X29, X30, [SP], #32
    RET

Register Trace: NEON 4-Wide, n=8

Input: h = [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5], x = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0] Expected: dot = 0.5(1+2+3+4+5+6+7+8) = 0.536 = 18.0

Step V0.4S V1.4S V2.4S Notes
MOVI {0,0,0,0} init
LDR Q1 {0,0,0,0} {0.5,0.5,0.5,0.5} h[0..3]
LDR Q2 {0,0,0,0} {0.5,0.5,0.5,0.5} {1,2,3,4} x[0..3]
FMLA V0 {0.5,1.0,1.5,2.0} V0 += V1*V2
LDR Q1 {0.5,1.0,1.5,2.0} {0.5,0.5,0.5,0.5} h[4..7]
LDR Q2 {0.5,0.5,0.5,0.5} {5,6,7,8} x[4..7]
FMLA V0 {3.0,4.0,5.0,6.0} V0 += V1*V2
FADDP {7.0,11.0,7.0,11.0} pairwise sum
FADDP S0 S0=18.0

Performance Summary

For n=256 elements on Raspberry Pi 4 (Cortex-A72):

Implementation       Cycles/call   Speedup
Scalar (FMADD)       ~512          1.0×
NEON 4-wide          ~136          3.8×
NEON 8-wide          ~80           6.4×

The NEON 8-wide doesn't achieve the theoretical 8× speedup because: 1. Memory bandwidth: loading 8 × 4-byte values per iteration is 32 bytes/cycle — approaches L1 cache bandwidth 2. Loop overhead: SUBS + branch cost is amortized but not zero 3. FMLA latency: even with two independent FMLAs, the FADD dependency in the reduction chain creates latency

On Apple M4 (with wider execution units and much larger caches), the speedup approaches 8×.


Building and Testing

# Assemble all versions
aarch64-linux-gnu-as dot_scalar.s -o dot_scalar.o
aarch64-linux-gnu-as dot_neon4.s  -o dot_neon4.o
aarch64-linux-gnu-as dot_neon8.s  -o dot_neon8.o

# Link with test harness (C main)
aarch64-linux-gnu-gcc -static dot_scalar.o dot_neon4.o dot_neon8.o test_dot.c -o test_dot

# Run
qemu-aarch64 ./test_dot
# Expected output:
# Scalar:  18.000000
# NEON4:   18.000000
# NEON8:   18.000000

Summary

The dot product case study demonstrates three key points about NEON programming:

  1. FMLA is the key instruction — it fuses multiply and accumulate into one operation, matching the mathematical structure of FIR filters exactly
  2. The reduction step (FADDP + FADDP) is a fixed cost amortized over the length of the vectors — for longer arrays, this overhead is negligible
  3. Two-accumulator unrolling (the 8-wide version) exploits instruction-level parallelism without requiring longer SIMD width — same 128-bit registers, twice the throughput by keeping two independent FMLA chains in flight

Audio, graphics, machine learning inference — all of these are fundamentally dot products. This is why NEON (and its successor, SVE) exists and why ARM64 is competitive with x86-64 for numeric workloads despite having "simpler" instructions.