Case Study 18-1: NEON SIMD — Vectorizing a Dot Product for Audio Processing
Objective
Implement a dot product computation (central to FIR audio filters) using ARM64 NEON FMLA instructions. Compare scalar, NEON 4-wide, and NEON 8-wide implementations. Measure the throughput improvement and understand why NEON is the right tool for digital signal processing on ARM64.
Background: Why Dot Products Matter for Audio
A Finite Impulse Response (FIR) filter is the workhorse of digital audio processing — equalization, noise reduction, room correction, and audio decompression all use them. The mathematical core of an FIR filter is a dot product:
output = sum(h[i] * x[i]) for i = 0..N-1
Where h[] is the filter's coefficient array (fixed) and x[] is a window of the input audio signal. A typical professional-quality FIR filter has N = 128 to 1024 coefficients, computed for every audio sample at 44100 or 96000 samples/second. At 44100 Hz with N=256 coefficients, that's 11.3 million multiply-add operations per second — continuous.
Scalar execution: 11.3M FMUL + 11.3M FADD per second. NEON 4-wide: reduce to ~2.8M FMLA instructions per second. NEON 8-wide (with unrolling): ~1.4M FMLA instructions per second.
The difference matters for a Raspberry Pi that's simultaneously running other processes.
Scalar Implementation (Baseline)
// C reference implementation
float dot_product_scalar(const float *h, const float *x, int n) {
float sum = 0.0f;
for (int i = 0; i < n; i++) {
sum += h[i] * x[i];
}
return sum;
}
ARM64 scalar assembly:
// dot_product_scalar: X0 = h, X1 = x, W2 = n
// Returns S0 = dot product
.global dot_product_scalar
dot_product_scalar:
FMOV S0, WZR // S0 = 0.0f (WZR = 0, FMOV from GP reg)
SXTW X3, W2 // X3 = (int64_t)n
CBZ X3, .scalar_done
.scalar_loop:
LDR S1, [X0], #4 // S1 = h[i]; h++
LDR S2, [X1], #4 // S2 = x[i]; x++
FMADD S0, S1, S2, S0 // S0 = S0 + S1*S2 (fused multiply-add)
SUBS X3, X3, #1
B.NE .scalar_loop
.scalar_done:
RET
Per-iteration: 2 loads + 1 FMADD + 1 SUBS + 1 branch = 5 instructions, 1 float processed.
NEON 4-Wide Implementation
// dot_product_neon4: X0 = h, X1 = x, W2 = n (must be multiple of 4)
// Returns S0 = dot product
.global dot_product_neon4
dot_product_neon4:
MOVI V0.4S, #0 // V0 = {0.0, 0.0, 0.0, 0.0} (accumulator)
SXTW X3, W2
LSR X4, X3, #2 // X4 = n / 4 (NEON iterations)
CBZ X4, .n4_reduce
.n4_loop:
LDR Q1, [X0], #16 // V1 = {h[i], h[i+1], h[i+2], h[i+3]}
LDR Q2, [X1], #16 // V2 = {x[i], x[i+1], x[i+2], x[i+3]}
FMLA V0.4S, V1.4S, V2.4S // V0 += V1 * V2 (element-wise, 4 FMAs at once)
SUBS X4, X4, #1
B.NE .n4_loop
.n4_reduce:
// Horizontal sum: V0 = {a, b, c, d}
FADDP V0.4S, V0.4S, V0.4S // V0 = {a+b, c+d, a+b, c+d}
FADDP S0, V0.2S // S0 = (a+b) + (c+d) = total
RET
Per-iteration: 2 Q-loads + 1 FMLA (4 FMAs!) + 1 SUBS + 1 branch = 5 instructions, 4 floats processed.
4× throughput improvement over scalar on a single-issue machine.
NEON 8-Wide (Unrolled Loop)
Further unrolling processes 8 floats per loop iteration using two accumulators:
// dot_product_neon8: processes 8 floats per loop iteration
// X0 = h, X1 = x, W2 = n (must be multiple of 8)
.global dot_product_neon8
dot_product_neon8:
MOVI V0.4S, #0 // accumulator 0
MOVI V1.4S, #0 // accumulator 1
SXTW X3, W2
LSR X4, X3, #3 // X4 = n / 8
CBZ X4, .n8_reduce
.n8_loop:
// Load 8 h[] and 8 x[] values
LDR Q2, [X0], #16 // h[i..i+3]
LDR Q3, [X0], #16 // h[i+4..i+7]
LDR Q4, [X1], #16 // x[i..i+3]
LDR Q5, [X1], #16 // x[i+4..i+7]
// Two FMLA instructions — can potentially dual-issue on wide CPUs
FMLA V0.4S, V2.4S, V4.4S // V0 += h[i..i+3] * x[i..i+3]
FMLA V1.4S, V3.4S, V5.4S // V1 += h[i+4..i+7] * x[i+4..i+7]
SUBS X4, X4, #1
B.NE .n8_loop
.n8_reduce:
// Combine two accumulators
FADD V0.4S, V0.4S, V1.4S // V0 = V0 + V1
FADDP V0.4S, V0.4S, V0.4S // pairwise add
FADDP S0, V0.2S // final horizontal sum
RET
Per-iteration: 4 Q-loads + 2 FMLA + 1 FADD + 1 SUBS + 1 branch = 9 instructions, 8 floats processed.
The two-accumulator approach allows the CPU's out-of-order engine to issue both FMLA instructions simultaneously (they are independent), further improving throughput on wide CPUs like the Apple M4 (which can issue multiple FP instructions per cycle).
Handling Non-Multiple-of-4 Arrays
Real arrays rarely have lengths that are exactly multiples of 4. A complete implementation handles the tail:
// dot_product_full: handles any n
.global dot_product_full
dot_product_full:
STP X29, X30, [SP, #-32]!
MOV X29, SP
STP X19, X20, [SP, #16]
MOV X19, X0 // save h
MOV X20, X1 // save x
MOV W21, W2 // save n
MOVI V0.4S, #0 // NEON accumulator
LSR X4, X3, #2 // NEON iterations
SXTW X3, W2
// NEON main loop (process 4 floats at a time)
LSR X5, X3, #2 // X5 = n / 4
CBZ X5, .full_tail
.full_neon:
LDR Q1, [X0], #16
LDR Q2, [X1], #16
FMLA V0.4S, V1.4S, V2.4S
SUBS X5, X5, #1
B.NE .full_neon
// Reduce NEON accumulator
FADDP V0.4S, V0.4S, V0.4S
FADDP S0, V0.2S // S0 = sum of NEON part
// Tail: process remaining 0-3 elements with scalar
ANDS X5, X3, #3 // X5 = n % 4
CBZ X5, .full_done
.full_tail:
MOVI V0.4S, #0 // if we didn't enter NEON loop, init S0
FMOV S3, WZR // S3 = 0.0f (scalar accumulator for tail)
.full_tail_loop:
LDR S1, [X0], #4
LDR S2, [X1], #4
FMADD S3, S1, S2, S3
SUBS X5, X5, #1
B.NE .full_tail_loop
FADD S0, S0, S3 // combine NEON and scalar parts
.full_done:
LDP X19, X20, [SP, #16]
LDP X29, X30, [SP], #32
RET
Register Trace: NEON 4-Wide, n=8
Input: h = [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5], x = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0] Expected: dot = 0.5(1+2+3+4+5+6+7+8) = 0.536 = 18.0
| Step | V0.4S | V1.4S | V2.4S | Notes |
|---|---|---|---|---|
| MOVI | {0,0,0,0} | init | ||
| LDR Q1 | {0,0,0,0} | {0.5,0.5,0.5,0.5} | h[0..3] | |
| LDR Q2 | {0,0,0,0} | {0.5,0.5,0.5,0.5} | {1,2,3,4} | x[0..3] |
| FMLA V0 | {0.5,1.0,1.5,2.0} | V0 += V1*V2 | ||
| LDR Q1 | {0.5,1.0,1.5,2.0} | {0.5,0.5,0.5,0.5} | h[4..7] | |
| LDR Q2 | {0.5,0.5,0.5,0.5} | {5,6,7,8} | x[4..7] | |
| FMLA V0 | {3.0,4.0,5.0,6.0} | V0 += V1*V2 | ||
| FADDP | {7.0,11.0,7.0,11.0} | pairwise sum | ||
| FADDP S0 | S0=18.0 | ✓ |
Performance Summary
For n=256 elements on Raspberry Pi 4 (Cortex-A72):
Implementation Cycles/call Speedup
Scalar (FMADD) ~512 1.0×
NEON 4-wide ~136 3.8×
NEON 8-wide ~80 6.4×
The NEON 8-wide doesn't achieve the theoretical 8× speedup because: 1. Memory bandwidth: loading 8 × 4-byte values per iteration is 32 bytes/cycle — approaches L1 cache bandwidth 2. Loop overhead: SUBS + branch cost is amortized but not zero 3. FMLA latency: even with two independent FMLAs, the FADD dependency in the reduction chain creates latency
On Apple M4 (with wider execution units and much larger caches), the speedup approaches 8×.
Building and Testing
# Assemble all versions
aarch64-linux-gnu-as dot_scalar.s -o dot_scalar.o
aarch64-linux-gnu-as dot_neon4.s -o dot_neon4.o
aarch64-linux-gnu-as dot_neon8.s -o dot_neon8.o
# Link with test harness (C main)
aarch64-linux-gnu-gcc -static dot_scalar.o dot_neon4.o dot_neon8.o test_dot.c -o test_dot
# Run
qemu-aarch64 ./test_dot
# Expected output:
# Scalar: 18.000000
# NEON4: 18.000000
# NEON8: 18.000000
Summary
The dot product case study demonstrates three key points about NEON programming:
- FMLA is the key instruction — it fuses multiply and accumulate into one operation, matching the mathematical structure of FIR filters exactly
- The reduction step (FADDP + FADDP) is a fixed cost amortized over the length of the vectors — for longer arrays, this overhead is negligible
- Two-accumulator unrolling (the 8-wide version) exploits instruction-level parallelism without requiring longer SIMD width — same 128-bit registers, twice the throughput by keeping two independent FMLA chains in flight
Audio, graphics, machine learning inference — all of these are fundamentally dot products. This is why NEON (and its successor, SVE) exists and why ARM64 is competitive with x86-64 for numeric workloads despite having "simpler" instructions.