Chapter 35 Key Takeaways: Spatial Audio & 3D Sound

The Physics of Spatial Hearing

Three-Mechanism Spatial Hearing The auditory system localizes sounds in three dimensions using three distinct physical mechanisms: - ITD (Interaural Time Difference): The difference in arrival time at the two ears. Dominant below 1500 Hz; maximum ≈700 µs at 90° azimuth. Enables azimuthal resolution of 1–2 degrees near the front. - ILD (Interaural Level Difference): The level difference caused by head shadow. Dominant above 1500 Hz; frequency-dependent (larger at high frequencies where head shadow is stronger). Maximum ≈20 dB at 90° azimuth at 4+ kHz. - Pinna Filtering: Direction-dependent spectral modifications from the outer ear geometry. Provides elevation cues and front-rear discrimination. Operates above ~4 kHz where wavelengths interact with pinna features.

The Cone of Confusion ITD and ILD alone cannot distinguish sources in front of, behind, above, or below the listener — all produce the same interaural cues for a given cone of positions centered on the interaural axis. Pinna filtering resolves this ambiguity by encoding source elevation and front-rear position in the spectral content of the received signal.

Head Movement as Localization Cue Turning the head changes the pattern of ITD and ILD in a direction-specific way. This dynamic information provides a powerful additional cue for resolving the cone of confusion — particularly for distinguishing front from rear sources.

HRTF: The Personal Acoustic Fingerprint The HRTF is a pair of direction-dependent filters (one per ear) that encodes the complete acoustic transformation of sound by the head, torso, and pinna. It incorporates ITD, ILD, and pinna filtering in a single mathematical object. HRTFs are individually unique — determined by head size, ear canal dimensions, and pinna geometry.

HRTF Individualization Is Critical Using a generic (population-average) HRTF for spatial audio rendering produces degraded elevation perception, front-rear confusion, and in-head localization for most users because their personal pinna notch frequencies do not match the generic model. Elevation perception is the most sensitive to HRTF mismatch because pinna geometry varies more between individuals than head size.

Measurement and Personalization Complete HRTF measurement requires captured responses from many directions (hundreds to thousands) and takes 30–90 minutes. Consumer personalization approaches include: photograph-based ear geometry matching (Apple Personalized Spatial Audio), acoustic measurement via in-ear microphone, and BEM-computed HRTFs from 3D ear scans.

Binaural Audio and the Externalization Problem

Binaural Recording Captures Real Acoustics Dummy-head recording (Neumann KU 100) captures a real acoustic scene with authentic spatial cues through the dummy head's geometry. Played over headphones, it can sound remarkably realistic. Limitations: HRTF mismatch with individual listeners (especially for elevation), no head tracking, and reduced externalization for listeners with atypical ear anatomy.

The Externalization Problem Even with HRTF convolution applied, binaural headphone audio often sounds inside the head rather than externalized. Physical causes: HRTF mismatch (wrong spectral notch frequencies), absent room reflections in the listening environment (the real room's acoustics compete with the virtual), and lack of head-tracking (no dynamic cue updates). Solutions: personalized HRTF, simulated room context, and head tracking.

Ambisonics: Scene-Based Spatial Audio

Spherical Harmonic Decomposition First-order ambisonics encodes the acoustic field in four channels (W, X, Y, Z) corresponding to the four first-order spherical harmonics: - W = (1/√2) × s (omnidirectional pressure) - X = cos(az)cos(el) × s (front-back) - Y = sin(az)cos(el) × s (left-right) - Z = sin(el) × s (up-down)

Format Independence The key advantage of ambisonics is that the encoded W/X/Y/Z (or higher-order) channels are playback-system agnostic. The same ambisonic master file can be decoded to stereo, 5.1, quadraphonic, binaural headphones, or a 64-speaker immersive installation using different decoding matrices. This makes ambisonics format-future-proof.

Higher-Order Ambisonics N-th order ambisonics has (N+1)² channels and provides angular resolution of approximately 180°/N. The "sweet spot" for accurate reproduction scales with order. First-order ambisonics is practical and commercially deployed; third-order (16 channels) provides good spatial resolution for small to medium reproduction systems.

Spatial Audio Technology Platforms

Channel-Based (5.1/7.1): Fixed channel assignments, locked to specific speaker configurations. Good when playback matches production configuration; poor downmix compatibility and no height channels.

Object-Based (Dolby Atmos): Audio objects with 3D position metadata, rendered at playback time for any speaker configuration. Decouples production from reproduction. Beds for ambience, objects for positioned elements. Industry standard for cinema and increasingly for streaming.

Apple Spatial Audio: Dolby Atmos content with binaural rendering and dynamic head tracking via AirPods. Personalized HRTF option via iPhone ear geometry scan. The largest consumer deployment of HRTF-based spatial audio in history.

Wavefield Synthesis: Reproduces actual physical wave field using dense speaker arrays. No sweet-spot limitation in principle; speaker density requirements constrain practical bandwidth to ~8–10 kHz.

NHK 22.2: 24-channel format (three height layers) for Super Hi-Vision (8K) TV. Most comprehensive channel-based format in production deployment.

VR Audio: The Hardest Problem

25 ms Latency Limit The perceptual threshold for detecting mismatch between head movement and audio update is approximately 25 ms end-to-end. Above this threshold, audio "lags" head movement and destroys presence. Typical modern VR systems achieve 14–33 ms at moderate scene complexity — within range but constrained.

HRTF Personalization in VR HRTF individualization is the single largest factor in VR audio quality (Meta research). Current consumer approaches: photograph-based selection, acoustic measurement via in-ear microphone, BEM computation from 3D scan. Full personalization in all production VR headsets is an active engineering goal.

Acoustic Occlusion and Propagation Correct occlusion modeling (sound changes when blocked by a virtual wall) and room acoustic propagation (reverb matching the virtual geometry) are required for physical plausibility but are computationally demanding. Current implementations use simplified physics with two-tier update rates.

Music and Beyond

Spatial Position as Musical Parameter Three-dimensional audio mixing enables spatial position to function as an expressive compositional parameter — not merely organization of the mix but a dimension of musical meaning. This possibility is in its earliest stages; most current spatial audio music is remixed stereo rather than purpose-composed 3D.

Auditory Display Applications The physics of spatial hearing has value beyond music: navigation audio (directional turn cues), sonification (mapping data dimensions to spatial position), safety systems (spatially placed warning sounds), and accessibility applications all benefit from spatial audio's ability to convey directional information with low cognitive load.

The Recurring Themes

Technology as Mediator Every layer of spatial audio technology — HRTF convolution, ambisonic encoding, object-based rendering, head tracking — is a mediation between the physics of acoustic waves and the perceptual experience of three-dimensional space. Each layer adds capability and introduces new approximations. The qualitative goal — creating the experience of being in a real acoustic space — remains the same as the physical reality it approximates; only the mediating mechanism changes.

Reductionism vs. Emergence Spatial audio is a perfect example of emergence from reductive components. Reduced to physics: ITD values of a few hundred microseconds and ILD values of a few decibels. What emerges from these small physical differences: the rich, enveloping experience of being surrounded by a full three-dimensional sonic environment. The sum is immensely greater than the parts.

Constraint as Creativity The Beatles' "Tomorrow Never Knows" demonstrates that spatial audio creativity does not require spatial audio technology. Working within the constraints of four-track tape, manual fader riding, and analog processing, Lennon, McCartney, Harrison, Starr, Martin, and Emerick produced a recording whose spatial conception has been equaled in intentional design by very few spatial audio productions made with explicit three-dimensional tools. The constraint of limited technology forced creative solutions that revealed the physics of spatial perception more clearly than unlimited tools might have.