Chapter 35 Quiz: Spatial Audio & 3D Sound

20 questions covering the physics of spatial hearing, HRTF, ambisonics, binaural audio, and spatial audio technology platforms.


Question 1 The primary binaural localization cue at frequencies below approximately 1500 Hz is:

A) Interaural Level Difference (ILD) — the head shadow effect B) Pinna filtering — spectral notches created by the outer ear C) Interaural Time Difference (ITD) — the difference in sound arrival time at each ear D) Early reflection pattern — the difference in room reflections at each ear

Reveal Answer **Answer: C — Interaural Time Difference (ITD)** Below about 1500 Hz, sound wavelengths are long relative to head diameter, so the head shadow effect (ILD) is minimal — sound diffracts around the head easily. However, the difference in arrival time between the two ears (ITD) remains a reliable cue. The auditory system compares the phase of the low-frequency waveform at each ear; phase comparison is unambiguous at low frequencies because the wavelength is long enough that a given phase difference corresponds to a single ITD value. Above 1500 Hz, phase comparison becomes ambiguous (a 180-degree phase difference could correspond to many ITD values), so ILD takes over as the dominant cue.

Question 2 The maximum ITD — the longest interaural time difference, which occurs when a source is directly to one side of the listener — is approximately:

A) 70 microseconds B) 200 microseconds C) 700 microseconds D) 7000 microseconds (7 ms)

Reveal Answer **Answer: C — 700 microseconds** At 90° azimuth (source directly to one side), sound must travel an extra path equal to approximately the width of the head plus the arc around the curved head surface. Using Woodworth's formula: ITD_max = (r/c) × (sin(90°) + π/2) = (0.0875/343) × (1 + 1.571) ≈ 655 µs. In practice, measured maximum ITDs are approximately 650–720 µs depending on head size. This is detectable: the auditory system can resolve ITDs as small as 10–20 µs, giving angular resolution of approximately 1–2 degrees near the frontal median plane.

Question 3 The "cone of confusion" in spatial hearing refers to:

A) The region behind the listener's head where hearing is worst B) The set of source positions that produce identical ITD and ILD values C) The frequency range where ITD and ILD cues conflict D) The confusion caused by listening in reverberant environments

Reveal Answer **Answer: B** For any given interaural axis, a surface of revolution (a cone) centered on that axis produces identical ITD and ILD values at all positions on the cone's surface. ITD and ILD together can determine azimuth (horizontal angle) but cannot distinguish between sources in front of and behind the listener, or above and below the listener, when those positions lie on the same cone. The pinna filter provides the additional cues (spectral notches that vary with elevation and front-back angle) needed to resolve the cone of confusion ambiguity.

Question 4 Pinna filtering provides elevation cues primarily at frequencies above 4 kHz because:

A) The auditory nerve fibers for high frequencies are located closer to the brain B) Low-frequency sounds diffract around the pinna without interacting with its ridges C) The ILD is larger at high frequencies, which interferes with elevation cues at lower frequencies D) High-frequency sounds contain more musical information

Reveal Answer **Answer: B** The pinna (outer ear) has ridges, valleys, and protrusions on the scale of centimeters. For sound wavelengths much longer than these features (low frequencies), sound diffracts around the pinna without interacting with its detailed geometry — no directional filtering occurs. At high frequencies, where wavelength approaches the scale of pinna features, sounds do interact with these features through reflection and diffraction, creating direction-dependent spectral patterns (notches and peaks). These frequency-specific modifications, which shift with source direction, provide the elevation and front-back cues that ITD and ILD cannot supply.

Question 5 The Head-Related Transfer Function (HRTF) is:

A) A standard filter applied uniformly to all listeners in spatial audio rendering B) The direction-dependent acoustic modification of sound by an individual's head, torso, and pinna C) A measure of the head's resonant frequency for sound transmission D) The transfer function between speaker and microphone in a recording room

Reveal Answer **Answer: B** The HRTF encodes the complete acoustic transformation that sound from a specific direction undergoes as it travels from free space to the ear canal, including the effects of the head's diffraction, the torso's scattering, and the pinna's directional filtering. It is a pair of complex-valued functions (one for each ear) that depend on source direction and frequency. Critically, HRTFs are unique to individuals: because every person has different head dimensions, pinna shape, and torso geometry, their HRTFs differ — sometimes substantially. This is why using a generic or other person's HRTF for binaural rendering produces degraded spatial perception.

Question 6 In first-order ambisonics (B-format), what does the Z channel encode?

A) The total acoustic pressure — equivalent to an omnidirectional microphone B) The pressure gradient in the left-right direction C) The pressure gradient in the up-down direction D) The pressure gradient in the front-back direction

Reveal Answer **Answer: C** In B-format ambisonics (ACN ordering), the four channels encode: - W: omnidirectional pressure (total acoustic pressure, Y₀⁰) - X: front-back pressure gradient (cosine of azimuth × cosine of elevation, Y₁¹) - Y: left-right pressure gradient (sine of azimuth × cosine of elevation, Y₁⁻¹) - Z: up-down pressure gradient (sine of elevation, Y₁⁰) Z = sin(elevation) × signal, so Z is maximum for a source directly above (elevation = +90°), minimum for directly below (elevation = -90°), and zero for any source in the horizontal plane (elevation = 0°). This is why first-order ambisonics can capture height information when used with an appropriate microphone array.

Question 7 A second-order ambisonic mix has how many channels?

A) 4 channels B) 9 channels C) 16 channels D) 25 channels

Reveal Answer **Answer: B — 9 channels** The number of channels in an N-th order ambisonic mix is (N+1)². For second-order ambisonics: (2+1)² = 9. The pattern is: 1st order: 4 channels, 2nd order: 9, 3rd order: 16, 4th order: 25, 5th order: 36, 6th order: 49, 7th order: 64. Higher orders provide better spatial resolution (finer angular discrimination) and a larger "sweet spot" where the spatial reproduction is accurate, but require exponentially more channels and computational resources.

Question 8 The fundamental difference between object-based audio (Dolby Atmos) and channel-based audio (5.1/7.1) is:

A) Object-based audio uses more speakers than channel-based audio B) Object-based audio encodes spatial intent as position metadata, which is rendered at playback time for any speaker configuration C) Object-based audio can only be reproduced over headphones D) Channel-based audio supports higher bit-depth than object-based audio

Reveal Answer **Answer: B** In channel-based audio, spatial positioning is "baked in" at mix time — each audio element is mixed into specific channels (e.g., left, center, right, left surround, right surround). The spatial intent is locked to a particular speaker configuration. In object-based audio, each sound element exists as an independent audio object with attached three-dimensional position metadata. The renderer at the playback device interprets this metadata and maps the objects to whatever speaker configuration is available — 2 speakers, 5.1, 7.1.4, or headphones — adapting the reproduction to the hardware. This decoupling of production from reproduction is the key advantage.

Question 9 Apple Spatial Audio uses dynamic head tracking primarily to:

A) Improve battery life by reducing processing when the head is stationary B) Make the spatial audio content fixed in space relative to the room as the listener turns their head C) Reduce audio latency by predicting head position D) Automatically select the optimal HRTF for the listener's head size

Reveal Answer **Answer: B** Without head tracking, a listener turning their head on headphones causes the "front" channel to follow — the spatial scene rotates with the head rather than staying fixed in the room. This is unnatural and reduces externalization. With dynamic head tracking, the gyroscope and accelerometer in AirPods detect head rotation and the rendering engine compensates in real time, keeping audio objects fixed in their intended spatial positions relative to the room (or the device). This dramatically improves the sense of externalization — the audio feels like it's in the room around you rather than inside your head.

Question 10 The "externalization problem" in headphone spatial audio refers to:

A) The technical difficulty of recording audio in outdoor environments B) The tendency of binaural headphone audio to be perceived as originating inside the head rather than from external sources C) The need to equalize headphones to match the frequency response of loudspeakers D) The challenge of streaming spatial audio over external networks

Reveal Answer **Answer: B** Headphones deliver audio directly to the ear canal, bypassing the pinna filtering and room reflections that the brain normally uses to construct spatial awareness of external sound sources. Even when HRTF convolution is applied to create the correct spectral cues, many listeners still experience partial in-head localization. Causes include: HRTF mismatch (generic HRTF notches at wrong frequencies), absence of matching room reflections in the listening environment, and inability to use head movement to disambiguate source directions. Head tracking and personalized HRTFs both address different aspects of the externalization problem.

Question 11 The end-to-end latency limit for spatial audio in virtual reality (below which head movement updates are not perceptibly delayed) is approximately:

A) 5 milliseconds B) 25 milliseconds C) 100 milliseconds D) 250 milliseconds

Reveal Answer **Answer: B — 25 milliseconds** Research on VR audio has established that when the latency between head movement and the corresponding update in the spatial audio rendering exceeds approximately 25 ms, users detect a lag between the visual scene (which updates faster, at display refresh rates of 60–120 Hz) and the audio. Above 50 ms latency, the mismatch becomes severely disorienting and can contribute to motion sickness. This 25 ms budget covers the entire signal chain: IMU sampling, head tracking computation, HRTF selection and interpolation, audio convolution, output buffer, and earphone driver response.

Question 12 Wavefield Synthesis (WFS) aims to reproduce:

A) A binaural signal optimized for a single listener's HRTF B) The actual physical acoustic wave field of the original source at a listening region C) A channel-based surround mix using more speakers than conventional formats D) The complete spherical harmonic decomposition of the sound field up to a specified order

Reveal Answer **Answer: B** Wavefield Synthesis applies Huygens' principle: any acoustic wave front can be recreated at a listening region by correctly controlling pressure and velocity at the boundary of that region. Practically, this means driving a large, dense array of loudspeakers around the listening area with individually calculated signals that collectively reproduce the wave front of the original source. Unlike binaural audio (optimized for one listener position) or ambisonics (accurate near the center), WFS theoretically provides correct spatial reproduction for any listener position within the listening region — though practical limitations (finite speaker density, computational load) constrain this ideal.

Question 13 Why is binaural dummy-head recording (using a microphone placed in each ear of an artificial head) limited in its effectiveness for all listeners?

A) Dummy heads use omnidirectional microphones that cannot capture directional information B) The dummy head's pinna geometry matches some listeners but not others, causing HRTF mismatch C) Dummy head recordings can only be played back over loudspeakers, not headphones D) The recording captures only horizontal spatial information, not elevation

Reveal Answer **Answer: B** A dummy head (like the Neumann KU 100) has a specific head and pinna geometry that approximates a population average, but no individual listener has exactly the same anatomy. Pinna filtering — the spectral notches that provide elevation and front-back cues — is extremely sensitive to pinna shape. Listeners whose pinna geometry differs significantly from the dummy head will experience elevation errors, front-back confusion, or in-head localization because the spectral notch frequencies in the recording don't match their personal pinna notch frequencies. ITD and basic ILD will be approximately correct (since head size variation is smaller than pinna variation), but elevation cues suffer the most from HRTF mismatch.

Question 14 In the B-format ambisonic encoding, the W channel is normalized by 1/√2. What is the reason for this normalization?

A) It reduces the bit depth required to store the W channel B) It ensures that a decorrelated sound field has equal energy in all four B-format channels C) It compensates for the fact that omnidirectional microphones pick up more sound than figure-of-8 microphones D) It makes W compatible with standard stereo playback systems

Reveal Answer **Answer: B** In a perfectly diffuse sound field — where sound arrives equally from all directions simultaneously — each of the four B-format channels (W, X, Y, Z) should carry equal energy. The maximum value of the directional channels X, Y, Z for any single source is 1.0; the maximum of W without normalization would also be 1.0. But in a diffuse field, the directional channels' contributions average to a specific value relative to W. The 1/√2 normalization of W ensures that for a diffuse field: energy in W = energy in each of X, Y, Z, maintaining consistent channel energy relationships. This is the SN3D (Semi-Normalized 3D) convention.

Question 15 Which statement about binaural recording vs. ambisonics is most accurate?

A) Binaural recording can be decoded to any playback system; ambisonics is locked to headphones B) Ambisonics captures the complete acoustic field and can be decoded for any playback system; binaural captures what a specific listener hears over headphones C) Both formats encode identical acoustic information, differing only in technical implementation D) Ambisonics is only useful for horizontal spatial audio; binaural is required for elevation

Reveal Answer **Answer: B** Ambisonics captures the physical acoustic field at a point (spherical harmonic decomposition of pressure and velocity) — this representation is playback-system agnostic. The same ambisonic file can be decoded to headphones (binaural), two speakers (stereo), a speaker ring, a 3D speaker dome, or virtually any configuration. Binaural recording captures what a specific listener (or dummy head) would hear — the HRTF-filtered signal at two ear positions, intended specifically for headphone playback. Binaural captures experience; ambisonics captures the field. Converting ambisonics to binaural is straightforward; converting binaural to ambisonics or loudspeaker formats is more complex and lossy.

Question 16 Apple Personalized Spatial Audio improves on generic HRTF rendering by:

A) Using more speakers in AirPods to reproduce spatial information B) Measuring the user's ear geometry with the iPhone camera to select a better-matched HRTF C) Increasing the audio bitrate for spatial tracks D) Adding more frequency bands to the equalizer applied to spatial content

Reveal Answer **Answer: B** Generic HRTF-based spatial audio rendering uses a single HRTF derived from population averages or a specific dummy head, which will not precisely match any individual listener. Apple Personalized Spatial Audio uses the iPhone's TrueDepth camera system to capture a three-dimensional model of the user's ear (specifically the pinna geometry). This geometry model is then used to select from a database of HRTFs measured from ears with similar geometry, or to interpolate between database HRTFs, finding a better acoustic match for that individual. The improvement is most significant for elevation perception, where pinna geometry is most critical.

Question 17 "Acoustic occlusion" in VR audio refers to:

A) The reduction in spatial audio quality when too many sound sources are processed simultaneously B) The acoustic attenuation and filtering that occurs when a sound source is physically blocked by a virtual object or wall C) The inability of headphones to reproduce very low-frequency spatial information D) The masking of binaural cues by the room's ambient noise

Reveal Answer **Answer: B** In a physically real environment, when a wall, door, or large object comes between you and a sound source, the direct path of the sound is blocked. High-frequency energy is attenuated more than low-frequency energy (since low frequencies diffract more easily), and the sound that does reach you is lower in level, spectrally modified (more bass-heavy), and potentially delayed. In VR audio, acoustic occlusion modeling recreates these physical effects: when the game engine determines that a virtual wall separates the listener from a sound source, the audio processing applies appropriate low-pass filtering and level reduction. Without occlusion modeling, sounds are heard at full level and clarity through virtual walls, breaking physical plausibility.

Question 18 NHK's 22.2 audio format was developed primarily for:

A) Binaural headphone playback of spatial audio B) Accompanying Japan's Super Hi-Vision (8K) ultra-high-definition television system with matching audio immersion C) Replacing Dolby Atmos in cinema applications D) Professional recording studio monitoring

Reveal Answer **Answer: B** NHK developed the 22.2 multichannel audio format to accompany their Super Hi-Vision (SHV) 8K television system, which has 16 times the pixel count of HDTV. The visual immersion of 8K required a corresponding audio immersion system. 22.2 uses 24 speakers (22 full-range + 2 LFE) arranged in three height layers — upper (9 speakers), middle (10 speakers), lower (3 speakers) — providing comprehensive three-dimensional coverage including height information. It remains the most ambitious channel-based spatial audio format deployed in production content, though its 24-channel requirement makes consumer implementation impractical without immersive speaker arrays.

Question 19 The "sweet spot" limitation in ambisonics means that:

A) Only listeners with a sweet personality can perceive the spatial audio correctly B) The spatial reproduction is accurate only near the center of the decoding region; accuracy decreases for off-center listeners C) Ambisonics works only at specific frequencies (the "sweet" frequencies) and not others D) The format requires a sweet (smooth, undistorted) audio source to work correctly

Reveal Answer **Answer: B** In ambisonics decoding, the accurate reproduction of the encoded sound field assumes a listening point at the center of the speaker array. The spatial accuracy of the reproduction degrades for off-center listener positions at a rate determined by the ambisonic order: higher orders provide accurate reproduction over a larger region. The "sweet spot" radius is approximately λ_min/2 (half a wavelength at the highest reproduction frequency) for the given order. First-order ambisonics decoded to 8 kHz has a sweet spot radius of approximately 2 cm — only perfect for a listener exactly at the center. Higher orders substantially enlarge the sweet spot, which is one motivation for higher-order systems in venues and VR.

Question 20 Which of the following correctly describes the key advantage of spatial audio for auditory navigation (guiding a visually impaired person through a city)?

A) Spatial audio requires fewer speakers and simpler hardware than conventional navigation audio B) Directional audio cues instinctively indicate where to turn, are processed faster than verbal instructions, and require less cognitive load C) Spatial audio makes navigation sounds louder and therefore easier to hear in outdoor environments D) The HRTF provides higher-frequency cues that carry better in outdoor environments than speech

Reveal Answer **Answer: B** Spatial audio navigation harnesses a fundamental property of the auditory localization system: the brain processes spatial information about sound automatically and rapidly, without conscious cognitive effort. A turn direction indicated by a spatial audio cue — a sound appearing to come from the right — is interpreted instinctively and processed faster than a verbal instruction ("Turn right in 50 meters") which requires language processing. Research has shown that spatially placed navigation cues produce faster response times and lower cognitive load than verbal equivalents, leaving more cognitive resources for the navigation task itself. This advantage is particularly significant in complex, noisy environments where verbal instructions may be misheard or require concentration to decode.

End of Chapter 35 Quiz. Review incorrect answers by returning to the relevant sections in the chapter.