Chapter 35 Exercises: Spatial Audio & 3D Sound

DataField.Dev

Chapter 35 Exercises: Spatial Audio & 3D Sound

25 exercises covering the physics of spatial hearing, HRTF analysis, ambisonics encoding and decoding, spatial audio format design, and critical evaluation of three-dimensional audio technologies.

Part A: Physics of Spatial Hearing — ITD, ILD, and Pinna Filtering

Exercise 1 Using Woodworth's formula ITD(θ) ≈ (r/c) × (sin θ + θ), where r = 0.0875 m and c = 343 m/s:

a) Calculate the ITD for source azimuths of 0°, 15°, 30°, 45°, 60°, 75°, and 90°. Express your answers in microseconds. b) Plot (or sketch) ITD as a function of azimuth. Is the relationship linear? Where is it most linear and why? c) What is the minimum ITD that the auditory system can detect (approximately 10–20 µs)? At what azimuth does the source need to be displaced from center to produce this minimum detectable ITD? d) At 90° azimuth, the ITD is approximately 700 µs. At a 1000 Hz tone, how many degrees of phase difference does 700 µs correspond to? Does this phase difference uniquely identify the source direction?

Exercise 2 The ILD depends on source frequency because longer wavelengths diffract more easily around the head.

a) The "head shadow" effect becomes significant when the head circumference is comparable to the sound wavelength. Estimate the frequency at which the head circumference (2πr = 2π × 0.0875 ≈ 0.55 m) equals one wavelength. What does this imply about the frequency above which ILD becomes a strong cue? b) Using the simplified model ILD(θ, f) ≈ ILD_max × sin(θ) × (1 − e^{−f/f_c}) with ILD_max = 20 dB and f_c = 1500 Hz, calculate ILD at azimuth = 45° for frequencies: 250, 500, 1000, 2000, 4000, 8000 Hz. c) Plot ILD vs. frequency for azimuth = 45°. Describe the transition region around 1500 Hz physically. d) Stereo loudspeakers produce ILD but typically very little ITD (because both speakers are in front of the listener and the path length difference is small). Does this mean stereo is acoustically "unrealistic"? Explain using the physics.

Exercise 3 Pinna filtering creates spectral notches at frequencies that depend on source elevation.

a) Explain physically (in terms of wavelength and pinna geometry) why pinna filtering provides elevation information only at frequencies above approximately 4 kHz. b) A listener hears a sound that has a spectral notch at 8 kHz. Using the approximate relationship: notch frequency ≈ 6000 + (elevation + 90°)/180° × 6000 Hz, estimate the likely source elevation. Is this above or below ear level? c) Pinna notch frequencies are different for every individual because pinna geometry varies. A generic HRTF uses population-average pinna notch frequencies. If a listener's actual pinna notch frequency for directly-overhead sounds is 10 kHz but the generic HRTF predicts 11.5 kHz, describe qualitatively what spatial error this might cause. d) Some listeners report that certain spatial audio applications make sounds seem to be coming from above them even when the source is at ear level. Explain this "elevation error" in terms of HRTF mismatch.

Exercise 4 The "cone of confusion" describes the set of source positions that produce identical ITD and ILD values.

a) For a listener with head radius r = 0.0875 m, what is the shape of the cone of confusion for a given interaural axis? Describe it geometrically. b) A source is at azimuth = 30°, elevation = 0° (in front, slightly to the right). What other position on the cone of confusion would produce identical ITD and ILD? (Hint: consider the front-rear symmetry.) c) Describe how head movement helps resolve the cone of confusion ambiguity. What specific head movement would allow a listener to distinguish a source at 30° azimuth, 0° elevation (front) from a source at the ambiguous rear position? d) Apple Spatial Audio uses dynamic head tracking to maintain fixed spatial positions as the listener turns their head. Explain how this head tracking resolves the front-rear confusion in terms of the changing ITD as the listener rotates.

Exercise 5 The binaural code accompanying this chapter (binaural_simulation.py) implements a simplified binaural model.

a) Run the script (or trace through it manually) and record the ITD values (in µs) at azimuths 0°, 30°, 60°, and 90°. Compare these to your calculated values from Exercise 1a. b) The script applies ILD using a simplified frequency-dependent model. At azimuth = 60° and frequency = 4000 Hz, what ILD does the model predict? How does this compare to a more realistic maximum ILD of 20 dB at 90°? c) The orbiting binaural simulation creates a tone that rotates 360° in 3 seconds. If you listen over headphones, describe what you expect to hear and why some positions (front, rear) might be harder to localize than others. d) The simulation does not implement head tracking. Explain what change to the code would be necessary to add basic head tracking and how this would improve perceived externalization.

Part B: HRTF — Measurement, Individualization, and Externalization

Exercise 6 HRTF measurement requires capturing impulse responses from many source directions.

a) A complete HRTF is measured at 72 azimuth positions × 36 elevation positions. How many individual HRIR measurements (per ear) does this require? What total measurement count does a complete binaural HRTF require? b) Each HRIR is 256 samples long at a sample rate of 48 kHz. How many milliseconds of impulse response does each HRIR capture? Is this duration sufficient to capture the pinna reflections and head-diffraction effects you have studied? c) HRTF measurement requires the subject to remain perfectly still. For a 1° angular resolution measurement at 72 azimuths × 36 elevations, estimate the minimum measurement time if each measurement requires 5 seconds (including source positioning). What practical challenges does this create? d) Propose an alternative measurement approach using a dummy head (like the Neumann KU 100) instead of a human subject. What are the advantages and disadvantages of this approach for HRTF capture?

Exercise 7 Apple Personalized Spatial Audio uses iPhone camera images of the user's ear to select or interpolate a personalized HRTF.

a) What specific anatomical measurements of the ear might a camera system capture to inform HRTF personalization? b) The relationship between ear geometry and HRTF is complex. A database of 100 measured HRTFs is available for selection. A user's ear geometry is measured, and the nearest-match HRTF is selected. What is the fundamental limitation of this approach compared to direct HRTF measurement for that individual? c) Research suggests that elevation perception is more sensitive to HRTF individualization than azimuthal perception. Explain why this might be the case using what you know about the roles of ITD, ILD, and pinna filtering. d) Design a perceptual experiment to evaluate the quality of a personalized HRTF. What stimuli would you use, what spatial positions would you test, and what response method (pointer, verbal, virtual interface) would you employ? What metric would you use to compare generic vs. personalized HRTF performance?

Exercise 8 The externalization problem: sounds rendered with headphone HRTF convolution sometimes seem to be inside the head rather than externalized.

a) List three physical reasons why headphone HRTF rendering might fail to externalize sounds effectively. b) Research suggests that early room reflections significantly improve externalization in headphone spatial audio. Explain physically why adding a simulated room acoustic (early reflections convolved with appropriate directional HRTFs) would improve externalization. c) A listener wearing AirPods Pro listens to Dolby Atmos music. They report that the sound feels externalized and immersive. Then they switch to the same content without Spatial Audio. They report the sound feels "inside the head." Using the physics discussed in this chapter, identify the specific differences in signal processing between these two conditions. d) Head movement (turning the head while listening) is known to improve externalization. Explain the physics: how does head movement provide additional spatial information that static HRTF rendering cannot?

Exercise 9 HRTF has both amplitude (ILD) and phase (ITD) components that vary with direction and frequency.

a) An HRTF measurement shows that at 1000 Hz, the right ear receives a signal with 3 dB higher level and 450 µs earlier than the left ear. What does this tell you about the source direction? Be specific about azimuth and hemisphere. b) The same HRTF shows that at 8000 Hz, the right ear has a 12 dB spectral notch that is absent in the left ear. Using your knowledge of pinna filtering, what additional spatial information might this notch provide? c) Convolution reverb applies a room impulse response h(t) to a dry signal x(t). Binaural rendering applies HRTF impulse responses to each signal. Explain why these two operations are mathematically identical and could, in principle, be combined into a single convolution step. d) A musician records a vocal track in a dead studio and wants to place it in a virtual concert hall using spatial audio. Describe the complete signal processing chain, from dry recording to binaural headphone output, that would be needed.

Exercise 10 Compare the externalization and spatial quality of the following headphone playback scenarios:

Scenario	Description
A	Stereo recording on conventional headphones
B	Binaural recording (dummy head) on open-back headphones
C	Dolby Atmos mix, binaural render with generic HRTF, on AirPods (no head tracking)
D	Dolby Atmos mix, binaural render with personalized HRTF, on AirPods Pro (with head tracking)

For each scenario: a) Predict the degree of externalization (fully externalized, partially externalized, in-head). b) Identify which spatial cues (ITD, ILD, pinna filtering, head movement) are present and which are absent. c) Rank the four scenarios in order of expected spatial realism and explain your ranking with physics-based reasoning.

Part C: Ambisonics — Encoding, Spherical Harmonics, and Decoding

Exercise 11 First-order ambisonics encoding.

Using the encoding equations: W = (1/√2) × s, X = cos(az)cos(el) × s, Y = sin(az)cos(el) × s, Z = sin(el) × s:

a) Calculate the W, X, Y, Z channel coefficients for a unit-amplitude source (s = 1) at: (i) front (az=0°, el=0°), (ii) left (az=90°, el=0°), (iii) rear (az=180°, el=0°), (iv) directly above (az=0°, el=90°). b) Two sources are present simultaneously: source A at az=30°, el=0° with amplitude 0.8, and source B at az=−60°, el=20° with amplitude 0.5. Calculate the total W, X, Y, Z values (ambisonics is linear, so channels sum). c) Show mathematically that the encoding preserves the total acoustic energy for a source at any direction: verify that W² + X² + Y² + Z² = s² × constant for any azimuth and elevation. d) At azimuth = 45°, elevation = 0°, which channels (W, X, Y, Z) are equal in magnitude? What does this tell you about the relationship between X and Y at diagonal directions?

Exercise 12 Spherical harmonic order and spatial resolution.

a) First-order ambisonics has 4 channels; second-order has 9 channels; N-th order has (N+1)² channels. List the channel counts for orders 1 through 7. b) The spatial resolution of an N-th order ambisonic system (in terms of angular resolution) is approximately 180°/N. Calculate the angular resolution for orders 1, 2, 3, 4, and 6. c) Human hearing has angular resolution of approximately 1–2 degrees in the horizontal front hemisphere. What ambisonic order would be needed to match human spatial resolution? How many channels does this require? d) Higher-order ambisonics provides good spatial resolution only near the "sweet spot" (the center of the decoding region). The radius of this sweet spot is approximately λ/2 at the highest reproduction frequency. For third-order ambisonics decoding up to 8 kHz (λ = 4.3 cm), calculate the sweet spot radius. What does this imply about listener positioning for HOA systems?

Exercise 13 Ambisonics decoding to speaker arrays.

The basic decoding of first-order ambisonics to N speakers uses a decoding matrix D where each row is the encoding vector for the speaker's position. For N equally spaced horizontal speakers, the decoding gain for speaker at angle θ_k is:

L_k = (1/N) × [W + X×cos(θ_k) + Y×sin(θ_k)]

a) Calculate the decoder output for each of 4 quadraphonic speakers (FL=+45°, FR=-45°, RL=+135°, RR=-135°) for a source encoded at azimuth = 60°, elevation = 0°. Which speakers receive the most energy? b) The same decoder is applied to a source at azimuth = 180° (rear center). Calculate the speaker outputs. Are the front speakers receiving any signal? If so, why? Is this acoustically correct? c) A decoder designed for 4 speakers is applied to a 2-speaker (stereo) setup by grouping FL with RL (left channel) and FR with RR (right channel). What happens to the rear source in part (b) in this simplified stereo downmix? d) Explain why the ambisonic format is "future-proof": if a consumer buys an ambisonic recording and upgrades from a 4-speaker to an 8-speaker system, the same recording can be decoded to the 8-speaker system without any change to the recorded content.

Exercise 14 The ambisonics_intro.py code implements first-order ambisonic encoding and binaural decoding.

a) Run the script (or trace through it manually). At azimuth = 90° (directly left), what are the W, X, Y values according to the encoding equations? Verify against the console output. b) The binaural decode uses virtual microphone angles of +30° (left) and −30° (right). Trace through the decoding equations for a source at azimuth = 90° (left). Which decode channel (left or right) receives more energy? Is this physically correct? c) The script shows spherical harmonic patterns as polar diagrams. Describe the shape of the X channel pattern. Where is it at maximum? Where is it zero? How does this relate to the fact that X encodes "front-back" information? d) A student modifies the binaural decode to use virtual microphone angles of ±90° (directly to the side) instead of ±30°. What effect would this have on the perceived left-right image width of the decoded audio?

Exercise 15 Higher-order ambisonics and the Schroeder frequency analogy.

a) There is a conceptual analogy between the Schroeder frequency in room acoustics (transition from modal to statistical description) and the "sweet spot" limitation in ambisonics. Articulate this analogy: what is "modal" in ambisonics, and what is the "statistical" regime? b) First-order ambisonics can be captured with a tetrahedral array of four omnidirectional microphones (the Sennheiser Ambeo, Rode NT-SF1, and others). Explain why four microphones are the minimum needed to capture all four B-format channels. c) A producer wants to create a spatial audio recording of a full symphony orchestra. They have the choice between: (i) first-order ambisonics microphone array placed at the conductor's position, (ii) a binaural dummy head at the same position. Compare the two approaches for: spatial accuracy, playback format flexibility, and tonal quality. d) Dolby Atmos uses object-based audio rather than ambisonics, yet the rendering engine must ultimately convert objects to speaker signals using similar spatial mathematics. What is the fundamental difference in the workflow between an ambisonics producer and a Dolby Atmos producer? What are the advantages of each approach for different production contexts?

Part D: Spatial Audio Formats and Applications

Exercise 16 Object-based audio (Dolby Atmos) vs. channel-based audio (5.1/7.1).

a) A Dolby Atmos mix includes a solo violin positioned at azimuth = 0°, elevation = +30° (slightly above front center). In a cinema with a 7.1.4 speaker system, which speaker(s) would reproduce this object? In a home system with 5.1 speakers and no height channels, what happens to this object? b) The same Atmos mix is rendered to binaural headphones. What signal processing is applied to the violin object to convey its position at elevation = +30°? c) A music producer is deciding between creating an Atmos mix and a first-order ambisonics mix. For the specific case of a jazz quartet recording intended for both streaming and live immersive playback at a 64-speaker venue, which format would you recommend and why? d) Dolby Atmos for Music was criticized in its early years for "gimmicky" spatial mixes where instruments were placed in unusual positions that seemed to prioritize novelty over musicality. From a physics perspective, is there an objective definition of "musically appropriate" spatial positioning? How should mix engineers determine where to place elements in three-dimensional space?

Exercise 17 Virtual reality audio requirements and constraints.

a) A VR system has the following latency budget: IMU sampling 2 ms, head tracking computation 3 ms, HRTF interpolation 5 ms, audio convolution 8 ms, audio buffer output 4 ms. What is the total end-to-end latency? Is this within the 25 ms perceptual limit for VR audio? b) In the VR system above, the audio convolution step processes 16 simultaneous sound objects, each requiring HRTF convolution of a 256-sample HRIR at 48 kHz sample rate. Estimate the computational load in multiply-accumulate operations per second. c) A VR game features a character speaking from behind a door (occluded source). Describe the acoustic changes you would physically expect to the character's voice after occlusion, and how you would implement these changes in real-time spatial audio processing. d) Two players in a multiplayer VR game are using generic HRTFs rather than personalized ones. Player A turns to the right; Player B's acoustic avatar should now be localized to the left-rear position. Player A reports the voice sounds "somewhere to the left but oddly close." What HRTF-related artifact explains this experience?

Exercise 18 Binaural audio on streaming platforms.

a) A streaming service delivers Dolby Atmos content to headphone listeners using a binaural render. The render uses a generic HRTF derived from averaging measurements across 100 subjects. Describe three specific ways this generic HRTF will fail to produce accurate spatial perception for any individual listener. b) Apple Music streams Dolby Atmos content at 256 kbps AAC. The Atmos content includes spatial metadata (object positions) and audio streams. Explain how the decoding and rendering happens on the end device (iPhone + AirPods Pro) vs. on the streaming server. Which stages happen where? c) YouTube supports 360-degree videos with first-order ambisonics audio. A user watches on a phone and rotates the phone to look in different directions within the 360 video. Explain the signal processing chain that adapts the audio to the phone orientation. d) A music critic argues that spatial audio streaming reduces music to a "novelty experience" focused on the spatial gimmick rather than the music itself. A spatial audio advocate argues that it enables new forms of musical expression. Using specific examples of music that does and does not benefit from spatial audio, engage with both sides of this argument from a physics and psychoacoustics perspective.

Exercise 19 Wavefield Synthesis and its practical limitations.

a) Wavefield Synthesis requires speaker spacing smaller than λ/2 at the highest reproduction frequency. For 16 kHz reproduction, calculate the required speaker spacing. For a listening room 6 m wide, how many speakers would be needed on each side wall? b) Current practical WFS systems are typically limited to frequencies below about 8 kHz. What spatial localization cue does this limitation most affect — ITD, ILD, or pinna filtering? Explain why. c) A WFS system correctly reproduces the wave field for a listener at the center of the listening room. A second listener sits 1.5 meters off-center. Qualitatively describe how the spatial experience differs for the off-center listener compared to a standard 5.1 system for the same off-center listener. d) NHK's 22.2 system uses 24 speakers in three layers. Compare this approach to first-order ambisonics in terms of: number of channels, spatial resolution, sweet spot size, and format flexibility. Which is more "future-proof" and why?

Exercise 20 Auditory display: spatial audio beyond music.

a) A navigation application wants to replace its conventional voice navigation ("Turn left in 200 meters") with a spatial audio cue: a sound that appears to come from the direction of the upcoming turn. What spatial audio technology would be most appropriate for this application, and what physical constraints must be satisfied (considering the user is wearing earbuds while walking)? b) Sonification maps data to sound. Design a spatial audio sonification for weather data: temperature is mapped to pitch, wind speed to overall level, and wind direction to the azimuthal position of the sound. What limitations of the auditory system (spatial resolution, simultaneous sound discrimination) would affect the usefulness of this design? c) Aircraft collision avoidance systems could benefit from spatial audio to indicate the direction of a threat. What are the critical latency, accuracy, and individualization requirements for such a safety-critical spatial audio system? How do these requirements compare to consumer entertainment spatial audio?

Part E: Advanced Analysis and Critical Design

Exercise 21 Critically analyze the claim that "spatial audio is merely a gimmick."

a) Identify five specific musical genres or applications where three-dimensional spatial audio genuinely adds musical or experiential value that stereo cannot provide. For each, explain the physical mechanism that makes spatial audio superior. b) Identify three specific musical genres or applications where stereo or mono might be preferable to spatial audio. Explain physically why spatial audio would not add value. c) The Beatles' "Tomorrow Never Knows" (featured in Case Study 35-1) created spatial impression through studio processing before Dolby Atmos existed. What does this suggest about the relationship between spatial audio technology and musical creativity? d) Write a physics-grounded argument for why spatial audio's value is not about novelty but about restoring acoustic information that stereo removes. Your argument should reference ITD, ILD, pinna filtering, and the natural acoustic environment.

Exercise 22 Design a spatial audio production system for a new type of musical performance: a "spatial symphony" in which the orchestra surrounds the audience in a 360-degree arrangement, with music specifically composed to use three-dimensional space as a compositional parameter.

a) Capture system: How would you microphone 80 musicians arranged in a circle around the audience? What microphone types and ambisonic capture techniques would you use? b) Reproduction system: For a 400-seat venue, what speaker array would you specify? Consider the physics of spatial resolution and sweet spot limitations. c) Streaming delivery: How would you deliver this experience to home listeners with headphones? What is the format chain from capture to headphone? d) Compositional implications: Give two examples of musical gestures that the 360-degree spatial arrangement enables that would be impossible in a conventional concert hall arrangement.

Exercise 23 HRTF research frontier: the "listener-adaptive" HRTF.

A research team is developing a system that continuously adapts the HRTF used in spatial audio rendering based on listener feedback. After each test stimulus, the listener rates their perceived spatial accuracy, and the system adjusts the HRTF parameters toward better performance.

a) What HRTF parameters could be adapted? List at least four, and for each, describe how it affects spatial perception. b) What perceptual task should the listener perform to provide meaningful feedback? (Pointing? Localization accuracy? Preference rating?) Justify your choice. c) The system uses a database of 500 measured HRTFs and interpolates between them based on listener feedback. What mathematical technique would allow smooth interpolation between HRTFs in high-dimensional parameter space? d) A concern is raised: listeners might adapt to an inaccurate HRTF by "learning" to reinterpret its wrong cues as correct ones. How would you design the adaptation system to avoid this failure mode?

Exercise 24 The "cocktail party effect" and spatial audio.

a) The cocktail party effect — the ability to attend to one speaker in a noisy, multi-talker environment — relies partly on spatial separation of sources. Using ITD, ILD, and pinna filtering, explain why spatially separated sources are easier to separate perceptually than co-located sources. b) A hearing aid uses conventional microphone processing to amplify and equalize sound. A spatially-aware hearing aid uses HRTF-based spatial audio processing to enhance spatially separated sources. Describe the signal processing chain for the spatially-aware hearing aid, and explain how it would improve speech intelligibility in a noisy restaurant. c) In a binaural recording of a cocktail party, two speakers are located at azimuths of +30° and −30°. A listener uses headphones to listen. Using what you know about ITD and ILD at ±30°, estimate the binaural cues separating the two speakers and predict the degree of spatial separation the listener would perceive. d) The same cocktail party recording is played over stereo loudspeakers (instead of headphones). Does the spatial separation of the two speakers benefit from HRTF-based localization in this case? Why or why not?

Exercise 25 (Capstone Design Problem) Design a complete spatial audio system for a new immersive music venue called "The Sphere" (not to be confused with the Las Vegas Sphere). The Sphere seats 300 audience members in a perfectly spherical room, with the performers at the center. The venue hosts contemporary music productions specifically composed for three-dimensional spatial audio.

Write a complete design specification addressing:

a) Acoustic design: What surface treatments are needed for the spherical room? What is the major acoustic problem with a perfectly spherical room (think about what happens to sound waves in a spherical enclosure) and how would you address it?

b) Speaker system: Design a speaker array for the spherical room. Specify: number of speaker positions, vertical and horizontal distribution (consider covering all audience positions including above and below performers), speaker types, and the ambisonics order needed to achieve 5-degree spatial resolution throughout the room.

c) Microphone/monitoring: During live performance, how do performers monitor themselves? The performers are at the center of a spherical room that may have significant reverberation. What monitoring solution allows them to hear clearly without contaminating the main speaker output?

d) Composition implications: The venue commissions three new works each season. Give a detailed description of one musical work specifically designed for The Sphere that exploits its three-dimensional spatial capability in a way that would be impossible in a conventional concert hall.

e) Accessibility: Some audience members use hearing aids that are not compatible with the venue's spatial audio system. Design an accessibility solution that delivers a high-quality audio experience to these listeners.

All exercises align with the learning objectives of Chapter 35. For quantitative exercises, show all steps and state assumptions. For design exercises, support every recommendation with physics-based reasoning.