Chapter 5 Quiz: Psychoacoustics — The Physics Inside Your Head
20 questions — use the hidden answers to check your understanding after attempting each question.
Question 1. What are "qualia," and why does the concept present a challenge for a purely physical explanation of music perception?
Answer
**Qualia** (singular: quale) are the subjective, experiential qualities of perception — the "what it is like" dimension of sensory experience. The redness of red, the pain of pain, the melancholy of a minor key melody are qualia. They are first-person experiences that cannot be directly observed from the outside. The challenge for physical explanation: a complete physical description of a sound wave, a cochlea's mechanical response, and a brain's neural firing pattern describes *what happens* without explaining *what it is like* to experience it. Even if you had complete knowledge of every neural state in a listener's brain when they hear a cello, philosopher David Chalmers argues, you would still have an explanatory gap — you have not explained why there is a subjective experience at all. This "hard problem of consciousness" applies directly to music: physics can describe everything about the stimulus and the neural response, but whether it explains the experience of musical beauty remains genuinely contested.Question 2. Why does 60 dB of pink noise at 100 Hz sound quieter than 60 dB at 1,000 Hz, even though they have the same measured intensity?
Answer
Because the human ear is not equally sensitive to all frequencies. The **equal-loudness contours** (Fletcher-Munson curves) show that at any given physical intensity level, sounds at 1,000–4,000 Hz are perceived as much louder than sounds at 100 Hz at the same dB SPL. The physical reason: the ear canal resonates around 3–4 kHz (amplifying those frequencies), and the cochlea's mechanical sensitivity is lower at low frequencies for physiological reasons related to basilar membrane stiffness and fluid dynamics. At 100 Hz, you need approximately 12–20 dB more physical intensity than at 1,000 Hz to achieve the same perceived loudness (varying with the overall loudness level). This is why the bass drum in a live rock concert needs dramatically more amplifier power than the snare, even if you want them to seem equally loud.Question 3. What is a "critical band" in auditory perception, and what is its physical basis in the cochlea?
Answer
A **critical band** is the frequency range processed by a single "auditory filter" — the region of the basilar membrane that responds to a given center frequency. Within a critical band, sound components interact strongly (they can mask each other, beat against each other, and are analyzed as a group). Across critical bands, sound components are processed more independently. **Physical basis:** The basilar membrane varies in width and stiffness along its length, creating a mechanical frequency-to-position mapping. Each location responds to a range of frequencies determined by its mechanical properties — this range is the critical bandwidth. Below ~500 Hz, critical bands are approximately 100 Hz wide; above 500 Hz, they widen to about 20% of center frequency. The critical band is essentially the cochlea's frequency resolution limit — it defines how finely the cochlea can separate different frequency components.Question 4. Explain simultaneous masking. Why is masking asymmetric, with low-frequency sounds tending to mask high frequencies more than the reverse?
Answer
**Simultaneous masking** occurs when a loud sound (the masker) makes a softer sound (the target) at a nearby frequency inaudible. The physical mechanism: the masker produces a large peak of vibration on the basilar membrane, and the "skirts" of this vibration pattern extend toward higher frequencies on the membrane, overlapping with the region that would respond to the target, reducing the target's neural response below threshold. **Asymmetry:** The basilar membrane's vibration pattern is inherently asymmetric — a tone of given frequency produces a peak that extends strongly toward higher frequencies (toward the base of the cochlea) but falls off more steeply toward lower frequencies. This means a loud low-frequency masker (say, 200 Hz at 80 dB) will effectively mask much higher frequencies (up to 1,000 Hz or beyond), while a high-frequency masker of the same level will not mask low-frequency sounds nearly as effectively. This "upward spread of masking" is a fundamental asymmetry of the basilar membrane's mechanical response.Question 5. ⚠️ Misconception question: "Because MP3 compression discards audio data, MP3 recordings always sound worse than uncompressed files." True or false? Explain.
Answer
**False, or at least significantly overstated.** MP3 compression is a *perceptual* codec — it discards only audio information that the psychoacoustic masking model predicts to be inaudible. At sufficient bit rates (typically 128 kbit/s or above for most music), a well-encoded MP3 is often perceptually indistinguishable from the original uncompressed file in careful blind listening tests. The key insight: the data discarded by the codec was not perceived anyway, because it was masked by louder sounds in the same frequency region at the same time. If masking model predictions are accurate, the codec is simply removing what the listener couldn't hear in the first place — without subjective quality loss. At low bit rates (below ~96 kbit/s), or with certain musical content (very sparse music like solo piano, or signals with little masking), artifacts may become audible: pre-ringing, muffled high frequencies, or "pumping" effects. But at adequate bit rates, the compression-quality difference is not fundamental — it is an engineering precision problem, not a theoretical impossibility.Question 6. What are the two competing theories of pitch perception, and at what frequencies does each theory's mechanism predominate?
Answer
**Place theory (tonotopic coding):** Pitch is determined by *where* on the basilar membrane maximum vibration occurs. High frequencies excite the base; low frequencies excite the apex. The brain reads the position of maximum activation and assigns pitch accordingly. Associated with Helmholtz and refined by Békésy. Predominates above ~4–5 kHz. **Temporal theory (phase-locking / volley theory):** Pitch is determined by the *timing* of neural firing — auditory nerve fibers fire in synchrony with the period of the sound wave (phase-locking). The brain extracts pitch by measuring the interval between neural spikes. Predominates below ~500 Hz. **Modern synthesis:** Both mechanisms operate. Temporal coding is more precise for fine pitch discrimination at low frequencies (where the basilar membrane's place resolution is coarser than our perception requires). Place coding dominates at high frequencies where phase-locking becomes unreliable. Between ~500 Hz and ~4–5 kHz, both contribute.Question 7. What is the "missing fundamental"? Give a real-world musical example of where this phenomenon matters.
Answer
The **missing fundamental** (also called residue pitch or virtual pitch) is the perceptual phenomenon where listeners clearly hear the pitch of a fundamental frequency that is physically absent from the sound, when the harmonics above it are present. **Mechanism:** The harmonics 2f, 3f, 4f... all share a common periodicity at f. The auditory system's temporal processing extracts this common periodicity across multiple frequency channels and assigns a pitch corresponding to f — even though f itself produces no basilar membrane vibration. **Real-world examples:** - **Small speaker bass:** A laptop or phone speaker physically cannot produce 110 Hz (bass guitar concert A), yet the pitch is clearly heard because the harmonics (220, 330, 440 Hz) are present. - **Telephone voice:** The telephone bandpass (300–3,400 Hz) cuts off most fundamental voice frequencies (male voice: 85–180 Hz), yet voices sound completely natural in pitch because the harmonics are transmitted. - **Orchestral bass at a distance:** At the rear of a large hall, direct bass from cellos and basses attenuates, but harmonics carry farther; the bass pitch still seems fully present.Question 8. Describe the Haas effect (precedence effect). How is this concept relevant to both concert hall design AND audio production?
Answer
The **Haas effect** (precedence effect) states that when the same sound arrives from two directions within approximately 80 milliseconds, the brain fuses them into a single auditory event and attributes the source location to the first-arriving sound. The later arrivals are "suppressed" as distinct events — they reinforce the perceived loudness without creating separate phantom images. **Concert hall design:** Early reflections (arriving within 80 ms) are deliberately aimed at the audience to reinforce the direct sound from the stage. Because of the Haas effect, these reflections are not heard as echo — they make the music seem louder, warmer, and more enveloping, while the listener still "locates" the music at the stage (the first-arriving source). **Audio production:** In a stereo mix, a technique called the **Haas delay** places the same sound slightly delayed in one channel (10–40 ms) relative to the other. The listener hears a single sound but with a strong stereo image — the sound "leans" toward the side with the first-arriving signal. This creates apparent stereo width from a mono source without using level differences alone. It exploits exactly the same perceptual mechanism as concert hall early reflections.Question 9. What is auditory scene analysis? Name two specific "grouping principles" the auditory system uses to separate simultaneous sound sources.
Answer
**Auditory scene analysis** (formalized by Albert Bregman) is the process by which the auditory brain separates a complex acoustic mixture into distinct perceptual streams corresponding to distinct sound sources. It is the basis of the "cocktail party effect" — the ability to follow one voice or instrument among many. **Grouping principles include:** 1. **Harmonicity:** Frequency components in simple integer ratios tend to be grouped together as a single stream (consistent with natural sound sources, which produce harmonic series). Non-harmonic components tend to be segregated as a separate stream. 2. **Common onset/offset:** Components that start and stop simultaneously are grouped together. This is why a piano chord sounds like one event — all harmonics share the same onset. 3. **Continuity:** A sound that is briefly interrupted tends to be perceptually "filled in" and heard as continuing through the interruption. 4. **Spatial location:** Components arriving from the same direction (same ITD and ILD) are grouped together as originating from one location. 5. **Spectral similarity (timbre):** Components with similar spectral shapes tend to be grouped — helping us track instruments with consistent timbral identity through a complex texture.Question 10. What is the Plomp-Levelt theory of consonance? How does it differ from Helmholtz's original theory, and what aspect of consonance does neither theory fully explain?
Answer
**Helmholtz's theory:** Consonance and dissonance result from acoustic beating between the harmonics of simultaneously sounding notes. Intervals with harmonics that beat at rates of 25–40 Hz produce roughness (dissonance); intervals whose harmonics largely avoid beating produce smoothness (consonance). **Plomp-Levelt revision (1965):** Confirmed that roughness is the core percept underlying dissonance, but refined the mechanism: roughness occurs specifically when frequency components fall *within the same critical band* — the cochlea's frequency resolution limit. The most dissonant interval is where the frequency difference between components equals about 25% of the critical bandwidth (producing beat rates in the roughness zone). This is a psychoacoustic derivation rather than a simple beating calculation. **What neither theory fully explains:** The *cultural dimension* of consonance and dissonance. Both theories treat consonance as purely psychoacoustic (a result of cochlear mechanics), but cross-cultural research shows that the hierarchy of consonant intervals — which intervals are "beautiful" vs. "tense" — varies across musical cultures. Indonesian gamelan deliberately uses intervals that produce beating; Arabic maqam includes intervals Western ears perceive as dissonant but trained Arab ears hear as specific, expressive pitches. Roughness avoidance may be partially universal (physiological), but the full cultural architecture of consonance/dissonance is substantially learned.Question 11. What is forward masking (post-masking), and how long does it last? What perceptual consequence does it have for listening to rapidly played musical sequences?
Answer
**Forward masking (post-masking)** is a temporal masking phenomenon where a loud sound continues to suppress the audibility of softer sounds that occur *after* it ends. The loud masker leaves the basilar membrane (and auditory nerve) in a state of reduced sensitivity, requiring 50–200 milliseconds to fully recover. During this recovery window, softer sounds at nearby frequencies are masked — they may be below the elevated threshold of hearing and go unperceived, even though they would be clearly audible without the prior loud masker. **Musical consequences:** - Very rapid passages (e.g., fast scales at the piano) played at high dynamic levels may have softer interior notes partially masked by the surrounding louder notes - Staccato articulation in dense textures may be partially masked — notes seem to blur even when the player clearly articulates them - Recording engineers must be aware that a loud transient (snare drum hit) can mask softer elements that follow within ~100 ms — timing placement of softer elements is important for their intelligibility - Pianissimo notes following a fortissimo passage take a moment to "emerge" from the perceptual masking, which skilled performers account for with slight timing and dynamic adjustmentsQuestion 12. Explain interaural time difference (ITD) and interaural level difference (ILD). Which is most important for localizing low-frequency sounds, and which for high-frequency sounds? Why?
Answer
**Interaural Time Difference (ITD):** A sound arriving from one side reaches the nearer ear slightly before the farther ear. The maximum ITD (for a source directly to one side) is approximately 660 microseconds. The auditory brainstem measures this difference with precision in the tens-of-microseconds range, corresponding to angular resolution of a few degrees. **Interaural Level Difference (ILD):** The head acoustically shadows the far ear from sounds arriving from the side, creating an intensity difference between the two ears. High-frequency sounds (wavelengths much smaller than the head) are effectively blocked by the head; low-frequency sounds (wavelengths larger than the head) diffract around the head with little shadowing. **Dominance by frequency:** - **Low frequencies (below ~1.5 kHz):** ILD is very small because low-frequency sounds diffract easily around the head. ITD is the dominant localization cue. - **High frequencies (above ~1.5 kHz):** Phase-locking (which is needed to measure ITD from ongoing waves) becomes unreliable above this limit. ILD is the dominant cue. The transition frequency (~1.5 kHz) corresponds to a wavelength approximately twice the head diameter — the natural acoustic boundary between the "large-head" and "small-head" scattering regimes.Question 13. What is a Head-Related Transfer Function (HRTF), and why do HRTFs vary between individuals? What are the practical implications of HRTF variation for commercial 3D audio products?
Answer
A **Head-Related Transfer Function (HRTF)** is the frequency-dependent modification that the outer ear (pinna), head, and torso impose on sound arriving from a given direction. Because sound arrives at the ear via paths that depend on the sound's direction (elevation, azimuth, and distance), the HRTF encodes directional information in spectral form. The auditory system learns to use these spectral cues for elevation localization and front-back disambiguation — aspects of spatial hearing that ITD and ILD alone cannot provide. **Why HRTFs vary:** Each person's pinna has a unique geometry (size, shape, depth of cavities). Head size, shoulder shape, and torso dimensions all affect the transfer function. Two people with identical ears might have different HRTFs due to different head or shoulder proportions. **Practical implications:** Commercial 3D audio products (headphones with spatial audio, gaming headsets, VR audio) typically use a "generic" HRTF measured from an artificial head (a standard measurement mannequin). For users whose own HRTF closely matches the generic one, spatial audio sounds convincingly external and well-localized. For users with very different HRTF characteristics, sounds may seem to come from inside the head (poor externalization), be mislocalized in elevation, or have front-back confusions. Personalized HRTF measurement (using cameras to model ear shape, or measured with a probe microphone in the ear canal) significantly improves spatial audio for those users — but adds cost and complexity to the system.Question 14. ⚠️ Misconception question: "Noise-canceling headphones create true silence by removing all sound." True or false? Explain.
Answer
**False.** Noise-canceling headphones use **destructive interference** — a microphone on the outside of the headphone measures incoming noise, electronic circuitry generates an anti-phase version of that noise signal (inverted waveform), and this anti-phase signal is combined with the incoming noise. When two sounds of equal amplitude are exactly 180 degrees out of phase, they cancel — reducing the perceived noise. However, "true silence" is not achieved, for several reasons: - The cancellation is not perfect at all frequencies — it works best for low-frequency, steady, predictable noise (airplane engine hum, air conditioning) and poorly for high-frequency, rapid, or unpredictable sounds (speech, high-frequency hiss, traffic noise) - The anti-phase signal must be generated in real time, so there is an inherent processing delay; faster-changing sounds cannot be fully canceled - Some noise always "leaks" through passive isolation (the headphone cup and seal attenuate, but do not eliminate, sound) What noise-canceling headphones produce is not silence but **reduced ambient noise** — specifically in the low-frequency, steady-noise regime. Listeners often describe the result as "quiet" or "comfortable," which is a psychoacoustic description: the *perception* of noise has been reduced to a level the brain registers as quiet, even though measurable sound remains. This is an example of perception diverging from physical measurement.Question 15. What is the temporal resolution limit of the auditory system, and what musical phenomena operate near or beyond this limit?
Answer
The auditory system's **temporal resolution** — the minimum gap between two events for them to be heard as distinct — is approximately 2–5 milliseconds for clicks and transients. For the purposes of stream segregation and rhythmic perception, a broader "integration window" of 10–20 ms applies. **Musical phenomena at or near this limit:** - **Rapid ornaments and trills:** A fast trill at 16 notes per second has inter-note intervals of ~62 ms — well within individual perception. A trill at 40 notes per second (25 ms between notes) may begin to lose individual note distinctness, perceived instead as a texture. - **Tremolo:** A tremolo at 16 Hz (an amplitude modulation) is perceived as rhythmic variation; at 40+ Hz it transforms into perceived timbre rather than rhythm — the notes fuse perceptually. - **Very fast drum fills:** At extremely fast tempos (machine-gun snare rolls exceeding ~600 strokes per minute), the individual strokes begin to merge into a buzz rather than a sequence of distinct impacts. - **Onset perception and timing "feel":** Recording engineers know that timing differences as small as 5–15 ms between tracks create perceptible changes in the "tightness" or "looseness" of a groove — well within the range of perceptual sensitivity.Question 16. How does the auditory system's "cocktail party effect" relate to musical polyphony? What compositional strategies support or undermine the listener's ability to follow individual voices?
Answer
The **cocktail party effect** — the brain's ability to segregate and follow individual sound streams within a complex acoustic mixture — is the perceptual foundation of musical polyphony. Following a fugal voice, a bass line, or a countermelody in a dense texture all require auditory scene analysis: the brain groups the target stream's frequency components and tracks them through the texture. **Compositional strategies that support stream segregation (make individual voices easier to follow):** - Wide pitch register separation between voices (each voice occupies a distinct spectral region) - Distinct timbres for different voices (different instruments or vowel qualities) - Different rhythmic patterns between voices (creating different onset timing sequences) - Melodic continuity within each voice (smooth, stepwise motion helps the brain "predict" and track the stream) **Compositional strategies that undermine stream segregation (create fused textures):** - All voices in the same pitch register - All voices with the same timbre (e.g., all strings playing in unison-region pitches) - Homophonic rhythm (all voices moving in the same rhythm, sounding like a chord block rather than independent lines) - Dense close-position harmonies at low pitch (where critical bands are wider and components interfere more) Bach's counterpoint is so effective partly because he reliably applies the stream-supporting strategies: clear register separation, smooth voice leading, and independent rhythmic activity in each voice.Question 17. What is the "hard problem" of consciousness, and why does it persist even after psychoacoustics has successfully mapped detailed relationships between physical stimuli and perceptual responses?
Answer
The **hard problem of consciousness** (David Chalmers) asks why there is subjective experience at all — why physical processes in the brain are accompanied by any "inner experience" rather than simply processing information "in the dark." No matter how complete our knowledge of neural firing patterns, ion channels, and brain states, Chalmers argues, we have an explanatory gap: we haven't explained why those processes produce a felt experience. Psychoacoustics makes genuine progress on the "easy problems" — explaining *what* someone perceives (which frequencies are heard as louder, which intervals sound consonant, how the brain separates simultaneous sources). This is enormous scientific progress. But the hard problem persists because explaining the perceptual response — even perfectly — leaves the felt quality of the experience unexplained. Example: We know in great detail how the auditory system processes a minor key melody. We know what neural circuits fire, what processing stages occur, what predictive mechanisms are involved. But we haven't explained why there is a felt sense of melancholy — why there is something it is like to hear that melody, why the processing doesn't just occur without any experiential accompaniment. This gap is not simply a matter of needing more research; it is, Chalmers argues, a conceptual gap that purely physical-functional accounts cannot bridge.Question 18. Describe the physical construction of a Shepard tone illusion. What does it reveal about pitch perception?
Answer
A **Shepard tone** is constructed from a set of sine waves separated by octaves (e.g., tones at C3, C4, C5, C6, C7) all sounding simultaneously, with a bell-shaped amplitude envelope that makes the middle octaves loudest and fades out the extreme octaves. As the perceived pitch "rises" by a semitone, the entire pattern shifts upward — but simultaneously, the highest components fade out at the top of the envelope and new components fade in at the bottom. The result is a pitch that seems to continuously ascend (or descend, if moving in the other direction) without ever actually getting higher or lower. **What it reveals about pitch perception:** 1. **Octave equivalence:** Pitch is perceived as cyclic — C4 and C5 share a "chroma" (they are both C), even though C5 is "higher" than C4. The Shepard tone exploits this by cycling through the chroma (semitone steps) without committing to a specific octave. The brain perceives the changing chroma but is uncertain about the actual octave, creating the illusion of infinite ascent. 2. **Separation of chroma from height:** Pitch has (at least) two dimensions — chroma (what note it is: C, D, E...) and height (how high or low). The Shepard tone holds chroma changing while keeping height ambiguous, demonstrating these are separable perceptual dimensions. 3. **The constructive nature of pitch perception:** The brain doesn't simply read off a physical frequency; it constructs pitch from ambiguous signals, and the Shepard tone reveals how that construction can be systematically fooled.Question 19. Why does Romantic-era orchestral music (written for reverberant 19th-century concert halls) sometimes sound muddy when played in modern recording studios or home listening environments with short reverberation?
Answer
Romantic-era orchestral music was composed with specific acoustic assumptions — composers like Brahms, Bruckner, and Mahler heard and revised their music in large concert halls with RT60 of 2.0–2.5 seconds. Several compositional features presuppose this acoustic context: **Slow harmonic rhythm:** Chords change slowly enough that the room's reverberant tail can sustain between changes without creating harmonic blur. At 2.0 seconds RT60, a harmonic change every 2 beats at a slow tempo is acoustically appropriate. **Thick, sustained string writing:** Long, held string chords rely on the room's reverberation to "fill out" and sustain between bow changes. In a dry room, string bow changes are more audible and the texture seems thinner. **Blended orchestral textures:** The room's late reverberation acoustically blends the different sections of the orchestra, creating the "mass" of sound that characterizes Romantic orchestral style. In a dry environment, the individual sections remain distinct rather than fusing into the characteristic orchestral sound. **Musical consequence:** In a very dry environment (short reverberation), the bass moves more slowly than the harmonic rhythm would imply, inner string voices seem pedestrian rather than rich, and the overall texture feels "thick" and "muddy" — because the music was designed for the room's reverberation to thin it out. This is also why these works need added reverb in recordings — without it, they sound uncharacteristically lean or, paradoxically, cluttered.Question 20. What does the comparison between auditory streaming and particle track reconstruction (from the Choir & Particle Accelerator running example) reveal about the nature of information-processing problems in science?