42 min read

> "We do not see what is there; we see what we expect to see. And we do not hear what is there; we hear what we expect to hear."

Chapter 5: Psychoacoustics — The Physics Inside Your Head

"We do not see what is there; we see what we expect to see. And we do not hear what is there; we hear what we expect to hear." — R. Murray Schafer, The Tuning of the World

Stand next to a busy highway and watch a conversation happen a few meters away. You can see the speakers' mouths moving, their gestures, the expressions on their faces. But you can barely hear them — the roar of traffic overwhelms their voices. Now move twenty feet away and you hear them clearly, because they have stepped behind a large concrete wall that blocks the direct traffic noise. The wall didn't create silence; there's still abundant ambient sound. But your brain, now able to extract the speakers' voices from the residual noise, reconstructs a fully intelligible conversation.

Or consider this: a bass guitar playing concert A (110 Hz) through a tiny laptop speaker that physically cannot reproduce frequencies below 200 Hz. The laptop, in a strict physical sense, cannot produce the fundamental frequency of the note being played. Yet you hear the bass note perfectly clearly — its pitch, its timbre, its musical function. How?

Or this: two tuning forks, both vibrating near 440 Hz but one 2 Hz faster than the other. Hold them to your ears simultaneously, and instead of a confusing mixture, you hear a single tone at roughly 440 Hz that slowly pulses in loudness — a gentle beating, one throb every half second. Your auditory system has performed a calculation that no simple physical law predicts, producing a perceptual event quite unlike the physical stimulus.

These are problems in psychoacoustics — the study of how physical sound stimuli produce perceptual experiences. It is the bridge discipline between physics and psychology, between the objective world of pressure waves and the subjective world of musical experience. Understanding psychoacoustics won't just explain the curiosities above; it will explain why we find certain chord progressions tense and others relaxing, why noise-canceling headphones work, why digital audio compression doesn't noticeably degrade sound quality, and why the same note played by a violin and a clarinet sounds completely different even when they're at the same pitch and loudness.

This chapter is where the physics of the first four chapters meets the biology and psychology of perception — and where things get genuinely strange and wonderful.


5.1 From Physics to Perception: The Problem of Qualia — Introducing the "Hard Problem" in a Musical Context

Physics is very good at describing what a sound wave is: its frequency, amplitude, waveform, spectrum, phase relationships. Physics can tell you everything about the pressure variations propagating through air from a cello to your eardrum. But physics says nothing about what it is like to hear a cello — the particular warmth and longing of its tone, the way a well-played phrase can make your chest tighten with an emotion you can't quite name.

This gap — between the objective physical description of a stimulus and the subjective experiential quality it produces — is what philosopher David Chalmers famously called the hard problem of consciousness. The physical description of a red light wave (wavelength ~700 nm) tells you nothing about the subjective experience of redness. Similarly, the physical description of a minor key melody tells you nothing about the experience of melancholy that melody might produce.

Philosophers call these subjective experiential qualities qualia (singular: quale). The quale of middle C played on a Stradivarius cello is the "what-it-is-likeness" of that experience — the ineffable subjective character that mere physics cannot capture.

Psychoacoustics as the Middleman

Psychoacoustics doesn't solve the hard problem — no field does. But it maps the relationship between objective physical quantities and subjective perceptual quantities with remarkable precision, revealing that the mapping is far more complex than simple correspondence. It turns out that:

  • Equal physical intensity does not produce equal perceived loudness — because the ear is far more sensitive at some frequencies than others
  • Equal physical frequency does not produce equal perceived pitch intervals — because the auditory system's frequency analysis is not linear
  • Physically present sound can be perceptually absent — masked by other, louder sounds
  • Physically absent sound can be perceptually present — when the auditory system "fills in" missing information

These discoveries don't resolve the hard problem, but they give it structure — revealing that the gap between physics and experience is not arbitrary. It is shaped by the evolutionary history of hearing, the mechanical properties of the cochlea, and the information-processing architecture of the auditory cortex. Understanding that structure is the goal of this chapter.

💡 Key Insight: Perception is not passive reception. The brain does not simply record acoustic information like a tape recorder and play it back. It actively processes, interprets, compresses, and reconstructs — filling in missing information, segregating competing signals, extracting patterns, and generating predictions. What you "hear" is as much a product of your brain's activity as of the physical sound in the room.


5.2 The Equal-Loudness Contours — Why 60 dB at 1 kHz Sounds Louder Than 60 dB at 100 Hz

In Chapter 2, we established the decibel scale as a measure of sound intensity. We treated intensity as equivalent to loudness, which was a useful approximation. Now we need to correct it, because the human ear is not equally sensitive at all frequencies.

The Fletcher-Munson Curves

In 1933, Harvey Fletcher and Wilden Munson at Bell Laboratories published a landmark study: they asked listeners to adjust the level of pure tones at various frequencies until those tones seemed equally loud to a 1,000 Hz reference tone at a fixed level. What they found was startling.

The ear is dramatically more sensitive to sounds in the 2,000–5,000 Hz range than to sounds at very low or very high frequencies. To produce the same perceived loudness as a 1 kHz tone at 60 dB, you need:

  • A 100 Hz tone at approximately 72 dB (12 dB louder physically)
  • A 50 Hz tone at approximately 82 dB (22 dB louder physically)
  • A 4,000 Hz tone at approximately 57 dB (3 dB quieter physically — this is near the peak of the ear's sensitivity)
  • A 10,000 Hz tone at approximately 71 dB (11 dB louder physically)

📊 Data/Formula Box: Equal-Loudness Contours (Phons)

The phon is the unit of perceived loudness level, defined so that N phons at any frequency sounds equally loud to N dB at 1 kHz.

The equal-loudness contours (also called Fletcher-Munson curves, updated as ISO 226 curves) are lines connecting frequency-SPL combinations that produce equal loudness. The contours are not flat — they dip steeply at low frequencies (meaning you need much more SPL to hear bass as loud) and have a region of increased sensitivity around 3–4 kHz.

Why 3–4 kHz? The human ear canal (external auditory canal) is a tube approximately 2.5 cm long, closed at one end (the eardrum). Like the tube resonances we studied in Chapter 3, this creates a resonant peak at the frequency whose quarter-wavelength equals the canal length: f ≈ 343 / (4 × 0.025) ≈ 3,430 Hz. The ear canal's resonance amplifies sounds in the 3–4 kHz range, which is why this region feels naturally loud and bright.

Musical Implications of Equal-Loudness

The frequency-dependent sensitivity of the ear has profound musical implications:

Bass at low volumes disappears. At low playback levels — say, a 30 phon listening level — the ear is far less sensitive to low frequencies than at high listening levels. This is why the familiar "bass boost" on inexpensive audio equipment makes music seem "fuller" at low volumes: it compensates for the ear's reduced bass sensitivity, mimicking the perception you'd have at higher playback levels.

The "loudness" button on vintage stereo equipment was actually implementing this compensation — boosting bass (and sometimes treble) at low volume to maintain consistent perceived tonal balance across listening levels.

Orchestral balance depends on volume. A well-balanced orchestral score sounds balanced because it was heard (and balanced) at specific rehearsal and performance levels. If you listen to the same recording very softly, the bass and low strings seem to recede; the upper woodwinds and strings dominate. If you listen very loudly, the bass seems to surge. The perceived orchestral balance is not fixed by the score; it is also determined by the listening level.

⚠️ Common Misconception: Sound engineers turn up the bass to make music sound "better." In professional audio, bass boost is usually applied to compensate for a specific problem (poor room acoustics, small speakers, a specific listening context) rather than as a universal improvement. Flat frequency response — reproducing all frequencies equally — is the professional standard; bass boost relative to that flat response is context-specific correction.


5.3 Critical Bands and Auditory Filters — How the Cochlea Acts as a Spectrum Analyzer

The cochlea — the snail-shaped organ in the inner ear — is one of the most extraordinary structures in biology. About 35 millimeters long when unrolled, it contains approximately 15,000–20,000 hair cells that convert mechanical vibration into neural signals. But the cochlea is not just a passive transducer. Its mechanical properties make it a sophisticated spectrum analyzer — a device that separates a complex sound into its component frequencies.

The Basilar Membrane: A Mechanical Fourier Transform

Running the length of the cochlea is the basilar membrane — a structure that varies in width and stiffness along its length. The narrow, stiff end near the oval window responds best to high frequencies; the wide, floppy far end (the apex) responds best to low frequencies. The gradient in between produces a continuous mapping of frequency to position along the membrane.

When a complex sound enters the cochlea, the basilar membrane vibrates with maximum amplitude at the position that corresponds to each frequency component of the sound. High frequencies produce peak vibration near the base; low frequencies near the apex. The hair cells at each position along the membrane then signal the brain about the vibration intensity at their location — effectively reporting the amplitude of each frequency component at its corresponding position.

This is, remarkably, a mechanical implementation of the Fourier analysis we discussed in Chapter 3: the cochlea physically decomposes a complex sound wave into its sinusoidal components. The brain then receives a "frequency map" rather than a raw pressure waveform.

Critical Bands: Limits of Frequency Resolution

The cochlea's frequency analysis is not infinitely precise. There is a minimum frequency difference below which two simultaneous tones are not perceived as separate: they fall within the same critical band — the frequency range processed by a single "auditory filter" at any given position on the basilar membrane.

📊 Data/Formula Box: Critical Bands

The bandwidth of the auditory filter at a given center frequency is called the Critical Bandwidth (CBW). Roughly:

  • At 100 Hz center frequency: CBW ≈ 100 Hz (very narrow in absolute terms)
  • At 1,000 Hz: CBW ≈ 128 Hz
  • At 4,000 Hz: CBW ≈ 700 Hz
  • At 10,000 Hz: CBW ≈ 2,500 Hz

In general: Below about 500 Hz, critical bands are approximately constant at ~100 Hz wide. Above 500 Hz, they widen to approximately 20% of the center frequency (i.e., the critical band is roughly 1/5 of an octave wide).

This explains why musical harmony sounds different in different registers. Two notes forming a minor second (semitone, ~6% frequency ratio) in the middle register (around E4, A4) sound sharply dissonant — their frequency difference falls within a single critical band, causing strong interference on the basilar membrane. The same interval in the bass register (E2, F2) sounds even more harshly dissonant, because the critical band in that region is relatively wider relative to the small frequency gap, producing even more interaction. This is why jazz and classical composers tend to write closely-spaced chords in the middle and upper registers, with much wider spacing in the bass.


5.4 Masking: When Sounds Hide Other Sounds

One of the most important and practically consequential phenomena in psychoacoustics is masking: the process by which the presence of one sound reduces or eliminates the audibility of another. Masking is the physical basis for audio compression formats like MP3, and it explains both why whispered conversations are hard to hear in a crowd and why certain musical textures are more transparent than others.

Simultaneous Masking

When two sounds are present at the same time, a louder sound can make a softer sound at a nearby frequency inaudible — this is simultaneous masking. The physical mechanism: the louder sound produces a large peak of vibration on the basilar membrane, and the side-skirts of this peak overlap with the region that would respond to the softer sound, reducing the softer sound's basilar membrane response below the threshold needed to trigger neural signals.

Key features of simultaneous masking: - Asymmetric upward spread: A loud low-frequency tone tends to mask higher frequencies more than lower ones. A 200 Hz masker at 80 dB will mask tones up to 1000 Hz or beyond; a 1000 Hz masker at 80 dB will not mask tones at 200 Hz nearly as effectively. - Frequency specificity: Masking is strongest when the masker and target are at similar frequencies (within the same critical band), and decreases as frequency separation increases.

Temporal Masking: Before and After

Sound masking isn't limited to simultaneous sounds. Pre-masking (or backward masking) occurs when a loud sound can mask a softer sound that occurred up to about 20 milliseconds before it. This seems paradoxical — how can something mask what happened before it? The answer is in neural processing timing: the cochlea and auditory nerve take a few milliseconds to respond to a sound, and a very loud sound arriving shortly after can "swamp" the neural channels before the softer earlier sound's signals have fully propagated.

Post-masking (forward masking) is more intuitive: a loud sound can continue to mask a softer subsequent sound for 50–200 milliseconds after the loud sound ends. The basilar membrane, after vigorous stimulation, requires time to recover its sensitivity.

💡 Key Insight: Masking and music production. A skilled audio engineer uses masking strategically. Instruments that occupy the same frequency band compete for audibility — they mask each other. The engineer's job is partly to ensure that critical elements (lead vocal, melodic line, bass groove) are in frequency regions where they are not heavily masked by other elements, either by choice of instrumentation, equalization, or arrangement. When a song "sounds muddy," it often means too many elements are competing in the same frequency band, mutually masking each other.

MP3 Compression and Psychoacoustic Masking

The MP3 audio format (and its successors: AAC, OGG, OPUS) achieves dramatic file size reductions by exploiting masking. A perceptual audio codec analyzes the audio in real time, determines what is audible and what is masked, and discards the masked components. If a loud 1 kHz tone is masking everything within a certain frequency range, the codec doesn't bother encoding those masked frequencies — it saves the bits, and the listener, whose auditory system would have masked those frequencies anyway, doesn't notice.

A typical MP3 at 128 kbit/s discards roughly 90% of the original audio data. The fact that a well-encoded 128 kbit/s MP3 is often indistinguishable from the original is a testament to how accurately psychoacoustic masking models capture real auditory behavior.

🔵 Try It Yourself: Observe Masking in Music Listen to a complex pop or orchestral recording, then focus on a single quiet instrument (maybe a triangle, a quiet guitar countermelody, or a distant wind instrument). Now try to find a passage where that instrument plays alone or nearly alone — notice how clearly you can now hear details you couldn't perceive before in the full texture. The masking effects from the full ensemble were suppressing your perception of those details. You hear more when there's less.


5.5 Pitch Perception: Two Competing Theories — Place Theory vs. Temporal Theory

How do we hear pitch? This question — seemingly simple — turned out to be one of the most contested problems in auditory science of the 20th century, spawning a debate that lasted nearly 150 years and was only partially resolved in recent decades.

Place Theory (Helmholtz and the Basilar Membrane)

The intuitive answer: pitch is determined by where on the basilar membrane maximum vibration occurs. High-frequency sounds produce peak activity near the base; low-frequency sounds near the apex. The brain reads out this position and assigns pitch accordingly.

This place theory (associated with Hermann von Helmholtz, who proposed a proto-version in 1863, later developed by Georg von Békésy, who won the 1961 Nobel Prize for work on the basilar membrane) has a lot going for it. The cochlea demonstrably is a frequency-to-place converter; cochlear implants exploit this by electrically stimulating different cochlear positions to convey different pitches. Place theory can naturally explain why we perceive pitch accurately across a wide frequency range.

But it has a problem: at low frequencies (below about 4,000 Hz), the basilar membrane's frequency resolution is coarser than our pitch perception. We can discriminate frequency differences of less than 1 Hz around 1,000 Hz; the basilar membrane's mechanical peak is too broad to account for this precision.

Temporal Theory (Volley Theory)

The alternative: pitch is determined by the timing of neural firing patterns. Hair cells, when stimulated, trigger action potentials in the auditory nerve. These action potentials fire at regular intervals corresponding to the period of the sound wave — a process called phase-locking. The brain extracts pitch by measuring the time between neural spikes.

This temporal (or volley) theory explains fine frequency discrimination at low frequencies excellently. A 1,000 Hz tone produces neural firing precisely timed to the 1 ms period of the wave; a 1,001 Hz tone produces firing timed to 0.999 ms. The difference is detectable.

But temporal theory has its own problem: phase-locking in the auditory nerve becomes unreliable above about 4,000–5,000 Hz (neural mechanisms can't fire fast enough to track high-frequency waveforms). So temporal theory alone can't explain pitch perception above this limit.

The Modern Synthesis

Contemporary auditory neuroscience accepts that both mechanisms operate: place coding dominates at high frequencies (above ~4–5 kHz), temporal coding (phase-locking) dominates at low frequencies, and there is a transition region in between where both contribute. This is not an unsatisfying compromise but a genuine insight: the auditory system apparently evolved two complementary mechanisms for pitch extraction, each optimized for a different frequency range. The brain integrates information from both to produce unified pitch perception across the full audible range.

📊 Data/Formula Box: Frequency Ranges of Pitch Mechanisms

Frequency Range Dominant Mechanism Notes
Below ~500 Hz Temporal (phase-locking) Very precise, but limited by neural firing rates
500 Hz – 4 kHz Both contribute Transition region; both mechanisms available
Above ~4–5 kHz Place (tonotopic map) Phase-locking unreliable above this limit

5.6 The Missing Fundamental — You Can Perceive a Pitch Even When It's Absent

One of the most striking demonstrations in psychoacoustics is the missing fundamental phenomenon: play a complex tone consisting only of the 2nd, 3rd, 4th, and 5th harmonics — 200, 300, 400, and 500 Hz — without any energy at the fundamental (100 Hz), and most listeners will clearly perceive the pitch of 100 Hz. The fundamental is physically absent from the signal. The ear — or more precisely, the brain — reconstructs it.

How Does This Work?

The harmonics 200, 300, 400, 500 Hz all have a common period — they all repeat at 100 Hz (because each is an integer multiple of 100). The auditory system's temporal processing machinery detects this common periodicity in the neural firing patterns triggered by these harmonics and assigns a pitch corresponding to the fundamental period.

This is a remarkable feat of inference. The brain doesn't just passively read out basilar membrane positions; it performs a computation across multiple frequency channels, extracting the pattern of relationships and deriving the implied fundamental. This computation — called residue pitch or virtual pitch — is what makes telephones intelligible despite filtering out almost all energy below 300 Hz. The bass frequencies of the voice are not transmitted, but the brain reconstructs the implied pitch from the harmonics that are transmitted.

🔵 Try It Yourself: Hear the Missing Fundamental Many free tone generator apps allow you to specify individual sine wave frequencies. Create a combination of sine tones at 300, 400, 500, and 600 Hz (no 100 Hz tone!). What pitch do you hear? Most listeners report a clear 100 Hz pitch — two octaves below the highest component. Now remove the 300 Hz component. Does the perceived pitch change? Try other combinations and notice how robustly the auditory system extracts the implied fundamental.

Musical Consequences of the Missing Fundamental

The missing fundamental explains several important musical phenomena:

Small speaker bass. As noted in the chapter introduction, a tiny laptop speaker that physically cannot produce 110 Hz still lets you hear the pitch of a bass guitar playing concert A. The harmonics (220, 330, 440, 550 Hz) are present; the brain extracts the implied 110 Hz pitch. This is also why vintage AM radio, with its very limited bass response, could still convey bass guitar and cello lines intelligibly.

Telephone voice pitch. The telephone system's frequency response cuts off below ~300 Hz. Yet you have no trouble recognizing the pitch and identity of voices through a telephone. The fundamental frequency of a male speaking voice is typically 85–180 Hz — often below the telephone cutoff. But the harmonics are present, and the auditory system reconstructs the implied pitch.

Orchestral bass depth perception at a distance. In a large concert hall, the direct bass frequencies from a cello or contrabass decay more rapidly with distance than higher harmonics (due to both air absorption and the directional radiation patterns of low-frequency sources). At the back of a large hall, you may be hearing primarily harmonics of the bass instruments — yet the pitch still seems fully present, because the residue pitch mechanism fills it in.

⚠️ Common Misconception: "Low-quality speakers can't reproduce bass." While it's true that physically moving air at 40 Hz requires a large speaker cone, the perceptual reality is more nuanced. If a speaker can reproduce the harmonics of a bass sound — even if not the fundamental — the auditory system often supplies the missing low pitch. High-quality bass reproduction requires a good subwoofer; but some bass perception can occur even without one, through the missing fundamental mechanism.


5.7 Auditory Scene Analysis — How We Separate Multiple Sound Sources

Walk into a busy restaurant. Multiple conversations are happening simultaneously. Music is playing. Dishes are clattering. Yet you can, with effort, follow a single conversation across the table. How does the brain accomplish this seemingly impossible acoustic separation task?

This question — how the auditory system segregates a complex acoustic mixture into distinct perceptual streams corresponding to distinct sound sources — was formalized by psychologist Albert Bregman in his landmark 1990 book Auditory Scene Analysis. Bregman identified a set of principles the auditory system uses to group acoustic components into streams:

Principles of Auditory Grouping

Harmonicity: Frequency components whose frequencies are in simple integer ratios tend to be grouped together as originating from a single source (because most natural sound sources produce harmonic series). Components that don't fit into a harmonic series tend to be heard as a separate stream.

Common onset/offset: Frequency components that start and stop at the same time tend to be grouped together. This is why a piano note sounds like a single event even though it's physically a complex mixture of harmonics: all harmonics begin together (onset) at the moment of key strike.

Continuity: An ongoing sound tends to remain in the same perceptual stream even if briefly interrupted by another sound. The auditory system "bridges the gap," treating the interrupted sound as continuous.

Spatial location: Frequency components arriving from the same direction (same head-related transfer function) tend to be grouped together as originating from a common location. This is one reason why stereo reproduction improves the intelligibility of speech against background noise — the spatial separation of signals aids stream segregation.

Similarity of timbre and spectral shape: Components with similar spectral characteristics tend to be grouped. This helps you follow a violin melody through an orchestra: the violin's distinctive spectral envelope — its timbre — marks its components as a consistent stream.

💡 Key Insight: Auditory streaming is the foundation of musical texture. When a composer writes a fugue with four independent voices, the ability of the listener to follow each voice separately depends entirely on auditory stream segregation. Bach's success in writing polyphony that sounds like four distinct melodic lines rather than an undifferentiated mass is, from a psychoacoustic perspective, a feat of managing the conditions for stream segregation: ensuring that each voice is sufficiently differentiated in pitch register, timbre, and melodic contour to be tracked as a separate stream.


5.8 Running Example: The Choir & The Particle Accelerator — Auditory Streaming as Particle Sorting

🔗 Connection to Running Example

In a particle physics detector, thousands of particle tracks are produced simultaneously in each collision event. The detector's job — like the brain's job in a complex acoustic scene — is to sort these into distinct events: to group detector signals that came from the same particle track, distinguish separate tracks from each other, and identify each track with a specific particle type based on its characteristic signature.

The algorithms used for particle track reconstruction in high-energy physics experiments face the same fundamental challenge as auditory scene analysis: how do you separate a complex mixture of signals into distinct sources? And remarkably, some of the same principles apply:

Common origin: Detector hits that form a geometrically continuous track are grouped together — analogous to auditory grouping by continuity. Hits that don't fit the track geometry are rejected, just as frequency components that don't fit a harmonic series are assigned to a different stream.

Characteristic signatures: Each particle type (electron, muon, pion, proton) leaves a characteristic pattern of energy deposition across the detector layers — its "timbre," so to speak. Pattern-recognition algorithms identify particle type from these signatures, just as the auditory system identifies instrument type from spectral envelope.

Simultaneous processing: Just as the auditory system processes all frequency channels simultaneously (the cochlea provides a parallel frequency representation, not a serial scan), modern particle detectors process all detector channels simultaneously, building up the full event picture in parallel.

Aiko Tanaka's doctoral research bridges these two problems directly: she is developing machine learning algorithms for particle track reconstruction that are inspired by computational models of auditory scene analysis. The parallel is not merely metaphorical — the mathematical structure of separating mixed signals into source streams is genuinely shared between the two domains. A technique that works for separating overlapping chord tones in a piano recording may, with adaptation, help separate overlapping particle tracks in a collision event.

The choir in the concert hall and the particle beam in the accelerator ring are both producing complex mixtures of signals. The listener and the detector both face the problem of unmixing — extracting the individual sources from the superimposed mixture. The brain solved this problem through millions of years of evolution. Physics is now building machines to solve it through engineering.


5.9 The Perception of Consonance and Dissonance — Helmholtz's Beating Theory, Plomp & Levelt's Revision, and What Culture Adds

Why do some combinations of musical pitches sound smooth and pleasant (consonant) while others sound harsh and tense (dissonant)? This question has been debated since ancient Greece, when Pythagoras observed that intervals with simple frequency ratios (2:1, 3:2, 4:3) sound pleasant while those with complex ratios sound harsh. Two millennia later, the debate is still not fully resolved — but we understand it much better.

Helmholtz's Beating Theory

Hermann von Helmholtz proposed in 1863 that dissonance results from acoustic beats between partials (harmonics) of the simultaneously sounding notes that fall within the same critical band. When two frequencies are close but not identical, they produce periodic amplitude fluctuations — beats — at a rate equal to their frequency difference (recall Chapter 2). At moderate beat rates (roughly 25–40 beats per second), these beats are perceived as a harsh, rough sensation. Intervals that produce many partial pairs beating in this range are heard as dissonant; intervals whose partial structure avoids strong beating are heard as consonant.

This explains the traditional consonance ranking: the octave (2:1) produces perfectly synchronous partials (no beating); the perfect fifth (3:2) produces partials that mostly avoid beating; the minor second produces many closely-spaced partial pairs that beat harshly against each other.

Plomp & Levelt's Revision

In 1965, Reinier Plomp and Willem Levelt refined Helmholtz's theory with careful perceptual experiments. They showed that:

  1. The perception of roughness/dissonance depends primarily on whether frequency components fall within the same critical band — not just whether any beating occurs
  2. The most dissonant interval is not the smallest possible (which produces beats too slow to be rough) but the interval where frequency difference equals about 25% of the critical bandwidth — producing beat rates in the roughness zone (roughly 25–40 Hz)

This is a psychoacoustic derivation of consonance and dissonance from first principles. The "beautiful" intervals of Western harmony — octave, perfect fifth, perfect fourth, major third — turn out to be those whose harmonic structure produces minimal roughness in the Plomp-Levelt sense. This is not a cultural agreement or an arbitrary convention; it is a consequence of the mechanical properties of the cochlea.

⚖️ Debate: Is Consonance Universal or Culturally Constructed?

The Helmholtz/Plomp-Levelt theory predicts that consonance and dissonance should be universal — the same cochlea with the same critical bands should produce the same roughness perceptions regardless of cultural background.

The universalist view: Joshua McDermott and colleagues (2016) conducted experiments with an isolated indigenous community in Bolivia (the Tsimane people) who had had minimal contact with Western music. Both Tsimane and Western listeners found acoustic roughness unpleasant, suggesting some physiological basis for consonance/dissonance. The most basic consonant intervals were rated somewhat similarly across groups.

The culturalist view: But there were important differences. The strong preference for harmonic intervals (simple frequency ratios) was much more pronounced in Western-trained listeners than in Tsimane listeners. The Tsimane did not show the strong preference for the perfect fifth and octave that Western listeners exhibit. This suggests that while roughness aversion may be partly physiological, the full cultural edifice of consonance/dissonance — the specific hierarchy of intervals, the conventions about which dissonances are "acceptable" and which are "harsh" — is substantially learned.

Furthermore, musical traditions worldwide have very different consonance hierarchies. Indonesian gamelan music is based on tuning systems that produce intentional beating between instruments — the "shimmer" that results is considered beautiful, not dissonant. Arabic maqam scales include intervals (3/4 tones, near-quartertones) that Western ears might perceive as "out of tune" but Arab musicians and listeners hear as specific, expressive pitches. Javanese and Balinese gamelan deliberately mistune paired instruments to create interference effects that Western acoustic theory would classify as dissonance.

Our Recurring Theme 2 (Universal structures vs. cultural specificity) is alive here: the physical basis of roughness perception may be universal, but the cultural interpretation and use of roughness is historically specific, regionally variable, and musically productive. Dissonance is not simply an acoustic bad. In nearly every musical tradition, what counts as dissonance is a productive resource — used to create tension, forward motion, expressiveness, and contrast.


5.10 Temporal Resolution: How Fast Can We Hear?

The auditory system's ability to perceive timing is remarkable. We can detect timing differences between the two ears of as little as 10–30 microseconds (millionths of a second) — a precision used for spatial localization (Section 5.11). But our ability to hear the temporal structure of sounds has its own characteristics and limits.

The Temporal Resolution Window

For the auditory system to perceive two events as distinct rather than merged into one, they must be separated by at least about 2–5 milliseconds (an "integration window"). This is the minimum gap needed for two clicks or pulses to be heard as two events rather than one. Events separated by less than ~2 ms tend to fuse perceptually into a single event.

This 2–5 ms temporal resolution has direct implications for music:

Rhythm perception: The finest rhythmic subdivisions that can be accurately perceived and performed fall in the range of 30–50 ms (subdivisions of about 20–30 per second). Faster "subdivisions" are heard as timbre (a very rapid tremolo becomes a buzzy texture) rather than as distinct rhythmic events.

Attack transient importance: The attack portion of musical notes — the first 10–30 ms — contains crucial information for timbre identification (we'll explore this in Chapter 7). The auditory system's temporal processing is precise enough to extract this information despite its brevity.

The 10 ms "cocktail party" window: In complex acoustic scenes, the auditory system tends to integrate energy over windows of approximately 10–20 ms for the purposes of stream segregation. Events within one 10 ms window tend to be treated as simultaneous; events spanning a 10 ms boundary are more easily segregated. Music producers and recording engineers take advantage of this: small timing adjustments (10–30 ms) between track elements can dramatically change the perceived "tightness" or "looseness" of a rhythm.

📊 Data/Formula Box: Key Temporal Thresholds

Phenomenon Time Window Musical Relevance
Interaural time difference detection 10–30 microseconds Spatial localization
Temporal resolution (minimum gap) 2–5 ms Individual event distinctness
Integration window (stream segregation) 10–20 ms Simultaneous vs. sequential perception
Haas effect (early reflections fused) 0–80 ms Concert hall design, reverb
Echo threshold (separate echo) > 50 ms Distinct echo vs. reverb
Backward masking (pre-masking) 0–20 ms before Audio compression design
Forward masking (post-masking) 0–200 ms after Clarity of successive sounds

5.11 The Cocktail Party Effect and Spatial Hearing — HRTFs and Binaural Processing

We return to the cocktail party — that paradigmatic example of auditory scene analysis. How does the brain extract a single voice from a cacophony? Spatial information — where each sound source is located — is a crucial cue, processed by the binaural (two-ear) auditory system with extraordinary precision.

The Physics of Spatial Hearing

The auditory system uses two primary cues for localizing sound sources:

Interaural Time Difference (ITD): A sound arriving from your left reaches your left ear slightly before your right ear. At the maximum (source directly to your left), this difference is approximately 660 microseconds. The auditory brainstem measures this time difference with precision in the 10–30 microsecond range, corresponding to angular resolution of a few degrees.

Interaural Level Difference (ILD): Your head and ears shade and diffract sound in frequency-dependent ways. A high-frequency sound coming from the left is partly blocked by your head from reaching the right ear, creating an intensity difference between the two ears. Low-frequency sounds (wavelengths larger than the head) bend around the head easily, creating minimal ILD; high-frequency sounds create significant ILD.

Together, ITD (dominant below ~1.5 kHz) and ILD (dominant above ~1.5 kHz) provide complementary localization cues across the full auditory spectrum — a remarkable division of labor whose frequency boundary is not arbitrary but related to the head's diameter (the wavelength at ~1.5 kHz is roughly twice the head diameter, the transition between the "large-head" and "small-head" acoustic regimes).

Beyond ITD and ILD, the complex geometry of the outer ear (pinna) creates intricate filtering of sounds that vary with the sound's elevation angle and front-back position. These filters — Head-Related Transfer Functions (HRTFs) — are essentially the acoustic "fingerprint" of how your particular ear shape modifies sounds from each direction.

The brain learns to use these HRTF-dependent spectral cues for elevation localization (distinguishing sounds above from below) and front-back disambiguation (distinguishing a source in front from one behind, which produces the same ITD and ILD). This learning is thought to occur partly during childhood, as the brain builds a model of the HRTF associated with the individual's particular head and ear geometry.

Binaural Audio and Spatial Illusion

HRTF data can be used to create remarkably convincing three-dimensional auditory experiences through headphones. By processing a mono or stereo signal with measured HRTFs for different directions, audio engineers can create binaural audio recordings that place phantom sound sources outside the head — sounds that seem to come from specific locations in the three-dimensional space around you, even though only two small speakers (the headphone drivers) are involved.

This technology underlies modern spatial audio formats (Dolby Atmos, Apple Spatial Audio) and has become increasingly important in AR/VR applications, gaming, and even medical rehabilitation (helping retrain spatial hearing after unilateral hearing loss).

🔵 Try It Yourself: Binaural Audio Demonstration Search online for "ASMR binaural barber shop" or "3D audio demo" — there are numerous freely available demonstrations of binaural recording. Using headphones (required — does not work with loudspeakers), listen to a good binaural demonstration. Notice how sounds seem to come from specific locations in the space around you rather than from inside your head. Try to notice what spatial cues the recording is using — can you identify when the source moves? When it is behind vs. in front of you?

⚠️ Common Misconception: Stereo = Spatial. Conventional stereo recording, played through loudspeakers, creates a convincing soundstage between the speakers, in the horizontal plane at ear level. It does not create sounds above, below, or behind the listener. True spatial audio — sounds in all directions including elevation and rear — requires either surround speaker systems or binaural (HRTF-processed) headphone audio. This is why "3D audio" headphone demonstrations are so striking: they offer a fundamentally richer spatial experience than conventional stereo.


5.12 🧪 Thought Experiment: What If Humans Could Hear Up to 100 kHz?

🧪 Thought Experiment: The 100 kHz Ear

Dolphins and bats can hear frequencies up to 100–150 kHz, using ultrasonic echolocation for navigation and hunting. Imagine a hypothetical human being with the same upper frequency limit: audible range 20 Hz to 100 kHz (compared to our 20 Hz to 20 kHz).

Work through the following implications:

The cochlear geometry problem: Recall that the cochlea is organized with high frequencies at the base and low frequencies at the apex, with frequency mapped to position in a roughly logarithmic manner. Our 20 kHz upper limit corresponds to the full length of the basilar membrane (about 35 mm). To extend hearing to 100 kHz, you would need either a much longer cochlea (to fit the additional frequency range) or the existing range compressed into less space. What would this mean for frequency discrimination in the 0–20 kHz range our current ear already handles?

The musical octave and its implications: Our current hearing spans about 10 octaves (20 Hz to 20 kHz). Adding 100 kHz as the upper limit would add roughly 2.3 additional octaves. But the musical content of natural sound sources falls off sharply above 15–20 kHz — most musical instruments produce little or no fundamental frequency content above 15 kHz. The additional octaves would be mostly noise, overtone content from inharmonic sources, and ultrasonic signals inaudible to current human ears.

What new music would be possible? With audible range to 100 kHz, composers could write melodies in the 20–100 kHz range — frequencies that currently have no musical tradition at all. Would our neural machinery for pitch perception, melody recognition, and emotional response extend naturally to these frequencies? Or would they be heard as high-pitched noise rather than organized music? Recall that temporal phase-locking fails above ~4–5 kHz — at 100 kHz, place coding would be the only available mechanism, with possibly much coarser pitch discrimination than in our current mid-range.

The masking problem: Higher frequencies produce narrower critical bands (as a fraction of center frequency). In the ultrasonic range, critical bands might be extremely narrow, potentially allowing extraordinary frequency resolution — but also potentially making many notes simultaneously sound harsh (many pairs falling within narrow critical bands). Would ultrasonic music sound more dissonant than our current music?

The evolutionary question: Why did mammalian hearing extend upward to 100+ kHz (in echolocating species) rather than downward to infrasound? Infrasound (below 20 Hz) is produced by earthquakes, ocean waves, and large animals (elephants communicate with infrasound) — potentially useful environmental information. Yet we don't hear it. What does this suggest about the evolutionary pressures that shaped our particular auditory range?


5.13 Psychoacoustics and Musical Emotion — A Preview of Chapter 27

Psychoacoustics can tell us a great deal about perception — what we hear and how we hear it. But the question of emotional response to music — why a minor key melody feels sad, why a rising pitch sequence creates tension, why the unexpected resolution of a suspended chord releases tension in a deeply satisfying way — carries us into territory that psychoacoustics alone cannot map.

The emotional response to music appears to involve at least three layers of processing that operate simultaneously and interact:

Psychoacoustic primitives: Some emotional responses may be grounded in basic psychoacoustic phenomena. Roughness (from beating between close frequency intervals) correlates with perceptions of harshness or tension. High-frequency sounds correlate with excitement (probably because high frequencies are characteristic of tense, fast-moving vocal sounds in human communication). Slow amplitude modulation correlates with calm.

Musical expectation: Much of the emotional power of music arises from the creation and manipulation of expectation. Music sets up patterns, creates expectations about where it is going, and then either confirms or violates those expectations — and both confirmation and violation, at the right moment and magnitude, are emotionally potent. This layer depends on musical training and cultural familiarity (though some elements of musical expectation may be universal).

Cultural and personal association: Music acquires emotional meaning through association — a melody heard at a moment of personal significance, a cultural tradition of using minor keys for lamentation, a film score that trains audiences to associate specific themes with specific characters or emotions. This layer is entirely learned.

Chapter 27 explores these layers in depth, asking to what degree musical emotion is psychoacoustic (and potentially universal) versus culturally constructed (and therefore variable). The debate connects directly to Theme 2 (Universal structures vs. cultural specificity) — and the answer turns out to be: both, intertwined in ways that are still being mapped.


5.14 Theme 1 Deep Dive: Where Physics Ends and Perception Begins — Is Psychoacoustics Just More Physics?

⚖️ Debate: Is psychoacoustics just more physics, or something fundamentally different?

Consider this argument: Everything we've discussed in this chapter — equal-loudness contours, critical bands, masking, pitch perception, auditory streaming — is, ultimately, the result of physical processes. The cochlea is a mechanical structure obeying the laws of fluid dynamics and elasticity. The auditory nerve is a collection of electrochemical cells following the physics of ion channels. The auditory cortex is a network of neurons whose firing patterns can (in principle) be described by the laws of biophysics. Where, in this chain of physical causation, is there room for anything other than physics?

The reductionist answer: There is no gap. Psychoacoustics is physics, applied to biological systems. What we call "perception" is what neural computation feels like from the inside. The qualia of hearing a cello are identical to (or at least supervenient on) the pattern of neural activation in your auditory cortex. There is no additional non-physical substance; there is no gap between the physical and the experiential. Science just hasn't fully mapped the relationship yet.

The emergence answer: Perception involves properties that are genuinely absent from the physical components and emerge only at the level of the system. No neuron "hears" music; no ion channel "perceives" consonance. These properties are emergent — they arise from the organization of components at a higher level of description. Emergence doesn't require anything supernatural; it just means that reductive explanation is incomplete, because system-level properties cannot be derived from component-level properties alone. Psychoacoustics is physics plus emergence.

The phenomenological answer: The hard problem of consciousness identifies a genuine explanatory gap that neither the reductionist nor the emergence answers fully bridge. Even if you have a complete physical description of all neural activity when someone hears a cello, you haven't explained why there is a subjective experience of hearing a cello at all. The existence of qualia — of what it is like to hear — seems to require explanation at a level that physics, as currently formulated, doesn't provide.

This is not merely a philosophical curiosity. It bears directly on fundamental questions in music research: Can we fully explain musical emotion in physical-computational terms? Is there a "neural correlate of musical beauty" that, once identified, constitutes an explanation of musical beauty? Or does the experience of musical beauty resist complete reduction to neural computation?

We won't resolve this debate. But holding it open — keeping the tension between the reductionist and emergence/phenomenological perspectives — is, we argue, the intellectually honest position. Physics is extraordinarily powerful. It explains a great deal about music. But the experience of music may exceed what physics, even in principle, can fully capture.

💡 Key Insight: Theme 1 in full view. The tension between reductionism and emergence is visible throughout this chapter. We can reduce pitch perception to basilar membrane mechanics and auditory nerve firing patterns — and this reduction is explanatorily powerful. But the experience of pitch — the difference between a B-flat and a C, perceived as musically meaningful within a tonal context — seems to require an account that includes the musical system, the cultural context, and the listening history of the individual. The physics is necessary but not sufficient. This is emergence.


5.15 Summary and Bridge to Part II

This chapter began at the boundary between the physical and the perceptual — the eardrum, where pressure waves become neural signals — and traveled through the astonishing complexity of the auditory brain. We found that:

  • Loudness perception is frequency-dependent, summarized by the equal-loudness contours: the ear is most sensitive near 3–4 kHz and progressively less sensitive at very low and very high frequencies
  • The cochlea is a frequency analyzer, mapping frequency to position along the basilar membrane and providing the brain with a parallel frequency representation
  • Critical bands are the cochlea's frequency resolution units — within a critical band, stimuli interact; across critical bands, they are processed more independently
  • Masking is the process by which louder sounds make softer sounds inaudible — exploited by perceptual audio codecs to achieve dramatic compression without audible quality loss
  • Pitch perception involves two complementary mechanisms — temporal coding at low frequencies and place coding at high frequencies — integrated by the brain into unified pitch perception
  • The missing fundamental demonstrates that pitch is a constructed perceptual attribute, not a simple readout of physical frequency content
  • Auditory scene analysis — the brain's ability to separate a complex acoustic mixture into distinct source streams — is the psychoacoustic basis of musical texture and polyphony
  • Consonance and dissonance have a psychoacoustic grounding in roughness (beating within critical bands), but their musical interpretation is substantially culturally constructed
  • Spatial hearing relies on interaural time and level differences, enriched by HRTF cues from the pinna, enabling the brain to localize sound sources with remarkable precision
  • The reductionism-vs-emergence debate is alive in psychoacoustics: physics explains much, but the full experience of musical perception may require emergent or phenomenological accounts that physics alone cannot provide

Key Takeaway: The brain is not a microphone. A microphone passively records pressure variations at a single point in space. The human auditory system actively constructs a rich, interpreted, emotion-laden representation of the acoustic world — separating sources, extracting pitch and rhythm, perceiving space, and transforming physical sound into musical meaning. Understanding this constructive process is essential to understanding music, not as a peripheral supplement to the "real" physics but as its most important context: music happens in minds, and minds are physical systems whose remarkable complexity gives rise to the remarkable experience we call musical listening.

Bridge to Part II: The Musical System

Part I has given us the physical and perceptual foundations: how sound is produced, propagates, is shaped by spaces, and is heard by humans. Part II — beginning with Chapter 6 — turns from physics and perception to the musical system itself: the organized structures of pitch, rhythm, and harmony that cultures have developed to exploit the physics and perception we've studied.

We'll ask: Why do humans organize sounds into scales? What determines which scales different cultures use? Is tonality (the system of keys and harmonic relationships central to Western music) a natural consequence of acoustics, or a cultural invention? How does rhythm relate to temporal perception? What is timbre, and how is it encoded in the spectral structure of sound?

The physics we've learned in Part I will remain constantly present — not as background but as the grounding for everything we discuss. The musical system is not arbitrary; it has deep physical and psychoacoustic roots. But it is also not fully determined by physics; culture, history, and aesthetic choice play constitutive roles. Mapping the boundary between the physical-universal and the cultural-specific is the central project of Part II.


Chapter 5 exercises, quiz, and case studies follow.