34 min read

Every technology confronts a scarcity. For recorded music in the late twentieth century, the scarcity was storage and transmission capacity. A three-minute song in CD-quality audio — 44,100 samples per second, 16 bits per sample, two channels —...

Chapter 33: Audio Compression — MP3, Perceptual Coding & What We Lose

33.1 Why Compress? — Bandwidth, Storage, and the Economics of Audio

Every technology confronts a scarcity. For recorded music in the late twentieth century, the scarcity was storage and transmission capacity. A three-minute song in CD-quality audio — 44,100 samples per second, 16 bits per sample, two channels — requires approximately 31.7 megabytes of storage. This number, which seems trivial in an age of terabyte hard drives and gigabit internet connections, was catastrophically large in 1993. At typical internet connection speeds of that era (28.8 kbps), downloading a single song would have required more than two hours. Carrying a few hundred songs in a portable device would have required storage hardware the size of a small refrigerator.

The economic problem was real, and its solution required a different question than the one the Nyquist theorem answered. The Nyquist theorem told engineers the minimum information required for perfect audio reconstruction. The compression question was: what is the minimum information required for perceptually satisfactory audio reconstruction? These are categorically different questions with categorically different answers, and the second question requires not just physics but psychology.

The realization that these questions have different answers — and that the gap between them is exploitable — is the foundational insight of perceptual audio coding. Human hearing is not a perfect information receiver. It discards information at every stage of the auditory processing pipeline: in the mechanical filtering of the cochlea, in the threshold of audibility that renders very quiet sounds inaudible, in the masking mechanisms that make sounds near a loud tone inaudible even if they are physically present, and in the temporal integration that blurs fine time-domain detail. If a codec can identify which information will be discarded by the auditory system anyway, it can discard that information from the digital file without the listener noticing.

The resulting technology — lossy audio compression — has been one of the most consequential technologies of the digital age. It made digital music distribution practical, enabled the portable music player revolution, created the conditions for streaming services, and indirectly reshaped the economic structure of the entire music industry. It also, as this chapter will examine, changed what music sounds like at a physical level in ways that are sometimes perceptible and sometimes musically significant — even when they are not perceptible to ordinary listeners.

💡 Key Insight: The gap between "information needed for perfect reconstruction" (the Nyquist answer) and "information needed for satisfactory perception" (the perceptual coding answer) is determined by the psychoacoustics of human hearing. A codec is a mathematical model of what you cannot hear. When that model is wrong — for some listeners, in some conditions, for some material — the codec removes something you can hear.

33.2 Lossless vs. Lossy: The Fundamental Choice

Before examining how lossy compression works, it is worth understanding the alternative: lossless compression, which achieves compression without discarding any information.

Lossless audio codecs (FLAC, ALAC, APE, WAV with certain compression options) use the same mathematical tools as general-purpose data compression (related to ZIP, gzip, DEFLATE) to remove statistical redundancy in the audio data without discarding any samples or reducing precision. Audio signals have significant statistical structure: samples are correlated with their neighbors (a signal that is positive at time n is likely to still be positive at time n+1), and a good lossless codec exploits this correlation to represent the data more efficiently.

The typical compression ratio for lossless audio is approximately 2:1 — a 30 MB WAV file compresses to approximately 15 MB as a FLAC file. The decompressed FLAC is bit-for-bit identical to the original WAV. No information is discarded. The dynamic range, frequency response, noise floor, and every other physical characteristic of the original digital audio is preserved exactly.

The 2:1 compression ratio of lossless coding is hard to improve upon because audio signals, while statistically structured, are not highly compressible — they contain a great deal of genuine information per sample. The Shannon entropy of typical audio data limits lossless compression to roughly this range.

Lossy audio codecs (MP3, AAC, Ogg Vorbis, Opus) achieve much higher compression by discarding information. MP3 at 128 kbps achieves approximately 10:1 compression; at 320 kbps, approximately 4:1. This compression comes at a cost: the decompressed audio is not identical to the original. It is a perceptual approximation — similar enough (in the designers' model of hearing) to be indistinguishable from the original by most listeners under most conditions, but physically different.

⚠️ Common Misconception: "Compressed audio" in everyday language often refers to dynamic range compression — the gain-riding process described in Chapter 31 that reduces the difference between loud and quiet sounds. In the context of this chapter, "compression" refers to data compression — reducing the file size of digital audio. These are completely different processes with completely different physics. A CD-quality WAV file can have severe dynamic range compression (loud mastering) while being entirely "lossless" in data compression terms.

The choice between lossless and lossy compression is ultimately a question about what is being preserved and for whom. For an archivalist, a producer, or a researcher studying the acoustic properties of recordings, lossless is the only acceptable choice — the original information must be preserved. For a listener streaming music in a noisy environment on consumer earbuds, a well-encoded 128 kbps AAC file may be perceptually indistinguishable from the lossless original. The right choice depends entirely on context, use case, and who is listening with what ears to what content on what equipment.

33.3 The Psychoacoustic Model — The Model of What You Cannot Hear

At the heart of every lossy audio codec is a psychoacoustic model: a mathematical representation of the human auditory system's sensitivity that is used to determine which parts of the audio signal are inaudible and therefore disposable.

The psychoacoustic model is built from experimental data: decades of hearing research that has measured, with great precision, the thresholds of audibility under various conditions. These measurements include:

The absolute threshold of hearing: The minimum sound pressure level detectable by a human ear, as a function of frequency. The ear is most sensitive around 3,000–4,000 Hz (where the ear canal has a resonance that amplifies incoming sound), and much less sensitive at low frequencies and very high frequencies. A 20 Hz tone must be approximately 60-70 dB SPL to be heard; a 3,000 Hz tone can be detected at nearly 0 dB SPL. Any sound below this threshold is inaudible — and a codec can discard it without consequence.

Simultaneous masking: A loud sound at one frequency raises the threshold of hearing for nearby frequencies (in the frequency domain). A 1,000 Hz tone at 80 dB SPL makes nearby frequencies inaudible at levels that would be perfectly audible in quiet: the masking threshold extends roughly 25 dB upward from the masker frequency and about 10-15 dB downward (masking is asymmetric — more upward than downward). The psychoacoustic model computes this masking threshold for every frequency at every moment, and discards audio components that fall below the masking threshold.

Temporal masking: Masking also extends in time, not just in frequency. A loud sound masks quieter sounds that occur shortly before it (backward masking, typically 5-20 ms) and shortly after it (forward masking, typically 50-200 ms). The codec exploits temporal masking to remove quiet sounds that occur near loud ones in time, even if the quiet sounds would be audible in isolation.

📊 Data/Formula Box: The Masking Threshold

The masking threshold T(f) at frequency f, due to a masker at frequency f_m with level L_m, can be approximated as:

T(f) = L_m − spreading_function(f, f_m) − attenuation(L_m)

Where the spreading function describes how masking decreases with frequency distance (in Bark units — the perceptual frequency scale). At distance Δz Bark from the masker:

  • Upward masking (f > f_m): spreads at approximately 10 dB per Bark
  • Downward masking (f < f_m): spreads at approximately 25 dB per Bark

The Bark scale (named after German physicist Heinrich Barkhausen): z = 13 × arctan(0.76f/1000) + 3.5 × arctan(f/7500)²

This nonlinear frequency scale reflects the width of auditory critical bands.

The psychoacoustic model is not a perfect description of human hearing. It is an average across many listeners, measured under laboratory conditions, for steady-state or slowly varying sounds. It may be inaccurate for specific listeners, for specific types of sounds, or for listeners with trained attention to specific acoustic features. This inaccuracy is the opening through which the lossy codec's artifacts become perceptible.

33.4 Simultaneous Masking — Codecs Exploit Your Ear's Masking to Hide What They Remove

Simultaneous masking is the most powerful tool in the codec designer's arsenal. When a loud tone is present at frequency f_m, it raises the threshold of hearing for neighboring frequencies by a large amount. Sounds within the masking shadow are genuinely inaudible — not just quieter, but physiologically masked by the saturation of the auditory nerve fibers tuned to that frequency region.

The MP3 and AAC codecs compute the masking threshold for each analysis frame (typically 1,024 samples, approximately 23 ms at 44.1 kHz). They allocate bits preferentially to frequency components that require precise representation to be heard accurately, and assign few or no bits to components that will be masked. A component that falls entirely below the masking threshold is allocated zero bits — it is simply discarded.

Critical bands: The frequency resolution of masking is described in terms of critical bands — the frequency ranges processed together by the cochlea. The auditory system does not have independent filters for every frequency; it has approximately 24 critical bands that together cover the audible range. Masking occurs primarily within a critical band: a loud tone in one critical band masks quieter tones in the same band but has less effect on adjacent bands.

💡 Key Insight: Critical bands are narrower at low frequencies and wider at high frequencies, following the mechanical properties of the basilar membrane. At 500 Hz, a critical band spans about 100 Hz. At 5,000 Hz, a critical band spans about 700 Hz. This is why MP3 compression artifacts are typically more noticeable at high frequencies — the wider critical bands mean that a codec can remove more frequency components per band at low frequencies before the masking threshold is reached, but at high frequencies the wider critical band captures both the masker and many components it does not fully mask.

The practical implementation: the MP3 encoder applies a filterbank (a polyphase filterbank followed by a Modified Discrete Cosine Transform — see Section 33.13) that decomposes the audio into frequency subbands. Each subband's quantization level is set based on the masking threshold computed by the psychoacoustic model. Subbands far from any loud masker get high quantization precision (many bits). Subbands under the masking shadow get low precision or zero bits.

33.5 Temporal Masking — Codecs Remove Pre-Echo Signals You Cannot Hear

Temporal masking describes how the auditory system's sensitivity changes over time in response to a loud sound. Forward masking (masking of sounds that follow the masker) is well established and relatively strong: a loud sound can mask quieter sounds for up to 200 ms after the masker ends. Backward masking (masking of sounds that precede the masker) is weaker and more controversial, but demonstrates that the auditory system integrates over time in ways that allow pre-masking of brief sounds.

MP3 exploits forward temporal masking extensively: quiet sounds that occur after a loud transient can be coded with less precision because the forward masking of the transient will hide any coding error.

However, temporal masking creates its own characteristic artifact: pre-echo. In the MP3 analysis, audio is processed in frames of 1,024 samples (approximately 23 ms). If a sharp transient occurs near the end of a frame, the quantization noise introduced by coding this frame at low precision (because the quantization noise budget was computed based on the beginning of the frame, before the transient) can appear before the transient in the reconstructed audio — as a brief smearing or "splashing" sound that precedes the attack.

Pre-echo is one of the most characteristic and identifiable MP3 artifacts. It is particularly audible on recordings with sharp transients against quiet backgrounds: the castanets in classical guitar, the attack of a harpsichord, triangle strikes, the click of drum sticks. The "pre-echo" heard on triangle strikes in compressed audio is precisely the codec's quantization noise appearing in the silent pre-attack period, visible to the trained ear as a faint "shhhhh" before the sharp "ting."

Modern codecs address this with adaptive window switching: when a sharp transient is detected, the codec switches from a 1,024-sample analysis window to a shorter window (e.g., 128 samples), providing better temporal resolution at the cost of worse frequency resolution. The transient is then coded with higher temporal precision, and the pre-echo artifact is reduced. But window switching introduces its own quantization issues at the transitions, and the balance between frequency resolution and temporal resolution remains a fundamental challenge in codec design.

⚠️ Common Misconception: Pre-echo is often described as "hearing the future" — as if the smearing appears before the transient. In reality, pre-echo is quantization noise that spans the entire analysis frame, but is only audible in the quiet portion before the transient. Once the transient arrives, its energy masks the quantization noise in the period following it (forward masking). The pre-transient period has no such masking, so the quantization noise in that period becomes audible.

33.6 The Critical Band Framework in MP3 — How MPEG-1 Layer 3 Actually Works

The MP3 format (MPEG-1 Audio Layer 3) was developed at the Fraunhofer Institute for Integrated Circuits in Erlangen, Germany, with key contributions from Karlheinz Brandenburg, and published as an ISO standard in 1993. Its technical structure reflects the psychoacoustic framework described above.

Analysis filterbank: The encoder first decomposes the audio into 32 equal-bandwidth subbands using a polyphase filterbank. This decomposes the audio frequency range into 32 frequency ranges, each approximately 689 Hz wide (at 44.1 kHz sample rate). The polyphase filterbank was chosen for computational efficiency.

MDCT (Modified Discrete Cosine Transform): Each 32-subband output is then further analyzed with an MDCT, which provides additional frequency resolution within each subband. In the normal (long window) mode, the MDCT uses 18 frequency coefficients per subband. This gives a total frequency resolution of 32 × 18 = 576 frequency coefficients per analysis frame — enough to apply the psychoacoustic model with reasonable precision.

Psychoacoustic model: The psychoacoustic model runs in parallel with the filterbank, computing the masking threshold for the current frame based on the spectrum of the audio. This threshold determines the allocation of bits to each frequency region.

Quantization and Huffman coding: The MDCT coefficients are quantized (rounded to discrete values) with precision determined by the bit allocation from the psychoacoustic model. The quantized values are then entropy-coded using Huffman codes (variable-length codes that assign shorter codes to more common values). The resulting bitstream is the MP3 file.

Decoding: The decoder reverses the process: reads the Huffman-coded bitstream, dequantizes the values, applies the inverse MDCT, and reconstructs the time-domain audio signal. The reconstructed signal is physically different from the original — the quantization and the discarding of masked components have changed it — but is designed to be perceptually similar.

33.7 Aiko's Experiment — The Codec Is Blind to Exactly What She Studies

🔗 Running Example: Aiko Tanaka

It is late on a Thursday evening in November. Aiko Tanaka, a doctoral candidate in musical acoustics, is sitting in the music perception lab with headphones on and a spectrogram tool running on her laptop. She has two audio files: a FLAC (lossless) recording of a professional choir performing the Brahms Liebeslieder Waltzes, and the same recording encoded as a 128 kbps MP3.

She has been running this comparison for forty minutes, and what she is seeing is making her simultaneously frustrated and fascinated.

Aiko's dissertation concerns the singer's formant cluster — a concentration of acoustic energy in the 2,800–3,200 Hz range that distinguishes professionally trained singers from untrained singers when singing against an orchestral or choral texture. The singer's formant is a genuine physical phenomenon: trained singers learn to tune their laryngeal and pharyngeal resonances to create an enhanced spectral peak in this frequency region, which sits in a spectral valley of the orchestral spectrum and therefore cuts through the texture without requiring the singer to sing louder. It is, in short, the acoustic superpower of the trained classical voice, and it is what Aiko has been measuring, modeling, and characterizing for the past two years.

She opens her spectrogram analysis software and compares the FLAC and MP3 spectrograms side by side, zoomed into the 2,500–3,500 Hz frequency range. The difference is right there in the visualization, and she has been staring at it for half an hour: in the FLAC recording, the singer's formant cluster is clearly visible as a distinct, stable spectral peak that persists across the phrase. In the MP3 recording, that peak is smeared — spread across a wider frequency range, its sharp structure blurred, its amplitude reduced in the critical 2,800–3,200 Hz core.

She also pulls up a waveform view synchronized between the two files, looking at consonant attacks. The FLAC file shows the precise, sharp onsets of consonants — the "L" in Liebes, the "s" in Walzer — with clear temporal precision. The MP3 file shows these same consonants spread across time by several milliseconds: the pre-echo artifact, the temporal smearing that results from the long analysis window catching the onset within a frame that spans silence and attack simultaneously.

She opens her lab notebook and writes the following entry:


Lab notebook, November 12, 2026:

Ran FLAC/MP3 comparison on Brahms LW recording (Vienna Philharmoniker Chor, 2019). Findings:

1. Singer's formant (2800–3200 Hz) is selectively degraded by MP3 encoding at 128 kbps. The spectral peak is intact in FLAC at approximately +8–10 dB relative to the surrounding spectrum. In MP3, the peak is approximately +4–5 dB — it still exists, but the psychoacoustic masking model has apparently treated the fine spectral structure in this region as "near-threshold" and under-allocated bits, reducing the precision of the formant peak.

2. Consonant temporal precision: The FLAC shows onset sharpness within approximately 5–10 ms. The MP3 shows pre-echo smearing extending 15–20 ms before the consonant attack. This is the pre-echo artifact from long-window MDCT analysis. For measurement of voice onset time and manner of articulation, this smearing is catastrophic.

Interpretation: The codec was designed by people who never thought about singer's formant or consonant onset time as items of interest. They built a model of "what ordinary listeners hear" and optimized for that model. The model does not include "ability to measure singer's formant amplitude" or "temporal precision of consonant onset for voice science research." Why would it? These are research-specific needs, not consumer needs.

The irony is elegant. The psychoacoustic model treats the singer's formant spectral peak as "hard to hear" — it sits above most of the harmonic energy and is close to the upper edge of the masking sensitivity curve. So the codec removes exactly the spectral detail that makes a trained voice audible through an orchestral texture. The codec optimizes for the most common listener while eliminating exactly what I study.

This is not a complaint about the codec engineers. They built something that works brilliantly for its intended purpose. It is a fact about what "intended purpose" means when it is built into a technology. Technology as mediator does not just mediate — it incorporates a model of what matters and what doesn't. When your interests fall outside that model, you are outside the zone of care.


The frustration fades. Aiko sits back and thinks. There is something genuinely interesting here, beyond the immediate methodological problem of needing to collect FLAC recordings of her study materials rather than MP3s (already noted, with a sigh, in the margin: data collection protocol: FLAC only going forward).

The codec's psychoacoustic model is built from measurements of what typical listeners can detect. The singer's formant is audible to trained listeners in live performance — it is the entire mechanism by which a soprano can project over an orchestra. But in a recording, at listening levels, through consumer speakers? The codec's model has apparently classified the fine structure of the singer's formant as "sub-threshold" in the masking model's framework. The codec is not wrong about its intended listeners. It is blind to Aiko's specific needs.

She writes one final note: "The codec is blind to exactly what I'm measuring. It optimizes for what ordinary listeners hear, not what I study. This is a limitation of the psychoacoustic model's scope, not of psychoacoustics itself. The model is an approximation of one kind of hearing. There are other kinds."

She closes the notebook, copies the lab's entire archive of recordings from MP3 to a note to herself about sourcing FLAC versions, and goes home.

33.8 The Spotify Spectral Dataset: Compression Artifacts Across Genres

🔗 Running Example: The Spotify Spectral Dataset

An analysis of compression artifacts across the 10,000-track Spotify Spectral Dataset reveals systematic patterns in which genres suffer most from standard MP3 encoding at 128 kbps.

Classical and acoustic instruments (highest artifact severity): Orchestral recordings, solo piano, acoustic guitar, and choral music show the most severe and musically significant compression artifacts. This is because: - High-frequency content is acoustically rich (cymbal wash, bow noise, breath) and lies in critical band regions where the codec's masking model is least accurate - Transient attacks are sharp and clear against quiet backgrounds — the worst condition for pre-echo - The singer's formant and similar fine spectral structures are present and perceptually significant to trained listeners - Dynamic range is large, and the codec's bit allocation fluctuates more under high dynamic range conditions

Jazz (high artifact severity): Cymbal work and piano harmonics show similar issues to classical. The natural room ambience of jazz recordings contains diffuse high-frequency content that the codec compresses aggressively.

Electronic music (moderate artifact severity): Synthesized sounds are often already band-limited by the synthesis process. Sustained, harmonically simple tones are the easiest content for codecs to handle — the masking model works well on steady-state tones. However, sharp electronic transients (especially in drum machine kicks and hi-hats with fast rise times) show pre-echo artifacts.

Highly compressed pop (lowest artifact severity): Pop productions that have been heavily processed — with dynamic range compression, saturation, limiting, and harmonic enhancement — present the codec with a signal that is already "dense" and contains fewer of the quiet-against-loud transitions that stress the masking model. Pre-echo is rare when there is little background silence to expose it. The irony is complete: heavily processed, dynamically compressed music (already subjected to signal quality degradation for loudness) is paradoxically the easiest material for perceptual codecs to handle.

Bit rate sensitivity by genre: The crossover point at which compression artifacts become inaudible to typical listeners varies by genre. For heavily compressed pop, 128 kbps AAC is typically sufficient for most listeners. For classical music, many listeners can detect compression artifacts at 128 kbps and some at 192 kbps; 320 kbps is more commonly considered "transparent" for this material. Trained listeners studying specific acoustic features (like Aiko) may detect artifacts even at 320 kbps on certain material.

33.9 Bit Rate and Quality — 128 kbps vs. 192 kbps vs. 320 kbps

Bit rate is the amount of audio data transmitted per unit time, measured in kilobits per second (kbps). Higher bit rates mean more data per second, which means more bits available to encode each analysis frame, which means more precise quantization and fewer forced compromises in the masking model. The relationship between bit rate and quality is not linear, but the general principles are:

128 kbps MP3: The compression ratio is approximately 10:1 relative to 16-bit/44.1 kHz CD audio. At this bit rate, the psychoacoustic model is under pressure: the bit allocation must be aggressive, and masked components are discarded even when their masking is incomplete. Artifacts are audible on challenging material (classical, acoustic, jazz) to listeners with good hearing. For consumer pop music on earbuds, typically acceptable.

192 kbps MP3: Approximately 6.5:1 compression. The psychoacoustic model has more room to work. Artifacts are reduced in frequency and severity. For most listeners on most material, 192 kbps is the practical transparency threshold for MP3.

320 kbps MP3: Approximately 4:1 compression. This is the highest standard MP3 bit rate. The psychoacoustic model has enough bits to code most audio components with high precision. Most double-blind tests find listeners cannot reliably distinguish 320 kbps MP3 from lossless audio for most program material, though some listeners on some demanding material (classical, acoustic guitar) report detectable differences.

📊 Data/Formula Box: Bit Rates and File Sizes

For a 3-minute stereo song (180 seconds):

Format Bit Rate File Size (approx.) Compression
CD (WAV 16/44.1) 1,411 kbps ~31.7 MB 1:1
FLAC (lossless) ~700 kbps ~15 MB ~2:1
MP3 320 kbps 320 kbps ~7.2 MB ~4.5:1
MP3 192 kbps 192 kbps ~4.3 MB ~7.5:1
MP3 128 kbps 128 kbps ~2.9 MB ~11:1
AAC 128 kbps 128 kbps ~2.9 MB ~11:1
Opus 96 kbps 96 kbps ~2.2 MB ~15:1

33.10 AAC and Modern Codecs — Why AAC Sounds Better Than MP3 at the Same Bit Rate

The Advanced Audio Coding format (AAC), standardized by the ISO/MPEG group in 1997 with contributions from AT&T Bell Laboratories, Dolby, Fraunhofer, and Sony, achieves audibly better quality than MP3 at equivalent bit rates. The improvement is real and measurable, not a marketing claim.

Why AAC is better:

More flexible filterbank: MP3 uses a 576-coefficient MDCT. AAC uses a 1,024-coefficient MDCT (in the main profile), providing significantly finer frequency resolution. This allows the psychoacoustic model to make more precise masking decisions, wasting fewer bits on imprecisely coded components.

Better psychoacoustic model: AAC was designed after a decade of additional psychoacoustic research. Its model incorporates more accurate measurements of masking thresholds and critical band widths, resulting in more efficient bit allocation.

Temporal noise shaping (TNS): AAC includes TNS, which applies noise shaping in the time domain within each analysis frame, specifically targeting the pre-echo problem. TNS shapes the quantization noise spectrum to minimize pre-echo artifacts, addressing one of MP3's most audible weaknesses.

Improved entropy coding: AAC uses more efficient Huffman tables and arithmetic coding, reducing the overhead of representing the quantized coefficients.

The practical result: AAC at 128 kbps is approximately equivalent to MP3 at 192 kbps in perceptual quality — the same quality requires less data. This is why Apple adopted AAC as its standard codec for iTunes and the iPod, and why streaming services generally prefer AAC or the even more modern Opus codec over MP3.

Opus (2012) is the current state-of-the-art perceptual audio codec for internet streaming, developed by the Xiph.Org Foundation and others. It achieves excellent quality at very low bit rates (as low as 32 kbps for speech, 96-128 kbps for music) through an advanced hybrid SILK+CELT architecture that combines linear prediction (efficient for speech) with MDCT transform coding (efficient for music).

33.11 Lossless Streaming: When the Physics Matters Again

The streaming era began with lossy compression as a necessary compromise. By the mid-2010s, storage costs had fallen enough and bandwidth had grown enough that lossless streaming became economically viable. Apple Music launched lossless streaming (ALAC, 24-bit/44.1 kHz to 24-bit/192 kHz) in 2021. Tidal launched "HiFi" (lossless CD quality) and "HiFi Plus" (24-bit/96 kHz MQA) services. Amazon Music Unlimited added HD and Ultra HD tiers. Spotify (as of 2026) has continued to evaluate high-quality audio options.

The physics of lossless streaming: at 1,411 kbps (CD quality), streaming requires approximately 10 MB per minute. At a typical 4G/LTE speed of 10-50 Mbps, this is entirely feasible. Even 24-bit/96 kHz audio at approximately 4,608 kbps is manageable on 5G or fast home broadband. The technical barrier that made lossy compression necessary in the 1990s no longer exists in much of the developed world.

What changes with lossless streaming? For most listeners in most contexts (earbuds on a subway, laptop speakers in a coffee shop), nothing perceptible changes. But for listeners using high-quality systems in quiet environments — the 5-10% of listeners who might actually hear the difference — lossless streaming provides assurance that no codec artifacts, no psychoacoustic model decisions, and no information discarding has occurred between the master recording and their ears.

For researchers like Aiko Tanaka, lossless streaming is not a luxury — it is a prerequisite. Any analysis of acoustic features that might be affected by psychoacoustic modeling (the singer's formant, fine spectral detail, temporal precision of onset) requires lossless source material.

💡 Key Insight: The question "does lossless streaming make a difference?" is not a single question. It depends on the listener, the listening environment, the playback equipment, the musical content, and the purpose of listening. For casual consumption of pop music on earbuds, probably not. For research, archiving, or critical listening of acoustic music on high-quality systems, it may be significant.

33.12 Pre-Echo: The Artifact You Can Hear

Pre-echo is worth examining in detail because it is the most distinctive and diagnosable MP3/lossy codec artifact — and one of the most instructive about how the masking model works and where it fails.

The mechanism, step by step:

  1. A loud transient (drum attack, piano key strike, plucked string attack) occurs near the end of an analysis frame.
  2. The psychoacoustic model analyzes the frame and computes the masking threshold.
  3. Much of the frame is quiet (before the transient), so the model sees a relatively low average signal level.
  4. The bit allocation for the frame is set based on the average masking, which may be relatively low — the encoder saves bits because it thinks the frame is mostly quiet.
  5. The encoded frame has relatively coarse quantization — there are not many bits available.
  6. The quantization noise is distributed across the entire frame, including the quiet period before the transient.
  7. During decoding and reconstruction, this quantization noise is present in the quiet pre-transient portion of the frame.
  8. In the original signal, this pre-transient period is near-silent. In the codec's output, it contains quantization noise.
  9. The auditory system has no transient to provide forward masking of this pre-transient period (the masking goes forward in time, not backward). The pre-echo noise is audible.

What it sounds like: On a triangle strike: a faint "shhh" immediately before the sharp "ting." On a castanets click: a "sh-" sound before the sharp percussive crack. On a guitar pluck: a very brief noise burst before the clear note attack. On speech consonants: a softening or smearing of the consonant onset. None of these are catastrophically obvious at higher bit rates, but trained listeners find them immediately identifiable.

Adaptive window switching reduces pre-echo by detecting upcoming transients and switching to a shorter MDCT window (128 samples instead of 1,024) when a transient is detected. The shorter window provides better temporal resolution — the quantization noise is confined to a shorter time window and thus less likely to smear across a large silent portion. But detecting the transient reliably and switching windows without introducing its own artifacts is technically challenging.

🔵 Try It Yourself: To hear pre-echo clearly, use a free audio editing application to compare a FLAC and MP3 version of the same recording. Useful test material: the triangle hits in Mahler's 7th Symphony (II. Nachtmusik I), castanets on any Spanish classical guitar recording, or the opening piano notes of a Chopin nocturne. Zoom in on the waveform to the few hundred milliseconds around a loud attack. In the MP3 version, you should be able to see (and with careful listening, hear) a slight noise elevation in the milliseconds before the attack that is absent in the FLAC version.

⚖️ Debate/Discussion: Does Lossy Audio Compression "Change the Music," or Just Remove What You Can't Hear Anyway?

Position A: Compression changes the music. The physical signal that emerges from a lossy codec is provably different from the original. Frequency components have been removed or reduced. Temporal precision of transients has been degraded. For listeners with trained hearing (Aiko's case), specific acoustic features important to their experience of the music are measurably altered. A sonic object that has had its frequency content permanently altered is a different sonic object, regardless of whether most listeners notice.

Position B: Compression removes only what you can't hear. The psychoacoustic model is built from careful experimental data about what humans can and cannot hear. If the model is accurate, then everything removed by the codec was genuinely inaudible to the listener in question. Removing inaudible information does not change the music-as-experienced. The claim that "the physical signal is different" confuses physical representation with perceptual experience. The music is what the listener hears, not what the spectrogram shows.

The nuanced position: Both are right, for different audiences. For the majority of listeners in typical conditions, position B is approximately correct. For specialized listeners (researchers, professional musicians, audiophiles with trained ears), or for challenging material (classical, acoustic, transient-heavy), position A captures something real. "The music" is not a single thing but a relationship between a physical signal and a listener, and that relationship varies.

33.13 Advanced: The MDCT (Modified Discrete Cosine Transform)

🔴 Advanced Topic

The Modified Discrete Cosine Transform (MDCT) is the frequency analysis tool at the heart of MP3, AAC, Vorbis, Opus, and virtually all modern perceptual audio codecs. Understanding it explains why codecs process audio in overlapping blocks and why this choice has the temporal characteristics it does.

The Discrete Cosine Transform (DCT) is a relative of the Fourier transform that expresses a signal as a sum of cosine waves. Unlike the DFT (Discrete Fourier Transform), which produces complex-valued (real and imaginary) outputs, the DCT produces real-valued outputs, making it computationally simpler.

The Modified DCT extends the DCT to overlapping windows. The MDCT analyzes blocks of 2N samples but produces only N output coefficients (not 2N). The "modification" is that consecutive blocks overlap by 50%: each block of 2N samples shares N samples with the preceding block and N samples with the following block. This overlap-add structure ensures that the reconstructed audio is continuous across block boundaries.

The MDCT formula: X[k] = Σ_{n=0}^{2N-1} x[n] × cos[π/N × (n + 1/2 + N/2) × (k + 1/2)], k = 0, 1, ..., N-1

Where x[n] is the windowed input signal (multiplied by an appropriate window function like the sine window to ensure smooth block overlaps) and X[k] are the MDCT coefficients.

Critical sampling: The MDCT is critically sampled: it produces the minimum number of coefficients needed to represent the signal without redundancy (N coefficients from 2N samples), but the overlap structure ensures perfect reconstruction when the inverse MDCT (IMDCT) is applied and the blocks are overlap-added. This critical sampling is what allows the MDCT to achieve efficient compression — every coefficient carries independent information.

The time-frequency tradeoff: With N = 512 (for a 1,024-sample block), the MDCT provides 512 frequency coefficients over the audio band, giving approximately 86 Hz of frequency resolution at 44.1 kHz. With the short window (N = 64), frequency resolution drops to 690 Hz but temporal resolution improves to approximately 3 ms. The adaptive window switching discussed in Section 33.5 exploits this tradeoff: long windows for sustained sounds (better frequency resolution → better masking model accuracy), short windows for transients (better temporal resolution → less pre-echo).

33.14 Theme 4 Synthesis: Technology as Mediator and What Gets Lost

Across Chapters 31, 32, and 33, we have traced a consistent theme: technology mediates between sound and its representation, and every mediating technology embodies choices — about what to capture, what to discard, what to optimize for, and whose needs to serve.

Edison's phonograph was optimized for mechanical fidelity with the constraint of groove physics. It discarded high frequencies not because anyone chose to sacrifice them, but because the stylus geometry and horn resonances simply could not capture them. The technology made the choice by physical necessity.

Magnetic tape was optimized for linear magnetic response within the constraints of oxide particle physics and hysteresis curves. The "warmth" it added was not a designed feature — it was an artifact of the physics of ferromagnetic saturation, which engineers eventually learned to exploit rather than eliminate.

The Nyquist theorem provided a mathematical framework that made digital audio possible — and defined the minimum requirement for "perfect" representation. The CD standard embodied specific engineering choices about what "perfect" meant: 20 Hz to 20,000 Hz, 96 dB dynamic range, two channels.

The MP3 psychoacoustic model made a more radical set of choices: it built a model of what typical human listeners can and cannot hear, and used that model to determine what information to permanently discard. This is technology as mediator at its most consequential — the mediation is not just physical filtering but cognitive filtering, built from a specific theory of human perception, optimized for a specific statistical average of listeners, and encoded into the file format itself.

When the model is accurate — for most listeners, most content, most conditions — the mediation is invisible. The listener hears music as if it were unmediated. This is the goal, and for the vast majority of real-world listening, it is achieved.

When the model is wrong — for Aiko's specific research needs, for trained musicians listening to demanding material, for the listener who notices the pre-echo on a triangle strike — the mediation becomes visible. The technology's choices become audible as artifacts.

This is the deepest lesson of perceptual audio coding: the technology incorporates a theory of the listener. That theory is an approximation — a model, not the reality. The model works well enough for its intended purpose, but it cannot be universally correct, because human perception is not uniform across all listeners, all uses, and all contexts. What gets lost in lossy compression is determined by the perceptual model built into the technology, and that model is always a theory about what matters and what doesn't.

For most purposes, that model is right. For some purposes — including Aiko's — it is exactly wrong.

33.15 Summary and Bridge to Chapter 34

Chapter 33 has developed the physics and mathematics of perceptual audio coding — how lossy compression works, what it discards, and why what it discards is sometimes more than meets the ear.

The psychoacoustic model — masking thresholds, critical bands, temporal masking, the absolute threshold of hearing — provides the theoretical framework for identifying inaudible information. The MDCT provides the mathematical tool for decomposing audio into time-frequency components. The bit allocation algorithm combines these to produce a compact representation that is designed to be perceptually equivalent to the original.

The failure modes — pre-echo, high-frequency smearing, singer's formant degradation, fine spectral structure loss — reveal the boundaries of the psychoacoustic model's accuracy. These artifacts are most audible on demanding material (classical, acoustic) and for listeners with specific training or acoustic knowledge (researchers, musicians).

The trajectory of compression technology — from MP3 to AAC to Opus — has been one of steadily improving psychoacoustic models and more efficient coding, achieving better quality at lower bit rates. Lossless streaming, now technologically and economically feasible, offers an alternative for contexts where any information loss is unacceptable.

Key Takeaway: Every lossy codec is a built-in theory of what you cannot hear. When that theory is right about you, the compression is transparent. When it is wrong — because you are listening carefully, with trained ears, to demanding material — the artifacts reveal the technology's assumptions. Aiko's discovery is not exceptional. It is what happens whenever the model of the user built into a technology encounters a user who falls outside the model's design parameters.

Chapter 34 will move from the technology of recording to the technology of synthesis — how electronic instruments generate sound from first principles, and what new musical possibilities this has created.


Word count: approximately 9,800 words