Chapter 33 Key Takeaways: Audio Compression — MP3, Perceptual Coding & What We Lose

DataField.Dev

Chapter 33 Key Takeaways: Audio Compression — MP3, Perceptual Coding & What We Lose

Compression Fundamentals

✅ Lossless compression preserves every bit; lossy compression permanently discards information. FLAC and ALAC achieve approximately 2:1 compression by removing statistical redundancy — exploiting correlations between consecutive audio samples — while guaranteeing bit-for-bit identical reconstruction. MP3 and AAC achieve 4:1 to 15:1 compression by permanently discarding audio components the psychoacoustic model predicts to be inaudible.

✅ "Compression" means two completely different things in audio. Dynamic range compression (Chapter 31) reduces the ratio of loud to quiet sounds. Data compression (this chapter) reduces the size of the digital file. The terms are unrelated in mechanism and should not be confused.

✅ The compression ratio achievable without perceptible loss depends on the listener, the content, and the listening context. 128 kbps AAC is transparent for most listeners on most pop music through consumer earbuds. The same bit rate may produce audible artifacts for a trained listener evaluating acoustic classical music on high-quality speakers.

Psychoacoustic Masking

✅ The psychoacoustic model is a mathematical theory of what you cannot hear. It encodes experimental measurements of simultaneous masking (loud tones masking nearby quieter tones), temporal masking (loud sounds masking events before and after them), and the absolute threshold of hearing (minimum audible level by frequency). These measurements determine which audio components can be safely discarded.

✅ Simultaneous masking is asymmetric: more upward than downward. A loud tone masks nearby frequencies much more strongly upward (toward higher frequencies) than downward. This reflects the direction of the traveling wave on the basilar membrane — physically, the mechanics of the cochlea. Codecs exploit this by assigning fewer bits to frequency regions above loud maskers.

✅ Temporal masking — especially forward masking — allows codecs to use fewer bits after loud transients. A loud sound masks quieter events that follow it for up to 200 ms. The codec exploits this by coding post-transient audio with lower precision. The attempt to exploit backward masking (pre-masking) leads to the pre-echo artifact.

Codec Architecture and Artifacts

✅ Pre-echo is the most characteristic and diagnosable MP3 artifact. When a sharp transient occurs late in an analysis frame, the codec may allocate too few bits (based on the frame's average energy), spreading quantization noise across the entire frame including the pre-transient silence. This noise appears before the transient in the reconstructed audio — audible as "shhh-ting" instead of a clean "ting" on triangle or harpsichord attacks.

✅ Adaptive window switching reduces pre-echo at the cost of frequency resolution. Switching from long analysis windows (1,024 samples, fine frequency resolution) to short windows (128 samples, fine temporal resolution) around transients confines quantization noise to a shorter time window. The tradeoff is worse frequency-domain masking model accuracy during and near the window switches.

✅ AAC is consistently superior to MP3 at the same bit rate. Larger MDCT windows (1,024 vs. 576 coefficients), more accurate psychoacoustic model, Temporal Noise Shaping to reduce pre-echo, and more efficient entropy coding combine to make AAC approximately equivalent to MP3 at 64% of the bit rate.

Aiko Tanaka's Discovery

✅ The codec's psychoacoustic model is blind to research-specific acoustic features. Aiko found that 128 kbps MP3 encoding reduced the amplitude of the singer's formant cluster (2,800–3,200 Hz) by approximately 4-5 dB and smeared consonant onset times by 15-20 ms. The psychoacoustic model classified the singer's formant's fine spectral structure as near-threshold in the complex choral masking environment, and treated pre-attack silence as backward-masked by the upcoming consonant. Both assumptions were wrong for her specific research needs.

✅ Lossless audio formats are a prerequisite for acoustic research, not a luxury. Any analysis of acoustic features that may be affected by psychoacoustic modeling — spectral fine structure, temporal precision of onsets, low-level detail — requires lossless source material. MP3 and AAC files are unsuitable for acoustic research on affected parameters.

Technology as Mediator

✅ Every lossy codec is a built-in theory of the listener that determines what gets preserved and what gets discarded. The codec embodies assumptions about average hearing, typical listening contexts, and which acoustic features matter. These assumptions serve the majority of listeners well. They fail for listeners outside the model's assumed parameters.

✅ The perceptual model in a codec is based on average human hearing — not any specific human's hearing. Listeners who differ from the average (trained musicians, hearing researchers, people with unusual hearing sensitivity) may find that codec decisions are systematically wrong about what they can and cannot hear.

✅ Lossless streaming represents the removal of the codec's editorial authority. Where lossy streaming builds a theory of the listener into the delivery mechanism, lossless streaming delivers the audio without perceptual filtering. The listener, not the codec's model, determines what is significant.

Theme Connections

Theme 4 (Technology as Mediator): The MP3 algorithm is the most sophisticated mediating technology in this textbook — not just a passive channel between sound and listener, but an active editorial agent that decides what information reaches the listener based on a model of what information matters. When the model is accurate, the mediation is invisible. When the model is wrong, the mediation reveals itself as artifact.

Theme 1 (Reductionism vs. Emergence): The psychoacoustic model is a reductionist project: reduce the listener to a set of masking thresholds and critical band widths. Aiko's experience demonstrates that this reduction loses something real — the capacity to detect fine spectral features that are genuinely audible in the right context. The music (and the listener) are more than the model captures.

Theme 3 (Constraint as Creativity): The bandwidth constraint that made lossy compression necessary produced one of the most consequential music technologies of the twentieth century. The MP3 format, designed to work within the scarcity of 1990s internet bandwidth, enabled the streaming economy that now funds most music production. The constraint was the condition of possibility for the transformation.