Chapter 33 Quiz: Audio Compression — MP3, Perceptual Coding & What We Lose

20 questions. Click the arrow to reveal each answer.


Question 1. What is the fundamental difference between lossless and lossy audio compression? Give an example of each type.

Show Answer **Lossless compression** (e.g., FLAC, ALAC) reduces file size by removing statistical redundancy in the data — exploiting the fact that audio samples are correlated with their neighbors — while preserving every bit of the original audio data. The decompressed file is bit-for-bit identical to the original. Typical compression ratio: approximately 2:1. **Lossy compression** (e.g., MP3, AAC, Opus) permanently discards audio information judged to be inaudible by a psychoacoustic model. The decompressed audio is physically different from the original, though it is designed to be perceptually similar. Typical compression ratios: 4:1 to 15:1. The key difference: lossless compression is reversible (decode → original); lossy compression is irreversible (once information is discarded, it cannot be recovered from the compressed file).

Question 2. What is a psychoacoustic model, and why is it essential to lossy audio compression?

Show Answer A **psychoacoustic model** is a mathematical representation of the human auditory system's sensitivity — specifically, what sounds humans can and cannot hear under various conditions. It encodes experimental measurements of: - The absolute threshold of hearing (minimum audible level by frequency) - Simultaneous masking (loud sounds making nearby frequencies inaudible) - Temporal masking (sounds near loud events being inaudible) - Critical band structure of the cochlea The psychoacoustic model is essential to lossy compression because it identifies which audio components are *genuinely inaudible* to the listener in context. Only those components can be safely discarded without the listener noticing. Without the model, the codec would have no principled way to decide what to remove — any arbitrary removal would produce audible distortion.

Question 3. Explain simultaneous masking and how an MP3 encoder exploits it.

Show Answer **Simultaneous masking** is the phenomenon by which a loud sound at one frequency raises the threshold of hearing for nearby frequencies, making quieter sounds in the same frequency region inaudible. The masking spreads asymmetrically: more strongly upward in frequency (toward higher frequencies) than downward. **How MP3 exploits it:** For each analysis frame (approximately 23 ms), the MP3 encoder computes the spectral masking threshold across all frequencies. This threshold rises under loud components and falls away from them. Frequency components that fall entirely below the masking threshold are assigned zero bits — they are completely omitted from the encoded file. Components above the threshold receive bits proportional to how far they exceed the threshold. This allows the encoder to concentrate its available bits where they matter most perceptually.

Question 4. What is the Bark scale, and why does the psychoacoustic model use it rather than a linear Hz scale?

Show Answer The **Bark scale** is a perceptual frequency scale that reflects the frequency resolution of the human cochlea. One Bark corresponds to approximately one critical band — the width of a frequency region processed together by the auditory system. The Bark scale is nonlinear: critical bands are narrow at low frequencies (around 100 Hz wide at 1,000 Hz) and wide at high frequencies (around 1,500 Hz wide at 10,000 Hz). The psychoacoustic model uses the Bark scale rather than linear Hz because masking is mediated by the cochlea's critical band structure. A loud tone masks sounds within the same critical band (approximately one Bark) much more than sounds in adjacent bands. The Bark scale captures this physical reality: equal distances in Bark correspond to approximately equal perceptual distances, not equal Hz distances. Using a linear Hz scale would predict incorrect masking widths at different frequencies.

Question 5. What is temporal masking? Distinguish between forward and backward masking.

Show Answer **Temporal masking** is the extension of masking in time: a loud sound affects the audibility of quieter sounds not just at the same time, but before and after the loud sound. **Forward masking (post-masking):** A loud sound masks quieter sounds that occur *after* it, for up to 200 ms. Caused by the auditory system's recovery time from adaptation of the auditory neurons. Strong and well-established. **Backward masking (pre-masking):** A loud sound can mask quieter sounds that occurred *before* it, for up to approximately 5-20 ms. More controversial and weaker than forward masking. Theorized to result from the auditory system's temporal integration over a time window. Codecs exploit forward masking extensively: quiet sounds immediately following a loud transient can be coded at lower precision because they will be masked. Backward masking is less reliably exploitable. The attempt to exploit backward masking is related to the pre-echo artifact: the codec may assume a quiet pre-attack signal is masked by the upcoming loud event, but this assumption is not always valid.

Question 6. What is pre-echo, and what causes it?

Show Answer **Pre-echo** is a compression artifact that appears as a brief burst of noise *before* a sharp transient in the reconstructed audio. On a triangle strike, it sounds like "shhh-ting" instead of the clean "ting" of the original. **Cause:** The MP3 encoder analyzes audio in frames (approximately 23 ms). If a sharp transient occurs near the *end* of a frame, the psychoacoustic model analyzes the entire frame's energy and may compute a low masking threshold (because most of the frame is quiet, before the transient). The low masking threshold leads to coarse quantization — the frame has a large quantization noise floor. This noise floor is distributed across the entire frame, including the silent pre-transient portion. When reconstructed, this noise appears in the pre-transient silence — where there is no transient energy to provide forward masking. The noise is therefore audible.

Question 7. How does adaptive window switching in MP3 and AAC address the pre-echo problem?

Show Answer **Adaptive window switching** detects sharp transients in the audio and switches from the normal long MDCT analysis window (1,024 samples in AAC, corresponding to approximately 23 ms) to a shorter window (128 samples, approximately 3 ms) around the transient. **Why it helps:** The shorter window confines the analysis frame to a much narrower time region around the transient. The pre-transient silent period and the post-attack period are each analyzed in separate short frames. The quantization noise introduced by coding the transient frame at low precision is now spread only across the 3 ms short window, not across 23 ms. The brief noise burst (now confined to 3 ms) is masked by the forward masking of the transient itself — the transient's loud energy masks the noise that occurs after it. **The tradeoff:** Short windows provide poor frequency resolution (128 samples → approximately 344 Hz per coefficient at 44.1 kHz). This makes the masking model less precise for sustained tonal content in the same region, potentially wasting bits on components that would be better coded with a long window.

Question 8. Describe Aiko Tanaka's experiment and what she discovered about MP3 compression and the singer's formant.

Show Answer **The experiment:** Aiko compared FLAC (lossless) and 128 kbps MP3 recordings of a professional choir performing Brahms, specifically examining the spectral and temporal characteristics of the recordings in her area of dissertation research. **Discovery 1 — Singer's formant degradation:** The singer's formant cluster (2,800–3,200 Hz), a characteristic spectral peak that distinguishes trained singers from untrained ones and allows voices to project over orchestral texture, was significantly reduced in the MP3 version. The spectral peak's amplitude dropped from approximately +8-10 dB above surrounding spectrum (in FLAC) to approximately +4-5 dB (in MP3). The psychoacoustic model had treated the fine spectral structure in this region as near-threshold in the complex choral masking environment and under-allocated bits. **Discovery 2 — Temporal smearing:** Consonant attacks, which Aiko uses to measure voice onset time and manner of articulation, were smeared by pre-echo extending approximately 15-20 ms before the consonant onset — rendering her temporal measurements invalid from MP3 sources. **Her notebook conclusion:** "The codec is blind to exactly what I'm measuring. It optimizes for what ordinary listeners hear, not what I study."

Question 9. Why does AAC sound better than MP3 at the same bit rate?

Show Answer AAC achieves better quality than MP3 at equivalent bit rates due to several technical improvements: 1. **Larger MDCT window:** AAC uses 1,024 MDCT coefficients (vs. MP3's 576), providing finer frequency resolution for more precise masking calculations and more efficient bit allocation. 2. **Better psychoacoustic model:** AAC incorporates more accurate experimental data on masking thresholds and critical band widths, developed after a decade of additional research following MP3's design. 3. **Temporal Noise Shaping (TNS):** AAC's TNS specifically addresses pre-echo by shaping quantization noise in the time domain within each analysis frame, reducing the amplitude of pre-echo artifacts. 4. **More efficient entropy coding:** AAC uses more sophisticated Huffman tables and arithmetic coding, reducing the overhead required to represent quantized coefficients. The practical result: AAC at 128 kbps is approximately equivalent to MP3 at 192 kbps in perceptual quality — the same audio quality requires less data.

Question 10. What compression ratio does 128 kbps MP3 achieve relative to CD audio, and how was this ratio calculated?

Show Answer **Ratio calculation:** CD audio bit rate: 44,100 samples/sec × 16 bits/sample × 2 channels = 1,411,200 bits/sec = 1,411 kbps. 128 kbps MP3 compression ratio: 1,411 / 128 ≈ **11:1** This means the MP3 file contains approximately 1/11 of the data of the original CD audio — about 90% of the audio information has been discarded (or more precisely, deemed inaudible by the psychoacoustic model and omitted from the encoded file). At 320 kbps MP3: 1,411 / 320 ≈ 4.4:1 compression ratio — approximately 77% of CD data is discarded. At 64 kbps AAC (common for speech): 1,411 / 64 ≈ 22:1 compression ratio — approximately 95% of the data is discarded.

Question 11. What is the MDCT (Modified Discrete Cosine Transform), and why is it used in audio codecs rather than the DFT?

Show Answer The **MDCT** is a frequency analysis transform that decomposes overlapping blocks of audio into cosine-wave components, producing N output coefficients from 2N input samples. **Why MDCT rather than DFT:** 1. **Real-valued output:** The DFT produces complex (real + imaginary) values; the MDCT produces only real values, roughly halving the number of values to store and making the representation more efficient. 2. **Critical sampling:** The MDCT with 50% overlap produces N coefficients from 2N samples — no redundancy beyond what is needed for smooth reconstruction. The DFT of N samples produces N complex values (2N real numbers), which is redundant for real signals. 3. **Perfect reconstruction:** The MDCT's overlap-add structure guarantees perfect reconstruction of the original signal when the inverse MDCT is applied and blocks are overlap-added. This is essential: artifacts from coding one block must not create discontinuities at block boundaries. 4. **Efficient computation:** The MDCT can be computed efficiently using fast algorithms related to the FFT.

Question 12. What does the Spotify Spectral Dataset analysis reveal about which music genres are most and least vulnerable to MP3 compression artifacts?

Show Answer **Most vulnerable (highest artifact severity):** - Classical and orchestral music: Rich high-frequency acoustic content, wide dynamic range, sharp transients against quiet backgrounds (worst case for pre-echo), fine spectral structures (singer's formant, string bow noise) near masking thresholds. - Acoustic jazz: Cymbal detail, natural room acoustics, wide dynamic range. **Moderately vulnerable:** - Electronic music: Often has sharp synthesized transients. But synthesized sounds may be already band-limited by design. **Least vulnerable (lowest artifact severity):** - Heavily produced, dynamically compressed pop: Already has reduced dynamic range (meaning fewer loud-against-quiet transitions that stress the masking model), densely packed spectral content (masking is pervasive, codec has more latitude), and limited acoustic high-frequency detail. **The irony:** Pop music that has already been sonically degraded by heavy dynamic range compression is the easiest material for lossy codecs to encode, while acoustically pristine classical recordings are the hardest. The double compression (dynamics + data) compounds, but the first compression (dynamics) happens to make the second compression (data) less damaging perceptually.

Question 13. What is lossless streaming, and why has it become practically feasible while it was not in the early 2000s?

Show Answer **Lossless streaming** delivers audio files encoded with lossless codecs (FLAC, ALAC) that are bit-for-bit identical to the CD or high-resolution master when decoded. No audio information is discarded. **Why it was not feasible in the early 2000s:** CD-quality lossless audio streams at approximately 1,000-1,400 kbps. In 2003, typical home broadband was 256 kbps-1 Mbps; mobile data was essentially unavailable for audio streaming. A lossless stream would have consumed the entire available bandwidth. **Why it is feasible now (2026):** Typical 5G mobile data speeds: 100-400 Mbps (peak). Typical home broadband: 100-1,000 Mbps. Even on average connections, 1,400 kbps is a tiny fraction of available bandwidth. Storage costs have also fallen dramatically: storing a lossless music library is now affordable for consumer devices. Apple Music, Tidal, and Amazon Music HD all offer lossless streaming. The technical barrier is gone; the remaining question is whether the perceptual benefit justifies the higher bandwidth and storage requirements for most listeners.

Question 14. Why does the chapter describe the psychoacoustic model as a "theory of the listener" built into technology?

Show Answer The psychoacoustic model encodes specific assumptions about who is listening, how they are listening, and what they can hear: - **Assumption about hearing:** The model assumes "average" young adult hearing. It uses masking thresholds derived from typical listeners, not from any specific listener. Elderly listeners may have worse high-frequency hearing; trained musicians may have more acute attention to specific spectral features. - **Assumption about context:** The model assumes the listener is "casually" listening — trying to enjoy the music as a whole, not analyzing specific acoustic features. Aiko is analyzing singer's formant amplitude. The model does not account for this. - **Assumption about content:** The model assumes typical music content. Its performance degrades for unusual content (very sparse recordings, unusual transient patterns). - **Built into the file format:** These assumptions are not labeled or disclosed. Every MP3 file contains audio that has been filtered through these assumptions. A listener who falls outside the model's assumed profile — a researcher, a trained musician, someone with unusual hearing — receives audio that has been altered according to assumptions that don't apply to them, without being informed this has occurred.

Question 15. What is the critical band framework in MP3, and how does it relate to the cochlea's physical structure?

Show Answer The **critical band framework** models the cochlea's frequency resolution. The cochlea does not have independent detectors for every frequency; instead, it processes sound through approximately 24 overlapping frequency bands (critical bands), each corresponding to a region of the basilar membrane. **Physical basis:** The basilar membrane acts as a mechanical filter bank. Each position along the membrane is most responsive to a specific frequency. Because the membrane is a continuous structure, nearby positions are excited together — the width of this co-excitation defines a critical band. Critical bands are approximately 100 Hz wide at 1,000 Hz and approximately 1,500 Hz wide at 10,000 Hz. **In MP3:** Masking is computed within critical bands — a loud sound in one critical band most strongly masks quieter sounds in the same band. The MP3 filterbank roughly approximates this critical band structure (though the 32 equal-bandwidth subbands are a cruder approximation than the actual Bark scale; AAC's 1,024-coefficient MDCT provides finer and more accurate frequency representation).

Question 16. Position A argues that "compression changes the music." Position B argues that "compression removes only what you can't hear anyway." Which position do you find more persuasive, and why?

Show Answer This is a debate question; the following represents a nuanced, well-supported response rather than a single "correct" answer. **The case for Position A (compression changes the music):** The physical signal after decoding an MP3 is measurably and provably different from the original. Frequency components have been removed or reduced. Temporal precision has been degraded. For specific listeners with specific needs (Aiko's singer's formant research), specific acoustic features are measurably altered. A physical object with altered physical properties is a different physical object, regardless of whether most people notice the difference. Music is a physical phenomenon, and changing its physical constitution changes the music. **The case for Position B (compression removes only what you can't hear):** The psychoacoustic model is built from careful experimental measurements of human perceptual limits. For the listeners and conditions for which those measurements were taken, the removed components are genuinely inaudible. The music-as-experienced is the same. Asking whether the "music" has changed when the perceptual experience is unchanged treats the physical signal as more fundamental than the phenomenal experience — a contestable philosophical commitment. **The nuanced position:** Both are right for different situations. For most listeners under typical conditions, the psychoacoustic model is accurate and Position B is correct. For specialized listeners (Aiko) or challenging material, Position A captures something real. "The music" is not a single thing — it is a relationship between physical signal and listener — and that relationship changes differently for different listeners.

Question 17. How does the Opus codec differ from MP3 in its design goals and technical architecture?

Show Answer **Opus** (2012) was designed specifically for low-latency internet streaming and real-time communication, where MP3 was designed for stored audio files. **Key differences:** *Latency:* MP3 frames are approximately 23 ms — too long for real-time VoIP (voice-over-IP), where total one-way latency must be under approximately 150 ms. Opus uses shorter frames (2.5–60 ms selectable), allowing latency as low as 5-10 ms for real-time applications. *Hybrid architecture:* Opus uses SILK (speech linear prediction model, highly efficient for voice at low bit rates) and CELT (Constrained Energy Lapped Transform, MDCT-based, efficient for music). It switches between them or blends them based on content type. MP3 uses only its MDCT filterbank for all content. *Modern psychoacoustic model:* Opus incorporates 20 additional years of psychoacoustic research compared to MP3, achieving better quality at lower bit rates. *Lower minimum bit rate:* Opus produces acceptable speech quality at 6-16 kbps; MP3 requires at least 32 kbps for speech (and sounds poor). For music, Opus at 96 kbps is competitive with MP3 at 192-256 kbps.

Question 18. What does the pre-echo artifact reveal about the fundamental tension in codec design between frequency resolution and temporal resolution?

Show Answer Pre-echo reveals the **time-frequency uncertainty principle** as applied to audio codecs. This is directly analogous to the Heisenberg uncertainty principle: you cannot simultaneously have perfect frequency resolution and perfect temporal resolution in the same analysis frame. **Long analysis windows** (1,024 samples, approximately 23 ms): High frequency resolution (fine spectral detail, better masking model accuracy), but poor temporal resolution (events within the frame cannot be separately located in time). Pre-echo occurs when a transient arrives late in a long frame — the frame is coded with a noise budget appropriate for its average content, but the pre-transient silence has no masking to hide this noise. **Short analysis windows** (128 samples, approximately 3 ms): Good temporal resolution (events can be more precisely located in time, pre-echo is confined to a 3 ms interval easily masked by the transient), but poor frequency resolution (coarser spectral bins, less precise masking model, less efficient coding of tonal content). Adaptive window switching is the engineering compromise: use long windows for sustained tonal content (where temporal precision matters less) and short windows for transients (where temporal precision is essential). The transitions themselves introduce additional complexity and can produce their own artifacts.

Question 19. How has the music industry's economic structure changed as a direct result of MP3 and lossy audio compression?

Show Answer Lossy audio compression enabled several transformative shifts in the music economy: **File sharing (1999–2003):** MP3 files small enough to share over internet connections enabled Napster and subsequent peer-to-peer networks. This allowed users to access nearly the entire catalog of recorded music without payment. Music industry revenue declined from approximately $40 billion (1999) to $15 billion (2015). **Portable players (2001 onward):** The iPod was only possible because MP3 compression allowed 1,000 songs in 1 GB of storage. Without compression, the same device would have held 70-100 songs. This created the personal music player market. **Digital download stores (2003-2015):** iTunes, Amazon MP3, and others sold individual compressed tracks for $0.99-$1.29, disaggregating the album (which had been the economic unit of music for 40 years). Artists and labels were forced to compete at the single-track level. **Streaming (2008 onward):** Higher-bitrate lossy streaming made catalog-level music access economically viable. Spotify, Apple Music, and others deliver music at 128-320 kbps, storing entire libraries in data centers efficiently and streaming them on demand. Music industry revenue has partially recovered through streaming. **Structural change:** The music industry shifted from selling physical objects (CDs) to selling access (subscriptions). The bargaining power shifted from labels (who controlled physical distribution) toward platforms (who control digital distribution).

Question 20. The chapter's Theme 4 analysis states: "What gets lost in lossy compression is determined by the perceptual model built into the technology, and that model is always a theory about what matters and what doesn't." What does this claim imply for how we should think about audio technology as a whole?

Show Answer This claim has several important implications: **No technology is neutral.** Every recording and reproduction technology incorporates assumptions about what matters in audio — what frequency range, what dynamic range, what temporal precision, what spatial information, what acoustic context. Edison's phonograph assumed that fundamental frequencies matter and upper harmonics don't (because of stylus geometry limits). The CD standard assumed that 20 Hz-20 kHz and 96 dB dynamic range are sufficient. MP3 assumes that the psychoacoustic model accurately represents all relevant listeners. These are design choices, not physical necessities. **The model is always somebody's model.** Psychoacoustic models are built from measurements on specific populations of listeners in specific experimental conditions. The "average listener" of the masking model may not resemble any actual listener — and certainly doesn't resemble Aiko. When a technology assumes a "typical user," it implicitly excludes atypical users. **Technology shapes what is possible to perceive and study.** If all available recordings of choral music are in MP3 format (as is largely true for historical recordings in many archives), the psychoacoustic model's assumptions become a constraint on all research using those recordings. Aiko cannot study singer's formant from MP3 files. Technology does not just mediate experience — it constrains what can be experienced, studied, and preserved. **Implications for digital preservation:** As recorded music is progressively archived in compressed formats, the information discarded by the psychoacoustic model is permanently lost. Future researchers with interests we cannot anticipate may find their questions unanswerable because the relevant acoustic information was discarded in 2005.