Chapter 33 Exercises: Audio Compression — MP3, Perceptual Coding & What We Lose
Part A: Lossless vs. Lossy Compression Fundamentals
A1. Explain the fundamental difference between lossless and lossy audio compression. (a) What does "lossless" mean mathematically — what is preserved exactly? (b) What is the approximate compression ratio achievable by lossless codecs (like FLAC) and why is this ratio hard to improve upon? What principle of information theory sets the limit? (c) A FLAC file is 15 MB. An MP3 at 128 kbps of the same 3-minute recording is 2.9 MB. Calculate the compression ratios for each relative to the uncompressed 44.1 kHz/16-bit original (~30 MB). (d) Under what circumstances would you choose FLAC over MP3 for each of these use cases: (i) podcasting for general audiences, (ii) archiving a field recording for scientific research, (iii) music for a long airplane flight on a smartphone with 16 GB storage, (iv) master files for a commercial release?
A2. The term "compression" has two completely different meanings in audio production. (a) Define "dynamic range compression" as used in mastering (Chapter 31). (b) Define "data compression" as used in audio codecs (Chapter 33). (c) Describe a scenario in which music has undergone heavy dynamic range compression but is stored in an uncompressed file format. Is the audio "compressed"? (d) Describe a scenario in which music with excellent dynamic range is stored in a heavily compressed lossy format. Is the audio "compressed"? (e) What is the relationship (if any) between dynamic range compression and data compression? Can one affect the other?
A3. Shannon's entropy sets a theoretical limit on lossless compression. Audio signals have statistical structure that lossless codecs exploit. (a) What kind of statistical structure does audio have — what correlations exist between consecutive samples? (b) FLAC uses "linear prediction" to predict each sample from its neighbors, then stores only the prediction error. Why does storing the error require fewer bits than storing the sample directly? (c) A FLAC file of a 10-minute symphony is 89 MB. A FLAC file of 10 minutes of white noise (truly random) is 105 MB. Which is larger, and why? What does this reveal about the statistical structure of music versus noise? (d) Can lossless compression achieve a 10:1 compression ratio on typical audio? Why or why not?
A4. Bit rate determines the data rate of an encoded audio file. (a) A 320 kbps stereo MP3 stream: how many bits are transmitted per second? How many bits are allocated per stereo sample pair at 44,100 Hz? (b) A 128 kbps AAC stream compared to a 320 kbps MP3 stream: both may sound similar in quality. What does this tell you about AAC's efficiency relative to MP3? (c) Spotify streams at different qualities: 24 kbps (low), 96 kbps (normal), 160 kbps (high), 320 kbps (very high). For each bit rate, calculate the file size for a 3-minute song and the approximate compression ratio relative to CD. (d) At what bit rate does a modern codec (AAC or Opus) achieve "transparent" quality for most listeners on most content?
A5. FLAC (Free Lossless Audio Codec) is the dominant lossless format for music distribution. (a) Why might a music streaming service choose to offer FLAC alongside their lossy streams rather than offering a very high bit rate MP3 (e.g., 512 kbps)? What does each format guarantee? (b) A 24-bit/96 kHz FLAC file and a 16-bit/44.1 kHz FLAC file: the first is larger. Does the larger file contain more "music," or just more data? What is the actual difference in content? (c) Apple Music offers "lossless" (16-bit/44.1 kHz or 24-bit/48 kHz ALAC) and "hi-res lossless" (24-bit/192 kHz). Is the "hi-res lossless" tier meaningfully better than "lossless" for most listening? What conditions would need to be met for the difference to be perceptible?
Part B: Psychoacoustic Masking
B1. Explain simultaneous masking. (a) A 1,000 Hz tone at 85 dB SPL is played. Approximately what is the masking threshold at 1,100 Hz? At 900 Hz? (Use the asymmetric spreading function: masking spreads upward at approximately 10 dB per critical bandwidth and downward at approximately 25 dB per critical bandwidth, with critical bandwidths of approximately 100 Hz at 1,000 Hz.) (b) Why is upward masking stronger than downward masking? (Hint: consider the traveling wave on the basilar membrane.) (c) A quiet 1,050 Hz tone at 65 dB SPL is present at the same time as the 1,000 Hz masker at 85 dB. Is the 1,050 Hz tone audible? (d) A MP3 codec uses this information to decide: should it allocate bits to encode the 1,050 Hz tone? What is the correct decision for this case?
B2. The Bark scale is the perceptual frequency scale used in psychoacoustic models. (a) Convert the following frequencies to the Bark scale using z ≈ 13 × arctan(0.76f/1000) + 3.5 × arctan(f/7500)²: 100 Hz, 1,000 Hz, 4,000 Hz, 10,000 Hz. (b) What do you notice about the spacing of these frequencies in the Bark domain compared to the linear Hz domain? (c) Why does the psychoacoustic model use the Bark scale rather than the linear Hz scale? What physical property of the cochlea does the Bark scale model? (d) A critical band at 1,000 Hz spans approximately 100 Hz. A critical band at 10,000 Hz spans approximately 1,500 Hz. How does this relate to the Bark scale values you calculated?
B3. Temporal masking describes how masking extends in time. (a) Explain forward temporal masking: what causes it, how long does it last, and how does a codec exploit it? (b) Explain backward masking: what causes it, how long does it last (shorter than forward masking — typically 5-20 ms), and why it is more controversial than forward masking? (c) A loud drum beat occurs at t = 0.1 s. A quiet 5,000 Hz tone occurs at t = 0.05 s (50 ms before the drum). Is the 5,000 Hz tone masked by backward masking? A quiet 5,000 Hz tone occurs at t = 0.15 s (50 ms after the drum). Is this tone masked by forward masking? (d) How do these temporal masking properties affect how a codec allocates bits to frames surrounding a loud transient?
B4. The absolute threshold of hearing varies with frequency. (a) At what frequency is the human ear most sensitive (lowest threshold)? What is the approximate threshold in dB SPL at this frequency? (b) At 50 Hz, the threshold of hearing is approximately 50 dB SPL. At 4,000 Hz, it is approximately 0-5 dB SPL. A recording contains a 50 Hz bass rumble at 45 dB SPL and a 4,000 Hz string tone at 3 dB SPL. Which is above its absolute threshold and thus audible? Which can the codec safely discard? (c) How does the absolute threshold of hearing change with age? How does this affect the design of a "universal" psychoacoustic model that must work for all listeners?
B5. Pre-masking (backward masking) is exploited by codecs in a way that creates the pre-echo artifact. Trace through the following scenario carefully: a sharp snare drum attack at 90 dB SPL occurs at t = 0.050 s within an analysis frame spanning t = 0.035 s to t = 0.058 s (23 ms frame). The pre-attack portion of the frame (t = 0.035 s to t = 0.050 s) is nearly silent. (a) The psychoacoustic model analyzes the frame and estimates the masking threshold based on the average energy in the frame. Will this estimate be too high, too low, or accurate for the pre-attack silent period? (b) Based on this masking estimate, will the codec allocate too many bits or too few bits to the frame? (c) Where does the resulting quantization noise appear in the reconstructed audio? (d) In the original signal, would this quantization noise be audible during the silent pre-attack period? Why or why not? (e) What is this artifact called, and how does adaptive window switching reduce it?
Part C: Codec Architecture
C1. The MP3 filterbank decomposes audio into 32 equal-bandwidth subbands before applying the MDCT. (a) If the audio bandwidth is 22,050 Hz (Nyquist for 44.1 kHz), what is the bandwidth of each of the 32 subbands? (b) Is uniform subband bandwidth consistent with the Bark scale (which has narrower critical bands at low frequencies and wider ones at high frequencies)? What does this mismatch imply for the accuracy of MP3's masking model at low frequencies? (c) The MDCT provides 18 frequency coefficients per subband in long-window mode. What is the total frequency resolution (total number of coefficients across all 32 subbands)? How does this compare to the 576 or 1,024 coefficient MDCT of AAC? (d) Why does higher frequency resolution generally improve codec quality?
C2. Compare MP3 and AAC codecs. (a) What specific improvements does AAC's psychoacoustic model have over MP3's? (b) The MDCT window size: MP3 uses 576 coefficients (long window), AAC uses 1,024. What is the frequency resolution of each in Hz, assuming 44.1 kHz sample rate and Nyquist = 22,050 Hz? (frequency resolution ≈ Nyquist / (N/2)). (c) AAC includes Temporal Noise Shaping (TNS). What problem does this address that MP3's standard model does not handle well? (d) The practical claim is "AAC at 128 kbps ≈ MP3 at 192 kbps." Express this as a ratio: how many bits per sample of audio does each actually use? At 44,100 samples per second stereo: kbps = bits/sample × 44,100 × 2.
C3. The MDCT produces N coefficients from 2N input samples with 50% overlap. (a) Explain why the MDCT uses overlapping blocks rather than non-overlapping blocks. What problem does the overlap-add reconstruction solve? (b) The overlap-add structure means each sample contributes to two MDCT frames. Does this create redundancy? If so, why doesn't it prevent compression? (c) A 1,024-sample MDCT at 44,100 Hz: what is the duration of one analysis frame in milliseconds? What is the frequency resolution? How does this compare to the 23 ms / 86 Hz resolution at 44.1 kHz? (d) When adaptive window switching uses short windows (128 samples), the frequency resolution drops dramatically. What trade-off is being made, and for what type of audio content is this trade-off appropriate?
C4. Aiko Tanaka's experiment (Section 33.7) found that 128 kbps MP3 encoding degraded the singer's formant cluster (2,800-3,200 Hz). Using your knowledge of the psychoacoustic masking model: (a) In the context of a full choral recording, what sources of masking are present near the singer's formant frequency range? Consider: other choir voices, orchestral mid-range instruments, the fundamental harmonics of the tenors and baritones at 200-600 Hz and their harmonics at 400-1,200 Hz, etc. (b) Why might the masking model classify the fine spectral peak of the singer's formant as "below threshold" in the context of this complex masking environment? (c) What would Aiko need to do to collect data unaffected by MP3 compression? What format would she require, and what does this mean practically for a researcher studying recorded choral music? (d) If Aiko's university archive contains only MP3 recordings of historical choral performances she needs to study, what analyses can she still validly perform, and which analyses are now invalid?
C5. Opus is a modern audio codec developed by the Xiph.Org Foundation. Research question (use your knowledge from the chapter): (a) Opus was designed specifically for internet streaming and real-time communication (VoIP, video calls). What latency constraint does this impose on the codec's analysis frame size? How does this compare to MP3's 23 ms frames? (b) Opus uses a hybrid SILK+CELT architecture. SILK is optimized for speech (linear prediction), CELT for music (MDCT-based). How might a single codec intelligently switch between these modes? What audio feature could trigger the switch? (c) Opus achieves "acceptable speech quality" at 6-16 kbps. At 6 kbps and 44,100 Hz sampling, how many bits are available per sample? What does this suggest about how much spectral information per sample can be preserved? (d) For music streaming, Opus is typically used at 96-128 kbps. Compare this to MP3 and AAC at the same bit rate: what advantages does Opus offer, based on its more modern design?
Part D: Artifacts and Perceptual Quality
D1. Pre-echo is among the most characteristic and identifiable codec artifacts. (a) Describe precisely what pre-echo sounds like on a triangle strike or harpsichord note. What is the perceptual character of the artifact (tonal? noisy? brief? extended)? (b) Why does adaptive window switching reduce pre-echo? Trace through the mechanism. (c) Window switching from 1,024 samples to 128 samples at 44,100 Hz: what happens to the frequency resolution? How does this create its own type of distortion for sustained tonal content? (d) Design a test to determine whether you can personally hear pre-echo artifacts. What audio material would you choose, what equipment would you use, and how would you ensure you are not being influenced by expectation?
D2. Singer's formant analysis (extending Aiko's experiment): (a) The singer's formant is a spectral peak at 2,800-3,200 Hz. In the context of a masking model, what other spectral content in a choral recording is present near this frequency? Does this content create masking that would push the singer's formant below the masking threshold? (b) At what bit rate does MP3 begin to preserve the singer's formant reliably? You may need to reason from principles: the singer's formant is approximately 8-10 dB above surrounding spectral content. How large a masking shadow would need to suppress it? (c) If you were designing a psychoacoustic model specifically for choral voice recording, what modification would you make to the masking model to better preserve singer's formant features? (d) What does this exercise reveal about the assumption of "universal" psychoacoustic models?
D3. The Spotify Spectral Dataset analysis found that different genres suffer differently from MP3 compression (Section 33.8). (a) Why does heavily produced, dynamically compressed pop music suffer less from MP3 compression than classical music, despite being in the same 44.1 kHz/16-bit format? (b) What specific characteristics of acoustic jazz (live recording, wide dynamic range, natural room acoustics) make it more vulnerable to pre-echo artifacts than electronic music? (c) If a streaming service wanted to optimize bit rate allocation by genre — using lower bit rates for genres that suffer less from compression — what technical and ethical issues would this create? (d) Would it be technically feasible to design genre-specific psychoacoustic models that better preserve the genre-specific acoustic features most important to each genre? What would this require?
D4. Double-blind ABX testing is used to determine whether listeners can detect differences between audio formats. (a) Explain the ABX test procedure: what is X, what are A and B, and what does a "correct" response mean? (b) Why is "double-blind" important — why must the experimenter also not know which format is which? (c) In an ABX test comparing FLAC and 128 kbps MP3 of a classical recording, a listener scores 15 correct out of 20 trials. Is this statistically significant? What would chance performance look like? (Calculate the probability of getting 15 or more correct by chance from 20 binary trials using the binomial distribution.) (d) The same listener, tested on 128 kbps MP3 versus 320 kbps MP3, scores 11 correct out of 20. What does this suggest?
D5. Lossless streaming represents the "return to physics" promised by the chapter. (a) Apple Music Lossless (ALAC 24-bit/48 kHz) requires approximately 230 kbps for stereo streaming. Compare this to the 128 kbps AAC also available on Apple Music. For a listener on a cellular data plan with 5 GB/month, how many hours of each format can they stream? (b) Apple Music Lossless uses 24-bit depth rather than 16-bit CD standard. As discussed in Chapter 32, is the additional dynamic range (above 96 dB) useful for any real-world listening scenario? Under what conditions might 24-bit provide a genuine benefit over 16-bit? (c) If lossless streaming is now economically and technically feasible, what remaining justification exists for lossy streaming? Consider: bandwidth in developing countries, device storage limits, battery life impact of higher-bandwidth streaming, battery life impact of more complex decoding, and legacy device compatibility.
Part E: Synthesis, Ethics, and Historical Context
E1. Karlheinz Brandenburg and the Fraunhofer team tested the early MP3 codec with Suzanne Vega's Tom's Diner (a capella voice recording). (a) Why is an a capella voice recording particularly good for testing audio codec quality? What specific acoustic features of voice are most revealing of codec artifacts? (b) Brandenburg reported that early versions of the codec made the voice sound "like Mickey Mouse." What specific distortion would cause this? (Hint: think about how frequency shifting artifacts affect vocal formants.) (c) The MP3 standard was finalized in 1993. What would have been the quality of the best 128 kbps MP3 encoder in 1993 compared to 2026? Have encoders improved over time for a fixed codec standard? How? (d) Suzanne Vega's Tom's Diner became known as "the mother of MP3." What does it mean for a piece of music to have this historical significance — to have shaped a technology that shaped the music industry?
E2. The psychoacoustic model encodes assumptions about who the listener is. (a) The masking model is based on measurements from "typical" young adult listeners. How might the model perform differently for (i) a 70-year-old with age-related high-frequency hearing loss, (ii) a professional orchestral musician with trained attention to timbre, (iii) a child with exceptionally broad hearing range, (iv) a person with tinnitus (ringing in the ears)? (b) Should codec designers tailor psychoacoustic models to individual listeners — using hearing test data to customize masking thresholds? What are the technical challenges, and what are the ethical implications of having "personalized compression"? (c) The GDPR and similar privacy regulations govern the collection of health-related data. Is an audiogram (hearing test result) health data? If so, what constraints would this place on a streaming service that wanted to offer personalized codec settings?
E3. The MP3 had major social consequences — enabling Napster, portable music players, and ultimately streaming. (a) Trace the causal chain from the psychoacoustic masking principle (a piece of auditory science) to the streaming music industry (a $25 billion annual market). What were the key technical and social steps? (b) Brandenburg has stated that he had no intention of enabling unauthorized music sharing when developing MP3. What responsibility, if any, do inventors bear for unintended uses of their inventions? (c) The RIAA's legal campaign against Napster ultimately failed to prevent the streaming transition — it succeeded only in delaying it and driving users to purchase iPods and buy music through iTunes instead. Was the RIAA's response to MP3-enabled file sharing effective? What alternative strategies might have been available? (d) The music industry's revenue declined from approximately $40 billion (1999) to $15 billion (2015) and has since partially recovered to approximately $30 billion (2025). To what extent is the MP3 responsible for this decline?
E4. Theme 4 synthesis (Technology as Mediator): Throughout Part VII, we have seen that recording technology mediates between musical sound and its representation in ways that shape what music sounds like and what music can be made. (a) Identify at least three specific ways in which the psychoacoustic model embedded in the MP3 algorithm constitutes a "theory of the listener" built into the technology. What assumptions about listeners does the model encode? (b) Who decided that these assumptions were valid — who designed the model, based on whose data, for whose listening conditions? (c) How does the MP3's "theory of the listener" compare to the theories embedded in earlier recording technologies — Edison's phonograph (what frequency range matters), magnetic tape (what distortion is acceptable), the CD standard (what frequency range and dynamic range humans can use)? (d) The chapter argues that Aiko's discovery is "what happens whenever the model of the user built into a technology encounters a user who falls outside the model's design parameters." Give two other examples from contemporary technology — in any domain, not just audio — where this pattern occurs.
E5. Write a 400-word analysis of the following claim: "Lossy audio compression is not a degradation of music — it is a curation of music." Consider: (a) In what sense is the codec's discarding of masked content a form of "curation" — selecting what is perceptually relevant and presenting only that? (b) How does this framing change if the curated content includes something the listener can hear (pre-echo, singer's formant degradation)? (c) Is the editorial authority embedded in the psychoacoustic model analogous to other forms of curation (record label A&R decisions, streaming algorithm recommendations, radio playlist selections)? (d) Should listeners be informed about what their codec is discarding — should there be a "codec transparency" requirement analogous to nutritional labeling? What form would this take?