Code Lab: Audio Compression Analysis

DataField.Dev

Code Lab: Audio Compression Analysis

Lossy audio compression formats like MP3 and AAC achieve dramatic file-size reductions by discarding information that is perceptually irrelevant. The key insight is psychoacoustic masking: a loud sound at one frequency renders nearby quieter sounds inaudible. By modeling the ear's masking behavior, a codec can determine which spectral components to keep and which to discard without audible degradation. In this lab, you will implement a simplified psychoacoustic masking model, visualize which components survive compression, explore quantization noise, and compare original and compressed signal spectra.

Synthesizing a Test Signal

We start by building a composite signal that exercises the masking model: a strong tone with harmonics, quiet nearby tones that may be masked, and broadband noise.

import numpy as np
import matplotlib.pyplot as plt
from scipy.fft import rfft, rfftfreq
from scipy.signal import windows

SAMPLE_RATE = 44100
DURATION = 0.1
N = int(SAMPLE_RATE * DURATION)
t = np.linspace(0, DURATION, N, endpoint=False)

# Strong fundamental at 1000 Hz with decaying harmonics
signal = np.zeros(N)
for harmonic, amp in enumerate([1.0, 0.5, 0.25, 0.12, 0.06], start=1):
    signal += amp * np.sin(2 * np.pi * 1000 * harmonic * t)

# Quiet tone at 1100 Hz -- close to the fundamental, likely masked
signal += 0.03 * np.sin(2 * np.pi * 1100 * t)

# Quiet tone at 4500 Hz -- far away, harder to mask
signal += 0.04 * np.sin(2 * np.pi * 4500 * t)

# Normalize
signal *= 0.9 / np.max(np.abs(signal))

Computing the Spectrum

We window a frame of the signal and compute its magnitude spectrum in decibels.

FRAME_SIZE = 2048
frame = signal[:FRAME_SIZE] * windows.hann(FRAME_SIZE)
freqs = rfftfreq(FRAME_SIZE, d=1.0 / SAMPLE_RATE)
magnitude = np.abs(rfft(frame)) / (FRAME_SIZE / 2)

# Convert to dB (referenced to full scale)
eps = 1e-12
spectrum_db = 20 * np.log10(magnitude + eps)

Simplified Psychoacoustic Masking Model

Real codecs use the Bark scale and detailed spreading functions. Here we implement an instructive simplification: each spectral peak spreads a masking skirt that falls off asymmetrically, steeper below the masker and shallower above it.

def absolute_threshold(f_hz):
    """Approximate threshold of hearing in quiet (Terhardt, 1979)."""
    f_khz = np.maximum(np.asarray(f_hz) / 1000.0, 0.02)
    return (3.64 * f_khz**-0.8
            - 6.5 * np.exp(-0.6 * (f_khz - 3.3)**2)
            + 1e-3 * f_khz**4)

def masking_spread(f_masker, level_db, f_probe):
    """
    Compute the masking threshold at f_probe due to a masker at f_masker.
    Uses asymmetric slopes on the Bark scale:
      - Lower side (f_probe < f_masker): falls at 27 dB/Bark
      - Upper side (f_probe > f_masker): falls at 10 dB/Bark
    """
    bark_m = 13 * np.arctan(0.76 * f_masker / 1000)
    bark_p = 13 * np.arctan(0.76 * f_probe / 1000)
    delta = bark_p - bark_m
    slope = np.where(delta >= 0, -10 * delta, 27 * delta)
    return level_db + slope

def compute_global_threshold(freqs, spectrum_db, peak_min_db=-30):
    """Combine absolute threshold with masking from all significant peaks."""
    threshold = absolute_threshold(freqs)
    # Find spectral peaks above the minimum level
    for i in range(1, len(spectrum_db) - 1):
        if (spectrum_db[i] > spectrum_db[i - 1] and
            spectrum_db[i] > spectrum_db[i + 1] and
            spectrum_db[i] > peak_min_db):
            mask_curve = masking_spread(freqs[i], spectrum_db[i], freqs)
            threshold = np.maximum(threshold, mask_curve)
    return threshold

threshold_db = compute_global_threshold(freqs, spectrum_db)

Visualizing Masking: What the Codec Keeps and Discards

plt.figure(figsize=(11, 5))
plt.plot(freqs, spectrum_db, color='steelblue', linewidth=1.2,
         label='Signal spectrum')
plt.plot(freqs, threshold_db, color='firebrick', linewidth=1.5,
         linestyle='--', label='Masking threshold')
plt.fill_between(freqs, threshold_db, spectrum_db,
                 where=(spectrum_db < threshold_db),
                 alpha=0.3, color='orange', label='Below threshold (discarded)')
plt.fill_between(freqs, threshold_db, spectrum_db,
                 where=(spectrum_db >= threshold_db),
                 alpha=0.15, color='green', label='Above threshold (kept)')
plt.xlim(0, 10000)
plt.ylim(-60, 10)
plt.xlabel('Frequency (Hz)')
plt.ylabel('Level (dB)')
plt.title('Psychoacoustic Masking: Original Spectrum vs. Masking Threshold')
plt.legend(fontsize=8)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Notice how the quiet 1100 Hz tone falls below the masking threshold created by the strong 1000 Hz fundamental -- the codec would discard it because your ear cannot hear it anyway. The 4500 Hz tone, being spectrally distant from the masker, sits above the threshold and is preserved.

Quantization Noise and Bit-Rate Effects

Audio codecs allocate bits to each frequency band based on the masking threshold. Bands where the masking threshold is high receive fewer bits (more quantization noise is tolerable), while bands where the ear is sensitive receive more bits. We can simulate this effect directly.

def quantize_spectrum(magnitude, bits_per_band):
    """Quantize each spectral bin with a given number of bits."""
    quantized = np.zeros_like(magnitude)
    for i, (mag, bits) in enumerate(zip(magnitude, bits_per_band)):
        if bits <= 0:
            quantized[i] = 0.0  # band is discarded
        else:
            levels = 2 ** bits
            step = (2 * np.max(magnitude)) / levels
            quantized[i] = np.round(mag / step) * step
    return quantized

# Allocate bits: more bits where signal exceeds threshold by a large margin
margin = spectrum_db - threshold_db
bits_alloc = np.clip((margin / 6).astype(int) + 2, 0, 16)
bits_alloc[spectrum_db < threshold_db] = 0  # masked bands get zero bits

# Compare at different total bit budgets (simulating bit-rate settings)
fig, axes = plt.subplots(1, 3, figsize=(14, 4), sharey=True)
for ax, scale, label in zip(axes, [1.0, 0.5, 0.25],
                             ['High (320 kbps)', 'Medium (128 kbps)', 'Low (64 kbps)']):
    scaled_bits = np.clip((bits_alloc * scale).astype(int), 0, 16)
    q_mag = quantize_spectrum(magnitude, scaled_bits)
    q_db = 20 * np.log10(q_mag + eps)
    noise_db = 20 * np.log10(np.abs(magnitude - q_mag) + eps)

    ax.plot(freqs, spectrum_db, 'steelblue', linewidth=1.0, label='Original')
    ax.plot(freqs, q_db, 'firebrick', linewidth=0.8, linestyle='--', label='Quantized')
    ax.fill_between(freqs, -80, noise_db, alpha=0.2, color='red', label='Quant. noise')
    ax.set_xlim(0, 10000)
    ax.set_ylim(-80, 10)
    ax.set_xlabel('Frequency (Hz)')
    ax.set_title(f'Bit Rate: {label}', fontsize=10)
    ax.legend(fontsize=7)
    ax.grid(True, alpha=0.3)

axes[0].set_ylabel('Level (dB)')
fig.suptitle('Quantization Noise at Different Bit Rates', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()

At 320 kbps the quantized spectrum closely follows the original, and the noise floor stays well below the masking threshold. At 64 kbps, aggressive quantization raises the noise floor into audible territory, producing the "underwater" or "swirly" artifacts characteristic of low-bitrate MP3 files.

Try It Yourself

Masking asymmetry. Add a quiet test tone at 900 Hz (just below the 1000 Hz masker) and another at 1200 Hz (just above). Which one is more effectively masked? Relate your observation to the asymmetric slopes in the spreading function. Why is upward masking (masker below, probe above) stronger than downward masking?
Temporal masking. The current model only considers simultaneous (frequency-domain) masking. Research pre-masking and post-masking. Modify the signal to include a loud click at t = 0.05 s followed by a quiet tone. How would a codec exploit temporal masking to reduce the bit rate further around the transient?
ABX listening test design. Using the quantized spectra at different bit rates, describe how you would set up a double-blind ABX test to determine the lowest bit rate at which compression artifacts become audible. What signal characteristics (transients, sustained tones, noise) would make the best test material, and why?