Sound is one of the most fundamental channels through which humans perceive and interact with the world. From the nuances of spoken language to the emotional power of music, audio carries rich information that machines have only recently begun to...
In This Chapter
- 29.1 The Physics and Representation of Sound
- 29.2 Automatic Speech Recognition: From HMMs to Transformers
- 29.3 Whisper: Robust Speech Recognition at Scale
- 29.4 Text-to-Speech (TTS) Systems
- 29.5 Speaker Recognition and Verification
- 29.6 Audio Classification
- 29.7 Audio Transformers and Self-Supervised Learning
- 29.8 Music AI
- 29.9 Advanced Topics
- 29.10 Practical Considerations
- 29.11 Summary
- 29.12 Exercises
Chapter 29: Speech, Audio, and Music AI
"The ear is the avenue to the heart." — Voltaire
Sound is one of the most fundamental channels through which humans perceive and interact with the world. From the nuances of spoken language to the emotional power of music, audio carries rich information that machines have only recently begun to understand at near-human levels. This chapter explores the rapidly evolving field of audio AI — spanning automatic speech recognition (ASR), text-to-speech synthesis (TTS), speaker identification, audio event classification, and music generation — and equips you with the theoretical foundations and practical skills to build systems that listen, speak, and create.
We will trace the evolution from handcrafted signal-processing pipelines to modern transformer-based architectures like Whisper, examining how deep learning has unified disparate audio tasks under a common framework. Along the way, you will implement working systems using PyTorch and Hugging Face, gaining hands-on experience with the tools that power today's production audio AI.
The techniques covered here connect to the broader themes of this book in important ways. The mel spectrogram computation that opens the chapter is a domain-specific feature engineering step analogous to the tokenization we saw in Chapter 3 — both transform raw inputs into representations suitable for neural networks. The CTC loss function shares conceptual DNA with the sequence-to-sequence objectives of Chapter 8. And the contrastive audio-text learning of CLAP mirrors the CLIP framework we explored in Chapter 28, demonstrating how the same architectural patterns transfer across modalities.
29.1 The Physics and Representation of Sound
Before we can build models that understand audio, we need to understand how sound is captured, digitized, and represented in formats suitable for neural networks.
29.1.1 Sound as a Physical Phenomenon
Sound is a longitudinal pressure wave that propagates through a medium (typically air). The key physical properties are:
- Frequency (measured in Hertz, Hz): The number of oscillation cycles per second. Human hearing spans roughly 20 Hz to 20,000 Hz.
- Amplitude: The magnitude of pressure variation, perceived as loudness.
- Phase: The position within the oscillation cycle at a given point in time.
When a microphone captures sound, it converts these pressure variations into an electrical signal, which is then digitized through sampling and quantization.
29.1.2 Digital Audio Fundamentals
Digital audio is created by sampling the continuous analog signal at discrete time intervals. Two parameters define this process:
- Sample rate ($f_s$): The number of samples per second. Common rates include 16,000 Hz (telephony, speech models), 22,050 Hz (low-quality audio), 44,100 Hz (CD quality), and 48,000 Hz (professional audio).
- Bit depth: The number of bits per sample, determining amplitude resolution. CD audio uses 16 bits (65,536 levels), while professional audio often uses 24 bits.
The Nyquist-Shannon sampling theorem states that to faithfully represent a signal, the sample rate must be at least twice the highest frequency present:
$$f_s \geq 2 f_{\max}$$
A digital audio signal can be represented as a one-dimensional tensor $\mathbf{x} \in \mathbb{R}^{T}$, where $T = f_s \times d$ is the total number of samples for a recording of duration $d$ seconds.
import torch
import torchaudio
torch.manual_seed(42)
# Load an audio file
waveform, sample_rate = torchaudio.load("speech.wav")
print(f"Waveform shape: {waveform.shape}") # [channels, samples]
print(f"Sample rate: {sample_rate} Hz")
print(f"Duration: {waveform.shape[1] / sample_rate:.2f} seconds")
29.1.3 Time-Domain Representations: Waveforms
The raw waveform is the most basic representation of audio. It preserves all information in the signal but presents challenges for learning:
- High dimensionality: A 10-second clip at 16 kHz contains 160,000 samples.
- Local structure: Relevant patterns span vastly different time scales (phonemes ~30 ms, words ~500 ms, sentences ~5 s).
- Lack of explicit frequency information: The waveform encodes frequency implicitly through patterns of oscillation.
Despite these challenges, some modern architectures (such as WaveNet and wav2vec) operate directly on raw waveforms, learning feature extractors end-to-end. However, frequency-domain representations remain the dominant paradigm for most tasks.
29.1.4 Frequency-Domain Representations: Spectrograms
The Short-Time Fourier Transform (STFT) converts a waveform into a time-frequency representation by applying the Discrete Fourier Transform (DFT) to overlapping windows of the signal:
$$\text{STFT}\{x[n]\}(m, k) = \sum_{n=0}^{N-1} x[n + mH] \cdot w[n] \cdot e^{-j2\pi kn / N}$$
where: - $w[n]$ is a window function (typically Hann or Hamming) of length $N$ - $H$ is the hop length (stride between successive windows) - $m$ is the time frame index - $k$ is the frequency bin index
The magnitude of the STFT gives us the spectrogram:
$$S(m, k) = |\text{STFT}\{x[n]\}(m, k)|$$
In practice, we often work with the power spectrogram $|S|^2$ or the log spectrogram $\log |S|$.
import torch
import torchaudio
import torchaudio.transforms as T
torch.manual_seed(42)
def compute_spectrogram(
waveform: torch.Tensor,
n_fft: int = 1024,
hop_length: int = 512,
power: float = 2.0,
) -> torch.Tensor:
"""Compute a power spectrogram from a waveform.
Args:
waveform: Input audio tensor of shape [channels, samples].
n_fft: Size of the FFT window.
hop_length: Number of samples between successive frames.
power: Exponent for the magnitude spectrogram.
Returns:
Spectrogram tensor of shape [channels, freq_bins, time_frames].
"""
spectrogram_transform = T.Spectrogram(
n_fft=n_fft,
hop_length=hop_length,
power=power,
)
return spectrogram_transform(waveform)
# Example usage
waveform, sample_rate = torchaudio.load("speech.wav")
spec = compute_spectrogram(waveform)
print(f"Spectrogram shape: {spec.shape}")
# [1, 513, T] where 513 = n_fft//2 + 1
29.1.5 Mel Spectrograms
Human auditory perception is non-linear in frequency: we are far more sensitive to differences between 200 Hz and 400 Hz than between 8,000 Hz and 8,200 Hz. The mel scale approximates this perceptual nonlinearity:
$$m = 2595 \log_{10}\left(1 + \frac{f}{700}\right)$$
where: - $m$ is the frequency in mels - $f$ is the frequency in Hertz - The constants 2595 and 700 are chosen to fit human perceptual data
Intuition: The mel scale is approximately linear below 1,000 Hz and approximately logarithmic above. This reflects the physiology of the cochlea: hair cells at the base respond to high frequencies and are logarithmically spaced, while those at the apex respond to low frequencies and are more linearly spaced.
Worked Example: Converting common frequencies to mel: - 200 Hz $\to$ $2595 \log_{10}(1 + 200/700) = 2595 \times 0.109 = 283$ mel - 400 Hz $\to$ $2595 \log_{10}(1 + 400/700) = 2595 \times 0.195 = 507$ mel (difference: 224 mel) - 8000 Hz $\to$ $2595 \log_{10}(1 + 8000/700) = 2595 \times 1.094 = 2840$ mel - 8200 Hz $\to$ $2595 \log_{10}(1 + 8200/700) = 2595 \times 1.104 = 2866$ mel (difference: 26 mel)
We see that a 200 Hz difference at low frequencies (200-400 Hz) spans 224 mel, while the same absolute difference at high frequencies (8000-8200 Hz) spans only 26 mel. This confirms that the mel scale allocates more resolution to lower frequencies, matching human perception.
A mel spectrogram applies a bank of triangular filters (the mel filterbank) to the power spectrogram, compressing the frequency axis to better match human perception. The mel filterbank consists of $M$ overlapping triangular filters, linearly spaced in the mel domain. Each filter has: - A center frequency (in mel) at position $c_i$ - A bandwidth spanning from $c_{i-1}$ to $c_{i+1}$ - Unity gain at the center, linearly tapering to zero at the edges
With $M$ mel filters, the mel spectrogram $\mathbf{S}_{\text{mel}} \in \mathbb{R}^{M \times T'}$ is:
$$S_{\text{mel}}(i, m) = \sum_{k} H_i(k) \cdot |S(m, k)|^2$$
where $H_i(k)$ is the $i$-th mel filter response at frequency bin $k$.
Typically, we also take the logarithm, producing the log-mel spectrogram — the most widely used input representation for audio neural networks:
$$S_{\log\text{-mel}}(i, m) = \log(S_{\text{mel}}(i, m) + \epsilon)$$
where $\epsilon$ is a small constant (typically $10^{-9}$ or $10^{-10}$) for numerical stability.
The logarithm is important for two reasons: (1) it compresses the dynamic range, making the representation more suitable for neural networks that work best with bounded inputs, and (2) it matches the logarithmic perception of loudness in the human auditory system (a sound must double in power to be perceived as noticeably louder).
Common parameter choices: The choice of STFT and mel parameters depends on the task:
| Parameter | Speech (16 kHz) | Music (22 kHz) | Whisper |
|---|---|---|---|
| Sample rate | 16,000 Hz | 22,050 Hz | 16,000 Hz |
| FFT size ($N$) | 512 or 1024 | 2048 | 400 |
| Hop length ($H$) | 160 or 256 | 512 | 160 |
| Window | Hann | Hann | Hann |
| Mel bins ($M$) | 80 | 128 | 80 |
| Freq range | 0-8000 Hz | 0-11025 Hz | 0-8000 Hz |
Whisper uses notably small FFT and hop sizes (400 and 160 samples, corresponding to 25 ms windows with 10 ms stride), producing a fine-grained temporal representation with 100 frames per second.
import torch
import torchaudio
import torchaudio.transforms as T
torch.manual_seed(42)
def compute_log_mel_spectrogram(
waveform: torch.Tensor,
sample_rate: int = 16000,
n_fft: int = 1024,
hop_length: int = 512,
n_mels: int = 80,
) -> torch.Tensor:
"""Compute a log-mel spectrogram from a waveform.
Args:
waveform: Input audio tensor of shape [channels, samples].
sample_rate: Audio sample rate in Hz.
n_fft: Size of the FFT window.
hop_length: Number of samples between successive frames.
n_mels: Number of mel filter banks.
Returns:
Log-mel spectrogram of shape [channels, n_mels, time_frames].
"""
mel_spectrogram = T.MelSpectrogram(
sample_rate=sample_rate,
n_fft=n_fft,
hop_length=hop_length,
n_mels=n_mels,
)
mel_spec = mel_spectrogram(waveform)
log_mel_spec = torch.log(mel_spec + 1e-9)
return log_mel_spec
waveform, sample_rate = torchaudio.load("speech.wav")
log_mel = compute_log_mel_spectrogram(waveform, sample_rate)
print(f"Log-mel spectrogram shape: {log_mel.shape}")
# [1, 80, T'] where T' depends on duration and hop_length
29.1.6 Mel-Frequency Cepstral Coefficients (MFCCs)
MFCCs take the log-mel spectrogram one step further by applying the Discrete Cosine Transform (DCT) to decorrelate the mel filter energies:
$$c_j = \sum_{i=1}^{M} S_{\log\text{-mel}}(i) \cos\left[\frac{\pi j}{M}\left(i - \frac{1}{2}\right)\right]$$
Typically, only the first 13-20 coefficients are retained, often augmented with their first and second derivatives (delta and delta-delta coefficients). MFCCs were the standard feature representation in classical speech recognition systems (GMM-HMM models) and remain useful for some audio classification tasks, though they have been largely supplanted by log-mel spectrograms in deep learning pipelines.
import torch
import torchaudio
import torchaudio.transforms as T
torch.manual_seed(42)
def compute_mfcc(
waveform: torch.Tensor,
sample_rate: int = 16000,
n_mfcc: int = 13,
n_mels: int = 40,
n_fft: int = 1024,
hop_length: int = 512,
) -> torch.Tensor:
"""Compute MFCCs from a waveform.
Args:
waveform: Input audio tensor of shape [channels, samples].
sample_rate: Audio sample rate in Hz.
n_mfcc: Number of MFCC coefficients to return.
n_mels: Number of mel filter banks.
n_fft: Size of the FFT window.
hop_length: Number of samples between successive frames.
Returns:
MFCC tensor of shape [channels, n_mfcc, time_frames].
"""
mfcc_transform = T.MFCC(
sample_rate=sample_rate,
n_mfcc=n_mfcc,
melkwargs={
"n_fft": n_fft,
"hop_length": hop_length,
"n_mels": n_mels,
},
)
return mfcc_transform(waveform)
waveform, sample_rate = torchaudio.load("speech.wav")
mfccs = compute_mfcc(waveform, sample_rate)
print(f"MFCC shape: {mfccs.shape}") # [1, 13, T']
29.2 Automatic Speech Recognition: From HMMs to Transformers
Automatic Speech Recognition (ASR) — the task of converting spoken language into text — has undergone a dramatic transformation over the past decade. Let us trace this evolution.
29.2.1 The Classical Pipeline: GMM-HMM Systems
For decades, the dominant ASR architecture combined Gaussian Mixture Models (GMMs) for acoustic modeling with Hidden Markov Models (HMMs) for temporal modeling. The pipeline included:
- Feature extraction: MFCCs from the audio signal.
- Acoustic model: GMMs modeling the probability $P(\mathbf{x}_t | s)$ of observing features $\mathbf{x}_t$ given a hidden state $s$.
- Language model: N-gram models providing $P(w_1, w_2, \ldots, w_n)$.
- Pronunciation dictionary: Mapping words to phoneme sequences.
- Decoder: Viterbi or beam search combining all components.
This required extensive expert engineering: hand-designed features, phoneme inventories, pronunciation dictionaries, and decision trees for context-dependent modeling. While effective, the system was brittle and required substantial language-specific expertise.
29.2.2 The Deep Learning Revolution in ASR
The transition to deep learning in ASR occurred in several waves:
Wave 1: DNN-HMM Hybrids (2011-2014). Deep neural networks replaced GMMs as acoustic models while retaining the HMM framework. This alone reduced word error rates (WER) by 20-30% relative.
Wave 2: Sequence-to-Sequence Models (2014-2018). End-to-end models eliminated the need for explicit phoneme inventories and pronunciation dictionaries: - CTC-based models (Graves et al., 2006): Models like DeepSpeech used Connectionist Temporal Classification to directly map audio to character sequences. - Attention-based encoder-decoder models (Chan et al., 2016): Listen, Attend and Spell (LAS) applied the sequence-to-sequence framework from machine translation to ASR. - RNN-Transducer (Graves, 2012): Combined the benefits of CTC and attention for streaming ASR.
Wave 3: Self-Supervised and Foundation Models (2019-present). Large-scale pre-training transformed ASR: - wav2vec 2.0 (Baevski et al., 2020): Self-supervised learning on unlabeled audio, then fine-tuned with CTC. - HuBERT (Hsu et al., 2021): Hidden-unit BERT applied to speech. - Whisper (Radford et al., 2022): Weakly supervised training on 680,000 hours of labeled data, achieving remarkable robustness.
29.2.3 Connectionist Temporal Classification (CTC)
CTC is a critical loss function that enables training ASR models without requiring frame-level alignments between audio and text. This deserves careful attention, as the same alignment challenge arises in other sequence-to-sequence tasks with unknown alignment.
The alignment problem: Given an audio sequence of $T$ frames and a text sequence of $U$ characters ($U \ll T$), we need to map frames to characters without knowing which frames correspond to which characters. For example, a 3-second audio clip at 100 frames/second produces $T = 300$ frames, while the spoken word "hello" has only $U = 5$ characters. Which of the 300 frames correspond to the "h"? Which to the "e"? The answer varies between speakers, speaking rates, and even repetitions by the same speaker.
Before CTC, this alignment had to be provided explicitly (through forced alignment using HMM-based systems), creating a chicken-and-egg problem: you needed an alignment to train the model, but you needed a trained model to produce an alignment.
CTC formulation: CTC introduces a special blank token ($\varnothing$) to the vocabulary. The model outputs a probability distribution over the extended vocabulary $\mathcal{V} \cup \{\varnothing\}$ at each time step:
$$P(\pi | \mathbf{x}) = \prod_{t=1}^{T} P(\pi_t | \mathbf{x})$$
where $\pi$ is a CTC path of length $T$. A many-to-one mapping function $\mathcal{B}$ collapses paths to label sequences by (1) removing consecutive duplicates and (2) removing blanks. For example:
$$\mathcal{B}(\text{-h-ee-l-l-oo-}) = \text{hello}$$ $$\mathcal{B}(\text{hh-eee-ll-lo-}) = \text{hello}$$
The CTC loss marginalizes over all valid alignments:
$$P(\mathbf{y} | \mathbf{x}) = \sum_{\pi \in \mathcal{B}^{-1}(\mathbf{y})} P(\pi | \mathbf{x})$$
$$\mathcal{L}_{\text{CTC}} = -\log P(\mathbf{y} | \mathbf{x})$$
This sum can be computed efficiently using dynamic programming (the forward-backward algorithm), similar to the forward algorithm in HMMs. The forward-backward algorithm computes probabilities for all valid alignments simultaneously by maintaining a 2D table of size $T \times (2U + 1)$ (the extra factor accounts for possible blanks between each character). The time complexity is $O(T \cdot U)$, which is tractable even for long sequences.
Worked Example — CTC alignment: For the target "hi" with $T = 5$ frames and vocabulary $\{\text{blank}, h, i\}$, valid CTC paths include: - $(\text{blank}, h, h, i, i)$ $\to$ $\mathcal{B} = \text{hi}$ - $(h, h, \text{blank}, i, i)$ $\to$ $\mathcal{B} = \text{hi}$ - $(h, \text{blank}, \text{blank}, i, \text{blank})$ $\to$ $\mathcal{B} = \text{hi}$ - $(h, i, i, i, i)$ $\to$ $\mathcal{B} = \text{hi}$
The CTC loss sums over all such valid paths, automatically marginalizing out the unknown alignment. This is what makes CTC so powerful: the model learns to produce spikes of probability for each character at roughly the right time, with blanks filling the gaps, without any explicit alignment supervision.
CTC assumptions and limitations: - Conditional independence: CTC assumes that the outputs at each time step are conditionally independent given the input. This means CTC models cannot learn language model-like dependencies between output characters. - Monotonic alignment: CTC assumes that the output sequence appears in the same order as the input. This is appropriate for ASR (speech is temporally ordered) but not for tasks like machine translation where word order changes between languages. - Peaky behavior: CTC models tend to produce very "peaky" output distributions — most frames are predicted as blank, with brief spikes for each character. This can make decoding noisy.
import torch
import torch.nn as nn
torch.manual_seed(42)
class CTCASRModel(nn.Module):
"""A simple CTC-based ASR model using bidirectional LSTMs.
Args:
input_dim: Dimension of input features (e.g., 80 for mel bins).
hidden_dim: Hidden dimension of the LSTM layers.
num_layers: Number of LSTM layers.
vocab_size: Size of the output vocabulary (including blank).
"""
def __init__(
self,
input_dim: int = 80,
hidden_dim: int = 256,
num_layers: int = 3,
vocab_size: int = 29,
) -> None:
super().__init__()
self.lstm = nn.LSTM(
input_size=input_dim,
hidden_size=hidden_dim,
num_layers=num_layers,
batch_first=True,
bidirectional=True,
dropout=0.1,
)
self.fc = nn.Linear(hidden_dim * 2, vocab_size)
self.log_softmax = nn.LogSoftmax(dim=-1)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""Forward pass through the CTC ASR model.
Args:
x: Input tensor of shape [batch, time, features].
Returns:
Log probabilities of shape [batch, time, vocab_size].
"""
output, _ = self.lstm(x)
logits = self.fc(output)
return self.log_softmax(logits)
# Example: Computing CTC loss
model = CTCASRModel(input_dim=80, vocab_size=29)
batch_size = 4
seq_len = 100
features = torch.randn(batch_size, seq_len, 80)
log_probs = model(features) # [4, 100, 29]
log_probs = log_probs.permute(1, 0, 2) # [T, N, C] for CTC loss
# Target sequences (variable length)
targets = torch.randint(1, 29, (4, 15)) # 4 targets of length 15
input_lengths = torch.full((batch_size,), seq_len, dtype=torch.long)
target_lengths = torch.full((batch_size,), 15, dtype=torch.long)
ctc_loss = nn.CTCLoss(blank=0, zero_infinity=True)
loss = ctc_loss(log_probs, targets, input_lengths, target_lengths)
print(f"CTC Loss: {loss.item():.4f}")
29.2.4 CTC Decoding Strategies
At inference time, we need to decode the CTC output into a text sequence. Common strategies include:
-
Greedy decoding: Select the most probable token at each time step, then apply the collapsing function $\mathcal{B}$. Fast but suboptimal.
-
Beam search decoding: Maintain multiple hypotheses, potentially integrating an external language model:
$$\text{score}(\mathbf{y}) = \log P_{\text{CTC}}(\mathbf{y} | \mathbf{x}) + \alpha \log P_{\text{LM}}(\mathbf{y}) + \beta |\mathbf{y}|$$
where $\alpha$ controls language model weight and $\beta$ is a word insertion bonus.
- Prefix beam search: An efficient variant that merges hypotheses sharing the same prefix.
29.3 Whisper: Robust Speech Recognition at Scale
OpenAI's Whisper (Radford et al., 2022) represents a paradigm shift in ASR. Rather than training on carefully curated datasets, Whisper was trained on 680,000 hours of weakly supervised data collected from the internet, achieving remarkable zero-shot generalization.
29.3.1 Architecture
Whisper uses a standard encoder-decoder Transformer architecture:
Encoder: - Input: 80-channel log-mel spectrogram computed from 30-second audio segments (padded or trimmed) - Two 1D convolution layers with GELU activation for initial feature processing - Sinusoidal positional encoding - Standard Transformer encoder blocks with pre-layer normalization
Decoder: - Autoregressive text decoder with learned positional embeddings - Cross-attention to encoder outputs - Predicts tokens from a byte-level BPE vocabulary
The model comes in several sizes:
| Model | Parameters | Encoder Layers | Decoder Layers | Hidden Dim |
|---|---|---|---|---|
| tiny | 39M | 4 | 4 | 384 |
| base | 74M | 6 | 6 | 512 |
| small | 244M | 12 | 12 | 768 |
| medium | 769M | 24 | 24 | 1024 |
| large | 1.55B | 32 | 32 | 1280 |
29.3.2 Multitask Training Format
A key innovation of Whisper is its use of a multitask training format through special tokens. The decoder predicts a sequence of tokens that encode the task:
<|startoftranscript|> <|en|> <|transcribe|> <|notimestamps|> Hello world <|endoftext|>
This format enables:
- Language identification: Predicting the <|lang|> token
- Voice activity detection: Through <|nospeech|> token
- Transcription vs. translation: <|transcribe|> vs. <|translate|>
- Timestamp prediction: Optional word or segment-level timestamps
29.3.3 Using Whisper with Hugging Face
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torchaudio
torch.manual_seed(42)
def transcribe_audio(
audio_path: str,
model_name: str = "openai/whisper-base",
language: str = "english",
task: str = "transcribe",
) -> str:
"""Transcribe an audio file using Whisper.
Args:
audio_path: Path to the audio file.
model_name: Hugging Face model identifier for Whisper.
language: Language of the audio.
task: Either 'transcribe' or 'translate'.
Returns:
Transcribed text string.
"""
# Load model and processor
processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)
# Load and preprocess audio
waveform, sample_rate = torchaudio.load(audio_path)
# Resample to 16kHz if necessary
if sample_rate != 16000:
resampler = torchaudio.transforms.Resample(sample_rate, 16000)
waveform = resampler(waveform)
# Convert to mono if stereo
if waveform.shape[0] > 1:
waveform = waveform.mean(dim=0, keepdim=True)
# Process through Whisper
input_features = processor(
waveform.squeeze().numpy(),
sampling_rate=16000,
return_tensors="pt",
).input_features
# Generate transcription
forced_decoder_ids = processor.get_decoder_prompt_ids(
language=language,
task=task,
)
generated_ids = model.generate(
input_features,
forced_decoder_ids=forced_decoder_ids,
)
# Decode tokens to text
transcription = processor.batch_decode(
generated_ids,
skip_special_tokens=True,
)[0]
return transcription
# Example usage
# transcription = transcribe_audio("lecture.wav")
# print(f"Transcription: {transcription}")
29.3.4 Whisper's Training Data and Robustness
Whisper's robustness comes from the scale and diversity of its training data. The 680,000 hours of audio span: - 99 languages (though English dominates at ~438,000 hours) - Diverse acoustic conditions: studio recordings, phone calls, street interviews, podcasts, lectures, and more - Weakly supervised labels: the transcriptions come from web-scraped caption/subtitle data and may contain errors
This "messy" training data is both a strength and a weakness. The strength is extraordinary robustness: Whisper handles accents, background noise, music, and multiple speakers far better than models trained on clean read-speech corpora. The weakness is that the noisy labels introduce a performance ceiling — Whisper occasionally produces hallucinated text, especially on very quiet or silent audio segments, because it was trained to always produce output.
Comparison with wav2vec 2.0 approach: The wav2vec 2.0 paradigm (self-supervised pre-training on unlabeled audio, then fine-tuning with CTC on labeled data) requires only a small amount of labeled data but achieves strong performance primarily in the languages and domains represented in the fine-tuning set. Whisper, by contrast, uses enormous amounts of weakly labeled data and achieves broader generalization out of the box. The tradeoff is:
| Aspect | wav2vec 2.0 + CTC | Whisper |
|---|---|---|
| Labeled data needed | 10-100 hours | 0 (zero-shot) |
| Unlabeled data | 60,000 hours | N/A |
| Weakly labeled data | N/A | 680,000 hours |
| Language adaptation | Requires fine-tuning | Works zero-shot |
| Streaming support | Yes (CTC is non-autoregressive) | No (encoder-decoder) |
| Hallucination risk | Low (CTC is constrained) | Moderate |
29.3.5 Long-Form Transcription
Whisper processes 30-second segments. For longer audio, a sequential decoding approach is used:
- Process the first 30-second segment
- Use the predicted timestamps to determine how much audio was consumed
- Slide the window forward and process the next segment
- Repeat until the audio is fully processed
The Hugging Face pipeline handles this automatically:
import torch
from transformers import pipeline
torch.manual_seed(42)
def transcribe_long_audio(
audio_path: str,
model_name: str = "openai/whisper-base",
chunk_length_s: int = 30,
batch_size: int = 8,
) -> str:
"""Transcribe a long audio file using chunked processing.
Args:
audio_path: Path to the audio file.
model_name: Hugging Face model identifier.
chunk_length_s: Length of each chunk in seconds.
batch_size: Number of chunks to process in parallel.
Returns:
Full transcription text.
"""
pipe = pipeline(
"automatic-speech-recognition",
model=model_name,
chunk_length_s=chunk_length_s,
batch_size=batch_size,
device="cuda" if torch.cuda.is_available() else "cpu",
)
result = pipe(audio_path)
return result["text"]
29.4 Text-to-Speech (TTS) Systems
Text-to-speech synthesis is the inverse problem of ASR: generating natural-sounding speech from text. Modern TTS systems have achieved near-human quality.
29.4.1 The TTS Pipeline
Classical TTS systems use a two-stage pipeline:
- Text analysis / Front-end: Converts raw text to a linguistic representation (phonemes, prosody markers, etc.).
- Acoustic model / Back-end: Converts linguistic features to audio.
Modern neural TTS systems often combine these stages or use end-to-end approaches.
29.4.2 Tacotron 2: Attention-Based TTS
Tacotron 2 (Shen et al., 2018) is a landmark neural TTS system consisting of:
- Text encoder: Character or phoneme embedding followed by a stack of convolution layers and a bidirectional LSTM.
- Attention mechanism: Location-sensitive attention that aligns text to mel spectrogram frames.
- Decoder: An autoregressive network that generates mel spectrogram frames one at a time, using the previous frame and attention context as input.
- Vocoder: A separate neural network (WaveNet, WaveGlow, or HiFi-GAN) that converts the mel spectrogram to a raw waveform.
The decoder at each step predicts: - The next mel spectrogram frame - A stop probability indicating whether to terminate generation
29.4.3 Modern TTS: VITS and Beyond
VITS (Kim et al., 2021) represents the state of the art in end-to-end TTS, combining: - A posterior encoder that models the variational distribution of speech given text and audio: $q_\phi(\mathbf{z} | \mathbf{x}_{\text{audio}}, \mathbf{y}_{\text{text}})$ - A normalizing flow that transforms between simple and complex distributions, improving the expressiveness of the latent space beyond what a simple Gaussian can represent - A HiFi-GAN decoder for waveform generation, trained adversarially with multi-period and multi-scale discriminators - A stochastic duration predictor that models the inherent variability in speech timing - All trained jointly in an end-to-end manner, eliminating the need for separate mel spectrogram prediction and vocoding stages
The end-to-end nature of VITS is its key advantage over two-stage systems like Tacotron 2 + HiFi-GAN. By training the entire pipeline jointly, VITS avoids the distribution mismatch between the acoustic model's predicted mel spectrograms and the vocoder's expected inputs — a common source of artifacts in two-stage systems.
29.4.4 Bark and Neural Codec-Based TTS
Bark (Suno AI, 2023) represents a different paradigm: it uses an autoregressive transformer to generate audio tokens (EnCodec tokens) conditioned on text, then decodes these tokens to waveforms. This approach inherits the expressiveness of language models, enabling: - Natural prosody and intonation that emerges from the autoregressive modeling - Non-speech sounds (laughter, sighs, hesitations) interspersed in speech - Speaker voice cloning from short audio prompts - Multilingual generation without language-specific training
The architecture consists of three stages: 1. Semantic tokens: A GPT-style model converts text to "semantic" audio tokens (derived from HuBERT features) 2. Coarse acoustic tokens: A second autoregressive model converts semantic tokens to coarse EnCodec tokens 3. Fine acoustic tokens: A final model generates the remaining EnCodec codebook levels
This cascaded generation, where each stage operates at a different level of abstraction, is similar to the hierarchical token generation in MusicGen (discussed in Section 29.8).
import torch
from transformers import VitsModel, AutoTokenizer
torch.manual_seed(42)
def synthesize_speech(
text: str,
model_name: str = "facebook/mms-tts-eng",
) -> tuple[torch.Tensor, int]:
"""Synthesize speech from text using VITS.
Args:
text: Input text to synthesize.
model_name: Hugging Face model identifier.
Returns:
Tuple of (waveform tensor, sample rate).
"""
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = VitsModel.from_pretrained(model_name)
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
output = model(**inputs)
waveform = output.waveform
sample_rate = model.config.sampling_rate
return waveform, sample_rate
# Example usage
# waveform, sr = synthesize_speech("Hello, welcome to the world of AI.")
# torchaudio.save("output.wav", waveform.squeeze(0), sr)
29.4.4 Vocoder Networks
The vocoder is a critical component that converts mel spectrograms to waveforms. Key architectures include:
WaveNet (van den Oord et al., 2016): An autoregressive model that generates audio sample by sample using dilated causal convolutions. Produces high-quality audio but is extremely slow at inference.
WaveGlow (Prenger et al., 2019): A flow-based model that generates all samples in parallel, trading some quality for much faster inference.
HiFi-GAN (Kong et al., 2020): A GAN-based vocoder that achieves both high quality and fast inference. It uses: - A generator with transposed convolutions for upsampling - Multi-period and multi-scale discriminators - Feature matching and mel spectrogram reconstruction losses
import torch
import torch.nn as nn
torch.manual_seed(42)
class SimpleVocoderGenerator(nn.Module):
"""Simplified HiFi-GAN-style vocoder generator.
Args:
mel_channels: Number of mel spectrogram channels.
upsample_rates: List of upsampling rates for each layer.
upsample_kernel_sizes: Kernel sizes for upsampling layers.
hidden_channels: Number of hidden channels.
"""
def __init__(
self,
mel_channels: int = 80,
upsample_rates: list[int] | None = None,
upsample_kernel_sizes: list[int] | None = None,
hidden_channels: int = 256,
) -> None:
super().__init__()
if upsample_rates is None:
upsample_rates = [8, 8, 2, 2]
if upsample_kernel_sizes is None:
upsample_kernel_sizes = [16, 16, 4, 4]
self.pre_conv = nn.Conv1d(mel_channels, hidden_channels, 7, padding=3)
self.upsamples = nn.ModuleList()
ch = hidden_channels
for rate, kernel in zip(upsample_rates, upsample_kernel_sizes):
self.upsamples.append(
nn.ConvTranspose1d(
ch,
ch // 2,
kernel,
stride=rate,
padding=(kernel - rate) // 2,
)
)
ch = ch // 2
self.post_conv = nn.Conv1d(ch, 1, 7, padding=3)
self.tanh = nn.Tanh()
def forward(self, mel: torch.Tensor) -> torch.Tensor:
"""Generate waveform from mel spectrogram.
Args:
mel: Mel spectrogram of shape [batch, mel_channels, time].
Returns:
Waveform tensor of shape [batch, 1, time * product(upsample_rates)].
"""
x = self.pre_conv(mel)
for upsample in self.upsamples:
x = torch.nn.functional.leaky_relu(x, 0.1)
x = upsample(x)
x = self.post_conv(x)
return self.tanh(x)
# Example
mel_input = torch.randn(1, 80, 50) # 50 mel frames
vocoder = SimpleVocoderGenerator()
waveform = vocoder(mel_input)
print(f"Generated waveform shape: {waveform.shape}")
# [1, 1, 12800] (50 * 8 * 8 * 2 * 2 = 12800 samples)
29.5 Speaker Recognition and Verification
Speaker recognition involves identifying or verifying individuals based on their voice characteristics. It encompasses two related tasks:
- Speaker identification: Determining who is speaking from a set of known speakers (closed-set classification).
- Speaker verification: Determining whether a given utterance belongs to a claimed speaker identity (binary decision).
29.5.1 Speaker Embeddings
Modern speaker recognition systems learn a mapping from variable-length utterances to fixed-dimensional speaker embeddings (also called d-vectors or x-vectors). The key architectures are:
x-vector systems use a Time-Delay Neural Network (TDNN) followed by a statistics pooling layer that aggregates frame-level features into a single utterance-level embedding.
ECAPA-TDNN (Desplanques et al., 2020) enhances x-vectors with channel attention, Squeeze-and-Excitation blocks, and multi-scale feature aggregation.
The training objective is typically a classification loss (cross-entropy with speaker labels) combined with a metric learning loss:
Angular Additive Margin (AAM-Softmax):
$$\mathcal{L} = -\log \frac{e^{s(\cos(\theta_{y_i} + m))}}{e^{s(\cos(\theta_{y_i} + m))} + \sum_{j \neq y_i} e^{s \cos \theta_j}}$$
where $\theta_{y_i}$ is the angle between the embedding and the weight vector of the true class, $m$ is the angular margin, and $s$ is a scaling factor.
import torch
import torch.nn as nn
import torch.nn.functional as F
torch.manual_seed(42)
class SpeakerEncoder(nn.Module):
"""Speaker embedding network with statistics pooling.
Args:
input_dim: Dimension of input features.
hidden_dim: Hidden dimension for TDNN layers.
embedding_dim: Dimension of the output speaker embedding.
num_speakers: Number of speakers for classification head.
"""
def __init__(
self,
input_dim: int = 80,
hidden_dim: int = 512,
embedding_dim: int = 192,
num_speakers: int = 1000,
) -> None:
super().__init__()
self.frame_layers = nn.Sequential(
nn.Conv1d(input_dim, hidden_dim, 5, padding=2),
nn.ReLU(),
nn.BatchNorm1d(hidden_dim),
nn.Conv1d(hidden_dim, hidden_dim, 3, padding=1),
nn.ReLU(),
nn.BatchNorm1d(hidden_dim),
nn.Conv1d(hidden_dim, hidden_dim, 3, padding=1),
nn.ReLU(),
nn.BatchNorm1d(hidden_dim),
)
# Statistics pooling: mean and std
self.embedding = nn.Linear(hidden_dim * 2, embedding_dim)
self.classifier = nn.Linear(embedding_dim, num_speakers)
def forward(
self,
x: torch.Tensor,
return_embedding: bool = False,
) -> torch.Tensor:
"""Forward pass through speaker encoder.
Args:
x: Input features of shape [batch, features, time].
return_embedding: If True, return embedding instead of logits.
Returns:
Speaker logits [batch, num_speakers] or
embeddings [batch, embedding_dim].
"""
frame_out = self.frame_layers(x) # [B, H, T]
# Statistics pooling
mean = frame_out.mean(dim=2)
std = frame_out.std(dim=2)
pooled = torch.cat([mean, std], dim=1) # [B, 2*H]
embedding = self.embedding(pooled) # [B, E]
if return_embedding:
return F.normalize(embedding, p=2, dim=1)
return self.classifier(embedding)
# Example
model = SpeakerEncoder()
features = torch.randn(2, 80, 200) # 2 utterances
logits = model(features)
print(f"Logits shape: {logits.shape}") # [2, 1000]
embeddings = model(features, return_embedding=True)
print(f"Embedding shape: {embeddings.shape}") # [2, 192]
# Compute cosine similarity for verification
similarity = F.cosine_similarity(embeddings[0:1], embeddings[1:2])
print(f"Speaker similarity: {similarity.item():.4f}")
29.5.2 Speaker Verification in Practice
Speaker verification is a binary decision: "Is this speech from the claimed speaker?" The decision is based on comparing the cosine similarity of two speaker embeddings against a threshold:
$$\text{decision} = \begin{cases} \text{accept} & \text{if } \cos(\mathbf{e}_{\text{test}}, \mathbf{e}_{\text{enrolled}}) > \theta \\ \text{reject} & \text{otherwise} \end{cases}$$
where $\mathbf{e}_{\text{test}}$ is the embedding of the test utterance, $\mathbf{e}_{\text{enrolled}}$ is the enrolled speaker embedding (typically the average of multiple enrollment utterances), and $\theta$ is the decision threshold.
The threshold $\theta$ controls the tradeoff between false acceptance (impostors accepted) and false rejection (genuine speakers rejected). The Equal Error Rate (EER) is the point where these two error rates are equal, providing a single-number summary of system performance. State-of-the-art systems achieve EER below 1% on standard benchmarks (VoxCeleb1), meaning they correctly verify speakers more than 99% of the time.
Practical deployment considerations: - Enrollment quality: More enrollment utterances (3-5 seconds each, from different sessions) produce better enrolled embeddings. A single short utterance is often insufficient. - Domain mismatch: A model trained on clean speech may fail when deployed in noisy environments. Domain adaptation through fine-tuning or multi-condition training is often necessary. - Anti-spoofing: Verification systems are vulnerable to replay attacks (playing a recording of the enrolled speaker) and synthesis attacks (generating speech in the enrolled speaker's voice using TTS). Separate anti-spoofing modules that detect synthetic or replayed audio are essential for security-critical applications.
29.5.3 Speaker Diarization
Speaker diarization answers the question "who spoke when?" in a multi-speaker recording. Modern approaches include:
- Clustering-based: Extract speaker embeddings for short segments, then cluster them (e.g., spectral clustering, agglomerative clustering).
- End-to-end neural diarization (EEND): Directly predict speaker activities for each frame using a neural network.
- Hybrid approaches: Combine neural embedding extraction with clustering, often incorporating overlap detection.
The evaluation metric for diarization is Diarization Error Rate (DER), which includes missed speech, false alarm speech, and speaker confusion errors.
29.6 Audio Classification
Audio classification extends beyond speech to encompass environmental sounds, music genres, acoustic events, and more.
29.6.1 Audio Event Detection and Classification
Common audio classification tasks include:
- Environmental sound classification: Identifying sounds like dog barking, sirens, rain, etc. (e.g., ESC-50, UrbanSound8K datasets)
- Acoustic scene classification: Identifying the environment (airport, park, office) from ambient audio (DCASE challenge)
- Sound event detection: Detecting and localizing multiple overlapping sound events in time (multi-label, temporal localization)
29.6.2 Audio Spectrogram Transformer (AST)
The Audio Spectrogram Transformer (AST) (Gong et al., 2021) applies Vision Transformer (ViT) to audio by treating the mel spectrogram as an image:
- Split the spectrogram into $16 \times 16$ patches
- Linearly embed each patch
- Add positional embeddings
- Process through Transformer encoder layers
- Use the
[CLS]token for classification
AST achieves state-of-the-art results on AudioSet and other benchmarks, benefiting from ImageNet pre-training through transfer of the ViT weights.
import torch
import torch.nn as nn
torch.manual_seed(42)
class AudioPatchEmbedding(nn.Module):
"""Convert mel spectrogram to patch embeddings for AST.
Args:
mel_bins: Number of mel frequency bins.
patch_height: Height of each patch (frequency dimension).
patch_width: Width of each patch (time dimension).
embed_dim: Embedding dimension.
"""
def __init__(
self,
mel_bins: int = 128,
patch_height: int = 16,
patch_width: int = 16,
embed_dim: int = 768,
) -> None:
super().__init__()
self.patch_height = patch_height
self.patch_width = patch_width
self.projection = nn.Conv2d(
1, embed_dim,
kernel_size=(patch_height, patch_width),
stride=(patch_height, patch_width),
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""Create patch embeddings from mel spectrogram.
Args:
x: Mel spectrogram of shape [batch, 1, mel_bins, time_frames].
Returns:
Patch embeddings of shape [batch, num_patches, embed_dim].
"""
x = self.projection(x) # [B, E, H', W']
x = x.flatten(2).transpose(1, 2) # [B, num_patches, E]
return x
class SimpleAST(nn.Module):
"""Simplified Audio Spectrogram Transformer.
Args:
mel_bins: Number of mel frequency bins.
num_classes: Number of output classes.
embed_dim: Transformer embedding dimension.
num_heads: Number of attention heads.
num_layers: Number of Transformer layers.
max_patches: Maximum number of patches for positional embedding.
"""
def __init__(
self,
mel_bins: int = 128,
num_classes: int = 527,
embed_dim: int = 768,
num_heads: int = 12,
num_layers: int = 6,
max_patches: int = 512,
) -> None:
super().__init__()
self.patch_embed = AudioPatchEmbedding(
mel_bins=mel_bins, embed_dim=embed_dim,
)
self.cls_token = nn.Parameter(torch.randn(1, 1, embed_dim))
self.pos_embed = nn.Parameter(
torch.randn(1, max_patches + 1, embed_dim)
)
encoder_layer = nn.TransformerEncoderLayer(
d_model=embed_dim,
nhead=num_heads,
dim_feedforward=embed_dim * 4,
activation="gelu",
batch_first=True,
norm_first=True,
)
self.transformer = nn.TransformerEncoder(
encoder_layer, num_layers=num_layers,
)
self.norm = nn.LayerNorm(embed_dim)
self.head = nn.Linear(embed_dim, num_classes)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""Forward pass through AST.
Args:
x: Mel spectrogram of shape [batch, 1, mel_bins, time_frames].
Returns:
Class logits of shape [batch, num_classes].
"""
patches = self.patch_embed(x) # [B, N, E]
batch_size = patches.shape[0]
num_patches = patches.shape[1]
cls_tokens = self.cls_token.expand(batch_size, -1, -1)
x = torch.cat([cls_tokens, patches], dim=1) # [B, N+1, E]
x = x + self.pos_embed[:, :num_patches + 1, :]
x = self.transformer(x)
x = self.norm(x)
cls_output = x[:, 0]
return self.head(cls_output)
# Example
model = SimpleAST(mel_bins=128, num_classes=50)
mel_input = torch.randn(2, 1, 128, 256) # 2 spectrograms
logits = model(mel_input)
print(f"Output shape: {logits.shape}") # [2, 50]
29.6.3 Pre-trained Audio Models
Several pre-trained models have achieved strong results across audio tasks:
- Audio Spectrogram Transformer (AST): ViT-based, pre-trained on AudioSet
- BEATs (Chen et al., 2023): Audio pre-training with discrete audio tokenizers
- CLAP (Wu et al., 2023): Contrastive Language-Audio Pretraining, analogous to CLIP for audio
- AudioMAE: Masked autoencoder pre-training for audio
Using pre-trained models from Hugging Face:
import torch
from transformers import ASTForAudioClassification, ASTFeatureExtractor
torch.manual_seed(42)
def classify_audio(
waveform: torch.Tensor,
sample_rate: int = 16000,
model_name: str = "MIT/ast-finetuned-audioset-10-10-0.4593",
top_k: int = 5,
) -> list[dict[str, float]]:
"""Classify an audio clip using a pre-trained AST model.
Args:
waveform: Audio tensor of shape [samples].
sample_rate: Audio sample rate.
model_name: Hugging Face model identifier.
top_k: Number of top predictions to return.
Returns:
List of dicts with 'label' and 'score' keys.
"""
feature_extractor = ASTFeatureExtractor.from_pretrained(model_name)
model = ASTForAudioClassification.from_pretrained(model_name)
inputs = feature_extractor(
waveform.numpy(),
sampling_rate=sample_rate,
return_tensors="pt",
)
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.softmax(logits, dim=-1)
top_probs, top_indices = probs.topk(top_k)
results = []
for prob, idx in zip(top_probs[0], top_indices[0]):
results.append({
"label": model.config.id2label[idx.item()],
"score": prob.item(),
})
return results
29.7 Audio Transformers and Self-Supervised Learning
29.7.1 wav2vec 2.0
wav2vec 2.0 (Baevski et al., 2020) pioneered self-supervised learning for speech by combining contrastive learning with masked prediction:
- Feature encoder: Multi-layer 1D CNN that processes raw waveform into latent representations $\mathbf{z}_t$.
- Quantization module: Discretizes the latent representations using product quantization: $\mathbf{q}_t = \text{Quantize}(\mathbf{z}_t)$.
- Context network: A Transformer that processes masked latent representations.
- Contrastive loss: The model must identify the true quantized representation $\mathbf{q}_t$ among distractors for masked positions.
$$\mathcal{L}_{\text{contrastive}} = -\log \frac{\exp(\text{sim}(\mathbf{c}_t, \mathbf{q}_t) / \tau)}{\sum_{\tilde{q} \in Q_t} \exp(\text{sim}(\mathbf{c}_t, \tilde{q}) / \tau)}$$
After pre-training, the model can be fine-tuned for ASR by adding a linear CTC head.
29.7.2 HuBERT
HuBERT (Hsu et al., 2021) takes a different approach to self-supervised audio learning:
- Generate pseudo-labels using k-means clustering on MFCC or learned features
- Train a BERT-style masked prediction model to predict these pseudo-labels
- Iterate: use learned features to generate better pseudo-labels
This iterative refinement produces increasingly meaningful discrete representations of speech.
29.7.3 Fine-Tuning Pre-trained Audio Models
import torch
import torch.nn as nn
from transformers import (
Wav2Vec2ForCTC,
Wav2Vec2Processor,
Wav2Vec2ForSequenceClassification,
)
torch.manual_seed(42)
def create_audio_classifier(
model_name: str = "facebook/wav2vec2-base",
num_labels: int = 10,
freeze_feature_encoder: bool = True,
) -> Wav2Vec2ForSequenceClassification:
"""Create an audio classifier from a pre-trained wav2vec2 model.
Args:
model_name: Hugging Face model identifier.
num_labels: Number of classification labels.
freeze_feature_encoder: Whether to freeze the CNN feature encoder.
Returns:
Wav2Vec2 model configured for classification.
"""
model = Wav2Vec2ForSequenceClassification.from_pretrained(
model_name,
num_labels=num_labels,
problem_type="single_label_classification",
)
if freeze_feature_encoder:
model.freeze_feature_encoder()
# Count trainable parameters
trainable = sum(
p.numel() for p in model.parameters() if p.requires_grad
)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable:,} / {total:,} parameters")
return model
# Example
model = create_audio_classifier(num_labels=10)
# Trainable parameters will be ~90M out of ~94M
29.8 Music AI
Music AI encompasses analysis, understanding, and generation of music — a fascinating domain that combines audio processing with higher-level musical concepts.
29.8.1 Music Information Retrieval (MIR)
MIR tasks include:
- Beat tracking: Detecting the temporal positions of musical beats
- Chord recognition: Identifying chord progressions over time
- Key detection: Determining the musical key of a piece
- Melody extraction: Isolating the dominant melodic line
- Genre classification: Categorizing music by genre
- Instrument recognition: Identifying instruments in a mix
- Source separation: Decomposing a mixture into individual instrument tracks
29.8.2 Music Generation
Music generation has seen remarkable progress with deep learning:
Symbolic music generation operates on MIDI or music notation: - Music Transformer (Huang et al., 2018): Transformer with relative attention for long-term musical structure - MuseNet (Payne, 2019): GPT-2-based model generating multi-instrument compositions
Audio-domain music generation directly generates audio waveforms or spectrograms: - Jukebox (Dhariwal et al., 2020): VQ-VAE with autoregressive priors for raw audio generation with lyrics - MusicLM (Agostinelli et al., 2023): Text-to-music generation using hierarchical sequence-to-sequence modeling - MusicGen (Copet et al., 2023): Single-stage transformer language model over audio tokens
import torch
from transformers import AutoProcessor, MusicgenForConditionalGeneration
torch.manual_seed(42)
def generate_music(
description: str,
model_name: str = "facebook/musicgen-small",
duration_seconds: float = 8.0,
) -> tuple[torch.Tensor, int]:
"""Generate music from a text description.
Args:
description: Text describing the desired music.
model_name: Hugging Face model identifier for MusicGen.
duration_seconds: Duration of generated audio in seconds.
Returns:
Tuple of (generated waveform tensor, sample rate).
"""
processor = AutoProcessor.from_pretrained(model_name)
model = MusicgenForConditionalGeneration.from_pretrained(model_name)
inputs = processor(
text=[description],
padding=True,
return_tensors="pt",
)
# Calculate max new tokens based on duration
# MusicGen generates at ~50 tokens per second
max_new_tokens = int(duration_seconds * 50)
audio_values = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
)
sample_rate = model.config.audio_encoder.sampling_rate
return audio_values, sample_rate
# Example usage
# waveform, sr = generate_music(
# "An upbeat electronic dance track with synth pads and a driving beat"
# )
# torchaudio.save("generated_music.wav", waveform[0].cpu(), sr)
29.8.3 Audio Tokenization for Language Models
A crucial innovation enabling music (and general audio) language models is audio tokenization — converting continuous audio into discrete tokens that can be modeled with standard language model architectures. This mirrors how BPE tokenization converts continuous text into discrete tokens for language models (as we saw in Chapter 3), creating a universal interface between raw signals and transformer architectures.
EnCodec (Defossez et al., 2022) is a neural audio codec that: 1. Encodes audio using a convolutional encoder that progressively downsamples the waveform (e.g., from 24 kHz to 75 Hz, a 320x compression in time) 2. Quantizes the latent representation using Residual Vector Quantization (RVQ) with multiple codebooks 3. Decodes back to audio with a symmetric convolutional decoder 4. Is trained with a combination of reconstruction loss, adversarial loss, and feature matching loss
The intuition behind RVQ is like successive approximation in signal processing. The first codebook captures the coarse structure of the audio (overall energy, pitch contour), and each subsequent codebook refines the representation by quantizing what the previous codebooks missed.
RVQ works by iteratively quantizing the residual error:
$$\mathbf{r}_0 = \mathbf{z}, \quad q_i = \text{VQ}_i(\mathbf{r}_{i-1}), \quad \mathbf{r}_i = \mathbf{r}_{i-1} - q_i$$
where: - $\mathbf{z}$ is the continuous latent representation from the encoder - $\text{VQ}_i$ is the $i$-th vector quantizer (each with its own codebook of, say, 1024 entries) - $q_i$ is the quantized approximation at level $i$ - $\mathbf{r}_i$ is the residual error after $i$ levels of quantization
The final quantized representation is:
$$\hat{\mathbf{z}} = \sum_{i=1}^{N_q} q_i$$
where $N_q$ is the number of codebook levels (typically 4-8). This produces discrete tokens at multiple levels of detail, which can then be modeled by transformer language models.
Worked Example: Consider a 1-second audio clip at 24 kHz (24,000 samples). EnCodec with a 320x downsampling factor produces 75 latent frames per second. With $N_q = 8$ codebooks, each frame is represented by 8 discrete tokens (one from each codebook). The total representation is $75 \times 8 = 600$ tokens per second. Each token is an index into a codebook of 1,024 entries, so the bitrate is $75 \times 8 \times 10 = 6{,}000$ bits/second = 6 kbps — a remarkable compression from the original $24{,}000 \times 16 = 384{,}000$ bits/second.
SoundStream (Zeghidour et al., 2021) from Google is a concurrent and architecturally similar neural audio codec. The key difference is that SoundStream introduced a streamable architecture suitable for real-time applications, while EnCodec was designed with a broader range of bitrates in mind.
29.8.4 MusicGen Architecture in Detail
MusicGen (Copet et al., 2023) from Meta generates music from text descriptions using a single-stage autoregressive transformer over EnCodec tokens. Its key innovation is the codebook interleaving pattern that enables modeling multiple codebook levels efficiently.
The challenge is that with $N_q$ codebooks and $T$ time steps, a naive autoregressive model would need to predict $T \times N_q$ tokens sequentially. MusicGen introduces several interleaving strategies:
- Flat: Predict all tokens sequentially $[q_1^1, q_2^1, \ldots, q_{N_q}^1, q_1^2, q_2^2, \ldots]$. Slow.
- Parallel: Predict all codebook levels simultaneously for each time step, with a delay pattern. The first codebook is predicted autoregressively; the others are predicted in parallel with a 1-step delay.
- Coarse-first: Predict all time steps of codebook 1 first, then codebook 2 conditioned on codebook 1, etc.
MusicGen uses the delay pattern, which provides a good balance between quality and speed. The text conditioning is provided through cross-attention from the T5 text encoder's embeddings, using the same cross-attention mechanism we saw in diffusion models (Section 27.8.2) and multimodal LLMs (Chapter 28).
29.8.5 MusicLM and Hierarchical Generation
MusicLM (Agostinelli et al., 2023) from Google takes a hierarchical approach: 1. A text encoder (MuLan) produces a text embedding 2. A "semantic modeling" stage generates semantic tokens (from w2v-BERT) conditioned on the text 3. An "acoustic modeling" stage generates acoustic tokens (from SoundStream) conditioned on the semantic tokens
This hierarchical decomposition — from text to semantic to acoustic — mirrors the way humans think about music: first the high-level structure and feel, then the specific sounds and timbres. MusicLM generates music at 24 kHz and can produce minutes of coherent audio, maintaining musical structure over long time spans.
29.9 Advanced Topics
29.9.1 Audio-Language Models
Following the success of CLIP for vision-language alignment, CLAP (Contrastive Language-Audio Pretraining) learns a shared embedding space for audio and text:
$$\mathcal{L}_{\text{CLAP}} = -\frac{1}{N}\sum_{i=1}^{N} \left[ \log \frac{\exp(\text{sim}(\mathbf{a}_i, \mathbf{t}_i)/\tau)}{\sum_j \exp(\text{sim}(\mathbf{a}_i, \mathbf{t}_j)/\tau)} + \log \frac{\exp(\text{sim}(\mathbf{t}_i, \mathbf{a}_i)/\tau)}{\sum_j \exp(\text{sim}(\mathbf{t}_i, \mathbf{a}_j)/\tau)} \right]$$
This enables zero-shot audio classification by comparing audio embeddings with text embeddings of class descriptions.
29.9.2 Speech Translation
Speech translation combines ASR and machine translation. Two paradigms exist:
- Cascaded: ASR then MT, which suffers from error propagation but leverages abundant text MT data.
- End-to-end: Direct speech-to-text translation, which avoids error propagation but requires parallel speech-translation data.
Whisper supports end-to-end translation to English through its <|translate|> task token.
29.9.3 Building a Complete ASR Pipeline
Let us bring together the concepts of this chapter into a practical ASR pipeline that handles real-world audio:
import torch
import torchaudio
from transformers import (
WhisperProcessor,
WhisperForConditionalGeneration,
pipeline,
)
class ASRPipeline:
"""Production-ready ASR pipeline with pre/post-processing.
Args:
model_name: Whisper model identifier.
device: Device to run inference on.
language: Target language for transcription.
"""
def __init__(
self,
model_name: str = "openai/whisper-base",
device: str = "cpu",
language: str = "english",
) -> None:
self.processor = WhisperProcessor.from_pretrained(model_name)
self.model = WhisperForConditionalGeneration.from_pretrained(
model_name
).to(device)
self.device = device
self.language = language
self.target_sr = 16000
def preprocess_audio(
self, audio_path: str
) -> torch.Tensor:
"""Load and preprocess audio for Whisper.
Args:
audio_path: Path to audio file.
Returns:
Preprocessed mono waveform at 16 kHz.
"""
waveform, sr = torchaudio.load(audio_path)
# Convert to mono
if waveform.shape[0] > 1:
waveform = waveform.mean(dim=0, keepdim=True)
# Resample if needed
if sr != self.target_sr:
resampler = torchaudio.transforms.Resample(sr, self.target_sr)
waveform = resampler(waveform)
# Normalize amplitude
waveform = waveform / (waveform.abs().max() + 1e-8)
return waveform.squeeze()
@torch.no_grad()
def transcribe(
self,
audio_path: str,
return_timestamps: bool = False,
) -> dict:
"""Transcribe an audio file.
Args:
audio_path: Path to audio file.
return_timestamps: Whether to include word timestamps.
Returns:
Dict with 'text' and optionally 'chunks' keys.
"""
waveform = self.preprocess_audio(audio_path)
input_features = self.processor(
waveform.numpy(),
sampling_rate=self.target_sr,
return_tensors="pt",
).input_features.to(self.device)
forced_ids = self.processor.get_decoder_prompt_ids(
language=self.language, task="transcribe",
)
generate_kwargs = {
"forced_decoder_ids": forced_ids,
"return_timestamps": return_timestamps,
}
generated_ids = self.model.generate(
input_features, **generate_kwargs,
)
result = self.processor.batch_decode(
generated_ids, skip_special_tokens=True,
)
output = {"text": result[0].strip()}
if return_timestamps:
output["chunks"] = self.processor.batch_decode(
generated_ids,
skip_special_tokens=False,
output_offsets=True,
)
return output
# Usage:
# asr = ASRPipeline(model_name="openai/whisper-small")
# result = asr.transcribe("meeting_recording.wav")
# print(result["text"])
29.9.4 Emotion Recognition from Speech
Speech Emotion Recognition (SER) extracts emotional states from vocal characteristics. Beyond linguistic content, speech carries paralinguistic information through:
- Prosody: Pitch patterns, rhythm, and stress
- Voice quality: Breathiness, hoarseness, tension
- Speaking rate: Speed variations that correlate with emotional states
- Energy patterns: Loudness dynamics
Modern SER systems often use wav2vec 2.0 or HuBERT features, which capture both linguistic and paralinguistic information, fine-tuned on emotion-labeled datasets.
29.9.4 Audio Denoising and Enhancement
Audio enhancement aims to improve signal quality by removing noise, reverberation, or interference. Deep learning approaches include:
- Spectral masking: Predict a time-frequency mask that, when applied to the noisy spectrogram, isolates the clean signal
- Waveform-domain models: U-Net-style architectures (e.g., DEMUCS) that operate directly on waveforms
- Diffusion-based enhancement: Iterative refinement using diffusion models conditioned on noisy audio
import torch
import torch.nn as nn
torch.manual_seed(42)
class SpectralMaskNet(nn.Module):
"""Simple spectral masking network for audio denoising.
Args:
n_fft: FFT size determining frequency bins.
hidden_dim: Hidden dimension for LSTM layers.
num_layers: Number of LSTM layers.
"""
def __init__(
self,
n_fft: int = 512,
hidden_dim: int = 256,
num_layers: int = 2,
) -> None:
super().__init__()
freq_bins = n_fft // 2 + 1
self.lstm = nn.LSTM(
freq_bins,
hidden_dim,
num_layers=num_layers,
batch_first=True,
bidirectional=True,
)
self.fc = nn.Linear(hidden_dim * 2, freq_bins)
self.sigmoid = nn.Sigmoid()
def forward(self, magnitude: torch.Tensor) -> torch.Tensor:
"""Predict a spectral mask for denoising.
Args:
magnitude: Magnitude spectrogram [batch, freq_bins, time].
Returns:
Predicted mask of shape [batch, freq_bins, time].
"""
x = magnitude.permute(0, 2, 1) # [B, T, F]
x, _ = self.lstm(x)
mask = self.sigmoid(self.fc(x)) # [B, T, F]
return mask.permute(0, 2, 1) # [B, F, T]
# Example
model = SpectralMaskNet()
noisy_magnitude = torch.randn(1, 257, 100).abs()
mask = model(noisy_magnitude)
clean_estimate = noisy_magnitude * mask
print(f"Mask shape: {mask.shape}") # [1, 257, 100]
29.10 Practical Considerations
29.10.1 Data Augmentation for Audio
Audio data augmentation is crucial for robust model training:
- Time stretching: Change speed without altering pitch
- Pitch shifting: Change pitch without altering speed
- Adding noise: Mix with environmental noise at various SNR levels
- Room simulation: Apply room impulse responses (RIR) for reverberation
- SpecAugment (Park et al., 2019): Mask blocks of frequency channels and/or time steps in the spectrogram
import torch
import torchaudio.transforms as T
torch.manual_seed(42)
class SpecAugment(torch.nn.Module):
"""SpecAugment data augmentation for spectrograms.
Args:
freq_mask_param: Maximum width of frequency masks.
time_mask_param: Maximum width of time masks.
num_freq_masks: Number of frequency masks to apply.
num_time_masks: Number of time masks to apply.
"""
def __init__(
self,
freq_mask_param: int = 27,
time_mask_param: int = 100,
num_freq_masks: int = 2,
num_time_masks: int = 2,
) -> None:
super().__init__()
self.freq_masks = torch.nn.ModuleList([
T.FrequencyMasking(freq_mask_param)
for _ in range(num_freq_masks)
])
self.time_masks = torch.nn.ModuleList([
T.TimeMasking(time_mask_param)
for _ in range(num_time_masks)
])
def forward(self, spec: torch.Tensor) -> torch.Tensor:
"""Apply SpecAugment to a spectrogram.
Args:
spec: Spectrogram tensor of shape [batch, freq, time].
Returns:
Augmented spectrogram of same shape.
"""
for mask in self.freq_masks:
spec = mask(spec)
for mask in self.time_masks:
spec = mask(spec)
return spec
# Example
augment = SpecAugment()
spec = torch.randn(1, 80, 200)
augmented = augment(spec)
print(f"Original sum: {spec.sum():.2f}")
print(f"Augmented sum: {augmented.sum():.2f}")
29.10.2 Evaluation Metrics
Different audio tasks use different evaluation metrics:
Speech Recognition: - Word Error Rate (WER): $\frac{S + D + I}{N}$ where $S$ = substitutions, $D$ = deletions, $I$ = insertions, $N$ = total reference words - Character Error Rate (CER): Same as WER but at character level
Speaker Verification: - Equal Error Rate (EER): The point where false acceptance rate equals false rejection rate - minDCF: Minimum Detection Cost Function
Audio Classification: - mean Average Precision (mAP): Standard for multi-label classification (AudioSet) - Accuracy: For single-label tasks
TTS: - Mean Opinion Score (MOS): Subjective human rating on a 1-5 scale - Mel Cepstral Distortion (MCD): Objective measure of spectral distance - PESQ/POLQA: Perceptual evaluation of speech quality
29.10.3 Deployment Considerations
When deploying audio AI systems:
- Streaming vs. offline: Streaming applications (voice assistants, live captioning) require models that can process audio incrementally without seeing the full utterance.
- Latency: Real-time applications typically require end-to-end latency under 200ms.
- Model size: Edge deployment may require model compression (quantization, distillation, pruning).
- Noise robustness: Production systems must handle diverse acoustic conditions.
- Privacy: On-device processing may be preferred for voice data to protect user privacy.
29.11 Summary
This chapter has taken you on a comprehensive journey through audio AI, from the physics of sound to state-of-the-art deep learning models. The key themes are:
-
Representations matter: The choice between waveforms, spectrograms, mel spectrograms, and MFCCs significantly impacts model performance, with log-mel spectrograms being the dominant choice for most tasks.
-
End-to-end learning has won: The field has moved decisively from handcrafted pipelines (GMM-HMM with pronunciation dictionaries) to end-to-end neural models that learn directly from audio.
-
Scale and weak supervision: Whisper demonstrated that training on massive amounts of weakly supervised data can produce remarkably robust models that generalize well across languages and conditions.
-
Self-supervised pre-training: Models like wav2vec 2.0 and HuBERT learn powerful audio representations from unlabeled data, enabling few-shot adaptation to downstream tasks.
-
Convergence with language modeling: Audio tokenization (EnCodec, RVQ) has enabled the application of language model architectures to audio generation, leading to breakthroughs in TTS and music generation.
-
Transformers are the universal architecture: From AST for audio classification to Whisper for ASR to MusicGen for music generation, the Transformer architecture has become the backbone of audio AI.
As we move forward, the boundaries between speech, audio, and music AI continue to blur, with universal audio models capable of handling multiple tasks emerging as a dominant paradigm. The integration of audio understanding with large language models — enabling conversational AI that can truly listen and speak — represents one of the most exciting frontiers in the field.
29.12 Exercises
-
Mel spectrogram parameters: Compute log-mel spectrograms of the same audio clip using different parameters: (a) 40 mel bins vs. 80 mel bins, (b) hop length 160 vs. 512, (c) FFT size 512 vs. 2048. Visualize the differences and describe how each parameter affects the time-frequency resolution tradeoff.
-
CTC decoding comparison: Train a simple CTC-based ASR model on a subset of LibriSpeech. Compare greedy decoding vs. beam search (beam width 10) vs. beam search with a language model. Measure the WER for each and report the improvement from each technique.
-
Whisper evaluation: Evaluate Whisper (base, small, and medium sizes) on a noisy audio recording (e.g., a street interview or podcast with background music). Report WER for each model size and characterize the types of errors each model makes. Which errors does scaling resolve?
-
Speaker verification: Using a pre-trained ECAPA-TDNN model from SpeechBrain, compute speaker embeddings for 10 utterances each from 5 different speakers. Compute all pairwise cosine similarities and determine the optimal threshold that maximizes verification accuracy.
-
Music generation: Generate 30-second music clips using MusicGen with 5 different text descriptions (varying genre, tempo, instrumentation). Subjectively evaluate the quality, temporal coherence, and adherence to the text description. How does generation quality compare between the "small" and "medium" model sizes?