Chapter 29: Exercises -- Speech, Audio, and Music AI

Section 1: Audio Representations

Exercise 29.1: Spectrogram Parameters

Compute a spectrogram for a 5-second audio clip sampled at 16,000 Hz with n_fft=1024 and hop_length=256. Calculate: (a) the number of frequency bins, (b) the number of time frames, and (c) the time resolution in milliseconds per frame.

Exercise 29.2: Mel Scale Conversion

Write a function that converts frequencies from Hz to mel scale and back. Verify that the round-trip conversion is accurate for frequencies 200, 1000, 4000, and 8000 Hz. Plot the mel scale against the Hz scale for 0-16000 Hz.

Exercise 29.3: Feature Comparison

For a 3-second speech clip at 16 kHz, compute and visually compare: (a) the raw waveform, (b) the power spectrogram, (c) the log-mel spectrogram (80 mels), and (d) MFCCs (13 coefficients). Discuss the information retained and lost in each representation.

Exercise 29.4: MFCC Delta Features

Implement delta and delta-delta feature computation for MFCCs. Given MFCC features of shape [13, T], compute the first and second temporal derivatives, producing a combined feature matrix of shape [39, T].

Exercise 29.5: Nyquist and Aliasing

Demonstrate aliasing by generating a 6000 Hz sine wave, sampling it at 8000 Hz (below Nyquist for this frequency), and comparing the reconstructed signal with one sampled at 16000 Hz. Plot both to visualize the aliasing effect.

Section 2: Speech Recognition

Exercise 29.6: CTC Loss Computation

Implement a simplified CTC forward pass for a vocabulary of 5 characters plus blank. Given a sequence of 10 frames and a target of 3 characters, compute the CTC loss using dynamic programming. Verify your result against torch.nn.CTCLoss.

Exercise 29.7: CTC Decoding

Implement greedy CTC decoding and prefix beam search decoding. Compare the outputs on 5 synthetic CTC output sequences. Show an example where beam search produces a different (better) result than greedy decoding.

Exercise 29.8: Whisper Transcription

Use the HuggingFace Whisper pipeline to transcribe three audio clips. Compare the accuracy of whisper-tiny, whisper-base, and whisper-small models on the same clips. Measure WER and inference time for each.

Exercise 29.9: Language Detection with Whisper

Use Whisper to detect the language of 5 audio clips in different languages. Examine the decoder's language token probabilities. How confident is the model in each case?

Exercise 29.10: Word Error Rate

Implement a WER calculator using dynamic programming (edit distance). Compute WER for 10 reference-hypothesis pairs. Identify which types of errors (substitutions, insertions, deletions) are most common.

Section 3: Text-to-Speech

Exercise 29.11: Vocoder Comparison

Implement a simplified Griffin-Lim vocoder (iterative spectrogram inversion) and compare its output quality with a neural vocoder (using a pre-trained HiFi-GAN from HuggingFace). Discuss the differences in quality and speed.

Exercise 29.12: TTS Pipeline

Build a complete TTS pipeline using VITS from HuggingFace. Synthesize speech for 5 different sentences and compute the mel cepstral distortion (MCD) between the synthesized and reference spectrograms.

Exercise 29.13: Prosody Analysis

Extract pitch (F0) contours from 3 speech utterances using torchaudio. Compare the prosody patterns of statements, questions, and exclamations. Visualize the pitch contours overlaid on spectrograms.

Section 4: Speaker Recognition

Exercise 29.14: Speaker Embedding Extraction

Use a pre-trained speaker encoder (e.g., speechbrain/spkrec-ecapa-voxceleb) to extract speaker embeddings for 10 utterances from 3 different speakers. Compute the cosine similarity matrix and visualize it. Does the model correctly cluster speakers?

Exercise 29.15: Speaker Verification System

Build a speaker verification system with enrollment and verification phases. Enroll 3 speakers with 3 utterances each. Test verification with 6 probe utterances (3 genuine, 3 impostor). Report EER and plot the DET curve.

Exercise 29.16: Speaker Diarization

Implement a simple clustering-based speaker diarization pipeline: (a) segment audio into 2-second windows, (b) extract speaker embeddings, (c) cluster with agglomerative clustering, (d) assign speaker labels. Test on a synthetic 2-speaker conversation.

Section 5: Audio Classification

Exercise 29.17: Environmental Sound Classification

Fine-tune a pre-trained Audio Spectrogram Transformer (AST) on a subset of ESC-50 (10 classes). Report accuracy, per-class precision and recall, and a confusion matrix. Identify the most commonly confused sound categories.

Exercise 29.18: SpecAugment Ablation

Train an audio classifier with and without SpecAugment augmentation. Vary the masking parameters (frequency mask width, time mask width, number of masks) and report the effect on classification accuracy. Which parameters have the largest impact?

Exercise 29.19: Audio Event Detection

Implement a multi-label audio event detection system that can identify overlapping sounds. Use sigmoid outputs instead of softmax. Evaluate using mean average precision (mAP) on a synthetic dataset with overlapping events.

Exercise 29.20: Zero-Shot Audio Classification with CLAP

Use a pre-trained CLAP model to classify audio clips using text descriptions as class labels (zero-shot). Compare with a supervised AST classifier on the same classes. When does zero-shot classification work well?

Section 6: Music AI

Exercise 29.21: Music Genre Classification

Build a music genre classifier using mel spectrograms and a CNN or transformer encoder. Train on a subset of GTZAN (or synthetic data). Report accuracy and analyze which genres are most difficult to distinguish.

Exercise 29.22: Beat Tracking

Implement a simple beat tracking algorithm: (a) compute an onset strength envelope from a spectrogram, (b) apply peak picking, (c) estimate tempo using autocorrelation. Compare with librosa's beat tracker on 3 music clips.

Exercise 29.23: Music Generation with MusicGen

Use MusicGen from HuggingFace to generate 4 music clips from different text descriptions. Evaluate the generated audio subjectively and measure basic statistics (duration, sample rate, spectral characteristics).

Exercise 29.24: Audio Tokenization

Implement a simplified audio tokenizer using vector quantization: (a) train a small autoencoder on speech spectrograms, (b) add a vector quantization bottleneck, (c) reconstruct the spectrogram from discrete tokens. Measure reconstruction quality (MSE, SNR).

Section 7: Advanced Topics

Exercise 29.25: Audio Denoising

Build a spectral masking denoiser: (a) mix clean speech with noise at various SNR levels, (b) train an LSTM to predict a time-frequency mask, (c) evaluate PESQ and STOI improvements at SNR levels of -5, 0, 5, and 10 dB.

Exercise 29.26: Emotion Recognition

Fine-tune a wav2vec2 model for speech emotion recognition on a subset of RAVDESS (or synthetic data). Compare performance with MFCC + SVM baseline. Which emotions are hardest to classify from speech?

Exercise 29.27: Speech Translation

Use Whisper's translate mode to translate speech from French (or another language) to English. Compare the translation quality (BLEU score) with a cascaded approach (Whisper transcription + text translation).

Exercise 29.28: Audio Data Augmentation Pipeline

Implement a comprehensive audio augmentation pipeline including: (a) time stretching, (b) pitch shifting, (c) noise addition at random SNR, (d) room impulse response simulation, (e) SpecAugment. Measure the effect of each augmentation on downstream classification accuracy.

Exercise 29.29: Streaming ASR

Implement a simple streaming ASR system that processes audio in 1-second chunks using a CTC model. Compare the latency and accuracy with full-utterance processing. What is the minimum chunk size that maintains acceptable accuracy?

Build a system that retrieves audio clips from text queries and text descriptions from audio queries using CLAP embeddings. Create a small test set of 20 audio-text pairs and measure Recall@1 and Recall@5 in both directions.