Chapter 29: Quiz -- Speech, Audio, and Music AI


Question 1

What does the Nyquist-Shannon sampling theorem state?

A) The sample rate must equal the highest frequency in the signal B) The sample rate must be at least twice the highest frequency present in the signal C) Higher sample rates always produce better audio quality D) The sample rate determines the bit depth of the audio

Answer: B Explanation: The Nyquist-Shannon theorem states that to faithfully represent a signal, the sample rate must be at least 2x the highest frequency component. For example, to capture frequencies up to 8000 Hz (typical speech), we need at least 16000 Hz sampling rate.


Question 2

Why are log-mel spectrograms preferred over raw spectrograms as input to audio neural networks?

A) They are computationally cheaper to compute B) They compress the frequency axis to match human auditory perception and reduce dimensionality C) They preserve phase information better D) They are lossless representations of the audio signal

Answer: B Explanation: The mel scale approximates human auditory perception, which is non-linear in frequency (more sensitive to low-frequency differences). Log-mel spectrograms compress the frequency axis to match this perception and reduce the number of frequency bins, making them more suitable for neural network input.


Question 3

What is the primary purpose of the blank token in CTC?

A) To represent silence in the audio B) To enable many-to-one alignment between audio frames and output characters without requiring frame-level labels C) To separate words in the output D) To pad sequences to equal length

Answer: B Explanation: The CTC blank token enables the model to output "nothing" at frames that do not correspond to a new character. This, combined with the collapsing function that removes consecutive duplicates and blanks, allows CTC to align variable-length audio with variable-length text without explicit frame-level alignment annotations.


Question 4

How does Whisper's multitask training format work?

A) It trains separate models for each task B) It uses special tokens in the decoder sequence to specify the language, task (transcribe/translate), and timestamp preferences C) It requires different training datasets for each task D) It uses multiple decoder heads, one per task

Answer: B Explanation: Whisper uses a structured token format where special tokens encode the task: <|startoftranscript|> <|language|> <|task|> <|timestamps|>. This allows a single model to perform language identification, transcription, translation, and timestamp prediction by conditioning on different token sequences.


Question 5

What is the key architectural difference between Whisper and wav2vec 2.0?

A) Whisper uses CNNs while wav2vec 2.0 uses transformers B) Whisper is an encoder-decoder model trained with weak supervision, while wav2vec 2.0 is an encoder-only model trained with self-supervised contrastive learning C) Whisper operates on raw waveforms while wav2vec 2.0 uses spectrograms D) Whisper can only transcribe English while wav2vec 2.0 is multilingual

Answer: B Explanation: Whisper uses a standard encoder-decoder transformer trained on 680K hours of weakly labeled data. wav2vec 2.0 uses an encoder-only architecture with a contrastive learning objective on unlabeled audio, then fine-tunes with CTC for ASR.


Question 6

What is the Audio Spectrogram Transformer (AST)?

A) A transformer that generates audio spectrograms from text B) A ViT-based model that treats mel spectrograms as images, splitting them into patches for classification C) A convolutional model that processes spectrograms D) A model that converts audio to text using attention

Answer: B Explanation: AST applies the Vision Transformer approach to audio by treating the mel spectrogram as an image, splitting it into 16x16 patches, embedding each patch, and processing them with a transformer encoder. It benefits from ImageNet ViT pre-training.


Question 7

In speaker verification, what does the Equal Error Rate (EER) measure?

A) The error rate when the system processes equal numbers of genuine and impostor trials B) The operating point where the false acceptance rate equals the false rejection rate C) The average of all error types D) The error rate at a 50% confidence threshold

Answer: B Explanation: EER is the point on the DET (Detection Error Tradeoff) curve where FAR (false acceptance rate) equals FRR (false rejection rate). Lower EER indicates better verification performance. Modern systems achieve EER below 1% on standard benchmarks.


Question 8

What is Residual Vector Quantization (RVQ) used for in audio AI?

A) Reducing the sample rate of audio signals B) Iteratively quantizing residual errors to convert continuous audio representations into discrete tokens at multiple levels of detail C) Computing residual connections in transformer models D) Normalizing audio volume levels

Answer: B Explanation: RVQ works by first quantizing the latent representation, then quantizing the residual (error), then quantizing the residual of that residual, and so on. Each level captures finer details. The discrete tokens produced can then be modeled by language model architectures like MusicGen.


Question 9

What is SpecAugment?

A) A method for augmenting the spectrogram by masking blocks of frequency channels and time steps B) A technique for increasing the resolution of spectrograms C) An algorithm for spectrogram denoising D) A method for converting spectrograms to waveforms

Answer: A Explanation: SpecAugment (Park et al., 2019) applies random masking to spectrograms along the frequency and time dimensions during training. This simple augmentation technique significantly improves ASR model robustness and has become standard practice in audio model training.


Question 10

Why does wav2vec 2.0 use a quantization module during pre-training?

A) To reduce memory usage during training B) To create discrete targets for the contrastive prediction task, since continuous targets would make contrastive learning ill-defined C) To convert audio to text tokens D) To speed up inference

Answer: B Explanation: wav2vec 2.0's contrastive learning requires discrete targets (the model must identify the true quantized representation among distractors). The quantization module discretizes continuous latent representations using product quantization, providing well-defined targets for the contrastive loss.


Question 11

What is the primary advantage of end-to-end TTS systems like VITS over pipeline TTS systems?

A) They require less training data B) They are always faster at inference C) They jointly optimize all components, avoiding error propagation between separate stages D) They produce higher sample rate output

Answer: C Explanation: End-to-end systems like VITS train the text encoder, alignment mechanism, and audio decoder jointly, allowing gradients to flow through the entire system. This avoids the error propagation that occurs when separately trained components are cascaded in pipeline systems.


Question 12

What is speaker diarization?

A) Converting speech to text for a known speaker B) Determining "who spoke when" in a multi-speaker recording C) Enhancing the voice of a specific speaker in noisy audio D) Cloning a speaker's voice for TTS

Answer: B Explanation: Speaker diarization segments a multi-speaker recording into speaker-homogeneous regions, answering "who spoke when." It combines speaker embedding extraction with clustering or end-to-end neural approaches. The standard metric is Diarization Error Rate (DER).


Question 13

How does CLAP (Contrastive Language-Audio Pretraining) enable zero-shot audio classification?

A) By training on every possible audio class B) By learning a shared embedding space for audio and text, allowing comparison of audio embeddings with text descriptions of classes C) By using a large language model to describe audio D) By converting audio to text first

Answer: B Explanation: CLAP learns aligned audio and text embeddings through contrastive learning, similar to CLIP for images and text. At inference, audio clips are compared against text embeddings of class descriptions (e.g., "a dog barking") using cosine similarity, enabling classification without training on specific classes.


Question 14

What is the Word Error Rate (WER) formula?

A) WER = (Substitutions + Deletions) / Total Reference Words B) WER = (Substitutions + Deletions + Insertions) / Total Reference Words C) WER = Incorrect Words / Total Hypothesis Words D) WER = 1 - (Correct Words / Total Reference Words)

Answer: B Explanation: WER = (S + D + I) / N, where S = substitutions, D = deletions, I = insertions, and N = total words in the reference. Note that WER can exceed 100% if there are more insertions than correct words.


Question 15

Why do modern neural vocoders like HiFi-GAN use multiple discriminators?

A) To increase the model size B) Multi-period and multi-scale discriminators evaluate different aspects of audio quality (periodicity, spectral structure), providing more comprehensive training signal C) Each discriminator handles a different audio format D) Multiple discriminators are faster than a single large one

Answer: B Explanation: HiFi-GAN uses multi-period discriminators (evaluating audio at different periodic intervals) and multi-scale discriminators (evaluating at different resolutions). This provides the generator with rich training signals about different aspects of audio quality, producing more natural-sounding speech.


Scoring Guide

Score Level Recommendation
14-15 Expert Ready for production audio AI systems
11-13 Advanced Strong foundation, explore advanced topics
8-10 Intermediate Good understanding, practice implementations
5-7 Developing Review core concepts, especially ASR and representations
0-4 Beginning Re-read the chapter focusing on audio fundamentals