Chapter 29: Further Reading
Speech Recognition
-
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023). "Robust Speech Recognition via Large-Scale Weak Supervision." Proceedings of ICML 2023. The Whisper paper, demonstrating that training on 680K hours of weakly supervised audio produces a robust multilingual ASR system that generalizes across domains and conditions.
-
Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations." NeurIPS 2020. Introduced contrastive self-supervised pre-training for speech, enabling strong ASR with minimal labeled data through quantized latent representations.
-
Graves, A., Fernandez, S., Gomez, F., & Schmidhuber, J. (2006). "Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks." ICML 2006. The foundational CTC paper that solved the alignment problem for sequence-to-sequence tasks without requiring frame-level annotations.
-
Gulati, A., Qin, J., Chiu, C.-C., et al. (2020). "Conformer: Convolution-Augmented Transformer for Speech Recognition." Interspeech 2020. Combines convolution and self-attention to capture both local and global dependencies in audio, achieving state-of-the-art ASR results.
-
Zhang, Y., Park, D. S., Han, W., et al. (2022). "BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition." IEEE Journal of Selected Topics in Signal Processing. Explores scaling self-supervised speech models to billions of parameters with large unlabeled corpora.
Audio Representations and Processing
-
Stevens, S. S., Volkmann, J., & Newman, E. B. (1937). "A Scale for the Measurement of the Psychological Magnitude of Pitch." Journal of the Acoustical Society of America, 8(3), 185--190. The original paper defining the mel scale based on psychoacoustic experiments with human listeners.
-
Park, D. S., Chan, W., Zhang, Y., et al. (2019). "SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition." Interspeech 2019. Introduced frequency and time masking on spectrograms as a simple yet highly effective augmentation technique for ASR training.
-
Defossez, A., Copet, J., Synnaeve, G., & Adi, Y. (2023). "High Fidelity Neural Audio Compression." ICLR 2023. Introduces EnCodec, a neural audio codec using residual vector quantization that achieves high-quality audio compression at low bitrates.
Text-to-Speech
-
Kim, J., Kong, J., & Son, J. (2021). "Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech." ICML 2021. The VITS paper, combining VAE, normalizing flows, and adversarial training for fully end-to-end TTS with high naturalness.
-
Kong, J., Kim, J., & Bae, J. (2020). "HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis." NeurIPS 2020. Introduced multi-period and multi-scale discriminators for high-fidelity neural vocoding at real-time speeds.
-
Shen, J., Pang, R., Weiss, R. J., et al. (2018). "Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions." ICASSP 2018. The Tacotron 2 paper, establishing the mel spectrogram prediction + vocoder pipeline that dominated TTS for several years.
-
Wang, C., Chen, S., Wu, Y., et al. (2023). "Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers." arXiv preprint arXiv:2301.02111. The VALL-E paper, demonstrating that audio language modeling with neural codec tokens enables zero-shot TTS from a 3-second enrollment.
Speaker Recognition
-
Desplanques, B., Thienpondt, J., & Demuynck, K. (2020). "ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification." Interspeech 2020. State-of-the-art speaker embedding architecture using channel attention and multi-scale aggregation for robust speaker verification.
-
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018). "X-Vectors: Robust DNN Embeddings for Speaker Recognition." ICASSP 2018. Introduced the x-vector framework that became the standard for neural speaker embeddings.
-
Park, T. J., Kanda, N., Dimitriadis, D., et al. (2022). "A Review of Speaker Diarization: Recent Advances with Deep Learning." Computer Speech and Language, 76, 101420. Comprehensive survey of speaker diarization methods, from traditional clustering-based to modern end-to-end neural approaches.
Audio Classification and Understanding
-
Gong, Y., Chung, Y.-A., & Glass, J. (2021). "AST: Audio Spectrogram Transformer." Interspeech 2021. Applied Vision Transformer architecture to audio spectrograms, achieving state-of-the-art on multiple audio classification benchmarks.
-
Gemmeke, J. F., Ellis, D. P. W., Freedman, D., et al. (2017). "Audio Set: An Ontology and Human-Labeled Dataset for Audio Events." ICASSP 2017. The AudioSet dataset of over 2 million audio clips with 527 sound event labels, which has become the standard pre-training corpus for audio models.
-
Wu, Y., Chen, K., Zhang, T., et al. (2023). "Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation." ICASSP 2023. The CLAP paper, extending CLIP-style contrastive learning to audio-text pairs for zero-shot audio classification and retrieval.
Music AI
-
Copet, J., Kreuk, F., Gat, I., et al. (2023). "Simple and Controllable Music Generation." NeurIPS 2023. The MusicGen paper, demonstrating efficient music generation using a single-stage language model over residual vector quantized audio tokens.
-
Agostinelli, A., Denk, T. I., Borsos, Z., et al. (2023). "MusicLM: Generating Music from Text." arXiv preprint arXiv:2301.11325. Google's approach to text-to-music generation using a hierarchical sequence-to-sequence model with audio tokens.
-
Dhariwal, P., Jun, H., Payne, C., et al. (2020). "Jukebox: A Generative Model for Music." arXiv preprint arXiv:2005.00341. OpenAI's approach to generating music with singing in the raw audio domain using VQ-VAE and autoregressive transformers.
Practical Resources
-
HuggingFace Audio Course. "Audio Transformers Course." Available at: https://huggingface.co/learn/audio-course. A hands-on course covering ASR, TTS, and audio classification with the Transformers library.
-
Piczak, K. J. (2015). "ESC: Dataset for Environmental Sound Classification." ACM Multimedia 2015. The ESC-50 dataset of 2,000 environmental sound recordings across 50 categories, widely used for benchmarking audio classifiers.