Chapter 29: Further Reading

Speech Recognition

  • Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023). "Robust Speech Recognition via Large-Scale Weak Supervision." Proceedings of ICML 2023. The Whisper paper, demonstrating that training on 680K hours of weakly supervised audio produces a robust multilingual ASR system that generalizes across domains and conditions.

  • Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations." NeurIPS 2020. Introduced contrastive self-supervised pre-training for speech, enabling strong ASR with minimal labeled data through quantized latent representations.

  • Graves, A., Fernandez, S., Gomez, F., & Schmidhuber, J. (2006). "Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks." ICML 2006. The foundational CTC paper that solved the alignment problem for sequence-to-sequence tasks without requiring frame-level annotations.

  • Gulati, A., Qin, J., Chiu, C.-C., et al. (2020). "Conformer: Convolution-Augmented Transformer for Speech Recognition." Interspeech 2020. Combines convolution and self-attention to capture both local and global dependencies in audio, achieving state-of-the-art ASR results.

  • Zhang, Y., Park, D. S., Han, W., et al. (2022). "BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition." IEEE Journal of Selected Topics in Signal Processing. Explores scaling self-supervised speech models to billions of parameters with large unlabeled corpora.

Audio Representations and Processing

  • Stevens, S. S., Volkmann, J., & Newman, E. B. (1937). "A Scale for the Measurement of the Psychological Magnitude of Pitch." Journal of the Acoustical Society of America, 8(3), 185--190. The original paper defining the mel scale based on psychoacoustic experiments with human listeners.

  • Park, D. S., Chan, W., Zhang, Y., et al. (2019). "SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition." Interspeech 2019. Introduced frequency and time masking on spectrograms as a simple yet highly effective augmentation technique for ASR training.

  • Defossez, A., Copet, J., Synnaeve, G., & Adi, Y. (2023). "High Fidelity Neural Audio Compression." ICLR 2023. Introduces EnCodec, a neural audio codec using residual vector quantization that achieves high-quality audio compression at low bitrates.

Text-to-Speech

  • Kim, J., Kong, J., & Son, J. (2021). "Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech." ICML 2021. The VITS paper, combining VAE, normalizing flows, and adversarial training for fully end-to-end TTS with high naturalness.

  • Kong, J., Kim, J., & Bae, J. (2020). "HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis." NeurIPS 2020. Introduced multi-period and multi-scale discriminators for high-fidelity neural vocoding at real-time speeds.

  • Shen, J., Pang, R., Weiss, R. J., et al. (2018). "Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions." ICASSP 2018. The Tacotron 2 paper, establishing the mel spectrogram prediction + vocoder pipeline that dominated TTS for several years.

  • Wang, C., Chen, S., Wu, Y., et al. (2023). "Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers." arXiv preprint arXiv:2301.02111. The VALL-E paper, demonstrating that audio language modeling with neural codec tokens enables zero-shot TTS from a 3-second enrollment.

Speaker Recognition

  • Desplanques, B., Thienpondt, J., & Demuynck, K. (2020). "ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification." Interspeech 2020. State-of-the-art speaker embedding architecture using channel attention and multi-scale aggregation for robust speaker verification.

  • Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018). "X-Vectors: Robust DNN Embeddings for Speaker Recognition." ICASSP 2018. Introduced the x-vector framework that became the standard for neural speaker embeddings.

  • Park, T. J., Kanda, N., Dimitriadis, D., et al. (2022). "A Review of Speaker Diarization: Recent Advances with Deep Learning." Computer Speech and Language, 76, 101420. Comprehensive survey of speaker diarization methods, from traditional clustering-based to modern end-to-end neural approaches.

Audio Classification and Understanding

  • Gong, Y., Chung, Y.-A., & Glass, J. (2021). "AST: Audio Spectrogram Transformer." Interspeech 2021. Applied Vision Transformer architecture to audio spectrograms, achieving state-of-the-art on multiple audio classification benchmarks.

  • Gemmeke, J. F., Ellis, D. P. W., Freedman, D., et al. (2017). "Audio Set: An Ontology and Human-Labeled Dataset for Audio Events." ICASSP 2017. The AudioSet dataset of over 2 million audio clips with 527 sound event labels, which has become the standard pre-training corpus for audio models.

  • Wu, Y., Chen, K., Zhang, T., et al. (2023). "Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation." ICASSP 2023. The CLAP paper, extending CLIP-style contrastive learning to audio-text pairs for zero-shot audio classification and retrieval.

Music AI

  • Copet, J., Kreuk, F., Gat, I., et al. (2023). "Simple and Controllable Music Generation." NeurIPS 2023. The MusicGen paper, demonstrating efficient music generation using a single-stage language model over residual vector quantized audio tokens.

  • Agostinelli, A., Denk, T. I., Borsos, Z., et al. (2023). "MusicLM: Generating Music from Text." arXiv preprint arXiv:2301.11325. Google's approach to text-to-music generation using a hierarchical sequence-to-sequence model with audio tokens.

  • Dhariwal, P., Jun, H., Payne, C., et al. (2020). "Jukebox: A Generative Model for Music." arXiv preprint arXiv:2005.00341. OpenAI's approach to generating music with singing in the raw audio domain using VQ-VAE and autoregressive transformers.

Practical Resources

  • HuggingFace Audio Course. "Audio Transformers Course." Available at: https://huggingface.co/learn/audio-course. A hands-on course covering ASR, TTS, and audio classification with the Transformers library.

  • Piczak, K. J. (2015). "ESC: Dataset for Environmental Sound Classification." ACM Multimedia 2015. The ESC-50 dataset of 2,000 environmental sound recordings across 50 categories, widely used for benchmarking audio classifiers.