Chapter 29: Further Reading

Speech Recognition

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023). "Robust Speech Recognition via Large-Scale Weak Supervision." Proceedings of ICML 2023. The Whisper paper, demonstrating that training on 680K hours of weakly supervised audio produces a robust multilingual ASR system that generalizes across domains and conditions.
Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations." NeurIPS 2020. Introduced contrastive self-supervised pre-training for speech, enabling strong ASR with minimal labeled data through quantized latent representations.
Graves, A., Fernandez, S., Gomez, F., & Schmidhuber, J. (2006). "Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks." ICML 2006. The foundational CTC paper that solved the alignment problem for sequence-to-sequence tasks without requiring frame-level annotations.
Gulati, A., Qin, J., Chiu, C.-C., et al. (2020). "Conformer: Convolution-Augmented Transformer for Speech Recognition." Interspeech 2020. Combines convolution and self-attention to capture both local and global dependencies in audio, achieving state-of-the-art ASR results.
Zhang, Y., Park, D. S., Han, W., et al. (2022). "BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition." IEEE Journal of Selected Topics in Signal Processing. Explores scaling self-supervised speech models to billions of parameters with large unlabeled corpora.

Audio Representations and Processing

Stevens, S. S., Volkmann, J., & Newman, E. B. (1937). "A Scale for the Measurement of the Psychological Magnitude of Pitch." Journal of the Acoustical Society of America, 8(3), 185--190. The original paper defining the mel scale based on psychoacoustic experiments with human listeners.
Park, D. S., Chan, W., Zhang, Y., et al. (2019). "SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition." Interspeech 2019. Introduced frequency and time masking on spectrograms as a simple yet highly effective augmentation technique for ASR training.
Defossez, A., Copet, J., Synnaeve, G., & Adi, Y. (2023). "High Fidelity Neural Audio Compression." ICLR 2023. Introduces EnCodec, a neural audio codec using residual vector quantization that achieves high-quality audio compression at low bitrates.

Text-to-Speech

Kim, J., Kong, J., & Son, J. (2021). "Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech." ICML 2021. The VITS paper, combining VAE, normalizing flows, and adversarial training for fully end-to-end TTS with high naturalness.
Kong, J., Kim, J., & Bae, J. (2020). "HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis." NeurIPS 2020. Introduced multi-period and multi-scale discriminators for high-fidelity neural vocoding at real-time speeds.
Shen, J., Pang, R., Weiss, R. J., et al. (2018). "Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions." ICASSP 2018. The Tacotron 2 paper, establishing the mel spectrogram prediction + vocoder pipeline that dominated TTS for several years.
Wang, C., Chen, S., Wu, Y., et al. (2023). "Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers." arXiv preprint arXiv:2301.02111. The VALL-E paper, demonstrating that audio language modeling with neural codec tokens enables zero-shot TTS from a 3-second enrollment.

Speaker Recognition

Desplanques, B., Thienpondt, J., & Demuynck, K. (2020). "ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification." Interspeech 2020. State-of-the-art speaker embedding architecture using channel attention and multi-scale aggregation for robust speaker verification.
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018). "X-Vectors: Robust DNN Embeddings for Speaker Recognition." ICASSP 2018. Introduced the x-vector framework that became the standard for neural speaker embeddings.
Park, T. J., Kanda, N., Dimitriadis, D., et al. (2022). "A Review of Speaker Diarization: Recent Advances with Deep Learning." Computer Speech and Language, 76, 101420. Comprehensive survey of speaker diarization methods, from traditional clustering-based to modern end-to-end neural approaches.

Audio Classification and Understanding

Gong, Y., Chung, Y.-A., & Glass, J. (2021). "AST: Audio Spectrogram Transformer." Interspeech 2021. Applied Vision Transformer architecture to audio spectrograms, achieving state-of-the-art on multiple audio classification benchmarks.
Gemmeke, J. F., Ellis, D. P. W., Freedman, D., et al. (2017). "Audio Set: An Ontology and Human-Labeled Dataset for Audio Events." ICASSP 2017. The AudioSet dataset of over 2 million audio clips with 527 sound event labels, which has become the standard pre-training corpus for audio models.
Wu, Y., Chen, K., Zhang, T., et al. (2023). "Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation." ICASSP 2023. The CLAP paper, extending CLIP-style contrastive learning to audio-text pairs for zero-shot audio classification and retrieval.

Music AI

Copet, J., Kreuk, F., Gat, I., et al. (2023). "Simple and Controllable Music Generation." NeurIPS 2023. The MusicGen paper, demonstrating efficient music generation using a single-stage language model over residual vector quantized audio tokens.
Agostinelli, A., Denk, T. I., Borsos, Z., et al. (2023). "MusicLM: Generating Music from Text." arXiv preprint arXiv:2301.11325. Google's approach to text-to-music generation using a hierarchical sequence-to-sequence model with audio tokens.
Dhariwal, P., Jun, H., Payne, C., et al. (2020). "Jukebox: A Generative Model for Music." arXiv preprint arXiv:2005.00341. OpenAI's approach to generating music with singing in the raw audio domain using VQ-VAE and autoregressive transformers.

Practical Resources

HuggingFace Audio Course. "Audio Transformers Course." Available at: https://huggingface.co/learn/audio-course. A hands-on course covering ASR, TTS, and audio classification with the Transformers library.
Piczak, K. J. (2015). "ESC: Dataset for Environmental Sound Classification." ACM Multimedia 2015. The ESC-50 dataset of 2,000 environmental sound recordings across 50 categories, widely used for benchmarking audio classifiers.