Case Study 1: Building a Speech-to-Text System with Whisper
Overview
Automatic speech recognition has traditionally required large labeled datasets, language-specific pipelines, and extensive tuning. OpenAI's Whisper changed this landscape by providing a single model that handles multiple languages, accents, and acoustic conditions out of the box. In this case study, you will build a production-ready speech-to-text system using Whisper, handling real-world challenges such as long audio files, noisy recordings, and multilingual content.
Problem Statement
A media company needs to transcribe podcast episodes (30-90 minutes each) in English, Spanish, and French. The system must handle background music, multiple speakers, and varying audio quality. Target Word Error Rate (WER) is below 10% for clean speech and below 20% for noisy conditions.
Approach
Step 1: Model Selection
We evaluate Whisper model sizes on a validation set of 50 audio clips:
| Model | Parameters | English WER | Spanish WER | French WER | RTF (A100) |
|---|---|---|---|---|---|
| tiny | 39M | 14.2% | 18.7% | 19.3% | 0.02 |
| base | 74M | 10.8% | 14.1% | 15.2% | 0.03 |
| small | 244M | 7.3% | 10.5% | 11.1% | 0.08 |
| medium | 769M | 5.9% | 8.2% | 8.8% | 0.18 |
| large-v3 | 1.55B | 4.6% | 6.1% | 6.7% | 0.35 |
RTF (Real-Time Factor) measures how long it takes to process 1 second of audio. We select whisper-medium as the best tradeoff between accuracy and latency.
Step 2: Long-Form Audio Processing
Whisper processes 30-second segments. For podcast-length audio, we use the HuggingFace pipeline with chunked processing and timestamp-based stitching to avoid word repetition at segment boundaries.
Step 3: Audio Preprocessing
Before transcription, the audio preprocessing pipeline handles:
- Resampling: Convert all audio to 16 kHz mono.
- Silence removal: Trim leading and trailing silence using energy-based VAD.
- Normalization: Apply peak normalization to -3 dBFS for consistent input levels.
- Segment detection: Use Whisper's VAD to skip silent segments, reducing processing time by 15-25%.
Step 4: Post-Processing
Raw Whisper output benefits from several post-processing steps:
- Punctuation normalization: Ensure consistent punctuation style.
- Number formatting: Convert spoken numbers to appropriate format (dates, currencies, percentages).
- Speaker attribution: Basic speaker change detection using embedding similarity between segments.
- Profanity filtering: Optional censoring for broadcast content.
Results
Accuracy by Audio Condition
| Condition | English WER | Spanish WER | French WER |
|---|---|---|---|
| Clean studio | 4.8% | 7.1% | 7.5% |
| Light background music | 6.2% | 9.3% | 9.8% |
| Noisy environment | 11.5% | 15.2% | 16.1% |
| Phone quality audio | 8.9% | 12.4% | 13.0% |
Processing Performance
| Metric | Value |
|---|---|
| Average RTF (medium, A100) | 0.18 |
| 60-min podcast processing time | ~11 minutes |
| GPU memory usage | 5.2 GB |
| Batch throughput (8 segments) | 0.06 RTF |
Key Lessons
-
Model size matters for non-English languages. The accuracy gap between tiny and medium is 12 points for English but 10 points for Spanish and French. Multilingual applications benefit more from larger models.
-
Chunked processing requires careful overlap. Without proper timestamp-based stitching, word repetitions and omissions appear at chunk boundaries. Using 30-second chunks with seek-based advancement (not fixed-stride) eliminates these artifacts.
-
Audio preprocessing significantly impacts accuracy. Peak normalization alone improved WER by 1.2% on phone-quality audio. Silence removal reduced processing time by 20% without affecting accuracy.
-
Whisper handles code-switching gracefully. When speakers switch between English and Spanish mid-sentence, Whisper correctly transcribes both languages without explicit language switching signals.
-
Batched inference provides substantial speedup. Processing 8 segments simultaneously reduced the effective RTF from 0.18 to 0.06 on an A100 GPU, making real-time transcription feasible.
Code Reference
The complete implementation is available in code/case-study-code.py.