Case Study 1: Building a Speech-to-Text System with Whisper

Overview

Automatic speech recognition has traditionally required large labeled datasets, language-specific pipelines, and extensive tuning. OpenAI's Whisper changed this landscape by providing a single model that handles multiple languages, accents, and acoustic conditions out of the box. In this case study, you will build a production-ready speech-to-text system using Whisper, handling real-world challenges such as long audio files, noisy recordings, and multilingual content.

Problem Statement

A media company needs to transcribe podcast episodes (30-90 minutes each) in English, Spanish, and French. The system must handle background music, multiple speakers, and varying audio quality. Target Word Error Rate (WER) is below 10% for clean speech and below 20% for noisy conditions.

Approach

Step 1: Model Selection

We evaluate Whisper model sizes on a validation set of 50 audio clips:

Model	Parameters	English WER	Spanish WER	French WER	RTF (A100)
tiny	39M	14.2%	18.7%	19.3%	0.02
base	74M	10.8%	14.1%	15.2%	0.03
small	244M	7.3%	10.5%	11.1%	0.08
medium	769M	5.9%	8.2%	8.8%	0.18
large-v3	1.55B	4.6%	6.1%	6.7%	0.35

RTF (Real-Time Factor) measures how long it takes to process 1 second of audio. We select whisper-medium as the best tradeoff between accuracy and latency.

Step 2: Long-Form Audio Processing

Whisper processes 30-second segments. For podcast-length audio, we use the HuggingFace pipeline with chunked processing and timestamp-based stitching to avoid word repetition at segment boundaries.

Step 3: Audio Preprocessing

Before transcription, the audio preprocessing pipeline handles:

Resampling: Convert all audio to 16 kHz mono.
Silence removal: Trim leading and trailing silence using energy-based VAD.
Normalization: Apply peak normalization to -3 dBFS for consistent input levels.
Segment detection: Use Whisper's VAD to skip silent segments, reducing processing time by 15-25%.

Step 4: Post-Processing

Raw Whisper output benefits from several post-processing steps:

Punctuation normalization: Ensure consistent punctuation style.
Number formatting: Convert spoken numbers to appropriate format (dates, currencies, percentages).
Speaker attribution: Basic speaker change detection using embedding similarity between segments.
Profanity filtering: Optional censoring for broadcast content.

Results

Accuracy by Audio Condition

Condition	English WER	Spanish WER	French WER
Clean studio	4.8%	7.1%	7.5%
Light background music	6.2%	9.3%	9.8%
Noisy environment	11.5%	15.2%	16.1%
Phone quality audio	8.9%	12.4%	13.0%

Processing Performance

Metric	Value
Average RTF (medium, A100)	0.18
60-min podcast processing time	~11 minutes
GPU memory usage	5.2 GB
Batch throughput (8 segments)	0.06 RTF

Key Lessons

Model size matters for non-English languages. The accuracy gap between tiny and medium is 12 points for English but 10 points for Spanish and French. Multilingual applications benefit more from larger models.
Chunked processing requires careful overlap. Without proper timestamp-based stitching, word repetitions and omissions appear at chunk boundaries. Using 30-second chunks with seek-based advancement (not fixed-stride) eliminates these artifacts.
Audio preprocessing significantly impacts accuracy. Peak normalization alone improved WER by 1.2% on phone-quality audio. Silence removal reduced processing time by 20% without affecting accuracy.
Whisper handles code-switching gracefully. When speakers switch between English and Spanish mid-sentence, Whisper correctly transcribes both languages without explicit language switching signals.
Batched inference provides substantial speedup. Processing 8 segments simultaneously reduced the effective RTF from 0.18 to 0.06 on an A100 GPU, making real-time transcription feasible.

Code Reference

The complete implementation is available in code/case-study-code.py.