Case Study 1: Building a Speech-to-Text System with Whisper

Overview

Automatic speech recognition has traditionally required large labeled datasets, language-specific pipelines, and extensive tuning. OpenAI's Whisper changed this landscape by providing a single model that handles multiple languages, accents, and acoustic conditions out of the box. In this case study, you will build a production-ready speech-to-text system using Whisper, handling real-world challenges such as long audio files, noisy recordings, and multilingual content.

Problem Statement

A media company needs to transcribe podcast episodes (30-90 minutes each) in English, Spanish, and French. The system must handle background music, multiple speakers, and varying audio quality. Target Word Error Rate (WER) is below 10% for clean speech and below 20% for noisy conditions.

Approach

Step 1: Model Selection

We evaluate Whisper model sizes on a validation set of 50 audio clips:

Model Parameters English WER Spanish WER French WER RTF (A100)
tiny 39M 14.2% 18.7% 19.3% 0.02
base 74M 10.8% 14.1% 15.2% 0.03
small 244M 7.3% 10.5% 11.1% 0.08
medium 769M 5.9% 8.2% 8.8% 0.18
large-v3 1.55B 4.6% 6.1% 6.7% 0.35

RTF (Real-Time Factor) measures how long it takes to process 1 second of audio. We select whisper-medium as the best tradeoff between accuracy and latency.

Step 2: Long-Form Audio Processing

Whisper processes 30-second segments. For podcast-length audio, we use the HuggingFace pipeline with chunked processing and timestamp-based stitching to avoid word repetition at segment boundaries.

Step 3: Audio Preprocessing

Before transcription, the audio preprocessing pipeline handles:

  • Resampling: Convert all audio to 16 kHz mono.
  • Silence removal: Trim leading and trailing silence using energy-based VAD.
  • Normalization: Apply peak normalization to -3 dBFS for consistent input levels.
  • Segment detection: Use Whisper's VAD to skip silent segments, reducing processing time by 15-25%.

Step 4: Post-Processing

Raw Whisper output benefits from several post-processing steps:

  • Punctuation normalization: Ensure consistent punctuation style.
  • Number formatting: Convert spoken numbers to appropriate format (dates, currencies, percentages).
  • Speaker attribution: Basic speaker change detection using embedding similarity between segments.
  • Profanity filtering: Optional censoring for broadcast content.

Results

Accuracy by Audio Condition

Condition English WER Spanish WER French WER
Clean studio 4.8% 7.1% 7.5%
Light background music 6.2% 9.3% 9.8%
Noisy environment 11.5% 15.2% 16.1%
Phone quality audio 8.9% 12.4% 13.0%

Processing Performance

Metric Value
Average RTF (medium, A100) 0.18
60-min podcast processing time ~11 minutes
GPU memory usage 5.2 GB
Batch throughput (8 segments) 0.06 RTF

Key Lessons

  1. Model size matters for non-English languages. The accuracy gap between tiny and medium is 12 points for English but 10 points for Spanish and French. Multilingual applications benefit more from larger models.

  2. Chunked processing requires careful overlap. Without proper timestamp-based stitching, word repetitions and omissions appear at chunk boundaries. Using 30-second chunks with seek-based advancement (not fixed-stride) eliminates these artifacts.

  3. Audio preprocessing significantly impacts accuracy. Peak normalization alone improved WER by 1.2% on phone-quality audio. Silence removal reduced processing time by 20% without affecting accuracy.

  4. Whisper handles code-switching gracefully. When speakers switch between English and Spanish mid-sentence, Whisper correctly transcribes both languages without explicit language switching signals.

  5. Batched inference provides substantial speedup. Processing 8 segments simultaneously reduced the effective RTF from 0.18 to 0.06 on an A100 GPU, making real-time transcription feasible.

Code Reference

The complete implementation is available in code/case-study-code.py.