Case Study 2: Audio Classification for Environmental Sound

Overview

Environmental sound classification has applications in wildlife monitoring, urban noise mapping, smart home systems, and industrial equipment health monitoring. In this case study, you build an audio classification system that identifies 10 environmental sound categories using a fine-tuned Audio Spectrogram Transformer (AST), demonstrating transfer learning from audio pre-training.

Problem Statement

A smart city project requires a system to classify urban sounds from street-level microphones into 10 categories: car horn, siren, dog bark, jackhammer, children playing, street music, engine idling, air conditioner, gun shot, and drilling. The system must operate in real time on edge devices with limited compute.

Dataset

The dataset contains 2,000 audio clips (200 per class), each 4 seconds long at 16 kHz:

  • Training: 1,400 clips (140 per class)
  • Validation: 300 clips (30 per class)
  • Test: 300 clips (30 per class)

Audio conditions vary: indoor/outdoor, different SNR levels, and varying distances from the source.

Approach

Step 1: Feature Extraction

Each 4-second clip is converted to a 128-bin mel spectrogram with n_fft=1024 and hop_length=160, producing a spectrogram of shape [128, 400]. SpecAugment (2 frequency masks of width 24, 2 time masks of width 50) is applied during training.

Step 2: Model Architecture

We fine-tune a pre-trained AST model (MIT/ast-finetuned-audioset) by replacing the classification head with a 10-class linear layer. The CNN feature extractor is frozen for the first 3 epochs, then unfrozen with a reduced learning rate.

Step 3: Training

  • Optimizer: AdamW, lr=1e-4 (head), lr=1e-5 (backbone after unfreezing)
  • Scheduler: Cosine annealing over 20 epochs
  • Batch size: 16
  • Mixup: alpha=0.3 for regularization
  • Early stopping: patience=5 on validation accuracy

Results

Category Precision Recall F1
Car horn 0.93 0.90 0.91
Siren 0.91 0.93 0.92
Dog bark 0.88 0.87 0.87
Jackhammer 0.90 0.93 0.91
Children playing 0.82 0.80 0.81
Street music 0.79 0.83 0.81
Engine idling 0.87 0.83 0.85
Air conditioner 0.85 0.87 0.86
Gun shot 0.95 0.93 0.94
Drilling 0.86 0.90 0.88
Macro Average 0.88 0.88 0.88

Overall accuracy: 88.0%

Key Lessons

  1. Pre-training on AudioSet transfers well. Fine-tuning from AudioSet pre-training improved accuracy by 12% over training from ImageNet pre-training, confirming that audio-domain pre-training is superior for audio tasks.

  2. SpecAugment is essential. Without augmentation, the model overfits by epoch 8 and achieves only 81% accuracy. SpecAugment plus Mixup together contribute 7% accuracy improvement.

  3. Confused categories share acoustic properties. "Children playing" and "street music" are the most confused pair, as both contain broadband energy with complex temporal patterns. Similarly, "engine idling" and "air conditioner" share low-frequency drone characteristics.

  4. Edge deployment requires distillation. The full AST model runs at 45ms per inference on GPU but 800ms on CPU. Distilling to a smaller student model (MobileNet-based) achieves 83% accuracy at 50ms on CPU, suitable for edge deployment.

Code Reference

The complete implementation is available in code/case-study-code.py.