Capstone 1: Build a Spectrogram Analyzer — Visualizing the Physics of Real Music
Overview
Sound is invisible. For most of human history, we could describe it only in words — bright, warm, harsh, resonant — or in the abstract symbols of musical notation, which captures pitch and duration but nothing about timbre, dynamics, or the physical richness of a real acoustic event. The spectrogram changed that. A spectrogram is a visual representation of the frequency content of a sound as it changes over time: time on the x-axis, frequency on the y-axis, and amplitude encoded as color or brightness. It makes the physics of sound visible.
In this capstone project, you will build a spectrogram analyzer from scratch using Python. You will implement the Short-Time Fourier Transform (STFT) — first manually, to understand the mathematics, then using industry-standard libraries to see how the implementation scales. You will apply your analyzer to real recorded music from a Spotify-like spectral dataset, and you will use what you see to do genuine physical analysis: identifying harmonics, measuring transients, locating chord changes, and — most importantly — demonstrating the Fourier Uncertainty Principle on actual audio.
This project connects directly to the Spotify Spectral Dataset running example from throughout the textbook. The features that Spotify and other streaming services extract from recordings to power recommendation algorithms — spectral centroid, mel-frequency cepstral coefficients, chroma features — are all downstream computations from the STFT. When you build a spectrogram analyzer, you are building the foundational layer of the technology that decides which music finds its audience.
By the time you finish this project, you will be able to look at a spectrogram and read it the way a physicist reads an oscilloscope trace: as a precise physical record, full of information, waiting to be interpreted.
Learning Objectives
By the end of this project, students will be able to:
- Explain the mathematical relationship between the time-domain waveform and the frequency-domain spectrum, and articulate why neither representation alone is sufficient for analyzing music.
- Implement a Short-Time Fourier Transform from first principles using NumPy, correctly applying windowing functions and selecting appropriate overlap parameters.
- Describe the trade-off between time resolution and frequency resolution in the STFT, and connect this trade-off to the Fourier Uncertainty Principle developed in Chapter 22.
- Produce publication-quality spectrogram visualizations with correctly labeled axes, appropriate color scales, and decibel normalization.
- Read a spectrogram of real music and identify specific physical features: fundamental frequencies, harmonic series structure, transient events, noise floors, and spectral envelopes.
- Quantitatively verify the Fourier Uncertainty Principle by measuring Δt and Δf for specific acoustic events in recorded audio.
- Compare the spectral characteristics of different musical genres and instrument families, grounding aesthetic descriptions in physical measurement.
- Explain how the STFT relates to psychoacoustic processing in the auditory system and to the feature extraction pipelines used in music streaming platforms.
Background Reading
Before beginning the implementation phases, review the following:
- Chapter 1 (Wave Basics): The wave equation, frequency, period, amplitude, and the concept of a spectrum.
- Chapter 7 (Fourier Analysis): The Fourier Transform in detail, complex exponentials, and the frequency-domain representation of periodic and aperiodic signals. This is the mathematical backbone of everything in this project.
- Chapter 8 (Timbre and Spectral Envelope): Why different instruments playing the same pitch have different sounds, and how the harmonic series appears in real spectra.
- Chapter 22 (Uncertainty Principles in Music and Physics): The Fourier Uncertainty Principle — that Δt × Δf ≥ 1/(4π) — and its musical implications. This chapter is essential for Phase 5.
- Chapter 31 (Physics of Recording): The transduction chain from air pressure to electrical signal to digital samples, and the physical meaning of a sampled audio file.
- Chapter 32 (Digital Audio): The Nyquist-Shannon sampling theorem, quantization, and why a 44,100 Hz sample rate captures all frequencies up to 22,050 Hz.
Phase 1: Setting Up the Audio Analysis Environment
Estimated time: 1–2 hours
Required Libraries
Install the following Python packages before beginning. All are available via pip or conda:
pip install librosa matplotlib scipy numpy soundfile
- librosa: High-level audio analysis library. We will use it for loading audio files and for comparison against our hand-built implementations.
- matplotlib: Visualization. We will build our spectrograms as matplotlib figures.
- scipy: Scientific Python. We will use
scipy.signalfor windowing functions and for reference FFT implementations. - numpy: Numerical computing. Our hand-built STFT will be built entirely with NumPy operations.
- soundfile: Reading and writing audio files in various formats.
Testing Your Environment
Run the following code block to verify that your installation is working correctly. If you see a waveform plot and no error messages, you are ready to proceed.
import librosa
import matplotlib.pyplot as plt
import numpy as np
# Load a built-in librosa example file (no external audio needed yet)
y, sr = librosa.load(librosa.ex('trumpet'))
# Display the waveform
plt.figure(figsize=(12, 3))
plt.plot(np.linspace(0, len(y)/sr, len(y)), y)
plt.xlabel('Time (seconds)')
plt.ylabel('Amplitude')
plt.title('Waveform: Trumpet Example')
plt.tight_layout()
plt.savefig('test_waveform.png', dpi=150)
plt.show()
print(f"Sample rate: {sr} Hz")
print(f"Duration: {len(y)/sr:.2f} seconds")
print(f"Number of samples: {len(y)}")
Phase 1 Exercise
Load three different audio clips: an ideally a solo piano note, a drum hit, and a vocal phrase (librosa includes several example files; you may also use your own clips). For each, display the waveform and record the following observations: - Approximate duration - Whether the amplitude envelope is percussive (fast attack, fast decay) or sustained (slow attack, long decay) - Any visible periodicity in the waveform - Your prediction, before computing the spectrum, of where most of the energy will be concentrated in frequency
Write 2–3 sentences for each clip connecting your waveform observations to the physical concepts from Chapter 1. These predictions will be tested when you compute spectrograms in Phase 3.
Phase 2: Building the Short-Time Fourier Transform
Estimated time: 3–5 hours
The Core Idea
A standard Fourier Transform analyzes the entire signal at once, producing a single spectrum that represents the average frequency content over the full duration. This is useful for stationary signals — pure tones, white noise — but music is anything but stationary. A piano piece changes its frequency content thousands of times per second.
The STFT solves this by dividing the signal into short, overlapping segments (called frames) and computing the Fourier Transform of each frame independently. The result is a two-dimensional array: one dimension is time (frame index), the other is frequency (FFT bin), and the value at each position is the complex amplitude of that frequency at that moment.
Implementing STFT from Scratch
def stft_from_scratch(audio, sample_rate, frame_length=2048, hop_length=512, window='hann'):
"""
Compute the Short-Time Fourier Transform of an audio signal.
Parameters
----------
audio : 1D numpy array of audio samples (float, -1.0 to 1.0)
sample_rate : integer, samples per second (e.g., 44100)
frame_length : integer, number of samples per FFT frame (default 2048)
hop_length : integer, samples between successive frames (default 512)
window : string, window function type ('hann' or 'hamming')
Returns
-------
stft_matrix : 2D numpy array, shape (frame_length//2 + 1, n_frames)
Complex-valued STFT coefficients
time_axis : 1D numpy array of frame center times (seconds)
freq_axis : 1D numpy array of frequency bin centers (Hz)
"""
# TODO: Create the window function using scipy.signal.get_window
# The window multiplies each frame to reduce spectral leakage at frame edges.
# Try both 'hann' and 'hamming' and observe the difference.
# TODO: Calculate the number of frames.
# n_frames = 1 + (len(audio) - frame_length) // hop_length
# Pad the audio signal with zeros if needed to ensure complete frames.
# TODO: Build the STFT matrix.
# For each frame i:
# 1. Extract the slice: audio[i*hop_length : i*hop_length + frame_length]
# 2. Multiply by the window function
# 3. Compute np.fft.rfft() of the windowed frame
# (rfft returns only the non-redundant half of the spectrum)
# 4. Store the result in column i of your output matrix
# TODO: Compute time_axis and freq_axis
# time_axis: center time of each frame = (i * hop_length + frame_length/2) / sample_rate
# freq_axis: frequency of each bin = k * sample_rate / frame_length, for k = 0..frame_length//2
pass # Remove this and implement above
# After implementing, verify against librosa:
# stft_librosa = librosa.stft(y, n_fft=2048, hop_length=512)
# Check that np.allclose(np.abs(stft_mine), np.abs(stft_librosa)) is approximately True
Window Functions and Spectral Leakage
A critical concept in STFT implementation is windowing. Because we are analyzing a finite segment of the signal, the FFT assumes the segment repeats infinitely. If the signal does not complete an exact integer number of cycles within the frame, the artificial discontinuity at the frame boundary introduces spurious frequency components (spectral leakage) that are not present in the original signal.
Window functions taper the signal to zero at the frame boundaries, suppressing leakage at the cost of slightly reducing frequency resolution. The Hann window (also called Hanning window) is the standard choice for audio analysis:
w[n] = 0.5 × (1 − cos(2πn / (N−1)))
The Hamming window uses different constants (0.54 and 0.46) to achieve a slightly different leakage/resolution trade-off.
Phase 2 Exercise: The Time-Frequency Trade-Off
This exercise demonstrates the Fourier Uncertainty Principle experimentally. Compute four spectrograms of the same audio clip using the following frame lengths, keeping hop_length = frame_length // 4:
| Frame Length | Time Resolution | Frequency Resolution |
|---|---|---|
| 256 samples | 5.8 ms | 172 Hz |
| 1024 samples | 23.2 ms | 43 Hz |
| 2048 samples | 46.4 ms | 21.5 Hz |
| 8192 samples | 185.8 ms | 5.4 Hz |
For each spectrogram, answer: Can you clearly see individual beats (time events)? Can you clearly see individual pitches (frequency events)? For a drum kit recording, which frame length works best? For a slow string melody, which works best? Write 3–4 sentences connecting your observations to the mathematical statement of the Fourier Uncertainty Principle from Chapter 22.
Phase 3: Spectrogram Visualization
Estimated time: 2–3 hours
From STFT to Spectrogram
The STFT matrix contains complex numbers. To make a spectrogram, we take the magnitude (or magnitude squared for power spectrum), convert to decibels, and display as a color image.
def plot_spectrogram(stft_matrix, time_axis, freq_axis, title='Spectrogram',
fmax=8000, db_range=80):
"""
Display a spectrogram from STFT output.
Parameters
----------
stft_matrix : complex 2D array from stft_from_scratch()
time_axis : 1D array of frame times (seconds)
freq_axis : 1D array of frequency bin centers (Hz)
title : string, plot title
fmax : maximum frequency to display (Hz)
db_range : dynamic range of the color scale (dB)
"""
# TODO: Compute magnitude in decibels
# magnitude = np.abs(stft_matrix) # linear amplitude
# db = 20 * np.log10(magnitude + 1e-8) # convert to dB (add small value to avoid log(0))
# db_normalized = db - db.max() # normalize so peak = 0 dB
# TODO: Select frequency range
# Find the index of fmax in freq_axis and trim the matrix accordingly
# TODO: Create the plot
# Use plt.pcolormesh() with the time_axis and freq_axis as coordinates
# Use cmap='magma' or cmap='inferno' for perceptually uniform color mapping
# Add a colorbar labeled 'dB'
# Label x-axis 'Time (s)', y-axis 'Frequency (Hz)'
# Add the title
pass
Reading a Spectrogram
Once you have a working visualization, spend time learning to read what you see:
- Horizontal bright lines = sustained tones at specific frequencies. Each line corresponds to one frequency component.
- A stack of horizontal lines at harmonic ratios (f, 2f, 3f, 4f...) = a pitched instrument with a clear harmonic series.
- Vertical bright stripes = percussive events (drum hits, plucked strings, consonants in speech) — sudden, broadband energy.
- Diagonal features = frequency glides (portamento, vibrato, Doppler effects).
- Diffuse, broadband brightness = noise (breath noise, bow noise, white noise).
Phase 3 Exercise: Piano Scale Spectrogram
Load a recording of a piano playing a chromatic or diatonic scale (ascending or descending). Compute the spectrogram with frame_length=2048, hop_length=512. On your spectrogram:
- Identify each note's onset time (the vertical stripe marking the attack).
- For three notes, identify the fundamental frequency by reading the y-axis.
- Count the visible harmonics for one clearly isolated note. How many harmonic partials can you see? At what frequency does the harmonic series fade below the noise floor?
- Calculate the ratio of each visible harmonic to the fundamental. Are these whole-number ratios, as predicted by the physics of a vibrating string? Quantify any deviations.
Write a 3–4 sentence analysis connecting your spectrogram observations to the harmonic series content developed in Chapter 8.
Phase 4: Real Music Analysis
Estimated time: 3–5 hours
Choosing Your Three Songs
Select three recordings that differ as much as possible in timbre, instrumentation, and genre. Suggested combinations:
- Option A: String quartet movement + electronic dance track + field recording (rain, traffic, or birdsong)
- Option B: Solo acoustic guitar + heavy metal song + choral piece
- Option C: Jazz piano trio + hip-hop track + orchestral excerpt
If you are using the Spotify Spectral Dataset samples referenced in Chapter 31, select tracks from three distinct genre clusters.
Analysis Framework
For each song, compute the spectrogram with frame_length=2048, hop_length=512. Then answer the following questions with specific reference to features you can see in the spectrogram:
1. Where is the spectral energy concentrated? Compute the spectral centroid — the amplitude-weighted mean frequency at each time frame — and overlay it on the spectrogram. Instruments and genres have characteristic centroid ranges: orchestral strings typically center around 1–3 kHz; electronic bass around 60–200 Hz; hi-hats around 8–12 kHz.
2. How does the harmonic structure differ across genres? Compare the number and relative amplitude of harmonic partials visible in a sustained note from each track. Electronic synthesizers often show extremely regular harmonic series; acoustic instruments show slight inharmonicity; percussion shows very irregular or inharmonic spectra.
3. Can you identify a specific musical event? Find one of the following events on each spectrogram and annotate it (draw a box or arrow on the plot): - A chord change (visible as a simultaneous shift in the frequencies of multiple horizontal lines) - A drum hit (vertical broadband stripe) - A section transition (systematic change in the density or character of features) - A note being held and then released (bright horizontal feature fading to silence)
4. Transient density Count the number of clearly visible transient events (vertical stripes) per second in a 10-second window of each song. How does this differ across genres? Connect this observation to the physical concept of attack time developed in Chapter 8.
Phase 4 Exercise
Write a 500-word analysis report (one song per ~170 words) that answers all four questions above for each of your three songs. Include annotated spectrogram images as figures. The analysis should be grounded in physical concepts: use terms like harmonic series, spectral envelope, transient, sustained tone, noise floor, and fundamental frequency with precision. Avoid purely aesthetic language ("this sounds warm") without connecting it to a physical observable ("this track's spectral centroid remains below 2 kHz throughout, consistent with the dominance of low-register instruments and the reduced presence of high-frequency harmonics").
Phase 5: The Fourier Uncertainty Demonstration
Estimated time: 2–3 hours
What You Will Demonstrate
The Fourier Uncertainty Principle (Chapter 22) states that for any signal:
Δt × Δf ≥ 1/(4π) ≈ 0.08
where Δt is the temporal spread of the signal and Δf is its spectral spread (both measured as standard deviations). For audio events, this means that a very brief event (small Δt) must have a very wide frequency spread (large Δf), and vice versa.
This is not a limitation of measurement equipment. It is a mathematical theorem about the nature of waves. In this phase, you will measure it directly on real audio.
Measurement Procedure
For each acoustic event you analyze, you will:
- Isolate a short window of audio containing the event (e.g., a single drum hit or a sustained pure tone).
- Compute the temporal spread Δt: the standard deviation of the squared amplitude envelope in the time domain.
- Compute the spectral spread Δf: the standard deviation of the squared magnitude spectrum in the frequency domain.
- Calculate the product Δt × Δf and verify it is greater than or equal to 0.08.
def measure_uncertainty(audio_segment, sample_rate):
"""
Measure the time-bandwidth product of an audio event.
Parameters
----------
audio_segment : 1D numpy array, the isolated acoustic event
sample_rate : integer, samples per second
Returns
-------
delta_t : float, temporal spread (seconds)
delta_f : float, spectral spread (Hz)
product : float, delta_t * delta_f (should be >= 0.08)
"""
# TODO: Compute time axis in seconds
# t = np.arange(len(audio_segment)) / sample_rate
# TODO: Compute temporal spread
# power_env = audio_segment ** 2
# mean_t = np.sum(t * power_env) / np.sum(power_env)
# delta_t = np.sqrt(np.sum((t - mean_t)**2 * power_env) / np.sum(power_env))
# TODO: Compute frequency spectrum and spread
# Use np.fft.rfft and np.fft.rfftfreq
# Compute spectral power as |FFT|^2
# Compute mean frequency and standard deviation in the same way as time
# TODO: Return delta_t, delta_f, and their product
pass
The Experiment
Analyze at least five acoustic events spanning a wide range of durations: - A very short drum hit (Δt ~ 5 ms) - A medium-length plucked note (Δt ~ 50 ms) - A sustained bowed string note (Δt ~ 500 ms) - A pure synthesized tone (Δt ~ 200 ms) - A consonant in speech (e.g., the 't' sound) (Δt ~ 10 ms)
For each event, record Δt, Δf, and their product in a table.
Phase 5 Exercise
Construct a scatter plot with Δt on the x-axis (log scale) and Δf on the y-axis (log scale). Plot each of your measured events as a point. Draw the theoretical lower bound curve Δf = 0.08/Δt.
Write a paragraph answering: Do your measured points lie above or below the theoretical bound? If any lie below (which would violate the theorem), investigate why — possible causes include measurement error, boundary effects, or non-Gaussian signal shape. Which of your events comes closest to the lower bound (the "minimum uncertainty" signal)? What does it sound like, and why might evolution or musical practice select for signals near the uncertainty limit?
Deliverables and Grading Rubric
Submit all of the following:
1. Working Python Script (40 points)
A single .py file containing your complete STFT implementation and all analysis code, clearly commented. The script must run without errors on a fresh Python environment with the required libraries installed. Code quality (clarity, documentation, logical organization) will be assessed.
| Sub-criterion | Points |
|---|---|
| STFT implementation from scratch runs correctly | 15 |
| Spectrogram visualization function produces labeled, readable plots | 10 |
| Uncertainty measurement function runs correctly | 10 |
| Code is commented and organized | 5 |
2. Written Analysis Report: Three Songs (30 points) A written document (minimum 500 words) analyzing your three chosen songs using their spectrograms, addressing all four questions from Phase 4. Must include annotated spectrogram images.
| Sub-criterion | Points |
|---|---|
| Physical precision of language and analysis | 10 |
| Correct identification and annotation of spectral features | 10 |
| Connection to textbook concepts (chapters cited) | 5 |
| Quality and clarity of spectrogram images | 5 |
3. Uncertainty Principle Demonstration (20 points) The scatter plot from Phase 5, a table of all measured events, and a written paragraph interpreting the results.
| Sub-criterion | Points |
|---|---|
| At least 5 events measured correctly | 10 |
| Scatter plot with theoretical bound clearly drawn | 5 |
| Written interpretation connecting to Chapter 22 | 5 |
4. Presentation / Discussion (10 points) A 5–10 minute in-class or recorded presentation in which you walk through one spectrogram, identify 3+ physical features, and demonstrate the uncertainty trade-off by playing audio examples at different time-frequency resolutions.
Extension Challenges (Optional)
These extensions are not required for full credit but are strongly recommended for students considering further work in audio engineering, computational musicology, or physics research:
Real-Time Spectrogram
Using Python's sounddevice or pyaudio library, implement a real-time spectrogram that analyzes microphone input as it arrives. This requires computing the STFT on overlapping audio buffers as they stream in, not after the full recording is available. The key challenge is maintaining consistent frame timing.
Mel-Scale Spectrogram The mel scale is a perceptually motivated frequency scale that compresses the high-frequency range (where human pitch discrimination is coarser) and expands the low-frequency range. Mel-scale spectrograms (mel spectrograms) are the standard input representation for AI music systems, speech recognition, and music information retrieval algorithms. Implement a mel filterbank and apply it to your linear-frequency STFT. Compare the two representations on the same audio and explain why the mel scaling is appropriate given the auditory physiology covered in Chapter 26.
STFT vs. Continuous Wavelet Transform
The Continuous Wavelet Transform (CWT) uses different basis functions than the STFT and achieves a different time-frequency resolution trade-off: high time resolution at high frequencies, high frequency resolution at low frequencies. This matches the frequency-dependent resolution of the auditory system more closely than the STFT. Use scipy.signal.cwt to compute the CWT of one of your audio clips and compare the result to the STFT spectrogram. Which representation makes the harmonic structure more visible? Which makes transients more visible?
Python Starter Code
The following skeleton provides the complete file structure. Replace each # TODO block with working code. Do not modify the function signatures or return types.
"""
Capstone 1: Spectrogram Analyzer
Physics of Music — Student Starter Code
Replace all # TODO blocks with working implementations.
Do not modify function signatures or return value structures.
"""
import numpy as np
import scipy.signal
import scipy.fft
import matplotlib.pyplot as plt
import librosa
import librosa.display
# ─────────────────────────────────────────────
# PHASE 2: STFT Implementation
# ─────────────────────────────────────────────
def stft_from_scratch(audio, sample_rate, frame_length=2048,
hop_length=512, window='hann'):
"""Compute STFT. See Phase 2 for full docstring."""
# TODO: Get window function
win = scipy.signal.get_window(window, frame_length)
# TODO: Pad audio and calculate n_frames
n_frames = 1 + (len(audio) - frame_length) // hop_length
stft_matrix = np.zeros((frame_length // 2 + 1, n_frames), dtype=complex)
# TODO: Main STFT loop
for i in range(n_frames):
start = i * hop_length
frame = audio[start: start + frame_length]
windowed = frame * win
stft_matrix[:, i] = np.fft.rfft(windowed)
# TODO: Compute time and frequency axes
time_axis = np.array([]) # replace with real computation
freq_axis = np.array([]) # replace with real computation
return stft_matrix, time_axis, freq_axis
# ─────────────────────────────────────────────
# PHASE 3: Visualization
# ─────────────────────────────────────────────
def plot_spectrogram(stft_matrix, time_axis, freq_axis,
title='Spectrogram', fmax=8000, db_range=80):
"""Visualize STFT as spectrogram. See Phase 3 for full docstring."""
# TODO: Magnitude to dB
# TODO: Frequency trimming
# TODO: pcolormesh plot with labels, colorbar, title
pass
# ─────────────────────────────────────────────
# PHASE 5: Uncertainty Measurement
# ─────────────────────────────────────────────
def measure_uncertainty(audio_segment, sample_rate):
"""Measure time-bandwidth product. See Phase 5 for full docstring."""
# TODO: time axis
# TODO: temporal spread (delta_t)
# TODO: frequency spectrum and spectral spread (delta_f)
# TODO: return delta_t, delta_f, delta_t * delta_f
pass
# ─────────────────────────────────────────────
# MAIN ANALYSIS PIPELINE
# ─────────────────────────────────────────────
if __name__ == "__main__":
# --- Load audio ---
# TODO: Replace with your own audio file path or use librosa.ex()
y, sr = librosa.load(librosa.ex('trumpet'), sr=None)
print(f"Loaded audio: {len(y)} samples at {sr} Hz")
# --- Phase 2: Compute STFT ---
S, t, f = stft_from_scratch(y, sr)
print(f"STFT shape: {S.shape} ({S.shape[1]} frames, {S.shape[0]} frequency bins)")
# --- Phase 3: Visualize ---
plot_spectrogram(S, t, f, title="My First Spectrogram")
# --- Phase 4: Compare to librosa reference ---
S_ref = librosa.stft(y, n_fft=2048, hop_length=512)
# TODO: Print the maximum absolute difference between your STFT and librosa's
# This verifies your implementation is numerically correct.
# --- Phase 5: Uncertainty measurement ---
# TODO: Select a short audio segment containing a single event
# TODO: Call measure_uncertainty() and print results
# TODO: Repeat for at least 5 different events
# TODO: Collect results and make the scatter plot