Case Study 6.2: The Spectrogram of a Human Voice — Reading the Harmonic Series in Speech

DataField.Dev

Case Study 6.2: The Spectrogram of a Human Voice — Reading the Harmonic Series in Speech

A Window Into Sound

In the middle of the twentieth century, Bell Telephone Laboratories developed a device they called the "visible speech" machine — the spectrograph. It produced paper printouts that translated speech sounds into visual patterns: a vertical axis showing frequency, a horizontal axis showing time, and the density or color of the trace showing amplitude. For the first time in history, it was possible to see what speech sounded like.

Linguists, speech therapists, and engineers immediately recognized the value of the spectrogram. But for students of the harmonic series, the spectrogram is something more profound: it is a direct visual display of how the harmonic series structures every human utterance. Every vowel, every consonant, every song is a pattern of harmonics, visible and readable once you know what you are looking at.

What a Vowel Looks Like

Take the vowel sound "ah" — the first vowel a baby learns, the vowel of the open mouth. Record someone singing "ah" on a sustained pitch, produce a spectrogram, and you will see a remarkable structure.

Along the vertical axis, at regular frequency intervals, you will see horizontal bands of energy — a series of bright lines, one above the other, each separated from the next by the same frequency gap. These are the harmonics of the singer's fundamental pitch. If the singer is holding A4 (440 Hz), the harmonics appear at 440, 880, 1320, 1760, 2200, 2640 Hz, and so on — the integer multiples of the fundamental, visible as evenly spaced (in terms of frequency on a linear scale) horizontal stripes.

But the harmonics are not all equally bright. Certain harmonics are much more prominent — their bands are wider, brighter, denser than their neighbors. These bright bands are centered at approximately 800–1000 Hz (for "ah") and 1200–1400 Hz. These are the formants — the resonant frequencies of the vocal tract, which amplify whichever harmonics happen to fall near them.

Now have the same singer sustain the vowel "ee" (as in "see") on the same pitch. The harmonic stripes will still be there, still at 440, 880, 1320, 1760 Hz, because the vocal folds are still vibrating at the same rate. But the bright bands will have moved. The first formant (F1) drops to approximately 300–400 Hz, and the second formant (F2) rises dramatically to approximately 2200–2500 Hz. The harmonic source is identical; the formant pattern is completely different. That difference is the vowel.

Formant Frequencies and the Vowel Space

The International Phonetic Alphabet (IPA) classifies vowels using articulatory descriptions: high, mid, or low (jaw height); front, central, or back (tongue position). Remarkably, these articulatory categories map directly onto the acoustic properties of formants:

F1 (first formant) correlates with jaw height: low vowels (wide open mouth, as in "ah") have a high F1; high vowels (nearly closed mouth, as in "ee" or "oo") have a low F1.
F2 (second formant) correlates with tongue position front-to-back: front vowels (tongue pushed forward, as in "ee") have a high F2; back vowels (tongue pulled back, as in "oo") have a low F2.

This two-dimensional space — F1 on one axis, F2 on another — is called the acoustic vowel space or F1-F2 space. Every language's vowel system can be plotted in this space, and the distribution of a language's vowels within the space is remarkably systematic. Languages tend to spread their vowels as widely as possible in the acoustic space — maximizing the perceptual distance between vowels to minimize the chance of confusion. English has 12–15 distinct vowel sounds (depending on dialect); Hawaiian has only 5. Both systems are optimally distributed within the same acoustic space, just at different densities.

The Harmonic Series and Speech Intelligibility

The harmonic structure of voiced speech is not merely an acoustic curiosity. It is the physical foundation of intelligibility — the reason we can understand each other at all.

When you speak or sing, the harmonics of your voice are your raw materials. The formants select which harmonics to amplify. The listener's auditory system reads the formant pattern from the amplified harmonics and decodes the vowel. This entire chain depends on the voice having a harmonic spectrum in the first place. If the voice produced random (inharmonic) frequencies rather than integer multiples, the formant-reading process would break down — the listener could not reliably extract vowel identity from the noise.

This is why whispered speech, which has a non-periodic, noise-like source (the turbulent air at the open glottis rather than the periodic buzz of vibrating vocal folds), is less intelligible than normal voiced speech, particularly over distance or in noise. Whispering destroys the harmonic structure that makes efficient formant reading possible.

It also explains why the acoustically resonant spaces of concert halls, cathedrals, and amphitheaters historically designed for voice — the ancient Greek theater at Epidaurus, the Sydney Opera House concert hall, the nave of Notre Dame de Paris — all prioritize the frequency ranges where speech harmonics and formants are densest (roughly 500–4000 Hz). Architecture and acoustics have co-evolved to serve the harmonic structure of the human voice.

Speech vs. Song: How the Harmonic Series Shifts

When we move from speech to singing, the harmonic series remains the same physical object — but the relationship between harmonics and formants changes in important ways.

In speech, the fundamental frequency varies rapidly — from approximately 85 Hz (low male voice) to 255 Hz (high female voice) in normal conversation, with fast pitch glides that carry the prosodic information (statement vs. question, emphasis, emotion). The formants remain relatively stable across this pitch range, defining the vowel identity.

In singing, the fundamental frequency is held much more precisely on target pitches, and the harmonics are consequently cleaner and more stable. Professional singers learn to exploit the relationship between their fundamental and their formants in ways that casual speakers do not. Two strategies are particularly important:

Singer's Formant: In trained classical and operatic singers, a cluster of higher formants (F3, F4, F5) merge into a single powerful resonance peak around 2500–3000 Hz — the frequency range where the orchestra is relatively weak and the human auditory system is most sensitive. This "singer's formant" allows a singer to project over a full orchestra in a large hall without amplification. It is created by specific training-induced changes to the laryngeal and pharyngeal configuration. Not all vocal traditions cultivate the singer's formant — pop singers who rely on microphone amplification and overtone singers, who pursue very different spectral goals, use the vocal tract in contrasting ways.

Formant Tuning: Some singers, particularly female operatic sopranos at the upper end of their range, use a technique called formant tuning: they deliberately adjust a formant frequency to coincide with a specific harmonic of their current sung pitch, producing an extremely bright, resonant sound. At high pitches, the harmonics are widely spaced, and a formant placed midway between two harmonics amplifies neither effectively. By "tuning" the formant to lock onto a harmonic, the singer maximizes vocal power. This is an example of a performer consciously exploiting the harmonic series structure — using physics to serve musical expression.

Reading Voice Spectrograms: A Practical Guide

Modern spectrogram software (Praat, Sonic Visualiser, Adobe Audition) makes it possible to produce and read spectrograms easily. Here is what to look for:

Voiced vs. Voiceless sounds: Voiced sounds (vowels, voiced consonants like /b/, /d/, /g/, /v/, /z/) show a regular, striped harmonic structure. Voiceless sounds (fricatives like /s/, /f/, /sh/; stops /p/, /t/, /k/) show irregular noise patterns or silence.

Fundamental frequency (F0): The lowest bright stripe in a voiced sound is the fundamental. Its frequency tells you the speaker's current pitch.

Formants: Look for broad, bright bands where multiple harmonics are amplified together. These are the formants. Count F1 (lowest band) and F2 (next band) to identify the vowel.

Transitions: The most acoustically and perceptually informative moments in speech are the transitions — when formant frequencies sweep rapidly up or down as the vocal tract changes shape. A /b/ before a vowel shows the formants sweeping upward rapidly. A /g/ shows them curving from low to high in a characteristic pattern. The harmonic series provides the background canvas against which these formant transitions are drawn.

The spectrogram is not just a scientific tool. It is, in a genuine sense, a portrait of the harmonic series at work in the most intimate instrument any human possesses.

Discussion Questions

Look up a spectrogram of your own name being spoken aloud (you can make one using Praat, which is free). Identify the vowels in your name. For each vowel, estimate F1 and F2. Do these values match the expected formant frequencies for those vowels?
Why is the spectrogram described as "visible speech"? What kinds of speech analysis become possible with spectrograms that were impossible before their development?
The singer's formant allows operatic singers to project over orchestras without amplification. Modern pop singers do not typically cultivate the singer's formant. What technological and cultural factors explain this difference in technique?
Speech intelligibility depends on the harmonic structure of the voice. What implications does this have for the design of audio compression systems (like MP3) that reduce the amount of data in recorded sound? What acoustic information must be preserved to maintain speech intelligibility?
Compare the acoustic vowel space approach to vowel classification with the articulatory (tongue/jaw position) approach. What are the advantages and disadvantages of each? Why do linguists use both?