Capstone Project 3: The Spoken-Word Production
The podcast and content path — Aisha's road. One complete spoken-word piece, 10–20 minutes, produced to professional loudness and clarity specs. In music, the voice is the lead instrument. In spoken word, the voice is the whole record, and the production succeeds precisely when nobody notices it happened.
Overview & Outcomes
Produce one finished spoken-word audio piece — a podcast episode, a narrated essay, or a short audio documentary — 10 to 20 minutes long, publish-ready, with the measurements to prove it.
Say this plainly before anything else: this is not the consolation path. It's the same physics, the same gates, and arguably the strictest listening conditions in the book — spoken word gets consumed on earbuds in traffic, on phone speakers over dishwashers, at low volume next to sleeping children. A music mix that translates is impressive. A voice that stays intelligible and comfortable across all of that, for twenty minutes, without the listener once touching the volume — that's engineering. It's also where an enormous share of working audio jobs actually live, which is why Aisha's arc got equal billing with Jaylen's for forty chapters.
The skills travel both directions, too. If you came to this book for music, this road sharpens exactly the muscles your vocal productions lean on — voice clarity at conversational level, ducking instinct, loudness literacy — and it does so under a microscope, because a twenty-minute piece with one voice has nowhere for sloppy craft to hide. If you came for the show, this is your capstone in the most literal sense: the thing you ship next, produced to a standard your audience will feel before they can name it.
By the end, you'll have demonstrated that you can:
- capture broadcast-quality voice in a treated domestic space (Chapters 10–11);
- run a dialogue edit that serves the content first and hides its seams completely (Chapter 15, adapted);
- build a voice chain that holds a voice steady and present for the full runtime (Chapters 22–23, with Chapter 29's logic adapted);
- place music beds and duck them so they support the voice without ever competing (Chapters 24, 27–28);
- conform a full episode to -16 LUFS stereo with true-peak compliance, measured rather than guessed (Chapter 33);
- translation-test for spoken word's real listening environments (Chapter 30, adapted);
- and assemble publish-ready metadata plus a production log that could become a weekly system (Chapters 35 and 19).
Prerequisites
The load-bearing chapters: Parts I–II (Chapters 1–10) for fundamentals, tools, and the room; Chapter 11 for voice capture; Chapter 15 for editing — the heart of this project; Chapters 22–23 for EQ and compression on the voice (Aisha's case studies in both are your previews); Chapter 19 for hygiene, and for collaboration if your format includes guests; Chapters 27–28 for rides and ducking; Chapter 29 for the chain-building logic, adapted to speech; Chapter 30 for translation; Chapter 33 for loudness — this road's crown jewel, as Chapter 39 put it; and Chapter 35 for metadata, adapted to publishing. Chapter 24 supports (space used with restraint), and Chapter 34 feeds the optional binaural variant. Chapters 13–14, 16–18, 25–26, and 31–32 inform the work but aren't load-bearing here.
Choose your format up front, because scope lives there: solo narration is the cleanest; an interview adds multi-voice leveling (the whiplash problem that was Aisha's founding wound); a documentary adds field audio and sound design, and is this brief's hard mode.
The Brief
Produce one complete spoken-word piece, 10–20 minutes finished, and deliver the following five items.
- The published-ready episode. A stereo WAV master plus the distribution encode your hosting target actually takes. If you publish it — encouraged, optional — the live link joins the package.
- The loudness QA report. Measured on the full episode, written down: integrated LUFS at your declared target — -16 LUFS stereo is the working convention as of this writing (Chapter 33); name your platform's guidance if it differs — and maximum true peak ≤ -1.0 dBTP (some spoken-word specs ask for extra headroom before lossy encoding; state what you chose and why). Plus a segment-consistency check: voice-to-voice and segment-to-segment levels that never make a commuter ride the volume knob. The report is the deliverable that proves the road's central skill.
- The session, in pro hygiene. Chapter 19 standard, plus this road's specifics: markers at every segment, tracks named per voice, bed, and effect, and the content edit and technical edit saved as separate versions so the structural decisions stay visible.
- The production log. Every stage with the ONE key decision and its reasoning — including the edit's cut list: what you removed, and why. In spoken word, the deletions are the production.
- The metadata package. Episode title, description and show notes, credits, an artwork spec note, chapter markers if used — and licensing documentation for every music bed: original (Chapter 14), library-licensed, or otherwise cleared (Chapter 36 awareness). Nothing "borrowed."
A note on the encode: listen to it. The WAV is your master, but the encode is what the audience receives, and lossy compression occasionally takes a swing at sibilance and bed cymbals — Chapter 2's what-lossy-loses lesson, now personal. The QA report should say the encode was auditioned end to end, not merely generated.
Milestones
Suggested span: 4–6 weeks. Spoken word runs shorter than the music capstones because the mix surface is smaller — but the edit is bigger than you think, and the schedule should say so.
| # | Milestone | The work | Chapters | Week (of 6) |
|---|---|---|---|---|
| 1 | Format design & script | The format document — this road's manifesto: what the piece is, who it's for, what it promises sonically; script or outline; three reference episodes chosen and profiled | 4, 5 | 1 |
| 2 | Capture plan & room pass | Mic choice (dynamic-in-an-untreated-room wisdom), position, the deadest spot in the house, blanket-fort legitimacy; level targets of -18 to -12 dBFS peaks | 7, 10, 11 | 1 |
| 3 | Voice capture | Performance production on yourself or guests: read-throughs, marked script, retakes as full paragraphs rather than single words; 30–60 seconds of deliberate room tone; interviews with each voice on its own track | 11 | 2 |
| 4 | The content edit | Structure first: cuts, reorders, the kill-your-darlings pass — this is arrangement, and every minute defends its existence like a part | 15, 16 | 2–3 |
| 5 | The technical edit | Breaths reduced (never deleted), de-click, gap discipline, crossfades on every cut, room-tone patching until the seams vanish | 15 | 3 |
| 6 | The voice chain & beds | High-pass discipline; boxiness and mud cuts (the 200–600 Hz neighborhood); de-essing; two gentle compression stages; presence without harshness; beds chosen or made, carved out of the speech band, ducked by rides or sidechain — choose deliberately and document the choice | 22–24, 27–29 | 3–4 |
| 7 | Conform & QA | Loudness conform to target; true-peak check; segment-consistency pass; translation tests — earbuds, car, phone speaker, low volume, and the noisy-kitchen test | 30, 33 | 4–5 |
| 8 | Publish package | Metadata, show notes, artwork; the top-and-tail seam listen; the encode checked end to end; publish or schedule | 35 | 5–6 |
Two scheduling truths from this road. First: the content edit and the technical edit are different jobs done with different ears — structure decisions and seam decisions corrupt each other when interleaved, so milestone them separately and keep the versions separate. Second: the ducking decision (rides versus sidechain, Chapter 27 versus Chapter 28) is a real fork — rides give you taste at the cost of time, sidechain gives you consistency at the cost of character — and the log entry explaining your pick is exactly the kind of reasoning the rubric pays for.
Two capture details earn their own sentences as well. Room tone — 30 to 60 seconds of your space doing nothing, recorded deliberately at session gain — is the patching material every invisible edit depends on; without it, every cut either jumps or hisses. And paragraph-level retakes beat word-level punch-ins for the same reason full vocal takes beat early surgical comping in Chapter 11: cadence is performance, and a single re-recorded word almost never matches the breath around it.
Constraints & Variations
The brief is format-agnostic within spoken word: episode, essay, or documentary all satisfy it. The constraints that don't bend: 10–20 minutes finished, real loudness measurement, and licensing documentation for anything you didn't make.
Whatever the format, profile your three references the way Chapter 4 taught for records: how loud do their beds sit under speech, how much room sound does the voice carry, how fast do they cut, what does their first thirty seconds promise? Spoken word has conventions exactly as binding as trap hi-hats or country snares — Aisha's "why does country snare sound like THAT" question has a podcast equivalent — and your listeners hold those expectations even though nobody wrote them down.
The three reader archetypes. The actual podcaster or content creator should use a real episode of a real show — but the log and QA report are still due, and the format document gets written even if the show predates it (Aisha revisits hers quarterly; the assignment is the excuse to write it down). The bedroom producer should take this road seriously rather than tourist through it: voice clarity, ducking instinct, and loudness literacy transfer directly back to vocal production — and pick a subject you'd genuinely listen to, because boredom is audible. The band or musician archetype has a purpose-built variant: the "making of the EP" mini-documentary — interviews plus your own music as beds, which solves licensing by ownership and produces a release-week asset Chapter 37 would be proud of.
Variations. The interview cut: two or more voices, with multi-voice leveling assessed hard — this is the whiplash problem, and conquering it is the variant's whole point. The documentary cut (hard mode): field recordings, scene transitions, an original score (Chapter 14 synthesis or Chapter 12 capture), and optionally one scene delivered as a binaural alternate (Chapter 34) with a sentence on what it cost. The series-starter variant: deliver the episode plus the per-episode folder spec and ten-minute QA checklist that would let you ship it weekly — Aisha's system, generalized. Repetition doesn't erode QA; un-designed QA erodes.
What "Done" Means
The gate list survives the format change — Chapter 39's Road Three walked this exact translation, and the printable version lives in Appendix G. On this road:
- Gate 0 — the format document honored: the piece you shipped is the piece you designed.
- Gate 1 — clarity and consistency verified on three or more systems, earbuds and a noisy room among them.
- Gates 2–3 — compressed: no separate mastering hire; your conform pass is the master, and the next-day listen still applies.
- Gate 4 — measured: integrated LUFS at target, true peak compliant, numbers written in the report.
- Gate 5 — release QA: the top-and-tail seam listen, the encode auditioned end to end, art and metadata verified.
- Gate 6 — the package complete, every bed's licensing documented.
- Gates 7–9 — published or scheduled, with notes and links live.
- Gate 10 — the next episode scheduled. On this road, Gate 10 has a day name. That's the point of the road.
Assessment
Score against the Capstone Assessment Rubric — self-assessment first, then the outside pass, 70% gate as everywhere. Two emphasis notes. Translation weighs the low-volume and noisy-environment tests heavily here, because spoken word is consumed in motion; a piece that needs silence and attention to be intelligible has failed its actual job. And Identity & Taste reads as the show's sound: format voice, bed taste, pacing, and restraint — whether a listener could recognize episode two from its first thirty seconds.
If the piece is a real episode of a real show, resist grading on the curve of your back catalog: the rubric scores this piece, on its own evidence. The series-starter variant exists precisely so the standard survives contact with a schedule.
Common Failure Modes
Each with its chapter home.
- The bathroom sound. Voice recorded in an untreated room's reflections. Treat first — the blanket fort is legitimate (Chapters 10–11).
- Level whiplash. Host hot, guest quiet, intro slamming. Clip-gain first, then compression, then the conform (Chapters 21, 23, 33).
- Radio cosplay. An over-compressed announcer voice on a conversational show. Two gentle stages beat one violent one (Chapters 23, 29).
- Beds fighting the voice. A bed camped in the speech band, or one too interesting to ignore — a lyric under dialogue is an arrangement problem (Chapter 16's logic). Carve the pocket, duck deliberately (Chapters 22, 27–28).
- The breath-deletion uncanny valley. Dead silence where a human should inhale. Reduce, don't delete (Chapter 15).
- Edits you can hear. Missing crossfades, room-tone jumps at every cut. Every edit gets a fade; patch with tone (Chapter 15).
- The guessed loudness. "Sounds about right" ships quiet on one platform and hot on another — Aisha's founding wound, re-enacted. Measure (Chapter 33; Gate 4).
- Skipping the noise test. Perfect in headphones at midnight, unintelligible over a dishwasher (Chapter 30, adapted).
- Unlicensed beds. A commercial track dropped in "just for now" that ships forever. Document every bed's right to exist (Chapter 36).
- The one-and-done. No system, so episode two never arrives. The per-episode spec is the difference between a project and a show (Chapter 19's hygiene, run as a weekly machine).
Ship the episode. Then schedule the next one — on this road, momentum isn't a virtue, it's the format.