Chapter 30: Sound Design and Music — The Design Sense You Forgot About

Claude (Anthropic)

54 min read

Here is an experiment I run with every class of new designers. I put a game on the screen — usually something they love, something they have played for fifty hours and would argue about online. I mute it. I hand someone the controller and say: play...

In This Chapter

Why Sound Is Underrated
The Four Audio Layers
Sound Effects as Game Feel
Music as Emotional Spine
Dynamic / Adaptive Music
Voice Acting
Ambient and World Sound
Diegetic vs. Non-Diegetic Audio
Mixing for Games
Accessibility — Sound
Budget Sound Design
GDScript Implementation — AudioManager.gd
Progressive Project Update — Chapter 30
Common Pitfalls
Summary

Chapter 30: Sound Design and Music — The Design Sense You Forgot About

Here is an experiment I run with every class of new designers. I put a game on the screen — usually something they love, something they have played for fifty hours and would argue about online. I mute it. I hand someone the controller and say: play for five minutes. Tell me what is happening.

What happens next is always the same. The player can still navigate. They can still see enemies. They can still see the HUD. They can technically play. But within thirty seconds, something in their body goes wrong. They tense up and lean forward. They miss inputs they would never miss. They look anxious in a way that has nothing to do with what is on the screen. When they are an enemy takes a swing at them, they do not flinch — because there is nothing to flinch at. The hit connects in silence. The player looks at the screen like someone watching a film with the picture slightly out of sync with the audio, except more so. Something is deeply wrong and they cannot quite name what.

Then I turn the sound back on. The player visibly relaxes. Their shoulders drop. They make a sound that is part relief, part recognition — ohh. And for the rest of the session, they play better. Faster reactions. Fewer missed inputs. More flow.

That is the mute test. It is the single most important test in audio design, and most designers I have worked with have never done it to their own game. They test their visuals with the sound on. They test their combat with the sound on. They check their UI with the sound on. Then they wonder, during playtest, why the game feels "floaty" or "flat" or "unresponsive," and they reach for their visual and haptic toolkits — more particles, more screen shake, more flash — because those are the tools they understand.

Ninety percent of the time, the problem was audio. The hit sound was wrong, or the footsteps were missing, or the music was not reinforcing the energy of the scene, or the UI clicks were the same pitch for every action. Audio is the sense designers forgot about — and it is the sense that is doing the heaviest lifting in every game your players love.

This chapter is about sound as design. Not sound as polish. Not sound as "the audio team will handle it." Sound as a first-class channel of player feedback, emotional communication, and spatial information, equal in weight to graphics and equal in design responsibility to anyone calling themselves a designer. If your project has a designer who says I'll figure out the audio later, your project has an audio problem already. The audio problem started when they said that.

By the end of this chapter, you will have a working mental model of the four audio layers (SFX, music, voice, ambient), a practitioner's sense for when each is doing its job and when it is quietly failing, and an AudioManager.gd singleton in your Godot project that implements the basics: bus mixing, SFX pooling, music cross-fading, ambient layering, and a simple dynamic-music state machine that responds to combat versus exploration. You will run the mute test on your own project and discover, probably with some embarrassment, how much of the feel you thought was coming from your combat code is actually coming from two .wav files you grabbed off Freesound in the last hour of a game jam.

You will understand, I hope, that audio is not a decoration you apply at the end of production. Audio is part of the design. The moment you start prototyping, you start thinking about how it sounds. Every swing of the sword, every menu click, every footstep, every door-opening, every environmental beat — each one is a design decision, and each one deserves the same care you give to a level layout or an enemy behavior tree. When audio is good, players feel the game. When audio is bad, players feel off in ways they cannot name. And because they cannot name it, they cannot forgive you for it.

Why Sound Is Underrated

Designers are visual-first people. We come from games we watched our friends play; we come from films and comics and art. We sketch. We storyboard. We build mood boards. We pin screenshots. We can look at Hollow Knight and Dark Souls and Celeste and Journey and articulate the visual design — the silhouettes, the color palettes, the environmental motifs. We can talk about pixel density and camera framing and color theory for hours.

Ask the same designers to talk about the audio of those same games and the conversation collapses. They can name the composer, maybe. They can hum a theme, maybe. They will say "the sound design is really good," which is the audio equivalent of looking at the Sistine Chapel and saying "nice painting."

This is a professional deficit. It is also a solvable one, if you start paying attention. But the deficit has consequences during production. Designers who are not fluent in audio make three recurring mistakes:

They underfund it. In small-team budgets, audio is the first line item to be cut. The team reasons: we can ship with placeholder sounds, or a music library, or the bare minimum, and work on the "real" problems first. This is a disaster, because audio is not a finishing touch — it is a feel multiplier. A combat system with great design and terrible audio feels worse than an average combat system with great audio. The audio is where players feel the combat. If you underfund audio, you are underfunding feel itself.

They specify too little. In design documents, audio gets a line: "Enemy attack — slashing sound." No designer would accept "Enemy attack — sharp-looking animation" as the only spec for a visual. But we write the audio equivalent and hand it off as if it were enough. The composer or sound designer ends up guessing at half the game's emotional intent because the designer never wrote it down.

They ignore mix. Even games with great individual sounds and great music often ship with a bad mix — where music drowns out dialogue, or combat SFX are so loud they hurt, or ambient is so faint you cannot hear the world. Mix is the designer's responsibility, because mix is about what the player hears at any given moment, and what the player hears is a design decision. "The audio team will mix it" is a phrase that leads to games where players reach for the volume slider in the first ten minutes and set music to forty percent because otherwise they cannot hear the voice track.

💡 The Mute Test: Pick your own game. Mute it. Play for five minutes. If someone watching over your shoulder can still tell what is happening — what enemy is attacking, whether you got hit, what menu you are in, whether something important just happened — your audio is redundant (bad) or your visuals are overspecified (also bad). If someone watching cannot tell, your audio is doing work; turn it back on and audit whether every sound is the right sound for its moment. Run this test monthly. You will find things.

The reason audio is underrated is not that designers are stupid. It is that audio is invisible. You cannot screenshot it. You cannot put it in a pitch deck. You cannot tweet a clip of your "audio mockup." It exists only in time, only in experience, and only when the player's volume is on. Our professional culture is not built to talk about it. The burden is on you, as a practitioner, to build the fluency anyway.

Play games with your eyes closed for thirty seconds at a time. Read Karen Collins's Game Sound. Watch Austin Wintory's GDC talks about Journey. Watch any Game Maker's Toolkit video that covers audio and listen more than you look. Open Wwise or FMOD even if you never ship with them, just to understand what audio middleware treats as a first-class concern. This is remedial education you will give yourself, because nobody gave it to you in school.

The Four Audio Layers

Game audio lives in four layers. Each has a distinct function. Each has distinct design concerns. When you mix them well, they cooperate: they tell the player what is happening, where it is happening, how they should feel about it, and what kind of place they are in. When you mix them badly, they fight: the music drowns the voice, the SFX drowns the music, the ambient disappears, the mix is a mud puddle.

The four layers are:

Sound effects (SFX). Short, event-driven sounds tied to specific gameplay events: footsteps, weapon impacts, UI clicks, item pickups, door-opens, enemy cries, jumps, lands. These are your primary game-feel channel. They fire on input or on simulation events. Their job is immediate feedback.
Music (BGM). Continuous, composed audio that underscores the scene. Its job is emotional framing and pacing energy. Music tells the player how to feel about the current situation — tense, heroic, grieving, at peace.
Voice (VO). Dialogue, barks, narration, recorded lines delivered by actors. Its job is characterization, plot delivery, and moment-to-moment information ("Enemy behind you!"). Voice is the most expensive layer by far.
Ambient. Continuous environmental sound: wind, rain, city hum, cave drips, forest birds, air-conditioning rumble, distant crowds. Its job is to build the world's physicality — to convince the player they are somewhere rather than nowhere.

Each layer operates on a different time scale. SFX is instantaneous (fires once per event, typically 0.1 to 2 seconds long). Music is minute-scale (changes with scenes, areas, combat state). Voice is line-scale (a single utterance, 1–10 seconds). Ambient is area-scale (one bed loops for an entire region, 30 seconds to several minutes).

Each layer also lives on a different bus in the mixer — a dedicated audio channel with its own volume, its own effects, and (critically) its own user-facing volume slider in the options menu. If your game does not expose separate Master / Music / SFX / Voice / Ambient sliders, your accessibility story is broken. Some players cannot tolerate loud music but need loud voice. Some players want cinematic ambient but find the UI clicks grating. Give them the controls. We will build exactly this bus structure in the AudioManager.gd section.

The rest of this chapter visits each layer — its purpose, its pitfalls, its craft — and then visits the cross-cutting concerns: diegetic versus non-diegetic framing, mixing, accessibility, budget. But the four-layer model is the spine. Hold it in your head. Every audio problem you hit in production will be located somewhere in that model, and naming it — "this is a music-vs-voice mix problem" or "this is an ambient-bed-disappearing-in-combat problem" — is half the battle.

Sound Effects as Game Feel

SFX are the audio layer most under the designer's direct control, and they are where most of your game's feel lives.

Pick up any platformer you consider beautiful to play. Celeste, Super Meat Boy, Hollow Knight. Now, silently in your head, try to separate the feel from the sound. You cannot, quite. The Celeste dash is not just the animation and the momentum curve — it is the specific tsh of the dash SFX, the little pulse of pitch-shifted feedback when you dash into the air, the different sound for dashing through a block. The Meat Boy jump is the cartoon boing. The Hollow Knight slash is the clean metallic shing. Each of these is not a single file — it is usually a layered stack of three to six audio elements: a main impact, a sub-bass thud, a high-frequency pop, maybe a tail reverb. And each is slightly randomized in pitch on every play so it never feels mechanical.

This is where designers go wrong: they think of an SFX as one sound. It is not. It is a design artifact with multiple components, tuned deliberately, and every layer is there on purpose.

The anatomy of a good hit sound

Let's dissect a sword-hit SFX the way we would dissect a combat animation:

Attack (the first 20 ms). The hit sound starts before the impact frame — usually on the anticipation frame — because audio has perceived latency. If you fire the SFX exactly on the visual contact frame, it feels late. The ear arrives after the eye. So your attack envelope starts a frame or two early.

Impact (20–80 ms). The main event. For a sword hit, this is usually a layered stack: a metallic clang for the weapon, a meaty thud for the flesh contact, and a high-frequency crack to give the hit edge. Without all three layers, the hit feels incomplete — too metallic, too soft, or too thin.

Body (80–300 ms). The resonance of the impact — the physical after-sound, a reverberant tail, a bass rumble if the enemy is large. This tells the player how solid the hit was. Boss hits have longer, heavier bodies. Normal enemy hits have tighter, crisper bodies.

Release (300 ms+). The fade. How quickly the sound decays back to silence. Short releases feel snappy; long releases feel weighty. You pick based on the game's tempo.

When designers say "our combat feels floaty," a diagnosis of the hit SFX almost always reveals the problem. Usually one of two things is wrong: the body is too short (so the hit has no weight), or the attack is late (so the hit feels disconnected from the animation). Both are audio problems masquerading as design problems.

Pitch variance and the repetition problem

Here is a subtle trap. You implement a footstep sound. You play it every time the player's foot hits the ground. It sounds fine for the first few seconds. Then the player walks for a minute and you, the designer, start to hate your own game. The footstep is driving you insane. Your brain has pattern-matched it as a mechanical repetition, and mechanical repetition is wrong in a way players cannot articulate but definitely feel.

The fix is variance. Every time you play the footstep, randomize the pitch by ±10–20 percent. Pick from a pool of three to five different footstep samples rather than the same one. Randomize volume by a few percent. The human ear stops pattern-matching the moment you introduce small random variation, because natural footsteps are not identical; they vary with surface, weight distribution, fatigue. Your varied samples fool the brain into treating them as natural rather than mechanical.

The rule of thumb: any sound that plays more than four times in a row needs variance. Footsteps, gunshots, menu clicks, enemy grunts, ambient drips. The cost is a few extra samples and a random number generator call. The payoff is the difference between a game that sounds alive and a game that sounds like a loop.

Super Meat Boy and SFX density

Super Meat Boy is a masterclass in SFX density. Watch a let's-play with headphones. Every death has a wet splat (and the game is designed around dying hundreds of times). Every jump has a squeak. Every saw has a rising whirr as you approach. Every retry triggers a specific beep. There is almost no frame of gameplay without sound.

This density matters because Meat Boy's core loop is failure. You are going to die constantly. If each death were a silent screen-flash, the failure loop would be flat. Because each death is a juicy, specific, almost-comedic sound, the failure loop is enjoyable. The audio rewards you for failing. The audio makes dying part of the fun.

Contrast this with the silence of Journey's pilgrimage sequences — the long solo traverses across the dunes, where the only sounds are the character's movement, the hiss of sand, and the distant wind. The silence is equally deliberate. Journey uses silence to build weight for its musical swells. When the choir comes in near the peak of a mountain, the silence that preceded it is half the payoff. Meat Boy densifies. Journey rarifies. Both are designed. Both serve the game's emotion.

Your project's SFX density should match its emotional target. Tight, action-packed, precise games (platformers, shooters, character-action) want high density. Meditative, atmospheric, exploratory games (walking simulators, certain metroidvanias, horror) want strategic silence. Get the density wrong and the game feels off in a way players cannot name.

Silence as design choice

A moment of silence — true silence, a hard cut where music stops and ambient drops — is one of the most powerful tools in game audio. Silent Hill 2 uses it before major encounters: the ambient drone cuts, and the player's heartbeat becomes audible. The Last of Us uses it after major deaths: the music stops, the world continues, and the silence carries the grief. Hollow Knight uses it in boss arenas before the boss appears: the area's music stops, you hear the character's footsteps clear in the room, and you know what is coming.

Silence is loud, paradoxically, because it breaks the expectation of continuous audio. Every player has learned, implicitly, that games have music and ambient playing all the time. When you cut it, the ear notices. Use that power sparingly — one silence per scene, maybe one per hour — and it is devastating. Overuse it and it becomes another loop.

Music as Emotional Spine

If SFX are the game's feel, music is the game's feeling. Music tells the player how to experience the current moment. A room with nothing in it feels mysterious with one piece of music, triumphant with another, and terrifying with a third. The room has not changed. The music has.

This is a staggering amount of power, and it is the reason every decent game hires a composer. Your music is the most expensive part of your audio budget for a reason: it is the channel that sets the emotional baseline against which everything else operates.

What music does in a game

Music in a game does at least four distinct jobs, and understanding them is the difference between a soundtrack that works and one that does not.

Sets mood. The obvious job. Soft piano for grief, loud brass for triumph, dissonant strings for dread. The composer picks instrumentation, tempo, and harmony to tell the player what kind of moment they are in. This is where game music most resembles film scoring, and most composers' intuitions transfer.

Paces energy. Music builds and releases tension in ways gameplay cannot easily do on its own. A battle theme starts at a specific intensity; it modulates upward as the fight escalates; it releases when the enemy dies. Even in a static piece, the music's internal structure — verse, chorus, bridge — paces the player's experience on timescales gameplay alone would struggle to control.

Signals transitions. A musical stinger when the boss appears. A theme change when the player enters a new area. A cue when a scripted event fires. Music is a transition-marker that the player reads instinctively. Change the music and the player's brain registers something is different.

Builds identity. A game's themes become its emotional signature. When you hum "Zelda's Lullaby," you summon the feeling of Zelda. When the Hollow Knight piano motif returns in the final area, twenty hours after you first heard it, you remember everything the game has put you through. Music is how games make you remember them.

A composer working in games needs to hit all four. A composer who only understands mood (film-scoring intuition) produces games where the music is beautiful but disconnected from pacing. A composer who only understands identity (theme-writing intuition) produces games where the music is memorable but inflexible — the boss theme is great but there is no way to ramp it up or cool it down dynamically. The best game composers — Austin Wintory, Christopher Larkin, Lena Raine, Mick Gordon — hit all four with craft.

Journey — Austin Wintory

Austin Wintory's score for Journey is the most-studied piece of game music of its generation, and it deserves the attention. The score is structured as a single arc that mirrors the game's pilgrimage. The opening is sparse, introspective, solo cello. The middle deepens with strings and piano. The summit swells with full orchestra and wordless choir. The descent reprises earlier themes in transformed ways, recontextualizing them against the end.

What makes this music game music and not film music is that it is interactive in its pacing. Players who rush the game hear the music compressed and reactive; players who linger hear it stretch and develop. The score has a composed backbone but its unfolding is responsive to player movement and location. Wintory has given GDC talks about this — about how he had to think in branches and layers rather than in linear cues, and how the DNA of game music differs from film. If you can only watch one GDC audio talk in your career, watch his.

One craft lesson from Journey: the score uses one theme, stated simply and then elaborated, for two hours. No filler. No generic "area 3" music disconnected from the rest. Every note is on the spine. This kind of economy is the hallmark of a composer who has thought about their material as a whole, not as a pile of cues.

Hollow Knight — Christopher Larkin

Christopher Larkin's score for Hollow Knight is a masterclass in using a small number of elements to cover a massive game. Piano, strings, occasional choir, occasional pipe organ. Dozens of distinct area themes, each with its own melodic identity, but all unmistakably from the same world. Motifs recur — the Hornet theme, the Radiance theme — across the game, linking scenes that are hours apart in play time but thematically connected in the narrative.

What is extraordinary is the economy of orchestration. Larkin does not have the budget for a full film orchestra. He has piano and strings and some choir and synth pads. But he uses them so carefully that the score never feels thin. Each piece earns its presence. The quiet piano of Dirtmouth earns the loud organ of the final boss because it was restrained for twenty hours of play.

Case study 1 goes deep on Larkin's approach. The short version here: if you have a small audio budget, do less with more intention. Pick three instruments, write one recurring theme, build everything around them. This is how indies punch above their weight.

The temptation to over-score

A common failure mode for first-time designers is to cover every scene with music. Silence must be a mistake, the designer thinks. Every moment needs a musical bed.

This is wrong. Music needs to breathe. If every scene has music, no scene has music, because the player habituates to continuous underscore and stops hearing it. Music works by contrast — with silence, with other music, with tonal shifts — and if you remove the contrast, you remove the function.

Look at how Breath of the Wild handles its overworld. Most of the time, the overworld has no music. Just wind, ambient, the horse's hooves, the character's footsteps. Music enters only at specific moments: cresting a hill to reveal a vista, entering a new region, fighting an enemy. The absence of music for most of play is what makes the presence of music, when it arrives, feel meaningful. BOTW is not under-scored — it is correctly-scored. Most designers would have laid music over the whole overworld and rendered every musical moment invisible.

Dynamic / Adaptive Music

A linear piece of music plays from start to end in a fixed form. This is how film scores work: the composer knows exactly when the scene starts, how long it lasts, and what the emotional arc is. Games do not have this luxury. In a game, the composer does not know how long the player will be in this area, how hard the combat will be, whether the player will run or fight, or whether the scripted event will fire now or in ten minutes.

Games respond with dynamic or adaptive music — scores built to shift in response to player state. There are three main techniques, and knowing which to use is a real design decision.

Vertical layering (stems)

In vertical layering, a single piece of music is recorded as multiple stems — separate tracks that play simultaneously but can be muted or faded independently. A battle theme might have stems for:

Drums (rhythm, always playing)
Bass (low foundation, always playing)
Strings (melodic, muted until tension rises)
Brass (punchy, muted until combat escalates)
Choir (pays off at peak intensity)

As combat intensity rises, more stems fade in. As it falls, they fade out. The music is always "the same piece" — same tempo, same key, same structure — but its instrumentation responds to gameplay state.

Red Dead Redemption 2 is perhaps the most elaborate example. The game's score is mixed in stems and fades them in and out based on what the player is doing: riding, fighting, hunting, exploring. You can play for ten hours and never hear the "full" version of a track, because the full version only plays when every stem is active at once — a state the game rarely reaches. Most of what you hear is three or four stems at a time, interlocking in ways that feel continuous and responsive.

Vertical layering is powerful because transitions are seamless. Nothing has to restart or re-cue. Stems just fade. The composer designs the stems to work in any subset, which is a skill of its own.

Horizontal re-sequencing (branching)

In horizontal re-sequencing, music is built as a set of segments that can be re-arranged on the fly. The game jumps from one segment to another at musical boundaries — usually at bar lines or phrase endings — so transitions happen in time rather than at arbitrary moments.

The landmark example is LucasArts' iMUSE system, used in Monkey Island 2 and onward. iMUSE tracked the musical state and, when the player changed scenes, queued the transition to happen at the next beat, bar, or phrase end. The result was music that flowed between scenes as if the composer had known all along what was coming.

Modern implementations use middleware (Wwise, FMOD) that know about musical meter and can queue transitions at musical-appropriate moments. If your composer has written the music with defined transition points, your system waits for the next valid transition and jumps. The player experiences what feels like hand-composed pacing without any linear authored solution.

State machines (combat vs. exploration)

The simplest and most common approach: the music is a set of named states (explore, tension, combat, victory, defeat), and the game switches between them based on gameplay events. Transitions are usually cross-fades — the old track fades out while the new one fades in over 1–3 seconds.

DOOM 2016 and DOOM Eternal use a combat-state approach. Mick Gordon's scores have a tightly-defined "combat mode" that fires when enemies engage and releases when combat ends. Between combat, the score is quieter, atmospheric, less insistent. The switch is immediate and unambiguous — you hear the guitars come in, you know it is time to fight.

Most indie games should start with this approach. Two states, three at most. Explore, combat, maybe boss. Cross-fade between them on combat-start and combat-end events. You will not need iMUSE. You will not need stems. You will get ninety percent of the emotional benefit with one tenth of the composition and engineering work.

We will build exactly this in DynamicMusicPlayer.gd below.

Middleware vs. engine-native

In a professional shop, you would run audio through Wwise or FMOD — commercial middleware that handles buses, stems, states, RTPCs (real-time parameter controls), occlusion, reverb, and the thousand other concerns audio programmers have. Wwise and FMOD integrate with Godot, Unity, Unreal, and custom engines. They are free for small teams and license for larger ones.

For a first project, engine-native audio is fine. Godot's AudioServer and AudioStreamPlayer nodes cover most of what you need. When your project outgrows them — usually when you want stem-based dynamic music with musically-timed transitions, or complex spatial audio with occlusion, or a composer who wants to iterate in Wwise without shipping builds — you migrate. Middleware is a tool for when you hit the wall. You will hit the wall, if your project is ambitious. But do not introduce middleware on day one, because the learning curve is real and for a small 2D project it is overkill.

Voice Acting

Voice is the most expensive layer of audio. A good voice actor in a good studio with a good director costs meaningful money, and the cost scales with line count. Disco Elysium's team shipped the original game with partial VO; two years later they went back and re-recorded full VO for the Final Cut edition, adding hundreds of hours of recording and dozens of actors. The final-cut budget for VO alone probably exceeded the entire production budget of many indie games.

If you are a small team, you probably cannot afford full VO. That is fine. What you must do is be honest about the tradeoff and design around it.

The VO spectrum

Games live on a spectrum of voice density:

Full VO. Every line is spoken. The Witcher 3, The Last of Us, Mass Effect, Cyberpunk 2077. Requires AAA budget.

Major-character VO. Protagonist and a few key NPCs are voiced; minor characters have text only or grunt-style barks. Skyrim takes this approach for major characters while giving minor NPCs distinctive but limited voice lines.

Cinematic-only VO. Cutscenes are voiced, but in-game dialogue is text. Many JRPGs work this way, especially in the PS1 and PS2 eras. Final Fantasy VII (original) and many others.

Bark-only VO. Characters have short combat and reaction barks — "Behind you!", "Reloading!", "Aaargh!" — but no long dialogue. Many shooters work this way. This is the cheapest form of useful VO.

No VO. Text-only. Undertale, Stardew Valley, Hollow Knight. Your writing is doing all the work. This is an artistic choice as much as a budget one.

Most indies should start at bark-only or no-VO. Adding major-character VO later is possible; ripping it out if you cannot sustain it is embarrassing.

Undertale — the no-VO choice

Undertale has no voice acting. Every line is text. And yet the game is wildly expressive — you feel Sans's exhaustion, Papyrus's bravado, Toriel's motherliness, Asgore's grief. How?

Three things. First, the writing is specific. Toby Fox writes characters who sound distinct on the page; you hear Papyrus in your head because his lines are designed to be heard a certain way. Second, the text-box rhythm is controlled. Lines reveal at different speeds to convey emotion — Sans speaks slowly and in lowercase, Papyrus in uppercase at a faster clip. The text timing is doing what voice inflection would do. Third, each character has a distinctive text-chirp — the little audio blip that plays as text reveals. Toriel's chirp is warm; Sans's is lower and slower; the final boss's chirp is menacing. These chirps are, in a real sense, the game's voice acting. They are voices without words.

The Undertale approach proves that VO is not mandatory. If you write well and design your text delivery, you can carry character with text alone. But note the design discipline: every text chirp was chosen. Every line was paced. This is not "we didn't bother with voice" — this is "we designed a non-voice voice system." The no-VO games that work are the ones where someone thought as hard about text delivery as voice games think about casting.

Disco Elysium — the full-VO rebirth

Disco Elysium originally shipped with partial VO — the Thought Cabinet, key moments, a fraction of the total text. The game still worked because the prose was phenomenal. Then, in 2021, the Final Cut added full VO for every line. Hundreds of hours of recording, dozens of actors, a massive undertaking.

The effect of the Final Cut is subtle and worth studying. The game is not a different game with full VO — it is the same game, more embodied. Characters who were voices-in-your-head become voices-in-the-air. The prose retains its power because the performances honor the writing rather than replacing it. The actors interpret; they do not invent.

The lesson: when you add VO, the quality of the performance and direction matters more than the quantity. A single great VO line (Lisa Warrior of Celeste's dialogue, Sans's final boss dialogue were it voiced) lands harder than a hundred workmanlike reads. If you are hiring actors, hire for interpretation. Direct them. Do not just hand them a script and record the first take.

A good VO take saves bad writing, and vice versa

Two craft observations, side by side.

A good VO take saves bad writing. A mediocre line delivered by a great actor becomes charged. The actor finds the subtext the writer missed. The delivery carries what the words did not. This is why AAA games with passable-but-not-great writing can still feel emotionally compelling — Troy Baker and Ashley Johnson are doing heavy lifting for the script in The Last of Us Part II, making scenes work that, read on the page, would feel overwrought.

A bad VO take sinks good writing. A great line delivered flatly, at the wrong pace, with the wrong emotion, becomes worse than no line at all. The player hears the mismatch. The character collapses. This is why "just get any voice actor" is dangerous — a poor VO cast makes your writing feel worse than it is.

The practical implication: if your VO budget is thin, cut lines before you compromise performance. Fifty great-acted lines beat five hundred acceptable ones. Write fewer, cast better, direct seriously, and you will end up with a game that punches above its VO weight.

Ambient and World Sound

Ambient is the audio layer designers most often forget. It is easy to see why: ambient is continuous, is rarely the center of attention, and does not respond to player input. You can ship a build without ambient and the build will not crash. The designer moves on.

The player, though, notices immediately. A world without ambient feels hollow. Even if the player cannot name what is missing, they feel it — the environments seem small, flat, unreal. The in-world soundscape is what convinces the body that there is a there there.

What ambient does

Ambient establishes place. A cave has drips, distant wind, low rumble. A forest has birds, rustling leaves, distant streams. A city has cars, murmured crowds, air-conditioning hum. A spaceship has the low drone of the engines, distant mechanical whirrs, occasional hisses. Each of these is a layered bed — usually three to six simultaneous loops, mixed to a specific character — that plays continuously while the player is in that kind of space.

Good ambient is layered. A single forest-loop is suspicious — the repetition becomes audible within a minute. A good forest ambient is a base layer of rustling leaves, a secondary layer of occasional bird calls (randomized in timing), a tertiary layer of distant wind, and occasional one-shot elements (a branch creaking, a distant animal call) that fire at random intervals. No single element is the whole sound. The brain does not pattern-match because there is no single pattern.

3D positional audio

In 2D games, audio can pan left-right based on where the source is. In 3D games, audio becomes fully spatial — pitched up and down, delayed to simulate head-related transfer functions, attenuated with distance, filtered for walls and obstructions. A well-designed 3D audio system lets the player close their eyes and point to where a sound came from.

For our 2D project, we will implement basic positional audio via AudioStreamPlayer2D, which attenuates by distance and pans by horizontal position. This is enough to make an off-screen enemy's footsteps give the player useful information about where the enemy is. For 3D, Godot's AudioStreamPlayer3D handles the full model, including Doppler effects and reverb buses for area-based acoustic spaces.

The design principle: every on-screen or off-screen sound source should be spatialized if its location is meaningful. A UI click is not spatialized (it is "in the player's head"). A distant waterfall is spatialized (it tells the player which direction the river lies). Getting this mix right — what is in-world and what is in-the-player's-head — is a recurring judgment call.

Alien: Isolation

Alien: Isolation is the example. Creative Assembly built the game around audio to an extent that is unusual even for horror games. The alien is not always on screen. But you can always hear it — scratching in a vent above you, hissing around a corner, heavy footsteps one deck up. The motion tracker beeps with a specific cadence that indicates proximity, creating a second audio channel dedicated to spatial awareness. The ventilation systems groan. Doors hiss. The station itself breathes.

The practical effect is that the player spends the entire game with their ears open. Visual information is often obstructed; audio is not. Players who played the game with good headphones report meaningfully better performance than those with speakers, because the spatialization is doing gameplay-relevant work.

Case study 2 goes deep on Alien: Isolation. The craft lesson here: ambient is not background. Ambient is a channel through which you communicate spatial, narrative, and gameplay information. Design it that way and it becomes one of your most powerful tools.

Occlusion and reverb zones

A sound source behind a wall should sound muffled. A sound source in a large stone room should sound reverberant. A sound source in a carpeted office should sound dead. These acoustic properties — occlusion (sound passing through barriers) and reverb (sound reflecting in spaces) — are part of what convinces the player they are in a real space.

Godot 4 supports reverb buses: you assign reverb effects to an audio bus, route certain sounds through it, and tune the parameters per area. A cave area routes ambient and SFX through a "cave reverb" bus; an outdoor area routes through a "no reverb" bus. As the player moves between areas, you swap the bus routing and the acoustic character of the world shifts without any loop-crossfade.

Occlusion is harder in a general case — it requires ray-casting from the sound source to the listener to check for barriers — but for a 2D project you can approximate it by detecting whether the sound source is behind a wall tile and applying a low-pass filter if so. The result is an enemy whose footsteps sound muffled until they round the corner and become clear. Players will not consciously notice, but their spatial awareness will sharpen.

Diegetic vs. Non-Diegetic Audio

A useful framing from film theory: diegetic audio exists in the world of the story (a character's voice, a radio playing in a room); non-diegetic audio exists only for the audience (the film score, narration). Games use both, and the choice is meaningful.

The player's character hearing a sound versus the player hearing a sound is a real distinction. When your game plays combat music during a fight, the character does not hear orchestral strings — the player does. The music is non-diegetic. When the player turns on a jukebox in a bar and music starts playing, both the player and the character hear it. The music is diegetic.

The classic diegetic-music games

Grand Theft Auto's radio stations are the canonical example of diegetic music as design. When you get in a car, you flip through radio stations, each with its own playlist, DJ banter, and genre identity. The music becomes part of the world — the character hears it, the player hears it, and the act of choosing a station is roleplay. The radio is also a solution to a design problem: GTA is an open-world game with long drives, and static non-diegetic score would get old. The diegetic radio gives the player control over their own score, refreshing with each station change, and tying the music to the world's 1980s-California or 1990s-New-York fiction.

BioShock uses diegetic 1940s vinyl recordings playing on phonographs scattered through Rapture. The music is canonical to the world — it is what the Rapturites listened to — and it establishes the fallen-utopia period setting without a word of exposition. When "Beyond the Sea" plays over the opening descent, it is diegetic in spirit; the player is being welcomed to Rapture by Rapture's own music. This contrast — period music in a horror context — is half the reason the game's tone works.

Fallout uses both: in-world radio stations (diegetic) and a non-diegetic orchestral score. The layering lets the game establish period through diegetic music and emotional framing through the score, without either mode exhausting itself.

Breaking the wall deliberately

Sometimes designers break diegesis deliberately as a design statement. Undertale does this in its boss fights, where the music is explicitly a performance the antagonist is staging — the Sans fight's "MEGALOVANIA" is canonically Sans's theme, arranged by Sans, blurring the non-diegetic score with character-performed music. Cadence of Hyrule, as a rhythm game, makes the score diegetic-ish — the game's mechanics are locked to the beat, so the music is simultaneously for the player and the mechanics.

These wall-breaking moves work when the blurring is meaningful. They fail when the designer simply did not think about the distinction. Be deliberate about which mode you are in at each moment, and do not switch modes without reason.

Mixing for Games

A game's mix is the moment-to-moment balance of all audio layers. Is the music louder than the dialogue? Is the UI click louder than the ambient? Does combat SFX overwhelm the music? Can the player hear the enemy footsteps behind them while the player's own weapon is firing?

These are mix questions, and they are design questions. A good mix is invisible — the player never thinks about the mix because every sound is at the right level. A bad mix is intrusive — the player reaches for the volume slider, or the subtitle toggle, or the uninstall button.

Dynamic range vs. compression

Film mixes have large dynamic range: quiet scenes are very quiet, loud scenes are very loud. This works in a controlled environment (cinema, living-room setup). It fails in every other context. Players play games on laptop speakers, on TV speakers at midnight, on commuter headphones, on mid-range phones, on gaming headsets with compressed profiles. Their environments have varied noise floors and varied dynamic ceilings.

Games compensate by compressing dynamic range. Loud sounds are slightly quieter; quiet sounds are slightly louder; the mix sits in a tighter band so that it is audible across environments. This is not a degradation of fidelity — it is a different mixing philosophy, suited to the medium.

The operational concept is loudness normalization. Most games target around -18 to -24 LUFS (integrated loudness) for their mix, depending on genre. Action games run hotter (-18); atmospheric games run cooler (-24). If your mix is significantly louder than this, players will reach for the volume knob immediately. If it is significantly quieter, they will miss important information.

The car-speakers vs. headphones problem

A mix that sounds great on studio monitors will often sound wrong on laptop speakers, phone speakers, car speakers, or cheap earbuds. Bass disappears on small speakers; treble fatigues on earbuds; stereo separation collapses on mono sources. If you mix exclusively on high-end gear, you are mixing for an audience that does not exist.

The discipline is to mix on multiple systems. Studio monitors for fidelity, laptop speakers for worst-case, phone speakers for mobile, good headphones for detail, cheap headphones for the long-tail audience. After every significant mix pass, listen on at least three of these. You will hear things on one that you did not hear on another.

For solo indies without multiple systems: the minimum viable test is studio headphones + laptop speakers + one mobile device. If it sounds good on all three, you are probably fine. If it falls apart on one, you have a problem.

Ducking

When dialogue plays over music, the music needs to duck — lower in volume — so the voice is clear. This is sidechain compression or ducking, and every game mixer uses it. When a VO line fires, the music bus drops by 6–12 dB for the duration of the line, then fades back up. The player hears the dialogue clearly without consciously noticing the music level change.

The same applies to UI: when an important notification fires, you can duck the game mix slightly so the UI sound cuts through. When combat SFX are dense, you might duck music to give the SFX more headroom. Ducking is a fundamental technique, and we will implement a simple version in AudioManager.gd.

Stereo separation

Good mixes use the stereo field deliberately. Ambient beds spread wide. Music sits centered. Dialogue sits centered and slightly forward. Spatialized SFX pan by source position. UI sounds are centered. Leaving everything centered (a mono-ish mix) wastes the spatial information the player's ears can use.

For 2D games, stereo spatialization is usually horizontal panning based on source X-position relative to the player. AudioStreamPlayer2D does this automatically if you set the right parameters. For 3D, full spatialization is available via AudioStreamPlayer3D. Use these nodes for everything that has a location; use AudioStreamPlayer (non-positional) only for UI, music, and whole-world ambient.

Accessibility — Sound

A substantial minority of your players will experience audio differently than the typical player. Deaf and hard-of-hearing players may not hear some or any of your game's audio. Players with auditory processing differences may struggle to parse overlapping audio channels. Players with sensory sensitivities may find loud or sudden sounds painful. Players with single-sided hearing (genetic, surgical, traumatic) cannot use stereo localization.

Designing for these players is not charity — it is competent audio design. And, as always, accessibility features benefit everyone: the parent whose baby is asleep, the commuter in a loud train, the player at 2 a.m. who muted the TV.

Closed captions and sound captions

Every VO line needs an optional subtitle. This is baseline. Beyond VO, consider sound captions — on-screen text describing non-dialogue audio events. "[Enemy footsteps, left]." "[Distant growl]." "[Door creaks]." These are the captions you would see on a TV show: [door slams], [tense music playing]. They turn auditory information into visual information without removing it from the game.

Sound captions should be toggleable and designed. Not every caption should fire constantly — that would clutter the screen. Typically you caption (1) VO always, (2) gameplay-critical SFX when the option is enabled (footsteps of hidden enemies, pickup confirmations, death triggers), and (3) musical transitions or narrative ambient cues for atmospheric support. The Last of Us Part II's captioning is the current AAA gold standard — it captions far more than dialogue, and it does so in a way that is readable without being overwhelming.

Vibration as audio proxy

Controller rumble can substitute for audio cues. An enemy attack that produces a low growl SFX can also trigger a short, low-frequency rumble. A deaf player who cannot hear the growl can still feel the impending attack. This is not a replacement for captioning — it is an additional channel.

Modern controllers (DualSense) have richer haptics than older ones, allowing for textured vibration that carries more information than simple rumble. Returnal uses this extensively, with different rumble patterns communicating different threat types. This is the frontier of accessible audio: using non-auditory channels to carry audio information.

Mono summing

A player who is deaf in one ear cannot use stereo. Their "left-right" information is lost; all they hear is whatever arrives at their hearing ear. If your spatialized footstep pans fully left, they miss it.

The accessibility feature is a mono sum option — combine left and right channels into a centered mono output, so no information is lost to the non-hearing side. Godot's AudioServer makes this easy via a mono-sum effect on the master bus. Expose it as a setting ("Mono Audio: On/Off") and you have solved a real problem for single-ear players.

Volume sliders

The non-negotiable baseline: separate volume sliders for Master, Music, SFX, Voice, and (ideally) Ambient. Allow each to go to 0. Allow each to be adjusted independently while the game is running. Do not hide them behind twenty menu clicks. If a player wants to mute your music and keep only dialogue + SFX — a common preference for streamers who cannot play copyrighted music — they should be able to do so in ten seconds.

This is ten minutes of engineering work and a settings menu entry. There is no excuse to ship without it.

Budget Sound Design

If you are a small team or a solo developer, you cannot commission a full original score and a full sound design pass. You do not have the money. That is okay. You can still ship a game with good audio. The techniques differ.

Royalty-free libraries

Freesound.org is a community library of user-contributed audio. Huge. Free for most uses (always check per-sample license). Zapsplat.com is a commercial library with a free tier. Soundly is a paid sound library used by film professionals. Epidemic Sound and Artlist offer music and SFX on a subscription model.

For a solo developer, Freesound plus a few hours of searching can assemble a passable SFX palette for a small game. The trick is to curate aggressively. Do not grab the first result. Listen to twenty options per need. Pick the one that matches your game's tone. Treat the library as raw material, not finished work.

Sub-syncing

Sub-syncing is using an SFX for something other than what it was recorded for. A tiger's growl, pitched down, can be the sound of a monster. A squeaking door hinge, pitched up, can be the sound of a small mechanical creature. A crumpling paper bag can be footsteps on leaves.

Foley artists have done this since films began — the classic example is coconuts as horse hooves — and game sound designers do it constantly. Your SFX library is not a catalog to be matched one-to-one; it is a supply of raw sonic material to be cut, pitched, layered, and re-used across contexts.

Learn a DAW (digital audio workstation) at a basic level. Reaper is cheap and great. Audacity is free and adequate for basic edits. You do not need to be a mixing engineer. You need to be able to trim, pitch-shift, layer two samples, and export a .wav. This is an afternoon of learning. It will transform your audio options.

Foley on a budget

Foley is recorded in-studio sound effects for things that cannot be captured in-action: footsteps, clothing rustle, prop impacts. Professional foley is done in acoustically-treated rooms with serious microphones. Budget foley is done in your closet with a phone.

This sounds like a joke. It is not. A clothes closet is an acoustically-dry environment (soft absorbent surfaces on all sides) that produces clean recordings. A modern phone records at 48 kHz stereo, which is more than adequate for game SFX. You can record footsteps on a gravel tray, sword swishes with a stick, cloth rustles with an actual shirt, water splashes with a bowl, and you will have usable audio with thirty minutes of work.

This is how small indies get distinctive SFX — by recording their own, not by pulling library samples every other game has already used. The sound of your sword swing will be your sound because you recorded your roommate swinging a stick in your closet. This is not lesser than a library sample. It is arguably better, because it is original.

When to hire vs. DIY

The heuristic: hire a composer for music. DIY your SFX. This is for two reasons.

Music requires composition skills that are hard to fake. A bad original score is worse than a good library track. If you cannot compose, you cannot write music, and hours of noodling will not produce a good theme. Hire a composer, even on a small budget — you will find serious composers willing to work on indie projects for equity, for portfolio, or for modest fixed fees.

SFX can be learned with craft. Sound design is an engineering task as much as a creative one, and engineers can learn it. Give yourself a week with tutorials and a DAW and you can do passable SFX work. Give yourself six months and you can do good SFX work. You will not compose a theme in six months of trying, but you can absolutely learn to layer three library samples into a great hit sound.

A reasonable indie budget allocates ~70 percent of audio spend to music (composer fees) and ~30 percent to SFX (library subscriptions, a microphone, a DAW, some time). A solo developer with no budget at all can still ship with a decent Freesound-curated SFX palette and a composer willing to work for revenue share.

GDScript Implementation — AudioManager.gd

Time to build. We will implement three scripts that together form the audio layer of your progressive project: AudioManager.gd (the global audio singleton), SpatialAudio.gd (2D positional SFX), and DynamicMusicPlayer.gd (combat/exploration music state machine).

Bus setup

First, set up your audio bus layout in Godot. Open Project → Project Settings → Audio → Buses. You want this structure:

Master — top-level
Music — BGM bus
SFX — sound effects bus
Voice — dialogue bus
Ambient — ambient bed bus
UI — menu sounds bus

Each of these routes to Master. Each has its own volume fader. Your settings menu will expose sliders for all five (plus Master).

AudioManager.gd — the singleton

Create res://autoload/AudioManager.gd and register it in Project Settings → Autoload as AudioManager.

# res://autoload/AudioManager.gd
extends Node

# -- Bus references (indices cached on ready) --
var master_bus: int
var music_bus: int
var sfx_bus: int
var voice_bus: int
var ambient_bus: int
var ui_bus: int

# -- Player pools for concurrent SFX --
const SFX_POOL_SIZE := 16
var sfx_players: Array[AudioStreamPlayer] = []
var sfx_pool_index := 0

# -- Long-lived players for music & ambient --
var music_player: AudioStreamPlayer
var ambient_player: AudioStreamPlayer

# -- Ducking state (for dialogue-over-music) --
var music_duck_target := 0.0
var music_duck_current := 0.0

func _ready() -> void:
    master_bus = AudioServer.get_bus_index("Master")
    music_bus = AudioServer.get_bus_index("Music")
    sfx_bus = AudioServer.get_bus_index("SFX")
    voice_bus = AudioServer.get_bus_index("Voice")
    ambient_bus = AudioServer.get_bus_index("Ambient")
    ui_bus = AudioServer.get_bus_index("UI")

    # Build SFX player pool
    for i in SFX_POOL_SIZE:
        var p := AudioStreamPlayer.new()
        p.bus = "SFX"
        add_child(p)
        sfx_players.append(p)

    # Music and ambient (long-lived, single instances)
    music_player = AudioStreamPlayer.new()
    music_player.bus = "Music"
    add_child(music_player)

    ambient_player = AudioStreamPlayer.new()
    ambient_player.bus = "Ambient"
    add_child(ambient_player)

A few design choices to notice. The SFX pool is a fixed array of pre-allocated AudioStreamPlayer nodes. When you want to play an SFX, you pick the next one in the pool (round-robin) and hand it a stream. This avoids allocating and freeing nodes at runtime, which is slow in Godot. Sixteen is usually enough for a 2D game; if you have a lot of concurrent audio events (bullet hell, DOOM-style shooter), raise it.

Music and ambient each get one long-lived player, because you only ever want one track of each playing at a time. Cross-fades are achieved by having two players, which we will see in DynamicMusicPlayer.gd.

Playing SFX

Add this method to AudioManager.gd:

func play_sfx(stream: AudioStream, volume_db := 0.0, pitch_variance := 0.1) -> void:
    var p := sfx_players[sfx_pool_index]
    sfx_pool_index = (sfx_pool_index + 1) % SFX_POOL_SIZE
    p.stream = stream
    p.volume_db = volume_db
    # Randomize pitch slightly to avoid mechanical repetition
    p.pitch_scale = 1.0 + randf_range(-pitch_variance, pitch_variance)
    p.play()

Now anywhere in your game, you call:

AudioManager.play_sfx(preload("res://audio/sfx/sword_hit.wav"))

And the SFX plays with slight pitch randomization and no allocation. This is the pattern you use for every event-driven sound in the game: hits, jumps, lands, pickups, UI clicks, door-opens. Fire-and-forget.

For SFX with a specific bus need (UI click should use UI bus, not SFX), add a bus parameter:

func play_sfx_on_bus(stream: AudioStream, bus: String, volume_db := 0.0, pitch_variance := 0.1) -> void:
    var p := sfx_players[sfx_pool_index]
    sfx_pool_index = (sfx_pool_index + 1) % SFX_POOL_SIZE
    p.stream = stream
    p.bus = bus
    p.volume_db = volume_db
    p.pitch_scale = 1.0 + randf_range(-pitch_variance, pitch_variance)
    p.play()

Music cross-fade

Add a cross-fade method for smooth music transitions:

var music_fade_tween: Tween

func play_music(stream: AudioStream, fade_seconds := 2.0) -> void:
    if music_player.stream == stream and music_player.playing:
        return  # Already playing this track

    # Kill any prior tween
    if music_fade_tween:
        music_fade_tween.kill()

    # Fade out current, then swap, then fade in
    music_fade_tween = create_tween()
    music_fade_tween.tween_property(music_player, "volume_db", -80.0, fade_seconds * 0.5)
    music_fade_tween.tween_callback(func():
        music_player.stream = stream
        music_player.volume_db = -80.0
        music_player.play()
    )
    music_fade_tween.tween_property(music_player, "volume_db", 0.0, fade_seconds * 0.5)

This is a simple linear cross-fade: half the time fading out, a swap, half fading in. For smoother results you can use a dedicated secondary_music_player and cross-fade between the two, which eliminates the brief silence in the middle. We will do exactly that for dynamic music in DynamicMusicPlayer.gd.

Ambient layering

Ambient is similar but simpler — you rarely want hard cuts; you almost always want cross-fades on area change:

func play_ambient(stream: AudioStream, fade_seconds := 4.0) -> void:
    if ambient_player.stream == stream and ambient_player.playing:
        return
    var tween := create_tween()
    tween.tween_property(ambient_player, "volume_db", -80.0, fade_seconds * 0.5)
    tween.tween_callback(func():
        ambient_player.stream = stream
        ambient_player.volume_db = -80.0
        ambient_player.play()
    )
    tween.tween_property(ambient_player, "volume_db", 0.0, fade_seconds * 0.5)

Call AudioManager.play_ambient(cave_ambient) when the player enters a cave; four seconds later the ambient has cross-faded from forest to cave.

Bus volume control

Expose bus volume control for your settings menu:

func set_bus_volume(bus_name: String, linear_volume: float) -> void:
    # linear_volume is 0.0 to 1.0 from the slider
    var bus_idx := AudioServer.get_bus_index(bus_name)
    if bus_idx < 0:
        return
    if linear_volume <= 0.0:
        AudioServer.set_bus_mute(bus_idx, true)
    else:
        AudioServer.set_bus_mute(bus_idx, false)
        AudioServer.set_bus_volume_db(bus_idx, linear_to_db(linear_volume))

Your settings menu hands slider values (0.0–1.0) to this method, and it converts to decibels via Godot's linear_to_db helper, which gives perceptually-correct scaling. Never set bus volume to a raw linear value — human hearing is logarithmic, and dB conversion is what makes the slider feel natural.

Ducking

For dialogue-over-music ducking, add:

func duck_music(db_drop: float = -12.0, duration: float = 0.3) -> Tween:
    var tween := create_tween()
    tween.tween_property(music_player, "volume_db", db_drop, duration)
    return tween

func unduck_music(duration: float = 0.6) -> void:
    var tween := create_tween()
    tween.tween_property(music_player, "volume_db", 0.0, duration)

Now in your dialogue system (the DialogueSystem.gd from Chapter 21), when you start a VO line:

AudioManager.duck_music()
# ... play the VO line ...
AudioManager.unduck_music()

This is the minimum-viable ducking system. The music drops 12 dB (meaningfully quieter but still audible) for the duration of the line, then returns. Players never consciously notice. If they did, you would have a mix problem.

SpatialAudio.gd — positional SFX

For sounds that have a location in the game world (an enemy's attack at some off-screen position, a dripping pipe, a nearby waterfall), use AudioStreamPlayer2D via a small helper:

# res://scripts/SpatialAudio.gd
class_name SpatialAudio
extends Node

static func play_at(parent: Node2D, stream: AudioStream, position: Vector2,
                    max_distance := 800.0, volume_db := 0.0, pitch_variance := 0.1) -> void:
    var p := AudioStreamPlayer2D.new()
    p.bus = "SFX"
    p.stream = stream
    p.max_distance = max_distance
    p.volume_db = volume_db
    p.pitch_scale = 1.0 + randf_range(-pitch_variance, pitch_variance)
    p.position = position
    parent.add_child(p)
    p.play()
    # Auto-remove when finished
    p.finished.connect(func(): p.queue_free())

Call it like this, passing the game world as parent and the event position:

SpatialAudio.play_at(get_tree().current_scene,
    preload("res://audio/sfx/enemy_growl.wav"),
    enemy.global_position)

Godot handles the attenuation and panning automatically based on distance and horizontal position relative to the AudioListener2D (usually your player camera). Off-screen enemies sound distant; on-screen enemies sound clear; enemies behind you pan to the left or right depending on their position.

Note: this version allocates one AudioStreamPlayer2D per call and frees it when done. For most games this is fine. If you have hundreds of concurrent spatial sounds (a large battle), you would extend this with a pool, exactly as we did for non-spatial SFX.

DynamicMusicPlayer.gd — combat/exploration state machine

Finally, a simple dynamic music system. Create res://autoload/DynamicMusicPlayer.gd and autoload it:

# res://autoload/DynamicMusicPlayer.gd
extends Node

enum MusicState { NONE, EXPLORE, COMBAT, BOSS }

var current_state: MusicState = MusicState.NONE
var explore_player: AudioStreamPlayer
var combat_player: AudioStreamPlayer
var boss_player: AudioStreamPlayer
var active_tween: Tween

const FADE_TIME := 2.0

func _ready() -> void:
    explore_player = _make_player("Music")
    combat_player = _make_player("Music")
    boss_player = _make_player("Music")
    # Start muted; caller loads streams and transitions
    explore_player.volume_db = -80.0
    combat_player.volume_db = -80.0
    boss_player.volume_db = -80.0

func _make_player(bus: String) -> AudioStreamPlayer:
    var p := AudioStreamPlayer.new()
    p.bus = bus
    add_child(p)
    return p

func set_tracks(explore: AudioStream, combat: AudioStream, boss: AudioStream = null) -> void:
    explore_player.stream = explore
    combat_player.stream = combat
    if boss:
        boss_player.stream = boss
    explore_player.play()
    combat_player.play()
    if boss:
        boss_player.play()

func transition_to(new_state: MusicState, fade: float = FADE_TIME) -> void:
    if new_state == current_state:
        return
    current_state = new_state
    if active_tween:
        active_tween.kill()
    active_tween = create_tween().set_parallel(true)
    active_tween.tween_property(explore_player, "volume_db",
        0.0 if new_state == MusicState.EXPLORE else -80.0, fade)
    active_tween.tween_property(combat_player, "volume_db",
        0.0 if new_state == MusicState.COMBAT else -80.0, fade)
    active_tween.tween_property(boss_player, "volume_db",
        0.0 if new_state == MusicState.BOSS else -80.0, fade)

The design is three parallel AudioStreamPlayer nodes, all playing their streams simultaneously from game start, but all muted except the currently-active one. When you transition from explore to combat, you fade the explore player down and the combat player up in parallel — a true cross-fade, no silence in the middle.

Because all three players started at the same time, they are phase-aligned in the musical sense: measure 1 of the combat track aligns (in time) with measure 1 of the explore track. If your composer wrote both tracks at the same tempo and with compatible structures, the cross-fade will feel musical rather than abrupt. This is the composer's responsibility. Give them the spec: "two tracks, 90 BPM, 4/4, structured so they can cross-fade at any bar line."

Usage from your combat system (the CombatSystem.gd of Chapter 26):

# On combat start
DynamicMusicPlayer.transition_to(DynamicMusicPlayer.MusicState.COMBAT)

# On combat end (e.g., all enemies dead, no alerts for N seconds)
DynamicMusicPlayer.transition_to(DynamicMusicPlayer.MusicState.EXPLORE)

# On boss encounter
DynamicMusicPlayer.transition_to(DynamicMusicPlayer.MusicState.BOSS)

The state machine is minimal — three states and transitions — but it produces a meaningfully interactive score. When the player aggros an enemy, combat music swells in over two seconds. When combat ends, exploration music eases back. When a boss door opens, the boss theme takes over. Players will feel the responsiveness without thinking about it. That is what good dynamic music does: it feels composed even though it is reacting to your gameplay in real time.

Progressive Project Update — Chapter 30

At the end of Chapter 29, your project had a full UI layer. Chapter 30 adds the audio layer:

Set up the bus structure (Master / Music / SFX / Voice / Ambient / UI).
Autoload AudioManager.gd and wire it into every existing system that should produce sound: Player.gd calls play_sfx on footsteps, jump, land, take-damage, heal. CombatSystem.gd calls on attack, hit-register, parry, miss. ScreenShake.gd (Chapter 8's juice) gains a hit SFX alongside the visual effect. ShopNPC.gd (Chapter 24) plays UI sounds for buy/sell. DialogueSystem.gd (Chapter 21) plays a typewriter chirp per character and ducks music during voiced lines.
Autoload DynamicMusicPlayer.gd and give it three tracks: an exploration loop, a combat loop, and a boss loop, all at the same tempo. Wire it into CombatSystem.gd and BossFight.gd state changes.
Add at least two ambient beds (one for outdoor Level 1 area, one for cave/interior area) and switch them on area-transition events.
Expose the five volume sliders in your settings menu (Chapter 29 scaffold) and make them call AudioManager.set_bus_volume.
Add a Mono Audio toggle that applies a mono-sum effect to the Master bus for single-ear accessibility.
Add a Subtitle toggle that enables sound-caption display for gameplay-critical SFX.

This is one to two weeks of solid work for a solo developer, assuming you source SFX from Freesound and use an agreed-upon composer for the three music tracks. When you ship Chapter 30's deliverable, run the mute test on the result. Hand your project to a friend without sound and one with. Compare what each can tell you about what is happening on screen. That gap is the value your audio layer delivered.

This work connects back to Chapter 8's feedback systems (every juice effect now has an audio counterpart), Chapter 11's flow concept (music now reinforces the energy of the current state), Chapter 15's emotional design (music and ambient now do the heavy lifting for scene tone), Chapter 26's combat design (hit SFX and combat music now sell your combat feel), and Chapter 27's AI barks (enemy audio reactions are now gameplay-readable). It forward-loads into Chapter 32's balancing work — much of what you will balance next chapter involves audio levels and mix decisions across the whole game.

Common Pitfalls

No audio at all for important actions. The player jumps; nothing plays. The player picks up a key; nothing plays. The player takes damage; nothing plays. Silent actions feel broken. Every gameplay-relevant action needs at least a minimal SFX. If you cannot afford a great SFX for every action, grab a placeholder from Freesound and ship with placeholder until you can replace. Silent is worse than generic.

SFX too loud or over-compressed. A sword swing that peaks at 0 dB every time is fatiguing. Ears close down. Players turn the volume off. Mix your SFX so they peak at -6 to -12 dB, leaving headroom. Gentle compression is fine; brick-walled loud-war mixes produce listener fatigue within minutes.

One looping track for four hours. You have one piece of music for the whole first zone, and the zone takes the player four hours to explore. By hour two they hate the track. By hour four they hate you. The minimum is three or four tracks per major zone, rotated; the better approach is a dynamic layered system where the music varies with state. If you truly have only one track per zone, make sure the zone is small enough to beat in twenty minutes.

Voice lines blaring over BGM. You added VO but did not implement ducking. Now every dialogue line is a wall of overlapping music and voice, and the voice is half-audible. Ducking is mandatory. Implement it before you ship a single voiced line.

No subtitle controls, no volume sliders, no mono option. Accessibility basics. Shipping without these is a failure, not a compromise.

Copyrighted music in a royalty-free library. A subtle disaster — someone uploaded a Zimmer cue to a "free" library and the DMCA claim hits your game three months after launch. Verify licenses. When in doubt, use reputable sources (Freesound with per-sample license check, Epidemic Sound with confirmed terms, a composer under contract) and keep documentation.

Summary

Audio is the sense you forgot about. It is doing half the work of your game's feel, and designers who do not speak audio fluently are building on top of a hole in their craft. The mute test reveals the hole. Four layers — SFX, music, voice, ambient — cover every kind of audio work, and each has its own craft concerns, pitfalls, and techniques. Sound effects carry feel through layered attack-impact-body-release envelopes, with pitch variance to defeat repetition. Music carries emotion through mood, pacing, transition-signaling, and identity, with dynamic techniques (vertical layering, horizontal re-sequencing, state machines) scaling up in complexity as your game and team do. Voice is expensive and optional, and the no-VO games work as well as the full-VO ones when the text delivery is designed with care. Ambient makes the world feel real. Diegetic audio binds the music to the world; non-diegetic audio frames the experience from outside; the distinction is a real design choice.

Mixing is the invisible craft that balances all of this. A bad mix drives players to the volume slider within minutes. A good mix is inaudible as a mix — players hear the game. Accessibility — captions, volume sliders, mono options, vibration — is non-negotiable. Budget sound design is a learnable craft; Freesound plus a DAW plus a composer on a small fee will ship a respectable audio layer for an indie game.

You now have AudioManager.gd, SpatialAudio.gd, and DynamicMusicPlayer.gd wired into your project. Run the mute test on your build. Listen to what you have. Fix what is broken. Then move on to Chapter 31, where we test the whole thing in front of real players and discover, inevitably, that half of what we thought worked does not, and half of what we thought did not work is secretly carrying the game. Audio will be at the center of that reckoning. It always is.