Chapter 31: Playtesting — The Most Important Skill No Designer Wants to Practice

Claude (Anthropic)

46 min read

There is a specific silence that descends over a designer's office the first time they put their game in front of a stranger. You have spent months on this thing. You know every input, every edge case, every secret. You are excited. The tester sits...

In This Chapter

The Uncomfortable Truth
Playtesting vs. QA
Types of Playtest
Who to Playtest With
How to Prepare a Playtest
Running the Session
The Two Questions Rule
Collecting Data
Interpreting Results
Common Playtest Traps
Acting on Results
Remote vs. In-Person
Analytics and Telemetry as Ongoing Playtesting
Progressive Project Update — Ch 31
Common Pitfalls
Summary

Chapter 31: Playtesting — The Most Important Skill No Designer Wants to Practice

There is a specific silence that descends over a designer's office the first time they put their game in front of a stranger. You have spent months on this thing. You know every input, every edge case, every secret. You are excited. The tester sits down, picks up the controller, and immediately does something you have never seen a human being do. They walk the wrong way. They press the wrong button. They miss the affordance you were so proud of. They mistake your inventory screen for the pause menu. They die to the first enemy because they did not realize they could jump.

You sit there, saying nothing, watching your game fail in real time. And the worst part — the part nobody tells you about when you decide to be a game designer — is that this is not the tester's fault. This is your game. This is what your game actually is, once it leaves your head. Every assumption you made about what was obvious, what was intuitive, what was "explained by the design" — every single one of those assumptions is now being measured against a real human brain, and most of them are failing.

Playtesting is the practice of sitting in that silence on purpose. Of seeking it out. Of structuring your development process around it, so that the silence happens as early and as often as possible, while there is still time to fix what it reveals. Playtesting is the single most important skill a game designer can develop, and it is the skill that almost every designer avoids, delays, underinvests in, and rationalizes away. There is a reason this chapter exists, and there is a reason it sits near the end of the book rather than the start: most designers only understand why they needed to playtest after they have shipped a game that did not.

The recurring theme of this book has been playtest or die. Chapter 1 said the player experience is the only thing that matters. Chapter 4 said you are not your player. Chapter 11 argued that flow is a fragile state you can only observe, not engineer. Chapter 16 pushed you to get levels in front of testers before the tilemap was finalized. Every chapter has been circling this one. Now we address it directly. You will not escape the need for this practice. You cannot think your way past it, you cannot substitute your own intuition for it, and you cannot replace it with any number of reviews, critiques, or discussions among designers. The only replacement for playtesting is shipping a game that has not been playtested, and the player reviews will tell you what the tests would have told you — except now it is too late.

The Uncomfortable Truth

Your opinion about your game is wrong. Not "might be wrong." Not "could be improved by input." Wrong in specific, measurable, predictable ways, and the longer you have been working on the game, the more wrong your opinion has become.

This is not a rhetorical flourish. It is a cognitive fact about how human brains interact with creative work. When you build something complex over time, you develop what researchers call expert blindness — a deep, internalized familiarity with the object that makes it impossible for you to see it the way a first-time observer sees it. You know where the hidden paths are. You know which enemy attacks are telegraphed. You know that the yellow glow on the ledge means "climbable." You cannot unknow these things. When you look at your game, you are looking at a palimpsest of your own design memory, not the game itself.

A first-time player looks at a screen of pixels and makes guesses. The player who plays Hollow Knight for the first time does not know that the shimmer on the cave wall means a breakable floor. They do not know that the bench ahead is a checkpoint. They do not know that the bug hovering in the distance is friendly. They have to infer all of this from visual language they have not yet learned. Your job as a designer is to build affordances that communicate these things to a player who knows nothing. Your ability to evaluate whether your affordances work — whether the yellow glow reads as "climbable" to a brain that has never seen your game — is structurally compromised by the months you have spent inside the game.

The second fact, related but distinct: you are not a representative sample. You are a person who decided to make video games for a living, which means your relationship to video games is pathological in ways that are invisible from the inside. You are faster with a controller than 95 percent of your audience. You have played more games. You have played more obscure games. You have internalized conventions — the crouch-button is always on O, double-tap-to-dodge, the X prompt means interact — that your audience may or may not share. When you play your own game and conclude "the combat feels tight," what you are actually saying is "the combat feels tight to someone who has been playing my combat for six months at four hours a day." A first-time player will have a different experience, and that experience is the one you are shipping.

The third fact: you have been inside the game too long. Every day you work on it, your calibration drifts further from the player's. You lose sensitivity to the things you have been staring at for months — the UI feels normal to you because you have been staring at it since November. You gain sensitivity to things that players will never notice — the one-frame animation inconsistency that bothers you because you saw it at Wednesday's review. Your internal meter is reading noise. You cannot trust it. And the fix is not "try harder to be objective." The fix is to stop asking your own nervous system the question and start asking other people's.

This is the uncomfortable truth, and it is uncomfortable because it implies you cannot finish your game alone. You cannot evaluate your own work to completion. The finishing step — the step that turns a game you think is good into a game that actually is good — requires outsiders. You must, at some point, hand the controller to strangers, watch them play, and respond to what you see. There is no workaround. The designers who learn this lesson early ship better games faster. The designers who learn it late ship worse games later. The designers who never learn it ship once, read the reviews, and quietly change careers.

Playtesting vs. QA

Before going further, disambiguate two practices that beginners routinely confuse: playtesting and quality assurance. Both involve putting the game in front of humans who will report issues. They are not the same discipline and they serve different questions.

QA tests correctness. Does the game crash on this level? Does the save system corrupt at this transition? Does the audio glitch when you pause during dialogue? Does the boss's second phase trigger even if you cheese the first one? QA's output is bug reports — specific reproducible failures of the software as software, with steps-to-reproduce and severity ratings. QA testers are often specialists who play methodically, file careful reports, and probe the weird corners where bugs hide. Large studios employ QA teams; indie developers do QA themselves, or pay contractors per hour. QA is essential. QA is not what this chapter is about.

Playtesting tests design. Is the core loop fun? Does the player understand what to do? Are they motivated to continue? Does the difficulty feel fair, or frustrating, or trivial? Does the boss feel climactic, or cheap, or forgettable? Are they laughing in the places you hoped they would laugh? Are they confused in the places you thought were obvious? Playtesting's output is design findings — observations about player experience, patterns across testers, moments of delight and moments of failure. Playtesters are not specialists. They are, ideally, people who resemble your intended audience as closely as possible and have not been told anything about the game before they sit down.

The distinction matters because the two practices have different protocols, different recruitment, different data collection, and different outputs. If you hire a QA tester and ask them "is my game fun?", they will earnestly try to help but they will be poorly calibrated for the question, because their job is to find defects and their attention is tuned to the seams where the code breaks. If you hand a playtester a list of crash scenarios to probe, you will get information but you will not get design signal, because the playtester's job is to be a naive player and a testing checklist corrupts that naivety.

You need both. Most small studios conflate them, with predictable consequences: designs ship with obvious play-experience problems that QA did not catch because QA was not looking for them, and QA ships with embarrassing bugs that playtesters did not catch because playtesters were trying to have fun. Keep the practices separate. Run playtests for design. Run QA for correctness. If you must run them with the same people, at least separate the questions — a playtest session on Monday, a QA bug hunt on Tuesday, and do not mix the two in a single sitting.

This chapter is about playtesting. Everything that follows assumes the question on the table is "does the design work?" not "does the build ship?"

Types of Playtest

Not all playtests are the same thing. The word is a bucket covering several distinct practices, each suited to a different stage of development and each asking a different question. Know which kind you are running before you run it. Sloppy designers mix them and get confused data.

Alpha-stage playtest (is the core loop fun?). This is the earliest playtest and the most important one. The build is ugly. The art is placeholder. There is one level, or one combat encounter, or one loop of the resource-gathering mechanic. You put it in front of a tester and you watch whether the thing you built has any spark of the experience you want. This is where games get made or abandoned. Supercell kills most of its projects at this phase, and we will look at their practice in the second case study. The question at alpha is brutal and simple: does this have fun in it? If yes, you continue. If no, you either fix it fundamentally or you stop.

Kleenex test (first five minutes). A specific subset of the alpha test, named after the idea that testers are like kleenex — single-use, because once they have played the first five minutes they can never again be a first-time player of your game. The kleenex test evaluates your onboarding. You sit a fresh tester down, hand them the controller, say almost nothing, and watch the first five minutes. Do they know what to do? Do they press the right buttons? Do they understand the world, the goal, the camera? Valve famously uses kleenex tests extensively for this reason: the first five minutes of a game are where most players decide whether to continue, and they are also the hardest minutes to evaluate from inside the design team because you all already know what to do.

Beta-stage playtest (is the content balanced?). Later in development, with most content in place, you shift from "is the core loop fun" to "is the full experience paced correctly?" Does the player's progression feel satisfying? Is the midgame sagging? Is the final boss too hard or too easy? Is the tutorial front-loaded? This is the phase where you play through the whole thing with a tester and collect data on the overall arc. Beta playtests are longer, more expensive (each tester costs hours), and higher-resolution about content balance. They cannot substitute for alpha tests — if your core loop is not fun, all the beta polish in the world will not save the game — but they catch the problems alpha tests cannot see, because alpha tests operate on fragments.

Focus group (guided discussion). A group of testers plays the game (or watches a demo), and then you run a moderated discussion afterward. This is borrowed from market research and is the least useful playtest format for most design questions, but it has its place. Focus groups are good for evaluating themes, marketing positioning, and broad emotional responses. They are bad for evaluating moment-to-moment gameplay, because the group dynamic distorts individual responses — testers follow the most confident voice in the room, or perform agreement to avoid conflict, or rationalize their experience to sound coherent. If you run focus groups, use them to evaluate framing and audience fit, not to measure whether the jump feels good.

UX playtest (think-aloud, eye-tracking). A tightly structured session in which the tester is asked to narrate their internal state while playing ("I'm trying to open the door, I see the key icon, I think I need a key"), sometimes with a camera tracking their eyes on the screen. This comes from the usability-research tradition — Jakob Nielsen, Steve Krug, the web-design world — and it is the most systematic way to diagnose specific interface and onboarding problems. UX playtests are expensive (they require a trained moderator and careful protocol) but they produce the highest-quality data per hour for specific questions about affordance, readability, and cognitive load.

Analytics-based playtest (silent observation at scale). Ship an early build to a private group — 50 people, 500 people, however many you can recruit — instrument it with event logging, and look at the funnel. Where do players drop off? How long does the average session last? Which levels do people replay and which do they abandon? This is playtesting at population scale, and it is what mobile studios run continuously. Analytics alone does not tell you why players dropped off, but it tells you where they dropped off with far more reliability than any in-person session can. Pair analytics with in-person qualitative sessions and you get the "what" and the "why" together.

Each of these is a different tool. Sloppy designers mix them — running a focus group and calling it UX, or running a kleenex test and drawing beta-stage balance conclusions. Match the method to the question. If you are asking "is the core loop fun" you need an alpha-stage session with a naive tester, not a beta-stage playthrough with a friend. If you are asking "are the late levels balanced" you need beta testers with hours in the build, not a kleenex tester who quit after the tutorial.

Who to Playtest With

This is the section that contains the single most common playtest failure. The failure is: playing with your friends, your spouse, your coworkers, or anyone else who is emotionally invested in you or in the game.

Your spouse played it. Your roommate played it. Your three designer friends played it. Everyone said it was great. You concluded the game works. You shipped. The reviews came in at 52 on Metacritic and said the tutorial was incomprehensible, the combat felt floaty, and the third level was a slog. You were astonished. Your spouse had said it was great.

Your spouse was being kind. Your designer friends were being collegial. Your roommate was being your roommate. None of them were being players. The feedback you got was filtered through their relationship to you, their understanding of how hard this has been, their desire to be supportive, and their familiarity with game-development context. It was not data. It was love. Love is not a substitute for data, and a game designed on the basis of love-filtered feedback will ship into a world that does not love it.

You need testers who owe you nothing. Specifically: testers who are in your target audience, have never heard of the game before, and will not encounter you socially after the session. The further the social distance between you and the tester, the more honest the signal. This is counterintuitive — you would think that friends, who want you to succeed, would give you the most useful feedback. In practice, they give you the least useful feedback, because the cost of honest criticism in a friendship is high enough that most friends will blunt the criticism, whether they mean to or not. A stranger has no such cost. A stranger will tell you, with offhand candor, that the combat feels bad, and then continue being a stranger.

Where do you find strangers? Several options, in rough order of quality:

Paid playtesting services. PlaytestCloud is the current standard for mobile and PC games. You pay somewhere between $30 and $100 per tester depending on session length, they recruit matches for your target demographic, they record the session, and you get a video plus a report. For an indie team, this is the fastest way to get high-quality remote data. It is not free — a full round of twenty testers can run $1000+ — but it is cheaper than shipping a game that fails. UserTesting.com serves a similar function for software in general; it is less game-specific but covers adjacent use cases.

Your own Discord / forum / subreddit. If you have a community, mine it. Post a call for playtesters. Pre-screen for demographic fit. Offer a small incentive if you can (Steam gift card, free copy of the final game). Run sessions over Discord with screen sharing. This works well once you have any community at all; it works badly if your community is ten people and eight of them are your cousins.

Campus, conventions, local game-dev nights. Physical proximity makes some tests possible that remote tests cannot replicate (seeing facial expressions clearly, watching hand position on the controller). University campuses are rich with in-target testers for many games, and most universities have a game-dev or media-studies club that will happily test for pizza. Local IGDA chapters run playtest nights. Conventions (PAX, GDC-adjacent events) host demo booths where you can get ten-minute kleenex tests from strangers all day.

Cold recruitment through target-audience channels. If your game is a puzzle game, post on puzzle-game forums asking for testers. If it is a roguelike, post on r/roguelikes. If it is a narrative game, reach out through Twitter/Bluesky accounts that cover narrative games. Cold recruitment is hit-or-miss; you will hear from enthusiasts who are poor matches and fail to hear from ideal matches who ignore the post. Budget time for the noise.

What you want, across your pool, is diversity along the axes that matter for your game. If your target audience is 18-to-35-year-old puzzle-game enthusiasts, your pool should span that age range, gender, experience level with puzzle games, and (if possible) cultural background. Three twenty-two-year-old men in computer-science departments are not a representative sample of your audience, even if your audience includes twenty-two-year-old men in computer-science departments.

A rough rule: five testers is the minimum useful sample for a qualitative round. Jakob Nielsen's classic usability-research finding is that five users will surface roughly 80 percent of the major issues in an interface; diminishing returns set in after that. Games are more variable than interfaces — player experience is higher-variance than interface usability — so for gameplay questions you probably want closer to 10 testers per round, across demographic diversity. Below five, you are at risk of mistaking an individual quirk for a pattern. Above twenty, you are spending money faster than you are learning new things. Find the zone.

How to Prepare a Playtest

A playtest that is not prepared in advance is an expensive way to collect nothing. You will watch a tester play, something will happen, you will take notes, and afterward you will realize you have no structured way to compare this tester's experience to the next tester's, no clear hypothesis you were testing, and no answer to any question you actually cared about. Preparation is what converts raw testing hours into design data.

Start with the question. Every playtest round should answer a specific question. "Is my game fun?" is not a specific question. "Do players understand that the red door requires the red key?" is a specific question. "Do players reach the first checkpoint within 10 minutes?" is a specific question. "Do players form any attachment to the companion character by the end of Act 1?" is a specific question. Write the question down before you build the testing session. If you have more than two or three questions per round, narrow them — you will lose resolution on every one.

Build the right prototype for the question. If your question is "is the core loop fun," you do not need final art, final music, or any of the meta-systems. You need a three-minute slice of the core loop, playable, with enough feedback that the loop reads. Playtesting final art in an alpha test is worse than useless — it both costs you money (final art is expensive) and distorts the test (testers will comment on art and their attention will drift from the loop you wanted them to evaluate). If your question is "is the tutorial clear," you need the tutorial sequence and a few minutes of post-tutorial gameplay, not the whole game. Match the build to the question.

Plan the tasks. Decide what you will ask the tester to do. The minimum is "play the game." The more structured option is a task list — "find your way to the first boss," "collect all three items in the first area," "try to defeat the enemy without taking damage." Tasks help focus attention and produce comparable data across testers. Over-tasking them, however, distorts the play experience — if the player is hunting a checklist, they are not playing the game the way a real player would. For most rounds, keep tasks to two or three and otherwise let the tester play freely.

Prepare the environment. If the session is in-person, you need a clean setup: the build installed and tested, the controller charged, a quiet room, a backup controller in case the battery dies, a camera or screen-recording tool aimed at the screen. If the session is remote, you need the tester's tech to work: have them install and test the build before the session, not during it, or you will lose twenty minutes to troubleshooting. A typical remote setup is Discord with screen sharing, plus OBS on your end recording the call.

Write the script. Yes, a script. A document you read from as the session begins, so that every tester hears the same words, and you do not accidentally prime one tester with a hint you did not give the next. The script includes the welcome, the explanation of think-aloud protocol (if you are using it), the consent/recording permission, the description of the task, and the prepared questions you will ask at the end. Professional UX researchers script every word. You do not have to be that strict, but the more you script, the more your sessions will produce comparable data.

Decide what you will observe and what you will ignore. You cannot watch everything. Pick the channels you care about — hand position, facial expression, verbal response, in-game behavior, timing — and pre-commit to taking notes on those. If you try to catch everything, you will catch nothing. Typical primary channels: what the player is doing on screen; what they say aloud; where they hesitate; where they smile; where they swear.

📋 Protocol Snapshot: A typical hour-long playtest session breaks down as: 5 minutes welcome and consent; 5 minutes context-setting (demographics, gaming experience); 30-40 minutes play; 10-15 minutes debrief questions. If the play session is expected to be shorter, scale proportionally. Never run sessions longer than 90 minutes without a break — tester fatigue corrupts data after an hour.

Running the Session

The session begins when the tester arrives, or joins the call. Your job from that moment until they leave has three parts: make them comfortable, watch what happens, and shut up.

The comfortable part is real. Testers who are nervous, who feel watched, who feel they are being judged — these testers play worse than they normally would, and they self-censor. A warm welcome, an explanation that the game is being tested, not the tester, and a reassurance that there are no wrong answers all make the session more honest. Spend the first five minutes putting them at ease. Offer water. Chat briefly. Remember that most people have never been playtesters before, and the social script is unfamiliar.

Then you explain the protocol. Something like: "I'm going to have you play the game for about thirty minutes. As you play, I'd love if you could narrate what you're thinking — what you're trying to do, what you notice, whether you're confused. Think of it as talking to yourself, except out loud. I'll be mostly quiet. If you get really stuck, I might ask a question, but try to play the way you normally would. At the end, I'll ask you some questions." This is called think-aloud protocol and it is the single most valuable technique in playtesting. Done right, it gives you a running commentary on the tester's internal state, letting you see mismatches between what the game is communicating and what the player is perceiving.

Ask for consent to record. Always. Screen recording, audio recording, and (if in person) camera on their face are all valuable, and all require explicit consent. Have a one-paragraph consent form for your records. Do not skip this; it protects both you and them.

And then — and this is the hardest part of the whole practice — you shut up.

You will want to speak. Every designer wants to speak. You will see the tester walk past the obvious affordance and you will want to say "the yellow one is climbable." You will see them struggle with the controls and you will want to explain the control scheme. You will see them miss the joke you labored over and you will want to point it out. Do not. Every word you speak during a playtest is data destroyed. The moment you explain something, you have told the tester something that real players will never be told, and everything they do afterward is compromised data.

The discipline of silence is the single most important skill to develop, and the one that most designers never master. The best playtest moderators are the ones who can sit for thirty minutes watching their own work be misunderstood, say nothing, take notes, and only speak up when the tester has been stuck for several minutes in a way that will invalidate the rest of the session (for example, they cannot progress at all, and the remaining twenty-five minutes of the session will be wasted if you do not intervene). Even then, intervene with the lightest possible hint: "What were you thinking of trying?" rather than "Try the door." The goal is to unstick them without teaching them.

A useful trick: physically sit behind the tester, out of their peripheral vision. You can see the screen, you can hear them, but they cannot see your face. This removes the feedback loop where they look at you for reassurance and adjust their play based on what your face is doing. In remote sessions, turn your camera off or angle it so they cannot see you. They should feel alone with the game as much as possible.

When the session ends, you switch modes. Now you ask questions. This is where much of the data will come from, and where a common failure mode lurks.

The Two Questions Rule

Here is the trap: when you finish the play session, you will want to ask the tester, "so, did you like it?" You will want to know whether the game is good. You will want the tester to validate you.

Do not ask this. Do not ever ask this. The answer will be "yeah, it was cool!" because that is the polite answer when a designer stares at you waiting for approval. The question tells you nothing. The answer tells you nothing. You have wasted the most valuable part of the session — the debrief — on a social performance.

Instead, ask process questions. Two of them are load-bearing enough to memorize:

"What were you thinking when X happened?" Pick a specific moment. The moment they died to the first boss. The moment they found the secret room. The moment they paused for a long time staring at the map. Ask them to reconstruct their internal state at that specific moment. What were they trying to do? What were they expecting? What were they feeling? Specific moments produce specific answers; general questions produce general answers.

"What did you expect to happen?" This is the diagnostic tool. When a tester does something and the game responds in a way that does not match their expectation, you have found a design problem. Every time you see the tester look surprised, confused, or disappointed, ask what they were expecting. Compare their expectation to what the game actually did. The gap is your finding.

These two questions, asked repeatedly about specific moments, will produce more design data than any amount of "did you like it." They work because they bypass the social-performance layer. When you ask "did you like it," the tester hears "please tell me I am a good designer," and they oblige. When you ask "what were you thinking when you died at the spikes," they drop into recall mode and report their actual experience, because there is no social answer to that question.

Other useful questions: What was the most frustrating moment? What was the most confusing moment? If you were describing this game to a friend, what would you say it was? What did you think this game was going to be, when you first started? Was there anything you wanted to do that the game would not let you do? Each of these is specific enough to produce a specific answer, and open enough not to lead.

Avoid leading questions. "Did you feel like the combat was tight?" is not a question; it is an ask for agreement. "How would you describe the combat?" is a question. Avoid binary questions; they collapse rich experience into yes/no. Avoid asking about preferences that cannot be acted on: "Would you have liked more powerups?" — they will say yes to almost any addition, because addition is cheap in the abstract.

The tester is not a designer. Their feedback is not a design prescription. Their feedback is a report on their experience, and your job is to translate that report into design insight. When a tester says "the boss was too hard," the finding is not "make the boss easier" — it is something about the boss is too hard for this player, and now you need to figure out what. Maybe the telegraphing is bad. Maybe the player did not know they could parry. Maybe the checkpoint is too far back. Maybe the boss is fine and this tester is bad. The tester's words are the start of your diagnosis, not the conclusion.

Collecting Data

Every playtest session should produce three kinds of record: observation notes, session recording, and analytics events. Skipping any of the three means you are flying on incomplete information.

Observation notes. You, or a dedicated note-taker, watch the session and write down what happens. Not everything — you can't — but the patterns you have pre-committed to watching for. Structured note templates help: a timestamped log with columns for "what the tester did," "what the tester said," "my interpretation." The interpretation column is critical and must be separated from the observation column. Observation: tester circled the first room three times. Interpretation: tester did not see the door to the north, possibly because the door blends into the wall tiles. The observation is fact; the interpretation is hypothesis. Never collapse them into one line.

Screen recording. Use OBS or a similar tool to record the tester's screen (and, ideally, a picture-in-picture of their face if they consent). Recording is non-negotiable. Your memory of the session is wrong — you will misremember timings, you will forget which tester said what, you will lose the subtle patterns. Video lets you go back. It also lets you share findings with your team ("watch this twelve-second clip of tester three at the boss door"), which is a hundred times more persuasive than any summary document.

Analytics events. Instrument the build to log events as the tester plays. Death locations. Level completion times. Menu opens. Item pickups. Any action the game can observe. After the session, the logs give you a precise timeline of what the player did, timestamped, across every session. This catches patterns you would not spot by eye — for example, "every tester died twice at the same jump in the second level," which an analytics log will show immediately but which you might miss while watching three different sessions in isolation.

Mixing the three is where insight emerges. The analytics log says "tester four died six times at the jump in level two." The video shows you why she died (she kept trying to double-jump at the wrong spot because the platform visually suggests a two-step rhythm). The observation notes capture her verbal reaction ("I don't get what this wants me to do"). Put all three together and you have a finding: the level-two jump's visual rhythm misreads to players as a double-jump, causing a repeated death pattern. The finding is specific, sourced, and actionable. A pure analytics finding ("deaths at this jump") is not actionable without the why. A pure observation ("the jump was hard") is not strong without the count. You need all three.

Below is a minimal analytics script you can drop into your Godot project to start capturing events today. It writes JSON lines to a local file, which you can collect after a session and analyze. For small playtests, this is more than sufficient; for large-scale testing, you would extend it to POST to a server.

# Telemetry.gd
# Autoload singleton. Writes one JSON-lines event per call to a local file.
# Enough for indie playtests. For scale, upgrade to HTTP POST.

extends Node

const LOG_PATH := "user://telemetry.jsonl"
var _session_id: String
var _file: FileAccess
var _start_time: float

func _ready() -> void:
    _session_id = "%d_%s" % [Time.get_unix_time_from_system(),
                             str(randi()).pad_zeros(6)]
    _start_time = Time.get_ticks_msec() / 1000.0
    _file = FileAccess.open(LOG_PATH, FileAccess.WRITE)
    log_event("session_start", {"version": ProjectSettings.get_setting("application/config/version", "dev")})

func log_event(event_name: String, data: Dictionary = {}) -> void:
    var payload := {
        "t": (Time.get_ticks_msec() / 1000.0) - _start_time,
        "session": _session_id,
        "event": event_name,
        "data": data
    }
    _file.store_line(JSON.stringify(payload))
    _file.flush()  # flush so we don't lose events on crash

func log_death(location: Vector2, cause: String) -> void:
    log_event("death", {"x": location.x, "y": location.y, "cause": cause})

func log_level_complete(level_id: String, duration_s: float) -> void:
    log_event("level_complete", {"level": level_id, "duration": duration_s})

Call Telemetry.log_death(position, "spike_trap") from your enemy or hazard code; call Telemetry.log_level_complete("level_01", time_taken) from your level manager. After the session, ask the tester to send you the log file (one line per event, easily parsed). Aggregate across sessions. Plot death positions on a heatmap of your level geometry. Patterns will emerge.

A note on privacy. Even for small indie games, you should disclose what you are logging and why. A simple in-game screen at session start ("This playtest build logs your gameplay events — level timing, death locations, menu actions — to help us improve the game. No personal information is collected.") is sufficient for most indie contexts. If you plan to retain or share the data, get explicit consent. For post-launch analytics shipped to all players, GDPR and similar regulations apply and you need a proper opt-in flow. The principle: never log more than you will actually analyze, and never log anything the player would be uncomfortable with you knowing.

Interpreting Results

You have run eight sessions. You have notes, video, and analytics. Now what?

This is where most indie teams make the next big mistake: they treat individual complaints as findings. Tester three said the jump felt floaty. Tester five said the music was too loud. Tester seven said the boss was unfair. The designer, trying to be responsive, writes down every complaint as a task. Two weeks later, they have addressed twenty complaints and the game is not better — because several of the complaints were idiosyncratic, several contradicted each other, and several pointed at symptoms of deeper problems that the tester could not articulate.

The signal in a playtest round is not in any one tester's commentary. It is in patterns across testers. What did four of the eight testers do? What did every single tester miss? Where did five testers hesitate, even if they did not articulate why? Where did seven testers laugh, and what was happening on screen at that moment? The patterns are the findings; the individual anecdotes are the evidence supporting or refuting them.

A working protocol for analysis:

Collect all observation notes and video into a shared document, per-tester.
Tag every observation with a category: onboarding, combat, navigation, difficulty, narrative, UI, audio, meta-system. Use as many categories as your game has major surfaces.
Within each category, group observations that describe the same phenomenon. "Tester 2 missed the map icon," "Tester 5 asked where the map was," "Tester 7 opened the inventory looking for the map" — these are all the same finding: the map affordance is not reading.
Count the occurrences. Findings that show up in 4+ of 8 testers are strong signal. Findings in 2-3 are worth investigating. Findings in 1 tester are worth noting but should not drive major work unless the finding is about a critical failure (crash, complete blocker).
Separate what a tester said from what the design problem is. "Tester said combat felt floaty" is not a finding; it is a report. The finding is "something about the combat's weight, feedback, or input response is failing for this demographic, and we need to investigate which." Maybe the finding, after investigation, is that enemy hit-stop is too short. Maybe it is that screen shake is too weak. Maybe it is that the audio does not sell impact. The tester's word "floaty" is the starting point for the design question, not the answer.

Once you have a ranked list of findings, you are in the position to act. Before you do, one more filter: do you agree with the finding? A tester is reporting their experience; you are the designer, and your game has a vision. If four of eight testers want the combat to be easier, and your vision is a Souls-like that rewards mastery, you do not comply with the findings — you investigate whether your game is recruiting the right audience, or whether the tutorial is not signaling the game's difficulty contract clearly enough. The tester is not always right. The tester's experience is always real.

Common Playtest Traps

A handful of classic failure modes haunt playtesting. Most of them have been named by designers who made them first. Recognize the patterns.

Confirmation bias. You already believe your game is great (or terrible). You watch the session through that lens. You overweight the moments that confirm your belief and discount the moments that disconfirm it. The antidote is numerical: commit to counting findings rather than feeling them. If three of eight testers hated the second level, that is a number, not a vibe. Act on numbers.

Friends-and-family syndrome. Discussed above but worth naming again. Feedback from people who know and care about you is systematically biased toward kindness. Do not use it as your primary data. Use it as sanity-check data after strangers have told you the truth.

The one-angry-tester trap. One tester will, statistically, hate your game. They will say so, loudly and at length. The temptation is to overweight this feedback because it is vivid, specific, and emotionally stinging. Resist. One tester is one data point. If seven others liked the thing, the angry one is noise; if seven others were lukewarm, the angry one is confirming a trend. The anger itself is not evidence of anything.

The demographic mismatch. You tested with college students. You are shipping to middle-aged strategy-game enthusiasts. Nothing your testers said applies to your audience in any reliable way, and you have wasted the round. Pre-screen for demographics. Every time.

The silent-but-confused tester. Some testers, especially in think-aloud protocols, simply will not narrate their confusion. They will go quiet. They will try things without talking about them. You will later review the recording and realize they spent ten minutes stuck and never said a word. The fix: during the session, when you see them go quiet for an unusually long time, prompt: "What are you thinking about right now?" Light prompts re-engage narration without contaminating the data.

The expert blindspot on the designer's side. Related to confirmation bias but specific to playtest interpretation. You look at the tester missing the affordance and think "they're being obtuse, anyone would see that." The correct response is they are not obtuse, my affordance is failing, and I can only see the failure because a non-designer brain tried to parse it. The mental move is to default to "the design is wrong" rather than "the tester is wrong." Over time, every single time the tester confuses you, make the move. You will be right more often than the alternative.

Priming. You told the tester, before they played, that the game is inspired by Celeste. Now every time they struggle with a jump, they frame their frustration through their memory of Celeste. You have contaminated their experience with a reference. Pre-play briefings should be minimal: "This is an action-adventure game. Try to play it the way you would play any game you just downloaded. I'll be quiet while you play." No more.

Over-tasking. You gave the tester a checklist of twelve things to try. They are now executing a checklist, not playing. Their experience is distorted. Scale tasks down. Two or three at most, with time to play freely between.

Acting on Results

Findings accumulate. At some point you have to decide what to do about them. This is a triage problem more than a design problem, and the triage discipline separates teams that finish games from teams that do not.

Classify every finding into four tiers:

Critical. The finding describes a blocker. Players cannot get past this point. Players crash here. Players uniformly quit here. Critical findings override every other priority. Fix them before the next playtest round. If you cannot fix them, you do not run another playtest, because the critical blocker will corrupt every session that follows it.

Important. The finding describes a significant experiential failure that does not block progression but hurts the game's appeal for most players. The tutorial is confusing but completable. The first boss is too easy and drains tension. The inventory UI is slow to navigate. These are ship-blockers in the sense that shipping with them lowers review scores and player retention, but they do not prevent players from seeing the game. Fix them before release.

Nice-to-have. The finding describes a polish item. One tester noted that the footstep audio loops slightly. Two testers wished the camera zoomed a little further out in open areas. These improve the game if fixed but do not significantly harm it if not. Fix them after the important items, and only if time allows.

Ignore. The finding describes something that contradicts your design intent, affects a single tester, or would require changes disproportionate to the improvement. Not every finding is a problem. Document the decision to ignore, so future-you remembers that this question was asked and answered.

The rhythm of acting on findings is: one cluster of findings → fix pass → retest. You do not accumulate fifty findings and then spend three months addressing them all before testing again. You prioritize five or six, you fix them, you run another round, you see whether the fixes worked and what new problems the fixes exposed. This iterative cadence is how Valve works, how Supercell works, how any serious studio works. The teams that batch findings and fix them all in one heroic pass always, always ship games with the same problems players originally reported, because the "fixes" were untested hypotheses. A fix that has not been retested is not a fix; it is a hope.

A harder problem: when to not act on feedback. If the data says "players want a jump button" and your game is a point-and-click adventure where jumping would wreck the puzzle design, you do not add a jump button. The data is real; the design decision is yours. Your job is not to execute playtest findings; your job is to make a game, of which playtest findings are one input. Hold the vision when the data attacks your vision's mechanism. Give up the vision when the data attacks your vision's core promise to the player. Distinguishing these is the subtlest part of the practice, and it gets easier with experience.

Remote vs. In-Person

Some notes on the logistics of running sessions.

In-person sessions give you access to facial expressions, hand position, body posture, and the micro-tells that remote sessions flatten. You can see exactly when the tester's thumb hovers over the jump button without pressing it — a signal of hesitation that tells you a great deal about how they are reading the situation. In-person sessions are also more socially pleasant for the tester, especially first-time ones; they respond to human presence better than to an empty call.

The costs: in-person is geographically constrained, logistically complex, and harder to scale. You can realistically run three in-person sessions in a day. You can run three remote sessions in a morning.

Remote sessions scale better and access any target audience that can get online. Discord with screen sharing is the minimum viable setup; Zoom works too. For higher-fidelity remote testing, Lookback and UserTesting offer built-in screen+face capture, observer dashboards, and session libraries. PlaytestCloud specializes in game-specific remote testing and handles recruitment for you.

The costs of remote: you lose facial and body cues unless you explicitly record the tester's webcam. You are subject to technical failures — bad internet, driver issues, audio problems — that eat session time. You have less social presence and must work harder to establish trust in the first few minutes.

For most indie teams, a hybrid approach works: run in-person sessions with local recruits for deep-dive kleenex tests and UX studies, and run remote sessions through PlaytestCloud or Discord for broader demographic coverage. The in-person sessions teach you what to look for; the remote sessions quantify how widely those patterns apply.

Analytics and Telemetry as Ongoing Playtesting

Playtesting does not end when you ship. In an always-connected world, every player is a potential data source, and a well-instrumented game runs continuous, population-scale playtests after release. This is how Candy Crush was built and how it continues to operate — King ships new levels, watches the drop-off curves, and adjusts within days.

At industrial scale, the tools are custom — King, Supercell, and Riot all built internal analytics pipelines over years. At indie scale, you have options:

GameAnalytics — free for small teams, straightforward event logging, prebuilt funnels and retention curves.
Unity Analytics — built into Unity, trivial to enable, reasonable coverage.
Firebase Analytics — Google's offering, strong on mobile, free at small scale.
Custom — roll your own. The Telemetry.gd snippet earlier is a starting point; extend it to POST events to a simple server, ingest into a database, and plot in Grafana or a spreadsheet. For indie games, a custom stack can be cheaper and more flexible than the prebuilt tools, at the cost of engineering time up front.

The metrics that matter for most games:

D1, D7, D30 retention. What fraction of day-zero players return the next day, a week later, a month later. The shape of this curve is the health of your game.
Session length. Mean and distribution. A long tail of five-hour sessions is a different signal than a fat lump at fifteen minutes.
Level completion rates. What fraction of players who attempt a level beat it? The drop across levels reveals the difficulty curve's actual shape, as distinct from the shape you intended.
Death locations. Where are players dying, and how often? Heatmaps reveal problem hotspots faster than any observation.
Funnel steps. What fraction of players complete each step of the onboarding? Of the tutorial? Of the first hour? The drop-off points are your leaks.

Analytics is not a substitute for in-person playtesting. It tells you where and how often, not why. But paired with in-person sessions — where the in-person session explains the "why" for the patterns the analytics revealed — it is the highest-leverage tool indie teams have ever had access to. For free. Use it.

💡 Intuition: Analytics is a telescope; in-person playtesting is a microscope. The telescope shows you the galaxy — where the problems are, how big, how common. The microscope shows you the cell — what is actually happening, what the player was thinking, why the pattern exists. Using one without the other leaves you half-blind.

Progressive Project Update — Ch 31

This chapter's project task: Run three formal playtests of your game, document findings, triage.

Recruit. Find three testers who are not family, close friends, or fellow developers. Draft a recruiting message that you can post on Discord, a subreddit in your game's genre, or a campus bulletin board. Include: genre, approximate session length (30-45 minutes), what you are asking them to do (play the current build, narrate their thinking, answer a few questions), any incentive you can offer (digital key, gift card, profound gratitude). Pre-screen respondents: do they play games in your genre? Are they in your target demographic? Have they heard of the game before?

Prepare. Write a one-page playtest protocol document. Include: the question you are answering in this round (e.g., "Can a new player complete the tutorial and reach the first boss?"), the task given to testers, the debrief questions, the observation channels. Stand up your build with telemetry enabled (use the Telemetry.gd script from this chapter). Check the build runs cleanly on a test machine that is not your dev machine.

Run the sessions. Three sessions, 30-45 minutes each. Record screen (OBS), audio, and telemetry. Use think-aloud protocol. Shut up during play. Ask the two-question-rule questions at the debrief. Collect consent for recording.

Analyze. Produce a findings document with three sections: Critical (blockers), Important (experiential failures in 2+ testers), Nice-to-have (polish items). For each finding, include: number of testers affected, the observed behavior, the designer's interpretation, and a proposed response. Ignore no tester; document every decision, including the decisions not to act. Budget this analysis at roughly equal time to the sessions themselves — three hours of testing means at least three hours of analysis, often more.

Plan the next round. Identify the top three fixes you will make before the next playtest. When those are implemented, run a smaller retest round (two testers) focused specifically on whether the fixes addressed the findings. This is the loop you will run for the rest of the project: test, fix, retest, ship.

This chapter's deliverable is the findings document plus video clips of the most important moments. Show them to a collaborator. Defend them. Revise. This is how the game gets better.

Common Pitfalls

Playtesting too late. The most destructive pitfall of them all. Teams delay playtesting until the game is "ready" — art finished, features complete, tutorial polished — and by then the findings are devastating because they reveal structural problems that would have been cheap to fix six months ago and are now prohibitive. Playtest in prototype. Playtest in alpha. Playtest with placeholder art. Every month of delay multiplies the cost of fixing what the tests would have surfaced.

Playtesting with the wrong audience. Your testers were computer-science undergrads; your game is for mobile puzzle players. Nothing your sessions revealed maps to your actual market. Pre-screen religiously.

Not recording. You ran the session, you remember it went well, you did not record because it felt intrusive. Two weeks later you cannot remember what happened at the boss. Record everything. Always.

Acting on every comment. The tester said the sword should be blue. You made it blue. The next tester said the sword should be red. You made it red. You are not designing; you are being remote-controlled by noise. Filter through patterns. Ignore singletons unless they describe a critical blocker.

Dismissing all feedback as "wrong audience." The inverse trap. Every finding that challenges your vision is rationalized away as a demographic mismatch. This is how designers ship games that fail and blame the market. If the same finding appears across three different demographic pools, it is not a demographic problem; it is a design problem. Own it.

Confusing QA and playtesting. You hired a QA service to evaluate whether the game is fun. They filed 200 bug reports and zero design findings, because QA's job is bugs. Now you think the game works because "no major bugs found." Run separate practices for the two questions.

Summary

Playtesting is the practice of confronting the gap between the game you think you built and the game that actually exists in players' hands. Your internal model of your own design is systematically wrong — expert blindness, unrepresentative intuition, and accumulated exposure have all corrupted your ability to see the game clearly. The only remedy is to put the game in front of strangers, watch them play, and listen to what happens.

The discipline has structure. Distinguish playtesting from QA. Choose the right type of session for the question you are asking. Recruit testers who owe you nothing. Prepare the session with a clear question and a minimal build. Shut up during play. Ask process questions, not preference questions, at the debrief. Record everything. Look for patterns across testers, not complaints from single testers. Triage findings into critical, important, nice-to-have, and ignore. Fix a small cluster, retest, repeat.

Scale with analytics. Instrument the build with event logging, collect death locations and funnel steps and session lengths, and use the data to see patterns that individual sessions cannot reveal. Pair analytics with in-person sessions so that the where from the telescope meets the why from the microscope.

And above all, do it early and often. The cost of a playtest round in pre-alpha is a few hours and a few hundred dollars. The cost of learning the same finding through post-launch reviews is your game. Designers who internalize this rhythm ship better games faster, and sleep better, and keep their studios open. Designers who do not — the ones who insist that their taste is sufficient, that playtesting is for insecure teams, that they will get around to it later — those designers ship one game, read the reviews, and start asking whether maybe they should have tested.

The recurring theme of this book is playtest or die. Chapter 32, next, is about what you do with the data the tests produced — turning findings into balance decisions, rebalancing the economy, retuning the difficulty curve. But the rebalance is only as good as the test that informed it. Get the test right first. Everything else follows from what real humans did when they picked up your game, alone, and tried to play it.