Appendix E: Probability and Math for Designers

Introduction: You Don't Have to Love Math. You Have to Use It.

Most designers got into games because of a feeling — the first time they cleared a level in Celeste, the first time they survived a boss in Dark Souls, the first time a Breath of the Wild chemistry interaction made them laugh out loud. Nobody I know got into games because they loved linear algebra.

Good news: you do not need to love math. You do not need a statistics degree. You do not need to be able to derive the binomial distribution from first principles. You need to be able to use a small toolkit of math concepts well enough that you do not ship a broken economy, a broken drop rate, or a boss whose damage feels "random" in a way the players will hate you for.

This appendix is the toolbox. Everything here is the math I actually reach for when I am balancing a game — the math I use on real projects, at the level of precision I use it. It does not replace Chapter 10 (randomness and probability) or Chapter 24 (economy design). It is the reference you flip to when you are mid-sprint and cannot remember the formula for expected value.

Open it, copy the formula, check the worked example, move on. Math is a tool. Use it.


Basic Probability

A probability is a number between 0 and 1 (or between 0% and 100% — same thing) that answers the question how often does this happen across many attempts?

  • p = 0.0 → never happens
  • p = 0.5 → happens half the time
  • p = 1.0 → always happens
  • p = 0.1 → happens 1 in 10 tries, on average, across many tries

That "on average, across many tries" is the part designers lose sight of. A 10% drop rate does not mean the tenth enemy drops the item. It means that across ten thousand enemies, roughly a thousand will drop the item. Across ten enemies, you might get zero, you might get three. That's the whole reason Chapter 10 exists.

Independent vs. Dependent Events

Two events are independent if the outcome of one does not affect the outcome of the other. Rolling a die twice. Flipping a coin twice. Killing two different enemies, each with its own drop roll.

Two events are dependent if the outcome of one does affect the other. Drawing two cards from a deck without shuffling in between — the first draw changes what is available for the second. Pulling a ball out of an urn without replacement. Hearthstone card draws within a turn.

This distinction matters because the math is different. For independent events, you multiply raw probabilities. For dependent events, the probability changes after each event.

The Multiplication Rule: "Both Happen"

For independent events A and B, the probability that both happen is:

P(A and B) = P(A) × P(B)

Worked example: flipping three heads in a row on a fair coin.

P(H and H and H) = 0.5 × 0.5 × 0.5 = 0.125 = 12.5%

Worked example: rolling a 6 on a d6 and getting a 10% drop from the kill that follows.

P(6 and drop) = (1/6) × 0.10 = 0.0167 = 1.67%

The Addition Rule: "Either Happens"

For mutually exclusive events (they cannot both happen), the probability that either happens is:

P(A or B) = P(A) + P(B)

Rolling a 5 or a 6 on a d6: 1/6 + 1/6 = 2/6 = 33.3%.

For events that could both happen, you have to subtract the overlap so you do not count it twice:

P(A or B) = P(A) + P(B) − P(A and B)

This is a source of designer bugs. "Crits on 10% of hits, and my weapon procs on 15% of hits. What's the chance the hit is interesting?" It is not 25%. It is 10% + 15% − (10% × 15%) = 23.5%. Small difference at low probabilities, big difference as the numbers rise.

The "At Least Once" Formula

This is the single most useful probability formula in game design. If an event has probability p per attempt, the probability of it happening at least once across n attempts is:

P(at least one) = 1 − (1 − p)^n

Worked example: a 10% drop rate. Chance of getting at least one drop across 20 kills?

1 − 0.9^20 = 1 − 0.1216 = 0.8784 = 87.8%

About one in eight players still gets nothing after twenty kills. If your design says "the player will have this item after killing twenty of the enemy," your design is wrong, and you will get forum posts about it. Use this formula before you ship.


Expected Value

E(X) = Σ (probability × value)

Expected value is the average outcome you would see if you could replay the event infinitely. Sum up each possible outcome multiplied by its probability.

Worked example: an enemy drops 100 gold with probability 0.1, and drops nothing (0 gold) the rest of the time.

E(gold per kill) = (0.1 × 100) + (0.9 × 0) = 10 gold

On average, every kill is worth 10 gold. Not "you get 10 gold per kill." You get 100 gold one-tenth of the time and zero the rest. But across a hundred kills, you will have roughly 1,000 gold — and that is what the economy must be balanced around.

Worked example: a loot chest.

Outcome Probability Value (gold)
Nothing 0.4 0
Small coin pouch 0.3 25
Medium coin pouch 0.2 75
Big coin pouch 0.08 200
Jackpot 0.02 1,000

E(chest) = (0.4)(0) + (0.3)(25) + (0.2)(75) + (0.08)(200) + (0.02)(1,000) = 0 + 7.5 + 15 + 16 + 20 = 58.5 gold per chest.

If the player opens 50 chests in a playthrough, budget for roughly 2,925 gold of chest income. Price your shop items accordingly.

Worked example: critical hits.

Base damage 20, crit chance 15%, crit multiplier 2x.

E(damage per hit) = (0.85 × 20) + (0.15 × 40) = 17 + 6 = 23 damage per hit.

A 15% crit chance at 2x is equivalent to a flat 15% damage increase. Players feel that 15% as drama (big numbers sometimes!), but your DPS spreadsheet treats it as a scalar. Balance the DPS first, then tune the feel of the variance second.


Distributions Designers Actually Use

You do not need to know all of probability theory. You need to recognize four shapes.

Uniform Distribution

Every outcome equally likely. A fair die. A random number from 0 to 1 in code. randi_range(1, 10) picks each value 10% of the time.

Use when: you want maximum fairness and no bias. Tetris piece order (sort of — modern Tetris uses a bag system, which is weighted uniform). Random encounter selection from an equal-weighted pool.

Normal Distribution (Gaussian)

The bell curve. Most values cluster near the mean; extreme values are rare. Real-world "scatter" often looks like this: weapon spread around a crosshair, player reaction times, enemy aggression variation.

Use when: you want realistic variation around a target value. Damage rolls where most hits land near the average but occasional big and small hits break the rhythm. Bullet spread in a shooter.

Binomial Distribution

The number of successes in N independent attempts, each with probability p. Flipping ten coins and counting heads. Killing twenty enemies and counting how many dropped the rare.

Use when: you want to answer "across N attempts, how many will succeed?" This is the distribution behind drop-rate anxiety — the shape that tells you half your players will get the 5% drop in 14 kills, but 8% will still be empty-handed after 50.

Poisson Distribution

The number of events per time window, when events happen at a steady average rate but individual timing is random. Enemy spawns. Crit procs per minute. Customer arrivals at a shop.

Use when: modeling arrivals or rare events over time. "An average of 3 reinforcements arrive per minute, but they clump." Poisson is the cleanest way to model that clumping without hand-rolling logic.


Standard Deviation: Why It Matters

The mean (average) tells you the center of a distribution. The standard deviation (SD) tells you the spread.

Two enemies, both dealing "50 damage on average":

  • Enemy A: damage = 50, SD = 5. Most hits between 45 and 55.
  • Enemy B: damage = 50, SD = 30. Hits range from 5 to 95 regularly.

Same average. Radically different feel.

Enemy A is predictable. The player can plan. "I can survive two hits." This supports tight, skill-based combat — Dark Souls, Hollow Knight, fighting games.

Enemy B is chaotic. Sometimes it tickles you; sometimes it one-shots you. This can feel exciting (XCOM) or infuriating (any RPG where the bandit occasionally crits for the entire party's HP). High variance demands mitigation systems — cover, retreat, healing, revival — so the swings don't end the run.

Design rule of thumb: the more a single hit matters to the player's survival, the lower you want the variance. Boss damage in Celeste is essentially deterministic. Damage in a horde shooter can vary widely because no single hit kills you. Pick your variance for the role the system plays in the experience.


The Gambler's Fallacy in Game Design

The gambler's fallacy is the false belief that past independent events affect future ones. "I've missed five 50% shots in a row — the next one must hit." No. Each shot is 50%, every time. The coin has no memory.

Chapter 10 covered this at length. Here's what matters for your math:

Players feel the fallacy. You cannot explain it away. The 95%-shot-that-missed is a famous moment of XCOM anguish, and no amount of "but it was technically 5%" ever makes the player happy. The most famous version is the Civilization "phalanx kills my tank" outrage — a technically-correct battle outcome that feels like a bug to the player because the displayed odds ("99% vs 1%") do not match their intuition about military history.

Your tools to manage this:

  1. Don't show probabilities you haven't committed to supporting. XCOM 2 on lower difficulties secretly boosts your real hit chance above the displayed percentage, because Firaxis learned that players perceive "75%" as "almost always" and the game feels broken when it doesn't match that perception.
  2. Use pity systems. (See below.) After N failures, force a success.
  3. Use pseudo-random distribution. (Covered in Ch. 10 §10.6.) Increase probability after each failure; reset on success.

The fallacy is a player psychology problem. Your design has to account for it regardless of whether it's mathematically "fair."


Pseudorandom vs. Shuffled: Replacement and Memory

There are two fundamental flavors of random selection: with replacement and without replacement.

With replacement (rolling): Each event is independent. Each roll of a d20 is unrelated to the last. Drop tables are usually with-replacement. So are crit procs. So are gacha pulls (for the most part).

Without replacement (shuffled): The "bag" runs out of outcomes and has to be refilled. Each draw changes the probability of future draws. Hearthstone card draws from your deck are without-replacement — once you draw the Fireball, it's gone until you shuffle.

Without-replacement feels fairer to players, because bad streaks end. If you shuffle 10 cards with 2 rare ones, by the time you've drawn 8 cards you must have at least one rare. The randomness has a guaranteed floor.

With-replacement feels more volatile. You can theoretically roll the same d20 forever and never see a 20. The randomness has no floor.

When to use which:

  • Without replacement: Tetris pieces (7-bag system), card games, "shuffled pool" enemy spawns, Into the Breach–style puzzle drops.
  • With replacement: Combat rolls, crit procs, gacha, enemy drops from generic pools.

XCOM 2 actually uses weighted random with memory for hit chances on lower difficulties — a shuffled-like pity system invisibly smoothing the experience. Players do not know. The game just feels less cruel.


RNG Manipulation and Pity Systems

When pure probability produces experiences you cannot ship, you engineer around it. This is not cheating. This is design.

The Gacha Math

Gacha games advertise rates like "0.6% for the top-tier character." Players see that number and assume "one pull in 167 gets me one." Here is what actually happens.

Probability of at least one top-tier in N pulls:

1 − (1 − 0.006)^N = 1 − 0.994^N

Pulls Chance of at least one
10 5.8%
50 26.1%
100 45.3%
200 70.1%
300 83.6%
500 95.1%
1,000 99.8%

The median is around pull 115. The unlucky tail — players who hit 300+ pulls with nothing — is not rare. It is 16% of players. In a game with a million active users, that is 160,000 people with awful luck, and they will tell you about it on Reddit.

Hard Pity

Guarantee the item after N pulls. "If you haven't gotten a 5-star in 90 pulls, pull 90 is guaranteed to be one." The tail is capped. No one ever pulls 300 times for nothing.

Soft Pity

Dynamically increase the drop rate as the dry streak grows. Genshin Impact's 5-star character rate is advertised as 0.6% but the real rate ramps aggressively after pull 74, hitting near-certainty by pull 89 — and hard pity at 90. Most 5-stars are obtained during soft pity, not at the advertised rate.

Soft pity is good design (it smooths the tail). It is also ethically complicated, because it makes the stated rate misleading. The player thinks they're pulling against a 0.6% rate. They are actually pulling against a dynamic rate that saves them from the worst outcomes. This is the kind of thing regulators are starting to look at. Disclose or don't disclose — both are defensible — but know which one you're doing and why.


Combat Math

Three formulas you will use constantly when balancing a fight.

DPS (damage per second) = damage × attacks per second × hit rate

A sword doing 40 damage, swinging once per second, hitting 90% of the time, has DPS = 40 × 1 × 0.9 = 36.

Effective HP (EHP) = HP / (1 − damage_reduction)

An enemy with 200 HP and 25% damage reduction has EHP = 200 / 0.75 = 266. That's how much raw damage you actually have to deal.

Time to Kill (TTK) = EHP / DPS

266 EHP, 36 DPS → 7.4 seconds to kill.

Worked example: balancing a fighter class.

You want the Fighter to kill a basic grunt in 3 seconds. The grunt has 150 HP, 0% armor. Required DPS = 150 / 3 = 50.

Your Fighter attacks once every 1.5 seconds with 85% accuracy. Required raw damage per hit = 50 / (1/1.5 × 0.85) = 50 / 0.567 = 88 damage per swing.

Now check against other enemies. Against a heavy (400 HP, 20% armor), EHP = 500. TTK = 500 / 50 = 10 seconds. Too long? Give the Fighter an armor-pierce ability, or a crit, or a heavy-slow stagger mechanic. Iterate.

This is what "game balancing" actually looks like. You build a spreadsheet. You change numbers. You check the feel in-game. You repeat until the TTK chart across enemies tells the story you want — trash dies fast, elites take commitment, bosses take mastery. Chapter 32 walks through the full balancing process; this is the math underneath it.


Progression Math: XP Curves

The shape of your XP curve determines how your game feels across the level-up arc. Four shapes cover almost every game:

Linear: xp_needed(level) = k × level

Level 1 needs 100 XP, level 2 needs 200, level 10 needs 1,000. Gaps grow slowly.

Feels like: flat grind after a few levels. Early levels pop; late levels drag. Bad for long games.

Exponential: xp_needed(level) = base × multiplier^level

Classic: 100 × 2^level. Level 1: 100. Level 10: 102,400. Level 20: 104,857,600.

Feels like: punishing. Every level is a brick wall. Used in old MMOs and Korean RPGs where grinding is the game. Do not use this unless your design embraces the grind as a feature.

Polynomial: xp_needed(level) = k × level^2 (or level^1.5)

Level 1: 100. Level 10: 10,000. Level 100: 1,000,000. The gaps grow — but sub-exponentially.

Feels like: balanced. Leveling speeds up early, slows down late, but never hits the wall of exponential. This is what most modern RPGs use. It's the Skyrim shape. It's the Diablo shape.

Logarithmic (diminishing): xp_yield shrinks as level rises

Not an XP curve so much as a reward curve. Each point in a skill gives less benefit. damage_bonus = log(strength). Common for hard caps — prevents infinite scaling.

Feels like: "after a while, more points don't matter much." Good for soft caps, late-game design space.

Pick your curve based on game length. A 15-hour indie RPG uses a gentler polynomial (level^1.5). A 100-hour loot grinder can afford level^2. Never use pure exponential unless you want your players to feel the wall.


Economy Math

Economy design is Chapter 24's territory. Here are the numbers you need at your fingertips.

Inflation Rate

inflation_per_hour = (gold_sources_per_hour − gold_sinks_per_hour) / gold_in_economy

If players earn 1,000 gold/hour and spend 800, and the total economy is 50,000 gold, inflation is 200/50,000 = 0.4% per player-hour. Across a thousand players and 100 hours, that compounds into a wealth tsunami. Endgame sinks exist to prevent that.

Time-to-Earn

If item X costs 5,000 gold and players earn 500 gold/hour, time-to-earn is 10 hours. Compare against your intended pacing: if that item is meant to feel "available around 5 hours in," you have a problem. Either increase gold sources, decrease item cost, or move the intended unlock to match reality.

Soft vs. Hard Caps

A hard cap says "you cannot gain more of this." Max level 50. Max gold 999,999. Simple to implement, creates obvious pressure to spend before cap.

A soft cap says "you can gain more, but it gets dramatically harder." XP required doubles after level 50. Gold above 100k generates taxes. Smoother, but requires more tuning.

The Cautionary Number: $100,000

Diablo Immortal (2022) famously required around $100,000 USD to fully max out a character through its gem and legendary-gem upgrade system, per community math. This number made international headlines not because Blizzard intended it but because the economic systems, when composed, multiplied. Each individual conversion rate looked reasonable. The total was obscene. When you design stacking upgrade economies, always compute the full composed cost from scratch to max. If that number is embarrassing, the design is broken.


Graph Theory for Metroidvania Design

A metroidvania world is a graph. Rooms are nodes. Connections between rooms are edges. Some edges require an unlock (double jump, dash, key, ability).

The design question: given the player's current set of unlocks, which rooms can they reach?

This is a graph reachability problem. In code, it's a breadth-first search from the player's current node, following only edges whose required unlocks are a subset of what the player owns.

Why this matters for design:

  • Sequence breaking: if players can reach room X without ability A, but you assumed A was required, they can break your intended progression. Sometimes that's a feature (Super Metroid loved this). Sometimes it's a bug.
  • Softlocks: if a player can drop into a room and then can't leave without an ability they don't own, you have a softlock. Graph analysis catches these before playtesting.
  • Progression gating: to force the player through a specific ability unlock, remove all alternate paths from the graph. To allow freedom, keep multiple paths in.

You don't need fancy graph algorithms. A room-to-room adjacency list and a BFS is enough to validate any metroidvania map. Build the tool. Run it every time you add a room. It will save you playtesting hours.


Simple Simulation (Monte Carlo)

When the math gets hard, simulate. Monte Carlo simulation means running a system thousands of times with random inputs and measuring the distribution of outcomes. In ten lines of code, you can answer questions that would take days of algebra.

# Simulating 10,000 dungeon runs for loot distribution
func simulate_loot_drops() -> Dictionary:
    var rng = RandomNumberGenerator.new()
    var counts = {"common": 0, "rare": 0, "epic": 0, "legendary": 0}
    for run in 10000:
        for kill in 20:  # 20 kills per run
            var roll = rng.randf()
            if roll < 0.60: counts["common"] += 1
            elif roll < 0.90: counts["rare"] += 1
            elif roll < 0.99: counts["epic"] += 1
            else: counts["legendary"] += 1
    return counts

After this runs, divide each count by 10,000 to get the per-run average. Now you know: the median player gets 0.8 legendaries per run, the average run has 1.8 epics, etc. You can tune drop rates against desired per-run yields.

Monte Carlo answers questions like:

  • "How long is my unlucky tail?" (Run 10,000 players, plot the distribution.)
  • "Can a player realistically afford X after N hours?" (Simulate the economy across 10,000 play sessions.)
  • "Does this rock-paper-scissors enemy rotation produce varied fights?" (Simulate combat encounters and measure fight length variance.)

Designers who simulate ship better economies and fewer absurd tails. It is not a statistics trick. It is just running the system and looking at what happens.


Statistics for Playtesting

You will read more on playtesting in Chapter 31. Here are the numbers that matter.

Sample Size

Jakob Nielsen's classic finding: 5 to 7 usability testers reveal about 80% of usability issues. This is for surface-level issues — menus, controls, confusion. After 7 testers, you hit diminishing returns on discovery. Add more testers only if you're testing specific questions (difficulty, balance, accessibility), not general usability.

Significance

For qualitative feedback (did they like it? did they understand it?), don't trust results with n < 10 as anything but anecdote. You might notice a pattern with n = 5, but you don't have enough data to make decisions off it alone.

For quantitative metrics (completion rates, retention curves), you want n ≥ 30 for rough estimates, n ≥ 100 for decision-grade numbers, n ≥ 1,000 for statistical significance on small differences.

Ratios are slippery. "2 out of 5 testers failed" is a warning sign. "2 out of 500 testers failed" is noise. Same absolute count; different stories. Always ask for the denominator.

Confirmation Bias

The biggest statistical error in playtesting isn't sample size — it's watching your own playtest and seeing what you want to see. "They paused there because they were impressed." No, they paused there because the UI was broken. The way to fight this is to watch the tape, not your memory of the session. And to count what testers do, not what they say.


Rounding, Truncation, and Floor/Ceiling

Integer math bites designers. 3.7 damage can round to 3 or 4. It matters.

Truncation (chop the decimal): 3.7 → 3. Easy, fast, but biased downward.

Rounding (nearest integer): 3.7 → 4, 3.4 → 3. Unbiased but weird at .5 boundaries.

Floor: 3.7 → 3 (always down).

Ceiling: 3.7 → 4 (always up).

The famous "4 damage doesn't kill an 8-HP enemy" bug: you attack, you deal 4 damage twice, enemy should be at 0 HP. Except buried somewhere is a 5% damage reduction that reduces 4 to 3.8, which truncates to 3. Now two hits leave the enemy at 2 HP. The player sees 4 damage in the popup. The system subtracts 3. Rage.

Rule: pick one rounding strategy per subsystem and stick to it. Document which. Damage calculations in my projects always floor() the final result after all multipliers apply; the popup shows the same floored value so what the player sees is what the player gets.

Accumulated error is the subtler version. Each tick of damage-over-time loses 0.3 to truncation. Over a 10-second DoT, the player expected 30 damage and got 27. Over a 30-second fight, the target survived when it shouldn't have. Fix: accumulate as float, floor only when applying to integer HP.


Unit Economics Basics

The two numbers that determine whether a F2P game is a business:

LTV (Lifetime Value): average total revenue per player across their entire play lifetime. Calculated as ARPDAU × retention_days × monetization_conversion_rate or pulled from analytics dashboards.

CAC (Customer Acquisition Cost): average ad spend to acquire one player. Pulled from the user-acquisition team's reports.

For a F2P game to make money: LTV > CAC. By a comfortable margin. If LTV is $3 and CAC is $2.50, you're sustainable-ish. If LTV is $3 and CAC is $5, every player you buy costs you money, and you are burning the runway. That's most F2P studios, by the way — profitable publishers run maybe 20% of their portfolio and subsidize the rest with hits.

Retention curves are the shape that determines LTV. Day 1 retention, day 7, day 30, day 90. A healthy mobile game might see D1 = 40%, D7 = 15%, D30 = 5%. A great one: D1 = 50%, D7 = 25%, D30 = 12%. Retention is the lifeblood; nothing else matters if players leave in week one.

The ethical issue (Chapter 33 covers this at length): the design patterns that maximize LTV are often the ones that maximize player harm. Daily login streaks, FOMO timers, variable-ratio reward schedules. The math of the business pushes toward exploitation. You as a designer will have to decide where the line is. The math will not help you find it. But the math will show you when you have crossed it — if the top 0.1% of players generates more than 50% of revenue, somebody is being harmed, no matter what the dashboard says.


Quick Reference: Formulas Sheet

Copy this to your design doc.

What Formula
Probability of both (independent) P(A) × P(B)
Probability of either (mutually excl.) P(A) + P(B)
Probability of either (overlapping) P(A) + P(B) − P(A and B)
Probability of at least one in N tries 1 − (1 − p)^N
Median kills for p-drop ln(0.5) / ln(1 − p)
Expected value Σ (probability × value)
DPS damage × attacks_per_sec × hit_rate
Effective HP HP / (1 − damage_reduction)
Time to kill EHP / DPS
XP curve (linear) k × level
XP curve (polynomial) k × level^α (α between 1.3 and 2.0)
Inflation rate (sources − sinks) / total_economy
LTV > CAC required for business viability
Playtester count 5–7 for 80% of usability issues
F2P whales top 0.15%–2% generate 50%–80% revenue

Keep the formulas simple. Check your work against simulations when the algebra gets hairy. When in doubt, write 10 lines of GDScript and run the system 10,000 times. The computer will tell you the truth.