40 min read

> "Six weeks of data is not a pattern. It's a story you're telling yourself about a pattern."

Chapter 7: The Law of Large Numbers — Why Small Samples Lie

"Six weeks of data is not a pattern. It's a story you're telling yourself about a pattern." — Dr. Yuki Tanaka


Opening Scene

Nadia slides her laptop across the library table so Dr. Yuki can see the screen. There are charts everywhere — bar graphs in teal and coral, a line chart showing weekly growth, a scatter plot she made herself at 2 a.m. after watching three YouTube tutorials on data visualization.

"Look," Nadia says, pointing to the line that climbs steadily from left to right. "Six weeks of data. My engagement rate goes up every week I post at 7 p.m. on Wednesdays and include a text hook in the first three seconds. I've figured it out."

Dr. Yuki leans in. She doesn't say anything for a moment.

"You've figured out six weeks," she says finally.

Nadia blinks. "What?"

"You've figured out six weeks. What you haven't figured out is whether those six weeks are telling you something real or something random. And right now, you have no way to know."

Nadia pulls the laptop back toward herself slightly. She's proud of this analysis. She stayed up until 2 a.m. making those charts. "But the pattern is right there. Look at the trend line."

"I see the trend line," Dr. Yuki says. She pulls out a notepad — she always has a notepad — and draws a short squiggly line. Then she draws a long squiggly line next to it. "Which one of these," she says, "is more likely to be heading somewhere meaningful?"

Nadia looks at the two lines. They're both going generally upward, but the longer one has enough variation that you can actually see it isn't perfectly straight. "The longer one. Because you can see more of it."

"Exactly," Dr. Yuki says. "A six-week trend is a short squiggly line. It looks like a pattern because there's not enough data to show you the noise. This is one of the most important ideas in all of statistics. It has a name: the law of large numbers."

Marcus, who has been sitting across the table pretending to work on his startup pitch deck, looks up. "I've heard of that."

"Everyone has heard of it," Dr. Yuki says. "Almost no one understands what it actually says."

She reaches across the table and turns the laptop so the screen is facing her more directly. She studies Nadia's charts for a moment, then looks up with an expression that is sympathetic but entirely honest.

"How did it feel, making these?"

Nadia is caught off guard. "Good. Like I was doing something right. Like I was finally being systematic."

"You were being systematic. The effort is real." Dr. Yuki turns the laptop back. "But systematic effort applied to insufficient data produces systematic illusions. That's actually worse than random guessing, in some ways, because you believe in it more."

Marcus leans forward. His startup has three months of revenue data. He has been making very similar charts.

"How much data do you actually need?" he asks.

"That's exactly the right question," Dr. Yuki says. "And the answer is almost always: more than you have."


What the Law of Large Numbers Actually Says

Let's start with the precise statement, because precision matters here.

The law of large numbers states that as the number of trials in a random experiment increases, the observed average (or frequency) of outcomes converges toward the true expected value (or probability).

In plain language: the more you repeat something, the closer your observed results will get to the true underlying probability.

Flip a fair coin 10 times, and you might get 7 heads. That's 70% — far from the true 50%. Flip it 1,000 times and you'll almost certainly be within a few percentage points of 50%. Flip it 1,000,000 times and you'll be within fractions of a percent.

This sounds obvious. But three critical implications are routinely missed:

Implication 1: The law says nothing about individual trials.

The law of large numbers is a statement about aggregates over time, not about individual outcomes. It does not say that after 7 heads, a tail is "due." The coin has no memory. Each flip remains exactly 50/50. What happens over a million flips is that the 7-heads-in-a-row weirdness gets swamped by the enormous weight of subsequent flips, all still at 50/50.

This distinction — between what the law says about aggregates vs. what people falsely infer about individual events — is the source of the gambler's fallacy.

Implication 2: Convergence is slow, especially for rare events.

The law guarantees convergence, but it doesn't tell you how fast. For common events (like 50% coin flips), convergence to the true probability is reasonably fast. For rare events (like 1-in-100 outcomes), you need a very large sample before your observed frequency reliably reflects the true probability. For extremely rare events (1-in-10,000), you may need millions of trials before the estimate stabilizes.

This is why studying rare cancers, rare side effects of medications, or rare social media virality requires enormous datasets. Small samples of rare events are essentially meaningless.

Implication 3: The law describes long-run behavior, not short-run guarantees.

"Long run" in mathematics can mean much larger than human intuition expects. Six weeks is not a long run. Even six months might not be. The relevant question is always: long run relative to the variance in what you're measuring.


Worked Example: Watching Convergence Happen With Numbers

Sometimes the most powerful way to understand the law of large numbers is to watch it unfold numerically. Let's do exactly that.

Suppose we're flipping a fair coin — true probability of heads is 0.5. Here's what observed proportions might look like across increasing sample sizes:

Flips completed Heads observed Observed proportion Distance from true 0.5
5 4 0.800 0.300 (very far)
10 7 0.700 0.200
25 15 0.600 0.100
50 28 0.560 0.060
100 54 0.540 0.040
250 129 0.516 0.016
500 253 0.506 0.006
1,000 503 0.503 0.003
5,000 2,506 0.501 0.001
10,000 5,002 0.5002 0.0002

Notice the pattern: the distance from the true value does not shrink at a constant rate. Early on, it shrinks rapidly. But in the small-sample zone — under 50 flips — the proportion can still be wildly far from the true value. Even at 100 flips, you might be 4 percentage points off. Only in the thousands does the estimate become reliably tight.

Now consider that this is a coin flip — the simplest possible random system, with only two outcomes and equal probability for each. Most real-world phenomena are far noisier. A TikTok video's view count, a startup's monthly revenue, a student's exam performance — all have far more variability than a coin flip. If convergence takes thousands of trials for a coin, how many observations does it take for something with twenty times the variance?

The answer is proportionally more. The law of large numbers always works. It just takes longer the noisier the system.


The Weak and Strong Laws: A Brief Technical Note

Mathematically, there are actually two versions of the law of large numbers — the weak law and the strong law. You don't need to master the technical distinction, but understanding it at a conceptual level is illuminating.

The Weak Law of Large Numbers (Bernoulli, 1713) says: for any fixed positive number ε (epsilon, representing a small margin of error), the probability that the observed average differs from the true mean by more than ε converges to zero as the sample size grows.

In other words: for any level of precision you want, you can make it arbitrarily likely to achieve that precision by using a large enough sample. But notice — it's "likely," not "certain." Even with a huge sample, there's technically a very small probability of being far off.

The Strong Law of Large Numbers (proved rigorously in the early 20th century) makes a stronger claim: the observed average converges to the true mean with probability 1. Almost surely — not just likely, but virtually certain — the long-run average will equal the expected value.

For practical purposes, both laws say the same thing: take more samples. But the strong law reminds us that we're operating in the realm of near-certainty, not absolute certainty. Extreme random runs can happen in principle. They just become astronomically unlikely as sample size grows.

The key insight for the real world: neither law helps you predict individual outcomes. Both laws are about what happens when you zoom way, way out.


Myth vs. Reality: The Law of Large Numbers Edition

Myth: "I've had bad luck recently, so I'm due for good luck soon." Reality: The law of large numbers explains why, over millions of events, luck averages out. It says nothing about your next event being "compensated." The law works through accumulation, not correction.

Myth: "I've collected data for six weeks. That's enough to see a pattern." Reality: Whether six weeks is enough depends entirely on how variable the thing you're measuring is. For most human behaviors — engagement rates, sales, academic performance — six weeks is far too short to distinguish real patterns from noise.

Myth: "The law of large numbers guarantees I'll win back my losses if I keep playing long enough." Reality: This is exactly backwards. In a game with negative expected value (like most casino games), the law of large numbers guarantees that the more you play, the closer your outcomes will converge to the expected value — which is a net loss. The house edge is the "true probability" that your observed results converge toward.


Why Small Samples Lie: Variance and the Sample Size Problem

To understand why small samples are unreliable, you need to understand the relationship between sample size and variance.

Variance is a measure of how spread out individual measurements are around the average. High variance means individual data points are scattered widely. Low variance means they're clustered close together.

Here's the key mathematical relationship: the variance of a sample mean decreases as sample size increases. Specifically, if the population variance is σ² and the sample size is n, the variance of the sample mean is σ²/n.

What this means in practice: the larger your sample, the less "wiggly" your sample average will be. Small samples are dominated by noise. Large samples quiet the noise and let the signal through.

Let's make this concrete with Nadia's situation.

Suppose the true effect of posting at 7 p.m. on Wednesday is that it increases engagement by 8% compared to posting at other times. But Nadia's individual video performances vary enormously — some videos randomly go mini-viral (500% above average) and some flop (80% below average) regardless of when she posts them. The natural variance in video performance is far larger than the 8% effect she's trying to detect.

With six weeks of data (six Wednesday posts vs., say, thirty other posts), the random variation in individual video performance will overwhelm the real 8% signal. She might see a huge apparent Wednesday effect entirely due to chance — or she might see no Wednesday effect even though it's real.

This is the fundamental problem: when the signal you're looking for is smaller than the noise in your measurement, small samples will mislead you more often than they enlighten you.

The formal concept is called statistical power — the probability that a study will correctly detect a real effect when one exists. A study with low power (from too-small samples) will frequently miss real effects and frequently find false ones. Both types of error are systematic.

The Signal-to-Noise Ratio: A Practical Framework

The relationship between sample size, signal strength, and noise can be thought of as a signal-to-noise ratio (SNR). The higher the SNR, the easier it is to detect the truth with a small sample. The lower the SNR, the more data you need.

For different domains, the SNR varies enormously:

High SNR domains (small samples sometimes sufficient): - Drug A vs. Drug B when one genuinely works and the other is a placebo: the effect size is often large relative to individual variation in outcomes - Physical constants (speed of light, gravitational constant): extremely low noise in properly controlled measurements - Obvious quality differences (comparing a polished professional resume to a sloppy first draft)

Low SNR domains (large samples almost always necessary): - Social media content performance: enormous variation in individual post outcomes; real effects from strategy are small - Financial returns: monthly returns vary enormously; distinguishing skill from luck requires decades of data - Educational interventions: student outcomes are highly variable; detecting a 10% improvement in test scores requires many students - Sports performance in small samples: a basketball player's "slump" or "streak" in a single week means almost nothing

Nadia is operating in a very low SNR domain. The honest implication is that she needs far more data than feels comfortable — and even then, she should remain humble about what the data actually proves.


The Hot Hand, Reloaded: Sample Size and the Illusion of Streaks

In Chapter 4, we introduced the hot hand fallacy — the belief that in random processes, recent successes make future successes more likely. Now we can add a critical layer: even in skill-influenced processes, the hot hand is often an artifact of small samples rather than a real phenomenon.

The original hot hand study (Gilovich, Vallone, and Tversky, 1985) analyzed NBA free throw data and found that after a made shot, the probability of making the next shot was essentially unchanged — and sometimes slightly lower, not higher. They concluded the hot hand was a cognitive illusion.

For years, this was accepted as settled science. Then something interesting happened.

In 2015, Miller and Sanjurjo reanalyzed the original data and found a mathematical flaw in the original analysis. When you condition on streaks within a finite sequence, you introduce a subtle bias that makes the hot hand artificially look absent. After correcting for this bias, they found evidence for a genuine hot hand effect in free throws — small, but real.

What does this saga teach us? Several things:

  1. Even expert statisticians can analyze small samples incorrectly.
  2. The hot hand, when real, is small — much smaller than intuition suggests.
  3. In genuinely random processes, the hot hand is still entirely illusory.
  4. The difference between "statistically significant hot hand in skill tasks" and "the hot hand explains this streak I'm watching" is enormous.

For Nadia, this means: even if there's a real pattern in her content performance — some posting strategy genuinely works better — six weeks of data is far too small to distinguish that real pattern from random variation. Her brain will find the pattern whether or not it's real.

Research Spotlight: The Hot Hand Revisited

Miller, J.B., & Sanjurjo, A. (2018). "Surprised by the Hot Hand Fallacy? A Truth in the Law of Small Numbers." Econometrica.

Miller and Sanjurjo identified a systematic mathematical bias in studies that deny the hot hand. In a finite sequence of binary outcomes (hit/miss), conditioning on the event "the last k shots were hits" creates a selection bias: the shot immediately following a streak is drawn from a systematically harder-to-sample set of positions in the sequence. This creates an apparent cold hand even when performance is truly independent.

After correcting for this bias, they found evidence of a modest genuine hot hand in basketball shooting data. Crucially, they did not find an enormous hot hand. Streaks are partially real in skill tasks, but the degree is far smaller than human perception suggests.

The takeaway for luck science: the relationship between streaks, skill, and randomness is subtler than either the "hot hand exists" or "hot hand doesn't exist" camps claimed. Small samples make it essentially impossible to sort out.


Social Media Metrics as Small Samples

Let's spend time on Nadia's situation, because it illustrates the small-sample problem with unusual clarity.

Social media content creators are drowning in data and starving for insight. Every platform provides dashboards full of metrics: views, engagement rate, reach, watch time, saves, shares, follower growth, click-through rates, and dozens more. All of this data arrives quickly — you can see how a video is performing within hours of posting.

The problem is that this speed creates an illusion of knowledge. The data arrives fast. That doesn't mean it's reliable.

Consider a typical content creator posting 3-4 times per week. In six weeks, they've posted roughly 18-24 pieces of content. Across those pieces, they've probably tried several different:

  • Posting times (morning, evening, lunch hour)
  • Content formats (talking head, B-roll, text overlay, duet)
  • Hook styles (question, controversial statement, story, statistic)
  • Topic categories (however many they cover)

If they're tracking 4 variables with 2-3 options each, they might have roughly 4-12 data points per "cell" (each combination of variables). With that many cells and that few data points per cell, the law of large numbers hasn't had time to work. Any pattern Nadia finds in 6 weeks has a reasonable probability of being a spurious coincidence.

The multiple comparisons problem makes this worse. If Nadia tests 30 different variables for correlation with views (as any creator reflexively does when staring at a dashboard), she should expect, by pure chance, to find 1-2 "statistically significant" correlations at the 5% level even if absolutely nothing she does matters — because 5% × 30 = 1.5 false positives on average.

This is the notorious problem of data dredging or p-hacking: when you look at many variables simultaneously, you guarantee finding patterns that aren't real. And on a content analytics dashboard with dozens of metrics and hundreds of possible combinations, data dredging is essentially automatic and invisible.

What would it actually take to reliably detect a real posting-time effect?

If the true effect of Wednesday 7 p.m. posting is a 15% boost in engagement, and individual video performance varies by a factor of 3x or 4x (which is typical in content creation), you'd need many weeks of controlled data to detect the effect reliably. Exactly how many depends on the variance structure, but rough calculations typically yield numbers in the range of 50-100 posting instances per condition — which, for a weekly posting schedule, means 1-2 years of data per posting time.

That's not Nadia's six weeks. It's not even close.

A Framework for Content Creators: What Data Actually Tells You

Given the small-sample problem, what should content creators actually do with their analytics dashboards? Here is a practical tiered framework:

Tier 1: Strong signal, even in small samples (act on these) - Dramatic drops in total views after a major format change (if views fall by 80%, not 10%) - Direct audience feedback in comments that reveals what they want or dislike - Completion rate data that differs dramatically by video length (2-minute videos watched at 90% vs. 10-minute videos watched at 20%)

These signals are strong because the effect sizes are large, the feedback is direct, or the comparison is clear-cut.

Tier 2: Moderate signal, requires validation (form hypothesis, test explicitly) - A particular hook style that consistently produces higher first-3-second retention rates across 15+ videos - A topic category that generates significantly more saves or shares than others - A format that seems to have consistently lower drop-off rates

These are worth noticing, but should be explicitly tested — meaning Nadia should predict in advance which videos will perform better, then check whether she's right.

Tier 3: Noise, ignore it (even though it's tempting) - Day-of-week posting time patterns across fewer than 50 data points - Hashtag performance differences in any dataset under 100 posts - Whether videos with a specific filter or color palette perform better

These are almost certainly data artifacts. The sample is too small and the effect sizes too small for the signal to overcome the noise.

Nadia, after hearing Dr. Yuki's explanation, opens her spreadsheet and starts color-coding her data by tier. Most of what she's been optimizing is Tier 3. This is uncomfortable to confront. It's also, she thinks, genuinely useful to know.


Lucky Break or Earned Win?

Nadia's Wednesday Effect

After six weeks of posting, Nadia is convinced she's found a pattern: Wednesday 7 p.m. posts outperform everything else. She builds her whole content calendar around it.

Six months later, the "Wednesday effect" has completely disappeared. Her calendar feels rigid. She's missed opportunities to post during trending moments because she was saving content for Wednesdays.

Was the original Wednesday pattern a lucky break (random variation that looked like a signal) or an earned win (genuine discovery through analysis)?

Consider: What does the law of large numbers suggest about how confident she should have been after only six weeks? What would a better approach have looked like? How could she have tested her hypothesis before committing to a strategy based on it?


The Sample Size Question: When Can You Trust a Pattern?

This is the practical question everyone wants answered. Here's a framework for evaluating whether a pattern you observe is likely to be real.

Step 1: How variable is what you're measuring?

Some things are relatively stable: the boiling point of water, the average height of a population, the speed of light. Some things are extremely variable: individual sales performance, stock prices, video view counts, mood ratings. The more variable the thing you're measuring, the more data you need before the law of large numbers kicks in and averages out the noise.

Step 2: How large is the effect you're looking for?

Large effects are easier to detect in small samples. If a new cancer drug increases survival rates by 50%, you'll see it in a relatively small trial. If it increases survival by 2%, you need an enormous trial. For social media, a 10% improvement in engagement rate is easily swamped by natural variation. A 200% improvement would be visible quickly.

Step 3: How many variables are you testing simultaneously?

As we've seen, testing many variables simultaneously creates false positives. If you're looking at a single pre-specified hypothesis ("Does Wednesday 7 p.m. outperform everything else?"), your threshold for evidence is different than if you're looking at a hundred variables and finding the best-performing one.

Step 4: Are you pre-registering or post-hoc explaining?

This is the critical question. Pre-registration means stating your hypothesis before you collect the data. Post-hoc analysis means fitting a story to data you've already seen. The latter is much more likely to produce false patterns.

Step 5: Has the pattern held up in new data?

The gold standard is out-of-sample validation. Does the pattern you found in the first six weeks show up in the next six weeks, when you're explicitly looking for it? If your Wednesday effect holds up across twelve separate Wednesdays, that's evidence. If it disappears, you've learned it was noise.

A rough rule of thumb that experienced researchers use: treat any pattern you've found in a small sample as a hypothesis to test, not a conclusion to act on.

Marcus Tests His Startup Data

Marcus has been listening to this conversation with growing discomfort. His pitch deck contains a slide titled "Revenue Momentum" that shows three months of climbing revenue. He has been treating this as proof that his strategy is working.

After the library session with Dr. Yuki, he sits alone at his laptop and applies the five-step framework to his own data.

Step 1 — How variable is revenue? For an early-stage app with a small user base, very variable. A single good month can look like a trend.

Step 2 — How large is the effect? Month-over-month growth of 18%, 22%, and 31%. That looks large. But with a small starting base (the app went from 180 paying users to 340 over three months), those percentages involve small absolute numbers. At this scale, a single corporate client or a single mention from a moderately followed Twitter account could produce an 18% jump.

Step 3 — How many variables am I testing? He's tracking five different growth drivers: word-of-mouth, his Instagram posts, a partnership with a chess club federation, direct outreach, and organic app store discovery. Five variables, three months of data. That's barely one data point per variable-month combination.

Step 4 — Pre-registered or post-hoc? Post-hoc, entirely. He noticed the growth first, then decided what was causing it.

Step 5 — Out-of-sample validation? He hasn't tried to predict anything yet.

His conclusion: three months of upward revenue is consistent with a genuine growth trend, but it is not strong evidence for one. He cannot distinguish between "my strategy is working" and "I got lucky with a few good months." He needs to actually predict what next month will look like before he can start treating the trend as real.

He types at the bottom of his pitch deck slide: "Revenue trend: 3 months. Interpreting with caution per LLN." He'll explain that note in the room. It's more intellectually honest than what was there before.


Medical Research and Underpowered Studies

The small-sample problem isn't just about content creators and investors. It's one of the defining crises in modern science.

Between 2010 and 2020, a series of high-profile failures revealed that a substantial fraction of published psychology studies couldn't be replicated when researchers tried to repeat them. A 2015 project coordinated by the Open Science Collaboration attempted to replicate 100 published psychology studies. Only about 39 of those replications were statistically significant — compared to 97 of the original studies.

The single most important explanation: underpowered studies.

The original studies, published in prestigious journals, often used small samples — 20, 30, 50 participants. Their statistical analyses found significant effects. But with small samples, two things happen simultaneously:

  1. You're more likely to miss a real effect (false negative — low power)
  2. You're more likely to find a false effect (false positive — the other side of the same problem)

This seems paradoxical. How can small samples increase both types of error? The answer is that small samples produce noisy estimates. Your estimate of the effect size swings wildly. In a large sample, the estimate stabilizes near the truth. In a small sample, it might be anywhere.

The statistician Andrew Gelman has called this the "winner's curse in science": studies that just barely reach statistical significance with small samples are likely to have accidentally overestimated the effect size. The studies that get published (because significant results are more publishable) are thus systematically biased toward overstated effects. When you try to replicate them with proper power, the effect shrinks or vanishes.

The practical lesson: when you read about a study with a striking finding, one of your first questions should be: how large was the sample? A finding based on 30 college students at one university is epistemically very different from a finding based on 3,000 people across multiple countries.


Research Spotlight: Power Failures in Neuroscience

Button, K.S., et al. (2013). "Power failure: Why small sample size undermines the reliability of neuroscience." Nature Reviews Neuroscience.

Button and colleagues conducted a systematic review of neuroscience studies and found that the median statistical power — the probability of detecting a real effect if one exists — was approximately 21%. In other words, the typical neuroscience study had less than a 1-in-4 chance of finding a real effect if it existed.

The consequences are severe: low-powered studies that do find significant effects are likely to be finding noise. And because those studies get published while null results don't (publication bias), the literature fills with false positives.

The paper concluded that neuroscience was in a statistical crisis that required larger samples, pre-registration of hypotheses, and much greater skepticism about small-sample findings.

Crucially, this is not unique to neuroscience. Similar analyses have been run in psychology, medicine, economics, and epidemiology, with similar findings. The small-sample problem is pervasive.


The Law of Large Numbers in Financial Markets

No discussion of small samples and the law of large numbers would be complete without examining financial markets — one of the most consequential domains where the small-sample problem plays out.

Active fund managers — people who pick stocks and try to beat a market index — have been the subject of decades of research. The finding is consistent and troubling: most active fund managers underperform simple index funds over the long run. But in any given year, a substantial fraction of active managers will outperform the index.

Here is where the small-sample problem becomes directly consequential.

If a fund manager beats the market for three years in a row, investors flock to them. The "hot hand" is visible and compelling. Magazine covers celebrate their genius. But three years of data is, statistically speaking, barely enough to distinguish skill from luck in financial markets — which are among the noisiest environments in existence.

Research by economists Laurent Barras, Oliver Scaillet, and Russ Wermers found that when they analyzed the returns of over 2,000 mutual fund managers over long periods (15+ years), the fraction who showed evidence of genuine skill — rather than luck — was approximately 0.6%. Nearly every manager who appears to have beaten the market has done so through random variation rather than superior stock-picking ability.

The law of large numbers, applied to markets, has a brutal implication: the expected value of active management is negative (after fees). The more years of data you have, the more clearly this emerges. Short-run data creates the illusion of skill. Long-run data largely dissolves it.

This does not mean no one has genuine investment skill. It means that distinguishing genuine skill from luck requires far more data than investors typically demand — and that the financial industry has powerful incentives to keep investors focused on short-term performance rather than the long-run picture.

A Worked Example: Three-Year Returns and What They Mean

Suppose a fund manager posts these annual returns relative to the market index: - Year 1: +4% above index - Year 2: +6% above index - Year 3: +3% above index

Three years of outperformance. Should you invest?

Let's think about this probabilistically. In any given year, roughly 50% of active managers outperform the index (by definition, before fees, the average must equal the index). The probability that any specific manager beats the index in all three years by pure chance is roughly 0.5³ = 0.125, or about 12.5%.

That's not negligible. If there are 1,000 active fund managers in the market, we'd expect about 125 of them to have three-year outperformance records purely by luck. The law of large numbers hasn't had enough time to sort the skilled 0.6% from the lucky 12.5%.

To see a fund manager's true alpha emerge from the noise, research suggests you'd need data from at least 20–30 years — a period long enough that sustained luck becomes vanishingly unlikely. Of course, very few fund managers are around for 30 years, which makes genuine skill essentially impossible to verify during a typical investor's relationship with them.

The lesson: the financial industry sells three-year track records as evidence of skill. The law of large numbers says three years is a very short squiggly line.


Python Simulation Preview: Coin Flip Convergence

One of the most powerful ways to understand the law of large numbers is to watch it happen. Here's a simple Python simulation that shows how the observed proportion of heads converges toward 0.5 as the number of flips increases.

import random
import matplotlib.pyplot as plt

def simulate_coin_convergence(total_flips=10000):
    """
    Simulate coin flips and plot how the running
    proportion of heads converges to 0.5.
    """
    flips = [random.choice([0, 1]) for _ in range(total_flips)]

    # Calculate running proportion of heads
    running_heads = []
    running_total = []
    heads_count = 0

    for i, flip in enumerate(flips, 1):
        heads_count += flip
        running_heads.append(heads_count / i)
        running_total.append(i)

    # Plot
    plt.figure(figsize=(12, 6))
    plt.plot(running_total, running_heads, color='steelblue',
             linewidth=1, label='Observed proportion of heads')
    plt.axhline(y=0.5, color='red', linestyle='--',
                linewidth=2, label='True probability (0.5)')

    # Mark the "small sample" zone
    plt.axvspan(0, 50, alpha=0.15, color='orange',
                label='Small sample zone (n < 50)')

    plt.xlabel('Number of flips', fontsize=12)
    plt.ylabel('Proportion of heads', fontsize=12)
    plt.title('Law of Large Numbers: Coin Flip Convergence', fontsize=14)
    plt.legend()
    plt.ylim(0, 1)
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

    # Print summary statistics
    print(f"After 10 flips: {running_heads[9]:.3f} heads")
    print(f"After 100 flips: {running_heads[99]:.3f} heads")
    print(f"After 1,000 flips: {running_heads[999]:.3f} heads")
    print(f"After 10,000 flips: {running_heads[9999]:.3f} heads")
    print(f"True probability: 0.500")

# Run the simulation
simulate_coin_convergence(10000)

When you run this, you'll see something striking: the early part of the graph is chaotic. The proportion bounces wildly — maybe 80% heads after 10 flips, then 60% after 20, then 45% after 30. It looks like meaningful variation. It's not. As the simulation progresses, the line gradually smooths out and settles close to the red dashed line at 0.5.

What to notice:

  • In the first 50 flips (orange zone), the proportion can be far from 0.5 purely by chance
  • Even after 500 flips, there can be visible deviations
  • Only after thousands of flips does the line become reliably close to 0.5
  • Every time you run the simulation, the early chaos looks different — confirming it's random

This is precisely what happens to Nadia's Wednesday data. Six weeks of posts is living in that orange zone. The signal she thinks she sees is mostly chaos.

Extending the Simulation: Multiple Paths

The following extension shows that the chaos in the early phase is not a bug — it's a feature of small samples. Every random path looks different. Every one of them eventually converges:

import random
import matplotlib.pyplot as plt

def simulate_multiple_paths(n_paths=10, total_flips=1000):
    """
    Show multiple independent coin-flip paths converging to 0.5.
    Illustrates that each path is different but all converge.
    """
    plt.figure(figsize=(14, 7))

    for path in range(n_paths):
        flips = [random.choice([0, 1]) for _ in range(total_flips)]
        running_proportion = []
        heads_count = 0

        for i, flip in enumerate(flips, 1):
            heads_count += flip
            running_proportion.append(heads_count / i)

        plt.plot(range(1, total_flips + 1), running_proportion,
                 alpha=0.5, linewidth=0.8)

    plt.axhline(y=0.5, color='black', linestyle='--',
                linewidth=2, label='True probability (0.5)', zorder=5)
    plt.axvspan(0, 50, alpha=0.1, color='red', label='Small sample zone')

    plt.xlabel('Number of flips')
    plt.ylabel('Running proportion of heads')
    plt.title(f'Law of Large Numbers: {n_paths} Independent Paths')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

simulate_multiple_paths(n_paths=10, total_flips=1000)

Each of the ten colored lines represents a completely independent sequence of coin flips. In the small-sample zone on the left, the lines are everywhere — some near 0.2, some near 0.8, all different. As they move right, they begin to cluster around the true value. The spread narrows. The chaos subsides.

This is what patience looks like, mathematically. Not one path correcting, but all paths, given enough time, finding their way to the truth.


When Small Samples Are All You Have: Acting Wisely Under Uncertainty

A fair objection to everything in this chapter: sometimes, you simply cannot wait for a large sample. Decisions must be made. You can't postpone a job application for two years while you gather enough data to calculate the true base rate for your specific skill set at your specific career stage in your specific industry.

This is the real-world constraint that the law of large numbers cannot dissolve. Small samples are often all we have. The question is not "should I wait for a large sample?" — often the answer is no, you cannot. The question is "given that I have a small sample, how should I act differently than I would if I had a large one?"

Several practical shifts follow from accepting small-sample reality honestly:

Shift 1: Widen your confidence intervals.

When you have a large sample, you can be precise. When you have a small sample, you must be humble. If three months of data suggest your startup's monthly growth rate is 20%, a large-sample estimate might produce a 95% confidence interval of 18–22%. A small-sample honest estimate might produce 5–40%. Act accordingly: plan for the wide range, not just the central estimate.

Shift 2: Favor reversible decisions.

When your evidence base is thin — which it always is in a small-sample environment — the cost of being wrong matters more. Prefer decisions that can be undone if new evidence shifts your view. Nadia should experiment with Wednesday posts as one element of her strategy, not restructure her entire content calendar around a six-week pattern that might be noise.

Shift 3: Seek better evidence, not just more of the same.

Sometimes you cannot get more of the existing data type quickly. But you might be able to get a different type of evidence that's more informative. Marcus cannot generate three more years of startup revenue in three months. But he can talk to ten customers in depth and ask why they subscribed — qualitative evidence that is less statistically rigorous but potentially more diagnostic about what's actually working.

Shift 4: Be explicit about your uncertainty when communicating.

When sharing conclusions based on small samples — in a pitch deck, a conversation with a mentor, a strategy document — be transparent. "Based on three months of data, which may not yet be reliable, the trend suggests X" is more accurate than "our data shows X." This is not weakness. It is epistemic honesty that experienced listeners will respect.

Shift 5: Plan for the next data collection point.

Instead of treating current data as the final word, treat it as the first word and plan explicitly for when you will revisit the question with more data. "I'll check whether this Wednesday pattern holds after three months of deliberately testing it" is a better posture than "I've confirmed this pattern, moving on."

Dr. Yuki talks about this in the context of poker. In a game with limited information and high variance, you cannot always wait for certainty before acting. But professionals act while holding the uncertainty explicitly in their minds — they know their read might be wrong, they size their bets accordingly, and they learn from outcomes with appropriate humility. That combination — acting decisively while remaining genuinely humble about your evidence base — is the stance the law of large numbers demands.


Building Better Intuitions About Sample Size

The law of large numbers gives us permission to distrust our early results. But it also gives us a roadmap for getting better results. Here are the practical principles that follow from understanding it properly.

Principle 1: Treat early results as hypotheses, not conclusions.

When you see a pattern in a small sample — your content performs better on certain days, your startup converts better with certain copy, your studying works better at certain times — treat it as a hypothesis worth testing, not a conclusion worth acting on. The law of large numbers hasn't had time to work.

Principle 2: Increase sample size before making irreversible decisions.

If you're going to make a significant, hard-to-reverse decision based on data (changing your entire content strategy, pivoting your startup, choosing a major), try to collect more data first. The convergence guaranteed by the law of large numbers is your friend — but only if you let it run.

Principle 3: Beware of "confirming" patterns by looking for them.

Once you believe a pattern exists (Wednesday posts do better!), you'll interpret subsequent evidence in light of that belief. This is confirmation bias layered on top of small-sample noise. If a Wednesday post does well, it confirms the belief. If it does poorly, it was an anomaly. You need to pre-commit to how you'll interpret results before you see them.

Principle 4: Compare your sample size to the variance in what you're measuring.

The relevant question isn't "is my sample big?" but "is my sample big relative to the variance in what I'm measuring?" For stable, low-variance outcomes, a small sample suffices. For volatile, high-variance outcomes (like content performance, startup revenue, or individual academic grades), you need much larger samples.

Principle 5: External replications are worth more than internal ones.

If your pattern holds up in a completely new context — different week, different platform, different audience segment — that's much stronger evidence than the pattern holding up in continued monitoring of the same dataset.


Research Spotlight: The Replication Crisis and What It Teaches Us

Open Science Collaboration (2015). "Estimating the Reproducibility of Psychological Science." Science.

This project coordinated 270 researchers across multiple institutions to attempt to replicate 100 published psychological experiments. The results were alarming: only 39% of the replications produced statistically significant results, compared to 97% of the original studies.

The causes were multiple, but underpowered original studies were central. Many original studies used 20–50 participants; many replication attempts used 150–300. The larger samples in replications were more likely to tell the truth — and they often found that the original, celebrated effects were smaller or absent.

What this means for everyday probabilistic thinking: the findings you've read about in popular science articles — power poses changing hormone levels, certain priming effects, ego depletion — have mostly failed to replicate under larger, better-controlled conditions. The small-sample bias affected not just individual researchers but the entire publication system, which preferentially published surprising findings that turned out to be noise.

The replication crisis is the law of large numbers taking revenge on an entire scientific field. The truth is eventually emerging — but it took decades and required deliberately running the samples much, much larger.


The Deep Lesson: What the Law of Large Numbers Teaches About Luck

We're in a book about luck. So why spend an entire chapter on a statistical theorem?

Because the law of large numbers is the hidden engine behind many of the things people attribute to luck, skill, intuition, or strategy.

When Nadia notices that her best-performing videos came in weeks when she was most consistent — and concludes that consistency causes performance — she might be right. Or she might be observing a small-sample coincidence. The law of large numbers hasn't had time to adjudicate. She can't tell the difference from inside the data.

When Marcus notices that his startup had three exceptional months in a row and concludes that his strategy is working — we'll encounter this directly in Chapter 8 — he might be right. Or he might be in a lucky stretch that will regress to the mean. Six months is not enough data to tell the difference.

When a fund manager posts three years of above-market returns and concludes they've beaten the market through skill — we'll see in Chapter 9 that survivorship bias is layered on top — they might be skilled. Or they might be one of the lucky ones in a large group of managers, and the law of large numbers simply hasn't run long enough to prove the difference.

In each case, the principle is the same: we make observations in short windows and draw conclusions that the data can't yet support. The law of large numbers tells us how many observations we need before we can start trusting patterns. And that number is almost always larger than we'd like.

This is humbling. But it's also liberating. If small samples lie, then the bad run you're on right now — the content not performing, the applications not landing, the chess games going wrong — is probably less meaningful than it feels. And the hot streak that feels like breakthrough clarity might be less conclusive than you'd like.

The law of large numbers teaches patience. Not passive patience, but active, data-collecting patience. Keep going. Increase the sample. Let the truth emerge.

What This Means for "Luck Is Not a Force. It's an Outcome."

Recall the foundational claim from Chapter 1: luck is not a force. It's an outcome.

The law of large numbers adds a crucial layer to this. Not only is luck an outcome rather than a mystical force — it's an outcome whose true probability can only be revealed over many, many trials. In the short run, luck looks like skill. In the short run, skill looks like luck. The law of large numbers is the mechanism by which the distinction eventually becomes clear.

This is why professional poker players — Dr. Yuki among them — must track results over thousands of hands, not dozens. In a single session, a great player can lose badly through no fault of their own. A weak player can win through pure variance. The poker pros who treat short sessions as meaningful evidence about their own skill level are making the same error as Nadia with her Wednesday effect.

The discipline of knowing that a small sample doesn't tell you the truth — and refusing to act as though it does — is one of the highest-value intellectual skills that probability education can develop. It is also among the hardest, because our brains are built to find patterns and act on them quickly. The law of large numbers asks us to do something profoundly counter-intuitive: sit with uncertainty rather than act on premature signal.

That is not passivity. It is epistemic discipline. And it is, in a real sense, one of the foundations of luck science.


Myth vs. Reality: Streaks and What They Mean

Myth: "If I've had a bad streak, something must be wrong with my strategy."

Reality: In any high-variance process — job applications, content creation, startup sales — a run of bad outcomes is expected even when your strategy is sound. The law of large numbers predicts this. Before concluding your strategy has failed, ask: how long would a bad streak last by pure chance, even for someone with a good process? For most high-variance activities, a two- or three-week bad stretch is well within normal random variation.

Myth: "My success rate has been consistent for six months — that's reliable data."

Reality: Six months might be reliable or unreliable depending on what you're measuring. Six months of data from a high-volume process (a restaurant serving 200 customers per day generates 36,000 data points in six months) is very different from six months of data in a low-volume process (a content creator posting three times per week generates 72 data points). Volume matters as much as time elapsed.

Myth: "Randomness averages out quickly."

Reality: This is perhaps the most pervasive misconception. For common events with moderate variance, randomness does average out relatively quickly. For rare events, extreme outcomes, or processes with very high variance, "averages out" can mean thousands or millions of trials. The casino does not average out after a few hours of play. The social media algorithm does not average out after a few weeks of posts.


The Luck Ledger

What this chapter gave you: The law of large numbers — precisely stated — tells you that observed averages converge to true probabilities as samples grow. Small samples are dominated by variance, not signal. Almost every pattern you observe in a small sample has a meaningful probability of being coincidental. This applies to Nadia's content analytics, Marcus's startup revenue, fund managers' track records, and published scientific research alike.

What's still uncertain: How large is "large enough"? The answer depends on the variance of what you're measuring, the size of the effect you're looking for, and how many hypotheses you're testing. There is no universal sample size. But the principle is unambiguous: whatever sample you have right now, more is almost always better before making irreversible decisions.


Chapter Summary

  • The law of large numbers states that observed averages converge to true expected values as the number of trials increases. It says nothing about individual outcomes.
  • The weak law says convergence to the truth becomes arbitrarily probable; the strong law says it happens with probability 1.
  • Convergence is slow — especially for rare events or high-variance processes. Numerical examples illustrate that even after 100 coin flips, estimates can be several percentage points off.
  • The signal-to-noise ratio determines how much data you need: high-SNR domains (large effects, low variance) reveal truth quickly; low-SNR domains (small effects, high variance) require enormous samples.
  • Small samples lie because variance in individual observations dominates the signal in any real effect you're trying to detect.
  • The hot hand, in random processes, is entirely illusory. In skill-influenced processes, it exists but is far smaller than intuition suggests.
  • Social media analytics are particularly susceptible to small-sample illusions because data arrives fast, variation is high, and the temptation to find patterns is enormous. A practical three-tier framework helps creators distinguish actionable signals from noise.
  • Financial markets illustrate the same principle at high stakes: three years of outperformance is a very short squiggly line.
  • Medical and psychological research has suffered enormously from underpowered (small-sample) studies, contributing to the replication crisis — which is the law of large numbers taking revenge on an entire scientific field.
  • The Python simulation shows coin flip convergence visually; the multiple-paths extension shows that every random sequence eventually finds its way to the true probability, given enough time.
  • Practical response: treat small-sample patterns as hypotheses, not conclusions. Increase sample size before acting. Pre-register your hypotheses before you collect data.
  • The deep lesson for luck science: in the short run, luck looks like skill and skill looks like luck. Only the long run — the large sample — reveals the difference. Epistemic patience is not passivity. It is one of the foundations of reasoning honestly about luck.