Key Takeaways: Power, Effect Sizes, and What "Significant" Really Means

One-Sentence Summary

Statistical significance tells you whether a result is unlikely under the null hypothesis, but it cannot tell you whether the result is important — for that, you need effect sizes, confidence intervals, and power analysis, which together form the complete toolkit for evaluating evidence.

Core Concepts at a Glance

Concept Definition Why It Matters
Effect size A standardized measure of the magnitude of a phenomenon, independent of sample size Tells you how big the effect is, not just whether it exists
Cohen's d Difference between group means divided by the pooled SD; measures separation in standard deviation units The go-to effect size for two-group comparisons; small $\approx$ 0.2, medium $\approx$ 0.5, large $\approx$ 0.8
$r^2$ (proportion of variance) Proportion of total variance explained by the group variable: $r^2 = t^2 / (t^2 + df)$ Puts effect sizes in humbling perspective: even "large" effects explain only ~14% of variance
Statistical power $P(\text{reject } H_0 \mid H_0 \text{ is false}) = 1 - \beta$; probability of detecting a real effect Studies with low power miss real effects and overestimate those they do find

The Key Formulas

Cohen's d (Two Independent Groups)

$$\boxed{d = \frac{\bar{x}_1 - \bar{x}_2}{s_p}, \quad s_p = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}}}$$

$r^2$ from a t-test

$$\boxed{r^2 = \frac{t^2}{t^2 + df}}$$

Cohen's d to $r^2$

$$\boxed{r^2 = \frac{d^2}{d^2 + 4}}$$

Cohen's h (Two Proportions)

$$\boxed{h = 2\arcsin(\sqrt{p_1}) - 2\arcsin(\sqrt{p_2})}$$

Statistical Power

$$\boxed{\text{Power} = 1 - \beta = P(\text{reject } H_0 \mid H_0 \text{ is false})}$$

Effect Size Benchmarks

Small Medium Large
Cohen's d 0.2 0.5 0.8
Cohen's h 0.2 0.5 0.8
$r^2$ 0.01 (1%) 0.06 (6%) 0.14 (14%)

Use with caution: These benchmarks are generic guidelines, not rigid rules. What counts as "small" or "large" depends on the field and the stakes. Always interpret in context.

Four Factors Affecting Power

Factor Increase Power By... Tradeoff
Sample size ($n$) Collecting more data Costs time and money
Effect size Studying larger effects Can't control nature's effect
Significance level ($\alpha$) Using a more lenient threshold (e.g., 0.10) Increases false positive rate
Variability ($\sigma$) Reducing noise (better measurement, paired designs) May limit generalizability

The Significance Matrix

Practically Significant Not Practically Significant
Statistically Significant Best case: real and important effect "Significant but trivial" — large $n$, tiny effect
Not Statistically Significant "Important but missed" — small $n$, real effect Consistent with no meaningful effect

Key insight: You need both statistical significance and practical significance. A p-value alone cannot tell you which cell you're in.

The Statistical Reporting Checklist

Every statistical analysis should report:

Component What It Tells You
Point estimate The observed effect size in original units
95% Confidence interval The plausible range of the true effect
Cohen's d (or h) Standardized effect magnitude
$r^2$ Proportion of variance explained
P-value Compatibility with $H_0$
Power Probability the study could detect this effect
Sample size How much data the conclusion rests on

Python Quick Reference

from statsmodels.stats.power import TTestIndPower
from scipy import stats
import numpy as np

# --- Cohen's d ---
def cohens_d(group1, group2):
    n1, n2 = len(group1), len(group2)
    s1, s2 = np.std(group1, ddof=1), np.std(group2, ddof=1)
    s_p = np.sqrt(((n1-1)*s1**2 + (n2-1)*s2**2) / (n1+n2-2))
    return (np.mean(group1) - np.mean(group2)) / s_p

# --- r-squared from t-test ---
def r_squared(t_stat, df):
    return t_stat**2 / (t_stat**2 + df)

# --- Power analysis: find required n ---
power = TTestIndPower()
n_needed = power.solve_power(effect_size=0.5, alpha=0.05,
                              power=0.80, alternative='two-sided')

# --- Power analysis: find achieved power ---
achieved = power.solve_power(effect_size=0.23, nobs1=250,
                              alpha=0.05, alternative='two-sided')

# --- Cohen's h for proportions ---
def cohens_h(p1, p2):
    return 2 * np.arcsin(np.sqrt(p1)) - 2 * np.arcsin(np.sqrt(p2))

Common Misconceptions

Misconception Reality
"Significant = important" A result can be significant but trivially small (with large $n$)
"Not significant = no effect" A result can be non-significant because the study was underpowered
"$p = 0.001$ means a huge effect" P-values mix effect size and sample size; $p = 0.001$ can come from a tiny effect with a massive sample
"$d = 0.2$ is always small" Effect size benchmarks depend on context; a $d$ of 0.2 on mortality could save thousands of lives
"Post-hoc power analysis is useful" Computing power from the observed effect size is circular; use the hypothesized effect size or focus on the CI
"We need $n = 30$ for everything" Required sample size depends on the effect size you want to detect; small effects require hundreds or thousands

How This Chapter Connects

This Chapter Builds On Leads To
Effect sizes (Cohen's d, $r^2$) Means and SDs (Ch.6), two-sample t-test (Ch.16) Regression $R^2$ (Ch.22-23)
Statistical power Type I/II errors (Ch.13), standard error (Ch.11) ANOVA power (Ch.20), sample size planning (throughout)
Practical significance P-value definition (Ch.13), CI interpretation (Ch.12) Communicating results (Ch.25), ethical practice (Ch.27)
Publication bias, p-hacking Replication crisis (Ch.1, Ch.13) Bootstrap methods (Ch.18), ethics (Ch.27)
Power analysis in Python scipy.stats (Ch.13-16) statsmodels throughout (Ch.20-24)

The Key Themes

Theme 4: Uncertainty is not failure. The confidence interval is more honest and more useful than a binary declaration of "significant" or "not significant." Reporting that the true difference is "somewhere between 1 and 8 minutes" communicates both what we know and what we don't. Treating uncertainty as information rather than failure is the hallmark of mature statistical thinking.

Theme 6: P-hacking and ethical data practice. Testing many hypotheses and reporting only the significant ones inflates the false positive rate from 5% to as high as 64%. Publication bias compounds the problem by making the published literature systematically overconfident. The ethical obligation: pre-register hypotheses, report all analyses, report effect sizes, and treat null results as valuable information.

The One Thing to Remember

If you forget everything else from this chapter, remember this:

"Statistically significant" does not mean "important." A p-value tells you whether a result is surprising under $H_0$ — not whether the effect is large, meaningful, or worth acting on. For that, you need the effect size. Cohen's d expresses the difference in standard deviation units (small $\approx$ 0.2, medium $\approx$ 0.5, large $\approx$ 0.8). $r^2$ tells you the proportion of variance explained. Statistical power (minimum 80%) tells you whether you had enough data to find the effect. And the confidence interval — which simultaneously conveys the direction, magnitude, and precision of the effect — is the single most informative summary of any statistical analysis. Always report all of these. Never report just a p-value.

Key Terms

Term Definition
Statistical power The probability of correctly rejecting $H_0$ when it is false: Power $= 1 - \beta$; depends on $\alpha$, $n$, effect size, and variability
Effect size A quantitative, sample-size-independent measure of the magnitude of a phenomenon; answers "how big?" rather than "is there an effect?"
Cohen's d Effect size for comparing two means: $d = (\bar{x}_1 - \bar{x}_2) / s_p$; expresses the group difference in standard deviation units
Practical significance Whether an effect is large enough to matter in the real world, as opposed to merely being statistically detectable
Power analysis The process of determining the sample size needed to detect a given effect size with a specified power and significance level
Sample size planning Using power analysis prospectively to determine how many observations are needed before collecting data
Underpowered study A study with insufficient sample size to reliably detect the effect of interest; typically power $< 80\%$
P-hacking Manipulating data analysis — testing multiple hypotheses, variables, or subgroups — until a statistically significant result is found; inflates the false positive rate
Publication bias The tendency for journals to publish significant results and reject null results, creating a distorted literature that overestimates effect sizes
Replication crisis The discovery that many published scientific findings cannot be reproduced; driven by underpowered studies, p-hacking, and publication bias