Appendix F: FAQ and Troubleshooting

This appendix collects the questions students ask most frequently, organized by topic. If you're stuck, start here.


F.1 Conceptual Questions

Q1: What does a confidence interval actually mean?

A 95% confidence interval does NOT mean "there is a 95% probability the true parameter is in this interval." The parameter is a fixed (unknown) number — it's either in the interval or it isn't.

What 95% confidence means: if we repeated the entire sampling process many times and constructed a confidence interval each time, approximately 95% of those intervals would contain the true parameter.

Think of it as a statement about the method, not about any single interval. The method is reliable 95% of the time.

Common wrong interpretations: - "There's a 95% chance mu is between 42 and 58." (Wrong — mu is fixed, not random.) - "95% of the data falls in this interval." (Wrong — that describes the data, not the parameter.) - "If we sampled again, there's a 95% chance the new sample mean would be in this interval." (Wrong — this confuses the sampling distribution with the CI.)

See Ch.12, Section 12.3 for the full treatment.


Q2: What does a p-value actually mean?

The p-value is the probability of observing a test statistic as extreme as (or more extreme than) what you actually got, assuming the null hypothesis is true.

In notation: p-value = P(data this extreme | H0 is true).

It is NOT: - The probability that H0 is true. - The probability the results are "due to chance." - The probability of making an error. - The probability of getting the same result if you repeat the study.

A small p-value (say, 0.003) means: "If there really were no effect, it would be very unusual to get data like ours. So either there is no effect and we got very unlucky, or there IS an effect."

See Ch.13, Section 13.5 for the full treatment.


Q3: What's the difference between "failing to reject H0" and "accepting H0"?

When p > alpha, we say "fail to reject H0." We do NOT say "accept H0" or "H0 is true."

Why? Because absence of evidence is not evidence of absence. A non-significant result might mean: - H0 is actually true, OR - H0 is false but our sample was too small to detect the effect (underpowered study), OR - H0 is false but the effect is so small it's indistinguishable from zero with our sample size.

Think of it like a courtroom: "not guilty" doesn't mean "innocent." It means the prosecution didn't present enough evidence.

See Ch.13, Section 13.7 and Ch.17 on power.


Q4: What's the difference between statistical significance and practical significance?

Statistical significance means p <= alpha. It tells you the effect is probably real (not zero).

Practical significance means the effect is large enough to matter in the real world.

These are NOT the same thing. With a large enough sample, even a tiny, meaningless difference becomes statistically significant. For example, a study with 100,000 participants might find that a drug lowers blood pressure by 0.3 mmHg (p < 0.001). That's statistically significant but clinically meaningless — no doctor would prescribe a drug for a 0.3 mmHg reduction.

Always report effect sizes (Cohen's d, r-squared, Cramer's V, eta-squared) alongside p-values. The effect size tells you how big the effect is; the p-value tells you how sure you are that it's not zero.

See Ch.17, Section 17.11 for the full treatment.


Q5: When someone says "correlation does not imply causation," what do they mean? And when CAN we make causal claims?

Correlation means two variables are associated — when one changes, the other tends to change. But there are four possible explanations for a correlation:

  1. Direct causation: X causes Y (what people usually assume).
  2. Reverse causation: Y causes X.
  3. Common cause: A third variable Z causes both X and Y (confounding).
  4. Coincidence: The correlation is a fluke (especially with many variables tested).

You can make causal claims when: - The study uses a randomized experiment with random assignment to treatment and control groups. Random assignment breaks confounding. - You have a strong natural experiment with a clearly exogenous source of variation. - You have very strong observational evidence with multiple controls, dose-response relationships, temporal ordering, and no plausible confounders (rare).

In an observational study, you can only claim association, not causation — no matter how large the sample or how small the p-value.

See Ch.4, Section 4.6 and Ch.22, Section 22.5.


Q6: What's the difference between Type I and Type II errors? Which is worse?

H0 is actually TRUE H0 is actually FALSE
Reject H0 Type I error (false alarm) Correct decision (power)
Fail to reject H0 Correct decision Type II error (missed detection)
  • Type I error (alpha): You conclude there IS an effect when there ISN'T one. Like a fire alarm going off when there's no fire.
  • Type II error (beta): You conclude there ISN'T an effect when there IS one. Like a fire alarm failing to go off when the building is burning.

Which is worse depends entirely on context: - In a criminal trial, Type I error (convicting an innocent person) is considered worse. - In medical screening, Type II error (missing a treatable disease) might be worse. - In quality control, Type I error (stopping production unnecessarily) might be more expensive.

There's always a tradeoff: lowering alpha reduces Type I errors but increases Type II errors. The only way to reduce both simultaneously is to increase sample size.

See Ch.13, Section 13.9.


Q7: Why do we divide by n-1 instead of n when computing sample variance?

Short answer: because dividing by n would systematically underestimate the population variance. Dividing by n-1 corrects this bias.

Intuitive explanation: when you calculate deviations from the sample mean, the deviations always sum to zero (by construction). This means one deviation is determined by all the others — you have only n-1 truly independent pieces of information. We say the sample mean "uses up" one degree of freedom, leaving n-1.

This correction is called Bessel's correction. It only matters for small samples. By n = 30, dividing by n versus n-1 gives nearly identical results.

See Ch.6, Section 6.5.


Q8: What's the Central Limit Theorem and why does it matter so much?

The Central Limit Theorem (CLT) says: if you take many random samples of size n from any population and compute the sample mean each time, the distribution of those sample means will be approximately normal — regardless of the shape of the original population — as long as n is large enough (typically n >= 30).

Why this matters: it's the reason we can do inference. Confidence intervals and hypothesis tests assume the sampling distribution is normal. The CLT guarantees this is approximately true for large samples, even if your data is skewed, bimodal, or shaped like a camel.

Without the CLT, every statistical test would require knowing the exact shape of the population distribution. The CLT frees us from that requirement.

See Ch.11, Section 11.4.


Q9: Is it ever okay to remove outliers?

It depends on WHY the outlier exists:

  1. Data entry error (someone typed 999 instead of 99): Yes, fix or remove it.
  2. Measurement error (instrument malfunction): Yes, remove it with documentation.
  3. Different population (a CEO's salary in a dataset of teacher salaries): Consider analyzing separately.
  4. Legitimate extreme value (a very healthy 105-year-old in a longevity study): Do NOT remove it. This is real data that belongs in your analysis.

The rule: never delete an outlier just because it makes your results look worse. Always document your reasoning in a cleaning log, and consider running your analysis both with and without the outlier (sensitivity analysis).

See Ch.6, Section 6.9 and Ch.7, Section 7.3.


Q10: What does "degrees of freedom" actually mean?

Degrees of freedom (df) represent the number of independent pieces of information in your data that are free to vary after you've estimated some statistic.

For a one-sample t-test with n observations: you estimated the mean from the data, so one piece of information is "used up." That leaves df = n - 1 independent values.

Practical rule: more degrees of freedom means the t-distribution gets closer to the normal distribution, critical values get smaller, and your tests become more powerful.

Specific formulas: - One-sample or paired t-test: df = n - 1 - Two-sample t-test (Welch's): df is computed from both sample sizes and variances (use software) - Chi-square goodness-of-fit: df = k - 1 (number of categories minus 1) - Chi-square independence: df = (r - 1)(c - 1) - ANOVA: df_between = k - 1, df_within = N - k

See Ch.6, Section 6.5; Ch.12, Section 12.4; Ch.15, Section 15.2.


F.2 "When Should I Use..." Decision Guides

Q11: When should I use z vs. t for means?

  • z-test: When the population standard deviation (sigma) is KNOWN. This almost never happens in practice.
  • t-test: When sigma is unknown and estimated by the sample standard deviation (s). This is almost always the case.

Default answer: Use the t-test. If you're not sure, use the t-test. For large samples (n > 30), the t-distribution and z-distribution give nearly identical results anyway.


Q12: When should I use a one-tailed vs. two-tailed test?

  • Two-tailed: You're interested in any difference from the null (could be higher or lower). Ha: mu != mu_0.
  • One-tailed: You have a specific directional prediction BEFORE seeing the data. Ha: mu > mu_0 or Ha: mu < mu_0.

Default answer: Use two-tailed unless you have a strong, pre-specified reason for a one-tailed test. Journal reviewers and statistics instructors are often skeptical of one-tailed tests because they make it easier to get significant results.

Rule of thumb: if rejecting H0 in the unexpected direction would be interesting or actionable, you need a two-tailed test.


Q13: How do I choose between parametric and nonparametric tests?

Use the parametric test (t-test, ANOVA) when: - Data is roughly normally distributed (check with QQ-plot or Shapiro-Wilk test) - Sample size is n >= 30 (CLT makes t-test robust) - Data is interval or ratio scale - No extreme outliers

Use the nonparametric test (Mann-Whitney U, Wilcoxon, Kruskal-Wallis) when: - Data is ordinal (ranked data, Likert scales) - Sample size is small (n < 15) AND the data is clearly non-normal - There are extreme outliers that you can't justify removing - The distribution is heavily skewed with a small sample

When in doubt: run both. If they agree, report the parametric test (more familiar to most audiences). If they disagree, investigate why and report the nonparametric test with an explanation.

See Ch.21 for the full decision framework.


Q14: How do I choose the right statistical test?

Start with two questions: 1. What type is your response (outcome) variable? Categorical or numerical? 2. How many groups are you comparing? One, two, or more than two?

Response Variable Groups Test
Numerical (mean) 1 sample vs. known value One-sample t-test (Ch.15)
Numerical (mean) 2 independent groups Two-sample t-test (Ch.16)
Numerical (mean) 2 paired/matched groups Paired t-test (Ch.16)
Numerical (mean) 3+ groups One-way ANOVA (Ch.20)
Categorical (proportion) 1 sample vs. known value One-sample z-test for proportions (Ch.14)
Categorical (proportion) 2 groups Two-proportion z-test (Ch.16)
Categorical (distribution) 1 sample vs. expected Chi-square goodness-of-fit (Ch.19)
Categorical x Categorical Association between two variables Chi-square test of independence (Ch.19)
Numerical x Numerical Linear relationship Correlation / regression (Ch.22)
Numerical (outcome) Multiple predictors Multiple regression (Ch.23)
Binary (outcome) One or more predictors Logistic regression (Ch.24)

Q15: How do I know if my sample size is large enough?

It depends on what you're doing:

  • For the CLT to kick in: n >= 30 for most population shapes. For very skewed populations, you might need n >= 50 or more.
  • For proportion inference: Need np >= 10 and n(1-p) >= 10 (success-failure condition).
  • For chi-square tests: All expected cell counts should be >= 5.
  • For regression: A common guideline is n >= 10-20 observations per predictor variable.
  • For a specific power level: Use a power analysis (Ch.17). To detect a medium effect (d = 0.5) with 80% power at alpha = 0.05, you need about 64 per group for a two-sample t-test.

F.3 Common Python Errors and Fixes

Q16: "TypeError: unsupported operand type(s)"

Cause: You're trying to do math on a column that contains text.

# This fails if "income" has values like "$50,000"
df["income"].mean()

# Fix: clean the column first
df["income"] = df["income"].str.replace("$", "").str.replace(",", "")
df["income"] = pd.to_numeric(df["income"], errors="coerce")
df["income"].mean()  # Now it works

Q17: "ValueError: could not convert string to float"

Cause: There's text mixed into a numerical column — often missing data coded as "N/A", "None", or a blank.

# Fix: force non-numeric values to NaN
df["column"] = pd.to_numeric(df["column"], errors="coerce")

Q18: My p-value is showing as 0.0. Is that right?

The p-value is never truly zero — it's just smaller than Python can display. Report it as "p < 0.001" rather than "p = 0.000."

# If you need more precision:
from decimal import Decimal
print(f"p = {Decimal(p_value):.2e}")

Q19: ttest_ind is giving me a two-tailed p-value but I need one-tailed.

scipy.stats.ttest_ind() always returns a two-tailed p-value. For a one-tailed test:

t_stat, p_two = stats.ttest_ind(group_a, group_b, equal_var=False)

# One-tailed (Ha: group_a > group_b)
p_one = p_two / 2 if t_stat > 0 else 1 - p_two / 2

# One-tailed (Ha: group_a < group_b)
p_one = p_two / 2 if t_stat < 0 else 1 - p_two / 2

Q20: My regression summary shows a very small p-value for the F-statistic but large p-values for individual predictors.

This means the model as a whole explains significant variation, but no single predictor stands out when the others are in the model. This usually indicates multicollinearity — your predictors are correlated with each other.

Fix: 1. Check VIF values (variance inflation factor). VIF > 10 is problematic. 2. Consider removing one of the correlated predictors. 3. See Ch.23, Section 23.8 on multicollinearity.


F.4 Common Excel Issues

Q21: Excel is rounding my p-values to 0.

Excel displays a limited number of decimal places by default.

Fix: Right-click the cell, select Format Cells, choose Number, and increase decimal places to 6 or more. Alternatively, use Scientific notation.


Q22: Which Excel function should I use for a t-test?

  • T.TEST(array1, array2, tails, type): Returns a p-value directly.
  • tails: 1 (one-tailed) or 2 (two-tailed)
  • type: 1 (paired), 2 (equal variance), 3 (unequal variance / Welch's)
  • T.INV.2T(alpha, df): Returns the two-tailed critical value t*.
  • T.DIST.2T(t, df): Returns the two-tailed p-value for a given t-statistic.

Q23: How do I create a histogram in Excel?

  1. Select your data range.
  2. Go to Insert > Chart > Histogram.
  3. To adjust bin width: right-click the x-axis > Format Axis > Bin width.

For more control, use the Data Analysis ToolPak: 1. Go to Data > Data Analysis (if you don't see it: File > Options > Add-ins > Analysis ToolPak > Go > check the box). 2. Select Histogram. 3. Specify the input range and optional bin range.


F.5 Interpreting Results

Q24: My confidence interval includes zero. What does that mean?

If zero is inside the CI for a difference (difference in means or difference in proportions), it means zero is a plausible value for the true difference. In other words, there might be no difference at all.

This is equivalent to failing to reject H0 in a two-tailed test at the corresponding alpha level.

However, look at the WIDTH of the interval. A CI of (-0.5, 0.6) that includes zero is very different from a CI of (-15, 20) that includes zero. The first suggests the effect, if it exists, is small. The second suggests you simply don't have enough data to tell.


Q25: I got a "statistically significant" result with a tiny effect size. Should I be excited?

Probably not. A statistically significant result with a tiny effect size usually means you had a very large sample that detected a trivially small difference.

Ask yourself: "Would this difference matter to anyone in the real world?" If a new teaching method increases test scores by 0.3 points on a 100-point exam (p = 0.02, d = 0.04), that's real but meaningless.

Always report the effect size alongside the p-value so readers can judge for themselves.


Q26: My R-squared is low but my regression coefficients are significant. Is my model useless?

Not necessarily. A low R-squared means your model explains only a small portion of the variation in Y. But the significant coefficients tell you that the relationship between your predictors and Y is real (not just noise).

This is common in social science and public health research where human behavior has enormous natural variability that no simple model can capture. A regression predicting income from education might have R-squared = 0.15 (education explains 15% of income variation), but the slope is highly significant and practically meaningful.

A model can be useful for understanding relationships even if it's poor for making individual predictions.

See Ch.22, Section 22.8 and Ch.23.