Answers to Selected Exercises

This appendix provides solutions to odd-numbered problems from Parts A and B of each chapter. Solutions show the reasoning process, not just the final answer. For computational problems, both the formula approach and the Python approach are shown where relevant.

Chapter 1: Why Statistics Matters

A.1. Descriptive statistics summarizes data you already have — for example, computing the average temperature in your city last month. Inferential statistics uses sample data to draw conclusions about a larger population — for example, using temperature data from 50 weather stations to estimate the average temperature across an entire state.

A.3. Statistics involves much more than calculations. It requires critical thinking about how data was collected (study design), what the data can and cannot tell us (inference vs. description), how to handle uncertainty (probability), and how to communicate findings honestly (ethics). A calculator can compute a mean, but only a trained statistical thinker can judge whether that mean is meaningful, representative, and honestly reported.

A.5. Almost every decision involves uncertainty because we rarely have complete information. For example, deciding whether to carry an umbrella involves uncertain weather forecasts. Statistics provides formal tools for quantifying uncertainty (confidence intervals), making decisions under uncertainty (hypothesis tests), and weighing evidence (p-values, Bayes' theorem).

A.7. Most people see variation as noise — something to be eliminated or ignored. Statistical thinking sees variation as information. When test scores vary, that variation tells us something about the students, the test, or the teaching. When patient outcomes vary, that variation may reveal differences in treatment effectiveness. The shift is from "variation is a problem" to "variation is the starting point of understanding."

B.1.

(a) Inferential — the sample of 1,500 is being used to draw a conclusion about all voters in the state.

(b) Descriptive — the teacher is simply summarizing the data she has (the whole class).

(d) Descriptive — the Census attempts to measure the entire population, so this is a summary of the population itself. (Note: technically the Census doesn't reach everyone, but its intent is a full count.)

(e) Inferential — past claims are being used to predict future claims from a different group of customers.

B.3. Alex can't simply compare this month's watch time to last month's because many other things change over time: new content releases, seasonal viewing patterns, holidays, marketing campaigns, and platform updates. Any observed difference might be caused by these confounding factors rather than the algorithm. A better approach is a randomized experiment (A/B test): randomly assign some users to the new algorithm and others to the old one, running both simultaneously. This controls for all other factors.

Chapter 2: Types of Data and the Language of Statistics

A.1. A categorical variable represents group membership or labels (blood type, favorite color, political party), while a numerical variable represents quantities that can be measured or counted (height, income, number of siblings). The key test: does it make sense to compute a mean? If yes, it's numerical. If not, it's categorical.

A.3. Ordinal — the categories have a natural order (not at all, somewhat, very), but the distance between "not at all" and "somewhat" is not necessarily the same as the distance between "somewhat" and "very." This has practical consequences: computing a mean of ordinal data is debatable.

A.5. (a) Nominal — no natural order among positions. (b) Ratio — height has a true zero and meaningful ratios. (c) Ordinal — there is a clear order, but the distance between gold and silver isn't necessarily the same as silver and bronze.

B.1. (a) Observational unit: one adult resident. (b) Variable 1: income (numerical, continuous, ratio). Variable 2: education level (categorical, ordinal). Variable 3: zip code (categorical, nominal — even though it uses numbers).

B.3. Parameter: the true average GPA of ALL students at the university (unknown). Statistic: the average GPA of the 200 sampled students (known, computed from data). The statistic is our best estimate of the parameter, but the two will almost certainly differ due to sampling variability.

Chapter 3: Your Data Toolkit

A.1. A Jupyter notebook is an interactive document that combines code cells (where you write and run Python) and Markdown cells (where you write formatted text). Unlike a traditional script that runs top to bottom, a notebook lets you run cells individually, see output immediately, and document your thought process alongside your code. This makes it ideal for data exploration and reproducible analysis.

A.3. A library is a collection of pre-written code that extends Python's capabilities. Without pandas, you'd have to write hundreds of lines of code to load, filter, and summarize a CSV file. With pandas, it takes one line. Libraries let you stand on the shoulders of other programmers.

B.1. .head() shows the first 5 rows of the DataFrame — useful for quickly checking that data loaded correctly and seeing what the columns look like. .info() shows the number of rows, column names, data types, and count of non-null values — essential for identifying missing data and incorrect data types. .describe() computes summary statistics (mean, std, min, max, quartiles) for numerical columns — gives you a first look at the center, spread, and range of your data.

Chapter 4: Designing Studies

A.1. In an observational study, the researcher observes and records data without intervening — they take the world as it is. In an experiment, the researcher actively imposes a treatment on subjects and compares outcomes. The critical difference: only experiments with random assignment can establish causation, because randomization controls for confounding variables (both known and unknown).

A.3. (a) Selection bias — the sample systematically excludes certain groups. (b) Response bias — participants give inaccurate answers, often due to social desirability (e.g., underreporting unhealthy behaviors). (c) Nonresponse bias — people who don't respond differ systematically from those who do. (d) Survivorship bias — only the "survivors" are studied, creating an incomplete picture.

A.5. A confounding variable is associated with both the explanatory and response variables, creating a spurious association (or masking a real one). Example: ice cream sales and drowning rates both increase in summer — but the confounding variable is temperature/season, not a causal link between ice cream and drowning.

B.1. (a) Experiment — students are assigned to different instruction methods. But random assignment is critical; if students self-select into methods, it becomes observational with confounders (motivation, prior ability). (b) Observational — the researcher observes coffee consumption without assigning it. Coffee drinkers may differ from non-drinkers in many ways (sleep habits, stress levels, age).

Chapter 5: Exploring Data

A.1. A bar chart displays categories on the x-axis with bars of different heights representing counts; the bars have gaps between them because categories are distinct. A histogram displays bins of a numerical variable with bars touching because the bins represent a continuous range. The visual distinction: gaps mean categorical; touching bars mean numerical.

A.3. No. A mean of 50 and SD of 10 is consistent with a symmetric distribution, a skewed distribution, a bimodal distribution, or many other shapes. Summary statistics alone cannot determine shape — you must visualize the data.

A.5. This is a bimodal distribution. The two peaks might indicate two distinct subgroups of students — perhaps students who studied thoroughly (scoring around 85%) and students who didn't prepare well (scoring around 55%).

A.7. Statisticians prefer bar charts because humans compare lengths (bar heights) more accurately than areas (pie slices). Pie charts are acceptable only when: (a) you have very few categories (3-5), (b) you want to emphasize part-to-whole relationships, and (c) no two slices are close in size. They clearly fail when there are many categories or when you need precise comparisons.

B.1. (a) Categorical (nominal). Best graph: bar chart.

B.3. This is tricky. Star ratings (1-5) are ordinal categorical data. While some analysts treat them as numerical, they are discrete and bounded. Best graph: bar chart (one bar per star rating) or a horizontal bar chart showing the frequency of each rating.

B.5. Categorical (nominal). Best graph: bar chart (sorted by frequency).

B.7. Numerical (discrete). Best graph: histogram (with integer bins) or a dot plot since the range is small (0-15).

B.9. Numerical (continuous). Best graph: histogram.

Chapter 6: Numerical Summaries

A.1. The mean is the sum of all values divided by the count — the "balance point." The median is the middle value when data is sorted — the 50th percentile. They differ most for skewed distributions or data with outliers. The median is resistant to outliers; the mean is pulled toward them.

A.3. Standard deviation measures the typical distance of observations from the mean. Conceptually, it answers: "On average, how far are individual data points from the center?" A small SD means values cluster tightly around the mean; a large SD means they're spread out.

A.5. Using the 1.5 * IQR rule: IQR = Q3 - Q1 = 85 - 60 = 25. Lower fence = Q1 - 1.5 * IQR = 60 - 37.5 = 22.5. Upper fence = Q3 + 1.5 * IQR = 85 + 37.5 = 122.5. A score of 15 is below 22.5 and would be flagged as an outlier. A score of 120 is below 122.5, so it is NOT an outlier by this criterion.

B.1. Mean = (12 + 15 + 18 + 22 + 25 + 28 + 30 + 35 + 42 + 150) / 10 = 377 / 10 = 37.7. Median = (25 + 28) / 2 = 26.5 (average of the 5th and 6th values when sorted). The mean (37.7) is much larger than the median (26.5) because the outlier (150) pulls the mean upward. The median is the better summary of "typical" here.

B.3. z-score = (x - mean) / SD = (82 - 72) / 8 = 1.25. This means the student scored 1.25 standard deviations above the class mean. In a roughly normal distribution, this would place the student at approximately the 89th percentile.

Chapter 7: Data Wrangling

A.1. Data cleaning is essential because real-world data is messy — it contains errors, inconsistencies, missing values, and formatting problems. If you analyze dirty data, your results will be unreliable (garbage in, garbage out). Cleaning also forces you to understand your data deeply before making claims about it.

A.3. MCAR (Missing Completely At Random): data is missing for reasons unrelated to any variable. Example: a survey page stuck together in printing. MAR (Missing At Random): missingness is related to observed variables but not the missing values themselves. Example: younger respondents skip the income question more often (age predicts missingness, not income itself). MNAR (Missing Not At Random): missingness is related to the missing values. Example: high-income people are less likely to report income.

B.1. (a) Check for missing values: df.isna().sum(). (b) Check for duplicates: df.duplicated().sum(). (c) Check data types: df.dtypes. (d) Check for inconsistent categories: df['column'].value_counts(). (e) Check value ranges: df.describe() and look for impossible values (e.g., negative ages, percentages above 100).

Chapter 8: Probability Foundations

A.1. P(at least one head) = 1 - P(no heads) = 1 - P(all tails) = 1 - (0.5)^3 = 1 - 0.125 = 0.875.

A.3. The gambler's fallacy is the incorrect belief that past independent events affect future ones. After 5 heads in a row, the probability of heads on the next flip is still 0.5. The coin has no memory. The law of large numbers says the PROPORTION of heads approaches 0.5 over many flips — but this is because future flips dilute the current streak, not because future flips "correct" it.

A.5. P(A or B) = P(A) + P(B) - P(A and B) = 0.30 + 0.20 - 0.10 = 0.40. We subtract P(A and B) because those individuals are counted in both P(A) and P(B). Without the subtraction, we'd double-count them.

B.1. (a) P(Jack) = 4/52 = 1/13 approximately 0.0769. (b) P(Heart) = 13/52 = 1/4 = 0.25. (c) P(Jack or Heart) = P(Jack) + P(Heart) - P(Jack AND Heart) = 4/52 + 13/52 - 1/52 = 16/52 = 4/13 approximately 0.3077.

Chapter 9: Conditional Probability and Bayes' Theorem

A.1. P(A|B) is the probability of A given that B has already occurred. P(B|A) is the probability of B given that A has already occurred. These are generally NOT equal. Example: P(positive test | disease) is the sensitivity (perhaps 0.95). P(disease | positive test) is the PPV (perhaps 0.10 if the disease is rare). Confusing these is the prosecutor's fallacy.

A.3. Using Bayes' theorem: P(disease | positive) = P(positive | disease) * P(disease) / P(positive). P(positive) = P(positive | disease) * P(disease) + P(positive | no disease) * P(no disease) = 0.99 * 0.001 + 0.05 * 0.999 = 0.00099 + 0.04995 = 0.05094. P(disease | positive) = 0.00099 / 0.05094 = 0.0194, about 1.9%. Despite the test's 99% sensitivity, a positive result from such a rare disease means only about a 2% chance of actually having it. This is the base rate fallacy in action.

B.1. Using Bayes' theorem with the tree diagram: P(defective | Machine A) = 0.02, P(defective | Machine B) = 0.05. Machine A produces 60% of items, Machine B produces 40%. P(defective) = 0.60 * 0.02 + 0.40 * 0.05 = 0.012 + 0.020 = 0.032. P(Machine A | defective) = (0.60 * 0.02) / 0.032 = 0.012 / 0.032 = 0.375. So a defective item is more likely to have come from Machine B (62.5%) despite Machine A producing more items, because Machine B has a higher defect rate.

Chapter 10: Probability Distributions and the Normal Curve

A.1. A discrete probability distribution lists all possible values and their probabilities (e.g., binomial). A continuous probability distribution uses a curve (PDF) where probability is the area under the curve (e.g., normal). For continuous distributions, P(X = any exact value) = 0; only intervals have positive probability.

A.3. The Empirical Rule was the approximation; the normal distribution is the exact model. The rule says ~68% of data falls within 1 SD of the mean. The normal distribution gives the exact probability: P(-1 < Z < 1) = 0.6827. The Empirical Rule works because many real-world distributions are approximately normal.

B.1. This is binomial with n = 20, p = 0.15. (a) P(X = 0) = C(20,0) * 0.15^0 * 0.85^20 = 0.85^20 = 0.0388. (b) P(X >= 5) = 1 - P(X <= 4). Using the binomial CDF or Python: 1 - stats.binom.cdf(4, n=20, p=0.15) = 0.1702. (c) Expected value = np = 20 * 0.15 = 3. Standard deviation = sqrt(np(1-p)) = sqrt(20 * 0.15 * 0.85) = sqrt(2.55) = 1.597.

Chapter 11: Sampling Distributions and the CLT

A.1. A sampling distribution is the distribution of a statistic (like the sample mean) computed from ALL possible random samples of the same size from the same population. It describes how much the statistic varies from sample to sample. This is different from the population distribution (which describes individual values) and from a single sample's distribution.

A.3. Standard error = sigma / sqrt(n). SE is large when the population SD is large (more variability) or when n is small (less information). Doubling n does NOT halve SE — it divides SE by sqrt(2), approximately 1.41. To halve SE, you must quadruple n. This is the "diminishing returns" of sample size.

B.1. SE = sigma / sqrt(n) = 15 / sqrt(36) = 15 / 6 = 2.5. By the CLT, the sampling distribution of x-bar is approximately N(100, 2.5). P(x-bar > 105) = P(Z > (105 - 100) / 2.5) = P(Z > 2.0) = 1 - 0.9772 = 0.0228. There is about a 2.3% chance of getting a sample mean above 105.

Chapter 12: Confidence Intervals

A.1. A 95% confidence interval means that if we repeated the sampling and CI construction process many times, approximately 95% of the resulting intervals would contain the true population parameter. It does NOT mean there is a 95% probability the parameter is in THIS specific interval.

A.3. Increasing the confidence level (say from 95% to 99%) makes the interval wider. Higher confidence requires a larger critical value (1.960 vs. 2.576 for z), which increases the margin of error. You gain more confidence at the cost of less precision. The tradeoff: wider interval = more confident but less informative.

B.1. n = 50, x-bar = 72, s = 12, confidence = 95%. SE = s / sqrt(n) = 12 / sqrt(50) = 1.697. df = 49, t* = 2.010. MOE = 2.010 * 1.697 = 3.411. CI = (72 - 3.411, 72 + 3.411) = (68.59, 75.41). We are 95% confident that the true population mean is between 68.59 and 75.41.

Chapter 13: Hypothesis Testing

A.1. Hypothesis testing works like a courtroom trial. The null hypothesis (H0) is "the defendant is innocent" — assumed true until proven otherwise. The sample data is the evidence presented by the prosecution. The p-value measures how surprising the evidence would be if the defendant really were innocent. If the evidence is surprising enough (p-value below alpha), we reject the null (convict). If not, we fail to reject (acquit). Importantly, "not guilty" doesn't mean "innocent."

A.3. (a) False — a small p-value means the EVIDENCE against H0 is strong, not that the EFFECT is large. You can get p = 0.001 with a tiny effect if n is large enough. (b) False — failing to reject H0 means we don't have enough evidence; it doesn't mean H0 is true. (c) False — two studies with p < 0.05 could have very different effect sizes. (d) True — alpha must be set before looking at data to avoid biasing the decision. (e) True — smaller p-values mean the data is more inconsistent with H0. (f) False — p-values range from 0 to 1.

B.1. (a) H0: mu = 1200, Ha: mu < 1200. One-tailed (left). (b) H0: mu = 72, Ha: mu > 72. One-tailed (right). (c) H0: mu = 25, Ha: mu > 25. One-tailed (right). (d) H0: p = 0.78, Ha: p != 0.78. Two-tailed. (e) H0: p = 0.15, Ha: p > 0.15. One-tailed (right).

Chapter 14: Inference for Proportions

A.1. Success-failure condition: np0 >= 10 and n(1 - p0) >= 10 for tests, or np-hat >= 10 and n(1 - p-hat) >= 10 for CIs. This ensures the sampling distribution of p-hat is approximately normal (by the CLT). If the condition fails, use an exact binomial test instead.

B.1. H0: p = 0.50, Ha: p != 0.50 (two-tailed). p-hat = 320/600 = 0.533. Check conditions: np0 = 300 >= 10. n(1-p0) = 300 >= 10. SE = sqrt(0.50 * 0.50 / 600) = sqrt(0.000417) = 0.0204. z = (0.533 - 0.50) / 0.0204 = 1.62. p-value = 2 * P(Z > 1.62) = 2 * 0.0526 = 0.1053. At alpha = 0.05, we fail to reject H0. There is not sufficient evidence to conclude that the coin is biased.

Chapter 15: Inference for Means

A.1. When sigma is unknown (which is almost always), we estimate it with s, and this extra uncertainty is accounted for by using the t-distribution instead of the z-distribution. The t-distribution has heavier tails, producing wider confidence intervals and larger critical values, which reflects the additional uncertainty from estimating sigma.

B.1. H0: mu = 500, Ha: mu != 500. n = 25, x-bar = 520, s = 50. SE = 50 / sqrt(25) = 10. t = (520 - 500) / 10 = 2.00. df = 24. From the t-table, the two-tailed p-value for t = 2.00 with df = 24 is approximately 0.057. At alpha = 0.05, we fail to reject H0 (barely). However, this is borderline — the effect size is d = 20/50 = 0.40 (small-to-medium), and a slightly larger sample might yield significance. The 95% CI is 520 +/- 2.064 * 10 = (499.4, 540.6), which barely includes 500.

Chapter 16: Comparing Two Groups

A.1. Independent samples come from two separate, unrelated groups (e.g., men vs. women, treatment vs. control). Paired samples have natural pairings (e.g., before/after on the same person, left eye vs. right eye). The paired design controls for individual differences, making it more powerful when pairs are similar.

B.1. This is a two-sample t-test (independent groups, Welch's). H0: mu_new - mu_old = 0, Ha: mu_new - mu_old != 0. SE = sqrt(8^2/40 + 10^2/40) = sqrt(1.6 + 2.5) = sqrt(4.1) = 2.025. t = (75 - 71) / 2.025 = 1.975. With Welch's df (approximately 74), the two-tailed p-value is approximately 0.052. At alpha = 0.05, this is borderline — we barely fail to reject. Cohen's d = 4 / sqrt((8^2 + 10^2)/2) = 4 / 9.06 = 0.44 (small-to-medium effect).

Chapter 17: Power, Effect Sizes, and What "Significant" Really Means

A.1. Statistical power is the probability of correctly rejecting H0 when it is actually false: Power = 1 - beta. It depends on three things: (1) the significance level alpha (larger alpha = more power), (2) the true effect size (larger effect = easier to detect = more power), and (3) the sample size (larger n = more power). The standard target is 80% power.

A.3. With 1 million users, even a 0.01% difference in click-through rates would likely be statistically significant (p < 0.05) because the standard error shrinks with large n. But a 0.01% difference is practically meaningless — no business decision should hinge on it. This is why effect sizes and confidence intervals matter: they tell you HOW BIG the effect is, not just whether it exists.

B.1. Using the power analysis formula or TTestIndPower: effect_size = 0.3 (small-to-medium), alpha = 0.05, power = 0.80. Required n per group = approximately 175. If the researcher can only recruit 50 per group, the power would be approximately 0.32 — the study would have only a 32% chance of detecting the effect even if it exists. An underpowered study is worse than no study at all, because a non-significant result would be uninformative.

Chapter 18: The Bootstrap and Simulation-Based Inference

A.1. The bootstrap works by treating the observed sample as a stand-in for the population. We draw many resamples (with replacement) from our data, compute the statistic of interest for each resample, and use the distribution of those resampled statistics to estimate the sampling distribution. This works because if the sample is representative of the population, then the bootstrap distribution approximates the true sampling distribution.

A.3. In a permutation test, we randomly shuffle the group labels and compute the test statistic many times to build a null distribution. If the observed test statistic is extreme relative to this null distribution, we reject H0. The key assumption: under H0, the group labels don't matter, so any permutation is equally likely.

Chapter 19: Chi-Square Tests

A.1. Chi-square goodness-of-fit tests whether a single categorical variable follows a hypothesized distribution (e.g., are M&M colors evenly distributed?). Chi-square test of independence tests whether two categorical variables are associated (e.g., is gender associated with voting preference?). Both use the same test statistic: chi-squared = sum of (O-E)^2/E.

B.1. H0: The die is fair (each face has probability 1/6). Ha: The die is not fair. Expected frequency for each face: 120/6 = 20. Chi-squared = (25-20)^2/20 + (17-20)^2/20 + (22-20)^2/20 + (18-20)^2/20 + (14-20)^2/20 + (24-20)^2/20 = 1.25 + 0.45 + 0.20 + 0.20 + 1.80 + 0.80 = 4.70. df = 6 - 1 = 5. Critical value at alpha = 0.05: 11.07. Since 4.70 < 11.07, we fail to reject H0. There is not sufficient evidence to conclude the die is unfair.

Chapter 20: ANOVA

A.1. If we compare 3 groups using three separate t-tests (A vs. B, A vs. C, B vs. C) at alpha = 0.05 each, the probability of at least one Type I error is 1 - (0.95)^3 = 0.143, almost three times the intended 5%. With 5 groups, it's 1 - (0.95)^10 = 0.401 — a 40% chance of a false positive. ANOVA solves this by testing all groups simultaneously with a single F-test at alpha = 0.05.

B.1. Grand mean = (50 * 20 + 55 * 20 + 60 * 20) / 60 = 3300 / 60 = 55. SS_Between = 20 * (50-55)^2 + 20 * (55-55)^2 + 20 * (60-55)^2 = 20 * 25 + 0 + 20 * 25 = 1000. MS_Between = 1000 / (3-1) = 500. Given SS_Within = 5700, MS_Within = 5700 / (60-3) = 100. F = 500 / 100 = 5.00. df1 = 2, df2 = 57. Critical value at alpha = 0.05 approximately 3.16. Since 5.00 > 3.16, reject H0. At least one group mean differs significantly. Eta-squared = 1000 / (1000 + 5700) = 0.149 (medium effect).

Chapter 21: Nonparametric Methods

A.1. Use nonparametric methods when: (1) data is ordinal rather than interval/ratio, (2) the sample is small AND the distribution is clearly non-normal, (3) there are extreme outliers that cannot be justified for removal, or (4) the distribution is heavily skewed with a small sample. The tradeoff: nonparametric tests have less power than parametric tests when assumptions ARE met, but they are valid when assumptions are violated.

Chapter 22: Correlation and Simple Linear Regression

A.1. (a) Strong positive correlation (r near +1): as study hours increase, test scores tend to increase. (b) Near-zero correlation (r near 0): hours of TV watched and shoe size are unrelated. (c) Moderate negative correlation (r near -0.5): as distance from campus increases, the number of library visits tends to decrease.

A.3. r = 0.90 and r-squared = 0.81. Interpretation: height explains 81% of the variation in weight in this sample. The remaining 19% is due to other factors (body composition, age, fitness level, diet, etc.).

B.1. Slope = b1 = r * (sy / sx) = 0.72 * (8 / 5) = 1.152. Intercept = b0 = y-bar - b1 * x-bar = 78 - 1.152 * 15 = 78 - 17.28 = 60.72. Regression equation: y-hat = 60.72 + 1.15x. Interpretation: for each additional hour of study, the predicted exam score increases by about 1.15 points. A student who doesn't study at all is predicted to score about 60.7 (the intercept — though this is an extrapolation if no students studied 0 hours). R-squared = 0.72^2 = 0.518; study hours explain about 52% of the variation in exam scores.

Chapter 23: Multiple Regression

A.1. "Holding other variables constant" means: for a one-unit increase in X1, the predicted Y changes by b1 WHEN all other predictors remain at fixed values. It allows us to isolate the unique contribution of each predictor. This is the regression equivalent of a controlled experiment — we're statistically controlling for confounders rather than physically holding them constant.

Chapter 24: Logistic Regression

A.1. Linear regression predicts a continuous outcome and can produce predictions outside the 0-1 range (e.g., a predicted probability of -0.3 or 1.4, which is nonsensical). Logistic regression uses the sigmoid function to constrain predictions between 0 and 1, making it appropriate for binary outcomes.

A.3. An odds ratio of 1.42 for prior admissions means: for each additional prior admission, the odds of readmission increase by a factor of 1.42, holding all other variables constant. In percentage terms: the odds increase by 42%. This does NOT mean the probability increases by 42% — odds and probability are different scales.

Chapter 25: Communicating with Data

A.1. A truncated y-axis starts above zero, making small differences appear dramatic. A cherry-picked time window selects only the dates that support a preferred narrative. Both techniques are dishonest because they exploit visual perception to create a false impression. The fix: always start bar chart y-axes at zero, and show the full time range (or clearly explain why a subset was chosen).

Chapter 26: Statistics and AI

A.1. AI systems use statistics constantly: training a model is regression or classification on data; evaluating a model uses confusion matrices, accuracy metrics, and hypothesis tests; the data used for training is a sample of reality and carries all the biases and limitations that any sample does. Understanding statistics means understanding the foundations — and the failure modes — of AI.

Chapter 27: Ethical Data Practice

A.1. Simpson's paradox occurs when a trend in aggregated data reverses when the data is broken into subgroups. The UC Berkeley admissions example: overall, men were admitted at a higher rate than women. But within most individual departments, women were admitted at a slightly higher rate. The paradox arose because women applied disproportionately to more competitive departments. The lesson: always consider whether aggregated statistics hide important subgroup differences.

Chapter 28: Your Statistical Journey Continues

A.1. [This is a reflection question — no single correct answer.] A strong response would trace the arc from descriptive statistics (summarizing what you have) through probability (quantifying uncertainty) to inference (making claims about what you don't have). The key transformation: from trusting numbers at face value to interrogating where they came from, what they assume, and who they affect.