Chapter 21 Quiz: Data Journalism and Statistical Literacy

Instructions: Answer each question to the best of your ability. Questions marked (M) are multiple choice; (T/F) are true/false; (SA) are short answer. Answers are hidden in collapsible sections — attempt each question before revealing the answer.

Section 1: Data Journalism Institutions

Question 1 (M): FiveThirtyEight became prominent for which type of journalistic contribution?

(A) Investigative reporting on government corruption (B) Probabilistic election forecasting using polling aggregation (C) Creating interactive maps of economic inequality (D) Developing the Freedom of Information Act as a reporting tool

Reveal Answer

**Answer: (B)** FiveThirtyEight, founded by Nate Silver, gained widespread recognition for aggregating polls and presenting election forecasts as probabilities rather than simple predictions. Its 2008 and 2012 presidential forecasts, which correctly called nearly every state outcome, demonstrated that rigorous quantitative methods could outperform pundits relying on intuition.

Question 2 (M): ProPublica's "Machine Bias" series investigated:

(A) Racial bias in social media recommendation algorithms (B) Racial disparities in an algorithmic criminal sentencing tool (C) Gender bias in hiring software used by Fortune 500 companies (D) Political bias in search engine result rankings

Reveal Answer

**Answer: (B)** ProPublica's "Machine Bias" series, published in 2016, analyzed the COMPAS algorithm used by courts to predict recidivism (re-offending rates). The investigation found that the algorithm was nearly twice as likely to falsely flag Black defendants as higher-risk compared to white defendants, and conversely, it more often labeled white defendants lower-risk when they subsequently re-offended. This landmark piece of data journalism spurred extensive debate about algorithmic fairness and criminal justice.

Question 3 (T/F): The Guardian Data Blog pioneered the practice of making its underlying datasets publicly available alongside stories, allowing readers to verify and independently analyze reported data.

Reveal Answer

**Answer: True** The Guardian Data Blog, established around 2009, was notable for publishing not just data-driven stories but the raw data itself (hosted on platforms like Google Sheets and data repositories). This commitment to data transparency allowed readers, other journalists, and researchers to independently check the Guardian's analysis and perform their own investigations — a significant contribution to open journalism.

Question 4 (SA): What distinguishes data journalism from traditional investigative journalism? Give one specific example of something data journalism can reveal that traditional narrative journalism cannot.

Reveal Answer

**Sample Answer:** Traditional journalism excels at documenting specific cases through interviews, documents, and observations. Data journalism adds the ability to systematically measure the magnitude and prevalence of a phenomenon — to determine whether an anecdotal case is typical or exceptional. **Example:** ProPublica's "Surgeon Scorecard" identified surgeons with abnormally high complication rates across thousands of procedures. No amount of patient interviews or surgical observations could reveal this pattern; only systematic database analysis of surgical outcomes records could demonstrate that specific surgeons had statistically elevated complication rates compared to peers performing similar procedures. Other valid examples: FiveThirtyEight demonstrating systematic patterns in police use of force data; The Upshot revealing geographic patterns in income mobility invisible at the individual level.

Section 2: Statistical Literacy Basics

Question 5 (M): In a highly right-skewed income distribution, which of the following is always true?

(A) The mean equals the median (B) The mean is greater than the median (C) The median is greater than the mean (D) The mean and median are not mathematically comparable

Reveal Answer

**Answer: (B)** In a right-skewed (positively skewed) distribution, the long right tail — created by high-value outliers like very wealthy households — pulls the mean upward while the median remains anchored near the center of the distribution. The mean is always greater than the median in a perfectly right-skewed distribution. This is why income and wealth distributions are typically reported using median figures when we want to capture typical experience.

Question 6 (M): A new cancer drug reduces the risk of cancer recurrence from 8% to 4% over five years. The drug company's press release claims the drug "reduces recurrence risk by 50%." This figure represents:

(A) The absolute risk reduction (B) The relative risk reduction (C) The number needed to treat (D) The odds ratio

Reveal Answer

**Answer: (B)** The 50% figure is the relative risk reduction (RRR): the proportional change in risk from the control group rate (8%) to the treatment group rate (4%). Relative risk reduction = (8% - 4%) / 8% = 50%. The absolute risk reduction (ARR) is 8% - 4% = 4 percentage points. The number needed to treat (NNT) = 1/ARR = 1/0.04 = 25. Pharmaceutical companies typically prefer relative risk framing because it produces larger-sounding numbers that are more impressive without the context of the baseline risk.

Question 7 (M): A rare disease affects 1 in 1,000 people. A test for the disease has 95% sensitivity and 95% specificity. If a randomly selected person tests positive, approximately what is the probability they actually have the disease?

(A) 95% (B) 50% (C) 2% (D) Less than 2%

Reveal Answer

**Answer: (D) Less than 2%** Working through Bayes' theorem with a population of 100,000: - People with the disease: 100 (1 in 1,000) - True positives (95% of 100): ~95 - People without the disease: 99,900 - False positives (5% of 99,900): ~4,995 Total positive tests: 95 + 4,995 = 5,090 Positive predictive value: 95 / 5,090 ≈ 1.9% So despite the apparently high test accuracy, the positive predictive value is less than 2%. This counterintuitive result — called the base rate fallacy or base rate neglect — arises because the very low prevalence of the disease means false positives vastly outnumber true positives even for a highly accurate test.

Question 8 (T/F): The margin of error reported by a poll captures all sources of polling error, including non-response bias and question wording effects.

Reveal Answer

**Answer: False** The margin of error (at a given confidence level) captures only sampling error — the uncertainty introduced by drawing a random sample rather than surveying the entire population. It does not capture non-sampling errors including: - Coverage error (not all population members are reachable) - Non-response bias (people who refuse to participate differ from those who agree) - Question wording effects - Social desirability bias - Interviewer effects - Weighting errors Polls can be substantially wrong even when their stated margin of error is small, if non-sampling errors are large. The 2016 and 2020 US presidential polls, which missed results in key states by more than their margins of error, illustrate this problem — primarily due to differential non-response by education level.

Section 3: Statistical Misuse

Question 9 (M): Which of the following best describes "p-hacking"?

(A) Fraudulently fabricating data to achieve a desired p-value (B) Selecting which statistical test to use only after seeing which one produces p < 0.05 (C) Analyzing data in multiple ways until a statistically significant result is found, exploiting researcher degrees of freedom (D) Using a one-tailed test when a two-tailed test is more appropriate

Reveal Answer

**Answer: (C)** P-hacking refers to the practice — which can be unconscious — of exploring multiple analytical choices (which covariates to include, how to define outcomes, whether to exclude outliers, which subgroups to analyze, when to stop data collection) until a p < 0.05 result is achieved. Each analytical choice creates a "researcher degree of freedom" that increases the probability of finding a spurious significant result. Option (B) describes a specific form of p-hacking (selective test choice), but (C) is the broader and more complete definition. Option (A) describes data fraud, which is distinct from p-hacking.

Question 10 (M): A politician begins a term in office when unemployment is at 7.2% and leaves when it is at 4.8%. Critics note that the labor force participation rate declined from 64.1% to 61.9% over the same period. The most likely explanation is:

(A) More people voluntarily retired or pursued education (B) Discouraged workers left the labor force, reducing unemployment without improving employment (C) The economy created jobs faster than the working-age population grew (D) Women disproportionately entered the labor force, which paradoxically reduced the participation rate

Reveal Answer

**Answer: (B)** When both the unemployment rate and the labor force participation rate fall simultaneously, it typically indicates that discouraged workers — people who want employment but have stopped actively searching — are leaving the labor force. Since the official unemployment rate (U-3) counts only those actively seeking work as unemployed, people who stop searching are reclassified from "unemployed" to "not in the labor force," which arithmetically reduces the unemployment rate without reflecting actual improvement in job creation. The decline in labor force participation from 64.1% to 61.9% represents millions of people who left the measured workforce and are not counted in the official unemployment figure.

Question 11 (T/F): A correlation coefficient of 0.85 between two variables proves that one variable causes the other.

Reveal Answer

**Answer: False** Correlation, regardless of its magnitude, does not establish causation. A strong correlation (positive or negative) is consistent with multiple causal structures: A causes B, B causes A, a third variable C causes both A and B, or the variables co-vary by chance in the dataset (especially in small samples or when multiple correlations have been examined). Establishing causation requires more than correlation: it requires eliminating alternative explanations (confounders), preferably through randomized experimental design or sophisticated quasi-experimental observational methods (regression discontinuity, instrumental variables, natural experiments). The correlation between ice cream sales and drowning deaths is approximately 0.85 — obviously no causal relationship exists.

Question 12 (M): The Open Science Collaboration (2015) attempted to replicate 100 published psychology studies. Approximately what fraction replicated with a significant result in the same direction?

(A) 97% (B) 77% (C) 60% (D) 39%

Reveal Answer

**Answer: (D) 39%** The Open Science Collaboration, published in *Science* in 2015, found that approximately 39% of the 100 replicated studies achieved statistical significance (p < 0.05) in the same direction as the original study. Using other replication criteria (subjective assessment by replication teams, comparison of effect sizes), somewhat higher replication rates were observed, but the overall picture was of widespread failure to replicate. The results were concentrated in social psychology, which had lower replication rates than cognitive psychology. This landmark paper launched widespread discussion of the replication crisis in behavioral science.

Section 4: Polling Methodology

Question 13 (M): Which type of poll is NOT actually a poll but rather a form of targeted political advertising?

(A) Exit poll (B) Push poll (C) Tracking poll (D) Robo-poll

Reveal Answer

**Answer: (B)** Push polls are not genuine opinion measurement instruments. They are political communications that use the format of a survey to "push" respondents toward a particular conclusion by exposing them to negative (often false or misleading) information about a candidate or policy under the guise of asking questions. ("Would you be more or less likely to support Candidate X if you knew she supported releasing violent criminals from prison?"). Push polls typically involve very large numbers of contacts (hundreds of thousands), do not record or report data in any meaningful way, and are designed to influence rather than measure opinion. Genuine polls — including exit polls, tracking polls, and robo-polls — all have the primary purpose of measuring opinion.

Question 14 (SA): Explain why opt-in internet panels produce a stated margin of error that is essentially meaningless for inferring population parameters.

Reveal Answer

**Sample Answer:** The margin of error formula (approximately ±1/√n at 95% confidence) is derived from probability sampling theory, which assumes that every member of the target population has a known, nonzero probability of being included in the sample. This assumption allows us to use statistical theory to quantify how much sample estimates are expected to vary from the true population value. Opt-in internet panels violate this assumption fundamentally. People who sign up for a panel in exchange for compensation are a self-selected group whose characteristics differ in unknown ways from the broader population. Because the selection probability is unknown and varies across individuals, the theoretical foundation for calculating a margin of error does not apply. A poll of n = 10,000 opt-in panel respondents might report a "margin of error of ±1%" — but this figure is mathematically meaningless because the formula assumes probability sampling. The true uncertainty (from the combination of non-probability selection and unmeasured differences between panel members and non-members) could be many times larger and is fundamentally unquantifiable from the data alone.

Question 15 (T/F): Stratified sampling is a form of probability sampling that can reduce sampling variance compared to simple random sampling.

Reveal Answer

**Answer: True** Stratified random sampling divides the population into homogeneous subgroups (strata) — such as age, gender, or region — and then draws independent random samples within each stratum. Because within-stratum variance is typically lower than overall population variance, stratified sampling can produce more precise estimates (smaller sampling error) for the same total sample size compared to simple random sampling. It also guarantees representation of important subgroups, which simple random sampling may undersample by chance. Stratified sampling remains a form of probability sampling as long as the within-stratum selection is random.

Section 5: Data Visualization

Question 16 (M): A bar chart comparing two government programs shows Program A at 81% effectiveness and Program B at 86% effectiveness. The chart's y-axis runs from 79% to 88%. Visually, Program B's bar appears to be about four times as tall as Program A's. What is the actual ratio of their effectiveness rates?

(A) 4:1 (B) 2:1 (C) Approximately 1.06:1 (D) Approximately 1.5:1

Reveal Answer

**Answer: (C) Approximately 1.06:1** The actual ratio of effectiveness rates is 86/81 ≈ 1.06:1 — Program B is about 6% more effective than Program A in absolute terms. The visual ratio of bar heights on the truncated chart (y-axis from 79–88%) reflects the position within the displayed range: Program A is 2 percentage points above the baseline (81-79=2), Program B is 7 percentage points above the baseline (86-79=7). The visual ratio is 7/2 = 3.5 — about 3.5 times taller, not 4 times, and dramatically overstating the actual 6% difference in effectiveness. This is exactly why zero-baseline bar charts are important: the bar height is supposed to represent the total quantity, not just the amount above an arbitrary baseline.

Question 17 (M): Choropleth maps of election results frequently overrepresent rural areas because:

(A) Rural voters are systematically more politically active (B) Geographic area and population density are uncorrelated, so sparsely populated large areas dominate the visual display (C) Rural county election data is more accurately reported (D) Map projections systematically enlarge areas near the poles where rural areas are concentrated

Reveal Answer

**Answer: (B)** The fundamental problem with choropleth election maps is that geographic area has nothing to do with the number of votes. Sparsely populated rural counties can cover enormous geographic areas — and when colored according to election outcomes, they dominate the visual impression. Densely populated urban areas that may contain millions of voters occupy tiny slivers of geographic space. A county-level map of US presidential election results that shows vast swaths of red across the Great Plains and Mountain West (and small blue islands in major cities) accurately depicts the geographic distribution of voting outcomes but creates a deeply misleading impression of the partisan distribution of the electorate. Solutions include cartograms (where area is proportional to population or votes) or dot density maps.

Question 18 (T/F): Edward Tufte's concept of "data-ink ratio" suggests that the most effective data visualizations minimize the amount of visual information displayed.

Reveal Answer

**Answer: False** Tufte's "data-ink ratio" does not suggest minimizing visual information — it suggests maximizing the proportion of a graphic's ink that is devoted to representing data, as opposed to non-data ink (decorative elements, redundant labeling, chart junk). The goal is to maximize the amount of data communicated per unit of visual complexity, not to reduce total information. A visualization with high data-ink ratio eliminates non-informative embellishments (3D effects, background patterns, excessive gridlines) while preserving or increasing the amount of data displayed. The concept supports information-dense, clean visualizations — not sparse ones.

Section 6: Scientific Studies

Question 19 (M): A study finds that a new teaching method significantly improves test scores (p = 0.001) with an effect size of Cohen's d = 0.08. The most accurate interpretation is:

(A) The teaching method has a large, practically meaningful effect (B) The result is highly statistically significant but the effect is negligibly small (C) The effect is meaningful but would not replicate in other schools (D) The p-value is too small to be credible

Reveal Answer

**Answer: (B)** A p-value of 0.001 indicates strong evidence against the null hypothesis (the result is unlikely to be due to chance), but says nothing about the practical importance of the effect. Cohen's d = 0.08 is considered very small by conventional standards (Cohen characterized d = 0.2 as small, d = 0.5 as medium, d = 0.8 as large). A d of 0.08 represents a tiny difference in test scores. The extreme statistical significance (p = 0.001) is consistent with d = 0.08 if the study has a very large sample size — because large samples can detect even tiny true effects with high precision. This combination — very significant, very small — is increasingly common as datasets grow larger, and it illustrates why reporting effect sizes alongside p-values is essential.

Question 20 (SA): What is a confounder, and why can observational studies not fully eliminate confounding even with sophisticated statistical controls?

Reveal Answer

**Sample Answer:** A confounder is a variable that is associated with both the exposure (e.g., coffee drinking) and the outcome (e.g., heart disease risk), and that is not on the causal pathway between them. When a confounder is uncontrolled, it can produce the appearance of an exposure-outcome relationship that does not exist (positive confounding) or mask a real relationship (negative confounding). Observational studies can control for confounders by including them as covariates in statistical models, matching exposed and unexposed subjects on confounders, or restricting the analysis to a stratum where the confounder does not vary. However, these methods can only control for confounders that are: 1. Known to the researchers 2. Measured accurately in the study "Residual confounding" refers to confounding that persists despite statistical adjustment, due to either unmeasured confounders or imprecise measurement of measured confounders (measurement error). Unmeasured confounders — variables that exist, matter, but were never collected — represent a fundamental limitation that no amount of statistical sophistication can overcome. Randomized controlled trials solve this problem by randomly assigning exposure, which on average distributes both measured and unmeasured confounders equally across groups.

Section 7: Health and Economic Statistics

Question 21 (M): A drug company reports that its vaccine is "95% effective" at preventing infection. The trial found an infection rate of 1% in the placebo group and 0.05% in the vaccinated group. The NNT for this vaccine is:

(A) 95 (B) 105 (C) 20 (D) 1,000

Reveal Answer

**Answer: (B) approximately 105** ARR = 1% - 0.05% = 0.95% NNT = 1 / 0.0095 ≈ 105 So approximately 105 people would need to be vaccinated to prevent one infection under the conditions of the trial. The "95% effective" figure is the relative risk reduction: (1% - 0.05%) / 1% = 95%. Note that this figure is technically accurate but gives no indication of the absolute baseline risk, which determines how many people need to be vaccinated to prevent one case. In a population with higher background infection risk, the NNT would be lower (fewer people needed to prevent one case).

Question 22 (M): Which of the following is NOT captured by the official US unemployment rate (U-3)?

(A) Full-time workers who were recently laid off (B) Part-time workers who want to work full-time (C) Recent graduates who have never worked but are actively applying (D) Workers on temporary layoff who expect to be recalled

Reveal Answer

**Answer: (B)** Involuntary part-time workers — those who are working part-time but want and are available for full-time work — are not counted in the U-3 unemployment rate. They are captured in the broader U-6 measure. The U-3 counts people who are without a job, are available for work, and actively searched for work in the past four weeks. Recent graduates actively seeking their first job (C) are counted as unemployed under U-3. Workers on temporary layoff expecting recall (D) are also counted. The gap between U-3 and U-6 — which can be several percentage points, especially during recessions — reflects discouraged workers and involuntary part-timers who are not counted in the headline figure.

Question 23 (T/F): GDP per capita is a reliable measure of typical living standards because it accounts for the distribution of income across the population.

Reveal Answer

**Answer: False** GDP per capita (total GDP divided by population) is an average measure that says nothing about the distribution of income or consumption. In a highly unequal economy, GDP per capita can grow substantially while median household income stagnates or declines, if the gains are concentrated among high earners. The United States provides a well-documented example: from roughly 1980 to the present, GDP per capita has grown substantially in real terms, but much of this growth has been captured by the top income quintile, particularly the top 1% and 0.1%. Median real household income has grown much more slowly. GDP per capita tells you about the size of the economic pie; it tells you nothing about how that pie is distributed.

Question 24 (SA): What is the "Number Needed to Harm" (NNH) and how is it used alongside NNT to evaluate a medical intervention?

Reveal Answer

**Sample Answer:** The Number Needed to Harm (NNH) is the number of patients who must be exposed to a treatment (or risk factor) for one additional patient to experience a specified adverse effect. It is calculated as 1 / Absolute Risk Increase (ARI), where ARI is the difference in harm rates between the treatment and control groups. NNH is used alongside NNT to characterize the risk-benefit profile of a medical intervention in clinically interpretable terms. If a drug has an NNT of 40 (40 patients treated to prevent one event) and an NNH of 100 (100 patients treated for one serious side effect), the interpretation is that for every 100 patients treated, approximately 2.5 events are prevented and 1 serious side effect occurs — a favorable balance. If the NNT is 200 and the NNH is 50, the balance is unfavorable (more harms than benefits at the population level). The ratio NNH/NNT (sometimes called the Likelihood of Being Helped or Harmed, LHH) provides a single number that summarizes this trade-off. An LHH > 1 favors treatment; LHH < 1 suggests harms outweigh benefits.

Question 25 (M): Which of the following is an example of cherry-picking a timeframe to make an economic record appear more favorable?

(A) Reporting the ten-year average unemployment rate (B) Starting a GDP growth chart in the year of a recession trough when measuring a politician's record (C) Using seasonally adjusted figures for employment data (D) Adjusting nominal wages for inflation to show real wages

Reveal Answer

**Answer: (B)** Starting a measurement of economic performance at the trough of a recession (the lowest point) ensures that all subsequent data points show improvement — even if the recovery is slow by historical standards. If a politician took office during a recession trough and the economy then recovered to its pre-recession level, using the trough as the baseline makes the record look exceptional even if the recovery was ordinary. The fairest baseline is typically either the pre-recession peak or the date of the politician's inauguration — not the worst point of the preceding downturn. Options (A), (C), and (D) represent standard statistical practices designed to improve accuracy, not cherry-picking.

Question 26 (T/F): A preprint study posted to bioRxiv or medRxiv has undergone peer review and can be treated as equivalent to a published journal article for the purposes of evaluating scientific claims.

Reveal Answer

**Answer: False** Preprints are manuscripts posted publicly before peer review. They have not been evaluated by independent expert reviewers commissioned by a journal, and they have not been accepted through the journal's editorial process. This does not mean preprints are necessarily wrong or unreliable — many are high quality and subsequently published with few changes — but it does mean they carry a higher degree of uncertainty than peer-reviewed publications. During the COVID-19 pandemic, numerous preprints were widely covered by media before peer review and later substantially revised or retracted. Notable examples include early studies on hydroxychloroquine, ivermectin, and mask efficacy that received extensive media attention while in preprint form, some of which were subsequently retracted or heavily criticized after peer review. Preprints should be treated as preliminary findings requiring corroboration, not as established scientific conclusions.

Question 27 (SA): Explain the difference between the Consumer Price Index (CPI) and the Personal Consumption Expenditures (PCE) price index. Why might they give different inflation readings?

Reveal Answer

**Sample Answer:** The Consumer Price Index (CPI) measures price changes for a fixed basket of goods and services purchased by urban consumers, based on periodic surveys of what households buy. The basket composition is updated infrequently (every two years currently). The Bureau of Labor Statistics publishes CPI. The Personal Consumption Expenditures (PCE) price index, published by the Bureau of Economic Analysis, covers a broader range of expenditures (including employer-provided healthcare and other expenses that consumers don't directly pay), uses a "chain-weighted" methodology that accounts for substitution behavior (when consumers switch from expensive to cheaper goods as relative prices change), and gives different weights to categories. Differences in readings arise from three main sources: 1. **Formula effect**: CPI uses a Laspeyres formula (fixed base-period weights) while PCE uses a Fisher chain index (updating weights continuously), which tends to make PCE show somewhat lower inflation because it accounts for substitution. 2. **Weight differences**: Healthcare gets a much larger weight in PCE (because it includes employer-paid insurance) than in CPI, and housing gets a larger weight in CPI. 3. **Scope differences**: PCE covers some purchases not included in CPI. The Federal Reserve targets 2% inflation as measured by PCE, while CPI is used for Social Security cost-of-living adjustments. Because of formula differences, PCE typically runs 0.2–0.4 percentage points below CPI in the same period.