35 min read

In 1954, Darrell Huff published a slim volume titled How to Lie with Statistics that remains, seven decades later, one of the most cited texts in statistical education. Its opening line captures a paradox that has only grown more acute in the...

In This Chapter

Learning Objectives
Introduction
Section 21.1: The Rise of Data Journalism
Section 21.2: Statistical Literacy Basics
Section 21.3: How Statistics Get Misused in Media
Section 21.4: Polling and Survey Methodology
Section 21.5: Visualizing Data Honestly
Section 21.6: Reading Scientific Studies
Section 21.7: Evaluating Health Statistics
Section 21.8: Economic Statistics and Their Manipulation
Section 21.9: A Data Literacy Checklist
Callout Boxes
Key Terms
Discussion Questions
Chapter Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 21: Data Journalism and Statistical Literacy

Learning Objectives

By the end of this chapter, students will be able to:

Identify the major data journalism organizations and explain what distinguishes data journalism from traditional reporting.
Distinguish between mean and median, absolute and relative risk, and explain why these distinctions matter for interpreting news.
Recognize common statistical manipulation techniques used in media reporting: cherry-picked timeframes, misleading axes, and correlation/causation confusion.
Evaluate polling methodology by assessing margin of error, sampling methods, question wording, and potential for bias.
Identify visual deceptions in data visualizations including truncated axes, misleading area representations, and problematic map projections.
Apply a structured framework to evaluate claims from scientific studies, including distinguishing statistical significance from practical significance.
Interpret health statistics using concepts like number needed to treat, absolute risk reduction, and relative risk reduction.
Recognize how economic statistics can be selectively deployed to support political arguments.
Apply a data literacy checklist to any statistical claim encountered in media consumption.

Introduction

In 1954, Darrell Huff published a slim volume titled How to Lie with Statistics that remains, seven decades later, one of the most cited texts in statistical education. Its opening line captures a paradox that has only grown more acute in the digital age: statistics are simultaneously among the most powerful tools for understanding reality and among the most effective instruments for distorting it.

The proliferation of data journalism in the twenty-first century represents a genuine advance in accountability reporting. Organizations like FiveThirtyEight, The Upshot at The New York Times, the Guardian's data desk, and ProPublica have demonstrated that rigorous quantitative analysis can expose systemic injustice, predict election outcomes with calibrated uncertainty, and hold institutions accountable in ways that anecdotal reporting cannot. At the same time, the democratization of data tools has enabled a flood of misleading statistics, fabricated infographics, and selectively framed claims to circulate with the visual authority that numbers and charts confer.

For the citizen navigating this landscape, statistical literacy is not optional. It is a core democratic competency. You do not need to become a statistician. You need to develop enough fluency to ask the right questions: What is being measured? By whom, and why? What is the comparison? What is the base rate? What is missing from this picture?

This chapter builds that fluency systematically. We begin with the institutions of data journalism and what they do well. We then move through the fundamental statistical concepts that news consumers need, the common manipulation techniques to watch for, and specific domains — polling, health, economics — where statistics are chronically misused. We end with a practical checklist you can apply to any quantitative claim you encounter.

Section 21.1: The Rise of Data Journalism

What Is Data Journalism?

Data journalism is reporting that makes systematic use of quantitative data — government records, scientific databases, financial filings, social media datasets — as primary source material. It combines traditional journalistic values (public interest, accountability, verification) with skills borrowed from statistics, computer science, and information design.

The practice is not entirely new. Florence Nightingale used statistical graphics in the 1850s to demonstrate that British soldiers were dying from preventable disease rather than battle wounds, and her polar area diagrams helped reform military medicine. John Snow's 1854 mapping of cholera deaths around a Broad Street pump is a canonical example of spatial data analysis serving public health. But modern data journalism, enabled by cheap computing, open government data portals, and sophisticated visualization tools, has qualitatively changed what is possible.

The Major Institutions

FiveThirtyEight, founded by Nate Silver in 2008 and later acquired by ABC News/ESPN before becoming an independent entity, pioneered probabilistic election forecasting for mainstream audiences. Silver's 2012 presidential forecast, which correctly predicted the winner of all 50 states, demonstrated that rigorous polling aggregation with transparent methodology could outperform pundits operating on intuition. FiveThirtyEight's model displays uncertainty explicitly — rather than declaring a winner, it reports a probability — which was a significant epistemological advance over the binary "who will win" framing of traditional political reporting.

FiveThirtyEight's broader contribution has been normalizing quantitative reasoning in political journalism. Sports analytics, public health data, and social science research all received similar treatment: not as auxiliary color but as primary evidentiary material. The organization has also been willing to explain its methodology in technical detail, allowing readers and critics to evaluate its assumptions.

The Upshot, launched by The New York Times in 2014 under the editorship of David Leonhardt and later Amanda Cox, brought data journalism into one of the world's most prestigious news organizations. The Upshot's interactive features — particularly its 2016 election needle, which displayed vote-counting uncertainty in real time — reached mass audiences that specialized outlets like FiveThirtyEight could not. Amanda Cox, trained as a statistician, brought rigor to data visualization that influenced the broader field.

The Upshot pioneered what might be called "explanatory data journalism" — using data not merely to report facts but to help readers understand complex systems. Its analyses of income mobility, healthcare costs, housing economics, and wage stagnation frequently deployed novel datasets assembled from government records and academic research, reframed familiar policy debates with quantitative precision.

The Guardian's Data Blog and Data Desk, established around 2009, demonstrated that data journalism could thrive in a European public-interest media environment. The Guardian pioneered making its underlying data publicly available alongside stories, enabling reader verification and secondary analysis. Its coverage of the WikiLeaks releases, involving sophisticated database analysis of diplomatic cables and military logs, remains a landmark in collaborative data journalism.

ProPublica occupies a distinctive niche as a nonprofit investigative newsroom that combines traditional accountability journalism with sophisticated data analysis. Its "Surgeon Scorecard," which tracked complication rates for individual surgeons, and its "Machine Bias" series, which analyzed racial disparities in algorithmic criminal sentencing tools, exemplify data journalism's capacity to expose systemic injustice invisible to traditional reporting.

ProPublica's "Dollars for Docs" database, which made pharmaceutical payments to physicians searchable and public, led directly to policy changes and journalistic investigations around the country. This "data as public utility" model — building searchable tools that serve other journalists and the public — represents one of data journalism's most distinctive contributions to democratic accountability.

What Data Journalism Adds

Traditional journalism excels at narrative: the specific case, the telling anecdote, the human face on a systemic problem. Data journalism adds the capacity to establish whether a case is typical or exceptional, to measure the magnitude of a problem, to test causal claims, and to make predictions with calibrated uncertainty.

The combination is powerful. The anecdote that opens a ProPublica investigation — a patient harmed by a surgeon with a high complication rate — gains force from the database demonstrating that the problem is systemic. The prediction that FiveThirtyEight makes gains credibility from the historical track record that allows probabilistic claims to be verified against outcomes.

But data journalism has its own failure modes. Data-driven analysis can create false precision, lending quantitative authority to findings that depend on questionable assumptions. The choice of which datasets to analyze, which metrics to privilege, and how to visualize results involves editorial judgments that can introduce bias. And the accessibility of data tools has lowered barriers for bad actors as well as good ones.

Section 21.2: Statistical Literacy Basics

Mean vs. Median

The mean (arithmetic average) and median (middle value) are both measures of central tendency, but they behave very differently in the presence of outliers and skewed distributions. Understanding the difference is essential for interpreting economic and demographic statistics.

Consider household income in the United States. If ten households have annual incomes of $40,000, $45,000, $50,000, $55,000, $60,000, $65,000, $70,000, $75,000, $80,000, and $5,000,000, the mean income is approximately $554,000 — a figure that describes none of the households accurately. The median income is $62,500 — the value that 50% of households fall below — which is far more representative of typical experience.

When distributions are heavily right-skewed (as income, wealth, and many other economic variables are), the mean is pulled upward by extreme values while the median remains anchored to typical experience. Politicians and advocates frequently exploit this: those wishing to emphasize prosperity cite mean figures; those wishing to emphasize inequality cite medians. Neither is lying, but neither is giving you the full picture either.

The alert reader asks: Is this a mean or a median? How skewed is the underlying distribution? Would the other measure tell a different story?

Absolute vs. Relative Risk

Perhaps no statistical distinction causes more misunderstanding in health journalism than the difference between absolute and relative risk. Consider a hypothetical drug that reduces the risk of heart attack from 4% to 2% over ten years. Both of the following statements are true:

The drug reduces heart attack risk by 50% (relative risk reduction)
The drug reduces heart attack risk by 2 percentage points (absolute risk reduction)

The relative figure sounds dramatic; the absolute figure reveals how small the benefit is in practical terms. Whether a 50% relative reduction in a 4% baseline risk is worth the drug's side effects, cost, and inconvenience is a question that absolute figures make tractable and relative figures obscure.

Pharmaceutical companies, as we will see in Section 21.7, systematically prefer relative risk framing in their promotional materials. Health journalists often reproduce this framing uncritically. Readers who understand the distinction can demand the context that makes claims interpretable.

Base Rates

A base rate is the background frequency of an event in a population. Ignoring base rates — a cognitive error psychologists call base rate neglect — leads to systematic misinterpretation of conditional probabilities.

The classic illustration is medical testing. Suppose a test for a rare disease has 99% sensitivity (correctly identifies 99% of people with the disease) and 99% specificity (correctly identifies 99% of people without it). If you test positive, what is the probability you actually have the disease?

Most people intuit the answer to be around 99%. The correct answer depends critically on the disease's prevalence. If the disease affects 1 in 10,000 people, and you test a random person from the population:

True positives: roughly 1 per 10,000 (the one person with the disease who tests positive)
False positives: roughly 100 per 10,000 (the 9,999 healthy people, 1% of whom test positive incorrectly)

So approximately 101 people test positive, of whom only 1 actually has the disease. The positive predictive value of this test in this population is roughly 1%. A positive result, despite the apparently excellent test characteristics, means you probably do not have the disease.

Base rate neglect explains many errors in security screening, medical diagnosis, and social policy. When a news story reports that a screening program has detected terrorists, criminals, or disease cases, the relevant question is always: how common is the condition being screened for in the population being screened?

Sample Size

Statistical estimates become more precise as sample size increases. This seems obvious but has non-obvious implications for how we should treat results from small studies versus large ones.

A poll of 1,000 randomly sampled voters will produce a margin of error of roughly ±3 percentage points at 95% confidence — meaning that if the true population value is 50%, we would expect 95% of polls of this size to report values between 47% and 53%. A poll of 100 voters has a margin of error of roughly ±10 percentage points — substantially less precise.

Small studies in medicine and social science are particularly problematic. A randomized controlled trial with 30 participants per arm lacks statistical power to detect modest effects and is vulnerable to chance findings that do not replicate. Yet small studies are conducted routinely (they are cheaper) and reported in media without adequate context about their limitations.

The rule of thumb that "extraordinary claims require extraordinary evidence" should be understood in part as a sample-size argument: a surprising finding from a small study should be regarded as a hypothesis to be tested in larger samples, not as established fact.

Section 21.3: How Statistics Get Misused in Media

Cherry-Picking Timeframes

Almost any trend can be made to appear positive or negative by choosing the right starting and ending points. Stock market performance, unemployment rates, crime statistics, and temperature records are all frequently manipulated through timeframe selection.

Consider crime statistics. If violent crime in a city peaked in year X and has declined since, a politician who took office in year X+5 can truthfully claim that crime has fallen "during my tenure" — while concealing that the trend predates their administration and may have nothing to do with their policies. Conversely, a politician's opponents can choose a trough year as the baseline to make the current level look high.

The solution is context: What is the longer-term trend? Is this timeframe typical or cherry-picked? What does the full data series show?

Misleading Axes

The visual encoding of quantitative information creates powerful opportunities for distortion. Two techniques are especially common:

Truncated Y-axes begin a chart's vertical axis at some value other than zero, magnifying the visual impression of differences. A bar chart showing stock prices ranging from $48 to $52 will look like a volatile roller coaster if the y-axis runs from $47 to $53, and like a flat line if it runs from $0 to $100. Neither is technically false, but the truncated version dramatically overstates the practical significance of the variation.

For time series data showing continuous measures (like temperature or poll averages), truncated axes are sometimes defensible — the variation within a narrow range may be the story. For bar charts or column charts where the height of a bar is meant to represent a quantity, starting the axis at zero is generally required for honest representation. The zero-baseline norm for bars is not arbitrary; it reflects the visual expectation that bar height encodes the full quantity, not just the variation.

Dual-axis charts with different scales on left and right axes frequently create the illusion of correlation or correspondence between two variables that move quite differently at their actual scales. By adjusting the scales independently, designers can make any two time series appear to track each other closely or diverge dramatically.

Correlation and Causation

The correlation/causation fallacy — inferring that because two variables are correlated, one must cause the other — is among the most common errors in popular statistical reasoning. It is institutionalized in breathless headlines like "Coffee consumption linked to lower cancer risk" or "Social media use associated with teen depression."

Correlation can arise through several causal structures: - A causes B - B causes A - A third variable C causes both A and B - A and B co-vary by chance (especially in small samples)

The website Spurious Correlations, compiled by Tyler Vigen, famously demonstrates that per capita cheese consumption in the United States correlates almost perfectly with the number of people who died by becoming tangled in their bedsheets between 2000 and 2009. No causal mechanism is plausible; the correlation is a statistical artifact of two variables that happen to share a trend.

Genuine causal inference requires more than correlation: it requires elimination of alternative explanations (confounders), often through randomized controlled experiments or sophisticated observational study designs (regression discontinuity, instrumental variables, difference-in-differences). Media reports rarely explain the causal inference strategy underlying claims about cause and effect.

P-Hacking and the Replication Crisis

The p-value — the probability of observing results at least as extreme as those observed, given that the null hypothesis is true — became the standard gatekeeper of scientific publication in the twentieth century. The conventional threshold of p < 0.05 was intended to limit false positives, but the institution of publication bias (the tendency to publish positive findings and file-drawer negative ones) combined with the practice of analyzing data in multiple ways until a significant result emerges (p-hacking) has generated a substantial literature of false positives that cannot be replicated.

P-hacking does not require conscious fraud. Researchers face legitimate choices about which covariates to include in a model, how to handle outliers, whether to analyze subgroups, and when to stop data collection. Each choice creates a "researcher degree of freedom" that increases the probability of finding a spurious significant result. When those choices are made after examining the data — consciously or not — the nominal p-value no longer controls the false positive rate.

The replication crisis, documented systematically in psychology beginning with the Open Science Collaboration's 2015 replication attempt of 100 published studies (roughly 60% of which failed to replicate), has profound implications for media consumers. Studies that were widely reported in popular media as establishing surprising facts about human behavior — that power poses boost testosterone, that willpower is a depletable resource, that priming with elderly-related words makes people walk slowly — have largely failed to replicate under rigorous conditions.

Section 21.4: Polling and Survey Methodology

Margin of Error

A poll's margin of error quantifies the sampling uncertainty introduced by surveying a sample rather than the entire population. For a simple random sample of n respondents, the margin of error at 95% confidence for a proportion near 50% is approximately ±1/√n.

Critically, the margin of error describes only sampling error — the uncertainty due to drawing a random sample rather than surveying everyone. It does not capture non-sampling errors including:

Coverage error: Some members of the population have no chance of being included (e.g., people without phones in telephone polls).
Non-response bias: People who decline to participate differ systematically from those who agree.
Question wording effects: How a question is phrased influences the answers.
Social desirability bias: Respondents give answers they believe are socially acceptable rather than their true views.

For this reason, a poll with a small stated margin of error can still be substantially wrong if it suffers from large non-sampling errors. The 2016 and 2020 US presidential elections both featured polls with nominal margins of error around ±3% that missed actual results by substantially more in key states — primarily because college-educated voters (overrepresented in polls) supported the Democratic candidate at higher rates than non-college voters.

Sampling Methods

The quality of a poll depends critically on how respondents are selected. Methods vary substantially in their rigor:

Probability sampling (simple random sampling, stratified random sampling, cluster sampling) gives each member of the target population a known, nonzero probability of selection. This is the theoretical foundation for margin-of-error calculations. Landline telephone surveys of registered voters historically approximated probability samples reasonably well; the decline of landline use and rise of caller ID evasion have eroded this.

Quota sampling sets targets for demographic subgroups (e.g., 52% women, 12% Black respondents) and recruits until the quotas are filled, without probability selection within groups. This can reduce obvious demographic imbalances but does not ensure that respondents within demographic groups are representative of non-respondents within those groups.

Opt-in web panels recruit respondents from panels of volunteers who have signed up to take surveys in exchange for compensation. These are not probability samples of any defined population; results must be weighted aggressively and the representativeness of the panel is always uncertain. Polls from opt-in web panels often have large stated sample sizes (n = 10,000) that create an appearance of precision, but their non-probability nature means that sampling error calculations do not apply in the conventional sense.

Question Wording Effects

Small changes in question wording can produce large changes in poll results — a fact routinely exploited by political actors who commission polls designed to generate favorable results.

Classic examples from political polling research: - "Welfare" versus "assistance to the poor" produces substantially different support levels for government programs. - Asking whether the government should "allow" versus "not allow" a behavior systematically shifts responses. - Presenting arguments on both sides of an issue before asking the question generally produces more moderate responses than asking cold. - Question order effects: attitudes expressed toward a specific policy are influenced by preceding questions about general values.

Push polls are not polls at all but thinly disguised attack advertising. They "push" voters toward a preferred conclusion by asking leading questions that embed negative information about a candidate: "Would you be more or less likely to vote for Candidate X if you knew he had been convicted of tax fraud?" (where no such conviction exists). Genuine polls ask questions to measure opinion; push polls ask questions to change it.

Evaluating Polls

Key questions for evaluating any reported poll: 1. Who conducted it, and who funded it? 2. What was the sampling method? 3. What is the exact question wording? 4. What is the response rate? 5. How was the sample weighted? 6. When was the poll conducted, and what was happening at the time? 7. What is the stated margin of error, and what errors does it not capture?

Section 21.5: Visualizing Data Honestly

The Principles of Honest Data Visualization

Edward Tufte's concept of "data-ink ratio" — the proportion of a graphic's ink devoted to displaying data rather than decoration — provides a useful standard for evaluating charts. The best visualizations maximize the information conveyed per unit of visual complexity. Deceptive visualizations often introduce complexity that obscures rather than reveals.

Alberto Cairo's framework of "truthful" visualization adds the dimension of intention: a visualization should accurately represent the underlying data, be functionally efficient, aesthetically pleasing, insightful, and enlightening. Each principle can be violated in ways that mislead readers.

Truncated Axes in Practice

The Fox News network became the subject of academic study for its use of truncated y-axes on bar charts. A 2012 chart showing the top marginal tax rate under three scenarios (35% if the Bush tax cuts were extended, 39.6% under Obama's proposal) used a y-axis running from 34% to 42%, making the Obama scenario appear to be more than twice as high as the extension scenario, when in fact it was 13% higher. The chart was not technically incorrect, but its visual message was deeply misleading.

The practical rule for consumers: on any bar chart or column chart, check whether the y-axis starts at zero. If not, recalibrate your intuition by estimating the actual proportional difference, not the visual proportional difference.

Area vs. Linear Scales

When circles or squares are used to represent quantities, the area should scale proportionally with the quantity, not the radius or side length. A circle with twice the radius has four times the area; a square with twice the side has four times the area. Designers who scale radius rather than area visually exaggerate differences by a factor of two.

Three-dimensional charts — 3D bar charts, 3D pie charts — add an extra layer of distortion because depth perspective makes nearest elements appear larger than equivalent elements in the background. Three-dimensional charts are almost never justified by data complexity; they exist primarily to make charts look impressive at the cost of accuracy.

Choropleth Maps and Their Problems

Choropleth maps color geographic areas according to a statistical value (e.g., income per capita, vote share, COVID case rate). They are extremely common in news reporting and extremely prone to misinterpretation.

The fundamental problem is that geographic area and population are uncorrelated. In the United States, the sparsely populated Great Plains and Mountain West states occupy vast geographic area but contain relatively few people. A map that colors counties by vote share will appear to show one party dominating the nation because that party wins the many large, sparsely populated counties. The other party wins geographically small but densely populated urban areas. The map's visual message — this is an overwhelming victory for Party A — reflects geography, not votes.

Solutions include: - Cartograms that distort geographic area to be proportional to population - Dot density maps where each dot represents a fixed number of people or events - Bivariate choropleth maps that encode two variables simultaneously - Simply switching from geographic area to the relevant unit of analysis

The choice of color scale also matters substantially. Sequential color scales (light to dark of one color) work well for continuous variables ranging from low to high. Diverging scales (two colors meeting at a midpoint) work well for variables with a meaningful center (e.g., temperature anomaly above and below average). Using a diverging scale on a non-diverging variable, or vice versa, creates misleading impressions.

What Good Chart Design Looks Like

Honest data visualization shares several characteristics: - The chosen chart type matches the data structure (line charts for time series, bar charts for categorical comparisons, scatter plots for bivariate relationships) - Axes are labeled clearly with units - The scale is chosen to accurately represent the variability of interest - The source of the data is credited - Uncertainty is represented where relevant (confidence intervals, error bars) - The chart does not require the caption to contradict the visual impression

Section 21.6: Reading Scientific Studies

The Peer Review Process

Peer review — the evaluation of submitted manuscripts by experts in the field before publication — is the primary quality-control mechanism in academic science. It is imperfect in well-documented ways: reviewers are human and have biases; review is conducted under time pressure; novel findings that challenge consensus may face undue resistance; fraudulent fabrication of data may not be detectable from the manuscript alone.

But peer review remains meaningful. A peer-reviewed study published in a well-regarded journal has been subjected to independent expert scrutiny that a press release, preprint, or media report has not. The complete absence of peer review is a significant warning sign.

Preprints — manuscripts posted to public servers like arXiv, bioRxiv, or medRxiv before peer review — played an enormous role in COVID-19 research communication, where the speed of peer review was genuinely too slow for a rapidly evolving public health crisis. But preprints that were widely covered in media before peer review were also disproportionately likely to report sensational findings that were later substantially revised or retracted. The preprint label should trigger additional skepticism.

Effect Size vs. Statistical Significance

A statistically significant finding is not necessarily an important finding. Statistical significance (p < 0.05 or some other threshold) tells you that the observed effect is unlikely to be due to chance alone given the sample size. It tells you nothing about whether the effect is large enough to be practically important.

Effect size measures — Cohen's d for comparing means, correlation coefficients, odds ratios, Number Needed to Treat — quantify the magnitude of an effect independently of sample size. A drug that reduces systolic blood pressure by 1 mmHg may produce a highly statistically significant result in a large trial while having essentially no clinical relevance. A drug that reduces systolic blood pressure by 15 mmHg in a small trial may produce a nonsignificant p-value while potentially offering large clinical benefit.

The distinction becomes acute when evaluating social science research. Very large datasets (e.g., census records, administrative data with millions of observations) can detect effects too small to be meaningful. A study finding that X "significantly" predicts Y with a correlation of r = 0.03 in a dataset of n = 1,000,000 has detected a real but negligibly small relationship.

Confounders

A confounder is a variable that is associated with both the exposure and the outcome of interest, and that can therefore produce the appearance of an exposure-outcome relationship that does not exist, or mask one that does.

If a study finds that coffee drinkers have lower rates of heart disease, confounding is a serious concern: heavy coffee drinkers may have other lifestyle characteristics (exercise habits, dietary patterns, social behaviors) that independently reduce cardiovascular risk. The observed association may reflect those other factors rather than coffee specifically.

Observational studies attempt to control for confounding through statistical adjustment (including known confounders as covariates in regression models) and study design choices (matching, restriction). But these methods can only control for confounders that are measured. Unmeasured confounders — whose existence may not even be suspected — cannot be controlled and represent a fundamental limitation of observational research.

Section 21.7: Evaluating Health Statistics

Number Needed to Treat (NNT)

The Number Needed to Treat (NNT) is the number of patients who must receive a treatment for one additional patient to experience the benefit (or be spared the harm). It is calculated as 1 / Absolute Risk Reduction (ARR).

NNT provides an immediately interpretable measure of treatment benefit that neither absolute nor relative risk framing alone conveys. Consider again our hypothetical drug that reduces heart attack risk from 4% to 2%:

ARR = 4% - 2% = 2%
NNT = 1 / 0.02 = 50

To prevent one heart attack, 50 patients must take the drug for ten years. Whether that is worthwhile depends on the drug's side effects, cost, and the patient's individual risk profile — but NNT makes the denominator of the decision visible in a way that "50% risk reduction" completely obscures.

For harms, the analogous measure is Number Needed to Harm (NNH): the number of patients exposed to a risk factor or treatment for one additional patient to experience the harm. Comparing NNT to NNH for a given intervention provides the clearest picture of its risk-benefit profile.

How Pharmaceutical PR Inflates Efficacy Claims

Pharmaceutical companies are legally required to accurately represent their products in promotional materials, but the selective presentation of technically accurate statistics can create substantially misleading impressions. Common techniques include:

Relative risk reporting without absolute context: Reporting that a vaccine reduces infection risk by 95% without mentioning that the baseline risk in the trial population over the trial period was 0.9% (absolute reduction: ~0.85%, NNT: ~120).

Surrogate endpoint reporting: Reporting improvement on a biomarker (e.g., LDL cholesterol, blood pressure, bone density) rather than on clinical endpoints (heart attacks, strokes, fractures). Surrogate endpoints do not always translate to clinical benefit, and some drugs that improve surrogates while worsening clinical outcomes have been approved and prescribed.

Subgroup analysis: Reporting favorable results from a post-hoc subgroup analysis of a trial that showed no significant overall effect. If a drug doesn't work overall but seems to work in left-handed patients under 45 with high baseline cholesterol, that finding is almost certainly due to chance — but it can be published and promoted.

Trial design advantages: Comparing to placebo rather than active control (standard of care), using doses of the comparator that are subtherapeutic, or defining outcome endpoints in ways that favor the drug.

The independent NNT website (thennt.com), maintained by physicians, provides patient-oriented summaries of evidence that translate published trial results into NNT and NNH figures, offering a counterweight to pharmaceutical framing.

Section 21.8: Economic Statistics and Their Manipulation

GDP and Its Limitations

Gross Domestic Product — the total market value of all goods and services produced within a country in a period — is the most widely cited measure of economic performance. Its limitations are substantial and well-known among economists, less well-known among the public:

GDP counts all production as positive regardless of whether it is beneficial or harmful. Rebuilding after a natural disaster increases GDP. Pollution remediation increases GDP. Treating cancer increases GDP; preventing it does not (or does so less, since the treatment is not needed). GDP is insensitive to the distribution of production — an economy where total output is growing while most of the gains flow to the top 1% will show robust GDP growth while typical household living standards stagnate.

Politicians across the ideological spectrum use GDP selectively: claiming credit for growth during their tenure and deflecting blame for recessions by citing global factors. The relevant questions for contextualizing GDP claims include: What is the per capita trend (adjusting for population)? What is the distribution of gains? How does the current growth rate compare to historical rates and to other comparable economies?

Unemployment Rate Definitions

The US Bureau of Labor Statistics publishes six measures of labor underutilization, labeled U-1 through U-6. The "official" unemployment rate (U-3) counts people without jobs who are actively seeking work as a fraction of the civilian labor force. It excludes:

Discouraged workers (U-4 adds these): People who have given up actively seeking employment because they believe no jobs are available.
Marginally attached workers (U-5 adds these): People who want and could work but have not searched recently for various reasons.
Part-time workers seeking full-time employment (U-6 adds these): People working part-time involuntarily.

During economic downturns, the official unemployment rate can actually fall as discouraged workers exit the labor force and are no longer counted. The labor force participation rate — the fraction of the working-age population that is either employed or actively seeking work — provides an important complement to the unemployment rate.

When politicians cite falling unemployment as evidence of a strong economy, the question to ask is: which unemployment measure? Is the labor force participation rate changing? Are new jobs full-time or part-time? What are wage growth trends?

Inflation Measures

The Consumer Price Index (CPI) measures the price of a fixed basket of goods and services purchased by urban consumers. The Personal Consumption Expenditures (PCE) price index, used by the Federal Reserve as its primary inflation gauge, covers a broader range of expenditures and uses a different methodology for dealing with substitution (when consumers shift from expensive to cheaper goods as prices change).

The choice between measures matters because they can produce different numbers. "Core" inflation excludes food and energy prices, which are volatile, to provide a cleaner signal of underlying inflation trends — but food and energy are the expenditures that matter most to lower-income households, for whom there is less ability to substitute.

When politicians or commentators cite inflation statistics, the relevant questions include: Which price index? Over what period? For which demographic groups? Inflation experienced by retirees, who spend more of their income on healthcare, differs from inflation experienced by young renters, who are more exposed to housing costs.

Section 21.9: A Data Literacy Checklist

The following questions should be asked of any statistical claim encountered in media coverage:

About the claim itself: 1. What exactly is being measured? Is the metric well-defined? 2. Is this an absolute or relative figure? What does the other framing show? 3. What is the baseline or comparison point? 4. What is the time period, and could a different period tell a different story? 5. What is the sample size? Is it adequate for the precision being claimed?

About the source: 6. Who collected this data, and what are their incentives? 7. Has this been peer-reviewed or independently verified? 8. Is this a single study or a body of evidence? Are results consistent across multiple studies? 9. Is the claim from the abstract/press release, or have I checked what the actual study says?

About what's missing: 10. What confounders might explain this relationship? 11. What alternative explanations have been ruled out? 12. What data would contradict this claim? Has anyone looked for it? 13. Who is not included in this sample? 14. What is the base rate in this population?

About the visualization: 15. Does the y-axis start at zero (for bar charts)? 16. Are dual axes being used? Are scales comparable? 17. Does the chart type match the data structure? 18. Does the visual impression match the numerical reality?

About the interpretation: 19. Does this claim establish causation or merely correlation? 20. Is statistical significance being confused with practical importance? 21. Has this finding been replicated? 22. Am I being shown the cherry-picked result from multiple analyses?

Callout Boxes

STATISTICAL MANIPULATION IN THE WILD In 2020, multiple news outlets reported that hydroxychloroquine was "94% effective" against COVID-19, citing a small French study with methodological problems severe enough that the International Society of Antimicrobial Chemotherapy (which published the study's journal) immediately distanced itself from the conclusions. The study had 26 patients in the treatment group, excluded six who deteriorated or died from the analysis, and had no true control group. The relative risk figure (94%) was technically derived from the data; the absolute risk reduction was trivial and the study design was inadequate to support any causal inference. This illustrates how small studies with dramatic relative risk figures can generate massive media coverage that outstrips the actual evidence.

WHAT PEER REVIEW IS NOT Peer review is not a guarantee that a published finding is true. It is a process of expert scrutiny that improves the odds of catching errors and that filters out the most obviously inadequate work. High-profile retractions — including the 1998 Wakefield MMR-autism study, which passed peer review at The Lancet and was retracted twelve years later after fraudulent data fabrication was uncovered — demonstrate that even prestigious journals with rigorous review processes can be deceived. Peer review is a necessary but not sufficient condition for trusting a scientific claim.

Key Terms

Absolute risk reduction (ARR): The arithmetic difference between event rates in treated and control groups, expressed as a percentage.

Base rate: The background frequency of an event in the general population before considering specific factors.

Choropleth map: A map that colors geographic areas according to a statistical value, prone to misrepresentation when geographic area and population density are uncorrelated.

Confounder: A variable associated with both exposure and outcome that can produce spurious apparent associations.

Effect size: A quantitative measure of the magnitude of a relationship or difference, independent of sample size.

Margin of error: The range within which a poll result is expected to fall, given the sampling variability; typically represents the 95% confidence interval for a simple random sample.

Number Needed to Treat (NNT): The number of patients who must receive a treatment for one patient to benefit; calculated as 1/ARR.

P-hacking: Analyzing data in multiple ways until a statistically significant result emerges, exploiting researcher degrees of freedom to inflate false positive rates.

Relative risk reduction (RRR): The proportional reduction in event rate in the treated group compared to the control group; frequently presented without the absolute baseline.

Replication crisis: The widespread failure of published scientific findings — particularly in psychology and social science — to replicate when independently tested, attributed to publication bias, p-hacking, and small sample sizes.

Spurious correlation: A statistical association between two variables that does not reflect a causal relationship.

Discussion Questions

FiveThirtyEight correctly predicted 49 of 50 states in the 2008 presidential election, then all 50 in 2012, before performing less well in 2016 and 2020. Does this track record validate or undermine confidence in probabilistic election forecasting? How should we evaluate forecast accuracy?
A pharmaceutical company reports that its new drug reduces the risk of a certain cancer by 40%. Using only this information, what are the most important questions you cannot answer? What additional information would you need to make an informed judgment about whether to take the drug?
The replication crisis has predominantly affected social psychology. Does this mean you should distrust social psychological findings? What criteria would you use to distinguish more from less reliable findings in a field with known replication problems?
Election polling errors in 2016 and 2020 were larger than the stated margins of error. Does this mean we should abandon polling as a tool, or should we interpret poll results differently? What reforms to polling practice or reporting practice might help?
Compare the GDP-based and household-income-based narratives of the US economy from 2009 to 2019. How do different statistical choices produce different — but potentially both technically accurate — stories about economic performance over this period?
A news headline reads: "Social media use linked to depression in teenagers." What would you need to know about the study behind this headline before accepting or rejecting the claim? Design a study that would provide more convincing evidence for or against a causal relationship.

Chapter Summary

Statistical literacy is the ability to critically evaluate quantitative claims — to distinguish meaningful evidence from misleading representation, and to ask the questions that reveal what statistics conceal as much as what they reveal. This chapter has built that literacy across nine domains:

Data journalism organizations like FiveThirtyEight, The Upshot, the Guardian Data Desk, and ProPublica have made rigorous quantitative reasoning central to accountability reporting. The best of this work sets a standard for what responsible data communication looks like.

The fundamental statistical concepts — mean vs. median, absolute vs. relative risk, base rates, and sample size adequacy — provide the vocabulary for evaluating any quantitative claim. The common manipulation techniques — cherry-picked timeframes, misleading axes, false correlation-causation inferences, and p-hacking — represent the adversarial landscape that an informed reader must navigate.

Polling methodology, data visualization design, scientific study interpretation, health statistics, and economic statistics each present domain-specific challenges requiring specific knowledge. The data literacy checklist synthesizes these into a practical tool for everyday media consumption.

The goal is not to make you a statistician. It is to make you a reader who cannot be easily fooled — who knows to ask "compared to what?", "what's the baseline?", "could this be confounded?", and "has this been replicated?" before accepting any number as evidence of anything.

Numbers are not neutral. They are constructed, selected, framed, and visualized by people with interests and incentives. Statistical literacy is the skill that allows you to engage with those numbers on equal terms.