Appendix D: Data Sources Guide

Finding good data is the first step in any statistical analysis. This appendix catalogs the datasets referenced throughout the textbook and additional sources for the Data Detective Portfolio and independent projects.

D.1 U.S. Government Data

CDC Behavioral Risk Factor Surveillance System (BRFSS)

What it contains: The largest continuously conducted health survey in the world. Over 400,000 adults surveyed annually on health behaviors, chronic conditions, access to healthcare, and demographics.
Format: SAS transport files, CSV exports available through web tools.
URL: cdc.gov/brfss
How to access: Use the Web Enabled Analysis Tool (WEAT) for pre-built queries, or download raw annual data files.
Suggested uses: Health behavior analysis (Ch.5-7), proportion inference on smoking/exercise rates (Ch.14), comparing health outcomes across states or demographics (Ch.16, Ch.19, Ch.20), regression on health indicators (Ch.22-23).
Used in textbook: Suggested portfolio dataset; Dr. Maya Chen's anchor example context.

U.S. Census Bureau / American Community Survey (ACS)

What it contains: Demographic, economic, social, and housing data for every community in the United States. The decennial Census counts everyone; the ACS surveys ~3.5 million households annually for detailed characteristics.
Format: CSV, API access, pre-built tables.
URL: data.census.gov
How to access: Use the data explorer for pre-built tables, or the Census API for programmatic access. The cenpy Python package simplifies API queries.
Suggested uses: Income distributions (Ch.5-6), demographic comparisons (Ch.16), regression on economic factors (Ch.22-23), Simpson's paradox examples (Ch.27).

Bureau of Labor Statistics (BLS)

What it contains: Employment, unemployment, wages, prices (CPI), productivity, and workplace safety data. Updated monthly.
Format: CSV, Excel, API.
URL: bls.gov/data
How to access: Use the BLS Data Viewer for interactive queries, or download flat files. The BLS API supports automated data retrieval.
Suggested uses: Time series of unemployment rates, wage comparisons across industries (Ch.16, Ch.20), CPI inflation analysis, labor force participation trends.

NOAA Climate Data Online

What it contains: Historical weather and climate data from thousands of stations worldwide. Includes temperature, precipitation, wind, snow, and extreme weather events.
Format: CSV.
URL: ncdc.noaa.gov/cdo-web
How to access: Search by location and date range, select variables, and download. Free account required.
Suggested uses: Temperature distributions (Ch.5-6, Ch.10), correlation between climate variables (Ch.22), regression on climate trends (Ch.22-23), normality assessment (Ch.10).
Used in textbook: Suggested portfolio dataset.

FBI Uniform Crime Reporting (UCR) / Crime Data Explorer

What it contains: National crime statistics including violent crime, property crime, hate crimes, and law enforcement data. Reported by ~18,000 agencies.
Format: CSV, Excel, API.
URL: crime-data-explorer.fr.cloud.gov
How to access: Interactive data explorer with download options. Bulk downloads available.
Suggested uses: Crime rate comparisons across cities or years (Ch.16, Ch.20), correlation between crime and socioeconomic factors (Ch.22), chi-square tests on crime type distributions (Ch.19), algorithmic fairness analysis (Ch.26-27).
Used in textbook: Professor Washington's anchor example context.

U.S. College Scorecard

What it contains: Data on every degree-granting institution in the United States: graduation rates, student debt, post-graduation earnings, admission rates, demographics, and financial aid.
Format: CSV (updated annually).
URL: collegescorecard.ed.gov/data
How to access: Download the full dataset or use the API. The "Most Recent" data file is most useful for course projects.
Suggested uses: Earnings distributions (Ch.5-6), comparing outcomes across institution types (Ch.16, Ch.20), regression on graduation rates (Ch.22-23), logistic regression on graduation (Ch.24).
Used in textbook: Suggested portfolio dataset.

Data.gov

What it contains: A meta-catalog of over 250,000 datasets from federal agencies covering agriculture, climate, education, energy, finance, health, and public safety.
Format: Varies by dataset (CSV, JSON, XML, API).
URL: data.gov
How to access: Search by topic or keyword. Quality and format vary widely.
Suggested uses: Browse for topic-specific projects; useful for finding niche datasets.

D.2 International Data

World Health Organization (WHO) Global Health Observatory

What it contains: Health statistics for 194 countries: life expectancy, disease prevalence, mortality rates, healthcare spending, vaccination coverage, and health workforce data.
Format: CSV, API.
URL: who.int/data/gho
How to access: Use the interactive data explorer or download bulk files. The who Python package provides API access.
Suggested uses: International health comparisons (Ch.16, Ch.20), life expectancy regression (Ch.22-23), disease prevalence proportions (Ch.14).

World Bank Open Data

What it contains: Development indicators for 217 countries: GDP, poverty rates, education enrollment, access to electricity, CO2 emissions, and 1,400+ other indicators spanning decades.
Format: CSV, Excel, API.
URL: data.worldbank.org
How to access: Search by indicator or country. The wbdata or world_bank_data Python packages simplify access.
Suggested uses: Economic development analysis, international comparisons (Ch.20), regression on GDP and education (Ch.22-23), longitudinal trends.

Gapminder

What it contains: A curated collection of global statistics on health, wealth, population, and education, designed for teaching and data visualization. Made famous by Hans Rosling's TED talks.
Format: CSV, Excel.
URL: gapminder.org/data
How to access: Download directly from the website or via the Python gapminder package: pip install gapminder
Suggested uses: Life expectancy vs. GDP scatterplots (Ch.5, Ch.22), international health comparisons (Ch.16), distributions of development indicators (Ch.5-6).
Used in textbook: Suggested portfolio dataset; referenced in Ch.1 case study and Ch.5 case study.

Our World in Data

What it contains: Research-driven datasets and visualizations on global problems: COVID-19, climate change, poverty, education, health, energy, and technology. Curated by researchers at the University of Oxford.
Format: CSV (GitHub-hosted), interactive charts.
URL: ourworldindata.org
How to access: Every chart has a "Download" button for the underlying data. All data is also on GitHub at github.com/owid/owid-datasets.
Suggested uses: COVID-19 analysis (Ch.5-7, Ch.14), climate change trends (Ch.22), global inequality (Ch.16), data visualization examples (Ch.25).

D.3 Sports Data

Basketball Reference

What it contains: Comprehensive NBA statistics: player stats, team records, game logs, advanced metrics, and historical data dating back to the BAA/NBA's founding in 1946.
Format: HTML tables (easily scraped), CSV export available for some tables.
URL: basketball-reference.com
How to access: Browse team and player pages. Use the basketball_reference_web_scraper Python package or copy/paste tables into Excel.
Suggested uses: Shooting percentage analysis (Ch.14, Ch.17), player comparison (Ch.16), regression on player performance (Ch.22-23), sports analytics projects.
Used in textbook: Sam Okafor's anchor example context.

ESPN

What it contains: Statistics for NFL, NBA, MLB, NHL, soccer, and college sports. Real-time scores, standings, and player stats.
Format: HTML, hidden API.
URL: espn.com/stats
How to access: Browse directly or use community-built API wrappers.
Suggested uses: Cross-sport comparisons, team performance analysis, current-season data for timely projects.

FiveThirtyEight Data

What it contains: Datasets behind FiveThirtyEight's data journalism articles on politics, sports, science, economics, and culture. Well-documented and analysis-ready.
Format: CSV (GitHub-hosted).
URL: github.com/fivethirtyeight/data
How to access: Browse the GitHub repository and download individual datasets. Includes README files explaining each dataset.
Suggested uses: Election polling analysis (Ch.12, Ch.14), sports predictions (Ch.22-24), examples of data journalism (Ch.25), pre-cleaned datasets for quick projects.

D.4 General-Purpose Repositories

Kaggle Datasets

What it contains: Over 200,000 user-contributed datasets on every imaginable topic. Quality varies from excellent (curated competition datasets) to rough (user uploads).
Format: CSV, JSON, SQLite, images.
URL: kaggle.com/datasets
How to access: Free account required. Search by topic, sort by votes or usability rating. Download directly or use the Kaggle API: kaggle datasets download -d dataset-name
Suggested uses: Great for finding topic-specific datasets for the portfolio project. Look for datasets with high usability scores (8+/10) and active discussions.
Caution: Always check the data description and license. Some datasets have quality issues or unclear provenance.

UCI Machine Learning Repository

What it contains: A classic collection of ~600 datasets curated for machine learning and statistics research since 1987. Well-documented with variable descriptions and prior analyses.
Format: CSV, data files.
URL: archive.ics.uci.edu/datasets
How to access: Browse by task type, subject area, or number of observations. Download directly.
Suggested uses: Regression datasets (Ch.22-24), classification datasets (Ch.24), chi-square analysis (Ch.19). The Iris, Wine, Heart Disease, and Auto MPG datasets are particularly well-suited for introductory statistics.

Google Dataset Search

What it contains: A search engine that indexes datasets from across the web — government agencies, research institutions, news organizations, and data repositories.
Format: Varies by source.
URL: datasetsearch.research.google.com
How to access: Search by keyword, just like regular Google. Results link to the original data source.
Suggested uses: Finding datasets on specific topics that aren't covered by the sources above. Useful for portfolio project exploration.

World Happiness Report

What it contains: Annual survey data ranking 150+ countries by self-reported life satisfaction, along with explanatory factors: GDP per capita, social support, healthy life expectancy, freedom, generosity, and corruption.
Format: CSV, Excel.
URL: worldhappiness.report
How to access: Download from the report website or from Kaggle.
Suggested uses: Multiple regression (Ch.23 — excellent for "holding other variables constant"), correlation analysis (Ch.22), international comparisons (Ch.16, Ch.20), visualization projects (Ch.25).
Used in textbook: Suggested portfolio dataset.

D.5 Tips for Choosing a Dataset

For the Data Detective Portfolio

Your portfolio dataset should have:

At least 200 observations (rows). More is better for inference, but avoid datasets so large they're unwieldy (over 1 million rows may be slow on older computers).
At least 6 variables, including a mix of: - At least 2 categorical variables (for chi-square tests, group comparisons) - At least 2 numerical variables (for t-tests, correlation, regression) - At least 1 potential outcome/response variable
Some missing values (so you can practice data cleaning in Chapter 7).
A meaningful question you care about. You'll spend the entire semester with this dataset. Pick something you find genuinely interesting.
Documentation. A data dictionary or codebook that explains what each variable means and how the data were collected.

Red Flags to Watch For

No documentation: If you can't figure out what the variables mean, move on.
Too clean: If every value is perfect with no missing data, you won't get cleaning practice. (But don't pick data that's 50% missing either.)
Unclear provenance: If you can't determine who collected the data and how, the data may not be trustworthy.
Ethical concerns: Avoid datasets with identifiable personal information unless they were explicitly released for public use with appropriate consent.

Loading Common Formats

import pandas as pd

# CSV (most common)
df = pd.read_csv("data.csv")

# Excel
df = pd.read_excel("data.xlsx")

# JSON
df = pd.read_json("data.json")

# From a URL
df = pd.read_csv("https://example.com/data.csv")

# Tab-separated
df = pd.read_csv("data.tsv", sep="\t")

# Fixed-width
df = pd.read_fwf("data.txt")

D.6 Quick Reference: Datasets by Chapter Topic

Chapter Topic	Recommended Sources
Getting started / variable types (Ch.1-2)	Any dataset from this list
Data cleaning (Ch.7)	BRFSS, ACS (real-world messiness)
Probability (Ch.8-10)	BRFSS (contingency tables), sports data (proportions)
Confidence intervals (Ch.12)	College Scorecard, BLS
Proportions (Ch.14)	BRFSS, polling data (FiveThirtyEight)
Comparing groups (Ch.16)	World Happiness, Gapminder, BRFSS by state
Chi-square (Ch.19)	Any dataset with 2+ categorical variables
ANOVA (Ch.20)	College Scorecard (by region), World Happiness (by continent)
Correlation / regression (Ch.22-23)	Gapminder, World Happiness, NOAA, College Scorecard
Logistic regression (Ch.24)	UCI Heart Disease, College Scorecard (graduation)
AI and bias (Ch.26-27)	COMPAS recidivism data, ProPublica datasets