Appendix D: Data Sources Guide
Finding good data is the first step in any statistical analysis. This appendix catalogs the datasets referenced throughout the textbook and additional sources for the Data Detective Portfolio and independent projects.
D.1 U.S. Government Data
CDC Behavioral Risk Factor Surveillance System (BRFSS)
- What it contains: The largest continuously conducted health survey in the world. Over 400,000 adults surveyed annually on health behaviors, chronic conditions, access to healthcare, and demographics.
- Format: SAS transport files, CSV exports available through web tools.
- URL: cdc.gov/brfss
- How to access: Use the Web Enabled Analysis Tool (WEAT) for pre-built queries, or download raw annual data files.
- Suggested uses: Health behavior analysis (Ch.5-7), proportion inference on smoking/exercise rates (Ch.14), comparing health outcomes across states or demographics (Ch.16, Ch.19, Ch.20), regression on health indicators (Ch.22-23).
- Used in textbook: Suggested portfolio dataset; Dr. Maya Chen's anchor example context.
U.S. Census Bureau / American Community Survey (ACS)
- What it contains: Demographic, economic, social, and housing data for every community in the United States. The decennial Census counts everyone; the ACS surveys ~3.5 million households annually for detailed characteristics.
- Format: CSV, API access, pre-built tables.
- URL: data.census.gov
- How to access: Use the data explorer for pre-built tables, or the Census API for programmatic access. The
cenpyPython package simplifies API queries. - Suggested uses: Income distributions (Ch.5-6), demographic comparisons (Ch.16), regression on economic factors (Ch.22-23), Simpson's paradox examples (Ch.27).
Bureau of Labor Statistics (BLS)
- What it contains: Employment, unemployment, wages, prices (CPI), productivity, and workplace safety data. Updated monthly.
- Format: CSV, Excel, API.
- URL: bls.gov/data
- How to access: Use the BLS Data Viewer for interactive queries, or download flat files. The BLS API supports automated data retrieval.
- Suggested uses: Time series of unemployment rates, wage comparisons across industries (Ch.16, Ch.20), CPI inflation analysis, labor force participation trends.
NOAA Climate Data Online
- What it contains: Historical weather and climate data from thousands of stations worldwide. Includes temperature, precipitation, wind, snow, and extreme weather events.
- Format: CSV.
- URL: ncdc.noaa.gov/cdo-web
- How to access: Search by location and date range, select variables, and download. Free account required.
- Suggested uses: Temperature distributions (Ch.5-6, Ch.10), correlation between climate variables (Ch.22), regression on climate trends (Ch.22-23), normality assessment (Ch.10).
- Used in textbook: Suggested portfolio dataset.
FBI Uniform Crime Reporting (UCR) / Crime Data Explorer
- What it contains: National crime statistics including violent crime, property crime, hate crimes, and law enforcement data. Reported by ~18,000 agencies.
- Format: CSV, Excel, API.
- URL: crime-data-explorer.fr.cloud.gov
- How to access: Interactive data explorer with download options. Bulk downloads available.
- Suggested uses: Crime rate comparisons across cities or years (Ch.16, Ch.20), correlation between crime and socioeconomic factors (Ch.22), chi-square tests on crime type distributions (Ch.19), algorithmic fairness analysis (Ch.26-27).
- Used in textbook: Professor Washington's anchor example context.
U.S. College Scorecard
- What it contains: Data on every degree-granting institution in the United States: graduation rates, student debt, post-graduation earnings, admission rates, demographics, and financial aid.
- Format: CSV (updated annually).
- URL: collegescorecard.ed.gov/data
- How to access: Download the full dataset or use the API. The "Most Recent" data file is most useful for course projects.
- Suggested uses: Earnings distributions (Ch.5-6), comparing outcomes across institution types (Ch.16, Ch.20), regression on graduation rates (Ch.22-23), logistic regression on graduation (Ch.24).
- Used in textbook: Suggested portfolio dataset.
Data.gov
- What it contains: A meta-catalog of over 250,000 datasets from federal agencies covering agriculture, climate, education, energy, finance, health, and public safety.
- Format: Varies by dataset (CSV, JSON, XML, API).
- URL: data.gov
- How to access: Search by topic or keyword. Quality and format vary widely.
- Suggested uses: Browse for topic-specific projects; useful for finding niche datasets.
D.2 International Data
World Health Organization (WHO) Global Health Observatory
- What it contains: Health statistics for 194 countries: life expectancy, disease prevalence, mortality rates, healthcare spending, vaccination coverage, and health workforce data.
- Format: CSV, API.
- URL: who.int/data/gho
- How to access: Use the interactive data explorer or download bulk files. The
whoPython package provides API access. - Suggested uses: International health comparisons (Ch.16, Ch.20), life expectancy regression (Ch.22-23), disease prevalence proportions (Ch.14).
World Bank Open Data
- What it contains: Development indicators for 217 countries: GDP, poverty rates, education enrollment, access to electricity, CO2 emissions, and 1,400+ other indicators spanning decades.
- Format: CSV, Excel, API.
- URL: data.worldbank.org
- How to access: Search by indicator or country. The
wbdataorworld_bank_dataPython packages simplify access. - Suggested uses: Economic development analysis, international comparisons (Ch.20), regression on GDP and education (Ch.22-23), longitudinal trends.
Gapminder
- What it contains: A curated collection of global statistics on health, wealth, population, and education, designed for teaching and data visualization. Made famous by Hans Rosling's TED talks.
- Format: CSV, Excel.
- URL: gapminder.org/data
- How to access: Download directly from the website or via the Python
gapminderpackage:pip install gapminder - Suggested uses: Life expectancy vs. GDP scatterplots (Ch.5, Ch.22), international health comparisons (Ch.16), distributions of development indicators (Ch.5-6).
- Used in textbook: Suggested portfolio dataset; referenced in Ch.1 case study and Ch.5 case study.
Our World in Data
- What it contains: Research-driven datasets and visualizations on global problems: COVID-19, climate change, poverty, education, health, energy, and technology. Curated by researchers at the University of Oxford.
- Format: CSV (GitHub-hosted), interactive charts.
- URL: ourworldindata.org
- How to access: Every chart has a "Download" button for the underlying data. All data is also on GitHub at
github.com/owid/owid-datasets. - Suggested uses: COVID-19 analysis (Ch.5-7, Ch.14), climate change trends (Ch.22), global inequality (Ch.16), data visualization examples (Ch.25).
D.3 Sports Data
Basketball Reference
- What it contains: Comprehensive NBA statistics: player stats, team records, game logs, advanced metrics, and historical data dating back to the BAA/NBA's founding in 1946.
- Format: HTML tables (easily scraped), CSV export available for some tables.
- URL: basketball-reference.com
- How to access: Browse team and player pages. Use the
basketball_reference_web_scraperPython package or copy/paste tables into Excel. - Suggested uses: Shooting percentage analysis (Ch.14, Ch.17), player comparison (Ch.16), regression on player performance (Ch.22-23), sports analytics projects.
- Used in textbook: Sam Okafor's anchor example context.
ESPN
- What it contains: Statistics for NFL, NBA, MLB, NHL, soccer, and college sports. Real-time scores, standings, and player stats.
- Format: HTML, hidden API.
- URL: espn.com/stats
- How to access: Browse directly or use community-built API wrappers.
- Suggested uses: Cross-sport comparisons, team performance analysis, current-season data for timely projects.
FiveThirtyEight Data
- What it contains: Datasets behind FiveThirtyEight's data journalism articles on politics, sports, science, economics, and culture. Well-documented and analysis-ready.
- Format: CSV (GitHub-hosted).
- URL: github.com/fivethirtyeight/data
- How to access: Browse the GitHub repository and download individual datasets. Includes README files explaining each dataset.
- Suggested uses: Election polling analysis (Ch.12, Ch.14), sports predictions (Ch.22-24), examples of data journalism (Ch.25), pre-cleaned datasets for quick projects.
D.4 General-Purpose Repositories
Kaggle Datasets
- What it contains: Over 200,000 user-contributed datasets on every imaginable topic. Quality varies from excellent (curated competition datasets) to rough (user uploads).
- Format: CSV, JSON, SQLite, images.
- URL: kaggle.com/datasets
- How to access: Free account required. Search by topic, sort by votes or usability rating. Download directly or use the Kaggle API:
kaggle datasets download -d dataset-name - Suggested uses: Great for finding topic-specific datasets for the portfolio project. Look for datasets with high usability scores (8+/10) and active discussions.
- Caution: Always check the data description and license. Some datasets have quality issues or unclear provenance.
UCI Machine Learning Repository
- What it contains: A classic collection of ~600 datasets curated for machine learning and statistics research since 1987. Well-documented with variable descriptions and prior analyses.
- Format: CSV, data files.
- URL: archive.ics.uci.edu/datasets
- How to access: Browse by task type, subject area, or number of observations. Download directly.
- Suggested uses: Regression datasets (Ch.22-24), classification datasets (Ch.24), chi-square analysis (Ch.19). The Iris, Wine, Heart Disease, and Auto MPG datasets are particularly well-suited for introductory statistics.
Google Dataset Search
- What it contains: A search engine that indexes datasets from across the web — government agencies, research institutions, news organizations, and data repositories.
- Format: Varies by source.
- URL: datasetsearch.research.google.com
- How to access: Search by keyword, just like regular Google. Results link to the original data source.
- Suggested uses: Finding datasets on specific topics that aren't covered by the sources above. Useful for portfolio project exploration.
World Happiness Report
- What it contains: Annual survey data ranking 150+ countries by self-reported life satisfaction, along with explanatory factors: GDP per capita, social support, healthy life expectancy, freedom, generosity, and corruption.
- Format: CSV, Excel.
- URL: worldhappiness.report
- How to access: Download from the report website or from Kaggle.
- Suggested uses: Multiple regression (Ch.23 — excellent for "holding other variables constant"), correlation analysis (Ch.22), international comparisons (Ch.16, Ch.20), visualization projects (Ch.25).
- Used in textbook: Suggested portfolio dataset.
D.5 Tips for Choosing a Dataset
For the Data Detective Portfolio
Your portfolio dataset should have:
- At least 200 observations (rows). More is better for inference, but avoid datasets so large they're unwieldy (over 1 million rows may be slow on older computers).
- At least 6 variables, including a mix of: - At least 2 categorical variables (for chi-square tests, group comparisons) - At least 2 numerical variables (for t-tests, correlation, regression) - At least 1 potential outcome/response variable
- Some missing values (so you can practice data cleaning in Chapter 7).
- A meaningful question you care about. You'll spend the entire semester with this dataset. Pick something you find genuinely interesting.
- Documentation. A data dictionary or codebook that explains what each variable means and how the data were collected.
Red Flags to Watch For
- No documentation: If you can't figure out what the variables mean, move on.
- Too clean: If every value is perfect with no missing data, you won't get cleaning practice. (But don't pick data that's 50% missing either.)
- Unclear provenance: If you can't determine who collected the data and how, the data may not be trustworthy.
- Ethical concerns: Avoid datasets with identifiable personal information unless they were explicitly released for public use with appropriate consent.
Loading Common Formats
import pandas as pd
# CSV (most common)
df = pd.read_csv("data.csv")
# Excel
df = pd.read_excel("data.xlsx")
# JSON
df = pd.read_json("data.json")
# From a URL
df = pd.read_csv("https://example.com/data.csv")
# Tab-separated
df = pd.read_csv("data.tsv", sep="\t")
# Fixed-width
df = pd.read_fwf("data.txt")
D.6 Quick Reference: Datasets by Chapter Topic
| Chapter Topic | Recommended Sources |
|---|---|
| Getting started / variable types (Ch.1-2) | Any dataset from this list |
| Data cleaning (Ch.7) | BRFSS, ACS (real-world messiness) |
| Probability (Ch.8-10) | BRFSS (contingency tables), sports data (proportions) |
| Confidence intervals (Ch.12) | College Scorecard, BLS |
| Proportions (Ch.14) | BRFSS, polling data (FiveThirtyEight) |
| Comparing groups (Ch.16) | World Happiness, Gapminder, BRFSS by state |
| Chi-square (Ch.19) | Any dataset with 2+ categorical variables |
| ANOVA (Ch.20) | College Scorecard (by region), World Happiness (by continent) |
| Correlation / regression (Ch.22-23) | Gapminder, World Happiness, NOAA, College Scorecard |
| Logistic regression (Ch.24) | UCI Heart Disease, College Scorecard (graduation) |
| AI and bias (Ch.26-27) | COMPAS recidivism data, ProPublica datasets |