Appendix E: Dataset Catalog
This appendix catalogs every dataset used in exercises, case studies, and the progressive project throughout this textbook. For each dataset, you will find what it contains, how large it is, where to get it, what license governs its use, which chapters reference it, and any preprocessing notes specific to our exercises.
E.1 Tabular / Structured Datasets
MovieLens 25M
| Field |
Details |
| Description |
25 million movie ratings from 162,000 users across 62,000 movies. Includes ratings, tags, and genome scores. |
| Size |
~250 MB (compressed), ~1.2 GB (extracted) |
| Format |
CSV files (ratings.csv, movies.csv, tags.csv, genome-scores.csv, genome-tags.csv) |
| Download |
wget https://files.grouplens.org/datasets/movielens/ml-25m.zip |
| License |
Research use only (GroupLens terms of use) |
| Chapters |
Ch. 7 (collaborative filtering), Ch. 9 (matrix factorization), Ch. 14 (embedding layers), Ch. 32 (large-scale recommendations) |
| Preprocessing |
Filter to users with >= 20 ratings for cold-start experiments. Convert timestamps to datetime. For causal exercises in Ch. 23, use the tag genome scores to construct synthetic treatment variables. |
UCI Adult (Census Income)
| Field |
Details |
| Description |
Predict whether income exceeds $50K/year based on census data. 48,842 instances, 14 attributes. |
| Size |
~4 MB |
| Format |
CSV |
| Download |
from sklearn.datasets import fetch_openml; df = fetch_openml("adult", version=2, as_frame=True).frame |
| License |
Public domain (CC0) |
| Chapters |
Ch. 3 (classification baselines), Ch. 25 (fairness metrics), Ch. 26 (bias mitigation), Ch. 27 (differential privacy) |
| Preprocessing |
Drop fnlwgt column. Binarize income as target. Use race and sex as sensitive attributes for fairness exercises. Handle missing values marked as "?" in workclass, occupation, and native-country. |
UCI Wine Quality
| Field |
Details |
| Description |
Physicochemical properties and quality scores for red and white Portuguese wines. 6,497 instances combined. |
| Size |
~300 KB |
| Format |
CSV (semicolon-delimited) |
| Download |
from sklearn.datasets import fetch_openml; df = fetch_openml("wine-quality-red", as_frame=True).frame |
| License |
CC BY 4.0 |
| Chapters |
Ch. 2 (EDA), Ch. 4 (regression), Ch. 17 (Bayesian regression) |
| Preprocessing |
Combine red and white datasets with an added color column for the full analysis. Quality scores range 3-9; for binary classification exercises, threshold at quality >= 7. |
California Housing
| Field |
Details |
| Description |
Median house values for California districts from the 1990 census. 20,640 instances, 8 features. |
| Size |
~1.5 MB |
| Format |
Bundled in scikit-learn |
| Download |
from sklearn.datasets import fetch_california_housing; data = fetch_california_housing(as_frame=True).frame |
| License |
Public domain |
| Chapters |
Ch. 1 (introduction), Ch. 4 (regression), Ch. 5 (feature engineering), Ch. 30 (data validation with Great Expectations) |
| Preprocessing |
Cap median_house_value at $500K (it is already capped in the original data — this is by design, not an error). Log-transform median_income and population for normalization exercises. |
Kaggle Ames Housing
| Field |
Details |
| Description |
Detailed housing data for Ames, Iowa. 2,930 instances, 80 features (23 nominal, 23 ordinal, 14 discrete, 20 continuous). |
| Size |
~1 MB |
| Format |
CSV |
| Download |
kaggle competitions download -c house-prices-advanced-regression-techniques (requires Kaggle API key) |
| License |
Competition terms (educational use permitted) |
| Chapters |
Ch. 5 (feature engineering), Ch. 6 (advanced pipelines) |
| Preprocessing |
Significant missing data in PoolQC, MiscFeature, Alley, Fence, FireplaceQu. These are meaningful absences (no pool, no alley), not random missingness. Encode accordingly. |
Lending Club Loan Data
| Field |
Details |
| Description |
Anonymized loan application and repayment data. Millions of records with outcomes (fully paid, charged off, current). |
| Size |
~1.5 GB |
| Format |
CSV |
| Download |
Available via Kaggle: kaggle datasets download -d wordsforthewise/lending-club |
| License |
CC0 (public domain) |
| Chapters |
Ch. 22 (causal inference — does interest rate cause default?), Ch. 23 (instrumental variables), Ch. 25 (fairness in lending) |
| Preprocessing |
Filter to fully paid and charged off loans only (drop "current"). Create binary target default = 1 if loan_status == "Charged Off". Parse term, emp_length to numeric. Drop leakage features: total_pymnt, recoveries, collection_recovery_fee. |
IBM HR Analytics (Synthetic)
| Field |
Details |
| Description |
Synthetic employee attrition dataset created by IBM data scientists. 1,470 instances, 35 features. |
| Size |
~230 KB |
| Format |
CSV |
| Download |
kaggle datasets download -d pavansubhasht/ibm-hr-analytics-attrition-dataset |
| License |
CC0 (public domain) |
| Chapters |
Ch. 21 (causal forests for heterogeneous treatment effects), Ch. 24 (uplift modeling) |
| Preprocessing |
Binarize Attrition as target. EmployeeNumber is an ID column — drop it before modeling. Several features are ordinal (e.g., JobSatisfaction 1-4) — encode as numeric, not one-hot. |
E.2 Time Series and Financial Datasets
Yahoo Finance (via yfinance)
| Field |
Details |
| Description |
Historical stock prices, volumes, and market indices. |
| Size |
Varies (downloaded on demand) |
| Format |
pandas DataFrame via API |
| Download |
pip install yfinance then import yfinance as yf; df = yf.download("AAPL", start="2015-01-01", end="2025-01-01") |
| License |
Yahoo terms of use (educational/personal use) |
| Chapters |
Ch. 15 (time series with LSTMs), Ch. 16 (attention for sequences), Ch. 18 (Bayesian structural time series) |
| Preprocessing |
Use adjusted close prices for returns calculations. Handle market holidays (missing dates). For multi-stock exercises, align all tickers to the same trading days. |
FRED Economic Data
| Field |
Details |
| Description |
Federal Reserve Economic Data: GDP, unemployment, CPI, interest rates, and thousands of other macroeconomic time series. |
| Size |
Individual series are small (KB); bulk downloads can be large |
| Format |
CSV or API |
| Download |
pip install fredapi then from fredapi import Fred; fred = Fred(api_key="YOUR_KEY"); df = fred.get_series("GDP") |
| License |
Public domain (U.S. government data) |
| Chapters |
Ch. 18 (Bayesian structural time series), Ch. 22 (causal inference with time series — difference-in-differences) |
| Preprocessing |
Series have different frequencies (daily, monthly, quarterly). Align frequencies before merging. Handle revisions — FRED provides "vintage" data where historical values are revised. |
ERA5 Climate Reanalysis
| Field |
Details |
| Description |
Hourly global climate data from ECMWF: temperature, precipitation, wind, humidity, pressure. 0.25° grid resolution from 1940 to present. |
| Size |
Extremely large (full dataset is petabytes); subset to specific variables, regions, and time ranges |
| Format |
NetCDF (.nc) or GRIB |
| Download |
Via Copernicus Climate Data Store API: pip install cdsapi then use the CDS web interface to select variables and submit download requests |
| License |
Copernicus License (free for all uses including commercial, with attribution) |
| Chapters |
Ch. 15 (spatiotemporal modeling), Ch. 31 (distributed training on large data), Ch. 33 (production pipeline case study) |
| Preprocessing |
Use xarray to read NetCDF files. Subset to region of interest before loading into memory. Convert temperature from Kelvin to Celsius. Aggregate hourly to daily for most exercises. |
CMIP6 Climate Projections
| Field |
Details |
| Description |
Climate model output from the Coupled Model Intercomparison Project Phase 6. Multiple models, multiple scenarios (SSP1-2.6, SSP2-4.5, SSP3-7.0, SSP5-8.5). |
| Size |
Subsets: 1-50 GB per variable/model/scenario |
| Format |
NetCDF |
| Download |
Via ESGF nodes: https://esgf-node.llnl.gov/search/cmip6/ |
| License |
Varies by modeling center; most are CC BY 4.0 |
| Chapters |
Ch. 15 (climate time series), Ch. 35 (capstone project option) |
| Preprocessing |
Regrid to common resolution using xesmf. Select a single model (e.g., CESM2) for exercises to avoid multi-model complexity. Compute anomalies relative to 1981-2010 baseline. |
E.3 Natural Language Processing Datasets
IMDb Movie Reviews
| Field |
Details |
| Description |
50,000 movie reviews labeled as positive or negative sentiment. Balanced dataset (25K/25K). |
| Size |
~65 MB |
| Format |
HuggingFace Datasets |
| Download |
from datasets import load_dataset; ds = load_dataset("imdb") |
| License |
Apache 2.0 |
| Chapters |
Ch. 13 (text classification with CNNs), Ch. 14 (transformer fine-tuning), Ch. 27 (differential privacy for NLP) |
| Preprocessing |
Already split into train/test. For validation, split 10% from training set with stratification. Truncate to 256 tokens for efficiency in most exercises; use 512 for the fine-tuning chapter. |
SQuAD 2.0
| Field |
Details |
| Description |
Stanford Question Answering Dataset with 150K questions, including 50K unanswerable questions. |
| Size |
~45 MB |
| Format |
JSON / HuggingFace Datasets |
| Download |
from datasets import load_dataset; ds = load_dataset("rajpurkar/squad_v2") |
| License |
CC BY-SA 4.0 |
| Chapters |
Ch. 14 (question answering fine-tuning) |
| Preprocessing |
Use the HuggingFace tokenizer's return_offsets_mapping=True to align token positions with character-level answer spans. Unanswerable questions have empty answers — handle these explicitly. |
AG News
| Field |
Details |
| Description |
News articles classified into 4 categories: World, Sports, Business, Science/Technology. 120K training, 7,600 test. |
| Size |
~30 MB |
| Format |
HuggingFace Datasets |
| Download |
from datasets import load_dataset; ds = load_dataset("ag_news") |
| License |
Academic use |
| Chapters |
Ch. 13 (text classification baselines), Ch. 16 (attention visualization) |
| Preprocessing |
Combine title and description fields with a [SEP] token for transformer models. Labels are 0-indexed (0=World, 1=Sports, 2=Business, 3=Sci/Tech). |
Amazon Reviews (Multilingual)
| Field |
Details |
| Description |
Product reviews in multiple languages with star ratings. Millions of reviews. |
| Size |
1-20 GB depending on language/category subset |
| Format |
HuggingFace Datasets |
| Download |
from datasets import load_dataset; ds = load_dataset("amazon_reviews_multi", "en") |
| License |
Amazon terms (research use) |
| Chapters |
Ch. 14 (transfer learning), Ch. 32 (large-scale NLP pipeline) |
| Preprocessing |
Filter to English subset for most exercises. Create binary sentiment: 4-5 stars = positive, 1-2 stars = negative, drop 3-star reviews. Subsample to 100K reviews for local training. |
E.4 Health and Biomedical Datasets
MIMIC-IV
| Field |
Details |
| Description |
De-identified electronic health records from Beth Israel Deaconess Medical Center. Includes diagnoses, procedures, lab results, medications, and clinical notes for ~300K ICU admissions. |
| Size |
~7 GB (compressed) |
| Format |
CSV (relational tables) |
| Download |
Requires credentialed access: complete CITI training, sign data use agreement at https://physionet.org/content/mimiciv/ |
| License |
PhysioNet Credentialed Health Data License 1.5.0 (no redistribution, requires IRB-equivalent training) |
| Chapters |
Ch. 20 (survival analysis), Ch. 22 (causal inference in healthcare), Ch. 25 (fairness in clinical models), Ch. 35 (capstone project option) |
| Preprocessing |
Join admissions, patients, diagnoses_icd, and labevents tables on subject_id and hadm_id. ICD codes are in both ICD-9 and ICD-10 — use the icd_version column to distinguish. For mortality prediction, define the target as in-hospital death (discharge_location == "DIED" or deathtime IS NOT NULL). Remove neonates (anchor_age == 0) for adult-only analyses. |
Access note: MIMIC-IV requires approximately 1-2 weeks for access approval. Begin the credentialing process before starting Chapter 20. If access is not available, the MIMIC-IV demo dataset (100 patients) is available without credentialing for code testing, and all exercises include synthetic data alternatives.
Framingham Heart Study (Teaching Dataset)
| Field |
Details |
| Description |
Cardiovascular risk factors and outcomes from the Framingham Heart Study. Teaching version with ~4,000 participants. |
| Size |
~500 KB |
| Format |
CSV |
| Download |
Available via Kaggle: kaggle datasets download -d aasheesh200/framingham-heart-study-dataset |
| License |
Public (teaching version only) |
| Chapters |
Ch. 17 (Bayesian logistic regression), Ch. 20 (survival analysis) |
| Preprocessing |
Handle missing values in education, cigsPerDay, BPMeds, totChol, BMI, heartRate, glucose. Do not impute randomly — missingness patterns are informative (patients with missing glucose may not have been tested). |
E.5 Graph and Network Datasets
Cora Citation Network
| Field |
Details |
| Description |
Citation network of 2,708 machine learning papers classified into 7 categories. Each paper has a binary bag-of-words feature vector (1,433 dimensions). |
| Size |
~5 MB |
| Format |
PyTorch Geometric |
| Download |
from torch_geometric.datasets import Planetoid; data = Planetoid(root="/tmp/cora", name="Cora") |
| License |
Public domain |
| Chapters |
Ch. 12 (graph neural networks) |
| Preprocessing |
Standard train/val/test split is provided via boolean masks. For transductive learning, all node features are available during training; only labels are masked. |
OGB (Open Graph Benchmark)
| Field |
Details |
| Description |
Standardized graph datasets for node classification, link prediction, and graph classification across molecular, social, and citation domains. |
| Size |
10 MB to 10 GB depending on dataset |
| Format |
OGB Python package |
| Download |
pip install ogb then from ogb.nodeproppred import PygNodePropPredDataset; dataset = PygNodePropPredDataset(name="ogbn-arxiv") |
| License |
MIT (OGB code), varies by underlying dataset |
| Chapters |
Ch. 12 (graph neural networks — ogbn-arxiv), Ch. 31 (distributed GNN training — ogbn-products) |
| Preprocessing |
OGB provides standardized splits and evaluators. Always use dataset.get_idx_split() for reproducible train/val/test splits. Do not create your own splits. |
E.6 Image Datasets
CIFAR-10 / CIFAR-100
| Field |
Details |
| Description |
60,000 32x32 color images in 10 classes (CIFAR-10) or 100 classes (CIFAR-100). |
| Size |
~170 MB (CIFAR-10), ~170 MB (CIFAR-100) |
| Format |
PyTorch built-in |
| Download |
torchvision.datasets.CIFAR10(root="./data", download=True) |
| License |
MIT |
| Chapters |
Ch. 8 (CNNs), Ch. 10 (regularization), Ch. 11 (transfer learning) |
| Preprocessing |
Normalize with mean=(0.4914, 0.4822, 0.4465), std=(0.2470, 0.2435, 0.2616) for CIFAR-10. Apply standard augmentation: random crop with padding=4, random horizontal flip. |
ImageNet (ILSVRC 2012 subset)
| Field |
Details |
| Description |
1.2 million training images in 1,000 classes. The standard benchmark for image classification. |
| Size |
~150 GB |
| Format |
JPEG images in class-labeled directories |
| Download |
Registration required at https://image-net.org/. Use torchvision.datasets.ImageNet after downloading. |
| License |
Non-commercial research only |
| Chapters |
Ch. 11 (pretrained backbones), Ch. 31 (distributed training) |
| Preprocessing |
Standard: resize to 256, center crop to 224, normalize with ImageNet mean/std. For data-efficient exercises, use the ImageNet-1k subset available via HuggingFace: load_dataset("imagenet-1k", split="train[:10%]"). |
E.7 Synthetic Datasets
Several exercises use synthetic data to isolate specific statistical concepts. These are generated programmatically and do not require downloads.
Synthetic Causal Data
import numpy as np
import pandas as pd
def generate_causal_dataset(n=5000, seed=42):
"""Generate data with known causal structure for validation.
DAG: Z -> X -> Y, Z -> Y (confounder)
U -> X (instrument)
True ATE of X on Y: 2.0
"""
rng = np.random.default_rng(seed)
Z = rng.normal(0, 1, n) # Confounder
U = rng.normal(0, 1, n) # Instrument
X = 0.5 * Z + 0.8 * U + rng.normal(0, 0.5, n) # Treatment
Y = 2.0 * X + 1.5 * Z + rng.normal(0, 1, n) # Outcome (true ATE = 2.0)
return pd.DataFrame({"Z": Z, "U": U, "X": X, "Y": Y})
Chapters: Ch. 21 (causal identification), Ch. 22 (backdoor adjustment), Ch. 23 (instrumental variables)
Synthetic Fairness Data
def generate_fairness_dataset(n=10000, seed=42):
"""Generate hiring data with known demographic disparity.
True qualification rate is equal across groups.
Selection rate differs due to biased threshold.
"""
rng = np.random.default_rng(seed)
group = rng.choice(["A", "B"], size=n, p=[0.6, 0.4])
qualification = rng.normal(50, 10, n)
noise = rng.normal(0, 5, n)
# Biased scoring: Group B gets a penalty
score = qualification + noise - 5.0 * (group == "B").astype(float)
hired = (score > 50).astype(int)
return pd.DataFrame({
"group": group,
"qualification": qualification,
"score": score,
"hired": hired,
})
Chapters: Ch. 25 (fairness metrics), Ch. 26 (mitigation)
Synthetic Time Series
def generate_arima_with_intervention(n=500, intervention_point=300, seed=42):
"""Generate time series with a known structural break for
difference-in-differences and interrupted time series exercises.
"""
rng = np.random.default_rng(seed)
trend = np.linspace(0, 10, n)
seasonality = 3 * np.sin(2 * np.pi * np.arange(n) / 52)
noise = rng.normal(0, 1, n)
treatment_effect = np.zeros(n)
treatment_effect[intervention_point:] = 5.0 # true effect = 5.0
y = trend + seasonality + noise + treatment_effect
return pd.DataFrame({
"t": np.arange(n),
"y": y,
"post_intervention": (np.arange(n) >= intervention_point).astype(int),
})
Chapters: Ch. 22 (difference-in-differences), Ch. 18 (Bayesian structural time series)
E.8 Download and Storage Recommendations
Directory Structure
Organize datasets under a single root to keep paths consistent across chapters:
~/ads-data/
├── movielens/
│ └── ml-25m/
├── lending-club/
├── mimic-iv/ # credentialed access
├── era5/
│ ├── temperature/
│ └── precipitation/
├── imagenet/ # if using full dataset
├── huggingface/ # HuggingFace datasets cache
└── synthetic/ # generated by code in exercises
Set the environment variable in your shell profile:
export ADS_DATA_DIR="$HOME/ads-data"
export HF_HOME="$HOME/ads-data/huggingface"
Storage Estimates
| Category |
Approximate Size |
| All tabular datasets |
~3 GB |
| NLP datasets (HuggingFace) |
~2 GB |
| Climate data (ERA5 subset) |
~10-50 GB |
| Image datasets (CIFAR only) |
~500 MB |
| Image datasets (with ImageNet) |
~160 GB |
| MIMIC-IV |
~7 GB |
| Total (without ImageNet) |
~20-65 GB |
| Total (with ImageNet) |
~180-225 GB |
Download Script
"""download_datasets.py — Download all freely available datasets."""
import os
from pathlib import Path
DATA_DIR = Path(os.environ.get("ADS_DATA_DIR", Path.home() / "ads-data"))
DATA_DIR.mkdir(parents=True, exist_ok=True)
def download_sklearn_datasets():
from sklearn.datasets import (
fetch_california_housing,
fetch_openml,
)
print("Downloading scikit-learn datasets...")
fetch_california_housing(data_home=DATA_DIR / "sklearn")
fetch_openml("adult", version=2, data_home=DATA_DIR / "sklearn")
fetch_openml("wine-quality-red", data_home=DATA_DIR / "sklearn")
print(" Done.")
def download_huggingface_datasets():
from datasets import load_dataset
print("Downloading HuggingFace datasets...")
load_dataset("imdb", cache_dir=DATA_DIR / "huggingface")
load_dataset("rajpurkar/squad_v2", cache_dir=DATA_DIR / "huggingface")
load_dataset("ag_news", cache_dir=DATA_DIR / "huggingface")
print(" Done.")
def download_torchvision_datasets():
import torchvision
print("Downloading torchvision datasets...")
torchvision.datasets.CIFAR10(root=DATA_DIR / "cifar", download=True)
torchvision.datasets.CIFAR100(root=DATA_DIR / "cifar", download=True)
print(" Done.")
def download_movielens():
import urllib.request
import zipfile
print("Downloading MovieLens 25M...")
url = "https://files.grouplens.org/datasets/movielens/ml-25m.zip"
dest = DATA_DIR / "movielens" / "ml-25m.zip"
dest.parent.mkdir(parents=True, exist_ok=True)
if not dest.exists():
urllib.request.urlretrieve(url, dest)
with zipfile.ZipFile(dest, "r") as z:
z.extractall(DATA_DIR / "movielens")
print(" Done.")
if __name__ == "__main__":
download_sklearn_datasets()
download_huggingface_datasets()
download_torchvision_datasets()
download_movielens()
print(f"\nAll datasets saved to {DATA_DIR}")
print("Note: MIMIC-IV and ERA5 require separate credentialed access.")
print("Note: Kaggle datasets require `kaggle` CLI and API key.")
Cross-references: Appendix C documents the libraries used to load and process these datasets. Appendix D covers the environment setup required to run the download script. Individual chapters provide detailed loading and preprocessing code for each dataset they use.