Appendix E: Dataset Catalog

This appendix catalogs every dataset used in exercises, case studies, and the progressive project throughout this textbook. For each dataset, you will find what it contains, how large it is, where to get it, what license governs its use, which chapters reference it, and any preprocessing notes specific to our exercises.


E.1 Tabular / Structured Datasets

MovieLens 25M

Field Details
Description 25 million movie ratings from 162,000 users across 62,000 movies. Includes ratings, tags, and genome scores.
Size ~250 MB (compressed), ~1.2 GB (extracted)
Format CSV files (ratings.csv, movies.csv, tags.csv, genome-scores.csv, genome-tags.csv)
Download wget https://files.grouplens.org/datasets/movielens/ml-25m.zip
License Research use only (GroupLens terms of use)
Chapters Ch. 7 (collaborative filtering), Ch. 9 (matrix factorization), Ch. 14 (embedding layers), Ch. 32 (large-scale recommendations)
Preprocessing Filter to users with >= 20 ratings for cold-start experiments. Convert timestamps to datetime. For causal exercises in Ch. 23, use the tag genome scores to construct synthetic treatment variables.

UCI Adult (Census Income)

Field Details
Description Predict whether income exceeds $50K/year based on census data. 48,842 instances, 14 attributes.
Size ~4 MB
Format CSV
Download from sklearn.datasets import fetch_openml; df = fetch_openml("adult", version=2, as_frame=True).frame
License Public domain (CC0)
Chapters Ch. 3 (classification baselines), Ch. 25 (fairness metrics), Ch. 26 (bias mitigation), Ch. 27 (differential privacy)
Preprocessing Drop fnlwgt column. Binarize income as target. Use race and sex as sensitive attributes for fairness exercises. Handle missing values marked as "?" in workclass, occupation, and native-country.

UCI Wine Quality

Field Details
Description Physicochemical properties and quality scores for red and white Portuguese wines. 6,497 instances combined.
Size ~300 KB
Format CSV (semicolon-delimited)
Download from sklearn.datasets import fetch_openml; df = fetch_openml("wine-quality-red", as_frame=True).frame
License CC BY 4.0
Chapters Ch. 2 (EDA), Ch. 4 (regression), Ch. 17 (Bayesian regression)
Preprocessing Combine red and white datasets with an added color column for the full analysis. Quality scores range 3-9; for binary classification exercises, threshold at quality >= 7.

California Housing

Field Details
Description Median house values for California districts from the 1990 census. 20,640 instances, 8 features.
Size ~1.5 MB
Format Bundled in scikit-learn
Download from sklearn.datasets import fetch_california_housing; data = fetch_california_housing(as_frame=True).frame
License Public domain
Chapters Ch. 1 (introduction), Ch. 4 (regression), Ch. 5 (feature engineering), Ch. 30 (data validation with Great Expectations)
Preprocessing Cap median_house_value at $500K (it is already capped in the original data — this is by design, not an error). Log-transform median_income and population for normalization exercises.

Kaggle Ames Housing

Field Details
Description Detailed housing data for Ames, Iowa. 2,930 instances, 80 features (23 nominal, 23 ordinal, 14 discrete, 20 continuous).
Size ~1 MB
Format CSV
Download kaggle competitions download -c house-prices-advanced-regression-techniques (requires Kaggle API key)
License Competition terms (educational use permitted)
Chapters Ch. 5 (feature engineering), Ch. 6 (advanced pipelines)
Preprocessing Significant missing data in PoolQC, MiscFeature, Alley, Fence, FireplaceQu. These are meaningful absences (no pool, no alley), not random missingness. Encode accordingly.

Lending Club Loan Data

Field Details
Description Anonymized loan application and repayment data. Millions of records with outcomes (fully paid, charged off, current).
Size ~1.5 GB
Format CSV
Download Available via Kaggle: kaggle datasets download -d wordsforthewise/lending-club
License CC0 (public domain)
Chapters Ch. 22 (causal inference — does interest rate cause default?), Ch. 23 (instrumental variables), Ch. 25 (fairness in lending)
Preprocessing Filter to fully paid and charged off loans only (drop "current"). Create binary target default = 1 if loan_status == "Charged Off". Parse term, emp_length to numeric. Drop leakage features: total_pymnt, recoveries, collection_recovery_fee.

IBM HR Analytics (Synthetic)

Field Details
Description Synthetic employee attrition dataset created by IBM data scientists. 1,470 instances, 35 features.
Size ~230 KB
Format CSV
Download kaggle datasets download -d pavansubhasht/ibm-hr-analytics-attrition-dataset
License CC0 (public domain)
Chapters Ch. 21 (causal forests for heterogeneous treatment effects), Ch. 24 (uplift modeling)
Preprocessing Binarize Attrition as target. EmployeeNumber is an ID column — drop it before modeling. Several features are ordinal (e.g., JobSatisfaction 1-4) — encode as numeric, not one-hot.

E.2 Time Series and Financial Datasets

Yahoo Finance (via yfinance)

Field Details
Description Historical stock prices, volumes, and market indices.
Size Varies (downloaded on demand)
Format pandas DataFrame via API
Download pip install yfinance then import yfinance as yf; df = yf.download("AAPL", start="2015-01-01", end="2025-01-01")
License Yahoo terms of use (educational/personal use)
Chapters Ch. 15 (time series with LSTMs), Ch. 16 (attention for sequences), Ch. 18 (Bayesian structural time series)
Preprocessing Use adjusted close prices for returns calculations. Handle market holidays (missing dates). For multi-stock exercises, align all tickers to the same trading days.

FRED Economic Data

Field Details
Description Federal Reserve Economic Data: GDP, unemployment, CPI, interest rates, and thousands of other macroeconomic time series.
Size Individual series are small (KB); bulk downloads can be large
Format CSV or API
Download pip install fredapi then from fredapi import Fred; fred = Fred(api_key="YOUR_KEY"); df = fred.get_series("GDP")
License Public domain (U.S. government data)
Chapters Ch. 18 (Bayesian structural time series), Ch. 22 (causal inference with time series — difference-in-differences)
Preprocessing Series have different frequencies (daily, monthly, quarterly). Align frequencies before merging. Handle revisions — FRED provides "vintage" data where historical values are revised.

ERA5 Climate Reanalysis

Field Details
Description Hourly global climate data from ECMWF: temperature, precipitation, wind, humidity, pressure. 0.25° grid resolution from 1940 to present.
Size Extremely large (full dataset is petabytes); subset to specific variables, regions, and time ranges
Format NetCDF (.nc) or GRIB
Download Via Copernicus Climate Data Store API: pip install cdsapi then use the CDS web interface to select variables and submit download requests
License Copernicus License (free for all uses including commercial, with attribution)
Chapters Ch. 15 (spatiotemporal modeling), Ch. 31 (distributed training on large data), Ch. 33 (production pipeline case study)
Preprocessing Use xarray to read NetCDF files. Subset to region of interest before loading into memory. Convert temperature from Kelvin to Celsius. Aggregate hourly to daily for most exercises.

CMIP6 Climate Projections

Field Details
Description Climate model output from the Coupled Model Intercomparison Project Phase 6. Multiple models, multiple scenarios (SSP1-2.6, SSP2-4.5, SSP3-7.0, SSP5-8.5).
Size Subsets: 1-50 GB per variable/model/scenario
Format NetCDF
Download Via ESGF nodes: https://esgf-node.llnl.gov/search/cmip6/
License Varies by modeling center; most are CC BY 4.0
Chapters Ch. 15 (climate time series), Ch. 35 (capstone project option)
Preprocessing Regrid to common resolution using xesmf. Select a single model (e.g., CESM2) for exercises to avoid multi-model complexity. Compute anomalies relative to 1981-2010 baseline.

E.3 Natural Language Processing Datasets

IMDb Movie Reviews

Field Details
Description 50,000 movie reviews labeled as positive or negative sentiment. Balanced dataset (25K/25K).
Size ~65 MB
Format HuggingFace Datasets
Download from datasets import load_dataset; ds = load_dataset("imdb")
License Apache 2.0
Chapters Ch. 13 (text classification with CNNs), Ch. 14 (transformer fine-tuning), Ch. 27 (differential privacy for NLP)
Preprocessing Already split into train/test. For validation, split 10% from training set with stratification. Truncate to 256 tokens for efficiency in most exercises; use 512 for the fine-tuning chapter.

SQuAD 2.0

Field Details
Description Stanford Question Answering Dataset with 150K questions, including 50K unanswerable questions.
Size ~45 MB
Format JSON / HuggingFace Datasets
Download from datasets import load_dataset; ds = load_dataset("rajpurkar/squad_v2")
License CC BY-SA 4.0
Chapters Ch. 14 (question answering fine-tuning)
Preprocessing Use the HuggingFace tokenizer's return_offsets_mapping=True to align token positions with character-level answer spans. Unanswerable questions have empty answers — handle these explicitly.

AG News

Field Details
Description News articles classified into 4 categories: World, Sports, Business, Science/Technology. 120K training, 7,600 test.
Size ~30 MB
Format HuggingFace Datasets
Download from datasets import load_dataset; ds = load_dataset("ag_news")
License Academic use
Chapters Ch. 13 (text classification baselines), Ch. 16 (attention visualization)
Preprocessing Combine title and description fields with a [SEP] token for transformer models. Labels are 0-indexed (0=World, 1=Sports, 2=Business, 3=Sci/Tech).

Amazon Reviews (Multilingual)

Field Details
Description Product reviews in multiple languages with star ratings. Millions of reviews.
Size 1-20 GB depending on language/category subset
Format HuggingFace Datasets
Download from datasets import load_dataset; ds = load_dataset("amazon_reviews_multi", "en")
License Amazon terms (research use)
Chapters Ch. 14 (transfer learning), Ch. 32 (large-scale NLP pipeline)
Preprocessing Filter to English subset for most exercises. Create binary sentiment: 4-5 stars = positive, 1-2 stars = negative, drop 3-star reviews. Subsample to 100K reviews for local training.

E.4 Health and Biomedical Datasets

MIMIC-IV

Field Details
Description De-identified electronic health records from Beth Israel Deaconess Medical Center. Includes diagnoses, procedures, lab results, medications, and clinical notes for ~300K ICU admissions.
Size ~7 GB (compressed)
Format CSV (relational tables)
Download Requires credentialed access: complete CITI training, sign data use agreement at https://physionet.org/content/mimiciv/
License PhysioNet Credentialed Health Data License 1.5.0 (no redistribution, requires IRB-equivalent training)
Chapters Ch. 20 (survival analysis), Ch. 22 (causal inference in healthcare), Ch. 25 (fairness in clinical models), Ch. 35 (capstone project option)
Preprocessing Join admissions, patients, diagnoses_icd, and labevents tables on subject_id and hadm_id. ICD codes are in both ICD-9 and ICD-10 — use the icd_version column to distinguish. For mortality prediction, define the target as in-hospital death (discharge_location == "DIED" or deathtime IS NOT NULL). Remove neonates (anchor_age == 0) for adult-only analyses.

Access note: MIMIC-IV requires approximately 1-2 weeks for access approval. Begin the credentialing process before starting Chapter 20. If access is not available, the MIMIC-IV demo dataset (100 patients) is available without credentialing for code testing, and all exercises include synthetic data alternatives.

Framingham Heart Study (Teaching Dataset)

Field Details
Description Cardiovascular risk factors and outcomes from the Framingham Heart Study. Teaching version with ~4,000 participants.
Size ~500 KB
Format CSV
Download Available via Kaggle: kaggle datasets download -d aasheesh200/framingham-heart-study-dataset
License Public (teaching version only)
Chapters Ch. 17 (Bayesian logistic regression), Ch. 20 (survival analysis)
Preprocessing Handle missing values in education, cigsPerDay, BPMeds, totChol, BMI, heartRate, glucose. Do not impute randomly — missingness patterns are informative (patients with missing glucose may not have been tested).

E.5 Graph and Network Datasets

Cora Citation Network

Field Details
Description Citation network of 2,708 machine learning papers classified into 7 categories. Each paper has a binary bag-of-words feature vector (1,433 dimensions).
Size ~5 MB
Format PyTorch Geometric
Download from torch_geometric.datasets import Planetoid; data = Planetoid(root="/tmp/cora", name="Cora")
License Public domain
Chapters Ch. 12 (graph neural networks)
Preprocessing Standard train/val/test split is provided via boolean masks. For transductive learning, all node features are available during training; only labels are masked.

OGB (Open Graph Benchmark)

Field Details
Description Standardized graph datasets for node classification, link prediction, and graph classification across molecular, social, and citation domains.
Size 10 MB to 10 GB depending on dataset
Format OGB Python package
Download pip install ogb then from ogb.nodeproppred import PygNodePropPredDataset; dataset = PygNodePropPredDataset(name="ogbn-arxiv")
License MIT (OGB code), varies by underlying dataset
Chapters Ch. 12 (graph neural networks — ogbn-arxiv), Ch. 31 (distributed GNN training — ogbn-products)
Preprocessing OGB provides standardized splits and evaluators. Always use dataset.get_idx_split() for reproducible train/val/test splits. Do not create your own splits.

E.6 Image Datasets

CIFAR-10 / CIFAR-100

Field Details
Description 60,000 32x32 color images in 10 classes (CIFAR-10) or 100 classes (CIFAR-100).
Size ~170 MB (CIFAR-10), ~170 MB (CIFAR-100)
Format PyTorch built-in
Download torchvision.datasets.CIFAR10(root="./data", download=True)
License MIT
Chapters Ch. 8 (CNNs), Ch. 10 (regularization), Ch. 11 (transfer learning)
Preprocessing Normalize with mean=(0.4914, 0.4822, 0.4465), std=(0.2470, 0.2435, 0.2616) for CIFAR-10. Apply standard augmentation: random crop with padding=4, random horizontal flip.

ImageNet (ILSVRC 2012 subset)

Field Details
Description 1.2 million training images in 1,000 classes. The standard benchmark for image classification.
Size ~150 GB
Format JPEG images in class-labeled directories
Download Registration required at https://image-net.org/. Use torchvision.datasets.ImageNet after downloading.
License Non-commercial research only
Chapters Ch. 11 (pretrained backbones), Ch. 31 (distributed training)
Preprocessing Standard: resize to 256, center crop to 224, normalize with ImageNet mean/std. For data-efficient exercises, use the ImageNet-1k subset available via HuggingFace: load_dataset("imagenet-1k", split="train[:10%]").

E.7 Synthetic Datasets

Several exercises use synthetic data to isolate specific statistical concepts. These are generated programmatically and do not require downloads.

Synthetic Causal Data

import numpy as np
import pandas as pd

def generate_causal_dataset(n=5000, seed=42):
    """Generate data with known causal structure for validation.

    DAG: Z -> X -> Y, Z -> Y (confounder)
         U -> X (instrument)
    True ATE of X on Y: 2.0
    """
    rng = np.random.default_rng(seed)

    Z = rng.normal(0, 1, n)                          # Confounder
    U = rng.normal(0, 1, n)                          # Instrument
    X = 0.5 * Z + 0.8 * U + rng.normal(0, 0.5, n)  # Treatment
    Y = 2.0 * X + 1.5 * Z + rng.normal(0, 1, n)    # Outcome (true ATE = 2.0)

    return pd.DataFrame({"Z": Z, "U": U, "X": X, "Y": Y})

Chapters: Ch. 21 (causal identification), Ch. 22 (backdoor adjustment), Ch. 23 (instrumental variables)

Synthetic Fairness Data

def generate_fairness_dataset(n=10000, seed=42):
    """Generate hiring data with known demographic disparity.

    True qualification rate is equal across groups.
    Selection rate differs due to biased threshold.
    """
    rng = np.random.default_rng(seed)

    group = rng.choice(["A", "B"], size=n, p=[0.6, 0.4])
    qualification = rng.normal(50, 10, n)
    noise = rng.normal(0, 5, n)

    # Biased scoring: Group B gets a penalty
    score = qualification + noise - 5.0 * (group == "B").astype(float)
    hired = (score > 50).astype(int)

    return pd.DataFrame({
        "group": group,
        "qualification": qualification,
        "score": score,
        "hired": hired,
    })

Chapters: Ch. 25 (fairness metrics), Ch. 26 (mitigation)

Synthetic Time Series

def generate_arima_with_intervention(n=500, intervention_point=300, seed=42):
    """Generate time series with a known structural break for
    difference-in-differences and interrupted time series exercises.
    """
    rng = np.random.default_rng(seed)

    trend = np.linspace(0, 10, n)
    seasonality = 3 * np.sin(2 * np.pi * np.arange(n) / 52)
    noise = rng.normal(0, 1, n)

    treatment_effect = np.zeros(n)
    treatment_effect[intervention_point:] = 5.0  # true effect = 5.0

    y = trend + seasonality + noise + treatment_effect

    return pd.DataFrame({
        "t": np.arange(n),
        "y": y,
        "post_intervention": (np.arange(n) >= intervention_point).astype(int),
    })

Chapters: Ch. 22 (difference-in-differences), Ch. 18 (Bayesian structural time series)


E.8 Download and Storage Recommendations

Directory Structure

Organize datasets under a single root to keep paths consistent across chapters:

~/ads-data/
├── movielens/
│   └── ml-25m/
├── lending-club/
├── mimic-iv/          # credentialed access
├── era5/
│   ├── temperature/
│   └── precipitation/
├── imagenet/          # if using full dataset
├── huggingface/       # HuggingFace datasets cache
└── synthetic/         # generated by code in exercises

Set the environment variable in your shell profile:

export ADS_DATA_DIR="$HOME/ads-data"
export HF_HOME="$HOME/ads-data/huggingface"

Storage Estimates

Category Approximate Size
All tabular datasets ~3 GB
NLP datasets (HuggingFace) ~2 GB
Climate data (ERA5 subset) ~10-50 GB
Image datasets (CIFAR only) ~500 MB
Image datasets (with ImageNet) ~160 GB
MIMIC-IV ~7 GB
Total (without ImageNet) ~20-65 GB
Total (with ImageNet) ~180-225 GB

Download Script

"""download_datasets.py — Download all freely available datasets."""

import os
from pathlib import Path

DATA_DIR = Path(os.environ.get("ADS_DATA_DIR", Path.home() / "ads-data"))
DATA_DIR.mkdir(parents=True, exist_ok=True)

def download_sklearn_datasets():
    from sklearn.datasets import (
        fetch_california_housing,
        fetch_openml,
    )
    print("Downloading scikit-learn datasets...")
    fetch_california_housing(data_home=DATA_DIR / "sklearn")
    fetch_openml("adult", version=2, data_home=DATA_DIR / "sklearn")
    fetch_openml("wine-quality-red", data_home=DATA_DIR / "sklearn")
    print("  Done.")

def download_huggingface_datasets():
    from datasets import load_dataset
    print("Downloading HuggingFace datasets...")
    load_dataset("imdb", cache_dir=DATA_DIR / "huggingface")
    load_dataset("rajpurkar/squad_v2", cache_dir=DATA_DIR / "huggingface")
    load_dataset("ag_news", cache_dir=DATA_DIR / "huggingface")
    print("  Done.")

def download_torchvision_datasets():
    import torchvision
    print("Downloading torchvision datasets...")
    torchvision.datasets.CIFAR10(root=DATA_DIR / "cifar", download=True)
    torchvision.datasets.CIFAR100(root=DATA_DIR / "cifar", download=True)
    print("  Done.")

def download_movielens():
    import urllib.request
    import zipfile
    print("Downloading MovieLens 25M...")
    url = "https://files.grouplens.org/datasets/movielens/ml-25m.zip"
    dest = DATA_DIR / "movielens" / "ml-25m.zip"
    dest.parent.mkdir(parents=True, exist_ok=True)
    if not dest.exists():
        urllib.request.urlretrieve(url, dest)
        with zipfile.ZipFile(dest, "r") as z:
            z.extractall(DATA_DIR / "movielens")
    print("  Done.")

if __name__ == "__main__":
    download_sklearn_datasets()
    download_huggingface_datasets()
    download_torchvision_datasets()
    download_movielens()
    print(f"\nAll datasets saved to {DATA_DIR}")
    print("Note: MIMIC-IV and ERA5 require separate credentialed access.")
    print("Note: Kaggle datasets require `kaggle` CLI and API key.")

Cross-references: Appendix C documents the libraries used to load and process these datasets. Appendix D covers the environment setup required to run the download script. Individual chapters provide detailed loading and preprocessing code for each dataset they use.