Appendix E: Dataset Catalog

This appendix catalogs every dataset used in exercises, case studies, and the progressive project throughout this textbook. For each dataset, you will find what it contains, how large it is, where to get it, what license governs its use, which chapters reference it, and any preprocessing notes specific to our exercises.

E.1 Tabular / Structured Datasets

MovieLens 25M

Field	Details
Description	25 million movie ratings from 162,000 users across 62,000 movies. Includes ratings, tags, and genome scores.
Size	~250 MB (compressed), ~1.2 GB (extracted)
Format	CSV files (`ratings.csv`, `movies.csv`, `tags.csv`, `genome-scores.csv`, `genome-tags.csv`)
Download	`wget https://files.grouplens.org/datasets/movielens/ml-25m.zip`
License	Research use only (GroupLens terms of use)
Chapters	Ch. 7 (collaborative filtering), Ch. 9 (matrix factorization), Ch. 14 (embedding layers), Ch. 32 (large-scale recommendations)
Preprocessing	Filter to users with >= 20 ratings for cold-start experiments. Convert timestamps to datetime. For causal exercises in Ch. 23, use the tag genome scores to construct synthetic treatment variables.

UCI Adult (Census Income)

Field	Details
Description	Predict whether income exceeds $50K/year based on census data. 48,842 instances, 14 attributes.
Size	~4 MB
Format	CSV
Download	`from sklearn.datasets import fetch_openml; df = fetch_openml("adult", version=2, as_frame=True).frame`
License	Public domain (CC0)
Chapters	Ch. 3 (classification baselines), Ch. 25 (fairness metrics), Ch. 26 (bias mitigation), Ch. 27 (differential privacy)
Preprocessing	Drop `fnlwgt` column. Binarize `income` as target. Use `race` and `sex` as sensitive attributes for fairness exercises. Handle missing values marked as `"?"` in `workclass`, `occupation`, and `native-country`.

UCI Wine Quality

Field	Details
Description	Physicochemical properties and quality scores for red and white Portuguese wines. 6,497 instances combined.
Size	~300 KB
Format	CSV (semicolon-delimited)
Download	`from sklearn.datasets import fetch_openml; df = fetch_openml("wine-quality-red", as_frame=True).frame`
License	CC BY 4.0
Chapters	Ch. 2 (EDA), Ch. 4 (regression), Ch. 17 (Bayesian regression)
Preprocessing	Combine red and white datasets with an added `color` column for the full analysis. Quality scores range 3-9; for binary classification exercises, threshold at quality >= 7.

California Housing

Field	Details
Description	Median house values for California districts from the 1990 census. 20,640 instances, 8 features.
Size	~1.5 MB
Format	Bundled in scikit-learn
Download	`from sklearn.datasets import fetch_california_housing; data = fetch_california_housing(as_frame=True).frame`
License	Public domain
Chapters	Ch. 1 (introduction), Ch. 4 (regression), Ch. 5 (feature engineering), Ch. 30 (data validation with Great Expectations)
Preprocessing	Cap `median_house_value` at $500K (it is already capped in the original data — this is by design, not an error). Log-transform `median_income` and `population` for normalization exercises.

Kaggle Ames Housing

Field	Details
Description	Detailed housing data for Ames, Iowa. 2,930 instances, 80 features (23 nominal, 23 ordinal, 14 discrete, 20 continuous).
Size	~1 MB
Format	CSV
Download	`kaggle competitions download -c house-prices-advanced-regression-techniques` (requires Kaggle API key)
License	Competition terms (educational use permitted)
Chapters	Ch. 5 (feature engineering), Ch. 6 (advanced pipelines)
Preprocessing	Significant missing data in `PoolQC`, `MiscFeature`, `Alley`, `Fence`, `FireplaceQu`. These are meaningful absences (no pool, no alley), not random missingness. Encode accordingly.

Lending Club Loan Data

Field	Details
Description	Anonymized loan application and repayment data. Millions of records with outcomes (fully paid, charged off, current).
Size	~1.5 GB
Format	CSV
Download	Available via Kaggle: `kaggle datasets download -d wordsforthewise/lending-club`
License	CC0 (public domain)
Chapters	Ch. 22 (causal inference — does interest rate cause default?), Ch. 23 (instrumental variables), Ch. 25 (fairness in lending)
Preprocessing	Filter to fully paid and charged off loans only (drop "current"). Create binary target `default = 1 if loan_status == "Charged Off"`. Parse `term`, `emp_length` to numeric. Drop leakage features: `total_pymnt`, `recoveries`, `collection_recovery_fee`.

IBM HR Analytics (Synthetic)

Field	Details
Description	Synthetic employee attrition dataset created by IBM data scientists. 1,470 instances, 35 features.
Size	~230 KB
Format	CSV
Download	`kaggle datasets download -d pavansubhasht/ibm-hr-analytics-attrition-dataset`
License	CC0 (public domain)
Chapters	Ch. 21 (causal forests for heterogeneous treatment effects), Ch. 24 (uplift modeling)
Preprocessing	Binarize `Attrition` as target. `EmployeeNumber` is an ID column — drop it before modeling. Several features are ordinal (e.g., `JobSatisfaction` 1-4) — encode as numeric, not one-hot.

E.2 Time Series and Financial Datasets

Yahoo Finance (via yfinance)

Field	Details
Description	Historical stock prices, volumes, and market indices.
Size	Varies (downloaded on demand)
Format	pandas DataFrame via API
Download	`pip install yfinance` then `import yfinance as yf; df = yf.download("AAPL", start="2015-01-01", end="2025-01-01")`
License	Yahoo terms of use (educational/personal use)
Chapters	Ch. 15 (time series with LSTMs), Ch. 16 (attention for sequences), Ch. 18 (Bayesian structural time series)
Preprocessing	Use adjusted close prices for returns calculations. Handle market holidays (missing dates). For multi-stock exercises, align all tickers to the same trading days.

FRED Economic Data

Field	Details
Description	Federal Reserve Economic Data: GDP, unemployment, CPI, interest rates, and thousands of other macroeconomic time series.
Size	Individual series are small (KB); bulk downloads can be large
Format	CSV or API
Download	`pip install fredapi` then `from fredapi import Fred; fred = Fred(api_key="YOUR_KEY"); df = fred.get_series("GDP")`
License	Public domain (U.S. government data)
Chapters	Ch. 18 (Bayesian structural time series), Ch. 22 (causal inference with time series — difference-in-differences)
Preprocessing	Series have different frequencies (daily, monthly, quarterly). Align frequencies before merging. Handle revisions — FRED provides "vintage" data where historical values are revised.

ERA5 Climate Reanalysis

Field	Details
Description	Hourly global climate data from ECMWF: temperature, precipitation, wind, humidity, pressure. 0.25° grid resolution from 1940 to present.
Size	Extremely large (full dataset is petabytes); subset to specific variables, regions, and time ranges
Format	NetCDF (`.nc`) or GRIB
Download	Via Copernicus Climate Data Store API: `pip install cdsapi` then use the CDS web interface to select variables and submit download requests
License	Copernicus License (free for all uses including commercial, with attribution)
Chapters	Ch. 15 (spatiotemporal modeling), Ch. 31 (distributed training on large data), Ch. 33 (production pipeline case study)
Preprocessing	Use `xarray` to read NetCDF files. Subset to region of interest before loading into memory. Convert temperature from Kelvin to Celsius. Aggregate hourly to daily for most exercises.

CMIP6 Climate Projections

Field	Details
Description	Climate model output from the Coupled Model Intercomparison Project Phase 6. Multiple models, multiple scenarios (SSP1-2.6, SSP2-4.5, SSP3-7.0, SSP5-8.5).
Size	Subsets: 1-50 GB per variable/model/scenario
Format	NetCDF
Download	Via ESGF nodes: `https://esgf-node.llnl.gov/search/cmip6/`
License	Varies by modeling center; most are CC BY 4.0
Chapters	Ch. 15 (climate time series), Ch. 35 (capstone project option)
Preprocessing	Regrid to common resolution using `xesmf`. Select a single model (e.g., CESM2) for exercises to avoid multi-model complexity. Compute anomalies relative to 1981-2010 baseline.

E.3 Natural Language Processing Datasets

IMDb Movie Reviews

Field	Details
Description	50,000 movie reviews labeled as positive or negative sentiment. Balanced dataset (25K/25K).
Size	~65 MB
Format	HuggingFace Datasets
Download	`from datasets import load_dataset; ds = load_dataset("imdb")`
License	Apache 2.0
Chapters	Ch. 13 (text classification with CNNs), Ch. 14 (transformer fine-tuning), Ch. 27 (differential privacy for NLP)
Preprocessing	Already split into train/test. For validation, split 10% from training set with stratification. Truncate to 256 tokens for efficiency in most exercises; use 512 for the fine-tuning chapter.

SQuAD 2.0

Field	Details
Description	Stanford Question Answering Dataset with 150K questions, including 50K unanswerable questions.
Size	~45 MB
Format	JSON / HuggingFace Datasets
Download	`from datasets import load_dataset; ds = load_dataset("rajpurkar/squad_v2")`
License	CC BY-SA 4.0
Chapters	Ch. 14 (question answering fine-tuning)
Preprocessing	Use the HuggingFace tokenizer's `return_offsets_mapping=True` to align token positions with character-level answer spans. Unanswerable questions have empty `answers` — handle these explicitly.

AG News

Field	Details
Description	News articles classified into 4 categories: World, Sports, Business, Science/Technology. 120K training, 7,600 test.
Size	~30 MB
Format	HuggingFace Datasets
Download	`from datasets import load_dataset; ds = load_dataset("ag_news")`
License	Academic use
Chapters	Ch. 13 (text classification baselines), Ch. 16 (attention visualization)
Preprocessing	Combine `title` and `description` fields with a `[SEP]` token for transformer models. Labels are 0-indexed (0=World, 1=Sports, 2=Business, 3=Sci/Tech).

Amazon Reviews (Multilingual)

Field	Details
Description	Product reviews in multiple languages with star ratings. Millions of reviews.
Size	1-20 GB depending on language/category subset
Format	HuggingFace Datasets
Download	`from datasets import load_dataset; ds = load_dataset("amazon_reviews_multi", "en")`
License	Amazon terms (research use)
Chapters	Ch. 14 (transfer learning), Ch. 32 (large-scale NLP pipeline)
Preprocessing	Filter to English subset for most exercises. Create binary sentiment: 4-5 stars = positive, 1-2 stars = negative, drop 3-star reviews. Subsample to 100K reviews for local training.

E.4 Health and Biomedical Datasets

MIMIC-IV

Field	Details
Description	De-identified electronic health records from Beth Israel Deaconess Medical Center. Includes diagnoses, procedures, lab results, medications, and clinical notes for ~300K ICU admissions.
Size	~7 GB (compressed)
Format	CSV (relational tables)
Download	Requires credentialed access: complete CITI training, sign data use agreement at `https://physionet.org/content/mimiciv/`
License	PhysioNet Credentialed Health Data License 1.5.0 (no redistribution, requires IRB-equivalent training)
Chapters	Ch. 20 (survival analysis), Ch. 22 (causal inference in healthcare), Ch. 25 (fairness in clinical models), Ch. 35 (capstone project option)
Preprocessing	Join `admissions`, `patients`, `diagnoses_icd`, and `labevents` tables on `subject_id` and `hadm_id`. ICD codes are in both ICD-9 and ICD-10 — use the `icd_version` column to distinguish. For mortality prediction, define the target as in-hospital death (`discharge_location == "DIED"` or `deathtime IS NOT NULL`). Remove neonates (`anchor_age == 0`) for adult-only analyses.

Access note: MIMIC-IV requires approximately 1-2 weeks for access approval. Begin the credentialing process before starting Chapter 20. If access is not available, the MIMIC-IV demo dataset (100 patients) is available without credentialing for code testing, and all exercises include synthetic data alternatives.

Framingham Heart Study (Teaching Dataset)

Field	Details
Description	Cardiovascular risk factors and outcomes from the Framingham Heart Study. Teaching version with ~4,000 participants.
Size	~500 KB
Format	CSV
Download	Available via Kaggle: `kaggle datasets download -d aasheesh200/framingham-heart-study-dataset`
License	Public (teaching version only)
Chapters	Ch. 17 (Bayesian logistic regression), Ch. 20 (survival analysis)
Preprocessing	Handle missing values in `education`, `cigsPerDay`, `BPMeds`, `totChol`, `BMI`, `heartRate`, `glucose`. Do not impute randomly — missingness patterns are informative (patients with missing glucose may not have been tested).

E.5 Graph and Network Datasets

Cora Citation Network

Field	Details
Description	Citation network of 2,708 machine learning papers classified into 7 categories. Each paper has a binary bag-of-words feature vector (1,433 dimensions).
Size	~5 MB
Format	PyTorch Geometric
Download	`from torch_geometric.datasets import Planetoid; data = Planetoid(root="/tmp/cora", name="Cora")`
License	Public domain
Chapters	Ch. 12 (graph neural networks)
Preprocessing	Standard train/val/test split is provided via boolean masks. For transductive learning, all node features are available during training; only labels are masked.

OGB (Open Graph Benchmark)

Field	Details
Description	Standardized graph datasets for node classification, link prediction, and graph classification across molecular, social, and citation domains.
Size	10 MB to 10 GB depending on dataset
Format	OGB Python package
Download	`pip install ogb` then `from ogb.nodeproppred import PygNodePropPredDataset; dataset = PygNodePropPredDataset(name="ogbn-arxiv")`
License	MIT (OGB code), varies by underlying dataset
Chapters	Ch. 12 (graph neural networks — ogbn-arxiv), Ch. 31 (distributed GNN training — ogbn-products)
Preprocessing	OGB provides standardized splits and evaluators. Always use `dataset.get_idx_split()` for reproducible train/val/test splits. Do not create your own splits.

E.6 Image Datasets

CIFAR-10 / CIFAR-100

Field	Details
Description	60,000 32x32 color images in 10 classes (CIFAR-10) or 100 classes (CIFAR-100).
Size	~170 MB (CIFAR-10), ~170 MB (CIFAR-100)
Format	PyTorch built-in
Download	`torchvision.datasets.CIFAR10(root="./data", download=True)`
License	MIT
Chapters	Ch. 8 (CNNs), Ch. 10 (regularization), Ch. 11 (transfer learning)
Preprocessing	Normalize with `mean=(0.4914, 0.4822, 0.4465)`, `std=(0.2470, 0.2435, 0.2616)` for CIFAR-10. Apply standard augmentation: random crop with padding=4, random horizontal flip.

ImageNet (ILSVRC 2012 subset)

Field	Details
Description	1.2 million training images in 1,000 classes. The standard benchmark for image classification.
Size	~150 GB
Format	JPEG images in class-labeled directories
Download	Registration required at `https://image-net.org/`. Use `torchvision.datasets.ImageNet` after downloading.
License	Non-commercial research only
Chapters	Ch. 11 (pretrained backbones), Ch. 31 (distributed training)
Preprocessing	Standard: resize to 256, center crop to 224, normalize with ImageNet mean/std. For data-efficient exercises, use the `ImageNet-1k` subset available via HuggingFace: `load_dataset("imagenet-1k", split="train[:10%]")`.

E.7 Synthetic Datasets

Several exercises use synthetic data to isolate specific statistical concepts. These are generated programmatically and do not require downloads.

Synthetic Causal Data

import numpy as np
import pandas as pd

def generate_causal_dataset(n=5000, seed=42):
    """Generate data with known causal structure for validation.

    DAG: Z -> X -> Y, Z -> Y (confounder)
         U -> X (instrument)
    True ATE of X on Y: 2.0
    """
    rng = np.random.default_rng(seed)

    Z = rng.normal(0, 1, n)                          # Confounder
    U = rng.normal(0, 1, n)                          # Instrument
    X = 0.5 * Z + 0.8 * U + rng.normal(0, 0.5, n)  # Treatment
    Y = 2.0 * X + 1.5 * Z + rng.normal(0, 1, n)    # Outcome (true ATE = 2.0)

    return pd.DataFrame({"Z": Z, "U": U, "X": X, "Y": Y})

Chapters: Ch. 21 (causal identification), Ch. 22 (backdoor adjustment), Ch. 23 (instrumental variables)

Synthetic Fairness Data

def generate_fairness_dataset(n=10000, seed=42):
    """Generate hiring data with known demographic disparity.

    True qualification rate is equal across groups.
    Selection rate differs due to biased threshold.
    """
    rng = np.random.default_rng(seed)

    group = rng.choice(["A", "B"], size=n, p=[0.6, 0.4])
    qualification = rng.normal(50, 10, n)
    noise = rng.normal(0, 5, n)

    # Biased scoring: Group B gets a penalty
    score = qualification + noise - 5.0 * (group == "B").astype(float)
    hired = (score > 50).astype(int)

    return pd.DataFrame({
        "group": group,
        "qualification": qualification,
        "score": score,
        "hired": hired,
    })

Chapters: Ch. 25 (fairness metrics), Ch. 26 (mitigation)

Synthetic Time Series

def generate_arima_with_intervention(n=500, intervention_point=300, seed=42):
    """Generate time series with a known structural break for
    difference-in-differences and interrupted time series exercises.
    """
    rng = np.random.default_rng(seed)

    trend = np.linspace(0, 10, n)
    seasonality = 3 * np.sin(2 * np.pi * np.arange(n) / 52)
    noise = rng.normal(0, 1, n)

    treatment_effect = np.zeros(n)
    treatment_effect[intervention_point:] = 5.0  # true effect = 5.0

    y = trend + seasonality + noise + treatment_effect

    return pd.DataFrame({
        "t": np.arange(n),
        "y": y,
        "post_intervention": (np.arange(n) >= intervention_point).astype(int),
    })

Chapters: Ch. 22 (difference-in-differences), Ch. 18 (Bayesian structural time series)

E.8 Download and Storage Recommendations

Directory Structure

Organize datasets under a single root to keep paths consistent across chapters:

~/ads-data/
├── movielens/
│   └── ml-25m/
├── lending-club/
├── mimic-iv/          # credentialed access
├── era5/
│   ├── temperature/
│   └── precipitation/
├── imagenet/          # if using full dataset
├── huggingface/       # HuggingFace datasets cache
└── synthetic/         # generated by code in exercises

Set the environment variable in your shell profile:

export ADS_DATA_DIR="$HOME/ads-data"
export HF_HOME="$HOME/ads-data/huggingface"

Storage Estimates

Category	Approximate Size
All tabular datasets	~3 GB
NLP datasets (HuggingFace)	~2 GB
Climate data (ERA5 subset)	~10-50 GB
Image datasets (CIFAR only)	~500 MB
Image datasets (with ImageNet)	~160 GB
MIMIC-IV	~7 GB
Total (without ImageNet)	~20-65 GB
Total (with ImageNet)	~180-225 GB

Download Script

"""download_datasets.py — Download all freely available datasets."""

import os
from pathlib import Path

DATA_DIR = Path(os.environ.get("ADS_DATA_DIR", Path.home() / "ads-data"))
DATA_DIR.mkdir(parents=True, exist_ok=True)

def download_sklearn_datasets():
    from sklearn.datasets import (
        fetch_california_housing,
        fetch_openml,
    )
    print("Downloading scikit-learn datasets...")
    fetch_california_housing(data_home=DATA_DIR / "sklearn")
    fetch_openml("adult", version=2, data_home=DATA_DIR / "sklearn")
    fetch_openml("wine-quality-red", data_home=DATA_DIR / "sklearn")
    print("  Done.")

def download_huggingface_datasets():
    from datasets import load_dataset
    print("Downloading HuggingFace datasets...")
    load_dataset("imdb", cache_dir=DATA_DIR / "huggingface")
    load_dataset("rajpurkar/squad_v2", cache_dir=DATA_DIR / "huggingface")
    load_dataset("ag_news", cache_dir=DATA_DIR / "huggingface")
    print("  Done.")

def download_torchvision_datasets():
    import torchvision
    print("Downloading torchvision datasets...")
    torchvision.datasets.CIFAR10(root=DATA_DIR / "cifar", download=True)
    torchvision.datasets.CIFAR100(root=DATA_DIR / "cifar", download=True)
    print("  Done.")

def download_movielens():
    import urllib.request
    import zipfile
    print("Downloading MovieLens 25M...")
    url = "https://files.grouplens.org/datasets/movielens/ml-25m.zip"
    dest = DATA_DIR / "movielens" / "ml-25m.zip"
    dest.parent.mkdir(parents=True, exist_ok=True)
    if not dest.exists():
        urllib.request.urlretrieve(url, dest)
        with zipfile.ZipFile(dest, "r") as z:
            z.extractall(DATA_DIR / "movielens")
    print("  Done.")

if __name__ == "__main__":
    download_sklearn_datasets()
    download_huggingface_datasets()
    download_torchvision_datasets()
    download_movielens()
    print(f"\nAll datasets saved to {DATA_DIR}")
    print("Note: MIMIC-IV and ERA5 require separate credentialed access.")
    print("Note: Kaggle datasets require `kaggle` CLI and API key.")

Cross-references: Appendix C documents the libraries used to load and process these datasets. Appendix D covers the environment setup required to run the download script. Individual chapters provide detailed loading and preprocessing code for each dataset they use.