Appendix D: Data Sources and Datasets

This appendix catalogs the major datasets, APIs, and data generation tools referenced throughout this textbook. For each entry, we provide a brief description, approximate size, licensing information where known, and the chapters where it is most relevant.


D.1 Natural Language Processing Datasets

D.1.1 Benchmark Suites

GLUE (General Language Understanding Evaluation) - URL: https://gluebenchmark.com - Description: A collection of nine sentence- and sentence-pair-level NLU tasks including sentiment analysis (SST-2), textual entailment (MNLI, RTE), paraphrase detection (MRPC, QQP), and linguistic acceptability (CoLA). - Size: Varies by task; SST-2 has approximately 67,000 training examples. - License: Varies by sub-dataset. - Chapters: 9, 12, 13. - Access: datasets.load_dataset("glue", "sst2")

SuperGLUE - URL: https://super.gluebenchmark.com - Description: A more challenging successor to GLUE, including tasks like BoolQ (yes/no QA), CB (textual entailment with three classes), MultiRC (multi-sentence reading comprehension), and WiC (word-in-context). - Size: Varies; generally smaller than GLUE tasks. - License: Varies by sub-dataset. - Chapters: 9, 12, 13. - Access: datasets.load_dataset("super_glue", "boolq")

MMLU (Massive Multitask Language Understanding) - URL: https://github.com/hendrycks/test - Description: 57 multiple-choice tasks spanning STEM, humanities, social sciences, and more. A standard benchmark for evaluating LLM knowledge breadth. - Size: Approximately 15,000 test questions. - License: MIT. - Chapters: 15, 16, 31.

D.1.2 Question Answering

SQuAD (Stanford Question Answering Dataset) - URL: https://rajpurkar.github.io/SQuAD-explorer/ - Description: SQuAD 1.1 contains 100,000+ question-answer pairs based on Wikipedia passages (extractive QA). SQuAD 2.0 adds 50,000+ unanswerable questions. - License: CC BY-SA 4.0. - Chapters: 10, 12, 20. - Access: datasets.load_dataset("squad")

Natural Questions (NQ) - URL: https://ai.google.com/research/NaturalQuestions - Description: Real questions from Google Search paired with Wikipedia articles containing the answers. Includes both short and long answer annotations. - Size: 307,000+ training examples. - License: Apache 2.0. - Chapters: 20, 26.

TriviaQA - Description: 650,000 question-answer-evidence triples gathered from trivia and quiz-league websites. - Chapters: 20, 26.

D.1.3 Text Generation and Summarization

CNN/DailyMail - Description: 300,000+ article-summary pairs from CNN and DailyMail news articles. The standard benchmark for abstractive summarization. - Access: datasets.load_dataset("cnn_dailymail", "3.0.0") - Chapters: 12, 15.

XSum (Extreme Summarization) - Description: 227,000 BBC news articles with single-sentence summaries. Tests models' ability to generate highly abstractive summaries. - Access: datasets.load_dataset("xsum") - Chapters: 12, 15.

D.1.4 Large-Scale Pre-training Corpora

Common Crawl - URL: https://commoncrawl.org - Description: Petabytes of raw web data collected monthly since 2008. The basis for many pre-training datasets. - Size: Petabytes (raw); filtered subsets vary. - License: Open; content licensing varies per page. - Chapters: 14, 15.

The Pile - Description: An 800GB curated dataset combining 22 high-quality sub-datasets (academic papers, books, code, etc.) designed for LLM pre-training. - License: MIT (compilation); individual components vary. - Chapters: 14, 15.

C4 (Colossal Clean Crawled Corpus) - Description: A cleaned version of Common Crawl used to train the T5 model. Approximately 750GB of text. - Access: datasets.load_dataset("c4", "en") - Chapters: 14, 15.

RedPajama - Description: An open reproduction of the LLaMA training data, consisting of approximately 1.2 trillion tokens from Common Crawl, C4, GitHub, books, arXiv, Wikipedia, and StackExchange. - License: Apache 2.0. - Chapters: 14, 15.

FineWeb / FineWeb-Edu - Description: High-quality filtered web text from HuggingFace, with an educational content subset particularly useful for training smaller models. - License: ODC-By 1.0. - Chapters: 14, 15.

D.1.5 Instruction and Alignment Datasets

Open Assistant (OASST) - Description: 160,000+ human-annotated conversations for training assistant-style models. Includes ranking information for RLHF. - License: Apache 2.0. - Chapters: 16, 17.

Anthropic HH-RLHF - Description: Human preference data for helpfulness and harmlessness. Contains pairs of model responses with human preference labels. - Chapters: 16, 17.

UltraFeedback - Description: 64,000 instructions with responses from multiple models, scored on helpfulness, honesty, instruction following, and truthfulness. - Chapters: 16, 17.


D.2 Computer Vision Datasets

ImageNet (ILSVRC) - URL: https://www.image-net.org - Description: 1.28 million training images across 1,000 classes. The foundational benchmark for image classification. ImageNet-21k has approximately 14 million images in 21,841 classes. - License: Research use only (requires registration). - Chapters: 8, 22.

CIFAR-10 / CIFAR-100 - Description: 60,000 32x32 color images in 10 (or 100) classes. A standard dataset for prototyping and experimentation due to its small size. - License: MIT. - Chapters: 7, 8. - Access: torchvision.datasets.CIFAR10(root="./data", download=True)

COCO (Common Objects in Context) - URL: https://cocodataset.org - Description: 330,000+ images with 80 object categories. Includes annotations for object detection, segmentation, keypoints, and image captioning. - License: CC BY 4.0. - Chapters: 22, 23.

LAION-5B - Description: 5.85 billion image-text pairs scraped from the web. Used to train open-source vision-language models like Stable Diffusion. - License: CC BY 4.0 (metadata); images are linked, not redistributed. - Chapters: 22, 23.

Visual Question Answering (VQA) - URL: https://visualqa.org - Description: 265,000 images with 760,000+ questions and 10 ground-truth answers each. - Chapters: 23.


D.3 Audio and Speech Datasets

LibriSpeech - URL: https://www.openslr.org/12 - Description: 1,000 hours of read English speech from audiobooks. The standard benchmark for automatic speech recognition (ASR). - License: CC BY 4.0. - Chapters: 24. - Access: datasets.load_dataset("librispeech_asr", "clean")

Common Voice (Mozilla) - URL: https://commonvoice.mozilla.org - Description: The world's largest open multilingual voice dataset, with contributions in 100+ languages. Crowdsourced recordings with validated transcriptions. - Size: 20,000+ hours across all languages. - License: CC-0 (public domain). - Chapters: 24, 25.

GigaSpeech - Description: 10,000 hours of English audio from audiobooks, podcasts, and YouTube. Designed as a large-scale ASR training corpus. - License: Apache 2.0. - Chapters: 24.

MusicCaps - Description: 5,521 music clips with text descriptions, created by Google for music understanding and generation tasks. - Chapters: 25.

VoxCeleb / VoxCeleb2 - Description: Speaker recognition datasets containing speech from thousands of celebrities. VoxCeleb2 has over 1 million utterances from 6,000+ speakers. - Chapters: 24.


D.4 Tabular and General-Purpose Datasets

UCI Machine Learning Repository - URL: https://archive.ics.uci.edu - Description: A long-standing collection of 600+ datasets for machine learning research. Includes classics like Iris, Wine, Adult Census, and Boston Housing. - License: Varies; most are freely available for research. - Chapters: 3, 4, 5, 6.

Kaggle Datasets - URL: https://www.kaggle.com/datasets - Description: Community-contributed datasets spanning virtually every domain. Kaggle also hosts competitions with curated datasets and leaderboards. - License: Varies per dataset. - Chapters: 3, 4, 5, 6, 38. - Access: kaggle datasets download -d <dataset-slug>

OpenML - URL: https://www.openml.org - Description: A platform for sharing machine learning datasets, tasks, and experiments. Provides standardized interfaces for benchmarking. - License: Varies per dataset. - Chapters: 3, 6.

HuggingFace Datasets Hub - URL: https://huggingface.co/datasets - Description: Centralized repository hosting 100,000+ datasets with a unified Python API. Supports streaming for large datasets. - Access: datasets.load_dataset("dataset_name") - Chapters: Used throughout the book.


D.5 APIs and Services for Data Access

D.5.1 Cloud Data APIs

Google BigQuery Public Datasets - Description: Petabytes of publicly available datasets queryable via SQL. Includes GitHub activity data, Wikipedia, US Census data, and more. - Pricing: 1TB free queries per month. - Chapters: 28, 38.

AWS Open Data - URL: https://registry.opendata.aws - Description: Hundreds of datasets available directly on S3. Includes satellite imagery, genomics data, and Common Crawl. - Chapters: 28.

Kaggle API - Description: Programmatic access to download competition data, public datasets, and submit predictions. - Usage: pip install kaggle && kaggle datasets download -d <slug>

D.5.2 LLM APIs for Synthetic Data Generation

OpenAI API - Description: Access to GPT-4 and related models. Commonly used for generating synthetic training data, labels, and evaluations. - Chapters: 15, 17, 30.

Anthropic API - Description: Access to Claude models. Used for generating high-quality synthetic data, especially for complex reasoning tasks. - Chapters: 15, 17, 30.

HuggingFace Inference API - Description: Free-tier access to thousands of hosted models for quick experimentation. - Chapters: 30.

D.5.3 Web Scraping and Data Collection

Scrapy - Description: Python framework for building web scrapers. Useful for collecting domain-specific text data. - Chapters: 28.

Selenium / Playwright - Description: Browser automation tools for scraping JavaScript-rendered pages. - Chapters: 28.

Common Crawl Index - Description: Query interface to search Common Crawl archives by URL pattern or domain without downloading the full archive. - Chapters: 28.


D.6 Synthetic Data Generation Tools

Synthetic data generation is increasingly important for AI engineering, especially for addressing data scarcity, privacy constraints, and generating evaluation datasets.

D.6.1 Text Data Generation

Faker - Description: Python library for generating realistic fake data (names, addresses, text, dates, etc.). Useful for generating test fixtures and tabular data. - Usage: pip install faker - Chapters: 28, 31.

LLM-Based Generation - Description: Using large language models to generate training data, synthetic conversations, and evaluation sets. This is now the dominant approach for creating instruction-following datasets. - Key technique: Provide few-shot examples in a prompt, then sample diverse completions with temperature > 0. - Chapters: 16, 17, 28.

Self-Instruct / Evol-Instruct - Description: Methods for using a strong LLM to generate instruction-response pairs, optionally evolving them to increase complexity. Used to create datasets like Alpaca and WizardLM. - Chapters: 16, 17.

D.6.2 Image and Multimodal Generation

Stable Diffusion / SDXL - Description: Open-source text-to-image models that can generate synthetic training images. Useful for data augmentation in computer vision. - Chapters: 22, 23.

Albumentations - Description: Fast image augmentation library. Provides geometric and photometric transformations for creating augmented training samples. - Usage: pip install albumentations - Chapters: 8, 22.

D.6.3 Tabular Data Generation

CTGAN (Conditional Tabular GAN) - Description: GAN-based approach for generating synthetic tabular data that preserves statistical properties of the original dataset. - Usage: pip install ctgan - Chapters: 6, 14.

SDV (Synthetic Data Vault) - Description: Comprehensive library for generating synthetic relational, tabular, and time-series data using statistical and deep learning models. - Usage: pip install sdv - Chapters: 6, 28.

D.6.4 Privacy-Preserving Synthetic Data

Differential Privacy: Synthetic data generators that provide formal privacy guarantees by adding calibrated noise during the generation process.

Key libraries: - diffprivlib (IBM): Differentially private machine learning. - opacus (Meta): Differentially private training for PyTorch models. - Chapters: 28, 37.


D.7 Dataset Loading Cheat Sheet

The following code snippets demonstrate how to quickly load datasets from the most common sources used in this book.

from datasets import load_dataset

# HuggingFace Hub
dataset = load_dataset("glue", "sst2")                    # GLUE benchmark
dataset = load_dataset("squad")                             # SQuAD
dataset = load_dataset("wikitext", "wikitext-103-v1")      # WikiText-103
dataset = load_dataset("json", data_files="data.jsonl")    # Local JSONL file
dataset = load_dataset("csv", data_files="data.csv")       # Local CSV file

# Streaming (for very large datasets)
dataset = load_dataset("c4", "en", streaming=True)
for example in dataset["train"]:
    process(example)
    break  # Process one at a time without downloading everything

# Torchvision
import torchvision
cifar10 = torchvision.datasets.CIFAR10(root="./data", train=True, download=True)
imagenet = torchvision.datasets.ImageNet(root="/data/imagenet", split="train")

# Torchaudio
import torchaudio
librispeech = torchaudio.datasets.LIBRISPEECH(root="./data", url="train-clean-100")

# Scikit-learn built-in
from sklearn.datasets import load_iris, fetch_california_housing
iris = load_iris()
housing = fetch_california_housing()

# Kaggle (CLI)
# kaggle competitions download -c titanic
# kaggle datasets download -d zillow/zecon

D.8 Dataset Best Practices

  1. Always check the license before using a dataset. Some datasets (e.g., ImageNet) have restrictions on commercial use.
  2. Document your data sources in your project's README and model card. This is essential for reproducibility and responsible AI.
  3. Inspect data quality before training. Use tools like cleanlab for label error detection and pandas profiling for exploratory data analysis.
  4. Version your datasets using tools like DVC (Data Version Control) or HuggingFace dataset versioning.
  5. Consider data contamination: If evaluating on a public benchmark, verify that your training data does not contain test set examples. This is an increasing concern with web-scraped pre-training corpora.
  6. Respect privacy: De-identify personal information before using real-world data for training. Consider using synthetic data when working with sensitive domains.