Appendix D: Data Sources and Datasets
This appendix catalogs the major datasets, APIs, and data generation tools referenced throughout this textbook. For each entry, we provide a brief description, approximate size, licensing information where known, and the chapters where it is most relevant.
D.1 Natural Language Processing Datasets
D.1.1 Benchmark Suites
GLUE (General Language Understanding Evaluation)
- URL: https://gluebenchmark.com
- Description: A collection of nine sentence- and sentence-pair-level NLU tasks including sentiment analysis (SST-2), textual entailment (MNLI, RTE), paraphrase detection (MRPC, QQP), and linguistic acceptability (CoLA).
- Size: Varies by task; SST-2 has approximately 67,000 training examples.
- License: Varies by sub-dataset.
- Chapters: 9, 12, 13.
- Access: datasets.load_dataset("glue", "sst2")
SuperGLUE
- URL: https://super.gluebenchmark.com
- Description: A more challenging successor to GLUE, including tasks like BoolQ (yes/no QA), CB (textual entailment with three classes), MultiRC (multi-sentence reading comprehension), and WiC (word-in-context).
- Size: Varies; generally smaller than GLUE tasks.
- License: Varies by sub-dataset.
- Chapters: 9, 12, 13.
- Access: datasets.load_dataset("super_glue", "boolq")
MMLU (Massive Multitask Language Understanding) - URL: https://github.com/hendrycks/test - Description: 57 multiple-choice tasks spanning STEM, humanities, social sciences, and more. A standard benchmark for evaluating LLM knowledge breadth. - Size: Approximately 15,000 test questions. - License: MIT. - Chapters: 15, 16, 31.
D.1.2 Question Answering
SQuAD (Stanford Question Answering Dataset)
- URL: https://rajpurkar.github.io/SQuAD-explorer/
- Description: SQuAD 1.1 contains 100,000+ question-answer pairs based on Wikipedia passages (extractive QA). SQuAD 2.0 adds 50,000+ unanswerable questions.
- License: CC BY-SA 4.0.
- Chapters: 10, 12, 20.
- Access: datasets.load_dataset("squad")
Natural Questions (NQ) - URL: https://ai.google.com/research/NaturalQuestions - Description: Real questions from Google Search paired with Wikipedia articles containing the answers. Includes both short and long answer annotations. - Size: 307,000+ training examples. - License: Apache 2.0. - Chapters: 20, 26.
TriviaQA - Description: 650,000 question-answer-evidence triples gathered from trivia and quiz-league websites. - Chapters: 20, 26.
D.1.3 Text Generation and Summarization
CNN/DailyMail
- Description: 300,000+ article-summary pairs from CNN and DailyMail news articles. The standard benchmark for abstractive summarization.
- Access: datasets.load_dataset("cnn_dailymail", "3.0.0")
- Chapters: 12, 15.
XSum (Extreme Summarization)
- Description: 227,000 BBC news articles with single-sentence summaries. Tests models' ability to generate highly abstractive summaries.
- Access: datasets.load_dataset("xsum")
- Chapters: 12, 15.
D.1.4 Large-Scale Pre-training Corpora
Common Crawl - URL: https://commoncrawl.org - Description: Petabytes of raw web data collected monthly since 2008. The basis for many pre-training datasets. - Size: Petabytes (raw); filtered subsets vary. - License: Open; content licensing varies per page. - Chapters: 14, 15.
The Pile - Description: An 800GB curated dataset combining 22 high-quality sub-datasets (academic papers, books, code, etc.) designed for LLM pre-training. - License: MIT (compilation); individual components vary. - Chapters: 14, 15.
C4 (Colossal Clean Crawled Corpus)
- Description: A cleaned version of Common Crawl used to train the T5 model. Approximately 750GB of text.
- Access: datasets.load_dataset("c4", "en")
- Chapters: 14, 15.
RedPajama - Description: An open reproduction of the LLaMA training data, consisting of approximately 1.2 trillion tokens from Common Crawl, C4, GitHub, books, arXiv, Wikipedia, and StackExchange. - License: Apache 2.0. - Chapters: 14, 15.
FineWeb / FineWeb-Edu - Description: High-quality filtered web text from HuggingFace, with an educational content subset particularly useful for training smaller models. - License: ODC-By 1.0. - Chapters: 14, 15.
D.1.5 Instruction and Alignment Datasets
Open Assistant (OASST) - Description: 160,000+ human-annotated conversations for training assistant-style models. Includes ranking information for RLHF. - License: Apache 2.0. - Chapters: 16, 17.
Anthropic HH-RLHF - Description: Human preference data for helpfulness and harmlessness. Contains pairs of model responses with human preference labels. - Chapters: 16, 17.
UltraFeedback - Description: 64,000 instructions with responses from multiple models, scored on helpfulness, honesty, instruction following, and truthfulness. - Chapters: 16, 17.
D.2 Computer Vision Datasets
ImageNet (ILSVRC) - URL: https://www.image-net.org - Description: 1.28 million training images across 1,000 classes. The foundational benchmark for image classification. ImageNet-21k has approximately 14 million images in 21,841 classes. - License: Research use only (requires registration). - Chapters: 8, 22.
CIFAR-10 / CIFAR-100
- Description: 60,000 32x32 color images in 10 (or 100) classes. A standard dataset for prototyping and experimentation due to its small size.
- License: MIT.
- Chapters: 7, 8.
- Access: torchvision.datasets.CIFAR10(root="./data", download=True)
COCO (Common Objects in Context) - URL: https://cocodataset.org - Description: 330,000+ images with 80 object categories. Includes annotations for object detection, segmentation, keypoints, and image captioning. - License: CC BY 4.0. - Chapters: 22, 23.
LAION-5B - Description: 5.85 billion image-text pairs scraped from the web. Used to train open-source vision-language models like Stable Diffusion. - License: CC BY 4.0 (metadata); images are linked, not redistributed. - Chapters: 22, 23.
Visual Question Answering (VQA) - URL: https://visualqa.org - Description: 265,000 images with 760,000+ questions and 10 ground-truth answers each. - Chapters: 23.
D.3 Audio and Speech Datasets
LibriSpeech
- URL: https://www.openslr.org/12
- Description: 1,000 hours of read English speech from audiobooks. The standard benchmark for automatic speech recognition (ASR).
- License: CC BY 4.0.
- Chapters: 24.
- Access: datasets.load_dataset("librispeech_asr", "clean")
Common Voice (Mozilla) - URL: https://commonvoice.mozilla.org - Description: The world's largest open multilingual voice dataset, with contributions in 100+ languages. Crowdsourced recordings with validated transcriptions. - Size: 20,000+ hours across all languages. - License: CC-0 (public domain). - Chapters: 24, 25.
GigaSpeech - Description: 10,000 hours of English audio from audiobooks, podcasts, and YouTube. Designed as a large-scale ASR training corpus. - License: Apache 2.0. - Chapters: 24.
MusicCaps - Description: 5,521 music clips with text descriptions, created by Google for music understanding and generation tasks. - Chapters: 25.
VoxCeleb / VoxCeleb2 - Description: Speaker recognition datasets containing speech from thousands of celebrities. VoxCeleb2 has over 1 million utterances from 6,000+ speakers. - Chapters: 24.
D.4 Tabular and General-Purpose Datasets
UCI Machine Learning Repository - URL: https://archive.ics.uci.edu - Description: A long-standing collection of 600+ datasets for machine learning research. Includes classics like Iris, Wine, Adult Census, and Boston Housing. - License: Varies; most are freely available for research. - Chapters: 3, 4, 5, 6.
Kaggle Datasets
- URL: https://www.kaggle.com/datasets
- Description: Community-contributed datasets spanning virtually every domain. Kaggle also hosts competitions with curated datasets and leaderboards.
- License: Varies per dataset.
- Chapters: 3, 4, 5, 6, 38.
- Access: kaggle datasets download -d <dataset-slug>
OpenML - URL: https://www.openml.org - Description: A platform for sharing machine learning datasets, tasks, and experiments. Provides standardized interfaces for benchmarking. - License: Varies per dataset. - Chapters: 3, 6.
HuggingFace Datasets Hub
- URL: https://huggingface.co/datasets
- Description: Centralized repository hosting 100,000+ datasets with a unified Python API. Supports streaming for large datasets.
- Access: datasets.load_dataset("dataset_name")
- Chapters: Used throughout the book.
D.5 APIs and Services for Data Access
D.5.1 Cloud Data APIs
Google BigQuery Public Datasets - Description: Petabytes of publicly available datasets queryable via SQL. Includes GitHub activity data, Wikipedia, US Census data, and more. - Pricing: 1TB free queries per month. - Chapters: 28, 38.
AWS Open Data - URL: https://registry.opendata.aws - Description: Hundreds of datasets available directly on S3. Includes satellite imagery, genomics data, and Common Crawl. - Chapters: 28.
Kaggle API
- Description: Programmatic access to download competition data, public datasets, and submit predictions.
- Usage: pip install kaggle && kaggle datasets download -d <slug>
D.5.2 LLM APIs for Synthetic Data Generation
OpenAI API - Description: Access to GPT-4 and related models. Commonly used for generating synthetic training data, labels, and evaluations. - Chapters: 15, 17, 30.
Anthropic API - Description: Access to Claude models. Used for generating high-quality synthetic data, especially for complex reasoning tasks. - Chapters: 15, 17, 30.
HuggingFace Inference API - Description: Free-tier access to thousands of hosted models for quick experimentation. - Chapters: 30.
D.5.3 Web Scraping and Data Collection
Scrapy - Description: Python framework for building web scrapers. Useful for collecting domain-specific text data. - Chapters: 28.
Selenium / Playwright - Description: Browser automation tools for scraping JavaScript-rendered pages. - Chapters: 28.
Common Crawl Index - Description: Query interface to search Common Crawl archives by URL pattern or domain without downloading the full archive. - Chapters: 28.
D.6 Synthetic Data Generation Tools
Synthetic data generation is increasingly important for AI engineering, especially for addressing data scarcity, privacy constraints, and generating evaluation datasets.
D.6.1 Text Data Generation
Faker
- Description: Python library for generating realistic fake data (names, addresses, text, dates, etc.). Useful for generating test fixtures and tabular data.
- Usage: pip install faker
- Chapters: 28, 31.
LLM-Based Generation - Description: Using large language models to generate training data, synthetic conversations, and evaluation sets. This is now the dominant approach for creating instruction-following datasets. - Key technique: Provide few-shot examples in a prompt, then sample diverse completions with temperature > 0. - Chapters: 16, 17, 28.
Self-Instruct / Evol-Instruct - Description: Methods for using a strong LLM to generate instruction-response pairs, optionally evolving them to increase complexity. Used to create datasets like Alpaca and WizardLM. - Chapters: 16, 17.
D.6.2 Image and Multimodal Generation
Stable Diffusion / SDXL - Description: Open-source text-to-image models that can generate synthetic training images. Useful for data augmentation in computer vision. - Chapters: 22, 23.
Albumentations
- Description: Fast image augmentation library. Provides geometric and photometric transformations for creating augmented training samples.
- Usage: pip install albumentations
- Chapters: 8, 22.
D.6.3 Tabular Data Generation
CTGAN (Conditional Tabular GAN)
- Description: GAN-based approach for generating synthetic tabular data that preserves statistical properties of the original dataset.
- Usage: pip install ctgan
- Chapters: 6, 14.
SDV (Synthetic Data Vault)
- Description: Comprehensive library for generating synthetic relational, tabular, and time-series data using statistical and deep learning models.
- Usage: pip install sdv
- Chapters: 6, 28.
D.6.4 Privacy-Preserving Synthetic Data
Differential Privacy: Synthetic data generators that provide formal privacy guarantees by adding calibrated noise during the generation process.
Key libraries:
- diffprivlib (IBM): Differentially private machine learning.
- opacus (Meta): Differentially private training for PyTorch models.
- Chapters: 28, 37.
D.7 Dataset Loading Cheat Sheet
The following code snippets demonstrate how to quickly load datasets from the most common sources used in this book.
from datasets import load_dataset
# HuggingFace Hub
dataset = load_dataset("glue", "sst2") # GLUE benchmark
dataset = load_dataset("squad") # SQuAD
dataset = load_dataset("wikitext", "wikitext-103-v1") # WikiText-103
dataset = load_dataset("json", data_files="data.jsonl") # Local JSONL file
dataset = load_dataset("csv", data_files="data.csv") # Local CSV file
# Streaming (for very large datasets)
dataset = load_dataset("c4", "en", streaming=True)
for example in dataset["train"]:
process(example)
break # Process one at a time without downloading everything
# Torchvision
import torchvision
cifar10 = torchvision.datasets.CIFAR10(root="./data", train=True, download=True)
imagenet = torchvision.datasets.ImageNet(root="/data/imagenet", split="train")
# Torchaudio
import torchaudio
librispeech = torchaudio.datasets.LIBRISPEECH(root="./data", url="train-clean-100")
# Scikit-learn built-in
from sklearn.datasets import load_iris, fetch_california_housing
iris = load_iris()
housing = fetch_california_housing()
# Kaggle (CLI)
# kaggle competitions download -c titanic
# kaggle datasets download -d zillow/zecon
D.8 Dataset Best Practices
- Always check the license before using a dataset. Some datasets (e.g., ImageNet) have restrictions on commercial use.
- Document your data sources in your project's README and model card. This is essential for reproducibility and responsible AI.
- Inspect data quality before training. Use tools like
cleanlabfor label error detection and pandas profiling for exploratory data analysis. - Version your datasets using tools like DVC (Data Version Control) or HuggingFace dataset versioning.
- Consider data contamination: If evaluating on a public benchmark, verify that your training data does not contain test set examples. This is an increasing concern with web-scraped pre-training corpora.
- Respect privacy: De-identify personal information before using real-world data for training. Consider using synthetic data when working with sensitive domains.