Glossary

Common Crawl

URL: https://commoncrawl.org - Description: Petabytes of raw web data collected monthly since 2008. The basis for many pre-training datasets. - Size: Petabytes (raw); filtered subsets vary. - License: Open; content licensing varies per page. - Chapters: 14, 15.

Learn More

Related Terms