RedPajama

Description: An open reproduction of the LLaMA training data, consisting of approximately 1.2 trillion tokens from Common Crawl, C4, GitHub, books, arXiv, Wikipedia, and StackExchange. - License: Apache 2.0. - Chapters: 14, 15.