C4 dataset

Concept

Colossal Clean Crawled Corpus, a large dataset derived from Common Crawl, often used for pre-training LLMs.

Mentioned in 1 video