C4 dataset

ConceptMentioned in 1 video

Colossal Clean Crawled Corpus, a large dataset derived from Common Crawl, often used for pre-training LLMs.