Common Crawl / C4

Software / App

Large web-crawl data sources often used in data mixtures for LLM pretraining; discussed in the training-data section.

Mentioned in 1 video