Common Crawl / C4
Software / AppMentioned in 1 video
Large web-crawl data sources often used in data mixtures for LLM pretraining; discussed in the training-data section.
Large web-crawl data sources often used in data mixtures for LLM pretraining; discussed in the training-data section.