C4
Software / App
A dataset from which samples were rewritten in the Pratus paper to improve format and quality.
Mentioned in 3 videos
Save the 3 videos on C4 to your own pod.
Sign up free to keep building your knowledge base on C4 as more episodes are added.
Videos Mentioning C4
![Best of 2024: Synthetic Data / Smol Models, Loubna Ben Allal, HuggingFace [LS Live! @ NeurIPS 2024]](https://i.ytimg.com/vi/AjmdDy7Rzx0/maxresdefault.jpg)
Best of 2024: Synthetic Data / Smol Models, Loubna Ben Allal, HuggingFace [LS Live! @ NeurIPS 2024]
Latent Space
A dataset from which samples were rewritten in the Pratus paper to improve format and quality.

Building an open AI company - with Ce and Vipul of Together AI
Latent Space
A large dataset from Google, mentioned as an inspiration for the RedPajama dataset.

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 13: Data (Sources, Datasets)
Stanford Online
A dataset from Google, known for its use in the T5 model. It filtered Common Crawl data using defined rules to improve quality, resulting in 156 billion tokens.