FineWeb
Concept
Dataset curated by Hugging Face used as an example pretraining corpus (filtered, ~44 TB).
Mentioned in 3 videos
Save the 3 videos on FineWeb to your own pod.
Sign up free to keep building your knowledge base on FineWeb as more episodes are added.
Videos Mentioning FineWeb

Deep Dive into LLMs like ChatGPT
Andrej Karpathy
Dataset curated by Hugging Face used as an example pretraining corpus (filtered, ~44 TB).
![Best of 2024: Synthetic Data / Smol Models, Loubna Ben Allal, HuggingFace [LS Live! @ NeurIPS 2024]](https://i.ytimg.com/vi/AjmdDy7Rzx0/maxresdefault.jpg)
Best of 2024: Synthetic Data / Smol Models, Loubna Ben Allal, HuggingFace [LS Live! @ NeurIPS 2024]
Latent Space
A large dataset (15 trillion tokens) used as a source for creating more specialized datasets like FineWeb-Edu.

⚡ Open Model Pretraining Masterclass — Elie Bakouch, HuggingFace SmolLM 3, FineWeb, FinePDF
Latent Space