FineWeb
Concept
Dataset curated by Hugging Face used as an example pretraining corpus (filtered, ~44 TB).
Mentioned in 3 videos
Videos Mentioning FineWeb

Deep Dive into LLMs like ChatGPT
Andrej Karpathy
Dataset curated by Hugging Face used as an example pretraining corpus (filtered, ~44 TB).
![Best of 2024: Synthetic Data / Smol Models, Loubna Ben Allal, HuggingFace [LS Live! @ NeurIPS 2024]](https://i.ytimg.com/vi/AjmdDy7Rzx0/maxresdefault.jpg)
Best of 2024: Synthetic Data / Smol Models, Loubna Ben Allal, HuggingFace [LS Live! @ NeurIPS 2024]
Latent Space
A large dataset (15 trillion tokens) used as a source for creating more specialized datasets like FineWeb-Edu.

⚡ Open Model Pretraining Masterclass — Elie Bakouch, HuggingFace SmolLM 3, FineWeb, FinePDF
Latent Space