FineWeb

Concept

Dataset curated by Hugging Face used as an example pretraining corpus (filtered, ~44 TB).

Mentioned in 3 videos