Dataset curated by Hugging Face used as an example pretraining corpus (filtered, ~44 TB).
Andrej Karpathy
Latent Space