RedPajama
Software / App
Mentioned as an example of an available cleaned dataset (RedPajama/slim) suitable for language-model training.
Mentioned in 3 videos
Videos Mentioning RedPajama

Let's reproduce GPT-2 (124M)
Andrej Karpathy
Mentioned as an example of an available cleaned dataset (RedPajama/slim) suitable for language-model training.

RWKV: Reinventing RNNs for the Transformer Era
Latent Space
A large open-source dataset, mentioned as a target for future RWKV training to compete with models like Falcon, specifically for English use cases.

FlashAttention-2: Making Transformers 800% faster AND exact
Latent Space
A dataset developed by Together, mentioned in the context of Tri Dao's work.