Common Crawl
ConceptMentioned in 5 videos
Dataset used for broad internet-scale pretraining of language models
Videos Mentioning Common Crawl

Is RL a dead end? – Dario Amodei
Dwarkesh Clips
Dataset used for broad internet-scale pretraining of language models

E165: Vision Pro: use or lose? Meta vs Snap, SaaS recovery, AI investing, rolling real estate crisis
All-In Podcast
An open-source web crawling dataset used for training AI models like GPT-3, highlighted as significantly smaller than YouTube's data repository.
![Best of 2024: Synthetic Data / Smol Models, Loubna Ben Allal, HuggingFace [LS Live! @ NeurIPS 2024]](https://i.ytimg.com/vi/AjmdDy7Rzx0/maxresdefault.jpg)
Best of 2024: Synthetic Data / Smol Models, Loubna Ben Allal, HuggingFace [LS Live! @ NeurIPS 2024]
Latent Space
A large dataset used for training LLMs, with analysis showing an increase in AI-generated content over different dumps.
![Best of 2024: Open Models [LS LIVE! at NeurIPS 2024]](https://i.ytimg.com/vi/jX1nuoTs2WU/maxresdefault.jpg)
Best of 2024: Open Models [LS LIVE! at NeurIPS 2024]
Latent Space
A publicly available scrape of a subsite of the internet used for training language models. A study analyzing its snapshots revealed diminishing accessibility to web content.

Aravind Srinivas: Perplexity CEO on Future of AI, Search & the Internet | Lex Fridman Podcast #434
Lex Fridman