Common Crawl

Concept

Dataset used for broad internet-scale pretraining of language models

Mentioned in 9 videos

Videos Mentioning Common Crawl

Is RL a dead end? – Dario Amodei

Is RL a dead end? – Dario Amodei

Dwarkesh Clips

Dataset used for broad internet-scale pretraining of language models

E143: Nvidia smashes earnings, Arm walks the plank, M&A market, Vivek dominates GOP debate & more

E143: Nvidia smashes earnings, Arm walks the plank, M&A market, Vivek dominates GOP debate & more

All-In Podcast

An open-source project that provides web crawl data to researchers and developers, used by many to train data models, including GPT-3. It was founded and largely funded by Gil Elbaz.

E165: Vision Pro: use or lose? Meta vs Snap, SaaS recovery, AI investing, rolling real estate crisis

E165: Vision Pro: use or lose? Meta vs Snap, SaaS recovery, AI investing, rolling real estate crisis

All-In Podcast

An open-source web crawling dataset used for training AI models like GPT-3, highlighted as significantly smaller than YouTube's data repository.

Best of 2024: Synthetic Data / Smol Models, Loubna Ben Allal, HuggingFace [LS Live! @ NeurIPS 2024]

Best of 2024: Synthetic Data / Smol Models, Loubna Ben Allal, HuggingFace [LS Live! @ NeurIPS 2024]

Latent Space

A large dataset used for training LLMs, with analysis showing an increase in AI-generated content over different dumps.

Best of 2024: Open Models [LS LIVE! at NeurIPS 2024]

Best of 2024: Open Models [LS LIVE! at NeurIPS 2024]

Latent Space

A publicly available scrape of a subsite of the internet used for training language models. A study analyzing its snapshots revealed diminishing accessibility to web content.

A Comprehensive Overview of Large Language Models - Latent Space Paper Club

A Comprehensive Overview of Large Language Models - Latent Space Paper Club

Latent Space

A non-profit organization that crawls and archives the web, providing vast datasets for research.

The Four Wars of the AI Stack - Dec 2023 Recap

The Four Wars of the AI Stack - Dec 2023 Recap

Latent Space

A source of data for AI training, with concerns about model-generated content.

Beating GPT-4 with Open Source Models - with Michael Royzen of Phind

Beating GPT-4 with Open Source Models - with Michael Royzen of Phind

Latent Space

A massive, open repository of web crawl data, used by Michael Royzen to build an extensive search index for Elias 5-based models.

Aravind Srinivas: Perplexity CEO on Future of AI, Search & the Internet | Lex Fridman Podcast #434

Aravind Srinivas: Perplexity CEO on Future of AI, Search & the Internet | Lex Fridman Podcast #434

Lex Fridman