Common Crawl
Software / App
A large dataset of internet data used to train models like GloVe, containing billions of tokens.
Mentioned in 3 videos
Save the 3 videos on Common Crawl to your own pod.
Sign up free to keep building your knowledge base on Common Crawl as more episodes are added.
Videos Mentioning Common Crawl

Deep Learning for Natural Language Processing (Richard Socher, Salesforce)
Lex Fridman
A large dataset of internet data used to train models like GloVe, containing billions of tokens.

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 13: Data (Sources, Datasets)
Stanford Online
A large, publicly available dataset of web crawl data, used by many researchers and companies for training language models.

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 14: Data
Stanford Online
A publicly available archive of web crawl data, used as a primary source for training large language models. It contains raw HTML and other web content, requiring significant processing.