Key Moments

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 13: Data (Sources, Datasets)

Stanford OnlineStanford Online
Education7 min read83 min video
May 19, 2026|60 views|7|1
Save to Pod
TL;DR

Most LLM training data is copyrighted, and while fair use is a defense, its boundaries are blurry, leading to a complex legal landscape and secretive data practices by companies.

Key Insights

1

The LLaMA 3 paper offers full transparency on architecture and training procedures but exclusively states data comes from 'a variety of data sources,' highlighting data's competitive and legal secrecy.

2

Technical and legal restrictions on web crawling have significantly increased, with the fraction of websites having full restrictions growing to nearly 50% by mid-2023, up from a negligible percentage in 2016.

3

While training is increasingly being deemed 'fair use' by courts in select cases, pirating copyrighted books for training data is explicitly illegal, as seen in the Anthropic lawsuit settlement.

4

Common Crawl offers approximately 3-5 billion web pages per monthly crawl, resulting in 300 billion pages to date, providing a massive, albeit uncurated, source of text data.

5

The C4 dataset filtered Common Crawl using rules like requiring punctuation, having more than five words, and avoiding boilerplate text, yielding 156 billion tokens (800 GB).

6

The Pile dataset is a diverse, grassroots effort including Common Crawl, PubMed, Books 3 (from a shadow library, now taken down), Archive, GitHub, Wikipedia, and Enron emails.

Data is the most important, yet most secretive, aspect of language models

While model architecture and training procedures are often transparent, especially for open-weight models like LLaMA 3, the exact data used for pre-training remains highly guarded. Companies cite competitive advantage and copyright liability as key reasons for this secrecy. Data has historically been a bottleneck in machine learning and continues to be so, particularly for foundation models that aim for broad capabilities. This is partly due to the 'long-tail problem,' where scaling data efforts can accelerate much more easily than scaling human effort in areas like architecture or systems design. Data plays a role across multiple stages: pre-training on raw web data, mid-training on higher-quality data for specific capabilities (like long context), and post-training for task-specific fine-tuning (e.g., chat logs, reinforcement learning environments). The trend is a shift from large amounts of low-quality data to smaller, more curated, high-quality datasets.

Navigating the web's technical and legal restrictions on data access

The notion that language models are trained on 'the entire internet' is an oversimplification. Raw web pages are dynamic and often require interaction with applications rather than simple crawling. Furthermore, a significant portion of web content is locked behind 'walled gardens' requiring authentication, such as social media platforms or subscription services. Even for accessible content, technical obstacles like `robots.txt` files and anti-bot measures (e.g., CAPTCHAs) by services like Cloudflare are prevalent. Beyond these, legal restrictions are increasingly significant. Websites often include terms of service that explicitly prohibit AI training. The use of Cloudflare and IP/country blocking further limits automated access. A study by Shane Lampray highlighted a dramatic increase in these restrictions, with the fraction of websites imposing full `robots.txt` restrictions growing to nearly 50% by mid-2023, and most terms of service now forbidding AI training.

The evolving landscape of copyright and fair use for AI training

Copyright law, designed to incentivize creation, protects original works fixed in a tangible medium. While it doesn't cover ideas, it applies to expressions with a very low threshold for protection. Copyright lasts for 75 years, after which works enter the public domain. Crucially, most content accessible online is copyrighted. To use copyrighted material, one typically needs a license (like Creative Commons or a paid agreement) or must appeal to the 'fair use' doctrine. Fair use considers four factors: the purpose and character of the use (transformative, educational uses are favored), the nature of the copyrighted work (factual works are less protected than fictional), the amount and substantiality of the portion used (snippets are favored), and the effect on the market for the original work. While mere copying can be a violation, training a model is often argued to be transformative.

Recent legal battles and precedents in AI training data

High-profile lawsuits are shaping the interpretation of fair use in AI training. The New York Times sued OpenAI, alleging verbatim generation of news articles by ChatGPT. A significant case against Anthropic involved allegations of pirating millions of books; while the court deemed the scanning of purchased books as fair use, the illegal pirating itself was not excused, leading to a $1.5 billion settlement. Similarly, Meta faced a lawsuit for training on books, where training itself was deemed fair use, but 'torrenting' books remained under litigation. These rulings suggest that while the act of training might be considered fair use, the methods of data acquisition (like piracy) are still illegal, and the impact on the market remains a critical factor. The legal landscape is highly active and evolving.

Key data sources: Common Crawl and specialized corpora

While many large model developers maintain proprietary crawlers, Common Crawl is a widely available resource providing monthly web crawls of billions of pages (300 billion to date). However, raw Common Crawl data requires extensive processing, including HTML-to-text conversion using tools like Trafilatura, and filtering. Specialized, higher-quality sources are also critical. Wikipedia, with its extensive articles and notability requirements, offers a curated dump. GitHub provides code repositories, valuable for both coding and reasoning capabilities, with permissive licenses allowing training. Project Gutenberg offers public domain books, while arXiv provides academic papers often under Creative Commons licenses. These structured sources facilitate easier acquisition through data dumps rather than live crawling.

Evolution of data filtering and curation techniques

Early models like BERT used Wikipedia and a 'Books' corpus scraped from Smashwords (later taken down). GPT-2 utilized a filtered subset of web pages linked from Reddit posts with high karma. CCNet (developed by Facebook) used a language model trained on Wikipedia to score document quality. Google's C4 dataset applied a series of heuristic rules to filter Common Crawl, removing non-English text, boilerplate, and pages with insufficient sentences, resulting in 156 billion tokens. GPT-3 used Common Crawl processed with a quality classifier, along with WebText, books corpora, and Wikipedia, reaching 400 billion tokens. The Pile dataset aggregated diverse sources, including Books 3 (from a shadow library, now removed), reflecting a move towards more comprehensive, yet curated, datasets.

Advanced filtering and the rise of synthetic data

More recent efforts like RefinedWeb (5 trillion tokens) and FineWeb (15 trillion tokens) focused on aggressive web filtering using rules to avoid algorithmic biases. Datasets like Dolma from AI2 incorporated processed Common Crawl, Stack Exchange data, and academic papers, using both rule-based and classifier-based filtering. The DataComp initiative aimed to standardize data processing and analysis, releasing unfiltered data (240 trillion tokens) and a filtered subset (1.4%) using model-based quality classifiers trained on instruction-tuned data like OpenHermes and Eli5. Nematron from NVIDIA leaned heavily into synthetic data, using language models to rephrase low-quality data or generate tasks from high-quality data, resulting in a 6 trillion token dataset. This highlights a growing trend of synthetic data generation for pre-training, aiming to increase data volume and quality, with models like LLaMA 3 and Qwen 3 trained on 15 and 36 trillion tokens respectively.

Focus on coding data and permissively licensed datasets

Specialized datasets for code, such as Stack (v1 and v2), have been developed by curating permissively licensed GitHub repositories, software heritage archives, and documentation. Stack v2 includes metadata like issues and pull requests, and complex linearization strategies to represent the software development process. It also innovates by compiling code into an intermediate representation (LLVM) to help models learn from low-resource languages by mapping them to higher-resource ones. Common Pile represents an attempt to build a high-quality dataset exclusively from permissively licensed data sources, including Stack v2, government proceedings, wikis, and academic papers, totaling 8 terabytes. However, challenges like license laundering and the distinction between collection licenses and individual work licenses remain significant hurdles. While competently curated, purely permissively licensed datasets may still struggle to match the performance of models trained on a broader, albeit legally complex, data mix.

Common Questions

Data is crucial because it directly influences the model's capabilities and knowledge. Companies often keep their data sources secret, highlighting its role as a competitive advantage and a factor in avoiding legal issues like copyright liability.

Topics

Mentioned in this video

Software & Apps
Llama 3

Mentioned as an example of a company that does not disclose its training data, despite transparency in architecture and training procedures.

Qwen 3.5 397B

Cited as an example of a large model where intermediate checkpoints are no longer tracked, indicating a shift in model development practices.

libgen

Mentioned as an example of a shadow library that disregards copyright and bypasses paywalls to make books and articles available for free.

Anna's Archive

Mentioned as an example of a shadow library that disregards copyright and bypasses paywalls to make books and articles available for free.

OpenCourseware

Cited as an example of content distributed under Creative Commons licenses, making it freely usable.

Google Books

A service allowing users to see snippets of books, which was the subject of a lawsuit regarding fair use and copyright.

ChatGPT

Allegedly trained on New York Times articles, leading to a lawsuit.

Claude

An LLM developed by Anthropic, allegedly trained on pirated books, leading to legal action.

Common Crawl

A large, publicly available dataset of web crawl data, used by many researchers and companies for training language models.

WARC

A file format used to store web crawl data, representing the raw HTTP response.

WAT

A processed version of WARC files from Common Crawl, which can be a lossy representation of the data.

Trafilatura

A tool for converting HTML to text, mentioned as being better than WAT files for processing web content.

Brazilia Parser

A tool for converting HTML to text, mentioned as being better than WAT files for processing web content.

arXiv

A repository for research papers, particularly in physics, a valuable source for academic content.

PushShift

A project that provided Reddit data dumps before access became restricted.

Semantic Scholar

AI2's own crawl of academic papers, used to derive a dataset for language model training.

OpenHermes

Instruction data generated by GPT-4, used as negative examples for training the quality classifier in DCLAM.

GPT-4

Used to generate instruction data (OpenHermes) for training the quality classifier in DCLAM.

GPT-2

The model that sought high-quality web content by filtering pages linked from Reddit posts with high karma, using 40GB of text.

Qwen 3

Trained on 36 trillion tokens, referenced for scale comparison with Nematron and other datasets.

Eli5

A subreddit with questions and answers, used for both positive and sometimes negative examples in the DCLAM quality classifier training.

T5

A model famous for pushing the text-to-text framework, associated with the C4 dataset.

RefinedWeb

A dataset created by asserting that web data alone is sufficient for training. Its loosely filtered version was used as negative examples in DCLAM.

Llama 1

The dataset processing for this model was detailed, including Common Crawl (CCNet), C4, GitHub, Wikipedia, Books 3, Project Gutenberg, Arxiv, and Stack Exchange.

Python

A common programming language mentioned in the context of the Stack V2 dataset, noting its prevalence compared to lower-resource languages.

CCNet

Developed by Facebook, used for creating high-quality datasets, particularly for low-resource languages, by employing deduplication and language identification, and using a language model to score document quality based on Wikipedia-likeness.

LLVM

A low-level intermediate language used in the Stack V2 dataset to enable language models to learn mappings between low-resource and high-resource programming languages.

BERT

An earlier language model trained on Wikipedia and books, noting its use of document-level sequences rather than sentences.

GPT-3

Its dataset included Common Crawl processing, expanded web text, internet-based books corpora, and Wikipedia, totaling 500GB of text.

Nematron

NVIDIA's dataset that used a prompt-based model to score educational value and incorporated synthetic data, resulting in 6 trillion tokens.

Llama

Mentioned as a benchmark model for comparison with Common Pile, which was trained on permissively licensed data.

Bibliotek

A shadow library from which the Books 3 dataset was sourced.

Stack V2

An updated version of The Stack dataset focusing on code, including repository data, metadata, and documentation, with LLVM intermediate representation for low-resource languages.

Qwen

Mentioned as a benchmark model that significantly outperforms models trained solely on permissively licensed data.

Chinchilla

A model mentioned as having subsumed the work on Gopher.

MassiveWeb

A dataset created by DeepMind, trained on C4, books, news, GitHub, and Wikipedia, with undisclosed data sources for some components.

Dolma

An AI2 dataset that includes processed Common Crawl, Stack Exchange, C4, and other sources, utilizing model-based filtering for quality.

Nim

A low-resource programming language mentioned in the context of the Stack V2 dataset and its use of LLVM intermediate representation.

C4

A dataset from Google, known for its use in the T5 model. It filtered Common Crawl data using defined rules to improve quality, resulting in 156 billion tokens.

The Pile

A diverse dataset created by Eleuther AI from various sources including Common Crawl, PubMed, Arxiv, GitHub, and books, aiming for open-source accessibility.

Gopher

A model developed by DeepMind, whose data processing methods were described in a paper, contributing to understanding data curation.

Red Pajama V1

A reproduction of the LLaMA 1 dataset, initially including Books 3 but later stripped out due to copyright concerns.

Common Pile

A project that aimed to create a model using only permissively licensed data, scouring the internet for sources like Stack V2, government proceedings, and wikis, resulting in 8TB of data.

Companies
Facebook

Mentioned as an example of a platform whose content is locked behind a 'walled garden' and inaccessible to external crawlers for training purposes.

LinkedIn

Mentioned as an example of a platform whose content is locked behind a 'walled garden' and inaccessible to external crawlers for training purposes.

Cloudflare

A service used by websites to detect and block bot activity, often by presenting CAPTCHAs or blocking IP addresses.

Khan Academy

Cited as an example of content distributed under Creative Commons licenses, making it freely usable.

OpenAI

Sued by The New York Times for allegedly training ChatGPT on their news articles.

Anthropic

Faced a lawsuit for allegedly pirating millions of books for training Claude. The court ruled training was fair use, but pirating books was illegal, leading to a $1.5 billion settlement.

Meta

Sued for allegedly training on copyrighted books; training was deemed fair use, but book torrenting is still under review.

11 Labs

A company mentioned in a question about voice dating, which the speaker had limited knowledge of.

GitHub

A platform for hosting code repositories; a valuable source for code and reasoning capabilities in language models.

DeepMind

Released a paper describing the Gopher model and its data processing steps, though parts of the data remained undisclosed.

NVIDIA

Developer of the Nematron dataset, which proposed a more elaborate filtering method than DCLAM and incorporated synthetic data.

Stack Exchange

A platform with user-contributed Q&A, used as a dataset for its structured, application-like data, aiding in question-answering capabilities.

Hugging Face

Platform where the fine-web dataset, a replication and improvement of Refined Web, was made available.

Smashwords

A platform for self-published ebooks, from which a 'books' corpus was scraped for the BERT model. Its terms of service were later violated, leading to its removal.

More from Stanford Online

View all 52 summaries

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free