What are the main challenges in collecting data from the public web?

Collecting web data faces technical hurdles like dynamic content and authentication requirements, as well as legal restrictions like robots.txt and terms of service. Furthermore, the increasing use of bot detection mechanisms and varying legal interpretations of crawling complicate the process.

How does copyright law apply to training data for language models?

Copyright law protects original works, and while everything on the internet is generally copyrighted, copyrighted works can be used through licenses or by appealing to the fair use doctrine. The interpretation of fair use for AI training is still evolving and subject to legal scrutiny.

What are the four factors of fair use?

The four factors are the purpose and character of the use (e.g., educational vs. commercial, transformative vs. direct copy), the nature of the copyrighted work (factual vs. fictional), the amount and substantiality of the portion used, and the effect of the use on the potential market for the original work.

What is the difference between web crawling and using datasets like Common Crawl?

While web crawling involves actively discovering and downloading pages, datasets like Common Crawl provide pre-compiled archives of web crawls. Common Crawl makes billions of web pages available, saving the effort of building and running a custom crawler.

Why are specialized datasets like Wikipedia and GitHub important?

Specialized datasets offer higher quality and relevance for specific tasks. Wikipedia provides encyclopedic knowledge, while GitHub is crucial for code generation and reasoning capabilities, offering structured data beyond general web crawls.

How has the approach to data filtering evolved over time?

Early methods involved simple rules (like in C4) or heuristics (like Reddit links for GPT-2). More recently, sophisticated classifier-based filtering and synthetic data generation (as seen in DCLAM and Nematron) are used to improve data quality and model performance.

What are the legal implications of using Books 3 or similar datasets?

Datasets like Books 3, sourced from shadow libraries and containing copyrighted material without explicit permission, pose significant legal risks. Training on such data can lead to lawsuits and substantial settlements, as seen in cases involving Anthropic and Meta.

What is 'license laundering' in the context of datasets?

License laundering refers to the practice of improperly applying permissive licenses to data that may not actually have them, or applying a dataset-level license to individual works within the dataset that do not hold that license. This makes it difficult to ensure true permissiveness.

Can language models be trained solely on permissively licensed data?

While it's possible to train models on permissively licensed data (e.g., Common Pile project), competing with models trained on vast, less restricted datasets is challenging. It requires significant effort in data sourcing and curation to achieve comparable performance.

What is the role of synthetic data in modern language model training?

Synthetic data, generated by other models, is increasingly used for pre-training. It can be used to rephrase low-quality data to improve its look-alike quality (like Wikipedia) or to generate specific tasks like question answering, enhancing model capabilities.

How does the Stack dataset differ from general web crawls for code?

The Stack dataset provides a more curated and structured collection of code, including repositories, metadata like issues and pull requests, and documentation. It also normalizes various programming languages into an LLVM intermediate representation, facilitating better learning.

Key Moments

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 13: Data (Sources, Datasets)

Stanford Online

Education7 min read83 min video

May 19, 2026|60 views|7|1

Stanford Stanford Online AI Artificial Intelligence

Save to Pod

Key Moments

TL;DR

Most LLM training data is copyrighted, and while fair use is a defense, its boundaries are blurry, leading to a complex legal landscape and secretive data practices by companies.

Key Insights

The LLaMA 3 paper offers full transparency on architecture and training procedures but exclusively states data comes from 'a variety of data sources,' highlighting data's competitive and legal secrecy.

Technical and legal restrictions on web crawling have significantly increased, with the fraction of websites having full restrictions growing to nearly 50% by mid-2023, up from a negligible percentage in 2016.

While training is increasingly being deemed 'fair use' by courts in select cases, pirating copyrighted books for training data is explicitly illegal, as seen in the Anthropic lawsuit settlement.

Common Crawl offers approximately 3-5 billion web pages per monthly crawl, resulting in 300 billion pages to date, providing a massive, albeit uncurated, source of text data.

The C4 dataset filtered Common Crawl using rules like requiring punctuation, having more than five words, and avoiding boilerplate text, yielding 156 billion tokens (800 GB).

The Pile dataset is a diverse, grassroots effort including Common Crawl, PubMed, Books 3 (from a shadow library, now taken down), Archive, GitHub, Wikipedia, and Enron emails.

Data is the most important, yet most secretive, aspect of language models

While model architecture and training procedures are often transparent, especially for open-weight models like LLaMA 3, the exact data used for pre-training remains highly guarded. Companies cite competitive advantage and copyright liability as key reasons for this secrecy. Data has historically been a bottleneck in machine learning and continues to be so, particularly for foundation models that aim for broad capabilities. This is partly due to the 'long-tail problem,' where scaling data efforts can accelerate much more easily than scaling human effort in areas like architecture or systems design. Data plays a role across multiple stages: pre-training on raw web data, mid-training on higher-quality data for specific capabilities (like long context), and post-training for task-specific fine-tuning (e.g., chat logs, reinforcement learning environments). The trend is a shift from large amounts of low-quality data to smaller, more curated, high-quality datasets.

Navigating the web's technical and legal restrictions on data access

The notion that language models are trained on 'the entire internet' is an oversimplification. Raw web pages are dynamic and often require interaction with applications rather than simple crawling. Furthermore, a significant portion of web content is locked behind 'walled gardens' requiring authentication, such as social media platforms or subscription services. Even for accessible content, technical obstacles like `robots.txt` files and anti-bot measures (e.g., CAPTCHAs) by services like Cloudflare are prevalent. Beyond these, legal restrictions are increasingly significant. Websites often include terms of service that explicitly prohibit AI training. The use of Cloudflare and IP/country blocking further limits automated access. A study by Shane Lampray highlighted a dramatic increase in these restrictions, with the fraction of websites imposing full `robots.txt` restrictions growing to nearly 50% by mid-2023, and most terms of service now forbidding AI training.

The evolving landscape of copyright and fair use for AI training

Copyright law, designed to incentivize creation, protects original works fixed in a tangible medium. While it doesn't cover ideas, it applies to expressions with a very low threshold for protection. Copyright lasts for 75 years, after which works enter the public domain. Crucially, most content accessible online is copyrighted. To use copyrighted material, one typically needs a license (like Creative Commons or a paid agreement) or must appeal to the 'fair use' doctrine. Fair use considers four factors: the purpose and character of the use (transformative, educational uses are favored), the nature of the copyrighted work (factual works are less protected than fictional), the amount and substantiality of the portion used (snippets are favored), and the effect on the market for the original work. While mere copying can be a violation, training a model is often argued to be transformative.

Recent legal battles and precedents in AI training data

High-profile lawsuits are shaping the interpretation of fair use in AI training. The New York Times sued OpenAI, alleging verbatim generation of news articles by ChatGPT. A significant case against Anthropic involved allegations of pirating millions of books; while the court deemed the scanning of purchased books as fair use, the illegal pirating itself was not excused, leading to a $1.5 billion settlement. Similarly, Meta faced a lawsuit for training on books, where training itself was deemed fair use, but 'torrenting' books remained under litigation. These rulings suggest that while the act of training might be considered fair use, the methods of data acquisition (like piracy) are still illegal, and the impact on the market remains a critical factor. The legal landscape is highly active and evolving.

Key data sources: Common Crawl and specialized corpora

While many large model developers maintain proprietary crawlers, Common Crawl is a widely available resource providing monthly web crawls of billions of pages (300 billion to date). However, raw Common Crawl data requires extensive processing, including HTML-to-text conversion using tools like Trafilatura, and filtering. Specialized, higher-quality sources are also critical. Wikipedia, with its extensive articles and notability requirements, offers a curated dump. GitHub provides code repositories, valuable for both coding and reasoning capabilities, with permissive licenses allowing training. Project Gutenberg offers public domain books, while arXiv provides academic papers often under Creative Commons licenses. These structured sources facilitate easier acquisition through data dumps rather than live crawling.

Evolution of data filtering and curation techniques

Early models like BERT used Wikipedia and a 'Books' corpus scraped from Smashwords (later taken down). GPT-2 utilized a filtered subset of web pages linked from Reddit posts with high karma. CCNet (developed by Facebook) used a language model trained on Wikipedia to score document quality. Google's C4 dataset applied a series of heuristic rules to filter Common Crawl, removing non-English text, boilerplate, and pages with insufficient sentences, resulting in 156 billion tokens. GPT-3 used Common Crawl processed with a quality classifier, along with WebText, books corpora, and Wikipedia, reaching 400 billion tokens. The Pile dataset aggregated diverse sources, including Books 3 (from a shadow library, now removed), reflecting a move towards more comprehensive, yet curated, datasets.

Advanced filtering and the rise of synthetic data

More recent efforts like RefinedWeb (5 trillion tokens) and FineWeb (15 trillion tokens) focused on aggressive web filtering using rules to avoid algorithmic biases. Datasets like Dolma from AI2 incorporated processed Common Crawl, Stack Exchange data, and academic papers, using both rule-based and classifier-based filtering. The DataComp initiative aimed to standardize data processing and analysis, releasing unfiltered data (240 trillion tokens) and a filtered subset (1.4%) using model-based quality classifiers trained on instruction-tuned data like OpenHermes and Eli5. Nematron from NVIDIA leaned heavily into synthetic data, using language models to rephrase low-quality data or generate tasks from high-quality data, resulting in a 6 trillion token dataset. This highlights a growing trend of synthetic data generation for pre-training, aiming to increase data volume and quality, with models like LLaMA 3 and Qwen 3 trained on 15 and 36 trillion tokens respectively.

Focus on coding data and permissively licensed datasets

Specialized datasets for code, such as Stack (v1 and v2), have been developed by curating permissively licensed GitHub repositories, software heritage archives, and documentation. Stack v2 includes metadata like issues and pull requests, and complex linearization strategies to represent the software development process. It also innovates by compiling code into an intermediate representation (LLVM) to help models learn from low-resource languages by mapping them to higher-resource ones. Common Pile represents an attempt to build a high-quality dataset exclusively from permissively licensed data sources, including Stack v2, government proceedings, wikis, and academic papers, totaling 8 terabytes. However, challenges like license laundering and the distinction between collection licenses and individual work licenses remain significant hurdles. While competently curated, purely permissively licensed datasets may still struggle to match the performance of models trained on a broader, albeit legally complex, data mix.

Mentioned in This Episode

●Software & Apps

●Companies

●Organizations

●Books

●Studies Cited

●Concepts

Common Questions

Data is crucial because it directly influences the model's capabilities and knowledge. Companies often keep their data sources secret, highlighting its role as a competitive advantage and a factor in avoiding legal issues like copyright liability.

Topics

AI & Machine Learning Technology & Innovation Programming & Software Synthetic Data Copyright Law Web Crawling Fair Use Data Curation Data Filtering Dataset Creation Data Sourcing

Mentioned in this video

Software & Apps

Llama 3

Mentioned as an example of a company that does not disclose its training data, despite transparency in architecture and training procedures.

Qwen 3.5 397B

Cited as an example of a large model where intermediate checkpoints are no longer tracked, indicating a shift in model development practices.

libgen

Mentioned as an example of a shadow library that disregards copyright and bypasses paywalls to make books and articles available for free.

Anna's Archive

Mentioned as an example of a shadow library that disregards copyright and bypasses paywalls to make books and articles available for free.

OpenCourseware

Cited as an example of content distributed under Creative Commons licenses, making it freely usable.

Google Books

A service allowing users to see snippets of books, which was the subject of a lawsuit regarding fair use and copyright.

ChatGPT

Allegedly trained on New York Times articles, leading to a lawsuit.

Claude

An LLM developed by Anthropic, allegedly trained on pirated books, leading to legal action.

Common Crawl

A large, publicly available dataset of web crawl data, used by many researchers and companies for training language models.

WARC

A file format used to store web crawl data, representing the raw HTTP response.

WAT

A processed version of WARC files from Common Crawl, which can be a lossy representation of the data.

Trafilatura

A tool for converting HTML to text, mentioned as being better than WAT files for processing web content.

Brazilia Parser

A tool for converting HTML to text, mentioned as being better than WAT files for processing web content.

arXiv

A repository for research papers, particularly in physics, a valuable source for academic content.

PushShift

A project that provided Reddit data dumps before access became restricted.

Semantic Scholar

AI2's own crawl of academic papers, used to derive a dataset for language model training.

OpenHermes

Instruction data generated by GPT-4, used as negative examples for training the quality classifier in DCLAM.

GPT-4

Used to generate instruction data (OpenHermes) for training the quality classifier in DCLAM.

GPT-2

The model that sought high-quality web content by filtering pages linked from Reddit posts with high karma, using 40GB of text.

Qwen 3

Trained on 36 trillion tokens, referenced for scale comparison with Nematron and other datasets.

Eli5

A subreddit with questions and answers, used for both positive and sometimes negative examples in the DCLAM quality classifier training.

A model famous for pushing the text-to-text framework, associated with the C4 dataset.

RefinedWeb

A dataset created by asserting that web data alone is sufficient for training. Its loosely filtered version was used as negative examples in DCLAM.

Llama 1

The dataset processing for this model was detailed, including Common Crawl (CCNet), C4, GitHub, Wikipedia, Books 3, Project Gutenberg, Arxiv, and Stack Exchange.

Python

A common programming language mentioned in the context of the Stack V2 dataset, noting its prevalence compared to lower-resource languages.

CCNet

Developed by Facebook, used for creating high-quality datasets, particularly for low-resource languages, by employing deduplication and language identification, and using a language model to score document quality based on Wikipedia-likeness.

LLVM

A low-level intermediate language used in the Stack V2 dataset to enable language models to learn mappings between low-resource and high-resource programming languages.

BERT

An earlier language model trained on Wikipedia and books, noting its use of document-level sequences rather than sentences.

GPT-3

Its dataset included Common Crawl processing, expanded web text, internet-based books corpora, and Wikipedia, totaling 500GB of text.

Nematron

NVIDIA's dataset that used a prompt-based model to score educational value and incorporated synthetic data, resulting in 6 trillion tokens.

Llama

Mentioned as a benchmark model for comparison with Common Pile, which was trained on permissively licensed data.

Bibliotek

A shadow library from which the Books 3 dataset was sourced.

Stack V2

An updated version of The Stack dataset focusing on code, including repository data, metadata, and documentation, with LLVM intermediate representation for low-resource languages.

Qwen

Mentioned as a benchmark model that significantly outperforms models trained solely on permissively licensed data.

Chinchilla

A model mentioned as having subsumed the work on Gopher.

MassiveWeb

A dataset created by DeepMind, trained on C4, books, news, GitHub, and Wikipedia, with undisclosed data sources for some components.

Dolma

An AI2 dataset that includes processed Common Crawl, Stack Exchange, C4, and other sources, utilizing model-based filtering for quality.

Nim

A low-resource programming language mentioned in the context of the Stack V2 dataset and its use of LLVM intermediate representation.

A dataset from Google, known for its use in the T5 model. It filtered Common Crawl data using defined rules to improve quality, resulting in 156 billion tokens.

The Pile

A diverse dataset created by Eleuther AI from various sources including Common Crawl, PubMed, Arxiv, GitHub, and books, aiming for open-source accessibility.

Gopher

A model developed by DeepMind, whose data processing methods were described in a paper, contributing to understanding data curation.

Red Pajama V1

A reproduction of the LLaMA 1 dataset, initially including Books 3 but later stripped out due to copyright concerns.

Common Pile

A project that aimed to create a model using only permissively licensed data, scouring the internet for sources like Stack V2, government proceedings, and wikis, resulting in 8TB of data.

Organizations

AI2

Mentioned as the origin of the open-source model Olmo, which allows full visibility into its training process.

New York Times

Filed a lawsuit against OpenAI in 2023 alleging that ChatGPT was trained on their news articles and could reproduce them verbatim.

Wikipedia

Cited as an example of content distributed under Creative Commons licenses, making it freely usable.

Author's Guild

Plaintiff in a lawsuit against Google regarding the use of copyrighted book snippets in Google Books, eventually settled in favor of Google.

Heritage Foundation

An initiative focused on archiving code repositories from various platforms like GitHub, GitLab, and Bitbucket.

Eleuther AI

Developed 'The Pile', an early grassroots effort to create a diverse and open-source dataset for language model training.

PubMed

Included in The Pile dataset for its collection of academic and medical papers.

Companies

Facebook

Mentioned as an example of a platform whose content is locked behind a 'walled garden' and inaccessible to external crawlers for training purposes.

Cloudflare

A service used by websites to detect and block bot activity, often by presenting CAPTCHAs or blocking IP addresses.

Khan Academy

Cited as an example of content distributed under Creative Commons licenses, making it freely usable.

OpenAI

Sued by The New York Times for allegedly training ChatGPT on their news articles.

Anthropic

Faced a lawsuit for allegedly pirating millions of books for training Claude. The court ruled training was fair use, but pirating books was illegal, leading to a $1.5 billion settlement.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 13: Data (Sources, Datasets)

Key Insights

Data is the most important, yet most secretive, aspect of language models

Navigating the web's technical and legal restrictions on data access

The evolving landscape of copyright and fair use for AI training

Recent legal battles and precedents in AI training data

Key data sources: Common Crawl and specialized corpora

Evolution of data filtering and curation techniques

Advanced filtering and the rise of synthetic data

Focus on coding data and permissively licensed datasets

Mentioned in This Episode

Common Questions

Topics

Mentioned in this video

More from Stanford Online

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 6 - Model Training

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 11: Scaling Laws

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 12: Evaluation

Stanford CS153 Frontier Systems | Jensen Huang from NVIDIA on the Compute Behind Intelligence

Ask anything from this episode.