Key Moments
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 14: Data
Want to know something specific about what's covered?
We've already dissected every moment. Ask and we will deliver (with timestamps).
Key Moments
Training language models requires meticulously cleaning and remixing vast datasets, as even slight imperfections or near-duplicates can lead to wasted compute and overfitting, unlike humans, models struggle with ambiguity and require carefully curated synthetic data for advanced tasks.
Key Insights
Raw web data is often in HTML or PDF formats, requiring complex, heuristic-based transformation processes that can lose semantic information, especially for tables.
Filtering is critical for quality, toxicity, and language identification, typically involving training a fast classifier (like fastText) on a small target dataset to score and select from a massive raw data pool.
De-duplication, essential for efficiency and avoiding memorization, employs techniques like hashing for exact matches and MinHash with Locality Sensitive Hashing (LSH) for near-duplicate detection based on Jaccard similarity.
Data mixing involves balancing diverse sources (e.g., web text, code, books) by assigning weights; regression-based methods at small scales are used to predict optimal mixtures for large-scale training, with caps on epochs to prevent overfitting.
Post-training data is largely synthetic, generated by strong models (teachers) responding to prompts in defined environments (e.g., GitHub repos for coding tasks), with recent efforts focusing on complex software development scenarios.
Even with advanced techniques, data work is often 'grungy' and domain-specific, requiring deep dives into concrete examples to build high-quality datasets representative of the real-world data landscape.
Transforming raw web data into usable text
The journey of data for language model training begins with raw internet content, which is rarely in a clean text format. Much of it is HTML, requiring parsers to extract core content while removing boilerplate like navigation, ads, and footers. This process is inherently lossy, particularly for complex structures like tables, which are difficult to linearize into sequences of tokens. PDFs present another challenge, often requiring OCR if they are image-based, and can be truncated or lack semantic structure inherent in HTML tags. The reliance on rule-based systems for this transformation, while fast, introduces imperfections and a risk of misinterpreting content, highlighting the need for robust data processing pipelines.
The critical role of filtering for quality and focus
Filtering is a cornerstone of preparing data for language models, aiming to isolate high-quality, relevant, and safe content. The general schema involves using a small set of target data to train a classifier that can then efficiently score and select similar examples from a vast raw dataset. Common filtering objectives include language identification (e.g., ensuring an English-only model gets English text), quality filtering (removing spam or low-value content), and toxicity filtering (excluding offensive material). FastText, a linear classifier, is often employed due to its speed on massive datasets, enabling the selection of a small, high-quality subset, typically in the single-digit percentage range of the original data.
De-duplication: Eliminating redundancy for efficiency and integrity
Duplicate and near-duplicate content significantly inflates dataset size without adding new information, leading to wasted computational resources and potential memorization issues. Exact duplicates can arise from server mirrors or forked code repositories. Near-duplicates, differing by only a few tokens, might stem from common templates, standardized text like licenses, or minor typographical variations. Detecting these efficiently is a major algorithmic challenge. While exact duplicates can be handled with hashing, near-duplicates require more sophisticated methods. Techniques like MinHash, combined with Locality Sensitive Hashing (LSH), are employed to identify documents with high Jaccard similarity above a certain threshold, reducing the dataset to a unique and informative core.
Strategic data mixing for diverse and balanced training
Language models are typically trained on a mixture of data sources, each contributing different styles, topics, and knowledge. The challenge lies in determining the optimal proportions for these sources. Naive methods like uniform or token-proportionate mixing can be suboptimal and lead to issues like excessive 'epoching' (repeatedly training on the same data) on scarce high-quality sources, causing overfitting. More advanced techniques, such as regression-based mixing (e.g., 'RegMixture'), involve training small proxy models on various data mixtures to predict performance metrics, then optimizing these mixtures for large-scale training. Careful consideration is given to preventing overfitting through mechanisms like capping epochs or simulating the effects of epoching at smaller scales.
Synthetic data generation for post-training and specialized tasks
Beyond the massive pre-training datasets, post-training often relies on synthetically generated data tailored for specific tasks. The general recipe involves defining an environment or task space (e.g., coding repositories), collecting prompts, and then using a strong 'teacher' model to generate responses. For coding, this involves tasks like code generation, bug fixing, or software development workflows. Projects like 'Open-source Thoughts' and 'SWE-bench' exemplify this, using real-world repositories and complex evaluation setups. While human feedback can be used, it's slow and expensive, making capable AI models the primary teachers for generating vast, task-specific datasets, though challenges remain in execution feedback and preventing model 'cheating'.
The nuances of scale-dependent data optimization
The optimal data mixture and filtering strategy can be scale-dependent. For instance, a small model trained on limited tokens may benefit from very high-quality data, while a large model trained on vastly more tokens might tolerate or even benefit from lower-quality data to avoid overfitting. Regression-based mixing approaches attempt to bridge this gap by using small-scale experiments to predict large-scale performance, but this carries risks. Optimizing mixtures based on small-scale proxy models might not perfectly transfer to the large scale, and phenomena like 'epoching' behave differently across scales. Strategies to mitigate this include capped epoching or simulating the large-scale epoching effects at the small scale, ensuring that optimization efforts are aligned with the eventual training regime.
Mentioned in This Episode
●Software & Apps
●Companies
●Studies Cited
●Concepts
Common Questions
Transforming raw web data, often in HTML or PDF formats, involves extracting meaningful content while removing boilerplate like navigation and ads. Linearizing HTML, handling images and tables, and dealing with the layout-centric nature of PDFs are significant challenges.
Topics
Mentioned in this video
A company that has trained fastText language identification models supporting 176 languages, often used off-the-shelf for language filtering in data processing pipelines.
An organization involved in AI research and development, which released a dataset called 'find PDFs' related to PDF processing for language models.
A model mentioned in an experiment showing that high-quality data yields better results initially, but over very long training periods, lower-quality data can also be effective.
The first paper on LLaMA used pages referenced by Wikipedia as positive examples for filtering, similar to other quality filtering approaches.
A library for efficient text classification and learning word representations, commonly used for its speed in tasks like language identification and filtering.
Used by Microsoft in the '51' project to classify a subset of data based on the prompt 'determine educational value', generating target data for training a cheaper classifier.
A publicly available archive of web crawl data, used as a primary source for training large language models. It contains raw HTML and other web content, requiring significant processing.
Mentioned in the context of its training data filtering process, which used Wikipedia, web text, and books as positive examples and web samples as negative examples, trained with a linear classifier.
A technique used to find approximate nearest neighbors in high-dimensional spaces, applied here to efficiently find near-duplicate documents by 'sharpening' collision probabilities.
A regression-based mixing method that trains small proxy models with different data mixtures, fits a regression to map weights to loss, and then optimizes for large-scale model training.
A probabilistic algorithm used for efficiently estimating the Jaccard similarity between sets, crucial for near-deduplication in large datasets.
A method introduced for training multilingual models that involves capping the number of epochs for each data source to prevent issues with low-resource languages.
A paper that proposed generating tasks automatically from a code repository using an agent, creating 50,000 synthetic task instances for models.
A 2023 paper focused on creating a large corpus of mathematical text, employing rules, generative models (KenLM), and classifiers to filter for mathematical content.
More from Stanford Online
View all 67 summaries
66 minStanford CS153 Frontier Systems | The Road Ahead: Resilience Required
102 minStanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 7 - Evaluation
80 minStanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 15: Mid/Post-Training
47 minStanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Infrastructure, Capstone Case
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free