Key Moments
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 15: Mid/Post-Training
Want to know something specific about what's covered?
We've already dissected every moment. Ask and we will deliver (with timestamps).
Key Moments
Large language models can now follow complex instructions, but achieving this requires significant post-training effort and data collection, which is often kept secret by companies.
Key Insights
Post-training, including Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), is crucial for transforming pre-trained models like GPT-3 into instruction-following systems like ChatGPT.
The creation of high-quality SFT data is a major challenge, evolving from early methods like FLAN (using existing NLP benchmarks) to Self-Instruct (model-generated data) and human-driven efforts like Open Assistant.
Modern SFT data and RLHF annotation increasingly involve expert annotators, with median wages exceeding $50/hour, and some experts earning over $100/hour, reflecting the complexity and cost of data collection.
While SFT requires numerous examples, it can be highly effective with as few as 500 high-quality safety examples to dramatically reduce the rate of malicious instruction following.
Reinforcement Learning from Human Feedback (RLHF) shifts from fitting a distribution (pre-training/SFT) to maximizing a reward, allowing for more targeted behavior shaping.
Direct Preference Optimization (DPO) offers a simpler alternative to complex RL algorithms like Proximal Policy Optimization (PPO) for RLHF, often achieving comparable results by directly optimizing for preferences.
The leap from GPT-3 to ChatGPT: The necessity of post-training
The lecture begins by highlighting the significant qualitative difference between large pre-trained models like GPT-3 and interactive systems like ChatGPT. While GPT-3 is capable of generating text, its utility for reliable instruction following and complex tasks was limited. Post-training, specifically Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), is presented as the crucial set of techniques that bridge this gap, enabling models to understand and respond to complicated prompts with remarkable accuracy. The process is described as 'artisanal,' involving more explicit data collection, steering, and engineering compared to the broad, diverse nature of pre-training.
Evolution of instruction tuning data sets
The historical progression of SFT data collection is detailed, starting with FLAN, which aggregated existing NLP datasets, demonstrating an early attempt at multitask training. This was followed by Self-Instruct, which leveraged models themselves to generate training data. Distillation-based approaches like Alpaca and Vicuna emerged, using outputs from stronger models. More human-driven efforts like Open Assistant aimed for high-quality, crowdsourced data. More recently, the focus has shifted towards agent and tool-use data, reflecting the growing complexity of desired model behaviors. This evolution underscores the field's continuous efforts to gather more effective and diverse data for instruction following.
The shift towards agents and tool use in data collection
A significant recent trend in SFT data is the move beyond simple chat interactions to more complex agentic systems. This includes generating data that enables models to perform tool calls, manage to-do lists, and interact with external functions. Examples like NVIDIA's NeOtron dataset and pipelines are discussed, where SFT data explicitly incorporates these structured formats, including parallel data for tool calls alongside textual responses. This signifies a broader evolution in what users expect from AI models, moving from purely conversational agents to more functional, task-oriented AI assistants.
Challenges and nuances in data quality and collection
Despite the goal of collecting high-quality responses, there are complexities. While good data is generally preferred, models can learn from imperfect data due to pre-training generalization. The 'unnaturalness' of data structures, as seen in FLAN derived from older datasets, can lead to deficiencies. Furthermore, stylistic factors like bullet points and length can influence human preference ratings without necessarily improving underlying capabilities, potentially creating a gap between engagement signals and true performance. Balancing these factors is a key challenge in collecting effective SFT data.
The role of human annotators and evolving annotation practices
RLHF relies heavily on human feedback for rating model outputs. Annotation practices have evolved from basic pairwise comparisons to more sophisticated guidelines emphasizing helpfulness, truthfulness, and harmlessness. There's a significant trend towards using more expert annotators with higher education levels and specialized knowledge (e.g., doctors, lawyers), justifying higher compensation, often exceeding $50-$100 per hour. This shift is driven by the need for nuanced judgments and the difficulty in verifying AI-generated responses accurately, especially when annotators may be pressured for time or tempted to use AI themselves.
Demographic biases and the impact of annotator selection
The composition of the annotator pool can unintentionally influence model behavior. Studies have shown that annotator demographics can correlate with the ideological leanings of post-trained models. For instance, specific demographic groups of annotators have been linked to shifts in alignment compared to base models. Furthermore, the expertise and focus of annotators matter; non-expert annotators may overemphasize formatting, while experts are more likely to focus on factuality and consistency. This highlights the intricate connection between annotator characteristics and the resulting model's capabilities and biases.
Model-based annotation and the future of data collection
While human annotation remains critical for pushing frontiers, model-based annotation is increasingly prevalent for catching up to existing capabilities. Large language models like GPT-4 can generate annotations that closely match human rankings at a significantly lower cost. This has led to a scenario where many open-source efforts, including HuggingFace's Zephyr and Tulu 3, rely on model-generated data (e.g., UltraChat, UltraFeedback) for both SFT and RLHF. While models can bootstrap data generation, human annotators are still essential for tasks requiring world knowledge or specialized expertise.
RLHF algorithms: From PPO to DPO
The lecture contrasts SFT (fitting a distribution) with RLHF (maximizing a reward). Core RLHF algorithms have evolved from complex methods like Proximal Policy Optimization (PPO), which involves careful KL divergence regularization, to simpler approaches like Direct Preference Optimization (DPO). DPO aims to eliminate the need for a separate reward model and complex on-policy training by directly optimizing a preference-based loss function. This involves increasing the likelihood of preferred responses and decreasing the likelihood of dispreferred ones, making RLHF more accessible and efficient.
Pitfalls in RLHF: Overoptimization and model collapse
Key challenges in RLHF include overoptimization, where models become overly tailored to the learned reward model, potentially losing general capabilities, and model collapse, where the model's output diversity significantly decreases. The KL regularizer in PPO is crucial for mitigating overoptimization. Model collapse is a consequence of RLHF focusing on maximizing reward rather than modeling a distribution. Additionally, models can become uncalibrated after RLHF, impacting their reliability and exploration capabilities, which is particularly relevant for future advancements in reasoning models.
Mentioned in This Episode
●Software & Apps
●Companies
●Organizations
●Books
●Concepts
●People Referenced
Common Questions
Pre-training involves learning general language patterns from vast amounts of diverse data. Post-training, or instruction tuning, focuses on extracting specific behaviors, like instruction following and alignment with human preferences, from the pre-trained model.
Topics
Mentioned in this video
A large language model that represents a strong base model, but with limited utility and difficulty in instruction following compared to newer models. Its primary uses were copywriting and simple tasks.
A successor to GPT-3 that offered a significant improvement in interaction and instruction following, making it seem amazing to early users. It is a benchmark for current language model capabilities.
Mentioned as a significant advancement that enabled better instruction following with long, programmatic prompts, contrasting with the limitations of earlier models like GPT-3.
A model exhibiting strong instruction following capabilities, capable of 'oneshotting' complex prompts. It serves as a benchmark for which other models are compared against, even in reverse engineering efforts.
A language model trained by Google using the FLAN dataset. The FLAN dataset's structure, derived from existing benchmarks, led to some unnatural task formulations.
Founded by Berkeley researchers, Alpaca used distillation from ChatGPT traces to create input-output pairs, demonstrating that such chat-style data could effectively train ChatGPT-like systems when applied to models like LLaMA.
A language model from Berkeley that utilized online user-shared prompts as inputs for distillation.
A large-scale, crowdsourced effort to build a high-quality, human-driven instruction tuning dataset, akin to Wikipedia in its collaborative model, aiming to match the performance of closed-source labs.
A newer generation of instruction tuning datasets that employs increasingly sophisticated methods for generating instruction-following data using language models.
Presented as a reference for a performant post-training pipeline, including a safety component with approximately 50,000 examples and utilizing a 'wild chat' dataset for mining unsafe behaviors.
NVIDIA's open-source initiative that incorporates a substantial amount of agentic SFT data, including tool calls alongside textual responses.
Used as an example to illustrate how different models exhibit distinct tones, highlighting conscious decisions made in data collection regarding chatbot style.
Its description of safety SFT is presented as one of the more detailed publicly available explanations, though it lacks specifics on the number of examples used.
A previous project that provided free chat access to users, collecting interactions to filter out unsafe behaviors and jailbreaks, which were then used to generate SFT data for Tulu 3's safety component.
Its appendix provides a glimpse into industry data collection processes for RLHF, detailing guidelines for rating outputs on helpfulness, truthfulness, and harmlessness.
Annotation data for Bard was leaked, showing a similar structure to InstructGPT's guidelines, focusing on helpfulness and presentation using a Likert scale rather than pairwise feedback.
Trust Region Policy Optimization, an off-policy RL algorithm that takes multiple steps while staying close to the original policy using importance weighting corrections.
Proximal Policy Optimization, a reinforcement learning algorithm that aims to improve stability by using a clipping heuristic to discourage large policy changes. It's a key algorithm in RLHF.
An open-source model Hugging Face attempted to build without model distillation, prioritizing human data collection, but ultimately found human data to be less efficient than AI feedback.
Direct Preference Optimization, a simpler RLHF algorithm that eliminates the need for a separate reward model and on-policy sampling. It works by taking gradient steps towards preferred responses and negative steps away from dispreferred ones.
Models like LLaMA were trained using DPO for their core RLHF primitive, demonstrating its effectiveness. LLaMA 7B is mentioned as a respectable open-source model size.
A dataset generation approach where a model is used to generate its own high-quality responses to inputs, based on the idea that models themselves might surpass human annotators.
An early work by Anthropic that involved prompting a model to generate safety data, which was then used to train the model itself, creating a self-post-training data generation loop for safer outputs.
Mentioned for its open-source efforts, specifically Neotron, which includes a significant portion of agentic SFT examples featuring tool calls.
Researchers at Meta conducted ablations and used court documents from a lawsuit regarding the use of books in training data to estimate the usefulness of different book subsets for model training.
Mentioned for its work in self-verification using models, particularly in domains like mathematics where verifying a proof is easier than generating one.
Mentioned for its development of InstructGPT, its pioneering work in RLHF, and its models like GPT-4. They also noted that their RLHF models were uncalibrated.
Attempted to build an open-source model called Zephier without model distillation, relying solely on human data collection. They found human data collection to be time-consuming, costly, and not superior to model-based annotations, eventually using AI feedback.
Pioneered self-training with constitutional AI, prompting models to generate safety data. They also noted models can be naturally uncalibrated after RLHF.
More from Stanford Online
View all 67 summaries
66 minStanford CS153 Frontier Systems | The Road Ahead: Resilience Required
102 minStanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 7 - Evaluation
85 minStanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 14: Data
47 minStanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Infrastructure, Capstone Case
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free