How did models like ChatGPT improve upon GPT-3?

ChatGPT significantly enhanced instruction following and reliability, making it much more useful than GPT-3, which was primarily suited for tasks like copywriting and lacked fine-grained control and instruction adherence.

What are the key datasets or approaches used for supervised fine-tuning (SFT)?

Early approaches like FLAN aggregated existing datasets. Later methods include Self-Instruct (model-generated data), distillation-based methods like Alpaca, human-driven efforts like Open Assistant, and newer approaches focusing on agents and tool use.

Why is data quality important in SFT, and why can it be tricky?

While higher quality data is generally better, SFT can sometimes inherit deficiencies from source data or lead to hallucinations if models are trained to emit facts they don't know, especially when associated with specific formatting cues.

How has the focus of SFT data collection evolved?

The field has shifted from basic chat interfaces to more detailed, human-like responses, increased emphasis on higher quality annotators, and a significant move towards generating data for agent systems and tool use.

What are the main challenges in collecting human annotation data for SFT?

Challenges include ensuring response length and style variation, avoiding engagement signal manipulation, preventing hallucinations due to 'tail knowledge,' and managing safety controls without excessive false refusals. Ensuring annotator verifiability and preventing AI misuse are also critical.

How does RLHF differ conceptually from SFT and pre-training?

Pre-training and SFT are primarily generative modeling tasks focused on predicting the next token. RLHF shifts to a reward maximization objective, aiming to optimize a policy towards a specific reward signal rather than fitting a distribution.

What is the typical process for RLHF data collection?

It involves generating multiple model outputs for a prompt, having human raters rank these outputs (pairwise or on a scale) based on criteria like helpfulness, truthfulness, and harmlessness, and then training a reward model on these rankings.

What are the trends in RLHF annotator demographics and quality?

There's a shift towards more expert and highly educated annotators, with some paid over $100/hour, due to the need for specialized knowledge and the challenge of preventing AI from being used in annotations. However, lower-cost annotation still exists.

Why is model-based annotation becoming prevalent in RLHF?

Models like GPT-4 can generate annotations with high agreement to human judgments at a significantly lower cost. This approach is scalable and effective for catching up to frontier capabilities, as demonstrated by HuggingFace's experience with Zephier.

What is Direct Preference Optimization (DPO) and why is it used?

DPO is a simpler RLHF algorithm that bypasses the need for a separate reward model. It directly optimizes the policy using collected preference data by increasing the likelihood of preferred responses and decreasing the likelihood of dispreferred ones.

What are the main challenges or pitfalls in RLHF?

Key challenges include overoptimization to the learned reward model (prevented by KL regularization), model collapse leading to reduced diversity in outputs, and uncalibrated models. These issues highlight the complexity of aligning models with human preferences.

Key Moments

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 15: Mid/Post-Training

Stanford Online

Education5 min read80 min video

May 27, 2026|18,788 views|387|30

Stanford Stanford Online Artificial Intelligence AI

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

On this page

TL;DR

Large language models can now follow complex instructions, but achieving this requires significant post-training effort and data collection, which is often kept secret by companies.

Key Insights

Post-training, including Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), is crucial for transforming pre-trained models like GPT-3 into instruction-following systems like ChatGPT.

The creation of high-quality SFT data is a major challenge, evolving from early methods like FLAN (using existing NLP benchmarks) to Self-Instruct (model-generated data) and human-driven efforts like Open Assistant.

Modern SFT data and RLHF annotation increasingly involve expert annotators, with median wages exceeding $50/hour, and some experts earning over $100/hour, reflecting the complexity and cost of data collection.

While SFT requires numerous examples, it can be highly effective with as few as 500 high-quality safety examples to dramatically reduce the rate of malicious instruction following.

Reinforcement Learning from Human Feedback (RLHF) shifts from fitting a distribution (pre-training/SFT) to maximizing a reward, allowing for more targeted behavior shaping.

Direct Preference Optimization (DPO) offers a simpler alternative to complex RL algorithms like Proximal Policy Optimization (PPO) for RLHF, often achieving comparable results by directly optimizing for preferences.

The leap from GPT-3 to ChatGPT: The necessity of post-training

The lecture begins by highlighting the significant qualitative difference between large pre-trained models like GPT-3 and interactive systems like ChatGPT. While GPT-3 is capable of generating text, its utility for reliable instruction following and complex tasks was limited. Post-training, specifically Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), is presented as the crucial set of techniques that bridge this gap, enabling models to understand and respond to complicated prompts with remarkable accuracy. The process is described as 'artisanal,' involving more explicit data collection, steering, and engineering compared to the broad, diverse nature of pre-training.

Evolution of instruction tuning data sets

The historical progression of SFT data collection is detailed, starting with FLAN, which aggregated existing NLP datasets, demonstrating an early attempt at multitask training. This was followed by Self-Instruct, which leveraged models themselves to generate training data. Distillation-based approaches like Alpaca and Vicuna emerged, using outputs from stronger models. More human-driven efforts like Open Assistant aimed for high-quality, crowdsourced data. More recently, the focus has shifted towards agent and tool-use data, reflecting the growing complexity of desired model behaviors. This evolution underscores the field's continuous efforts to gather more effective and diverse data for instruction following.

The shift towards agents and tool use in data collection

A significant recent trend in SFT data is the move beyond simple chat interactions to more complex agentic systems. This includes generating data that enables models to perform tool calls, manage to-do lists, and interact with external functions. Examples like NVIDIA's NeOtron dataset and pipelines are discussed, where SFT data explicitly incorporates these structured formats, including parallel data for tool calls alongside textual responses. This signifies a broader evolution in what users expect from AI models, moving from purely conversational agents to more functional, task-oriented AI assistants.

Challenges and nuances in data quality and collection

Despite the goal of collecting high-quality responses, there are complexities. While good data is generally preferred, models can learn from imperfect data due to pre-training generalization. The 'unnaturalness' of data structures, as seen in FLAN derived from older datasets, can lead to deficiencies. Furthermore, stylistic factors like bullet points and length can influence human preference ratings without necessarily improving underlying capabilities, potentially creating a gap between engagement signals and true performance. Balancing these factors is a key challenge in collecting effective SFT data.

The role of human annotators and evolving annotation practices

RLHF relies heavily on human feedback for rating model outputs. Annotation practices have evolved from basic pairwise comparisons to more sophisticated guidelines emphasizing helpfulness, truthfulness, and harmlessness. There's a significant trend towards using more expert annotators with higher education levels and specialized knowledge (e.g., doctors, lawyers), justifying higher compensation, often exceeding $50-$100 per hour. This shift is driven by the need for nuanced judgments and the difficulty in verifying AI-generated responses accurately, especially when annotators may be pressured for time or tempted to use AI themselves.

Demographic biases and the impact of annotator selection

The composition of the annotator pool can unintentionally influence model behavior. Studies have shown that annotator demographics can correlate with the ideological leanings of post-trained models. For instance, specific demographic groups of annotators have been linked to shifts in alignment compared to base models. Furthermore, the expertise and focus of annotators matter; non-expert annotators may overemphasize formatting, while experts are more likely to focus on factuality and consistency. This highlights the intricate connection between annotator characteristics and the resulting model's capabilities and biases.

Model-based annotation and the future of data collection

While human annotation remains critical for pushing frontiers, model-based annotation is increasingly prevalent for catching up to existing capabilities. Large language models like GPT-4 can generate annotations that closely match human rankings at a significantly lower cost. This has led to a scenario where many open-source efforts, including HuggingFace's Zephyr and Tulu 3, rely on model-generated data (e.g., UltraChat, UltraFeedback) for both SFT and RLHF. While models can bootstrap data generation, human annotators are still essential for tasks requiring world knowledge or specialized expertise.

RLHF algorithms: From PPO to DPO

The lecture contrasts SFT (fitting a distribution) with RLHF (maximizing a reward). Core RLHF algorithms have evolved from complex methods like Proximal Policy Optimization (PPO), which involves careful KL divergence regularization, to simpler approaches like Direct Preference Optimization (DPO). DPO aims to eliminate the need for a separate reward model and complex on-policy training by directly optimizing a preference-based loss function. This involves increasing the likelihood of preferred responses and decreasing the likelihood of dispreferred ones, making RLHF more accessible and efficient.

Pitfalls in RLHF: Overoptimization and model collapse

Key challenges in RLHF include overoptimization, where models become overly tailored to the learned reward model, potentially losing general capabilities, and model collapse, where the model's output diversity significantly decreases. The KL regularizer in PPO is crucial for mitigating overoptimization. Model collapse is a consequence of RLHF focusing on maximizing reward rather than modeling a distribution. Additionally, models can become uncalibrated after RLHF, impacting their reliability and exploration capabilities, which is particularly relevant for future advancements in reasoning models.

Mentioned in This Episode

●Software & Apps

●Companies

●Organizations

●Books

●Concepts

●People Referenced

Common Questions

Pre-training involves learning general language patterns from vast amounts of diverse data. Post-training, or instruction tuning, focuses on extracting specific behaviors, like instruction following and alignment with human preferences, from the pre-trained model.

Topics

Mindset & Self-Improvement AI & Machine Learning Technology & Innovation Data Collection Supervised Fine-tuning Reinforcement Learning From Human Feedback Ethical AI Language Model Training Instruction Following LLM Alignment

Mentioned in this video

Software & Apps

GPT-3

A large language model that represents a strong base model, but with limited utility and difficulty in instruction following compared to newer models. Its primary uses were copywriting and simple tasks.

ChatGPT

A successor to GPT-3 that offered a significant improvement in interaction and instruction following, making it seem amazing to early users. It is a benchmark for current language model capabilities.

GPT-3.5

Mentioned as a significant advancement that enabled better instruction following with long, programmatic prompts, contrasting with the limitations of earlier models like GPT-3.

GPT-4

A model exhibiting strong instruction following capabilities, capable of 'oneshotting' complex prompts. It serves as a benchmark for which other models are compared against, even in reverse engineering efforts.

A language model trained by Google using the FLAN dataset. The FLAN dataset's structure, derived from existing benchmarks, led to some unnatural task formulations.

Alpaca

Founded by Berkeley researchers, Alpaca used distillation from ChatGPT traces to create input-output pairs, demonstrating that such chat-style data could effectively train ChatGPT-like systems when applied to models like LLaMA.

Vicuna

A language model from Berkeley that utilized online user-shared prompts as inputs for distillation.

Open Assistant

A large-scale, crowdsourced effort to build a high-quality, human-driven instruction tuning dataset, akin to Wikipedia in its collaborative model, aiming to match the performance of closed-source labs.

WizardLM

A newer generation of instruction tuning datasets that employs increasingly sophisticated methods for generating instruction-following data using language models.

Tulu 3

Presented as a reference for a performant post-training pipeline, including a safety component with approximately 50,000 examples and utilizing a 'wild chat' dataset for mining unsafe behaviors.

Neotron

NVIDIA's open-source initiative that incorporates a substantial amount of agentic SFT data, including tool calls alongside textual responses.

Claude

Used as an example to illustrate how different models exhibit distinct tones, highlighting conscious decisions made in data collection regarding chatbot style.

LLaMA 2

Its description of safety SFT is presented as one of the more detailed publicly available explanations, though it lacks specifics on the number of examples used.

WildChat

A previous project that provided free chat access to users, collecting interactions to filter out unsafe behaviors and jailbreaks, which were then used to generate SFT data for Tulu 3's safety component.

InstructGPT

Its appendix provides a glimpse into industry data collection processes for RLHF, detailing guidelines for rating outputs on helpfulness, truthfulness, and harmlessness.

Google Bard

Annotation data for Bard was leaked, showing a similar structure to InstructGPT's guidelines, focusing on helpfulness and presentation using a Likert scale rather than pairwise feedback.

TRPO

Trust Region Policy Optimization, an off-policy RL algorithm that takes multiple steps while staying close to the original policy using importance weighting corrections.

PPO

Proximal Policy Optimization, a reinforcement learning algorithm that aims to improve stability by using a clipping heuristic to discourage large policy changes. It's a key algorithm in RLHF.

Zephier

An open-source model Hugging Face attempted to build without model distillation, prioritizing human data collection, but ultimately found human data to be less efficient than AI feedback.

DPO

Direct Preference Optimization, a simpler RLHF algorithm that eliminates the need for a separate reward model and on-policy sampling. It works by taking gradient steps towards preferred responses and negative steps away from dispreferred ones.

Llama

Models like LLaMA were trained using DPO for their core RLHF primitive, demonstrating its effectiveness. LLaMA 7B is mentioned as a respectable open-source model size.

Concepts

Self-Instruct

A dataset generation approach where a model is used to generate its own high-quality responses to inputs, based on the idea that models themselves might surpass human annotators.

Constitutional AI

An early work by Anthropic that involved prompting a model to generate safety data, which was then used to train the model itself, creating a self-post-training data generation loop for safer outputs.

Companies

NVIDIA

Mentioned for its open-source efforts, specifically Neotron, which includes a significant portion of agentic SFT examples featuring tool calls.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free