What is the difference between pre-training and post-training AI models?

Pre-training involves massive datasets and compute to learn general language patterns, essentially compressing vast knowledge into model weights. Post-training, or alignment, refines these models for specific tasks, safety, and user interaction, like chat formats.

What are the current bottlenecks in AI model development?

Historically, bottlenecks included compute power, architecture, and data scale for pre-training. Currently, the focus is on RL environments for reasoning, and the future bottleneck is seen as continual learning and extreme data efficiency with sparse rewards.

Why is code considered the first frontier for AI model training?

Code and math provide verifiable rewards, making them ideal for RL training. Additionally, there's a vast amount of code data available, and many view coding as a fundamental skill encompassing many other tasks, making it a proxy for AGI.

Where does the future data for training AI models come from?

As readily available internet data saturates, future data will increasingly come from proprietary sources within enterprises, synthetic generation, and more efficient learning from real-world interactions through methods like RL environments.

What is continual learning in AI, and why is it important?

Continual learning is the ability for AI models to learn from sparse rewards and real-world interactions, updating themselves over time to improve. This is crucial for production systems and achieving human-like learning efficiency.

What is the future of AI training beyond transformers?

While transformers are currently dominant due to their scalability and ongoing research, there's interest in non-transformer architectures like Mamba for efficiency. However, the consensus leans towards continued scaling and optimization of transformers.

What are the biggest investment opportunities in AI right now?

Compute power and chip providers like NVIDIA remain strong. There's also a significant opportunity in hardware innovation for more efficient chips and energy sources to power AI, as compute scarcity is a major issue.

Key Moments

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Enterprise Internal Knowledge

Stanford Online

Education7 min read49 min video

May 22, 2026|5,747 views|145|7

Stanford Stanford Online Artificial Intelligence AI

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

AI models are nearing a data cliff, forcing a pivot from internet-scale pre-training to more efficient, specialized post-training and 'continual learning' for future enterprise applications.

Key Insights

Deep learning's pivotal moment, AlexNet in 2012, enabled massive gains by scaling GPUs, data, and neural nets, but also led to models that are not understood.

The current bottleneck in AI model development is not just data or compute, but the ability for models to learn continuously from sparse, real-world feedback, akin to human learning.

Code and math are currently favored for advanced AI training (like RLVR) because they offer deterministic, verifiable rewards, making them ideal for 'eval-maxing' and iterative improvement.

Pre-training models on internet-scale data requires vast compute (e.g., 2.5 million H800 hours for DeepSeek V3), whereas post-training like RL can use as little as 5% of that compute.

Applied Compute specializes AI models for enterprises like DoorDash and Cognition, by fine-tuning general models to specific business needs, which is more cost-effective and faster than waiting for future monolithic AI advancements.

While transformers are the dominant architecture, scaling them is currently more promising than exploring non-transformer alternatives, though research into architectures like Mamba continues.

Continual learning, exemplified by Cursor's 'Composer' model, involves gradual, iterative updates based on user interactions and implicit rewards, requiring days or weeks of training time per step but yielding significant performance gains.

The AI supercycle's foundational shift: Deep learning and transformer architectures

The journey of AI has seen exponential growth, particularly since the advent of deep learning, marked by the pivotal AlexNet moment around 2012. This era revolutionized AI by leveraging GPUs and massive datasets like ImageNet, enabling neural networks to learn complex representations from data. In language models, the transformer architecture, introduced in 2017, proved to be a game-changer. Its self-attention mechanism allowed for significantly better performance and scalability in processing long sequences of text compared to previous recurrent neural networks (RNNs) and LSTMs. Following this, the era of pre-training emerged, where models like GPT-3, trained on vast internet-scale text data, demonstrated emergent general intelligence. Key insights from scaling laws, such as the Chinchilla scaling laws, revealed that optimal performance requires balancing both model size (parameters) and the amount of training data. Finally, post-training techniques like Reinforcement Learning from Human Feedback (RLHF) and preference tuning became crucial for aligning these powerful base models, which by default just predict the next token, to be more helpful, harmless, and honest. This progression culminated in models like GPT-4, which represented a significant step-change in quality and reasoning capabilities.

Bottlenecks in AI development: From compute to continual learning

The evolution of AI has been driven by overcoming various bottlenecks. Initially, the limiting factor was compute power. As this became more accessible, attention shifted to developing the right architectures, like the transformer, to effectively utilize that compute. Subsequently, acquiring and processing the massive datasets required for pre-training became a hurdle. Post-training methods like RLHF and RLVR (Reinforcement Learning from AI Feedback) then emerged to refine models, but this also presented data scarcity challenges. Today, the primary bottleneck is evolving towards 'continual learning.' This refers to a model's ability to learn from extremely sparse rewards and real-world interactions, much like humans do. For instance, a single experience of touching a hot stove can teach a person a lasting lesson, a capability current AI models largely lack. The next frontier involves making AI vastly more data-efficient, enabling it to learn from individual interactions rather than requiring massive datasets or prolonged training. This capability is crucial for deploying AI in dynamic, real-world enterprise environments where continuous adaptation is key.

The dominance of code and math in advanced AI training

The focus on software engineering and code as a primary frontier for AI advancement is deliberate, largely owing to the nature of verifiable rewards required for sophisticated training techniques like RLVR. Unlike tasks in natural language or vision, code and mathematical problems offer deterministic outcomes that can be objectively checked. Compiling code, running unit tests, or solving equations provide clear, binary signals of success or failure, which are essential for the reinforcement learning process to 'climb the hill' of performance. This inherent verifiability makes these domains ideal for 'eval-maxing,' where AI labs can create training pipelines that directly mirror evaluation benchmarks. This allows models to iteratively improve and achieve high performance. Furthermore, the vast amount of code available online serves as a rich source of synthetic data, and many researchers see coding itself as a fundamental, almost 'AGI-complete' task that can serve as a general language for AI to interact with the world and execute complex instructions.

Pre-training versus post-training: Compute economics and data scarcity

The landscape of AI model training is broadly divided into pre-training and post-training. Pre-training involves training a model on a massive corpus of data, often internet-scale, to develop a foundational understanding of patterns and knowledge. This phase is incredibly compute-intensive, with models like DeepSeek V3 requiring approximately 2.5 million H800 hours of compute. This represents a significant capital expenditure. In contrast, post-training, which includes methods like fine-tuning, RLHF, and RLVR, aims to align and specialize the pre-trained model for specific tasks or to adhere to safety guidelines. This phase is considerably more data- and compute-efficient. For example, the RL training for DeepSeek R1 used about 150,000 hours of compute, roughly 5% of the pre-training budget. However, the scarcity of high-quality, diverse data for both pre-training and post-training is becoming a critical issue, pushing research towards architectural innovations that can utilize existing data more effectively and the development of synthetic data generation techniques.

Specialization for enterprises: Applied Compute's approach

The core insight behind Applied Compute, founded by former OpenAI researchers, is that while general-purpose foundation models (like GPT-4) set a baseline, true differentiation for enterprises lies in specializing these models for their unique needs. Companies possess vast amounts of proprietary data that general models, despite their intelligence, do not understand. Applied Compute bridges this gap by training specialized models that enhance specific business functions. A key example is their work with DoorDash. To onboard merchants, DoorDash requires accurate extraction and formatting of menu information, adhering to strict style guides for modifiers, add-ons, and item attachments. General models struggled with this task. Applied Compute developed a solution by allowing humans to correct the model's output, creating a feedback loop to directly optimize for reducing error rates. This approach bypasses complex prompting and directly targets desired business outcomes, offering significant ROI by reducing manual effort and improving data quality. This specialization is crucial for enterprises to gain a competitive edge.

The future of AI training: Continual learning and architectural innovation

Looking ahead, the focus is shifting towards enabling AI models to learn continuously and efficiently in real-world deployments. Continual learning aims to address how deployed AI systems can adapt over time by learning from sparse feedback and downstream consequences. This is a gradual process and hinges on obtaining the right telemetry and understanding context. Examples like Cursor's 'Composer' model showcase this by using user interactions (accepting/reverting code suggestions) as implicit rewards to update the model in near real-time over days or weeks. Applied Compute's 'Context Base' initiative uses offline agents to analyze past interactions and documents to extract learnings for downstream performance improvements, demonstrating significant gains in reasoning capabilities with fewer tokens. Innovations are expected across weight updates, context management, and the 'harness' (the infrastructure enabling AI interactions). While transformer architectures are expected to remain dominant due to existing infrastructure and scaling success, research into non-transformer models persists, though the consensus leans towards scaling current technologies, trusting future AI to potentially discover superior architectures.

The economics of models: Inference, chips, and data markets

The AI industry is characterized by massive investments, particularly in compute and specialized hardware. NVIDIA, a key chip provider, commands significant margins, leading to questions about whether major AI labs might invest in in-house chip design to capture more value, especially given the immense sums spent on compute. While this is a difficult undertaking, it represents a potential disruption to the current chip provider model. On the data front, the market is more challenging. As AI models become more adept at specific tasks through RL, creating new, challenging tasks for further improvement becomes harder and more expensive. Furthermore, the increasing capability of AI itself to generate synthetic data, especially for tasks that don't require human judgment (like code verification), might reduce reliance on human-labeled datasets. Founders in this space need to be agile, pivoting towards emerging data needs like robotics or egocentric data to stay relevant, as AI's capacity for data generation and task creation evolves.

Mentioned in This Episode

●Software & Apps

●Companies

●Organizations

●Concepts

●People Referenced

Compute Cost Comparison: Pre-training vs. RL Training

Data extracted from this episode

Training Type	Example Model	Compute Hours
Pre-training	DeepSeek V3	2.4-2.5 million H800 hours
RL Training	DeepSeek R1	150K hours

Common Questions

Yash Bottle, founder of Applied Compute, started his AI journey at Stanford, briefly worked at OpenAI research, and was motivated by the release of ChatGPT to pursue his career in AI model development.

Topics

Reinforcement Learning AI & Machine Learning Technology & Innovation Business & Entrepreneurship Enterprise AI Continual Learning Transformer Architecture AI Model Development Data Efficiency Compute Scarcity Model Specialization

Mentioned in this video

Companies

Applied Compute

Company founded by Yash Bottle, focusing on creating specialized AI models for enterprises using frontier technology.

OpenAI

Research organization where Yash Bottle worked and gained insights for starting Applied Compute. Mentioned for its residency program and contribution to AI development.

Codeex

An AI agentic coding research project that evolved from work at OpenAI.

DoorDash

A customer of Applied Compute, using their services to specialize in merchant onboarding and menu extraction.

Cognition

A company that uses Applied Compute's models for real-time bug catching in code development.

NVIDIA

The dominant provider of compute hardware (chips) for AI training, with high profit margins, suggesting potential for labs to develop in-house solutions.

Ramp Labs

A client that trained an RL model for fast search within spreadsheets, showcasing product enhancement through specialized AI.

People

Yash Bottle

Founder and CEO of Applied Compute, formerly part of the post-training team at OpenAI research.

Sam Altman

Mentioned for his support of young entrepreneurs and his role in connecting Yash Bottle to OpenAI.

Ilya Sutskever

Mentioned as a prominent figure on the side of the transformer debate who believes in alternative architectures.

Yan LeCun

Mentioned as a prominent figure on the side of the transformer debate, advocating for architectures not requiring pre-training scale data.

Software & Apps

ChatGPT

The release of ChatGPT in late 2022 was a pivotal moment that motivated Yash Bottle to pursue work at OpenAI.

OpenAI residency

A program at OpenAI that helped individuals transition into full-time roles, which Yash Bottle utilized.

GPT-3

Mentioned as the first model that exhibited a level of general intelligence, building on scaling laws.

GPT-4

Described as a significant step-change in model quality.

AlexNet

A pivotal moment in deep learning, marking a shift towards understanding fewer model internals but achieving better performance with GPUs and large datasets.

Transformer

A novel neural network architecture that enabled scaling of language model training through self-attention, leading to improved performance on GPUs.

DeepSeek V3

A model trained on approximately 2.4-2.5 million H800 hours for pre-training.

DeepSeek R1

A model that underwent RL training using about 150K hours.

Vinsurf

Mentioned alongside Cognition as a client benefiting from Applied Compute's specialized AI for bug catching.

Curser

A coding model that utilizes continual learning by capturing telemetry and performing online training steps based on user feedback.

Composer

Curser's coding model trained on their proprietary data on top of an open-source model, leveraging continual learning.

Mamba

A non-transformer architecture mentioned as a potential competitor to the dominant transformer models.

Image Duo

Yash Bottle's favorite AI product, useful for visual design tasks and creating walkthroughs, especially for those without design expertise.

Concepts

Chinchilla scaling laws

These laws demonstrated the compute-optimal way to scale models by increasing both parameter size and training data.

Deep Learning

A machine learning method that allows learning underlying representations from data, often involving large datasets and significant compute.

Organizations

Google Brain

Researchers at Google Brain developed the Transformer architecture.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free