Key Moments

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Enterprise Internal Knowledge

Stanford OnlineStanford Online
Education7 min read49 min video
May 22, 2026|5,747 views|145|7
Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

TL;DR

AI models are nearing a data cliff, forcing a pivot from internet-scale pre-training to more efficient, specialized post-training and 'continual learning' for future enterprise applications.

Key Insights

1

Deep learning's pivotal moment, AlexNet in 2012, enabled massive gains by scaling GPUs, data, and neural nets, but also led to models that are not understood.

2

The current bottleneck in AI model development is not just data or compute, but the ability for models to learn continuously from sparse, real-world feedback, akin to human learning.

3

Code and math are currently favored for advanced AI training (like RLVR) because they offer deterministic, verifiable rewards, making them ideal for 'eval-maxing' and iterative improvement.

4

Pre-training models on internet-scale data requires vast compute (e.g., 2.5 million H800 hours for DeepSeek V3), whereas post-training like RL can use as little as 5% of that compute.

5

Applied Compute specializes AI models for enterprises like DoorDash and Cognition, by fine-tuning general models to specific business needs, which is more cost-effective and faster than waiting for future monolithic AI advancements.

6

While transformers are the dominant architecture, scaling them is currently more promising than exploring non-transformer alternatives, though research into architectures like Mamba continues.

7

Continual learning, exemplified by Cursor's 'Composer' model, involves gradual, iterative updates based on user interactions and implicit rewards, requiring days or weeks of training time per step but yielding significant performance gains.

The AI supercycle's foundational shift: Deep learning and transformer architectures

The journey of AI has seen exponential growth, particularly since the advent of deep learning, marked by the pivotal AlexNet moment around 2012. This era revolutionized AI by leveraging GPUs and massive datasets like ImageNet, enabling neural networks to learn complex representations from data. In language models, the transformer architecture, introduced in 2017, proved to be a game-changer. Its self-attention mechanism allowed for significantly better performance and scalability in processing long sequences of text compared to previous recurrent neural networks (RNNs) and LSTMs. Following this, the era of pre-training emerged, where models like GPT-3, trained on vast internet-scale text data, demonstrated emergent general intelligence. Key insights from scaling laws, such as the Chinchilla scaling laws, revealed that optimal performance requires balancing both model size (parameters) and the amount of training data. Finally, post-training techniques like Reinforcement Learning from Human Feedback (RLHF) and preference tuning became crucial for aligning these powerful base models, which by default just predict the next token, to be more helpful, harmless, and honest. This progression culminated in models like GPT-4, which represented a significant step-change in quality and reasoning capabilities.

Bottlenecks in AI development: From compute to continual learning

The evolution of AI has been driven by overcoming various bottlenecks. Initially, the limiting factor was compute power. As this became more accessible, attention shifted to developing the right architectures, like the transformer, to effectively utilize that compute. Subsequently, acquiring and processing the massive datasets required for pre-training became a hurdle. Post-training methods like RLHF and RLVR (Reinforcement Learning from AI Feedback) then emerged to refine models, but this also presented data scarcity challenges. Today, the primary bottleneck is evolving towards 'continual learning.' This refers to a model's ability to learn from extremely sparse rewards and real-world interactions, much like humans do. For instance, a single experience of touching a hot stove can teach a person a lasting lesson, a capability current AI models largely lack. The next frontier involves making AI vastly more data-efficient, enabling it to learn from individual interactions rather than requiring massive datasets or prolonged training. This capability is crucial for deploying AI in dynamic, real-world enterprise environments where continuous adaptation is key.

The dominance of code and math in advanced AI training

The focus on software engineering and code as a primary frontier for AI advancement is deliberate, largely owing to the nature of verifiable rewards required for sophisticated training techniques like RLVR. Unlike tasks in natural language or vision, code and mathematical problems offer deterministic outcomes that can be objectively checked. Compiling code, running unit tests, or solving equations provide clear, binary signals of success or failure, which are essential for the reinforcement learning process to 'climb the hill' of performance. This inherent verifiability makes these domains ideal for 'eval-maxing,' where AI labs can create training pipelines that directly mirror evaluation benchmarks. This allows models to iteratively improve and achieve high performance. Furthermore, the vast amount of code available online serves as a rich source of synthetic data, and many researchers see coding itself as a fundamental, almost 'AGI-complete' task that can serve as a general language for AI to interact with the world and execute complex instructions.

Pre-training versus post-training: Compute economics and data scarcity

The landscape of AI model training is broadly divided into pre-training and post-training. Pre-training involves training a model on a massive corpus of data, often internet-scale, to develop a foundational understanding of patterns and knowledge. This phase is incredibly compute-intensive, with models like DeepSeek V3 requiring approximately 2.5 million H800 hours of compute. This represents a significant capital expenditure. In contrast, post-training, which includes methods like fine-tuning, RLHF, and RLVR, aims to align and specialize the pre-trained model for specific tasks or to adhere to safety guidelines. This phase is considerably more data- and compute-efficient. For example, the RL training for DeepSeek R1 used about 150,000 hours of compute, roughly 5% of the pre-training budget. However, the scarcity of high-quality, diverse data for both pre-training and post-training is becoming a critical issue, pushing research towards architectural innovations that can utilize existing data more effectively and the development of synthetic data generation techniques.

Specialization for enterprises: Applied Compute's approach

The core insight behind Applied Compute, founded by former OpenAI researchers, is that while general-purpose foundation models (like GPT-4) set a baseline, true differentiation for enterprises lies in specializing these models for their unique needs. Companies possess vast amounts of proprietary data that general models, despite their intelligence, do not understand. Applied Compute bridges this gap by training specialized models that enhance specific business functions. A key example is their work with DoorDash. To onboard merchants, DoorDash requires accurate extraction and formatting of menu information, adhering to strict style guides for modifiers, add-ons, and item attachments. General models struggled with this task. Applied Compute developed a solution by allowing humans to correct the model's output, creating a feedback loop to directly optimize for reducing error rates. This approach bypasses complex prompting and directly targets desired business outcomes, offering significant ROI by reducing manual effort and improving data quality. This specialization is crucial for enterprises to gain a competitive edge.

The future of AI training: Continual learning and architectural innovation

Looking ahead, the focus is shifting towards enabling AI models to learn continuously and efficiently in real-world deployments. Continual learning aims to address how deployed AI systems can adapt over time by learning from sparse feedback and downstream consequences. This is a gradual process and hinges on obtaining the right telemetry and understanding context. Examples like Cursor's 'Composer' model showcase this by using user interactions (accepting/reverting code suggestions) as implicit rewards to update the model in near real-time over days or weeks. Applied Compute's 'Context Base' initiative uses offline agents to analyze past interactions and documents to extract learnings for downstream performance improvements, demonstrating significant gains in reasoning capabilities with fewer tokens. Innovations are expected across weight updates, context management, and the 'harness' (the infrastructure enabling AI interactions). While transformer architectures are expected to remain dominant due to existing infrastructure and scaling success, research into non-transformer models persists, though the consensus leans towards scaling current technologies, trusting future AI to potentially discover superior architectures.

The economics of models: Inference, chips, and data markets

The AI industry is characterized by massive investments, particularly in compute and specialized hardware. NVIDIA, a key chip provider, commands significant margins, leading to questions about whether major AI labs might invest in in-house chip design to capture more value, especially given the immense sums spent on compute. While this is a difficult undertaking, it represents a potential disruption to the current chip provider model. On the data front, the market is more challenging. As AI models become more adept at specific tasks through RL, creating new, challenging tasks for further improvement becomes harder and more expensive. Furthermore, the increasing capability of AI itself to generate synthetic data, especially for tasks that don't require human judgment (like code verification), might reduce reliance on human-labeled datasets. Founders in this space need to be agile, pivoting towards emerging data needs like robotics or egocentric data to stay relevant, as AI's capacity for data generation and task creation evolves.

Compute Cost Comparison: Pre-training vs. RL Training

Data extracted from this episode

Training TypeExample ModelCompute Hours
Pre-trainingDeepSeek V32.4-2.5 million H800 hours
RL TrainingDeepSeek R1150K hours

Common Questions

Yash Bottle, founder of Applied Compute, started his AI journey at Stanford, briefly worked at OpenAI research, and was motivated by the release of ChatGPT to pursue his career in AI model development.

Topics

Mentioned in this video

Software & Apps
ChatGPT

The release of ChatGPT in late 2022 was a pivotal moment that motivated Yash Bottle to pursue work at OpenAI.

OpenAI residency

A program at OpenAI that helped individuals transition into full-time roles, which Yash Bottle utilized.

GPT-3

Mentioned as the first model that exhibited a level of general intelligence, building on scaling laws.

GPT-4

Described as a significant step-change in model quality.

AlexNet

A pivotal moment in deep learning, marking a shift towards understanding fewer model internals but achieving better performance with GPUs and large datasets.

Transformer

A novel neural network architecture that enabled scaling of language model training through self-attention, leading to improved performance on GPUs.

DeepSeek V3

A model trained on approximately 2.4-2.5 million H800 hours for pre-training.

DeepSeek R1

A model that underwent RL training using about 150K hours.

Vinsurf

Mentioned alongside Cognition as a client benefiting from Applied Compute's specialized AI for bug catching.

Curser

A coding model that utilizes continual learning by capturing telemetry and performing online training steps based on user feedback.

Composer

Curser's coding model trained on their proprietary data on top of an open-source model, leveraging continual learning.

Mamba

A non-transformer architecture mentioned as a potential competitor to the dominant transformer models.

Image Duo

Yash Bottle's favorite AI product, useful for visual design tasks and creating walkthroughs, especially for those without design expertise.

More from Stanford Online

View all 59 summaries

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free