Key Moments

Cooking with OpenAI’s Research Chief: AGI, o1, Evals, and Scaling Laws — Mark Chen

Latent Space PodcastLatent Space Podcast
Science & Technology7 min read42 min video
Jun 25, 2026|1,776 views|71|5
Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

TL;DR

OpenAI's research chief believes scaling laws are still valid and crucial for AGI, despite skepticism, and emphasizes the importance of replication and well-designed evals over just benchmark scores.

Key Insights

1

OpenAI's chief research officer, Mark Chen, remains a strong believer in the continued validity of scaling laws, stating there's no reason they shouldn't continue to hold, as they have for nearly ten orders of magnitude.

2

Research taste, crucial for AI development, can be best developed through rigorous replication of existing papers, aiming to match exact training curves and losses, a technique that taught Chen many overlooked techniques.

3

Reinforcement learning (RL) traditionally faces challenges in subjective fields like creative writing where expert opinions can vary widely, making it more effective in objective domains such as math and computer science where correctness is binary.

4

The field faces an 'evals crisis' due to a low number of canonical benchmarks and the potential for models to 'overfit' onto existing distributions, necessitating continuous innovation in evaluation methods.

5

OpenAI deliberately separates teams developing evals from those optimizing models to prevent co-incentivization and maintain an adversarial process where evaluators strive to create genuinely challenging tests.

6

While models excel at complex tasks like solving IMO problems, they still struggle with mundane human-like capabilities, partly due to a lack of broader context and biological wiring, though long-context learning and engineering shortcuts are being developed.

Scaling laws and the pre-training paradigm remain central to AGI development

Mark Chen, OpenAI's Chief Research Officer, firmly believes in the enduring power of scaling laws, stating there's no reason they should cease to hold, given their consistent validity across nearly ten orders of magnitude of development. He dismisses claims that pre-training is dead, characterizing such narratives as recurring skepticism that has historically been overcome by engineering improvements and new research insights. Chen likens the current advancements to previous phases where bottlenecks were identified and then surpassed, suggesting that continued careful research, data engineering, and scaling will unlock further capabilities. This perspective underpins OpenAI's strategy, emphasizing that fundamental research, even when challenged, is key to pushing AI frontiers. The act of cooking together, a hands-on, real-world activity, serves as a lighthearted backdrop to a serious discussion about the complex research landscape.

Developing research taste through replication and learning from the past

For aspiring AI researchers, particularly those without formal training, Chen highlights replication as the most effective mechanism for developing 'research taste.' He emphasizes replicating papers to match exact training curves and loss metrics, a process that reveals crucial, often unstated, techniques. Chen’s own journey into AI was inspired by AlphaGo's matches and subsequently led him to work on Deep Q-Networks (DQN). He notes that the current era feels like witnessing 'move 37s' across various fields, indicating rapid, profound advancements. This sentiment is echoed by the observation that many professionals are now realizing AI agents can perform long-horizon, meaningful work in their domains, signifying a paradigm shift in human-AI collaboration and task execution.

NavigatingRL's challenges in subjective domains and the 'evals crisis'

Reinforcement learning (RL), traditionally powerful in objective tasks, faces headwinds in domains where outcomes are subjective, such as creative writing, where expert opinions can differ significantly. This makes grading and direct application difficult. In contrast, RL excels in fields like mathematics and computer science where correctness is clearly defined. This distinction leads to the broader 'evals crisis' impacting the field. The sheer power of current models and their ability to surpass top human performance, even on benchmarks like IMO questions, raises the question of how to evaluate superhuman intelligence. The scarcity of canonical, gold-standard benchmarks, coupled with the risk of models overfitting to existing evaluation distributions, means that newly developed evaluation methods are crucial. Tools like CodeX have been instrumental in enabling rapid, high-quality eval creation, allowing for faster iteration and a better understanding of model capabilities in real-world, long-horizon tasks.

The rationale behind OpenAI's research bets and organizational structure

OpenAI strategically allocates compute to key research bets, with a high-level roadmap remaining stable to provide direction, while implementation details evolve. This roadmap encompasses foundational areas: pre-training for world knowledge, RL for reasoning and insight chaining, and alignment/post-training. The company actively seeks new bets that unlock different or more aggressive scaling properties. To manage this, OpenAI focuses its bets, typically on three to five core initiatives per 'org,' empowering managers with both directed compute for major bets and flexible pools for exploration. This approach balances top-down strategic vision, driven by respected research leaders, with bottom-up innovation, where researchers can bring compelling evidence for new directions. Decisions on resource allocation are critical, prompting regular reviews to ensure compute and talent are applied to the highest-priority areas.

Identifying research potential and the value of 'research taste'

Identifying potential researchers is a challenging but crucial task. While past research output is a primary indicator, experienced managers develop an intuition for a candidate's thinking process, the nature of their ideas, and whether their intuition aligns with established research directions. This 'gut check' is difficult to assess fully at the outset, with clear trajectories often emerging within six to twelve months. OpenAI recognizes diverse forms of impact, from those who execute clear ideas swiftly to those who propose ambitious, 'moonshot' concepts that fundamentally shift perspectives. The distinction between top engineers and top researchers lies in the inherent uncertainty of research. While engineering principles can follow known patterns, research success hinges on 'research taste'—the ability to identify promising directions, articulate their value, and integrate them into the core research strategy.

The challenge of evals: avoiding benchmark overfitting and cultivating new methods

A significant challenge in AI development is avoiding 'benchmark maxing,' where models become overfitted to specific evaluation distributions, leading to an inaccurate representation of their true generalization capabilities. Chen emphasizes the need to operate across diverse and representative eval mixtures and to continuously invest in creating new evaluation methods. The philosophy at OpenAI is that once an eval is widely known, it is no longer truly effective. To combat this, they partner with external organizations to develop novel, high-quality evaluations, particularly in difficult areas like math and science. A key strategy is to separate the teams responsible for creating evals from those optimizing the models. This adversarial approach ensures that evaluators are incentivized to create challenging benchmarks, preventing the self-deception that can arise from internal optimization.

The evolving role of researchers: orchestration and the future of idea generation

The landscape of AI research is shifting towards orchestration, where models are increasingly capable of implementation and execution, placing greater value on human researchers' ability to generate novel ideas. While both idea generation and execution remain important, there's a market shift favoring the conception of many ideas, with models handling the execution. This marks a significant evolution in research methodology. However, there's a recognition that models currently lack 'taste'—the intuitive judgment for what constitutes a good research idea—making the researcher's role in ideation critical. While acknowledging that models may eventually develop taste, the immediate benefit lies in accelerating research through automated execution and orchestration, enabling a more efficient pace toward AGI.

Embracing failure and the importance of post-mortems

OpenAI’s strategy involves taking significant high-risk bets, which inherently means some will not pan out. A crucial part of their 'alpha' is the ability to learn from these failures. When a bet doesn't succeed, it's vital to avoid self-delusion and disengage from the idea. This involves retrospective analysis, identifying whether an idea was less important than initially thought, if a better approach emerged, or if discoveries were made that were not directly related to the original goal. Even unsuccessful research efforts yield valuable insights. Write-ups from failed projects often become important resources, naturally leading to ideas that can be built upon, saving others from repeating the same work. This positive view of failure is balanced with the expectation that researchers must eventually deliver impactful contributions, justifying ambitious, riskier endeavors with periodic major successes based on sound, albeit ambitious, ideas.

Common Questions

Mark Chen suggests that replicating existing research papers is the best way to develop research taste and learn practical techniques. While formal training is valuable, the ability to creatively problem-solve and think outside the box is crucial.

Topics

Mentioned in this video

More from Latent Space

View all 231 summaries

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free