Key Moments
Scaling Test Time Compute to Multi-Agent Civilizations — Noam Brown, OpenAI
Key Moments
OpenAI's Noam Brown discusses AI scaling, reasoning paradigms, and the future of multi-agent systems and AGI.
Key Insights
AI models benefit significantly from larger language models and advanced reasoning paradigms, moving beyond simple scaling.
The System 1/System 2 analogy for AI thinking is useful but imperfect, particularly concerning the prerequisite base capabilities for System 2 to be effective.
AI safety and alignment are critical, with steerability and controllable systems like Cicero offering promising approaches.
While coding and math are verifiable domains, AI is proving capable in less defined areas like deep research, demonstrating existence proofs for success.
Self-play can lead to superhuman performance in zero-sum games (like Go, Chess), but its application to multi-agent, collaborative, or non-zero-sum scenarios is more complex and requires new objective functions.
The future of AI may involve multi-agent civilizations that cooperate and compete, mirroring human civilizational progress to achieve greater intelligence.
FROM DIPLOMACY CHAMPION TO AI RESEARCHER
Noam Brown, renowned for his work on Cicero, a top-tier AI for the game Diplomacy, shares insights from his journey. His deep involvement in the game to debug Cicero led to personal improvement, culminating in winning the 2025 World Diplomacy Championship. He notes that while AI didn't directly play for him, inspiration from Cicero's novel strategies influenced his human play, highlighting the 'centaur' model of human-AI collaboration. Brown also touches on the early challenges with Cicero's language model, which would occasionally "hallucinate" or produce bizarre outputs, a problem now significantly reduced with more advanced models.
THE EVOLUTION OF REASONING AND THINKING PARADIGMS
The conversation delves into the 'thinking fast and slow' paradigm, comparing AI's System 1 (fast, intuitive) and System 2 (slow, deliberate reasoning) capabilities. Brown emphasizes that this analogy has limitations: AI models require a certain baseline capability before 'System 2' reasoning techniques, like chain-of-thought, provide significant benefits. He likens this to the brain needing cortical development before higher cognitive functions can emerge. This necessity for a foundational level of intelligence is crucial for reasoning paradigms to yield results, as seen when early models failed to benefit from these techniques.
SCALING TEST-TIME COMPUTE AND THE LIMITS OF CURRENT AI
Brown discusses the concept of 'test-time compute,' where models deliberate more to solve problems. While current LLMs can 'think' for minutes, scaling this to hours or days is a frontier. He contrasts this with zero-sum games like Go, where self-play converges to a minimax equilibrium, leading to superhuman performance. However, this self-play approach is less directly applicable to multi-agent, collaborative, or non-zero-sum scenarios like Diplomacy or math problems, where defining the objective function becomes significantly more complex. The limitations due to computational cost and the serial nature of iterative experiments also pose challenges to this scaling.
CHALLENGES AND FUTURES IN MULTI-AGENT SYSTEMS
OpenAI's multi-agent team, led by Brown, explores both collaborative and competitive AI interactions. He argues that human intelligence is not a narrow band but rather a broad spectrum amplified by millennia of cooperation and competition. Similarly, AI, currently akin to 'cavemen,' could achieve vastly greater intelligence by building a 'civilization' through sustained multi-agent interaction. This differs from traditional approaches, which Brown considers too heuristic, advocating for a more principled scaling approach akin to the 'bitter lesson'.
AI'S IMPACT ON SOFTWARE DEVELOPMENT AND PRODUCTIVITY
Tools like CodeX and GPT-4 are transforming software development, with models capable of generating pull requests and handling complex tasks. Brown shares his personal experience using these tools for nearly all his coding, finding them incredibly effective. He notes that experiencing 'AI feeling' moments – instances where AI capability feels magical – quickly leads to users identifying its limitations and desiring further improvements. The key, he suggests, is to build systems that allow models to gain experience, moving beyond their 'first day on the job' output to more seasoned performance without excessive 'harnesses'.
BEYOND CODE: THE BROADENING APPLICABILITY OF AI
Brown anticipates AI's expansion beyond software engineering to encompass a wide array of remote work tasks, including those typically found on platforms like Upwork or in virtual assistant roles. For virtual assistants, aligned AI could potentially offer a more consistent and preference-aligned service than human agents, mitigating the principal-agent problem. He also touches upon the rapid progress in generative media, like Sora, and the ongoing research into different image generation techniques (autoregressive vs. diffusion), underscoring the dynamic and multi-faceted nature of AI advancement across various domains.
Mentioned in This Episode
●Software & Apps
●Companies
●Organizations
●Books
●Studies Cited
●Concepts
●People Referenced
Common Questions
Noam Brown improved his Diplomacy gameplay by deeply understanding the game to debug Cicero, playing in tournaments, and observing the bot's unconventional strategies, which provided new insights into the game.
Topics
Mentioned in this video
Mentioned alongside AlphaGo as an example of AI achieving superhuman performance through pre-training, test-time compute, and self-play.
An IDE tool that Noam Brown uses daily, though he notes that his preferred model, O3, is not yet the default interface, requiring manual selection.
Used as a parallel to language model development, highlighting the stages of pre-training, large-scale test-time compute, and self-play in achieving superhuman performance.
An AI tool for coding that Noam Brown uses extensively for tasks ranging from research to generating pull requests, finding it effective and a valuable way to understand model limitations.
The model is mentioned in the context of the 'all you can eat' scaling paradigm and as a benchmark for reasoning capabilities. Its successor, GPT-4.5, is also discussed.
Mentioned in the context of playing games like tic-tac-toe, where it performs reasonably well but can make mistakes, suggesting a need for system two thinking for perfect play.
Noam Brown's preferred model for daily use, even replacing Google Search, highlighting its utility for web browsing and research. It's also suggested as a primary tool for coding.
Mentioned as an example of another AI model that could be benchmarked against Diplomacy.
A model that is discussed as passing the Turing test, with its capabilities improving since 2022. It's also mentioned in the context of agentic systems and conversational AI.
A generative AI model for creating videos from text, whose initial announcement was seen as magical and indicative of AGI, but now shows flaws upon closer inspection.
Mentioned in the context of text diffusion, alongside auto-regressive imaging, as an example of different directions in generative media research.
Mentioned as a model that likely would not have benefited from reasoning paradigms, highlighting the need for a certain baseline capability in models before applying techniques like chain-of-thought.
Mentioned as a prior iteration of language models, serving as a precursor to current capabilities and benchmarks.
A key figure at OpenAI with whom Noam Brown discussed AGI timelines, RL, and the importance of reasoning paradigms. His early emails emphasized the value of large experiments.
Author of 'The Art of Doing Science and Engineering,' cited for his ideas on technological shifts and the need to adapt workflows rather than simply replicating old tasks in new technology.
The podcast host is mentioned as someone Noam Brown has appeared on previously, indicating Brown's frequent appearances on podcasts.
An AI system developed by Noam Brown and OpenAI that performed in the top 10% of human players in Diplomacy. Its development influenced Brown's own playstyle and understanding of the game.
Mentioned for his work on Voyager skill library and a talk on physical Turing tests, highlighting his role as an educator in AI and robotics.
Former president of OpenAI, whom Noam Brown suggests asking about the long-term future of AI and how to steer it towards positive outcomes.
Used as an example of a game where a 'harness' might be needed for AI to perform well, but Noam Brown argues that improving the base model's capabilities is the better approach than relying on external harnesses.
A complex card game with imperfect information and a vast number of possible states, discussed as a challenging domain for AI that current poker AI techniques would not directly solve.
A benchmark that features difficult but easily gradable problems, which Noam Brown suggests limits the scope of AI evaluation to more common, measurable tasks rather than fuzzier, more complex ones.
A game frequently used as a reference point for AI development, particularly in areas like game theory, sample efficiency, and the trade-off between exploitative and optimal strategies.
Monte Carlo Tree Search, mentioned as not being very useful to DeepSeek but still a method engineers try for search in AI.
A social deduction game that Blood on the Clock Tower is compared to, highlighting its similar gameplay mechanics.
A prompting technique that improves reasoning in LLMs, discussed as a paradigm that requires a certain baseline model capability to be effective, as seen when applied to GPT-2 versus larger models.
More from Latent Space
View all 202 summaries
67 minMoonlake: Multimodal, Interactive, and Efficient World Models — with Fan-yun Sun and Chris Manning
38 minThe Stove Guy: Sam D'Amico Shows New AI Cooking Features on America's Most Powerful Stove at Impulse
55 minMistral: Voxtral TTS, Forge, Leanstral, & Mistral 4 — w/ Pavan Kumar Reddy & Guillaume Lample
36 min🔬There Is No AlphaFold for Materials — AI for Materials Discovery with Heather Kulik
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Get Started Free