Key Moments

Scaling Test Time Compute to Multi-Agent Civilizations — Noam Brown, OpenAI

Latent Space PodcastLatent Space Podcast
Science & Technology4 min read78 min video
Jun 19, 2025|26,613 views|635|38
Save to Pod
TL;DR

OpenAI's Noam Brown discusses AI scaling, reasoning paradigms, and the future of multi-agent systems and AGI.

Key Insights

1

AI models benefit significantly from larger language models and advanced reasoning paradigms, moving beyond simple scaling.

2

The System 1/System 2 analogy for AI thinking is useful but imperfect, particularly concerning the prerequisite base capabilities for System 2 to be effective.

3

AI safety and alignment are critical, with steerability and controllable systems like Cicero offering promising approaches.

4

While coding and math are verifiable domains, AI is proving capable in less defined areas like deep research, demonstrating existence proofs for success.

5

Self-play can lead to superhuman performance in zero-sum games (like Go, Chess), but its application to multi-agent, collaborative, or non-zero-sum scenarios is more complex and requires new objective functions.

6

The future of AI may involve multi-agent civilizations that cooperate and compete, mirroring human civilizational progress to achieve greater intelligence.

FROM DIPLOMACY CHAMPION TO AI RESEARCHER

Noam Brown, renowned for his work on Cicero, a top-tier AI for the game Diplomacy, shares insights from his journey. His deep involvement in the game to debug Cicero led to personal improvement, culminating in winning the 2025 World Diplomacy Championship. He notes that while AI didn't directly play for him, inspiration from Cicero's novel strategies influenced his human play, highlighting the 'centaur' model of human-AI collaboration. Brown also touches on the early challenges with Cicero's language model, which would occasionally "hallucinate" or produce bizarre outputs, a problem now significantly reduced with more advanced models.

THE EVOLUTION OF REASONING AND THINKING PARADIGMS

The conversation delves into the 'thinking fast and slow' paradigm, comparing AI's System 1 (fast, intuitive) and System 2 (slow, deliberate reasoning) capabilities. Brown emphasizes that this analogy has limitations: AI models require a certain baseline capability before 'System 2' reasoning techniques, like chain-of-thought, provide significant benefits. He likens this to the brain needing cortical development before higher cognitive functions can emerge. This necessity for a foundational level of intelligence is crucial for reasoning paradigms to yield results, as seen when early models failed to benefit from these techniques.

SCALING TEST-TIME COMPUTE AND THE LIMITS OF CURRENT AI

Brown discusses the concept of 'test-time compute,' where models deliberate more to solve problems. While current LLMs can 'think' for minutes, scaling this to hours or days is a frontier. He contrasts this with zero-sum games like Go, where self-play converges to a minimax equilibrium, leading to superhuman performance. However, this self-play approach is less directly applicable to multi-agent, collaborative, or non-zero-sum scenarios like Diplomacy or math problems, where defining the objective function becomes significantly more complex. The limitations due to computational cost and the serial nature of iterative experiments also pose challenges to this scaling.

CHALLENGES AND FUTURES IN MULTI-AGENT SYSTEMS

OpenAI's multi-agent team, led by Brown, explores both collaborative and competitive AI interactions. He argues that human intelligence is not a narrow band but rather a broad spectrum amplified by millennia of cooperation and competition. Similarly, AI, currently akin to 'cavemen,' could achieve vastly greater intelligence by building a 'civilization' through sustained multi-agent interaction. This differs from traditional approaches, which Brown considers too heuristic, advocating for a more principled scaling approach akin to the 'bitter lesson'.

AI'S IMPACT ON SOFTWARE DEVELOPMENT AND PRODUCTIVITY

Tools like CodeX and GPT-4 are transforming software development, with models capable of generating pull requests and handling complex tasks. Brown shares his personal experience using these tools for nearly all his coding, finding them incredibly effective. He notes that experiencing 'AI feeling' moments – instances where AI capability feels magical – quickly leads to users identifying its limitations and desiring further improvements. The key, he suggests, is to build systems that allow models to gain experience, moving beyond their 'first day on the job' output to more seasoned performance without excessive 'harnesses'.

BEYOND CODE: THE BROADENING APPLICABILITY OF AI

Brown anticipates AI's expansion beyond software engineering to encompass a wide array of remote work tasks, including those typically found on platforms like Upwork or in virtual assistant roles. For virtual assistants, aligned AI could potentially offer a more consistent and preference-aligned service than human agents, mitigating the principal-agent problem. He also touches upon the rapid progress in generative media, like Sora, and the ongoing research into different image generation techniques (autoregressive vs. diffusion), underscoring the dynamic and multi-faceted nature of AI advancement across various domains.

Common Questions

Noam Brown improved his Diplomacy gameplay by deeply understanding the game to debug Cicero, playing in tournaments, and observing the bot's unconventional strategies, which provided new insights into the game.

Topics

Mentioned in this video

Software & Apps
AlphaZero

Mentioned alongside AlphaGo as an example of AI achieving superhuman performance through pre-training, test-time compute, and self-play.

Windsurf

An IDE tool that Noam Brown uses daily, though he notes that his preferred model, O3, is not yet the default interface, requiring manual selection.

AlphaGo

Used as a parallel to language model development, highlighting the stages of pre-training, large-scale test-time compute, and self-play in achieving superhuman performance.

Codex

An AI tool for coding that Noam Brown uses extensively for tasks ranging from research to generating pull requests, finding it effective and a valuable way to understand model limitations.

GPT-4

The model is mentioned in the context of the 'all you can eat' scaling paradigm and as a benchmark for reasoning capabilities. Its successor, GPT-4.5, is also discussed.

GPT-4.5

Mentioned in the context of playing games like tic-tac-toe, where it performs reasonably well but can make mistakes, suggesting a need for system two thinking for perfect play.

O3

Noam Brown's preferred model for daily use, even replacing Google Search, highlighting its utility for web browsing and research. It's also suggested as a primary tool for coding.

Llama

Mentioned as an example of another AI model that could be benchmarked against Diplomacy.

GPT-4o

A model that is discussed as passing the Turing test, with its capabilities improving since 2022. It's also mentioned in the context of agentic systems and conversational AI.

Sora

A generative AI model for creating videos from text, whose initial announcement was seen as magical and indicative of AGI, but now shows flaws upon closer inspection.

Gemini

Mentioned in the context of text diffusion, alongside auto-regressive imaging, as an example of different directions in generative media research.

GPT-2

Mentioned as a model that likely would not have benefited from reasoning paradigms, highlighting the need for a certain baseline capability in models before applying techniques like chain-of-thought.

GPT-3

Mentioned as a prior iteration of language models, serving as a precursor to current capabilities and benchmarks.

More from Latent Space

View all 202 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free