Key Moments

Scaling Test Time Compute to Multi-Agent Civilizations — Noam Brown, OpenAI

Latent Space PodcastLatent Space Podcast
Science & Technology4 min read78 min video
Jun 19, 2025|26,600 views|635|38
Save to Pod
TL;DR

OpenAI's Noam Brown discusses AI scaling, reasoning paradigms, and the future of multi-agent systems and AGI.

Key Insights

1

AI models benefit significantly from larger language models and advanced reasoning paradigms, moving beyond simple scaling.

2

The System 1/System 2 analogy for AI thinking is useful but imperfect, particularly concerning the prerequisite base capabilities for System 2 to be effective.

3

AI safety and alignment are critical, with steerability and controllable systems like Cicero offering promising approaches.

4

While coding and math are verifiable domains, AI is proving capable in less defined areas like deep research, demonstrating existence proofs for success.

5

Self-play can lead to superhuman performance in zero-sum games (like Go, Chess), but its application to multi-agent, collaborative, or non-zero-sum scenarios is more complex and requires new objective functions.

6

The future of AI may involve multi-agent civilizations that cooperate and compete, mirroring human civilizational progress to achieve greater intelligence.

FROM DIPLOMACY CHAMPION TO AI RESEARCHER

Noam Brown, renowned for his work on Cicero, a top-tier AI for the game Diplomacy, shares insights from his journey. His deep involvement in the game to debug Cicero led to personal improvement, culminating in winning the 2025 World Diplomacy Championship. He notes that while AI didn't directly play for him, inspiration from Cicero's novel strategies influenced his human play, highlighting the 'centaur' model of human-AI collaboration. Brown also touches on the early challenges with Cicero's language model, which would occasionally "hallucinate" or produce bizarre outputs, a problem now significantly reduced with more advanced models.

THE EVOLUTION OF REASONING AND THINKING PARADIGMS

The conversation delves into the 'thinking fast and slow' paradigm, comparing AI's System 1 (fast, intuitive) and System 2 (slow, deliberate reasoning) capabilities. Brown emphasizes that this analogy has limitations: AI models require a certain baseline capability before 'System 2' reasoning techniques, like chain-of-thought, provide significant benefits. He likens this to the brain needing cortical development before higher cognitive functions can emerge. This necessity for a foundational level of intelligence is crucial for reasoning paradigms to yield results, as seen when early models failed to benefit from these techniques.

SCALING TEST-TIME COMPUTE AND THE LIMITS OF CURRENT AI

Brown discusses the concept of 'test-time compute,' where models deliberate more to solve problems. While current LLMs can 'think' for minutes, scaling this to hours or days is a frontier. He contrasts this with zero-sum games like Go, where self-play converges to a minimax equilibrium, leading to superhuman performance. However, this self-play approach is less directly applicable to multi-agent, collaborative, or non-zero-sum scenarios like Diplomacy or math problems, where defining the objective function becomes significantly more complex. The limitations due to computational cost and the serial nature of iterative experiments also pose challenges to this scaling.

CHALLENGES AND FUTURES IN MULTI-AGENT SYSTEMS

OpenAI's multi-agent team, led by Brown, explores both collaborative and competitive AI interactions. He argues that human intelligence is not a narrow band but rather a broad spectrum amplified by millennia of cooperation and competition. Similarly, AI, currently akin to 'cavemen,' could achieve vastly greater intelligence by building a 'civilization' through sustained multi-agent interaction. This differs from traditional approaches, which Brown considers too heuristic, advocating for a more principled scaling approach akin to the 'bitter lesson'.

AI'S IMPACT ON SOFTWARE DEVELOPMENT AND PRODUCTIVITY

Tools like CodeX and GPT-4 are transforming software development, with models capable of generating pull requests and handling complex tasks. Brown shares his personal experience using these tools for nearly all his coding, finding them incredibly effective. He notes that experiencing 'AI feeling' moments – instances where AI capability feels magical – quickly leads to users identifying its limitations and desiring further improvements. The key, he suggests, is to build systems that allow models to gain experience, moving beyond their 'first day on the job' output to more seasoned performance without excessive 'harnesses'.

BEYOND CODE: THE BROADENING APPLICABILITY OF AI

Brown anticipates AI's expansion beyond software engineering to encompass a wide array of remote work tasks, including those typically found on platforms like Upwork or in virtual assistant roles. For virtual assistants, aligned AI could potentially offer a more consistent and preference-aligned service than human agents, mitigating the principal-agent problem. He also touches upon the rapid progress in generative media, like Sora, and the ongoing research into different image generation techniques (autoregressive vs. diffusion), underscoring the dynamic and multi-faceted nature of AI advancement across various domains.

Common Questions

Noam Brown improved his Diplomacy gameplay by deeply understanding the game to debug Cicero, playing in tournaments, and observing the bot's unconventional strategies, which provided new insights into the game.

Topics

Mentioned in this video

toolAlphaZero

Mentioned alongside AlphaGo as an example of AI achieving superhuman performance through pre-training, test-time compute, and self-play.

toolWindsurf

An IDE tool that Noam Brown uses daily, though he notes that his preferred model, O3, is not yet the default interface, requiring manual selection.

bookThe Bitter Lesson

A principle emphasizing that scaling and simple approaches often outperform complex, hand-engineered solutions in AI research. It's cited as a guiding principle for OpenAI's approach to multi-agent systems.

toolDeepSeek

Mentioned for a finding that MCTS (Monte Carlo Tree Search) was not very useful to them.

bookThe Art of Doing Science and Engineering

A book by Richard Hamming that influenced the discussion on technological shifts and adapting workflows in new technologies.

personIlya Sutskever

A key figure at OpenAI with whom Noam Brown discussed AGI timelines, RL, and the importance of reasoning paradigms. His early emails emphasized the value of large experiments.

mediaPokemon

Used as an example of a game where a 'harness' might be needed for AI to perform well, but Noam Brown argues that improving the base model's capabilities is the better approach than relying on external harnesses.

mediaOmaha Poker

Mentioned as a variant of poker with more hidden cards than Texas Hold'em, increasing the complexity of state enumeration for AI.

conceptReasoning models

AI models that use techniques like chain-of-thought to improve their problem-solving abilities, discussed as a leap forward that requires a base level of capability and is now being integrated into more AI systems.

toolAlphaGo

Used as a parallel to language model development, highlighting the stages of pre-training, large-scale test-time compute, and self-play in achieving superhuman performance.

toolCodex

An AI tool for coding that Noam Brown uses extensively for tasks ranging from research to generating pull requests, finding it effective and a valuable way to understand model limitations.

conceptMid-training

A stage of AI model development between pre-training and post-training, which OpenAI is explicitly hiring for, though its definition remains fuzzy.

conceptHumanoid

The shape and form of robots, discussed in the context of whether humanoids are the optimal embodiment for AI, especially in domestic environments.

organizationPhysical Intelligence

A company whose CEO's pitches influenced Noam Brown's view on the benefits of non-humanoid robotics.

mediaWerewolf

Another social deduction game used as a comparison point for Blood on the Clock Tower, emphasizing its similar core concepts.

toolGPT-4

The model is mentioned in the context of the 'all you can eat' scaling paradigm and as a benchmark for reasoning capabilities. Its successor, GPT-4.5, is also discussed.

personRichard Hamming

Author of 'The Art of Doing Science and Engineering,' cited for his ideas on technological shifts and the need to adapt workflows rather than simply replicating old tasks in new technology.

softwareOAI models

Referred to in the context of Diplomacy benchmarking and their potential to improve with scale and new paradigms.

locationTaiwan

Mentioned as a country that might have its own AI research or development landscape, in contrast to Western hubs.

toolGPT-4.5

Mentioned in the context of playing games like tic-tac-toe, where it performs reasonably well but can make mistakes, suggesting a need for system two thinking for perfect play.

softwareO3

Noam Brown's preferred model for daily use, even replacing Google Search, highlighting its utility for web browsing and research. It's also suggested as a primary tool for coding.

personLex Fridman

The podcast host is mentioned as someone Noam Brown has appeared on previously, indicating Brown's frequent appearances on podcasts.

organizationAffinity

The organization that the host implies is working on a specific type of model router that aims to balance fast and slow thinking in AI.

toolLlama

Mentioned as an example of another AI model that could be benchmarked against Diplomacy.

conceptThe Turing Test

A benchmark for machine intelligence, discussed in the context of current language models like GPT-4 and O3 passing it, and how this affects human perception and the search for AI.

personCicero

An AI system developed by Noam Brown and OpenAI that performed in the top 10% of human players in Diplomacy. Its development influenced Brown's own playstyle and understanding of the game.

toolSora

A generative AI model for creating videos from text, whose initial announcement was seen as magical and indicative of AGI, but now shows flaws upon closer inspection.

toolGemini

Mentioned in the context of text diffusion, alongside auto-regressive imaging, as an example of different directions in generative media research.

personGreg Brockman

Former president of OpenAI, whom Noam Brown suggests asking about the long-term future of AI and how to steer it towards positive outcomes.

toolMagic: The Gathering

A complex card game with imperfect information and a vast number of possible states, discussed as a challenging domain for AI that current poker AI techniques would not directly solve.

conceptTexas Hold'em

A variant of poker with a limited amount of hidden information, making it amenable to explicit state enumeration and probability assignment for AI.

toolGPT-40

A model that is discussed as passing the Turing test, with its capabilities improving since 2022. It's also mentioned in the context of agentic systems and conversational AI.

toolOpenAI

The organization where Noam Brown works, responsible for developing advanced AI models like Cicero and the GPT series.

personJim Fan

Mentioned for his work on Voyager skill library and a talk on physical Turing tests, highlighting his role as an educator in AI and robotics.

bookHumanity's Last Exam

A benchmark that features difficult but easily gradable problems, which Noam Brown suggests limits the scope of AI evaluation to more common, measurable tasks rather than fuzzier, more complex ones.

conceptPoker

A game frequently used as a reference point for AI development, particularly in areas like game theory, sample efficiency, and the trade-off between exploitative and optimal strategies.

conceptHumanoid robotics

The embodiment of AI in human-like robots, discussed in terms of its advantages (familiarity, existing infrastructure) and disadvantages (limitations of human form, potential creepiness) compared to non-humanoid forms.

mediaStratego

A game with a very large number of possible states (approaching 40 factorial), which breaks traditional AI approaches used for poker and requires different methods.

toolGPT-2

Mentioned as a model that likely would not have benefited from reasoning paradigms, highlighting the need for a certain baseline capability in models before applying techniques like chain-of-thought.

personTimnit Gebru

Mentioned by reference to a "hit talk" at ICLR about open-endedness and multi-agent systems.

conceptMCTS

Monte Carlo Tree Search, mentioned as not being very useful to DeepSeek but still a method engineers try for search in AI.

mediaBlood on the Clock Tower

A social deduction game similar to Mafia or Werewolf, which has become popular for socializing and even recruiting in the tech industry, replacing poker as a favored activity.

mediaStratum

Likely a misremembered or misheard name for 'Stratego', a board game with a vast number of possible states, making it challenging for traditional AI approaches.

conceptMafia

A social deduction game that Blood on the Clock Tower is compared to, highlighting its similar gameplay mechanics.

conceptChain of Thought

A prompting technique that improves reasoning in LLMs, discussed as a paradigm that requires a certain baseline model capability to be effective, as seen when applied to GPT-2 versus larger models.

toolGPT-3

Mentioned as a prior iteration of language models, serving as a precursor to current capabilities and benchmarks.

More from Latent Space

View all 70 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free