Can current AI models like GPT-4 and O3 pass the Turing Test?

Yes, models like GPT-4 and O3 are considered to be passing the Turing Test. This makes it difficult for humans to reliably distinguish them from other humans in conversation.

Why was Cicero considered a controllable and safe AI system?

Cicero was considered controllable because it was conditioned on concrete actions and had a reasoning system steering its interactions, making its behavior interpretable and definable, unlike less steerable language models.

How do AI models benefit from the 'System 1 / System 2' thinking analogy?

The analogy suggests that AI models need a foundational 'System 1' capability before advanced 'System 2' reasoning techniques can significantly improve their performance. This means larger, more capable base models are needed to benefit from complex reasoning.

What are 'harnesses' in AI and why are they being phased out?

Harnesses are external tools or scaffolding used to guide AI models. Noam Brown believes they are temporary crutches that will be replaced by improved model capabilities and unified models, as seen with the emergence of reasoning models.

What is Reinforcement Fine-Tuning (RFT) and why is it important?

RFT is a technique for specializing AI models using specific data. It's valuable because it allows developers to create data that will be useful for future, more capable models, thus complementing model scaling rather than becoming obsolete.

Why was the shift towards reasoning and RL paradigms important for AI development?

The shift was crucial because simply scaling pre-training wasn't sufficient for achieving advanced AI. Reasoning paradigms, combined with scaling, unlocked significant performance gains, akin to dramatically increasing model size.

What are the challenges with 'self-play' in AI beyond two-player zero-sum games?

In games like Diplomacy or math problems, self-play doesn't automatically converge to a useful equilibrium like it does in chess or Go. Defining the objective function and avoiding trivial or uninteresting self-improvement is a major challenge.

What is the difference between human data efficiency and current AI models?

Humans are significantly more data-efficient, learning complex concepts from far fewer examples than current AI models require for training. Improving AI data efficiency is considered one of the most important unsolved research questions.

What are the most significant challenges in the AI development cycle?

One challenge is the 'broken' part of the cycle like pull request review if AI generates code. Another is the rapid pace of development, making it hard for developers to build complex systems that might become obsolete quickly, as well as the 'genius but it's their first day' nature of current models.

How is OpenAI structured differently from traditional research labs?

OpenAI operates more like a startup with a mission to build AGI, enabling it to organize, collaborate, pool resources, and make bold bets on scaling, unlike more traditional labs or universities.

What are the future directions for AI beyond software engineering?

AI is expected to move into more remote work tasks, becoming valuable for freelancers and various non-software engineering roles. Virtual assistants are also seen as a key near-term application, potentially performing better than humans due to better alignment.

Is the human form necessary for AI embodiment in robotics?

While humanoids are familiar and leverage human-designed environments, there's also value in non-humanoid robotics (like drones). The necessity of human form is debated, with some arguing for optimal shapes that deviate from the human standard.

What are the main limitations and opportunities in AI benchmarking?

Many benchmarks focus on easy-to-grade problems, which can limit evaluation to simpler tasks. There's a need for benchmarks that assess fuzzier, more complex capabilities, though these are harder and more expensive to implement.

Key Moments

Scaling Test Time Compute to Multi-Agent Civilizations — Noam Brown, OpenAI

Latent Space Podcast

Science & Technology4 min read78 min video

Jun 19, 2025|26,613 views|635|38

Save to Pod

Key Moments

TL;DR

OpenAI's Noam Brown discusses AI scaling, reasoning paradigms, and the future of multi-agent systems and AGI.

Key Insights

AI models benefit significantly from larger language models and advanced reasoning paradigms, moving beyond simple scaling.

The System 1/System 2 analogy for AI thinking is useful but imperfect, particularly concerning the prerequisite base capabilities for System 2 to be effective.

AI safety and alignment are critical, with steerability and controllable systems like Cicero offering promising approaches.

While coding and math are verifiable domains, AI is proving capable in less defined areas like deep research, demonstrating existence proofs for success.

Self-play can lead to superhuman performance in zero-sum games (like Go, Chess), but its application to multi-agent, collaborative, or non-zero-sum scenarios is more complex and requires new objective functions.

The future of AI may involve multi-agent civilizations that cooperate and compete, mirroring human civilizational progress to achieve greater intelligence.

FROM DIPLOMACY CHAMPION TO AI RESEARCHER

Noam Brown, renowned for his work on Cicero, a top-tier AI for the game Diplomacy, shares insights from his journey. His deep involvement in the game to debug Cicero led to personal improvement, culminating in winning the 2025 World Diplomacy Championship. He notes that while AI didn't directly play for him, inspiration from Cicero's novel strategies influenced his human play, highlighting the 'centaur' model of human-AI collaboration. Brown also touches on the early challenges with Cicero's language model, which would occasionally "hallucinate" or produce bizarre outputs, a problem now significantly reduced with more advanced models.

THE EVOLUTION OF REASONING AND THINKING PARADIGMS

The conversation delves into the 'thinking fast and slow' paradigm, comparing AI's System 1 (fast, intuitive) and System 2 (slow, deliberate reasoning) capabilities. Brown emphasizes that this analogy has limitations: AI models require a certain baseline capability before 'System 2' reasoning techniques, like chain-of-thought, provide significant benefits. He likens this to the brain needing cortical development before higher cognitive functions can emerge. This necessity for a foundational level of intelligence is crucial for reasoning paradigms to yield results, as seen when early models failed to benefit from these techniques.

SCALING TEST-TIME COMPUTE AND THE LIMITS OF CURRENT AI

Brown discusses the concept of 'test-time compute,' where models deliberate more to solve problems. While current LLMs can 'think' for minutes, scaling this to hours or days is a frontier. He contrasts this with zero-sum games like Go, where self-play converges to a minimax equilibrium, leading to superhuman performance. However, this self-play approach is less directly applicable to multi-agent, collaborative, or non-zero-sum scenarios like Diplomacy or math problems, where defining the objective function becomes significantly more complex. The limitations due to computational cost and the serial nature of iterative experiments also pose challenges to this scaling.

CHALLENGES AND FUTURES IN MULTI-AGENT SYSTEMS

OpenAI's multi-agent team, led by Brown, explores both collaborative and competitive AI interactions. He argues that human intelligence is not a narrow band but rather a broad spectrum amplified by millennia of cooperation and competition. Similarly, AI, currently akin to 'cavemen,' could achieve vastly greater intelligence by building a 'civilization' through sustained multi-agent interaction. This differs from traditional approaches, which Brown considers too heuristic, advocating for a more principled scaling approach akin to the 'bitter lesson'.

AI'S IMPACT ON SOFTWARE DEVELOPMENT AND PRODUCTIVITY

Tools like CodeX and GPT-4 are transforming software development, with models capable of generating pull requests and handling complex tasks. Brown shares his personal experience using these tools for nearly all his coding, finding them incredibly effective. He notes that experiencing 'AI feeling' moments – instances where AI capability feels magical – quickly leads to users identifying its limitations and desiring further improvements. The key, he suggests, is to build systems that allow models to gain experience, moving beyond their 'first day on the job' output to more seasoned performance without excessive 'harnesses'.

BEYOND CODE: THE BROADENING APPLICABILITY OF AI

Brown anticipates AI's expansion beyond software engineering to encompass a wide array of remote work tasks, including those typically found on platforms like Upwork or in virtual assistant roles. For virtual assistants, aligned AI could potentially offer a more consistent and preference-aligned service than human agents, mitigating the principal-agent problem. He also touches upon the rapid progress in generative media, like Sora, and the ongoing research into different image generation techniques (autoregressive vs. diffusion), underscoring the dynamic and multi-faceted nature of AI advancement across various domains.

Mentioned in This Episode

●Software & Apps

●Companies

●Organizations

●Books

●Studies Cited

●Concepts

●People Referenced

Common Questions

Noam Brown improved his Diplomacy gameplay by deeply understanding the game to debug Cicero, playing in tournaments, and observing the bot's unconventional strategies, which provided new insights into the game.

Topics

Ai-Ethics Ai Safety AI & Machine Learning Technology & Innovation Science & Mathematics Future Of AI Large Language Models Multi-Agent Systems Reasoning Capabilities Game Theory In AI Computational Scaling

Mentioned in this video

Software & Apps

AlphaZero

Mentioned alongside AlphaGo as an example of AI achieving superhuman performance through pre-training, test-time compute, and self-play.

Windsurf

An IDE tool that Noam Brown uses daily, though he notes that his preferred model, O3, is not yet the default interface, requiring manual selection.

AlphaGo

Used as a parallel to language model development, highlighting the stages of pre-training, large-scale test-time compute, and self-play in achieving superhuman performance.

Codex

An AI tool for coding that Noam Brown uses extensively for tasks ranging from research to generating pull requests, finding it effective and a valuable way to understand model limitations.

GPT-4

The model is mentioned in the context of the 'all you can eat' scaling paradigm and as a benchmark for reasoning capabilities. Its successor, GPT-4.5, is also discussed.

GPT-4.5

Mentioned in the context of playing games like tic-tac-toe, where it performs reasonably well but can make mistakes, suggesting a need for system two thinking for perfect play.

Noam Brown's preferred model for daily use, even replacing Google Search, highlighting its utility for web browsing and research. It's also suggested as a primary tool for coding.

Llama

Mentioned as an example of another AI model that could be benchmarked against Diplomacy.

GPT-4o

A model that is discussed as passing the Turing test, with its capabilities improving since 2022. It's also mentioned in the context of agentic systems and conversational AI.

Sora

A generative AI model for creating videos from text, whose initial announcement was seen as magical and indicative of AGI, but now shows flaws upon closer inspection.

Gemini

Mentioned in the context of text diffusion, alongside auto-regressive imaging, as an example of different directions in generative media research.

GPT-2

Mentioned as a model that likely would not have benefited from reasoning paradigms, highlighting the need for a certain baseline capability in models before applying techniques like chain-of-thought.

GPT-3

Mentioned as a prior iteration of language models, serving as a precursor to current capabilities and benchmarks.

Books

The Bitter Lesson

A principle emphasizing that scaling and simple approaches often outperform complex, hand-engineered solutions in AI research. It's cited as a guiding principle for OpenAI's approach to multi-agent systems.

Companies

DeepSeek

Mentioned for a finding that MCTS (Monte Carlo Tree Search) was not very useful to them.

OpenAI

The organization where Noam Brown works, responsible for developing advanced AI models like Cicero and the GPT series.

People

Ilya Sutskever

A key figure at OpenAI with whom Noam Brown discussed AGI timelines, RL, and the importance of reasoning paradigms. His early emails emphasized the value of large experiments.

Richard Hamming

Author of 'The Art of Doing Science and Engineering,' cited for his ideas on technological shifts and the need to adapt workflows rather than simply replicating old tasks in new technology.

Lex Fridman

The podcast host is mentioned as someone Noam Brown has appeared on previously, indicating Brown's frequent appearances on podcasts.

Cicero

An AI system developed by Noam Brown and OpenAI that performed in the top 10% of human players in Diplomacy. Its development influenced Brown's own playstyle and understanding of the game.

Jim Fan

Mentioned for his work on Voyager skill library and a talk on physical Turing tests, highlighting his role as an educator in AI and robotics.

Greg Brockman

Former president of OpenAI, whom Noam Brown suggests asking about the long-term future of AI and how to steer it towards positive outcomes.

Media

Pokemon

Used as an example of a game where a 'harness' might be needed for AI to perform well, but Noam Brown argues that improving the base model's capabilities is the better approach than relying on external harnesses.

Magic: The Gathering

A complex card game with imperfect information and a vast number of possible states, discussed as a challenging domain for AI that current poker AI techniques would not directly solve.

Organizations

Physical Intelligence

A company whose CEO's pitches influenced Noam Brown's view on the benefits of non-humanoid robotics.

Locations

Taiwan

Mentioned as a country that might have its own AI research or development landscape, in contrast to Western hubs.

Concepts

Humanity's Last Exam

A benchmark that features difficult but easily gradable problems, which Noam Brown suggests limits the scope of AI evaluation to more common, measurable tasks rather than fuzzier, more complex ones.

Poker

A game frequently used as a reference point for AI development, particularly in areas like game theory, sample efficiency, and the trade-off between exploitative and optimal strategies.

MCTS

Monte Carlo Tree Search, mentioned as not being very useful to DeepSeek but still a method engineers try for search in AI.

Mafia

A social deduction game that Blood on the Clock Tower is compared to, highlighting its similar gameplay mechanics.

Chain of Thought

A prompting technique that improves reasoning in LLMs, discussed as a paradigm that requires a certain baseline model capability to be effective, as seen when applied to GPT-2 versus larger models.

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free