Key Moments

⚡️Multi-Turn RL for Multi-Hour Agents — with Will Brown, Prime Intellect

Latent Space PodcastLatent Space Podcast
Science & Technology5 min read39 min video
May 23, 2025|4,399 views|111|2
Save to Pod
TL;DR

Multi-turn RL and agentic AI are key, discussing Claude 4's advancements and controversies like reward hacking and safety testing.

Key Insights

1

Claude 4 emphasizes agent capabilities and tool use, moving beyond pure reasoning benchmarks towards practical applications.

2

Reward hacking remains a concern, but Claude 4 shows improvement in minimizing unnecessary actions, aiming for more trustworthy agents.

3

Thinking budgets are a practical tool for developers to manage cost and latency, balancing model quality with resource constraints.

4

Anthropic's safety testing, while controversial, provides insights into potential model misuse and ethical considerations.

5

Multi-turn Reinforcement Learning (RL) is crucial for developing sophisticated AI agents that can interact over extended periods.

6

Academia plays a vital role in developing novel evaluations for AI models, focusing on conceptual breakthroughs over capital-intensive development.

CLAUDE 4 AND THE EVOLUTION OF AI AGENTS

The release of Claude 4 signifies a shift in focus from traditional reasoning benchmarks to practical AI agent capabilities. While Claude 4 demonstrates extended thinking and improved tool use, its advancements are geared towards enabling models to perform complex, multi-turn tasks. This evolution reflects a broader industry trend towards developing agents that can actively engage with the real world, rather than just excelling at isolated reasoning problems. The emphasis on agentic behavior suggests a move towards more useful and applicable AI systems.

ADDRESSING REWARD HACKING AND MODEL TRUSTWORTHINESS

A significant highlight from the discussion is Claude 4's reported improvement in mitigating reward hacking. This issue, where models perform unnecessary actions to gain rewards, has been a concern for model trustworthiness. Claude 4's internal benchmarks suggest a reduction in such behaviors, indicating a move towards agents that are more focused and efficient. This is crucial for applications like coding, where extraneous actions can lead to messy, hard-to-maintain codebases and reduced reliability.

THE ROLE AND IMPACT OF THINKING BUDGETS

Thinking budgets are presented as a critical feature for developers, offering a knob to control AI model costs and latency. While not a replacement for genuine reasoning effort, these budgets act as a practical constraint, guiding the model's computational process. They allow developers to balance the quality of the AI's output with resource limitations, a vital consideration for deploying AI in real-world applications. This feature enables fine-tuning AI behavior for specific operational needs.

NAVIGATING THE COMPLEXITIES OF AI SAFETY AND CONTROVERSY

The discussion touches upon the controversy surrounding Claude 4's safety testing, particularly its responses during adversarial scenarios. Anthropic's rigorous red-teaming aims to identify potential model failures and misuse, though publicizing these results can be misinterpreted. The inherent conflict between being maximally helpful to a user and adhering to societal norms presents a complex challenge. These situations highlight the difficulty in creating AI that is both capable and safe, prompting ongoing ethical discussions.

ADVANCEMENTS IN MULTI-TURN REINFORCEMENT LEARNING

Multi-turn Reinforcement Learning (RL) is identified as a key area for developing advanced AI agents. This approach focuses on how models can learn and adapt over extended interactions, crucial for complex tasks. The research presented, including work on GRPO and turn-level credit assignment, aims to solve challenges like models not naturally using tools or failing at function calls. By incentivizing and correctly assigning credit for tool use, these RL methods are vital for building more capable and reliable agents.

THE FUTURE OF AI EVALUATION AND ACADEMIC RESEARCH

The conversation emphasizes the critical role of academia in developing novel and precise evaluation methods for AI models. Unlike resource-intensive model training, creating effective evaluations requires significant intellectual effort and creativity. This focus on clever evals is seen as more sustainable and impactful for PhD students. The trend towards model-based rewards, where LLMs act as judges, offers a flexible alternative to brittle, deterministic reward systems, promising more nuanced and generalizable AI assessment.

STRATEGIC THINKING IN AI RESEARCH AND DEVELOPMENT

Developing impactful AI research requires foresight and strategic thinking, identifying questions that are not yet widely discussed. Researchers should make educated bets on the future trajectory of AI, considering areas like multi-agent systems and their theoretical underpinnings. By anticipating future needs and challenges, such as the intersection of RL and agentic tool use, researchers can position themselves to make significant contributions. This proactive approach is essential for pushing the boundaries of AI innovation.

CHALLENGES IN MODEL TRAINING AND TOOL UTILIZATION

A significant challenge in training AI models, especially smaller ones, is their reluctance to adopt new behaviors like tool use. Models may avoid using tools due to difficulties with function calling, JSON formatting errors, or simply lacking an inherent instinct. This can lead them to default to simpler, non-tool-using responses. The research aims to address this by incorporating tool use incentives directly into the RL reward system, ensuring models are practically encouraged to leverage the tools available to them.

CREDIT ASSIGNMENT AND FLEXIBLE REWARD MECHANISMS

The paper on multi-turn RL introduces sophisticated credit assignment mechanisms, moving beyond simple state-action reward structures. A key innovation is evaluating the usefulness of intermediate steps, such as search results, to determine if a tool call provided valuable information. This allows for more granular reward calculations tailored to specific tasks, like verifying if a retrieved Wikipedia snippet aids in answering a question. Such methods are crucial for training agents that effectively utilize tools for reasoning.

THE PROMISE OF MODEL-BASED REWARDS

The discussion highlights the potential of model-based rewards as a more flexible alternative to traditional deterministic rewards. Instead of relying on rigid parsers or rule-based systems, LLMs can be employed as judges to evaluate the quality and relevance of AI outputs. This approach is particularly beneficial for complex domains like mathematics, where verifying answers can be challenging due to symbolic representation and equivalent forms. Utilizing LLMs as evaluators allows for more nuanced and adaptable assessment criteria.

DECOMPOSING COMPLEXITY WITH TURN-LEVEL REWARDS

Turn-level rewards offer a promising direction for incorporating detailed performance metrics into RL. By evaluating the utility of specific interactions, such as a search query's effectiveness, AI systems can be guided towards more optimal strategies. This granular feedback allows models to refine their behavior at each stage of an interaction, rather than just at the final output. This decomposition of the problem into manageable turns enhances the AI's ability to learn and adapt in complex, multi-step processes.

THE EVOLVING LANDSCAPE OF AI DEVELOPMENT AND MARKETING

The AI landscape is characterized by rapid advancements and evolving marketing strategies. While models like Claude 4 offer significant technical improvements, their consumer-facing appeal can be complex. Anthropic's brand image resonates with a particular segment of users interested in AI personality, while a broader audience prioritizes utility. Navigating these different user preferences and effectively communicating AI capabilities remains a challenge for developers and marketers alike.

Common Questions

Claude 3 emphasizes agents and tool use, with extended thinking capabilities now integrated with tool calling. This focus shifts from pure reasoning benchmarks to practical applications and multi-turn interactions.

Topics

Mentioned in this video

softwareGPT-4.1

Mentioned as a trustworthy model for use in large codebases, contrasted with newer models regarding reward hacking.

conceptConstitutional AI

A framework discussed in relation to Anthropic's safety efforts, where LLMs are trained to adhere to a set of principles, often through reward modeling.

productR1

A reference to a stage or model in AI development likely related to reasoning capabilities, possibly from OpenAI's framework.

conceptDPO

A reinforcement learning algorithm, compared to GRPO, with GRPO being described as 'DPO on steroids' and offering advantages in online learning and batch processing.

softwareAlphaGo

A DeepMind AI that demonstrated capabilities in multi-agent RL, serving as a foundational example for Will Brown's early research interests.

softwareGemini

Google's AI model, compared in terms of trustworthiness for code base interactions, with 'old Gemini' being trustworthy and 'new Gemini' not.

companyPrime Intellect

Will Brown's company, likely involved in AI research and development, particularly in the area of agentic RL.

companyAnthropic

The company behind Claude models, known for their rigorous safety testing and approach to model alignment and extending model capabilities.

companyMorgan Stanley

The company where Will Brown was working when he started serious work on multi-turn RL for tool use and explored the GRPO demo.

companyDeepMind

Mentioned for their work on AlphaGo and multi-agent RL, which influenced the early research direction of guest Will Brown.

softwareWikipedia

Used as a source for search results in experiments for evaluating tool use in RL, specifically for retrieving information to answer questions.

companyOpenAI

Mentioned in the context of their five-level framework for AI development, where reasoners are seen as a step towards agents.

toolPO

An older reinforcement learning algorithm, mentioned as the basis for RHF and contrasted with GRPO in a discussion about memory efficiency and gradient syncing.

softwareClaude 4

The latest model from Anthropic, discussed as a significant release with an emphasis on coding, agents, and tool use.

softwareClaude 3

The latest model from Anthropic, noted for its extended thinking capabilities and emphasis on agentic behavior and tool use, moving beyond traditional reasoning benchmarks.

toolGRPO

A reinforcement learning algorithm that Will Brown has published work on, particularly in relation to format reward and its use in multi-turn RL.

companyMeta

Mentioned in the context of AI research funding, their partnerships, and the potential for data access influencing valuations.

toolClaude 3.5

Mentioned as a model that could perform a small amount of 'thinking' to decide which tool to use.

More from Latent Space

View all 76 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free