What are the main challenges in getting AI agents to perform well on Sweet-Bench?

Challenges include operating at the correct level of abstraction (e.g., performing a large refactor vs. a small fix), the ambiguity in task descriptions, and the lack of multimodal capabilities in some agents, preventing them from interpreting visual outputs like graphs. Getting to very high success rates requires addressing these nuanced issues.

How does Sweet-Bench Verified differ from the original Sweet-Bench?

Sweet-Bench Verified involved a partnership with OpenAI to manually review and filter tasks, removing obstacles that made them impossible, like requiring specific error strings that models wouldn't know beforehand. This ensures all tasks in Sweet-Bench Verified are theoretically doable by an AI agent.

What makes Anthropic's approach to agent architecture minimal and effective?

Anthropic's approach gives the AI model, specifically Claude, the reins to decide its own steps and tool usage rather than enforcing rigid workflows. The model keeps thinking and calling tools until it believes it's done, leveraging its self-correction and grit, especially in models like Claude 3.5 Sonnet.

Why is tool design iteration important for AI agents?

Simply providing basic API specs for tools isn't enough. Iterating on tool design, similar to designing human interfaces with examples and clear instructions, makes them more foolproof for models. This includes making tools mistake-proof (poka-yoke) by enforcing formats like absolute paths to prevent confusion.

What are the advantages of using computer use for AI integration?

Computer use, like giving an AI a browser, drastically reduces the friction of integrating with existing systems compared to API integrations, which can be slow and require significant engineering. It allows for rapid deployment of AI agents for tasks that leverage existing UIs, acting as a universal tool.

What are the key breakthroughs enabling current progress in AI robotics?

Two main breakthroughs are general-purpose language models (LLMs) for common sense reasoning, which solves the problem of task description, and diffusion-inspired algorithms for path planning. These allow robots to understand tasks more intuitively and learn motion control from demonstrations.

What is the biggest bottleneck for AI adoption in robotics?

The primary bottleneck is reliability. While AI can perform many tasks demonstrating cool demos, achieving the 99.9% reliability needed for real-world applications (like not breaking dishes or dropping boxes) is extremely challenging and takes time to scale from 95% to that level.

Why is there skepticism about the business model of autonomous vehicles like Waymo?

The skepticism arises from the high cost of the vehicles and the unit economics. While revenue generation is possible, the concern is whether the profit per vehicle, after accounting for costs like depreciation, can be significantly better than human-driven services like Uber, which have lower capital expenditure.

What is the future role of computer use versus API integrations for AI?

Computer use is excellent for prototyping and handling the long tail of tasks where APIs don't exist or are hard to integrate. Once these use cases prove valuable and gain traction, they can be optimized by developing specific APIs for faster and more efficient execution.

What are the critical factors for trust and auditability in AI agents?

For AI agents to be valuable, humans need to trust their output. This requires not just completing tasks but doing so in a trustable and auditable way, where the agent can explain its reasoning and how it arrived at its conclusions, similar to how a human would present their work.

Key Moments

The new Claude 3.5 Sonnet, Computer Use, and Building SOTA Agents — with Erik Schluntz, Anthropic

Latent Space Podcast

Science & Technology4 min read72 min video

Nov 28, 2024|3,561 views|113|9

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

Claude 3.5 Sonnet achieves SOTA on SWE-Bench; discusses agent architecture, tooling, and computer use.

Key Insights

Claude 3.5 Sonnet, particularly its upgraded version, has set a new state-of-the-art on the SWE-Bench Verified benchmark for coding agents.

SWE-Bench focuses on real-world repository tasks, offering a more practical evaluation than isolated coding puzzles.

The architecture for SWE-Agent is minimal, allowing Claude significant control to decide its approach and tool usage until completion within a context window.

Tool design is crucial; iterating on tools with robust documentation, examples, and explicit instructions (like using absolute paths) makes them more foolproof for LLMs.

Computer use, analogous to a robot pressing elevator buttons instead of relying on APIs, significantly reduces integration friction for LLMs to interact with the world.

While LLMs excel at common sense tasks and diffusion models aid path planning in robotics, achieving high reliability and viable unit economics remains a major challenge.

ACHIEVING STATE-OF-THE-ART IN CODING AGENTS

The discussion highlights Anthropic's recent success with Claude 3.5 Sonnet, which has achieved state-of-the-art performance on the SWE-Bench Verified benchmark. This achievement is significant because SWE-Bench evaluates agents on real-world coding tasks within software repositories, providing a more practical assessment than traditional, isolated coding challenges. The release of the agent's architecture and prompts aims to empower developers to build upon Anthropic's findings and maximize the utility of their models.

UNDERSTANDING SWE-BENCH AND ITS VERIFICATION

SWE-Bench, a benchmark for evaluating coding agents, focuses on a subset of tasks from popular open-source Python repositories. It's distinguished by its requirement for agents to navigate and modify existing codebases, rather than solving isolated problems. SWE-Bench Verified further refines this by manually filtering tasks to remove ambiguities and impossibilities, ensuring that each task is theoretically solvable by a model. This verification process is a partnership with OpenAI, aiming for a more accurate representation of engineering challenges.

SWE-AGENT ARCHITECTURE AND TOOL USAGE

The SWE-Agent architecture is designed to be minimal, granting Claude substantial autonomy. The agent operates by receiving tools and then iteratively thinking, calling tools, and observing results until it determines the task is complete or it hits the context window limit. This approach contrasts with more rigid, pre-defined workflows, allowing the model, particularly Claude 3.5 Sonnet with its enhanced self-correction capabilities, to adapt its strategy dynamically. The primary downside identified is token expense, though future research might focus on pruning inefficient paths.

THE CRITICAL ROLE OF TOOL DESIGN AND ENGINEERING

Significant engineering effort was dedicated to tool design, with a focus on making them 'foolproof' for LLMs. For example, mandating absolute paths for file operations prevents confusion when the agent changes directories. Similarly, providing explicit instructions within tool descriptions, such as avoiding commands that don't return or advising against launching interactive programs like Vim, greatly improves reliability. This iterative process of testing tool usage and refining descriptions is presented as a key strategy for effective LLM agent development.

COMPUTER USE AS A LOW-FRICTION INTEGRATION TOOL

Computer use is framed as a revolutionary way for LLMs to interact with the digital world, analogous to a robot using its arm to press elevator buttons instead of waiting for complex API integrations. By providing the model with a browser and login credentials, it can immediately perform actions across various applications. This drastically reduces the engineering overhead typically required for API integrations, enabling rapid deployment of sophisticated tools for customer support and other applications, transforming how LLMs interface with existing software systems.

ROBOTICS: PROGRESS, HYPE, AND CHALLENGES

Despite burnout from its founder, the robotics field is seeing significant positive progress. General-purpose language models are bridging the gap in describing complex tasks, while diffusion models are revolutionizing path planning. However, considerable hype surrounds robotics, similar to the early days of self-driving cars. The primary hurdles remain achieving high reliability (moving from 99% to 99.9% success rates) and viable unit economics, as current hardware costs are substantial, making widespread adoption challenging compared to software-based AI agents.

THE FUTURE OF AGENTS AND TRUST

The critical factor for the future success of AI agents will be trust. As agents perform increasingly complex tasks over longer durations, users need to be confident in their outputs. This requires not only task completion but also auditable and explainable processes, allowing humans to understand how an agent arrived at its solution. This focus on trust and transparency is seen as paramount for agents to deliver true value beyond synthetic benchmarks.

Mentioned in This Episode

●Software & Apps

●Companies

●Concepts

●People Referenced

Common Questions

Sweet-Bench is a benchmark designed to test AI coding agents on real-world engineering tasks within existing code repositories. It's crucial because it evaluates agents on their ability to navigate and modify complex codebases, which is more representative of actual software development than isolated coding puzzles.

Topics

Ai Safety Tool Use AI & Machine Learning Technology & Innovation Programming & Software Large Language Models Coding Agents Computer Vision Agent Architecture

Mentioned in this video

Software & Apps

Copilot

Erik Schluntz mentions using Copilot for coding, which was a catalyst for his increased excitement about AI.

MapLib

A graphing library where some Sweet-Bench tasks involve generating plots, and agents sometimes fail to visually inspect the output images.

Replit

Mentioned for its coding agent that creates a plan first and its comments on multi-agent systems.

CrewAI

Mentioned as one of the agent frameworks available, alongside LangGraph.

Docker

A containerization technology that can be used for sandboxing agent environments, though it can be slow or resource-intensive.

Claude 3.5 Haiku

Anthropic's smallest and fastest model, which still performed surprisingly well on the Sweet-Bench benchmark.

Claude 3.5 Sonnet

A model released by Anthropic that has shown significant improvements in coding tasks and agentic capabilities.

Devon

Mentioned as an example of an agent framework that allows users to edit the plan as it goes along.

React

A paradigm (Think-Act-Observe) that underlies agent frameworks like Sweet-AEgent and influenced Anthropic's agent implementation.

LangGraph

Mentioned as one of the agent frameworks available, alongside CrewAI.

E2B

A startup working on agent sandboxing, mentioned as a friend of Anthropic.

Concepts

Sweet Bench

A benchmark developed to evaluate the performance of coding agents, focusing on real-world engineering tasks within existing code repositories.

Companies

OpenAI

Collaborated with Anthropic on creating Sweet-Bench Verified by manually reviewing and filtering tasks to ensure they are fully doable.

Uber

Used as a comparison point for the potential revenue and profitability of autonomous vehicle services like Waymo.

SpaceX

Mentioned as a previous employer of Erik Schluntz before joining Anthropic.

Cobalt Robotics

Erik Schluntz was the CTO and co-founder of this company, which built security and inspection robots for buildings and warehouses.

Anthropic

The company where Erik Schluntz currently works, focusing on AI safety and developing advanced models like Claude.

WeMobility

A company operating autonomous vehicles, discussed in the context of the challenges and economics of self-driving technology.

The company where Eric Jang currently works, mentioned in the context of AI and robotics crossover.

Citrix

A technology from the past that provided a remote desktop experience, used as an analogy before the widespread adoption of remote access or cloud computing.

Dusty Robotics

Mentioned in relation to Stan Po's perspective on AI in robotics.

Waymo

An autonomous vehicle company whose operational figures and vehicle costs are discussed in the context of the self-driving industry's economic viability.

INB

Mentioned alongside Josh Alperin regarding issues with purchasing and testing large batches of GPUs.

People

Josh Alperin

Mentioned in the context of discussing hardware issues, specifically with GPUs not performing as expected.

Chelsea Finn

Associated with Stanford and a startup in physical intelligence, focused on diffusion-inspired path planning.

Eric Jang

Formerly at Google AI and now at 1X, he jokes with Erik Schluntz about switching between AI and robotics fields and mentions hardware/supply chain complaints.

Amanda Askell

Head of Claude character at Anthropic, who uses computer use for generating research ideas during her lunch breaks.

Stan Po

From Dusty Robotics, who has a view that AI vision might not be the primary workhorse in robotics.

Organizations

Physical Intelligence

A startup working on diffusion model inspired path planning for robotics.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free