Key Moments

Beating OpenAI and Anthropic by Looking At Data: the new #1 on SWE-Bench w/ W&B CTO Shawn Lewis

Latent Space PodcastLatent Space Podcast
Science & Technology4 min read35 min video
Jan 28, 2025|2,017 views|54|6
Save to Pod
TL;DR

Weights & Biases CTO Sean Lewis discusses achieving the top SWE-bench score using data-driven insights and novel tooling.

Key Insights

1

Achieving the #1 spot on SWE-bench was a focused effort, leveraging data and custom tooling for evaluation.

2

Weights & Biases' own tools (Weave, Eval Studio) were crucial for tracking experiments and analyzing agent performance.

3

The agent primarily used OpenAI's GPT-4o model, demonstrating its effectiveness in complex programming tasks.

4

Detailed data analysis, including comparing agent performance against historical data and public leaderboards, was key to identifying regressions and driving improvements.

5

The development process highlighted the continued necessity of manual inspection of agent traces for effective debugging and improvement.

6

Sean Lewis is planning to open-source the 'Face Shift' framework used in this development, based on high community interest.

7

The future of AI programming agents is seen as very promising, with potential for autonomous programmers within a couple of years, possibly with a focus on UI and business integration.

THE ACCIDENTAL INNOVATOR AND A NEW LEADER

Sean Lewis, CTO of Weights & Biases (W&B), recounts his unexpected journey to the top of the SWE-bench leaderboard for AI coding agents. This achievement, coinciding with the birth of his child, was the result of months of focused work, driven by a personal challenge to build something impactful. Lewis, a natural tool builder, saw an opportunity to apply his expertise to the burgeoning field of AI programming, a domain he felt uniquely positioned to tackle given his background as a programmer.

DOGFOODING: LEVERAGING INTERNAL TOOLS FOR EXTERNAL SUCCESS

A core philosophy at W&B is 'dogfooding'—using their own tools to build and improve. This principle was central to Lewis's SWE-bench success. He utilized W&B's experiment tracking capabilities, specifically the Weave platform, for logging and analyzing agent behavior. To gain deeper insights, he developed 'Eval Studio,' a custom frontend that provides enhanced visibility into agent performance, allowing for detailed analysis of prompts, tool calls, and resulting code changes. This data-centric approach was critical for iterating and optimizing the AI agent.

THE POWER OF GPT-4O AND DATA-DRIVEN EVALUATION

The agent's high performance on SWE-bench was attributed to the strategic use of OpenAI's GPT-4o model, making it the first publicly verified SWE-bench result to rely solely on this model for both agent logic and code generation. Lewis noted GPT-4o's adeptness at precise instruction following, a characteristic that, while sometimes leading to 'malicious compliance,' proved highly effective. The evaluation process involved extensive experimentation on SWE-bench, a benchmark widely recognized for its robust testing of AI programming capabilities.

UNCOVERING REGRESSIONS THROUGH DETAILED DATA ANALYSIS

Lewis emphasized the critical role of detailed data analysis in identifying and rectifying performance regressions. By comparing the agent's output across multiple evaluations, particularly against historical data and public leaderboard results, he could pinpoint specific instances where the agent's performance declined. Tools like spreadsheets and Eval Studio allowed him to sort and filter results, spotlighting problems that were previously solved but now failed, thereby guiding his debugging efforts towards the root causes of these issues.

THE NECESSITY OF MANUAL TRACE INSPECTION

Despite advancements in AI and tooling, Lewis stressed that manually inspecting agent traces remains an indispensable part of the development process for complex tasks like those on SWE-bench. While tools can automate much of the data collection and initial analysis, understanding the nuanced decision-making of an AI agent often requires looking directly at its step-by-step reasoning and actions. This in-depth review allows developers to uncover subtle errors or inefficiencies that automated metrics might miss.

FACE SHIFT FRAMEWORK AND FUTURE OPEN-SOURCING

The development also involved a new TypeScript framework for building agents, dubbed 'Face Shift.' This framework, built on top of Weave, simplifies the process of logging and composing agentic tasks. Lewis expressed his intention to open-source Face Shift, acknowledging the strong community interest and the personal utility he has found in it. Although currently optimized for the SWE-bench environment, he aims to refine it for broader applications, potentially using AI assistance to polish the framework itself.

THE EVOLVING LANDSCAPE OF AI PROGRAMMING AGENTS

Looking ahead, Lewis anticipates significant advancements in autonomous AI programmers, potentially within the next one to two years. He highlighted the potential for AI to excel in UI and business integration, citing examples like Devin. While acknowledging that current benchmarks might not fully capture real-world utility, he believes the focus will increasingly shift towards user interfaces and business applicability, alongside the continuous improvement of underlying AI models. The competitive nature of the field drives innovation, pushing the boundaries of what AI can achieve.

SWE-Bench Performance Comparison

Data extracted from this episode

Agent/ModelSingle Rollout (%)Final Score (%)
OpenAI (GPT-4o)49%
Sean Lewis's Agent (GPT-4o)57%64%

Common Questions

SWE-Bench is a benchmark for evaluating AI coding agents on real-world programming tasks. Achieving a high score, like Sean Lewis's 64% with GPT-4o, signifies significant progress in AI's ability to autonomously write and fix code.

Topics

Mentioned in this video

More from Latent Space

View all 185 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free