Why did Weights & Biases develop a coding agent when they focus on tooling?

Sean Lewis explains that developing their own agent served as 'dogfooding,' allowing them to test and improve their AI tools like Weave and Eval Studio. It also explores potential future applications in AI programming.

How does Sean Lewis's Eval Studio aid agent development?

Eval Studio provides a custom interface to visualize and analyze agent performance data logged in Weave. This helps developers understand agent behavior at a granular level, enabling faster iteration and debugging.

What models were used in the top-performing SWE-Bench agent?

Sean Lewis's agent primarily uses OpenAI's GPT-4o model for both its core agent logic and programming tasks. This choice contributed to its high score and demonstrated GPT-4o's capabilities beyond standard API calls.

What is the 'cross-check' mechanism in SWE-Bench submissions?

Submitted trajectories from multiple agent rollouts are combined and compared using a 'cross-check' step to determine the most correct solution. This technique, detailed in Lewis's submission, helps improve overall accuracy.

Can AI agents like Phaseshift automate the debugging process entirely?

Despite advancements, manually inspecting agent traces remains crucial for debugging complex issues. While tools help pivot data, the final detailed analysis to understand regressions often requires human oversight.

What are the future prospects for AI programmer agents?

Sean Lewis anticipates effective autonomous AI programmers within the next year or two. He believes the focus will also expand to user interfaces and business applications, with agents like Devon showing significant UI potential.

Key Moments

Beating OpenAI and Anthropic by Looking At Data: the new #1 on SWE-Bench w/ W&B CTO Shawn Lewis

Latent Space Podcast

Science & Technology4 min read35 min video

Jan 28, 2025|2,019 views|54|6

Save to Pod

Key Moments

TL;DR

Weights & Biases CTO Sean Lewis discusses achieving the top SWE-bench score using data-driven insights and novel tooling.

Key Insights

Achieving the #1 spot on SWE-bench was a focused effort, leveraging data and custom tooling for evaluation.

Weights & Biases' own tools (Weave, Eval Studio) were crucial for tracking experiments and analyzing agent performance.

The agent primarily used OpenAI's GPT-4o model, demonstrating its effectiveness in complex programming tasks.

Detailed data analysis, including comparing agent performance against historical data and public leaderboards, was key to identifying regressions and driving improvements.

The development process highlighted the continued necessity of manual inspection of agent traces for effective debugging and improvement.

Sean Lewis is planning to open-source the 'Face Shift' framework used in this development, based on high community interest.

The future of AI programming agents is seen as very promising, with potential for autonomous programmers within a couple of years, possibly with a focus on UI and business integration.

THE ACCIDENTAL INNOVATOR AND A NEW LEADER

Sean Lewis, CTO of Weights & Biases (W&B), recounts his unexpected journey to the top of the SWE-bench leaderboard for AI coding agents. This achievement, coinciding with the birth of his child, was the result of months of focused work, driven by a personal challenge to build something impactful. Lewis, a natural tool builder, saw an opportunity to apply his expertise to the burgeoning field of AI programming, a domain he felt uniquely positioned to tackle given his background as a programmer.

DOGFOODING: LEVERAGING INTERNAL TOOLS FOR EXTERNAL SUCCESS

A core philosophy at W&B is 'dogfooding'—using their own tools to build and improve. This principle was central to Lewis's SWE-bench success. He utilized W&B's experiment tracking capabilities, specifically the Weave platform, for logging and analyzing agent behavior. To gain deeper insights, he developed 'Eval Studio,' a custom frontend that provides enhanced visibility into agent performance, allowing for detailed analysis of prompts, tool calls, and resulting code changes. This data-centric approach was critical for iterating and optimizing the AI agent.

THE POWER OF GPT-4O AND DATA-DRIVEN EVALUATION

The agent's high performance on SWE-bench was attributed to the strategic use of OpenAI's GPT-4o model, making it the first publicly verified SWE-bench result to rely solely on this model for both agent logic and code generation. Lewis noted GPT-4o's adeptness at precise instruction following, a characteristic that, while sometimes leading to 'malicious compliance,' proved highly effective. The evaluation process involved extensive experimentation on SWE-bench, a benchmark widely recognized for its robust testing of AI programming capabilities.

UNCOVERING REGRESSIONS THROUGH DETAILED DATA ANALYSIS

Lewis emphasized the critical role of detailed data analysis in identifying and rectifying performance regressions. By comparing the agent's output across multiple evaluations, particularly against historical data and public leaderboard results, he could pinpoint specific instances where the agent's performance declined. Tools like spreadsheets and Eval Studio allowed him to sort and filter results, spotlighting problems that were previously solved but now failed, thereby guiding his debugging efforts towards the root causes of these issues.

THE NECESSITY OF MANUAL TRACE INSPECTION

Despite advancements in AI and tooling, Lewis stressed that manually inspecting agent traces remains an indispensable part of the development process for complex tasks like those on SWE-bench. While tools can automate much of the data collection and initial analysis, understanding the nuanced decision-making of an AI agent often requires looking directly at its step-by-step reasoning and actions. This in-depth review allows developers to uncover subtle errors or inefficiencies that automated metrics might miss.

FACE SHIFT FRAMEWORK AND FUTURE OPEN-SOURCING

The development also involved a new TypeScript framework for building agents, dubbed 'Face Shift.' This framework, built on top of Weave, simplifies the process of logging and composing agentic tasks. Lewis expressed his intention to open-source Face Shift, acknowledging the strong community interest and the personal utility he has found in it. Although currently optimized for the SWE-bench environment, he aims to refine it for broader applications, potentially using AI assistance to polish the framework itself.

THE EVOLVING LANDSCAPE OF AI PROGRAMMING AGENTS

Looking ahead, Lewis anticipates significant advancements in autonomous AI programmers, potentially within the next one to two years. He highlighted the potential for AI to excel in UI and business integration, citing examples like Devin. While acknowledging that current benchmarks might not fully capture real-world utility, he believes the focus will increasingly shift towards user interfaces and business applicability, alongside the continuous improvement of underlying AI models. The competitive nature of the field drives innovation, pushing the boundaries of what AI can achieve.

Mentioned in This Episode

●Software & Apps

●Companies

●People Referenced

SWE-Bench Performance Comparison

Data extracted from this episode

Agent/Model	Single Rollout (%)	Final Score (%)
OpenAI (GPT-4o)		49%
Sean Lewis's Agent (GPT-4o)	57%	64%

Common Questions

SWE-Bench is a benchmark for evaluating AI coding agents on real-world programming tasks. Achieving a high score, like Sean Lewis's 64% with GPT-4o, signifies significant progress in AI's ability to autonomously write and fix code.

Topics

Ai Agents AI & Machine Learning Technology & Innovation Programming & Software LLM Evaluation Software Development Coding Assistants Experiment Tracking Benchmark Performance

Mentioned in this video

Software & Apps

Eval Studio

A custom front-end tool built by Sean Lewis for visualizing and analyzing agent evaluation data, backed by Weave.

Phaseshift

A TypeScript framework developed by Sean Lewis for building AI agents, designed to leverage Weave for logging and evaluation.

Devon

An AI agent that can perform complex tasks, acknowledged for its impressive UI and potential, especially with upcoming model improvements.

Weave

A tool developed by Weights & Biases for tracing and evaluating AI agent performance, used by Sean Lewis to log experiments and evaluate his agent.

SWE-Bench

A benchmark for evaluating the performance of AI coding agents, where Lewis's agent achieved the #1 ranking.

GPT-4o

An OpenAI model used by Sean Lewis's agent for both logic and programming, achieving a 64% success rate on SWE-Bench.

GPT-4

A previous model from OpenAI, discussed as being more adept at agentic tasks over longer sequences compared to GPT-4o.

Cursor

An AI-powered code editor whose AI capabilities were used by Sean Lewis to write the UI for his Eval Studio tool.

Companies

Weights & Biases

A company that provides tools for machine learning experiment tracking and AI development. Their CTO, Sean Lewis, developed a leading AI coding agent.

OpenAI

A leading AI research laboratory mentioned in the context of its models (like GPT-4o) and its own SWE-Bench results.

Anthropic

An AI safety and research company that previously discussed their agent work and SWE-Bench results on the podcast.

People

Sean Lewis

CTO at Weights & Biases, who developed the top-performing coding agent on SWE-Bench.

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free