How does Augment Code's agent perform on benchmarks like SWE Bench?

Augment Code's agent has achieved the number one verified agent position on SWE Bench. While SWE Bench is useful for prompt refinement, they note that codebase understanding provides less benefit on this specific benchmark as it often pinpoints changes directly.

What techniques improved the performance of Augment Code's agent?

Key techniques that boosted performance included sequential thinking, which allows the model to reflect, and ensembling agent runs to combine scores. Reliable file edits were also a significant iterative improvement.

How does Augment Code handle model experimentation and evaluation?

They start with a curated set of samples (around 10) run in notebooks, then scale to hundreds or thousands. They have infrastructure for evaluations with and without code execution and use contractors for tasks like chat and agent evaluation that are hard to automate.

What is Augment Code's strategy for competing in the AI coding agent market?

Augment Code focuses on developers in organizations with large, complex codebases, optimizing for working within existing structures rather than 0 to 1 development. They prioritize meeting developers where they are with IDE extensions (VS Code, JetBrains, Vim) rather than forcing workflow changes.

Can Augment Code's agent add features to itself?

Yes, the demo showcased the agent autonomously adding a new tool (a dialog box feature) to its own VS Code extension code, demonstrating its ability to modify its own functionality and integrate it.

What is the future role of the IDE with advanced coding agents?

While currently integrated into the IDE, the long-term vision suggests the IDE might become less central, with users spending more time in external apps controlling agents and only using the IDE for deeper dives. However, for complex codebases, the IDE remains crucial for interaction.

What are the recent research trends in AI for coding?

Recent research includes papers like 'SWL' for RL in coding, DeepSeek's R1 paper, and popular techniques like DPO and its variants such as GRPO. These focus on improving model performance through advanced training methodologies.

Key Moments

The #1 SWE-Bench Verified Agent

Latent Space Podcast

Science & Technology6 min read32 min video

Apr 2, 2025|2,094 views|36|3

Save to Pod

Key Moments

On this page

TL;DR

Augment Code launches a top-tier AI agent for SWE-Bench, leveraging off-the-shelf models and innovative techniques like sequential thinking and ensembling.

Key Insights

Augment Code's new AI agent ranks #1 on the SWE-Bench Verified leaderboard without custom model fine-tuning, utilizing off-the-shelf LLMs.

Key strategies for agent performance include sequential thinking, ensembling multiple model outputs, and reliable file editing.

The SWE-Bench benchmark is useful for prompt refinement but less so for evaluating codebase understanding due to its specific nature.

Hybrid cloud and multi-model approaches are crucial for building robust AI systems, balancing cost and capability.

Augment Code prioritizes meeting developers where they are by integrating agents into existing IDEs (VS Code, JetBrains) rather than forcing new workflows.

The company's agent development focuses on large, complex codebases, emphasizing context engine integration and future multi-agent capabilities.

LAUNCH OF AUGMENT CODE'S NEW AI AGENT

Augment Code has launched a new AI agent feature that significantly enhances their coding assistance capabilities. Building on previous features like code completion, next edit suggestions, and chat with codebase understanding, the agent introduces advanced codebase comprehension. This allows the agent to understand requests, identify necessary changes within the codebase while respecting its conventions, execute commands, run tests, and ultimately generate a working Pull Request (PR).

SWE-BENCH SUCCESS AND MODEL STRATEGIES

The team's agent has achieved the #1 spot on the SWE-Bench Verified leaderboard, a feat accomplished using off-the-shelf LLMs. While the product includes custom models for codebase understanding, the SWE-Bench performance highlights the power of readily available models. They found that SWE-Bench, while useful for prompt engineering and tool experimentation, doesn't significantly benefit from their specific codebase understanding features due to its focused nature, where changes are often pinpointed.

OPTIMIZATION TECHNIQUES FOR AGENT PERFORMANCE

Several techniques contributed to the agent's high performance. Sequential thinking, allowing the model to reflect and improve its actions, was a significant factor. Ensembling, where multiple agent runs are combined (e.g., through majority voting), also boosted scores, although it incurs higher costs. Reliable file editing was identified as a non-trivial but crucial task that required considerable iteration to perfect.

THE IMPORTANCE OF HYBRID APPROACHES IN AI

The development of effective AI agents necessitates a hybrid approach, similar to hybrid cloud strategies in infrastructure. This involves supporting multiple models and potentially multiple cloud providers to maximize benefits and availability. Augment Code built its system with this in mind from the start, considering different models for generation and retrieval, and acknowledges that cost management is a key driver for mixing and matching models.

EXPERIMENTATION AND EVALUATION FRAMEWORKS

Developing and refining AI agents involves a robust experimentation process. Augment Code starts with small, curated sets of samples for initial feature development, becoming deeply familiar with these examples. As development progresses, they scale to larger datasets and employ infrastructure for evaluations, including those with and without code execution. They also focus on bridging the gap between research evaluations and production systems to catch regressions.

INTEGRATING AGENTS INTO DEVELOPER WORKFLOWS

Augment Code's philosophy is to meet developers where they are, integrating AI agents into existing Integrated Development Environments (IDEs) like VS Code and JetBrains, rather than forcing a change in workflow. While advanced AI development might eventually move beyond the IDE, currently, for complex codebases, the IDE remains essential. They plan for a future where standalone apps control agents, with the IDE used for deeper dives.

ADDRESSING AGENT COST AND UX CHALLENGES

The cost of running powerful AI models is a significant consideration. While ensembling can improve results, its cost-effectiveness needs careful balancing. Furthermore, as agents become more capable, user experience (UX) becomes a critical factor. Presenting multiple agent trajectories from ensembling can be confusing for users who want to follow the process, highlighting the need for intuitive interfaces even if costs decrease in the future.

MULTI-AGENT SYSTEMS AND CODEBASE ORIENTATION

The concept of multi-agent systems, such as an 'orientation agent' and a 'regression fixing agent,' is crucial. Orientation, in particular, is vital for agents operating in complex codebases. This involves understanding codebase conventions, testing frameworks, and identifying how to execute tests. Augment Code is building these orientation capabilities into their product and plans to ship a more thorough orientation process that runs for several minutes.

MEMORIES AND CONTINUOUS LEARNING MECHANISMS

Augment Code is incorporating 'memories' into their agent system, allowing the agent to learn from its mistakes and adapt over time. This feature helps the agent avoid repeating errors and generalize correctly about codebase conventions, such as testing procedures or execution methods. By creating memories, the agent continuously improves its performance as it works alongside the developer.

COMPARISON WITH OTHER ENTERPRISE AI CODING SOLUTIONS

Augment Code positions itself for developers working in large, complex codebases, contrasting with zero-to-one development tools. Their focus is on complementing existing workflows within IDEs, offering extensions for VS Code and upcoming support for JetBrains. They aim to facilitate multi-agent usage without disrupting established developer practices, differentiating them from potential competitors focusing on entirely new platforms or AI-centric workflows.

LEVERAGING OFF-THE-SHELF MODELS AND FUTURE CUSTOMIZATION

For their agent product, Augment Code primarily uses off-the-shelf models, with custom models reserved for specific needs like codebase understanding. The company believes the explosion of agent usage will drive up costs, creating an opportunity for custom-trained models to optimize performance and cost-effectiveness in the future. This strategy allows for rapid product development and market entry while keeping future customization options open.

DEMONSTRATION OF END-TO-END AGENT CAPABILITIES

A demonstration showcased the agent implementing a new tool (a dialog box) within Augment Code's own VS Code extension. The agent successfully retrieved ticket information, used the context engine for orientation, planned the implementation, edited its own code, registered the new tool, and integrated it with the VS Code API. The process included registering the tool, defining its schema, and implementing the functionality, culminating in a working feature.

INTEGRATION WITH EXTERNAL SYSTEMS AND PULL REQUESTS

Following the implementation of the new tool, the agent demonstrated its capability to create a Pull Request (PR) via GitHub integration. By connecting to both Linear for ticket management and GitHub for code changes, the system facilitates an end-to-end workflow. While external actions like PR creation require manual confirmation for safety, this end-to-end automation streamlines the development process significantly.

PERSPECTIVES ON EMERGENT AI TRENDS AND RESEARCH

The discussion touched upon emerging research areas like Reinforcement Learning (RL) for coding, specifically mentioning the SWL paper by Wey et al., and general techniques like DPO and GRPO. Regarding Gemini and Google's AI progress, the sentiment is that Google has 'woken up' and is making solid progress, though the ultimate frontier may still be AGI, with significant room for improvement in current models' quality and reasoning capabilities.

CALL TO ACTION AND OPEN-SOURCED RESOURCES

Augment Code encourages developers to visit their website (augmentcode.com) to download the extension and try the agent and other features, especially those working with large codebases. They also offer a free tier where code can be used for training purposes. Additionally, Augment Code has open-sourced their implementation of SWE-Bench, providing valuable resources for the community interested in understanding their top-ranking approach.

Mentioned in This Episode

●Software & Apps

●Tools

●Companies

●Organizations

●Studies Cited

●Concepts

●People Referenced

Augment Code Agent Best Practices

Practical takeaways from this episode

Do This

Utilize the agent for complex codebases to understand conventions and existing structures.

Leverage the agent's ability to integrate with tools like GitHub and Linear for end-to-end workflows.

Explore the 'memories' feature for agents to learn from past mistakes and adapt to your coding style.

Consider using sequential thinking and ensembling techniques to potentially improve agent performance.

For custom integrations, explore MCP for flexibility in connecting agents to various systems.

Avoid This

Don't rely solely on off-the-shelf models; consider custom models for specific needs like codebase understanding.

Avoid expecting agents to perfectly handle complex, multi-step tasks without breaking them down.

Be mindful of agent costs; mixing and matching models can help manage expenses.

Don't underestimate the importance of the 'orientation' phase for agents in understanding a codebase.

For now, expect to work within your IDE for complex agent-assisted development, rather than solely through external interfaces.

Common Questions

Augment Code has launched a new agent feature that goes beyond code completion and chat. It understands requests, analyzes the codebase for conventions and design, and can execute commands, run tests, and generate pull requests.

Topics

Reinforcement Learning AI & Machine Learning Technology & Innovation Programming & Software LLM Agents Coding Agents Software Engineering Developer Tools AI In Software Development Codebase Understanding

Mentioned in this video

Software & Apps

SWE-Bench

A benchmark used by Augment Code to evaluate and refine their agent capabilities, particularly for achieving verified agent status.

VS Code

A popular code editor where Augment Code's agent feature is implemented as an extension, allowing seamless integration into developer workflows.

Cursor

A code editor that users can use Augment with, indicating compatibility and wider adoption of Augment's tools.

Palm 2

A previous generation AI model from Google, mentioned in the context of their ongoing AI development and competition.

Magic.dev

A company offering coding extensions, discussed in the context of proprietary model training versus Augment Code's approach.

ChatGPT

An AI chatbot developed by OpenAI, whose release is noted as a significant moment that prompted a crisis response within Google.

Linear

A project management tool used by Augment Code for managing tickets, which their agent can interact with to implement tasks.

Devon

A competitor in the enterprise coding agent space, mentioned as part of a landscape analysis.

Sourcegraph

Mentioned as a player in the enterprise coding agent space, providing context for Augment Code's market positioning.

Basil

A build system used by Augment Code, mentioned in the context of building VS Code extensions and the challenges involved.

Gemini

Google's advanced AI model series, with Gemini 2.5 Pro being highlighted as a top-performing model.

Vim

A text editor for which Augment Code offers a plugin, demonstrating their commitment to meeting developers where they are.

Companies

Poolside

Another company in the coding extension market, contrasted with Augment Code's strategy regarding custom model training.

Anthropic

Mentioned in relation to their agent development, specifically Eric from Anthropic who built their version of an agent.

GitHub

A platform for software development and version control, integrated as a tool for Augment Code's agent to create pull requests and manage code.

Google

A major technology company whose AI development, particularly with Gemini and Palm 2, is discussed. The internal culture shift post-ChatGPT is also noted.

Factory AI

Another company in the enterprise coding agent market, used as a point of comparison for Augment Code's strategy.

DeepSeek

Mentioned for their R1 paper on RL, indicating their contributions to the research in reinforcement learning for AI models.

Augment Code

A company that has launched a new agent feature for coding, aiming to improve the developer experience. They are known for their work on VS Code extensions and custom models for code base understanding.

DeepMind

An AI research lab owned by Google, mentioned in discussions about the future direction and competitive landscape of AI development.

Concepts

Agent Feature

A new feature launched by Augment Code that enhances coding capabilities by understanding requests, codebase conventions, and executing commands to produce working PRs.

GRPO

A variation of DPO that is currently popular in AI research, particularly in the context of reinforcement learning for models.

Sequential Thinking

A method used by agents that allows them to reflect and improve performance, which was found to be beneficial in boosting scores on benchmarks like SWE Bench.

Studies & Research

SWL

An interesting paper focused on Reinforcement Learning (RL) specifically for coding tasks, recommended for those interested in the field.

People

Eric Schmidt

Former CEO of Google, mentioned as a backer of Augment Code, suggesting connections between established tech leaders and new AI ventures.