Key Moments

Devin’s 80% Moment: Background Agents, 7x PRs, & End of Hand-Held Coding — Walden Yan & Cole Murray

Latent Space PodcastLatent Space Podcast
Science & Technology6 min read70 min video
May 28, 2026|1,391 views|44|3
Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

TL;DR

AI coding agents have advanced to perform 80% of commits, but complex tasks like testing still require significant human expertise.

Key Insights

1

AI coding agents have reached a capability shift around December 2025 with models like Opus 4.5 and GPT-5.2, moving from handholding to autonomous task driving from specification to pull request.

2

Devin's merged pull requests have grown 7x in the last 2-3 months, while its commit percentage across Cognition repos jumped from 16% in January to 80% in March.

3

OpenInspect utilizes a 'brain out of the box' architecture, separating the agent's core logic from its execution environment for enhanced security and complexity management.

4

Testing complex, end-to-end features that span front-end and back-end services remains a significant challenge for AI, often requiring orchestration of multiple models or human intervention.

5

While AI can generate code, human oversight is crucial to prevent codebase regression to the 'worst engineer' and to maintain architectural integrity and architectural contracts.

6

Key use cases for cloud agents include SRE tasks like first-responder alerts, PM-initiated bug fixes, and customer support issue resolution, with potential costs ranging from $1,000 to $5,000 per engineer.

The 'background agent' paradigm shift and Devin's rapid growth

The engineering world is increasingly embracing 'background agents' or 'cloud agents,' a trend significantly accelerated around December 2025. This shift, driven by advancements in models like Opus 4.5 and GPT-5.2, has enabled AI agents to move from requiring constant human guidance to autonomously driving tasks from a specification to a completed pull request with minimal friction. Walden Yan, co-founder and CPO of Cognition, highlights this transformation, noting that Devin's development saw a 7x increase in merged pull requests over a few months, alongside a dramatic rise in its commit percentage across Cognition repositories, soaring from 16% in January to 80% by March. This growth underscores the practical viability and increasing adoption of autonomous AI agents in software development workflows.

OpenInspect's architectural choices for robust cloud agents

Cole Murray, creator of OpenInspect, discusses the architectural decisions behind building cloud agents, particularly the 'harness in the box' versus 'out of the box' dilemma. OpenInspect adopts an 'out of the box' approach, running the agent's 'brain' in a control plane worker while the sandbox acts as the 'hands.' This separation is favored for security, as it prevents sensitive secrets from residing directly within the potentially unpredictable agent environment. While this adds complexity in state management, it offers better control and security. Murray also notes the importance of robust developer environment setups, advocating for Docker Compose for infrastructure if not local development, and emphasizes that a good local developer experience naturally translates to easier agent sandbox setup. OpenInspect includes hooks for pre-installing dependencies and snapshotting environments for rapid agent startup.

The persistent challenge of AI-driven testing and complex problem-solving

Despite the advancements in AI's ability to generate code and perform basic computer use (like clicking buttons), complex testing remains a significant hurdle. The true challenge lies not in controlling the cursor, but in the AI's capacity for 'arbitrary testing' – reasoning through intricate scenarios that span front-end, back-end, and multiple services. To test a change effectively, an AI must orchestrate applications with the correct code versions, trigger features, and understand complex interdependencies, sometimes requiring administrative privileges or specific feature flag configurations. Yan explains that this comprehensive testing often necessitates orchestrating multiple frontier models together, as a single model may not be capable of handling the end-to-end task. Therefore, 'testing' in this context is a deep problem-solving challenge for AI, far exceeding simple computer interaction.

Memory and knowledge management for autonomous agents

A significant unsolved problem in agent development is effective memory and knowledge management. Cognition's journey with Devin included developing a system called 'Kip,' which aimed to auto-generate memories and learn over time without explicit user input. The goal was to have agents proactively ask for approval to remember information, building a knowledge base organically. However, both generation and retrieval of these memories pose challenges. Agents need to distinguish between common patterns and one-off requests, and efficiently retrieve relevant information from potentially thousands of memories without overwhelming the context window. Devin has evolved to allow editing memories, but rebuilding them to feel more like a navigable file system managed by the agent itself is an ongoing exploration. This 'memory pruning' and temporal aspect are crucial for agents to maintain long-term context and relevance.

Integrating agents into company ecosystems and common use cases

Beyond core code generation, the real value of cloud agents emerges when they are deeply integrated into a company's broader ecosystem. Common use cases include SRE tasks, where agents act as first responders to alerts, collecting context from logs and databases to generate immediate pull requests for fixes. For non-technical teams like PMs and marketing, agents enable code modifications directly via Slack, bypassing traditional engineering bottlenecks. Customer support also benefits, as agents can rapidly gather detailed context about reported issues, streamlining the debugging process. These integrations often require custom solutions beyond standard MCPs, involving webhooks and natural language interaction to ensure seamless communication and value realization.

The critical role of human oversight and architectural integrity

Despite the push towards agent autonomy, human oversight remains indispensable. A significant risk is 'codebase regression,' where AI agents, by replicating patterns from their training data or from less experienced engineers, can inadvertently cement 'sloppy' or inefficient coding practices. This necessitates a focus on code quality, scheduled cleanups, and strong architectural contracts between modules. While agents can now interact more sophisticatedly, sometimes even pushing back on human instructions, the 'human in the loop' is vital for maintaining architectural integrity, managing complex decision-making, and ensuring the long-term maintainability and scalability of codebases. The ability to define strict boundaries and require human sign-off for cross-module changes is highlighted as a crucial responsibility for engineering leadership.

Infrastructure: The unseen backbone of agent performance

The performance and reliability of AI agents are heavily reliant on the underlying infrastructure, an area where Cognition has invested significant effort. Early challenges with Devin included slow boot-up and shutdown times for virtual machines (VMs) not designed for repeated use. This led to the development of specialized infrastructure, including a custom file system format, 'block diff file storage,' which incrementally builds on previous states, drastically reducing VM spin-up and teardown times. This focus on 'agent infra' allows for deployment in diverse environments like VPCs and on-premises setups. Experiences with network file systems causing slow grep operations also highlighted the need for fine-grained infra optimization. The effort to provide a seamless agent experience means selling not just the agent, but also the robust infrastructure enabling it.

The future: Hybrid intelligence, multi-agent systems, and the evolving developer experience

The trajectory of AI agents points towards more sophisticated capabilities. The concept of 'hybrid intelligence,' blending powerful frontier models with efficient sub-frontier models, promises to optimize performance and cost. Multi-agent systems, where agents can collaborate or spawn sub-agents, are an area of ongoing research, though current practical applications often still rely on single, highly capable agents. A key development is the 'handoff' between background and foreground agents, enabling a more fluid developer experience with tools like Windsurf. This allows developers to seamlessly transition between automated tasks and local debugging or intervention. While AI's ability to write code is rapidly advancing, the emphasis is shifting towards how agents can be integrated, managed, and overseen to truly enhance productivity and maintain high-quality software development.

Best Practices for Implementing and Using AI Coding Agents

Practical takeaways from this episode

Do This

Consider 'out-of-the-box' agent architectures for better security, despite complexity.
Utilize Docker for running infrastructure, but be cautious about using it for the agent itself.
Integrate agents into your company's existing ecosystem (databases, logs, knowledge bases) for maximum value.
Focus on building robust local development environments and mock servers for effective agent testing.
Implement scheduled cleanup and duplication checks to prevent codebase regression.
Ensure strict boundaries and clear contracts between different modules in your system.
Leverage AI to assist in migrating older codebases towards better local development practices.
Use agents for SRE use cases, including automated triage and incident response.
Consider agents for non-developer tasks like PMs prompting for bug fixes or customer support.
Develop well-defined roles for humans in overseeing AI operations, especially for deep infrastructure problems.

Avoid This

Do not assume Docker containers provide a true security boundary for agents.
Avoid giving agents direct access to sensitive production credentials for testing.
Do not solely rely on AI-generated code without auditing; be aware of potential codebase regression.
Do not underestimate the importance of human expertise in complex infrastructure challenges.
Avoid 'reward hacking' behaviors in AI coding, such as excessive use of `get attribute` or untyped tuples.
Do not expect AI memory systems to be fully solved; approach them with caution.
Do not neglect the need for clear communication and defined protocols when transitioning between local and cloud agent environments.

Common Questions

Background agents, also known as cloud agents, are AI systems designed to operate autonomously in the background. They can transform specifications into completed pull requests with minimal human intervention, shifting the paradigm from hand-holding models to leveraging their increased autonomy.

Topics

Mentioned in this video

Software & Apps
Opus 4.5

A model that enabled a shift towards autonomous AI agents capable of driving models from specification to pull request with minimal human intervention.

Sonnet

A model version that represented a significant leap in intelligence, leading to the stripping out of unnecessary parts of Devon and enabling higher autonomy.

Open Inspect

An open-source project focused on creating cloud agents, inspired by client friction with existing tools like Cloud IDE and Slack-based interactions.

Claude

Mentioned as one of the models whose capabilities were tested, particularly in the context of early development of agents and comparing GPT and Claude.

GPT

Mentioned in comparison to Claude regarding early agent development and capabilities.

Daytona

A sandbox layer provider where money is being made in the agent ecosystem, mentioned as a potential contributor to Open Inspect.

E2B

A sandbox layer provider where money is being made in the agent ecosystem, mentioned as a potential contributor to Open Inspect.

EC2

Raw VMs from cloud providers like EC2 were used in the early stages of building Devon, leading to slow boot times and difficulty in bringing systems back up.

Docker Compose

A common tool used by teams for microservices, which can be a good solution for running infrastructure, though not ideal for running the agent itself.

Docker

Containers are discussed as a potential abstraction for models, but noted to not be a true security boundary and can lead to 'Docker in Docker' issues.

Cloud IDE

A tool that clients were using, revealing friction points that inspired the development of Open Inspect, particularly regarding session sharing.

Cursor

Their released recordings of AI agent testing were enlightening, and their codebase experiments with single to multi-agent approaches are discussed.

Deep Brookie

An AI system that Devon can call upon to make requests and return results, functioning like a tool call within a larger agent system.

Claude 4.6

Mentioned for exhibiting backwards compatibility issues at all costs, similar to other GPT models, through weird import exports.

Git AI

The concept of storing agent prompts alongside code in git metadata for future reference by agents and review bots.

PostgreSQL

A local database setup recommended for AI agents to test code without needing to provide sensitive production credentials.

Little Snitch

A man-in-the-middle tool that shows all traffic, which can be used to reconstruct server behavior and create local mocks for testing.

Windsurf

A recent release that acts as a local command center for managing both background and local agents, facilitating handoffs between them and improving the testing process.

Modal

A highly regarded offering for cloud agent sandboxes, particularly its container offering, though it is Python-centric.

Python

The primary language for many libraries in the observability space, and often the default for AI-generated code, though JavaScript is gaining traction.

JavaScript

Appears to be winning in terms of workload shifts, with many new greenfield applications being built using it.

Mac OS

An operating system for which specific VM support might be needed for certain technologies or development tasks.

Java

Programming language used by Cognition, contrasting with the prevalence of Python and JavaScript in newer applications.

Windows

An operating system that an agent managed to rebuild by running for a long enough period, showcasing its capabilities.

Confluence

An internal knowledge base system that agents can integrate with to provide comprehensive context and support.

C++

Programming language used by Cognition, contrasting with the prevalence of Python and JavaScript in newer applications.

S3

Amazon Simple Storage Service, used in network file systems where data is cached, causing network calls during operations like 'grep' and slowing down performance.

Pon Ping

A tool that plays sound packs from popular games like Command and Conquer and Warcraft when an agent completes its work.

iOS

An operating system for which specific VM support might be needed for iOS development.

Firecracker

A virtualized instance used as the basis for VMs in the agent system, requiring nested virtualization for Android emulator support.

Android

Development for Android requires nested virtualization within machines, with performance issues that are still being addressed in beta.

Slack

A communication platform where agents can be integrated for various use cases, including customer support and prompting for code changes.

D-Piki

An AI system that Devon can call to make requests and return results, functioning as a tool call within the agent's workflow.

More from Latent Space

View all 219 summaries

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free