How has the capability of AI models like GPT-5.2 and Opus 4.5 changed AI agent development?

These advanced models, particularly after December 2025, reached a capability where they could autonomously drive tasks from specification to completed pull request. This significantly reduced friction and made background agents a more practical and powerful tool.

What were the key challenges that led to the creation of Open Inspect?

Open Inspect was developed due to friction points observed with tools like Cloud IDE and Slack-based interactions. A major issue was that sessions were specific to the invoker, preventing collaboration, especially for PMs needing to engage with engineering.

What is the difference between 'harness in the box' and 'out of the box' agent architectures?

The 'harness in the box' model runs the agent within the sandbox, which can be simpler but poses security risks due to secrets being co-located. The 'out of the box' model separates the agent's 'brain' in a control plane from the sandbox 'hands', offering better security but increased architectural complexity.

Why is AI testing considered more complex than basic computer use?

Testing an AI's ability to run an app involves reasoning about orchestrating complex systems, triggering features across front-end and back-end, and handling arbitrary conditions like admin access or feature flags. This requires deep codebase context and orchestration, often more than a single frontier model can manage alone.

How does Devon handle interactions on GitHub, and what are the implications?

Devon can interact directly on GitHub, receiving and addressing comments on its own pull requests. This requires significant tuning to ensure high-signal comments and prevent infinite loops, but it enables Devon to act as a collaborator, even pushing back when it disagrees with a proposed change.

What are the biggest integration challenges for AI agents in enterprise environments?

Integrating agents into existing company ecosystems is crucial but challenging. This includes connecting to production databases, logs, or knowledge bases, especially in compliance-heavy environments with strict access control. Building custom ad-hoc integrations is often necessary.

Is the problem of AI memory and knowledge bases considered solved yet?

No, AI memory and knowledge bases are largely considered unsolved problems. While skills can bridge some gaps, the core challenges lie in retrieval and generation of memories, ensuring accuracy, relevance, and efficient storage without overwhelming context.

What are the potential pitfalls of relying solely on AI for coding?

A major pitfall is that a codebase can regress to the patterns of the least experienced engineer if AI-generated code isn't audited. This can lead to exponential growth of 'slop' and duplication. Scheduled cleanup and strict modular boundaries are essential.

Can AI agents truly collaborate in multi-agent systems?

While the concept is exciting, practical multi-agent collaboration is still evolving. Currently, many multi-agent experiments function more like tool calls. However, AI agents showing maturity by sometimes disagreeing and pushing back suggests true collaboration might become more feasible.

What are some common use cases for cloud agents today?

Common use cases include SRE functions like automated triage and incident response, non-developer tasks like PMs prompting for bug fixes, and customer support, where agents can gather full context to help resolve issues more efficiently.

How much should companies expect to spend on AI agent infrastructure?

Spending can range from approximately $1,000 to $5,000 per engineer, though numbers as high as $50,000 per engineer have been seen. The ultimate budget depends on the value and responsible usage derived from the agents.

Key Moments

Devin’s 80% Moment: Background Agents, 7x PRs, & End of Hand-Held Coding — Walden Yan & Cole Murray

Latent Space Podcast

Science & Technology6 min read70 min video

May 28, 2026|6,433 views|139|10

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

On this page

TL;DR

AI coding agents have advanced to perform 80% of commits, but complex tasks like testing still require significant human expertise.

Key Insights

AI coding agents have reached a capability shift around December 2025 with models like Opus 4.5 and GPT-5.2, moving from handholding to autonomous task driving from specification to pull request.

Devin's merged pull requests have grown 7x in the last 2-3 months, while its commit percentage across Cognition repos jumped from 16% in January to 80% in March.

OpenInspect utilizes a 'brain out of the box' architecture, separating the agent's core logic from its execution environment for enhanced security and complexity management.

Testing complex, end-to-end features that span front-end and back-end services remains a significant challenge for AI, often requiring orchestration of multiple models or human intervention.

While AI can generate code, human oversight is crucial to prevent codebase regression to the 'worst engineer' and to maintain architectural integrity and architectural contracts.

Key use cases for cloud agents include SRE tasks like first-responder alerts, PM-initiated bug fixes, and customer support issue resolution, with potential costs ranging from $1,000 to $5,000 per engineer.

The 'background agent' paradigm shift and Devin's rapid growth

The engineering world is increasingly embracing 'background agents' or 'cloud agents,' a trend significantly accelerated around December 2025. This shift, driven by advancements in models like Opus 4.5 and GPT-5.2, has enabled AI agents to move from requiring constant human guidance to autonomously driving tasks from a specification to a completed pull request with minimal friction. Walden Yan, co-founder and CPO of Cognition, highlights this transformation, noting that Devin's development saw a 7x increase in merged pull requests over a few months, alongside a dramatic rise in its commit percentage across Cognition repositories, soaring from 16% in January to 80% by March. This growth underscores the practical viability and increasing adoption of autonomous AI agents in software development workflows.

OpenInspect's architectural choices for robust cloud agents

Cole Murray, creator of OpenInspect, discusses the architectural decisions behind building cloud agents, particularly the 'harness in the box' versus 'out of the box' dilemma. OpenInspect adopts an 'out of the box' approach, running the agent's 'brain' in a control plane worker while the sandbox acts as the 'hands.' This separation is favored for security, as it prevents sensitive secrets from residing directly within the potentially unpredictable agent environment. While this adds complexity in state management, it offers better control and security. Murray also notes the importance of robust developer environment setups, advocating for Docker Compose for infrastructure if not local development, and emphasizes that a good local developer experience naturally translates to easier agent sandbox setup. OpenInspect includes hooks for pre-installing dependencies and snapshotting environments for rapid agent startup.

The persistent challenge of AI-driven testing and complex problem-solving

Despite the advancements in AI's ability to generate code and perform basic computer use (like clicking buttons), complex testing remains a significant hurdle. The true challenge lies not in controlling the cursor, but in the AI's capacity for 'arbitrary testing' – reasoning through intricate scenarios that span front-end, back-end, and multiple services. To test a change effectively, an AI must orchestrate applications with the correct code versions, trigger features, and understand complex interdependencies, sometimes requiring administrative privileges or specific feature flag configurations. Yan explains that this comprehensive testing often necessitates orchestrating multiple frontier models together, as a single model may not be capable of handling the end-to-end task. Therefore, 'testing' in this context is a deep problem-solving challenge for AI, far exceeding simple computer interaction.

Memory and knowledge management for autonomous agents

A significant unsolved problem in agent development is effective memory and knowledge management. Cognition's journey with Devin included developing a system called 'Kip,' which aimed to auto-generate memories and learn over time without explicit user input. The goal was to have agents proactively ask for approval to remember information, building a knowledge base organically. However, both generation and retrieval of these memories pose challenges. Agents need to distinguish between common patterns and one-off requests, and efficiently retrieve relevant information from potentially thousands of memories without overwhelming the context window. Devin has evolved to allow editing memories, but rebuilding them to feel more like a navigable file system managed by the agent itself is an ongoing exploration. This 'memory pruning' and temporal aspect are crucial for agents to maintain long-term context and relevance.

Integrating agents into company ecosystems and common use cases

Beyond core code generation, the real value of cloud agents emerges when they are deeply integrated into a company's broader ecosystem. Common use cases include SRE tasks, where agents act as first responders to alerts, collecting context from logs and databases to generate immediate pull requests for fixes. For non-technical teams like PMs and marketing, agents enable code modifications directly via Slack, bypassing traditional engineering bottlenecks. Customer support also benefits, as agents can rapidly gather detailed context about reported issues, streamlining the debugging process. These integrations often require custom solutions beyond standard MCPs, involving webhooks and natural language interaction to ensure seamless communication and value realization.

The critical role of human oversight and architectural integrity

Despite the push towards agent autonomy, human oversight remains indispensable. A significant risk is 'codebase regression,' where AI agents, by replicating patterns from their training data or from less experienced engineers, can inadvertently cement 'sloppy' or inefficient coding practices. This necessitates a focus on code quality, scheduled cleanups, and strong architectural contracts between modules. While agents can now interact more sophisticatedly, sometimes even pushing back on human instructions, the 'human in the loop' is vital for maintaining architectural integrity, managing complex decision-making, and ensuring the long-term maintainability and scalability of codebases. The ability to define strict boundaries and require human sign-off for cross-module changes is highlighted as a crucial responsibility for engineering leadership.

Infrastructure: The unseen backbone of agent performance

The performance and reliability of AI agents are heavily reliant on the underlying infrastructure, an area where Cognition has invested significant effort. Early challenges with Devin included slow boot-up and shutdown times for virtual machines (VMs) not designed for repeated use. This led to the development of specialized infrastructure, including a custom file system format, 'block diff file storage,' which incrementally builds on previous states, drastically reducing VM spin-up and teardown times. This focus on 'agent infra' allows for deployment in diverse environments like VPCs and on-premises setups. Experiences with network file systems causing slow grep operations also highlighted the need for fine-grained infra optimization. The effort to provide a seamless agent experience means selling not just the agent, but also the robust infrastructure enabling it.

The future: Hybrid intelligence, multi-agent systems, and the evolving developer experience

The trajectory of AI agents points towards more sophisticated capabilities. The concept of 'hybrid intelligence,' blending powerful frontier models with efficient sub-frontier models, promises to optimize performance and cost. Multi-agent systems, where agents can collaborate or spawn sub-agents, are an area of ongoing research, though current practical applications often still rely on single, highly capable agents. A key development is the 'handoff' between background and foreground agents, enabling a more fluid developer experience with tools like Windsurf. This allows developers to seamlessly transition between automated tasks and local debugging or intervention. While AI's ability to write code is rapidly advancing, the emphasis is shifting towards how agents can be integrated, managed, and overseen to truly enhance productivity and maintain high-quality software development.

Mentioned in This Episode

●Software & Apps

●Companies

●Concepts

●People Referenced

Best Practices for Implementing and Using AI Coding Agents

Practical takeaways from this episode

Do This

Consider 'out-of-the-box' agent architectures for better security, despite complexity.

Utilize Docker for running infrastructure, but be cautious about using it for the agent itself.

Integrate agents into your company's existing ecosystem (databases, logs, knowledge bases) for maximum value.

Focus on building robust local development environments and mock servers for effective agent testing.

Implement scheduled cleanup and duplication checks to prevent codebase regression.

Ensure strict boundaries and clear contracts between different modules in your system.

Leverage AI to assist in migrating older codebases towards better local development practices.

Use agents for SRE use cases, including automated triage and incident response.

Consider agents for non-developer tasks like PMs prompting for bug fixes or customer support.

Develop well-defined roles for humans in overseeing AI operations, especially for deep infrastructure problems.

Avoid This

Do not assume Docker containers provide a true security boundary for agents.

Avoid giving agents direct access to sensitive production credentials for testing.

Do not solely rely on AI-generated code without auditing; be aware of potential codebase regression.

Do not underestimate the importance of human expertise in complex infrastructure challenges.

Avoid 'reward hacking' behaviors in AI coding, such as excessive use of `get attribute` or untyped tuples.

Do not expect AI memory systems to be fully solved; approach them with caution.

Do not neglect the need for clear communication and defined protocols when transitioning between local and cloud agent environments.

Common Questions

Background agents, also known as cloud agents, are AI systems designed to operate autonomously in the background. They can transform specifications into completed pull requests with minimal human intervention, shifting the paradigm from hand-holding models to leveraging their increased autonomy.

Topics

Ai Agents AI & Machine Learning Technology & Innovation Programming & Software Enterprise AI Software Development Agent Architecture Codebase Management Autonomous Coding

Mentioned in this video

Software & Apps

Opus 4.5

A model that enabled a shift towards autonomous AI agents capable of driving models from specification to pull request with minimal human intervention.

Sonnet

A model version that represented a significant leap in intelligence, leading to the stripping out of unnecessary parts of Devon and enabling higher autonomy.

Open Inspect

An open-source project focused on creating cloud agents, inspired by client friction with existing tools like Cloud IDE and Slack-based interactions.

Claude

Mentioned as one of the models whose capabilities were tested, particularly in the context of early development of agents and comparing GPT and Claude.

GPT

Mentioned in comparison to Claude regarding early agent development and capabilities.

Daytona

A sandbox layer provider where money is being made in the agent ecosystem, mentioned as a potential contributor to Open Inspect.

E2B

A sandbox layer provider where money is being made in the agent ecosystem, mentioned as a potential contributor to Open Inspect.

EC2

Raw VMs from cloud providers like EC2 were used in the early stages of building Devon, leading to slow boot times and difficulty in bringing systems back up.

Docker Compose

A common tool used by teams for microservices, which can be a good solution for running infrastructure, though not ideal for running the agent itself.

Docker

Containers are discussed as a potential abstraction for models, but noted to not be a true security boundary and can lead to 'Docker in Docker' issues.

Cloud IDE

A tool that clients were using, revealing friction points that inspired the development of Open Inspect, particularly regarding session sharing.

Cursor

Their released recordings of AI agent testing were enlightening, and their codebase experiments with single to multi-agent approaches are discussed.

Deep Brookie

An AI system that Devon can call upon to make requests and return results, functioning like a tool call within a larger agent system.

Claude 4.6

Mentioned for exhibiting backwards compatibility issues at all costs, similar to other GPT models, through weird import exports.

Git AI

The concept of storing agent prompts alongside code in git metadata for future reference by agents and review bots.

PostgreSQL

A local database setup recommended for AI agents to test code without needing to provide sensitive production credentials.

Little Snitch

A man-in-the-middle tool that shows all traffic, which can be used to reconstruct server behavior and create local mocks for testing.

Windsurf

A recent release that acts as a local command center for managing both background and local agents, facilitating handoffs between them and improving the testing process.

Modal

A highly regarded offering for cloud agent sandboxes, particularly its container offering, though it is Python-centric.

Python

The primary language for many libraries in the observability space, and often the default for AI-generated code, though JavaScript is gaining traction.

JavaScript

Appears to be winning in terms of workload shifts, with many new greenfield applications being built using it.

Mac OS

An operating system for which specific VM support might be needed for certain technologies or development tasks.

Java

Programming language used by Cognition, contrasting with the prevalence of Python and JavaScript in newer applications.

Windows

An operating system that an agent managed to rebuild by running for a long enough period, showcasing its capabilities.

Confluence

An internal knowledge base system that agents can integrate with to provide comprehensive context and support.

C++

Programming language used by Cognition, contrasting with the prevalence of Python and JavaScript in newer applications.

Amazon Simple Storage Service, used in network file systems where data is cached, causing network calls during operations like 'grep' and slowing down performance.

Pon Ping

A tool that plays sound packs from popular games like Command and Conquer and Warcraft when an agent completes its work.

iOS

An operating system for which specific VM support might be needed for iOS development.

Firecracker

A virtualized instance used as the basis for VMs in the agent system, requiring nested virtualization for Android emulator support.

Android

Development for Android requires nested virtualization within machines, with performance issues that are still being addressed in beta.

Slack

A communication platform where agents can be integrated for various use cases, including customer support and prompting for code changes.

D-Piki

An AI system that Devon can call to make requests and return results, functioning as a tool call within the agent's workflow.

People

Walden Yen

Co-founder and CPO, credited as a coiner of 'context engineering', discussing the evolution of agent development and infrastructure.

Cole Murray

Creator of Open Inspect, discussing the rise of background agents, architectural decisions, and real-world use cases.

Wilson Lynn

Associated with experiments at Cursor on single-agent to multi-agent transitions.

Ryan Lopo

From OpenAI, discussed regarding 'slot cannon' approaches to agent development and the challenges of bottlenecking single agents.

Companies

Ramp

Their blog post was instrumental in inspiring Cole Murray to build Open Inspect, providing technical details on building agent systems.

Cognition

The company building Devon, focusing on helping enterprises learn, use, and adopt coding agents, acting as thought partners for customers.

GitHub

A platform that Devon agents interact with, raising questions about user permissions and separation between the system deciding actions and the secrets on the machine.

Sentry

A supported integration for Open Inspect, enabling autonomous responses to alerts and issues within a company's systems.

DataDog

An alerting system that can trigger autonomous responses from agents like Open Inspect.

OpenAI

Mentioned as the developer of Codex, a tool that clients were using, revealing friction that led to the creation of Open Inspect.

OpenClaw

Has a daily memory journal feature that functions as a file system, allowing for discovery of information and potential application of forgetting algorithms.

Cloudflare

Provides the control plane for the discussed agent infrastructure.

Concepts

Smart Friend

A hybrid approach to AI systems combining fast, efficient sub-frontier models with powerful frontier models for complex tasks.

On-prem

On-premises deployment, mentioned as an environment that Devin's infrastructure control allows it to support without relying on external providers.

S-RE

Site Reliability Engineering use cases are the easiest and most common for cloud agents, involving automated triage and response to alerts.

Media

Command and Conquer

A popular game whose sound packs are used by the 'Pon Ping' tool to play sounds when an agent completes a task.

Locations

Fed GovCloud

Federal Government Cloud, mentioned as an environment that Devin's infrastructure control allows it to support without relying on external providers.

Books

Warcraft