Key Moments
Devin’s 80% Moment: Background Agents, 7x PRs, & End of Hand-Held Coding — Walden Yan & Cole Murray
Want to know something specific about what's covered?
We've already dissected every moment. Ask and we will deliver (with timestamps).
Key Moments
AI coding agents have advanced to perform 80% of commits, but complex tasks like testing still require significant human expertise.
Key Insights
AI coding agents have reached a capability shift around December 2025 with models like Opus 4.5 and GPT-5.2, moving from handholding to autonomous task driving from specification to pull request.
Devin's merged pull requests have grown 7x in the last 2-3 months, while its commit percentage across Cognition repos jumped from 16% in January to 80% in March.
OpenInspect utilizes a 'brain out of the box' architecture, separating the agent's core logic from its execution environment for enhanced security and complexity management.
Testing complex, end-to-end features that span front-end and back-end services remains a significant challenge for AI, often requiring orchestration of multiple models or human intervention.
While AI can generate code, human oversight is crucial to prevent codebase regression to the 'worst engineer' and to maintain architectural integrity and architectural contracts.
Key use cases for cloud agents include SRE tasks like first-responder alerts, PM-initiated bug fixes, and customer support issue resolution, with potential costs ranging from $1,000 to $5,000 per engineer.
The 'background agent' paradigm shift and Devin's rapid growth
The engineering world is increasingly embracing 'background agents' or 'cloud agents,' a trend significantly accelerated around December 2025. This shift, driven by advancements in models like Opus 4.5 and GPT-5.2, has enabled AI agents to move from requiring constant human guidance to autonomously driving tasks from a specification to a completed pull request with minimal friction. Walden Yan, co-founder and CPO of Cognition, highlights this transformation, noting that Devin's development saw a 7x increase in merged pull requests over a few months, alongside a dramatic rise in its commit percentage across Cognition repositories, soaring from 16% in January to 80% by March. This growth underscores the practical viability and increasing adoption of autonomous AI agents in software development workflows.
OpenInspect's architectural choices for robust cloud agents
Cole Murray, creator of OpenInspect, discusses the architectural decisions behind building cloud agents, particularly the 'harness in the box' versus 'out of the box' dilemma. OpenInspect adopts an 'out of the box' approach, running the agent's 'brain' in a control plane worker while the sandbox acts as the 'hands.' This separation is favored for security, as it prevents sensitive secrets from residing directly within the potentially unpredictable agent environment. While this adds complexity in state management, it offers better control and security. Murray also notes the importance of robust developer environment setups, advocating for Docker Compose for infrastructure if not local development, and emphasizes that a good local developer experience naturally translates to easier agent sandbox setup. OpenInspect includes hooks for pre-installing dependencies and snapshotting environments for rapid agent startup.
The persistent challenge of AI-driven testing and complex problem-solving
Despite the advancements in AI's ability to generate code and perform basic computer use (like clicking buttons), complex testing remains a significant hurdle. The true challenge lies not in controlling the cursor, but in the AI's capacity for 'arbitrary testing' – reasoning through intricate scenarios that span front-end, back-end, and multiple services. To test a change effectively, an AI must orchestrate applications with the correct code versions, trigger features, and understand complex interdependencies, sometimes requiring administrative privileges or specific feature flag configurations. Yan explains that this comprehensive testing often necessitates orchestrating multiple frontier models together, as a single model may not be capable of handling the end-to-end task. Therefore, 'testing' in this context is a deep problem-solving challenge for AI, far exceeding simple computer interaction.
Memory and knowledge management for autonomous agents
A significant unsolved problem in agent development is effective memory and knowledge management. Cognition's journey with Devin included developing a system called 'Kip,' which aimed to auto-generate memories and learn over time without explicit user input. The goal was to have agents proactively ask for approval to remember information, building a knowledge base organically. However, both generation and retrieval of these memories pose challenges. Agents need to distinguish between common patterns and one-off requests, and efficiently retrieve relevant information from potentially thousands of memories without overwhelming the context window. Devin has evolved to allow editing memories, but rebuilding them to feel more like a navigable file system managed by the agent itself is an ongoing exploration. This 'memory pruning' and temporal aspect are crucial for agents to maintain long-term context and relevance.
Integrating agents into company ecosystems and common use cases
Beyond core code generation, the real value of cloud agents emerges when they are deeply integrated into a company's broader ecosystem. Common use cases include SRE tasks, where agents act as first responders to alerts, collecting context from logs and databases to generate immediate pull requests for fixes. For non-technical teams like PMs and marketing, agents enable code modifications directly via Slack, bypassing traditional engineering bottlenecks. Customer support also benefits, as agents can rapidly gather detailed context about reported issues, streamlining the debugging process. These integrations often require custom solutions beyond standard MCPs, involving webhooks and natural language interaction to ensure seamless communication and value realization.
The critical role of human oversight and architectural integrity
Despite the push towards agent autonomy, human oversight remains indispensable. A significant risk is 'codebase regression,' where AI agents, by replicating patterns from their training data or from less experienced engineers, can inadvertently cement 'sloppy' or inefficient coding practices. This necessitates a focus on code quality, scheduled cleanups, and strong architectural contracts between modules. While agents can now interact more sophisticatedly, sometimes even pushing back on human instructions, the 'human in the loop' is vital for maintaining architectural integrity, managing complex decision-making, and ensuring the long-term maintainability and scalability of codebases. The ability to define strict boundaries and require human sign-off for cross-module changes is highlighted as a crucial responsibility for engineering leadership.
Infrastructure: The unseen backbone of agent performance
The performance and reliability of AI agents are heavily reliant on the underlying infrastructure, an area where Cognition has invested significant effort. Early challenges with Devin included slow boot-up and shutdown times for virtual machines (VMs) not designed for repeated use. This led to the development of specialized infrastructure, including a custom file system format, 'block diff file storage,' which incrementally builds on previous states, drastically reducing VM spin-up and teardown times. This focus on 'agent infra' allows for deployment in diverse environments like VPCs and on-premises setups. Experiences with network file systems causing slow grep operations also highlighted the need for fine-grained infra optimization. The effort to provide a seamless agent experience means selling not just the agent, but also the robust infrastructure enabling it.
The future: Hybrid intelligence, multi-agent systems, and the evolving developer experience
The trajectory of AI agents points towards more sophisticated capabilities. The concept of 'hybrid intelligence,' blending powerful frontier models with efficient sub-frontier models, promises to optimize performance and cost. Multi-agent systems, where agents can collaborate or spawn sub-agents, are an area of ongoing research, though current practical applications often still rely on single, highly capable agents. A key development is the 'handoff' between background and foreground agents, enabling a more fluid developer experience with tools like Windsurf. This allows developers to seamlessly transition between automated tasks and local debugging or intervention. While AI's ability to write code is rapidly advancing, the emphasis is shifting towards how agents can be integrated, managed, and overseen to truly enhance productivity and maintain high-quality software development.
Mentioned in This Episode
●Software & Apps
●Companies
●Concepts
●People Referenced
Best Practices for Implementing and Using AI Coding Agents
Practical takeaways from this episode
Do This
Avoid This
Common Questions
Background agents, also known as cloud agents, are AI systems designed to operate autonomously in the background. They can transform specifications into completed pull requests with minimal human intervention, shifting the paradigm from hand-holding models to leveraging their increased autonomy.
Topics
Mentioned in this video
A model that enabled a shift towards autonomous AI agents capable of driving models from specification to pull request with minimal human intervention.
A model version that represented a significant leap in intelligence, leading to the stripping out of unnecessary parts of Devon and enabling higher autonomy.
An open-source project focused on creating cloud agents, inspired by client friction with existing tools like Cloud IDE and Slack-based interactions.
Mentioned as one of the models whose capabilities were tested, particularly in the context of early development of agents and comparing GPT and Claude.
Mentioned in comparison to Claude regarding early agent development and capabilities.
A sandbox layer provider where money is being made in the agent ecosystem, mentioned as a potential contributor to Open Inspect.
A sandbox layer provider where money is being made in the agent ecosystem, mentioned as a potential contributor to Open Inspect.
Raw VMs from cloud providers like EC2 were used in the early stages of building Devon, leading to slow boot times and difficulty in bringing systems back up.
A common tool used by teams for microservices, which can be a good solution for running infrastructure, though not ideal for running the agent itself.
Containers are discussed as a potential abstraction for models, but noted to not be a true security boundary and can lead to 'Docker in Docker' issues.
A tool that clients were using, revealing friction points that inspired the development of Open Inspect, particularly regarding session sharing.
Their released recordings of AI agent testing were enlightening, and their codebase experiments with single to multi-agent approaches are discussed.
An AI system that Devon can call upon to make requests and return results, functioning like a tool call within a larger agent system.
Mentioned for exhibiting backwards compatibility issues at all costs, similar to other GPT models, through weird import exports.
The concept of storing agent prompts alongside code in git metadata for future reference by agents and review bots.
A local database setup recommended for AI agents to test code without needing to provide sensitive production credentials.
A man-in-the-middle tool that shows all traffic, which can be used to reconstruct server behavior and create local mocks for testing.
A recent release that acts as a local command center for managing both background and local agents, facilitating handoffs between them and improving the testing process.
A highly regarded offering for cloud agent sandboxes, particularly its container offering, though it is Python-centric.
The primary language for many libraries in the observability space, and often the default for AI-generated code, though JavaScript is gaining traction.
Appears to be winning in terms of workload shifts, with many new greenfield applications being built using it.
An operating system for which specific VM support might be needed for certain technologies or development tasks.
Programming language used by Cognition, contrasting with the prevalence of Python and JavaScript in newer applications.
An operating system that an agent managed to rebuild by running for a long enough period, showcasing its capabilities.
An internal knowledge base system that agents can integrate with to provide comprehensive context and support.
Programming language used by Cognition, contrasting with the prevalence of Python and JavaScript in newer applications.
Amazon Simple Storage Service, used in network file systems where data is cached, causing network calls during operations like 'grep' and slowing down performance.
A tool that plays sound packs from popular games like Command and Conquer and Warcraft when an agent completes its work.
An operating system for which specific VM support might be needed for iOS development.
A virtualized instance used as the basis for VMs in the agent system, requiring nested virtualization for Android emulator support.
Development for Android requires nested virtualization within machines, with performance issues that are still being addressed in beta.
A communication platform where agents can be integrated for various use cases, including customer support and prompting for code changes.
An AI system that Devon can call to make requests and return results, functioning as a tool call within the agent's workflow.
Co-founder and CPO, credited as a coiner of 'context engineering', discussing the evolution of agent development and infrastructure.
Creator of Open Inspect, discussing the rise of background agents, architectural decisions, and real-world use cases.
Associated with experiments at Cursor on single-agent to multi-agent transitions.
From OpenAI, discussed regarding 'slot cannon' approaches to agent development and the challenges of bottlenecking single agents.
Their blog post was instrumental in inspiring Cole Murray to build Open Inspect, providing technical details on building agent systems.
The company building Devon, focusing on helping enterprises learn, use, and adopt coding agents, acting as thought partners for customers.
A platform that Devon agents interact with, raising questions about user permissions and separation between the system deciding actions and the secrets on the machine.
A supported integration for Open Inspect, enabling autonomous responses to alerts and issues within a company's systems.
An alerting system that can trigger autonomous responses from agents like Open Inspect.
Mentioned as the developer of Codex, a tool that clients were using, revealing friction that led to the creation of Open Inspect.
Has a daily memory journal feature that functions as a file system, allowing for discovery of information and potential application of forgetting algorithms.
Provides the control plane for the discussed agent infrastructure.
A hybrid approach to AI systems combining fast, efficient sub-frontier models with powerful frontier models for complex tasks.
On-premises deployment, mentioned as an environment that Devin's infrastructure control allows it to support without relying on external providers.
Site Reliability Engineering use cases are the easiest and most common for cloud agents, involving automated triage and response to alerts.
More from Latent Space
View all 219 summaries
71 min🔬 The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub
30 min⚡️ Google's Open AI Strategy — Omar Sanseviero, Google DeepMind
72 minAI Agents Need Computers: 74% MoM Growth, 850K/Day Runs, & New Agent Cloud — Ivan Burazin, Daytona
90 minThe Agent-Native Cloud: 3M Users, 100K Signups/Wk, Data Centers, & Death PRs — Jake Cooper, Railway
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free