Key Moments
AI Dev 26 x SF | Diamond Bishop: The Next 100 Agents. Building the Agent Native Office
Want to know something specific about what's covered?
We've already dissected every moment. Ask and we will deliver (with timestamps).
Key Moments
Scaling AI agents from one to hundreds requires companies to prioritize agent-native interfaces and robust evaluation frameworks, not just advanced models, to avoid production failures and ensure long-term viability.
Key Insights
By 2026, Datadog observed a significant shift from demo agents to production-ready agents saving real time and effort across companies.
Companies should adopt an 'agent-native' interface mandate, similar to the 'Bezos API mandate,' ensuring all team functionalities are accessible via agents, not just human users.
A key mistake early in agent development was the lack of strong evaluation frameworks, leading to prolonged debugging and difficulty assessing true improvement.
Framework adoption for agent development has doubled in the past year as companies focus on productionizing agents, with options like OpenAI's Agents, Langgraph, and Pantic becoming popular.
The future of enterprise AI agents will heavily feature 'learning on the job' through reinforcement learning and human observation, necessitating robust data logging for continuous improvement.
Future agent capabilities will include longer horizons for task execution (up to 12+ hours), advanced multimodal interaction (computer vision, voice), and generative UI for on-the-fly customization.
The shift from single agents to 'agent offices' necessitates new infrastructure and strategies.
The journey of building AI agents is evolving from creating one or two for specific tasks to deploying hundreds across an organization. Diamond Bishop from Datadog highlights this transition, emphasizing that scaling AI agents to power a 'next-gen enterprise' requires moving beyond individual agent development to building platforms that support diverse agent workloads safely and efficiently. This evolution is fueled by advancements in AI models that are becoming more powerful and accessible. However, the real challenge lies not in the intelligence of the models themselves, but in the infrastructure and methodologies required to manage and deploy these agents at scale. As more companies move from impressive demos to production agents that deliver tangible time and effort savings, the focus shifts to building 'agent offices' capable of handling this complexity.
Early agent development focused on core functionalities like SRE and code generation.
Datadog’s initial foray into AI agents focused on automating critical functions previously handled by human teams. This included an 'automated AI SRE' agent designed to debug problems automatically, inspired by the increasing complexity of codebases, especially those generated partly by AI. Complementing this, the 'Bits AI dev' agent was built to write and develop code based on identified errors. A third key agent developed was the 'security analyst' agent, which investigates suspicious signals and automates initial responses to potential security issues, mirroring the investigative process of human analysts. These early agents demonstrated the potential for AI to take on complex, time-consuming tasks within IT operations, development, and security.
Empowering agent-native interfaces and proactive operations is crucial for adoption.
A significant lesson learned is the need for an 'agent-native' approach to user experience (UX). Traditionally, UX design focuses on human users, but in an agent-driven future, agents themselves become first-class users of applications and APIs. Bishop advocates for a 'Bezos API mandate' equivalent for agents, ensuring all team functionalities are accessible through agent-friendly interfaces, whether MCPs, APIs, or skills. This means not just supporting non-browser-based interactions but actively designing for them. Furthermore, agents should operate proactively rather than reactively. Instead of waiting for commands, agents should run in the background, triggered by events, much like human employees operate in a business. This proactive stance requires durable infrastructure, such as solutions like Temporal, to handle potential failures and ensure continuous operation. Chat interfaces, while useful, should not be the primary mode of interaction; event-driven triggers are more efficient for background agents.
Robust and continuous evaluation is critical to agent effectiveness.
One of the most cited mistakes in early agent development was the lack of a strong evaluation framework. Without rigorous evaluation, it's difficult to determine if an agent is actually improving or whether added tools and tweaks are beneficial. Bishop stresses the importance of a multi-stage evaluation process: 1. **Offline Eval:** Using representative, measurable, and rerunnable datasets to test base performance. 2. **Online Data:** Incorporating observability data, clicks, and user interactions to understand performance in the wild. 3. **Living Evals:** Continuously feeding real-world data back into offline datasets to account for drift and evolving usage patterns. This continuous feedback loop is essential for maintaining agent efficacy over time. The process can even be aided by agents designed to evaluate other agents, creating a 'who watches the watchman' scenario for automated improvement.
Embracing framework and model agnosticism accelerates adaptation.
Given the rapid pace of model development, companies should adopt a strategy of being framework and model agnostic. The 'bitter lesson' suggests that general methods leveraging new off-the-shelf models will prevail. This means building agents with flexible tools and functions, and being prepared to swap out underlying models as better ones become available. Frameworks like OpenAI's Agents, LangGraph, and Pantic can provide useful building blocks, but organizations should avoid being locked into a single one. Multimodality is also key, as different models excel at different tasks. Companies should be able to test and switch models for various use cases without significant re-engineering. Maintaining memory and context across model updates is crucial for retaining learned improvements and customer insights.
Multiplayer capabilities and agent-to-agent communication are the next frontier.
The concept of 'multiplayer' is expanding beyond human-to-human collaboration to include agent-to-agent and human-agent collaboration. This involves not just shared repositories but transparency into agent skills, tools, and actions, fostering learning and remixing of agent capabilities. A 'tools hub' or 'skills hub' can facilitate this. Human-agent collaboration goes beyond simple human-in-the-loop feedback; it includes agents sharing their actions and explanations with humans, and humans demonstrating tasks to agents, potentially leading to new RPA-like paradigms. Secure agent-to-agent communication is also vital, often managed within an enclave or cluster with restricted network access to prevent unauthorized interactions and ensure safety.
Future predictions include learning on the job, synthetic environments, and multimodal interaction.
Looking ahead, expect a surge in agents that can 'learn on the job' through reinforcement learning and human observation, requiring companies to log data for continuous feedback. Synthetic environments will allow for product-specific world modeling, enabling agents to interact with virtual versions of services and simulated users. Durable agents capable of long-horizon tasks (12+ hours) will become more common, though managing token costs will be a concern. The evolution of authorization (Ozero) for agents acting on behalf of users is a critical, yet underdeveloped, area. Multimodal capabilities, including direct computer interaction with applications and voice-based real-time communication, are on the horizon, promising higher-bandwidth interactions. Finally, generative UI will enable dynamic, on-the-fly creation of custom user interfaces for dashboards and services.
Mentioned in This Episode
●Software & Apps
●Companies
●Concepts
●People Referenced
Building and Scaling AI Agents: Key Principles
Practical takeaways from this episode
Do This
Avoid This
Common Questions
DataDog focuses on observability and helping companies scale their AI agents. They are developing AI agents for their own products (like SRE, Dev, and Security) and also aiming to help other companies build and manage their own custom AI agent fleets effectively.
Topics
Mentioned in this video
An AI assistant developed by Microsoft, previously used with Windows Phone.
A startup mentioned in the context of reinforcement learning for enterprise.
An AI agent developed by DataDog that acts as an automated SRE to debug problems.
An early experiment by the speaker to provide tools for agents to communicate with each other.
An AI agent developed by DataDog focused on writing and developing code based on identified errors.
A platform mentioned in relation to Steve Jagg, advised against for production use.
An AI agent developed by DataDog that investigates suspicious security signals.
A framework mentioned as a good option for agent development.
A communication platform where the speaker interacts with coworkers, compared to agent communication.
A technology company where the speaker previously worked on Cortana.
Mentioned as a provider of agent frameworks.
The speaker's company, specializing in observability for SaaS applications.
Mentioned as a provider of agent frameworks and models.
A company mentioned alongside Tinker in the context of enterprise RL.
A company whose tools are used by DataDog to ensure agent durability and problem resolution.
Mentioned as a past trend for 'X for Figma' pitches, compared to current agent collaboration trends.
A company associated with the startup Tinker, in the space of enterprise RL.
Mentioned in relation to a classic approach to Oauth and agent permissions.
More from DeepLearningAI
View all 94 summaries
22 minAI Dev 26 x SF | Andrew K. Davies: Deterministic Memory: How to Build an AI That Cannot Lie
26 minAI Dev 26 x SF | Brandon Waselnuk: Building the Context Engine AI Agents Need
29 minAI Dev 26 x SF | Paul Everitt: The Shift to Agentic Engineering
32 minAI Dev 26 x SF | Jerry Liu: My Agent Can't Read a PDF?
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free