How do MCP servers address tool call failures in AI agents?

MCP servers help by providing a centralized and consistent way for agents to access and use tools. This reduces the burden on individual agent teams to handle tool call failures and ensures a more reliable interaction with backend services.

What is the eval philosophy for MCP servers?

The eval philosophy for MCP servers is agent and tool agnostic. They focus on the final result rather than specific tool calls or agent behavior, acknowledging the creativity and resourcefulness of agents and aiming for ergonomic usability for any agent.

How can eval scenarios for MCP servers be generated more efficiently?

Instead of manual creation, eval scenarios can be automatically generated. This process involves using an LLM or coding agent with product documentation to create seed queries, which are then converted into natural language question-answer pairs for labeled eval scenarios.

How are evaluation results analyzed and used for optimization?

Evaluation results can be visualized on a dashboard and analyzed for failure patterns. By tagging capabilities and using tools like LLM Observability, developers can identify weak areas, and coding agents can help optimize the underlying code, creating a self-optimization loop.

What are the advantages of using an MCP server for agent builders?

Agent builders benefit from a centralized team handling tool call failures and improvements. This reduces their workload, and as the MCP server updates, all agents using it improve, leading to a better overall agentic experience without individual teams needing to manage tool complexity.

Can MCP servers offer a more flexible API interface for backends?

Yes, MCP servers allow for a more flexible interface compared to traditional APIs. They can incorporate features like spelling correction, pre-call checks, or dynamically choosing different backend APIs (e.g., clustering vs. list queries) based on the agent's needs.

Key Moments

AI Dev 25 x NYC | Scott Yak: Building MCP Servers That Make Agents More Effective

DeepLearning.AI

Education3 min read28 min video

Dec 5, 2025|732 views|9|1

Save to Pod

Key Moments

TL;DR

MCP servers centralize agent tools, improving effectiveness and simplifying evaluation.

Key Insights

Consolidating agent tools into an MCP server simplifies development and enhances agent capabilities.

MCP servers act as a product, providing direct value to customers and third-party agents.

Agent-agnostic and tool-agnostic evaluation strategies are crucial for MCP servers due to their diverse user base.

Automated generation of evaluation scenarios significantly reduces development time and effort.

Integrating evaluations into the development cycle, visualized through LLM observability, makes the process more efficient and enjoyable.

MCP servers enable a self-optimization loop where evaluations inform code improvements, leading to better tool performance.

THE STRATEGIC ADVANTAGE OF MCP SERVERS

The core message emphasizes that consolidating agent tools into a Managed Connectable Platform (MCP) server can transform the evaluation process, making it a source of joy rather than a pain. Scott Yak from Datadog explains that Datadog, an observability platform, uses these servers to empower agents, moving beyond mere data visualization to actionable insights. By centralizing tools, MCP servers reduce duplicated effort for agent teams, streamline tool usage, and allow for remote accessibility, effectively turning tools into a product that directly benefits customers and third-party agents like Cursor and Cloud Code.

ARCHITECTING THE AGENTIC WORKFLOW

The typical agent workflow involves multiple steps, starting with the agent managing its context window using system prompts and tool descriptions obtained from the MCP server. User requests are added, and an LLM decides the next action, potentially calling an MCP server tool. The MCP server processes this request, interacts with back-end services, performs business logic like filtering or post-processing, and returns a response to the agent. This response is added to the agent's context, allowing for iterative loops until the task is complete, ultimately providing a result to the user. This structured interaction ensures agents can leverage backend capabilities effectively.

EVALUATION PHILOSOPHY FOR MCP SERVERS

MCP server developers face a unique evaluation challenge because they control little about the agents using their services. Therefore, Datadog adopts an agent-agnostic and tool-agnostic evaluation philosophy. This approach focuses on the final outcome rather than specific tool calls or agent behaviors. By not optimizing for any particular agent, the MCP server remains flexible and ergonomic for all users, including simpler agents. This strategy allows for the use of faster, cheaper evaluation methods, making the evaluation process itself more efficient and less burdensome.

THE POWER OF AUTOMATED EVALUATION SCENARIO GENERATION

Manually creating comprehensive evaluation scenarios for MCP servers is a daunting task, potentially requiring thousands of scenarios to cover all functionalities and edge cases across different products like logs, metrics, and traces. The video highlights a more efficient method: generating these scenarios automatically. This involves taking a natural language question, converting it into a structured query language (like a CQ query), and then transforming that back into a natural language question for which the answer is known. This process, aided by coding agents and product documentation, can yield hundreds of labeled eval scenarios from a single documentation page, vastly accelerating the evaluation setup.

VISUALIZING AND OPTIMIZING WITH OBSERVABILITY

Once evaluation scenarios are generated and run, visualizing the results is key. Datadog uses LLM observability to display evaluation outcomes on a dashboard, showing which tool calls succeeded or failed. This data is crucial for debugging and improvement. By analyzing failure patterns, developers can identify areas needing optimization, whether it's in prompts, tool descriptions, or deeper backend logic. The ability to group these evaluations by capability further highlights areas of weakness, guiding development efforts towards enhancing specific functionalities.

THE SELF-OPTIMIZATION LOOP AND BENEFITS

The integration of MCP servers, automated evaluations, and LLM observability creates a powerful self-optimization loop. Developers can analyze evaluation failures, use coding agents to suggest and implement code improvements in tool descriptions or backend logic, and then re-run evaluations to confirm the fix. This iterative process, often taking only a few minutes from code change to evaluation result, makes development cyclical and enjoyable. For agent builders, this means fewer tool call failures to manage, and for the MCP server team, it provides motivation for continuous improvement, ultimately leading to a better experience for all users.

Mentioned in This Episode

●Software & Apps

●Tools

●Companies

●Concepts

Common Questions

An MCP server consolidates an agent's tools into a single, remote server. This simplifies tool management, reduces duplication of effort, and allows tools to serve multiple agents, turning them into a product that offers a better user experience.

Topics

MCP Servers Evaluation Frameworks Tool Integration LLM Observability Automated Testing Agent Development DataDog Self-Optimization

Mentioned in this video

Software & Apps

Large Language Model

Referred to as 'LM' and 'LLM', these models are used by agents to decide on actions, including making tool calls. They are also utilized in the process of generating eval scenarios and analyzing results.

LLM Observability

A DataDog product used for instrumenting and visualizing evaluation results. It helps in analyzing failure patterns, identifying areas for improvement, and can be accessed through the MCP server.

Cloud Code

Concepts

evals

Short for evaluations, used to assess agent performance and identify failure modes like hallucination and output formatting issues. The speaker aims to make evals a source of joy rather than a pain by using MCP servers.

Time travel

A feature available in the server that allows passing a specific timestamp to simulate a point in time, making eval scenarios unambiguous and repeatable regardless of when they are run.

Search logs

A tool within the DataDog MCP server that allows agents to search through logs. It's used as an example in demonstrating the agent-MCP server interaction and in creating eval scenarios.

Companies

DataDog