AI Dev 25 x NYC | Ori Goshen: Reliability Is the Bottleneck for Agents

DeepLearning.AIDeepLearning.AI
Education4 min read29 min video
Dec 5, 2025|341 views|4
Save to Pod

Key Moments

TL;DR

AI agents struggle with reliability in critical enterprise tasks; Maestro offers a solution for this.

Key Insights

1

Current AI agents face significant adoption barriers in enterprises due to reliability issues, especially in mission-critical, multi-step workflows.

2

The 'prompt and prey' approach, relying solely on LLM orchestration, leads to inconsistent accuracy and compounded errors, making it unsuitable for production.

3

Manually coded static workflows offer control but lack flexibility and are rigid, requiring significant development and optimization.

4

AI21's Maestro aims to bridge the gap by providing an agent orchestration technology focused on control and high accuracy through structured planning and dynamic validation.

5

Maestro enables defining requirements as policies or constraints, from which it generates validators and fixers to ensure adherence, combined with computational budget controls.

6

The system dynamically creates structured, deterministic plans, ranks alternative courses of action for each step by probability of success, and validates outputs to mitigate compounding errors.

THE ENTERPRISE ADOPTION WALL FOR AI AGENTS

While consumer AI adoption is widespread, enterprise AI, particularly agentic systems, faces significant adoption hurdles. The core challenge lies in applying AI to mission-critical workflows with high value but also high cost of error, such as financial underwriting or compliance reviews. Existing generative AI tools are adept at tasks like data entry or marketing content creation, but their unreliability makes them unsuitable for these more demanding applications. This discrepancy highlights a fundamental issue preventing AI from penetrating deeper into enterprise operations.

THE FUNDAMENTAL PROBLEM: RELIABILITY AND COMPOUNDING ERRORS

The primary bottleneck for serious AI agents in enterprises is accuracy and reliability. Large Language Models (LLMs), being probabilistic in nature, often make mistakes, ignore instructions, or act inconsistently. This becomes even more problematic in multi-step tasks, where errors from one step compound, leading to a significant drop in overall accuracy. This effect makes it extremely difficult for AI systems to reliably complete complex workflows, illustrating why many AI projects fail to reach production.

EXISTING APPROACHES AND THEIR LIMITATIONS

Current methods for building AI agents often fall into two camps, both with limitations. The 'prompt and prey' approach involves an LLM controlling the agent's actions, offering high automation but low control and unreliable outcomes, making it suitable only for demos. Conversely, manually building static, coded workflows provides more precision and control by codifying the process and calling LLMs at specific steps. However, this approach is rigid, use-case specific, and requires substantial development and optimization, trapping builders in a trade-off between automation and accuracy.

INTRODUCING MAESTRO: ORCHESTRATION FOR CONTROL AND ACCURACY

AI21's Maestro is presented as a solution to overcome the trade-offs in current agent development. It's an agent orchestration technology designed to build agents that can automate complex enterprise tasks with an emphasis on control and high accuracy. Maestro is model-agnostic, allowing integration with various LLMs, and can incorporate any tool, whether first-party or third-party, through API specifications. This flexibility allows for the creation of robust agents tailored to specific enterprise needs.

MAESTRO'S MECHANISMS FOR ENSURING RELIABILITY

Maestro addresses reliability by dynamically creating structured, deterministic plans for tasks, rather than relying on free-form natural language prompts for execution. It identifies dependencies between steps and implements checkpoints. For each step, Maestro ranks alternative courses of action by their probability of success and, based on a defined computational budget, chooses how many to execute at inference time. This approach, combined with a validation mechanism that selects the best results from various attempts, significantly reduces the compounding error effect.

CONTROL, BUDGETING, AND TRANSPARENCY

A key feature of Maestro is its ability to incorporate user-defined requirements—policies, instructions, or constraints—which the system translates into validators and fixers. This ensures that agents adhere to specified guidelines. Furthermore, Maestro includes computational budget controls, allowing users to set spending limits in terms of tokens, query cost, or latency, preventing runaway expenses. The system also provides detailed accuracy reports and execution traces, offering transparency into model and tool calls, and a report card showing whether requirements were met.

ENSURING TRUSTWORTHINESS IN VALIDATION AND CONFIDENCE SCORES

The discussion addresses the challenge of generating reliable confidence scores, particularly when LLMs are involved in validation and may exhibit self-enhancement bias or inaccuracies with numerical tasks. Maestro supports custom, deterministic validators, such as code execution, for specific requirements, ensuring trust. When LLM-based validation is used, Maestro employs specialized judges and directs them to focus on single constraints rather than multiple ones. This focused validation, while still probabilistic, empirically improves output reliability and provides better guarantees.

ORGANIZATIONAL INTELLIGENCE OVER SUPER INTELLIGENCE

The overarching vision presented is not about achieving Artificial General Intelligence (AGI) or Superintelligence, but rather about creating 'organizational intelligence.' This involves developing AI systems that deeply understand and optimize how work is done within an enterprise context. By focusing on reliability, control, and transparency in agentic systems, AI21 aims to build a future where AI can effectively and dependably assist in complex business processes, offering immediate and concrete opportunities for improvement.

Common Questions

AI agents struggle in critical enterprise workflows because LLMs are probabilistic and can make mistakes or act inconsistently. This leads to compounding errors in multi-step processes, resulting in low accuracy that prevents production deployment.

Topics

Mentioned in this video

More from DeepLearningAI

View all 65 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free