Key Moments
AI Dev 25 x NYC | Samraj Moorjani: Accelerate High quality Agent Development with MLflow
Key Moments
MLflow accelerates AI agent development with tracing, evaluation, and insights for production quality.
Key Insights
Developing high-quality AI agents is challenging due to non-deterministic outputs and the need for domain expertise.
MLflow provides end-to-end lifecycle management for AI agents, including tracing for debugging and observability.
Agentic Insights analyzes production traces to automatically identify and root cause issues in agent performance.
Offline evaluations in MLflow act as regression test suites to ensure quality and prevent unintended side effects.
MLflow facilitates the creation of trustworthy evaluation judges by aligning them with human feedback and domain expertise.
Agent as a Judge allows expressing complex evaluation criteria in natural language, simplifying judge development.
Managed MLflow on Databricks offers enterprise features like governance and fine-grained access control.
THE CHALLENGE OF BUILDING RELIABLE AI AGENTS
Traditional software development has established methods for ensuring reliability through testing and QA. However, AI agents present unique challenges. Users' unpredictable interactions, non-deterministic outputs, and the constant need to balance cost, latency, and quality make development complex. Furthermore, developers often lack the domain expertise required for specialized agents, necessitating collaboration across organizational boundaries. The risks associated with low-quality AI agents, such as poor customer experience, increased costs, and reputational damage, underscore the critical need for robust quality assurance processes.
MLFLOW AS A COMPREHENSIVE GENAI LIFECYCLE PLATFORM
MLflow is an open-source platform designed to manage the entire lifecycle of generative AI applications. It offers a suite of capabilities, including tracing for detailed observability of agent execution step-by-step, and evaluation tools that incorporate human feedback and AI judges to assess quality. MLflow also supports versioning of prompts, parameters, and code, and provides a gateway for controlled and audited access to LLMs and agents with built-in guardrails. Its open-source nature and broad framework support, built on open standards, make it a widely adopted solution for bringing agents to production.
ENHANCING DEBUGGING AND OBSERVABILITY WITH TRACING
Debugging AI agents is significantly harder than traditional software due to their unbounded and non-deterministic actions. MLflow addresses this with tracing, providing rich, step-by-step observability into agent executions. This feature allows developers to visualize every LLM call, tool usage, and operational metric like latency and cost. Implementing tracing is simplified with a single line of code, supporting over 25 agentic frameworks. Even manual instrumentation is straightforward, making tracing an essential component for root-causing issues and enabling high-quality agent development.
ADDRESSING PRODUCTION-SCALE ISSUES WITH AGENTIC INSIGHTS
When agents are deployed at scale, issues related to low quality or high cost can arise, often buried within vast amounts of log data. MLflow's Agentic Insights automate the analysis of production traces, identifying and root-causing issues such as slow performance or high costs, and providing supporting evidence. This capability extends to development environments as well. By integrating human feedback or AI judges, Agentic Insights can further refine its analysis. The platform also offers a markdown report detailing issues, root causes, and prioritized fixes, streamlining the debugging process for large-scale deployments.
OFFLINE EVALUATIONS AND PROMPT OPTIMIZATION FOR QUALITY ASSURANCE
To build confidence in fixes and new features, offline evaluations serve as crucial regression test suites for AI agents. MLflow allows the creation of evaluation datasets from various sources, treating each data point as a unit test. These evaluations can be run easily by defining criteria and using out-of-the-box judges, with results displayed visually in the UI. This helps in comparing agent versions and understanding improvements or regressions. Furthermore, MLflow's automatic prompt optimization facilitates migration to cheaper or newer models while preserving quality, and can optimize system prompts for improved accuracy by leveraging feedback signals.
BUILDING TRUSTWORTHY EVALUATION JUDGES
Creating reliable evaluation criteria and implementing them into automatic judges is a significant challenge, especially in specialized domains. MLflow addresses this by enabling users to tune judges with human feedback, aligning them with domain expert preferences and ensuring agreement. The platform offers an intuitive judge builder workflow for creating, labeling, and aligning judges. For complex criteria, the 'Agent as a Judge' feature allows expressing evaluation logic in plain English, simplifying the process of introspecting traces and extracting necessary context without writing brittle code. This enhances the trustworthiness and maintainability of evaluation systems.
MANAGED MLFLOW AND COMMUNITY CONTRIBUTIONS
MLflow can be utilized in three primary ways: as open-source software, within the Databricks ecosystem as managed MLflow, or self-hosted on one's own infrastructure. Managed MLflow on Databricks provides enterprise-grade features like governance, fine-grained access controls, and lineage through Unity Catalog. While self-hosting offers flexibility, it doesn't include these advanced enterprise features. MLflow is an open-source project, actively welcoming community contributions, encouraging broad engagement and development to further its capabilities in managing the GenAI lifecycle.
Mentioned in This Episode
●Software & Apps
●Concepts
Common Questions
AI agents present unique challenges compared to traditional software due to unpredictable user inputs, non-deterministic outputs, difficulty defining quality, and developers often lacking domain expertise. MLflow helps address these by providing tools for observability, evaluation, and lifecycle management.
Topics
Mentioned in this video
A feature on DataBricks that backs MLflow datasets, providing governance and access control.
A feature in MLflow that analyzes traces to find and root cause issues in agents, applicable in both production and development.
An open-source GenAI platform that manages the end-to-end GenAI lifecycle, including tracing, evaluation, versioning, and access control.
More from DeepLearningAI
View all 65 summaries
1 minThe #1 Skill Employers Want in 2026
1 minThe truth about tech layoffs and AI..
2 minBuild and Train an LLM with JAX
1 minWhat should you learn next? #AI #deeplearning
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free