Key Moments
⚡️Ranking Agentic LLMs — Pratik Bhavsar, Galileo
Key Moments
Galileo AI launches Agent Leaderboard to rank LLM agents on tool use, cost, and realistic task completion.
Key Insights
The shift from evaluating base LLMs to evaluating LLM agents capable of complex, real-world tasks is crucial.
Galileo's Agent Leaderboard focuses on tool selection, tool use, single vs. multi-turn interactions, and cost-efficiency.
Gemini models, particularly Flash, showed surprising cost-efficiency and strong performance, while LLaMA models underperformed in tool calling.
LLM-as-a-judge metrics, like Galileo's Tool Selection Quality (TSQ), are becoming viable and flexible alternatives to traditional metrics, with careful prompt engineering being key.
Version 2 of the leaderboard will introduce domain-specific evaluations, increased benchmark hardness, and more complex multi-turn scenarios simulating real-world agent workflows.
The primary goal of V2 is to recalibrate user expectations by highlighting that current LLM agents are not yet perfect and can make significant mistakes in realistic scenarios.
THE SHIFT TOWARDS AGENT EVALUATIONS
The landscape of LLM evaluation is rapidly evolving from assessing foundational model knowledge to evaluating the performance of specific LLM agents. With recent advancements like Grok 4 and Kimi K2, there's a growing need for robust methods to gauge an agent's ability to perform real-world tasks. This includes crucial skills like tool calling, context utilization, and cost-effectiveness, moving beyond simple knowledge checks. Traditional benchmarks are becoming saturated, necessitating specialized evaluations for these agentic capabilities.
DESIGNING THE GALILEO AGENT LEADERBOARD
Galileo Labs developed an agent leaderboard to provide specific insights into LLM agent performance, addressing customer demands for a shift from RAG evaluations to agentic tasks. Key dimensions considered include performance, cost, the performance of smaller versus larger models (SLM vs LLM), and the open-source versus private model debate. The evaluation framework was designed to be holistic, incorporating data slices from various existing benchmarks and focusing on critical agent capabilities.
METHODOLOGY: DATA SETS AND METRICS
The leaderboard was constructed using 14 curated datasets, filtered for difficulty, to evaluate LLMs across hundreds of domains and edge cases, including missing information, tool errors, and long contexts. The core evaluation metric used is Galileo's 'Tool Selection Quality' (TSQ), an LLM-based judge metric powered by GPT-4. This approach aims for higher performance through multiple judges and assesses aspects like tool selection and tool use, encompassing single-turn and multi-turn interactions.
SURPRISING RESULTS AND KEY FINDINGS
Initial findings revealed that Gemini models, especially the Flash version, performed exceptionally well and were highly cost-efficient, often outperforming larger models. Conversely, LLaMA models (3.3, 4) showed poor performance in tool calling, contrary to community expectations. Reasoning-focused models like some from the 'O' series also struggled with multi-tool call outputs. Mistral 3.1 was noted as a strong performer among open-source models, while Meta's Llama models did not fare well on this specific benchmark.
THE ROLE AND VALIDATION OF LLM-AS-A-JUDGE
Galileo emphasizes the validity and flexibility of using LLMs as judges, such as their TSQ metric. While acknowledging potential biases in LLMs, they employ rigorous internal datasets, prompt optimization, and self-consistency with multiple outputs and majority voting. This approach offers flexibility in model choice and prompt customization, allowing users to tailor evaluations. They stress the importance of initial validation with small annotated datasets to build confidence in the LLM-judge methodology, which has shown significant improvements over the past year.
PREVIEW OF LEADERBOARD VERSION 2
Version 2 of the Agent Leaderboard aims to address limitations of the first iteration, such as score saturation and a lack of domain specificity. It will feature entirely new, domain-specific datasets across five verticals (banking, finance, telecommunication, insurance, healthcare) to prevent data leakage and provide more relevant insights. The benchmarks will be harder, focusing on complex multi-turn scenarios simulating support agents with numerous tools and adversarial situations, aiming to reset user expectations by showcasing the current imperfections of LLM agents.
EVALUATION METRICS FOR V2 AND REALISTIC SCENARIOS
The V2 evaluation will incorporate two primary metrics: the existing Tool Selection Quality and a new 'Action Completion' metric. Action Completion focuses on whether the agent successfully accomplished all user-requested goals within a scenario, a critical factor for business applications. This more comprehensive evaluation, simulating realistic multi-turn interactions with user and tool simulators, is designed to provide a clearer picture of agent capabilities and limitations in production-like settings, highlighting areas where models still struggle.
Mentioned in This Episode
●Software & Apps
●Tools
●Companies
●Organizations
●People Referenced
Agent Evaluation Best Practices
Practical takeaways from this episode
Do This
Avoid This
Common Questions
An agent evaluation leaderboard ranks Large Language Models (LLMs) based on their ability to perform real-world tasks using tools, rather than just testing knowledge. It considers factors like tool selection, multi-turn interactions, and cost-efficiency.
Topics
More from Latent Space
View all 167 summaries
86 minNVIDIA's AI Engineers: Brev, Dynamo and Agent Inference at Planetary Scale and "Speed of Light"
72 minCursor's Third Era: Cloud Agents — ft. Sam Whitmore, Jonas Nelle, Cursor
77 minWhy Every Agent Needs a Box — Aaron Levie, Box
42 min⚡️ Polsia: Solo Founder Tiny Team from 0 to 1m ARR in 1 month & the future of Self-Running Companies
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free