Key Moments

⚡️Ranking Agentic LLMs — Pratik Bhavsar, Galileo

Latent Space PodcastLatent Space Podcast
Science & Technology3 min read35 min video
Jul 14, 2025|1,988 views|35|10
Save to Pod
TL;DR

Galileo AI launches Agent Leaderboard to rank LLM agents on tool use, cost, and realistic task completion.

Key Insights

1

The shift from evaluating base LLMs to evaluating LLM agents capable of complex, real-world tasks is crucial.

2

Galileo's Agent Leaderboard focuses on tool selection, tool use, single vs. multi-turn interactions, and cost-efficiency.

3

Gemini models, particularly Flash, showed surprising cost-efficiency and strong performance, while LLaMA models underperformed in tool calling.

4

LLM-as-a-judge metrics, like Galileo's Tool Selection Quality (TSQ), are becoming viable and flexible alternatives to traditional metrics, with careful prompt engineering being key.

5

Version 2 of the leaderboard will introduce domain-specific evaluations, increased benchmark hardness, and more complex multi-turn scenarios simulating real-world agent workflows.

6

The primary goal of V2 is to recalibrate user expectations by highlighting that current LLM agents are not yet perfect and can make significant mistakes in realistic scenarios.

THE SHIFT TOWARDS AGENT EVALUATIONS

The landscape of LLM evaluation is rapidly evolving from assessing foundational model knowledge to evaluating the performance of specific LLM agents. With recent advancements like Grok 4 and Kimi K2, there's a growing need for robust methods to gauge an agent's ability to perform real-world tasks. This includes crucial skills like tool calling, context utilization, and cost-effectiveness, moving beyond simple knowledge checks. Traditional benchmarks are becoming saturated, necessitating specialized evaluations for these agentic capabilities.

DESIGNING THE GALILEO AGENT LEADERBOARD

Galileo Labs developed an agent leaderboard to provide specific insights into LLM agent performance, addressing customer demands for a shift from RAG evaluations to agentic tasks. Key dimensions considered include performance, cost, the performance of smaller versus larger models (SLM vs LLM), and the open-source versus private model debate. The evaluation framework was designed to be holistic, incorporating data slices from various existing benchmarks and focusing on critical agent capabilities.

METHODOLOGY: DATA SETS AND METRICS

The leaderboard was constructed using 14 curated datasets, filtered for difficulty, to evaluate LLMs across hundreds of domains and edge cases, including missing information, tool errors, and long contexts. The core evaluation metric used is Galileo's 'Tool Selection Quality' (TSQ), an LLM-based judge metric powered by GPT-4. This approach aims for higher performance through multiple judges and assesses aspects like tool selection and tool use, encompassing single-turn and multi-turn interactions.

SURPRISING RESULTS AND KEY FINDINGS

Initial findings revealed that Gemini models, especially the Flash version, performed exceptionally well and were highly cost-efficient, often outperforming larger models. Conversely, LLaMA models (3.3, 4) showed poor performance in tool calling, contrary to community expectations. Reasoning-focused models like some from the 'O' series also struggled with multi-tool call outputs. Mistral 3.1 was noted as a strong performer among open-source models, while Meta's Llama models did not fare well on this specific benchmark.

THE ROLE AND VALIDATION OF LLM-AS-A-JUDGE

Galileo emphasizes the validity and flexibility of using LLMs as judges, such as their TSQ metric. While acknowledging potential biases in LLMs, they employ rigorous internal datasets, prompt optimization, and self-consistency with multiple outputs and majority voting. This approach offers flexibility in model choice and prompt customization, allowing users to tailor evaluations. They stress the importance of initial validation with small annotated datasets to build confidence in the LLM-judge methodology, which has shown significant improvements over the past year.

PREVIEW OF LEADERBOARD VERSION 2

Version 2 of the Agent Leaderboard aims to address limitations of the first iteration, such as score saturation and a lack of domain specificity. It will feature entirely new, domain-specific datasets across five verticals (banking, finance, telecommunication, insurance, healthcare) to prevent data leakage and provide more relevant insights. The benchmarks will be harder, focusing on complex multi-turn scenarios simulating support agents with numerous tools and adversarial situations, aiming to reset user expectations by showcasing the current imperfections of LLM agents.

EVALUATION METRICS FOR V2 AND REALISTIC SCENARIOS

The V2 evaluation will incorporate two primary metrics: the existing Tool Selection Quality and a new 'Action Completion' metric. Action Completion focuses on whether the agent successfully accomplished all user-requested goals within a scenario, a critical factor for business applications. This more comprehensive evaluation, simulating realistic multi-turn interactions with user and tool simulators, is designed to provide a clearer picture of agent capabilities and limitations in production-like settings, highlighting areas where models still struggle.

Agent Evaluation Best Practices

Practical takeaways from this episode

Do This

Focus on agent-specific evaluations that test real tasks, tool calling, and context utilization.
Include cost awareness as a key component in evaluations.
Consider Small Language Models (SLMs) and open-source models for potential cost-performance trade-offs.
Use LLMs as judges with rigorous prompt optimization, self-consistency, and multiple judges for robust evaluation.
Validate LLM judge metrics with annotated datasets before making them the default.
For V2 benchmarking, prioritize domain-specific datasets and harder, more complex tool-calling scenarios.
Evaluate agents on multi-turn interactions and overall action completion, not just individual tool selection.

Avoid This

Rely solely on general knowledge or previous benchmarks like Arena for agent evaluation, as performance may differ.
Ignore cost-efficiency when comparing models; a slightly worse model might be more economical.
Overlook smaller or less-hyped models that could offer competitive performance at a lower cost.
Dismiss LLMs as judges due to potential biases; focus on best practices in prompting and self-consistency.
Use overly simple benchmarks like BFCL for realistic agent evaluation; opt for more complex and domain-specific tasks.
Assume models are perfect at tool calling; real-world scenarios reveal significant limitations.

Common Questions

An agent evaluation leaderboard ranks Large Language Models (LLMs) based on their ability to perform real-world tasks using tools, rather than just testing knowledge. It considers factors like tool selection, multi-turn interactions, and cost-efficiency.

Topics

More from Latent Space

View all 167 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free