Why are traditional LLM evaluations not sufficient for agents?

Traditional evaluations often focus on knowledge recall or general capabilities. Agents require specific testing for their ability to use tools, manage context effectively, and handle complex multi-turn interactions, which general benchmarks may not capture.

Which models performed best on the Galileo agent leaderboard?

Initially, Gemini models showed strong performance and cost-efficiency. For open-source models, Mistral 3.1 was highlighted as a top performer. However, the landscape is dynamic, with newer models like Mistral 3.7 also showing impressive results.

Are reasoning-focused LLMs like GPT-4 O series good for tool calling?

Surprisingly, the O series models did not perform well on multi-tool calls in the evaluation. They often outputted only a single tool call when multiple were required, leading to lower scores, and were also more costly.

What is TSQ (Tool Selection Quality) and is it reliable?

TSQ is Galileo's LLM-based metric for evaluating tool selection. While LLMs can have biases, Galileo conducts extensive research, prompt optimization, and uses self-consistency with multiple judges to ensure reliability. It's generally considered a strong approach for recent LLMs.

What are the key improvements planned for Agent Leaderboard V2?

V2 will focus on domain-specific evaluations (banking, healthcare, etc.), harder and more realistic benchmarks to avoid score saturation, and advanced multi-turn evaluations simulating support bots with complex goals and adversarial scenarios.

How does Agent Leaderboard V2 differ in evaluation metrics?

V2 will emphasize 'action completion' to check if all user requests were fulfilled correctly, alongside the existing 'tool selection quality'. This provides a more business-relevant view of an agent's overall performance.

Are current LLM agents perfect for real-world tasks?

No, even top-performing agents are not perfect in real-world scenarios. The updated V2 leaderboard aims to recalibrate user expectations by highlighting that models still make mistakes and may not accomplish all tasks reliably, especially in complex, multi-turn interactions.

Key Moments

⚡️Ranking Agentic LLMs — Pratik Bhavsar, Galileo

Latent Space Podcast

Science & Technology3 min read35 min video

Jul 14, 2025|1,990 views|35|10

Save to Pod

Key Moments

TL;DR

Galileo AI launches Agent Leaderboard to rank LLM agents on tool use, cost, and realistic task completion.

Key Insights

The shift from evaluating base LLMs to evaluating LLM agents capable of complex, real-world tasks is crucial.

Galileo's Agent Leaderboard focuses on tool selection, tool use, single vs. multi-turn interactions, and cost-efficiency.

Gemini models, particularly Flash, showed surprising cost-efficiency and strong performance, while LLaMA models underperformed in tool calling.

LLM-as-a-judge metrics, like Galileo's Tool Selection Quality (TSQ), are becoming viable and flexible alternatives to traditional metrics, with careful prompt engineering being key.

Version 2 of the leaderboard will introduce domain-specific evaluations, increased benchmark hardness, and more complex multi-turn scenarios simulating real-world agent workflows.

The primary goal of V2 is to recalibrate user expectations by highlighting that current LLM agents are not yet perfect and can make significant mistakes in realistic scenarios.

THE SHIFT TOWARDS AGENT EVALUATIONS

The landscape of LLM evaluation is rapidly evolving from assessing foundational model knowledge to evaluating the performance of specific LLM agents. With recent advancements like Grok 4 and Kimi K2, there's a growing need for robust methods to gauge an agent's ability to perform real-world tasks. This includes crucial skills like tool calling, context utilization, and cost-effectiveness, moving beyond simple knowledge checks. Traditional benchmarks are becoming saturated, necessitating specialized evaluations for these agentic capabilities.

DESIGNING THE GALILEO AGENT LEADERBOARD

Galileo Labs developed an agent leaderboard to provide specific insights into LLM agent performance, addressing customer demands for a shift from RAG evaluations to agentic tasks. Key dimensions considered include performance, cost, the performance of smaller versus larger models (SLM vs LLM), and the open-source versus private model debate. The evaluation framework was designed to be holistic, incorporating data slices from various existing benchmarks and focusing on critical agent capabilities.

METHODOLOGY: DATA SETS AND METRICS

The leaderboard was constructed using 14 curated datasets, filtered for difficulty, to evaluate LLMs across hundreds of domains and edge cases, including missing information, tool errors, and long contexts. The core evaluation metric used is Galileo's 'Tool Selection Quality' (TSQ), an LLM-based judge metric powered by GPT-4. This approach aims for higher performance through multiple judges and assesses aspects like tool selection and tool use, encompassing single-turn and multi-turn interactions.

SURPRISING RESULTS AND KEY FINDINGS

Initial findings revealed that Gemini models, especially the Flash version, performed exceptionally well and were highly cost-efficient, often outperforming larger models. Conversely, LLaMA models (3.3, 4) showed poor performance in tool calling, contrary to community expectations. Reasoning-focused models like some from the 'O' series also struggled with multi-tool call outputs. Mistral 3.1 was noted as a strong performer among open-source models, while Meta's Llama models did not fare well on this specific benchmark.

THE ROLE AND VALIDATION OF LLM-AS-A-JUDGE

Galileo emphasizes the validity and flexibility of using LLMs as judges, such as their TSQ metric. While acknowledging potential biases in LLMs, they employ rigorous internal datasets, prompt optimization, and self-consistency with multiple outputs and majority voting. This approach offers flexibility in model choice and prompt customization, allowing users to tailor evaluations. They stress the importance of initial validation with small annotated datasets to build confidence in the LLM-judge methodology, which has shown significant improvements over the past year.

PREVIEW OF LEADERBOARD VERSION 2

Version 2 of the Agent Leaderboard aims to address limitations of the first iteration, such as score saturation and a lack of domain specificity. It will feature entirely new, domain-specific datasets across five verticals (banking, finance, telecommunication, insurance, healthcare) to prevent data leakage and provide more relevant insights. The benchmarks will be harder, focusing on complex multi-turn scenarios simulating support agents with numerous tools and adversarial situations, aiming to reset user expectations by showcasing the current imperfections of LLM agents.

EVALUATION METRICS FOR V2 AND REALISTIC SCENARIOS

The V2 evaluation will incorporate two primary metrics: the existing Tool Selection Quality and a new 'Action Completion' metric. Action Completion focuses on whether the agent successfully accomplished all user-requested goals within a scenario, a critical factor for business applications. This more comprehensive evaluation, simulating realistic multi-turn interactions with user and tool simulators, is designed to provide a clearer picture of agent capabilities and limitations in production-like settings, highlighting areas where models still struggle.

Mentioned in This Episode

●Software & Apps

●Tools

●Companies

●Organizations

●People Referenced

Agent Evaluation Best Practices

Practical takeaways from this episode

Do This

Focus on agent-specific evaluations that test real tasks, tool calling, and context utilization.

Include cost awareness as a key component in evaluations.

Consider Small Language Models (SLMs) and open-source models for potential cost-performance trade-offs.

Use LLMs as judges with rigorous prompt optimization, self-consistency, and multiple judges for robust evaluation.

Validate LLM judge metrics with annotated datasets before making them the default.

For V2 benchmarking, prioritize domain-specific datasets and harder, more complex tool-calling scenarios.

Evaluate agents on multi-turn interactions and overall action completion, not just individual tool selection.

Avoid This

Rely solely on general knowledge or previous benchmarks like Arena for agent evaluation, as performance may differ.

Ignore cost-efficiency when comparing models; a slightly worse model might be more economical.

Overlook smaller or less-hyped models that could offer competitive performance at a lower cost.

Dismiss LLMs as judges due to potential biases; focus on best practices in prompting and self-consistency.

Use overly simple benchmarks like BFCL for realistic agent evaluation; opt for more complex and domain-specific tasks.

Assume models are perfect at tool calling; real-world scenarios reveal significant limitations.

Common Questions

An agent evaluation leaderboard ranks Large Language Models (LLMs) based on their ability to perform real-world tasks using tools, rather than just testing knowledge. It considers factors like tool selection, multi-turn interactions, and cost-efficiency.