Why shouldn't I trust model lab evals at face value?

Model labs often optimize for benchmark scores to gain clout, which may not accurately reflect real-world performance. Scores can be close between models, making precise differentiation difficult and potentially misleading.

How often do AI models change, and how should I approach adopting new ones?

The AI landscape moves extremely fast, with models changing every couple of months. Instead of being an early adopter, it's advised to wait a few weeks for the dust to settle after a new model is released before testing it.

What is Terminal Bench and why is it better than older coding benchmarks?

Terminal Bench is a benchmark from Stanford consisting of 89 real-world software engineering problems like database issues and race conditions. It's more applicable to actual daily tasks faced by developers than traditional algorithm problems like Fibonacci.

What key metrics should I track when evaluating AI agents?

Key metrics include the number of turns, tool calls, tokens used, and the total time the agent run takes. This helps in understanding the cost, performance, and quality trade-offs for agent development.

What are the three things being tested when evaluating an AI agent?

You are testing the AI model itself, the 'harness' or scaffolding of your agent code, and the relevance of the problem you are trying to solve. All three need to be aligned for successful evaluation.

What are the three 'zones of improvement' when refining an AI agent?

Zone 1 addresses obvious flaws in the agent's functionality. Zone 2 delves into more nuanced aspects of prompt engineering and tool usage for philosophical improvement. Zone 3 is the danger zone of only optimizing for a metric and overfitting.

After improving an agent's score on an eval, what else should I consider?

Even with a good score, you must 'pass the vibe check.' This means intuitively assessing if the agent makes sense, is behaving sensibly, and is genuinely solving the intended problem, not just optimizing for the benchmark.

Key Moments

AI Dev 26 x SF | Ara Khan: Evals Are Broken Use Them Anyway

DeepLearning.AI

Education5 min read25 min video

May 22, 2026|5,321 views|72|1

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

AI evals are broken, but using them anyway with practical heuristics is better than relying on 'vibes'. The key is to interpret them judiciously, stay current without being an early adopter, and focus on problem-specific benchmarks.

Key Insights

Most current AI evals are compromised: either they focus on easily gamed objective scores ('benchmark maxing') or purely subjective 'vibes' (like how an AI sounds).

Objective benchmark scores from model labs are only 'close approximations' and shouldn't be taken as definitive proof of a model's superiority.

When evaluating models, stay current but wait for the 'dust to settle' for a couple of weeks after a new model release before adopting it.

Instead of generic evals, focus on problem-specific benchmarks. For coding agents, a benchmark like Stanford's 'TerminalCH' with 89 real-world software engineering tasks is more applicable than older benchmarks like SWE-bench.

When developing your own evals, containerize environments using tools like Harbor for isolation and parallelization, and focus on tracking metrics like turns, tool calls, tokens, and run time.

Improving an agent involves three zones: fixing obvious flaws, fine-tuning prompts and tool usage for philosophical alignment, and avoiding the danger zone of simply optimizing for the metric (overfitting).

AI evals are flawed but essential tools

Ara Khan argues that the current state of AI evaluations is fundamentally broken, yet indispensable for agent development. He identifies two main camps of flawed thinking: the 'objective metrics camp,' which blindly trusts benchmark scores that are easily gamed, and the 'taste is king' camp, which dismisses numbers entirely in favor of subjective 'vibes.' Khan asserts that the truth lies in a balanced approach, where evals are used strategically rather than being treated as absolute truths. The goal is to build, interpret, and utilize evals effectively within agent workflows, whether for simple or complex applications. He emphasizes that while evals aren't the 'end all, be all,' they offer critical insights that surpass pure intuition.

Interpreting external evals with caution

Khan advises against taking model lab evaluations at face value. While these benchmarks provide useful approximations, they can be misleading. He cites an example where models with very close scores might not actually perform equally well in practice. The prevalent practice of 'benchmark maxing,' where labs focus on achieving high scores rather than genuine model quality, further erodes trust. A key heuristic is to 'stay current but don't be the earliest adopter.' Given the rapid evolution of AI models (in AI, a couple of months can feel like years), waiting a few weeks for the hype to die down allows for more stable and practical assessment. Prioritize evals specific to your problem domain; generic benchmarks may not reflect the realities of your specific application.

Leveraging evals to improve agent performance

Improving AI agents, especially complex ones like coding agents, is challenging due to the high variance in their responses and the infinite solution spaces. Traditional evals often focused on trivial tasks like Fibonacci sequences, which are irrelevant to real-world applications. Khan highlights the development of 'TerminalCH' by Stanford as a significant advancement. This benchmark features 89 problems directly relevant to actual software engineering tasks, such as database issues, race conditions, and front-end bugs. These agentic evals involve letting the agent run for extended periods (5-45 minutes) to complete tasks, followed by deterministic unit tests to evaluate success. This approach is crucial for understanding how agents perform in multi-step, complex scenarios.

Building and running effective agentic evals

When building custom evals, it's vital to containerize environments to ensure isolation and prevent interference between tasks. Tools like Harbor and Modal facilitate parallelized, containerized evaluations. The evaluation process involves running an agent on specific tasks, analyzing its performance, and identifying failure points. For example, if an agent repeatedly fails at editing files or encounters installation issues, these patterns become apparent at scale. Analyzing these failures helps bucket problems into broad categories (e.g., file editing, inference issues) which then allows for iterative improvement of specific tools or agent logic. This aggregate view provides a more realistic simulation of user experience than subjective testing.

Testing the model, harness, and problem set

Effective evaluation requires testing three components: the model itself, the agent's 'harness' (the scaffolding and logic surrounding the model), and the relevance of the problem set. A model might perform poorly not due to its inherent quality, but because the harness is not optimized for it, or because the problem set is ill-suited. This explains why the same model might perform differently across various agent frameworks. By iteratively making changes to parameters, timeouts, and reasoning behaviors, Khan's team saw their agent's scores improve, eventually outperforming competitors like Claude Code on specific evals for Opus 4.5.

Zones of improvement and avoiding overfitting

Agent improvement through evals can be categorized into three zones. Zone 1 involves fixing obvious, fundamental flaws in the agent's basic functionality. Zone 2, the 'real hill climbing,' focuses on more nuanced improvements in prompt engineering, tool selection, and logic to enhance performance on complex tasks. This is where evals provide objective feedback for subjective improvements. Zone 3 is the 'danger zone' of overfitting, where developers focus solely on optimizing the metric without regard for solving the actual problem. This can lead to agents that perform well on the eval but fail in real-world scenarios. Developers must ensure they are genuinely improving the agent's capability, not just manipulating it to pass the test.

The ongoing value of evals

Ultimately, Khan advocates for finding or building benchmarks that work for your specific needs, 'hill climbing' (iteratively improving scores), and crucially, always performing a 'vibe check.' This means ensuring the agent not only achieves good scores but also behaves sensibly and solves the intended problems. The commitment to evals, even with their flaws, has been instrumental in uncovering the capabilities of sophisticated open-source models that might otherwise have been overlooked. By embracing this discipline, teams can continuously improve agent experiences and make informed decisions about model selection and development.

Mentioned in This Episode

●Software & Apps

●Companies

●Organizations

●Studies Cited

●Concepts

AI Eval Best Practices: Dos and Don'ts

Practical takeaways from this episode

Do This

Use heuristics to interpret other people's evals.

Stay current with AI model advancements, but wait for the dust to settle before adopting new models.

Prioritize evals that are specific to your problem or as close to it as possible.

Let agents run for extended periods (5-40 minutes) for complex tasks before evaluating.

Track metrics like turns, tool calls, tokens, and run time to understand cost vs. quality.

Containerize and isolate eval environments to prevent interference.

Identify broad buckets of failures (e.g., reading files, inference, installation issues).

Iteratively improve agents by fixing specific problems identified through evals.

Test the model, the agent harness, and the relevance of the problem itself.

Focus on fixing obvious flaws first (Zone 1).

Engage in nuanced judgment by giving agents real problems to solve (Zone 2).

Be cautious of optimizing solely for a metric (Zone 3 Danger Zone).

Find a benchmark that works for you and 'hill climb' on it.

Always pass the 'vibe check' – ensure the agent makes sense and solves the actual problem.

Avoid This

Don't blindly believe model lab evals or take them as absolute truth.

Don't be the earliest adopter; wait for consensus and stability.

Don't use generic, general-purpose evals if specific ones exist for your problem.

Don't use trivial evals (e.g., counting letters, cat toes) for complex agents.

Don't ignore the cost implications of using certain models or eval setups.

Don't assume that a successful agent setup is solely due to the model; your harness might be the factor.

Don't overfit to a metric; focus on solving the actual problem.

Don't solely rely on benchmarks; ensure the agent makes intuitive sense.

AI Model Performance Evolution (Conceptual)

Data extracted from this episode

Timeframe	Top Model (Example)
Couple months ago	Sonnet 4.6 / Opus
Now	Newer models surpass previous leaders

Common Questions

The two main camps are the 'objective metrics' camp, who believe benchmark scores are absolute truth, and the 'taste is king' camp, who dismiss numbers entirely and focus only on perceived 'vibes' or user experience.

Topics

AI & Machine Learning Technology & Innovation Programming & Software Prompt Engineering Agentic Workflows AI Evaluation Coding Agents Software Development LLM Benchmarks Model Performance Custom Evals

Mentioned in this video

Organizations

Epoch AI

Mentioned as a company that releases objective evaluation numbers for AI models.

Stanford University

Institution from which the Terminal Bench benchmark originated.

Software & Apps

Sonnet

A model mentioned with a score of 4.6, highlighting how close scores between models can be misleading.

Claude

AI model that some users prefer due to its perceived 'vibes' or pleasant conversational style.

Cursor

An AI coding assistant that was mentioned in the context of evaluating models.

Cloud Code

An AI coding assistant mentioned alongside Cursor and within the context of evals.

Modal

An infrastructure layer used to build parallelized, containerized environments for running eval tasks.

Codex

An AI coding model that the speaker uses as an example of an agent that can get stuck in loops or encounter installation issues when evaluated.

Opus

An AI model mentioned in the context of beating Cloud Code in evals after iterative improvements.

Companies

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free