Key Moments

AI Dev 26 x SF | Ara Khan: Evals Are Broken Use Them Anyway

DeepLearning.AIDeepLearning.AI
Education5 min read25 min video
May 22, 2026|1,216 views|29|1
Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

TL;DR

AI evals are broken, but using them anyway with practical heuristics is better than relying on 'vibes'. The key is to interpret them judiciously, stay current without being an early adopter, and focus on problem-specific benchmarks.

Key Insights

1

Most current AI evals are compromised: either they focus on easily gamed objective scores ('benchmark maxing') or purely subjective 'vibes' (like how an AI sounds).

2

Objective benchmark scores from model labs are only 'close approximations' and shouldn't be taken as definitive proof of a model's superiority.

3

When evaluating models, stay current but wait for the 'dust to settle' for a couple of weeks after a new model release before adopting it.

4

Instead of generic evals, focus on problem-specific benchmarks. For coding agents, a benchmark like Stanford's 'TerminalCH' with 89 real-world software engineering tasks is more applicable than older benchmarks like SWE-bench.

5

When developing your own evals, containerize environments using tools like Harbor for isolation and parallelization, and focus on tracking metrics like turns, tool calls, tokens, and run time.

6

Improving an agent involves three zones: fixing obvious flaws, fine-tuning prompts and tool usage for philosophical alignment, and avoiding the danger zone of simply optimizing for the metric (overfitting).

AI evals are flawed but essential tools

Ara Khan argues that the current state of AI evaluations is fundamentally broken, yet indispensable for agent development. He identifies two main camps of flawed thinking: the 'objective metrics camp,' which blindly trusts benchmark scores that are easily gamed, and the 'taste is king' camp, which dismisses numbers entirely in favor of subjective 'vibes.' Khan asserts that the truth lies in a balanced approach, where evals are used strategically rather than being treated as absolute truths. The goal is to build, interpret, and utilize evals effectively within agent workflows, whether for simple or complex applications. He emphasizes that while evals aren't the 'end all, be all,' they offer critical insights that surpass pure intuition.

Interpreting external evals with caution

Khan advises against taking model lab evaluations at face value. While these benchmarks provide useful approximations, they can be misleading. He cites an example where models with very close scores might not actually perform equally well in practice. The prevalent practice of 'benchmark maxing,' where labs focus on achieving high scores rather than genuine model quality, further erodes trust. A key heuristic is to 'stay current but don't be the earliest adopter.' Given the rapid evolution of AI models (in AI, a couple of months can feel like years), waiting a few weeks for the hype to die down allows for more stable and practical assessment. Prioritize evals specific to your problem domain; generic benchmarks may not reflect the realities of your specific application.

Leveraging evals to improve agent performance

Improving AI agents, especially complex ones like coding agents, is challenging due to the high variance in their responses and the infinite solution spaces. Traditional evals often focused on trivial tasks like Fibonacci sequences, which are irrelevant to real-world applications. Khan highlights the development of 'TerminalCH' by Stanford as a significant advancement. This benchmark features 89 problems directly relevant to actual software engineering tasks, such as database issues, race conditions, and front-end bugs. These agentic evals involve letting the agent run for extended periods (5-45 minutes) to complete tasks, followed by deterministic unit tests to evaluate success. This approach is crucial for understanding how agents perform in multi-step, complex scenarios.

Building and running effective agentic evals

When building custom evals, it's vital to containerize environments to ensure isolation and prevent interference between tasks. Tools like Harbor and Modal facilitate parallelized, containerized evaluations. The evaluation process involves running an agent on specific tasks, analyzing its performance, and identifying failure points. For example, if an agent repeatedly fails at editing files or encounters installation issues, these patterns become apparent at scale. Analyzing these failures helps bucket problems into broad categories (e.g., file editing, inference issues) which then allows for iterative improvement of specific tools or agent logic. This aggregate view provides a more realistic simulation of user experience than subjective testing.

Testing the model, harness, and problem set

Effective evaluation requires testing three components: the model itself, the agent's 'harness' (the scaffolding and logic surrounding the model), and the relevance of the problem set. A model might perform poorly not due to its inherent quality, but because the harness is not optimized for it, or because the problem set is ill-suited. This explains why the same model might perform differently across various agent frameworks. By iteratively making changes to parameters, timeouts, and reasoning behaviors, Khan's team saw their agent's scores improve, eventually outperforming competitors like Claude Code on specific evals for Opus 4.5.

Zones of improvement and avoiding overfitting

Agent improvement through evals can be categorized into three zones. Zone 1 involves fixing obvious, fundamental flaws in the agent's basic functionality. Zone 2, the 'real hill climbing,' focuses on more nuanced improvements in prompt engineering, tool selection, and logic to enhance performance on complex tasks. This is where evals provide objective feedback for subjective improvements. Zone 3 is the 'danger zone' of overfitting, where developers focus solely on optimizing the metric without regard for solving the actual problem. This can lead to agents that perform well on the eval but fail in real-world scenarios. Developers must ensure they are genuinely improving the agent's capability, not just manipulating it to pass the test.

The ongoing value of evals

Ultimately, Khan advocates for finding or building benchmarks that work for your specific needs, 'hill climbing' (iteratively improving scores), and crucially, always performing a 'vibe check.' This means ensuring the agent not only achieves good scores but also behaves sensibly and solves the intended problems. The commitment to evals, even with their flaws, has been instrumental in uncovering the capabilities of sophisticated open-source models that might otherwise have been overlooked. By embracing this discipline, teams can continuously improve agent experiences and make informed decisions about model selection and development.

AI Eval Best Practices: Dos and Don'ts

Practical takeaways from this episode

Do This

Use heuristics to interpret other people's evals.
Stay current with AI model advancements, but wait for the dust to settle before adopting new models.
Prioritize evals that are specific to your problem or as close to it as possible.
Let agents run for extended periods (5-40 minutes) for complex tasks before evaluating.
Track metrics like turns, tool calls, tokens, and run time to understand cost vs. quality.
Containerize and isolate eval environments to prevent interference.
Identify broad buckets of failures (e.g., reading files, inference, installation issues).
Iteratively improve agents by fixing specific problems identified through evals.
Test the model, the agent harness, and the relevance of the problem itself.
Focus on fixing obvious flaws first (Zone 1).
Engage in nuanced judgment by giving agents real problems to solve (Zone 2).
Be cautious of optimizing solely for a metric (Zone 3 Danger Zone).
Find a benchmark that works for you and 'hill climb' on it.
Always pass the 'vibe check' – ensure the agent makes sense and solves the actual problem.

Avoid This

Don't blindly believe model lab evals or take them as absolute truth.
Don't be the earliest adopter; wait for consensus and stability.
Don't use generic, general-purpose evals if specific ones exist for your problem.
Don't use trivial evals (e.g., counting letters, cat toes) for complex agents.
Don't ignore the cost implications of using certain models or eval setups.
Don't assume that a successful agent setup is solely due to the model; your harness might be the factor.
Don't overfit to a metric; focus on solving the actual problem.
Don't solely rely on benchmarks; ensure the agent makes intuitive sense.

AI Model Performance Evolution (Conceptual)

Data extracted from this episode

TimeframeTop Model (Example)
Couple months agoSonnet 4.6 / Opus
NowNewer models surpass previous leaders

Common Questions

The two main camps are the 'objective metrics' camp, who believe benchmark scores are absolute truth, and the 'taste is king' camp, who dismiss numbers entirely and focus only on perceived 'vibes' or user experience.

Topics

Mentioned in this video

More from DeepLearningAI

View all 98 summaries

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free