Key Moments
AI Dev 26 x SF | Ara Khan: Evals Are Broken Use Them Anyway
Want to know something specific about what's covered?
We've already dissected every moment. Ask and we will deliver (with timestamps).
Key Moments
AI evals are broken, but using them anyway with practical heuristics is better than relying on 'vibes'. The key is to interpret them judiciously, stay current without being an early adopter, and focus on problem-specific benchmarks.
Key Insights
Most current AI evals are compromised: either they focus on easily gamed objective scores ('benchmark maxing') or purely subjective 'vibes' (like how an AI sounds).
Objective benchmark scores from model labs are only 'close approximations' and shouldn't be taken as definitive proof of a model's superiority.
When evaluating models, stay current but wait for the 'dust to settle' for a couple of weeks after a new model release before adopting it.
Instead of generic evals, focus on problem-specific benchmarks. For coding agents, a benchmark like Stanford's 'TerminalCH' with 89 real-world software engineering tasks is more applicable than older benchmarks like SWE-bench.
When developing your own evals, containerize environments using tools like Harbor for isolation and parallelization, and focus on tracking metrics like turns, tool calls, tokens, and run time.
Improving an agent involves three zones: fixing obvious flaws, fine-tuning prompts and tool usage for philosophical alignment, and avoiding the danger zone of simply optimizing for the metric (overfitting).
AI evals are flawed but essential tools
Ara Khan argues that the current state of AI evaluations is fundamentally broken, yet indispensable for agent development. He identifies two main camps of flawed thinking: the 'objective metrics camp,' which blindly trusts benchmark scores that are easily gamed, and the 'taste is king' camp, which dismisses numbers entirely in favor of subjective 'vibes.' Khan asserts that the truth lies in a balanced approach, where evals are used strategically rather than being treated as absolute truths. The goal is to build, interpret, and utilize evals effectively within agent workflows, whether for simple or complex applications. He emphasizes that while evals aren't the 'end all, be all,' they offer critical insights that surpass pure intuition.
Interpreting external evals with caution
Khan advises against taking model lab evaluations at face value. While these benchmarks provide useful approximations, they can be misleading. He cites an example where models with very close scores might not actually perform equally well in practice. The prevalent practice of 'benchmark maxing,' where labs focus on achieving high scores rather than genuine model quality, further erodes trust. A key heuristic is to 'stay current but don't be the earliest adopter.' Given the rapid evolution of AI models (in AI, a couple of months can feel like years), waiting a few weeks for the hype to die down allows for more stable and practical assessment. Prioritize evals specific to your problem domain; generic benchmarks may not reflect the realities of your specific application.
Leveraging evals to improve agent performance
Improving AI agents, especially complex ones like coding agents, is challenging due to the high variance in their responses and the infinite solution spaces. Traditional evals often focused on trivial tasks like Fibonacci sequences, which are irrelevant to real-world applications. Khan highlights the development of 'TerminalCH' by Stanford as a significant advancement. This benchmark features 89 problems directly relevant to actual software engineering tasks, such as database issues, race conditions, and front-end bugs. These agentic evals involve letting the agent run for extended periods (5-45 minutes) to complete tasks, followed by deterministic unit tests to evaluate success. This approach is crucial for understanding how agents perform in multi-step, complex scenarios.
Building and running effective agentic evals
When building custom evals, it's vital to containerize environments to ensure isolation and prevent interference between tasks. Tools like Harbor and Modal facilitate parallelized, containerized evaluations. The evaluation process involves running an agent on specific tasks, analyzing its performance, and identifying failure points. For example, if an agent repeatedly fails at editing files or encounters installation issues, these patterns become apparent at scale. Analyzing these failures helps bucket problems into broad categories (e.g., file editing, inference issues) which then allows for iterative improvement of specific tools or agent logic. This aggregate view provides a more realistic simulation of user experience than subjective testing.
Testing the model, harness, and problem set
Effective evaluation requires testing three components: the model itself, the agent's 'harness' (the scaffolding and logic surrounding the model), and the relevance of the problem set. A model might perform poorly not due to its inherent quality, but because the harness is not optimized for it, or because the problem set is ill-suited. This explains why the same model might perform differently across various agent frameworks. By iteratively making changes to parameters, timeouts, and reasoning behaviors, Khan's team saw their agent's scores improve, eventually outperforming competitors like Claude Code on specific evals for Opus 4.5.
Zones of improvement and avoiding overfitting
Agent improvement through evals can be categorized into three zones. Zone 1 involves fixing obvious, fundamental flaws in the agent's basic functionality. Zone 2, the 'real hill climbing,' focuses on more nuanced improvements in prompt engineering, tool selection, and logic to enhance performance on complex tasks. This is where evals provide objective feedback for subjective improvements. Zone 3 is the 'danger zone' of overfitting, where developers focus solely on optimizing the metric without regard for solving the actual problem. This can lead to agents that perform well on the eval but fail in real-world scenarios. Developers must ensure they are genuinely improving the agent's capability, not just manipulating it to pass the test.
The ongoing value of evals
Ultimately, Khan advocates for finding or building benchmarks that work for your specific needs, 'hill climbing' (iteratively improving scores), and crucially, always performing a 'vibe check.' This means ensuring the agent not only achieves good scores but also behaves sensibly and solves the intended problems. The commitment to evals, even with their flaws, has been instrumental in uncovering the capabilities of sophisticated open-source models that might otherwise have been overlooked. By embracing this discipline, teams can continuously improve agent experiences and make informed decisions about model selection and development.
Mentioned in This Episode
●Software & Apps
●Companies
●Organizations
●Studies Cited
●Concepts
AI Eval Best Practices: Dos and Don'ts
Practical takeaways from this episode
Do This
Avoid This
AI Model Performance Evolution (Conceptual)
Data extracted from this episode
| Timeframe | Top Model (Example) |
|---|---|
| Couple months ago | Sonnet 4.6 / Opus |
| Now | Newer models surpass previous leaders |
Common Questions
The two main camps are the 'objective metrics' camp, who believe benchmark scores are absolute truth, and the 'taste is king' camp, who dismiss numbers entirely and focus only on perceived 'vibes' or user experience.
Topics
Mentioned in this video
A model mentioned with a score of 4.6, highlighting how close scores between models can be misleading.
AI model that some users prefer due to its perceived 'vibes' or pleasant conversational style.
An AI coding assistant that was mentioned in the context of evaluating models.
An AI coding assistant mentioned alongside Cursor and within the context of evals.
An infrastructure layer used to build parallelized, containerized environments for running eval tasks.
An AI coding model that the speaker uses as an example of an agent that can get stuck in loops or encounter installation issues when evaluated.
An AI model mentioned in the context of beating Cloud Code in evals after iterative improvements.
Came out with a new model that was a disappointment due to benchmark maxing.
Mentioned for stating that their benchmarks have become saturated, indicating a need for new evaluation methods.
Mentioned as having models that perform better in certain coding environments like Claude Code compared to others.
An AI model mentioned as a cost-effective alternative to more expensive frontier models, costing significantly less.
A tool used in conjunction with Terminal Bench that allows for isolated, containerized environments to run evals in parallel.
More from DeepLearningAI
View all 98 summaries
27 minAI Dev 26 x SF | Diamond Bishop: The Next 100 Agents. Building the Agent Native Office
22 minAI Dev 26 x SF | Andrew K. Davies: Deterministic Memory: How to Build an AI That Cannot Lie
26 minAI Dev 26 x SF | João Moura: Building Recurring, Governed, and Embedded Enterprise Workflows
26 minAI Dev 26 x SF | Manos Koukoumidis & Stefan Webb: VibeML: Build your AI model in hours, not months
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free