The Great Evals Debate — Ankur Goyal & Malte Ubl
Key Moments
Debate on AI coding agent evaluations: vibes vs. rigorous evals, feedback loops, and competitive advantage.
Key Insights
Evals are crucial feedback loops for AI products, with varying levels of effort and efficiency from 'vibes' to rigorous testing.
Coding is a prime use case for evals due to its verifiability, though nuances exist in complex tasks like refactoring.
Public benchmarks often serve marketing purposes, distinct from internal evals used for product iteration and improvement.
Offline evals are evolving from creating 'golden datasets' to reconciling production behavior with iterative testing.
Evals can be a competitive advantage, and their design should focus on the 'true north star' of the problem, not implementation details.
Product managers can leverage evals as a precise form of product management, especially in non-coding domains.
RL environments offer a powerful, programmable form of evaluation, though they require specialized expertise and may decouple from human-centric data.
THE SPECTRUM OF FEEDBACK LOOPS
The discussion highlights a spectrum of feedback mechanisms for AI products, ranging from informal "vibes" to structured evaluations (evals) and A/B tests. While pure vibes require minimal effort, rigorous evals demand significant investment. The key is to find an efficient balance, with the best teams intentionally investing in and questioning the effectiveness of each feedback loop. This is particularly relevant for AI coding agents, where development involves non-deterministic magic that necessitates a reliable loop to understand outcomes and ensure quality.
CODING AS A PRIME EVALUATION DOMAIN
Coding is identified as a use case with strong product-market fit for AI, largely because it offers a high degree of verifiability. Unlike creative writing or summarization, many aspects of code generation are objective and can be evaluated. However, complexities arise with tasks like refactoring codebases or assessing the visual quality of artifacts, introducing subtleties that challenge traditional evaluation methods. The ability to write and run evals is a core reason for the success seen in coding-related AI advancements.
THE PRIVILEGE OF BEING AN AI LAB
A key point is that companies deeply embedded within AI labs, like Anthropic, may benefit from extreme privilege. This includes having access to internal teams actively developing evals and training models concurrently. Such close collaboration allows for rapid, hand-in-hand iteration between the agent and the model. This contrasts with external developers who must work with off-the-shelf models or open-source alternatives, facing a different set of challenges in building effective feedback loops without direct access to model training infrastructure.
PUBLIC BENCHMARKS VS. INTERNAL EVALS
The conversation distinguishes between public benchmarks and internal evals. Public benchmarks, while useful for marketing and communicating technological value, are often not designed for product iteration. They serve orthogonal purposes to the practice of using evals as a feedback mechanism. Companies develop their own internal benchmarks, which are proprietary and not comparable, but are trusted more for guiding product development and ensuring alignment with specific business goals.
EVOLVING OFFLINE EVALS AND PRODUCTION INSIGHTS
Offline evaluations are shifting from simply creating comprehensive 'golden datasets' to a more dynamic process. The best teams leverage offline evals to reconcile observed production behavior with iterative offline testing. This involves discovering interesting behaviors or failure modes from user logs, capturing and reproducing them offline, iterating, and then testing these changes alongside other scenarios. This approach allows for more confident iteration and faster deployment of improvements.
EVALS AS A COMPETITIVE ADVANTAGE AND PRODUCT TOOL
Evals are increasingly recognized as a competitive advantage. Designing evals that represent the core problem, rather than brittle implementation details, ensures their long-term utility. Furthermore, product managers and designers are finding value in participating in the eval process, using it as a precise method to communicate product intuition and criteria, especially in non-coding domains like finance or healthcare. This participatory approach offers a more actionable alternative to traditional specification writing.
THE ROLE OF RANDOMIZED CONTROLLED TRIALS AND RL ENVIRONMENTS
The discussion touches on how feedback loops, including A/B testing and Reinforcement Learning (RL) environments, enable faster iteration. While rigorous evals can reveal regressions, RL environments offer a structured way to train models by evaluating their performance or decision-making within a defined system. However, designing effective RL environments to avoid reward hacking requires specialized expertise, and their adoption by average companies is still emerging, with potential to be decoupled from slow, expensive human data collection.
DECOUPLING EVAL CREATION FROM MODEL DEVELOPMENT
A forward-looking perspective suggests a future where the creation of evals can be decoupled from the entity building the AI models. Businesses with specific goals could incentivize model labs to develop agents excelling in their domains by providing tailored evals. This shifts the focus to who writes the evals and for what purpose, potentially leading to a marketplace for specialized evaluations, enabling more precise AI development aligned with diverse business needs, as exemplified by the development of Next.js specific evals.
Mentioned in This Episode
●Software & Apps
●Tools
●Companies
●Organizations
●People Referenced
Evals Best Practices
Practical takeaways from this episode
Do This
Avoid This
Common Questions
The main discussion revolves around whether coding agents effectively use 'evals' (evaluations or metrics) to gauge performance. Some believe many agents lack standardized or robust evals, relying instead on 'vibes' or intuition, even in high-value businesses.
Topics
Mentioned in this video
An AI model mentioned as being 'okay' but not 'really great' for specific healthcare tasks.
A company working on upgrading senior dev evals.
A standard benchmark mentioned in the context of marketing and communicating technological value.
The podcast or platform hosting the discussion.
A company mentioned in relation to AI products and agent development.
One of the guests discussing the debate around evals.
A tool or product mentioned for its user interface and complaint handling.
One of the guests discussing the debate around evals.
A standard benchmark mentioned in the context of marketing and communicating technological value.
More from Latent Space
View all 63 summaries
86 minNVIDIA's AI Engineers: Brev, Dynamo and Agent Inference at Planetary Scale and "Speed of Light"
72 minCursor's Third Era: Cloud Agents — ft. Sam Whitmore, Jonas Nelle, Cursor
77 minWhy Every Agent Needs a Box — Aaron Levie, Box
42 min⚡️ Polsia: Solo Founder Tiny Team from 0 to 1m ARR in 1 month & the future of Self-Running Companies
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free