The Great Evals Debate — Ankur Goyal & Malte Ubl

Latent Space PodcastLatent Space Podcast
People & Blogs4 min read35 min video
Dec 7, 2025|2,343 views|42|5
Save to Pod

Key Moments

TL;DR

Debate on AI coding agent evaluations: vibes vs. rigorous evals, feedback loops, and competitive advantage.

Key Insights

1

Evals are crucial feedback loops for AI products, with varying levels of effort and efficiency from 'vibes' to rigorous testing.

2

Coding is a prime use case for evals due to its verifiability, though nuances exist in complex tasks like refactoring.

3

Public benchmarks often serve marketing purposes, distinct from internal evals used for product iteration and improvement.

4

Offline evals are evolving from creating 'golden datasets' to reconciling production behavior with iterative testing.

5

Evals can be a competitive advantage, and their design should focus on the 'true north star' of the problem, not implementation details.

6

Product managers can leverage evals as a precise form of product management, especially in non-coding domains.

7

RL environments offer a powerful, programmable form of evaluation, though they require specialized expertise and may decouple from human-centric data.

THE SPECTRUM OF FEEDBACK LOOPS

The discussion highlights a spectrum of feedback mechanisms for AI products, ranging from informal "vibes" to structured evaluations (evals) and A/B tests. While pure vibes require minimal effort, rigorous evals demand significant investment. The key is to find an efficient balance, with the best teams intentionally investing in and questioning the effectiveness of each feedback loop. This is particularly relevant for AI coding agents, where development involves non-deterministic magic that necessitates a reliable loop to understand outcomes and ensure quality.

CODING AS A PRIME EVALUATION DOMAIN

Coding is identified as a use case with strong product-market fit for AI, largely because it offers a high degree of verifiability. Unlike creative writing or summarization, many aspects of code generation are objective and can be evaluated. However, complexities arise with tasks like refactoring codebases or assessing the visual quality of artifacts, introducing subtleties that challenge traditional evaluation methods. The ability to write and run evals is a core reason for the success seen in coding-related AI advancements.

THE PRIVILEGE OF BEING AN AI LAB

A key point is that companies deeply embedded within AI labs, like Anthropic, may benefit from extreme privilege. This includes having access to internal teams actively developing evals and training models concurrently. Such close collaboration allows for rapid, hand-in-hand iteration between the agent and the model. This contrasts with external developers who must work with off-the-shelf models or open-source alternatives, facing a different set of challenges in building effective feedback loops without direct access to model training infrastructure.

PUBLIC BENCHMARKS VS. INTERNAL EVALS

The conversation distinguishes between public benchmarks and internal evals. Public benchmarks, while useful for marketing and communicating technological value, are often not designed for product iteration. They serve orthogonal purposes to the practice of using evals as a feedback mechanism. Companies develop their own internal benchmarks, which are proprietary and not comparable, but are trusted more for guiding product development and ensuring alignment with specific business goals.

EVOLVING OFFLINE EVALS AND PRODUCTION INSIGHTS

Offline evaluations are shifting from simply creating comprehensive 'golden datasets' to a more dynamic process. The best teams leverage offline evals to reconcile observed production behavior with iterative offline testing. This involves discovering interesting behaviors or failure modes from user logs, capturing and reproducing them offline, iterating, and then testing these changes alongside other scenarios. This approach allows for more confident iteration and faster deployment of improvements.

EVALS AS A COMPETITIVE ADVANTAGE AND PRODUCT TOOL

Evals are increasingly recognized as a competitive advantage. Designing evals that represent the core problem, rather than brittle implementation details, ensures their long-term utility. Furthermore, product managers and designers are finding value in participating in the eval process, using it as a precise method to communicate product intuition and criteria, especially in non-coding domains like finance or healthcare. This participatory approach offers a more actionable alternative to traditional specification writing.

THE ROLE OF RANDOMIZED CONTROLLED TRIALS AND RL ENVIRONMENTS

The discussion touches on how feedback loops, including A/B testing and Reinforcement Learning (RL) environments, enable faster iteration. While rigorous evals can reveal regressions, RL environments offer a structured way to train models by evaluating their performance or decision-making within a defined system. However, designing effective RL environments to avoid reward hacking requires specialized expertise, and their adoption by average companies is still emerging, with potential to be decoupled from slow, expensive human data collection.

DECOUPLING EVAL CREATION FROM MODEL DEVELOPMENT

A forward-looking perspective suggests a future where the creation of evals can be decoupled from the entity building the AI models. Businesses with specific goals could incentivize model labs to develop agents excelling in their domains by providing tailored evals. This shifts the focus to who writes the evals and for what purpose, potentially leading to a marketplace for specialized evaluations, enabling more precise AI development aligned with diverse business needs, as exemplified by the development of Next.js specific evals.

Evals Best Practices

Practical takeaways from this episode

Do This

Invest deliberately in the efficiency of all feedback loops (evals, A/B tests, vibes).
Treat offline evals as a mechanism for reconciling production behavior with iteration, not just creating golden datasets.
Construct evals to represent the core problem you're solving, avoiding brittleness from implementation details.
Leverage evals as a competitive advantage by developing robust evaluation strategies.
Involve product managers and designers in the eval process for better product intuition and communication.
Use examples and scoring functions (or LLM as a judge) to communicate product intuition effectively.
Capture production data and upgrade failure modes into measurable eval criteria.
Consider using trivial evals (e.g., does it compile, does it start) as valuable components for RL pipelines.
Decouple from human-centric data by creating controllable, repeatable RL environments.
Track incentives to understand who creates evals and for what purpose (e.g., improving own models vs. others').

Avoid This

Rely solely on one type of feedback loop (vibes, A/B tests, or evals).
Get bogged down in manufacturing golden datasets for offline evals.
Hardcode evals to narrow implementation details, making them brittle.
Publish public benchmarks solely for marketing without considering their value for product iteration.
Underestimate the power of evals for creating competitive advantages.
Treat evals purely as a way to generate numbers; focus on communicating product intuition.
Over-rely on human data collection when scalable RL environments can be leveraged.
Ignore the potential for reward hacking in RL environments.
Assume the creator of an eval will always be the one building the model or agent.

Common Questions

The main discussion revolves around whether coding agents effectively use 'evals' (evaluations or metrics) to gauge performance. Some believe many agents lack standardized or robust evals, relying instead on 'vibes' or intuition, even in high-value businesses.

Topics

Mentioned in this video

More from Latent Space

View all 63 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free