What are the different types of feedback loops for AI products?

Feedback loops can range from pure 'vibes' (low effort, low efficiency), to A/B tests (moderate effort, moderate efficiency), to formal evals (high effort, high efficiency). The best approach involves a deliberate investment in all three.

Why is coding a particularly interesting use case for evals?

Coding is considered a use case with high product-market fit for AI and is largely verifiable. This makes it seem almost counterintuitive that robust evals aren't universally applied, though nuances like refactoring or visual quality can be hard to evaluate.

How does open-endedness affect the use of evals?

Tasks like codegen are more open-ended than, for example, search. This unpredictability makes it harder to create comprehensive evals, potentially leading teams to rely more on 'vibes' or spot checks because traditional offline evals might miss unknown issues.

What is the modern view of offline evals?

The current best practice for offline evals is not about creating pre-defined 'golden datasets,' but about using them to reconcile observed production behavior with offline tests. This involves capturing real user issues from logs and iterating on them locally.

How can product managers use evals?

Evals can serve as a more precise tool for product management than traditional specs, especially in domains like finance or healthcare. Product managers can construct evals or use 'LLM as a judge' to articulate user needs and criteria for success, providing a quantifiable measure.

What is the role of RL environments in the future of evals?

RL environments, combined with data and scorers, can offer powerful ways to evaluate AI agents. They can scale beyond human capabilities by embedding encoded opinions into realistic, controllable scenarios, thus decoupling from potentially slow and expensive human data.

Who should be responsible for creating evals?

The discussion suggests a potential future where the creator of an eval is not necessarily the builder of the agent or model. Businesses could create evals to incentivize external labs to improve their models for specific industry tasks, aligning incentives beyond internal model development.

Key Moments

The Great Evals Debate — Ankur Goyal & Malte Ubl

Latent Space Podcast

People & Blogs4 min read35 min video

Dec 7, 2025|2,437 views|43|5

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

Debate on AI coding agent evaluations: vibes vs. rigorous evals, feedback loops, and competitive advantage.

Key Insights

Evals are crucial feedback loops for AI products, with varying levels of effort and efficiency from 'vibes' to rigorous testing.

Coding is a prime use case for evals due to its verifiability, though nuances exist in complex tasks like refactoring.

Public benchmarks often serve marketing purposes, distinct from internal evals used for product iteration and improvement.

Offline evals are evolving from creating 'golden datasets' to reconciling production behavior with iterative testing.

Evals can be a competitive advantage, and their design should focus on the 'true north star' of the problem, not implementation details.

Product managers can leverage evals as a precise form of product management, especially in non-coding domains.

RL environments offer a powerful, programmable form of evaluation, though they require specialized expertise and may decouple from human-centric data.

THE SPECTRUM OF FEEDBACK LOOPS

The discussion highlights a spectrum of feedback mechanisms for AI products, ranging from informal "vibes" to structured evaluations (evals) and A/B tests. While pure vibes require minimal effort, rigorous evals demand significant investment. The key is to find an efficient balance, with the best teams intentionally investing in and questioning the effectiveness of each feedback loop. This is particularly relevant for AI coding agents, where development involves non-deterministic magic that necessitates a reliable loop to understand outcomes and ensure quality.

CODING AS A PRIME EVALUATION DOMAIN

Coding is identified as a use case with strong product-market fit for AI, largely because it offers a high degree of verifiability. Unlike creative writing or summarization, many aspects of code generation are objective and can be evaluated. However, complexities arise with tasks like refactoring codebases or assessing the visual quality of artifacts, introducing subtleties that challenge traditional evaluation methods. The ability to write and run evals is a core reason for the success seen in coding-related AI advancements.

THE PRIVILEGE OF BEING AN AI LAB

A key point is that companies deeply embedded within AI labs, like Anthropic, may benefit from extreme privilege. This includes having access to internal teams actively developing evals and training models concurrently. Such close collaboration allows for rapid, hand-in-hand iteration between the agent and the model. This contrasts with external developers who must work with off-the-shelf models or open-source alternatives, facing a different set of challenges in building effective feedback loops without direct access to model training infrastructure.

PUBLIC BENCHMARKS VS. INTERNAL EVALS

The conversation distinguishes between public benchmarks and internal evals. Public benchmarks, while useful for marketing and communicating technological value, are often not designed for product iteration. They serve orthogonal purposes to the practice of using evals as a feedback mechanism. Companies develop their own internal benchmarks, which are proprietary and not comparable, but are trusted more for guiding product development and ensuring alignment with specific business goals.

EVOLVING OFFLINE EVALS AND PRODUCTION INSIGHTS

Offline evaluations are shifting from simply creating comprehensive 'golden datasets' to a more dynamic process. The best teams leverage offline evals to reconcile observed production behavior with iterative offline testing. This involves discovering interesting behaviors or failure modes from user logs, capturing and reproducing them offline, iterating, and then testing these changes alongside other scenarios. This approach allows for more confident iteration and faster deployment of improvements.

EVALS AS A COMPETITIVE ADVANTAGE AND PRODUCT TOOL

Evals are increasingly recognized as a competitive advantage. Designing evals that represent the core problem, rather than brittle implementation details, ensures their long-term utility. Furthermore, product managers and designers are finding value in participating in the eval process, using it as a precise method to communicate product intuition and criteria, especially in non-coding domains like finance or healthcare. This participatory approach offers a more actionable alternative to traditional specification writing.

THE ROLE OF RANDOMIZED CONTROLLED TRIALS AND RL ENVIRONMENTS

The discussion touches on how feedback loops, including A/B testing and Reinforcement Learning (RL) environments, enable faster iteration. While rigorous evals can reveal regressions, RL environments offer a structured way to train models by evaluating their performance or decision-making within a defined system. However, designing effective RL environments to avoid reward hacking requires specialized expertise, and their adoption by average companies is still emerging, with potential to be decoupled from slow, expensive human data collection.

DECOUPLING EVAL CREATION FROM MODEL DEVELOPMENT

A forward-looking perspective suggests a future where the creation of evals can be decoupled from the entity building the AI models. Businesses with specific goals could incentivize model labs to develop agents excelling in their domains by providing tailored evals. This shifts the focus to who writes the evals and for what purpose, potentially leading to a marketplace for specialized evaluations, enabling more precise AI development aligned with diverse business needs, as exemplified by the development of Next.js specific evals.

Mentioned in This Episode

●Software & Apps

●Tools

●Companies

●Organizations

●People Referenced

Evals Best Practices

Practical takeaways from this episode

Do This

Invest deliberately in the efficiency of all feedback loops (evals, A/B tests, vibes).

Treat offline evals as a mechanism for reconciling production behavior with iteration, not just creating golden datasets.

Construct evals to represent the core problem you're solving, avoiding brittleness from implementation details.

Leverage evals as a competitive advantage by developing robust evaluation strategies.

Involve product managers and designers in the eval process for better product intuition and communication.

Use examples and scoring functions (or LLM as a judge) to communicate product intuition effectively.

Capture production data and upgrade failure modes into measurable eval criteria.

Consider using trivial evals (e.g., does it compile, does it start) as valuable components for RL pipelines.

Decouple from human-centric data by creating controllable, repeatable RL environments.

Track incentives to understand who creates evals and for what purpose (e.g., improving own models vs. others').

Avoid This

Rely solely on one type of feedback loop (vibes, A/B tests, or evals).

Get bogged down in manufacturing golden datasets for offline evals.

Hardcode evals to narrow implementation details, making them brittle.

Publish public benchmarks solely for marketing without considering their value for product iteration.

Underestimate the power of evals for creating competitive advantages.

Treat evals purely as a way to generate numbers; focus on communicating product intuition.

Over-rely on human data collection when scalable RL environments can be leveraged.

Ignore the potential for reward hacking in RL environments.

Assume the creator of an eval will always be the one building the model or agent.

Common Questions

The main discussion revolves around whether coding agents effectively use 'evals' (evaluations or metrics) to gauge performance. Some believe many agents lack standardized or robust evals, relying instead on 'vibes' or intuition, even in high-value businesses.

Topics

Evals AI Feedback Loops

Mentioned in this video

Software & Apps

GT5

An AI model mentioned as being 'okay' but not 'really great' for specific healthcare tasks.

A tool or product mentioned for its user interface and complaint handling.

TPCC

A standard benchmark mentioned in the context of marketing and communicating technological value.

ClickHouse

TPCDS

A standard benchmark mentioned in the context of marketing and communicating technological value.

Organizations

Mccor

A company working on upgrading senior dev evals.

Lin Space

The podcast or platform hosting the discussion.

Companies

Brain Trust

A company mentioned in relation to AI products and agent development.

LMNT

People

Malte Ubl

One of the guests discussing the debate around evals.

Anker Goyal

One of the guests discussing the debate around evals.

Swix

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free