Why is context engineering crucial for AI code review at scale?

Code review requires understanding the integration of code across the entire codebase and the business context. Extensive context enrichment is necessary for an AI to perform this task effectively, ensuring proposed changes don't negatively impact the business.

Why is cloning the repository important for AI code review?

Cloning the repository ensures the AI has access to the 'source of truth,' which is essential in dynamic environments like code generation or review. This approach reduces reliance on potentially outdated cached information (like RAG) and anchors the AI's process.

How should software engineers and ML researchers collaborate on AI applications?

Collaboration is crucial as software engineers bring expertise in testing fundamentals (like the testing pyramid), while ML researchers understand probabilistic systems. Combining these skills is vital for writing effective evaluations (evals) and avoiding business or edge case oversights.

What are the differences between offline, shadow, and online evaluations for AI systems?

Offline evaluation is done internally during development (e.g., by an engineer checking slides). Shadow evaluation involves testing with a small group (like colleagues), while online evaluation is live testing in production, collecting real user feedback and metrics.

Why is building custom benchmarks (evals) more important than public benchmarks for AI products?

Public benchmarks like SWE-bench are useful for initial signals but don't fully represent a specific product's use cases. Building custom evals tailored to your application's specific needs and failure modes is critical for trustworthy evaluation.

How can companies stay on the forefront of AI model advancements?

By building robust evaluation systems (evals), companies can quickly test and integrate new models as they are released by providers like OpenAI, DeepMind, and Anthropic. This agility allows them to maintain a competitive advantage.

What are common production failure modes in AI agentic applications?

Common failures include using too little context (missing issues), too much context (creating noise), and a lack of instrumentation or online evals, which hides regressions and makes debugging difficult.

Key Moments

AI Dev 26 x SF | Erik Thorelli: Deploying AI Code Review at Scale

DeepLearning.AI

Education5 min read31 min video

May 21, 2026|351 views|3

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

On this page

TL;DR

AI code generation has exploded, but AI code reviews introduce 40% more critical bugs and 70% more general bugs, necessitating rigorous evaluation.

Key Insights

AI-generated code can have a 40% increase in critical bugs and a 70% increase in general bugs compared to human-written code.

Evals (evaluation systems) are crucial for AI applications, especially for managing context engineering and ensuring model performance.

Testing agentic applications requires a blend of software engineering and ML expertise, focusing on probabilistic tests and business implications.

Switching LLM models without re-evaluating prompts and context can lead to application failure due to model divergence.

Online evals in production, looking at metrics like acceptance rate and latency, are critical for understanding real-world performance and catching regressions.

Routing AI models based on trusted evaluations is the architectural conclusion for scaling complex agentic systems with multiple models and contexts.

The AI-driven code generation revolution and its hidden costs

The software development landscape has been dramatically reshaped by AI code generation tools, significantly boosting developer velocity. What once took days can now be accomplished in minutes through agentic loops. However, this rapid code generation has shifted the primary bottleneck from writing code to reviewing it. Despite advances, AI-generated code exhibits a concerning trend of increased bugs, with studies showing approximately 40% more critical bugs and 70% more general bugs compared to exclusively human-written code. This surge in AI-generated code, potentially exceeding a billion lines daily, coupled with increased bugs, makes robust code review more critical than ever. The cost of downtime, cited as $5 million per hour, underscores the business imperative to address these quality issues.

The necessity of extensive context for effective AI code review

Unlike straightforward code generation tasks, AI-driven code review demands a deep and broad understanding of context, mirroring the capabilities of experienced senior engineers. A good AI code reviewer must integrate context not only from the codebase but also from business objectives and past issues. For instance, a code change might be technically correct but could introduce performance bottlenecks or, more critically, undermine business viability by making a product free for all users. This highlights the need for 'context enrichment' to ensure that AI-generated suggestions are not just correct but also strategically sound and aligned with broader goals. Relying solely on a model's latent knowledge or limited prompts is insufficient for such nuanced tasks.

Anchoring AI with deterministic context and source of truth

To mitigate the inherent probabilities in AI systems, especially in dynamic environments like code review, grounding the AI with a 'source of truth' or deterministic context is paramount. This approach reduces reliance on one-shot prompting, which can be unpredictable. By cloning the repository and providing direct, reliable context, AI agents can be anchored to specific concerns, improving accuracy and reducing errors. This contrasts with relying solely on Retrieval-Augmented Generation (RAG), which can be akin to using a cache that might become invalidated in rapidly changing codebases. The emphasis on context engineering, therefore, becomes the core differentiator for high-performing AI applications, moving beyond simply invoking models to carefully curating the information they receive.

The crucial role of blended expertise in building reliable evals

Developing trustworthy evaluation systems (evals) for AI applications requires a synergistic collaboration between software engineers and ML/AI practitioners. While software engineers bring expertise in unit testing, integration testing, and understanding the testing pyramid, ML experts understand probabilistic testing and model behavior. In agentic applications, where probabilistic outcomes are common, this blend is essential. ML researchers might overlook critical business implications or edge cases, while software engineers might struggle with the nuances of probabilistic tests. Effective evals need to bridge these gaps, ensuring that AI solutions are not only technically sound but also practically viable and aligned with business objectives.

Navigating model divergence and the risks of prompt-based assumptions

The AI landscape is characterized by rapid model advancements and significant divergence between model families, such as OpenAI's GPT and Anthropic's Claude. A common failure mode is assuming that prompts effective for one model will work identically for another, especially when switching between versions or families. This is further complicated by the sheer number of available models, including fine-tuned versions, making selectivity crucial. Focusing on frontier models and treating every change as a hypothesis is key. Continuous Integration and Continuous Deployment (CI/CD) become critical competitive advantages, enabling quick rollouts and rapid identification of regressions in these probabilistic systems, which is difficult when many changes are deployed at once.

Beyond benchmarks: Crafting and trusting custom evaluation systems

Public benchmarks like SWE-bench can offer initial signals about model capabilities, particularly if they align with a product's domain. However, they are rarely comprehensive enough to map an entire business or product. Therefore, building custom benchmarks, or 'evals,' is essential. The speaker emphasizes 'benchmarking' as a critical internal practice, urging users to critically assess the origin and relevance of public benchmarks. Ultimately, the most trusted evals are those that directly reflect the specific needs, contexts, and potential failure modes of the application being developed. These custom evals form the bedrock for making informed decisions about model selection and deployment.

Staged deployment: From offline testing to online validation

Deployment strategies for AI applications typically involve multiple stages: offline, shadow, and online. Offline evals, often integrated into the development flow, catch obvious issues without the dynamism of production. Key metrics include precision (correctly identifying issues) and recall (finding all relevant issues, especially in large contexts). Shadow environments allow for testing with colleagues. However, the ultimate validation occurs in online, production environments. Here, metrics like acceptance rate (how often suggestions are adopted) and latency become paramount. These online evals are indispensable for evaluating aspects that cannot be accurately proxied internally, providing real-world feedback crucial for continuous improvement and identifying unforeseen regressions or performance issues.

The architectural conclusion of evaluation: Intelligent routing for complex systems

Effective scaling of AI applications, particularly those with multiple models and diverse contexts, hinges on intelligent routing. Routing, in this context, is the direct architectural outcome of rigorous evaluation. By understanding the strengths and weaknesses of various models and contexts through trusted evals, systems can dynamically select the most appropriate resources for a given task. This is evident in the built-in routing mechanisms of advanced models like GPT-5 and platforms like Claude. The management of context is also critical; while abundant context is often needed, too much can lead to noise, increased costs, and reduced quality. Finding the optimal balance, often by using internal dashboards to monitor tails of metrics like latency and acceptance rates, ensures that the system remains efficient and effective at scale.

Mentioned in This Episode

●Software & Apps

●Companies

●Studies Cited

AI Code Review at Scale: Key Principles

Practical takeaways from this episode

Do This

Treat every change as a hypothesis and conduct evaluations.

Collaborate closely between ML researchers and software engineers.

Clone the repository to get the latest source of truth.

Focus on frontier models rather than evaluating millions of fine-tuned ones.

Implement staged rollouts and monitor online evaluations for regressions.

Develop empathy and intuition for different model families.

Focus on context engineering to anchor models to relevant information.

Measure latency, hallucination indicators, and cohort regressions.

Keep invariant aspects stable while adapting variable elements per model.

Avoid This

Do not assume prompts work across different models or versions without testing.

Do not rely solely on public benchmarks for evaluating your specific product.

Do not attempt to stuff all context into a single prompt or context window.

Do not deploy changes to production without robust evaluation and instrumentation.

Do not stop evaluating models once they are in production; continuous eval is key.

Do not 'vibe' on whether code works; use data-driven evaluations.

Do not miss business implications or edge cases by focusing only on ML metrics.

Do not test only for average performance; examine the tails for edge cases.

Common Questions

With AI significantly speeding up code generation, the primary bottleneck has shifted to code review, which requires integrating various forms of context about the codebase and business.

Topics

Context Engineering AI & Machine Learning Technology & Innovation Programming & Software Model Deployment Agentic Systems AI Code Review Software Development Lifecycle Probabilistic Systems

Mentioned in this video

Companies

Code Rabbit

The company developing the AI code review product discussed in the talk, emphasizing its context engineering capabilities and focus on evaluation.

OpenAI

The company that developed the GPT family of models. They published guides on updating prompts for newer models.

Anthropic

A company developing frontier AI models, specifically the Claude family, which presents a divergence in prompting compared to GPT models.

Hugging Face

A platform where millions of AI models, including fine-tuned ones, are available. It's noted that evaluating all of them is impractical.

Groupon

A customer that uses the AI code review product, experiencing reductions in review time.

Software & Apps

GPT-4

Mentioned as part of the GPT family of models, highlighting the divergence in prompting strategies between its generation and newer models like GPT-5.

GPT-5

Mentioned as a newer generation of GPT models, indicating a need to update prompts and highlighting divergence from older models.

Claude

A family of frontier models from Anthropic, mentioned in contrast to GPT models and noted for having a routing layer (dispatch).

Dispatch

A routing layer recommended for checking out, used by Claude for routing models and providing prompts.

Studies & Research

SWE-bench

A popular software engineering benchmark mentioned as a source of early signals for model performance, but not comprehensive enough for evaluating a full product.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free