Key Moments

AI Dev 26 x SF | Erik Thorelli: Deploying AI Code Review at Scale

DeepLearning.AIDeepLearning.AI
Education5 min read31 min video
May 21, 2026|67 views
Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

TL;DR

AI code generation has exploded, but AI code reviews introduce 40% more critical bugs and 70% more general bugs, necessitating rigorous evaluation.

Key Insights

1

AI-generated code can have a 40% increase in critical bugs and a 70% increase in general bugs compared to human-written code.

2

Evals (evaluation systems) are crucial for AI applications, especially for managing context engineering and ensuring model performance.

3

Testing agentic applications requires a blend of software engineering and ML expertise, focusing on probabilistic tests and business implications.

4

Switching LLM models without re-evaluating prompts and context can lead to application failure due to model divergence.

5

Online evals in production, looking at metrics like acceptance rate and latency, are critical for understanding real-world performance and catching regressions.

6

Routing AI models based on trusted evaluations is the architectural conclusion for scaling complex agentic systems with multiple models and contexts.

The AI-driven code generation revolution and its hidden costs

The software development landscape has been dramatically reshaped by AI code generation tools, significantly boosting developer velocity. What once took days can now be accomplished in minutes through agentic loops. However, this rapid code generation has shifted the primary bottleneck from writing code to reviewing it. Despite advances, AI-generated code exhibits a concerning trend of increased bugs, with studies showing approximately 40% more critical bugs and 70% more general bugs compared to exclusively human-written code. This surge in AI-generated code, potentially exceeding a billion lines daily, coupled with increased bugs, makes robust code review more critical than ever. The cost of downtime, cited as $5 million per hour, underscores the business imperative to address these quality issues.

The necessity of extensive context for effective AI code review

Unlike straightforward code generation tasks, AI-driven code review demands a deep and broad understanding of context, mirroring the capabilities of experienced senior engineers. A good AI code reviewer must integrate context not only from the codebase but also from business objectives and past issues. For instance, a code change might be technically correct but could introduce performance bottlenecks or, more critically, undermine business viability by making a product free for all users. This highlights the need for 'context enrichment' to ensure that AI-generated suggestions are not just correct but also strategically sound and aligned with broader goals. Relying solely on a model's latent knowledge or limited prompts is insufficient for such nuanced tasks.

Anchoring AI with deterministic context and source of truth

To mitigate the inherent probabilities in AI systems, especially in dynamic environments like code review, grounding the AI with a 'source of truth' or deterministic context is paramount. This approach reduces reliance on one-shot prompting, which can be unpredictable. By cloning the repository and providing direct, reliable context, AI agents can be anchored to specific concerns, improving accuracy and reducing errors. This contrasts with relying solely on Retrieval-Augmented Generation (RAG), which can be akin to using a cache that might become invalidated in rapidly changing codebases. The emphasis on context engineering, therefore, becomes the core differentiator for high-performing AI applications, moving beyond simply invoking models to carefully curating the information they receive.

The crucial role of blended expertise in building reliable evals

Developing trustworthy evaluation systems (evals) for AI applications requires a synergistic collaboration between software engineers and ML/AI practitioners. While software engineers bring expertise in unit testing, integration testing, and understanding the testing pyramid, ML experts understand probabilistic testing and model behavior. In agentic applications, where probabilistic outcomes are common, this blend is essential. ML researchers might overlook critical business implications or edge cases, while software engineers might struggle with the nuances of probabilistic tests. Effective evals need to bridge these gaps, ensuring that AI solutions are not only technically sound but also practically viable and aligned with business objectives.

Navigating model divergence and the risks of prompt-based assumptions

The AI landscape is characterized by rapid model advancements and significant divergence between model families, such as OpenAI's GPT and Anthropic's Claude. A common failure mode is assuming that prompts effective for one model will work identically for another, especially when switching between versions or families. This is further complicated by the sheer number of available models, including fine-tuned versions, making selectivity crucial. Focusing on frontier models and treating every change as a hypothesis is key. Continuous Integration and Continuous Deployment (CI/CD) become critical competitive advantages, enabling quick rollouts and rapid identification of regressions in these probabilistic systems, which is difficult when many changes are deployed at once.

Beyond benchmarks: Crafting and trusting custom evaluation systems

Public benchmarks like SWE-bench can offer initial signals about model capabilities, particularly if they align with a product's domain. However, they are rarely comprehensive enough to map an entire business or product. Therefore, building custom benchmarks, or 'evals,' is essential. The speaker emphasizes 'benchmarking' as a critical internal practice, urging users to critically assess the origin and relevance of public benchmarks. Ultimately, the most trusted evals are those that directly reflect the specific needs, contexts, and potential failure modes of the application being developed. These custom evals form the bedrock for making informed decisions about model selection and deployment.

Staged deployment: From offline testing to online validation

Deployment strategies for AI applications typically involve multiple stages: offline, shadow, and online. Offline evals, often integrated into the development flow, catch obvious issues without the dynamism of production. Key metrics include precision (correctly identifying issues) and recall (finding all relevant issues, especially in large contexts). Shadow environments allow for testing with colleagues. However, the ultimate validation occurs in online, production environments. Here, metrics like acceptance rate (how often suggestions are adopted) and latency become paramount. These online evals are indispensable for evaluating aspects that cannot be accurately proxied internally, providing real-world feedback crucial for continuous improvement and identifying unforeseen regressions or performance issues.

The architectural conclusion of evaluation: Intelligent routing for complex systems

Effective scaling of AI applications, particularly those with multiple models and diverse contexts, hinges on intelligent routing. Routing, in this context, is the direct architectural outcome of rigorous evaluation. By understanding the strengths and weaknesses of various models and contexts through trusted evals, systems can dynamically select the most appropriate resources for a given task. This is evident in the built-in routing mechanisms of advanced models like GPT-5 and platforms like Claude. The management of context is also critical; while abundant context is often needed, too much can lead to noise, increased costs, and reduced quality. Finding the optimal balance, often by using internal dashboards to monitor tails of metrics like latency and acceptance rates, ensures that the system remains efficient and effective at scale.

AI Code Review at Scale: Key Principles

Practical takeaways from this episode

Do This

Treat every change as a hypothesis and conduct evaluations.
Collaborate closely between ML researchers and software engineers.
Clone the repository to get the latest source of truth.
Focus on frontier models rather than evaluating millions of fine-tuned ones.
Implement staged rollouts and monitor online evaluations for regressions.
Develop empathy and intuition for different model families.
Focus on context engineering to anchor models to relevant information.
Measure latency, hallucination indicators, and cohort regressions.
Keep invariant aspects stable while adapting variable elements per model.

Avoid This

Do not assume prompts work across different models or versions without testing.
Do not rely solely on public benchmarks for evaluating your specific product.
Do not attempt to stuff all context into a single prompt or context window.
Do not deploy changes to production without robust evaluation and instrumentation.
Do not stop evaluating models once they are in production; continuous eval is key.
Do not 'vibe' on whether code works; use data-driven evaluations.
Do not miss business implications or edge cases by focusing only on ML metrics.
Do not test only for average performance; examine the tails for edge cases.

Common Questions

With AI significantly speeding up code generation, the primary bottleneck has shifted to code review, which requires integrating various forms of context about the codebase and business.

Topics

Mentioned in this video

More from DeepLearningAI

View all 94 summaries

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free