Key Moments
AI Dev 26 x SF | Erik Thorelli: Deploying AI Code Review at Scale
Want to know something specific about what's covered?
We've already dissected every moment. Ask and we will deliver (with timestamps).
Key Moments
AI code generation has exploded, but AI code reviews introduce 40% more critical bugs and 70% more general bugs, necessitating rigorous evaluation.
Key Insights
AI-generated code can have a 40% increase in critical bugs and a 70% increase in general bugs compared to human-written code.
Evals (evaluation systems) are crucial for AI applications, especially for managing context engineering and ensuring model performance.
Testing agentic applications requires a blend of software engineering and ML expertise, focusing on probabilistic tests and business implications.
Switching LLM models without re-evaluating prompts and context can lead to application failure due to model divergence.
Online evals in production, looking at metrics like acceptance rate and latency, are critical for understanding real-world performance and catching regressions.
Routing AI models based on trusted evaluations is the architectural conclusion for scaling complex agentic systems with multiple models and contexts.
The AI-driven code generation revolution and its hidden costs
The software development landscape has been dramatically reshaped by AI code generation tools, significantly boosting developer velocity. What once took days can now be accomplished in minutes through agentic loops. However, this rapid code generation has shifted the primary bottleneck from writing code to reviewing it. Despite advances, AI-generated code exhibits a concerning trend of increased bugs, with studies showing approximately 40% more critical bugs and 70% more general bugs compared to exclusively human-written code. This surge in AI-generated code, potentially exceeding a billion lines daily, coupled with increased bugs, makes robust code review more critical than ever. The cost of downtime, cited as $5 million per hour, underscores the business imperative to address these quality issues.
The necessity of extensive context for effective AI code review
Unlike straightforward code generation tasks, AI-driven code review demands a deep and broad understanding of context, mirroring the capabilities of experienced senior engineers. A good AI code reviewer must integrate context not only from the codebase but also from business objectives and past issues. For instance, a code change might be technically correct but could introduce performance bottlenecks or, more critically, undermine business viability by making a product free for all users. This highlights the need for 'context enrichment' to ensure that AI-generated suggestions are not just correct but also strategically sound and aligned with broader goals. Relying solely on a model's latent knowledge or limited prompts is insufficient for such nuanced tasks.
Anchoring AI with deterministic context and source of truth
To mitigate the inherent probabilities in AI systems, especially in dynamic environments like code review, grounding the AI with a 'source of truth' or deterministic context is paramount. This approach reduces reliance on one-shot prompting, which can be unpredictable. By cloning the repository and providing direct, reliable context, AI agents can be anchored to specific concerns, improving accuracy and reducing errors. This contrasts with relying solely on Retrieval-Augmented Generation (RAG), which can be akin to using a cache that might become invalidated in rapidly changing codebases. The emphasis on context engineering, therefore, becomes the core differentiator for high-performing AI applications, moving beyond simply invoking models to carefully curating the information they receive.
The crucial role of blended expertise in building reliable evals
Developing trustworthy evaluation systems (evals) for AI applications requires a synergistic collaboration between software engineers and ML/AI practitioners. While software engineers bring expertise in unit testing, integration testing, and understanding the testing pyramid, ML experts understand probabilistic testing and model behavior. In agentic applications, where probabilistic outcomes are common, this blend is essential. ML researchers might overlook critical business implications or edge cases, while software engineers might struggle with the nuances of probabilistic tests. Effective evals need to bridge these gaps, ensuring that AI solutions are not only technically sound but also practically viable and aligned with business objectives.
Navigating model divergence and the risks of prompt-based assumptions
The AI landscape is characterized by rapid model advancements and significant divergence between model families, such as OpenAI's GPT and Anthropic's Claude. A common failure mode is assuming that prompts effective for one model will work identically for another, especially when switching between versions or families. This is further complicated by the sheer number of available models, including fine-tuned versions, making selectivity crucial. Focusing on frontier models and treating every change as a hypothesis is key. Continuous Integration and Continuous Deployment (CI/CD) become critical competitive advantages, enabling quick rollouts and rapid identification of regressions in these probabilistic systems, which is difficult when many changes are deployed at once.
Beyond benchmarks: Crafting and trusting custom evaluation systems
Public benchmarks like SWE-bench can offer initial signals about model capabilities, particularly if they align with a product's domain. However, they are rarely comprehensive enough to map an entire business or product. Therefore, building custom benchmarks, or 'evals,' is essential. The speaker emphasizes 'benchmarking' as a critical internal practice, urging users to critically assess the origin and relevance of public benchmarks. Ultimately, the most trusted evals are those that directly reflect the specific needs, contexts, and potential failure modes of the application being developed. These custom evals form the bedrock for making informed decisions about model selection and deployment.
Staged deployment: From offline testing to online validation
Deployment strategies for AI applications typically involve multiple stages: offline, shadow, and online. Offline evals, often integrated into the development flow, catch obvious issues without the dynamism of production. Key metrics include precision (correctly identifying issues) and recall (finding all relevant issues, especially in large contexts). Shadow environments allow for testing with colleagues. However, the ultimate validation occurs in online, production environments. Here, metrics like acceptance rate (how often suggestions are adopted) and latency become paramount. These online evals are indispensable for evaluating aspects that cannot be accurately proxied internally, providing real-world feedback crucial for continuous improvement and identifying unforeseen regressions or performance issues.
The architectural conclusion of evaluation: Intelligent routing for complex systems
Effective scaling of AI applications, particularly those with multiple models and diverse contexts, hinges on intelligent routing. Routing, in this context, is the direct architectural outcome of rigorous evaluation. By understanding the strengths and weaknesses of various models and contexts through trusted evals, systems can dynamically select the most appropriate resources for a given task. This is evident in the built-in routing mechanisms of advanced models like GPT-5 and platforms like Claude. The management of context is also critical; while abundant context is often needed, too much can lead to noise, increased costs, and reduced quality. Finding the optimal balance, often by using internal dashboards to monitor tails of metrics like latency and acceptance rates, ensures that the system remains efficient and effective at scale.
Mentioned in This Episode
●Software & Apps
●Companies
●Studies Cited
AI Code Review at Scale: Key Principles
Practical takeaways from this episode
Do This
Avoid This
Common Questions
With AI significantly speeding up code generation, the primary bottleneck has shifted to code review, which requires integrating various forms of context about the codebase and business.
Topics
Mentioned in this video
The company developing the AI code review product discussed in the talk, emphasizing its context engineering capabilities and focus on evaluation.
The company that developed the GPT family of models. They published guides on updating prompts for newer models.
A company developing frontier AI models, specifically the Claude family, which presents a divergence in prompting compared to GPT models.
A platform where millions of AI models, including fine-tuned ones, are available. It's noted that evaluating all of them is impractical.
A customer that uses the AI code review product, experiencing reductions in review time.
Mentioned as part of the GPT family of models, highlighting the divergence in prompting strategies between its generation and newer models like GPT-5.
Mentioned as a newer generation of GPT models, indicating a need to update prompts and highlighting divergence from older models.
A family of frontier models from Anthropic, mentioned in contrast to GPT models and noted for having a routing layer (dispatch).
A routing layer recommended for checking out, used by Claude for routing models and providing prompts.
More from DeepLearningAI
View all 94 summaries
29 minAI Dev 26 x SF | Paul Everitt: The Shift to Agentic Engineering
26 minAI Dev 26 x SF | Brandon Waselnuk: Building the Context Engine AI Agents Need
27 minAI Dev 26 x SF | Diamond Bishop: The Next 100 Agents. Building the Agent Native Office
32 minAI Dev 26 x SF | Jerry Liu: My Agent Can't Read a PDF?
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free