Key Moments
AI Dev 26 x SF | Tom Howlett: Can LLMs Generate Enterprise Quality Code?
Want to know something specific about what's covered?
We've already dissected every moment. Ask and we will deliver (with timestamps).
Key Moments
LLMs can generate code twice as fast, but this comes at the cost of 2x more code, drastically different quality, and a near-doubled number of bugs, making current AI-generated code unsuitable for enterprise applications without significant adaptation.
Key Insights
A Carnegie Mellon study on Cursor showed a 3-5x velocity spike with LLM code generation, but this was accompanied by an increase in static analysis warnings and code complexity, leading to a slowdown by month three.
Sonar's benchmark of 52 LLMs revealed significant variability, with some models producing over twice as much code (e.g., 336,000 lines vs. 703,000 lines for specific tasks) and vastly different numbers of bugs and security vulnerabilities.
The 'slop code bench' benchmark demonstrated that LLMs can thrive on poorly structured code, continuously patching existing code to meet task requirements (e.g., pass rates) without an inherent drive for quality, leading to single files of 6,000 lines with 600-line functions.
A Wharton paper found that humans still incorrectly approve AI-generated code 80% of the time when it's wrong, highlighting a psychological bias that challenges the objectivity of manual code reviews for AI-generated code.
Sonar's agent-centric development cycle proposes a 'guide, verify, solve' loop with an 'inner loop' for agents to self-correct code before human review and an 'outer loop' for final checks and PR merges.
A new verification method developed by Sonar achieves 95% of the issue detection of a full analysis in seconds, enabling significantly faster feedback loops for agents within the inner loop.
The enterprise quality gap: LLMs accelerate, but at what cost?
The rapid advancements in Large Language Models (LLMs) for code generation offer impressive speed, creating a sense of excitement akin to the early days of the World Wide Web. However, deploying this generated code into enterprise environments presents significant challenges. A study by Carnegie Mellon on the use of Cursor highlighted an initial 3-5 fold increase in development velocity, marked by more lines of code and commits. Crucially, this velocity spike was accompanied by a rise in static analysis warnings and code complexity. As these issues compound over time, productivity paradoxically declines, with the study showing a significant slowdown by month three as developers likely grappled with understanding and fixing the generated code. This phenomenon, termed the 'enterprise quality gap,' suggests that while LLMs are excellent for rapid prototyping or short-lived internal tools (e.g., apps with less than 50,000 lines of code), they fall short for mission-critical, long-lived applications that require high levels of quality, security, and maintainability. The potential for adversarial users on public-facing applications further amplifies the need for robust code quality. Without adequate quality checks, the increased speed of LLMs can lead to escalating production incidents and a situation where developers find it 'easier to write it myself' to ensure the necessary quality standards are met. This highlights a critical oversight: the additional cost and effort required to bring AI-generated code up to enterprise standards are often not factored into development timelines, leading teams to push for speed without sustainable quality.
Benchmarking LLMs reveals vast disparities in code quality
To address the 'enterprise quality gap,' Sonar embarked on a comprehensive benchmark of over 52 LLMs, evaluating not only task completion but also the quality of the generated code. The results underscore a significant lack of consistency across models. While pass rates for tasks might hover around 80%, the actual code produced varies drastically. For instance, on specific tasks, models like Gemini and Opus 4.7 produced around 336,000 lines of code, whereas GPT-5.5 generated a staggering 703,000 lines – more than double. Beyond sheer volume, the benchmark uncovered substantial differences in bugs, security vulnerabilities, and maintainability issues. Every model exhibits a unique behavior profile, with some excelling in conciseness but introducing severe security flaws, while others might be more secure but excessively verbose. This variability means that simply picking the 'best' performing model on a given task is insufficient; understanding its specific quality profile is paramount for enterprise adoption. The increasing prevalence of concurrency issues, which are notoriously difficult to debug, is also a growing concern as models become more sophisticated.
The 'slop code bench' and the challenge of evolving codebases
Traditional benchmarks often assess code generation in a static, one-shot manner. However, real-world software development involves continuous evolution with new features and changes over time. To simulate this, Sonar introduced the 'slop code bench.' This benchmark starts with a single task and iteratively adds new, unspecified future tasks without giving the LLM foresight. The results are compelling: even with an increase to ten checkpoints, the LLM continues to generate code that passes tests, but the codebase devolves rapidly. A basic code search application, initially a neat 300 lines of Python, can balloon into a single 6,000-line file with 600-line functions that are incomprehensible to humans, yet the LLM continues to manage. This demonstrates that LLMs, driven by task completion (like passing tests), tend to 'patch' existing code rather than refactor or improve its underlying structure. They 'thrive on sloppy code' by making the minimum necessary changes. This raises a critical question: if the LLM's primary objective is to pass the tests provided and not to maintain architectural integrity or readability, the overall code quality will inevitably degrade, especially in large, evolving applications.
The psychological bias undermining human and AI code reviews
The increasing reliance on AI-generated code intensifies the challenge of effective code review. Traditional human code review, which traditionally consumed as much time as coding itself pre-LLMs, becomes exponentially more difficult and less effective when faced with large, AI-generated code changes. Adding to this complexity is a well-documented psychological bias. A Wharton paper, 'Thinking Fast, Slow and Artificial,' found that humans incorrectly approved AI-generated code as correct 80% of the time when it was actually flawed. This suggests a natural inclination to trust output from an AI, especially when it's presented quickly and looks superficially correct. This bias also extends to AI-on-AI code reviews. While AI review tools are impressive, having one LLM review code generated by another can create a cycle of confirmation, potentially masking errors rather than identifying them. The speaker likens this to a junior developer being easily placated by a senior developer, fostering a false sense of security. This highlights the need for an objective, deterministic verification process that supplements, rather than replaces, critical human oversight.
Introducing the agent-centric development cycle: Guide, verify, solve
Sonar proposes an 'agent-centric development cycle' to address these challenges, built around three core phases: Guide, Verify, and Solve. This cycle operates in both an 'inner loop' (for agent-driven development) and an 'outer loop' (for human review). The 'Guide' phase emphasizes providing agents with clear architectural guidelines, coding standards, and quality expectations, akin to onboarding a new human developer. The 'Verify' phase involves robust, deterministic checks to identify issues early. The 'Solve' phase enables the agent to rectify identified problems. This contrasts with current practices where agents often discover architecture and quality standards organically, leading to suboptimal outcomes. The emphasis is on a continuous feedback loop, where the agent is guided, its code is verified, and any issues are solved, ideally within the inner loop before reaching the human reviewer. This structured approach aims to consistently produce code of the expected quality.
The dual-loop system: Inner loop for agent efficiency, outer loop for final checks
The agent-centric cycle incorporates an 'inner loop' for rapid agent self-correction and an 'outer loop' for more comprehensive human and system-level verification. The inner loop aims to resolve issues before they reach a pull request (PR). This is achieved by equipping agents with tools for guidance (coding standards, architecture) and verification (static analysis). The agent generates code, analyzes it, fixes issues, and reanalyzes, repeating the cycle until quality standards are met. This proactive approach significantly reduces the number of issues presented in the outer loop. The outer loop, triggered by a PR, involves full builds, test runs, and a final quality gate check using SonarQube. If the code passes the quality gate, it can be merged; otherwise, it's blocked. Issues found in the outer loop can be addressed by a remediation agent or the original code generation agent. The human developer's role remains critical for the final review, but the goal is for them to encounter significantly fewer issues, allowing them to focus on higher-level validation and architectural coherence rather than low-level bug fixing.
Efficient verification for rapid feedback: The power of hybrid analysis
A key bottleneck in the verification process, especially for large codebases, is the time required for comprehensive static analysis. To overcome this, Sonar has developed a new verification method that combines previous analysis results with patch analysis of new code changes. This 'hybrid' approach analyzes code changes within the context of an already resolved codebase, significantly speeding up the process. It aims to provide 95% of the issues found by a full analysis but in a matter of seconds, making it as fast as a linter while maintaining the depth of Sonar's analysis. This rapid feedback loop is crucial for the inner loop of agent development, allowing iterative code generation and correction without lengthy delays. While not yet at 100% coverage of a full analysis, this method represents a substantial leap in efficiency, enabling faster and more continuous verification cycles.
Implementing enterprise-grade AI development with SonarQube
SonarQube provides the foundational tools for this agent-centric development cycle. Its static analysis capabilities cover over 40 languages with thousands of checks, offering deterministic and unbiased code verification. The recent addition of architecture analysis allows for defining and enforcing dependency graphs, ensuring adherence to architectural blueprints. Plugins for cloud and agent marketplaces facilitate the integration of Sonar tools into agent workflows. In practice, an agent would use SonarQube features to retrieve project-specific guidelines and architectural constraints, generate code, and then use Sonar's rapid verification tools to identify and fix issues iteratively. The final Outer Loop PR checks leverage SonarQube's quality gates for automated blocking of non-compliant code, while also providing detailed reports on metrics, test coverage (with a focus on new code), and duplication. While the agent handles much of the early detection and correction, the human developer remains accountable, tasked with maintaining critical judgment and performing final reviews, ensuring that the agentic development process ultimately serves the goal of reliable and maintainable enterprise software.
Mentioned in This Episode
●Software & Apps
●Organizations
●Books
●Studies Cited
●People Referenced
Enterprise Code Quality with LLMs: Dos and Don'ts
Practical takeaways from this episode
Do This
Avoid This
LLM Benchmark Comparison: Lines of Code and Bugs
Data extracted from this episode
| Model | Lines of Code (for benchmark tasks) | Bugs | Security Vulnerabilities |
|---|---|---|---|
| Gemini & Opus 4.7 | 336,000 | High range | Significant |
| GPT-5.5 | 703,000 | High range | Significant |
Common Questions
While LLMs can rapidly generate code that works for simple or short-lived applications, they often produce code with higher complexity, potential bugs, and security vulnerabilities. Adapting the software development lifecycle with rigorous verification processes is crucial for enterprise-quality code.
Topics
Mentioned in this video
Mentioned as an example of LLMs that can generate code, but the focus is on enterprise quality.
An AI code editor that, according to a Carnegie Mellon study, initially provided a 3-5x velocity spike in code generation but led to issues like increased complexity and warnings.
An LLM that was benchmarked, showing specific line counts and bug numbers in Sonar's analysis.
Mentioned as one of the LLM models benchmarked, with specific performance data provided.
A benchmark designed to mimic real-world software development by adding tasks sequentially to code, evaluating complexity and verbosity over time.
The primary tool discussed for static code analysis and verification, being adapted for agent-centric development.
The speaker's company, which provides tools for code quality and security, and has developed benchmarks and solutions for LLM-generated code.
Mentioned as the institution where a paper on AI's impact on thinking was authored.
Mentioned in the context of regulatory reports that Sonar can generate, specifically OWASP reports related to security.
More from DeepLearningAI
View all 94 summaries
29 minAI Dev 26 x SF | Paul Everitt: The Shift to Agentic Engineering
26 minAI Dev 26 x SF | Brandon Waselnuk: Building the Context Engine AI Agents Need
27 minAI Dev 26 x SF | Diamond Bishop: The Next 100 Agents. Building the Agent Native Office
32 minAI Dev 26 x SF | Jerry Liu: My Agent Can't Read a PDF?
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free