AI Dev 25 x NYC | Manish Kapur: Assessing the Quality of AI Generated Code
Key Moments
AI-generated code requires quality assessment beyond benchmarks; Sonar offers solutions.
Key Insights
AI code generation is rapidly increasing, but its quality and reliability are concerns.
Current benchmarks for AI code quality, like HumanEval, focus on functional correctness, not maintainability or security.
Large, state-of-the-art LLMs do not always produce better code; they can be unnecessarily complex and verbose.
AI-generated code inherits security flaws and logic errors from training data and can be probabilistic and lack explainability.
Sonar provides tools and research to analyze AI-generated code for quality, security, and maintainability, offering a verification layer.
Ensuring high-quality training data is crucial for LLMs to reduce vulnerabilities and improve output.
THE RISE OF AI IN CODE GENERATION
The volume of code being generated by Artificial Intelligence is growing exponentially. While AI significantly boosts developer productivity, a critical concern arises regarding the quality, reliability, and deployability of this AI-generated code. The core challenge isn't the speed of writing code, but the subsequent verification process to ensure it meets production standards without compromising quality or security.
LIMITATIONS OF EXISTING BENCHMARKS
Leading LLMs are primarily evaluated using benchmarks like HumanEval and MBPP, which focus on functional correctness and passing specific tasks. However, these benchmarks often fail to assess crucial aspects such as code security, maintainability, and overall quality. This leaves a gap in understanding the true robustness of AI-produced code, as they don't measure if the code is secure or maintainable.
THE PROBABILISTIC NATURE OF LLMS AND THEIR FLAWS
LLMs are inherently probabilistic, meaning they can 'hallucinate' or produce code that is functional but potentially inefficient, unreliable, complex, or verbose. They inherit existing security bugs and logic errors from their vast training data, which often consists of code written by human developers over decades. Furthermore, LLMs can lack context and are not easily explainable, leading to unpredictable outputs.
COMPLEXITY AND 'CODE SMELLS' IN AI CODE
Research by Sonar, analyzing various LLMs, reveals that larger and newer models do not always equate to superior code quality. These models can generate code that is needlessly complex, leading to higher cognitive and cyclomatic complexity. This can manifest as 'code smells'—indicators of deeper maintainability issues—making the code difficult for human developers to understand, debug, and maintain over time.
SONAR'S APPROACH TO VERIFICATION AND QUALITY ASSURANCE
Sonar offers solutions to address the challenges of AI-generated code quality. Their platform provides a standardized verification layer, analyzing code for quality, security, and maintainability. This includes detecting over 7,000 types of issues, reducing technical debt, and improving developer productivity. Sonar supports various languages and integrates with DevOps platforms, offering early detection in IDEs and pull requests.
ENSURING HIGH-QUALITY TRAINING DATA WITH SONAR SWEEP
A new product, SonarSweep, focuses on ensuring the quality of training data used by LLMs. Recognizing that 'garbage in, garbage out,' this service cleanses coding datasets before they are fed to models. Evaluations show this can lead to a significant reduction in security vulnerabilities, as cleaner data results in less flawed AI output, a critical step for model builders seeking to improve their LLMs.
KEY TAKEAWAYS FOR AI CODE ADOPTION
The key takeaways emphasize that while benchmarks are necessary, they are insufficient for assessing AI code quality. Newer, more capable models present new challenges, particularly in maintainability and complex bug introduction, despite improvements in functional correctness and security. Robust verification processes and a focus on code quality and security are non-negotiable for responsible AI code adoption.
Mentioned in This Episode
●Products
●Software & Apps
●Tools
●Companies
●Organizations
●Studies Cited
●Concepts
Guidelines for Using AI-Generated Code
Practical takeaways from this episode
Do This
Avoid This
LLM Code Generation vs. Cognitive Complexity
Data extracted from this episode
| LLM Model | Reasoning Mode | % Cognitive Complexity | % Functional Performance (Benchmark) |
|---|---|---|---|
| Llama | N/A | Low | Low |
| Open Coder 8b | N/A | Low | Low |
| GPT-5 | Minimal | 75% | High |
| Claude 4 | Minimal | 77% | High |
Code Volume for a Single Programming Task Across Different LLMs
Data extracted from this episode
| LLM Model | Lines of Code to Solve Task |
|---|---|
| GPT-4 | 200,000 |
| GPT (High Reasoning Mode) | 75,000 - 80,000 |
Impact of SonarSweep on Security Vulnerabilities in Tested LLM Data
Data extracted from this episode
| Action | Reduction in Security Vulnerabilities |
|---|---|
| Cleaning training data with SonarSweep | 50% - 65% |
Common Questions
LLMs are probabilistic and can hallucinate, leading to code that is functional but potentially inefficient, complex, or contains hidden security issues. Benchmarks often only test functional correctness, not overall quality or security.
Topics
Mentioned in this video
Mentioned as a technology with historically high developer trust, contrasting with AI code trust.
The company the speaker works for, focused on helping developers and AI produce better code through code health dashboards and analysis tools.
A benchmark used in the industry to evaluate LLMs for coding tasks, focusing on functional correctness.
A standard for mobile app security with OWASP Top 10, which Sonar supports and is developing for LLMs.
A new product announced by Sonar that sweeps LLM training data to ensure high quality and reduce security vulnerabilities.
A DevOps platform with which Sonar integrates for code analysis and pipeline integration.
Mentioned as a simple application example where developers can easily write code, contrasting with complex applications generated by LLMs.
A tool familiar to users that provides metrics like code smells for maintainability.
Mentioned as a version of Claude, evaluated for code generation quality and security.
An LLM evaluated for cognitive complexity, showing a high score (75%) even with minimal reasoning.
A server version of SonarQube for on-premises code analysis and evaluation.
Mentioned as a technology where developer trust was historically high, contrasting with current trust in AI code.
A relatively small, open-source LLM that, along with Meta Lama, showed less cognitive complexity but also less functional performance.
A version of SonarQube available in the cloud for code analysis and evaluation.
More from DeepLearningAI
View all 65 summaries
1 minThe #1 Skill Employers Want in 2026
1 minThe truth about tech layoffs and AI..
2 minBuild and Train an LLM with JAX
1 minWhat should you learn next? #AI #deeplearning
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free