AI Dev 25 x NYC | Manish Kapur: Assessing the Quality of AI Generated Code

DeepLearning.AIDeepLearning.AI
Education3 min read28 min video
Dec 5, 2025|374 views|7
Save to Pod

Key Moments

TL;DR

AI-generated code requires quality assessment beyond benchmarks; Sonar offers solutions.

Key Insights

1

AI code generation is rapidly increasing, but its quality and reliability are concerns.

2

Current benchmarks for AI code quality, like HumanEval, focus on functional correctness, not maintainability or security.

3

Large, state-of-the-art LLMs do not always produce better code; they can be unnecessarily complex and verbose.

4

AI-generated code inherits security flaws and logic errors from training data and can be probabilistic and lack explainability.

5

Sonar provides tools and research to analyze AI-generated code for quality, security, and maintainability, offering a verification layer.

6

Ensuring high-quality training data is crucial for LLMs to reduce vulnerabilities and improve output.

THE RISE OF AI IN CODE GENERATION

The volume of code being generated by Artificial Intelligence is growing exponentially. While AI significantly boosts developer productivity, a critical concern arises regarding the quality, reliability, and deployability of this AI-generated code. The core challenge isn't the speed of writing code, but the subsequent verification process to ensure it meets production standards without compromising quality or security.

LIMITATIONS OF EXISTING BENCHMARKS

Leading LLMs are primarily evaluated using benchmarks like HumanEval and MBPP, which focus on functional correctness and passing specific tasks. However, these benchmarks often fail to assess crucial aspects such as code security, maintainability, and overall quality. This leaves a gap in understanding the true robustness of AI-produced code, as they don't measure if the code is secure or maintainable.

THE PROBABILISTIC NATURE OF LLMS AND THEIR FLAWS

LLMs are inherently probabilistic, meaning they can 'hallucinate' or produce code that is functional but potentially inefficient, unreliable, complex, or verbose. They inherit existing security bugs and logic errors from their vast training data, which often consists of code written by human developers over decades. Furthermore, LLMs can lack context and are not easily explainable, leading to unpredictable outputs.

COMPLEXITY AND 'CODE SMELLS' IN AI CODE

Research by Sonar, analyzing various LLMs, reveals that larger and newer models do not always equate to superior code quality. These models can generate code that is needlessly complex, leading to higher cognitive and cyclomatic complexity. This can manifest as 'code smells'—indicators of deeper maintainability issues—making the code difficult for human developers to understand, debug, and maintain over time.

SONAR'S APPROACH TO VERIFICATION AND QUALITY ASSURANCE

Sonar offers solutions to address the challenges of AI-generated code quality. Their platform provides a standardized verification layer, analyzing code for quality, security, and maintainability. This includes detecting over 7,000 types of issues, reducing technical debt, and improving developer productivity. Sonar supports various languages and integrates with DevOps platforms, offering early detection in IDEs and pull requests.

ENSURING HIGH-QUALITY TRAINING DATA WITH SONAR SWEEP

A new product, SonarSweep, focuses on ensuring the quality of training data used by LLMs. Recognizing that 'garbage in, garbage out,' this service cleanses coding datasets before they are fed to models. Evaluations show this can lead to a significant reduction in security vulnerabilities, as cleaner data results in less flawed AI output, a critical step for model builders seeking to improve their LLMs.

KEY TAKEAWAYS FOR AI CODE ADOPTION

The key takeaways emphasize that while benchmarks are necessary, they are insufficient for assessing AI code quality. Newer, more capable models present new challenges, particularly in maintainability and complex bug introduction, despite improvements in functional correctness and security. Robust verification processes and a focus on code quality and security are non-negotiable for responsible AI code adoption.

Guidelines for Using AI-Generated Code

Practical takeaways from this episode

Do This

Verify the quality, security, and maintainability of AI-generated code.
Understand the specific 'personality' and reasoning modes of different LLMs for your use case.
Use static code analysis tools for consistent and deterministic results.
Ensure AI-generated code integrates well with existing architecture and requirements.
Utilize AI code fixing tools for potential solutions, but always review them.
Clean LLM training data to reduce security vulnerabilities (e.g., using SonarSweep).

Avoid This

Solely rely on benchmarks (like Human Eval, MBPP) to assess code quality.
Assume larger or newer models are always better; smaller models can be sufficient.
Deploy AI-generated code without thorough review, especially for mission-critical applications.
Trust LLMs for code review; their probabilistic nature can lead to inconsistent or incorrect feedback.
Ignore potential issues like cognitive complexity, cyclomatic complexity, and code smells.
Use LLM-generated code for sensitive applications (healthcare, finance) without rigorous verification.

LLM Code Generation vs. Cognitive Complexity

Data extracted from this episode

LLM ModelReasoning Mode% Cognitive Complexity% Functional Performance (Benchmark)
LlamaN/ALowLow
Open Coder 8bN/ALowLow
GPT-5Minimal75%High
Claude 4Minimal77%High

Code Volume for a Single Programming Task Across Different LLMs

Data extracted from this episode

LLM ModelLines of Code to Solve Task
GPT-4200,000
GPT (High Reasoning Mode)75,000 - 80,000

Impact of SonarSweep on Security Vulnerabilities in Tested LLM Data

Data extracted from this episode

ActionReduction in Security Vulnerabilities
Cleaning training data with SonarSweep50% - 65%

Common Questions

LLMs are probabilistic and can hallucinate, leading to code that is functional but potentially inefficient, complex, or contains hidden security issues. Benchmarks often only test functional correctness, not overall quality or security.

Topics

Mentioned in this video

conceptcloud native

Mentioned as a technology with historically high developer trust, contrasting with AI code trust.

organizationSonar

The company the speaker works for, focused on helping developers and AI produce better code through code health dashboards and analysis tools.

studyMBPP

A benchmark used in the industry to evaluate LLMs for coding tasks, focusing on functional correctness.

toolOWASP

A standard for mobile app security with OWASP Top 10, which Sonar supports and is developing for LLMs.

productSonarSweep

A new product announced by Sonar that sweeps LLM training data to ensure high quality and reduce security vulnerabilities.

companyBitbucket

A DevOps platform with which Sonar integrates for code analysis and pipeline integration.

conceptBMI

Mentioned as a simple application example where developers can easily write code, contrasting with complex applications generated by LLMs.

softwareSonarQube

A tool familiar to users that provides metrics like code smells for maintainability.

softwareClaude 3.7

Mentioned as a version of Claude, evaluated for code generation quality and security.

softwareClaude Sonnet 3.7

An LLM evaluated for cognitive complexity, showing a high score (75%) even with minimal reasoning.

softwareSonarQube Server

A server version of SonarQube for on-premises code analysis and evaluation.

conceptDevOps

Mentioned as a technology where developer trust was historically high, contrasting with current trust in AI code.

softwareOpen Coder 8b

A relatively small, open-source LLM that, along with Meta Lama, showed less cognitive complexity but also less functional performance.

softwareSonarQube Cloud

A version of SonarQube available in the cloud for code analysis and evaluation.

toolJava
organizationGitLab
softwareStack Overflow

More from DeepLearningAI

View all 65 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free