What are the main challenges with AI-generated code quality?

AI models inherit security bugs from their training data, are probabilistic, have limited context, and are not always explainable. This can result in more verbose, complex, buggy, or insecure code, making it hard to find issues.

Is higher reasoning in an LLM always better for code quality?

Not necessarily. While higher reasoning can improve code quality for many tasks, it often comes with increased cognitive complexity and verbosity. Smaller, open-source models can be just as effective for simpler tasks.

What is cognitive complexity and why is it important for AI-generated code?

Cognitive complexity measures how easy code is to understand and maintain. LLMs can generate code with high cognitive complexity, making it difficult for human developers to debug, maintain, or even understand, potentially hiding security flaws.

How does Sonar help ensure the quality and security of AI-generated code?

Sonar provides a standardized verification layer that checks code for quality, security, and maintainability. They use static code analysis, detect over 7,000 types of issues, and offer tools like SonarSweep to clean training data.

Does Sonar's analysis use AI?

No, Sonar explicitly states they do not use AI for their analysis. They prefer deterministic, static code analysis to ensure consistent and predictable results, as LLMs are probabilistic and can make mistakes.

How can companies adopt AI code generation responsibly?

Companies should implement robust verification processes, use tools to guardrail AI output, evaluate models thoroughly, and never compromise on code quality and security, regardless of the source of the code.

Key Moments

AI Dev 25 x NYC | Manish Kapur: Assessing the Quality of AI Generated Code

DeepLearning.AI

Education3 min read28 min video

Dec 5, 2025|454 views|7

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

AI-generated code requires quality assessment beyond benchmarks; Sonar offers solutions.

Key Insights

AI code generation is rapidly increasing, but its quality and reliability are concerns.

Current benchmarks for AI code quality, like HumanEval, focus on functional correctness, not maintainability or security.

Large, state-of-the-art LLMs do not always produce better code; they can be unnecessarily complex and verbose.

AI-generated code inherits security flaws and logic errors from training data and can be probabilistic and lack explainability.

Sonar provides tools and research to analyze AI-generated code for quality, security, and maintainability, offering a verification layer.

Ensuring high-quality training data is crucial for LLMs to reduce vulnerabilities and improve output.

THE RISE OF AI IN CODE GENERATION

The volume of code being generated by Artificial Intelligence is growing exponentially. While AI significantly boosts developer productivity, a critical concern arises regarding the quality, reliability, and deployability of this AI-generated code. The core challenge isn't the speed of writing code, but the subsequent verification process to ensure it meets production standards without compromising quality or security.

LIMITATIONS OF EXISTING BENCHMARKS

Leading LLMs are primarily evaluated using benchmarks like HumanEval and MBPP, which focus on functional correctness and passing specific tasks. However, these benchmarks often fail to assess crucial aspects such as code security, maintainability, and overall quality. This leaves a gap in understanding the true robustness of AI-produced code, as they don't measure if the code is secure or maintainable.

THE PROBABILISTIC NATURE OF LLMS AND THEIR FLAWS

LLMs are inherently probabilistic, meaning they can 'hallucinate' or produce code that is functional but potentially inefficient, unreliable, complex, or verbose. They inherit existing security bugs and logic errors from their vast training data, which often consists of code written by human developers over decades. Furthermore, LLMs can lack context and are not easily explainable, leading to unpredictable outputs.

COMPLEXITY AND 'CODE SMELLS' IN AI CODE

Research by Sonar, analyzing various LLMs, reveals that larger and newer models do not always equate to superior code quality. These models can generate code that is needlessly complex, leading to higher cognitive and cyclomatic complexity. This can manifest as 'code smells'—indicators of deeper maintainability issues—making the code difficult for human developers to understand, debug, and maintain over time.

SONAR'S APPROACH TO VERIFICATION AND QUALITY ASSURANCE

Sonar offers solutions to address the challenges of AI-generated code quality. Their platform provides a standardized verification layer, analyzing code for quality, security, and maintainability. This includes detecting over 7,000 types of issues, reducing technical debt, and improving developer productivity. Sonar supports various languages and integrates with DevOps platforms, offering early detection in IDEs and pull requests.

ENSURING HIGH-QUALITY TRAINING DATA WITH SONAR SWEEP

A new product, SonarSweep, focuses on ensuring the quality of training data used by LLMs. Recognizing that 'garbage in, garbage out,' this service cleanses coding datasets before they are fed to models. Evaluations show this can lead to a significant reduction in security vulnerabilities, as cleaner data results in less flawed AI output, a critical step for model builders seeking to improve their LLMs.

KEY TAKEAWAYS FOR AI CODE ADOPTION

The key takeaways emphasize that while benchmarks are necessary, they are insufficient for assessing AI code quality. Newer, more capable models present new challenges, particularly in maintainability and complex bug introduction, despite improvements in functional correctness and security. Robust verification processes and a focus on code quality and security are non-negotiable for responsible AI code adoption.

Mentioned in This Episode

●Products

●Software & Apps

●Tools

●Companies

●Organizations

●Studies Cited

●Concepts

Guidelines for Using AI-Generated Code

Practical takeaways from this episode

Do This

Verify the quality, security, and maintainability of AI-generated code.

Understand the specific 'personality' and reasoning modes of different LLMs for your use case.

Use static code analysis tools for consistent and deterministic results.

Ensure AI-generated code integrates well with existing architecture and requirements.

Utilize AI code fixing tools for potential solutions, but always review them.

Clean LLM training data to reduce security vulnerabilities (e.g., using SonarSweep).

Avoid This

Solely rely on benchmarks (like Human Eval, MBPP) to assess code quality.

Assume larger or newer models are always better; smaller models can be sufficient.

Deploy AI-generated code without thorough review, especially for mission-critical applications.

Trust LLMs for code review; their probabilistic nature can lead to inconsistent or incorrect feedback.

Ignore potential issues like cognitive complexity, cyclomatic complexity, and code smells.

Use LLM-generated code for sensitive applications (healthcare, finance) without rigorous verification.

LLM Code Generation vs. Cognitive Complexity

Data extracted from this episode

LLM Model	Reasoning Mode	% Cognitive Complexity	% Functional Performance (Benchmark)
Llama	N/A	Low	Low
Open Coder 8b	N/A	Low	Low
GPT-5	Minimal	75%	High
Claude 4	Minimal	77%	High

Code Volume for a Single Programming Task Across Different LLMs

Data extracted from this episode

LLM Model	Lines of Code to Solve Task
GPT-4	200,000
GPT (High Reasoning Mode)	75,000 - 80,000

Impact of SonarSweep on Security Vulnerabilities in Tested LLM Data

Data extracted from this episode

Action	Reduction in Security Vulnerabilities
Cleaning training data with SonarSweep	50% - 65%

Common Questions

LLMs are probabilistic and can hallucinate, leading to code that is functional but potentially inefficient, complex, or contains hidden security issues. Benchmarks often only test functional correctness, not overall quality or security.

Topics

AI Code Generation Code Quality LLM Security Cognitive Complexity Static Code Analysis Developer Trust Code Maintainability Security Vulnerabilities SonarQube

Mentioned in this video

Concepts

cloud native

Mentioned as a technology with historically high developer trust, contrasting with AI code trust.

BMI

Mentioned as a simple application example where developers can easily write code, contrasting with complex applications generated by LLMs.

DevOps

Mentioned as a technology where developer trust was historically high, contrasting with current trust in AI code.

Organizations

Sonar

The company the speaker works for, focused on helping developers and AI produce better code through code health dashboards and analysis tools.

OWASP

A standard for mobile app security with OWASP Top 10, which Sonar supports and is developing for LLMs.

Studies & Research

MBPP

A benchmark used in the industry to evaluate LLMs for coding tasks, focusing on functional correctness.

Products

SonarSweep

A new product announced by Sonar that sweeps LLM training data to ensure high quality and reduce security vulnerabilities.

Companies

Bitbucket

A DevOps platform with which Sonar integrates for code analysis and pipeline integration.

A tool familiar to users that provides metrics like code smells for maintainability.

Claude 3.7

Mentioned as a version of Claude, evaluated for code generation quality and security.

Claude Sonnet 3.7

An LLM evaluated for cognitive complexity, showing a high score (75%) even with minimal reasoning.

SonarQube Server

A server version of SonarQube for on-premises code analysis and evaluation.

Open Coder 8b

A relatively small, open-source LLM that, along with Meta Lama, showed less cognitive complexity but also less functional performance.

SonarQube Cloud

A version of SonarQube available in the cloud for code analysis and evaluation.

Java

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free

AI Dev 25 x NYC | Manish Kapur: Assessing the Quality of AI Generated Code

Want to know something specific about what's covered?

Key Insights

THE RISE OF AI IN CODE GENERATION

LIMITATIONS OF EXISTING BENCHMARKS

THE PROBABILISTIC NATURE OF LLMS AND THEIR FLAWS

COMPLEXITY AND 'CODE SMELLS' IN AI CODE

SONAR'S APPROACH TO VERIFICATION AND QUALITY ASSURANCE

ENSURING HIGH-QUALITY TRAINING DATA WITH SONAR SWEEP

KEY TAKEAWAYS FOR AI CODE ADOPTION

Mentioned in This Episode

Guidelines for Using AI-Generated Code

Do This

Avoid This

LLM Code Generation vs. Cognitive Complexity

Code Volume for a Single Programming Task Across Different LLMs

Impact of SonarSweep on Security Vulnerabilities in Tested LLM Data

Common Questions

Topics

Mentioned in this video

More from DeepLearningAI

AI Dev 26 x SF | Daniel Beutel: Flower SuperGrid Agents

AI Dev 26 x SF | Jerry Liu: My Agent Can't Read a PDF?

AI Dev 26 x SF | Ara Khan: Evals Are Broken Use Them Anyway

AI Dev 26 x SF | Andrew K. Davies: Deterministic Memory: How to Build an AI That Cannot Lie

Ask anything from this episode.