Key Moments
⚡️Making DeepSeek v4 outperform Opus 4.7 with Taste — @AhmadAwais , CommandCode.ai
Want to know something specific about what's covered?
We've already dissected every moment. Ask and we will deliver (with timestamps).
Key Moments
Open source models like DeepSeek can now outperform premium ones like Opus 4.7 by fixing specific 'tool confusion' errors, a problem previously costing billions in token usage.
Key Insights
Ahmad Awais's 'taste' system encodes personal preferences and learned behaviors into 'taste files' that LLMs can utilize.
Tool confusion in open models, like DeepSeek V4 Pro, leads to repeated errors (50+ failures on average) when tool calls have optional parameters or incorrect schemas.
A 'validate-then-repair' approach with 3,200 lines of code initially, expanding to 16,000 repair variations, has fixed tool calling for models like DeepSeek, Kimi, and MiniMax.
Command Code processed hundreds of billions of tokens and has observed that models perform better with fewer tool call errors, leading to increased creativity and longer exploration.
The same 'repair tool logic' used for coding issues has been successfully applied to fix 'design slop' in LLM-generated designs, by providing a compositional framework of seven design patterns.
Command Code is being open-sourced and aims to be a hackable, 'Apple-like' platform for top-tier models (open and closed), rather than a 'soup' of all models.
From DevRel to AI Engineering: The Genesis of Command Code
Ahmad Awais, with a background spanning over 25 years in open source and a significant role in the WordPress community, transitioned into AI engineering in July 2020. Early access to GPT-3 allowed him to conceptualize and build CLAI, a precursor to GitHub Copilot, aiming to suggest code snippets. This early work evolved through building LangBase, an AI cloud that handled 1.2 billion agent runs monthly, and eventually pivoted to Command Code. The core philosophy behind Command Code is the belief that all agents are fundamentally coding agents, and this capability should be central. Awais's extensive experience with cutting-edge projects, often lacking documentation, led him to develop 'Taste,' a meta-neuro-symbolic model designed to encode personal opinions and preferences into 'taste files' or 'skill files.' These files allow AI agents to learn and adapt to user-specific patterns, such as preferring pnpm for package installations but npm for local CLI linking, thereby guiding LLMs in the right direction.
The 'Tool Confusion' problem in open source models
A significant challenge identified with open source LLMs, particularly DeepSeek V4 Pro, is 'tool confusion.' This occurs when models struggle to correctly interpret and execute tool calls. For example, if a tool schema specifies optional parameters, DeepSeek might incorrectly send null values or empty objects, leading to errors. Zod, a strict validation library, flags these errors, but DeepSeek, instead of correcting its approach, repeatedly makes the same incorrect tool call, averaging 50-60 failures per billion tokens. Awais speculates this behavior stems from training data where models might learn from superior models, developing an inherent belief in the correctness of their own outputs and a resistance to correction. This issue makes these models seem unreliable and slow, despite their potential.
A 'Validate-then-Repair' solution for tool calling
To combat tool confusion, Awais developed a 'validate-then-repair' layer. Instead of just relying on error messages to correct the LLM, this system first attempts to repair the erroneous tool call. This began with approximately 3,200 lines of code for repair logic, generating 'repair files' similar to database migrations, each addressing a specific failure pattern. For instance, if an LLM emits a JSON string when an array is expected, the repair logic converts it to an array and provides a 'repair hint' to the LLM. This approach acts like teaching someone to drive: you first prevent an accident, then explain how to avoid it. The results showed a dramatic improvement, with subsequent tool calls becoming correct after the first repair, turning previously unusable models like DeepSeek V4 Flash into competitive options. This repair logic has been generalized, fixing issues for models like Kimi and MiniMax across hundreds of billions of tokens.
Repair logic improves model creativity and performance
The implementation of the repair logic has a profound effect beyond just fixing tool calls; it significantly enhances the perceived intelligence and creativity of LLMs. When models encounter fewer tool call errors, they become more exploratory and can sustain operations for longer periods. Users have reported running DeepSeek with Command Code for over 12-hour sessions without issues. This suggests that tool confusion is a major bottleneck. By reducing these errors, models seem to gain a 'creative boost,' allowing them to explore different paths and maintain coherence. This is analogous to how models perform better without permission bypasses, indicating that a cleaner inference path leads to superior output.
Applying repair logic to design flaws
The 'repair tool logic' has proven versatile, extending beyond coding to address 'design slop'—the common, often uninspired visual outputs from LLMs (e.g., the ubiquitous indigo purple gradient). By analyzing patterns from conversations with designers, Awais identified seven key design patterns and ten 'design smells' that LLMs tend to exhibit. Applying a similar repair framework has allowed Command Code to guide LLMs towards better design outcomes. For instance, LLMs can be nudged to consider the 'intention' behind a design (is it a dashboard for monitoring?) and utilize specific color palettes like OKLCH, which offer better control over lightness and hue compared to HSL. This framework enables LLMs to produce designs that are perceptibly more human-curated, reducing the distinct 'AI' look.
Taste as a transparent, user-controlled memory system
Awais emphasizes that 'Taste' is a form of transparent, user-controlled memory integrated into Command Code. Unlike opaque memory systems, taste files are stored directly within the user's Git repository as markdown files. This allows for review and modification in every Pull Request (PR). The system automatically learns micro-decisions and preferences, such as always using pnpm for dependencies or preferring Vitest over other testing frameworks. It can even adapt to changes within a project, like switching from Commander to Meow for CLI building. This ensures that the 'learned' behaviors are never stale and are fully transparent to the user, forming a powerful, personalized AI assistant that evolves with the developer's workflow.
Command Code's future: Open-sourcing and an 'Apple-like' model philosophy
Command Code, a project with roots dating back to 2020 and significant backing, is set to be open-sourced. Awais plans for it to be fully hackable, allowing users to modify any part of the system. This aligns with his background in WordPress core development and reflects a philosophy of enabling deep customization. The vision for Command Code is not to become a vast collection of every available model, but rather to emulate Apple's approach: offering a curated selection of the 'best of the best' commercial and open-source models, while remaining fully customizable, including the ability to integrate local models. This curated yet hackable model aims to provide a high-quality, user-controlled experience for developers.
Mentioned in This Episode
●Software & Apps
●Companies
●Organizations
●Concepts
●People Referenced
Command Code Best Practices
Practical takeaways from this episode
Do This
Avoid This
Common Questions
Command Code is a full-fledged coding agent designed to help developers command LLMs. It differentiates itself by its 'Taste' feature, which learns and applies developer preferences, and its focus on improving the tool-calling capabilities of open-source models.
Topics
Mentioned in this video
Mentioned in the context of the guest's past roles.
The company where Ahmad Awais previously held the role of VP of DevRel.
One of the companies where Ahmad Awais worked, primarily in an open-source capacity.
Another company where Ahmad Awais worked, contributing to his open-source background.
A model announced by DeepSeek, with hiring efforts for its development mentioned following the discussion on open-source coding agents.
The venture fund associated with Tom Preston-Werner, which observed positive changes in DeepSeek's performance.
Models from this company also showed issues with tool calling that were fixed using the developed repair logic.
The guest on the podcast, discussing his work on Command Code and AI engineering. He has a background in open source and developer relations.
Mentioned for having a current viral post about 'design slop' in LLM-generated designs.
An investor whose fund (PW) noted the significant improvement in DeepSeek V4 Flash after the tool confusion repairs.
Founder of WordPress and an angel investor in Command Code, who reached out upon hearing about the decision to open source the project.
The platform where Ahmad Awais spent 13 years working on its core, contributing to his background in open-source development.
An early AI project by Ahmad Awais, focused on suggesting the next line of code, which evolved into Command Code.
An open-source LLM that showed 'tool confusion' errors, prompting the development of repair logic to improve its performance.
A commercial LLM used as a benchmark against which DeepSeek V4 Pro's performance was compared after implementing repair logic.
A schema definition library used to identify and report errors in tool calls made by LLMs, particularly relevant in the discussion of 'tool confusion'.
A framework for TypeScript mentioned in the context of desired developer preferences.
A tool for building CLIs, mentioned as an example of how Command Code learns and adapts to changing tool usage within a project.
A framework suggested as an alternative to TRPC, indicating specific implementation choices LLMs might make and which can be guided.
A library mentioned for interactive CLI elements that Command Code's 'Taste' feature can enforce.
An AI cloud platform built by Ahmad Awais's team, which handled a large volume of agent runs before pivoting to Command Code.
A full-fledged coding agent and AI cloud platform developed by Ahmad Awais's team, focusing on making LLMs more effective for coding tasks.
Another LLM that exhibited similar issues to DeepSeek with tool calling, which were addressed by the same repair logic.
A programming language mentioned in the context of a user's preference for specific frameworks like TS up.
A framework preference mentioned by the speaker, illustrating how user preferences are handled.
A preferred testing framework mentioned by the speaker, highlighting the 'Taste' feature's ability to learn and enforce preferences.
A CLI framework that, if adopted in a project, Command Code's 'Taste' feature would automatically recognize and manage.
A framework mentioned as an example of a technology that developers might choose to avoid, demonstrating how LLM-generated code edits can be reduced.
A commercial LLM that is lenient with tool calls, making it more forgiving than some open-source models when the command code harness makes mistakes.
A meta-neuro-symbolic model developed by Ahmad Awais's team that encodes developer preferences and patterns to steer AI models, functioning like 'skill files'.
A color space found to be effective for LLMs in controlling color palettes, contrasting with HSL which can have issues with lightness control.
More from Latent Space
View all 225 summaries
78 minWhen AI Agents Run Businesses — Lukas Petersson and Axel Backlund of Andon Labs
94 minScaling Past Informal AI - Carina Hong, Axiom Math
42 minSatya Nadella on AI: @NoPriorsPodcast x Latent Space Crossover Special at Microsoft Build 2026
85 minGitHub’s Agent Era: 14x Commits, 200M Developers, Copilot’s Next Act — Kyle Daigle
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free