What is 'tool confusion' and how does Command Code address it?

Tool confusion is an issue where LLMs, particularly open-source ones like DeepSeek V4 Pro, struggle with correct tool calling, often ignoring schema errors. Command Code implements a repair logic that corrects these errors and provides feedback, significantly improving the LLM's performance.

How does the 'Taste' feature in Command Code work?

Taste is a meta-neuro-symbolic model that automatically learns user preferences and coding patterns from their work. It stores these as 'skill files' or 'taste files', ensuring that the coding agent consistently follows the user's preferred methodologies and tools.

Can this repair logic be applied to areas other than code generation?

Yes, the same repair logic has been successfully applied to fixing 'design slop,' the common, uninspired design patterns generated by LLMs. By providing a compositional framework based on designer insights, Command Code can improve LLM-generated designs.

What are the main advantages of using Command Code with open-source models?

Command Code is particularly effective at improving the performance of open-source models, which often struggle with tool calling errors. By implementing repair logic and 'Taste,' it makes these models more reliable and efficient, approaching or even surpassing the capabilities of some commercial models.

What is the future of Command Code?

The team plans to open-source Command Code soon, making it highly hackable. Their philosophy is to offer the best models (both open and closed) rather than all models, aiming for an 'Apple-like' curated yet adaptable experience.

How does Command Code's 'Taste' feature differ from traditional skills or memory systems?

Taste is the highest-level component that manages skills and rules, acting as an automatic engine that creates and maintains skills. It's designed to be transparent, stored within the user's repository, ensuring continuous learning and adaptation without stale information.

Key Moments

⚡️Making DeepSeek v4 outperform Opus 4.7 with Taste — @AhmadAwais , CommandCode.ai

Latent Space Podcast

Science & Technology5 min read41 min video

Jun 6, 2026|485 views|22|4

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

Open source models like DeepSeek can now outperform premium ones like Opus 4.7 by fixing specific 'tool confusion' errors, a problem previously costing billions in token usage.

Key Insights

Ahmad Awais's 'taste' system encodes personal preferences and learned behaviors into 'taste files' that LLMs can utilize.

Tool confusion in open models, like DeepSeek V4 Pro, leads to repeated errors (50+ failures on average) when tool calls have optional parameters or incorrect schemas.

A 'validate-then-repair' approach with 3,200 lines of code initially, expanding to 16,000 repair variations, has fixed tool calling for models like DeepSeek, Kimi, and MiniMax.

Command Code processed hundreds of billions of tokens and has observed that models perform better with fewer tool call errors, leading to increased creativity and longer exploration.

The same 'repair tool logic' used for coding issues has been successfully applied to fix 'design slop' in LLM-generated designs, by providing a compositional framework of seven design patterns.

Command Code is being open-sourced and aims to be a hackable, 'Apple-like' platform for top-tier models (open and closed), rather than a 'soup' of all models.

From DevRel to AI Engineering: The Genesis of Command Code

Ahmad Awais, with a background spanning over 25 years in open source and a significant role in the WordPress community, transitioned into AI engineering in July 2020. Early access to GPT-3 allowed him to conceptualize and build CLAI, a precursor to GitHub Copilot, aiming to suggest code snippets. This early work evolved through building LangBase, an AI cloud that handled 1.2 billion agent runs monthly, and eventually pivoted to Command Code. The core philosophy behind Command Code is the belief that all agents are fundamentally coding agents, and this capability should be central. Awais's extensive experience with cutting-edge projects, often lacking documentation, led him to develop 'Taste,' a meta-neuro-symbolic model designed to encode personal opinions and preferences into 'taste files' or 'skill files.' These files allow AI agents to learn and adapt to user-specific patterns, such as preferring pnpm for package installations but npm for local CLI linking, thereby guiding LLMs in the right direction.

The 'Tool Confusion' problem in open source models

A significant challenge identified with open source LLMs, particularly DeepSeek V4 Pro, is 'tool confusion.' This occurs when models struggle to correctly interpret and execute tool calls. For example, if a tool schema specifies optional parameters, DeepSeek might incorrectly send null values or empty objects, leading to errors. Zod, a strict validation library, flags these errors, but DeepSeek, instead of correcting its approach, repeatedly makes the same incorrect tool call, averaging 50-60 failures per billion tokens. Awais speculates this behavior stems from training data where models might learn from superior models, developing an inherent belief in the correctness of their own outputs and a resistance to correction. This issue makes these models seem unreliable and slow, despite their potential.

A 'Validate-then-Repair' solution for tool calling

To combat tool confusion, Awais developed a 'validate-then-repair' layer. Instead of just relying on error messages to correct the LLM, this system first attempts to repair the erroneous tool call. This began with approximately 3,200 lines of code for repair logic, generating 'repair files' similar to database migrations, each addressing a specific failure pattern. For instance, if an LLM emits a JSON string when an array is expected, the repair logic converts it to an array and provides a 'repair hint' to the LLM. This approach acts like teaching someone to drive: you first prevent an accident, then explain how to avoid it. The results showed a dramatic improvement, with subsequent tool calls becoming correct after the first repair, turning previously unusable models like DeepSeek V4 Flash into competitive options. This repair logic has been generalized, fixing issues for models like Kimi and MiniMax across hundreds of billions of tokens.

Repair logic improves model creativity and performance

The implementation of the repair logic has a profound effect beyond just fixing tool calls; it significantly enhances the perceived intelligence and creativity of LLMs. When models encounter fewer tool call errors, they become more exploratory and can sustain operations for longer periods. Users have reported running DeepSeek with Command Code for over 12-hour sessions without issues. This suggests that tool confusion is a major bottleneck. By reducing these errors, models seem to gain a 'creative boost,' allowing them to explore different paths and maintain coherence. This is analogous to how models perform better without permission bypasses, indicating that a cleaner inference path leads to superior output.

Applying repair logic to design flaws

The 'repair tool logic' has proven versatile, extending beyond coding to address 'design slop'—the common, often uninspired visual outputs from LLMs (e.g., the ubiquitous indigo purple gradient). By analyzing patterns from conversations with designers, Awais identified seven key design patterns and ten 'design smells' that LLMs tend to exhibit. Applying a similar repair framework has allowed Command Code to guide LLMs towards better design outcomes. For instance, LLMs can be nudged to consider the 'intention' behind a design (is it a dashboard for monitoring?) and utilize specific color palettes like OKLCH, which offer better control over lightness and hue compared to HSL. This framework enables LLMs to produce designs that are perceptibly more human-curated, reducing the distinct 'AI' look.

Taste as a transparent, user-controlled memory system

Awais emphasizes that 'Taste' is a form of transparent, user-controlled memory integrated into Command Code. Unlike opaque memory systems, taste files are stored directly within the user's Git repository as markdown files. This allows for review and modification in every Pull Request (PR). The system automatically learns micro-decisions and preferences, such as always using pnpm for dependencies or preferring Vitest over other testing frameworks. It can even adapt to changes within a project, like switching from Commander to Meow for CLI building. This ensures that the 'learned' behaviors are never stale and are fully transparent to the user, forming a powerful, personalized AI assistant that evolves with the developer's workflow.

Command Code's future: Open-sourcing and an 'Apple-like' model philosophy

Command Code, a project with roots dating back to 2020 and significant backing, is set to be open-sourced. Awais plans for it to be fully hackable, allowing users to modify any part of the system. This aligns with his background in WordPress core development and reflects a philosophy of enabling deep customization. The vision for Command Code is not to become a vast collection of every available model, but rather to emulate Apple's approach: offering a curated selection of the 'best of the best' commercial and open-source models, while remaining fully customizable, including the ability to integrate local models. This curated yet hackable model aims to provide a high-quality, user-controlled experience for developers.

Mentioned in This Episode

●Software & Apps

●Companies

●Organizations

●Concepts

●People Referenced

Command Code Best Practices

Practical takeaways from this episode

Do This

Use the '/design' skill for better LLM-generated designs.

Leverage 'Taste' to codify your developer preferences.

Consider the intention behind a design, not just the aesthetics.

Use OKLCH for better color control in LLM-generated designs.

Apply similar repair logic to improve security code generation.

Open source contributions allow for greater hackability of Command Code.

Avoid This

Don't expect raw open models to handle tool calls perfectly without fine-tuning.

Don't rely solely on default LLM design patterns (e.g., the 'design slop').

Don't let your agent's permissions overly restrict its creativity and performance.

Don't ignore the potential for 'tool confusion' in open-source models.

Don't use outdated or generic agent.md/cloud.md files; keep them updated.

Common Questions

Command Code is a full-fledged coding agent designed to help developers command LLMs. It differentiates itself by its 'Taste' feature, which learns and applies developer preferences, and its focus on improving the tool-calling capabilities of open-source models.

Topics

AI & Machine Learning Technology & Innovation Programming & Software Coding Agents Open-source Models AI In Design Developer Experience Developer Productivity LLM Tool Calling Meta-neuro-symbolic Models

Mentioned in this video

Companies

Netlify

Mentioned in the context of the guest's past roles.

RapidAPI

The company where Ahmad Awais previously held the role of VP of DevRel.

Google

One of the companies where Ahmad Awais worked, primarily in an open-source capacity.

Airbnb

Another company where Ahmad Awais worked, contributing to his open-source background.

DeepSeek

A model announced by DeepSeek, with hiring efforts for its development mentioned following the discussion on open-source coding agents.

The venture fund associated with Tom Preston-Werner, which observed positive changes in DeepSeek's performance.

Minimax

Models from this company also showed issues with tool calling that were fixed using the developed repair logic.

People

Ahmad Awais

The guest on the podcast, discussing his work on Command Code and AI engineering. He has a background in open source and developer relations.

Mario Zechner

Mentioned for having a current viral post about 'design slop' in LLM-generated designs.

Tom Preston-Werner

An investor whose fund (PW) noted the significant improvement in DeepSeek V4 Flash after the tool confusion repairs.

Matt Mullenweg

Founder of WordPress and an angel investor in Command Code, who reached out upon hearing about the decision to open source the project.

Software & Apps

WordPress

The platform where Ahmad Awais spent 13 years working on its core, contributing to his background in open-source development.

CLAI

An early AI project by Ahmad Awais, focused on suggesting the next line of code, which evolved into Command Code.

DeepSeek V4 Pro

An open-source LLM that showed 'tool confusion' errors, prompting the development of repair logic to improve its performance.

Opus 4.7

A commercial LLM used as a benchmark against which DeepSeek V4 Pro's performance was compared after implementing repair logic.

Zod

A schema definition library used to identify and report errors in tool calls made by LLMs, particularly relevant in the discussion of 'tool confusion'.

TSE

A framework for TypeScript mentioned in the context of desired developer preferences.

Commander

A tool for building CLIs, mentioned as an example of how Command Code learns and adapts to changing tool usage within a project.

Hon

A framework suggested as an alternative to TRPC, indicating specific implementation choices LLMs might make and which can be guided.

Clack

A library mentioned for interactive CLI elements that Command Code's 'Taste' feature can enforce.

LangBase

An AI cloud platform built by Ahmad Awais's team, which handled a large volume of agent runs before pivoting to Command Code.

Command Code

A full-fledged coding agent and AI cloud platform developed by Ahmad Awais's team, focusing on making LLMs more effective for coding tasks.

Kimi

Another LLM that exhibited similar issues to DeepSeek with tool calling, which were addressed by the same repair logic.

Typescript

A programming language mentioned in the context of a user's preference for specific frameworks like TS up.

TS up

A framework preference mentioned by the speaker, illustrating how user preferences are handled.

Vitest

A preferred testing framework mentioned by the speaker, highlighting the 'Taste' feature's ability to learn and enforce preferences.

Meow

A CLI framework that, if adopted in a project, Command Code's 'Taste' feature would automatically recognize and manage.

TRPC

A framework mentioned as an example of a technology that developers might choose to avoid, demonstrating how LLM-generated code edits can be reduced.

Claude

A commercial LLM that is lenient with tool calls, making it more forgiving than some open-source models when the command code harness makes mistakes.

Concepts

Taste

A meta-neuro-symbolic model developed by Ahmad Awais's team that encodes developer preferences and patterns to steer AI models, functioning like 'skill files'.

OKLCH

A color space found to be effective for LLMs in controlling color palettes, contrasting with HSL which can have issues with lightness control.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free