Key Moments

⚡️Making DeepSeek v4 outperform Opus 4.7 with Taste — @AhmadAwais , CommandCode.ai

Latent Space PodcastLatent Space Podcast
Science & Technology5 min read41 min video
Jun 6, 2026|485 views|22|4
Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

TL;DR

Open source models like DeepSeek can now outperform premium ones like Opus 4.7 by fixing specific 'tool confusion' errors, a problem previously costing billions in token usage.

Key Insights

1

Ahmad Awais's 'taste' system encodes personal preferences and learned behaviors into 'taste files' that LLMs can utilize.

2

Tool confusion in open models, like DeepSeek V4 Pro, leads to repeated errors (50+ failures on average) when tool calls have optional parameters or incorrect schemas.

3

A 'validate-then-repair' approach with 3,200 lines of code initially, expanding to 16,000 repair variations, has fixed tool calling for models like DeepSeek, Kimi, and MiniMax.

4

Command Code processed hundreds of billions of tokens and has observed that models perform better with fewer tool call errors, leading to increased creativity and longer exploration.

5

The same 'repair tool logic' used for coding issues has been successfully applied to fix 'design slop' in LLM-generated designs, by providing a compositional framework of seven design patterns.

6

Command Code is being open-sourced and aims to be a hackable, 'Apple-like' platform for top-tier models (open and closed), rather than a 'soup' of all models.

From DevRel to AI Engineering: The Genesis of Command Code

Ahmad Awais, with a background spanning over 25 years in open source and a significant role in the WordPress community, transitioned into AI engineering in July 2020. Early access to GPT-3 allowed him to conceptualize and build CLAI, a precursor to GitHub Copilot, aiming to suggest code snippets. This early work evolved through building LangBase, an AI cloud that handled 1.2 billion agent runs monthly, and eventually pivoted to Command Code. The core philosophy behind Command Code is the belief that all agents are fundamentally coding agents, and this capability should be central. Awais's extensive experience with cutting-edge projects, often lacking documentation, led him to develop 'Taste,' a meta-neuro-symbolic model designed to encode personal opinions and preferences into 'taste files' or 'skill files.' These files allow AI agents to learn and adapt to user-specific patterns, such as preferring pnpm for package installations but npm for local CLI linking, thereby guiding LLMs in the right direction.

The 'Tool Confusion' problem in open source models

A significant challenge identified with open source LLMs, particularly DeepSeek V4 Pro, is 'tool confusion.' This occurs when models struggle to correctly interpret and execute tool calls. For example, if a tool schema specifies optional parameters, DeepSeek might incorrectly send null values or empty objects, leading to errors. Zod, a strict validation library, flags these errors, but DeepSeek, instead of correcting its approach, repeatedly makes the same incorrect tool call, averaging 50-60 failures per billion tokens. Awais speculates this behavior stems from training data where models might learn from superior models, developing an inherent belief in the correctness of their own outputs and a resistance to correction. This issue makes these models seem unreliable and slow, despite their potential.

A 'Validate-then-Repair' solution for tool calling

To combat tool confusion, Awais developed a 'validate-then-repair' layer. Instead of just relying on error messages to correct the LLM, this system first attempts to repair the erroneous tool call. This began with approximately 3,200 lines of code for repair logic, generating 'repair files' similar to database migrations, each addressing a specific failure pattern. For instance, if an LLM emits a JSON string when an array is expected, the repair logic converts it to an array and provides a 'repair hint' to the LLM. This approach acts like teaching someone to drive: you first prevent an accident, then explain how to avoid it. The results showed a dramatic improvement, with subsequent tool calls becoming correct after the first repair, turning previously unusable models like DeepSeek V4 Flash into competitive options. This repair logic has been generalized, fixing issues for models like Kimi and MiniMax across hundreds of billions of tokens.

Repair logic improves model creativity and performance

The implementation of the repair logic has a profound effect beyond just fixing tool calls; it significantly enhances the perceived intelligence and creativity of LLMs. When models encounter fewer tool call errors, they become more exploratory and can sustain operations for longer periods. Users have reported running DeepSeek with Command Code for over 12-hour sessions without issues. This suggests that tool confusion is a major bottleneck. By reducing these errors, models seem to gain a 'creative boost,' allowing them to explore different paths and maintain coherence. This is analogous to how models perform better without permission bypasses, indicating that a cleaner inference path leads to superior output.

Applying repair logic to design flaws

The 'repair tool logic' has proven versatile, extending beyond coding to address 'design slop'—the common, often uninspired visual outputs from LLMs (e.g., the ubiquitous indigo purple gradient). By analyzing patterns from conversations with designers, Awais identified seven key design patterns and ten 'design smells' that LLMs tend to exhibit. Applying a similar repair framework has allowed Command Code to guide LLMs towards better design outcomes. For instance, LLMs can be nudged to consider the 'intention' behind a design (is it a dashboard for monitoring?) and utilize specific color palettes like OKLCH, which offer better control over lightness and hue compared to HSL. This framework enables LLMs to produce designs that are perceptibly more human-curated, reducing the distinct 'AI' look.

Taste as a transparent, user-controlled memory system

Awais emphasizes that 'Taste' is a form of transparent, user-controlled memory integrated into Command Code. Unlike opaque memory systems, taste files are stored directly within the user's Git repository as markdown files. This allows for review and modification in every Pull Request (PR). The system automatically learns micro-decisions and preferences, such as always using pnpm for dependencies or preferring Vitest over other testing frameworks. It can even adapt to changes within a project, like switching from Commander to Meow for CLI building. This ensures that the 'learned' behaviors are never stale and are fully transparent to the user, forming a powerful, personalized AI assistant that evolves with the developer's workflow.

Command Code's future: Open-sourcing and an 'Apple-like' model philosophy

Command Code, a project with roots dating back to 2020 and significant backing, is set to be open-sourced. Awais plans for it to be fully hackable, allowing users to modify any part of the system. This aligns with his background in WordPress core development and reflects a philosophy of enabling deep customization. The vision for Command Code is not to become a vast collection of every available model, but rather to emulate Apple's approach: offering a curated selection of the 'best of the best' commercial and open-source models, while remaining fully customizable, including the ability to integrate local models. This curated yet hackable model aims to provide a high-quality, user-controlled experience for developers.

Command Code Best Practices

Practical takeaways from this episode

Do This

Use the '/design' skill for better LLM-generated designs.
Leverage 'Taste' to codify your developer preferences.
Consider the intention behind a design, not just the aesthetics.
Use OKLCH for better color control in LLM-generated designs.
Apply similar repair logic to improve security code generation.
Open source contributions allow for greater hackability of Command Code.

Avoid This

Don't expect raw open models to handle tool calls perfectly without fine-tuning.
Don't rely solely on default LLM design patterns (e.g., the 'design slop').
Don't let your agent's permissions overly restrict its creativity and performance.
Don't ignore the potential for 'tool confusion' in open-source models.
Don't use outdated or generic agent.md/cloud.md files; keep them updated.

Common Questions

Command Code is a full-fledged coding agent designed to help developers command LLMs. It differentiates itself by its 'Taste' feature, which learns and applies developer preferences, and its focus on improving the tool-calling capabilities of open-source models.

Topics

Mentioned in this video

Software & Apps
WordPress

The platform where Ahmad Awais spent 13 years working on its core, contributing to his background in open-source development.

CLAI

An early AI project by Ahmad Awais, focused on suggesting the next line of code, which evolved into Command Code.

DeepSeek V4 Pro

An open-source LLM that showed 'tool confusion' errors, prompting the development of repair logic to improve its performance.

Opus 4.7

A commercial LLM used as a benchmark against which DeepSeek V4 Pro's performance was compared after implementing repair logic.

Zod

A schema definition library used to identify and report errors in tool calls made by LLMs, particularly relevant in the discussion of 'tool confusion'.

TSE

A framework for TypeScript mentioned in the context of desired developer preferences.

Commander

A tool for building CLIs, mentioned as an example of how Command Code learns and adapts to changing tool usage within a project.

Hon

A framework suggested as an alternative to TRPC, indicating specific implementation choices LLMs might make and which can be guided.

Clack

A library mentioned for interactive CLI elements that Command Code's 'Taste' feature can enforce.

LangBase

An AI cloud platform built by Ahmad Awais's team, which handled a large volume of agent runs before pivoting to Command Code.

Command Code

A full-fledged coding agent and AI cloud platform developed by Ahmad Awais's team, focusing on making LLMs more effective for coding tasks.

Kimi

Another LLM that exhibited similar issues to DeepSeek with tool calling, which were addressed by the same repair logic.

Typescript

A programming language mentioned in the context of a user's preference for specific frameworks like TS up.

TS up

A framework preference mentioned by the speaker, illustrating how user preferences are handled.

Vitest

A preferred testing framework mentioned by the speaker, highlighting the 'Taste' feature's ability to learn and enforce preferences.

Meow

A CLI framework that, if adopted in a project, Command Code's 'Taste' feature would automatically recognize and manage.

TRPC

A framework mentioned as an example of a technology that developers might choose to avoid, demonstrating how LLM-generated code edits can be reduced.

Claude

A commercial LLM that is lenient with tool calls, making it more forgiving than some open-source models when the command code harness makes mistakes.

More from Latent Space

View all 225 summaries

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free