Why are AI tokens so expensive?

AI tokens are expensive primarily because of the highly inefficient autoregressive way large language models operate. They re-process the entire input and context for every single token they generate, requiring significant computational power and electricity, especially for complex tasks or coding agents.

How do AI models process information autoregressively?

In an autoregressive manner, an AI model takes an input, makes decisions, and outputs a single next token. To predict the subsequent token, it must re-process the entire sequence of previous tokens and its own generated text, rather than simply remembering it like a human would.

What is KV caching and how does it help with AI costs?

KV caching is a technique that stores intermediate representations of tokens already processed. This prevents the model from recalculating relationships between tokens it has already seen, significantly improving efficiency, especially for large context windows, thereby reducing computational cost.

Why are AI coding agents more expensive than chatbots?

Coding agents are more expensive because they often have more autonomy and can perform 'tool calls' like reading files or executing commands. This leads to much larger input token requirements as the agent needs to process file contents and its own thought process repeatedly, drastically increasing costs.

What is the cost breakdown for AI tokens?

Modern AI providers typically charge around $2 to $3 per million input tokens and about $15 per million output tokens. However, the true cost escalates with complex agentic tasks where the model reads multiple files and goes through numerous internal processing steps, multiplying the token count significantly.

How did GitHub Copilot's pricing change impact users?

GitHub Copilot shifted from a flat monthly fee to a token-based credit system. This change made users more aware of the high token consumption, especially for coding agents, and highlighted that previous unlimited usage was heavily subsidized.

What are the unsustainable practices in AI usage?

Measuring AI adoption solely by token usage or providing unlimited agentic capabilities for a flat fee is unsustainable. This incentivizes excessive, inefficient token generation, leading to prohibitive costs and a need for more realistic, cost-aware AI interaction models.

Key Moments

Why AI Tokens are so Expensive - Computerphile

Computerphile

Education7 min read26 min video

Jul 2, 2026|666,303 views|21,365|1,611

computers computerphile computer science

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

AI agents can rack up tens of thousands of tokens for a single bug fix by repeatedly reading files and re-evaluating context, making previously flat-rate services prohibitively expensive.

Key Insights

A token is a word or a piece of a word, with spaces, punctuation, and even Chinese characters potentially counting as individual tokens.

Large language models process information auto-regressively, meaning they output one token at a time and must re-process the entire context each time, making them inherently inefficient compared to human cognition.

A simple bug fix request to an AI coding agent could involve hundreds of thousands of tokens due to agentic 'thinking' and tool calls to read multiple files, each potentially thousands of tokens long.

An example shows a simple bug fix requiring approximately 55,000-60,000 input tokens when only two files were read and the agent 'thought' several times.

A real-world test with GitHub Copilot creating a starfield screensaver after 5-6 prompts resulted in 2 million input tokens and 47,000 output tokens.

Companies are moving away from subsidized flat-rate pricing for AI agents towards token-based pricing, revealing the true, and often high, cost of these sophisticated operations.

What are AI tokens and why are they expensive?

The concept of an 'AI token' is fundamental to understanding the cost of using large language models (LLMs). A token is essentially a word or a part of a word, acting as the basic unit of text that an LLM processes. This can include punctuation, spaces, and even individual Chinese characters, depending on the tokenizer used by the model. For example, the common word 'the' might be its own token, while a more complex word could be broken down into multiple tokens. Modern LLMs utilize a vocabulary of tens of thousands of these tokens, encompassing words, code symbols, and special characters. The process involves turning these tokens into numerical representations called embeddings, which are then processed by the AI. While the definition of a token is straightforward, their expense stems from how LLMs process information. Unlike humans who can hold context in memory, LLMs operate auto-regressively, meaning they output one token at a time and must re-process the entire input context with each new output. This repetitive processing is computationally intensive and forms the core of why tokens, especially in complex agentic tasks, become so costly.

The inefficiency of auto-regressive processing

The auto-regressive nature of LLMs, while powerful for generation, is also their primary source of inefficiency and cost. When an LLM generates output, it doesn't simply remember what it just said and then predict the next word. Instead, it takes the entire sequence of input and previously generated output, processes it, and then predicts the single next token. This means that for every token generated, the model essentially performs a full pass through the input context it has to consider. For simple tasks, this is manageable. However, for more complex requests, like coding agents that need to read files, analyze code, and formulate solutions, this 're-reading' of the entire context becomes a significant cost driver. The problem is exacerbated because the input context grows with each step, requiring more computational resources for each subsequent output token. This fundamental aspect of LLM operation is often not grasped by the general public or even in media discussions, leading to surprise when costs escalate rapidly.

How coding agents inflate token usage

Coding agents, designed to perform complex tasks like debugging or writing code, are particularly prone to high token usage due to their autonomous nature and need to interact with external data. A typical interaction might start with a system prompt (e.g., instructions for the agent) and a user query, perhaps a few thousand tokens in total. The agent then 'thinks' about the problem, generating its own internal 'thought' tokens. Critically, to perform its task, it often needs to read files. Each file content, if it's several thousand tokens long, is added to the context. The agent then thinks again, possibly decides to read another file, and so on. With each step, the input context grows: original prompt + user query + previous thoughts + newly read file content. For instance, a simple bug fix request might involve an initial prompt of a few thousand tokens, a user query of 200 tokens, the agent's initial thoughts (2,000 tokens), a tool call to read a file (100 tokens), the file content itself (5,000 tokens), further thoughts (2,000 tokens), another tool call (100 tokens), and another file's content (4,000 tokens). By the time the agent formulates a fix and a response, the cumulative input tokens can easily reach tens of thousands, each demanding processing by the GPU. This iterative process, where the context balloons with each 'thought' and file access, is the primary mechanism by which simple requests can generate immense token counts.

The role of KV caching and its limitations

To mitigate the extreme inefficiency of re-processing the entire context every time, systems employ KV caching. This technique stores intermediate calculations (key and value states) from previous processing steps. When new tokens are added, the cached values for the existing context are reused, avoiding redundant computations. This significantly speeds up the process, especially for large context windows. However, KV caching has limitations. The cache has a finite size and a limited 'time to live'. If a user pauses for too long between prompts, or if the context window becomes too full, the cached data might be discarded. When this happens, the system must 'pre-fill' the context again, re-processing much of the original information, which incurs a cost. Furthermore, the aggressive management of this cache is a complex system design issue, balancing performance with the need to serve multiple users and manage GPU resources. Thus, while KV caching helps, it doesn't eliminate the underlying cost of processing large contexts.

Real-world examples of extreme token use

To illustrate the scale of token consumption, the video presents a hypothetical bug fix scenario. If an AI agent reads two files, each around 5,000 tokens, and engages in multiple 'thinking' phases, the total input tokens for a single request could reach approximately 55,000 to 60,000, plus several thousand output tokens for thoughts and tool calls. This is for a relatively simple bug fix. A more extreme example involved using GitHub Copilot to generate a Windows 3.11 starfield screensaver. After about five or six prompts, including adding features and fixing bugs introduced by the AI itself, the session consumed an astonishing 2 million input tokens and 47,000 output tokens. This dwarfs the token usage of simple chatbot interactions and highlights the immense computational resources required by agentic AI, particularly in coding assistance. The shift from flat-rate monthly subscriptions to per-token pricing by services like GitHub Copilot has brought these costs into sharp focus for users.

The unsustainability of subsidized pricing models

The recent shift in pricing for AI tools, such as GitHub Copilot's move from a flat monthly fee to token-based credits, reveals the unsustainability of previously subsidized models. When users paid a fixed price for potentially unlimited agentic use, the incentive was to leverage the AI for every possible task, regardless of token efficiency. This led to practices like asking overly long-winded questions or allowing AI agents to get stuck in loops, driving up costs unnoticed by the consumer. Measuring AI adoption by raw token usage, as some companies have done, is akin to measuring a driver's skill by tire wear – it focuses on activity without regard for efficiency or outcome. For normal companies, not just big tech giants, continuously spending vast sums on tokens without clear, immediate product returns is not viable. The industry is now grappling with how to justify these costs and encourage more efficient usage, moving away from the era of hidden subsidies and towards models where users directly confront the expense of complex AI operations.

Towards more reasonable AI interactions

The high cost associated with agentic AI, especially in coding, necessitates a reconsideration of how these tools are used. While agents capable of reading files and performing complex actions have their place, their current cost structure for autonomous operations is a significant concern. The video suggests that more efficient and cost-effective uses of LLMs include smaller, succinct questions and quick code completions, such as finishing a partially written loop in an IDE. These tasks require less context and generate fewer tokens. The challenge for the next year will be for users and providers to find a balance, understanding when to leverage powerful agentic capabilities versus employing simpler, more direct methods. The goal is to move towards a model where AI interactions are not only functional but also economically sustainable, avoiding the trap of paying for massive, repetitive computations hidden behind a flat fee. The industry must answer whether the quality and return on investment from these expensive AI operations can genuinely justify the ongoing token expenditure.

Mentioned in This Episode

●Products

●Software & Apps

●Companies

AI Token Cost Management: Dos and Don'ts

Practical takeaways from this episode

Do This

Use AI for small, succinct questions and quick fixes where minimal context is needed.

Utilize AI for code completion within an editor, where only a small context window is typically required.

Be mindful of token costs when using AI agents, especially for complex tasks involving file reading or multiple steps.

Understand that your AI session's cost can escalate rapidly with follow-up questions and file analysis.

Consider the efficiency of token usage and seek product return on investment for AI spending.

Avoid This

Do not expect conversational AI chatbots to be as costly as coding agents performing complex tasks.

Avoid long-winded questions or prompts that cause AI to get stuck in lengthy thought processes.

Do not treat token usage as a metric for AI adoption; it's a cost, not necessarily a measure of progress.

Do not rely on subsidized pricing models for AI agents long-term; be prepared for per-token costs.

Avoid inflicting AI-generated code that hasn't been thoroughly reviewed or understood.

Common Questions

An AI token is essentially a word or a piece of a word that a large language model processes. It can also include spaces, punctuation, and other characters. The specific tokenization depends on the model and the tokenizer used, often based on the frequency of characters or word pieces.

Topics

Ai Agents AI & Machine Learning Technology & Innovation Token-efficiency Coding Assistants Autoregressive Models AI Tokenization LLM Costs KV Caching AI Pricing

Mentioned in this video

Software & Apps

JavaScript

Mentioned as an example of a programming language where models trained on a broad set of tokens can still be effective, even if not all tokens are used in a specific domain.

GPU

Graphics Processing Units are essential hardware for running large language models, and their cost is directly related to the electricity consumption required for the intensive computations involved in processing tokens.

Windows

An older version of the Windows operating system, mentioned as the platform for a starfield screensaver that the speaker programmed using GitHub Copilot as an example of extensive token usage.

GitHub Copilot

A specific AI coding assistant whose pricing model changed from a request-based monthly cost to an AI token credit system, leading to reduced usage for users. It's also used as a case study for high token consumption in coding agent tasks.

Codex

Mentioned as an example of an environment where AI code completion is used for quick fixes, representing a more cost-effective and reasonable use case compared to full agentic tasks.

Companies

Anthropic

Mentioned as a company that has implemented caps for premium users on their AI services, indicating cost or usage limitations.

Media

The Simpsons

Mentioned in an analogy to illustrate how incentives can lead to inefficient or unsustainable practices, comparing it to a pulley system or a bird automatically performing a task.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free