Key Moments
Why AI Tokens are so Expensive - Computerphile
Want to know something specific about what's covered?
We've already dissected every moment. Ask and we will deliver (with timestamps).
Key Moments
AI agents can rack up tens of thousands of tokens for a single bug fix by repeatedly reading files and re-evaluating context, making previously flat-rate services prohibitively expensive.
Key Insights
A token is a word or a piece of a word, with spaces, punctuation, and even Chinese characters potentially counting as individual tokens.
Large language models process information auto-regressively, meaning they output one token at a time and must re-process the entire context each time, making them inherently inefficient compared to human cognition.
A simple bug fix request to an AI coding agent could involve hundreds of thousands of tokens due to agentic 'thinking' and tool calls to read multiple files, each potentially thousands of tokens long.
An example shows a simple bug fix requiring approximately 55,000-60,000 input tokens when only two files were read and the agent 'thought' several times.
A real-world test with GitHub Copilot creating a starfield screensaver after 5-6 prompts resulted in 2 million input tokens and 47,000 output tokens.
Companies are moving away from subsidized flat-rate pricing for AI agents towards token-based pricing, revealing the true, and often high, cost of these sophisticated operations.
What are AI tokens and why are they expensive?
The concept of an 'AI token' is fundamental to understanding the cost of using large language models (LLMs). A token is essentially a word or a part of a word, acting as the basic unit of text that an LLM processes. This can include punctuation, spaces, and even individual Chinese characters, depending on the tokenizer used by the model. For example, the common word 'the' might be its own token, while a more complex word could be broken down into multiple tokens. Modern LLMs utilize a vocabulary of tens of thousands of these tokens, encompassing words, code symbols, and special characters. The process involves turning these tokens into numerical representations called embeddings, which are then processed by the AI. While the definition of a token is straightforward, their expense stems from how LLMs process information. Unlike humans who can hold context in memory, LLMs operate auto-regressively, meaning they output one token at a time and must re-process the entire input context with each new output. This repetitive processing is computationally intensive and forms the core of why tokens, especially in complex agentic tasks, become so costly.
The inefficiency of auto-regressive processing
The auto-regressive nature of LLMs, while powerful for generation, is also their primary source of inefficiency and cost. When an LLM generates output, it doesn't simply remember what it just said and then predict the next word. Instead, it takes the entire sequence of input and previously generated output, processes it, and then predicts the single next token. This means that for every token generated, the model essentially performs a full pass through the input context it has to consider. For simple tasks, this is manageable. However, for more complex requests, like coding agents that need to read files, analyze code, and formulate solutions, this 're-reading' of the entire context becomes a significant cost driver. The problem is exacerbated because the input context grows with each step, requiring more computational resources for each subsequent output token. This fundamental aspect of LLM operation is often not grasped by the general public or even in media discussions, leading to surprise when costs escalate rapidly.
How coding agents inflate token usage
Coding agents, designed to perform complex tasks like debugging or writing code, are particularly prone to high token usage due to their autonomous nature and need to interact with external data. A typical interaction might start with a system prompt (e.g., instructions for the agent) and a user query, perhaps a few thousand tokens in total. The agent then 'thinks' about the problem, generating its own internal 'thought' tokens. Critically, to perform its task, it often needs to read files. Each file content, if it's several thousand tokens long, is added to the context. The agent then thinks again, possibly decides to read another file, and so on. With each step, the input context grows: original prompt + user query + previous thoughts + newly read file content. For instance, a simple bug fix request might involve an initial prompt of a few thousand tokens, a user query of 200 tokens, the agent's initial thoughts (2,000 tokens), a tool call to read a file (100 tokens), the file content itself (5,000 tokens), further thoughts (2,000 tokens), another tool call (100 tokens), and another file's content (4,000 tokens). By the time the agent formulates a fix and a response, the cumulative input tokens can easily reach tens of thousands, each demanding processing by the GPU. This iterative process, where the context balloons with each 'thought' and file access, is the primary mechanism by which simple requests can generate immense token counts.
The role of KV caching and its limitations
To mitigate the extreme inefficiency of re-processing the entire context every time, systems employ KV caching. This technique stores intermediate calculations (key and value states) from previous processing steps. When new tokens are added, the cached values for the existing context are reused, avoiding redundant computations. This significantly speeds up the process, especially for large context windows. However, KV caching has limitations. The cache has a finite size and a limited 'time to live'. If a user pauses for too long between prompts, or if the context window becomes too full, the cached data might be discarded. When this happens, the system must 'pre-fill' the context again, re-processing much of the original information, which incurs a cost. Furthermore, the aggressive management of this cache is a complex system design issue, balancing performance with the need to serve multiple users and manage GPU resources. Thus, while KV caching helps, it doesn't eliminate the underlying cost of processing large contexts.
Real-world examples of extreme token use
To illustrate the scale of token consumption, the video presents a hypothetical bug fix scenario. If an AI agent reads two files, each around 5,000 tokens, and engages in multiple 'thinking' phases, the total input tokens for a single request could reach approximately 55,000 to 60,000, plus several thousand output tokens for thoughts and tool calls. This is for a relatively simple bug fix. A more extreme example involved using GitHub Copilot to generate a Windows 3.11 starfield screensaver. After about five or six prompts, including adding features and fixing bugs introduced by the AI itself, the session consumed an astonishing 2 million input tokens and 47,000 output tokens. This dwarfs the token usage of simple chatbot interactions and highlights the immense computational resources required by agentic AI, particularly in coding assistance. The shift from flat-rate monthly subscriptions to per-token pricing by services like GitHub Copilot has brought these costs into sharp focus for users.
The unsustainability of subsidized pricing models
The recent shift in pricing for AI tools, such as GitHub Copilot's move from a flat monthly fee to token-based credits, reveals the unsustainability of previously subsidized models. When users paid a fixed price for potentially unlimited agentic use, the incentive was to leverage the AI for every possible task, regardless of token efficiency. This led to practices like asking overly long-winded questions or allowing AI agents to get stuck in loops, driving up costs unnoticed by the consumer. Measuring AI adoption by raw token usage, as some companies have done, is akin to measuring a driver's skill by tire wear – it focuses on activity without regard for efficiency or outcome. For normal companies, not just big tech giants, continuously spending vast sums on tokens without clear, immediate product returns is not viable. The industry is now grappling with how to justify these costs and encourage more efficient usage, moving away from the era of hidden subsidies and towards models where users directly confront the expense of complex AI operations.
Towards more reasonable AI interactions
The high cost associated with agentic AI, especially in coding, necessitates a reconsideration of how these tools are used. While agents capable of reading files and performing complex actions have their place, their current cost structure for autonomous operations is a significant concern. The video suggests that more efficient and cost-effective uses of LLMs include smaller, succinct questions and quick code completions, such as finishing a partially written loop in an IDE. These tasks require less context and generate fewer tokens. The challenge for the next year will be for users and providers to find a balance, understanding when to leverage powerful agentic capabilities versus employing simpler, more direct methods. The goal is to move towards a model where AI interactions are not only functional but also economically sustainable, avoiding the trap of paying for massive, repetitive computations hidden behind a flat fee. The industry must answer whether the quality and return on investment from these expensive AI operations can genuinely justify the ongoing token expenditure.
Mentioned in This Episode
●Products
●Software & Apps
●Companies
AI Token Cost Management: Dos and Don'ts
Practical takeaways from this episode
Do This
Avoid This
Common Questions
An AI token is essentially a word or a piece of a word that a large language model processes. It can also include spaces, punctuation, and other characters. The specific tokenization depends on the model and the tokenizer used, often based on the frequency of characters or word pieces.
Topics
Mentioned in this video
Mentioned as an example of a programming language where models trained on a broad set of tokens can still be effective, even if not all tokens are used in a specific domain.
Graphics Processing Units are essential hardware for running large language models, and their cost is directly related to the electricity consumption required for the intensive computations involved in processing tokens.
An older version of the Windows operating system, mentioned as the platform for a starfield screensaver that the speaker programmed using GitHub Copilot as an example of extensive token usage.
A specific AI coding assistant whose pricing model changed from a request-based monthly cost to an AI token credit system, leading to reduced usage for users. It's also used as a case study for high token consumption in coding agent tasks.
Mentioned as an example of an environment where AI code completion is used for quick fixes, representing a more cost-effective and reasonable use case compared to full agentic tasks.
More from Computerphile
View all 88 summaries
27 minTCP b : Additive Increase Multiplicative Decrease & 'Slow Start' - Computerphile
23 minHuman Readable Code - Computerphile
25 minHacking on the PDP1 Raspberry Pi Emulator - Computerphile
23 minHaptic Rendering - Computerphile
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free