What is context rot and why is it a problem?

Context rot refers to the degradation of a language model's performance as the amount of context provided increases beyond a certain limit. Even with large context windows advertised, models may not effectively utilize information beyond 40,000-100,000 tokens, making their responses unreliable.

How does Chroma's Context One model differ from other large language models?

Chroma Context One is a 20 billion parameter model that is significantly smaller, faster (3000 tokens/sec on Cerebras vs. 40 tokens/sec for Opus), and cheaper (~$1/million tokens vs. $25/million for Opus) while achieving state-of-the-art performance in agentic search.

What is context engineering?

Context engineering is the process of curating and managing the information provided to an AI model to ensure it performs optimally. It's necessary because of context rot, meaning we can't just throw data at models and expect them to reason effectively.

How will AI development change information work?

AI is expected to automate many information-based tasks, changing the nature of labor. Currently, humans spend about 30% of their time finding information for their jobs, a capability that AI agents will also need.

What are the future predictions for context in AI?

Jeff Huber predicts that context will become continuous (with push/pull steering layers), extremely fast (driven by small, efficient models), and that continual learning will primarily occur at the context layer rather than by updating model weights.

Why is speed critical for future AI applications?

As models become much faster, the architecture of applications will shift. Instead of being cautious about slow language model interactions, developers can embrace 'the bitter lesson' by using more tokens and letting agents handle complexity, pushing compute closer to data to minimize network costs.

Key Moments

AI Dev 26 x SF | Jeff Huber: Everything You Need to Know About Agentic Search

DeepLearning.AI

Education4 min read24 min video

May 20, 2026|165 views|5

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

Agentic search promises to solve AI context and reliability issues by using a continuous loop of reading and writing information, but its effectiveness hinges on effective context engineering.

Key Insights

Over 45% of ChatGPT queries are centered around a specific question or topic.

The average human information worker spends approximately 30% of their time searching for relevant information.

Despite marketing of million-token context windows, many builders report models becoming unreliable past 40,000 to 100,000 tokens due to context rot.

Chroma's Context One model, a 20 billion parameter model, runs at 400 tokens/sec on Blackwells and 3,000 tokens/sec on Cerebras, significantly faster and cheaper than models like Opus.

Agentic search is a loop where a model uses tools, decides when to stop, and can employ hybrid search (dense + sparse vectors) and grep search.

The future of context is predicted to be continuous, extremely fast due to small language models, and centered around continual learning at the context layer rather than model weight updates.

AI as a tool: context and reasoning are key

Jeff Huber, CEO of Chroma, frames AI not as a technological deity but as a powerful tool, emphasizing the critical role of context alongside reasoning. He defines a traditional computer as a universal structured information processor and AI as a universal unstructured information processor. Huber contends that while reasoning has received significant investment and focus, context remains largely underrated, despite data showing a substantial portion of AI interactions, like over 45% of ChatGPT queries, are centered around user information or specific questions. This highlights the fundamental need for AI systems to effectively process and utilize context to be truly useful.

The human cost of information seeking

The challenge of information retrieval is not unique to AI; it mirrors the struggles of human information workers. On average, humans spend about 30% of their workday searching for the right information to perform their tasks accurately. This statistic underscores the inherent difficulty and time consumption associated with sifting through vast amounts of data. As AI agents are increasingly tasked with performing information work, they will require similar capabilities to efficiently find and utilize relevant context, making agentic search a crucial development.

Context rot: the silent killer of large context windows

Despite advancements in language model context windows, often marketed with millions of tokens, practical applications face a phenomenon known as 'context rot.' Chroma's research indicates that models do not perform consistently across large context lengths; performance often degrades significantly beyond 40,000 to 100,000 tokens, a range far below manufacturer claims. This degradation makes model outputs unreliable, akin to a coin flip. This limitation necessitates 'context engineering,' the deliberate curation and management of information fed to AI models, rather than simply relying on larger context windows. For builders, this means understanding and working within these effective limits, even when higher limits are advertised, to ensure reliable application performance.

Agentic search: a continuous loop for reading and writing context

Agentic search is proposed as a solution to context and reliability issues. It operates on a continuous loop where a 'search agent' utilizes a set of provided tools, which can include hybrid search (combining dense vector and sparse vector search), grep for full-text queries, and document retrieval. Crucially, the agent has the ability to decide when to stop its search process. This mirrors human interaction with search engines like Google, where users iteratively refine queries and explore links. Agentic search is essential for both the 'read path' (retrieving information the agent needs) and the 'write path' (determining where to store newly learned information to maintain a consistent knowledge base). This paradigm shift allows agents to manage their own context effectively.

Chroma's Context One: speed, cost, and performance breakthroughs

Chroma has developed Context One, a 20 billion parameter open-source model designed for agentic search. This model offers significant performance and cost advantages over larger, frontier models. It achieves speeds of 400 tokens per second on Blackwell hardware and 3,000 tokens per second on Cerebras hardware, far surpassing the typical 40 tokens per second of models like Opus. In terms of cost, Context One is priced at approximately $1 per million output tokens, a fraction of the cost of commercial alternatives. This efficiency makes it possible to train smaller models to excel at complex retrieval tasks, challenging the notion that larger models are always superior for such applications. Chroma claims Context One defines the Pareto frontier for accuracy versus latency and cost, with further advancements (Context 2 and 3) anticipated.

The future of context: continuous, fast, and learning-centric

Huber outlines three key predictions for the future of context: 1. Continuous context management, where a steering layer continually guides reasoning models, both through retrieval (pull) and by pushing relevant information or interruptions (push). 2. Extreme speed, driven by small, efficient language models integrated deeply into applications, enabling architectures where compute is pushed close to the data to minimize network costs and latency. 3. Continual learning primarily at the context layer, meaning systems will learn by adding knowledge to their context systems rather than by frequently fine-tuning or updating model weights. This approach is made feasible by the low cost and speed of fine-tuning models like Context One, allowing systems to adapt and integrate new knowledge efficiently.

Mentioned in This Episode

●Products

●Software & Apps

●Companies

●Books

●People Referenced

Speed Comparison: Chroma Context 1 vs. Opus

Data extracted from this episode

Model	Hardware	Speed (Tokens/Second)	Cost (per Million Output Tokens)
Chroma Context 1	Blackwells	400	~$1
Chroma Context 1	Cerebras	3,000 (current)	N/A
Chroma Context 1	Cerebras	15,000-20,000 (future)	N/A
Opus	N/A	40 (average)	$25 (for 4.6)

Common Questions

Agentic search is a loop where an AI model, acting as a search agent, has access to tools (like search functions) and can decide when to stop searching. It mimics how humans use search engines by querying, reviewing results, and exploring.

Topics

Ai Agents Context Engineering AI & Machine Learning Technology & Innovation Large Language Models Retrieval-augmented Generation LLM Performance Agentic Search Context Rot

Mentioned in this video

Software & Apps

Chroma DB

The most popular open-source solution for search and retrieval, offered locally and in a distributed cloud offering.

Chroma

A company focused on context in AI, makers of Chroma DB and the Context One model, and known for research on context rot.

ChatGPT

Mentioned in the context of OpenAI's report on usage, with over 45% of its queries being 'asking centric'.

Copilot

Mentioned as a tool users might be familiar with, implicitly related to agentic tasks and context management.

Context One

An open-source 20 billion parameter model trained by Chroma, described as state-of-the-art at agentic search and significantly faster and cheaper than frontier models.

Claude

Mentioned as a model whose users might already be experiencing agents running sub-agents and gathering context.

Git

Mentioned as an example of a single repository representing 'a little bit of information'.

Opus

A language model from Anthropic, mentioned as a benchmark for performance and speed (40 tokens/sec) compared to Chroma's Context One.

Codex

Mentioned as a tool users might be familiar with, which engages in sub-agent activities and context gathering.

Chroma sync

Mentioned as an example of a simple query ('what are the features of Chroma sync?').

People

Daniel Kahneman

Author of 'Thinking, Fast and Slow', whose ideas on System One and System Two thinking are used to explain the bifurcated workload in brains.

Locations

Kyrgyzstan

Mentioned as an example of a simple query ('what is the capital of Kyrgyzstan?').

Products

Blackwell

Hardware mentioned where Chroma Context One can run at 400 tokens per second.

Companies

GitHub

Mentioned as a platform where Chroma DB has a significant number of stars, indicating its popularity.

OpenAI

Mentioned for putting out a report on how people are using ChatGPT and as a company whose model releases are informed by Chroma's research on context rot.

Anthropic

Mentioned as a company that routinely cites Chroma's research on context rot in their new model releases.

Google

Used as an analogy for how humans naturally perform agentic search by typing queries, seeing previews, and exploring links.

Cerebras

A hardware company whose chips allow Chroma Context One to run at 3,000 tokens per second, significantly faster than models like Opus.

Books

Thinking, Fast and Slow

A book by Daniel Kahneman that introduces the concept of System One and System Two thinking, used as an analogy for AI processing.

Media

Doom

A video game demoed previously by the speaker, used as a point of comparison for current AI capabilities.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free