How does Power Retention architecture differ from standard Transformers?

Power Retention uses a fixed-size, compressed memory instead of a constantly growing KV cache. This allows for efficient scaling of context length without the quadratic cost associated with traditional attention mechanisms.

What are the benefits of Power Retention for inference?

Power Retention significantly reduces inference complexity and costs. Its fixed-size memory allows for easier GPU partitioning and utilization. It offers up to 100x speedups over flash attention at inference time for long contexts.

What is Vidril and why was it developed?

Vidril is a framework for writing highly efficient CUDA kernels. It was developed because standard tools like Triton lacked the flexibility needed to optimize Power Retention for maximum hardware performance, enabling just-in-time sweeping of kernel configurations.

Can existing pre-trained models be upgraded to Power Retention?

Yes, through a process called 'metamorphosis', which involves a relatively short period of mid-training or fine-tuning. This allows models like StarCoder to be transformed into Power Coder without full retraining, maintaining or improving performance.

How does Power Coder perform compared to StarCoder?

Power Coder, a metamorphosed version of StarCoder 3B, shows significantly faster training speeds (e.g., 5x at 32K context) and drastically reduced inference costs at long contexts while matching or exceeding baseline performance on downstream tasks.

What is Manifest AI looking for collaboration on?

Manifest AI is seeking team members with expertise in deep learning and CUDA programming, and importantly, unique long-context datasets, especially those involving human trajectories over extended periods, to further showcase Power Retention's capabilities.

Key Moments

⚡️ Beyond Transformers with Power Retention

Latent Space Podcast

Science & Technology4 min read33 min video

Sep 23, 2025|2,438 views|74|13

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

Manifest AI introduces Power Retention, a fixed-memory architecture replacing Transformer Attention to solve long-context AI bottlenecks and reduce inference costs significantly.

Key Insights

Power Retention offers a fixed-size memory architecture, a significant departure from the growing KV cache in Transformers, eliminating a major computational bottleneck for long contexts.

This new architecture enables more efficient and cost-effective inference by simplifying GPU allocation and reducing memory management complexity.

The Vidrial framework allows for the creation of highly optimized CUDA kernels by dynamically selecting the best hardware configurations for specific operations and hardware.

Models can be transformed ("metamorphosed") from existing pre-trained architectures to Power Retention variants with minimal mid-training fine-tuning, without full pre-training.

Power Retention demonstrates substantial speed-ups in both training (up to 10x) and inference (up to 100x) compared to traditional attention mechanisms at long context lengths (e.g., 64k tokens).

Manifest AI is open-sourcing its components to foster community adoption and development of new models and applications leveraging long-context capabilities.

The true value of long context is data-dependent; internet text often favors short contexts, while domains with long-term dependencies (e.g., human trajectories) stand to benefit most from Power Retention.

THE ORIGIN AND MISSION OF MANIFEST AI

Manifest AI was founded by Jacob Buckman and Kalis, former researchers from Google Brain and OpenAI, respectively. Their core belief was that simply scaling up existing Transformer models would not lead to true human-level intelligence. They identified the challenge of efficiently processing large inputs, specifically long contexts, as a fundamental blocker. The company set out to address this bottleneck at its deepest level, moving beyond incremental improvements to create a fundamentally different architecture.

THE KV CACHE BOTTLENECK IN TRANSFORMERS

Traditional Transformer models face a significant computational bottleneck due to the KV cache. This cache stores information from every token processed, causing it to grow linearly with the sequence length. Consequently, each subsequent token prediction becomes more expensive, making long contexts computationally intractable and costly. Many existing 'long context' solutions are workarounds like windowed attention, which sacrifice information from earlier parts of the context, leading to degraded performance on complex, long-range dependencies.

INTRODUCING POWER RETENTION: A FIXED-MEMORY ARCHITECTURE

Manifest AI's solution is 'Power Retention,' a new family of architectures that replace the Transformer's attention mechanism. Instead of a growing KV cache, Power Retention utilizes a fixed-size memory. New tokens are compressed into this memory, which does not grow. The size of this memory can be adjusted based on the problem's difficulty and the available compute budget. This provides a stateful memory comparable to model parameters, allowing for dynamic scaling without the quadratic cost associated with traditional attention.

ENGINEERING ADVANTAGES AND INFERENCE EFFICIENCY

Power Retention dramatically simplifies inference infrastructure. The fixed-size memory eliminates the dynamic scheduling challenges associated with growing KV caches, allowing GPUs to be partitioned efficiently among users without waste. This leads to substantial reductions in engineering complexity and inference costs. While training sees significant speedups (e.g., 10x at 64k tokens), inference benefits are even more pronounced, with potential for 100x speed increases due to savings in computation and memory operations.

VIDRIAL: OPTIMIZING HARDWARE UTILIZATION WITH CUSTOM KERNELS

To achieve practical speedups, Power Retention is implemented using highly optimized CUDA kernels generated by Vidrial. Vidrial is a framework for writing generalized GPU kernels that dynamically sweep over various implementation strategies (e.g., core selection, data movement, tiling) to find the optimal configuration for a specific hardware and problem shape. This 'just-in-time sweeping' approach ensures maximum GPU utilization, delivering performance gains even in scenarios where traditional kernels are less optimized, potentially achieving 20-30% improvement over highly optimized existing kernels like Flash Attention.

METAMORPHOSIS: TRANSFORMATION WITHOUT FULL PRE-TRAINING

A key advantage of Power Retention is that it doesn't require complete re-pre-training. Existing pre-trained Transformer models can be transformed into Power Retention variants through a process called 'metamorphosis,' which involves minimal mid-training fine-tuning on all parameters. This process has been shown to quickly recover performance, enabling models like StarCoder 3B to achieve comparable or even superior downstream task performance with significantly faster training and inference, particularly at extended context lengths.

POWER CODER AND OPEN-SOURCE CONTRIBUTIONS

Manifest AI is releasing 'PowerCoder,' a transformed version of the StarCoder 3B model, demonstrating the efficacy of their approach. They are also open-sourcing all components, including the Vidrial framework and Power Retention kernels, to encourage community adoption. This allows other researchers and developers to apply the 'metamorphosis' process to various base models, creating specialized Power Retention variants for different tasks and languages, fostering a new ecosystem of efficient, long-context AI models.

THE ROLE OF DATA IN LONG CONTEXT VALUE

The effectiveness of long context capabilities is highly dependent on the dataset. Internet text, often composed of short documents, provides limited benefit for extremely long contexts when simply 'packed' together. While Transformers are efficient for such data, Manifest AI believes that unique datasets with inherent long-term dependencies, such as human trajectories in administrative or coding tasks, are where Power Retention's true potential can be unlocked. They are actively seeking collaborations with holders of such specialized datasets.

FUTURE OF ARCHITECTURE AND COMMUNITY BUILDING

Manifest AI aims to build community trust and drive adoption by open-sourcing their technology. They anticipate that after initial community experimentation and validation, larger foundation model companies will gradually adopt Power Retention. Their roadmap includes scaling to larger models (e.g., 30 billion parameters) and exploring tensor/sequence parallelism for even longer contexts (64k and beyond). The goal is to see a widespread shift towards retention-based architectures for inference-heavy, long-context use cases.

Mentioned in This Episode

●Software & Apps

●Companies

●Organizations

●Concepts

●People Referenced

Common Questions

Manifest AI aims to solve the bottleneck of long context windows in large language models. The standard transformer architecture struggles with the exponentially increasing cost of processing longer sequences due to the KV cache, making it inefficient and expensive.

Topics

Power Retention Vidril CUDA Kernels Metamorphosis Power Coder LLM Efficiency Inference Costs Deep Learning Hardware

Mentioned in this video

Companies

Fireworks

Mentioned alongside Together as inference providers who could benefit from power retention technology.

Together

An inference provider that Manifest AI sees as a potential partner to serve their models at lower prices and faster latencies.

Manifest AI

A company founded by Jugger Buckman and Kalis, focused on solving the bottleneck of long context windows in language models with their power retention architecture.

SF Compute

A compute provider that Manifest AI has used for research and development.

Software & Apps

StarCoder 3B

A 3 billion parameter coding model from the Big Code project that was used to demonstrate the 'metamorphosis' process and power coder, showing significant improvements with power retention.

log cabin

Manifest AI's in-house run visualization platform used to showcase training curves and model performance.

GBD5

Mentioned as a future frontier model, with the question of whether it could be built on power attention.

flash attention

An optimized implementation of transformer attention, which Manifest AI's Vidril framework can match or outperform, especially on non-standard problem shapes.

Vidril

A general framework developed by Manifest AI for writing hardware-efficient CUDA kernels, enabling performance sweeps for optimal GPU utilization.

retention

The package that provides access to Manifest AI's flash attention kernels (written in Vidril) and the power retention architecture.

power coder

A metamorphosed version of StarCoder 3B using the power retention architecture, demonstrating significantly faster training and inference speeds for long contexts.

Manifesto

A tool built by Manifest AI that points to a repository and attempts to fix missteps or errors within its code, demonstrating the capabilities of Power Coder.

Concepts

windowed attention

A workaround for long context limitations in transformers that discards information from certain layers, leading to degraded performance on older context.

power retention

Manifest AI's novel architecture that uses a fixed-size, compressed memory instead of a growing KV cache, enabling efficient handling of long contexts.

Transformer

The foundational architecture that has dominated LLMs; its history and evolution are compared to the potential adoption of power retention.

human eval

A downstream evaluation benchmark used to measure the performance of coding models, where Power Coder showed improved accuracy.

MOE models

Mixture of Experts models, mentioned as a type of architecture that differs from dense transformers, relevant to the future development of large-scale AI models.

People

Allesio Ponder

Host of the L in space podcast and associated with Colonel Labs, an investor in Manifest AI.

Jugger Buckman

Co-founder of Manifest AI, previously worked at Google Brain. He is discussing the company's innovations in AI architecture.

Organizations

Decible

An investment firm that invested in Manifest AI early on.

Triton

Media

L in space podcast

The podcast where the interview is taking place, hosted by Allesio Ponder.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free