Key Moments

⚡️ Beyond Transformers with Power Retention

Latent Space PodcastLatent Space Podcast
Science & Technology4 min read33 min video
Sep 23, 2025|2,392 views|75|13
Save to Pod
TL;DR

Manifest AI introduces Power Retention, a fixed-memory architecture replacing Transformer Attention to solve long-context AI bottlenecks and reduce inference costs significantly.

Key Insights

1

Power Retention offers a fixed-size memory architecture, a significant departure from the growing KV cache in Transformers, eliminating a major computational bottleneck for long contexts.

2

This new architecture enables more efficient and cost-effective inference by simplifying GPU allocation and reducing memory management complexity.

3

The Vidrial framework allows for the creation of highly optimized CUDA kernels by dynamically selecting the best hardware configurations for specific operations and hardware.

4

Models can be transformed ("metamorphosed") from existing pre-trained architectures to Power Retention variants with minimal mid-training fine-tuning, without full pre-training.

5

Power Retention demonstrates substantial speed-ups in both training (up to 10x) and inference (up to 100x) compared to traditional attention mechanisms at long context lengths (e.g., 64k tokens).

6

Manifest AI is open-sourcing its components to foster community adoption and development of new models and applications leveraging long-context capabilities.

7

The true value of long context is data-dependent; internet text often favors short contexts, while domains with long-term dependencies (e.g., human trajectories) stand to benefit most from Power Retention.

THE ORIGIN AND MISSION OF MANIFEST AI

Manifest AI was founded by Jacob Buckman and Kalis, former researchers from Google Brain and OpenAI, respectively. Their core belief was that simply scaling up existing Transformer models would not lead to true human-level intelligence. They identified the challenge of efficiently processing large inputs, specifically long contexts, as a fundamental blocker. The company set out to address this bottleneck at its deepest level, moving beyond incremental improvements to create a fundamentally different architecture.

THE KV CACHE BOTTLENECK IN TRANSFORMERS

Traditional Transformer models face a significant computational bottleneck due to the KV cache. This cache stores information from every token processed, causing it to grow linearly with the sequence length. Consequently, each subsequent token prediction becomes more expensive, making long contexts computationally intractable and costly. Many existing 'long context' solutions are workarounds like windowed attention, which sacrifice information from earlier parts of the context, leading to degraded performance on complex, long-range dependencies.

INTRODUCING POWER RETENTION: A FIXED-MEMORY ARCHITECTURE

Manifest AI's solution is 'Power Retention,' a new family of architectures that replace the Transformer's attention mechanism. Instead of a growing KV cache, Power Retention utilizes a fixed-size memory. New tokens are compressed into this memory, which does not grow. The size of this memory can be adjusted based on the problem's difficulty and the available compute budget. This provides a stateful memory comparable to model parameters, allowing for dynamic scaling without the quadratic cost associated with traditional attention.

ENGINEERING ADVANTAGES AND INFERENCE EFFICIENCY

Power Retention dramatically simplifies inference infrastructure. The fixed-size memory eliminates the dynamic scheduling challenges associated with growing KV caches, allowing GPUs to be partitioned efficiently among users without waste. This leads to substantial reductions in engineering complexity and inference costs. While training sees significant speedups (e.g., 10x at 64k tokens), inference benefits are even more pronounced, with potential for 100x speed increases due to savings in computation and memory operations.

VIDRIAL: OPTIMIZING HARDWARE UTILIZATION WITH CUSTOM KERNELS

To achieve practical speedups, Power Retention is implemented using highly optimized CUDA kernels generated by Vidrial. Vidrial is a framework for writing generalized GPU kernels that dynamically sweep over various implementation strategies (e.g., core selection, data movement, tiling) to find the optimal configuration for a specific hardware and problem shape. This 'just-in-time sweeping' approach ensures maximum GPU utilization, delivering performance gains even in scenarios where traditional kernels are less optimized, potentially achieving 20-30% improvement over highly optimized existing kernels like Flash Attention.

METAMORPHOSIS: TRANSFORMATION WITHOUT FULL PRE-TRAINING

A key advantage of Power Retention is that it doesn't require complete re-pre-training. Existing pre-trained Transformer models can be transformed into Power Retention variants through a process called 'metamorphosis,' which involves minimal mid-training fine-tuning on all parameters. This process has been shown to quickly recover performance, enabling models like StarCoder 3B to achieve comparable or even superior downstream task performance with significantly faster training and inference, particularly at extended context lengths.

POWER CODER AND OPEN-SOURCE CONTRIBUTIONS

Manifest AI is releasing 'PowerCoder,' a transformed version of the StarCoder 3B model, demonstrating the efficacy of their approach. They are also open-sourcing all components, including the Vidrial framework and Power Retention kernels, to encourage community adoption. This allows other researchers and developers to apply the 'metamorphosis' process to various base models, creating specialized Power Retention variants for different tasks and languages, fostering a new ecosystem of efficient, long-context AI models.

THE ROLE OF DATA IN LONG CONTEXT VALUE

The effectiveness of long context capabilities is highly dependent on the dataset. Internet text, often composed of short documents, provides limited benefit for extremely long contexts when simply 'packed' together. While Transformers are efficient for such data, Manifest AI believes that unique datasets with inherent long-term dependencies, such as human trajectories in administrative or coding tasks, are where Power Retention's true potential can be unlocked. They are actively seeking collaborations with holders of such specialized datasets.

FUTURE OF ARCHITECTURE AND COMMUNITY BUILDING

Manifest AI aims to build community trust and drive adoption by open-sourcing their technology. They anticipate that after initial community experimentation and validation, larger foundation model companies will gradually adopt Power Retention. Their roadmap includes scaling to larger models (e.g., 30 billion parameters) and exploring tensor/sequence parallelism for even longer contexts (64k and beyond). The goal is to see a widespread shift towards retention-based architectures for inference-heavy, long-context use cases.

Common Questions

Manifest AI aims to solve the bottleneck of long context windows in large language models. The standard transformer architecture struggles with the exponentially increasing cost of processing longer sequences due to the KV cache, making it inefficient and expensive.

Topics

Mentioned in this video

More from Latent Space

View all 201 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free