⚡️ Beyond Transformers with Power Retention

Latent Space PodcastLatent Space Podcast
Science & Technology4 min read33 min video
Sep 23, 2025|2,390 views|75|13
Save to Pod

Key Moments

TL;DR

Manifest AI introduces Power Retention, a fixed-memory architecture replacing Transformer Attention to solve long-context AI bottlenecks and reduce inference costs significantly.

Key Insights

1

Power Retention offers a fixed-size memory architecture, a significant departure from the growing KV cache in Transformers, eliminating a major computational bottleneck for long contexts.

2

This new architecture enables more efficient and cost-effective inference by simplifying GPU allocation and reducing memory management complexity.

3

The Vidrial framework allows for the creation of highly optimized CUDA kernels by dynamically selecting the best hardware configurations for specific operations and hardware.

4

Models can be transformed ("metamorphosed") from existing pre-trained architectures to Power Retention variants with minimal mid-training fine-tuning, without full pre-training.

5

Power Retention demonstrates substantial speed-ups in both training (up to 10x) and inference (up to 100x) compared to traditional attention mechanisms at long context lengths (e.g., 64k tokens).

6

Manifest AI is open-sourcing its components to foster community adoption and development of new models and applications leveraging long-context capabilities.

7

The true value of long context is data-dependent; internet text often favors short contexts, while domains with long-term dependencies (e.g., human trajectories) stand to benefit most from Power Retention.

THE ORIGIN AND MISSION OF MANIFEST AI

Manifest AI was founded by Jacob Buckman and Kalis, former researchers from Google Brain and OpenAI, respectively. Their core belief was that simply scaling up existing Transformer models would not lead to true human-level intelligence. They identified the challenge of efficiently processing large inputs, specifically long contexts, as a fundamental blocker. The company set out to address this bottleneck at its deepest level, moving beyond incremental improvements to create a fundamentally different architecture.

THE KV CACHE BOTTLENECK IN TRANSFORMERS

Traditional Transformer models face a significant computational bottleneck due to the KV cache. This cache stores information from every token processed, causing it to grow linearly with the sequence length. Consequently, each subsequent token prediction becomes more expensive, making long contexts computationally intractable and costly. Many existing 'long context' solutions are workarounds like windowed attention, which sacrifice information from earlier parts of the context, leading to degraded performance on complex, long-range dependencies.

INTRODUCING POWER RETENTION: A FIXED-MEMORY ARCHITECTURE

Manifest AI's solution is 'Power Retention,' a new family of architectures that replace the Transformer's attention mechanism. Instead of a growing KV cache, Power Retention utilizes a fixed-size memory. New tokens are compressed into this memory, which does not grow. The size of this memory can be adjusted based on the problem's difficulty and the available compute budget. This provides a stateful memory comparable to model parameters, allowing for dynamic scaling without the quadratic cost associated with traditional attention.

ENGINEERING ADVANTAGES AND INFERENCE EFFICIENCY

Power Retention dramatically simplifies inference infrastructure. The fixed-size memory eliminates the dynamic scheduling challenges associated with growing KV caches, allowing GPUs to be partitioned efficiently among users without waste. This leads to substantial reductions in engineering complexity and inference costs. While training sees significant speedups (e.g., 10x at 64k tokens), inference benefits are even more pronounced, with potential for 100x speed increases due to savings in computation and memory operations.

VIDRIAL: OPTIMIZING HARDWARE UTILIZATION WITH CUSTOM KERNELS

To achieve practical speedups, Power Retention is implemented using highly optimized CUDA kernels generated by Vidrial. Vidrial is a framework for writing generalized GPU kernels that dynamically sweep over various implementation strategies (e.g., core selection, data movement, tiling) to find the optimal configuration for a specific hardware and problem shape. This 'just-in-time sweeping' approach ensures maximum GPU utilization, delivering performance gains even in scenarios where traditional kernels are less optimized, potentially achieving 20-30% improvement over highly optimized existing kernels like Flash Attention.

METAMORPHOSIS: TRANSFORMATION WITHOUT FULL PRE-TRAINING

A key advantage of Power Retention is that it doesn't require complete re-pre-training. Existing pre-trained Transformer models can be transformed into Power Retention variants through a process called 'metamorphosis,' which involves minimal mid-training fine-tuning on all parameters. This process has been shown to quickly recover performance, enabling models like StarCoder 3B to achieve comparable or even superior downstream task performance with significantly faster training and inference, particularly at extended context lengths.

POWER CODER AND OPEN-SOURCE CONTRIBUTIONS

Manifest AI is releasing 'PowerCoder,' a transformed version of the StarCoder 3B model, demonstrating the efficacy of their approach. They are also open-sourcing all components, including the Vidrial framework and Power Retention kernels, to encourage community adoption. This allows other researchers and developers to apply the 'metamorphosis' process to various base models, creating specialized Power Retention variants for different tasks and languages, fostering a new ecosystem of efficient, long-context AI models.

THE ROLE OF DATA IN LONG CONTEXT VALUE

The effectiveness of long context capabilities is highly dependent on the dataset. Internet text, often composed of short documents, provides limited benefit for extremely long contexts when simply 'packed' together. While Transformers are efficient for such data, Manifest AI believes that unique datasets with inherent long-term dependencies, such as human trajectories in administrative or coding tasks, are where Power Retention's true potential can be unlocked. They are actively seeking collaborations with holders of such specialized datasets.

FUTURE OF ARCHITECTURE AND COMMUNITY BUILDING

Manifest AI aims to build community trust and drive adoption by open-sourcing their technology. They anticipate that after initial community experimentation and validation, larger foundation model companies will gradually adopt Power Retention. Their roadmap includes scaling to larger models (e.g., 30 billion parameters) and exploring tensor/sequence parallelism for even longer contexts (64k and beyond). The goal is to see a widespread shift towards retention-based architectures for inference-heavy, long-context use cases.

Common Questions

Manifest AI aims to solve the bottleneck of long context windows in large language models. The standard transformer architecture struggles with the exponentially increasing cost of processing longer sequences due to the KV cache, making it inefficient and expensive.

Topics

Mentioned in this video

companyFireworks

Mentioned alongside Together as inference providers who could benefit from power retention technology.

softwareStarCoder 3B

A 3 billion parameter coding model from the Big Code project that was used to demonstrate the 'metamorphosis' process and power coder, showing significant improvements with power retention.

companyTogether

An inference provider that Manifest AI sees as a potential partner to serve their models at lower prices and faster latencies.

softwarelog cabin

Manifest AI's in-house run visualization platform used to showcase training curves and model performance.

conceptwindowed attention

A workaround for long context limitations in transformers that discards information from certain layers, leading to degraded performance on older context.

conceptpower retention

Manifest AI's novel architecture that uses a fixed-size, compressed memory instead of a growing KV cache, enabling efficient handling of long contexts.

companyManifest AI

A company founded by Jugger Buckman and Kalis, focused on solving the bottleneck of long context windows in language models with their power retention architecture.

personAllesio Ponder

Host of the L in space podcast and associated with Colonel Labs, an investor in Manifest AI.

conceptthe transformer

The foundational architecture that has dominated LLMs; its history and evolution are compared to the potential adoption of power retention.

concepthuman eval

A downstream evaluation benchmark used to measure the performance of coding models, where Power Coder showed improved accuracy.

softwareGBD5

Mentioned as a future frontier model, with the question of whether it could be built on power attention.

organizationDecible

An investment firm that invested in Manifest AI early on.

softwareflash attention

An optimized implementation of transformer attention, which Manifest AI's Vidril framework can match or outperform, especially on non-standard problem shapes.

mediaL in space podcast

The podcast where the interview is taking place, hosted by Allesio Ponder.

softwareVidril

A general framework developed by Manifest AI for writing hardware-efficient CUDA kernels, enabling performance sweeps for optimal GPU utilization.

softwareretention

The package that provides access to Manifest AI's flash attention kernels (written in Vidril) and the power retention architecture.

personJugger Buckman

Co-founder of Manifest AI, previously worked at Google Brain. He is discussing the company's innovations in AI architecture.

softwarepower coder

A metamorphosed version of StarCoder 3B using the power retention architecture, demonstrating significantly faster training and inference speeds for long contexts.

softwareManifesto

A tool built by Manifest AI that points to a repository and attempts to fix missteps or errors within its code, demonstrating the capabilities of Power Coder.

conceptMOE models

Mixture of Experts models, mentioned as a type of architecture that differs from dense transformers, relevant to the future development of large-scale AI models.

companySF Compute

A compute provider that Manifest AI has used for research and development.

organizationTriton

More from Latent Space

View all 63 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free