Key Moments

⚡️Mercury: Ultra-Fast Diffusion LLMs — Estefano Ermon, CEO Inception Labs

Latent Space PodcastLatent Space Podcast
Science & Technology6 min read29 min video
Aug 4, 2025|2,695 views|74|6
Save to Pod
TL;DR

Inception Labs introduces Mercury, diffusion-based LLMs that offer faster speeds and competitive quality, especially for coding and latency-sensitive tasks.

Key Insights

1

Diffusion models, inspired by image generation, are being adapted for discrete data like text and code, offering an alternative to autoregressive LLMs.

2

Mercury LLMs (Inception Labs) significantly improve speed and efficiency, outperforming autoregressive models in throughput and latency.

3

While autoregressive models generate text token-by-token, diffusion models refine a complete output iteratively, enabling parallel processing and faster generation.

4

Adapting diffusion for text requires new training objectives and noise mechanisms, differing from image generation's Gaussian noise.

5

Diffusion models are not inherently causal, allowing them to consider bidirectional context, which is advantageous for tasks like code generation and infilling.

6

While architectural components like Transformers can be reused, diffusion LLMs require novel training losses, objectives, and potentially specialized inference engines.

7

Inception Labs uses techniques similar to autoregressive LLMs for post-training, including SFT and RLHF (with specialized algorithms like DPO for diffusion models), to align models with user preferences.

8

The primary advantage of diffusion LLMs currently lies in their inference efficiency (speed and cost), making them ideal for latency-sensitive applications like real-time agents and coding assistants.

9

Although current diffusion LLMs may not match the absolute frontier intelligence of the very largest autoregressive models, they offer comparable or superior performance on speed-optimized benchmarks and are expected to scale.

10

The future of LLMs may see a shift towards diffusion models, driven by efficiency demands, especially given resource constraints like power and data center capacity.

THE ORIGIN AND EVOLUTION OF DIFFUSION MODELS

Estefano Ermon of Inception Labs traces the journey of diffusion models back to 2019, initially exploring image generation due to dissatisfaction with GANs. The core idea is an iterative refinement process rather than one-shot generation. This approach proved highly successful for continuous data like images and video. However, extending diffusion models to discrete data such as text and code presented significant challenges for years, requiring new mathematical frameworks and engineering solutions to achieve competitive results.

BREAKING BARRIERS IN DISCRETE DATA GENERATION

A breakthrough came with the development of discrete diffusion models, demonstrated in a best paper award at ICML. This research showed that diffusion models could compete with autoregressive models in language generation at smaller scales. Motivated by these findings, Inception Labs was founded to scale these research ideas, leading to the creation of 'Mercury,' their first commercial-scale diffusion language models.

MERCURY: DIFFUSION LLMS FOR CODING AND BEYOND

Mercury represents a new generation of LLMs parameterized by Transformers but trained to predict and refine multiple tokens in parallel. This parallel processing capability is the key to their speed advantage over traditional autoregressive models, which generate tokens sequentially. Mercury Coder was the first release, targeting coding applications, and has since been expanded to general text use cases like summarization and translation, offering substantial speed gains and novel capabilities.

THE MECHANICS OF TEXT AND CODE DIFFUSION

Unlike autoregressive models that generate left-to-right one token at a time, diffusion models start with a rough guess and iteratively refine it. This refinement process modifies multiple tokens in parallel, explaining their significantly higher throughput. The training process involves adding noise to text or code data and training a neural network to denoise it, effectively learning to correct mistakes or fill in missing parts. Noise can be introduced via masking or token flipping, analogous to Gaussian noise in image diffusion.

ARCHITECTURAL CONSIDERATIONS AND ADAPTATION

While diffusion models can leverage existing architectures like Transformers, they cannot simply be fine-tuned from pre-trained autoregressive models. The training objective—denoising versus next-token prediction—is fundamentally different, and diffusion models are not inherently causal, allowing them to look at context from both left and right. This non-causal nature is a strength for tasks requiring bidirectional understanding but complicates adaptation from causally masked models. However, architectural components and data types are largely compatible.

INNOVATIONS IN TRAINING AND ALIGNMENT

Adapting diffusion models for text and code requires innovations in training losses and objectives. Techniques like classifier guidance, familiar from image generation, transfer well. Post-training alignment methods, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), are also employed. Inception Labs has developed specialized algorithms, like a variant of DPO for diffusion language models, to effectively fine-tune models based on human preferences and proprietary customer data.

COMPUTATIONAL ADVANTAGES: SPEED AND EFFICIENCY

A major differentiator for diffusion models is their inference efficiency. They outperform autoregressive models in the throughput-latency trade-off, meaning they can achieve higher throughput for the same latency or lower latency for the same throughput. This translates directly to cost savings. This efficiency is particularly valuable for latency-sensitive applications like voice agents, real-time coding assistants, and UI development tools where user experience is paramount.

PERFORMANCE BENCHMARKS AND CAPABILITIES

Mercury Coder models have demonstrated state-of-the-art performance on the speed-quality frontier. The generalist Mercury models have achieved intelligence scores comparable to leading closed-source, speed-optimized autoregressive models like GPT-4.1, but at 5-10x faster speeds. While current models might not match the peak intelligence of the largest frontier models, they offer a compelling balance of speed, quality, and cost-effectiveness, with ongoing research aiming for even greater intelligence.

APPLICATIONS AND IDEAL USE CASES

While diffusion LLMs can theoretically handle any text-in, text-out task, their current strength lies in latency-sensitive applications. They are ideal for scenarios where existing models are too slow or require the use of smaller, less capable autoregressive models to meet speed requirements. Diffusion models can replace these with larger, higher-quality models that are still faster, significantly improving user experience and enabling new real-time applications. Inception Labs actively works with customers to fine-tune models for specific latency-constrained problems.

THE FUTURE TRAJECTORY OF LLM ARCHITECTURES

The Inception Labs team believes diffusion models have the potential to become the dominant architecture for LLMs, driven by the critical need for inference efficiency amidst growing demands and resource constraints (power, data centers). While the race for peak intelligence continues, the efficiency gains offered by diffusion models, coupled with ongoing R&D, position them as a strong contender to replace autoregressive models, especially as they become even more capable. The ability to iteratively correct errors also offers unique advantages.

INFERENCE ENGINE AND PRODUCTION DEPLOYMENT

Developing an efficient inference engine is crucial for diffusion LLMs due to their unique operational characteristics. Inception Labs has built its own proprietary inference engine that supports features like continuous batching and caching. Releasing open-source diffusion models is complex because the inference code itself is innovative and proprietary, unlike the more straightforward inference of autoregressive models. The company is actively hiring engineers with expertise in optimizing model serving and inference for continuous batching, quantization, and kernel development.

EXPLORING NEW FRONTIERS: BEYOND CHAT

While chat applications are well-served by existing autoregressive models, diffusion LLMs offer significant advantages for other use cases where sequential interaction is less critical. This includes tasks requiring strong controllability, reduced hallucinations, and novel forms of generation. Inception Labs is exploring new form factors and capabilities where diffusion models can provide a distinct edge, moving beyond direct competition with autoregressive models in chat and towards unique applications that leverage their inherent strengths.

Diffusion Language Model Best Practices

Practical takeaways from this episode

Do This

Leverage diffusion models for latency-sensitive applications to improve user experience.
Consider diffusion models for a higher quality output by using larger, more efficient models.
Explore diffusion models for use cases beyond basic chat, such as controllability and complex generation tasks.
Utilize the iterative refinement and error correction capabilities inherent in diffusion models.

Avoid This

Do not solely focus on chat applications if seeking to leverage unique diffusion model strengths.
Do not expect current diffusion models to match the absolute frontier intelligence of the largest autoregressive models for all tasks.
Do not overlook the potential for diffusion models in applications where speed and cost-efficiency are critical factors.

Common Questions

Diffusion language models are a type of generative model that, unlike autoregressive models which generate token by token, start with a rough guess and iteratively refine it. This approach allows them to modify multiple tokens in parallel, leading to significant speed improvements.

Topics

Mentioned in this video

More from Latent Space

View all 78 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free