How are diffusion language models trained?

Training involves using standard pre-training corpora and a neural network, often a transformer. Instead of predicting the next token, the model is trained to 'denoise' text by filling in masked portions or correcting flipped tokens, essentially learning to reverse a noise-adding process.

Can diffusion language models be built on existing LLM backbones?

While architectures like transformers can be reused, diffusion models have a fundamentally different training objective (denoising vs. next-token prediction) and are non-causal. This makes it challenging to directly adapt existing autoregressive models, often requiring training from scratch.

What makes diffusion language models faster than autoregressive models?

The key to their speed is the ability to modify multiple tokens simultaneously within each neural network evaluation, rather than generating one token at a time. This parallel processing significantly reduces latency and increases throughput.

Are diffusion language models comparable to models like GPT-4?

In terms of intelligence, Inception's generalist model scored comparably to GPT-4 on benchmarks. However, their main advantage lies in being significantly faster (5-10x) than speed-optimized autoregressive models, offering better latency and cost-efficiency.

What are the main advantages of diffusion language models?

Their primary advantage is efficiency, offering much higher throughput and lower latency compared to autoregressive models. They also possess built-in error correction and the ability to consider context from both left and right, enabling unique capabilities.

Are there any use cases where diffusion language models are not recommended?

Currently, for tasks requiring absolute state-of-the-art intelligence where latency and cost are less critical, the largest autoregressive models might still be preferred. However, as diffusion models scale, this gap is expected to close.

Why doesn't Inception open-source their diffusion models?

Unlike autoregressive models where inference code is straightforward, diffusion models have more complex and proprietary inference engines with significant intellectual property. Releasing a model would necessitate releasing this complex inference code, which they are not planning to do.

Key Moments

⚡️Mercury: Ultra-Fast Diffusion LLMs — Estefano Ermon, CEO Inception Labs

Latent Space Podcast

Science & Technology6 min read29 min video

Aug 4, 2025|2,700 views|74|6

Save to Pod

Key Moments

On this page

TL;DR

Inception Labs introduces Mercury, diffusion-based LLMs that offer faster speeds and competitive quality, especially for coding and latency-sensitive tasks.

Key Insights

Diffusion models, inspired by image generation, are being adapted for discrete data like text and code, offering an alternative to autoregressive LLMs.

Mercury LLMs (Inception Labs) significantly improve speed and efficiency, outperforming autoregressive models in throughput and latency.

While autoregressive models generate text token-by-token, diffusion models refine a complete output iteratively, enabling parallel processing and faster generation.

Adapting diffusion for text requires new training objectives and noise mechanisms, differing from image generation's Gaussian noise.

Diffusion models are not inherently causal, allowing them to consider bidirectional context, which is advantageous for tasks like code generation and infilling.

While architectural components like Transformers can be reused, diffusion LLMs require novel training losses, objectives, and potentially specialized inference engines.

Inception Labs uses techniques similar to autoregressive LLMs for post-training, including SFT and RLHF (with specialized algorithms like DPO for diffusion models), to align models with user preferences.

The primary advantage of diffusion LLMs currently lies in their inference efficiency (speed and cost), making them ideal for latency-sensitive applications like real-time agents and coding assistants.

Although current diffusion LLMs may not match the absolute frontier intelligence of the very largest autoregressive models, they offer comparable or superior performance on speed-optimized benchmarks and are expected to scale.

The future of LLMs may see a shift towards diffusion models, driven by efficiency demands, especially given resource constraints like power and data center capacity.

THE ORIGIN AND EVOLUTION OF DIFFUSION MODELS

Estefano Ermon of Inception Labs traces the journey of diffusion models back to 2019, initially exploring image generation due to dissatisfaction with GANs. The core idea is an iterative refinement process rather than one-shot generation. This approach proved highly successful for continuous data like images and video. However, extending diffusion models to discrete data such as text and code presented significant challenges for years, requiring new mathematical frameworks and engineering solutions to achieve competitive results.

BREAKING BARRIERS IN DISCRETE DATA GENERATION

A breakthrough came with the development of discrete diffusion models, demonstrated in a best paper award at ICML. This research showed that diffusion models could compete with autoregressive models in language generation at smaller scales. Motivated by these findings, Inception Labs was founded to scale these research ideas, leading to the creation of 'Mercury,' their first commercial-scale diffusion language models.

MERCURY: DIFFUSION LLMS FOR CODING AND BEYOND

Mercury represents a new generation of LLMs parameterized by Transformers but trained to predict and refine multiple tokens in parallel. This parallel processing capability is the key to their speed advantage over traditional autoregressive models, which generate tokens sequentially. Mercury Coder was the first release, targeting coding applications, and has since been expanded to general text use cases like summarization and translation, offering substantial speed gains and novel capabilities.

THE MECHANICS OF TEXT AND CODE DIFFUSION

Unlike autoregressive models that generate left-to-right one token at a time, diffusion models start with a rough guess and iteratively refine it. This refinement process modifies multiple tokens in parallel, explaining their significantly higher throughput. The training process involves adding noise to text or code data and training a neural network to denoise it, effectively learning to correct mistakes or fill in missing parts. Noise can be introduced via masking or token flipping, analogous to Gaussian noise in image diffusion.

ARCHITECTURAL CONSIDERATIONS AND ADAPTATION

While diffusion models can leverage existing architectures like Transformers, they cannot simply be fine-tuned from pre-trained autoregressive models. The training objective—denoising versus next-token prediction—is fundamentally different, and diffusion models are not inherently causal, allowing them to look at context from both left and right. This non-causal nature is a strength for tasks requiring bidirectional understanding but complicates adaptation from causally masked models. However, architectural components and data types are largely compatible.

INNOVATIONS IN TRAINING AND ALIGNMENT

Adapting diffusion models for text and code requires innovations in training losses and objectives. Techniques like classifier guidance, familiar from image generation, transfer well. Post-training alignment methods, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), are also employed. Inception Labs has developed specialized algorithms, like a variant of DPO for diffusion language models, to effectively fine-tune models based on human preferences and proprietary customer data.

COMPUTATIONAL ADVANTAGES: SPEED AND EFFICIENCY

A major differentiator for diffusion models is their inference efficiency. They outperform autoregressive models in the throughput-latency trade-off, meaning they can achieve higher throughput for the same latency or lower latency for the same throughput. This translates directly to cost savings. This efficiency is particularly valuable for latency-sensitive applications like voice agents, real-time coding assistants, and UI development tools where user experience is paramount.

PERFORMANCE BENCHMARKS AND CAPABILITIES

Mercury Coder models have demonstrated state-of-the-art performance on the speed-quality frontier. The generalist Mercury models have achieved intelligence scores comparable to leading closed-source, speed-optimized autoregressive models like GPT-4.1, but at 5-10x faster speeds. While current models might not match the peak intelligence of the largest frontier models, they offer a compelling balance of speed, quality, and cost-effectiveness, with ongoing research aiming for even greater intelligence.

APPLICATIONS AND IDEAL USE CASES

While diffusion LLMs can theoretically handle any text-in, text-out task, their current strength lies in latency-sensitive applications. They are ideal for scenarios where existing models are too slow or require the use of smaller, less capable autoregressive models to meet speed requirements. Diffusion models can replace these with larger, higher-quality models that are still faster, significantly improving user experience and enabling new real-time applications. Inception Labs actively works with customers to fine-tune models for specific latency-constrained problems.

THE FUTURE TRAJECTORY OF LLM ARCHITECTURES

The Inception Labs team believes diffusion models have the potential to become the dominant architecture for LLMs, driven by the critical need for inference efficiency amidst growing demands and resource constraints (power, data centers). While the race for peak intelligence continues, the efficiency gains offered by diffusion models, coupled with ongoing R&D, position them as a strong contender to replace autoregressive models, especially as they become even more capable. The ability to iteratively correct errors also offers unique advantages.

INFERENCE ENGINE AND PRODUCTION DEPLOYMENT

Developing an efficient inference engine is crucial for diffusion LLMs due to their unique operational characteristics. Inception Labs has built its own proprietary inference engine that supports features like continuous batching and caching. Releasing open-source diffusion models is complex because the inference code itself is innovative and proprietary, unlike the more straightforward inference of autoregressive models. The company is actively hiring engineers with expertise in optimizing model serving and inference for continuous batching, quantization, and kernel development.

EXPLORING NEW FRONTIERS: BEYOND CHAT

While chat applications are well-served by existing autoregressive models, diffusion LLMs offer significant advantages for other use cases where sequential interaction is less critical. This includes tasks requiring strong controllability, reduced hallucinations, and novel forms of generation. Inception Labs is exploring new form factors and capabilities where diffusion models can provide a distinct edge, moving beyond direct competition with autoregressive models in chat and towards unique applications that leverage their inherent strengths.

Mentioned in This Episode

●Software & Apps

●Companies

●Concepts

Diffusion Language Model Best Practices

Practical takeaways from this episode

Do This

Leverage diffusion models for latency-sensitive applications to improve user experience.

Consider diffusion models for a higher quality output by using larger, more efficient models.

Explore diffusion models for use cases beyond basic chat, such as controllability and complex generation tasks.

Utilize the iterative refinement and error correction capabilities inherent in diffusion models.

Avoid This

Do not solely focus on chat applications if seeking to leverage unique diffusion model strengths.

Do not expect current diffusion models to match the absolute frontier intelligence of the largest autoregressive models for all tasks.

Do not overlook the potential for diffusion models in applications where speed and cost-efficiency are critical factors.

Common Questions

Diffusion language models are a type of generative model that, unlike autoregressive models which generate token by token, start with a rough guess and iteratively refine it. This approach allows them to modify multiple tokens in parallel, leading to significant speed improvements.