How did Latent Diffusion and Stable Diffusion contribute to AI image generation?

Latent diffusion introduced a more efficient method for training generative models by using a compressed, perceptually equivalent representation of images. This led to Stable Diffusion, which made high-quality image generation accessible and widely adopted.

What was the prevailing AI dogma before Stable Diffusion's impact?

The dominant belief in the ML community was that language modeling was the primary path to intelligence, with language seen as the core interface for reasoning. Visual processing was considered secondary or a niche topic.

What is the difference between natural and unnatural representations in AI?

Natural representations (like video and audio) are derived from sources humans cannot control, while unnatural representations (like text) are inherently human-made and highly efficient for communication. Black Forest Labs believes starting with natural representations is key to developing higher intelligence.

How has the approach to generative visual models evolved?

Initially, models were unimodal (e.g., text-to-image), primarily for content creation. Now, the focus is on multimodal models trained on natural data (image, video, audio) to achieve a deeper understanding and enable applications like physical AI and robotics.

What was the strategy behind Black Forest Labs' first product, Flux One?

Facing resource constraints, BFL focused on improving a specific area within image generation, aiming to be 10x better than existing models, specifically addressing the issue of generating more accurate hands. This initial focus led to product-market fit and customer feedback.

How did customer feedback influence the development of Flux One Context?

By observing users trying to achieve character consistency with Flux One, BFL learned about the demand for image editing capabilities. This led to the development of Flux One Context, an editing model that significantly improved character consistency.

Why is open source important for Black Forest Labs' business model?

Open models allow for greater customization by end-users and businesses with diverse preferences. This approach creates value by enabling tailored solutions, which is key to BFL's strategy of providing advanced, customizable visual intelligence infrastructure.

What is Self Flow and how does it advance multimodal reasoning?

Self Flow is an algorithm published by Black Forest Labs that enables generative models to align multimodal representations. It solves the problem of combining representations from different data types (image, audio, video) to achieve a more comprehensive understanding, crucial for higher intelligence.

How does Black Forest Labs ensure data privacy and prevent misuse of its models?

They implement comprehensive content filters on their API and comply with EU regulations like the GDPR for data privacy. Users can request deletion of their personal data, and BFL maintains strict guardrails that apply universally to all partners to prevent misuse.

What are the challenges with data labeling for images and videos?

Unlike text, images and videos are complex to label. Black Forest Labs uses automated labeling for large-scale pre-training with noisy data and human labeling for crucial, high-quality annotations in later training stages to ensure alignment with desired outcomes.

How do diffusion models compare to autoregressive models in terms of efficiency?

Diffusion models are data-inefficient during training but offer efficient inference through techniques like distillation, allowing for steps to be reduced significantly. Autoregressive models are more data-efficient in training but less capable of similar inference speed-ups through distillation.

What is Laten Adversarial Distillation's role at Black Forest Labs?

Laten Adversarial Distillation is crucial for making models exceptionally efficient and commercially viable. It allows for the packaging of core technology into different versions (e.g., fast, high-quality) catering to both open-source developers and enterprise clients.

What is Black Forest Labs' stance on explicit 3D representations versus video-based approaches for spatial intelligence?

Black Forest Labs favors learning from natural representations like video and audio, mirroring human childhood development, over explicit 3D coordinate representations. They believe this approach is more flexible and generalizable for achieving higher forms of intelligence.

Key Moments

Stanford CS153 Frontier Systems | Andreas Blattmann from Black Forest Labs on Visual Intelligence

Stanford Online

Education7 min read62 min video

May 4, 2026|3,013 views|62|1

Stanford Stanford Online Artificial Intelligence AI

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

On this page

TL;DR

Generative AI models are evolving beyond just content creation to understand and interact with the physical world, but current approaches to 3D representation in AI may be less effective than learning from raw sensory data like humans do.

Key Insights

While early generative models like GANs struggled with 256x256 image generation, latent diffusion models, like Stable Diffusion, enabled training on perceptually equivalent but lower-dimensional representations, saving significant compute.

The prevailing AI dogma shifted from language modeling to recognizing the importance of multimodal natural representations (audio, video) for developing higher intelligence, mirroring early human learning through observation.

Black Forest Labs' Flux models have progressed from unimodal text-to-image generation to multimodal models capable of physical AI tasks, robotics, and world modeling, crucial for real-world applications.

Flux 1's success in image editing and character consistency, particularly through its open-weight release and customer feedback loop, demonstrated the power of adapting to user needs and unexpected use cases like LoRA training.

The company's "debate, disagree, then commit" culture, with only one person leaving in its history, has been a key factor in its sustained progress and ability to overcome challenges in the fast-paced AI field.

Future AI development will likely focus on multimodal reasoning, integrating diverse sensory inputs and action prediction, moving beyond passive observation to active interaction with the physical world, potentially rendering explicit 3D representations less critical than learning from continuous sensory streams.

The foundational leap: From GANs to latent diffusion

The journey into advanced visual intelligence began with significant limitations. In the past, generating even a 256x256 pixel image was a computational challenge, often relying on Generative Adversarial Networks (GANs) that had specific inductive biases for image data. Andy Blattmann, co-founder of Black Forest Labs, recounts his early research days in 2019 when computer vision was a niche within AI. Competing with giants like Google and OpenAI with far less compute, his team focused on developing more efficient algorithms. This led to the concept of latent generative modeling, essentially training a compression model akin to a learned JPEG encoder to find perceptually equivalent but lower-dimensional representations of images. By training generative models in this latent space, they achieved orders of magnitude less compute usage, leading to models that were more efficient and often superior to those trained on raw pixels. This algorithm, latent diffusion, was the direct precursor to Stable Diffusion, released in 2022, which surprised even its creators with its widespread adoption.

Challenging the language-centric AI dogma

At the time of Stable Diffusion's release, the AI community was heavily focused on language modeling, with a prevailing belief that language was the primary interface for intelligence and reasoning. However, those in the computer vision field, like Blattmann and Anjney Midha, recognized this as an incomplete picture. Humans learn through a rich interplay of sensory inputs – seeing, hearing, and interacting with the world from infancy. This observation led to the crucial insight that true intelligence requires understanding natural representations (like video and audio) rather than solely relying on human-made, compressed forms like text. Blattmann argues that starting with language and layering other modalities on top is the wrong approach; instead, intelligence should be built from first principles, starting with natural sensory observation, mirroring human development. This perspective emphasizes the fundamental role of visual and auditory intelligence as a bedrock for higher cognitive functions.

Evolving visual models: Beyond content creation to physical AI

Initially, visual models like Stable Diffusion were primarily designed for content creation, excelling at tasks like artistic style transfer or character consistency for marketing. These were unimodal, text-to-image models. The frontier is now moving towards multimodal models trained on natural representations (images, video, audio) to achieve far broader capabilities. Black Forest Labs' 'Flux' model family exemplifies this shift. Instead of training single-purpose models, they are developing unified, multimodal systems capable of physical AI, robotics, and world modeling. The integration of different natural representations allows models to learn crucial correlations; for instance, the sound of colliding objects (audio) provides context for the physical action (visual). This richer understanding, derived from cross-modal learning, enables models to grasp real-world phenomena more effectively, as demonstrated by demos in world modeling and simulation, alongside continued content creation advancements.

Bootstrapping growth: From an open-weight image model to customer feedback

Starting a frontier AI company requires strategic focus, especially with limited resources. Black Forest Labs leveraged their expertise in unimodal image generation to tackle a specific problem: the inability of existing models to consistently produce correct anatomical features, like five-fingered hands. This led to Flux 1, a next-generation image model designed to be significantly better than its predecessors. Crucially, they adopted an open-weight strategy, allowing for rapid feedback from the community. Users discovered unexpected applications, such as extensive use for LoRA training to achieve character consistency. This feedback loop was invaluable: it revealed a strong customer desire for image editing and control beyond text prompts. Responding to this, BFL developed Flux 1 Context, an image editing model that achieved character consistency at scale, supercharging creative applications and leading to significant commercial traction and partnerships, including with Meta.

The feedback loop and adapting to user needs

The development of Flux 1 Context highlighted the critical importance of closing the feedback loop with real-world usage. Initially, users struggled with consistent character representation in generated images. This wasn't just a technical limitation; it was a clear signal of user need and a gap in current capabilities. While some in the industry doubted AI could achieve such fine-grained control, BFL observed the data from their open-weight model. This observational data, combined with proactive prompt engineering and user feedback, led to the insight that developing a specialized image editing model was necessary. The team rapidly iterated, pivoting to focus on this capability, resulting in Flux 1 Context. This agility, driven by direct customer engagement and a methodical assessment of the landscape, allowed them to quickly adapt and deliver a highly sought-after feature, demonstrating how user-driven insights can accelerate product development and market success.

Open source as a strategy for customization and sustainability

Black Forest Labs' decision to release open-weight models, such as Flux 1, has been a strategic differentiator. While open models can be hard to monetize directly, they enable broad customization. This is particularly valuable because aesthetic preferences and desired outcomes often vary significantly across different user groups or enterprises. Open models allow partners like Meta to customize models for their specific user bases, catering to diverse cultural nuances and biases. Blattmann argues that this is a false trade-off between open and closed models; open models offer significant value where customization is key. This strategy not only fosters innovation within the community but also provides a clear path for commercial sustainability by addressing the demand for personalized AI solutions, thereby creating robust and adaptable infrastructure.

From interaction to physical AI: Closing the loop with real-world data

Moving beyond passive observation, the next frontier for AI involves interaction with the physical world. BFL's training pipeline now incorporates structured phases: intensive, noisy pre-training on natural representations (images, video, audio) using methods like Selflow; mid-training with additional context and action conditioning; and finally, post-training that involves actual interaction. By hooking models up to robots in the physical world, they can generate data through simulated or real-world actions, which is then fed back into training. This process is vital for developing higher forms of intelligence. Verification in this domain becomes inherently tied to physical constraints; a robot arm cannot perform impossible movements, thus naturally bounding the model's actions. This contrasts with the subjective verification of aesthetics in image generation, which often requires extensive human judgment and can be ambiguous. This interaction-driven approach is seen as key to building more capable and general AI systems.

The future: Multimodal reasoning and questioning explicit 3D representations

The current state-of-the-art in AI is increasingly focused on unified reasoning across modalities like text, image, and video, with techniques like Selflow enabling cross-modal transfer learning. Looking ahead, the discussion turns to spatial intelligence and the relevance of explicit 3D representations. Blattmann expresses a belief they are not necessary for AI, mirroring human learning which relies on continuous video and audio input rather than explicit coordinate systems. Midha, having previously founded a 3D mapping company that faced challenges, offers a slightly more nuanced view, suggesting that while explicit 3D representations can be useful (e.g., for robotics in GPS-denied environments), they are often narrow, inflexible, and static. He speculates that AI networks might implicitly learn 3D structure based on interaction and perception, making explicit representations less critical than integrating diverse, temporal sensory data. This debate underscores the ongoing exploration into how AI can best learn to understand and interact with the world, emphasizing observation and interaction over artificial constructs.

Mentioned in This Episode

●Software & Apps

●Companies

●People Referenced

Common Questions

Visual intelligence refers to a model's ability to understand and reason about visual information, analogous to how humans learn through seeing. It's considered critical for developing AI that can operate in real-world, mission-critical contexts, moving beyond basic content creation.

Topics

AI & Machine Learning Technology & Innovation Science & Mathematics Deep Learning Multimodal AI Computer Vision AI Research Generative Models Frontier AI Visual Intelligence

Mentioned in this video

Companies

Black Forest Labs

A frontier research company focused on visual intelligence, co-founded by Andreas Blattmann.

11 Labs

A company whose representative, Maddie, discussed the frontier of audio and speech intelligence.

Google

A major tech company whose research teams were competitors in the field of AI during Andreas Blattmann's early research.

OpenAI

A prominent AI research organization whose teams were competitors in the field during Andreas Blattmann's early research.

Ubiquity 6

A 3D mapping and computer vision company that the host previously worked at, focusing on 3D reconstruction.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free