Generating 3D Models with Diffusion - Computerphile
Key Moments
2D diffusion fuels 3D via score distillation; multi-view cues and challenges
Key Insights
3D diffusion lags behind 2D due to smaller datasets and the need for consistent quality across all viewing angles.
Dream Fusion demonstrated that 2D diffusion can guide 3D model generation by projecting 2D priors into 3D space.
Score Distillation Sampling (SDS) optimizes a 3D scene by iteratively rendering from multiple angles and refining with diffusion guidance.
Gaussian splatting and NeRF-based view synthesis are common 3D representations used in diffusion-guided reconstruction.
The multiface Janus problem reveals a bias where front-facing views dominate, causing inconsistent multi-angle outputs.
Solutions include multi-angle prompts (front and back) and interpolating across angles to improve cross-view coherence.
A carefully scheduled noise ramp helps the model first learn rough structure, then refine textures and details.
Training can produce surprising failures (e.g., extra limbs or unexpected reflections) that illustrate current model limits.
Progress is moving toward multi-view latent diffusion models, which aim to stabilize and improve cross-view consistency.
Overall, the field is early but advancing toward practical 3D content generation from 2D diffusion foundations.
INTRODUCTION TO 3D DIFFUSION CHALLENGES
3D generative AI is slower to mature than 2D diffusion because geometry and data present extra difficulties. In 2D, diffusion models learn to combine concepts from billions of image-caption examples, so prompts like frogs on stilts can emerge by blending learned ideas that rarely appear together in a single image. In 3D, however, training data are far smaller, and a model must render a coherent scene from all potential viewpoints while maintaining spatial realism and lighting consistency. Those requirements create bigger data gaps and greater optimization complexity than 2D.
WHY 2D DIFFUSION IS EASIER THAN 3D
2D diffusion works from a prompt by gradually denoising a completely random image until it resembles the requested concept. The model can blend separate ideas—like a frog and stilts—because its training teaches how to map prompts to image features accumulated over billions of examples. In 3D, the same process would have to hold up across many viewing angles and lighting conditions, requiring the model to represent and optimize a full volume or surface. That extra dimensionality is the core reason 3D diffusion lags behind 2D.
DATA SET SIZE LIMITATIONS FOR 3D
The presenter notes that the largest 3D datasets are in the millions, whereas 2D diffusion benefits from billions of image-caption pairs. This gap means fewer real-world examples exist for specific 3D concepts (like a frog on stilts) across all angles, poses, textures, and lighting. The consequence is weaker priors, slower learning, and more room for artifacts when trying to synthesize novel 3D objects from text prompts.
DREAM FUSION: BRIDGING 2D DIFFUSION TO 3D
Dream Fusion, released in 2022, was a watershed because it used 2D diffusion to generate 3D content. The idea was to project a 2D diffusion-driven concept into a 3D representation, treating 3D reconstruction as an optimization guided by 2D priors. This allowed abstract 3D concepts—like a frog on stilts—to emerge from 2D diffusion signals. It marked the first major demonstration that 2D diffusion can meaningfully inform and drive 3D model generation.
FROM 2D PROMPTS TO 3D VIEWS: THE SCORE DISTILLATION APPROACH
The core mechanism is score distillation sampling (SDS). Start with a blank 3D scene and render a 2D image from a chosen camera angle. Add noise to that image, then query a 2D diffusion model to estimate how much noise would be needed to obtain your prompt’s appearance. Subtracting that noise guides the 3D scene toward matching the diffusion-derived appearance. Repeating this across many camera angles distills updates into a coherent 3D object.
GAUSSIAN SPLATTING VS NEURAL RENDERERS
For 3D representation, Gaussian splatting is common, though NeRF-based approaches exist as well. In Gaussian splatting, the 3D scene is built from a collection of simple colored Gaussians (or splats) that combine to form surfaces and volumes. In NeRF-style methods, a neural network models color and density in space. Both approaches can be guided by 2D diffusion through SDS, effectively turning 2D priors into a 3D reconstruction.
MULTI-VIEW RECONSTRUCTION: LEARNING FROM MULTIPLE ANGLES
A major challenge is ensuring consistency across views. If you optimize from a single camera angle repeatedly, you can get a great look from that angle but a poor or distorted appearance elsewhere. The solution is to sample many camera positions and refine the 3D scene holistically so that updates from one view align with others, creating a stable, plausible 3D object rather than angle-specific artifacts.
THE JANUS PROBLEM: MULTIFACE CHALLENGE AND ITS SYMPTOMS
A phenomenon called the multiface Janus problem arises because diffusion models are biased toward front-facing, dominant views. When trying to synthesize from multiple angles, the model can produce conflicting features—like a frog with two heads or misinterpreted shapes—since each image pushes the model to assume a primary, front-facing geometry. This is commonly referred to as content drift across views and highlights how view dominance can derail multi-angle consistency.
SOLVING FRONT/BACK CONSISTENCY AND INTERPOLATION
A mitigation strategy involves conditioning the 3D diffusion process on front and back prompts. By embedding both front-facing and back-facing views into clip embeddings and interpolating between them for intermediate angles, the model learns a smoother transition across viewpoints. This reduces the risk of abrupt changes or dual-headed artifacts and improves the overall coherency of the generated 3D object as it is rotated.
TRAINING DYNAMICS: NOISE SCHEDULE AND CONVERGENCE
An important empirical insight is that the amount of noise should not be kept constant. Starting with high noise encourages the model to discover rough structure and large-scale geometry (eyes, limbs, basic silhouette). Gradually reducing noise as more camera angles are processed helps refine textures and details. A fixed noise level often prevents convergence and leads to misaligned details, whereas a graduated schedule supports a more accurate final render.
SUCCESS STORIES AND COMMON FAILURES DURING TRAINING
Examples show a frog on stilts evolving from chaotic initial renders to recognizable silhouettes, but failures are instructive. Some runs yield frogs with extra legs, antenna-like protrusions, or misinterpreted reflections mistaken for geometry. Such artifacts reveal the boundaries of current models and training regimes, including how diffuse priors, multi-view alignment, and perspective cues interact. Observing both successes and failures helps researchers understand where improvements are most needed.
WHERE THE FIELD IS NOW AND FUTURE DIRECTIONS
Progress continues with newer approaches like multi-view latent diffusion models that aim to stabilize cross-view consistency and reduce drift. The core idea is to integrate latent representations that better capture multi-view geometry and lighting, making 3D generation from 2D guidance more robust. While still early in practice, the trajectory points toward more reliable, scalable 3D content generation from diffusion-based systems, with ongoing work on prompting, conditioning, and architectural improvements.
Mentioned in This Episode
●Tools & Products
●People Referenced
3D Diffusion Cheat Sheet (Practical Do's and Don'ts)
Practical takeaways from this episode
Do This
Avoid This
Common Questions
Dream Fusion was the first model to use 2D diffusion to generate 3D models by projecting 2D prompts into a 3D scene. It popularized the idea of using 2D diffusion outputs to supervise and refine 3D reconstructions via optimization across multiple viewpoints. This opened the door to leveraging strong 2D diffusion priors for 3D content creation.
Topics
Mentioned in this video
Neural Radiance Fields; a 3D representation technique used in conjunction with diffusion-based methods.
A 3D generation model mentioned as a good example to explore 3D data-driven generation.
A score distillation sampling model that uses Gaussian primitives to build 3D objects.
Clip-based embedding used to guide prompt conditioning for different 3D viewpoints.
First model to use 2D diffusion to generate 3D models by projecting 2D diffusion outputs into a 3D scene.
Advanced models that address multi-view consistency in 3D diffusion, improving alignment across angles.
More from Computerphile
View all 11 summaries
15 minCoding a Guitar Sound in C - Computerphile
16 minNetwork Basics - Transport Layer and User Datagram Protocol Explained - Computerphile
15 minImplementing Passkeys in Practice - Computerphile
19 minLLMs and Newcomb's Problem - Computerphile
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free