Key Moments

Inference, Diffusion, World Models, and More | YC Paper Club

Y CombinatorY Combinator
Science & Technology5 min read68 min video
May 28, 2026|9,056 views|496|19
Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

TL;DR

AI inference speed is becoming a capability, not just a cost, with speculative decoding and world models promising faster, more intelligent agents.

Key Insights

1

Inference is becoming a capability lever, especially as reinforcement learning (a wrapper on inference) starts to exceed pre-training compute.

2

Speculative decoding with its SSD variant parallelizes drafting and verification, achieving speedups by hiding drafting latency and predicting verification outcomes.

3

Diffusion Model Predictive Control (DMPC) uses diffusion models for multi-step action proposals and dynamics, outperforming existing methods and allowing runtime adaptation to novel rewards and dynamics.

4

World models aim to learn world dynamics, enabling imagined outcomes, model-based control, and, crucially, surprise quantification, with 'Lay World Model' offering a potential solution to representation collapse via a 'sigg' regularizer.

5

Deep learning's generalization is not entirely a mystery; classical theories like PAC-Bayes can explain phenomena like overparameterization and benign overfitting by considering compressibility and soft inductive biases.

6

When data is constrained but compute is abundant, aggressive regularization, ensembling, and distillation, rather than solely scaling model size, offer significant data efficiency gains, potentially improving performance by up to 5x.

Inference as a capability: Beyond cost and convenience

The presentation kicks off by reframing artificial intelligence inference not merely as a cost or convenience factor, but as a crucial capability lever. This shift in perspective is driven by several trends. Firstly, inference costs are already dominating training costs, especially when serving large-scale models to billions of users or generating trillions of tokens. Secondly, within the training process itself, reinforcement learning—which is essentially a form of inference—is beginning to demand more compute than traditional pre-training. The presenter argues that in the very near future, inference speed will be directly equated with peak intelligence delivery, particularly for methods where performance scales with computational 'thinking' time. This capability-centric view motivates research into making inference significantly faster and more powerful.

Accelerating inference with speculative decoding

Vanilla speculative decoding aims to speed up inference by using a small, fast 'draft' model to generate multiple token guesses, which are thenverified by a larger, more accurate 'target' model in a single forward pass. The key insight is that verification is easier than generation. However, a bottleneck arises from the sequential dependency between drafting and verification. The proposed Speculative Sampling Decoding (SSD) tackles this by parallelizing these operations. SSD has the draft model anticipate likely verification outcomes while the verifier is still processing the current batch, effectively hiding the drafting latency. It leverages information from the draft model's token distributions to predict verification outcomes, achieving significant speedups. This approach can further benefit from drafting more tokens due to the increased time provided by the slower verification step, leading to improved overall throughput and latency. Results showed SSD outperforming other engines, achieving 300 tokens per second for LLaMA 3 70B on 4H100s.

World models for robotics and understanding dynamics

The discussion then shifts to 'world models,' which are systems designed to learn the dynamics of the world. These models predict how a system will change over time based on its current state and executed actions. This capability allows for generating imagined future outcomes, enabling model-based control, and quantifying surprise or uncertainty. Stannis presented Diffusion Model Predictive Control (DMPC), which uses diffusion models to learn both multi-step action proposals and dynamics models. This approach aims to reduce compounding errors in predictions and simplify the planning algorithm, allowing a simple sampler to outperform existing methods. A key advantage of DMPC is its ability to adapt to novel reward functions and dynamics at runtime, a significant benefit for real-world robotics applications where environmental conditions can change unpredictably.

The 'Lay World Model' and the quest for healthy latent spaces

Following the DMPC discussion, 'Lay World Model' was presented, focusing on the challenge of learning robust world models without representational collapse. Traditional world models can suffer from optimization landscapes where trivial solutions, like predicting the same state regardless of action, dominate. To combat this, Lay World Model, inspired by Yann LeCun's JEPA architecture, introduces a 'sigg' regularizer. This regularizer encourages the latent embeddings of predicted states to be Gaussian and isotropic (uniform across dimensions) across a batch of data. By ensuring a 'healthy' distribution in the latent space, the model avoids collapse and maintains its ability to predict meaningful dynamics. This elegance in regularization, requiring only one hyperparameter and one loss term, allows for faster inference (50x speedup over competitors on toy tasks) and quantification of model error through detected spikes in prediction uncertainty.

Demystifying deep learning generalization

The presentation then addressed theories attempting to explain deep learning's success, particularly generalization. The paper "Deep Learning is Not So Mysterious or Different" by Andrew Gordon Wilson argues that classical theories of generalization, such as PAC-Bayes, can illuminate contemporary phenomena like overparameterization and benign overfitting. Overparameterization, where models with more parameters than data points often generalize better, is explained by considering both improved empirical risk (lower training loss) and increased model compressibility, as larger models can find more efficient encodings of training data. Benign overfitting, the ability of models to fit random noise while still generalizing on structured data, is explained through the lens of 'soft inductive biases'—expressive models that, due to regularization, are biased towards simpler, generalizable solutions. The key takeaway is that understanding and potentially optimizing these inductive biases could lead to significant gains in AI sample efficiency.

Data-constrained pre-training: Rethinking scaling laws

The final paper explored AI development in a regime where compute is abundant but data is scarce. Current pre-training methods, optimized for compute efficiency, are approaching data limitations as internet data grows much slower than compute. The research proposed that when data is constrained, techniques like aggressive regularization, ensembling, and historical methods like distillation become paramount. Experiments demonstrated that applying these techniques, particularly a 'joint scaling recipe' combining regularization and ensembling, could achieve performance comparable to much larger compute budgets or datasets. For instance, it showed up to a 5x data efficiency win over standard recipes. Furthermore, distillation techniques allowed for compressing these data-efficient models into smaller, inference-friendly versions without significant performance loss. This suggests a paradigm shift where, in data-limited scenarios, focusing on algorithmic efficiency through classical ML techniques can unlock substantial performance gains.

Common Questions

The YC Paper Club aims to create a community of founders and researchers to foster collaboration and discussion around AI papers. It also serves to invigorate the Pioneer space at YC.

Topics

Mentioned in this video

Concepts
speculative decoding

An inference technique that aims to speed up token generation by using a smaller model to predict tokens and a larger model to verify them.

inference

The process of using a trained model to generate outputs, discussed in terms of cost, convenience, and capability.

Overparameterization

A phenomenon where increasing model parameters surprisingly improves generalization, explained through classical theories of generalization.

Distillation

A method to transfer knowledge from a larger model or ensemble to a smaller model, reducing inference compute.

Self-distillation

A form of distillation where a model distills knowledge into another model of the same size, surprisingly leading to further loss improvement.

Diffusion Policy

A paper that heavily influenced the speaker's interest in diffusion models for robotics.

Robotics

The field where DMPC and world models are being applied for control and policy learning.

World Models

Models that learn the dynamics of the world to predict how actions will change the state, enabling capabilities like imagined outcomes and model-based control.

Benign Overfitting

The ability of deep neural networks to fit random noise while generalizing well on structured data, partially explained by Andrew's work.

Weight Decay

A regularization technique used to improve model performance in data-constrained settings, with significantly higher values used than in compute-optimal pre-training.

Model Predictive Control

A control strategy that uses a dynamics model to predict future states and optimize actions.

diffusion models

Models showing success in generating images, videos, and increasingly in robotics.

Deep Learning

Discussed in the context of generalization, overparameterization, and benign overfitting.

Ensembling

A technique that combines multiple models to improve performance and data efficiency, shown to be effective in modern pre-training.

Pack Bay

A classical theory of generalization that bounds test loss with training loss and a compression term.

More from Y Combinator

View all 592 summaries

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free