Why is inference speed important for AI capabilities?

The speed of inference, or tokens per second, directly translates to the peak intelligence an AI can deliver. As methods become more computationally intensive, faster inference becomes a capability lever, not just a cost or convenience factor.

How does speculative decoding work?

Speculative decoding uses a small, fast 'draft' model to propose multiple tokens and a larger 'target' model to verify them. This is faster because verifying is easier and can be done in parallel with a single forward pass, compared to generating tokens one by one.

What problem does Accelerated Speculative Decoding (SSD) solve?

SSD aims to parallelize the sequential drafting and verification steps in speculative decoding. It allows drafting and verification to happen concurrently by predicting the verification outcomes, thereby hiding the latency of drafting.

What is Model Predictive Control (MPC)?

MPC, also known as receding horizon control, uses a dynamics model (or world model) and a planner to predict future states based on a sequence of actions, optimizing for a given objective.

What are world models in AI?

World models are systems, typically large neural networks, that learn the dynamics of the world. They predict how a system or environment will change over time based on current state and actions, enabling capabilities like generating imagined futures.

What is the 'LAY World Model' approach?

LAY World Model uses an encoder-decoder architecture and an action-conditioned forecasting module in the latent space. It introduces a SIGG regularizer to ensure healthy, Gaussian-distributed latent embeddings and prevent representational collapse.

How does Andrew Gordon Wilson's paper address 'overparameterization' in deep learning?

The paper explains that overparameterization improves generalization by simultaneously reducing empirical risk (training loss) and leading to more compressible solutions, which can be efficiently encoded, fitting within classical generalization theories like Pack Bay.

What is the core question of the paper on data-constrained pre-training?

The paper investigates how to approach pre-training when compute is abundant but data is scarce. It explores scaling recipes that monotonically decrease validation loss to find optimal strategies under these constraints.

What is the benefit of ensembling in data-constrained pre-training?

Ensembling multiple smaller models is shown to be highly data-efficient, achieving lower asymptotic loss compared to a single large regularized model. It represents a true data efficiency win when compute is not a constraint.

How can distillation improve data efficiency in AI models?

Distillation allows knowledge from large, data-efficient, or ensembled models to be transferred to smaller models. This reduces inference compute while retaining a significant portion of the loss improvement, making data efficiency practical.

Key Moments

Inference, Diffusion, World Models, and More | YC Paper Club

Q: How do Diffusion Models contribute to robotics control in DMPC?

DMPC uses diffusion models to learn both multi-step action proposals and multi-step dynamics models. This approach aims to reduce compounding errors and simplify the planning algorithm, allowing for runtime adaptation to new rewards and dynamics.

Y Combinator

Science & Technology5 min read68 min video

May 28, 2026|129,879 views|4,506|63

YC Y Combinator

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

AI inference speed is becoming a capability, not just a cost, with speculative decoding and world models promising faster, more intelligent agents.

Key Insights

Inference is becoming a capability lever, especially as reinforcement learning (a wrapper on inference) starts to exceed pre-training compute.

Speculative decoding with its SSD variant parallelizes drafting and verification, achieving speedups by hiding drafting latency and predicting verification outcomes.

Diffusion Model Predictive Control (DMPC) uses diffusion models for multi-step action proposals and dynamics, outperforming existing methods and allowing runtime adaptation to novel rewards and dynamics.

World models aim to learn world dynamics, enabling imagined outcomes, model-based control, and, crucially, surprise quantification, with 'Lay World Model' offering a potential solution to representation collapse via a 'sigg' regularizer.

Deep learning's generalization is not entirely a mystery; classical theories like PAC-Bayes can explain phenomena like overparameterization and benign overfitting by considering compressibility and soft inductive biases.

When data is constrained but compute is abundant, aggressive regularization, ensembling, and distillation, rather than solely scaling model size, offer significant data efficiency gains, potentially improving performance by up to 5x.

Inference as a capability: Beyond cost and convenience

The presentation kicks off by reframing artificial intelligence inference not merely as a cost or convenience factor, but as a crucial capability lever. This shift in perspective is driven by several trends. Firstly, inference costs are already dominating training costs, especially when serving large-scale models to billions of users or generating trillions of tokens. Secondly, within the training process itself, reinforcement learning—which is essentially a form of inference—is beginning to demand more compute than traditional pre-training. The presenter argues that in the very near future, inference speed will be directly equated with peak intelligence delivery, particularly for methods where performance scales with computational 'thinking' time. This capability-centric view motivates research into making inference significantly faster and more powerful.

Accelerating inference with speculative decoding

Vanilla speculative decoding aims to speed up inference by using a small, fast 'draft' model to generate multiple token guesses, which are thenverified by a larger, more accurate 'target' model in a single forward pass. The key insight is that verification is easier than generation. However, a bottleneck arises from the sequential dependency between drafting and verification. The proposed Speculative Sampling Decoding (SSD) tackles this by parallelizing these operations. SSD has the draft model anticipate likely verification outcomes while the verifier is still processing the current batch, effectively hiding the drafting latency. It leverages information from the draft model's token distributions to predict verification outcomes, achieving significant speedups. This approach can further benefit from drafting more tokens due to the increased time provided by the slower verification step, leading to improved overall throughput and latency. Results showed SSD outperforming other engines, achieving 300 tokens per second for LLaMA 3 70B on 4H100s.

World models for robotics and understanding dynamics

The discussion then shifts to 'world models,' which are systems designed to learn the dynamics of the world. These models predict how a system will change over time based on its current state and executed actions. This capability allows for generating imagined future outcomes, enabling model-based control, and quantifying surprise or uncertainty. Stannis presented Diffusion Model Predictive Control (DMPC), which uses diffusion models to learn both multi-step action proposals and dynamics models. This approach aims to reduce compounding errors in predictions and simplify the planning algorithm, allowing a simple sampler to outperform existing methods. A key advantage of DMPC is its ability to adapt to novel reward functions and dynamics at runtime, a significant benefit for real-world robotics applications where environmental conditions can change unpredictably.

The 'Lay World Model' and the quest for healthy latent spaces

Following the DMPC discussion, 'Lay World Model' was presented, focusing on the challenge of learning robust world models without representational collapse. Traditional world models can suffer from optimization landscapes where trivial solutions, like predicting the same state regardless of action, dominate. To combat this, Lay World Model, inspired by Yann LeCun's JEPA architecture, introduces a 'sigg' regularizer. This regularizer encourages the latent embeddings of predicted states to be Gaussian and isotropic (uniform across dimensions) across a batch of data. By ensuring a 'healthy' distribution in the latent space, the model avoids collapse and maintains its ability to predict meaningful dynamics. This elegance in regularization, requiring only one hyperparameter and one loss term, allows for faster inference (50x speedup over competitors on toy tasks) and quantification of model error through detected spikes in prediction uncertainty.

Demystifying deep learning generalization

The presentation then addressed theories attempting to explain deep learning's success, particularly generalization. The paper "Deep Learning is Not So Mysterious or Different" by Andrew Gordon Wilson argues that classical theories of generalization, such as PAC-Bayes, can illuminate contemporary phenomena like overparameterization and benign overfitting. Overparameterization, where models with more parameters than data points often generalize better, is explained by considering both improved empirical risk (lower training loss) and increased model compressibility, as larger models can find more efficient encodings of training data. Benign overfitting, the ability of models to fit random noise while still generalizing on structured data, is explained through the lens of 'soft inductive biases'—expressive models that, due to regularization, are biased towards simpler, generalizable solutions. The key takeaway is that understanding and potentially optimizing these inductive biases could lead to significant gains in AI sample efficiency.

Data-constrained pre-training: Rethinking scaling laws

The final paper explored AI development in a regime where compute is abundant but data is scarce. Current pre-training methods, optimized for compute efficiency, are approaching data limitations as internet data grows much slower than compute. The research proposed that when data is constrained, techniques like aggressive regularization, ensembling, and historical methods like distillation become paramount. Experiments demonstrated that applying these techniques, particularly a 'joint scaling recipe' combining regularization and ensembling, could achieve performance comparable to much larger compute budgets or datasets. For instance, it showed up to a 5x data efficiency win over standard recipes. Furthermore, distillation techniques allowed for compressing these data-efficient models into smaller, inference-friendly versions without significant performance loss. This suggests a paradigm shift where, in data-limited scenarios, focusing on algorithmic efficiency through classical ML techniques can unlock substantial performance gains.

Mentioned in This Episode

●Products

●Software & Apps

●Companies

●Organizations

●Studies Cited

●Concepts

●People Referenced

Common Questions

The YC Paper Club aims to create a community of founders and researchers to foster collaboration and discussion around AI papers. It also serves to invigorate the Pioneer space at YC.

Topics

AI & Machine Learning Technology & Innovation Science & Mathematics Inference Optimization Model Scaling Robotics Control Data Efficiency Generalization In Deep Learning Pre-training Strategies

Mentioned in this video

Companies

Anthropic

One of the AI companies located in San Francisco.

XAI

Company mentioned as being in Palo Alto/Woodside area.

OpenAI

Mentioned for the emergence of reasoning in 2024 with a model called '01'.

Tesla

Company mentioned as being in Palo Alto/Woodside area.

Thinking Machines

Company mentioned as being in Palo Alto/Woodside area.

DeepSeek

Mentioned for reporting reasoning capabilities in 2024.

Organizations

YC Paper Club

The event where researchers and founders gather to discuss AI papers.

Google DeepMind

An AI company located in Palo Alto and mentioned in relation to the diffusion model paper.

Stanford University

Tanishk's affiliation as a grad student.

Q Labs

Ashe's startup, working with Andrew Gordon Wilson on generalization problems.

Locations

Pioneer

The YC location where the event was held, with a mission to be used more.

People

Sam Altman

Mentioned as having run the show at YC during Winter 16.

Greg Brockman

Mentioned as being involved in the early stages of OpenAI.

Isaac Ward

Introduced as presenting the paper on LAY World Model.

Chris Ray

Leads a lab focused on generalization under fixed data and infinite compute.

Andrew Gordon Wilson

Author of the paper 'Deep Learning is Not So Mysterious or Different', discussed by Ashe.

Con Woo

Presents the paper on data-constrained pre-training with infinite compute.

Software & Apps

Cursor

Mentioned as an AI-related company in San Francisco.

Llama

A language model used in the speculative decoding schematic.

GPT-3

Mentioned as an example of emergence of in-context learning in 2020.

LLaMA 3 70B

A specific model mentioned in the context of achieving high tokens per second during inference.

SIGG Regularizer

A regularization term used in LAY World Model to ensure healthy, Gaussian-distributed latent embeddings.

LAY World Model

A specific world model approach developed out of Yan Lacun's group, focusing on latent space dynamics and SIGG regularization.

Dino World Model

A world model that performs well, especially on 3D tasks due to its foundational backbone.

Ethos

A large pre-training run mentioned for its continuous improvement.

DCLM

A dataset used for experiments, simulating a data-constrained world.

Concepts

speculative decoding

An inference technique that aims to speed up token generation by using a smaller model to predict tokens and a larger model to verify them.

inference

The process of using a trained model to generate outputs, discussed in terms of cost, convenience, and capability.

Overparameterization

A phenomenon where increasing model parameters surprisingly improves generalization, explained through classical theories of generalization.

Distillation

A method to transfer knowledge from a larger model or ensemble to a smaller model, reducing inference compute.

Self-distillation

A form of distillation where a model distills knowledge into another model of the same size, surprisingly leading to further loss improvement.

Diffusion Policy

A paper that heavily influenced the speaker's interest in diffusion models for robotics.

Robotics

The field where DMPC and world models are being applied for control and policy learning.

World Models

Models that learn the dynamics of the world to predict how actions will change the state, enabling capabilities like imagined outcomes and model-based control.

Benign Overfitting

The ability of deep neural networks to fit random noise while generalizing well on structured data, partially explained by Andrew's work.

Weight Decay

A regularization technique used to improve model performance in data-constrained settings, with significantly higher values used than in compute-optimal pre-training.

Model Predictive Control

A control strategy that uses a dynamics model to predict future states and optimize actions.

diffusion models

Models showing success in generating images, videos, and increasingly in robotics.

Deep Learning

Discussed in the context of generalization, overparameterization, and benign overfitting.

Ensembling

A technique that combines multiple models to improve performance and data efficiency, shown to be effective in modern pre-training.

Pack Bay

A classical theory of generalization that bounds test loss with training loss and a compression term.

Products

H100s

Hardware mentioned in the context of fast inference for LLaMA 3 70B.

Studies & Research

Chinchilla scaling laws

Quantified the relationship between parameter count, data size, and compute efficiency for pre-training.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free

Inference, Diffusion, World Models, and More | YC Paper Club

Want to know something specific about what's covered?

Key Insights

Inference as a capability: Beyond cost and convenience

Accelerating inference with speculative decoding

World models for robotics and understanding dynamics

The 'Lay World Model' and the quest for healthy latent spaces

Demystifying deep learning generalization

Data-constrained pre-training: Rethinking scaling laws

Mentioned in This Episode

Common Questions

Topics

Mentioned in this video

More from Y Combinator

Jensen Huang: The Mindset That Built NVIDIA

What Actually Makes A Startup Durable

What Big Tech Missed And How Startups Can Still Win

Why Physical AI Is the Next Platform Shift

Ask anything from this episode.