speculative decoding
A technique used to make language model generation faster by having a smaller model predict draft tokens that a larger model then verifies. Cursor uses 'speculative edits' as a variant.
Save the 3 videos on speculative decoding to your own pod.
Sign up free to keep building your knowledge base on speculative decoding as more episodes are added.
Videos Mentioning speculative decoding

DeepSeek V3, SGLang, and the state of Open Model Inference in 2025 (Quantization, MoEs, Pricing)
Latent Space
A technique for speeding up inference by using a draft model to predict tokens, which are then verified by the larger target model. Support for variations exists in SGLang and other frameworks.

Why Compound AI + Open Source will beat Closed AI — with Lin Qiao, CEO of Fireworks AI
Latent Space
A technique used by Fireworks AI to improve inference speed, particularly mentioned in relation to achieving 1000 tokens per second and its implementation within the Fire Optimizer.

Cursor Team: Future of Programming with AI | Lex Fridman Podcast #447
Lex Fridman
A technique used to make language model generation faster by having a smaller model predict draft tokens that a larger model then verifies. Cursor uses 'speculative edits' as a variant.