Key Moments
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 10: Inference
Key Moments
Inference is a costly, repeated process that consumes significant compute, yet is often memory-bound due to KV caches, demanding system-level optimizations like quantization and speculative decoding to improve speed and efficiency.
Key Insights
OpenAI reportedly produces 8.6 trillion tokens daily, exceeding the training data size (32 trillion tokens) of models like DeepS v4 in less than four days, highlighting the massive scale of inference costs.
During inference, the attention layer's arithmetic intensity is significantly lower (around S/2 for prefill, and less than 1 for generation) compared to MLPs, making it a fundamental bottleneck, especially in generation phases.
KV caching, crucial for re-using computations across generated tokens, can consume more memory than model parameters at high batch sizes, necessitating techniques like Grouped Query Attention (GQA) or Multi-Latent Attention (MLA) to reduce its size.
Speculative decoding, using a smaller 'draft' model to propose tokens and a larger 'target' model to verify them, can offer significant speedups without sacrificing accuracy by exploiting the asymmetry between generation and verification speed.
Techniques like paged attention, inspired by operating system concepts, can mitigate KV cache fragmentation and improve memory utilization by storing caches in non-contiguous blocks, enabling KV cache sharing for common prompts or multiple responses from the same prompt.
Emerging architectures like state space models or diffusion models, designed with inference efficiency in mind, hold significant potential to overcome limitations inherent in transformer architectures that were not optimized for inference.
The growing importance and cost of inference
Inference, the process of generating outputs from a trained language model given a prompt, is rapidly becoming a critical bottleneck due to its increasing scale and repeated nature. Unlike training, which is a one-time cost, inference incurs costs daily. For example, OpenAI is estimated to produce 8.6 trillion tokens per day, a volume that approaches the total training data size of models like DeepS v4 (32 trillion tokens) in under four days. This escalating demand is further amplified by the rise of agentic AI, where models perform complex reasoning, tool use, and introspection, generating numerous tokens not for human reading but for internal computation. This means the number of tokens generated directly correlates with compute spent, and there's no inherent limit to this consumption if the problem is complex enough. Consequently, making inference faster, even by 10%, has substantial economic and practical implications for service providers and users alike.
Metrics for evaluating inference speed
Several key metrics help quantify inference speed, each with its own applications and trade-offs. 'Time to First Token' (TTFT) measures the delay before any output appears, crucial for interactive applications to ensure a good user experience. 'Latency' refers to how quickly individual tokens are generated for a single query, also vital for real-time interactions. 'Throughput', measured in tokens per second, indicates how many tokens can be processed across many queries simultaneously; it's essential for batch processing tasks where completing a large volume of work quickly is the priority. While latency and throughput are related, there can be a trade-off: optimizing for one may negatively impact the other.
Inference's fundamental bottleneck: memory-bound attention
The core computational difference between training and inference lies in their parallelization capabilities. During training, all tokens in a sequence can be processed in parallel. In contrast, inference is auto-regressive, generating tokens one by one. This sequential nature prevents parallelization across the sequence dimension, leading to lower arithmetic intensity. Analysis of transformer blocks reveals that while MLPs can be compute-bound if batch sizes are large enough, the attention mechanism, particularly during generation, becomes fundamentally memory-bound. The arithmetic intensity for attention generation is extremely low (less than 1), meaning the compute units are often idle, waiting for data to be fetched from memory. This memory bottleneck is exacerbated by the KV cache, which stores intermediate key and value states for each token to avoid recomputation, and can grow significantly with batch size.
Optimizing the KV Cache: reducing memory footprint
Given that memory—specifically the KV cache—is the primary bottleneck for inference, many optimization techniques focus on reducing its size. Grouped Query Attention (GQA) reduces the number of key-value heads, decreasing KV cache size by a factor of N/K, where N is the original number of heads and K is the new number of groups. This directly improves both latency and throughput by reducing memory I/O. Another approach, Multi-Latent Attention (MLA) proposed by DeepSeek, drastically compresses the key and value representations to a much lower dimension before computing them, storing a smaller compressed cache. Cross-Layer Attention (CLA) further reduces memory by sharing KV caches across different layers, rather than computing them independently for each layer. These techniques aim to shrink the KV cache without significantly degrading model accuracy, though the trade-off must always be carefully evaluated.
Quantization, pruning, and architectural shifts
Beyond KV cache optimization, several other methods enhance inference efficiency. Quantization reduces the precision of model weights and activations (e.g., from FP16 to INT4), decreasing memory usage and speeding up computations. While post-training quantization is common, quantization-aware training can yield better results by simulating quantization errors during the training process. Model pruning involves removing less important weights, neurons, or even entire layers from a large model, followed by fine-tuning to recover accuracy. This can significantly reduce model size. Architectural innovations, such as sliding window attention (limiting attention to recent tokens) or linear attention variants (like those in Mamba or Delta Nets), offer alternatives to the quadratic complexity of full attention, especially for long contexts. These new architectures are often designed with inference efficiency as a primary goal, aiming to overcome the inherent limitations of the original transformer design.
Speculative decoding and dynamic batching for live systems
Speculative decoding is an elegant method to mask inference latency. It leverages the fact that verifying a sequence of tokens is faster than generating them. A smaller, faster 'draft' model generates a candidate sequence of tokens, which are then processed in parallel by the larger, 'target' model for verification. If accepted, the batch of tokens is committed, achieving a speedup. This approach balances the efficiency of the draft model with the accuracy of the target model, offering significant speed gains. For dynamic workloads, such as live chatbots, continuous batching and paged attention are crucial. Continuous batching updates the batch of requests dynamically as new ones arrive and old ones complete. Paged attention, inspired by OS memory management, stores KV caches in non-contiguous memory blocks, reducing fragmentation and enabling efficient sharing of caches for common prompts or multiple responses from the same prompt, further optimizing throughput and memory usage.
Mentioned in This Episode
●Products
●Software & Apps
●Companies
●Concepts
●People Referenced
Common Questions
The primary goal of AI model inference is to produce accurate and fast responses from a trained model when given a specific prompt. This process is crucial for practical applications beyond research, enabling tasks like chatting, code completion, and agentic actions.
Topics
Mentioned in this video
A DeepSeek innovation that compresses key and value projections to reduce KV cache size, improving efficiency.
An attention mechanism that reduces the KV cache size by using fewer key/value heads, improving latency and throughput.
A technique that shares KV caches across layers, reducing memory usage and improving performance.
A method that selects a subset of KV cache tokens to keep, using lighter-weight queries to determine which tokens are important.
A training method where quantization is simulated during the forward pass, adapting weights for better performance post-quantization.
A lossless method for speeding up inference by using a cheaper draft model to generate multiple tokens, which are then verified by a larger target model in parallel.
An attention variant with K=1, mentioned as being very fast but generally not used due to poor performance.
A technique introduced in the VLM paper that divides the KV cache into non-contiguous blocks to manage memory fragmentation and improve efficiency.
An attention mechanism that limits the context to the last k tokens, reducing KV cache size and making it suitable for long contexts.
Mentioned as a more powerful alternative to sliding window attention, potentially better for long contexts and capable of representing aspects of sliding window attention.
A quantization technique that uses hashing information to quantize layer by layer and track error propagation, improving accuracy.
Mentioned for introducing the Page Attention technique to manage KV cache storage and reduce fragmentation.
An open-source framework particularly good for agentic workloads, noted as potentially less popular but useful.
A 13 billion parameter model used as an example for calculating latency, throughput, and memory usage on an H100.
Software from NVIDIA that provides fast inference but is described as more narrow in its application.
An early system that introduced the concept of continuous batching for dynamic workloads in inference.
A package for running inference on CPUs, highlighted as a popular option for CPU-based inference.
Used as a reference point for token production and training scale, comparing OpenAI's inference needs.
Mentioned for its estimated daily token production, highlighting the scale of inference.
Mentioned for a paper on model pruning that demonstrated reducing a 15B model to an 8B model with minimal accuracy loss.
More from Stanford Online
View all 48 summaries
69 minStanford CS153 Frontier Systems | Jensen Huang from NVIDIA on the Compute Behind Intelligence
61 minStanford CS153 Frontier Systems | Scott Nolan from General Matter on Energy Bottlenecks
63 minStanford Robotics Seminar ENGR319 | Spring 2026 | Unlocking Autonomous Medical Robotics
58 minStanford CS25: Transformers United V6 I From Next-Token Prediction to Next-Generation Intelligence
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free