Why is inference becoming increasingly important?

Inference is crucial because it's a repeated cost incurred daily, unlike the one-time cost of training. With the rise of agentic systems that generate many tokens for reasoning and tool use, the demand for efficient inference has grown significantly.

What are the key metrics for measuring inference speed?

The main metrics are Time To First Token (TTFT) for initial response latency, latency itself for how fast tokens appear for a single query, and throughput, which measures how fast tokens appear across multiple queries, often expressed in tokens per second.

What is the KV cache and why is it important for inference?

The KV cache stores keys and values computed during the prefill phase for the prompt. This allows subsequent token generation to reuse these computations, significantly speeding up the process by avoiding redundant calculations for already processed tokens.

Why is inference often considered memory-bound?

Inference is memory-bound because the autoregressive nature means computations can't be fully parallelized across the sequence. The KV cache, which grows with sequence length and batch size, consumes significant memory, making memory bandwidth and movement a bottleneck.

What is the trade-off between latency and throughput in inference?

Smaller batch sizes generally yield better latency but worse throughput, while larger batch sizes improve throughput but increase latency. This trade-off is influenced by factors like KV cache size and memory bandwidth.

How can Grouped Query Attention (GQA) improve inference?

GQA reduces the KV cache size by using fewer key/value heads, which lowers memory usage. This, in turn, leads to improvements in both latency and throughput without significantly compromising accuracy.

What is quantization and how does it help inference?

Quantization reduces the precision of numbers (e.g., from FP16 to INT4), which lowers memory usage and can increase inference speed. Techniques like post-training quantization and quantization-aware training are used to minimize accuracy loss.

What is speculative decoding in the context of LLM inference?

Speculative decoding uses a smaller, faster 'draft' model to generate multiple potential tokens, which are then verified by the larger, more accurate 'target' model. This allows the target model to process tokens in parallel, achieving faster, lossless inference.

What is continuous batching and why is it useful for dynamic workloads?

Continuous batching dynamically updates the batch with incoming requests and removes finished ones, handling variable sequence lengths and arrival times. This makes inference servers more efficient for live chatbot applications with unpredictable user traffic.

How does Paged Attention optimize KV cache management?

Paged Attention divides the KV cache into non-contiguous blocks, similar to memory paging in operating systems. This effectively manages memory fragmentation, allowing for better utilization and sharing of KV cache blocks, especially for common system prompts.

Key Moments

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 10: Inference

Q: What makes inference different and harder than training?

Inference is fundamentally different because it's autoregressive, requiring token generation one at a time. This prevents parallelization across the sequence dimension, unlike training where all tokens can be processed simultaneously, making it harder to achieve high arithmetic intensity.

Stanford Online

Education5 min read86 min video

May 11, 2026|2,141 views|46|1

Stanford Stanford Online AI Artificial Intelligence

Save to Pod

Key Moments

TL;DR

Inference is a costly, repeated process that consumes significant compute, yet is often memory-bound due to KV caches, demanding system-level optimizations like quantization and speculative decoding to improve speed and efficiency.

Key Insights

OpenAI reportedly produces 8.6 trillion tokens daily, exceeding the training data size (32 trillion tokens) of models like DeepS v4 in less than four days, highlighting the massive scale of inference costs.

During inference, the attention layer's arithmetic intensity is significantly lower (around S/2 for prefill, and less than 1 for generation) compared to MLPs, making it a fundamental bottleneck, especially in generation phases.

KV caching, crucial for re-using computations across generated tokens, can consume more memory than model parameters at high batch sizes, necessitating techniques like Grouped Query Attention (GQA) or Multi-Latent Attention (MLA) to reduce its size.

Speculative decoding, using a smaller 'draft' model to propose tokens and a larger 'target' model to verify them, can offer significant speedups without sacrificing accuracy by exploiting the asymmetry between generation and verification speed.

Techniques like paged attention, inspired by operating system concepts, can mitigate KV cache fragmentation and improve memory utilization by storing caches in non-contiguous blocks, enabling KV cache sharing for common prompts or multiple responses from the same prompt.

Emerging architectures like state space models or diffusion models, designed with inference efficiency in mind, hold significant potential to overcome limitations inherent in transformer architectures that were not optimized for inference.

The growing importance and cost of inference

Inference, the process of generating outputs from a trained language model given a prompt, is rapidly becoming a critical bottleneck due to its increasing scale and repeated nature. Unlike training, which is a one-time cost, inference incurs costs daily. For example, OpenAI is estimated to produce 8.6 trillion tokens per day, a volume that approaches the total training data size of models like DeepS v4 (32 trillion tokens) in under four days. This escalating demand is further amplified by the rise of agentic AI, where models perform complex reasoning, tool use, and introspection, generating numerous tokens not for human reading but for internal computation. This means the number of tokens generated directly correlates with compute spent, and there's no inherent limit to this consumption if the problem is complex enough. Consequently, making inference faster, even by 10%, has substantial economic and practical implications for service providers and users alike.

Metrics for evaluating inference speed

Several key metrics help quantify inference speed, each with its own applications and trade-offs. 'Time to First Token' (TTFT) measures the delay before any output appears, crucial for interactive applications to ensure a good user experience. 'Latency' refers to how quickly individual tokens are generated for a single query, also vital for real-time interactions. 'Throughput', measured in tokens per second, indicates how many tokens can be processed across many queries simultaneously; it's essential for batch processing tasks where completing a large volume of work quickly is the priority. While latency and throughput are related, there can be a trade-off: optimizing for one may negatively impact the other.

Inference's fundamental bottleneck: memory-bound attention

The core computational difference between training and inference lies in their parallelization capabilities. During training, all tokens in a sequence can be processed in parallel. In contrast, inference is auto-regressive, generating tokens one by one. This sequential nature prevents parallelization across the sequence dimension, leading to lower arithmetic intensity. Analysis of transformer blocks reveals that while MLPs can be compute-bound if batch sizes are large enough, the attention mechanism, particularly during generation, becomes fundamentally memory-bound. The arithmetic intensity for attention generation is extremely low (less than 1), meaning the compute units are often idle, waiting for data to be fetched from memory. This memory bottleneck is exacerbated by the KV cache, which stores intermediate key and value states for each token to avoid recomputation, and can grow significantly with batch size.

Optimizing the KV Cache: reducing memory footprint

Given that memory—specifically the KV cache—is the primary bottleneck for inference, many optimization techniques focus on reducing its size. Grouped Query Attention (GQA) reduces the number of key-value heads, decreasing KV cache size by a factor of N/K, where N is the original number of heads and K is the new number of groups. This directly improves both latency and throughput by reducing memory I/O. Another approach, Multi-Latent Attention (MLA) proposed by DeepSeek, drastically compresses the key and value representations to a much lower dimension before computing them, storing a smaller compressed cache. Cross-Layer Attention (CLA) further reduces memory by sharing KV caches across different layers, rather than computing them independently for each layer. These techniques aim to shrink the KV cache without significantly degrading model accuracy, though the trade-off must always be carefully evaluated.

Quantization, pruning, and architectural shifts

Beyond KV cache optimization, several other methods enhance inference efficiency. Quantization reduces the precision of model weights and activations (e.g., from FP16 to INT4), decreasing memory usage and speeding up computations. While post-training quantization is common, quantization-aware training can yield better results by simulating quantization errors during the training process. Model pruning involves removing less important weights, neurons, or even entire layers from a large model, followed by fine-tuning to recover accuracy. This can significantly reduce model size. Architectural innovations, such as sliding window attention (limiting attention to recent tokens) or linear attention variants (like those in Mamba or Delta Nets), offer alternatives to the quadratic complexity of full attention, especially for long contexts. These new architectures are often designed with inference efficiency as a primary goal, aiming to overcome the inherent limitations of the original transformer design.

Speculative decoding and dynamic batching for live systems

Speculative decoding is an elegant method to mask inference latency. It leverages the fact that verifying a sequence of tokens is faster than generating them. A smaller, faster 'draft' model generates a candidate sequence of tokens, which are then processed in parallel by the larger, 'target' model for verification. If accepted, the batch of tokens is committed, achieving a speedup. This approach balances the efficiency of the draft model with the accuracy of the target model, offering significant speed gains. For dynamic workloads, such as live chatbots, continuous batching and paged attention are crucial. Continuous batching updates the batch of requests dynamically as new ones arrive and old ones complete. Paged attention, inspired by OS memory management, stores KV caches in non-contiguous memory blocks, reducing fragmentation and enabling efficient sharing of caches for common prompts or multiple responses from the same prompt, further optimizing throughput and memory usage.

Mentioned in This Episode

●Products

●Software & Apps

●Companies

●Concepts

●People Referenced

Common Questions

The primary goal of AI model inference is to produce accurate and fast responses from a trained model when given a specific prompt. This process is crucial for practical applications beyond research, enabling tasks like chatting, code completion, and agentic actions.

Topics

AI & Machine Learning Technology & Innovation KV Cache Transformer Architecture LLM Inference Speculative Decoding Latency Optimization Arithmetic Intensity Throughput Optimization Model Pruning Continuous Batching

Mentioned in this video

Concepts

Multi-Latent Attention

A DeepSeek innovation that compresses key and value projections to reduce KV cache size, improving efficiency.

Grouped Query Attention

An attention mechanism that reduces the KV cache size by using fewer key/value heads, improving latency and throughput.

Cross-Layer Attention

A technique that shares KV caches across layers, reducing memory usage and improving performance.

DeepSeek Sparse Attention

A method that selects a subset of KV cache tokens to keep, using lighter-weight queries to determine which tokens are important.

Quantization Aware Training

A training method where quantization is simulated during the forward pass, adapting weights for better performance post-quantization.

Speculative Sampling

A lossless method for speeding up inference by using a cheaper draft model to generate multiple tokens, which are then verified by a larger target model in parallel.

Multi-query attention

An attention variant with K=1, mentioned as being very fast but generally not used due to poor performance.

Page Attention

A technique introduced in the VLM paper that divides the KV cache into non-contiguous blocks to manage memory fragmentation and improve efficiency.

Sliding Window Attention

An attention mechanism that limits the context to the last k tokens, reducing KV cache size and making it suitable for long contexts.

Software & Apps

Mamba

Mentioned as a more powerful alternative to sliding window attention, potentially better for long contexts and capable of representing aspects of sliding window attention.

GPT

A quantization technique that uses hashing information to quantize layer by layer and track error propagation, improving accuracy.

VLM

Mentioned for introducing the Page Attention technique to manage KV cache storage and reduce fragmentation.

LangChain

An open-source framework particularly good for agentic workloads, noted as potentially less popular but useful.

Llama

A 13 billion parameter model used as an example for calculating latency, throughput, and memory usage on an H100.

NVIDIA TensorRT

Software from NVIDIA that provides fast inference but is described as more narrow in its application.

Orca

An early system that introduced the concept of continuous batching for dynamic workloads in inference.

LLaMA CPP

A package for running inference on CPUs, highlighted as a popular option for CPU-based inference.

Products

H100

A GPU mentioned in the context of calculating arithmetic intensity thresholds for being compute-bound.

People

Tatsu Nakamura

Mentioned as having discussed scaling laws in the previous lecture.

Companies

DeepSeek

Used as a reference point for token production and training scale, comparing OpenAI's inference needs.

OpenAI

Mentioned for its estimated daily token production, highlighting the scale of inference.

NVIDIA

Mentioned for a paper on model pruning that demonstrated reducing a 15B model to an 8B model with minimal accuracy loss.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free